# tokenstream
[![GitHub Actions](https://github.com/vberlier/tokenstream/workflows/CI/badge.svg)](https://github.com/vberlier/tokenstream/actions)
[![PyPI](https://img.shields.io/pypi/v/tokenstream.svg)](https://pypi.org/project/tokenstream/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tokenstream.svg)](https://pypi.org/project/tokenstream/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
> A versatile token stream for handwritten parsers.
```python
from tokenstream import TokenStream
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
brace, number, name = stream.expect(("brace", "("), "number", "name")
if brace:
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
elif number:
return int(number.value)
elif name:
return name.value
print(parse_sexp(TokenStream("(hello (world 42))"))) # ['hello', ['world', 42]]
```
## Introduction
Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.
### Features
- Define the set of recognizable tokens dynamically with regular expressions
- Transparently skip over irrelevant tokens
- Expressive API for matching, collecting, peeking, and expecting tokens
- Clean error reporting with line numbers and column numbers
- Contextual support for indentation-based syntax
- Checkpoints for backtracking parsers
- Works well with Python 3.10+ match statements
Check out the [`examples`](https://github.com/vberlier/tokenstream/tree/main/examples) directory for practical examples.
## Installation
The package can be installed with `pip`.
```bash
pip install tokenstream
```
## Getting started
You can define tokens with the `syntax()` method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.
```python
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
print([token.value for token in stream]) # ['hello', 'world']
```
Check out the full [API reference](https://vberlier.github.io/tokenstream/api_reference/) for more details.
### Expecting tokens
The token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the `expect()` method.
```python
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
print(stream.expect().value) # "hello"
print(stream.expect().value) # "world"
```
The `expect()` method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.
```python
stream = TokenStream("hello world")
with stream.syntax(number=r"\d+", word=r"\w+"):
print(stream.expect("word").value) # "hello"
print(stream.expect("number").value) # UnexpectedToken: Expected number but got word 'world'
```
### Filtering the stream
Newlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in `newline` and `whitespace` tokens.
```python
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"), stream.intercept("newline", "whitespace"):
print(stream.expect("word").value) # "hello"
print(stream.expect("word").value) # UnexpectedToken: Expected word but got whitespace ' '
```
The opposite of the `intercept()` method is `ignore()`. It allows you to ignore tokens and handle comments pretty easily.
```python
stream = TokenStream(
"""
# this is a comment
hello # also a comment
world
"""
)
with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.ignore("comment"):
print([token.value for token in stream]) # ['hello', 'world']
```
### Indentation
To enable indentation you can use the `indent()` method. The stream will now yield balanced pairs of `indent` and `dedent` tokens when the indentation changes.
```python
source = """
hello
world
"""
stream = TokenStream(source)
with stream.syntax(word=r"\w+"), stream.indent():
stream.expect("word")
stream.expect("indent")
stream.expect("word")
stream.expect("dedent")
```
To prevent some tokens from triggering unwanted indentation changes you can use the `skip` argument.
```python
source = """
hello
# some comment
world
"""
stream = TokenStream(source)
with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.indent(skip=["comment"]):
stream.expect("word")
stream.expect("comment")
stream.expect("indent")
stream.expect("word")
stream.expect("dedent")
```
### Checkpoints
The `checkpoint()` method returns a context manager that resets the stream to the current token at the end of the `with` statement. You can use the returned `commit()` function to keep the state of the stream at the end of the `with` statement.
```python
stream = TokenStream("hello world")
with stream.syntax(word=r"\w+"):
with stream.checkpoint():
print([token.value for token in stream]) # ['hello', 'world']
with stream.checkpoint() as commit:
print([token.value for token in stream]) # ['hello', 'world']
commit()
print([token.value for token in stream]) # []
```
### Match statements
Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.
```python
from tokenstream import TokenStream, Token
def parse_sexp(stream: TokenStream):
"""A basic S-expression parser that uses Python 3.10+ match statements."""
with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
match stream.expect_any(("brace", "("), "number", "name"):
case Token(type="brace"):
return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
case Token(type="number") as number :
return int(number.value)
case Token(type="name") as name:
return name.value
```
## Contributing
Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).
```bash
$ poetry install
```
You can run the tests with `poetry run pytest`.
```bash
$ poetry run pytest
```
The project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.
```bash
$ npm run watch
$ npm run check
$ npm run verifytypes
```
The code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).
```bash
$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests
```
---
License - [MIT](https://github.com/vberlier/tokenstream/blob/main/LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/vberlier/tokenstream",
"name": "tokenstream",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "",
"keywords": "parsing,tokenizer,lexer,recursive-descent-parser,token-stream",
"author": "Valentin Berlier",
"author_email": "berlier.v@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/97/d0/dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1/tokenstream-1.6.0.tar.gz",
"platform": null,
"description": "# tokenstream\n\n[![GitHub Actions](https://github.com/vberlier/tokenstream/workflows/CI/badge.svg)](https://github.com/vberlier/tokenstream/actions)\n[![PyPI](https://img.shields.io/pypi/v/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\n> A versatile token stream for handwritten parsers.\n\n```python\nfrom tokenstream import TokenStream\n\ndef parse_sexp(stream: TokenStream):\n \"\"\"A basic S-expression parser.\"\"\"\n with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n brace, number, name = stream.expect((\"brace\", \"(\"), \"number\", \"name\")\n if brace:\n return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n elif number:\n return int(number.value)\n elif name:\n return name.value\n\nprint(parse_sexp(TokenStream(\"(hello (world 42))\"))) # ['hello', ['world', 42]]\n```\n\n## Introduction\n\nWriting recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.\n\n### Features\n\n- Define the set of recognizable tokens dynamically with regular expressions\n- Transparently skip over irrelevant tokens\n- Expressive API for matching, collecting, peeking, and expecting tokens\n- Clean error reporting with line numbers and column numbers\n- Contextual support for indentation-based syntax\n- Checkpoints for backtracking parsers\n- Works well with Python 3.10+ match statements\n\nCheck out the [`examples`](https://github.com/vberlier/tokenstream/tree/main/examples) directory for practical examples.\n\n## Installation\n\nThe package can be installed with `pip`.\n\n```bash\npip install tokenstream\n```\n\n## Getting started\n\nYou can define tokens with the `syntax()` method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n print([token.value for token in stream]) # ['hello', 'world']\n```\n\nCheck out the full [API reference](https://vberlier.github.io/tokenstream/api_reference/) for more details.\n\n### Expecting tokens\n\nThe token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the `expect()` method.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n print(stream.expect().value) # \"hello\"\n print(stream.expect().value) # \"world\"\n```\n\nThe `expect()` method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(number=r\"\\d+\", word=r\"\\w+\"):\n print(stream.expect(\"word\").value) # \"hello\"\n print(stream.expect(\"number\").value) # UnexpectedToken: Expected number but got word 'world'\n```\n\n### Filtering the stream\n\nNewlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in `newline` and `whitespace` tokens.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"), stream.intercept(\"newline\", \"whitespace\"):\n print(stream.expect(\"word\").value) # \"hello\"\n print(stream.expect(\"word\").value) # UnexpectedToken: Expected word but got whitespace ' '\n```\n\nThe opposite of the `intercept()` method is `ignore()`. It allows you to ignore tokens and handle comments pretty easily.\n\n```python\nstream = TokenStream(\n \"\"\"\n # this is a comment\n hello # also a comment\n world\n \"\"\"\n)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.ignore(\"comment\"):\n print([token.value for token in stream]) # ['hello', 'world']\n```\n\n### Indentation\n\nTo enable indentation you can use the `indent()` method. The stream will now yield balanced pairs of `indent` and `dedent` tokens when the indentation changes.\n\n```python\nsource = \"\"\"\nhello\n world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\"), stream.indent():\n stream.expect(\"word\")\n stream.expect(\"indent\")\n stream.expect(\"word\")\n stream.expect(\"dedent\")\n```\n\nTo prevent some tokens from triggering unwanted indentation changes you can use the `skip` argument.\n\n```python\nsource = \"\"\"\nhello\n # some comment\n world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.indent(skip=[\"comment\"]):\n stream.expect(\"word\")\n stream.expect(\"comment\")\n stream.expect(\"indent\")\n stream.expect(\"word\")\n stream.expect(\"dedent\")\n```\n\n### Checkpoints\n\nThe `checkpoint()` method returns a context manager that resets the stream to the current token at the end of the `with` statement. You can use the returned `commit()` function to keep the state of the stream at the end of the `with` statement.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n with stream.checkpoint():\n print([token.value for token in stream]) # ['hello', 'world']\n with stream.checkpoint() as commit:\n print([token.value for token in stream]) # ['hello', 'world']\n commit()\n print([token.value for token in stream]) # []\n```\n\n### Match statements\n\nMatch statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.\n\n```python\nfrom tokenstream import TokenStream, Token\n\ndef parse_sexp(stream: TokenStream):\n \"\"\"A basic S-expression parser that uses Python 3.10+ match statements.\"\"\"\n with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n match stream.expect_any((\"brace\", \"(\"), \"number\", \"name\"):\n case Token(type=\"brace\"):\n return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n case Token(type=\"number\") as number :\n return int(number.value)\n case Token(type=\"name\") as name:\n return name.value\n```\n\n## Contributing\n\nContributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).\n\n```bash\n$ poetry install\n```\n\nYou can run the tests with `poetry run pytest`.\n\n```bash\n$ poetry run pytest\n```\n\nThe project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.\n\n```bash\n$ npm run watch\n$ npm run check\n$ npm run verifytypes\n```\n\nThe code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).\n\n```bash\n$ poetry run isort tokenstream examples tests\n$ poetry run black tokenstream examples tests\n$ poetry run black --check tokenstream examples tests\n```\n\n---\n\nLicense - [MIT](https://github.com/vberlier/tokenstream/blob/main/LICENSE)\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A versatile token stream for handwritten parsers",
"version": "1.6.0",
"project_urls": {
"Documentation": "https://github.com/vberlier/tokenstream",
"Homepage": "https://github.com/vberlier/tokenstream",
"Repository": "https://github.com/vberlier/tokenstream"
},
"split_keywords": [
"parsing",
"tokenizer",
"lexer",
"recursive-descent-parser",
"token-stream"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "19db72249cea36d9504bc43721745aff2fdff9ba203b5acffdf0c811eb969532",
"md5": "b8ccedfa167c00bff104df7ea96ed9fd",
"sha256": "ff153ee95a6b6bbc0c2b2893197ef5b0bcfeccce33dee8ff17e1426007ded5e8"
},
"downloads": -1,
"filename": "tokenstream-1.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b8ccedfa167c00bff104df7ea96ed9fd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 16153,
"upload_time": "2023-08-02T18:52:55",
"upload_time_iso_8601": "2023-08-02T18:52:55.966068Z",
"url": "https://files.pythonhosted.org/packages/19/db/72249cea36d9504bc43721745aff2fdff9ba203b5acffdf0c811eb969532/tokenstream-1.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "97d0dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1",
"md5": "aab3d31b16e4516bd28e43c4de7f7905",
"sha256": "5a61bf921f23626f273d8fe48650f501092775abadf5b12dcb2ce0b4fb36b84e"
},
"downloads": -1,
"filename": "tokenstream-1.6.0.tar.gz",
"has_sig": false,
"md5_digest": "aab3d31b16e4516bd28e43c4de7f7905",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 16325,
"upload_time": "2023-08-02T18:52:57",
"upload_time_iso_8601": "2023-08-02T18:52:57.252587Z",
"url": "https://files.pythonhosted.org/packages/97/d0/dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1/tokenstream-1.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-02 18:52:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vberlier",
"github_project": "tokenstream",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tokenstream"
}