tokenstream

Name	tokenstream JSON
Version	1.6.0 JSON
	download
home_page	https://github.com/vberlier/tokenstream
Summary	A versatile token stream for handwritten parsers
upload_time	2023-08-02 18:52:57
maintainer
docs_url	None
author	Valentin Berlier
requires_python	>=3.10,<4.0
license	MIT
keywords	parsing tokenizer lexer recursive-descent-parser token-stream
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # tokenstream

[![GitHub Actions](https://github.com/vberlier/tokenstream/workflows/CI/badge.svg)](https://github.com/vberlier/tokenstream/actions)
[![PyPI](https://img.shields.io/pypi/v/tokenstream.svg)](https://pypi.org/project/tokenstream/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tokenstream.svg)](https://pypi.org/project/tokenstream/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)

> A versatile token stream for handwritten parsers.

```python
from tokenstream import TokenStream

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        brace, number, name = stream.expect(("brace", "("), "number", "name")
        if brace:
            return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
        elif number:
            return int(number.value)
        elif name:
            return name.value

print(parse_sexp(TokenStream("(hello (world 42))")))  # ['hello', ['world', 42]]
```

## Introduction

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.

### Features

- Define the set of recognizable tokens dynamically with regular expressions
- Transparently skip over irrelevant tokens
- Expressive API for matching, collecting, peeking, and expecting tokens
- Clean error reporting with line numbers and column numbers
- Contextual support for indentation-based syntax
- Checkpoints for backtracking parsers
- Works well with Python 3.10+ match statements

Check out the [`examples`](https://github.com/vberlier/tokenstream/tree/main/examples) directory for practical examples.

## Installation

The package can be installed with `pip`.

```bash
pip install tokenstream
```

## Getting started

You can define tokens with the `syntax()` method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.

```python
stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print([token.value for token in stream])  # ['hello', 'world']
```

Check out the full [API reference](https://vberlier.github.io/tokenstream/api_reference/) for more details.

### Expecting tokens

The token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the `expect()` method.

```python
stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print(stream.expect().value)  # "hello"
    print(stream.expect().value)  # "world"
```

The `expect()` method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.

```python
stream = TokenStream("hello world")

with stream.syntax(number=r"\d+", word=r"\w+"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("number").value)  # UnexpectedToken: Expected number but got word 'world'
```

### Filtering the stream

Newlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in `newline` and `whitespace` tokens.

```python
stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"), stream.intercept("newline", "whitespace"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("word").value)  # UnexpectedToken: Expected word but got whitespace ' '
```

The opposite of the `intercept()` method is `ignore()`. It allows you to ignore tokens and handle comments pretty easily.

```python
stream = TokenStream(
    """
    # this is a comment
    hello # also a comment
    world
    """
)

with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.ignore("comment"):
    print([token.value for token in stream])  # ['hello', 'world']
```

### Indentation

To enable indentation you can use the `indent()` method. The stream will now yield balanced pairs of `indent` and `dedent` tokens when the indentation changes.

```python
source = """
hello
    world
"""
stream = TokenStream(source)

with stream.syntax(word=r"\w+"), stream.indent():
    stream.expect("word")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")
```

To prevent some tokens from triggering unwanted indentation changes you can use the `skip` argument.

```python
source = """
hello
        # some comment
    world
"""
stream = TokenStream(source)

with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.indent(skip=["comment"]):
    stream.expect("word")
    stream.expect("comment")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")
```

### Checkpoints

The `checkpoint()` method returns a context manager that resets the stream to the current token at the end of the `with` statement. You can use the returned `commit()` function to keep the state of the stream at the end of the `with` statement.

```python
stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    with stream.checkpoint():
        print([token.value for token in stream])  # ['hello', 'world']
    with stream.checkpoint() as commit:
        print([token.value for token in stream])  # ['hello', 'world']
        commit()
    print([token.value for token in stream])  # []
```

### Match statements

Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.

```python
from tokenstream import TokenStream, Token

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser that uses Python 3.10+ match statements."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        match stream.expect_any(("brace", "("), "number", "name"):
            case Token(type="brace"):
                return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
            case Token(type="number") as number :
                return int(number.value)
            case Token(type="name") as name:
                return name.value
```

## Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).

```bash
$ poetry install
```

You can run the tests with `poetry run pytest`.

```bash
$ poetry run pytest
```

The project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.

```bash
$ npm run watch
$ npm run check
$ npm run verifytypes
```

The code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).

```bash
$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests
```

---

License - [MIT](https://github.com/vberlier/tokenstream/blob/main/LICENSE)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/vberlier/tokenstream",
    "name": "tokenstream",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "parsing,tokenizer,lexer,recursive-descent-parser,token-stream",
    "author": "Valentin Berlier",
    "author_email": "berlier.v@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/97/d0/dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1/tokenstream-1.6.0.tar.gz",
    "platform": null,
    "description": "# tokenstream\n\n[![GitHub Actions](https://github.com/vberlier/tokenstream/workflows/CI/badge.svg)](https://github.com/vberlier/tokenstream/actions)\n[![PyPI](https://img.shields.io/pypi/v/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\n> A versatile token stream for handwritten parsers.\n\n```python\nfrom tokenstream import TokenStream\n\ndef parse_sexp(stream: TokenStream):\n    \"\"\"A basic S-expression parser.\"\"\"\n    with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n        brace, number, name = stream.expect((\"brace\", \"(\"), \"number\", \"name\")\n        if brace:\n            return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n        elif number:\n            return int(number.value)\n        elif name:\n            return name.value\n\nprint(parse_sexp(TokenStream(\"(hello (world 42))\")))  # ['hello', ['world', 42]]\n```\n\n## Introduction\n\nWriting recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.\n\n### Features\n\n- Define the set of recognizable tokens dynamically with regular expressions\n- Transparently skip over irrelevant tokens\n- Expressive API for matching, collecting, peeking, and expecting tokens\n- Clean error reporting with line numbers and column numbers\n- Contextual support for indentation-based syntax\n- Checkpoints for backtracking parsers\n- Works well with Python 3.10+ match statements\n\nCheck out the [`examples`](https://github.com/vberlier/tokenstream/tree/main/examples) directory for practical examples.\n\n## Installation\n\nThe package can be installed with `pip`.\n\n```bash\npip install tokenstream\n```\n\n## Getting started\n\nYou can define tokens with the `syntax()` method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    print([token.value for token in stream])  # ['hello', 'world']\n```\n\nCheck out the full [API reference](https://vberlier.github.io/tokenstream/api_reference/) for more details.\n\n### Expecting tokens\n\nThe token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the `expect()` method.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    print(stream.expect().value)  # \"hello\"\n    print(stream.expect().value)  # \"world\"\n```\n\nThe `expect()` method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(number=r\"\\d+\", word=r\"\\w+\"):\n    print(stream.expect(\"word\").value)  # \"hello\"\n    print(stream.expect(\"number\").value)  # UnexpectedToken: Expected number but got word 'world'\n```\n\n### Filtering the stream\n\nNewlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in `newline` and `whitespace` tokens.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"), stream.intercept(\"newline\", \"whitespace\"):\n    print(stream.expect(\"word\").value)  # \"hello\"\n    print(stream.expect(\"word\").value)  # UnexpectedToken: Expected word but got whitespace ' '\n```\n\nThe opposite of the `intercept()` method is `ignore()`. It allows you to ignore tokens and handle comments pretty easily.\n\n```python\nstream = TokenStream(\n    \"\"\"\n    # this is a comment\n    hello # also a comment\n    world\n    \"\"\"\n)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.ignore(\"comment\"):\n    print([token.value for token in stream])  # ['hello', 'world']\n```\n\n### Indentation\n\nTo enable indentation you can use the `indent()` method. The stream will now yield balanced pairs of `indent` and `dedent` tokens when the indentation changes.\n\n```python\nsource = \"\"\"\nhello\n    world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\"), stream.indent():\n    stream.expect(\"word\")\n    stream.expect(\"indent\")\n    stream.expect(\"word\")\n    stream.expect(\"dedent\")\n```\n\nTo prevent some tokens from triggering unwanted indentation changes you can use the `skip` argument.\n\n```python\nsource = \"\"\"\nhello\n        # some comment\n    world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.indent(skip=[\"comment\"]):\n    stream.expect(\"word\")\n    stream.expect(\"comment\")\n    stream.expect(\"indent\")\n    stream.expect(\"word\")\n    stream.expect(\"dedent\")\n```\n\n### Checkpoints\n\nThe `checkpoint()` method returns a context manager that resets the stream to the current token at the end of the `with` statement. You can use the returned `commit()` function to keep the state of the stream at the end of the `with` statement.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    with stream.checkpoint():\n        print([token.value for token in stream])  # ['hello', 'world']\n    with stream.checkpoint() as commit:\n        print([token.value for token in stream])  # ['hello', 'world']\n        commit()\n    print([token.value for token in stream])  # []\n```\n\n### Match statements\n\nMatch statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.\n\n```python\nfrom tokenstream import TokenStream, Token\n\ndef parse_sexp(stream: TokenStream):\n    \"\"\"A basic S-expression parser that uses Python 3.10+ match statements.\"\"\"\n    with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n        match stream.expect_any((\"brace\", \"(\"), \"number\", \"name\"):\n            case Token(type=\"brace\"):\n                return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n            case Token(type=\"number\") as number :\n                return int(number.value)\n            case Token(type=\"name\") as name:\n                return name.value\n```\n\n## Contributing\n\nContributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).\n\n```bash\n$ poetry install\n```\n\nYou can run the tests with `poetry run pytest`.\n\n```bash\n$ poetry run pytest\n```\n\nThe project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.\n\n```bash\n$ npm run watch\n$ npm run check\n$ npm run verifytypes\n```\n\nThe code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).\n\n```bash\n$ poetry run isort tokenstream examples tests\n$ poetry run black tokenstream examples tests\n$ poetry run black --check tokenstream examples tests\n```\n\n---\n\nLicense - [MIT](https://github.com/vberlier/tokenstream/blob/main/LICENSE)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A versatile token stream for handwritten parsers",
    "version": "1.6.0",
    "project_urls": {
        "Documentation": "https://github.com/vberlier/tokenstream",
        "Homepage": "https://github.com/vberlier/tokenstream",
        "Repository": "https://github.com/vberlier/tokenstream"
    },
    "split_keywords": [
        "parsing",
        "tokenizer",
        "lexer",
        "recursive-descent-parser",
        "token-stream"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "19db72249cea36d9504bc43721745aff2fdff9ba203b5acffdf0c811eb969532",
                "md5": "b8ccedfa167c00bff104df7ea96ed9fd",
                "sha256": "ff153ee95a6b6bbc0c2b2893197ef5b0bcfeccce33dee8ff17e1426007ded5e8"
            },
            "downloads": -1,
            "filename": "tokenstream-1.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b8ccedfa167c00bff104df7ea96ed9fd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 16153,
            "upload_time": "2023-08-02T18:52:55",
            "upload_time_iso_8601": "2023-08-02T18:52:55.966068Z",
            "url": "https://files.pythonhosted.org/packages/19/db/72249cea36d9504bc43721745aff2fdff9ba203b5acffdf0c811eb969532/tokenstream-1.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "97d0dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1",
                "md5": "aab3d31b16e4516bd28e43c4de7f7905",
                "sha256": "5a61bf921f23626f273d8fe48650f501092775abadf5b12dcb2ce0b4fb36b84e"
            },
            "downloads": -1,
            "filename": "tokenstream-1.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "aab3d31b16e4516bd28e43c4de7f7905",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 16325,
            "upload_time": "2023-08-02T18:52:57",
            "upload_time_iso_8601": "2023-08-02T18:52:57.252587Z",
            "url": "https://files.pythonhosted.org/packages/97/d0/dc51d5ed4ffee377c0429e03c5e95c76ce8a0e8d1e992943eb305f26a5d1/tokenstream-1.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-02 18:52:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vberlier",
    "github_project": "tokenstream",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tokenstream"
}

Valentin Berlier