grammar-utils

Name	grammar-utils JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	Utilities for regex and grammar parsing and constraining
upload_time	2025-01-21 15:01:57
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	nlp utilities text grammar constraint
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ## Grammar utilities

This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions
and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.

Context-free [grammars](grammars) already included in this repository are:
- JSON
- SPARQL

### Installation

You can install the Python package from PyPI:

```bash
pip install grammar-utils
```

Only Linux is currently supported when installing from PyPI. Windows is causing some issues in CI so builds for this platform are not yet available.

Alternatively, you can clone this repository and build the package yourself:

```bash
git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release
```

### Usage

Two use cases are supported by this library: parsing and constraining.

#### Parsing

Given a context-free grammar, parse a string and return the corresponding parse tree.

```python
from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)
```

Parsing is also supported for prefixes, in which case the input should be a list of bytes
and not a string. Here a tree for the already fixed terminals is returned, as well as the
suffix of the input where we do not know yet what the next terminal is.

```python
from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)
```

You can also use your own grammars.

```python
from grammar_utils.parse import LR1Parser

# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)
```

#### Constraining

Constraints are used to check what symbols from the vocabulary can follow the current prefix
such that the regular expression or context-free grammar can still be satisfied.

```python
import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint

vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())

# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())
```

You can also use your own grammars and regexes.

```python
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint

vocab = load_byte_vocab()

# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)

# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)
```

### Use cases

#### Forcing a language model to generate structured text

The following example shows how to use a regex constraint to force GPT2
to output either "true" or "false" after a given prompt.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
    token.replace("Ġ", " ").encode()
    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
    input_tensor = torch.tensor([input_ids])
    logits = gpt2(input_tensor).logits
    valid_indices = torch.from_numpy(constraint.get())
    valid_logits = logits[0, -1, valid_indices]
    index = valid_indices[torch.argmax(valid_logits)]
    constraint.next(index)
    input_ids.append(index)
    print(tokenizer.decode(input_ids))
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "grammar-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "nlp, utilities, text, grammar, constraint",
    "author": null,
    "author_email": "Sebastian Walter <swalter@cs.uni-freiburg.de>",
    "download_url": null,
    "platform": null,
    "description": "## Grammar utilities\n\nThis repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions\nand context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.\n\nContext-free [grammars](grammars) already included in this repository are:\n- JSON\n- SPARQL\n\n### Installation\n\nYou can install the Python package from PyPI:\n\n```bash\npip install grammar-utils\n```\n\nOnly Linux is currently supported when installing from PyPI. Windows is causing some issues in CI so builds for this platform are not yet available.\n\nAlternatively, you can clone this repository and build the package yourself:\n\n```bash\ngit clone https://github.com/bastiscode/grammar-utils\ncd grammar-utils\npip install maturin[patchelf]\nmaturin develop --release\n```\n\n### Usage\n\nTwo use cases are supported by this library: parsing and constraining.\n\n#### Parsing\n\nGiven a context-free grammar, parse a string and return the corresponding parse tree.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree = parser.parse('{\"key\": \"value\"}')\nprint(tree)\n# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes\npruned_tree = parser.parse('{\"key\": \"value\"}', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\n```\n\nParsing is also supported for prefixes, in which case the input should be a list of bytes\nand not a string. Here a tree for the already fixed terminals is returned, as well as the\nsuffix of the input where we do not know yet what the next terminal is.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree, rest = parser.prefix_parse(b'{\"key\"\")\nprint(tree)\nprint(rest)\n# pruning is also supported here\npruned_tree, rest = parser.prefix_parse(b'{\"key\"', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\nprint(rest)\n```\n\nYou can also use your own grammars.\n\n```python\nfrom grammar_utils.parse import LR1Parser\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nparser = LR1Parser(grammar, lexer, vocab)\n```\n\n#### Constraining\n\nConstraints are used to check what symbols from the vocabulary can follow the current prefix\nsuch that the regular expression or context-free grammar can still be satisfied.\n\n```python\nimport random\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import load_lr1_constraint, load_regex_constraint\n\nvocab = load_byte_vocab()\nconstraint = load_lr1_constraint(\"json\", vocab)\n# reset constraint to a given prefix, default is an empty prefix\nconstraint.reset(b'{\"key\"')\n# get the next possible symbols\nnext_indices = constraint.get()\n# the indices refer to the vocabulary (decode only for human-readable strings)\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\n# you can forward the constraint with a valid index\nconstraint.next(random.choice(next_indices))\n# check if constraint is satisfied (should be False)\nprint(constraint.is_match())\n\n# same for regular expressions\nconstraint = load_regex_constraint(\"boolean\", vocab)\nconstraint.reset(b\"tr\")\nnext_indices = constraint.get()\n# should only be 'u'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\nprint(constraint.is_match())\nnext_indices = constraint.get()\n# should only be 'e'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\n# should be True\nprint(constraint.is_match())\n```\n\nYou can also use your own grammars and regexes.\n\n```python\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import LR1Constraint, RegexConstraint\n\nvocab = load_byte_vocab()\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nconstraint = LR1Constraint(grammar, lexer, vocab)\n\n# define your own regex\nregex = \"...\"\nconstraint = RegexConstraint(regex, vocab)\n```\n\n### Use cases\n\n#### Forcing a language model to generate structured text\n\nThe following example shows how to use a regex constraint to force GPT2\nto output either \"true\" or \"false\" after a given prompt.\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nfrom grammar_utils.constrain import load_regex_constraint\n\ngpt2 = AutoModelForCausalLM.from_pretrained(\"gpt2\")\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\nvocab = [\n    token.replace(\"\u0120\", \" \").encode()\n    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])\n]\nconstraint = load_regex_constraint(\"boolean\", vocab)\nprefix = \"Constrained decoding is cool: \"\ninput_ids = tokenizer.encode(prefix)\nwhile not (constraint.is_match() or constraint.is_invalid()):\n    input_tensor = torch.tensor([input_ids])\n    logits = gpt2(input_tensor).logits\n    valid_indices = torch.from_numpy(constraint.get())\n    valid_logits = logits[0, -1, valid_indices]\n    index = valid_indices[torch.argmax(valid_logits)]\n    constraint.next(index)\n    input_ids.append(index)\n    print(tokenizer.decode(input_ids))\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Utilities for regex and grammar parsing and constraining",
    "version": "0.1.2",
    "project_urls": {
        "Github": "https://github.com/bastiscode/grammar-utils"
    },
    "split_keywords": [
        "nlp",
        " utilities",
        " text",
        " grammar",
        " constraint"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "160e554d9ac27c5a09f57b98fdb1deca9d29c356506bbf419db3bed9df09b7b0",
                "md5": "a525a971a3b084ea4fa77c4ccaf8d3f9",
                "sha256": "2bbd2cfd9d51c22e0c32b37cb85d3d5805a80772c63fe1fb5cb341a63af4e261"
            },
            "downloads": -1,
            "filename": "grammar_utils-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "a525a971a3b084ea4fa77c4ccaf8d3f9",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1048527,
            "upload_time": "2025-01-21T15:01:57",
            "upload_time_iso_8601": "2025-01-21T15:01:57.278656Z",
            "url": "https://files.pythonhosted.org/packages/16/0e/554d9ac27c5a09f57b98fdb1deca9d29c356506bbf419db3bed9df09b7b0/grammar_utils-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-21 15:01:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bastiscode",
    "github_project": "grammar-utils",
    "github_not_found": true,
    "lcname": "grammar-utils"
}

None