grammar-utils


Namegrammar-utils JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryUtilities for regex and grammar parsing and constraining
upload_time2024-11-22 09:43:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords nlp utilities text grammar constraint
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Grammar utilities

This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions
and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.

Context-free [grammars](grammars) already included in this repository are:
- JSON
- SPARQL

### Installation

You can install the Python package from PyPI:

```bash
pip install grammar-utils
```

Windows (x64) and Linux are currently supported when installing from PyPI.

Alternatively, you can clone this repository and build the package yourself:

```bash
git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release
```

### Usage

Two use cases are supported by this library: parsing and constraining.

#### Parsing

Given a context-free grammar, parse a string and return the corresponding parse tree.

```python
from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)
```

Parsing is also supported for prefixes, in which case the input should be a list of bytes
and not a string. Here a tree for the already fixed terminals is returned, as well as the
suffix of the input where we do not know yet what the next terminal is.

```python
from grammar_utils.parse import load_lr1_parser

parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)
```

You can also use your own grammars.

```python
from grammar_utils.parse import LR1Parser

# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)
```

#### Constraining

Constraints are used to check what symbols from the vocabulary can follow the current prefix
such that the regular expression or context-free grammar can still be satisfied.

```python
import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint

vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())

# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())
```

You can also use your own grammars and regexes.

```python
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint

vocab = load_byte_vocab()

# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)

# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)
```

### Use cases

#### Forcing a language model to generate structured text

The following example shows how to use a regex constraint to force GPT2
to output either "true" or "false" after a given prompt.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
    token.replace("Ġ", " ").encode()
    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
    input_tensor = torch.tensor([input_ids])
    logits = gpt2(input_tensor).logits
    valid_indices = torch.from_numpy(constraint.get())
    valid_logits = logits[0, -1, valid_indices]
    index = valid_indices[torch.argmax(valid_logits)]
    constraint.next(index)
    input_ids.append(index)
    print(tokenizer.decode(input_ids))
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "grammar-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "nlp, utilities, text, grammar, constraint",
    "author": null,
    "author_email": "Sebastian Walter <swalter@cs.uni-freiburg.de>",
    "download_url": null,
    "platform": null,
    "description": "## Grammar utilities\n\nThis repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions\nand context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.\n\nContext-free [grammars](grammars) already included in this repository are:\n- JSON\n- SPARQL\n\n### Installation\n\nYou can install the Python package from PyPI:\n\n```bash\npip install grammar-utils\n```\n\nWindows (x64) and Linux are currently supported when installing from PyPI.\n\nAlternatively, you can clone this repository and build the package yourself:\n\n```bash\ngit clone https://github.com/bastiscode/grammar-utils\ncd grammar-utils\npip install maturin[patchelf]\nmaturin develop --release\n```\n\n### Usage\n\nTwo use cases are supported by this library: parsing and constraining.\n\n#### Parsing\n\nGiven a context-free grammar, parse a string and return the corresponding parse tree.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree = parser.parse('{\"key\": \"value\"}')\nprint(tree)\n# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes\npruned_tree = parser.parse('{\"key\": \"value\"}', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\n```\n\nParsing is also supported for prefixes, in which case the input should be a list of bytes\nand not a string. Here a tree for the already fixed terminals is returned, as well as the\nsuffix of the input where we do not know yet what the next terminal is.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree, rest = parser.prefix_parse(b'{\"key\"\")\nprint(tree)\nprint(rest)\n# pruning is also supported here\npruned_tree, rest = parser.prefix_parse(b'{\"key\"', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\nprint(rest)\n```\n\nYou can also use your own grammars.\n\n```python\nfrom grammar_utils.parse import LR1Parser\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nparser = LR1Parser(grammar, lexer, vocab)\n```\n\n#### Constraining\n\nConstraints are used to check what symbols from the vocabulary can follow the current prefix\nsuch that the regular expression or context-free grammar can still be satisfied.\n\n```python\nimport random\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import load_lr1_constraint, load_regex_constraint\n\nvocab = load_byte_vocab()\nconstraint = load_lr1_constraint(\"json\", vocab)\n# reset constraint to a given prefix, default is an empty prefix\nconstraint.reset(b'{\"key\"')\n# get the next possible symbols\nnext_indices = constraint.get()\n# the indices refer to the vocabulary (decode only for human-readable strings)\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\n# you can forward the constraint with a valid index\nconstraint.next(random.choice(next_indices))\n# check if constraint is satisfied (should be False)\nprint(constraint.is_match())\n\n# same for regular expressions\nconstraint = load_regex_constraint(\"boolean\", vocab)\nconstraint.reset(b\"tr\")\nnext_indices = constraint.get()\n# should only be 'u'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\nprint(constraint.is_match())\nnext_indices = constraint.get()\n# should only be 'e'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\n# should be True\nprint(constraint.is_match())\n```\n\nYou can also use your own grammars and regexes.\n\n```python\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import LR1Constraint, RegexConstraint\n\nvocab = load_byte_vocab()\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nconstraint = LR1Constraint(grammar, lexer, vocab)\n\n# define your own regex\nregex = \"...\"\nconstraint = RegexConstraint(regex, vocab)\n```\n\n### Use cases\n\n#### Forcing a language model to generate structured text\n\nThe following example shows how to use a regex constraint to force GPT2\nto output either \"true\" or \"false\" after a given prompt.\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nfrom grammar_utils.constrain import load_regex_constraint\n\ngpt2 = AutoModelForCausalLM.from_pretrained(\"gpt2\")\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\nvocab = [\n    token.replace(\"\u0120\", \" \").encode()\n    for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])\n]\nconstraint = load_regex_constraint(\"boolean\", vocab)\nprefix = \"Constrained decoding is cool: \"\ninput_ids = tokenizer.encode(prefix)\nwhile not (constraint.is_match() or constraint.is_invalid()):\n    input_tensor = torch.tensor([input_ids])\n    logits = gpt2(input_tensor).logits\n    valid_indices = torch.from_numpy(constraint.get())\n    valid_logits = logits[0, -1, valid_indices]\n    index = valid_indices[torch.argmax(valid_logits)]\n    constraint.next(index)\n    input_ids.append(index)\n    print(tokenizer.decode(input_ids))\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Utilities for regex and grammar parsing and constraining",
    "version": "0.1.0",
    "project_urls": {
        "Github": "https://github.com/bastiscode/grammar-utils"
    },
    "split_keywords": [
        "nlp",
        " utilities",
        " text",
        " grammar",
        " constraint"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "625f0dae0c7e16acbae5c78b9695e34945ae9bce93e1b720f81818b09ab59c29",
                "md5": "ca16d864b1cd13a9533a546f2ca610e3",
                "sha256": "5fe5f8dbcabab753e98ae14612228dc8ef40b4e69f8995a805e831274493d653"
            },
            "downloads": -1,
            "filename": "grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ca16d864b1cd13a9533a546f2ca610e3",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 1074954,
            "upload_time": "2024-11-22T09:43:09",
            "upload_time_iso_8601": "2024-11-22T09:43:09.244950Z",
            "url": "https://files.pythonhosted.org/packages/62/5f/0dae0c7e16acbae5c78b9695e34945ae9bce93e1b720f81818b09ab59c29/grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-22 09:43:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bastiscode",
    "github_project": "grammar-utils",
    "github_not_found": true,
    "lcname": "grammar-utils"
}
        
Elapsed time: 1.09318s