Name | grammar-utils JSON |
Version |
0.1.0
JSON |
| download |
home_page | None |
Summary | Utilities for regex and grammar parsing and constraining |
upload_time | 2024-11-22 09:43:09 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | None |
keywords |
nlp
utilities
text
grammar
constraint
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
## Grammar utilities
This repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions
and context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.
Context-free [grammars](grammars) already included in this repository are:
- JSON
- SPARQL
### Installation
You can install the Python package from PyPI:
```bash
pip install grammar-utils
```
Windows (x64) and Linux are currently supported when installing from PyPI.
Alternatively, you can clone this repository and build the package yourself:
```bash
git clone https://github.com/bastiscode/grammar-utils
cd grammar-utils
pip install maturin[patchelf]
maturin develop --release
```
### Usage
Two use cases are supported by this library: parsing and constraining.
#### Parsing
Given a context-free grammar, parse a string and return the corresponding parse tree.
```python
from grammar_utils.parse import load_lr1_parser
parser = load_lr1_parser("json")
tree = parser.parse('{"key": "value"}')
print(tree)
# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes
pruned_tree = parser.parse('{"key": "value"}', skip_empty=True, collapse_single=True)
print(pruned_tree)
```
Parsing is also supported for prefixes, in which case the input should be a list of bytes
and not a string. Here a tree for the already fixed terminals is returned, as well as the
suffix of the input where we do not know yet what the next terminal is.
```python
from grammar_utils.parse import load_lr1_parser
parser = load_lr1_parser("json")
tree, rest = parser.prefix_parse(b'{"key"")
print(tree)
print(rest)
# pruning is also supported here
pruned_tree, rest = parser.prefix_parse(b'{"key"', skip_empty=True, collapse_single=True)
print(pruned_tree)
print(rest)
```
You can also use your own grammars.
```python
from grammar_utils.parse import LR1Parser
# define your own grammar and lexer
grammar = "..."
lexer = "..."
parser = LR1Parser(grammar, lexer, vocab)
```
#### Constraining
Constraints are used to check what symbols from the vocabulary can follow the current prefix
such that the regular expression or context-free grammar can still be satisfied.
```python
import random
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import load_lr1_constraint, load_regex_constraint
vocab = load_byte_vocab()
constraint = load_lr1_constraint("json", vocab)
# reset constraint to a given prefix, default is an empty prefix
constraint.reset(b'{"key"')
# get the next possible symbols
next_indices = constraint.get()
# the indices refer to the vocabulary (decode only for human-readable strings)
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
# you can forward the constraint with a valid index
constraint.next(random.choice(next_indices))
# check if constraint is satisfied (should be False)
print(constraint.is_match())
# same for regular expressions
constraint = load_regex_constraint("boolean", vocab)
constraint.reset(b"tr")
next_indices = constraint.get()
# should only be 'u'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
print(constraint.is_match())
next_indices = constraint.get()
# should only be 'e'
print(f"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}")
constraint.next(next_indices[0])
# should be True
print(constraint.is_match())
```
You can also use your own grammars and regexes.
```python
from grammar_utils import load_byte_vocab
from grammar_utils.constrain import LR1Constraint, RegexConstraint
vocab = load_byte_vocab()
# define your own grammar and lexer
grammar = "..."
lexer = "..."
constraint = LR1Constraint(grammar, lexer, vocab)
# define your own regex
regex = "..."
constraint = RegexConstraint(regex, vocab)
```
### Use cases
#### Forcing a language model to generate structured text
The following example shows how to use a regex constraint to force GPT2
to output either "true" or "false" after a given prompt.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from grammar_utils.constrain import load_regex_constraint
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocab = [
token.replace("Ġ", " ").encode()
for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])
]
constraint = load_regex_constraint("boolean", vocab)
prefix = "Constrained decoding is cool: "
input_ids = tokenizer.encode(prefix)
while not (constraint.is_match() or constraint.is_invalid()):
input_tensor = torch.tensor([input_ids])
logits = gpt2(input_tensor).logits
valid_indices = torch.from_numpy(constraint.get())
valid_logits = logits[0, -1, valid_indices]
index = valid_indices[torch.argmax(valid_logits)]
constraint.next(index)
input_ids.append(index)
print(tokenizer.decode(input_ids))
```
Raw data
{
"_id": null,
"home_page": null,
"name": "grammar-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "nlp, utilities, text, grammar, constraint",
"author": null,
"author_email": "Sebastian Walter <swalter@cs.uni-freiburg.de>",
"download_url": null,
"platform": null,
"description": "## Grammar utilities\n\nThis repository contains Python utilities (backed by Rust) to parse and constrain text with regular expressions\nand context-free grammars (LR(1)). Parsing is supported both for prefixes and full strings.\n\nContext-free [grammars](grammars) already included in this repository are:\n- JSON\n- SPARQL\n\n### Installation\n\nYou can install the Python package from PyPI:\n\n```bash\npip install grammar-utils\n```\n\nWindows (x64) and Linux are currently supported when installing from PyPI.\n\nAlternatively, you can clone this repository and build the package yourself:\n\n```bash\ngit clone https://github.com/bastiscode/grammar-utils\ncd grammar-utils\npip install maturin[patchelf]\nmaturin develop --release\n```\n\n### Usage\n\nTwo use cases are supported by this library: parsing and constraining.\n\n#### Parsing\n\nGiven a context-free grammar, parse a string and return the corresponding parse tree.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree = parser.parse('{\"key\": \"value\"}')\nprint(tree)\n# you can calso get a pruned parse tree, skipping empty or collapsing single child nodes\npruned_tree = parser.parse('{\"key\": \"value\"}', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\n```\n\nParsing is also supported for prefixes, in which case the input should be a list of bytes\nand not a string. Here a tree for the already fixed terminals is returned, as well as the\nsuffix of the input where we do not know yet what the next terminal is.\n\n```python\nfrom grammar_utils.parse import load_lr1_parser\n\nparser = load_lr1_parser(\"json\")\ntree, rest = parser.prefix_parse(b'{\"key\"\")\nprint(tree)\nprint(rest)\n# pruning is also supported here\npruned_tree, rest = parser.prefix_parse(b'{\"key\"', skip_empty=True, collapse_single=True)\nprint(pruned_tree)\nprint(rest)\n```\n\nYou can also use your own grammars.\n\n```python\nfrom grammar_utils.parse import LR1Parser\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nparser = LR1Parser(grammar, lexer, vocab)\n```\n\n#### Constraining\n\nConstraints are used to check what symbols from the vocabulary can follow the current prefix\nsuch that the regular expression or context-free grammar can still be satisfied.\n\n```python\nimport random\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import load_lr1_constraint, load_regex_constraint\n\nvocab = load_byte_vocab()\nconstraint = load_lr1_constraint(\"json\", vocab)\n# reset constraint to a given prefix, default is an empty prefix\nconstraint.reset(b'{\"key\"')\n# get the next possible symbols\nnext_indices = constraint.get()\n# the indices refer to the vocabulary (decode only for human-readable strings)\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\n# you can forward the constraint with a valid index\nconstraint.next(random.choice(next_indices))\n# check if constraint is satisfied (should be False)\nprint(constraint.is_match())\n\n# same for regular expressions\nconstraint = load_regex_constraint(\"boolean\", vocab)\nconstraint.reset(b\"tr\")\nnext_indices = constraint.get()\n# should only be 'u'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\nprint(constraint.is_match())\nnext_indices = constraint.get()\n# should only be 'e'\nprint(f\"allowed continuations: {[bytes(vocab[i]).decode() for i in next_indices]}\")\nconstraint.next(next_indices[0])\n# should be True\nprint(constraint.is_match())\n```\n\nYou can also use your own grammars and regexes.\n\n```python\nfrom grammar_utils import load_byte_vocab\nfrom grammar_utils.constrain import LR1Constraint, RegexConstraint\n\nvocab = load_byte_vocab()\n\n# define your own grammar and lexer\ngrammar = \"...\"\nlexer = \"...\"\nconstraint = LR1Constraint(grammar, lexer, vocab)\n\n# define your own regex\nregex = \"...\"\nconstraint = RegexConstraint(regex, vocab)\n```\n\n### Use cases\n\n#### Forcing a language model to generate structured text\n\nThe following example shows how to use a regex constraint to force GPT2\nto output either \"true\" or \"false\" after a given prompt.\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nfrom grammar_utils.constrain import load_regex_constraint\n\ngpt2 = AutoModelForCausalLM.from_pretrained(\"gpt2\")\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\nvocab = [\n token.replace(\"\u0120\", \" \").encode()\n for token, _ in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])\n]\nconstraint = load_regex_constraint(\"boolean\", vocab)\nprefix = \"Constrained decoding is cool: \"\ninput_ids = tokenizer.encode(prefix)\nwhile not (constraint.is_match() or constraint.is_invalid()):\n input_tensor = torch.tensor([input_ids])\n logits = gpt2(input_tensor).logits\n valid_indices = torch.from_numpy(constraint.get())\n valid_logits = logits[0, -1, valid_indices]\n index = valid_indices[torch.argmax(valid_logits)]\n constraint.next(index)\n input_ids.append(index)\n print(tokenizer.decode(input_ids))\n```\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Utilities for regex and grammar parsing and constraining",
"version": "0.1.0",
"project_urls": {
"Github": "https://github.com/bastiscode/grammar-utils"
},
"split_keywords": [
"nlp",
" utilities",
" text",
" grammar",
" constraint"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "625f0dae0c7e16acbae5c78b9695e34945ae9bce93e1b720f81818b09ab59c29",
"md5": "ca16d864b1cd13a9533a546f2ca610e3",
"sha256": "5fe5f8dbcabab753e98ae14612228dc8ef40b4e69f8995a805e831274493d653"
},
"downloads": -1,
"filename": "grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "ca16d864b1cd13a9533a546f2ca610e3",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 1074954,
"upload_time": "2024-11-22T09:43:09",
"upload_time_iso_8601": "2024-11-22T09:43:09.244950Z",
"url": "https://files.pythonhosted.org/packages/62/5f/0dae0c7e16acbae5c78b9695e34945ae9bce93e1b720f81818b09ab59c29/grammar_utils-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-22 09:43:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bastiscode",
"github_project": "grammar-utils",
"github_not_found": true,
"lcname": "grammar-utils"
}