llm-structured-output

Name	llm-structured-output JSON
Version	0.0.19 JSON
	download
home_page	None
Summary	Constrain LLM generation to structured output, such as function calling and tool use
upload_time	2024-10-28 23:19:55
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LLM Structured Output: JSON Schema, Function Calling, Tools

This repository contains a library to constrain LLM generation to structured
output, such as function calling a.k.a. tool use.

We include examples of application implementations using the MLX library.

Differences with other approaches:

- "JSON mode": this library constrains output to be valid JSON, but goes
  beyond JSON mode in also enforcing a JSON schema. This enables much tighter
  steeing: specifying data types, property names, etc.

- GBNF translation: rather than converting the JSON schema to a formal grammar,
  we steer the output directly using the schema, which enables more flexible
  and deeper control with lower overhead. For example, expressing minimum and
  maximum array or string lengths in GBNF can lead to very large set of
  production rules, and certain JSON schema features are simply not possible.

- Fine-tuning: our approach is complementary to fine-tuning an LLM to produce
  structured output. While fine-tuning currently can enhance but not guarantee
  adherence to a schema, our system introduces strong guarantees on the output.

## Demo

https://github.com/otriscon/llm-structured-output/assets/165947759/f38704da-34b0-4601-be8b-48b92199445d

Without a schema, Mistral 7B Instruct 0.2 solves the data extraction task but,
despite our instructions to the contrary, it adds a lot of additional output that's
not necessary, is hard to parse, and wastes time.

https://github.com/otriscon/llm-structured-output/assets/165947759/f79a78ca-8244-4ec6-9e90-b6cdedfbb8b0

With the schema, the generation is precisely the output we require.

## What's in the box

You'll find:

- A framefork and set of acceptors for constraining LLM output, which are
  application-independent.

- Reference implementations and examples using Apple's MLX library.

### Framework and JSON acceptors

- An acceptor/state machine framework which progresses all valid states of a
  given graph simultaneously. This minimizes the need for backtracking, which
  is expensive for LLMs as it would require re-computing past tokens. In this
  sense, the concept is similar to a chart parser or Earley-style recognizer
  and shares a similar motivation. In practice, it's quite different because
  we're dealing with token-level input. We implemented several optimizations
  to minimize combinatorial explosion: we use a trie to traverse the token
  vocabulary in logarithmic time, and collapse the trie branches when multiple
  options are equivalent. We also prune the chart by removing equivalent
  states arrived at by different paths. See [acceptor.py](src/llm_structured_output/acceptor.py).

- A JSON acceptor based on the framework above that accepts valid JSON. See
  [json_acceptor.py](src/llm_structured_output/json_acceptor.py).

- A JSON schema acceptor based on both items above that accepts valid JSON that
  conforms to a JSON schema. See [json_schema_acceptor.py](src/llm_structured_output/json_schema_acceptor.py).
  Please note that most but not all JSON schema directives are implemented.
  Please open an issue if one that you need is not.

### Reference implementation / examples

- An example of using the acceptors above to guide decoding in an LLM using
  Apple's MLX framework. See [llm_schema.py](src/examples/llm_schema.py).
  This example includes several decoding techniques, including pre-emptive evaluation,
  which is a way to use the acceptor to anticipate the tokens that can be generated
  according to the schema, and use that to evaluate two tokens at a time instead of
  one, sometimes leading to noticeable performance improvements.

- A server example that implements an OpenAI-compatible API including tools / function
  calling. Unlike [OpenAI's](https://platform.openai.com/docs/api-reference/chat/object),
  this implementation always generates valid JSON, and does not return hallucinated
  parameters not defined in your function schema (but it may still hallucinate their
  values). See [server.py](src/examples/server.py).

## Usage

### Run the examples on Apple hardware with MLX

Clone this repo:

```sh
git clone https://github.com/otriscon/llm-structured-output.git
cd llm-structured-output
```

Optional, but recommended: create and activate a virtual environment with your favorite tool of choice, e.g.

```sh
python -m venv .venv
source .venv/bin/activate
```

Move into the examples folder and install the requirements, then move back:

```sh
cd src/examples
pip install -r requirements.txt
cd ..
```

Run the llm_schema example:

```sh
MODEL=mistralai/Mistral-7B-Instruct-v0.2

LLM_PROMPT='[INST] Parse the following address into a JSON object: "27 Barrow St, New York, NY 10014". Your answer should be only a JSON object according to this schema: {"type": "object", "properties": {"streetNumber": {"type": "number"}, "streetName": {"type": "string"}, "city": {"type": {"string"}}, "state": {"type": "string"}, "zipCode": {"type": "number"}}}. Do not explain the result, just output it. Do not add any additional information. [/INST]'

LLM_SCHEMA='{"type": "object", "properties": {"streetNumber": {"type": "number"}, "streetName": {"type": "string"}, "city": {"type": "string"}, "state": {"type": "string"}, "zipCode": {"type": "number"}}}'

python3 -m examples.llm_schema --model-path $MODEL --prompt "$LLM_PROMPT" --schema "$LLM_SCHEMA" --max-tokens 1000 --repeat-prompt
```

Run the server example:

```sh
MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2 uvicorn examples.server:app --port 8080 --reload
```

Try calling the server with this example adapted from [the OpenAI documentation (click on the example request titled _Functions_)](https://platform.openai.com/docs/api-reference/chat/create):
```sh
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ignored",
  "messages": [
    {
      "role": "user",
      "content": "What'\''s the weather like in Boston today?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'
```

### Using the JSON schema acceptor in your project

Install in your project with `pip install llm-structured-output` and
use a `JsonSchemaAcceptorDriver` within your normal generation loop:

```python
import json
import mlx.core as mx
from mlx_lm.utils import load # Needs pip import mlx_lm
from llm_structured_output import JsonSchemaAcceptorDriver, HuggingfaceTokenizerHelper, bias_logits


MODEL_PATH = "mistralai/Mistral-7B-Instruct-v0.2"
SCHEMA = {
    "type": "object",
    "properties": {
        "streetNumber": {"type": "number"},
        "streetName": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "zipCode": {"type": "number"},
    },
}
PROMPT = f'''
[INST] Parse the following address into a JSON object: "27 Barrow St, New York, NY 10014".
Your answer should be only a JSON object according to this schema: {json.dumps(SCHEMA)}
Do not explain the result, just output it. Do not add any additional information. [/INST]
'''


# Load the model as usual.
model, tokenizer = load(MODEL_PATH)

# Instantiate a token acceptor
tokenizer_helper = HuggingfaceTokenizerHelper(tokenizer)
vocabulary, eos_id = tokenizer_helper.extract_vocabulary()
token_acceptor_factory = JsonSchemaAcceptorDriver.driver_factory_for_model(vocabulary, eos_id)
token_acceptor = token_acceptor_factory(SCHEMA)

cache = None
tokens = tokenizer_helper.encode_prompt(PROMPT)

while tokens[-1] != eos_id:
    # Evaluate the model as usual. 
    logits, cache = model(mx.array(tokens)[None], cache)

    # Set probability to -inf for invalid tokens.
    accepted_token_bitmap = token_acceptor.select_valid_tokens()
    logits = bias_logits(mx, logits[0, -1, :], accepted_token_bitmap)

    # Sample as usual, e.g.:
    token = mx.argmax(logits, axis=-1).item()

    if token == eos_id:
      break

    # Store or use the generated token.
    tokens = [token]
    text = tokenizer_helper.no_strip_decode(tokens)
    print(text, end="")

    # Advance the acceptor to the next state.
    token_acceptor.advance_token(token)
```

## A note about guarantees on the output

Constraining the output of an LLM to follow a schema doesn't magically make the
LLM great at producing output that solves a particular task.

If an LLM that is not prompted or fine-tuned correctly to solve the task, it
will produce syntactically valid output but the values inside won't necessarily
constitute a good solution. As with any other technique, proper LLM prompting
and/or n-shot examples are crucial to avoid getting nice-looking,
well-formatted, schema-compliant nonsense.

In particular, it's crucial to instruct the LLM regarding the desired output
format, including making the desired schema part of the prompt. Here's an
example of a prompt that includes the schema:

```
Parse the following address into a JSON object: "27 Barrow St, New York, NY 10014".
Your answer should be only a JSON object according to this schema: {"type": "object", "properties": {"streetNumber": {"type": "number"}, "streetName": {"type": "string"}, "city": {"type": {"string"}}, "state": {"type": "string"}, "zipCode": {"type": "number"}}}.
Do not explain the result, just output it. Do not add any additional information.
```

In order to give the LLM a scratch-pad prior to JSON generation for e.g.
chain-of-thought reasoning, we have included an option for the acceptor to kick in
only on output within a section delimited by the lines `` ```json `` and `` ``` ``,
with the prior output treated as free text. This is enabled with the `is_encapsulated_json`
option of the `JsonSchemaAcceptorDriver` constructor. Here's an example of a
prompt that produces encapsulated JSON:
```
Your mission is to parse the following address into a JSON object: "27 Barrow St, New York, NY 10014".
Your answer should be a JSON object according to this schema: {"type": "object", "properties": {"streetNumber": {"type": "number"}, "streetName": {"type": "string"}, "city": {"type": {"string"}}, "state": {"type": "string"}, "zipCode": {"type": "number"}}}.
First, think through the task step by step, and then output a JSON object wrapped between the lines ```json and ```.
```

In our OpenAI-compatible server example, when the request specifies `tool_calls` or a
legacy `function_call`, we automatically prepend a system message to the prompt with
the schema and instructions for the LLM to use the tools provided. If your prompt already
includes these instructions (because e.g. you want to customize them), this can be disabled
with a non-standard option in the request payload: `"tool_options": { "no_prompt_steering": true }`


## Testing

The library has been tested with the following datasets:

- [Fireworks.ai's function calling eval dataset](https://huggingface.co/datasets/fireworks-ai/function-calling-eval-dataset-v0/)

- [ALU.AI's table extraction](https://blog.alu.ai/tables-and-structured-data/) evaluation dataset (not yet open-source)

## Evaluations

We're starting to perform evaluations to understand how well different LLMs perform
in function calling tasks. The tools and data can be found in the [src/tests](src/tests/) folder.

### Fireworks.ai function calling eval dataset

Environment:

- llm_structured_output v0.0.15
- mlx 0.14.1
- 2023 Mac Studio M2 Ultra 24 cores (16 performance and 8 efficiency) 192 GB RAM running macOS Sonoma 14.5
- LLM: mlx-community/Meta-Llama-3-8B-Instruct-4bit
- Benchmarking LLM: gpt-4o-2024-05-13

Results:

- [multi-turn dataset report](src/tests/data/fireworks-ai_function-calling-eval-dataset-v0/report-multi_turn.md)

- [single-turn dataset report](src/tests/data/fireworks-ai_function-calling-eval-dataset-v0/report-single_turn.md)


## Performance

Since we need to select the acceptable tokens prior to sampling, constraining
the output according to a schema introduces a delay for every token, which
depends on the complexity of the schema. On the other hand, since the output is
guaranteed to be valid JSON and to conform to the schema, it can reduce the
number of tokens generated and reduce or eliminate the number of retries
required to solve the task.

### Pre-emptive decoding experiment
As an experiment to improve performance, we implement the option to use
pre-emptive decoding: when the range of tokens that can be accepted after the
current one is small, as often happens with structured output, we submit to the
LLM a batch of two-token continuations where the first token is the one that
was to be evaluated anyway, and the second token in each item in the batch is
one of the possible continuations predicted according to the schema. We can
then sample two tokens instead of one.  We find that this approach can
occasionally produce considerable increases in token generation speed, but in
general it can also considerably slow it down, depending on model and
quantization. We found that it works better with no fp16 models (no quantization),
but batching performance degrades vastly in quantized models making pre-emptive
decoding not worth it for those models.

### Benchmarks

- The following tests were perfomed on an Apple Studio with an M2 Ultra (24 core)
with 192GB of RAM using MLX version 0.9.0, with models converted to MLX format.

- The results are the average of 5 runs on a simple data extraction task with a
127-token prompt.

- Pre-emptive decoding was tested in two different forms: with a constant batch
  size, where we always sent the same size matrices for evaluation, and variable-
  size batching, where we made the batch large or shorter depending on the numer
  of possible follow-up tokens.

<br>

| Mistral-7B-v0.2-Instruct (fp16) | Prompt tps | Generation tps | Generation tokens |
| --- | :-: | :-: | :-: |
| No schema | 305.82 | 34.76 | 321 |
| Schema | 307.00 |	31.70 | 42 |
| Pre-emptive constant batch =5 | 211.72 | 33.16 | 42 |
| Pre-emptive variable batch <=5 | 321.85 | 36.53  | 42 |


**Notes:**

- Pre-emptive decoding accelerates generation even over schemaless generation.

<br>
<br>

| Mistral-7B-v0.2-Instruct (q4) | Prompt tps | Generation tps | Generation tokens |
| --- | :-: | :-: | :-: |
| No schema | 487.19 | 86.36 | 137 |
| Schema | 487.83 | 67.60 | 42 |
| Pre-emptive constant batch =5 | 139.61 | 27.16 | 42 |
| Pre-emptive variable batch <=5 | 488.88 | 36.25 | 42 |

**Notes:**

- Pre-emptive decoding is vastly slower, with the only change being quantization.

<br>
<br>

| Mixtral-8x7B-Instruct-v0.1 (fp16) | Prompt tps | Generation tps | Generation tokens |
| --- | :-: | :-: | :-: |
| No schema | 3.48 | 2.23 | 50 |
| Schema | 3.49 | 2.21 | 50 |
| Pre-emptive constant batch =5 |2.36 | 1.16 | 50 |
| Pre-emptive variable batch <=5 | 3.18 | 1.68 | 50 |

**Notes:**

- This is the only tested model that outputs schema-conforming output without a schema.

- Pre-emptive decoding is a lot slower again.

<br>
<br>

| Mixtral-8x7B-Instruct-v0.1 (q4) | Prompt tps | Generation tps | Generation tokens |
| --- | :-: | :-: | :-: |
| No schema | 15.02 | 32.21 | 165 |
| Schema | 14.94 | 23.75 | 50 |
| Pre-emptive constant batch =5 | 9.29 | 11.28 | 50 |
| Pre-emptive variable batch <=5 | 15.02 | 17.94 | 50 |

## Roadmap

- Extend JSON schema support as needed (see TODOs in code). Please, feel free to
  open an issue if you need a feature that not supported at the moment. Also open to
  implement additional schemas such as YAML and reference implementations for other LLMs.

- Add formal test cases.

- Reference implementation for the Transformers library.

- Port to C++ and reference implementation for llama.cpp

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-structured-output",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "\"Oscar D.P. Triscon\" <github@triscon.com>",
    "download_url": "https://files.pythonhosted.org/packages/be/46/34f421b38f45ccc73e6dfd8cce96d7dc54eeaa5a86fc48927fbe2a0470e6/llm_structured_output-0.0.19.tar.gz",
    "platform": null,
    "description": "# LLM Structured Output: JSON Schema, Function Calling, Tools\n\nThis repository contains a library to constrain LLM generation to structured\noutput, such as function calling a.k.a. tool use.\n\nWe include examples of application implementations using the MLX library.\n\nDifferences with other approaches:\n\n- \"JSON mode\": this library constrains output to be valid JSON, but goes\n  beyond JSON mode in also enforcing a JSON schema. This enables much tighter\n  steeing: specifying data types, property names, etc.\n\n- GBNF translation: rather than converting the JSON schema to a formal grammar,\n  we steer the output directly using the schema, which enables more flexible\n  and deeper control with lower overhead. For example, expressing minimum and\n  maximum array or string lengths in GBNF can lead to very large set of\n  production rules, and certain JSON schema features are simply not possible.\n\n- Fine-tuning: our approach is complementary to fine-tuning an LLM to produce\n  structured output. While fine-tuning currently can enhance but not guarantee\n  adherence to a schema, our system introduces strong guarantees on the output.\n\n## Demo\n\nhttps://github.com/otriscon/llm-structured-output/assets/165947759/f38704da-34b0-4601-be8b-48b92199445d\n\nWithout a schema, Mistral 7B Instruct 0.2 solves the data extraction task but,\ndespite our instructions to the contrary, it adds a lot of additional output that's\nnot necessary, is hard to parse, and wastes time.\n\nhttps://github.com/otriscon/llm-structured-output/assets/165947759/f79a78ca-8244-4ec6-9e90-b6cdedfbb8b0\n\nWith the schema, the generation is precisely the output we require.\n\n## What's in the box\n\nYou'll find:\n\n- A framefork and set of acceptors for constraining LLM output, which are\n  application-independent.\n\n- Reference implementations and examples using Apple's MLX library.\n\n### Framework and JSON acceptors\n\n- An acceptor/state machine framework which progresses all valid states of a\n  given graph simultaneously. This minimizes the need for backtracking, which\n  is expensive for LLMs as it would require re-computing past tokens. In this\n  sense, the concept is similar to a chart parser or Earley-style recognizer\n  and shares a similar motivation. In practice, it's quite different because\n  we're dealing with token-level input. We implemented several optimizations\n  to minimize combinatorial explosion: we use a trie to traverse the token\n  vocabulary in logarithmic time, and collapse the trie branches when multiple\n  options are equivalent. We also prune the chart by removing equivalent\n  states arrived at by different paths. See [acceptor.py](src/llm_structured_output/acceptor.py).\n\n- A JSON acceptor based on the framework above that accepts valid JSON. See\n  [json_acceptor.py](src/llm_structured_output/json_acceptor.py).\n\n- A JSON schema acceptor based on both items above that accepts valid JSON that\n  conforms to a JSON schema. See [json_schema_acceptor.py](src/llm_structured_output/json_schema_acceptor.py).\n  Please note that most but not all JSON schema directives are implemented.\n  Please open an issue if one that you need is not.\n\n### Reference implementation / examples\n\n- An example of using the acceptors above to guide decoding in an LLM using\n  Apple's MLX framework. See [llm_schema.py](src/examples/llm_schema.py).\n  This example includes several decoding techniques, including pre-emptive evaluation,\n  which is a way to use the acceptor to anticipate the tokens that can be generated\n  according to the schema, and use that to evaluate two tokens at a time instead of\n  one, sometimes leading to noticeable performance improvements.\n\n- A server example that implements an OpenAI-compatible API including tools / function\n  calling. Unlike [OpenAI's](https://platform.openai.com/docs/api-reference/chat/object),\n  this implementation always generates valid JSON, and does not return hallucinated\n  parameters not defined in your function schema (but it may still hallucinate their\n  values). See [server.py](src/examples/server.py).\n\n## Usage\n\n### Run the examples on Apple hardware with MLX\n\nClone this repo:\n\n```sh\ngit clone https://github.com/otriscon/llm-structured-output.git\ncd llm-structured-output\n```\n\nOptional, but recommended: create and activate a virtual environment with your favorite tool of choice, e.g.\n\n```sh\npython -m venv .venv\nsource .venv/bin/activate\n```\n\nMove into the examples folder and install the requirements, then move back:\n\n```sh\ncd src/examples\npip install -r requirements.txt\ncd ..\n```\n\nRun the llm_schema example:\n\n```sh\nMODEL=mistralai/Mistral-7B-Instruct-v0.2\n\nLLM_PROMPT='[INST] Parse the following address into a JSON object: \"27 Barrow St, New York, NY 10014\". Your answer should be only a JSON object according to this schema: {\"type\": \"object\", \"properties\": {\"streetNumber\": {\"type\": \"number\"}, \"streetName\": {\"type\": \"string\"}, \"city\": {\"type\": {\"string\"}}, \"state\": {\"type\": \"string\"}, \"zipCode\": {\"type\": \"number\"}}}. Do not explain the result, just output it. Do not add any additional information. [/INST]'\n\nLLM_SCHEMA='{\"type\": \"object\", \"properties\": {\"streetNumber\": {\"type\": \"number\"}, \"streetName\": {\"type\": \"string\"}, \"city\": {\"type\": \"string\"}, \"state\": {\"type\": \"string\"}, \"zipCode\": {\"type\": \"number\"}}}'\n\npython3 -m examples.llm_schema --model-path $MODEL --prompt \"$LLM_PROMPT\" --schema \"$LLM_SCHEMA\" --max-tokens 1000 --repeat-prompt\n```\n\nRun the server example:\n\n```sh\nMODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2 uvicorn examples.server:app --port 8080 --reload\n```\n\nTry calling the server with this example adapted from [the OpenAI documentation (click on the example request titled _Functions_)](https://platform.openai.com/docs/api-reference/chat/create):\n```sh\ncurl http://localhost:8080/v1/chat/completions \\\n-H \"Content-Type: application/json\" \\\n-d '{\n  \"model\": \"ignored\",\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"What'\\''s the weather like in Boston today?\"\n    }\n  ],\n  \"tools\": [\n    {\n      \"type\": \"function\",\n      \"function\": {\n        \"name\": \"get_current_weather\",\n        \"description\": \"Get the current weather in a given location\",\n        \"parameters\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"location\": {\n              \"type\": \"string\",\n              \"description\": \"The city and state, e.g. San Francisco, CA\"\n            },\n            \"unit\": {\n              \"type\": \"string\",\n              \"enum\": [\"celsius\", \"fahrenheit\"]\n            }\n          },\n          \"required\": [\"location\"]\n        }\n      }\n    }\n  ],\n  \"tool_choice\": \"auto\"\n}'\n```\n\n### Using the JSON schema acceptor in your project\n\nInstall in your project with `pip install llm-structured-output` and\nuse a `JsonSchemaAcceptorDriver` within your normal generation loop:\n\n```python\nimport json\nimport mlx.core as mx\nfrom mlx_lm.utils import load # Needs pip import mlx_lm\nfrom llm_structured_output import JsonSchemaAcceptorDriver, HuggingfaceTokenizerHelper, bias_logits\n\n\nMODEL_PATH = \"mistralai/Mistral-7B-Instruct-v0.2\"\nSCHEMA = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"streetNumber\": {\"type\": \"number\"},\n        \"streetName\": {\"type\": \"string\"},\n        \"city\": {\"type\": \"string\"},\n        \"state\": {\"type\": \"string\"},\n        \"zipCode\": {\"type\": \"number\"},\n    },\n}\nPROMPT = f'''\n[INST] Parse the following address into a JSON object: \"27 Barrow St, New York, NY 10014\".\nYour answer should be only a JSON object according to this schema: {json.dumps(SCHEMA)}\nDo not explain the result, just output it. Do not add any additional information. [/INST]\n'''\n\n\n# Load the model as usual.\nmodel, tokenizer = load(MODEL_PATH)\n\n# Instantiate a token acceptor\ntokenizer_helper = HuggingfaceTokenizerHelper(tokenizer)\nvocabulary, eos_id = tokenizer_helper.extract_vocabulary()\ntoken_acceptor_factory = JsonSchemaAcceptorDriver.driver_factory_for_model(vocabulary, eos_id)\ntoken_acceptor = token_acceptor_factory(SCHEMA)\n\ncache = None\ntokens = tokenizer_helper.encode_prompt(PROMPT)\n\nwhile tokens[-1] != eos_id:\n    # Evaluate the model as usual. \n    logits, cache = model(mx.array(tokens)[None], cache)\n\n    # Set probability to -inf for invalid tokens.\n    accepted_token_bitmap = token_acceptor.select_valid_tokens()\n    logits = bias_logits(mx, logits[0, -1, :], accepted_token_bitmap)\n\n    # Sample as usual, e.g.:\n    token = mx.argmax(logits, axis=-1).item()\n\n    if token == eos_id:\n      break\n\n    # Store or use the generated token.\n    tokens = [token]\n    text = tokenizer_helper.no_strip_decode(tokens)\n    print(text, end=\"\")\n\n    # Advance the acceptor to the next state.\n    token_acceptor.advance_token(token)\n```\n\n## A note about guarantees on the output\n\nConstraining the output of an LLM to follow a schema doesn't magically make the\nLLM great at producing output that solves a particular task.\n\nIf an LLM that is not prompted or fine-tuned correctly to solve the task, it\nwill produce syntactically valid output but the values inside won't necessarily\nconstitute a good solution. As with any other technique, proper LLM prompting\nand/or n-shot examples are crucial to avoid getting nice-looking,\nwell-formatted, schema-compliant nonsense.\n\nIn particular, it's crucial to instruct the LLM regarding the desired output\nformat, including making the desired schema part of the prompt. Here's an\nexample of a prompt that includes the schema:\n\n```\nParse the following address into a JSON object: \"27 Barrow St, New York, NY 10014\".\nYour answer should be only a JSON object according to this schema: {\"type\": \"object\", \"properties\": {\"streetNumber\": {\"type\": \"number\"}, \"streetName\": {\"type\": \"string\"}, \"city\": {\"type\": {\"string\"}}, \"state\": {\"type\": \"string\"}, \"zipCode\": {\"type\": \"number\"}}}.\nDo not explain the result, just output it. Do not add any additional information.\n```\n\nIn order to give the LLM a scratch-pad prior to JSON generation for e.g.\nchain-of-thought reasoning, we have included an option for the acceptor to kick in\nonly on output within a section delimited by the lines `` ```json `` and `` ``` ``,\nwith the prior output treated as free text. This is enabled with the `is_encapsulated_json`\noption of the `JsonSchemaAcceptorDriver` constructor. Here's an example of a\nprompt that produces encapsulated JSON:\n```\nYour mission is to parse the following address into a JSON object: \"27 Barrow St, New York, NY 10014\".\nYour answer should be a JSON object according to this schema: {\"type\": \"object\", \"properties\": {\"streetNumber\": {\"type\": \"number\"}, \"streetName\": {\"type\": \"string\"}, \"city\": {\"type\": {\"string\"}}, \"state\": {\"type\": \"string\"}, \"zipCode\": {\"type\": \"number\"}}}.\nFirst, think through the task step by step, and then output a JSON object wrapped between the lines ```json and ```.\n```\n\nIn our OpenAI-compatible server example, when the request specifies `tool_calls` or a\nlegacy `function_call`, we automatically prepend a system message to the prompt with\nthe schema and instructions for the LLM to use the tools provided. If your prompt already\nincludes these instructions (because e.g. you want to customize them), this can be disabled\nwith a non-standard option in the request payload: `\"tool_options\": { \"no_prompt_steering\": true }`\n\n\n## Testing\n\nThe library has been tested with the following datasets:\n\n- [Fireworks.ai's function calling eval dataset](https://huggingface.co/datasets/fireworks-ai/function-calling-eval-dataset-v0/)\n\n- [ALU.AI's table extraction](https://blog.alu.ai/tables-and-structured-data/) evaluation dataset (not yet open-source)\n\n## Evaluations\n\nWe're starting to perform evaluations to understand how well different LLMs perform\nin function calling tasks. The tools and data can be found in the [src/tests](src/tests/) folder.\n\n### Fireworks.ai function calling eval dataset\n\nEnvironment:\n\n- llm_structured_output v0.0.15\n- mlx 0.14.1\n- 2023 Mac Studio M2 Ultra 24 cores (16 performance and 8 efficiency) 192 GB RAM running macOS Sonoma 14.5\n- LLM: mlx-community/Meta-Llama-3-8B-Instruct-4bit\n- Benchmarking LLM: gpt-4o-2024-05-13\n\nResults:\n\n- [multi-turn dataset report](src/tests/data/fireworks-ai_function-calling-eval-dataset-v0/report-multi_turn.md)\n\n- [single-turn dataset report](src/tests/data/fireworks-ai_function-calling-eval-dataset-v0/report-single_turn.md)\n\n\n## Performance\n\nSince we need to select the acceptable tokens prior to sampling, constraining\nthe output according to a schema introduces a delay for every token, which\ndepends on the complexity of the schema. On the other hand, since the output is\nguaranteed to be valid JSON and to conform to the schema, it can reduce the\nnumber of tokens generated and reduce or eliminate the number of retries\nrequired to solve the task.\n\n### Pre-emptive decoding experiment\nAs an experiment to improve performance, we implement the option to use\npre-emptive decoding: when the range of tokens that can be accepted after the\ncurrent one is small, as often happens with structured output, we submit to the\nLLM a batch of two-token continuations where the first token is the one that\nwas to be evaluated anyway, and the second token in each item in the batch is\none of the possible continuations predicted according to the schema. We can\nthen sample two tokens instead of one.  We find that this approach can\noccasionally produce considerable increases in token generation speed, but in\ngeneral it can also considerably slow it down, depending on model and\nquantization. We found that it works better with no fp16 models (no quantization),\nbut batching performance degrades vastly in quantized models making pre-emptive\ndecoding not worth it for those models.\n\n### Benchmarks\n\n- The following tests were perfomed on an Apple Studio with an M2 Ultra (24 core)\nwith 192GB of RAM using MLX version 0.9.0, with models converted to MLX format.\n\n- The results are the average of 5 runs on a simple data extraction task with a\n127-token prompt.\n\n- Pre-emptive decoding was tested in two different forms: with a constant batch\n  size, where we always sent the same size matrices for evaluation, and variable-\n  size batching, where we made the batch large or shorter depending on the numer\n  of possible follow-up tokens.\n\n<br>\n\n| Mistral-7B-v0.2-Instruct (fp16) | Prompt tps | Generation tps | Generation tokens |\n| --- | :-: | :-: | :-: |\n| No schema | 305.82 | 34.76 | 321 |\n| Schema | 307.00 |\t31.70 | 42 |\n| Pre-emptive constant batch =5 | 211.72 | 33.16 | 42 |\n| Pre-emptive variable batch <=5 | 321.85 | 36.53  | 42 |\n\n\n**Notes:**\n\n- Pre-emptive decoding accelerates generation even over schemaless generation.\n\n<br>\n<br>\n\n| Mistral-7B-v0.2-Instruct (q4) | Prompt tps | Generation tps | Generation tokens |\n| --- | :-: | :-: | :-: |\n| No schema | 487.19 | 86.36 | 137 |\n| Schema | 487.83 | 67.60 | 42 |\n| Pre-emptive constant batch =5 | 139.61 | 27.16 | 42 |\n| Pre-emptive variable batch <=5 | 488.88 | 36.25 | 42 |\n\n**Notes:**\n\n- Pre-emptive decoding is vastly slower, with the only change being quantization.\n\n<br>\n<br>\n\n| Mixtral-8x7B-Instruct-v0.1 (fp16) | Prompt tps | Generation tps | Generation tokens |\n| --- | :-: | :-: | :-: |\n| No schema | 3.48 | 2.23 | 50 |\n| Schema | 3.49 | 2.21 | 50 |\n| Pre-emptive constant batch =5 |2.36 | 1.16 | 50 |\n| Pre-emptive variable batch <=5 | 3.18 | 1.68 | 50 |\n\n**Notes:**\n\n- This is the only tested model that outputs schema-conforming output without a schema.\n\n- Pre-emptive decoding is a lot slower again.\n\n<br>\n<br>\n\n| Mixtral-8x7B-Instruct-v0.1 (q4) | Prompt tps | Generation tps | Generation tokens |\n| --- | :-: | :-: | :-: |\n| No schema | 15.02 | 32.21 | 165 |\n| Schema | 14.94 | 23.75 | 50 |\n| Pre-emptive constant batch =5 | 9.29 | 11.28 | 50 |\n| Pre-emptive variable batch <=5 | 15.02 | 17.94 | 50 |\n\n## Roadmap\n\n- Extend JSON schema support as needed (see TODOs in code). Please, feel free to\n  open an issue if you need a feature that not supported at the moment. Also open to\n  implement additional schemas such as YAML and reference implementations for other LLMs.\n\n- Add formal test cases.\n\n- Reference implementation for the Transformers library.\n\n- Port to C++ and reference implementation for llama.cpp\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Constrain LLM generation to structured output, such as function calling and tool use",
    "version": "0.0.19",
    "project_urls": {
        "Homepage": "https://github.com/otriscon/llm-structured-output",
        "Issues": "https://github.com/otriscon/llm-structured-output/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b8164f6785175cb1a879809aaaa0b5d904564cc2b4148ea1fdb9a101175174f8",
                "md5": "38fabbf759b5aee95f5373b19d80f6d2",
                "sha256": "82a335be05e4e33a4754a1d008d154186c2a8b3e35d41a89ba4acbcb588559a7"
            },
            "downloads": -1,
            "filename": "llm_structured_output-0.0.19-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "38fabbf759b5aee95f5373b19d80f6d2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 29763,
            "upload_time": "2024-10-28T23:19:54",
            "upload_time_iso_8601": "2024-10-28T23:19:54.349094Z",
            "url": "https://files.pythonhosted.org/packages/b8/16/4f6785175cb1a879809aaaa0b5d904564cc2b4148ea1fdb9a101175174f8/llm_structured_output-0.0.19-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be4634f421b38f45ccc73e6dfd8cce96d7dc54eeaa5a86fc48927fbe2a0470e6",
                "md5": "832db15ad1fa4c0960fbf92f3217d966",
                "sha256": "ef27f91096a787b25b12117fc5dd95ecab25198aca8f2595ca63d2ee51b5554d"
            },
            "downloads": -1,
            "filename": "llm_structured_output-0.0.19.tar.gz",
            "has_sig": false,
            "md5_digest": "832db15ad1fa4c0960fbf92f3217d966",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25435,
            "upload_time": "2024-10-28T23:19:55",
            "upload_time_iso_8601": "2024-10-28T23:19:55.353424Z",
            "url": "https://files.pythonhosted.org/packages/be/46/34f421b38f45ccc73e6dfd8cce96d7dc54eeaa5a86fc48927fbe2a0470e6/llm_structured_output-0.0.19.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-28 23:19:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "otriscon",
    "github_project": "llm-structured-output",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "llm-structured-output"
}

None