Name | json-repair JSON |
Version |
0.32.0
JSON |
| download |
home_page | None |
Summary | A package to repair broken json strings |
upload_time | 2024-12-18 16:25:07 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License Copyright (c) 2023 Stefano Baccianella Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
json
repair
llm
parser
|
VCS |
|
bugtrack_url |
|
requirements |
build
coverage
pre-commit
pytest
pytest-benchmark
pytest-cov
|
Travis-CI |
No Travis.
|
coveralls test coverage |
|
[![PyPI](https://img.shields.io/pypi/v/json-repair)](https://pypi.org/project/json-repair/)
![Python version](https://img.shields.io/badge/python-3.9+-important)
[![PyPI downloads](https://img.shields.io/pypi/dm/json-repair)](https://pypi.org/project/json-repair/)
[![Github Sponsors](https://img.shields.io/github/sponsors/mangiucugna)](https://github.com/sponsors/mangiucugna)
[![GitHub Repo stars](https://img.shields.io/github/stars/mangiucugna/json_repair?style=flat)](https://github.com/mangiucugna/json_repair/stargazers)
This simple package can be used to fix an invalid json string. To know all cases in which this package will work, check out the unit test.
![banner](banner.png)
---
# Offer me a beer
If you find this library useful, you can help me by donating toward my monthly beer budget here: https://github.com/sponsors/mangiucugna
---
# Demo
If you are unsure if this library will fix your specific problem, or simply want your json validated online, you can visit the demo site on GitHub pages: https://mangiucugna.github.io/json_repair/
Or hear an [audio deepdive generate by Google's NotebookLM](https://notebooklm.google.com/notebook/05312bb3-f6f3-4e49-a99b-bd51db64520b/audio) for an introduction to the module
---
# Motivation
Some LLMs are a bit iffy when it comes to returning well formed JSON data, sometimes they skip a parentheses and sometimes they add some words in it, because that's what an LLM does.
Luckily, the mistakes LLMs make are simple enough to be fixed without destroying the content.
I searched for a lightweight python package that was able to reliably fix this problem but couldn't find any.
*So I wrote one*
### Wouldn't GPT-4o Structured Output make this library outdated?
As part of my job we use OpenAI APIs and we noticed that even with structured output sometimes the result isn't a fully valid json.
So we still use this library to cover those outliers.
# Supported use cases
### Fixing Syntax Errors in JSON
- Missing quotes, misplaced commas, unescaped characters, and incomplete key-value pairs.
- Missing quotation marks, improperly formatted values (true, false, null), and repairs corrupted key-value structures.
### Repairing Malformed JSON Arrays and Objects
- Incomplete or broken arrays/objects by adding necessary elements (e.g., commas, brackets) or default values (null, "").
- The library can process JSON that includes extra non-JSON characters like comments or improperly placed characters, cleaning them up while maintaining valid structure.
### Auto-Completion for Missing JSON Values
- Automatically completes missing values in JSON fields with reasonable defaults (like empty strings or null), ensuring validity.
# How to use
Install the library with pip
pip install json-repair
then you can use use it in your code like this
from json_repair import repair_json
good_json_string = repair_json(bad_json_string)
# If the string was super broken this will return an empty string
You can use this library to completely replace `json.loads()`:
import json_repair
decoded_object = json_repair.loads(json_string)
or just
import json_repair
decoded_object = json_repair.repair_json(json_string, return_objects=True)
### Avoid this antipattern
Some users of this library adopt the following pattern:
obj = {}
try:
obj = json.loads(string)
except json.JSONDecodeError as e:
obj = json_repair.loads(string)
...
This is wasteful because `json_repair` will already verify for you if the JSON is valid, if you still want to do that then add `skip_json_loads=True` to the call as explained the section below.
### Read json from a file or file descriptor
JSON repair provides also a drop-in replacement for `json.load()`:
import json_repair
try:
file_descriptor = open(fname, 'rb')
except OSError:
...
with file_descriptor:
decoded_object = json_repair.load(file_descriptor)
and another method to read from a file:
import json_repair
try:
decoded_object = json_repair.from_file(json_file)
except OSError:
...
except IOError:
...
Keep in mind that the library will not catch any IO-related exception and those will need to be managed by you
### Non-Latin characters
When working with non-Latin characters (such as Chinese, Japanese, or Korean), you need to pass `ensure_ascii=False` to `repair_json()` in order to preserve the non-Latin characters in the output.
Here's an example using Chinese characters:
repair_json("{'test_chinese_ascii':'统一码'}")
will return
{"test_chinese_ascii": "\u7edf\u4e00\u7801"}
Instead passing `ensure_ascii=False`:
repair_json("{'test_chinese_ascii':'统一码'}", ensure_ascii=False)
will return
{"test_chinese_ascii": "统一码"}
### Performance considerations
If you find this library too slow because is using `json.loads()` you can skip that by passing `skip_json_loads=True` to `repair_json`. Like:
from json_repair import repair_json
good_json_string = repair_json(bad_json_string, skip_json_loads=True)
I made a choice of not using any fast json library to avoid having any external dependency, so that anybody can use it regardless of their stack.
Some rules of thumb to use:
- Setting `return_objects=True` will always be faster because the parser returns an object already and it doesn't have serialize that object to JSON
- `skip_json_loads` is faster only if you 100% know that the string is not a valid JSON
- If you are having issues with escaping pass the string as **raw** string like: `r"string with escaping\""`
### Use json_repair from CLI
Install the library for command-line with:
```
pipx install json-repair
```
to know all options available:
```
$ json_repair -h
usage: json_repair [-h] [-i] [-o TARGET] [--ensure_ascii] [--indent INDENT] filename
Repair and parse JSON files.
positional arguments:
filename The JSON file to repair
options:
-h, --help show this help message and exit
-i, --inline Replace the file inline instead of returning the output to stdout
-o TARGET, --output TARGET
If specified, the output will be written to TARGET filename instead of stdout
--ensure_ascii Pass ensure_ascii=True to json.dumps()
--indent INDENT Number of spaces for indentation (Default 2)
```
## Adding to requirements
**Please pin this library only on the major version!**
We use TDD and strict semantic versioning, there will be frequent updates and no breaking changes in minor and patch versions.
To ensure that you only pin the major version of this library in your `requirements.txt`, specify the package name followed by the major version and a wildcard for minor and patch versions. For example:
json_repair==0.*
In this example, any version that starts with `0.` will be acceptable, allowing for updates on minor and patch versions.
---
# How to cite
If you are using this library in your academic work (as I know many folks are) please find the BibTex here:
@software{Baccianella_JSON_Repair_-_2024,
author = {Baccianella, Stefano},
month = aug,
title = {{JSON Repair - A python module to repair invalid JSON, commonly used to parse the output of LLMs}},
url = {https://github.com/mangiucugna/json_repair},
version = {0.28.3},
year = {2024}
}
Thank you for citing my work and please send me a link to the paper if you can!
---
# How it works
This module will parse the JSON file following the BNF definition:
<json> ::= <primitive> | <container>
<primitive> ::= <number> | <string> | <boolean>
; Where:
; <number> is a valid real number expressed in one of a number of given formats
; <string> is a string of valid characters enclosed in quotes
; <boolean> is one of the literal strings 'true', 'false', or 'null' (unquoted)
<container> ::= <object> | <array>
<array> ::= '[' [ <json> *(', ' <json>) ] ']' ; A sequence of JSON values separated by commas
<object> ::= '{' [ <member> *(', ' <member>) ] '}' ; A sequence of 'members'
<member> ::= <string> ': ' <json> ; A pair consisting of a name, and a JSON value
If something is wrong (a missing parentheses or quotes for example) it will use a few simple heuristics to fix the JSON string:
- Add the missing parentheses if the parser believes that the array or object should be closed
- Quote strings or add missing single quotes
- Adjust whitespaces and remove line breaks
I am sure some corner cases will be missing, if you have examples please open an issue or even better push a PR
# How to develop
Just create a virtual environment with `requirements.txt`, the setup uses [pre-commit](https://pre-commit.com/) to make sure all tests are run.
Make sure that the Github Actions running after pushing a new commit don't fail as well.
# How to release
You will need owner access to this repository
- Edit `pyproject.toml` and update the version number appropriately using `semver` notation
- **Commit and push all changes to the repository before continuing or the next steps will fail**
- Run `python -m build`
- Create a new release in Github, making sure to tag all the issues solved and contributors. Create the new tag, same as the one in the build configuration
- Once the release is created, a new Github Actions workflow will start to publish on Pypi, make sure it didn't fail
---
# Repair JSON in other programming languages
- Typescript: https://github.com/josdejong/jsonrepair
- Go: https://github.com/RealAlexandreAI/json-repair
- Ruby: https://github.com/sashazykov/json-repair-rb
---
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=mangiucugna/json_repair&type=Date)](https://star-history.com/#mangiucugna/json_repair&Date)
Raw data
{
"_id": null,
"home_page": null,
"name": "json-repair",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "JSON, REPAIR, LLM, PARSER",
"author": null,
"author_email": "Stefano Baccianella <4247706+mangiucugna@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/ed/00/850d481c0cb290f399610d5b5693f95634780b6a21d04719ab32d772d8bf/json_repair-0.32.0.tar.gz",
"platform": null,
"description": "[![PyPI](https://img.shields.io/pypi/v/json-repair)](https://pypi.org/project/json-repair/)\n![Python version](https://img.shields.io/badge/python-3.9+-important)\n[![PyPI downloads](https://img.shields.io/pypi/dm/json-repair)](https://pypi.org/project/json-repair/)\n[![Github Sponsors](https://img.shields.io/github/sponsors/mangiucugna)](https://github.com/sponsors/mangiucugna)\n[![GitHub Repo stars](https://img.shields.io/github/stars/mangiucugna/json_repair?style=flat)](https://github.com/mangiucugna/json_repair/stargazers)\n\n\nThis simple package can be used to fix an invalid json string. To know all cases in which this package will work, check out the unit test.\n\n![banner](banner.png)\n\n---\n# Offer me a beer\nIf you find this library useful, you can help me by donating toward my monthly beer budget here: https://github.com/sponsors/mangiucugna\n\n---\n\n# Demo\nIf you are unsure if this library will fix your specific problem, or simply want your json validated online, you can visit the demo site on GitHub pages: https://mangiucugna.github.io/json_repair/\n\nOr hear an [audio deepdive generate by Google's NotebookLM](https://notebooklm.google.com/notebook/05312bb3-f6f3-4e49-a99b-bd51db64520b/audio) for an introduction to the module\n\n---\n\n# Motivation\nSome LLMs are a bit iffy when it comes to returning well formed JSON data, sometimes they skip a parentheses and sometimes they add some words in it, because that's what an LLM does.\nLuckily, the mistakes LLMs make are simple enough to be fixed without destroying the content.\n\nI searched for a lightweight python package that was able to reliably fix this problem but couldn't find any.\n\n*So I wrote one*\n\n### Wouldn't GPT-4o Structured Output make this library outdated?\n\nAs part of my job we use OpenAI APIs and we noticed that even with structured output sometimes the result isn't a fully valid json.\nSo we still use this library to cover those outliers.\n\n# Supported use cases\n\n### Fixing Syntax Errors in JSON\n\n- Missing quotes, misplaced commas, unescaped characters, and incomplete key-value pairs.\n- Missing quotation marks, improperly formatted values (true, false, null), and repairs corrupted key-value structures.\n\n### Repairing Malformed JSON Arrays and Objects\n\n- Incomplete or broken arrays/objects by adding necessary elements (e.g., commas, brackets) or default values (null, \"\").\n- The library can process JSON that includes extra non-JSON characters like comments or improperly placed characters, cleaning them up while maintaining valid structure.\n\n### Auto-Completion for Missing JSON Values\n\n- Automatically completes missing values in JSON fields with reasonable defaults (like empty strings or null), ensuring validity.\n\n# How to use\n\nInstall the library with pip\n\n pip install json-repair\n\nthen you can use use it in your code like this\n\n from json_repair import repair_json\n\n good_json_string = repair_json(bad_json_string)\n # If the string was super broken this will return an empty string\n\nYou can use this library to completely replace `json.loads()`:\n\n import json_repair\n\n decoded_object = json_repair.loads(json_string)\n\nor just\n\n import json_repair\n\n decoded_object = json_repair.repair_json(json_string, return_objects=True)\n\n### Avoid this antipattern\nSome users of this library adopt the following pattern:\n\n obj = {}\n try:\n obj = json.loads(string)\n except json.JSONDecodeError as e:\n obj = json_repair.loads(string)\n ...\n\nThis is wasteful because `json_repair` will already verify for you if the JSON is valid, if you still want to do that then add `skip_json_loads=True` to the call as explained the section below.\n\n### Read json from a file or file descriptor\n\nJSON repair provides also a drop-in replacement for `json.load()`:\n\n import json_repair\n\n try:\n file_descriptor = open(fname, 'rb')\n except OSError:\n ...\n\n with file_descriptor:\n decoded_object = json_repair.load(file_descriptor)\n\nand another method to read from a file:\n\n import json_repair\n\n try:\n decoded_object = json_repair.from_file(json_file)\n except OSError:\n ...\n except IOError:\n ...\n\nKeep in mind that the library will not catch any IO-related exception and those will need to be managed by you\n\n### Non-Latin characters\n\nWhen working with non-Latin characters (such as Chinese, Japanese, or Korean), you need to pass `ensure_ascii=False` to `repair_json()` in order to preserve the non-Latin characters in the output.\n\nHere's an example using Chinese characters:\n\n repair_json(\"{'test_chinese_ascii':'\u7edf\u4e00\u7801'}\")\n\nwill return\n\n {\"test_chinese_ascii\": \"\\u7edf\\u4e00\\u7801\"}\n\nInstead passing `ensure_ascii=False`:\n\n repair_json(\"{'test_chinese_ascii':'\u7edf\u4e00\u7801'}\", ensure_ascii=False)\n\nwill return\n\n {\"test_chinese_ascii\": \"\u7edf\u4e00\u7801\"}\n\n### Performance considerations\nIf you find this library too slow because is using `json.loads()` you can skip that by passing `skip_json_loads=True` to `repair_json`. Like:\n\n from json_repair import repair_json\n\n good_json_string = repair_json(bad_json_string, skip_json_loads=True)\n\nI made a choice of not using any fast json library to avoid having any external dependency, so that anybody can use it regardless of their stack.\n\nSome rules of thumb to use:\n- Setting `return_objects=True` will always be faster because the parser returns an object already and it doesn't have serialize that object to JSON\n- `skip_json_loads` is faster only if you 100% know that the string is not a valid JSON\n- If you are having issues with escaping pass the string as **raw** string like: `r\"string with escaping\\\"\"`\n\n### Use json_repair from CLI\n\nInstall the library for command-line with:\n```\npipx install json-repair\n```\nto know all options available:\n```\n$ json_repair -h\nusage: json_repair [-h] [-i] [-o TARGET] [--ensure_ascii] [--indent INDENT] filename\n\nRepair and parse JSON files.\n\npositional arguments:\n filename The JSON file to repair\n\noptions:\n -h, --help show this help message and exit\n -i, --inline Replace the file inline instead of returning the output to stdout\n -o TARGET, --output TARGET\n If specified, the output will be written to TARGET filename instead of stdout\n --ensure_ascii Pass ensure_ascii=True to json.dumps()\n --indent INDENT Number of spaces for indentation (Default 2)\n```\n\n## Adding to requirements\n**Please pin this library only on the major version!**\n\nWe use TDD and strict semantic versioning, there will be frequent updates and no breaking changes in minor and patch versions.\nTo ensure that you only pin the major version of this library in your `requirements.txt`, specify the package name followed by the major version and a wildcard for minor and patch versions. For example:\n\n json_repair==0.*\n\nIn this example, any version that starts with `0.` will be acceptable, allowing for updates on minor and patch versions.\n\n---\n# How to cite\nIf you are using this library in your academic work (as I know many folks are) please find the BibTex here:\n\n @software{Baccianella_JSON_Repair_-_2024,\n author = {Baccianella, Stefano},\n month = aug,\n title = {{JSON Repair - A python module to repair invalid JSON, commonly used to parse the output of LLMs}},\n url = {https://github.com/mangiucugna/json_repair},\n version = {0.28.3},\n year = {2024}\n }\n\nThank you for citing my work and please send me a link to the paper if you can!\n\n---\n\n# How it works\nThis module will parse the JSON file following the BNF definition:\n\n <json> ::= <primitive> | <container>\n\n <primitive> ::= <number> | <string> | <boolean>\n ; Where:\n ; <number> is a valid real number expressed in one of a number of given formats\n ; <string> is a string of valid characters enclosed in quotes\n ; <boolean> is one of the literal strings 'true', 'false', or 'null' (unquoted)\n\n <container> ::= <object> | <array>\n <array> ::= '[' [ <json> *(', ' <json>) ] ']' ; A sequence of JSON values separated by commas\n <object> ::= '{' [ <member> *(', ' <member>) ] '}' ; A sequence of 'members'\n <member> ::= <string> ': ' <json> ; A pair consisting of a name, and a JSON value\n\nIf something is wrong (a missing parentheses or quotes for example) it will use a few simple heuristics to fix the JSON string:\n- Add the missing parentheses if the parser believes that the array or object should be closed\n- Quote strings or add missing single quotes\n- Adjust whitespaces and remove line breaks\n\nI am sure some corner cases will be missing, if you have examples please open an issue or even better push a PR\n\n# How to develop\nJust create a virtual environment with `requirements.txt`, the setup uses [pre-commit](https://pre-commit.com/) to make sure all tests are run.\n\nMake sure that the Github Actions running after pushing a new commit don't fail as well.\n\n# How to release\nYou will need owner access to this repository\n- Edit `pyproject.toml` and update the version number appropriately using `semver` notation\n- **Commit and push all changes to the repository before continuing or the next steps will fail**\n- Run `python -m build`\n- Create a new release in Github, making sure to tag all the issues solved and contributors. Create the new tag, same as the one in the build configuration\n- Once the release is created, a new Github Actions workflow will start to publish on Pypi, make sure it didn't fail\n---\n# Repair JSON in other programming languages\n- Typescript: https://github.com/josdejong/jsonrepair\n- Go: https://github.com/RealAlexandreAI/json-repair\n- Ruby: https://github.com/sashazykov/json-repair-rb\n---\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=mangiucugna/json_repair&type=Date)](https://star-history.com/#mangiucugna/json_repair&Date)\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2023 Stefano Baccianella Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "A package to repair broken json strings",
"version": "0.32.0",
"project_urls": {
"Bug Tracker": "https://github.com/mangiucugna/json_repair/issues",
"Homepage": "https://github.com/mangiucugna/json_repair/",
"Live demo": "https://mangiucugna.github.io/json_repair/"
},
"split_keywords": [
"json",
" repair",
" llm",
" parser"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0d138a660fa1966882ff2d82ddd17943865d2ad560e5773d9b878938dc336c4c",
"md5": "8b7ab26ee58bf3f9cbea9528650786fb",
"sha256": "a06a83c62e75c69a58cda5902f5631adec567d6584413cf233b412b491cf8580"
},
"downloads": -1,
"filename": "json_repair-0.32.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8b7ab26ee58bf3f9cbea9528650786fb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 19149,
"upload_time": "2024-12-18T16:25:05",
"upload_time_iso_8601": "2024-12-18T16:25:05.039802Z",
"url": "https://files.pythonhosted.org/packages/0d/13/8a660fa1966882ff2d82ddd17943865d2ad560e5773d9b878938dc336c4c/json_repair-0.32.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ed00850d481c0cb290f399610d5b5693f95634780b6a21d04719ab32d772d8bf",
"md5": "cdf18463862f5fec9f495f8938985fd5",
"sha256": "eed776fb24dbcce5bcd200f3c254a7d70fda40405c31c97f52a5ca8cfb7cf3e4"
},
"downloads": -1,
"filename": "json_repair-0.32.0.tar.gz",
"has_sig": false,
"md5_digest": "cdf18463862f5fec9f495f8938985fd5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 28169,
"upload_time": "2024-12-18T16:25:07",
"upload_time_iso_8601": "2024-12-18T16:25:07.126675Z",
"url": "https://files.pythonhosted.org/packages/ed/00/850d481c0cb290f399610d5b5693f95634780b6a21d04719ab32d772d8bf/json_repair-0.32.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-18 16:25:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mangiucugna",
"github_project": "json_repair",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [
{
"name": "build",
"specs": []
},
{
"name": "coverage",
"specs": []
},
{
"name": "pre-commit",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "pytest-benchmark",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
}
],
"lcname": "json-repair"
}