# Chompjs
![license](https://img.shields.io/github/license/Nykakin/chompjs?style=flat-square)
![pypi version](https://img.shields.io/pypi/v/chompjs.svg)
![python version](https://img.shields.io/pypi/pyversions/chompjs.svg)
![downloads](https://img.shields.io/pypi/dm/chompjs.svg)
Transforms JavaScript objects into Python data structures.
In web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. `chompjs` is a library designed to do that as a more powerful replacement of standard `json.loads`:
```python
>>> chompjs.parse_js_object("{a: 100}")
{'a': 100}
>>>
>>> json_lines = """
... {'a': 12}
... {'b': 13}
... {'c': 14}
... """
>>> for entry in chompjs.parse_js_objects(json_lines):
... print(entry)
...
{'a': 12}
{'b': 13}
{'c': 14}
```
[Reference documentation](https://nykakin.github.io/chompjs/)
## Quickstart
**1. installation**
```
> pip install chompjs
```
or build from source:
```bash
$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install
```
## Features
There are two functions available:
* `parse_js_object` - try reading first encountered JSON-like object. Raises `ValueError` on failure
* `parse_js_objects` - returns a generator yielding all encountered JSON-like objects. Can be used to read [JSON Lines](https://jsonlines.org/). Does not raise on invalid input.
An example usage with `scrapy`:
```python
import chompjs
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
script_css = 'script:contains("__NEXT_DATA__")::text'
script_pattern = r'__NEXT_DATA__ = (.*);'
# warning: for some pages you need to pass replace_entities=True
# into re_first to have JSON escaped properly
script_text = response.css(script_css).re_first(script_pattern)
try:
json_data = chompjs.parse_js_object(script_text)
except ValueError:
self.log('Failed to extract data from {}'.format(response.url))
return
# work on json_data
```
Parsing of [JSON5 objects](https://json5.org/) is supported:
```python
>>> data = """
... {
... // comments
... unquoted: 'and you can quote me on that',
... singleQuotes: 'I can use "double quotes" here',
... lineBreaks: "Look, Mom! \
... No \\n's!",
... hexadecimal: 0xdecaf,
... leadingDecimalPoint: .8675309, andTrailing: 8675309.,
... positiveSign: +1,
... trailingComma: 'in objects', andIn: ['arrays',],
... "backwardsCompatible": "with JSON",
... }
... """
>>> chompjs.parse_js_object(data)
{'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom! No \n's!", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'}
```
If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it:
```python
>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{'a': 12}
```
By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:
```python
>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>')
[1, 2, 3]
```
Post-processed input is parsed using `json.loads` by default. A different loader such as `orsjon` can be used with `loader` argument:
```python
>>> import orjson
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{'a': 12}", loader=orjson.loads)
{'a': 12}
```
`loader_args` and `loader_kwargs` arguments can be used to pass options to underlying loader function. For example for default `json.loads` you can pass down options such as `strict` or `object_hook`:
```python
>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', loader_kwargs={'parse_float': decimal.Decimal})
[Decimal('23.2')]
```
# Rationale
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
```html
<html>
<head>...</head>
<body>
...
<script type="text/javascript">window.__PRELOADED_STATE__={"foo": "bar"}</script>
...
</body>
</html>
```
Standard library function `json.loads` is usually sufficient to extract this data:
```python
>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}
```
The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote
* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all
* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array
* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0
As a result, `json.loads` fail to extract any of those:
```python
>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
...
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
...
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
...
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
...
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
```
`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
```python
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{'a': 'b'}")
{'a': 'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{'a': 'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{'a': [1, 2, 3]}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}
```
Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.
```python
>>> import json
>>> import _chompjs
>>>
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{'a': 1}
```
# Development
Pull requests are welcome.
To run unittests
```
$ tox
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Nykakin/chompjs",
"name": "chompjs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": null,
"keywords": "parsing parser JavaScript json json5 webscrapping",
"author": "Mariusz Obajtek",
"author_email": "nykakin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/05/aa/19e20c5157bcb76f22395143d9a26b33e82d5aff37e25f2feef259f13131/chompjs-1.3.0.tar.gz",
"platform": null,
"description": "# Chompjs\n\n![license](https://img.shields.io/github/license/Nykakin/chompjs?style=flat-square)\n![pypi version](https://img.shields.io/pypi/v/chompjs.svg)\n![python version](https://img.shields.io/pypi/pyversions/chompjs.svg)\n![downloads](https://img.shields.io/pypi/dm/chompjs.svg)\n\nTransforms JavaScript objects into Python data structures.\n\nIn web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. `chompjs` is a library designed to do that as a more powerful replacement of standard `json.loads`:\n\n```python\n>>> chompjs.parse_js_object(\"{a: 100}\")\n{'a': 100}\n>>>\n>>> json_lines = \"\"\"\n... {'a': 12}\n... {'b': 13}\n... {'c': 14}\n... \"\"\"\n>>> for entry in chompjs.parse_js_objects(json_lines):\n... print(entry)\n... \n{'a': 12}\n{'b': 13}\n{'c': 14}\n```\n\n[Reference documentation](https://nykakin.github.io/chompjs/)\n\n## Quickstart\n\n**1. installation**\n\n```\n> pip install chompjs\n```\n\nor build from source:\n\n```bash\n$ git clone https://github.com/Nykakin/chompjs\n$ cd chompjs\n$ python setup.py build\n$ python setup.py install\n```\n\n## Features\n\nThere are two functions available:\n* `parse_js_object` - try reading first encountered JSON-like object. Raises `ValueError` on failure\n* `parse_js_objects` - returns a generator yielding all encountered JSON-like objects. Can be used to read [JSON Lines](https://jsonlines.org/). Does not raise on invalid input.\n\nAn example usage with `scrapy`:\n\n```python\nimport chompjs\nimport scrapy\n\n\nclass MySpider(scrapy.Spider):\n # ...\n\n def parse(self, response):\n script_css = 'script:contains(\"__NEXT_DATA__\")::text'\n script_pattern = r'__NEXT_DATA__ = (.*);'\n # warning: for some pages you need to pass replace_entities=True\n # into re_first to have JSON escaped properly\n script_text = response.css(script_css).re_first(script_pattern)\n try:\n json_data = chompjs.parse_js_object(script_text)\n except ValueError:\n self.log('Failed to extract data from {}'.format(response.url))\n return\n\n # work on json_data\n```\n\nParsing of [JSON5 objects](https://json5.org/) is supported:\n\n```python\n>>> data = \"\"\"\n... {\n... // comments\n... unquoted: 'and you can quote me on that',\n... singleQuotes: 'I can use \"double quotes\" here',\n... lineBreaks: \"Look, Mom! \\\n... No \\\\n's!\",\n... hexadecimal: 0xdecaf,\n... leadingDecimalPoint: .8675309, andTrailing: 8675309.,\n... positiveSign: +1,\n... trailingComma: 'in objects', andIn: ['arrays',],\n... \"backwardsCompatible\": \"with JSON\",\n... }\n... \"\"\"\n>>> chompjs.parse_js_object(data)\n{'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use \"double quotes\" here', 'lineBreaks': \"Look, Mom! No \\n's!\", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'}\n```\n\nIf the input string is not yet escaped and contains a lot of `\\\\` characters, then `unicode_escape=True` argument might help to sanitize it:\n\n```python\n>>> chompjs.parse_js_object('{\\\\\\\"a\\\\\\\": 12}', unicode_escape=True)\n{'a': 12}\n```\n\nBy default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:\n\n```python\n>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>')\n[1, 2, 3]\n```\n\nPost-processed input is parsed using `json.loads` by default. A different loader such as `orsjon` can be used with `loader` argument:\n\n```python\n>>> import orjson\n>>> import chompjs\n>>> \n>>> chompjs.parse_js_object(\"{'a': 12}\", loader=orjson.loads)\n{'a': 12}\n```\n\n`loader_args` and `loader_kwargs` arguments can be used to pass options to underlying loader function. For example for default `json.loads` you can pass down options such as `strict` or `object_hook`:\n\n```python\n>>> import decimal\n>>> import chompjs\n>>> chompjs.parse_js_object('[23.2]', loader_kwargs={'parse_float': decimal.Decimal})\n[Decimal('23.2')]\n```\n\n# Rationale\n\nIn web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:\n\n```html\n<html>\n<head>...</head>\n<body>\n...\n<script type=\"text/javascript\">window.__PRELOADED_STATE__={\"foo\": \"bar\"}</script>\n...\n</body>\n</html>\n```\n\nStandard library function `json.loads` is usually sufficient to extract this data:\n\n```python\n>>> # scrapy shell file:///tmp/test.html\n>>> import json\n>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')\n>>> json.loads(script_text)\n{u'foo': u'bar'}\n\n```\nThe problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:\n\n* `\"{'a': 'b'}\"` is not a valid JSON because it uses `'` character to quote\n* `'{a: \"b\"}'`is not a valid JSON because property name is not quoted at all\n* `'{\"a\": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array\n* `'{\"a\": .99}'` is not a valid JSON because float value lacks a leading 0\n\nAs a result, `json.loads` fail to extract any of those:\n\n```python\n>>> json.loads(\"{'a': 'b'}\")\nTraceback (most recent call last):\n ...\nValueError: Expecting property name: line 1 column 2 (char 1)\n>>> json.loads('{a: \"b\"}')\nTraceback (most recent call last):\n ...\nValueError: Expecting property name: line 1 column 2 (char 1)\n>>> json.loads('{\"a\": [1, 2, 3,]}')\nTraceback (most recent call last):\n ...\nValueError: No JSON object could be decoded\n>>> json.loads('{\"a\": .99}')\nTraceback (most recent call last):\n ...\njson.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)\n\n```\n`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:\n\n```python\n>>> import chompjs\n>>> \n>>> chompjs.parse_js_object(\"{'a': 'b'}\")\n{'a': 'b'}\n>>> chompjs.parse_js_object('{a: \"b\"}')\n{'a': 'b'}\n>>> chompjs.parse_js_object('{\"a\": [1, 2, 3,]}')\n{'a': [1, 2, 3]}\n>>> chompjs.parse_js_object('{\"a\": .99}')\n{'a': 0.99}\n```\n\nInternally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.\n\n```python\n>>> import json\n>>> import _chompjs\n>>> \n>>> _chompjs.parse('{a: 1}')\n'{\"a\":1}'\n>>> json.loads(_)\n{'a': 1}\n```\n\n# Development\nPull requests are welcome. \n\nTo run unittests\n\n```\n$ tox\n```\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Parsing JavaScript objects into Python dictionaries",
"version": "1.3.0",
"project_urls": {
"Homepage": "https://github.com/Nykakin/chompjs"
},
"split_keywords": [
"parsing",
"parser",
"javascript",
"json",
"json5",
"webscrapping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05aa19e20c5157bcb76f22395143d9a26b33e82d5aff37e25f2feef259f13131",
"md5": "a4dc4d24adb2767b26986b5ec4b0ff19",
"sha256": "8ac0b31755e939348fb2af1cc01c357db50c854fa3d506ce79218fc15f056868"
},
"downloads": -1,
"filename": "chompjs-1.3.0.tar.gz",
"has_sig": false,
"md5_digest": "a4dc4d24adb2767b26986b5ec4b0ff19",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 16811,
"upload_time": "2024-08-17T09:15:58",
"upload_time_iso_8601": "2024-08-17T09:15:58.278290Z",
"url": "https://files.pythonhosted.org/packages/05/aa/19e20c5157bcb76f22395143d9a26b33e82d5aff37e25f2feef259f13131/chompjs-1.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-17 09:15:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Nykakin",
"github_project": "chompjs",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "chompjs"
}