# extracto
[![PyPI](https://img.shields.io/pypi/v/extracto.svg)](https://pypi.org/project/extracto/)
[![Changelog](https://img.shields.io/github/v/release/cldellow/extracto?include_prereleases&label=changelog)](https://github.com/cldellow/extracto/releases)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/cldellow/extracto/blob/main/LICENSE)
Extract Python structures from HTML files, fast.
Built on the very fast [selectolax](https://github.com/rushter/selectolax) library,
and applies a few tricks to make your life happier.
## Installation
Install this library using `pip`:
pip install extracto
## Usage
`extracto` supports two modes: **extract** and **infer**.
**extract** mode takes an HTML document and a recipe to convert that HTML document into a Python data structure.
**infer** mode takes an HTML document and its desired output, and tries to propose a good recipe. You don't need to use infer mode at all; it's just a handy shortcut.
You can infer/extract two shapes of data:
- tabular data, as a list of lists (eg: `[['Alfie', 1986], ['Lily', 1985]]`)
- shaped data, eg `[ { 'name': 'Alfie', 'year': 1986 }, { 'name': 'Lily', 'year': 1985 }]`
Tabular data is the lowest-level layer of the system. Shaped data is built on top of tabular data.
### extract
#### Table data
```python
from extracto import prepare, extract_table
from selectolax.parser import HTMLParser
html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
<div><b>Name</b> Alfie</div>
<div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
<div><b>Name</b> Lily</div>
<div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
<div><b>Name</b> Tim</div>
<div><b>Year</b> Unknown</div>
</div>
'''
tree = HTMLParser(html)
# Tweak the HTML to allow easier extractions.
prepare(tree, for_infer=False)
results = extract_table(
'http://example.com/url-of-the-page',
tree,
{
# Try to emit a row for every element matched by this selector
'selector': 'h1 ~ div',
'columns': [
{
# Columns are usually evaluated relative to the row selector,
# but you can "break out" and have an absolute value by
# prefixing the selector with "html"
'selector': 'html h1'
'conversions': [
# Strip "Famous" by capturing only the text that follows,
# and assigning it to the return value ('rv') group
re.compile('Famous (?P<rv>.+)')
]
},
{
'selector': '.q-name + span',
},
{
'selector': '.q-year + span',
# Convert the year to an int
'conversions': ['int'],
# If we fail to extract something for this column, that's OK--just emit None
'optional': True,
},
{
'conversions': [
# Extract the value of the "data-occupation" attribute
'@data-occupation',
# Actors are boring
re.compile('singer|pharmaceutical-entrepreneur'),
],
}
]
}
)
```
Will result in:
```
[
[ 'Allens', 'Lily', 1985, 'singer' ],
[ 'Allens', 'Tim', None, 'pharmaceutical-entrepreneur' ],
]
```
Note that Alfie was excluded by the regular expression filter on
occupation, which permitted only `singer` and `pharmaceutical-entrepreneur` rows
through.
#### Shaped data
```python
from extracto import prepare, extract_object
from selectolax.parser import HTMLParser
html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
<div><b>Name</b> Alfie</div>
<div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
<div><b>Name</b> Lily</div>
<div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
<div><b>Name</b> Tim</div>
<div><b>Year</b> Unknown</div>
</div>
'''
tree = HTMLParser(html)
# Tweak the HTML to allow easier extractions.
prepare(tree, for_infer=False)
results = extract_object(
'http://example.com/url-of-the-page',
tree,
{
'label': {
'$row': 'html',
'$column': 'h1'
},
'people': {
'$': {
'$row': '[data-occupation]',
'name': {
'$column': '.q-name + span'
},
'year': {
'$column': '.q-year + span',
'$conversions': ['int']
},
'job': {
'$column': '[data-occupation]',
'conversions': ['@data-occupation']
}
}
}
}
)
```
Will give:
```
{
"label": "Famous Allens",
"people": [
{
"name": "Alfie",
"year": 1986,
"job": "actor"
},
{
"name": "Lily",
"year": 1985,
"job": "singer"
}
]
}
```
### infer
#### Table data
```python
from selectolax.parser import HTMLParser
from extracto import prepare, infer_table
html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
<div><b>Name</b> Alfie</div>
<div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
<div><b>Name</b> Lily</div>
<div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
<div><b>Name</b> Tim</div>
<div><b>Year</b> Unknown</div>
</div>
'''
tree = HTMLParser(html)
prepare(tree)
recipe = infer_table(
'http://example.com/url-of-page',
tree,
[
['Alfie', '1986'],
['Lily', '1985']
]
)
```
## Development
To contribute to this library, first checkout the code. Then create a new virtual environment:
cd extracto
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Raw data
{
"_id": null,
"home_page": "https://github.com/cldellow/extracto",
"name": "extracto",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Colin Dellow",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/9a/a6/7ecf81fbe2b7ca517c0587b0db323be42cb85452412af107d9c85e4b0c53/extracto-0.12.tar.gz",
"platform": null,
"description": "# extracto\n\n[![PyPI](https://img.shields.io/pypi/v/extracto.svg)](https://pypi.org/project/extracto/)\n[![Changelog](https://img.shields.io/github/v/release/cldellow/extracto?include_prereleases&label=changelog)](https://github.com/cldellow/extracto/releases)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/cldellow/extracto/blob/main/LICENSE)\n\nExtract Python structures from HTML files, fast.\n\nBuilt on the very fast [selectolax](https://github.com/rushter/selectolax) library,\nand applies a few tricks to make your life happier.\n\n## Installation\n\nInstall this library using `pip`:\n\n pip install extracto\n\n## Usage\n\n`extracto` supports two modes: **extract** and **infer**.\n\n**extract** mode takes an HTML document and a recipe to convert that HTML document into a Python data structure.\n\n**infer** mode takes an HTML document and its desired output, and tries to propose a good recipe. You don't need to use infer mode at all; it's just a handy shortcut.\n\nYou can infer/extract two shapes of data:\n- tabular data, as a list of lists (eg: `[['Alfie', 1986], ['Lily', 1985]]`)\n- shaped data, eg `[ { 'name': 'Alfie', 'year': 1986 }, { 'name': 'Lily', 'year': 1985 }]`\n\nTabular data is the lowest-level layer of the system. Shaped data is built on top of tabular data.\n\n### extract\n\n#### Table data\n\n```python\nfrom extracto import prepare, extract_table\nfrom selectolax.parser import HTMLParser\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n <div><b>Name</b> Alfie</div>\n <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n <div><b>Name</b> Lily</div>\n <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n <div><b>Name</b> Tim</div>\n <div><b>Year</b> Unknown</div>\n</div>\n'''\n\ntree = HTMLParser(html)\n\n# Tweak the HTML to allow easier extractions.\nprepare(tree, for_infer=False)\n\nresults = extract_table(\n 'http://example.com/url-of-the-page',\n tree,\n {\n # Try to emit a row for every element matched by this selector\n 'selector': 'h1 ~ div',\n 'columns': [\n {\n # Columns are usually evaluated relative to the row selector,\n # but you can \"break out\" and have an absolute value by\n # prefixing the selector with \"html\"\n 'selector': 'html h1'\n 'conversions': [\n # Strip \"Famous\" by capturing only the text that follows,\n # and assigning it to the return value ('rv') group\n re.compile('Famous (?P<rv>.+)')\n ]\n },\n {\n 'selector': '.q-name + span',\n },\n {\n 'selector': '.q-year + span',\n # Convert the year to an int\n 'conversions': ['int'],\n # If we fail to extract something for this column, that's OK--just emit None\n 'optional': True,\n },\n {\n 'conversions': [\n # Extract the value of the \"data-occupation\" attribute\n '@data-occupation',\n # Actors are boring\n re.compile('singer|pharmaceutical-entrepreneur'),\n ],\n }\n ]\n }\n)\n```\n\nWill result in:\n\n```\n[\n [ 'Allens', 'Lily', 1985, 'singer' ],\n [ 'Allens', 'Tim', None, 'pharmaceutical-entrepreneur' ],\n]\n```\n\nNote that Alfie was excluded by the regular expression filter on\noccupation, which permitted only `singer` and `pharmaceutical-entrepreneur` rows\nthrough.\n\n#### Shaped data\n\n```python\nfrom extracto import prepare, extract_object\nfrom selectolax.parser import HTMLParser\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n <div><b>Name</b> Alfie</div>\n <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n <div><b>Name</b> Lily</div>\n <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n <div><b>Name</b> Tim</div>\n <div><b>Year</b> Unknown</div>\n</div>\n'''\n\ntree = HTMLParser(html)\n\n# Tweak the HTML to allow easier extractions.\nprepare(tree, for_infer=False)\n\nresults = extract_object(\n 'http://example.com/url-of-the-page',\n tree,\n {\n 'label': {\n '$row': 'html',\n '$column': 'h1'\n },\n 'people': {\n '$': {\n '$row': '[data-occupation]',\n 'name': {\n '$column': '.q-name + span'\n },\n 'year': {\n '$column': '.q-year + span',\n '$conversions': ['int']\n },\n 'job': {\n '$column': '[data-occupation]',\n 'conversions': ['@data-occupation']\n }\n }\n }\n }\n)\n```\n\nWill give:\n\n```\n{\n \"label\": \"Famous Allens\",\n \"people\": [\n {\n \"name\": \"Alfie\",\n \"year\": 1986,\n \"job\": \"actor\"\n },\n {\n \"name\": \"Lily\",\n \"year\": 1985,\n \"job\": \"singer\"\n }\n ]\n}\n```\n\n### infer\n\n#### Table data\n\n```python\nfrom selectolax.parser import HTMLParser\nfrom extracto import prepare, infer_table\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n <div><b>Name</b> Alfie</div>\n <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n <div><b>Name</b> Lily</div>\n <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n <div><b>Name</b> Tim</div>\n <div><b>Year</b> Unknown</div>\n</div>\n'''\n\n\ntree = HTMLParser(html)\nprepare(tree)\n\nrecipe = infer_table(\n 'http://example.com/url-of-page',\n tree,\n [\n ['Alfie', '1986'],\n ['Lily', '1985']\n ]\n)\n```\n\n## Development\n\nTo contribute to this library, first checkout the code. Then create a new virtual environment:\n\n cd extracto\n python -m venv venv\n source venv/bin/activate\n\nNow install the dependencies and test dependencies:\n\n pip install -e '.[test]'\n\nTo run the tests:\n\n pytest\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Extract Python dicts from HTML files, fast.",
"version": "0.12",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "971a962f3e8d5790e71ac4f802e3d0d3",
"sha256": "8491301a42a15061dc195da6770cf1ceca97262a1193ed08983865bc26053506"
},
"downloads": -1,
"filename": "extracto-0.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "971a962f3e8d5790e71ac4f802e3d0d3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 27232,
"upload_time": "2022-12-25T21:45:29",
"upload_time_iso_8601": "2022-12-25T21:45:29.536288Z",
"url": "https://files.pythonhosted.org/packages/75/1c/b4eba0f2d7b6d10b656ff945c7faf6c072e49153326baf36fb85d7d13df8/extracto-0.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "36f960afb0070545675ca74fecb91b60",
"sha256": "48c4a273aa7edeb669b46ce52c7eb5f0fc3b4fde390bc07450078ef5b48c61b4"
},
"downloads": -1,
"filename": "extracto-0.12.tar.gz",
"has_sig": false,
"md5_digest": "36f960afb0070545675ca74fecb91b60",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 26509,
"upload_time": "2022-12-25T21:45:30",
"upload_time_iso_8601": "2022-12-25T21:45:30.930012Z",
"url": "https://files.pythonhosted.org/packages/9a/a6/7ecf81fbe2b7ca517c0587b0db323be42cb85452412af107d9c85e4b0c53/extracto-0.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-25 21:45:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "cldellow",
"github_project": "extracto",
"lcname": "extracto"
}