extracto


Nameextracto JSON
Version 0.12 PyPI version JSON
download
home_pagehttps://github.com/cldellow/extracto
SummaryExtract Python dicts from HTML files, fast.
upload_time2022-12-25 21:45:30
maintainer
docs_urlNone
authorColin Dellow
requires_python>=3.7
licenseApache License, Version 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # extracto

[![PyPI](https://img.shields.io/pypi/v/extracto.svg)](https://pypi.org/project/extracto/)
[![Changelog](https://img.shields.io/github/v/release/cldellow/extracto?include_prereleases&label=changelog)](https://github.com/cldellow/extracto/releases)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/cldellow/extracto/blob/main/LICENSE)

Extract Python structures from HTML files, fast.

Built on the very fast [selectolax](https://github.com/rushter/selectolax) library,
and applies a few tricks to make your life happier.

## Installation

Install this library using `pip`:

    pip install extracto

## Usage

`extracto` supports two modes: **extract** and **infer**.

**extract** mode takes an HTML document and a recipe to convert that HTML document into a Python data structure.

**infer** mode takes an HTML document and its desired output, and tries to propose a good recipe. You don't need to use infer mode at all; it's just a handy shortcut.

You can infer/extract two shapes of data:
- tabular data, as a list of lists (eg: `[['Alfie', 1986], ['Lily', 1985]]`)
- shaped data, eg `[ { 'name': 'Alfie', 'year': 1986 }, { 'name': 'Lily', 'year': 1985 }]`

Tabular data is the lowest-level layer of the system. Shaped data is built on top of tabular data.

### extract

#### Table data

```python
from extracto import prepare, extract_table
from selectolax.parser import HTMLParser

html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
  <div><b>Name</b> Alfie</div>
  <div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
  <div><b>Name</b> Lily</div>
  <div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
  <div><b>Name</b> Tim</div>
  <div><b>Year</b> Unknown</div>
</div>
'''

tree = HTMLParser(html)

# Tweak the HTML to allow easier extractions.
prepare(tree, for_infer=False)

results = extract_table(
    'http://example.com/url-of-the-page',
    tree,
    {
        # Try to emit a row for every element matched by this selector
        'selector': 'h1 ~ div',
        'columns': [
            {
                # Columns are usually evaluated relative to the row selector,
                # but you can "break out" and have an absolute value by
                # prefixing the selector with "html"
                'selector': 'html h1'
                'conversions': [
                    # Strip "Famous" by capturing only the text that follows,
                    # and assigning it to the return value ('rv') group
                    re.compile('Famous (?P<rv>.+)')
                ]
            },
            {
                'selector': '.q-name + span',
            },
            {
                'selector': '.q-year + span',
                # Convert the year to an int
                'conversions': ['int'],
                # If we fail to extract something for this column, that's OK--just emit None
                'optional': True,
            },
            {
                'conversions': [
                  # Extract the value of the "data-occupation" attribute
                  '@data-occupation',
                  # Actors are boring
                  re.compile('singer|pharmaceutical-entrepreneur'),
                ],
            }
        ]
    }
)
```

Will result in:

```
[
  [ 'Allens', 'Lily', 1985, 'singer' ],
  [ 'Allens', 'Tim', None, 'pharmaceutical-entrepreneur' ],
]
```

Note that Alfie was excluded by the regular expression filter on
occupation, which permitted only `singer` and `pharmaceutical-entrepreneur` rows
through.

#### Shaped data

```python
from extracto import prepare, extract_object
from selectolax.parser import HTMLParser

html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
  <div><b>Name</b> Alfie</div>
  <div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
  <div><b>Name</b> Lily</div>
  <div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
  <div><b>Name</b> Tim</div>
  <div><b>Year</b> Unknown</div>
</div>
'''

tree = HTMLParser(html)

# Tweak the HTML to allow easier extractions.
prepare(tree, for_infer=False)

results = extract_object(
    'http://example.com/url-of-the-page',
    tree,
    {
        'label': {
          '$row': 'html',
          '$column': 'h1'
        },
        'people': {
            '$': {
                '$row': '[data-occupation]',
                'name': {
                    '$column': '.q-name + span'
                },
                'year': {
                    '$column': '.q-year + span',
                    '$conversions': ['int']
                },
                'job': {
                    '$column': '[data-occupation]',
                    'conversions': ['@data-occupation']
                }
            }
        }
    }
)
```

Will give:

```
{
    "label": "Famous Allens",
    "people": [
        {
            "name": "Alfie",
            "year": 1986,
            "job": "actor"
        },
        {
            "name": "Lily",
            "year": 1985,
            "job": "singer"
        }
    ]
}
```

### infer

#### Table data

```python
from selectolax.parser import HTMLParser
from extracto import prepare, infer_table

html = '''
<h1>Famous Allens</h1>
<div data-occupation="actor">
  <div><b>Name</b> Alfie</div>
  <div><b>Year</b> 1986</div>
</div>
<div data-occupation="singer">
  <div><b>Name</b> Lily</div>
  <div><b>Year</b> 1985</div>
</div>
<div data-occupation="pharmaceutical-entrepreneur">
  <div><b>Name</b> Tim</div>
  <div><b>Year</b> Unknown</div>
</div>
'''


tree = HTMLParser(html)
prepare(tree)

recipe = infer_table(
    'http://example.com/url-of-page',
    tree,
    [
        ['Alfie', '1986'],
        ['Lily', '1985']
    ]
)
```

## Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

    cd extracto
    python -m venv venv
    source venv/bin/activate

Now install the dependencies and test dependencies:

    pip install -e '.[test]'

To run the tests:

    pytest

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cldellow/extracto",
    "name": "extracto",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Colin Dellow",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/9a/a6/7ecf81fbe2b7ca517c0587b0db323be42cb85452412af107d9c85e4b0c53/extracto-0.12.tar.gz",
    "platform": null,
    "description": "# extracto\n\n[![PyPI](https://img.shields.io/pypi/v/extracto.svg)](https://pypi.org/project/extracto/)\n[![Changelog](https://img.shields.io/github/v/release/cldellow/extracto?include_prereleases&label=changelog)](https://github.com/cldellow/extracto/releases)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/cldellow/extracto/blob/main/LICENSE)\n\nExtract Python structures from HTML files, fast.\n\nBuilt on the very fast [selectolax](https://github.com/rushter/selectolax) library,\nand applies a few tricks to make your life happier.\n\n## Installation\n\nInstall this library using `pip`:\n\n    pip install extracto\n\n## Usage\n\n`extracto` supports two modes: **extract** and **infer**.\n\n**extract** mode takes an HTML document and a recipe to convert that HTML document into a Python data structure.\n\n**infer** mode takes an HTML document and its desired output, and tries to propose a good recipe. You don't need to use infer mode at all; it's just a handy shortcut.\n\nYou can infer/extract two shapes of data:\n- tabular data, as a list of lists (eg: `[['Alfie', 1986], ['Lily', 1985]]`)\n- shaped data, eg `[ { 'name': 'Alfie', 'year': 1986 }, { 'name': 'Lily', 'year': 1985 }]`\n\nTabular data is the lowest-level layer of the system. Shaped data is built on top of tabular data.\n\n### extract\n\n#### Table data\n\n```python\nfrom extracto import prepare, extract_table\nfrom selectolax.parser import HTMLParser\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n  <div><b>Name</b> Alfie</div>\n  <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n  <div><b>Name</b> Lily</div>\n  <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n  <div><b>Name</b> Tim</div>\n  <div><b>Year</b> Unknown</div>\n</div>\n'''\n\ntree = HTMLParser(html)\n\n# Tweak the HTML to allow easier extractions.\nprepare(tree, for_infer=False)\n\nresults = extract_table(\n    'http://example.com/url-of-the-page',\n    tree,\n    {\n        # Try to emit a row for every element matched by this selector\n        'selector': 'h1 ~ div',\n        'columns': [\n            {\n                # Columns are usually evaluated relative to the row selector,\n                # but you can \"break out\" and have an absolute value by\n                # prefixing the selector with \"html\"\n                'selector': 'html h1'\n                'conversions': [\n                    # Strip \"Famous\" by capturing only the text that follows,\n                    # and assigning it to the return value ('rv') group\n                    re.compile('Famous (?P<rv>.+)')\n                ]\n            },\n            {\n                'selector': '.q-name + span',\n            },\n            {\n                'selector': '.q-year + span',\n                # Convert the year to an int\n                'conversions': ['int'],\n                # If we fail to extract something for this column, that's OK--just emit None\n                'optional': True,\n            },\n            {\n                'conversions': [\n                  # Extract the value of the \"data-occupation\" attribute\n                  '@data-occupation',\n                  # Actors are boring\n                  re.compile('singer|pharmaceutical-entrepreneur'),\n                ],\n            }\n        ]\n    }\n)\n```\n\nWill result in:\n\n```\n[\n  [ 'Allens', 'Lily', 1985, 'singer' ],\n  [ 'Allens', 'Tim', None, 'pharmaceutical-entrepreneur' ],\n]\n```\n\nNote that Alfie was excluded by the regular expression filter on\noccupation, which permitted only `singer` and `pharmaceutical-entrepreneur` rows\nthrough.\n\n#### Shaped data\n\n```python\nfrom extracto import prepare, extract_object\nfrom selectolax.parser import HTMLParser\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n  <div><b>Name</b> Alfie</div>\n  <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n  <div><b>Name</b> Lily</div>\n  <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n  <div><b>Name</b> Tim</div>\n  <div><b>Year</b> Unknown</div>\n</div>\n'''\n\ntree = HTMLParser(html)\n\n# Tweak the HTML to allow easier extractions.\nprepare(tree, for_infer=False)\n\nresults = extract_object(\n    'http://example.com/url-of-the-page',\n    tree,\n    {\n        'label': {\n          '$row': 'html',\n          '$column': 'h1'\n        },\n        'people': {\n            '$': {\n                '$row': '[data-occupation]',\n                'name': {\n                    '$column': '.q-name + span'\n                },\n                'year': {\n                    '$column': '.q-year + span',\n                    '$conversions': ['int']\n                },\n                'job': {\n                    '$column': '[data-occupation]',\n                    'conversions': ['@data-occupation']\n                }\n            }\n        }\n    }\n)\n```\n\nWill give:\n\n```\n{\n    \"label\": \"Famous Allens\",\n    \"people\": [\n        {\n            \"name\": \"Alfie\",\n            \"year\": 1986,\n            \"job\": \"actor\"\n        },\n        {\n            \"name\": \"Lily\",\n            \"year\": 1985,\n            \"job\": \"singer\"\n        }\n    ]\n}\n```\n\n### infer\n\n#### Table data\n\n```python\nfrom selectolax.parser import HTMLParser\nfrom extracto import prepare, infer_table\n\nhtml = '''\n<h1>Famous Allens</h1>\n<div data-occupation=\"actor\">\n  <div><b>Name</b> Alfie</div>\n  <div><b>Year</b> 1986</div>\n</div>\n<div data-occupation=\"singer\">\n  <div><b>Name</b> Lily</div>\n  <div><b>Year</b> 1985</div>\n</div>\n<div data-occupation=\"pharmaceutical-entrepreneur\">\n  <div><b>Name</b> Tim</div>\n  <div><b>Year</b> Unknown</div>\n</div>\n'''\n\n\ntree = HTMLParser(html)\nprepare(tree)\n\nrecipe = infer_table(\n    'http://example.com/url-of-page',\n    tree,\n    [\n        ['Alfie', '1986'],\n        ['Lily', '1985']\n    ]\n)\n```\n\n## Development\n\nTo contribute to this library, first checkout the code. Then create a new virtual environment:\n\n    cd extracto\n    python -m venv venv\n    source venv/bin/activate\n\nNow install the dependencies and test dependencies:\n\n    pip install -e '.[test]'\n\nTo run the tests:\n\n    pytest\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Extract Python dicts from HTML files, fast.",
    "version": "0.12",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "971a962f3e8d5790e71ac4f802e3d0d3",
                "sha256": "8491301a42a15061dc195da6770cf1ceca97262a1193ed08983865bc26053506"
            },
            "downloads": -1,
            "filename": "extracto-0.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "971a962f3e8d5790e71ac4f802e3d0d3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 27232,
            "upload_time": "2022-12-25T21:45:29",
            "upload_time_iso_8601": "2022-12-25T21:45:29.536288Z",
            "url": "https://files.pythonhosted.org/packages/75/1c/b4eba0f2d7b6d10b656ff945c7faf6c072e49153326baf36fb85d7d13df8/extracto-0.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "36f960afb0070545675ca74fecb91b60",
                "sha256": "48c4a273aa7edeb669b46ce52c7eb5f0fc3b4fde390bc07450078ef5b48c61b4"
            },
            "downloads": -1,
            "filename": "extracto-0.12.tar.gz",
            "has_sig": false,
            "md5_digest": "36f960afb0070545675ca74fecb91b60",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 26509,
            "upload_time": "2022-12-25T21:45:30",
            "upload_time_iso_8601": "2022-12-25T21:45:30.930012Z",
            "url": "https://files.pythonhosted.org/packages/9a/a6/7ecf81fbe2b7ca517c0587b0db323be42cb85452412af107d9c85e4b0c53/extracto-0.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-25 21:45:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "cldellow",
    "github_project": "extracto",
    "lcname": "extracto"
}
        
Elapsed time: 0.02413s