xpath-filter


Namexpath-filter JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/CarlosAdp/xpath-filter
SummaryXPath filter of HTML files
upload_time2023-05-30 12:53:44
maintainer
docs_urlNone
authorCarlos Pinto
requires_python
licenseApache License 2.0
keywords xpath html scraping webscraping scraper webscraper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # xpath-filter

[![version](https://img.shields.io/badge/version-1.0.1-blue)](https://pypi.org/project/xpath-filter)
[![tests](https://img.shields.io/badge/tests-passed-green)](tests/test_xpath_filter.py)

Filter HTML files using xpath mappings.

## Installation

Install `xpath-filter` using pip:

```shell
pip install xpath-filter
```

## Usage

Import the `xpath_filter` function from the `xpath_filter` module. Find below
some use cases.

### Filtering HTML file

```python
>>> xpaths = {
...     'article': {
...         'xpath': '//div[@class="article"]',
...         'matches': 'all',
...         'elements': {
...             'author': './@data-author',
...             'content': './p/text()'
...         }
...     }
... }
>>> xpath_filter('index.html', xpaths)
```

Result

```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```

### Filtering HTML file from a YAML xpaths definition.

File at "xpaths.yml":

```yml
article:
    xpath: //div[@class="article"]
    matches: all
    elements:
        author: './@data-author'
        content: ./p/text()
```

Code:

```python
>>> xpath_filter('index.html', 'xpaths.yml')
```

Result

```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```

### Simplified filtering

By definining only the xpath of an HTML element, only its first match is returned and no inner element is searched.

```python
>>> xpath_filter('index.html', {'article': '//div[@class="article"]'})
>>> xpath_filter('index.html', {'article': '//div[@class="article"]/p/text()'})
```

Result

```python
{'article': <Element div at 0x1f08369ea80>}
{'article': 'Awesome'}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CarlosAdp/xpath-filter",
    "name": "xpath-filter",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "xpath html scraping webscraping scraper webscraper",
    "author": "Carlos Pinto",
    "author_email": "carlos.adpinto@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4a/7f/9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a/xpath_filter-1.0.1.tar.gz",
    "platform": null,
    "description": "# xpath-filter\r\n\r\n[![version](https://img.shields.io/badge/version-1.0.1-blue)](https://pypi.org/project/xpath-filter)\r\n[![tests](https://img.shields.io/badge/tests-passed-green)](tests/test_xpath_filter.py)\r\n\r\nFilter HTML files using xpath mappings.\r\n\r\n## Installation\r\n\r\nInstall `xpath-filter` using pip:\r\n\r\n```shell\r\npip install xpath-filter\r\n```\r\n\r\n## Usage\r\n\r\nImport the `xpath_filter` function from the `xpath_filter` module. Find below\r\nsome use cases.\r\n\r\n### Filtering HTML file\r\n\r\n```python\r\n>>> xpaths = {\r\n...     'article': {\r\n...         'xpath': '//div[@class=\"article\"]',\r\n...         'matches': 'all',\r\n...         'elements': {\r\n...             'author': './@data-author',\r\n...             'content': './p/text()'\r\n...         }\r\n...     }\r\n... }\r\n>>> xpath_filter('index.html', xpaths)\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}\r\n```\r\n\r\n### Filtering HTML file from a YAML xpaths definition.\r\n\r\nFile at \"xpaths.yml\":\r\n\r\n```yml\r\narticle:\r\n    xpath: //div[@class=\"article\"]\r\n    matches: all\r\n    elements:\r\n        author: './@data-author'\r\n        content: ./p/text()\r\n```\r\n\r\nCode:\r\n\r\n```python\r\n>>> xpath_filter('index.html', 'xpaths.yml')\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}\r\n```\r\n\r\n### Simplified filtering\r\n\r\nBy definining only the xpath of an HTML element, only its first match is returned and no inner element is searched.\r\n\r\n```python\r\n>>> xpath_filter('index.html', {'article': '//div[@class=\"article\"]'})\r\n>>> xpath_filter('index.html', {'article': '//div[@class=\"article\"]/p/text()'})\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': <Element div at 0x1f08369ea80>}\r\n{'article': 'Awesome'}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "XPath filter of HTML files",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/CarlosAdp/xpath-filter"
    },
    "split_keywords": [
        "xpath",
        "html",
        "scraping",
        "webscraping",
        "scraper",
        "webscraper"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "47968af617a83bf5a9e519ea8d97d892cd829da66885da262482c8d72c92db81",
                "md5": "0b2516fd51dd78778be0204da0b9e74f",
                "sha256": "c84c6c811675fc8ec4bb3c92a5ffe55af5400e2706841e752534742d98d20a8c"
            },
            "downloads": -1,
            "filename": "xpath_filter-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0b2516fd51dd78778be0204da0b9e74f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8215,
            "upload_time": "2023-05-30T12:53:41",
            "upload_time_iso_8601": "2023-05-30T12:53:41.870945Z",
            "url": "https://files.pythonhosted.org/packages/47/96/8af617a83bf5a9e519ea8d97d892cd829da66885da262482c8d72c92db81/xpath_filter-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a7f9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a",
                "md5": "933c7443096965901ca15eb1ac2f7a84",
                "sha256": "625273246a4b97980e6bfdf769b9277c64c05864b6f51ee5bca24ae2adc1b373"
            },
            "downloads": -1,
            "filename": "xpath_filter-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "933c7443096965901ca15eb1ac2f7a84",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7533,
            "upload_time": "2023-05-30T12:53:44",
            "upload_time_iso_8601": "2023-05-30T12:53:44.974400Z",
            "url": "https://files.pythonhosted.org/packages/4a/7f/9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a/xpath_filter-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-30 12:53:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CarlosAdp",
    "github_project": "xpath-filter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "xpath-filter"
}
        
Elapsed time: 0.08093s