# xpath-filter
[![version](https://img.shields.io/badge/version-1.0.1-blue)](https://pypi.org/project/xpath-filter)
[![tests](https://img.shields.io/badge/tests-passed-green)](tests/test_xpath_filter.py)
Filter HTML files using xpath mappings.
## Installation
Install `xpath-filter` using pip:
```shell
pip install xpath-filter
```
## Usage
Import the `xpath_filter` function from the `xpath_filter` module. Find below
some use cases.
### Filtering HTML file
```python
>>> xpaths = {
... 'article': {
... 'xpath': '//div[@class="article"]',
... 'matches': 'all',
... 'elements': {
... 'author': './@data-author',
... 'content': './p/text()'
... }
... }
... }
>>> xpath_filter('index.html', xpaths)
```
Result
```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```
### Filtering HTML file from a YAML xpaths definition.
File at "xpaths.yml":
```yml
article:
xpath: //div[@class="article"]
matches: all
elements:
author: './@data-author'
content: ./p/text()
```
Code:
```python
>>> xpath_filter('index.html', 'xpaths.yml')
```
Result
```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```
### Simplified filtering
By definining only the xpath of an HTML element, only its first match is returned and no inner element is searched.
```python
>>> xpath_filter('index.html', {'article': '//div[@class="article"]'})
>>> xpath_filter('index.html', {'article': '//div[@class="article"]/p/text()'})
```
Result
```python
{'article': <Element div at 0x1f08369ea80>}
{'article': 'Awesome'}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/CarlosAdp/xpath-filter",
"name": "xpath-filter",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "xpath html scraping webscraping scraper webscraper",
"author": "Carlos Pinto",
"author_email": "carlos.adpinto@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4a/7f/9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a/xpath_filter-1.0.1.tar.gz",
"platform": null,
"description": "# xpath-filter\r\n\r\n[![version](https://img.shields.io/badge/version-1.0.1-blue)](https://pypi.org/project/xpath-filter)\r\n[![tests](https://img.shields.io/badge/tests-passed-green)](tests/test_xpath_filter.py)\r\n\r\nFilter HTML files using xpath mappings.\r\n\r\n## Installation\r\n\r\nInstall `xpath-filter` using pip:\r\n\r\n```shell\r\npip install xpath-filter\r\n```\r\n\r\n## Usage\r\n\r\nImport the `xpath_filter` function from the `xpath_filter` module. Find below\r\nsome use cases.\r\n\r\n### Filtering HTML file\r\n\r\n```python\r\n>>> xpaths = {\r\n... 'article': {\r\n... 'xpath': '//div[@class=\"article\"]',\r\n... 'matches': 'all',\r\n... 'elements': {\r\n... 'author': './@data-author',\r\n... 'content': './p/text()'\r\n... }\r\n... }\r\n... }\r\n>>> xpath_filter('index.html', xpaths)\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}\r\n```\r\n\r\n### Filtering HTML file from a YAML xpaths definition.\r\n\r\nFile at \"xpaths.yml\":\r\n\r\n```yml\r\narticle:\r\n xpath: //div[@class=\"article\"]\r\n matches: all\r\n elements:\r\n author: './@data-author'\r\n content: ./p/text()\r\n```\r\n\r\nCode:\r\n\r\n```python\r\n>>> xpath_filter('index.html', 'xpaths.yml')\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}\r\n```\r\n\r\n### Simplified filtering\r\n\r\nBy definining only the xpath of an HTML element, only its first match is returned and no inner element is searched.\r\n\r\n```python\r\n>>> xpath_filter('index.html', {'article': '//div[@class=\"article\"]'})\r\n>>> xpath_filter('index.html', {'article': '//div[@class=\"article\"]/p/text()'})\r\n```\r\n\r\nResult\r\n\r\n```python\r\n{'article': <Element div at 0x1f08369ea80>}\r\n{'article': 'Awesome'}\r\n```\r\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "XPath filter of HTML files",
"version": "1.0.1",
"project_urls": {
"Homepage": "https://github.com/CarlosAdp/xpath-filter"
},
"split_keywords": [
"xpath",
"html",
"scraping",
"webscraping",
"scraper",
"webscraper"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "47968af617a83bf5a9e519ea8d97d892cd829da66885da262482c8d72c92db81",
"md5": "0b2516fd51dd78778be0204da0b9e74f",
"sha256": "c84c6c811675fc8ec4bb3c92a5ffe55af5400e2706841e752534742d98d20a8c"
},
"downloads": -1,
"filename": "xpath_filter-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0b2516fd51dd78778be0204da0b9e74f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8215,
"upload_time": "2023-05-30T12:53:41",
"upload_time_iso_8601": "2023-05-30T12:53:41.870945Z",
"url": "https://files.pythonhosted.org/packages/47/96/8af617a83bf5a9e519ea8d97d892cd829da66885da262482c8d72c92db81/xpath_filter-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4a7f9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a",
"md5": "933c7443096965901ca15eb1ac2f7a84",
"sha256": "625273246a4b97980e6bfdf769b9277c64c05864b6f51ee5bca24ae2adc1b373"
},
"downloads": -1,
"filename": "xpath_filter-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "933c7443096965901ca15eb1ac2f7a84",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7533,
"upload_time": "2023-05-30T12:53:44",
"upload_time_iso_8601": "2023-05-30T12:53:44.974400Z",
"url": "https://files.pythonhosted.org/packages/4a/7f/9c62aaede6a3600b8e94e75bcbeadc560a046a6e0cc8b82c4c0e672a227a/xpath_filter-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-30 12:53:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CarlosAdp",
"github_project": "xpath-filter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "xpath-filter"
}