# WebExtractionHelper
A comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.
## 🚀 Features
- **95+ Pre-built Selectors**: Comprehensive collection of XPath selectors for web scraping
- **Google Search Support**: Specialized selectors for Google SERP features
- **Multiple Content Types**: Support for featured snippets, related questions, images, links, and more
- **Easy to Use**: Simple API with clear explanations for each selector
- **Well Documented**: Each selector includes detailed explanations and usage examples
## 📦 Installation
### From PyPI (Recommended)
```bash
pip install webextractionhelper
```
### From Source
```bash
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e .
```
## 🔧 Requirements
- Python 3.7+
- lxml >= 4.6.0
## 📚 Quick Start
```python
from webextractionhelper import Selectors
# Create a Selectors instance
selectors = Selectors()
# Access Google featured snippet selectors
featured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
featured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']
# Access related questions selectors
related_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
print(f"Featured snippet title XPath: {featured_title_xpath}")
print(f"Featured snippet text XPath: {featured_text_xpath}")
print(f"Related questions XPath: {related_questions_xpath}")
```
## 🎯 Available Selector Categories
### Google Search Selectors (21 selectors)
- **Featured Snippets**: Title, text, bullet points, numbered lists, tables, URLs, images
- **Related Questions**: Individual questions, all questions, answer snippets, source titles/URLs
- **Search Results**: Main containers, links, titles, descriptions
### Meta & Open Graph Selectors (11 selectors)
- **Meta Tags**: Title, description, keywords, robots, viewport
- **Open Graph**: Title, description, image, URL, type, site name
### Social Media Selectors (6 selectors)
- **Twitter/X**: Card type, title, description, image, creator, site
### Content Selectors (10 selectors)
- **Headings**: H1, H2, H3, H4
- **Text Content**: Paragraphs, lists, blockquotes
- **Forms**: Input fields, buttons, labels
### Media Selectors (5 selectors)
- **Images**: Source, alt text, title, dimensions
- **Videos**: Source, poster, dimensions
### Link Selectors (7 selectors)
- **Navigation**: Main nav, footer links, breadcrumbs
- **Content Links**: Internal, external, download links
## 🔍 Usage Examples
### Example 1: Extract Google Featured Snippet
```python
from webextractionhelper import Selectors
import requests
from lxml import html
selectors = Selectors()
# Get the page content
url = "https://www.google.com/search?q=python+programming"
response = requests.get(url)
tree = html.fromstring(response.content)
# Extract featured snippet title
title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
title_elements = tree.xpath(title_xpath)
if title_elements:
title = title_elements[0].text_content()
print(f"Featured snippet title: {title}")
```
### Example 2: Extract All Related Questions
```python
# Get all related questions
questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
question_elements = tree.xpath(questions_xpath)
for i, question in enumerate(question_elements, 1):
print(f"Question {i}: {question.text_content()}")
```
### Example 3: Extract Meta Information
```python
# Get page meta description
meta_desc_xpath = selectors.selectors['meta.description']['xpath']
meta_desc_elements = tree.xpath(meta_desc_xpath)
if meta_desc_elements:
description = meta_desc_elements[0].get('content')
print(f"Meta description: {description}")
```
## 📋 Selector Structure
Each selector in the package follows this structure:
```python
{
'explanation': 'Human-readable description of what this selector extracts',
'xpath': 'The XPath expression to extract the content',
'regex': 'Optional regex pattern for text processing',
'css': 'Optional CSS selector alternative'
}
```
## 🛠️ Development
### Setting up development environment
```bash
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e ".[dev]"
```
### Running tests
```bash
python test_package.py
python example_usage.py
```
### Building the package
```bash
python -m build
```
## 📄 License
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the [LICENSE.txt](LICENSE.txt) file for details.
## 👨💻 Author
**Jens Verneuer**
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## 📞 Support
If you have any questions or need help, please:
1. Check the [GitHub Issues](https://github.com/Artistotle-ai/webextractionhelper/issues)
2. Create a new issue if your problem isn't already addressed
## 🔗 Links
- **GitHub Repository**: [https://github.com/Artistotle-ai/webextractionhelper](https://github.com/Artistotle-ai/webextractionhelper)
- **PyPI Package**: [https://pypi.org/project/webextractionhelper/](https://pypi.org/project/webextractionhelper/)
- **Documentation**: [https://github.com/Artistotle-ai/webextractionhelper#readme](https://github.com/Artistotle-ai/webextractionhelper#readme)
## 📈 Version History
- **0.1.0** - Initial release with 95+ selectors for web scraping
---
**Note**: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.
Raw data
{
"_id": null,
"home_page": "https://github.com/Artistotle-ai/webextractionhelper",
"name": "webextractionhelper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "Jens Verneuer <Jens@Aristotle.ventures>",
"keywords": "web scraping, xpath, css selectors, regex, google search, serp",
"author": "Jens Verneuer",
"author_email": "Jens Verneuer <Jens@Aristotle.ventures>",
"download_url": "https://files.pythonhosted.org/packages/3e/ca/ae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e/webextractionhelper-0.1.1.tar.gz",
"platform": null,
"description": "# WebExtractionHelper\n\nA comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.\n\n## \ud83d\ude80 Features\n\n- **95+ Pre-built Selectors**: Comprehensive collection of XPath selectors for web scraping\n- **Google Search Support**: Specialized selectors for Google SERP features\n- **Multiple Content Types**: Support for featured snippets, related questions, images, links, and more\n- **Easy to Use**: Simple API with clear explanations for each selector\n- **Well Documented**: Each selector includes detailed explanations and usage examples\n\n## \ud83d\udce6 Installation\n\n### From PyPI (Recommended)\n```bash\npip install webextractionhelper\n```\n\n### From Source\n```bash\ngit clone https://github.com/Artistotle-ai/webextractionhelper.git\ncd webextractionhelper\npip install -e .\n```\n\n## \ud83d\udd27 Requirements\n\n- Python 3.7+\n- lxml >= 4.6.0\n\n## \ud83d\udcda Quick Start\n\n```python\nfrom webextractionhelper import Selectors\n\n# Create a Selectors instance\nselectors = Selectors()\n\n# Access Google featured snippet selectors\nfeatured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']\nfeatured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']\n\n# Access related questions selectors\nrelated_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']\n\nprint(f\"Featured snippet title XPath: {featured_title_xpath}\")\nprint(f\"Featured snippet text XPath: {featured_text_xpath}\")\nprint(f\"Related questions XPath: {related_questions_xpath}\")\n```\n\n## \ud83c\udfaf Available Selector Categories\n\n### Google Search Selectors (21 selectors)\n- **Featured Snippets**: Title, text, bullet points, numbered lists, tables, URLs, images\n- **Related Questions**: Individual questions, all questions, answer snippets, source titles/URLs\n- **Search Results**: Main containers, links, titles, descriptions\n\n### Meta & Open Graph Selectors (11 selectors)\n- **Meta Tags**: Title, description, keywords, robots, viewport\n- **Open Graph**: Title, description, image, URL, type, site name\n\n### Social Media Selectors (6 selectors)\n- **Twitter/X**: Card type, title, description, image, creator, site\n\n### Content Selectors (10 selectors)\n- **Headings**: H1, H2, H3, H4\n- **Text Content**: Paragraphs, lists, blockquotes\n- **Forms**: Input fields, buttons, labels\n\n### Media Selectors (5 selectors)\n- **Images**: Source, alt text, title, dimensions\n- **Videos**: Source, poster, dimensions\n\n### Link Selectors (7 selectors)\n- **Navigation**: Main nav, footer links, breadcrumbs\n- **Content Links**: Internal, external, download links\n\n## \ud83d\udd0d Usage Examples\n\n### Example 1: Extract Google Featured Snippet\n```python\nfrom webextractionhelper import Selectors\nimport requests\nfrom lxml import html\n\nselectors = Selectors()\n\n# Get the page content\nurl = \"https://www.google.com/search?q=python+programming\"\nresponse = requests.get(url)\ntree = html.fromstring(response.content)\n\n# Extract featured snippet title\ntitle_xpath = selectors.selectors['google.featured_snippet_title']['xpath']\ntitle_elements = tree.xpath(title_xpath)\n\nif title_elements:\n title = title_elements[0].text_content()\n print(f\"Featured snippet title: {title}\")\n```\n\n### Example 2: Extract All Related Questions\n```python\n# Get all related questions\nquestions_xpath = selectors.selectors['google.related_questions_all']['xpath']\nquestion_elements = tree.xpath(questions_xpath)\n\nfor i, question in enumerate(question_elements, 1):\n print(f\"Question {i}: {question.text_content()}\")\n```\n\n### Example 3: Extract Meta Information\n```python\n# Get page meta description\nmeta_desc_xpath = selectors.selectors['meta.description']['xpath']\nmeta_desc_elements = tree.xpath(meta_desc_xpath)\n\nif meta_desc_elements:\n description = meta_desc_elements[0].get('content')\n print(f\"Meta description: {description}\")\n```\n\n## \ud83d\udccb Selector Structure\n\nEach selector in the package follows this structure:\n\n```python\n{\n 'explanation': 'Human-readable description of what this selector extracts',\n 'xpath': 'The XPath expression to extract the content',\n 'regex': 'Optional regex pattern for text processing',\n 'css': 'Optional CSS selector alternative'\n}\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### Setting up development environment\n```bash\ngit clone https://github.com/Artistotle-ai/webextractionhelper.git\ncd webextractionhelper\npip install -e \".[dev]\"\n```\n\n### Running tests\n```bash\npython test_package.py\npython example_usage.py\n```\n\n### Building the package\n```bash\npython -m build\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the [LICENSE.txt](LICENSE.txt) file for details.\n\n## \ud83d\udc68\u200d\ud83d\udcbb Author\n\n**Jens Verneuer**\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n## \ud83d\udcde Support\n\nIf you have any questions or need help, please:\n1. Check the [GitHub Issues](https://github.com/Artistotle-ai/webextractionhelper/issues)\n2. Create a new issue if your problem isn't already addressed\n\n## \ud83d\udd17 Links\n\n- **GitHub Repository**: [https://github.com/Artistotle-ai/webextractionhelper](https://github.com/Artistotle-ai/webextractionhelper)\n- **PyPI Package**: [https://pypi.org/project/webextractionhelper/](https://pypi.org/project/webextractionhelper/)\n- **Documentation**: [https://github.com/Artistotle-ai/webextractionhelper#readme](https://github.com/Artistotle-ai/webextractionhelper#readme)\n\n## \ud83d\udcc8 Version History\n\n- **0.1.0** - Initial release with 95+ selectors for web scraping\n\n---\n\n**Note**: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.\n",
"bugtrack_url": null,
"license": "CC-BY-SA-4.0",
"summary": "A comprehensive web scraping helper package with XPath selectors, regex patterns, and CSS selectors",
"version": "0.1.1",
"project_urls": {
"Documentation": "https://github.com/Artistotle-ai/webextractionhelper#readme",
"Download": "https://github.com/Artistotle-ai/webextractionhelper/archive/refs/tags/v0.1.1.tar.gz",
"Homepage": "https://github.com/Artistotle-ai/webextractionhelper",
"Issues": "https://github.com/Artistotle-ai/webextractionhelper/issues",
"Repository": "https://github.com/Artistotle-ai/webextractionhelper"
},
"split_keywords": [
"web scraping",
" xpath",
" css selectors",
" regex",
" google search",
" serp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3ecaae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e",
"md5": "d569a5dcbe8ab31d150ba2a31f773053",
"sha256": "39604ee86e0b04ea0c30dd86ccb0ffb285d3320d35c56528f3aed7a9ccbcfba9"
},
"downloads": -1,
"filename": "webextractionhelper-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d569a5dcbe8ab31d150ba2a31f773053",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 19567,
"upload_time": "2025-09-03T19:40:35",
"upload_time_iso_8601": "2025-09-03T19:40:35.195295Z",
"url": "https://files.pythonhosted.org/packages/3e/ca/ae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e/webextractionhelper-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-03 19:40:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Artistotle-ai",
"github_project": "webextractionhelper",
"github_not_found": true,
"lcname": "webextractionhelper"
}