webextractionhelper


Namewebextractionhelper JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/Artistotle-ai/webextractionhelper
SummaryA comprehensive web scraping helper package with XPath selectors, regex patterns, and CSS selectors
upload_time2025-09-03 19:40:35
maintainerNone
docs_urlNone
authorJens Verneuer
requires_python>=3.7
licenseCC-BY-SA-4.0
keywords web scraping xpath css selectors regex google search serp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # WebExtractionHelper

A comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.

## 🚀 Features

- **95+ Pre-built Selectors**: Comprehensive collection of XPath selectors for web scraping
- **Google Search Support**: Specialized selectors for Google SERP features
- **Multiple Content Types**: Support for featured snippets, related questions, images, links, and more
- **Easy to Use**: Simple API with clear explanations for each selector
- **Well Documented**: Each selector includes detailed explanations and usage examples

## 📦 Installation

### From PyPI (Recommended)
```bash
pip install webextractionhelper
```

### From Source
```bash
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e .
```

## 🔧 Requirements

- Python 3.7+
- lxml >= 4.6.0

## 📚 Quick Start

```python
from webextractionhelper import Selectors

# Create a Selectors instance
selectors = Selectors()

# Access Google featured snippet selectors
featured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
featured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']

# Access related questions selectors
related_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']

print(f"Featured snippet title XPath: {featured_title_xpath}")
print(f"Featured snippet text XPath: {featured_text_xpath}")
print(f"Related questions XPath: {related_questions_xpath}")
```

## 🎯 Available Selector Categories

### Google Search Selectors (21 selectors)
- **Featured Snippets**: Title, text, bullet points, numbered lists, tables, URLs, images
- **Related Questions**: Individual questions, all questions, answer snippets, source titles/URLs
- **Search Results**: Main containers, links, titles, descriptions

### Meta & Open Graph Selectors (11 selectors)
- **Meta Tags**: Title, description, keywords, robots, viewport
- **Open Graph**: Title, description, image, URL, type, site name

### Social Media Selectors (6 selectors)
- **Twitter/X**: Card type, title, description, image, creator, site

### Content Selectors (10 selectors)
- **Headings**: H1, H2, H3, H4
- **Text Content**: Paragraphs, lists, blockquotes
- **Forms**: Input fields, buttons, labels

### Media Selectors (5 selectors)
- **Images**: Source, alt text, title, dimensions
- **Videos**: Source, poster, dimensions

### Link Selectors (7 selectors)
- **Navigation**: Main nav, footer links, breadcrumbs
- **Content Links**: Internal, external, download links

## 🔍 Usage Examples

### Example 1: Extract Google Featured Snippet
```python
from webextractionhelper import Selectors
import requests
from lxml import html

selectors = Selectors()

# Get the page content
url = "https://www.google.com/search?q=python+programming"
response = requests.get(url)
tree = html.fromstring(response.content)

# Extract featured snippet title
title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']
title_elements = tree.xpath(title_xpath)

if title_elements:
    title = title_elements[0].text_content()
    print(f"Featured snippet title: {title}")
```

### Example 2: Extract All Related Questions
```python
# Get all related questions
questions_xpath = selectors.selectors['google.related_questions_all']['xpath']
question_elements = tree.xpath(questions_xpath)

for i, question in enumerate(question_elements, 1):
    print(f"Question {i}: {question.text_content()}")
```

### Example 3: Extract Meta Information
```python
# Get page meta description
meta_desc_xpath = selectors.selectors['meta.description']['xpath']
meta_desc_elements = tree.xpath(meta_desc_xpath)

if meta_desc_elements:
    description = meta_desc_elements[0].get('content')
    print(f"Meta description: {description}")
```

## 📋 Selector Structure

Each selector in the package follows this structure:

```python
{
    'explanation': 'Human-readable description of what this selector extracts',
    'xpath': 'The XPath expression to extract the content',
    'regex': 'Optional regex pattern for text processing',
    'css': 'Optional CSS selector alternative'
}
```

## 🛠️ Development

### Setting up development environment
```bash
git clone https://github.com/Artistotle-ai/webextractionhelper.git
cd webextractionhelper
pip install -e ".[dev]"
```

### Running tests
```bash
python test_package.py
python example_usage.py
```

### Building the package
```bash
python -m build
```

## 📄 License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the [LICENSE.txt](LICENSE.txt) file for details.

## 👨‍💻 Author

**Jens Verneuer**

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## 📞 Support

If you have any questions or need help, please:
1. Check the [GitHub Issues](https://github.com/Artistotle-ai/webextractionhelper/issues)
2. Create a new issue if your problem isn't already addressed

## 🔗 Links

- **GitHub Repository**: [https://github.com/Artistotle-ai/webextractionhelper](https://github.com/Artistotle-ai/webextractionhelper)
- **PyPI Package**: [https://pypi.org/project/webextractionhelper/](https://pypi.org/project/webextractionhelper/)
- **Documentation**: [https://github.com/Artistotle-ai/webextractionhelper#readme](https://github.com/Artistotle-ai/webextractionhelper#readme)

## 📈 Version History

- **0.1.0** - Initial release with 95+ selectors for web scraping

---

**Note**: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Artistotle-ai/webextractionhelper",
    "name": "webextractionhelper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "Jens Verneuer <Jens@Aristotle.ventures>",
    "keywords": "web scraping, xpath, css selectors, regex, google search, serp",
    "author": "Jens Verneuer",
    "author_email": "Jens Verneuer <Jens@Aristotle.ventures>",
    "download_url": "https://files.pythonhosted.org/packages/3e/ca/ae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e/webextractionhelper-0.1.1.tar.gz",
    "platform": null,
    "description": "# WebExtractionHelper\n\nA comprehensive Python package providing XPath selectors, regex patterns, and CSS selectors for web scraping various web content including Google search features, featured snippets, related questions, and other SERP elements.\n\n## \ud83d\ude80 Features\n\n- **95+ Pre-built Selectors**: Comprehensive collection of XPath selectors for web scraping\n- **Google Search Support**: Specialized selectors for Google SERP features\n- **Multiple Content Types**: Support for featured snippets, related questions, images, links, and more\n- **Easy to Use**: Simple API with clear explanations for each selector\n- **Well Documented**: Each selector includes detailed explanations and usage examples\n\n## \ud83d\udce6 Installation\n\n### From PyPI (Recommended)\n```bash\npip install webextractionhelper\n```\n\n### From Source\n```bash\ngit clone https://github.com/Artistotle-ai/webextractionhelper.git\ncd webextractionhelper\npip install -e .\n```\n\n## \ud83d\udd27 Requirements\n\n- Python 3.7+\n- lxml >= 4.6.0\n\n## \ud83d\udcda Quick Start\n\n```python\nfrom webextractionhelper import Selectors\n\n# Create a Selectors instance\nselectors = Selectors()\n\n# Access Google featured snippet selectors\nfeatured_title_xpath = selectors.selectors['google.featured_snippet_title']['xpath']\nfeatured_text_xpath = selectors.selectors['google.featured_snippet_text']['xpath']\n\n# Access related questions selectors\nrelated_questions_xpath = selectors.selectors['google.related_questions_all']['xpath']\n\nprint(f\"Featured snippet title XPath: {featured_title_xpath}\")\nprint(f\"Featured snippet text XPath: {featured_text_xpath}\")\nprint(f\"Related questions XPath: {related_questions_xpath}\")\n```\n\n## \ud83c\udfaf Available Selector Categories\n\n### Google Search Selectors (21 selectors)\n- **Featured Snippets**: Title, text, bullet points, numbered lists, tables, URLs, images\n- **Related Questions**: Individual questions, all questions, answer snippets, source titles/URLs\n- **Search Results**: Main containers, links, titles, descriptions\n\n### Meta & Open Graph Selectors (11 selectors)\n- **Meta Tags**: Title, description, keywords, robots, viewport\n- **Open Graph**: Title, description, image, URL, type, site name\n\n### Social Media Selectors (6 selectors)\n- **Twitter/X**: Card type, title, description, image, creator, site\n\n### Content Selectors (10 selectors)\n- **Headings**: H1, H2, H3, H4\n- **Text Content**: Paragraphs, lists, blockquotes\n- **Forms**: Input fields, buttons, labels\n\n### Media Selectors (5 selectors)\n- **Images**: Source, alt text, title, dimensions\n- **Videos**: Source, poster, dimensions\n\n### Link Selectors (7 selectors)\n- **Navigation**: Main nav, footer links, breadcrumbs\n- **Content Links**: Internal, external, download links\n\n## \ud83d\udd0d Usage Examples\n\n### Example 1: Extract Google Featured Snippet\n```python\nfrom webextractionhelper import Selectors\nimport requests\nfrom lxml import html\n\nselectors = Selectors()\n\n# Get the page content\nurl = \"https://www.google.com/search?q=python+programming\"\nresponse = requests.get(url)\ntree = html.fromstring(response.content)\n\n# Extract featured snippet title\ntitle_xpath = selectors.selectors['google.featured_snippet_title']['xpath']\ntitle_elements = tree.xpath(title_xpath)\n\nif title_elements:\n    title = title_elements[0].text_content()\n    print(f\"Featured snippet title: {title}\")\n```\n\n### Example 2: Extract All Related Questions\n```python\n# Get all related questions\nquestions_xpath = selectors.selectors['google.related_questions_all']['xpath']\nquestion_elements = tree.xpath(questions_xpath)\n\nfor i, question in enumerate(question_elements, 1):\n    print(f\"Question {i}: {question.text_content()}\")\n```\n\n### Example 3: Extract Meta Information\n```python\n# Get page meta description\nmeta_desc_xpath = selectors.selectors['meta.description']['xpath']\nmeta_desc_elements = tree.xpath(meta_desc_xpath)\n\nif meta_desc_elements:\n    description = meta_desc_elements[0].get('content')\n    print(f\"Meta description: {description}\")\n```\n\n## \ud83d\udccb Selector Structure\n\nEach selector in the package follows this structure:\n\n```python\n{\n    'explanation': 'Human-readable description of what this selector extracts',\n    'xpath': 'The XPath expression to extract the content',\n    'regex': 'Optional regex pattern for text processing',\n    'css': 'Optional CSS selector alternative'\n}\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### Setting up development environment\n```bash\ngit clone https://github.com/Artistotle-ai/webextractionhelper.git\ncd webextractionhelper\npip install -e \".[dev]\"\n```\n\n### Running tests\n```bash\npython test_package.py\npython example_usage.py\n```\n\n### Building the package\n```bash\npython -m build\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License - see the [LICENSE.txt](LICENSE.txt) file for details.\n\n## \ud83d\udc68\u200d\ud83d\udcbb Author\n\n**Jens Verneuer**\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n## \ud83d\udcde Support\n\nIf you have any questions or need help, please:\n1. Check the [GitHub Issues](https://github.com/Artistotle-ai/webextractionhelper/issues)\n2. Create a new issue if your problem isn't already addressed\n\n## \ud83d\udd17 Links\n\n- **GitHub Repository**: [https://github.com/Artistotle-ai/webextractionhelper](https://github.com/Artistotle-ai/webextractionhelper)\n- **PyPI Package**: [https://pypi.org/project/webextractionhelper/](https://pypi.org/project/webextractionhelper/)\n- **Documentation**: [https://github.com/Artistotle-ai/webextractionhelper#readme](https://github.com/Artistotle-ai/webextractionhelper#readme)\n\n## \ud83d\udcc8 Version History\n\n- **0.1.0** - Initial release with 95+ selectors for web scraping\n\n---\n\n**Note**: This package is designed to help with web scraping tasks. Please ensure you comply with the terms of service of the websites you're scraping and respect robots.txt files.\n",
    "bugtrack_url": null,
    "license": "CC-BY-SA-4.0",
    "summary": "A comprehensive web scraping helper package with XPath selectors, regex patterns, and CSS selectors",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/Artistotle-ai/webextractionhelper#readme",
        "Download": "https://github.com/Artistotle-ai/webextractionhelper/archive/refs/tags/v0.1.1.tar.gz",
        "Homepage": "https://github.com/Artistotle-ai/webextractionhelper",
        "Issues": "https://github.com/Artistotle-ai/webextractionhelper/issues",
        "Repository": "https://github.com/Artistotle-ai/webextractionhelper"
    },
    "split_keywords": [
        "web scraping",
        " xpath",
        " css selectors",
        " regex",
        " google search",
        " serp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3ecaae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e",
                "md5": "d569a5dcbe8ab31d150ba2a31f773053",
                "sha256": "39604ee86e0b04ea0c30dd86ccb0ffb285d3320d35c56528f3aed7a9ccbcfba9"
            },
            "downloads": -1,
            "filename": "webextractionhelper-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d569a5dcbe8ab31d150ba2a31f773053",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 19567,
            "upload_time": "2025-09-03T19:40:35",
            "upload_time_iso_8601": "2025-09-03T19:40:35.195295Z",
            "url": "https://files.pythonhosted.org/packages/3e/ca/ae28b7b4901ab1b1cb6c091620b9f6241ce85fe2d4cbf28cbb26c711bf2e/webextractionhelper-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-03 19:40:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Artistotle-ai",
    "github_project": "webextractionhelper",
    "github_not_found": true,
    "lcname": "webextractionhelper"
}
        
Elapsed time: 2.45903s