Name | scrapme JSON |
Version |
1.8.8
JSON |
| download |
home_page | https://ubix.pro/ |
Summary | A comprehensive web scraping framework featuring both static and dynamic content extraction, automatic Selenium/geckodriver management, rate limiting, proxy rotation, and Unicode support (including Georgian). Built with BeautifulSoup4 and Selenium, it provides an intuitive API for extracting text, tables, links and more from any web source. |
upload_time | 2024-10-27 21:28:11 |
maintainer | None |
docs_url | None |
author | N.Sikharulidze |
requires_python | >=3.8 |
license | MIT License Copyright (c) 2024 N.Sikharulidze (https://ubix.pro/) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Scrapme
A comprehensive web scraping framework featuring both static and dynamic content extraction, automatic Selenium/geckodriver management, rate limiting, proxy rotation, and Unicode support (including Georgian). Built with BeautifulSoup4 and Selenium, it provides an intuitive API for extracting text, tables, links and more from any web source.
## Features
- ๐ Simple and intuitive API
- ๐ Support for JavaScript-rendered content using Selenium
- ๐ ๏ธ Automatic geckodriver management
- โฑ๏ธ Built-in rate limiting
- ๐ Proxy rotation with health tracking
- ๐ Automatic table parsing to Pandas DataFrames
- ๐ Full Unicode support (including Georgian)
- ๐งน Clean text extraction
- ๐ฏ CSS selector support
- ๐ Multiple content extraction methods
## Installation
```bash
pip install scrapme
```
## Quick Start
### Basic Usage (Static Content)
```python
from scrapme import WebScraper
# Initialize scraper
scraper = WebScraper()
# Get text content
text = scraper.get_text("https://example.com")
print(text)
# Extract all links
links = scraper.get_links("https://example.com")
for link in links:
print(f"Text: {link['text']}, URL: {link['href']}")
# Parse tables into pandas DataFrames
tables = scraper.get_tables("https://example.com")
if tables:
print(tables[0].head())
```
### Dynamic Content (JavaScript-Rendered)
```python
from scrapme import SeleniumScraper
# Initialize with automatic geckodriver management
scraper = SeleniumScraper(headless=True)
# Get dynamic content
text = scraper.get_text("https://example.com")
print(text)
# Execute JavaScript
title = scraper.execute_script("return document.title;")
print(f"Page title: {title}")
# Handle infinite scrolling
scraper.scroll_infinite(max_scrolls=5)
```
### Custom Geckodriver Path
```python
from scrapme import SeleniumScraper
import os
# Use custom geckodriver path
driver_path = os.getenv('GECKODRIVER_PATH', '/path/to/geckodriver')
scraper = SeleniumScraper(driver_path=driver_path)
```
### Rate Limiting and Proxy Rotation
```python
from scrapme import WebScraper
# Initialize with rate limiting and proxies
proxies = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
]
scraper = WebScraper(
requests_per_second=0.5, # One request every 2 seconds
proxies=proxies
)
# Add new proxy at runtime
scraper.add_proxy('http://proxy3.example.com:8080')
# Update rate limit
scraper.set_rate_limit(0.2) # One request every 5 seconds
```
### Unicode Support (Including Georgian)
```python
from scrapme import WebScraper
# Initialize with Georgian language support
scraper = WebScraper(
headers={'Accept-Language': 'ka-GE,ka;q=0.9'},
encoding='utf-8'
)
# Scrape Georgian content
text = scraper.get_text("https://example.ge")
print(text)
```
## Advanced Features
### Content Selection Methods
```python
# Using CSS selectors
elements = scraper.find_by_selector("https://example.com", "div.content > p")
# By class name
elements = scraper.find_by_class("https://example.com", "main-content")
# By ID
element = scraper.find_by_id("https://example.com", "header")
# By tag name
elements = scraper.find_by_tag("https://example.com", "article")
```
### Selenium Wait Conditions
```python
from scrapme import SeleniumScraper
scraper = SeleniumScraper()
# Wait for element presence
soup = scraper.get_soup(url, wait_for="#dynamic-content")
# Wait for element visibility
soup = scraper.get_soup(url, wait_for="#loading", wait_type="visibility")
```
## Error Handling
The package provides custom exceptions for better error handling:
```python
from scrapme import ScraperException, RequestException, ParsingException
try:
scraper.get_text("https://example.com")
except RequestException as e:
print(f"Failed to fetch content: {e}")
except ParsingException as e:
print(f"Failed to parse content: {e}")
except ScraperException as e:
print(f"General scraping error: {e}")
```
## Best Practices
1. **Rate Limiting**: Always use rate limiting to avoid overwhelming servers:
```python
scraper = WebScraper(requests_per_second=0.5)
```
2. **Proxy Rotation**: For large-scale scraping, rotate through multiple proxies:
```python
scraper = WebScraper(proxies=['proxy1', 'proxy2', 'proxy3'])
```
3. **Resource Management**: Use context managers or clean up Selenium resources:
```python
scraper = SeleniumScraper()
try:
# Your scraping code
finally:
del scraper # Closes browser automatically
```
4. **Error Handling**: Always implement proper error handling:
```python
try:
scraper.get_text(url)
except ScraperException as e:
logging.error(f"Scraping failed: {e}")
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Support
For support, please open an issue on the GitHub repository or contact info@ubix.pro.
Raw data
{
"_id": null,
"home_page": "https://ubix.pro/",
"name": "scrapme",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "N.Sikharulidze",
"author_email": "\"N.Sikharulidze\" <info@ubix.pro>",
"download_url": "https://files.pythonhosted.org/packages/9b/bb/5954a473ab9bb804b18da610453a616beec65e8cd2fd1882390d7049a74a/scrapme-1.8.8.tar.gz",
"platform": null,
"description": "# Scrapme\r\n\r\nA comprehensive web scraping framework featuring both static and dynamic content extraction, automatic Selenium/geckodriver management, rate limiting, proxy rotation, and Unicode support (including Georgian). Built with BeautifulSoup4 and Selenium, it provides an intuitive API for extracting text, tables, links and more from any web source.\r\n\r\n## Features\r\n\r\n- \ud83d\ude80 Simple and intuitive API\r\n- \ud83d\udd04 Support for JavaScript-rendered content using Selenium\r\n- \ud83d\udee0\ufe0f Automatic geckodriver management\r\n- \u23f1\ufe0f Built-in rate limiting\r\n- \ud83d\udd04 Proxy rotation with health tracking\r\n- \ud83d\udcca Automatic table parsing to Pandas DataFrames\r\n- \ud83c\udf10 Full Unicode support (including Georgian)\r\n- \ud83e\uddf9 Clean text extraction\r\n- \ud83c\udfaf CSS selector support\r\n- \ud83d\udd0d Multiple content extraction methods\r\n\r\n## Installation\r\n\r\n```bash\r\npip install scrapme\r\n```\r\n\r\n## Quick Start\r\n\r\n### Basic Usage (Static Content)\r\n\r\n```python\r\nfrom scrapme import WebScraper\r\n\r\n# Initialize scraper\r\nscraper = WebScraper()\r\n\r\n# Get text content\r\ntext = scraper.get_text(\"https://example.com\")\r\nprint(text)\r\n\r\n# Extract all links\r\nlinks = scraper.get_links(\"https://example.com\")\r\nfor link in links:\r\n print(f\"Text: {link['text']}, URL: {link['href']}\")\r\n\r\n# Parse tables into pandas DataFrames\r\ntables = scraper.get_tables(\"https://example.com\")\r\nif tables:\r\n print(tables[0].head())\r\n```\r\n\r\n### Dynamic Content (JavaScript-Rendered)\r\n\r\n```python\r\nfrom scrapme import SeleniumScraper\r\n\r\n# Initialize with automatic geckodriver management\r\nscraper = SeleniumScraper(headless=True)\r\n\r\n# Get dynamic content\r\ntext = scraper.get_text(\"https://example.com\")\r\nprint(text)\r\n\r\n# Execute JavaScript\r\ntitle = scraper.execute_script(\"return document.title;\")\r\nprint(f\"Page title: {title}\")\r\n\r\n# Handle infinite scrolling\r\nscraper.scroll_infinite(max_scrolls=5)\r\n```\r\n\r\n### Custom Geckodriver Path\r\n\r\n```python\r\nfrom scrapme import SeleniumScraper\r\nimport os\r\n\r\n# Use custom geckodriver path\r\ndriver_path = os.getenv('GECKODRIVER_PATH', '/path/to/geckodriver')\r\nscraper = SeleniumScraper(driver_path=driver_path)\r\n```\r\n\r\n### Rate Limiting and Proxy Rotation\r\n\r\n```python\r\nfrom scrapme import WebScraper\r\n\r\n# Initialize with rate limiting and proxies\r\nproxies = [\r\n 'http://proxy1.example.com:8080',\r\n 'http://proxy2.example.com:8080'\r\n]\r\n\r\nscraper = WebScraper(\r\n requests_per_second=0.5, # One request every 2 seconds\r\n proxies=proxies\r\n)\r\n\r\n# Add new proxy at runtime\r\nscraper.add_proxy('http://proxy3.example.com:8080')\r\n\r\n# Update rate limit\r\nscraper.set_rate_limit(0.2) # One request every 5 seconds\r\n```\r\n\r\n### Unicode Support (Including Georgian)\r\n\r\n```python\r\nfrom scrapme import WebScraper\r\n\r\n# Initialize with Georgian language support\r\nscraper = WebScraper(\r\n headers={'Accept-Language': 'ka-GE,ka;q=0.9'},\r\n encoding='utf-8'\r\n)\r\n\r\n# Scrape Georgian content\r\ntext = scraper.get_text(\"https://example.ge\")\r\nprint(text)\r\n```\r\n\r\n## Advanced Features\r\n\r\n### Content Selection Methods\r\n\r\n```python\r\n# Using CSS selectors\r\nelements = scraper.find_by_selector(\"https://example.com\", \"div.content > p\")\r\n\r\n# By class name\r\nelements = scraper.find_by_class(\"https://example.com\", \"main-content\")\r\n\r\n# By ID\r\nelement = scraper.find_by_id(\"https://example.com\", \"header\")\r\n\r\n# By tag name\r\nelements = scraper.find_by_tag(\"https://example.com\", \"article\")\r\n```\r\n\r\n### Selenium Wait Conditions\r\n\r\n```python\r\nfrom scrapme import SeleniumScraper\r\n\r\nscraper = SeleniumScraper()\r\n\r\n# Wait for element presence\r\nsoup = scraper.get_soup(url, wait_for=\"#dynamic-content\")\r\n\r\n# Wait for element visibility\r\nsoup = scraper.get_soup(url, wait_for=\"#loading\", wait_type=\"visibility\")\r\n```\r\n\r\n## Error Handling\r\n\r\nThe package provides custom exceptions for better error handling:\r\n\r\n```python\r\nfrom scrapme import ScraperException, RequestException, ParsingException\r\n\r\ntry:\r\n scraper.get_text(\"https://example.com\")\r\nexcept RequestException as e:\r\n print(f\"Failed to fetch content: {e}\")\r\nexcept ParsingException as e:\r\n print(f\"Failed to parse content: {e}\")\r\nexcept ScraperException as e:\r\n print(f\"General scraping error: {e}\")\r\n```\r\n\r\n## Best Practices\r\n\r\n1. **Rate Limiting**: Always use rate limiting to avoid overwhelming servers:\r\n ```python\r\n scraper = WebScraper(requests_per_second=0.5)\r\n ```\r\n\r\n2. **Proxy Rotation**: For large-scale scraping, rotate through multiple proxies:\r\n ```python\r\n scraper = WebScraper(proxies=['proxy1', 'proxy2', 'proxy3'])\r\n ```\r\n\r\n3. **Resource Management**: Use context managers or clean up Selenium resources:\r\n ```python\r\n scraper = SeleniumScraper()\r\n try:\r\n # Your scraping code\r\n finally:\r\n del scraper # Closes browser automatically\r\n ```\r\n\r\n4. **Error Handling**: Always implement proper error handling:\r\n ```python\r\n try:\r\n scraper.get_text(url)\r\n except ScraperException as e:\r\n logging.error(f\"Scraping failed: {e}\")\r\n ```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n## Support\r\n\r\nFor support, please open an issue on the GitHub repository or contact info@ubix.pro.\r\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 N.Sikharulidze (https://ubix.pro/) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "A comprehensive web scraping framework featuring both static and dynamic content extraction, automatic Selenium/geckodriver management, rate limiting, proxy rotation, and Unicode support (including Georgian). Built with BeautifulSoup4 and Selenium, it provides an intuitive API for extracting text, tables, links and more from any web source.",
"version": "1.8.8",
"project_urls": {
"Bug Tracker": "https://github.com/NSb0y/scrapme/issues",
"Documentation": "https://github.com/NSb0y/scrapme",
"Homepage": "https://ubix.pro/",
"Repository": "https://github.com/NSb0y/scrapme"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8feb3447a7f7d10728d3ec86dcbb3e2068701b5f9c754378da12ef7481ebd662",
"md5": "c452d5942f67643950c4cdd9fa364669",
"sha256": "9a53689b1625a51305b34cacf4bdcab572309b6c3a676413420107a3f6f084b5"
},
"downloads": -1,
"filename": "scrapme-1.8.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c452d5942f67643950c4cdd9fa364669",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11128,
"upload_time": "2024-10-27T21:28:09",
"upload_time_iso_8601": "2024-10-27T21:28:09.699654Z",
"url": "https://files.pythonhosted.org/packages/8f/eb/3447a7f7d10728d3ec86dcbb3e2068701b5f9c754378da12ef7481ebd662/scrapme-1.8.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9bbb5954a473ab9bb804b18da610453a616beec65e8cd2fd1882390d7049a74a",
"md5": "c2df297db9d63ee561a793efd192fa89",
"sha256": "23ba42aeda83b431567092a3e97771d7f5ab4ed5ae995e8a903877729e437622"
},
"downloads": -1,
"filename": "scrapme-1.8.8.tar.gz",
"has_sig": false,
"md5_digest": "c2df297db9d63ee561a793efd192fa89",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 12116,
"upload_time": "2024-10-27T21:28:11",
"upload_time_iso_8601": "2024-10-27T21:28:11.094639Z",
"url": "https://files.pythonhosted.org/packages/9b/bb/5954a473ab9bb804b18da610453a616beec65e8cd2fd1882390d7049a74a/scrapme-1.8.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-27 21:28:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NSb0y",
"github_project": "scrapme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scrapme"
}