# WebRover π
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
**WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.**
---
## π Features
- **Smart Web Scraping**: Automatically find and scrape relevant content based on topics
- **Multiple Input Formats**: Support for JSON, YAML, TXT, and Markdown topic files
- **Async Processing**: Fast, concurrent scraping with built-in rate limiting
- **Quality Control**: Built-in content validation and cleaning
- **LLM-Ready Output**: Structured JSONL format perfect for model training
- **Error Handling**: Robust error tracking and recovery mechanisms
## β οΈ Important Notes
### Cloud Environment Compatibility
When using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:
1. Install nest_asyncio:
```bash
pip install nest_asyncio
```
2. Add these lines at the start of your notebook:
```python
import nest_asyncio
nest_asyncio.apply()
```
This setup is only required for:
- Google Colab
- Kaggle Notebooks
- Similar cloud-based Jupyter environments
It's not needed for:
- Local Python scripts
- Command line usage
- Standard server deployments
## π Troubleshooting
### Cloud Environment Issues
When using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:
```python
# Install the required package
pip install nest_asyncio
# Add at the start of your notebook
import nest_asyncio
nest_asyncio.apply()
```
### Common Issues and Solutions
1. **Rate Limiting**
- Symptom: Many HTTP 429 errors
- Solution: Decrease scraping speed by increasing sleep time between requests
2. **Memory Issues with Large Datasets**
- Symptom: Out of memory errors
- Solution: Use smaller batch sizes or enable disk caching
3. **Blocked Access**
- Symptom: HTTP 403 Forbidden errors
- Solution: Ensure your user agent is set correctly and respect robots.txt
4. **SSL Certificate Errors**
- Symptom: SSL verification failed
- Solution: Update your Python SSL certificates or check network settings
## π Quick Start
### Installation
```bash
pip install webrover
```
### Basic Usage
```python
from webrover import WebRover
# Initialize WebRover
rover = WebRover()
# Scrape content from topics
rover.scrape_topics(
topics=["artificial intelligence", "machine learning"],
sites_per_topic=20 # Will get 20 sites for each topic
)
# Save the dataset
rover.save_dataset("my_dataset.jsonl")
```
### Using Topic Files
```python
# From JSON file
rover.scrape_topics(
topics="topics.json",
num_websites=100
)
# From Markdown list
rover.scrape_topics(
topics="topics.md",
num_websites=100
)
```
## π Documentation
### Supported Topic File Formats
#### JSON
```json
{
"topics": [
"AI basics",
"machine learning",
"deep learning"
]
}
```
#### YAML
```yaml
topics:
- AI basics
- machine learning
- deep learning
```
#### Markdown
```markdown
- AI basics
- machine learning
- deep learning
```
### Output Structure
```python
{
'url': 'https://example.com/article',
'title': 'Article Title',
'content': 'Article content...',
'metadata': {
'length': 1234,
'has_title': true,
'domain': 'example.com'
}
}
```
## π οΈ Advanced Usage
```python
# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")
# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")
# Access dataset programmatically
dataset = rover.get_dataset()
```
## π Output Files
- `final_dataset/dataset.jsonl`: Main dataset in JSONL format
- `websites_master.json`: List of all discovered URLs
- `websites_completed.json`: Successfully scraped URLs
- `websites_errors.json`: Failed attempts with error details
## π Error Handling
WebRover automatically handles common issues:
- Rate limiting
- Network timeouts
- Invalid URLs
- Blocked requests
- Malformed content
## π§ Limitations
- Respects robots.txt and site rate limits
- Some sites may block automated access
- Large datasets require more processing time
- Google search may throttle excessive requests
## πΊοΈ Roadmap
See [FUTURE.md](FUTURE.md) for planned features and improvements.
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
Built with β€οΈ by Area-25. Special thanks to all contributors.
---
**WebRover: Build better datasets, train better models.** π
Raw data
{
"_id": null,
"home_page": "https://github.com/Area-25/webrover",
"name": "webrover",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "web-scraping, dataset-generation, machine-learning, ai-training, deep-learning",
"author": "Area-25",
"author_email": "jasonquist.ssh@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f3/9c/45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63/webrover-0.1.11.tar.gz",
"platform": null,
"description": "# WebRover \ud83d\ude80\r\n\r\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\n**WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.**\r\n\r\n---\r\n\r\n## \ud83c\udf1f Features\r\n\r\n- **Smart Web Scraping**: Automatically find and scrape relevant content based on topics\r\n- **Multiple Input Formats**: Support for JSON, YAML, TXT, and Markdown topic files\r\n- **Async Processing**: Fast, concurrent scraping with built-in rate limiting\r\n- **Quality Control**: Built-in content validation and cleaning\r\n- **LLM-Ready Output**: Structured JSONL format perfect for model training\r\n- **Error Handling**: Robust error tracking and recovery mechanisms\r\n\r\n## \u26a0\ufe0f Important Notes\r\n\r\n### Cloud Environment Compatibility\r\n\r\nWhen using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:\r\n\r\n1. Install nest_asyncio:\r\n```bash\r\npip install nest_asyncio\r\n```\r\n\r\n2. Add these lines at the start of your notebook:\r\n```python\r\nimport nest_asyncio\r\nnest_asyncio.apply()\r\n```\r\n\r\nThis setup is only required for:\r\n- Google Colab\r\n- Kaggle Notebooks\r\n- Similar cloud-based Jupyter environments\r\n\r\nIt's not needed for:\r\n- Local Python scripts\r\n- Command line usage\r\n- Standard server deployments\r\n\r\n## \ud83d\ude80 Troubleshooting\r\n\r\n### Cloud Environment Issues\r\n\r\nWhen using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:\r\n\r\n```python\r\n# Install the required package\r\npip install nest_asyncio\r\n\r\n# Add at the start of your notebook\r\nimport nest_asyncio\r\nnest_asyncio.apply()\r\n```\r\n\r\n### Common Issues and Solutions\r\n\r\n1. **Rate Limiting**\r\n - Symptom: Many HTTP 429 errors\r\n - Solution: Decrease scraping speed by increasing sleep time between requests\r\n\r\n2. **Memory Issues with Large Datasets**\r\n - Symptom: Out of memory errors\r\n - Solution: Use smaller batch sizes or enable disk caching\r\n\r\n3. **Blocked Access**\r\n - Symptom: HTTP 403 Forbidden errors\r\n - Solution: Ensure your user agent is set correctly and respect robots.txt\r\n\r\n4. **SSL Certificate Errors**\r\n - Symptom: SSL verification failed\r\n - Solution: Update your Python SSL certificates or check network settings\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n```bash\r\npip install webrover\r\n```\r\n\r\n### Basic Usage\r\n```python\r\nfrom webrover import WebRover\r\n\r\n# Initialize WebRover\r\nrover = WebRover()\r\n\r\n# Scrape content from topics\r\nrover.scrape_topics(\r\n topics=[\"artificial intelligence\", \"machine learning\"],\r\n sites_per_topic=20 # Will get 20 sites for each topic\r\n)\r\n\r\n# Save the dataset\r\nrover.save_dataset(\"my_dataset.jsonl\")\r\n```\r\n\r\n### Using Topic Files\r\n```python\r\n# From JSON file\r\nrover.scrape_topics(\r\n topics=\"topics.json\",\r\n num_websites=100\r\n)\r\n\r\n# From Markdown list\r\nrover.scrape_topics(\r\n topics=\"topics.md\",\r\n num_websites=100\r\n)\r\n```\r\n\r\n## \ud83d\udcd6 Documentation\r\n\r\n### Supported Topic File Formats\r\n\r\n#### JSON\r\n```json\r\n{\r\n \"topics\": [\r\n \"AI basics\",\r\n \"machine learning\",\r\n \"deep learning\"\r\n ]\r\n}\r\n```\r\n\r\n#### YAML\r\n```yaml\r\ntopics:\r\n - AI basics\r\n - machine learning\r\n - deep learning\r\n```\r\n\r\n#### Markdown\r\n```markdown\r\n- AI basics\r\n- machine learning\r\n- deep learning\r\n```\r\n\r\n### Output Structure\r\n```python\r\n{\r\n 'url': 'https://example.com/article',\r\n 'title': 'Article Title',\r\n 'content': 'Article content...',\r\n 'metadata': {\r\n 'length': 1234,\r\n 'has_title': true,\r\n 'domain': 'example.com'\r\n }\r\n}\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Advanced Usage\r\n\r\n```python\r\n# Initialize with custom output directory\r\nrover = WebRover(output_dir=\"my_datasets\")\r\n\r\n# Get scraping statistics\r\nstats = rover.get_stats()\r\nprint(f\"Success rate: {stats['success_rate']*100:.1f}%\")\r\n\r\n# Access dataset programmatically\r\ndataset = rover.get_dataset()\r\n```\r\n\r\n## \ud83d\udcca Output Files\r\n\r\n- `final_dataset/dataset.jsonl`: Main dataset in JSONL format\r\n- `websites_master.json`: List of all discovered URLs\r\n- `websites_completed.json`: Successfully scraped URLs\r\n- `websites_errors.json`: Failed attempts with error details\r\n\r\n## \ud83d\udd04 Error Handling\r\n\r\nWebRover automatically handles common issues:\r\n- Rate limiting\r\n- Network timeouts\r\n- Invalid URLs\r\n- Blocked requests\r\n- Malformed content\r\n\r\n## \ud83d\udea7 Limitations\r\n\r\n- Respects robots.txt and site rate limits\r\n- Some sites may block automated access\r\n- Large datasets require more processing time\r\n- Google search may throttle excessive requests\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\nSee [FUTURE.md](FUTURE.md) for planned features and improvements.\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n## \ud83d\udcdc License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\nBuilt with \u2764\ufe0f by Area-25. Special thanks to all contributors.\r\n\r\n---\r\n\r\n**WebRover: Build better datasets, train better models.** \ud83d\ude80\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Generate high-quality datasets from web content for AI training",
"version": "0.1.11",
"project_urls": {
"Homepage": "https://github.com/Area-25/webrover"
},
"split_keywords": [
"web-scraping",
" dataset-generation",
" machine-learning",
" ai-training",
" deep-learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bdf97193e7f9d95a04a202a2996087f246738c9865b2234a58d5ec6eab9a8c4a",
"md5": "760c841bc0f33052a85dedf9209a477f",
"sha256": "be6ab5b515d55ff23a3b380199e79fc47a0523fca2d437ed7b38c71937f783d9"
},
"downloads": -1,
"filename": "webrover-0.1.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "760c841bc0f33052a85dedf9209a477f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 12926,
"upload_time": "2024-11-29T20:34:54",
"upload_time_iso_8601": "2024-11-29T20:34:54.290772Z",
"url": "https://files.pythonhosted.org/packages/bd/f9/7193e7f9d95a04a202a2996087f246738c9865b2234a58d5ec6eab9a8c4a/webrover-0.1.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f39c45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63",
"md5": "f6d5466f077ede243f45bdbf351c4c10",
"sha256": "e8d61cf1d97a8b59e94d424b8ba17c43b27007d4c85ee9e32f85173153cd0f7b"
},
"downloads": -1,
"filename": "webrover-0.1.11.tar.gz",
"has_sig": false,
"md5_digest": "f6d5466f077ede243f45bdbf351c4c10",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 12183,
"upload_time": "2024-11-29T20:34:58",
"upload_time_iso_8601": "2024-11-29T20:34:58.050291Z",
"url": "https://files.pythonhosted.org/packages/f3/9c/45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63/webrover-0.1.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-29 20:34:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Area-25",
"github_project": "webrover",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "aiohttp",
"specs": [
[
"==",
"3.11.8"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.3"
]
]
},
{
"name": "googlesearch-python",
"specs": [
[
"==",
"1.2.5"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"75.6.0"
]
]
}
],
"lcname": "webrover"
}