webrover


Namewebrover JSON
Version 0.1.11 PyPI version JSON
download
home_pagehttps://github.com/Area-25/webrover
SummaryGenerate high-quality datasets from web content for AI training
upload_time2024-11-29 20:34:58
maintainerNone
docs_urlNone
authorArea-25
requires_python>=3.10
licenseNone
keywords web-scraping dataset-generation machine-learning ai-training deep-learning
VCS
bugtrack_url
requirements aiohttp beautifulsoup4 googlesearch-python pyyaml setuptools
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # WebRover πŸš€

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.**

---

## 🌟 Features

- **Smart Web Scraping**: Automatically find and scrape relevant content based on topics
- **Multiple Input Formats**: Support for JSON, YAML, TXT, and Markdown topic files
- **Async Processing**: Fast, concurrent scraping with built-in rate limiting
- **Quality Control**: Built-in content validation and cleaning
- **LLM-Ready Output**: Structured JSONL format perfect for model training
- **Error Handling**: Robust error tracking and recovery mechanisms

## ⚠️ Important Notes

### Cloud Environment Compatibility

When using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:

1. Install nest_asyncio:
```bash
pip install nest_asyncio
```

2. Add these lines at the start of your notebook:
```python
import nest_asyncio
nest_asyncio.apply()
```

This setup is only required for:
- Google Colab
- Kaggle Notebooks
- Similar cloud-based Jupyter environments

It's not needed for:
- Local Python scripts
- Command line usage
- Standard server deployments

## πŸš€ Troubleshooting

### Cloud Environment Issues

When using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:

```python
# Install the required package
pip install nest_asyncio

# Add at the start of your notebook
import nest_asyncio
nest_asyncio.apply()
```

### Common Issues and Solutions

1. **Rate Limiting**
   - Symptom: Many HTTP 429 errors
   - Solution: Decrease scraping speed by increasing sleep time between requests

2. **Memory Issues with Large Datasets**
   - Symptom: Out of memory errors
   - Solution: Use smaller batch sizes or enable disk caching

3. **Blocked Access**
   - Symptom: HTTP 403 Forbidden errors
   - Solution: Ensure your user agent is set correctly and respect robots.txt

4. **SSL Certificate Errors**
   - Symptom: SSL verification failed
   - Solution: Update your Python SSL certificates or check network settings

## πŸš€ Quick Start

### Installation
```bash
pip install webrover
```

### Basic Usage
```python
from webrover import WebRover

# Initialize WebRover
rover = WebRover()

# Scrape content from topics
rover.scrape_topics(
    topics=["artificial intelligence", "machine learning"],
    sites_per_topic=20  # Will get 20 sites for each topic
)

# Save the dataset
rover.save_dataset("my_dataset.jsonl")
```

### Using Topic Files
```python
# From JSON file
rover.scrape_topics(
    topics="topics.json",
    num_websites=100
)

# From Markdown list
rover.scrape_topics(
    topics="topics.md",
    num_websites=100
)
```

## πŸ“– Documentation

### Supported Topic File Formats

#### JSON
```json
{
    "topics": [
        "AI basics",
        "machine learning",
        "deep learning"
    ]
}
```

#### YAML
```yaml
topics:
  - AI basics
  - machine learning
  - deep learning
```

#### Markdown
```markdown
- AI basics
- machine learning
- deep learning
```

### Output Structure
```python
{
    'url': 'https://example.com/article',
    'title': 'Article Title',
    'content': 'Article content...',
    'metadata': {
        'length': 1234,
        'has_title': true,
        'domain': 'example.com'
    }
}
```

## πŸ› οΈ Advanced Usage

```python
# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")

# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")

# Access dataset programmatically
dataset = rover.get_dataset()
```

## πŸ“Š Output Files

- `final_dataset/dataset.jsonl`: Main dataset in JSONL format
- `websites_master.json`: List of all discovered URLs
- `websites_completed.json`: Successfully scraped URLs
- `websites_errors.json`: Failed attempts with error details

## πŸ”„ Error Handling

WebRover automatically handles common issues:
- Rate limiting
- Network timeouts
- Invalid URLs
- Blocked requests
- Malformed content

## 🚧 Limitations

- Respects robots.txt and site rate limits
- Some sites may block automated access
- Large datasets require more processing time
- Google search may throttle excessive requests

## πŸ—ΊοΈ Roadmap

See [FUTURE.md](FUTURE.md) for planned features and improvements.

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## πŸ“œ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

Built with ❀️ by Area-25. Special thanks to all contributors.

---

**WebRover: Build better datasets, train better models.** πŸš€

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Area-25/webrover",
    "name": "webrover",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "web-scraping, dataset-generation, machine-learning, ai-training, deep-learning",
    "author": "Area-25",
    "author_email": "jasonquist.ssh@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f3/9c/45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63/webrover-0.1.11.tar.gz",
    "platform": null,
    "description": "# WebRover \ud83d\ude80\r\n\r\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\n**WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.**\r\n\r\n---\r\n\r\n## \ud83c\udf1f Features\r\n\r\n- **Smart Web Scraping**: Automatically find and scrape relevant content based on topics\r\n- **Multiple Input Formats**: Support for JSON, YAML, TXT, and Markdown topic files\r\n- **Async Processing**: Fast, concurrent scraping with built-in rate limiting\r\n- **Quality Control**: Built-in content validation and cleaning\r\n- **LLM-Ready Output**: Structured JSONL format perfect for model training\r\n- **Error Handling**: Robust error tracking and recovery mechanisms\r\n\r\n## \u26a0\ufe0f Important Notes\r\n\r\n### Cloud Environment Compatibility\r\n\r\nWhen using WebRover in cloud environments like Google Colab or Kaggle Notebooks, you may need to handle nested asyncio loops. This is a limitation of these environments, not WebRover itself. To resolve this:\r\n\r\n1. Install nest_asyncio:\r\n```bash\r\npip install nest_asyncio\r\n```\r\n\r\n2. Add these lines at the start of your notebook:\r\n```python\r\nimport nest_asyncio\r\nnest_asyncio.apply()\r\n```\r\n\r\nThis setup is only required for:\r\n- Google Colab\r\n- Kaggle Notebooks\r\n- Similar cloud-based Jupyter environments\r\n\r\nIt's not needed for:\r\n- Local Python scripts\r\n- Command line usage\r\n- Standard server deployments\r\n\r\n## \ud83d\ude80 Troubleshooting\r\n\r\n### Cloud Environment Issues\r\n\r\nWhen using WebRover in cloud environments (Google Colab, Kaggle Notebooks), you may encounter asyncio-related errors. This is due to how these environments handle async operations. To fix:\r\n\r\n```python\r\n# Install the required package\r\npip install nest_asyncio\r\n\r\n# Add at the start of your notebook\r\nimport nest_asyncio\r\nnest_asyncio.apply()\r\n```\r\n\r\n### Common Issues and Solutions\r\n\r\n1. **Rate Limiting**\r\n   - Symptom: Many HTTP 429 errors\r\n   - Solution: Decrease scraping speed by increasing sleep time between requests\r\n\r\n2. **Memory Issues with Large Datasets**\r\n   - Symptom: Out of memory errors\r\n   - Solution: Use smaller batch sizes or enable disk caching\r\n\r\n3. **Blocked Access**\r\n   - Symptom: HTTP 403 Forbidden errors\r\n   - Solution: Ensure your user agent is set correctly and respect robots.txt\r\n\r\n4. **SSL Certificate Errors**\r\n   - Symptom: SSL verification failed\r\n   - Solution: Update your Python SSL certificates or check network settings\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n```bash\r\npip install webrover\r\n```\r\n\r\n### Basic Usage\r\n```python\r\nfrom webrover import WebRover\r\n\r\n# Initialize WebRover\r\nrover = WebRover()\r\n\r\n# Scrape content from topics\r\nrover.scrape_topics(\r\n    topics=[\"artificial intelligence\", \"machine learning\"],\r\n    sites_per_topic=20  # Will get 20 sites for each topic\r\n)\r\n\r\n# Save the dataset\r\nrover.save_dataset(\"my_dataset.jsonl\")\r\n```\r\n\r\n### Using Topic Files\r\n```python\r\n# From JSON file\r\nrover.scrape_topics(\r\n    topics=\"topics.json\",\r\n    num_websites=100\r\n)\r\n\r\n# From Markdown list\r\nrover.scrape_topics(\r\n    topics=\"topics.md\",\r\n    num_websites=100\r\n)\r\n```\r\n\r\n## \ud83d\udcd6 Documentation\r\n\r\n### Supported Topic File Formats\r\n\r\n#### JSON\r\n```json\r\n{\r\n    \"topics\": [\r\n        \"AI basics\",\r\n        \"machine learning\",\r\n        \"deep learning\"\r\n    ]\r\n}\r\n```\r\n\r\n#### YAML\r\n```yaml\r\ntopics:\r\n  - AI basics\r\n  - machine learning\r\n  - deep learning\r\n```\r\n\r\n#### Markdown\r\n```markdown\r\n- AI basics\r\n- machine learning\r\n- deep learning\r\n```\r\n\r\n### Output Structure\r\n```python\r\n{\r\n    'url': 'https://example.com/article',\r\n    'title': 'Article Title',\r\n    'content': 'Article content...',\r\n    'metadata': {\r\n        'length': 1234,\r\n        'has_title': true,\r\n        'domain': 'example.com'\r\n    }\r\n}\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Advanced Usage\r\n\r\n```python\r\n# Initialize with custom output directory\r\nrover = WebRover(output_dir=\"my_datasets\")\r\n\r\n# Get scraping statistics\r\nstats = rover.get_stats()\r\nprint(f\"Success rate: {stats['success_rate']*100:.1f}%\")\r\n\r\n# Access dataset programmatically\r\ndataset = rover.get_dataset()\r\n```\r\n\r\n## \ud83d\udcca Output Files\r\n\r\n- `final_dataset/dataset.jsonl`: Main dataset in JSONL format\r\n- `websites_master.json`: List of all discovered URLs\r\n- `websites_completed.json`: Successfully scraped URLs\r\n- `websites_errors.json`: Failed attempts with error details\r\n\r\n## \ud83d\udd04 Error Handling\r\n\r\nWebRover automatically handles common issues:\r\n- Rate limiting\r\n- Network timeouts\r\n- Invalid URLs\r\n- Blocked requests\r\n- Malformed content\r\n\r\n## \ud83d\udea7 Limitations\r\n\r\n- Respects robots.txt and site rate limits\r\n- Some sites may block automated access\r\n- Large datasets require more processing time\r\n- Google search may throttle excessive requests\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\nSee [FUTURE.md](FUTURE.md) for planned features and improvements.\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n## \ud83d\udcdc License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\nBuilt with \u2764\ufe0f by Area-25. Special thanks to all contributors.\r\n\r\n---\r\n\r\n**WebRover: Build better datasets, train better models.** \ud83d\ude80\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Generate high-quality datasets from web content for AI training",
    "version": "0.1.11",
    "project_urls": {
        "Homepage": "https://github.com/Area-25/webrover"
    },
    "split_keywords": [
        "web-scraping",
        " dataset-generation",
        " machine-learning",
        " ai-training",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bdf97193e7f9d95a04a202a2996087f246738c9865b2234a58d5ec6eab9a8c4a",
                "md5": "760c841bc0f33052a85dedf9209a477f",
                "sha256": "be6ab5b515d55ff23a3b380199e79fc47a0523fca2d437ed7b38c71937f783d9"
            },
            "downloads": -1,
            "filename": "webrover-0.1.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "760c841bc0f33052a85dedf9209a477f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 12926,
            "upload_time": "2024-11-29T20:34:54",
            "upload_time_iso_8601": "2024-11-29T20:34:54.290772Z",
            "url": "https://files.pythonhosted.org/packages/bd/f9/7193e7f9d95a04a202a2996087f246738c9865b2234a58d5ec6eab9a8c4a/webrover-0.1.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f39c45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63",
                "md5": "f6d5466f077ede243f45bdbf351c4c10",
                "sha256": "e8d61cf1d97a8b59e94d424b8ba17c43b27007d4c85ee9e32f85173153cd0f7b"
            },
            "downloads": -1,
            "filename": "webrover-0.1.11.tar.gz",
            "has_sig": false,
            "md5_digest": "f6d5466f077ede243f45bdbf351c4c10",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 12183,
            "upload_time": "2024-11-29T20:34:58",
            "upload_time_iso_8601": "2024-11-29T20:34:58.050291Z",
            "url": "https://files.pythonhosted.org/packages/f3/9c/45f9bda8c9a7cb8b87c0cbdd434713cf69d965fcf4dfc0044f429fb06b63/webrover-0.1.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-29 20:34:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Area-25",
    "github_project": "webrover",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "aiohttp",
            "specs": [
                [
                    "==",
                    "3.11.8"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.12.3"
                ]
            ]
        },
        {
            "name": "googlesearch-python",
            "specs": [
                [
                    "==",
                    "1.2.5"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "6.0.2"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "75.6.0"
                ]
            ]
        }
    ],
    "lcname": "webrover"
}
        
Elapsed time: 0.97270s