# NewsCrawler
NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.
## Features
- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.
- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.
- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.
## Dependencies
- Python 3.x
- `requests`
- `selenium`
- `newspaper3k`
- `selenium-stealth`
- `beautifulsoup4`
Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.
## Installation
1. Clone the repository:
```sh
git clone https://github.com/yourgithubusername/newscrawler.git
```
2. Install the required Python packages:
```sh
pip install -r requirements.txt
```
## Usage
To use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:
```python
from newscrawler import NewsParser, NewsUrlGetter
# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)
# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")
```
## Contributing
Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/yourgithubusername/newscrawler",
"name": "KyaNewsScraper",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "news,web scraping,article scraping,news scraping",
"author": "Kya",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/88/b2/80a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9/KyaNewsScraper-1.0.5.tar.gz",
"platform": null,
"description": "# NewsCrawler\r\n\r\nNewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.\r\n\r\n## Features\r\n\r\n- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.\r\n- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.\r\n- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.\r\n- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.\r\n- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.\r\n\r\n## Dependencies\r\n\r\n- Python 3.x\r\n- `requests`\r\n- `selenium`\r\n- `newspaper3k`\r\n- `selenium-stealth`\r\n- `beautifulsoup4`\r\n\r\nEnsure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.\r\n\r\n## Installation\r\n\r\n1. Clone the repository:\r\n```sh\r\ngit clone https://github.com/yourgithubusername/newscrawler.git\r\n```\r\n\r\n2. Install the required Python packages:\r\n```sh\r\npip install -r requirements.txt\r\n```\r\n\r\n## Usage\r\n\r\nTo use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:\r\n\r\n```python\r\nfrom newscrawler import NewsParser, NewsUrlGetter\r\n\r\n# Initialize the NewsParser with custom settings\r\nnews_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)\r\n\r\n# Fetch news articles about \"Interest rates\"\r\narticles = news_parser.get_news(\"Interest rates\")\r\n```\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n",
"bugtrack_url": null,
"license": "",
"summary": "A Python-based tool for scraping news articles from various sources, using different techniques.",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://github.com/yourgithubusername/newscrawler"
},
"split_keywords": [
"news",
"web scraping",
"article scraping",
"news scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5782ec07285dac64e7d49dcac7be9aea338b82e4221db5b628d1153a8236a8b2",
"md5": "81bcd96758af4d52a2e765ba8bde7853",
"sha256": "bc3060a04384b1dc0d41171bd1905cbd22f477183ebb041f2f53dae7e836ac39"
},
"downloads": -1,
"filename": "KyaNewsScraper-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "81bcd96758af4d52a2e765ba8bde7853",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8644,
"upload_time": "2024-03-18T21:27:29",
"upload_time_iso_8601": "2024-03-18T21:27:29.979587Z",
"url": "https://files.pythonhosted.org/packages/57/82/ec07285dac64e7d49dcac7be9aea338b82e4221db5b628d1153a8236a8b2/KyaNewsScraper-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "88b280a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9",
"md5": "5daafdc37f84fef54de75839b5f101f1",
"sha256": "ba54f79ef12195c73a0b5ad9ba51ffa12e42d690599618006608e866083e49ac"
},
"downloads": -1,
"filename": "KyaNewsScraper-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "5daafdc37f84fef54de75839b5f101f1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 7463,
"upload_time": "2024-03-18T21:27:32",
"upload_time_iso_8601": "2024-03-18T21:27:32.015576Z",
"url": "https://files.pythonhosted.org/packages/88/b2/80a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9/KyaNewsScraper-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-18 21:27:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourgithubusername",
"github_project": "newscrawler",
"github_not_found": true,
"lcname": "kyanewsscraper"
}