KyaNewsScraper


NameKyaNewsScraper JSON
Version 1.0.5 PyPI version JSON
download
home_pagehttps://github.com/yourgithubusername/newscrawler
SummaryA Python-based tool for scraping news articles from various sources, using different techniques.
upload_time2024-03-18 21:27:32
maintainer
docs_urlNone
authorKya
requires_python>=3.6
license
keywords news web scraping article scraping news scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NewsCrawler

NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.

## Features

- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.
- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.
- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.

## Dependencies

- Python 3.x
- `requests`
- `selenium`
- `newspaper3k`
- `selenium-stealth`
- `beautifulsoup4`

Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.

## Installation

1. Clone the repository:
```sh
git clone https://github.com/yourgithubusername/newscrawler.git
```

2. Install the required Python packages:
```sh
pip install -r requirements.txt
```

## Usage

To use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:

```python
from newscrawler import NewsParser, NewsUrlGetter

# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)

# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")
```

## Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourgithubusername/newscrawler",
    "name": "KyaNewsScraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "news,web scraping,article scraping,news scraping",
    "author": "Kya",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/88/b2/80a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9/KyaNewsScraper-1.0.5.tar.gz",
    "platform": null,
    "description": "# NewsCrawler\r\n\r\nNewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.\r\n\r\n## Features\r\n\r\n- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.\r\n- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.\r\n- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.\r\n- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.\r\n- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.\r\n\r\n## Dependencies\r\n\r\n- Python 3.x\r\n- `requests`\r\n- `selenium`\r\n- `newspaper3k`\r\n- `selenium-stealth`\r\n- `beautifulsoup4`\r\n\r\nEnsure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.\r\n\r\n## Installation\r\n\r\n1. Clone the repository:\r\n```sh\r\ngit clone https://github.com/yourgithubusername/newscrawler.git\r\n```\r\n\r\n2. Install the required Python packages:\r\n```sh\r\npip install -r requirements.txt\r\n```\r\n\r\n## Usage\r\n\r\nTo use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:\r\n\r\n```python\r\nfrom newscrawler import NewsParser, NewsUrlGetter\r\n\r\n# Initialize the NewsParser with custom settings\r\nnews_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)\r\n\r\n# Fetch news articles about \"Interest rates\"\r\narticles = news_parser.get_news(\"Interest rates\")\r\n```\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A Python-based tool for scraping news articles from various sources, using different techniques.",
    "version": "1.0.5",
    "project_urls": {
        "Homepage": "https://github.com/yourgithubusername/newscrawler"
    },
    "split_keywords": [
        "news",
        "web scraping",
        "article scraping",
        "news scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5782ec07285dac64e7d49dcac7be9aea338b82e4221db5b628d1153a8236a8b2",
                "md5": "81bcd96758af4d52a2e765ba8bde7853",
                "sha256": "bc3060a04384b1dc0d41171bd1905cbd22f477183ebb041f2f53dae7e836ac39"
            },
            "downloads": -1,
            "filename": "KyaNewsScraper-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "81bcd96758af4d52a2e765ba8bde7853",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8644,
            "upload_time": "2024-03-18T21:27:29",
            "upload_time_iso_8601": "2024-03-18T21:27:29.979587Z",
            "url": "https://files.pythonhosted.org/packages/57/82/ec07285dac64e7d49dcac7be9aea338b82e4221db5b628d1153a8236a8b2/KyaNewsScraper-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "88b280a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9",
                "md5": "5daafdc37f84fef54de75839b5f101f1",
                "sha256": "ba54f79ef12195c73a0b5ad9ba51ffa12e42d690599618006608e866083e49ac"
            },
            "downloads": -1,
            "filename": "KyaNewsScraper-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "5daafdc37f84fef54de75839b5f101f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 7463,
            "upload_time": "2024-03-18T21:27:32",
            "upload_time_iso_8601": "2024-03-18T21:27:32.015576Z",
            "url": "https://files.pythonhosted.org/packages/88/b2/80a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9/KyaNewsScraper-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-18 21:27:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourgithubusername",
    "github_project": "newscrawler",
    "github_not_found": true,
    "lcname": "kyanewsscraper"
}
        
Kya
Elapsed time: 0.20911s