news-fetch


Namenews-fetch JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://santhoshse7en.github.io/news-fetch/
Summarynews-fetch is an open-source, easy-to-use news extractor with basic NLP features (cleaning text, keywords, summary) that just works.
upload_time2024-11-03 07:10:21
maintainerNone
docs_urlNone
authorM Santhosh Kumar
requires_pythonNone
licenseNone
keywords newspaper3k news-fetch without-api google_scraper news_scraper bs4 lxml news-crawler news-extractor crawler extractor news news-websites elasticsearch json python nlp data-gathering news-archive news-articles commoncrawl extract-articles extract-information news-scraper spacy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://img.shields.io/pypi/v/news-fetch.svg?style=flat-square)](https://pypi.org/project/news-fetch)
[![License](https://img.shields.io/pypi/l/news-fetch.svg?style=flat-square)](https://pypi.python.org/pypi/news-fetch/)
[![Documentation Status](https://readthedocs.org/projects/pip/badge/?version=latest&style=flat-square)](https://santhoshse7en.github.io/news-fetch_doc)

# news-fetch

<img align="right" height="128px" width="128px" src="https://raw.githubusercontent.com/fhamborg/news-please/master/misc/logo/logo-256.png" />

**news-fetch** is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, including [news-please](https://github.com/fhamborg/news-please) by [Felix Hamborg](https://www.linkedin.com/in/felixhamborg/) and [Newspaper3K](https://github.com/codelucas/newspaper/) by [Lucas (欧阳象) Ou-Yang](https://www.linkedin.com/in/lucasouyang/). This package leverages features from both of these works.

I built this tool to minimize NaN or empty values when scraping data from various news websites. It's platform-independent and written in Python 3, making it easy for programmers and developers to access news data for their applications.

| Source         | Link                                                                   |
| -------------- | ---------------------------------------------------------------------- |
| PyPI:          | [https://pypi.org/project/news-fetch/](https://pypi.org/project/news-fetch/)  |
| Repository:    | [https://santhoshse7en.github.io/news-fetch/](https://santhoshse7en.github.io/news-fetch/) |
| Documentation: | [https://santhoshse7en.github.io/news-fetch_doc/](https://santhoshse7en.github.io/news-fetch_doc/) (**Not Yet Created!**) |

## Dependencies

- [news-please](https://pypi.org/project/news-please/)
- [newspaper3k](https://pypi.org/project/newspaper3k/)
- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
- [selenium](https://pypi.org/project/selenium/)

## Extracted Information

news-fetch extracts the following attributes from news articles. You can also check out an [example JSON file](https://github.com/santhoshse7en/news-fetch/blob/master/newsfetch/example/sample.json) generated by news-please.

- Headline
- Author(s)
- Publication date
- Publication
- Category
- Source domain
- Article content
- Summary
- Keywords
- URL
- Language

## Dependency Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install the required dependencies:

```bash
pip install -r requirements.txt
```

## Usage

You can download it by clicking the green download button on [Github](https://github.com/santhoshse7en/news-fetch/archive/master.zip). 

To scrape all the news details, use the `newspaper` function:

```python
from newsfetch.news import Newspaper

news = Newspaper(url='https://www.thehindu.com/news/cities/Madurai/aa-plays-a-pivotal-role-in-helping-people-escape-from-the-grip-of-alcoholism/article67716206.ece')
print(news.headline)
# Output: 'AA plays a pivotal role in helping people escape from the grip of alcoholism'
```

To extract URLs from a targeted website, call the `GoogleSearchNewsURLExtractor` by providing the keyword and newspaper link as arguments:

```python
from newsfetch.google import GoogleSearchNewsURLExtractor

google = GoogleSearchNewsURLExtractor(keyword='Alcoholics Anonymous', news_domain='https://timesofindia.indiatimes.com/')
print(google.urls)
"""
['https://timesofindia.indiatimes.com/city/pune/pune-takes-a-stand-against-alcoholism-experts-collaborate-with-alcoholics-anonymous/articleshow/114438466.cms', 
'https://timesofindia.indiatimes.com/city/mumbai/we-have-lost-jobs-homes-alcoholics-anonymous/articleshow/96824383.cms', 
'https://timesofindia.indiatimes.com/city/gurgaon/gurgaons-alcoholics-open-up-about-their-road-to-recovery/articleshow/45080744.cms', 
'https://timesofindia.indiatimes.com/city/goa/alcoholism-is-illness-not-issue-of-weak-willpower-say-experts/articleshow/105320008.cms', 
'https://timesofindia.indiatimes.com/city/bhopal/alcoholism-is-an-illness-bhopal-aa-silver-jubilee-celebration/articleshow/106849014.cms', 
'https://timesofindia.indiatimes.com/city/ahmedabad/alcoholics-anonymous-switches-to-online-sessions/articleshow/76144639.cms', 
'https://timesofindia.indiatimes.com/city/kochi/keralites-trying-to-kick-alcoholism-alcoholics-anonymous/articleshow/13977818.cms', 
'https://timesofindia.indiatimes.com/city/chandigarh/alcoholics-anonymous-turned-their-lives-around/articleshow/18239.cms', 
'https://timesofindia.indiatimes.com/city/mumbai/like-air-india-flyer-alcoholics-anonymous-members-reap-whirlwind-of-job-loss-broken-homes/articleshow/96820403.cms', 
'https://timesofindia.indiatimes.com/city/nagpur/alcoholics-anonymous-meet-promotes-one-day-at-a-time/articleshow/50538092.cms']
"""
```

## Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Make sure to update tests as appropriate.

## License
This project is licensed under the [MIT](https://choosealicense.com/licenses/mit/) License.


            

Raw data

            {
    "_id": null,
    "home_page": "https://santhoshse7en.github.io/news-fetch/",
    "name": "news-fetch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Newspaper3K, news-fetch, without-api, google_scraper, news_scraper, bs4, lxml, news-crawler, news-extractor, crawler, extractor, news, news-websites, elasticsearch, json, python, nlp, data-gathering, news-archive, news-articles, commoncrawl, extract-articles, extract-information, news-scraper, spacy",
    "author": "M Santhosh Kumar",
    "author_email": "santhoshse7en@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/93/16/6ab3205649a70b6faa20ac65f3b16b667629c056dbea314daeb828118374/news_fetch-0.3.0.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://img.shields.io/pypi/v/news-fetch.svg?style=flat-square)](https://pypi.org/project/news-fetch)\r\n[![License](https://img.shields.io/pypi/l/news-fetch.svg?style=flat-square)](https://pypi.python.org/pypi/news-fetch/)\r\n[![Documentation Status](https://readthedocs.org/projects/pip/badge/?version=latest&style=flat-square)](https://santhoshse7en.github.io/news-fetch_doc)\r\n\r\n# news-fetch\r\n\r\n<img align=\"right\" height=\"128px\" width=\"128px\" src=\"https://raw.githubusercontent.com/fhamborg/news-please/master/misc/logo/logo-256.png\" />\r\n\r\n**news-fetch** is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, including [news-please](https://github.com/fhamborg/news-please) by [Felix Hamborg](https://www.linkedin.com/in/felixhamborg/) and [Newspaper3K](https://github.com/codelucas/newspaper/) by [Lucas (\u00e6\u00ac\u00a7\u00e9\u02dc\u00b3\u00e8\u00b1\u00a1) Ou-Yang](https://www.linkedin.com/in/lucasouyang/). This package leverages features from both of these works.\r\n\r\nI built this tool to minimize NaN or empty values when scraping data from various news websites. It's platform-independent and written in Python 3, making it easy for programmers and developers to access news data for their applications.\r\n\r\n| Source         | Link                                                                   |\r\n| -------------- | ---------------------------------------------------------------------- |\r\n| PyPI:          | [https://pypi.org/project/news-fetch/](https://pypi.org/project/news-fetch/)  |\r\n| Repository:    | [https://santhoshse7en.github.io/news-fetch/](https://santhoshse7en.github.io/news-fetch/) |\r\n| Documentation: | [https://santhoshse7en.github.io/news-fetch_doc/](https://santhoshse7en.github.io/news-fetch_doc/) (**Not Yet Created!**) |\r\n\r\n## Dependencies\r\n\r\n- [news-please](https://pypi.org/project/news-please/)\r\n- [newspaper3k](https://pypi.org/project/newspaper3k/)\r\n- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)\r\n- [selenium](https://pypi.org/project/selenium/)\r\n\r\n## Extracted Information\r\n\r\nnews-fetch extracts the following attributes from news articles. You can also check out an [example JSON file](https://github.com/santhoshse7en/news-fetch/blob/master/newsfetch/example/sample.json) generated by news-please.\r\n\r\n- Headline\r\n- Author(s)\r\n- Publication date\r\n- Publication\r\n- Category\r\n- Source domain\r\n- Article content\r\n- Summary\r\n- Keywords\r\n- URL\r\n- Language\r\n\r\n## Dependency Installation\r\n\r\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install the required dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n## Usage\r\n\r\nYou can download it by clicking the green download button on [Github](https://github.com/santhoshse7en/news-fetch/archive/master.zip). \r\n\r\nTo scrape all the news details, use the `newspaper` function:\r\n\r\n```python\r\nfrom newsfetch.news import Newspaper\r\n\r\nnews = Newspaper(url='https://www.thehindu.com/news/cities/Madurai/aa-plays-a-pivotal-role-in-helping-people-escape-from-the-grip-of-alcoholism/article67716206.ece')\r\nprint(news.headline)\r\n# Output: 'AA plays a pivotal role in helping people escape from the grip of alcoholism'\r\n```\r\n\r\nTo extract URLs from a targeted website, call the `GoogleSearchNewsURLExtractor` by providing the keyword and newspaper link as arguments:\r\n\r\n```python\r\nfrom newsfetch.google import GoogleSearchNewsURLExtractor\r\n\r\ngoogle = GoogleSearchNewsURLExtractor(keyword='Alcoholics Anonymous', news_domain='https://timesofindia.indiatimes.com/')\r\nprint(google.urls)\r\n\"\"\"\r\n['https://timesofindia.indiatimes.com/city/pune/pune-takes-a-stand-against-alcoholism-experts-collaborate-with-alcoholics-anonymous/articleshow/114438466.cms', \r\n'https://timesofindia.indiatimes.com/city/mumbai/we-have-lost-jobs-homes-alcoholics-anonymous/articleshow/96824383.cms', \r\n'https://timesofindia.indiatimes.com/city/gurgaon/gurgaons-alcoholics-open-up-about-their-road-to-recovery/articleshow/45080744.cms', \r\n'https://timesofindia.indiatimes.com/city/goa/alcoholism-is-illness-not-issue-of-weak-willpower-say-experts/articleshow/105320008.cms', \r\n'https://timesofindia.indiatimes.com/city/bhopal/alcoholism-is-an-illness-bhopal-aa-silver-jubilee-celebration/articleshow/106849014.cms', \r\n'https://timesofindia.indiatimes.com/city/ahmedabad/alcoholics-anonymous-switches-to-online-sessions/articleshow/76144639.cms', \r\n'https://timesofindia.indiatimes.com/city/kochi/keralites-trying-to-kick-alcoholism-alcoholics-anonymous/articleshow/13977818.cms', \r\n'https://timesofindia.indiatimes.com/city/chandigarh/alcoholics-anonymous-turned-their-lives-around/articleshow/18239.cms', \r\n'https://timesofindia.indiatimes.com/city/mumbai/like-air-india-flyer-alcoholics-anonymous-members-reap-whirlwind-of-job-loss-broken-homes/articleshow/96820403.cms', \r\n'https://timesofindia.indiatimes.com/city/nagpur/alcoholics-anonymous-meet-promotes-one-day-at-a-time/articleshow/50538092.cms']\r\n\"\"\"\r\n```\r\n\r\n## Contributing\r\n\r\nPull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.\r\n\r\nMake sure to update tests as appropriate.\r\n\r\n## License\r\nThis project is licensed under the [MIT](https://choosealicense.com/licenses/mit/) License.\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "news-fetch is an open-source, easy-to-use news extractor with basic NLP features (cleaning text, keywords, summary) that just works.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://santhoshse7en.github.io/news-fetch/"
    },
    "split_keywords": [
        "newspaper3k",
        " news-fetch",
        " without-api",
        " google_scraper",
        " news_scraper",
        " bs4",
        " lxml",
        " news-crawler",
        " news-extractor",
        " crawler",
        " extractor",
        " news",
        " news-websites",
        " elasticsearch",
        " json",
        " python",
        " nlp",
        " data-gathering",
        " news-archive",
        " news-articles",
        " commoncrawl",
        " extract-articles",
        " extract-information",
        " news-scraper",
        " spacy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93166ab3205649a70b6faa20ac65f3b16b667629c056dbea314daeb828118374",
                "md5": "caadf88242bbb2cfaa2712909827420d",
                "sha256": "3fca9fbcb80c8bd7d5c4db4700a5055f6327dd2e29abee93843e7d89e34d4b26"
            },
            "downloads": -1,
            "filename": "news_fetch-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "caadf88242bbb2cfaa2712909827420d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10066,
            "upload_time": "2024-11-03T07:10:21",
            "upload_time_iso_8601": "2024-11-03T07:10:21.961103Z",
            "url": "https://files.pythonhosted.org/packages/93/16/6ab3205649a70b6faa20ac65f3b16b667629c056dbea314daeb828118374/news_fetch-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-03 07:10:21",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "news-fetch"
}
        
Elapsed time: 0.91381s