fundus


Namefundus JSON
Version 0.3.1 PyPI version JSON
download
home_pageNone
SummaryA very simple news crawler
upload_time2024-05-13 16:33:02
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords web scraping web crawling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_darkmode_with_font_and_clear_space.svg">
    <source media="(prefers-color-scheme: light)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg">
    <img src="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg" alt="Logo" width="50%" height="50%">
  </picture>
</p>

<p align="center">A very simple <b>news crawler</b> in Python.
Developed at <a href="https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/">Humboldt University of Berlin</a>.
</p>
<p align="center">
<a href="https://pypi.org/project/fundus/"><img alt="PyPi version" src="https://badge.fury.io/py/fundus.svg"></a>
<img alt="python" src="https://img.shields.io/badge/python-3.8-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/license-MIT-green">
<img alt="Publisher Coverage" src="https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/dobbersc/ca0ae056b05cbfeaf30fa42f84ddf458/raw/fundus_publisher_coverage.json">
</p>
<div align="center">
<hr>

[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://arxiv.org/abs/2403.15279)

</div>


---

Fundus is:

* **A static news crawler.** 
  Fundus lets you crawl online news articles with only a few lines of Python code!
  Be it from live websites or the CC-NEWS dataset.

* **An open-source Python package.**
  Fundus is built on the idea of building something together. 
  We welcome your contribution to  help Fundus [grow](docs/how_to_contribute.md)!

<hr>

## Quick Start

To install from pip, simply do:

```
pip install fundus
```

Fundus requires Python 3.8+.


## Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)
```

That's already it!

If you run this code, it should print out something like this:

```console
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)
```

This printout tells you that you successfully crawled two articles!

For each article, the printout details:
- the "Title" of the article, i.e. its headline 
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
- the news source it is "From"


## Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)
```

## Example 3: Crawl articles from CC-NEWS

If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).

````python
from fundus import PublisherCollection, CCNewsCrawler

# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
  print(article)
````


## Tutorials

We provide **quick tutorials** to get you started with the library:

1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)
2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)
3. [**Tutorial 3: The Article Class**](docs/3_the_article_class.md)
4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles.md)
5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers.md)

If you wish to contribute check out these tutorials:
1. [**How to contribute**](docs/how_to_contribute.md)
2. [**How to add a publisher**](docs/how_to_add_a_publisher.md)

## Currently Supported News Sources

You can find the publishers currently supported [**here**](/docs/supported_publishers.md).

Also: **Adding a new publisher is easy - consider contributing to the project!**

## Evaluation benchmark

Check out our evaluation [benchmark](https://github.com/dobbersc/fundus-evaluation).

| **Scraper** | **Precision**             | **Recall**                | **F1-Score**              |
|-------------|---------------------------|---------------------------|---------------------------|
| [Fundus](https://github.com/flairNLP/fundus)      | **99.89**<sub>±0.57</sub> | 96.75<sub>±12.75</sub>    | **97.69**<sub>±9.75</sub> |
| [Trafilatura](https://github.com/adbar/trafilatura) | 90.54<sub>±18.86</sub>    | 93.23<sub>±23.81</sub>    | 89.81<sub>±23.69</sub>    |
| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py)         | 81.09<sub>±19.41</sub>    | **98.23**<sub>±8.61</sub> | 87.14<sub>±15.48</sub>    |
| [jusText](https://github.com/miso-belica/jusText)     | 86.51<sub>±18.92</sub>    | 90.23<sub>±20.61</sub>    | 86.96<sub>±19.76</sub>    |
| [news-please](https://github.com/fhamborg/news-please) | 92.26<sub>±12.40</sub>    | 86.38<sub>±27.59</sub>    | 85.81<sub>±23.29</sub>    |
| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet)   | 84.73<sub>±20.82</sub>    | 90.66<sub>±21.05</sub>    | 85.77<sub>±20.28</sub>    |
| [Boilerpipe](https://github.com/kohlschutter/boilerpipe)  | 82.89<sub>±20.65</sub>    | 82.11<sub>±29.99</sub>    | 79.90<sub>±25.86</sub>    |

## Cite

Please cite the following [paper](https://arxiv.org/abs/2403.15279) when using Fundus or building upon our work:

```bibtex
@misc{dallabetta2024fundus,
      title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, 
      author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
      year={2024},
      eprint={2403.15279},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Contact

Please email your questions or comments to [**Max Dallabetta**](mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus)

## Contributing

Thanks for your interest in contributing! There are many ways to get involved;
start with our [contributor guidelines](docs/how_to_contribute.md) and then
check these [open issues](https://github.com/flairNLP/fundus/issues) for specific tasks.

## License

[MIT](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fundus",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "web scraping, web crawling",
    "author": null,
    "author_email": "Max Dallabetta <max.dallabetta@googlemail.com>",
    "download_url": "https://files.pythonhosted.org/packages/79/e2/a7ba830d24df62d0d606041eff2b4093b3aca154b3092b3f0af2aa632fa3/fundus-0.3.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_darkmode_with_font_and_clear_space.svg\">\n    <source media=\"(prefers-color-scheme: light)\" srcset=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg\">\n    <img src=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg\" alt=\"Logo\" width=\"50%\" height=\"50%\">\n  </picture>\n</p>\n\n<p align=\"center\">A very simple <b>news crawler</b> in Python.\nDeveloped at <a href=\"https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/\">Humboldt University of Berlin</a>.\n</p>\n<p align=\"center\">\n<a href=\"https://pypi.org/project/fundus/\"><img alt=\"PyPi version\" src=\"https://badge.fury.io/py/fundus.svg\"></a>\n<img alt=\"python\" src=\"https://img.shields.io/badge/python-3.8-blue\">\n<img alt=\"Static Badge\" src=\"https://img.shields.io/badge/license-MIT-green\">\n<img alt=\"Publisher Coverage\" src=\"https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/dobbersc/ca0ae056b05cbfeaf30fa42f84ddf458/raw/fundus_publisher_coverage.json\">\n</p>\n<div align=\"center\">\n<hr>\n\n[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://arxiv.org/abs/2403.15279)\n\n</div>\n\n\n---\n\nFundus is:\n\n* **A static news crawler.** \n  Fundus lets you crawl online news articles with only a few lines of Python code!\n  Be it from live websites or the CC-NEWS dataset.\n\n* **An open-source Python package.**\n  Fundus is built on the idea of building something together. \n  We welcome your contribution to  help Fundus [grow](docs/how_to_contribute.md)!\n\n<hr>\n\n## Quick Start\n\nTo install from pip, simply do:\n\n```\npip install fundus\n```\n\nFundus requires Python 3.8+.\n\n\n## Example 1: Crawl a bunch of English-language news articles\n\nLet's use Fundus to crawl 2 articles from publishers based in the US.\n\n```python\nfrom fundus import PublisherCollection, Crawler\n\n# initialize the crawler for news publishers based in the US\ncrawler = Crawler(PublisherCollection.us)\n\n# crawl 2 articles and print\nfor article in crawler.crawl(max_articles=2):\n    print(article)\n```\n\nThat's already it!\n\nIf you run this code, it should print out something like this:\n\n```console\nFundus-Article:\n- Title: \"Feinstein's Return Not Enough for Confirmation of Controversial New [...]\"\n- Text:  \"Democrats jammed three of President Joe Biden's controversial court nominees\n          through committee votes on Thursday thanks to a last-minute [...]\"\n- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/\n- From:   FreeBeacon (2023-05-11 18:41)\n\nFundus-Article:\n- Title: \"Northwestern student government freezes College Republicans funding over [...]\"\n- Text:  \"Student government at Northwestern University in Illinois \"indefinitely\" froze\n          the funds of the university's chapter of College Republicans [...]\"\n- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community\n- From:   FoxNews (2023-05-09 14:37)\n```\n\nThis printout tells you that you successfully crawled two articles!\n\nFor each article, the printout details:\n- the \"Title\" of the article, i.e. its headline \n- the \"Text\", i.e. the main article body text\n- the \"URL\" from which it was crawled\n- the news source it is \"From\"\n\n\n## Example 2: Crawl a specific news source\n\nMaybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:\n\n```python\nfrom fundus import PublisherCollection, Crawler\n\n# initialize the crawler for The New Yorker\ncrawler = Crawler(PublisherCollection.us.TheNewYorker)\n\n# crawl 2 articles and print\nfor article in crawler.crawl(max_articles=2):\n    print(article)\n```\n\n## Example 3: Crawl articles from CC-NEWS\n\nIf you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).\n\n````python\nfrom fundus import PublisherCollection, CCNewsCrawler\n\n# initialize the crawler for news publishers based in the US\ncrawler = CCNewsCrawler(*PublisherCollection.us)\n\n# crawl 2 articles and print\nfor article in crawler.crawl(max_articles=2):\n  print(article)\n````\n\n\n## Tutorials\n\nWe provide **quick tutorials** to get you started with the library:\n\n1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)\n2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)\n3. [**Tutorial 3: The Article Class**](docs/3_the_article_class.md)\n4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles.md)\n5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers.md)\n\nIf you wish to contribute check out these tutorials:\n1. [**How to contribute**](docs/how_to_contribute.md)\n2. [**How to add a publisher**](docs/how_to_add_a_publisher.md)\n\n## Currently Supported News Sources\n\nYou can find the publishers currently supported [**here**](/docs/supported_publishers.md).\n\nAlso: **Adding a new publisher is easy - consider contributing to the project!**\n\n## Evaluation benchmark\n\nCheck out our evaluation [benchmark](https://github.com/dobbersc/fundus-evaluation).\n\n| **Scraper** | **Precision**             | **Recall**                | **F1-Score**              |\n|-------------|---------------------------|---------------------------|---------------------------|\n| [Fundus](https://github.com/flairNLP/fundus)      | **99.89**<sub>\u00b10.57</sub> | 96.75<sub>\u00b112.75</sub>    | **97.69**<sub>\u00b19.75</sub> |\n| [Trafilatura](https://github.com/adbar/trafilatura) | 90.54<sub>\u00b118.86</sub>    | 93.23<sub>\u00b123.81</sub>    | 89.81<sub>\u00b123.69</sub>    |\n| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py)         | 81.09<sub>\u00b119.41</sub>    | **98.23**<sub>\u00b18.61</sub> | 87.14<sub>\u00b115.48</sub>    |\n| [jusText](https://github.com/miso-belica/jusText)     | 86.51<sub>\u00b118.92</sub>    | 90.23<sub>\u00b120.61</sub>    | 86.96<sub>\u00b119.76</sub>    |\n| [news-please](https://github.com/fhamborg/news-please) | 92.26<sub>\u00b112.40</sub>    | 86.38<sub>\u00b127.59</sub>    | 85.81<sub>\u00b123.29</sub>    |\n| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet)   | 84.73<sub>\u00b120.82</sub>    | 90.66<sub>\u00b121.05</sub>    | 85.77<sub>\u00b120.28</sub>    |\n| [Boilerpipe](https://github.com/kohlschutter/boilerpipe)  | 82.89<sub>\u00b120.65</sub>    | 82.11<sub>\u00b129.99</sub>    | 79.90<sub>\u00b125.86</sub>    |\n\n## Cite\n\nPlease cite the following [paper](https://arxiv.org/abs/2403.15279) when using Fundus or building upon our work:\n\n```bibtex\n@misc{dallabetta2024fundus,\n      title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, \n      author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},\n      year={2024},\n      eprint={2403.15279},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n## Contact\n\nPlease email your questions or comments to [**Max Dallabetta**](mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus)\n\n## Contributing\n\nThanks for your interest in contributing! There are many ways to get involved;\nstart with our [contributor guidelines](docs/how_to_contribute.md) and then\ncheck these [open issues](https://github.com/flairNLP/fundus/issues) for specific tasks.\n\n## License\n\n[MIT](LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A very simple news crawler",
    "version": "0.3.1",
    "project_urls": {
        "Repository": "https://github.com/flairNLP/fundus"
    },
    "split_keywords": [
        "web scraping",
        " web crawling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9228c9809c8ddffd3483bd02ef87bfc1c9e1c019b981328911877abfd8fce836",
                "md5": "bc0eef8cc98541d1be7d3c3ef55ce0a4",
                "sha256": "fec94d369ead453553e6615303d6d286c8709dfcec368c3e6ff6d3344dd6706e"
            },
            "downloads": -1,
            "filename": "fundus-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bc0eef8cc98541d1be7d3c3ef55ce0a4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 113589,
            "upload_time": "2024-05-13T16:33:01",
            "upload_time_iso_8601": "2024-05-13T16:33:01.336413Z",
            "url": "https://files.pythonhosted.org/packages/92/28/c9809c8ddffd3483bd02ef87bfc1c9e1c019b981328911877abfd8fce836/fundus-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "79e2a7ba830d24df62d0d606041eff2b4093b3aca154b3092b3f0af2aa632fa3",
                "md5": "403dbd797cce10f12a5e2d9c41a196bb",
                "sha256": "e18091bb7a2239a52d03bff4e0a726e397fcfa09e428964d192804432ba6b6e8"
            },
            "downloads": -1,
            "filename": "fundus-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "403dbd797cce10f12a5e2d9c41a196bb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 65248,
            "upload_time": "2024-05-13T16:33:02",
            "upload_time_iso_8601": "2024-05-13T16:33:02.672657Z",
            "url": "https://files.pythonhosted.org/packages/79/e2/a7ba830d24df62d0d606041eff2b4093b3aca154b3092b3f0af2aa632fa3/fundus-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-13 16:33:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "flairNLP",
    "github_project": "fundus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fundus"
}
        
Elapsed time: 0.25840s