news-crawlers


Namenews-crawlers JSON
Version 3.5.1 PyPI version JSON
download
home_page
SummaryAn extensible python library to create web crawlers which alert users on news.
upload_time2023-05-26 12:17:31
maintainer
docs_urlNone
author
requires_python>=3.7
licenseMIT
keywords crawler news
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # News Crawlers

Contains various spiders which crawl websites for new content. If any new
content is found, users are alerted via email.

![Tests](https://github.com/jprevc/news_crawlers/actions/workflows/tests.yml/badge.svg)

Installation
------------------
Install this application with

    python -m pip install news_crawlers

After installation, News Crawlers can be run from CLI, to view help you can write:

    python -m news_crawlers -h

Configuration
----------------------------
NewsCrawlers's configuration is defined with a *news_crawlers.yaml* file.

Configuration file path can then be provided via CLI, like this:

    python -m news_crawlers -c {news_crawlers.yaml path}

If path is not provided, application will search the file in *config* folder (if it exists) and in the current working
directory.

When spider is run, it will append any new items found in *.nc_cache* folder. Location of that folder can be customized
with a --cache option, like this

python -m news_crawlers --cache {data path}

If not specified, application will put cache to *data/.nc_cache*, relative to current working directory.

Within the configuration file, there should be a *spiders* segment, where spiders and their configurations are listed,
for example:

    spiders:
            bolha:
                notifications:
                  email:
                    email_user: "__env_EMAIL_USER"
                    email_password: "__env_EMAIL_PASS"
                    recipients: ['jost.prevc@gmail.com']
                    message_body_format: "Query: {query}\nURL: {url}\nPrice: {price}\n"
                  pushover:
                    recipients: ['ukdwndomjog3swwos57umfydpsa2sk']
                    send_separately: True
                    message_body_format: "Query: {query}\nPrice: {price}\n"
                urls:
                  'pet_prijateljev': https://www.bolha.com/?ctl=search_ads&keywords=pet+prijateljev
                  'enid_blyton': https://www.bolha.com/?ctl=search_ads&keywords=enid%20blyton

Spider name (for example "bolha", above), should match the *name* attribute of a spider, defined in spiders.py.
Each spider should have a *notifications* and *urls* segment. *notifications* defines how user(s) will be notified on
any found changes when crawling the urls, defined in *urls* segment.

Note that prepending any configuration value with "\_\_env\_" will treat the subsequent string as an environment
variable and will attempt to obtain the value from environment variables. For example "__env_EMAIL_USER" will
be replaced with the value of "EMAIL_USER" environment variable. This can be useful to avoid storing secrets within the
configuration file.

Crawling can also be set on a schedule, by adding a schedule segment to news_crawlers.yaml file:

    schedule:
        every: 15
        units: minutes

So the entire *news_crawlers.yaml* file should look like this:

    schedule:
        every: 15
        units: minutes
    spiders:
        bolha:
            notifications:
              email:
                email_user: "__env_EMAIL_USER"
                email_password: "__env_EMAIL_PASS"
                recipients: ['jost.prevc@gmail.com']
                message_body_format: "Query: {query}\nURL: {url}\nPrice: {price}\n"
              pushover:
                recipients: ['ukdwndomjog3swwos57umfydpsa2sk']
                send_separately: True
                message_body_format: "Query: {query}\nPrice: {price}\n"
            urls:
              'pet_prijateljev': https://www.bolha.com/?ctl=search_ads&keywords=pet+prijateljev
              'enid_blyton': https://www.bolha.com/?ctl=search_ads&keywords=enid%20blyton

Notification configuration
------------------------------
Next, you should configure notification, which will alert you about any found news. Currently, there are two options -
Email via Gmail SMTP server or Pushover.

### Email configuration

Visit [google app passwords](https://myaccount.google.com/apppasswords) and generate a new app password for your account.

Username and password can then be placed directly to configuration file or referenced via environment variables
(see instructions above).

### Pushover configuration

[Pushover](https://pushover.net) is a platform which enables you to easily send and receive push notifications on your
smart device. To get it running, you will first need to create a user account. You can sign-up on
this [link](https://pushover.net/signup). When sign-up is complete, you will receive a unique user token, which you
will have to copy and paste to your crawler configuration (see example configuration above). Any user that wants to
receive push notifications needs to create its own pushover username to receive their own user tokens, which will
be stored in crawler configuration.

Next, you should register your crawler application on pushover. To do this, visit [registration site](https://pushover.net/apps/build)
and fill out the provided form. Once your application is registered, you will receive an API token. This token can then
be placed directly to configuration file or referenced via environment variables (see instructions above).

To receive notifications, every user should download the Pushover app to the smart device on which they want to
receive push notifications. Once logged in, they will receive push notifications when any crawler finds news.

- [Android](https://play.google.com/store/apps/details?id=net.superblock.pushover)
- [AppStore](https://apps.apple.com/us/app/pushover-notifications/id506088175?ls=1)

Note: Pushover trial version expires after 30 days. After that, you will need to create a one-time purchase with a cost
of 5$ to keep it working, see [pricing](https://pushover.net/pricing).


Running the crawlers
----------------------
Run the scraper by executing the following command on the project root:

    python -m news_crawlers scrape

You can also run individual spiders with

    python -m news_crawlers scrape -s {spider_name}


This will run specified spider and then send a configured notifications if any
news are found.

Contribution
==================

Checkout
----------------
Checkout this project with

    git clone https://github.com/jprevc/news_crawlers.git

Adding new custom crawlers
----------------------------

New spiders need to be added to news_crawlers/spiders.py file. Spider is a class which must subclass Spider class.

When crawling, crawler needs to yield all found items in a form of dictionary. Keys of each item need to correspond to
referenced values of "message_body_format" field within the configuration file.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "news-crawlers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "crawler,news",
    "author": "",
    "author_email": "Jost Prevc <jost.prevc@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/24/96a5959c3fc1872f635f66fe73da1fc5f04535dfdfc186bbde8e45b29d7c/news_crawlers-3.5.1.tar.gz",
    "platform": null,
    "description": "# News Crawlers\n\nContains various spiders which crawl websites for new content. If any new\ncontent is found, users are alerted via email.\n\n![Tests](https://github.com/jprevc/news_crawlers/actions/workflows/tests.yml/badge.svg)\n\nInstallation\n------------------\nInstall this application with\n\n    python -m pip install news_crawlers\n\nAfter installation, News Crawlers can be run from CLI, to view help you can write:\n\n    python -m news_crawlers -h\n\nConfiguration\n----------------------------\nNewsCrawlers's configuration is defined with a *news_crawlers.yaml* file.\n\nConfiguration file path can then be provided via CLI, like this:\n\n    python -m news_crawlers -c {news_crawlers.yaml path}\n\nIf path is not provided, application will search the file in *config* folder (if it exists) and in the current working\ndirectory.\n\nWhen spider is run, it will append any new items found in *.nc_cache* folder. Location of that folder can be customized\nwith a --cache option, like this\n\npython -m news_crawlers --cache {data path}\n\nIf not specified, application will put cache to *data/.nc_cache*, relative to current working directory.\n\nWithin the configuration file, there should be a *spiders* segment, where spiders and their configurations are listed,\nfor example:\n\n    spiders:\n            bolha:\n                notifications:\n                  email:\n                    email_user: \"__env_EMAIL_USER\"\n                    email_password: \"__env_EMAIL_PASS\"\n                    recipients: ['jost.prevc@gmail.com']\n                    message_body_format: \"Query: {query}\\nURL: {url}\\nPrice: {price}\\n\"\n                  pushover:\n                    recipients: ['ukdwndomjog3swwos57umfydpsa2sk']\n                    send_separately: True\n                    message_body_format: \"Query: {query}\\nPrice: {price}\\n\"\n                urls:\n                  'pet_prijateljev': https://www.bolha.com/?ctl=search_ads&keywords=pet+prijateljev\n                  'enid_blyton': https://www.bolha.com/?ctl=search_ads&keywords=enid%20blyton\n\nSpider name (for example \"bolha\", above), should match the *name* attribute of a spider, defined in spiders.py.\nEach spider should have a *notifications* and *urls* segment. *notifications* defines how user(s) will be notified on\nany found changes when crawling the urls, defined in *urls* segment.\n\nNote that prepending any configuration value with \"\\_\\_env\\_\" will treat the subsequent string as an environment\nvariable and will attempt to obtain the value from environment variables. For example \"__env_EMAIL_USER\" will\nbe replaced with the value of \"EMAIL_USER\" environment variable. This can be useful to avoid storing secrets within the\nconfiguration file.\n\nCrawling can also be set on a schedule, by adding a schedule segment to news_crawlers.yaml file:\n\n    schedule:\n        every: 15\n        units: minutes\n\nSo the entire *news_crawlers.yaml* file should look like this:\n\n    schedule:\n        every: 15\n        units: minutes\n    spiders:\n        bolha:\n            notifications:\n              email:\n                email_user: \"__env_EMAIL_USER\"\n                email_password: \"__env_EMAIL_PASS\"\n                recipients: ['jost.prevc@gmail.com']\n                message_body_format: \"Query: {query}\\nURL: {url}\\nPrice: {price}\\n\"\n              pushover:\n                recipients: ['ukdwndomjog3swwos57umfydpsa2sk']\n                send_separately: True\n                message_body_format: \"Query: {query}\\nPrice: {price}\\n\"\n            urls:\n              'pet_prijateljev': https://www.bolha.com/?ctl=search_ads&keywords=pet+prijateljev\n              'enid_blyton': https://www.bolha.com/?ctl=search_ads&keywords=enid%20blyton\n\nNotification configuration\n------------------------------\nNext, you should configure notification, which will alert you about any found news. Currently, there are two options -\nEmail via Gmail SMTP server or Pushover.\n\n### Email configuration\n\nVisit [google app passwords](https://myaccount.google.com/apppasswords) and generate a new app password for your account.\n\nUsername and password can then be placed directly to configuration file or referenced via environment variables\n(see instructions above).\n\n### Pushover configuration\n\n[Pushover](https://pushover.net) is a platform which enables you to easily send and receive push notifications on your\nsmart device. To get it running, you will first need to create a user account. You can sign-up on\nthis [link](https://pushover.net/signup). When sign-up is complete, you will receive a unique user token, which you\nwill have to copy and paste to your crawler configuration (see example configuration above). Any user that wants to\nreceive push notifications needs to create its own pushover username to receive their own user tokens, which will\nbe stored in crawler configuration.\n\nNext, you should register your crawler application on pushover. To do this, visit [registration site](https://pushover.net/apps/build)\nand fill out the provided form. Once your application is registered, you will receive an API token. This token can then\nbe placed directly to configuration file or referenced via environment variables (see instructions above).\n\nTo receive notifications, every user should download the Pushover app to the smart device on which they want to\nreceive push notifications. Once logged in, they will receive push notifications when any crawler finds news.\n\n- [Android](https://play.google.com/store/apps/details?id=net.superblock.pushover)\n- [AppStore](https://apps.apple.com/us/app/pushover-notifications/id506088175?ls=1)\n\nNote: Pushover trial version expires after 30 days. After that, you will need to create a one-time purchase with a cost\nof 5$ to keep it working, see [pricing](https://pushover.net/pricing).\n\n\nRunning the crawlers\n----------------------\nRun the scraper by executing the following command on the project root:\n\n    python -m news_crawlers scrape\n\nYou can also run individual spiders with\n\n    python -m news_crawlers scrape -s {spider_name}\n\n\nThis will run specified spider and then send a configured notifications if any\nnews are found.\n\nContribution\n==================\n\nCheckout\n----------------\nCheckout this project with\n\n    git clone https://github.com/jprevc/news_crawlers.git\n\nAdding new custom crawlers\n----------------------------\n\nNew spiders need to be added to news_crawlers/spiders.py file. Spider is a class which must subclass Spider class.\n\nWhen crawling, crawler needs to yield all found items in a form of dictionary. Keys of each item need to correspond to\nreferenced values of \"message_body_format\" field within the configuration file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An extensible python library to create web crawlers which alert users on news.",
    "version": "3.5.1",
    "project_urls": null,
    "split_keywords": [
        "crawler",
        "news"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4349b55b5af4080704c7be35c00c6596b00ff3b2dc74bc8e0ad298ff0ad8eb2b",
                "md5": "1fefc9575b52dcb7121bad793015d062",
                "sha256": "c3a7a489d4facbc6e40e0bcbc896a8717ab3cf21c8c38ebd067fd72c65b9143c"
            },
            "downloads": -1,
            "filename": "news_crawlers-3.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1fefc9575b52dcb7121bad793015d062",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 18633,
            "upload_time": "2023-05-26T12:17:29",
            "upload_time_iso_8601": "2023-05-26T12:17:29.226728Z",
            "url": "https://files.pythonhosted.org/packages/43/49/b55b5af4080704c7be35c00c6596b00ff3b2dc74bc8e0ad298ff0ad8eb2b/news_crawlers-3.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7c2496a5959c3fc1872f635f66fe73da1fc5f04535dfdfc186bbde8e45b29d7c",
                "md5": "0a0ffd0b9fc6c38a08d8fb2532effe85",
                "sha256": "0772cf143703bb83e1ece72930ff6d37ffdd9955032b34ce06216a7ea5498c3d"
            },
            "downloads": -1,
            "filename": "news_crawlers-3.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0a0ffd0b9fc6c38a08d8fb2532effe85",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 72617,
            "upload_time": "2023-05-26T12:17:31",
            "upload_time_iso_8601": "2023-05-26T12:17:31.069109Z",
            "url": "https://files.pythonhosted.org/packages/7c/24/96a5959c3fc1872f635f66fe73da1fc5f04535dfdfc186bbde8e45b29d7c/news_crawlers-3.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-26 12:17:31",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "news-crawlers"
}
        
Elapsed time: 0.07153s