tagesschauscraper


Nametagesschauscraper JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/TheFerry10/TagesschauScraper
SummaryA library for scraping the German news archive of Tagesschau.de
upload_time2023-02-02 15:53:16
maintainer
docs_urlNone
authorMalte Sauerwein
requires_python
licenseGPL-3.0 license
keywords tagesschau scraper scraping news archive
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            TagesschauScraper
=================

A library for scraping the German news archive of Tagesschau.de

Install
-------

Tagesschauscraper is available on PyPI:

::

   $ pip install tagesschauscraper

Usage
-----

Here’s an example of how to use the library to scrape teaser info from
the Tagesschau archive:

.. code:: python

   import os
   from datetime import date
   from tagesschauscraper import constants, helper, tagesschau

   # Scraping teaser published on <date_> and in specific news category  
   DATA_DIR = "data"
   date_ = date(2022,3,1)
   category = "wirtschaft"

   # Initialize scraper, create url and run
   tagesschauScraper = tagesschau.TagesschauScraper()
   url = tagesschau.create_url_for_news_archive(date_, category=category)
   teaser = tagesschauScraper.scrape_teaser(url)

   # Save output in a hierarchical directory tree
   if not os.path.isdir(DATA_DIR):
       os.mkdir(DATA_DIR)
   dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(
       date_, root_dir=DATA_DIR
   )
   file_path = dateDirectoryTreeCreator.create_file_path_from_date()
   dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)
   file_name_and_path = os.path.join(
       file_path,
       helper.create_file_name_from_date(
           date_, suffix="_" + category, extension=".json"
       ),
   )
   logging.info(f"Save scraped teaser to file {file_name_and_path}")
   helper.save_to_json(teaser, file_name_and_path)

The result saved in “data/2022/03/2022-03-01_wirtschaft.json”. Json
document looks the following (only a snippet):

::

   {
       "teaser": [
           {
               "date": "2022-03-01 22:23:00",
               "topline": "Deutliche Verluste",
               "headline": "Der Krieg lastet auf der Wall Street",
               "shorttext": "Die intensiven K\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.",
               "link": "https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html",
               "tags": "B\u00f6rse,DAX,Dow Jones,Marktbericht",
               "id": "d49cfb71130e46638dcfe2afe8d775ac9670a9a8"
           },
           {
               "date": "2022-03-01 18:54:00",
               "topline": "Pipeline-Projekt",
               "headline": "Nordstream-Betreiber offenbar insolvent",
               "shorttext": "Die Nord Stream 2 AG, die Schweizer Eigent\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.",
               "link": "https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html",
               "tags": "Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz",
               "id": "595aa643ed39edd3695b8401a99ce808afa539fb"
           },
           {
               "date": "2022-03-01 18:52:00",
               "topline": "Fehlende Teile wegen Ukraine-Kriegs",
               "headline": "Autobauern drohen Produktionsausf\u00e4lle",
               "shorttext": "Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.",
               "link": "https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html",
               "tags": "Autowerke,BMW,Mercedes,Produktionsausf\u00e4lle,Ukraine,Ukraine-Krieg,VW",
               "id": "914174596c3590784c903908f569c099475981f7"
           },
           ...

Contributing
------------

If you’d like to contribute to TagesschauScraper, please fork the
repository and make changes as you’d like. Pull requests are welcome.

License
-------

TagesschauScraper is licensed under the GPL-3.0 license.

Disclaimer
----------

Please note that this is a scraping tool, and using it to scrape website
data without the website owner’s consent may be against their terms of
service. Use at your own risk.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TheFerry10/TagesschauScraper",
    "name": "tagesschauscraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tagesschau scraper scraping news archive",
    "author": "Malte Sauerwein",
    "author_email": "malte.sauerwein@live.de",
    "download_url": "https://files.pythonhosted.org/packages/5b/f6/70e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff/tagesschauscraper-0.1.2.tar.gz",
    "platform": null,
    "description": "TagesschauScraper\n=================\n\nA library for scraping the German news archive of Tagesschau.de\n\nInstall\n-------\n\nTagesschauscraper is available on PyPI:\n\n::\n\n   $ pip install tagesschauscraper\n\nUsage\n-----\n\nHere\u2019s an example of how to use the library to scrape teaser info from\nthe Tagesschau archive:\n\n.. code:: python\n\n   import os\n   from datetime import date\n   from tagesschauscraper import constants, helper, tagesschau\n\n   # Scraping teaser published on <date_> and in specific news category  \n   DATA_DIR = \"data\"\n   date_ = date(2022,3,1)\n   category = \"wirtschaft\"\n\n   # Initialize scraper, create url and run\n   tagesschauScraper = tagesschau.TagesschauScraper()\n   url = tagesschau.create_url_for_news_archive(date_, category=category)\n   teaser = tagesschauScraper.scrape_teaser(url)\n\n   # Save output in a hierarchical directory tree\n   if not os.path.isdir(DATA_DIR):\n       os.mkdir(DATA_DIR)\n   dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(\n       date_, root_dir=DATA_DIR\n   )\n   file_path = dateDirectoryTreeCreator.create_file_path_from_date()\n   dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)\n   file_name_and_path = os.path.join(\n       file_path,\n       helper.create_file_name_from_date(\n           date_, suffix=\"_\" + category, extension=\".json\"\n       ),\n   )\n   logging.info(f\"Save scraped teaser to file {file_name_and_path}\")\n   helper.save_to_json(teaser, file_name_and_path)\n\nThe result saved in \u201cdata/2022/03/2022-03-01_wirtschaft.json\u201d. Json\ndocument looks the following (only a snippet):\n\n::\n\n   {\n       \"teaser\": [\n           {\n               \"date\": \"2022-03-01 22:23:00\",\n               \"topline\": \"Deutliche Verluste\",\n               \"headline\": \"Der Krieg lastet auf der Wall Street\",\n               \"shorttext\": \"Die intensiven K\\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.\",\n               \"link\": \"https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html\",\n               \"tags\": \"B\\u00f6rse,DAX,Dow Jones,Marktbericht\",\n               \"id\": \"d49cfb71130e46638dcfe2afe8d775ac9670a9a8\"\n           },\n           {\n               \"date\": \"2022-03-01 18:54:00\",\n               \"topline\": \"Pipeline-Projekt\",\n               \"headline\": \"Nordstream-Betreiber offenbar insolvent\",\n               \"shorttext\": \"Die Nord Stream 2 AG, die Schweizer Eigent\\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.\",\n               \"link\": \"https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html\",\n               \"tags\": \"Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz\",\n               \"id\": \"595aa643ed39edd3695b8401a99ce808afa539fb\"\n           },\n           {\n               \"date\": \"2022-03-01 18:52:00\",\n               \"topline\": \"Fehlende Teile wegen Ukraine-Kriegs\",\n               \"headline\": \"Autobauern drohen Produktionsausf\\u00e4lle\",\n               \"shorttext\": \"Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.\",\n               \"link\": \"https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html\",\n               \"tags\": \"Autowerke,BMW,Mercedes,Produktionsausf\\u00e4lle,Ukraine,Ukraine-Krieg,VW\",\n               \"id\": \"914174596c3590784c903908f569c099475981f7\"\n           },\n           ...\n\nContributing\n------------\n\nIf you\u2019d like to contribute to TagesschauScraper, please fork the\nrepository and make changes as you\u2019d like. Pull requests are welcome.\n\nLicense\n-------\n\nTagesschauScraper is licensed under the GPL-3.0 license.\n\nDisclaimer\n----------\n\nPlease note that this is a scraping tool, and using it to scrape website\ndata without the website owner\u2019s consent may be against their terms of\nservice. Use at your own risk.\n\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0 license",
    "summary": "A library for scraping the German news archive of Tagesschau.de",
    "version": "0.1.2",
    "split_keywords": [
        "tagesschau",
        "scraper",
        "scraping",
        "news",
        "archive"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "685de1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb",
                "md5": "93e8f459bc42fb634089a1610618134c",
                "sha256": "cc22839be1f1a3904ff3495e2eecbf2cd25e28573f39519d5c0bdd8a1f9ae612"
            },
            "downloads": -1,
            "filename": "tagesschauscraper-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "93e8f459bc42fb634089a1610618134c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 20889,
            "upload_time": "2023-02-02T15:53:14",
            "upload_time_iso_8601": "2023-02-02T15:53:14.352151Z",
            "url": "https://files.pythonhosted.org/packages/68/5d/e1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb/tagesschauscraper-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5bf670e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff",
                "md5": "71ff9f1e565253fea85e21b83ded91a1",
                "sha256": "84fd15e485b31c5eee9b10baad949ca229b2dcaf43fbafc569bfd5b33ffb4eb8"
            },
            "downloads": -1,
            "filename": "tagesschauscraper-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "71ff9f1e565253fea85e21b83ded91a1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21818,
            "upload_time": "2023-02-02T15:53:16",
            "upload_time_iso_8601": "2023-02-02T15:53:16.001506Z",
            "url": "https://files.pythonhosted.org/packages/5b/f6/70e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff/tagesschauscraper-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-02 15:53:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "TheFerry10",
    "github_project": "TagesschauScraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "tagesschauscraper"
}
        
Elapsed time: 0.03738s