TagesschauScraper
=================
A library for scraping the German news archive of Tagesschau.de
Install
-------
Tagesschauscraper is available on PyPI:
::
$ pip install tagesschauscraper
Usage
-----
Here’s an example of how to use the library to scrape teaser info from
the Tagesschau archive:
.. code:: python
import os
from datetime import date
from tagesschauscraper import constants, helper, tagesschau
# Scraping teaser published on <date_> and in specific news category
DATA_DIR = "data"
date_ = date(2022,3,1)
category = "wirtschaft"
# Initialize scraper, create url and run
tagesschauScraper = tagesschau.TagesschauScraper()
url = tagesschau.create_url_for_news_archive(date_, category=category)
teaser = tagesschauScraper.scrape_teaser(url)
# Save output in a hierarchical directory tree
if not os.path.isdir(DATA_DIR):
os.mkdir(DATA_DIR)
dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(
date_, root_dir=DATA_DIR
)
file_path = dateDirectoryTreeCreator.create_file_path_from_date()
dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)
file_name_and_path = os.path.join(
file_path,
helper.create_file_name_from_date(
date_, suffix="_" + category, extension=".json"
),
)
logging.info(f"Save scraped teaser to file {file_name_and_path}")
helper.save_to_json(teaser, file_name_and_path)
The result saved in “data/2022/03/2022-03-01_wirtschaft.json”. Json
document looks the following (only a snippet):
::
{
"teaser": [
{
"date": "2022-03-01 22:23:00",
"topline": "Deutliche Verluste",
"headline": "Der Krieg lastet auf der Wall Street",
"shorttext": "Die intensiven K\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.",
"link": "https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html",
"tags": "B\u00f6rse,DAX,Dow Jones,Marktbericht",
"id": "d49cfb71130e46638dcfe2afe8d775ac9670a9a8"
},
{
"date": "2022-03-01 18:54:00",
"topline": "Pipeline-Projekt",
"headline": "Nordstream-Betreiber offenbar insolvent",
"shorttext": "Die Nord Stream 2 AG, die Schweizer Eigent\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.",
"link": "https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html",
"tags": "Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz",
"id": "595aa643ed39edd3695b8401a99ce808afa539fb"
},
{
"date": "2022-03-01 18:52:00",
"topline": "Fehlende Teile wegen Ukraine-Kriegs",
"headline": "Autobauern drohen Produktionsausf\u00e4lle",
"shorttext": "Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.",
"link": "https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html",
"tags": "Autowerke,BMW,Mercedes,Produktionsausf\u00e4lle,Ukraine,Ukraine-Krieg,VW",
"id": "914174596c3590784c903908f569c099475981f7"
},
...
Contributing
------------
If you’d like to contribute to TagesschauScraper, please fork the
repository and make changes as you’d like. Pull requests are welcome.
License
-------
TagesschauScraper is licensed under the GPL-3.0 license.
Disclaimer
----------
Please note that this is a scraping tool, and using it to scrape website
data without the website owner’s consent may be against their terms of
service. Use at your own risk.
Raw data
{
"_id": null,
"home_page": "https://github.com/TheFerry10/TagesschauScraper",
"name": "tagesschauscraper",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "tagesschau scraper scraping news archive",
"author": "Malte Sauerwein",
"author_email": "malte.sauerwein@live.de",
"download_url": "https://files.pythonhosted.org/packages/5b/f6/70e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff/tagesschauscraper-0.1.2.tar.gz",
"platform": null,
"description": "TagesschauScraper\n=================\n\nA library for scraping the German news archive of Tagesschau.de\n\nInstall\n-------\n\nTagesschauscraper is available on PyPI:\n\n::\n\n $ pip install tagesschauscraper\n\nUsage\n-----\n\nHere\u2019s an example of how to use the library to scrape teaser info from\nthe Tagesschau archive:\n\n.. code:: python\n\n import os\n from datetime import date\n from tagesschauscraper import constants, helper, tagesschau\n\n # Scraping teaser published on <date_> and in specific news category \n DATA_DIR = \"data\"\n date_ = date(2022,3,1)\n category = \"wirtschaft\"\n\n # Initialize scraper, create url and run\n tagesschauScraper = tagesschau.TagesschauScraper()\n url = tagesschau.create_url_for_news_archive(date_, category=category)\n teaser = tagesschauScraper.scrape_teaser(url)\n\n # Save output in a hierarchical directory tree\n if not os.path.isdir(DATA_DIR):\n os.mkdir(DATA_DIR)\n dateDirectoryTreeCreator = helper.DateDirectoryTreeCreator(\n date_, root_dir=DATA_DIR\n )\n file_path = dateDirectoryTreeCreator.create_file_path_from_date()\n dateDirectoryTreeCreator.make_dir_tree_from_file_path(file_path)\n file_name_and_path = os.path.join(\n file_path,\n helper.create_file_name_from_date(\n date_, suffix=\"_\" + category, extension=\".json\"\n ),\n )\n logging.info(f\"Save scraped teaser to file {file_name_and_path}\")\n helper.save_to_json(teaser, file_name_and_path)\n\nThe result saved in \u201cdata/2022/03/2022-03-01_wirtschaft.json\u201d. Json\ndocument looks the following (only a snippet):\n\n::\n\n {\n \"teaser\": [\n {\n \"date\": \"2022-03-01 22:23:00\",\n \"topline\": \"Deutliche Verluste\",\n \"headline\": \"Der Krieg lastet auf der Wall Street\",\n \"shorttext\": \"Die intensiven K\\u00e4mpfe in der Ukraine und die Auswirkungen der Sanktionen verschreckten die US-Investoren.\",\n \"link\": \"https://www.tagesschau.de/wirtschaft/finanzen/marktberichte/marktbericht-dax-dow-jones-213.html\",\n \"tags\": \"B\\u00f6rse,DAX,Dow Jones,Marktbericht\",\n \"id\": \"d49cfb71130e46638dcfe2afe8d775ac9670a9a8\"\n },\n {\n \"date\": \"2022-03-01 18:54:00\",\n \"topline\": \"Pipeline-Projekt\",\n \"headline\": \"Nordstream-Betreiber offenbar insolvent\",\n \"shorttext\": \"Die Nord Stream 2 AG, die Schweizer Eigent\\u00fcmergesellschaft der neuen Ostsee-Pipeline nach Russland, ist offenbar insolvent.\",\n \"link\": \"https://www.tagesschau.de/wirtschaft/unternehmen/nord-stream-insolvenz-gazrom-gas-pipeline-russland-ukraine-103.html\",\n \"tags\": \"Insolvenz,Nord Stream 2,Pipeline,Russland,Schweiz\",\n \"id\": \"595aa643ed39edd3695b8401a99ce808afa539fb\"\n },\n {\n \"date\": \"2022-03-01 18:52:00\",\n \"topline\": \"Fehlende Teile wegen Ukraine-Kriegs\",\n \"headline\": \"Autobauern drohen Produktionsausf\\u00e4lle\",\n \"shorttext\": \"Der anhaltende Krieg in der Ukraine bremst auch die deutsche Autoindustrie.\",\n \"link\": \"https://www.tagesschau.de/wirtschaft/autobauern-drohen-produktionsausfaelle-101.html\",\n \"tags\": \"Autowerke,BMW,Mercedes,Produktionsausf\\u00e4lle,Ukraine,Ukraine-Krieg,VW\",\n \"id\": \"914174596c3590784c903908f569c099475981f7\"\n },\n ...\n\nContributing\n------------\n\nIf you\u2019d like to contribute to TagesschauScraper, please fork the\nrepository and make changes as you\u2019d like. Pull requests are welcome.\n\nLicense\n-------\n\nTagesschauScraper is licensed under the GPL-3.0 license.\n\nDisclaimer\n----------\n\nPlease note that this is a scraping tool, and using it to scrape website\ndata without the website owner\u2019s consent may be against their terms of\nservice. Use at your own risk.\n\n\n",
"bugtrack_url": null,
"license": "GPL-3.0 license",
"summary": "A library for scraping the German news archive of Tagesschau.de",
"version": "0.1.2",
"split_keywords": [
"tagesschau",
"scraper",
"scraping",
"news",
"archive"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "685de1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb",
"md5": "93e8f459bc42fb634089a1610618134c",
"sha256": "cc22839be1f1a3904ff3495e2eecbf2cd25e28573f39519d5c0bdd8a1f9ae612"
},
"downloads": -1,
"filename": "tagesschauscraper-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93e8f459bc42fb634089a1610618134c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 20889,
"upload_time": "2023-02-02T15:53:14",
"upload_time_iso_8601": "2023-02-02T15:53:14.352151Z",
"url": "https://files.pythonhosted.org/packages/68/5d/e1291265682cc2e8634947afa695af3a05ebd69dcca1d8cccb3fc4e21ceb/tagesschauscraper-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5bf670e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff",
"md5": "71ff9f1e565253fea85e21b83ded91a1",
"sha256": "84fd15e485b31c5eee9b10baad949ca229b2dcaf43fbafc569bfd5b33ffb4eb8"
},
"downloads": -1,
"filename": "tagesschauscraper-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "71ff9f1e565253fea85e21b83ded91a1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 21818,
"upload_time": "2023-02-02T15:53:16",
"upload_time_iso_8601": "2023-02-02T15:53:16.001506Z",
"url": "https://files.pythonhosted.org/packages/5b/f6/70e23288d029220c8a4f53c021973734aca2b7201814c3136d56476625ff/tagesschauscraper-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-02 15:53:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "TheFerry10",
"github_project": "TagesschauScraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "tagesschauscraper"
}