<h1 align="center">Welcome to the Wikipedia Periodic Revisions <code>(wikipedia_tools)</code> </h1>
<p align="center">
<a href="https://github.com/DLR-SC/wikipedia-periodic-revisions/blob/master/LICENSE">
<img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-yellow.svg" target="_blank" />
</a>
<a href="https://img.shields.io/badge/Made%20with-Python-1f425f.svg">
<img src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg" alt="Badge: Made with Python"/>
</a>
<a href="https://pypi.org/project/wikipedia_tools/"><img src="https://badge.fury.io/py/wikipedia_tools.svg" alt="Badge: PyPI version" height="18"></a>
<a href="https://twitter.com/dlr_software">
<img alt="Twitter: DLR Software" src="https://img.shields.io/twitter/follow/dlr_software.svg?style=social" target="_blank" />
</a>
<a href="https://open.vscode.dev/DLR-SC/wikipedia_tools">
<img alt="Badge: Open in VSCode" src="https://img.shields.io/static/v1?logo=visualstudiocode&label=&message=open%20in%20visual%20studio%20code&labelColor=2c2c32&color=007acc&logoColor=007acc" target="_blank" />
</a>
<a href="https://github.com/psf/black">
<img alt="Badge: Open in VSCode" src="https://img.shields.io/badge/code%20style-black-000000.svg" target="_blank" />
</a>
</p>
> `wikipedia_tools` is a Python Package to download wikipedia revisions for pages belonging to certain *categories*, based on a period of time. This package also provides overview stats for the downloaded data.
---
## Dependencies and Credits
#### [Wikipedia API](https://github.com/goldsmith/Wikipedia)
This package is built on top of the [Wikipedia API](https://github.com/goldsmith/Wikipedia). This code was forked under the `base` subpackage.
#### [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser)
Also we forked the code from [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser) and we modified it to support *from* and *to* datetime to fetch revisions between certain periods; the modified code is `wikipedia_toools.scraper.wikirevparser_with_time.py`.
Note: No need to download these two projects, they are already integrated as part of this project.
## Installation
Via PIP
```
pip install wikipedia_tools
```
Or install manually by cloning and then running
```
pip install -e wikipedia_tools
```
## wikipedia_tools package
This packages is responsible for:
- fetching the wikipages revisions based on a period of time
- load them into parquet, and
- provide basic analysis
It contains three main subpackages and the *utils* package which contains few helpers functions:
### Downlaod Wiki Article Revisions [[wikipedia_tools.scraper](wikipedia_tools/wikipedia_tools/scraper.py)]
This subpackage is responsible for downloading the wikipedia revisions from the web.
The code below shows how to download all the revisions of pages:
- belonging to the *Climate_change* category.
- revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The *get_x_months_ago_date* function returns the datetime of the beginning of 8 months ago.
```python
from wikipedia_tools.utils import utils
utils.get_x_months_ago_date(8)
```
- if save_each_page= True: each page is fetched and downloaded on the spot under the folder **data/periodic_wiki_batches/{*categories_names*}/from{month-year}_to{month-year}**. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.
```python
from wikipedia_tools.scraper import downloader
from datetime import datetime
wikirevs= downloader.WikiPagesRevision(
categories = ["Climate_change"],
revisions_from = utils.get_x_months_ago_date(8),
revisions_to=datetime.now(),
save_each_page= True
)
count, destination_folder = wikirevs.download()
```
For german wiki revisions, you can set the *lang* attribute to *de* - For example, you can download the German Wikipedia page revisions for the Climate_change category, as follows:
```python
from wikipedia_tools.scraper import downloader
from datetime import datetime
wikirevs= downloader.WikiPagesRevision(
categories = ["Klimaveränderung"],
revisions_from = utils.get_x_months_ago_date(1), # beginning of last month, you can use instead datetime.now() + dateutil.relativedelta.relativedelta() to customize past datetime relatively
revisions_to=datetime.now(),
save_each_page= True,
lang="de"
)
count, destination_folder = wikirevs.download()
```
You can then process each file by, for example, reading the parquet file using pandas:
```python
import pandas as pd
from glob import glob
files = f"{destination_folder}/*.parquet"
# Loop over all wiki page revisions with this period and read each wiki page revs as a pandas dataframe
for page_path in glob(files):
page_revs_df = pd.read_parquet(page_name)
# dataframe with columns ['page', 'lang', 'timestamp', 'categories', 'content', 'images', 'links', 'sections', 'urls', 'user']
# process/use file ....
```
### Overview Stats
```python
## Initialize the analyzer object
from wikipedia_tools.analyzer.revisions import WikipediaRevisionAnalyzer
analyzer = WikipediaRevisionAnalyzer(
category = category,
period = properties.PERIODS._YEARLY_,
corpus = CORPUS,
root = ROOT_PATH
)
# Get the yearly number of articles that were created/edit at least once
unique_created_updated_articles = analyzer.get_edited_page_count(plot=True, save=True)
# Returned the number of created articles over time
unique_created_articles = analyzer.get_created_page_count(plot=True, save=True)
# Returns the number of revisions over time
rev_overtime_df = analyzer.get_revisions_over_time(save=True)
# Returns the number of words over time
words_overtime_df = analyzer.get_words_over_time(save=True)
# Returns the number of users over time, grouped by user type
users_overtime_df = analyzer.get_users_over_time(save=True)
# return the top n wikipedia articles over time
top_edited = analyzer.get_most_edited_articles(top=4)
# return the articles sorted from most to least edited over time
most_to_least_revised = analyzer.get_periodic_most_to_least_revised(save=True)
```
You can find the full example under the examples folder.
Raw data
{
"_id": null,
"home_page": null,
"name": "wikipedia_tools",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "wikipedia,wikipedia revisions,wikipedia stats",
"author": null,
"author_email": "Roxanne El Baff <roxanne.elbaff@dlr.de>",
"download_url": "https://files.pythonhosted.org/packages/28/4b/5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef/wikipedia_tools-2.4.1.tar.gz",
"platform": null,
"description": "\n<h1 align=\"center\">Welcome to the Wikipedia Periodic Revisions <code>(wikipedia_tools)</code> </h1>\n\n<p align=\"center\">\n <a href=\"https://github.com/DLR-SC/wikipedia-periodic-revisions/blob/master/LICENSE\">\n <img alt=\"License: MIT\" src=\"https://img.shields.io/badge/license-MIT-yellow.svg\" target=\"_blank\" />\n </a>\n <a href=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg\">\n <img src=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg\" alt=\"Badge: Made with Python\"/>\n </a>\n <a href=\"https://pypi.org/project/wikipedia_tools/\"><img src=\"https://badge.fury.io/py/wikipedia_tools.svg\" alt=\"Badge: PyPI version\" height=\"18\"></a>\n <a href=\"https://twitter.com/dlr_software\">\n <img alt=\"Twitter: DLR Software\" src=\"https://img.shields.io/twitter/follow/dlr_software.svg?style=social\" target=\"_blank\" />\n </a>\n <a href=\"https://open.vscode.dev/DLR-SC/wikipedia_tools\">\n <img alt=\"Badge: Open in VSCode\" src=\"https://img.shields.io/static/v1?logo=visualstudiocode&label=&message=open%20in%20visual%20studio%20code&labelColor=2c2c32&color=007acc&logoColor=007acc\" target=\"_blank\" />\n </a>\n \n\n <a href=\"https://github.com/psf/black\">\n <img alt=\"Badge: Open in VSCode\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" target=\"_blank\" />\n </a>\n</p>\n\n> `wikipedia_tools` is a Python Package to download wikipedia revisions for pages belonging to certain *categories*, based on a period of time. This package also provides overview stats for the downloaded data.\n\n---\n\n## Dependencies and Credits\n\n#### [Wikipedia API](https://github.com/goldsmith/Wikipedia)\n\nThis package is built on top of the [Wikipedia API](https://github.com/goldsmith/Wikipedia). This code was forked under the `base` subpackage. \n\n#### [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser)\n\nAlso we forked the code from [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser) and we modified it to support *from* and *to* datetime to fetch revisions between certain periods; the modified code is `wikipedia_toools.scraper.wikirevparser_with_time.py`. \n\nNote: No need to download these two projects, they are already integrated as part of this project.\n\n## Installation\n\nVia PIP\n\n``` \npip install wikipedia_tools\n```\n\nOr install manually by cloning and then running\n\n``` \npip install -e wikipedia_tools\n```\n\n\n\n## wikipedia_tools package\n\nThis packages is responsible for:\n- fetching the wikipages revisions based on a period of time\n- load them into parquet, and\n- provide basic analysis\n\nIt contains three main subpackages and the *utils* package which contains few helpers functions:\n\n### Downlaod Wiki Article Revisions [[wikipedia_tools.scraper](wikipedia_tools/wikipedia_tools/scraper.py)]\nThis subpackage is responsible for downloading the wikipedia revisions from the web.\n\nThe code below shows how to download all the revisions of pages:\n - belonging to the *Climate_change* category.\n - revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The *get_x_months_ago_date* function returns the datetime of the beginning of 8 months ago.\n \n ```python \n from wikipedia_tools.utils import utils \n utils.get_x_months_ago_date(8)\n ```\n - if save_each_page= True: each page is fetched and downloaded on the spot under the folder **data/periodic_wiki_batches/{*categories_names*}/from{month-year}_to{month-year}**. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.\n \n\n```python\nfrom wikipedia_tools.scraper import downloader\nfrom datetime import datetime\n\nwikirevs= downloader.WikiPagesRevision( \n categories = [\"Climate_change\"],\n revisions_from = utils.get_x_months_ago_date(8),\n revisions_to=datetime.now(),\n save_each_page= True\n )\n\ncount, destination_folder = wikirevs.download()\n```\n\n\nFor german wiki revisions, you can set the *lang* attribute to *de* - For example, you can download the German Wikipedia page revisions for the Climate_change category, as follows:\n\n```python\nfrom wikipedia_tools.scraper import downloader\nfrom datetime import datetime\n\nwikirevs= downloader.WikiPagesRevision( \n categories = [\"Klimaver\u00e4nderung\"],\n revisions_from = utils.get_x_months_ago_date(1), # beginning of last month, you can use instead datetime.now() + dateutil.relativedelta.relativedelta() to customize past datetime relatively\n revisions_to=datetime.now(),\n save_each_page= True,\n lang=\"de\"\n )\ncount, destination_folder = wikirevs.download()\n\n```\n\nYou can then process each file by, for example, reading the parquet file using pandas:\n\n```python\nimport pandas as pd\nfrom glob import glob\nfiles = f\"{destination_folder}/*.parquet\"\n\n# Loop over all wiki page revisions with this period and read each wiki page revs as a pandas dataframe\nfor page_path in glob(files):\n page_revs_df = pd.read_parquet(page_name)\n # dataframe with columns ['page', 'lang', 'timestamp', 'categories', 'content', 'images', 'links', 'sections', 'urls', 'user']\n # process/use file ....\n\n```\n### Overview Stats\n\n```python\n\n## Initialize the analyzer object\n\nfrom wikipedia_tools.analyzer.revisions import WikipediaRevisionAnalyzer\nanalyzer = WikipediaRevisionAnalyzer(\n category = category,\n period = properties.PERIODS._YEARLY_,\n corpus = CORPUS,\n root = ROOT_PATH\n)\n\n# Get the yearly number of articles that were created/edit at least once \nunique_created_updated_articles = analyzer.get_edited_page_count(plot=True, save=True)\n\n# Returned the number of created articles over time\nunique_created_articles = analyzer.get_created_page_count(plot=True, save=True)\n\n# Returns the number of revisions over time\nrev_overtime_df = analyzer.get_revisions_over_time(save=True)\n\n# Returns the number of words over time\nwords_overtime_df = analyzer.get_words_over_time(save=True)\n\n# Returns the number of users over time, grouped by user type\nusers_overtime_df = analyzer.get_users_over_time(save=True)\n\n# return the top n wikipedia articles over time\ntop_edited = analyzer.get_most_edited_articles(top=4)\n\n# return the articles sorted from most to least edited over time\nmost_to_least_revised = analyzer.get_periodic_most_to_least_revised(save=True)\n\n```\n\nYou can find the full example under the examples folder.",
"bugtrack_url": null,
"license": null,
"summary": "This is a Wikipedia Tool to fetch revisions based on a period of time.",
"version": "2.4.1",
"project_urls": {
"Homepage": "https://github.com/DLR-SC/wikipedia-periodic-revisions"
},
"split_keywords": [
"wikipedia",
"wikipedia revisions",
"wikipedia stats"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a32e82378c68626caf83b1d5d4ccdda85b11810c9d6297d373ba8713359ea545",
"md5": "ea3b352f8bf4611a1d3470a5782d7e14",
"sha256": "aa8a9bf9f9e7530e707b85fcd422cb5292798aef619352b462c49a19d27f4570"
},
"downloads": -1,
"filename": "wikipedia_tools-2.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ea3b352f8bf4611a1d3470a5782d7e14",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 37181,
"upload_time": "2023-10-12T06:47:00",
"upload_time_iso_8601": "2023-10-12T06:47:00.219553Z",
"url": "https://files.pythonhosted.org/packages/a3/2e/82378c68626caf83b1d5d4ccdda85b11810c9d6297d373ba8713359ea545/wikipedia_tools-2.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "284b5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef",
"md5": "a2ff9811d987df0308ceb3881766dca2",
"sha256": "ae8d9d316915507b82d5812c8267cfecd17ee10317888205190de4a7ab21e2de"
},
"downloads": -1,
"filename": "wikipedia_tools-2.4.1.tar.gz",
"has_sig": false,
"md5_digest": "a2ff9811d987df0308ceb3881766dca2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 52635,
"upload_time": "2023-10-12T06:57:28",
"upload_time_iso_8601": "2023-10-12T06:57:28.841879Z",
"url": "https://files.pythonhosted.org/packages/28/4b/5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef/wikipedia_tools-2.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-12 06:57:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DLR-SC",
"github_project": "wikipedia-periodic-revisions",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "wikipedia_tools"
}