wikipedia_tools


Namewikipedia_tools JSON
Version 2.4.1 PyPI version JSON
download
home_pageNone
SummaryThis is a Wikipedia Tool to fetch revisions based on a period of time.
upload_time2023-10-12 06:57:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords wikipedia wikipedia revisions wikipedia stats
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
<h1 align="center">Welcome to the Wikipedia Periodic Revisions <code>(wikipedia_tools)</code> </h1>

<p align="center">
  <a href="https://github.com/DLR-SC/wikipedia-periodic-revisions/blob/master/LICENSE">
    <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-yellow.svg" target="_blank" />
  </a>
  <a href="https://img.shields.io/badge/Made%20with-Python-1f425f.svg">
    <img src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg" alt="Badge: Made with Python"/>
  </a>
  <a href="https://pypi.org/project/wikipedia_tools/"><img src="https://badge.fury.io/py/wikipedia_tools.svg" alt="Badge: PyPI version" height="18"></a>
  <a href="https://twitter.com/dlr_software">
    <img alt="Twitter: DLR Software" src="https://img.shields.io/twitter/follow/dlr_software.svg?style=social" target="_blank" />
  </a>
  <a href="https://open.vscode.dev/DLR-SC/wikipedia_tools">
    <img alt="Badge: Open in VSCode" src="https://img.shields.io/static/v1?logo=visualstudiocode&label=&message=open%20in%20visual%20studio%20code&labelColor=2c2c32&color=007acc&logoColor=007acc" target="_blank" />
  </a>
  

  <a href="https://github.com/psf/black">
    <img alt="Badge: Open in VSCode" src="https://img.shields.io/badge/code%20style-black-000000.svg" target="_blank" />
  </a>
</p>

> `wikipedia_tools` is a Python Package to download wikipedia revisions for pages belonging to certain *categories*, based on a period of time. This package also provides overview stats for the downloaded data.

---

## Dependencies and Credits

#### [Wikipedia API](https://github.com/goldsmith/Wikipedia)

This package is built on top of the [Wikipedia API](https://github.com/goldsmith/Wikipedia). This code was forked under the `base` subpackage. 

#### [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser)

Also we forked the code from [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser) and we modified it to support *from* and *to* datetime to fetch revisions between certain periods; the modified code is `wikipedia_toools.scraper.wikirevparser_with_time.py`. 

Note: No need to download these two projects, they are already integrated as part of this project.

## Installation

Via PIP

``` 
pip install wikipedia_tools
```

Or install manually by cloning and then running

``` 
pip install -e wikipedia_tools
```



## wikipedia_tools package

This packages is responsible for:
- fetching the wikipages revisions based on a period of time
- load them into parquet, and
- provide basic analysis

It contains three main subpackages and the *utils* package which contains few helpers functions:

### Downlaod Wiki Article Revisions [[wikipedia_tools.scraper](wikipedia_tools/wikipedia_tools/scraper.py)]
This subpackage is responsible for downloading the wikipedia revisions from the web.

The code below shows how to download all the revisions of pages:
  - belonging to the *Climate_change* category.
  - revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The *get_x_months_ago_date* function returns the datetime of the beginning of 8 months ago.
  
    ```python 
    from wikipedia_tools.utils import utils 
    utils.get_x_months_ago_date(8)
    ```
  - if  save_each_page= True: each page is fetched and downloaded on the spot under the folder **data/periodic_wiki_batches/{*categories_names*}/from{month-year}_to{month-year}**. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.
  

```python
from wikipedia_tools.scraper import downloader
from datetime import datetime

wikirevs= downloader.WikiPagesRevision( 
                                        categories = ["Climate_change"],
                                        revisions_from = utils.get_x_months_ago_date(8),
                                        revisions_to=datetime.now(),
                                        save_each_page= True
                                        )

count, destination_folder = wikirevs.download()
```


For german wiki revisions, you can set the *lang* attribute to *de* - For example, you can download the German Wikipedia page revisions for the Climate_change category, as follows:

```python
from wikipedia_tools.scraper import downloader
from datetime import datetime

wikirevs= downloader.WikiPagesRevision( 
                                        categories = ["Klimaveränderung"],
                                        revisions_from = utils.get_x_months_ago_date(1), # beginning of last month, you can use instead datetime.now() + dateutil.relativedelta.relativedelta() to customize past datetime relatively
                                        revisions_to=datetime.now(),
                                        save_each_page= True,
                                        lang="de"
                                        )
count, destination_folder = wikirevs.download()

```

You can then process each file by, for example, reading the parquet file using pandas:

```python
import pandas as pd
from glob import glob
files = f"{destination_folder}/*.parquet"

# Loop over all wiki page revisions with this period and read each wiki page revs as a pandas dataframe
for page_path in glob(files):
    page_revs_df = pd.read_parquet(page_name)
    # dataframe with columns ['page', 'lang', 'timestamp', 'categories', 'content', 'images', 'links', 'sections', 'urls', 'user']
    # process/use file ....

```
### Overview Stats

```python

## Initialize the analyzer object

from wikipedia_tools.analyzer.revisions import WikipediaRevisionAnalyzer
analyzer = WikipediaRevisionAnalyzer(
    category = category,
    period = properties.PERIODS._YEARLY_,
    corpus = CORPUS,
    root = ROOT_PATH
)

# Get the yearly number of articles that were created/edit at least once 
unique_created_updated_articles = analyzer.get_edited_page_count(plot=True, save=True)

# Returned the number of created articles over time
unique_created_articles = analyzer.get_created_page_count(plot=True, save=True)

# Returns the number of revisions over time
rev_overtime_df = analyzer.get_revisions_over_time(save=True)

# Returns the number of words over time
words_overtime_df = analyzer.get_words_over_time(save=True)

# Returns the number of users over time, grouped by user type
users_overtime_df = analyzer.get_users_over_time(save=True)

# return the top n wikipedia articles over time
top_edited = analyzer.get_most_edited_articles(top=4)

# return the articles sorted from most to least edited over time
most_to_least_revised = analyzer.get_periodic_most_to_least_revised(save=True)

```

You can find the full example under the examples folder.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "wikipedia_tools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "wikipedia,wikipedia revisions,wikipedia stats",
    "author": null,
    "author_email": "Roxanne El Baff <roxanne.elbaff@dlr.de>",
    "download_url": "https://files.pythonhosted.org/packages/28/4b/5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef/wikipedia_tools-2.4.1.tar.gz",
    "platform": null,
    "description": "\n<h1 align=\"center\">Welcome to the Wikipedia Periodic Revisions <code>(wikipedia_tools)</code> </h1>\n\n<p align=\"center\">\n  <a href=\"https://github.com/DLR-SC/wikipedia-periodic-revisions/blob/master/LICENSE\">\n    <img alt=\"License: MIT\" src=\"https://img.shields.io/badge/license-MIT-yellow.svg\" target=\"_blank\" />\n  </a>\n  <a href=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg\">\n    <img src=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg\" alt=\"Badge: Made with Python\"/>\n  </a>\n  <a href=\"https://pypi.org/project/wikipedia_tools/\"><img src=\"https://badge.fury.io/py/wikipedia_tools.svg\" alt=\"Badge: PyPI version\" height=\"18\"></a>\n  <a href=\"https://twitter.com/dlr_software\">\n    <img alt=\"Twitter: DLR Software\" src=\"https://img.shields.io/twitter/follow/dlr_software.svg?style=social\" target=\"_blank\" />\n  </a>\n  <a href=\"https://open.vscode.dev/DLR-SC/wikipedia_tools\">\n    <img alt=\"Badge: Open in VSCode\" src=\"https://img.shields.io/static/v1?logo=visualstudiocode&label=&message=open%20in%20visual%20studio%20code&labelColor=2c2c32&color=007acc&logoColor=007acc\" target=\"_blank\" />\n  </a>\n  \n\n  <a href=\"https://github.com/psf/black\">\n    <img alt=\"Badge: Open in VSCode\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" target=\"_blank\" />\n  </a>\n</p>\n\n> `wikipedia_tools` is a Python Package to download wikipedia revisions for pages belonging to certain *categories*, based on a period of time. This package also provides overview stats for the downloaded data.\n\n---\n\n## Dependencies and Credits\n\n#### [Wikipedia API](https://github.com/goldsmith/Wikipedia)\n\nThis package is built on top of the [Wikipedia API](https://github.com/goldsmith/Wikipedia). This code was forked under the `base` subpackage. \n\n#### [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser)\n\nAlso we forked the code from [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser) and we modified it to support *from* and *to* datetime to fetch revisions between certain periods; the modified code is `wikipedia_toools.scraper.wikirevparser_with_time.py`. \n\nNote: No need to download these two projects, they are already integrated as part of this project.\n\n## Installation\n\nVia PIP\n\n``` \npip install wikipedia_tools\n```\n\nOr install manually by cloning and then running\n\n``` \npip install -e wikipedia_tools\n```\n\n\n\n## wikipedia_tools package\n\nThis packages is responsible for:\n- fetching the wikipages revisions based on a period of time\n- load them into parquet, and\n- provide basic analysis\n\nIt contains three main subpackages and the *utils* package which contains few helpers functions:\n\n### Downlaod Wiki Article Revisions [[wikipedia_tools.scraper](wikipedia_tools/wikipedia_tools/scraper.py)]\nThis subpackage is responsible for downloading the wikipedia revisions from the web.\n\nThe code below shows how to download all the revisions of pages:\n  - belonging to the *Climate_change* category.\n  - revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The *get_x_months_ago_date* function returns the datetime of the beginning of 8 months ago.\n  \n    ```python \n    from wikipedia_tools.utils import utils \n    utils.get_x_months_ago_date(8)\n    ```\n  - if  save_each_page= True: each page is fetched and downloaded on the spot under the folder **data/periodic_wiki_batches/{*categories_names*}/from{month-year}_to{month-year}**. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.\n  \n\n```python\nfrom wikipedia_tools.scraper import downloader\nfrom datetime import datetime\n\nwikirevs= downloader.WikiPagesRevision( \n                                        categories = [\"Climate_change\"],\n                                        revisions_from = utils.get_x_months_ago_date(8),\n                                        revisions_to=datetime.now(),\n                                        save_each_page= True\n                                        )\n\ncount, destination_folder = wikirevs.download()\n```\n\n\nFor german wiki revisions, you can set the *lang* attribute to *de* - For example, you can download the German Wikipedia page revisions for the Climate_change category, as follows:\n\n```python\nfrom wikipedia_tools.scraper import downloader\nfrom datetime import datetime\n\nwikirevs= downloader.WikiPagesRevision( \n                                        categories = [\"Klimaver\u00e4nderung\"],\n                                        revisions_from = utils.get_x_months_ago_date(1), # beginning of last month, you can use instead datetime.now() + dateutil.relativedelta.relativedelta() to customize past datetime relatively\n                                        revisions_to=datetime.now(),\n                                        save_each_page= True,\n                                        lang=\"de\"\n                                        )\ncount, destination_folder = wikirevs.download()\n\n```\n\nYou can then process each file by, for example, reading the parquet file using pandas:\n\n```python\nimport pandas as pd\nfrom glob import glob\nfiles = f\"{destination_folder}/*.parquet\"\n\n# Loop over all wiki page revisions with this period and read each wiki page revs as a pandas dataframe\nfor page_path in glob(files):\n    page_revs_df = pd.read_parquet(page_name)\n    # dataframe with columns ['page', 'lang', 'timestamp', 'categories', 'content', 'images', 'links', 'sections', 'urls', 'user']\n    # process/use file ....\n\n```\n### Overview Stats\n\n```python\n\n## Initialize the analyzer object\n\nfrom wikipedia_tools.analyzer.revisions import WikipediaRevisionAnalyzer\nanalyzer = WikipediaRevisionAnalyzer(\n    category = category,\n    period = properties.PERIODS._YEARLY_,\n    corpus = CORPUS,\n    root = ROOT_PATH\n)\n\n# Get the yearly number of articles that were created/edit at least once \nunique_created_updated_articles = analyzer.get_edited_page_count(plot=True, save=True)\n\n# Returned the number of created articles over time\nunique_created_articles = analyzer.get_created_page_count(plot=True, save=True)\n\n# Returns the number of revisions over time\nrev_overtime_df = analyzer.get_revisions_over_time(save=True)\n\n# Returns the number of words over time\nwords_overtime_df = analyzer.get_words_over_time(save=True)\n\n# Returns the number of users over time, grouped by user type\nusers_overtime_df = analyzer.get_users_over_time(save=True)\n\n# return the top n wikipedia articles over time\ntop_edited = analyzer.get_most_edited_articles(top=4)\n\n# return the articles sorted from most to least edited over time\nmost_to_least_revised = analyzer.get_periodic_most_to_least_revised(save=True)\n\n```\n\nYou can find the full example under the examples folder.",
    "bugtrack_url": null,
    "license": null,
    "summary": "This is a Wikipedia Tool to fetch revisions based on a period of time.",
    "version": "2.4.1",
    "project_urls": {
        "Homepage": "https://github.com/DLR-SC/wikipedia-periodic-revisions"
    },
    "split_keywords": [
        "wikipedia",
        "wikipedia revisions",
        "wikipedia stats"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a32e82378c68626caf83b1d5d4ccdda85b11810c9d6297d373ba8713359ea545",
                "md5": "ea3b352f8bf4611a1d3470a5782d7e14",
                "sha256": "aa8a9bf9f9e7530e707b85fcd422cb5292798aef619352b462c49a19d27f4570"
            },
            "downloads": -1,
            "filename": "wikipedia_tools-2.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea3b352f8bf4611a1d3470a5782d7e14",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 37181,
            "upload_time": "2023-10-12T06:47:00",
            "upload_time_iso_8601": "2023-10-12T06:47:00.219553Z",
            "url": "https://files.pythonhosted.org/packages/a3/2e/82378c68626caf83b1d5d4ccdda85b11810c9d6297d373ba8713359ea545/wikipedia_tools-2.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "284b5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef",
                "md5": "a2ff9811d987df0308ceb3881766dca2",
                "sha256": "ae8d9d316915507b82d5812c8267cfecd17ee10317888205190de4a7ab21e2de"
            },
            "downloads": -1,
            "filename": "wikipedia_tools-2.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a2ff9811d987df0308ceb3881766dca2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 52635,
            "upload_time": "2023-10-12T06:57:28",
            "upload_time_iso_8601": "2023-10-12T06:57:28.841879Z",
            "url": "https://files.pythonhosted.org/packages/28/4b/5433d08ff68b6a04e798a2fe9b3bfdf42f48f1bb28b203c44b6e324857ef/wikipedia_tools-2.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-12 06:57:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DLR-SC",
    "github_project": "wikipedia-periodic-revisions",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "wikipedia_tools"
}
        
Elapsed time: 0.20641s