wikiscraper


Namewikiscraper JSON
Version 1.1.9 PyPI version JSON
download
home_pagehttps://github.com/Alexandre333/wikiscraper
SummaryEasy scraper that extracts data from Wikipedia articles thanks to its URL slug
upload_time2023-08-20 09:32:09
maintainer
docs_urlNone
authorAlexandre Meyer
requires_python
license
keywords python web scraping wikipedia slug
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![CC BY 4.0][cc-by-shield]][cc-by]
[![Downloads](https://static.pepy.tech/badge/wikiscraper)](https://pepy.tech/project/wikiscraper)

# wikiscraper

Easy scraper that extracts data from Wikipedia articles thanks to its URL slug : title, images, summary, sections paragraphs, sidebar info

Developed by Alexandre MEYER

This work is licensed under a
[Creative Commons Attribution 4.0 International License][cc-by].

[![CC BY 4.0][cc-by-image]][cc-by]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg


Installation

```python
$ pip install wikiscraper
```

## Initialization

Import
```python
import wikiscraper as ws
```

Main request
```python
# Set the language page in Wikipedia for the query
# (ISO 639-1 & by default "en" for English)
ws.lang("fr")
```

```python
# Search and get content by the URL slug of the article
# (Example : https://fr.wikipedia.org/wiki/Paris)
result = ws.searchBySlug("Paris")
```
## Examples

Title H1 & URL
```python
# Get article's title
result.getTitle()
# Get article's URL
result.getURL()
```

Sidebar
```python
# Get value of the sidebar information label
result.getSideInfo("Gentilé")
```

Abstract
```python
# Get all paragraphs of abstract
print(result.getAbstract())
# Get the second paragraph of abstract
print(result.getAbstract()[1])
# Optional : Get the x paragraphs, starting from the beginning
print(result.getAbstract(2))
```

Images
```python
# Get all illustration images
img = result.getImage()
# Get a specific image thanks to its position in the page
print(img[0]) # Main image
```

Sections
```python
# Get table of contents
# Only first headlines
print(result.getContentsTable())
# All headelines (first and second levels)
print(result.getContentsTable(subcontents=True))
```

```python
# Get paragraphs from a specific section thanks to the parents' header title
# All optional args : .getSection(h2Title, h3Title, h4Title)
# Exemple : https://fr.wikipedia.org/wiki/Paris#Politique_et_administration
print(result.getSection('Politique et administration', 'Statut et organisation administrative', 'Historique')[0])
```

## Errors
> "Unable to find the requested query: please check the spelling of the slug"

* Check if the spelling of the slug is correct
* Check if the article exists
* Check if the language set for the query matches with the slug (by default the search is for English articles)

## Versions
- 1.1.0 = Error Handling
- 1.0.0 = init

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Alexandre333/wikiscraper",
    "name": "wikiscraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,web scraping,wikipedia,slug",
    "author": "Alexandre Meyer",
    "author_email": "contact@alexandremeyer.fr",
    "download_url": "https://files.pythonhosted.org/packages/61/06/3b996e96e8400f84d8d1e075b41ae5917fbfeab087ba931820bf744b4396/wikiscraper-1.1.9.tar.gz",
    "platform": null,
    "description": "[![CC BY 4.0][cc-by-shield]][cc-by]\n[![Downloads](https://static.pepy.tech/badge/wikiscraper)](https://pepy.tech/project/wikiscraper)\n\n# wikiscraper\n\nEasy scraper that extracts data from Wikipedia articles thanks to its URL slug : title, images, summary, sections paragraphs, sidebar info\n\nDeveloped by Alexandre MEYER\n\nThis work is licensed under a\n[Creative Commons Attribution 4.0 International License][cc-by].\n\n[![CC BY 4.0][cc-by-image]][cc-by]\n\n[cc-by]: http://creativecommons.org/licenses/by/4.0/\n[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png\n[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg\n\n\nInstallation\n\n```python\n$ pip install wikiscraper\n```\n\n## Initialization\n\nImport\n```python\nimport wikiscraper as ws\n```\n\nMain request\n```python\n# Set the language page in Wikipedia for the query\n# (ISO 639-1 & by default \"en\" for English)\nws.lang(\"fr\")\n```\n\n```python\n# Search and get content by the URL slug of the article\n# (Example : https://fr.wikipedia.org/wiki/Paris)\nresult = ws.searchBySlug(\"Paris\")\n```\n## Examples\n\nTitle H1 & URL\n```python\n# Get article's title\nresult.getTitle()\n# Get article's URL\nresult.getURL()\n```\n\nSidebar\n```python\n# Get value of the sidebar information label\nresult.getSideInfo(\"Gentil\u00c3\u00a9\")\n```\n\nAbstract\n```python\n# Get all paragraphs of abstract\nprint(result.getAbstract())\n# Get the second paragraph of abstract\nprint(result.getAbstract()[1])\n# Optional : Get the x paragraphs, starting from the beginning\nprint(result.getAbstract(2))\n```\n\nImages\n```python\n# Get all illustration images\nimg = result.getImage()\n# Get a specific image thanks to its position in the page\nprint(img[0]) # Main image\n```\n\nSections\n```python\n# Get table of contents\n# Only first headlines\nprint(result.getContentsTable())\n# All headelines (first and second levels)\nprint(result.getContentsTable(subcontents=True))\n```\n\n```python\n# Get paragraphs from a specific section thanks to the parents' header title\n# All optional args : .getSection(h2Title, h3Title, h4Title)\n# Exemple : https://fr.wikipedia.org/wiki/Paris#Politique_et_administration\nprint(result.getSection('Politique et administration', 'Statut et organisation administrative', 'Historique')[0])\n```\n\n## Errors\n> \"Unable to find the requested query: please check the spelling of the slug\"\n\n* Check if the spelling of the slug is correct\n* Check if the article exists\n* Check if the language set for the query matches with the slug (by default the search is for English articles)\n\n## Versions\n- 1.1.0 = Error Handling\n- 1.0.0 = init\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Easy scraper that extracts data from Wikipedia articles thanks to its URL slug",
    "version": "1.1.9",
    "project_urls": {
        "Homepage": "https://github.com/Alexandre333/wikiscraper"
    },
    "split_keywords": [
        "python",
        "web scraping",
        "wikipedia",
        "slug"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "452ddcd1498fa900a97d8b0800b94cb0dacef96e3ec1663aa6d61789831e7898",
                "md5": "c6eafddffdb32bd452c4e5805a20b7bd",
                "sha256": "0c1847ef6d62fe33dda762398c2121ebfba338f4bb94491be1b8eaaa7c772efa"
            },
            "downloads": -1,
            "filename": "wikiscraper-1.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c6eafddffdb32bd452c4e5805a20b7bd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 12125,
            "upload_time": "2023-08-20T09:32:07",
            "upload_time_iso_8601": "2023-08-20T09:32:07.583207Z",
            "url": "https://files.pythonhosted.org/packages/45/2d/dcd1498fa900a97d8b0800b94cb0dacef96e3ec1663aa6d61789831e7898/wikiscraper-1.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "61063b996e96e8400f84d8d1e075b41ae5917fbfeab087ba931820bf744b4396",
                "md5": "f1d5108cf1e460fb2de1de582eeec88d",
                "sha256": "9790c6a07e1b36f4579aeb46adbf013a067365b4e14d29f5edcaddc9b714318a"
            },
            "downloads": -1,
            "filename": "wikiscraper-1.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "f1d5108cf1e460fb2de1de582eeec88d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10626,
            "upload_time": "2023-08-20T09:32:09",
            "upload_time_iso_8601": "2023-08-20T09:32:09.252866Z",
            "url": "https://files.pythonhosted.org/packages/61/06/3b996e96e8400f84d8d1e075b41ae5917fbfeab087ba931820bf744b4396/wikiscraper-1.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-20 09:32:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Alexandre333",
    "github_project": "wikiscraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "wikiscraper"
}
        
Elapsed time: 0.58149s