enwiki-offline


Nameenwiki-offline JSON
Version 0.2.27 PyPI version JSON
download
home_pagehttps://github.com/craigtrim/enwiki-offline
SummaryHigh-performance offline access to Wikipedia data for Linked Data / NLP applications.
upload_time2024-04-10 00:07:54
maintainerCraig Trim
docs_urlNone
authorCraig Trim
requires_python<4.0,>=3.8
licenseMIT
keywords utility helper text matching wikipedia offline access nlp linked data data science information retrieval
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # enwiki-offline
High-performance offline access to Wikipedia data.

These functions are helpful and essential for Linked Data / NLP applications that need to determine if a given entity has a corresponding Wikipedia entry.

Knowing if an entity exists in Wikipedia (and what the ISO URL is) helps differentiate known (public) entities from entities unique to a data source.  Existential lookups can be helpful in either eliminating noise or boosting differentiating entities.

Runtime access to this package does not require remote calls to DBpedia, Wikipedia, or other Linked data providers.  Offline access is stable and offers consistent performance with a disk-io tradeoff.  Slightly over 17,000 localized files are required to enable offline access.

## Functions
```python
def exists(entity: str) -> bool
```
Performs a case insensitive search and returns True if a Wikipedia entry exists for the input entity.  Synonyms, Partial and Fuzzy searches are not supported.  Exact matches only.

```python
def is_ambiguous(entity: str) -> bool
```
Returns True if multiple Wikipedia entries exist for this term.

```python
def titles(entity: str) -> Optional[List[str]]
```
Returns all Wikipedia Titles for this input entity.

## Use Existing Data
Scroll down to the `DVC` section and use the `dvc pull` command to access the data.

## Parsing Wikipedia Titles
The latest enwiki file can be downloaded from https://dumps.wikimedia.org/enwiki/

You only need to do this if
1. You don't want to refresh from DVC
2. You have a different version of the enwiki file
```sh
poetry run python drivers/parse_enwiki_all_titles.py "/path/to/file/enwiki-20240301-all-titles"
```

## DVC (Data Version Control)

### Initialize DVC and Configure S3 Remote
In your project root, initialize DVC if you haven't already, and configure your S3 bucket as the remote storage. Replace `enwikioffline` with your actual S3 bucket name if it's different. Run:

```shell
dvc init
dvc remote add -d myremote s3://enwikioffline
dvc remote modify myremote profile enwiki_offline
```

This setup:
- Initializes DVC in your project.
- Adds your S3 bucket as the default remote storage.
- Configures DVC to use the `enwiki_offline` AWS profile for S3 operations.

### Track and Push Data with DVC
To track the resources folder and push it to S3, execute:
```shell
dvc add resources
git add resources.dvc .gitignore
git commit -m "Track resources folder with DVC"
dvc push
```

This process:
- Tracks the `resources` folder with DVC, creating a .dvc file.
- Commits the DVC files to Git.
- Pushes the data to your S3 bucket using the configured AWS profile.

### Pull Data with DVC
To retrieve the data managed by DVC, use:
```sh
dvc pull
```
This command pulls the data from S3 into your local `resources` folder, based on the current DVC setup and the latest `resources.dvc` file in your repository.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/craigtrim/enwiki-offline",
    "name": "enwiki-offline",
    "maintainer": "Craig Trim",
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": "craigtrim@gmail.com",
    "keywords": "utility, helper, text, matching, wikipedia, offline access, NLP, linked data, data science, information retrieval",
    "author": "Craig Trim",
    "author_email": "craigtrim@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/90/99/bbca70b4be56813f148914de8b96f3d143ddabeb3b85c65b97512b5b56b8/enwiki_offline-0.2.27.tar.gz",
    "platform": null,
    "description": "# enwiki-offline\nHigh-performance offline access to Wikipedia data.\n\nThese functions are helpful and essential for Linked Data / NLP applications that need to determine if a given entity has a corresponding Wikipedia entry.\n\nKnowing if an entity exists in Wikipedia (and what the ISO URL is) helps differentiate known (public) entities from entities unique to a data source.  Existential lookups can be helpful in either eliminating noise or boosting differentiating entities.\n\nRuntime access to this package does not require remote calls to DBpedia, Wikipedia, or other Linked data providers.  Offline access is stable and offers consistent performance with a disk-io tradeoff.  Slightly over 17,000 localized files are required to enable offline access.\n\n## Functions\n```python\ndef exists(entity: str) -> bool\n```\nPerforms a case insensitive search and returns True if a Wikipedia entry exists for the input entity.  Synonyms, Partial and Fuzzy searches are not supported.  Exact matches only.\n\n```python\ndef is_ambiguous(entity: str) -> bool\n```\nReturns True if multiple Wikipedia entries exist for this term.\n\n```python\ndef titles(entity: str) -> Optional[List[str]]\n```\nReturns all Wikipedia Titles for this input entity.\n\n## Use Existing Data\nScroll down to the `DVC` section and use the `dvc pull` command to access the data.\n\n## Parsing Wikipedia Titles\nThe latest enwiki file can be downloaded from https://dumps.wikimedia.org/enwiki/\n\nYou only need to do this if\n1. You don't want to refresh from DVC\n2. You have a different version of the enwiki file\n```sh\npoetry run python drivers/parse_enwiki_all_titles.py \"/path/to/file/enwiki-20240301-all-titles\"\n```\n\n## DVC (Data Version Control)\n\n### Initialize DVC and Configure S3 Remote\nIn your project root, initialize DVC if you haven't already, and configure your S3 bucket as the remote storage. Replace `enwikioffline` with your actual S3 bucket name if it's different. Run:\n\n```shell\ndvc init\ndvc remote add -d myremote s3://enwikioffline\ndvc remote modify myremote profile enwiki_offline\n```\n\nThis setup:\n- Initializes DVC in your project.\n- Adds your S3 bucket as the default remote storage.\n- Configures DVC to use the `enwiki_offline` AWS profile for S3 operations.\n\n### Track and Push Data with DVC\nTo track the resources folder and push it to S3, execute:\n```shell\ndvc add resources\ngit add resources.dvc .gitignore\ngit commit -m \"Track resources folder with DVC\"\ndvc push\n```\n\nThis process:\n- Tracks the `resources` folder with DVC, creating a .dvc file.\n- Commits the DVC files to Git.\n- Pushes the data to your S3 bucket using the configured AWS profile.\n\n### Pull Data with DVC\nTo retrieve the data managed by DVC, use:\n```sh\ndvc pull\n```\nThis command pulls the data from S3 into your local `resources` folder, based on the current DVC setup and the latest `resources.dvc` file in your repository.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance offline access to Wikipedia data for Linked Data / NLP applications.",
    "version": "0.2.27",
    "project_urls": {
        "Homepage": "https://github.com/craigtrim/enwiki-offline",
        "Repository": "https://github.com/craigtrim/enwiki-offline"
    },
    "split_keywords": [
        "utility",
        " helper",
        " text",
        " matching",
        " wikipedia",
        " offline access",
        " nlp",
        " linked data",
        " data science",
        " information retrieval"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e8f2ef243c7d63cc6a19b2eda614142c507ef3887c26a75ac6892319665a0bd5",
                "md5": "92d0be27416f89370c75864805f54035",
                "sha256": "b62ff054dc416bf5078a588293b330bd51d2fe763af27e319b6d748cc94c6c39"
            },
            "downloads": -1,
            "filename": "enwiki_offline-0.2.27-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "92d0be27416f89370c75864805f54035",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 5548,
            "upload_time": "2024-04-10T00:07:53",
            "upload_time_iso_8601": "2024-04-10T00:07:53.821248Z",
            "url": "https://files.pythonhosted.org/packages/e8/f2/ef243c7d63cc6a19b2eda614142c507ef3887c26a75ac6892319665a0bd5/enwiki_offline-0.2.27-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9099bbca70b4be56813f148914de8b96f3d143ddabeb3b85c65b97512b5b56b8",
                "md5": "5ab9aeccbc342ee954cf82255a5a3d99",
                "sha256": "2f4d1d50ba35fd3ad2bcccee1b6652818ef27ac41eda4e4e952a7eb2e68b3071"
            },
            "downloads": -1,
            "filename": "enwiki_offline-0.2.27.tar.gz",
            "has_sig": false,
            "md5_digest": "5ab9aeccbc342ee954cf82255a5a3d99",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 5464,
            "upload_time": "2024-04-10T00:07:54",
            "upload_time_iso_8601": "2024-04-10T00:07:54.945659Z",
            "url": "https://files.pythonhosted.org/packages/90/99/bbca70b4be56813f148914de8b96f3d143ddabeb3b85c65b97512b5b56b8/enwiki_offline-0.2.27.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-10 00:07:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "craigtrim",
    "github_project": "enwiki-offline",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "enwiki-offline"
}
        
Elapsed time: 0.65057s