sqlitekg2vec


Namesqlitekg2vec JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/khaller93/sqlitekg2vec
SummarySQLiteKG implements the KG class from pyRDF2Vec by using a local SQLite database for storing and querying a KG.
upload_time2022-12-30 04:59:33
maintainer
docs_urlNone
authorKevin Haller
requires_python>=3.8,<4.0
license
keywords embeddings knowledge-graph rdf2vec word2vec sqlite
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # sqlitekg2vec

sqlitekg2vec is an extension of
[pyRDF2Vec](https://github.com/IBCNServices/pyRDF2Vec), which is a popular
library to train RDF2Vec models for RDF-based knowledge graphs. It aims to be
less memory hungry than building KGs from scratch using pyRDF2Vec, or running a
local/remote triplestore.


sqlitekg2vec creates a local SQLite database with a single big table for all the
statements of a knowledge graph, and an additional table as an index of KG
entity names to integer IDs. This SQLite database will be referenced to as 
SQLite KG in the remaining documentation.

## Installation

The releases of this extension can by found in the [PyPi](https://pypi.org/project/sqlitekg2vec/)
repository. This `sqlitekg2vec` package can easily be installed with `pip` or
other package managers.

```bash
pip install sqlitekg2vec
```

**Requirements:**
* Python 3.8 or higher (Python 3.9 recommended)

## Usage

```python
import sqlitekg2vec

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.walkers import RandomWalker

with sqlitekg2vec.open_from_pykeen_dataset('dbpedia50', combined=True) as kg:
    transformer = RDF2VecTransformer(
        Word2Vec(epochs=100),
        walkers=[RandomWalker(max_walks=200,
                              max_depth=4,
                              random_state=133,
                              with_reverse=False,
                              n_jobs=4)],
        verbose=1
    )
    # train RDF2Vec
    ent = kg.entities()
    embeddings, _ = transformer.fit_transform(kg, ent)
    print(kg.pack(ent, embeddings))
```

### Create from PyKeen dataset

[PyKeen](https://github.com/pykeen/pykeen) is a popular library for knowledge
graph embeddings, and it specifies a number of datasets that are commonly
referenced in scientific literature. An SQLite KG can be constructed from a
PyKeen dataset by specifying the name of the dataset or passing the dataset
instance.

In the following code snippet, the `db100k` dataset, which is a subsampling of
DBpedia, is used to construct an SQLite KG.

```python
import sqlitekg2vec

with sqlitekg2vec.open_from_pykeen_dataset('db100k', combined=True) as kg:
    # ...
    pass
```

**Parameters:**

* *combined* - `False` if only the training set of a dataset shall be used for
  the training of RDF2Vec. `True` if all the sets (training, testing and
  validation) shall be used. It is `False` by default.

### Create from TSV file

In order to save memory for big knowledge graphs, it might be a good idea to
load the statements of such a knowledge graph from a TSV file into a SQLite KG.
All the rows in the TSV file must have three columns, where the first column is
the subject, the second is the predicate, and the last column is the object.

The following code snippet creates a new SQLite KG instance from the statements
of the specified TSV file, which has been compressed using GZIP.

```python
import sqlitekg2vec

with sqlitekg2vec.open_from_tsv_file('statements.tsv.gz',
                                     compression='gzip') as kg:
    # ...
    pass
```

**Parameters:**

* *skip_header* - `True` if the first row shall be skipped, because it is a
  header row for example. `False` if it shouldn't be skipped. It is `False` by
  default.
* *compression* - specifies the compression type of source TSV file. The default
  value is `None`, which means that the source isn't compressed. At the moment,
  only `'gzip'` is supported as compression type.

### Create from Pandas dataframe

A knowledge graph can be represented in a Pandas dataframe, and this method
allows to create an SQLite KG from a dataframe. While the dataframe can have
more than three columns, the three columns representing the subject, predicate
and object must be specified in this particular order.

The following code snippet creates a new SQLite KG instance from a dataframe.

```python
import sqlitekg2vec

with sqlitekg2vec.open_from_dataframe(df, column_names=(
        'subj', 'pred', 'obj')) as kg:
    # ...
    pass
```

**Parameters:**

* *column_names* - a tuple of three indices for the dataframe, which can be an
  integer or string. The first entry of the tuple shall point to the subject,
  the second to the predicate, and the third one to the object. `(0, 1, 2)` are
  the default indices.

## Limitations:

This implementation has three limitations.

1) **Literals** are ignored by this implementation for now.
2) **Inverse traversal** isn't working properly. The walker might get stuck.
3) **Samplers** (besides the default one) might not work properly.

## Contact

* Kevin Haller - [contact@kevinhaller.dev](mailto:contact@kevinhaller.dev)
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/khaller93/sqlitekg2vec",
    "name": "sqlitekg2vec",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "embeddings,knowledge-graph,rdf2vec,word2vec,sqlite",
    "author": "Kevin Haller",
    "author_email": "contact@kevinhaller.dev",
    "download_url": "https://files.pythonhosted.org/packages/ff/03/c99c00cc195487ae30bfb6c2b29c81bf03ddda178c4537e3e937f6320817/sqlitekg2vec-0.2.0.tar.gz",
    "platform": null,
    "description": "# sqlitekg2vec\n\nsqlitekg2vec is an extension of\n[pyRDF2Vec](https://github.com/IBCNServices/pyRDF2Vec), which is a popular\nlibrary to train RDF2Vec models for RDF-based knowledge graphs. It aims to be\nless memory hungry than building KGs from scratch using pyRDF2Vec, or running a\nlocal/remote triplestore.\n\n\nsqlitekg2vec creates a local SQLite database with a single big table for all the\nstatements of a knowledge graph, and an additional table as an index of KG\nentity names to integer IDs. This SQLite database will be referenced to as \nSQLite KG in the remaining documentation.\n\n## Installation\n\nThe releases of this extension can by found in the [PyPi](https://pypi.org/project/sqlitekg2vec/)\nrepository. This `sqlitekg2vec` package can easily be installed with `pip` or\nother package managers.\n\n```bash\npip install sqlitekg2vec\n```\n\n**Requirements:**\n* Python 3.8 or higher (Python 3.9 recommended)\n\n## Usage\n\n```python\nimport sqlitekg2vec\n\nfrom pyrdf2vec import RDF2VecTransformer\nfrom pyrdf2vec.embedders import Word2Vec\nfrom pyrdf2vec.walkers import RandomWalker\n\nwith sqlitekg2vec.open_from_pykeen_dataset('dbpedia50', combined=True) as kg:\n    transformer = RDF2VecTransformer(\n        Word2Vec(epochs=100),\n        walkers=[RandomWalker(max_walks=200,\n                              max_depth=4,\n                              random_state=133,\n                              with_reverse=False,\n                              n_jobs=4)],\n        verbose=1\n    )\n    # train RDF2Vec\n    ent = kg.entities()\n    embeddings, _ = transformer.fit_transform(kg, ent)\n    print(kg.pack(ent, embeddings))\n```\n\n### Create from PyKeen dataset\n\n[PyKeen](https://github.com/pykeen/pykeen) is a popular library for knowledge\ngraph embeddings, and it specifies a number of datasets that are commonly\nreferenced in scientific literature. An SQLite KG can be constructed from a\nPyKeen dataset by specifying the name of the dataset or passing the dataset\ninstance.\n\nIn the following code snippet, the `db100k` dataset, which is a subsampling of\nDBpedia, is used to construct an SQLite KG.\n\n```python\nimport sqlitekg2vec\n\nwith sqlitekg2vec.open_from_pykeen_dataset('db100k', combined=True) as kg:\n    # ...\n    pass\n```\n\n**Parameters:**\n\n* *combined* - `False` if only the training set of a dataset shall be used for\n  the training of RDF2Vec. `True` if all the sets (training, testing and\n  validation) shall be used. It is `False` by default.\n\n### Create from TSV file\n\nIn order to save memory for big knowledge graphs, it might be a good idea to\nload the statements of such a knowledge graph from a TSV file into a SQLite KG.\nAll the rows in the TSV file must have three columns, where the first column is\nthe subject, the second is the predicate, and the last column is the object.\n\nThe following code snippet creates a new SQLite KG instance from the statements\nof the specified TSV file, which has been compressed using GZIP.\n\n```python\nimport sqlitekg2vec\n\nwith sqlitekg2vec.open_from_tsv_file('statements.tsv.gz',\n                                     compression='gzip') as kg:\n    # ...\n    pass\n```\n\n**Parameters:**\n\n* *skip_header* - `True` if the first row shall be skipped, because it is a\n  header row for example. `False` if it shouldn't be skipped. It is `False` by\n  default.\n* *compression* - specifies the compression type of source TSV file. The default\n  value is `None`, which means that the source isn't compressed. At the moment,\n  only `'gzip'` is supported as compression type.\n\n### Create from Pandas dataframe\n\nA knowledge graph can be represented in a Pandas dataframe, and this method\nallows to create an SQLite KG from a dataframe. While the dataframe can have\nmore than three columns, the three columns representing the subject, predicate\nand object must be specified in this particular order.\n\nThe following code snippet creates a new SQLite KG instance from a dataframe.\n\n```python\nimport sqlitekg2vec\n\nwith sqlitekg2vec.open_from_dataframe(df, column_names=(\n        'subj', 'pred', 'obj')) as kg:\n    # ...\n    pass\n```\n\n**Parameters:**\n\n* *column_names* - a tuple of three indices for the dataframe, which can be an\n  integer or string. The first entry of the tuple shall point to the subject,\n  the second to the predicate, and the third one to the object. `(0, 1, 2)` are\n  the default indices.\n\n## Limitations:\n\nThis implementation has three limitations.\n\n1) **Literals** are ignored by this implementation for now.\n2) **Inverse traversal** isn't working properly. The walker might get stuck.\n3) **Samplers** (besides the default one) might not work properly.\n\n## Contact\n\n* Kevin Haller - [contact@kevinhaller.dev](mailto:contact@kevinhaller.dev)",
    "bugtrack_url": null,
    "license": "",
    "summary": "SQLiteKG implements the KG class from pyRDF2Vec by using a local SQLite database for storing and querying a KG.",
    "version": "0.2.0",
    "split_keywords": [
        "embeddings",
        "knowledge-graph",
        "rdf2vec",
        "word2vec",
        "sqlite"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "2e69bbd992c45f456a8229b34001b1ef",
                "sha256": "6081bef94fa9d4fee483497c6db7d6d75cd3ef2a8721b00ad6eec04e0b970499"
            },
            "downloads": -1,
            "filename": "sqlitekg2vec-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2e69bbd992c45f456a8229b34001b1ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 14099,
            "upload_time": "2022-12-30T04:59:31",
            "upload_time_iso_8601": "2022-12-30T04:59:31.582480Z",
            "url": "https://files.pythonhosted.org/packages/95/61/48a7fa9e69492ba52c78ef2b9937fa125e5d52b5a73a53b989c671a646aa/sqlitekg2vec-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "7661eec12100f1ad1159b743a9ae335b",
                "sha256": "2d03a98023cb9f3ebe8761d0b6749a111dc1b4761b36bd660dec4a12423be191"
            },
            "downloads": -1,
            "filename": "sqlitekg2vec-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7661eec12100f1ad1159b743a9ae335b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 12055,
            "upload_time": "2022-12-30T04:59:33",
            "upload_time_iso_8601": "2022-12-30T04:59:33.192667Z",
            "url": "https://files.pythonhosted.org/packages/ff/03/c99c00cc195487ae30bfb6c2b29c81bf03ddda178c4537e3e937f6320817/sqlitekg2vec-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-30 04:59:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "khaller93",
    "github_project": "sqlitekg2vec",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sqlitekg2vec"
}
        
Elapsed time: 0.08679s