rdflib-hdt


Namerdflib-hdt JSON
Version 3.1 PyPI version JSON
download
home_page
SummaryA Store back-end for rdflib to allow for reading and querying HDT documents
upload_time2023-06-02 12:22:02
maintainer
docs_urlNone
author
requires_python
licenseMIT License
keywords rdflib hdt rdf semantic web search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            |rdflib-htd logo|

|Build Status| |PyPI version|

A Store back-end for `rdflib <https://github.com/RDFLib>`_ to allow for reading and querying HDT documents.

`Online Documentation <https://rdflib.dev/rdflib-hdt/>`_

Requirements
============


* Python *version 3.6.4 or higher*
* `pip <https://pip.pypa.io/en/stable/>`_
* **gcc/clang** with **c++11 support**
* **Python Development headers**
  ..

     You should have the ``Python.h`` header available on your system.\
     For example, for Python 3.6, install the ``python3.6-dev`` package on Debian/Ubuntu systems.


Installation
============

Installation using `pipenv <https://github.com/pypa/pipenv>`_ or a `virtualenv <https://virtualenv.pypa.io/en/stable/>`_ is **strongly advised!**

PyPi installation (recommended)
-------------------------------

.. code-block:: bash

   # you can install using pip
   pip install rdflib-hdt

   # or you can use pipenv
   pipenv install rdflib-hdt

Manual installation
-------------------

**Requirement:** `pipenv <https://github.com/pypa/pipenv>`_ 

.. code-block:: bash

   git clone https://github.com/Callidon/pyHDT
   cd pyHDT/
   ./install.sh

Getting started
===============

You can use the ``rdflib-hdt`` library in two modes: as an rdflib Graph or as a raw HDT document.

Graph usage (recommended)
-------------------------

.. code-block:: python

   from rdflib import Graph
   from rdflib_hdt import HDTStore
   from rdflib.namespace import FOAF

   # Load an HDT file. Missing indexes are generated automatically
   # You can provide the index file by putting them in the same directory than the HDT file.
   store = HDTGraph("test.hdt")

   # Display some metadata about the HDT document itself
   print(f"Number of RDF triples: {len(store)}")
   print(f"Number of subjects: {store.nb_subjects}")
   print(f"Number of predicates: {store.nb_predicates}")
   print(f"Number of objects: {store.nb_objects}")
   print(f"Number of shared subject-object: {store.nb_shared}")


Using the RDFlib API, you can also `execute SPARQL queries <https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html>`_ over an HDT document.
If you do so, we recommend that you first call the ``optimize_sparql`` function, which optimize
the RDFlib SPARQL query engine in the context of HDT documents.

.. code-block:: python

   from rdflib import Graph
   from rdflib_hdt import HDTStore, optimize_sparql

   # Calling this function optimizes the RDFlib SPARQL engine for HDT documents
   optimize_sparql()

   graph = Graph(store=HDTStore("test.hdt"))

   # You can execute SPARQL queries using the regular RDFlib API
   qres = graph.query("""
   PREFIX foaf: <http://xmlns.com/foaf/0.1/>
   SELECT ?name ?friend WHERE {
      ?a foaf:knows ?b.
      ?a foaf:name ?name.
      ?b foaf:name ?friend.
   }""")

   for row in qres:
     print(f"{row.name} knows {row.friend}")

HDT Document usage
------------------

.. code-block:: python

   from rdflib_hdt import HDTDocument

   # Load an HDT file. Missing indexes are generated automatically.
   # You can provide the index file by putting them in the same directory than the HDT file.
   document = HDTDocument("test.hdt")

   # Display some metadata about the HDT document itself
   print(f"Number of RDF triples: {document.total_triples}")
   print(f"Number of subjects: {document.nb_subjects}")
   print(f"Number of predicates: {document.nb_predicates}")
   print(f"Number of objects: {document.nb_objects}")
   print(f"Number of shared subject-object: {document.nb_shared}")

   # Fetch all triples that matches { ?s foaf:name ?o }
   # Use None to indicates variables
   triples, cardinality = document.search_triples((None, FOAF("name"), None))

   print(f"Cardinality of (?s foaf:name ?o): {cardinality}")
   for s, p, o in triples:
     print(triple)

   # The search also support limit and offset
   triples, cardinality = document.search_triples((None, FOAF("name"), None), limit=10, offset=100)
   # etc ...

An HDT document also provides support for evaluating joins over a set of triples patterns.

.. code-block:: python

  from rdflib_hdt import HDTDocument
  from rdflib import Variable
  from rdflib.namespace import FOAF, RDF
  
  document = HDTDocument("test.hdt")
  
  # find the names of two entities that know each other
  tp_a = (Variable("a"), FOAF("knows"), Variable("b"))
  tp_b = (Variable("a"), FOAF("name"), Variable("name"))
  tp_c = (Variable("b"), FOAF("name"), Variable("friend"))
  query = set([tp_a, tp_b, tp_c])
  
  iterator = document.search_join(query)
  print(f"Estimated join cardinality: {len(iterator)}")
  
  # Join results are produced as ResultRow, like in the RDFlib SPARQL API
  for row in iterator:
     print(f"{row.name} knows {row.friend}")

Handling non UTF-8 strings in python
====================================

If the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a ``UnicodeDecodeError``.
More details on how to convert string to str from C++ to Python `here <https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html>`_

To handle this, we doubled the API of the HDT document by adding:


* ``search_triples_bytes(...)`` return an iterator of triples as ``(py::bytes, py::bytes, py::bytes)``
* ``search_join_bytes(...)`` return an iterator of sets of solutions mapping as ``py::set(py::bytes, py::bytes)``
* ``convert_tripleid_bytes(...)`` return a triple as: ``(py::bytes, py::bytes, py::bytes)``
* ``convert_id_bytes(...)`` return a ``py::bytes``

**Parameters and documentation are the same as the standard version**

.. code-block:: python

   from rdflib_hdt import HDTDocument

   document = HDTDocument("test.hdt")
   it = document.search_triple_bytes("", "", "")

   for s, p, o in it:
   print(s, p, o) # print b'...', b'...', b'...'
   # now decode it, or handle any error
   try:
      s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
   except UnicodeDecodeError as err:
      # try another other codecs, ignore error, etc
      pass

.. |Build Status| image:: https://github.com/RDFLib/rdflib-hdt/workflows/Python%20tests/badge.svg
   :target: https://github.com/RDFLib/rdflib-hdt/actions?query=workflow%3A%22Python+tests%22
.. |PyPI version| image:: https://badge.fury.io/py/rdflib-hdt.svg
   :target: https://badge.fury.io/py/rdflib-hdt
.. |rdflib-htd logo| image:: https://raw.githubusercontent.com/RDFLib/rdflib-hdt/master/docs/source/_static/rdflib-hdt-250.png
   :target: https://rdflib.dev/rdflib-hdt/

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "rdflib-hdt",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "rdflib,hdt,rdf,semantic web,search",
    "author": "",
    "author_email": "Thomas Minier <tminier01@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ea/11/f83aedf9517a20fe10fbd2e1ccd147760322b8cce36eac9efedb8b3783be/rdflib_hdt-3.1.tar.gz",
    "platform": null,
    "description": "|rdflib-htd logo|\n\n|Build Status| |PyPI version|\n\nA Store back-end for `rdflib <https://github.com/RDFLib>`_ to allow for reading and querying HDT documents.\n\n`Online Documentation <https://rdflib.dev/rdflib-hdt/>`_\n\nRequirements\n============\n\n\n* Python *version 3.6.4 or higher*\n* `pip <https://pip.pypa.io/en/stable/>`_\n* **gcc/clang** with **c++11 support**\n* **Python Development headers**\n  ..\n\n     You should have the ``Python.h`` header available on your system.\\\n     For example, for Python 3.6, install the ``python3.6-dev`` package on Debian/Ubuntu systems.\n\n\nInstallation\n============\n\nInstallation using `pipenv <https://github.com/pypa/pipenv>`_ or a `virtualenv <https://virtualenv.pypa.io/en/stable/>`_ is **strongly advised!**\n\nPyPi installation (recommended)\n-------------------------------\n\n.. code-block:: bash\n\n   # you can install using pip\n   pip install rdflib-hdt\n\n   # or you can use pipenv\n   pipenv install rdflib-hdt\n\nManual installation\n-------------------\n\n**Requirement:** `pipenv <https://github.com/pypa/pipenv>`_ \n\n.. code-block:: bash\n\n   git clone https://github.com/Callidon/pyHDT\n   cd pyHDT/\n   ./install.sh\n\nGetting started\n===============\n\nYou can use the ``rdflib-hdt`` library in two modes: as an rdflib Graph or as a raw HDT document.\n\nGraph usage (recommended)\n-------------------------\n\n.. code-block:: python\n\n   from rdflib import Graph\n   from rdflib_hdt import HDTStore\n   from rdflib.namespace import FOAF\n\n   # Load an HDT file. Missing indexes are generated automatically\n   # You can provide the index file by putting them in the same directory than the HDT file.\n   store = HDTGraph(\"test.hdt\")\n\n   # Display some metadata about the HDT document itself\n   print(f\"Number of RDF triples: {len(store)}\")\n   print(f\"Number of subjects: {store.nb_subjects}\")\n   print(f\"Number of predicates: {store.nb_predicates}\")\n   print(f\"Number of objects: {store.nb_objects}\")\n   print(f\"Number of shared subject-object: {store.nb_shared}\")\n\n\nUsing the RDFlib API, you can also `execute SPARQL queries <https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html>`_ over an HDT document.\nIf you do so, we recommend that you first call the ``optimize_sparql`` function, which optimize\nthe RDFlib SPARQL query engine in the context of HDT documents.\n\n.. code-block:: python\n\n   from rdflib import Graph\n   from rdflib_hdt import HDTStore, optimize_sparql\n\n   # Calling this function optimizes the RDFlib SPARQL engine for HDT documents\n   optimize_sparql()\n\n   graph = Graph(store=HDTStore(\"test.hdt\"))\n\n   # You can execute SPARQL queries using the regular RDFlib API\n   qres = graph.query(\"\"\"\n   PREFIX foaf: <http://xmlns.com/foaf/0.1/>\n   SELECT ?name ?friend WHERE {\n      ?a foaf:knows ?b.\n      ?a foaf:name ?name.\n      ?b foaf:name ?friend.\n   }\"\"\")\n\n   for row in qres:\n     print(f\"{row.name} knows {row.friend}\")\n\nHDT Document usage\n------------------\n\n.. code-block:: python\n\n   from rdflib_hdt import HDTDocument\n\n   # Load an HDT file. Missing indexes are generated automatically.\n   # You can provide the index file by putting them in the same directory than the HDT file.\n   document = HDTDocument(\"test.hdt\")\n\n   # Display some metadata about the HDT document itself\n   print(f\"Number of RDF triples: {document.total_triples}\")\n   print(f\"Number of subjects: {document.nb_subjects}\")\n   print(f\"Number of predicates: {document.nb_predicates}\")\n   print(f\"Number of objects: {document.nb_objects}\")\n   print(f\"Number of shared subject-object: {document.nb_shared}\")\n\n   # Fetch all triples that matches { ?s foaf:name ?o }\n   # Use None to indicates variables\n   triples, cardinality = document.search_triples((None, FOAF(\"name\"), None))\n\n   print(f\"Cardinality of (?s foaf:name ?o): {cardinality}\")\n   for s, p, o in triples:\n     print(triple)\n\n   # The search also support limit and offset\n   triples, cardinality = document.search_triples((None, FOAF(\"name\"), None), limit=10, offset=100)\n   # etc ...\n\nAn HDT document also provides support for evaluating joins over a set of triples patterns.\n\n.. code-block:: python\n\n  from rdflib_hdt import HDTDocument\n  from rdflib import Variable\n  from rdflib.namespace import FOAF, RDF\n  \n  document = HDTDocument(\"test.hdt\")\n  \n  # find the names of two entities that know each other\n  tp_a = (Variable(\"a\"), FOAF(\"knows\"), Variable(\"b\"))\n  tp_b = (Variable(\"a\"), FOAF(\"name\"), Variable(\"name\"))\n  tp_c = (Variable(\"b\"), FOAF(\"name\"), Variable(\"friend\"))\n  query = set([tp_a, tp_b, tp_c])\n  \n  iterator = document.search_join(query)\n  print(f\"Estimated join cardinality: {len(iterator)}\")\n  \n  # Join results are produced as ResultRow, like in the RDFlib SPARQL API\n  for row in iterator:\n     print(f\"{row.name} knows {row.friend}\")\n\nHandling non UTF-8 strings in python\n====================================\n\nIf the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a ``UnicodeDecodeError``.\nMore details on how to convert string to str from C++ to Python `here <https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html>`_\n\nTo handle this, we doubled the API of the HDT document by adding:\n\n\n* ``search_triples_bytes(...)`` return an iterator of triples as ``(py::bytes, py::bytes, py::bytes)``\n* ``search_join_bytes(...)`` return an iterator of sets of solutions mapping as ``py::set(py::bytes, py::bytes)``\n* ``convert_tripleid_bytes(...)`` return a triple as: ``(py::bytes, py::bytes, py::bytes)``\n* ``convert_id_bytes(...)`` return a ``py::bytes``\n\n**Parameters and documentation are the same as the standard version**\n\n.. code-block:: python\n\n   from rdflib_hdt import HDTDocument\n\n   document = HDTDocument(\"test.hdt\")\n   it = document.search_triple_bytes(\"\", \"\", \"\")\n\n   for s, p, o in it:\n   print(s, p, o) # print b'...', b'...', b'...'\n   # now decode it, or handle any error\n   try:\n      s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')\n   except UnicodeDecodeError as err:\n      # try another other codecs, ignore error, etc\n      pass\n\n.. |Build Status| image:: https://github.com/RDFLib/rdflib-hdt/workflows/Python%20tests/badge.svg\n   :target: https://github.com/RDFLib/rdflib-hdt/actions?query=workflow%3A%22Python+tests%22\n.. |PyPI version| image:: https://badge.fury.io/py/rdflib-hdt.svg\n   :target: https://badge.fury.io/py/rdflib-hdt\n.. |rdflib-htd logo| image:: https://raw.githubusercontent.com/RDFLib/rdflib-hdt/master/docs/source/_static/rdflib-hdt-250.png\n   :target: https://rdflib.dev/rdflib-hdt/\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A Store back-end for rdflib to allow for reading and querying HDT documents",
    "version": "3.1",
    "project_urls": {
        "homepage": "https://rdflib.dev/rdflib-hdt",
        "repository": "https://github.com/RDFLib/rdflib-hdt.git"
    },
    "split_keywords": [
        "rdflib",
        "hdt",
        "rdf",
        "semantic web",
        "search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b04d584ae3e684d8522ce77baf1c155bb35ad2ca0462d5e1b6b675f8c3b3c87",
                "md5": "cb6993ccd6f278349d2e2d188c701fc4",
                "sha256": "5efa586ae8934b4c968c4f7b9ec14864b08efd6b9a3342f2463982172f6519ed"
            },
            "downloads": -1,
            "filename": "rdflib_hdt-3.1-py3.7-linux-x86_64.egg",
            "has_sig": false,
            "md5_digest": "cb6993ccd6f278349d2e2d188c701fc4",
            "packagetype": "bdist_egg",
            "python_version": "3.1",
            "requires_python": null,
            "size": 7374098,
            "upload_time": "2023-06-02T12:21:59",
            "upload_time_iso_8601": "2023-06-02T12:21:59.662810Z",
            "url": "https://files.pythonhosted.org/packages/3b/04/d584ae3e684d8522ce77baf1c155bb35ad2ca0462d5e1b6b675f8c3b3c87/rdflib_hdt-3.1-py3.7-linux-x86_64.egg",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea11f83aedf9517a20fe10fbd2e1ccd147760322b8cce36eac9efedb8b3783be",
                "md5": "f7e619415939ff0ee56d6405160fef0f",
                "sha256": "0db95fe58e276fe58668cae6ef94dd7b7b30bf1e9f89dd06d3941208a17fcc44"
            },
            "downloads": -1,
            "filename": "rdflib_hdt-3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f7e619415939ff0ee56d6405160fef0f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 235883,
            "upload_time": "2023-06-02T12:22:02",
            "upload_time_iso_8601": "2023-06-02T12:22:02.609459Z",
            "url": "https://files.pythonhosted.org/packages/ea/11/f83aedf9517a20fe10fbd2e1ccd147760322b8cce36eac9efedb8b3783be/rdflib_hdt-3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-02 12:22:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "RDFLib",
    "github_project": "rdflib-hdt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "rdflib-hdt"
}
        
Elapsed time: 0.08944s