PyStellarDB
===========
PyStellarDB is a Python API for executing Transwarp Exetended OpenCypher(TEoC) and Hive query.
It could also generate a RDD object which could be used in PySpark.
It is base on PyHive(https://github.com/dropbox/PyHive) and PySpark(https://github.com/apache/spark/)
PySpark RDD
===========
We hack a way to generate RDD object using the same method in `sc.parallelize(data)`.
It could cause memory panic if the query returns a large amount of data.
Users could use a workaround if you do need huge data:
1. If you are querying a graph, refer to StellarDB manual of Chapter 4.4.5 to save the query data into a temporary table.
2. If you are querying a SQL table, save your query result into a temporary table.
3. Find the HDFS path of the temporary table generated in Step 1 or Step 2.
4. Use API like `sc.newAPIHadoopFile()` to generate RDD.
Usage
=====
PLAIN Mode (No security is configured)
---------------------------------------
.. code-block:: python
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
LDAP Mode
---------
.. code-block:: python
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, username='hive', password='123456', auth='LDAP', graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
Kerberos Mode
-------------
.. code-block:: python
# Make sure you have the correct realms infomation about the KDC server in /etc/krb5.conf
# Make sure you have the correct keytab file in your environment
# Run kinit command:
# In Linux: kinit -kt FILE_PATH_OF_KEYTABL PRINCIPAL_NAME
# In Mac: kinit -t FILE_PATH_OF_KEYTABL -f PRINCIPAL_NAME
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, kerberos_service_name='hive', auth='KERBEROS', graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
Execute Hive Query
------------------
.. code-block:: python
from pystellardb import stellar_hive
# If `graph_name` parameter is None, it will execute a Hive query and return data just as PyHive does
conn = stellar_hive.StellarConnection(host="localhost", port=10000, database='default')
cur = conn.cursor()
cur.execute('SELECT * FROM default.abc limit 10')
Execute Graph Query and change to a PySpark RDD object
------------------------------------------------------
.. code-block:: python
from pyspark import SparkContext
from pystellardb import stellar_hive
sc = SparkContext("local", "Demo App")
conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 10')
rdd = cur.toRDD(sc)
def f(x): print(x)
rdd.map(lambda x: (x[0].toJSON(), x[1].toJSON(), x[2].toJSON())).foreach(f)
# Every line of this query is in format of Tuple(VertexObject, EdgeObject, VertexObject)
# Vertex and Edge object has a function of toJSON() which can print the object in JSON format
Execute Hive Query and change to a PySpark RDD object
-----------------------------------------------------
.. code-block:: python
from pyspark import SparkContext
from pystellardb import stellar_hive
sc = SparkContext("local", "Demo App")
conn = stellar_hive.StellarConnection(host="localhost", port=10000)
cur = conn.cursor()
cur.execute('select * from default_db.default_table limit 10')
rdd = cur.toRDD(sc)
def f(x): print(x)
rdd.foreach(f)
# Every line of this query is in format of Tuple(Column, Column, Column)
Dependencies
============
Required:
------------
- Python 2.7+ / Python 3
System SASL
------------
Ubuntu:
.. code-block:: bash
apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit
apt-get install python-dev gcc #Update python and gcc if needed
RHEL/CentOS:
.. code-block:: bash
yum install cyrus-sasl-md5 cyrus-sasl-plain cyrus-sasl-gssapi cyrus-sasl-devel
yum install gcc-c++ python-devel.x86_64 #Update python and gcc if needed
# if pip3 install fails with a message like 'Can't connect to HTTPS URL because the SSL module is not available'
# you may need to update ssl & reinstall python
# 1. Download a higher version of openssl, e.g: https://www.openssl.org/source/openssl-1.1.1k.tar.gz
# 2. Install openssl: ./config && make && make install
# 3. Link openssl: echo /usr/local/lib64/ > /etc/ld.so.conf.d/openssl-1.1.1.conf
# 4. Update dynamic lib: ldconfig -v
# 5. Uninstall Python & Download a new Python source package
# 6. vim Modules/Setup, search '_socket socketmodule.c', uncomment
# _socket socketmodule.c
# SSL=/usr/local/ssl
# _ssl _ssl.c \
# -DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \
# -L$(SSL)/lib -lssl -lcrypto
#
# 7. Install Python: ./configure && make && make install
Windows:
.. code-block:: bash
# There are 3 ways of installing sasl for python on windows
# 1. (recommended) Download a .whl version of sasl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#sasl
# 2. (recommended) If using anaconda, use conda install sasl.
# 3. Install Microsoft Visual C++ 9.0/14.0 buildtools for python2.7/3.x, then pip install sasl.
Notices
=======
Pystellardb >= 0.9 contains beeline installation to /usr/local/bin/beeline.
Requirements
============
Install using
- ``pip install 'pystellardb[hive]'`` for the Hive interface.
PyHive works with
- For Hive: `HiveServer2 <https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2>`_ daemon
Windows Kerberos Configuration
==============================
Windows Kerberos configuration can be a little bit tricky and may need a few instructions.
First, you'll need to install & configure Kerberos for Windows.
Get it from http://web.mit.edu/kerberos/dist/
After installation, configure the environment variables.
Make sure the position of your Kerberos variable is ahead of JDK variable, avoid using kinit command located in JDK path.
Find /etc/krb5.conf on your KDC, copy it into krb5.ini on Windows with some modifications.
e.g.(krb5.conf on KDC):
.. code-block:: bash
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = DEFAULT
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
allow_weak_crypto = true
udp_preference_limit = 32700
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
[realms]
DEFAULT = {
kdc = host1:1088
kdc = host2:1088
}
Modify it, delete [logging] and default_ccache_name in [libdefaults]:
.. code-block:: bash
[libdefaults]
default_realm = DEFAULT
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
allow_weak_crypto = true
udp_preference_limit = 32700
[realms]
DEFAULT = {
kdc = host1:1088
kdc = host2:1088
}
Above is your krb5.ini for Kerberos on Windows. Put it at 3 places:
C:\ProgramData\MIT\Kerberos5\krb5.ini
C:\Program Files\MIT\Kerberos\krb5.ini
C:\Windows\krb5.ini
Finally, configure hosts file at: C:/Windows/System32/drivers/etc/hosts
Add ip mappings of host1, host2 in the previous example. e.g.
.. code-block:: bash
10.6.6.96 host1
10.6.6.97 host2
Now, you can try running kinit in your command line!
Testing
=======
On his way
Raw data
{
"_id": null,
"home_page": "https://github.com/WarpCloud/PyStellarDB",
"name": "PyStellarDB",
"maintainer": null,
"docs_url": null,
"requires_python": ">=2.7",
"maintainer_email": null,
"keywords": null,
"author": "Zhiping Wang",
"author_email": "zhiping.wang@transwarp.io",
"download_url": "https://files.pythonhosted.org/packages/46/06/a6a6dd0655c483ce7be67208ba2dac0e85398053e781138d86c8727e5a96/pystellardb-0.13.3.tar.gz",
"platform": null,
"description": "PyStellarDB\n===========\n\nPyStellarDB is a Python API for executing Transwarp Exetended OpenCypher(TEoC) and Hive query.\nIt could also generate a RDD object which could be used in PySpark.\nIt is base on PyHive(https://github.com/dropbox/PyHive) and PySpark(https://github.com/apache/spark/)\n\nPySpark RDD\n===========\n\nWe hack a way to generate RDD object using the same method in `sc.parallelize(data)`.\nIt could cause memory panic if the query returns a large amount of data.\n\nUsers could use a workaround if you do need huge data:\n\n1. If you are querying a graph, refer to StellarDB manual of Chapter 4.4.5 to save the query data into a temporary table.\n\n2. If you are querying a SQL table, save your query result into a temporary table.\n\n3. Find the HDFS path of the temporary table generated in Step 1 or Step 2.\n\n4. Use API like `sc.newAPIHadoopFile()` to generate RDD.\n\nUsage\n=====\n\nPLAIN Mode (No security is configured)\n---------------------------------------\n.. code-block:: python\n\n from pystellardb import stellar_hive\n\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000, graph_name='pokemon')\n cur = conn.cursor()\n cur.execute('config query.lang cypher')\n cur.execute('use graph pokemon')\n cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')\n\n print cur.fetchall()\n\n\nLDAP Mode\n---------\n.. code-block:: python\n\n from pystellardb import stellar_hive\n\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000, username='hive', password='123456', auth='LDAP', graph_name='pokemon')\n cur = conn.cursor()\n cur.execute('config query.lang cypher')\n cur.execute('use graph pokemon')\n cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')\n\n print cur.fetchall()\n\n\nKerberos Mode\n-------------\n.. code-block:: python\n\n # Make sure you have the correct realms infomation about the KDC server in /etc/krb5.conf\n # Make sure you have the correct keytab file in your environment\n # Run kinit command:\n # In Linux: kinit -kt FILE_PATH_OF_KEYTABL PRINCIPAL_NAME\n # In Mac: kinit -t FILE_PATH_OF_KEYTABL -f PRINCIPAL_NAME\n\n from pystellardb import stellar_hive\n\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000, kerberos_service_name='hive', auth='KERBEROS', graph_name='pokemon')\n cur = conn.cursor()\n cur.execute('config query.lang cypher')\n cur.execute('use graph pokemon')\n cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')\n\n print cur.fetchall()\n\n\nExecute Hive Query\n------------------\n.. code-block:: python\n\n from pystellardb import stellar_hive\n\n # If `graph_name` parameter is None, it will execute a Hive query and return data just as PyHive does\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000, database='default')\n cur = conn.cursor()\n cur.execute('SELECT * FROM default.abc limit 10')\n\n\nExecute Graph Query and change to a PySpark RDD object\n------------------------------------------------------\n.. code-block:: python\n\n from pyspark import SparkContext\n from pystellardb import stellar_hive\n \n sc = SparkContext(\"local\", \"Demo App\")\n\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000, graph_name='pokemon')\n cur = conn.cursor()\n cur.execute('config query.lang cypher')\n cur.execute('use graph pokemon')\n cur.execute('match p = (a)-[f]->(b) return a,f,b limit 10')\n\n rdd = cur.toRDD(sc)\n\n def f(x): print(x)\n\n rdd.map(lambda x: (x[0].toJSON(), x[1].toJSON(), x[2].toJSON())).foreach(f)\n\n # Every line of this query is in format of Tuple(VertexObject, EdgeObject, VertexObject)\n # Vertex and Edge object has a function of toJSON() which can print the object in JSON format\n\n\nExecute Hive Query and change to a PySpark RDD object\n-----------------------------------------------------\n.. code-block:: python\n\n from pyspark import SparkContext\n from pystellardb import stellar_hive\n \n sc = SparkContext(\"local\", \"Demo App\")\n\n conn = stellar_hive.StellarConnection(host=\"localhost\", port=10000)\n cur = conn.cursor()\n cur.execute('select * from default_db.default_table limit 10')\n\n rdd = cur.toRDD(sc)\n\n def f(x): print(x)\n\n rdd.foreach(f)\n\n # Every line of this query is in format of Tuple(Column, Column, Column)\n\nDependencies\n============\n\nRequired:\n------------\n\n- Python 2.7+ / Python 3\n\nSystem SASL\n------------\n\nUbuntu:\n\n.. code-block:: bash\n\n apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit\n apt-get install python-dev gcc #Update python and gcc if needed\n\nRHEL/CentOS:\n\n.. code-block:: bash\n\n yum install cyrus-sasl-md5 cyrus-sasl-plain cyrus-sasl-gssapi cyrus-sasl-devel\n yum install gcc-c++ python-devel.x86_64 #Update python and gcc if needed\n\n # if pip3 install fails with a message like 'Can't connect to HTTPS URL because the SSL module is not available'\n # you may need to update ssl & reinstall python\n\n # 1. Download a higher version of openssl, e.g: https://www.openssl.org/source/openssl-1.1.1k.tar.gz\n # 2. Install openssl: ./config && make && make install\n # 3. Link openssl: echo /usr/local/lib64/ > /etc/ld.so.conf.d/openssl-1.1.1.conf\n # 4. Update dynamic lib: ldconfig -v\n # 5. Uninstall Python & Download a new Python source package\n # 6. vim Modules/Setup, search '_socket socketmodule.c', uncomment\n # _socket socketmodule.c\n # SSL=/usr/local/ssl\n # _ssl _ssl.c \\\n # -DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \\\n # -L$(SSL)/lib -lssl -lcrypto\n #\n # 7. Install Python: ./configure && make && make install\n\nWindows:\n\n.. code-block:: bash\n\n # There are 3 ways of installing sasl for python on windows\n # 1. (recommended) Download a .whl version of sasl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#sasl\n # 2. (recommended) If using anaconda, use conda install sasl.\n # 3. Install Microsoft Visual C++ 9.0/14.0 buildtools for python2.7/3.x, then pip install sasl.\n\nNotices\n=======\n\nPystellardb >= 0.9 contains beeline installation to /usr/local/bin/beeline.\n\nRequirements\n============\n\nInstall using\n\n- ``pip install 'pystellardb[hive]'`` for the Hive interface.\n\nPyHive works with\n\n- For Hive: `HiveServer2 <https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2>`_ daemon\n\n\nWindows Kerberos Configuration\n==============================\n\nWindows Kerberos configuration can be a little bit tricky and may need a few instructions.\nFirst, you'll need to install & configure Kerberos for Windows.\nGet it from http://web.mit.edu/kerberos/dist/\n\nAfter installation, configure the environment variables.\nMake sure the position of your Kerberos variable is ahead of JDK variable, avoid using kinit command located in JDK path.\n\nFind /etc/krb5.conf on your KDC, copy it into krb5.ini on Windows with some modifications.\ne.g.(krb5.conf on KDC):\n\n.. code-block:: bash\n\n [logging]\n default = FILE:/var/log/krb5libs.log\n kdc = FILE:/var/log/krb5kdc.log\n admin_server = FILE:/var/log/kadmind.log\n\n [libdefaults]\n default_realm = DEFAULT\n dns_lookup_realm = false\n dns_lookup_kdc = false\n ticket_lifetime = 24h\n renew_lifetime = 7d\n forwardable = true\n allow_weak_crypto = true\n udp_preference_limit = 32700\n default_ccache_name = FILE:/tmp/krb5cc_%{uid}\n\n [realms]\n DEFAULT = {\n kdc = host1:1088\n kdc = host2:1088\n }\n\nModify it, delete [logging] and default_ccache_name in [libdefaults]:\n\n.. code-block:: bash\n\n [libdefaults]\n default_realm = DEFAULT\n dns_lookup_realm = false\n dns_lookup_kdc = false\n ticket_lifetime = 24h\n renew_lifetime = 7d\n forwardable = true\n allow_weak_crypto = true\n udp_preference_limit = 32700\n\n [realms]\n DEFAULT = {\n kdc = host1:1088\n kdc = host2:1088\n }\n\nAbove is your krb5.ini for Kerberos on Windows. Put it at 3 places:\n\n C:\\ProgramData\\MIT\\Kerberos5\\krb5.ini\n\n C:\\Program Files\\MIT\\Kerberos\\krb5.ini\n\n C:\\Windows\\krb5.ini\n\n\nFinally, configure hosts file at: C:/Windows/System32/drivers/etc/hosts\nAdd ip mappings of host1, host2 in the previous example. e.g.\n\n.. code-block:: bash\n\n 10.6.6.96 host1\n 10.6.6.97 host2\n\nNow, you can try running kinit in your command line!\n\nTesting\n=======\n\nOn his way\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Python interface to StellarDB",
"version": "0.13.3",
"project_urls": {
"Homepage": "https://github.com/WarpCloud/PyStellarDB"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "445ef1ce7287998eb65f0e528c3a83174227073f7f7123a82407cd9c390d69c9",
"md5": "3b338152f4857010e0fbce8da9a7ed10",
"sha256": "c03cfa70399268646a5b1d2ff796a86a39547dbe8794df74c30ef12327413292"
},
"downloads": -1,
"filename": "PyStellarDB-0.13.3-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3b338152f4857010e0fbce8da9a7ed10",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=2.7",
"size": 13846,
"upload_time": "2024-09-11T02:34:44",
"upload_time_iso_8601": "2024-09-11T02:34:44.123746Z",
"url": "https://files.pythonhosted.org/packages/44/5e/f1ce7287998eb65f0e528c3a83174227073f7f7123a82407cd9c390d69c9/PyStellarDB-0.13.3-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4606a6a6dd0655c483ce7be67208ba2dac0e85398053e781138d86c8727e5a96",
"md5": "ffe63479043788946ef576846a325566",
"sha256": "0a3a928abe1e4303a4c6510757e0d66394dbe4260f71d69c51d98f178dcb3ddd"
},
"downloads": -1,
"filename": "pystellardb-0.13.3.tar.gz",
"has_sig": false,
"md5_digest": "ffe63479043788946ef576846a325566",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=2.7",
"size": 29028,
"upload_time": "2024-09-11T02:34:46",
"upload_time_iso_8601": "2024-09-11T02:34:46.613330Z",
"url": "https://files.pythonhosted.org/packages/46/06/a6a6dd0655c483ce7be67208ba2dac0e85398053e781138d86c8727e5a96/pystellardb-0.13.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-11 02:34:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "WarpCloud",
"github_project": "PyStellarDB",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pystellardb"
}