# kurve
## Functionality
A library for turning data entities into a graph. The entities
can be files or RDBMS, depending on which connectors we're supporting
at the time you read this. To start, we're supporting local flat files,
Snowflake, and PostgreSQL.
## Motivation
Data discovery is an increasingly hard problem, requiring domain
experts, good documentation, and somewhat reliable data. This becomes
an even larger problem when onboarding new team members, and gets
exacerbated by the data warehouse / lake pattern of shoving all an org's
data into a single location. The idea is to automate some of the discovery
by turning relational portions of data into graphs, which can then be navigated
visually and programmatically.
### Installation
```python
pip install kurve
```
## Usage
### Postgres
```python
from kurve.sources import PostgresSource
source = PostgresSource(
host=os.getenv('MY_HOST'),
user=os.getenv('MY_USER'),
pw=os.getenv('MY_PASSWORD'),
port=5432,
database='MY_DATABASE',
databases=['MY_DATABASE'],
schemas=['SCHEMA1', 'SCHEMA2']
)
import kurve.graph
g = graph.Graph(source)
g.build_graph()
g.save_graph('postgres_graph.pkl')
len(g)
1632
len(g.edges)
2450
g.plot_graph(fname='my_first_graph.html')
```
### graph output from `plot_graph()`
![plotted graph](https://github.com/wesmadrigal/kurve/blob/master/docs/postgres_graph_example.jpg?raw=true)
### BigQuery
```python
from google.cloud import bigquery
from kurve.sources import BigQuerySource
from kurve.graph import kurve
# assumes you have gcloud credentials configured
bq_client = bigquery.client.Client()
big_query_source = BigQuerySource(client=bq_client)
g = Graph(source=big_query_source)
g.build_graph()
g.save_graph('bigquery_graph.pkl')
len(g)
3
len(g.edges)
2
g.plot_graph(fname='big_query_graph.html')
```
### Filesystem (local, S3, Azure Blob, GCS)
```python
from kurve.sources import FileSystemSource
from kurve.enums import FileProvider, StorageFormat
from kurve.graph import Graph
# S3 example
efs = FileSource(
path_root='BUCKET_NAME',
provider=FileProvider.s3,
storage_format=StorageFormat.parquet,
prefix='SUB/DIRECTORY/PATH/TO/FILES',
regex_filter=re.compile(r"([a-fA-F\d]{32})"),
entities_are_partitioned=True
)
g = Graph(source=efs)
g.build_graph()
g.save_graph('filesystem_graph.pkl')
len(g)
100
len(g.edges)
88
g.plot_graph(fname='s3_warehouse_graph.html')
```
## Future work
* More sources
* Better cardinality support
* Optional compute "budget" specification for graph building
* Closer integration with compute and integration with GraphReduce paradigms
* Top-level project configuration for graph output locations, credentials, etc.
* substituting `networkx` for Knowledge graphs (probably `kglab` https://github.com/derwenai/kglab)
* consolidate data types, most likely by leveraging good Open source work out of `pyarrow` https://arrow.apache.org/docs/python/api/datatypes.html
Raw data
{
"_id": null,
"home_page": "https://github.com/kurveai/kurve",
"name": "kurve",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "knowledge discovery,data discovery,data visualization,entity linking,graph algorithms,knowledge graph,parsing",
"author": "Wes Madrigal",
"author_email": "wes@madconsulting.ai",
"download_url": "https://files.pythonhosted.org/packages/b4/d9/8f1c24c9a61f6047d055a0816564e7ef2600c07134bab7915a41f15bf175/kurve-0.5.tar.gz",
"platform": null,
"description": "# kurve\n\n## Functionality\n\nA library for turning data entities into a graph. The entities\ncan be files or RDBMS, depending on which connectors we're supporting\nat the time you read this. To start, we're supporting local flat files,\nSnowflake, and PostgreSQL.\n\n## Motivation\n\nData discovery is an increasingly hard problem, requiring domain\nexperts, good documentation, and somewhat reliable data. This becomes\nan even larger problem when onboarding new team members, and gets\nexacerbated by the data warehouse / lake pattern of shoving all an org's\ndata into a single location. The idea is to automate some of the discovery\nby turning relational portions of data into graphs, which can then be navigated\nvisually and programmatically.\n\n### Installation\n```python\npip install kurve\n```\n\n## Usage\n\n\n### Postgres\n```python\nfrom kurve.sources import PostgresSource\nsource = PostgresSource(\n host=os.getenv('MY_HOST'),\n user=os.getenv('MY_USER'),\n pw=os.getenv('MY_PASSWORD'),\n port=5432,\n database='MY_DATABASE',\n databases=['MY_DATABASE'],\n schemas=['SCHEMA1', 'SCHEMA2']\n)\n\nimport kurve.graph\n\ng = graph.Graph(source)\ng.build_graph()\ng.save_graph('postgres_graph.pkl')\n\nlen(g)\n1632\nlen(g.edges)\n2450\n\ng.plot_graph(fname='my_first_graph.html')\n```\n\n### graph output from `plot_graph()`\n![plotted graph](https://github.com/wesmadrigal/kurve/blob/master/docs/postgres_graph_example.jpg?raw=true)\n\n### BigQuery\n```python\nfrom google.cloud import bigquery\nfrom kurve.sources import BigQuerySource\nfrom kurve.graph import kurve\n# assumes you have gcloud credentials configured\nbq_client = bigquery.client.Client()\nbig_query_source = BigQuerySource(client=bq_client)\n\ng = Graph(source=big_query_source)\ng.build_graph()\ng.save_graph('bigquery_graph.pkl')\nlen(g)\n3\nlen(g.edges)\n2\ng.plot_graph(fname='big_query_graph.html')\n```\n\n### Filesystem (local, S3, Azure Blob, GCS)\n```python\nfrom kurve.sources import FileSystemSource\nfrom kurve.enums import FileProvider, StorageFormat\nfrom kurve.graph import Graph\n\n# S3 example\nefs = FileSource(\n path_root='BUCKET_NAME',\n provider=FileProvider.s3,\n storage_format=StorageFormat.parquet,\n prefix='SUB/DIRECTORY/PATH/TO/FILES',\n regex_filter=re.compile(r\"([a-fA-F\\d]{32})\"),\n entities_are_partitioned=True\n)\ng = Graph(source=efs)\ng.build_graph()\ng.save_graph('filesystem_graph.pkl')\nlen(g)\n100\nlen(g.edges)\n88\ng.plot_graph(fname='s3_warehouse_graph.html')\n```\n\n\n\n## Future work\n* More sources\n* Better cardinality support\n* Optional compute \"budget\" specification for graph building\n* Closer integration with compute and integration with GraphReduce paradigms\n* Top-level project configuration for graph output locations, credentials, etc.\n* substituting `networkx` for Knowledge graphs (probably `kglab` https://github.com/derwenai/kglab)\n* consolidate data types, most likely by leveraging good Open source work out of `pyarrow` https://arrow.apache.org/docs/python/api/datatypes.html\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An interface for dynamic entity linking with graphs as the backend for arbitrary data sources.",
"version": "0.5",
"project_urls": {
"Homepage": "https://github.com/kurveai/kurve",
"Issue Tracker": "https://github.com/kurveai/kurve/issues",
"Source": "http://github.com/kurveai/kurve"
},
"split_keywords": [
"knowledge discovery",
"data discovery",
"data visualization",
"entity linking",
"graph algorithms",
"knowledge graph",
"parsing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b4d98f1c24c9a61f6047d055a0816564e7ef2600c07134bab7915a41f15bf175",
"md5": "83149f32d2dbddeea4543941999905a0",
"sha256": "d2305e256fe4a0ce2c9063575d883247af091bf80d45f431af4c85c83b053b8f"
},
"downloads": -1,
"filename": "kurve-0.5.tar.gz",
"has_sig": false,
"md5_digest": "83149f32d2dbddeea4543941999905a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22043,
"upload_time": "2023-06-01T17:08:48",
"upload_time_iso_8601": "2023-06-01T17:08:48.759477Z",
"url": "https://files.pythonhosted.org/packages/b4/d9/8f1c24c9a61f6047d055a0816564e7ef2600c07134bab7915a41f15bf175/kurve-0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-01 17:08:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kurveai",
"github_project": "kurve",
"github_not_found": true,
"lcname": "kurve"
}