Name | agefreighter JSON |
Version |
0.6.0
JSON |
| download |
home_page | None |
Summary | AgeFreighter is a Python package that helps you to create a graph database using Azure Database for PostgreSQL. |
upload_time | 2024-12-20 03:39:05 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License Copyright (c) 2024 Rio Fujita Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# AGEFreighter
a Python package that helps you to create a graph database using Azure Database for PostgreSQL.
[Apache AGEā¢](https://age.apache.org/) is a PostgreSQL Graph database compatible with PostgreSQL's distributed assets and leverages graph data structures to analyze and use relationships and patterns in data.
[Azure Database for PostgreSQL](https://azure.microsoft.com/en-us/services/postgresql/) is a managed database service that is based on the open-source Postgres database engine.
[Introducing support for Graph data in Azure Database for PostgreSQL (Preview)](https://techcommunity.microsoft.com/blog/adforpostgresql/introducing-support-for-graph-data-in-azure-database-for-postgresql-preview/4275628).
## 0.5.0 Release
Refactored the code to make it more readable and maintainable with the separated classes for factory model.
Please note how to use the new version of the package is tottally different from the previous versions.
### 0.5.2 Release -AzureStorageFreighter-
* AzureStorageFreighter class is used to load data from Azure Storage into the graph database. It's totally different from other classes. The class works as follows:
* If the argument, 'subscription_id' is not set, the class tries to find the Azure Subscription ID from your local environment using the 'az' command.
* Creates an Azure Storage account and a blob container under the resource group where the PostgreSQL server runs in.
* Enables the 'azure_storage' extension in the PostgreSQL server, if it's not enabled.
* Uploads the CSV file to the blob container.
* Creates a UDF (User Defined Function) named 'load_from_azure_storage' in the PostgreSQL server. The UDF loads data from the Azure Storage into the graph database.
* Executes the UDF.
* The above process takes time to prepare for loading data, making it unsuitable for loading small files, but effective for loading large files. For instance, it takes under 3 seconds to load 'actorfilms.csv' after uploading.
* However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.
### 0.5.3 Release -AzureStorageFreighter-
* AzureStorageFreighter class is totally refactored for better performance and scalability.
* 0.5.2 didn't work well for large files.
* Now, it works well for large files.
Checked with a 5.4GB CSV file consisting of 10M of start vertices, 10K of end vertices, and 25M edges,
it took 512 seconds to load the data into the graph database with PostgreSQL Flex,
Standard_D32ds_v4 (32 vcpus, 128 GiB memory) and 512TB / 7500 iops of storage.
* Tested data was generated with tests/generate_dummy_data.py.
* UDF to load the data to graph is no longer used.
* However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.
### 0.6.0 Release
* Added edge properties support.
* 'edge_props' argument (list) is added to the 'load()' method.
* 'drop_graph' argument is obsoleted. 'create_graph' argument is added.
* 'create_graph' is set to True by default. CAUTION: If the graph already exists, the graph is dropped before loading the data.
* If 'create_graph' is False, the data is loaded into the existing graph.
### Features
* Asynchronous connection pool support for psycopg PostgreSQL driver
* 'direct_loading' option for loading data directly into the graph. If 'direct_loading' is True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries.
* 'COPY' protocol support for loading data into the graph. If 'use_copy' is True, the data is loaded into the graph using the 'COPY' protocol.
### Classes
* AzureStorageFreighter
* AvroFreighter
* CosmosGremlinFreighter
* CSVFreighter
* MultiCSVFreighter
* Neo4jFreighter
* NetworkXFreighter
* ParquetFreighter
* PGFreighter
### Method
All the classes have the same load() method. The method loads data into the graph database.
### Arguments
* Common arguments
* graph_name (str) : the name of the graph
* chunk_size (int) : the number of rows to be loaded at once
* direct_loading (bool) : if True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries
* use_copy (bool) : if True, the data is loaded into the graph using the 'COPY' protocol
* create_graph (bool) : if True, the graph will be created after the existing graph is dropped
* Common arguments for 'Single Source' classes
* AvroFreighter
* AzureStorageFreighter
* CosmosGremlinFreighter
* Neo4jFreighter
* NetworkXFreighter
* ParquetFreighter
* PGFreighter
* start_v_label (str): Start Vertex Label
* start_id (str): Start Vertex ID
* start_props (list): Start Vertex Properties
* edge_type (str): Edge Type
* edge_props (list): Edge Properties
* end_v_label (str): End Vertex Label
* end_id (str): End Vertex ID
* end_props (list): End Vertex Properties
* Class specific arguments
* AvroFreighter
* source_avro (str): The path to the Avro file.
* CosmosGremlinFreighter
* cosmos_gremlin_endpoint (str): The Cosmos Gremlin endpoint.
* cosmos_gremlin_key (str): The Cosmos Gremlin key.
* cosmos_username (str): The Cosmos username.
* id_map (dict): ID Mapping
* MultiCSVFreighter
* vertex_csvs (list): The paths to the vertex CSV files.
* vertex_labels (list): The labels of the vertices.
* edge_csvs (list): The paths to the edge CSV files.
* edge_types (list): The types of the edges.
* Neo4jFreighter
* neo4j_uri (str): The URI of the Neo4j database.
* neo4j_user (str): The username of the Neo4j database.
* neo4j_password (str): The password of the Neo4j database.
* neo4j_database (str): The database of the Neo4j database.
* id_map (dict): ID Mapping
* NetworkXFreighter
* networkx_graph (nx.Graph): The NetworkX graph.
* id_map (dict): ID Mapping
* ParquetFreighter
* source_parquet (str): The path to the Parquet file.
* PGFreighter
* source_pg_con_string (str): The connection string of the source PostgreSQL database.
* source_schema (str): The source schema.
* source_tables (list): The source tables.
* id_map (dict): ID Mapping
### Release Notes
* 0.4.0 : Added 'loadFromCosmosGremlin()' function.
* 0.4.1 : Changed base Python version to 3.9 to run on Azure Cloud Shell and Databricks 15.4ML.
* 0.4.2 : Tuning for 'loadFromCosmosGremlin()' function.
* 0.4.3 : Standardized the argument names. Enhanced the tests for each functions.
* 0.4.4 : Performance tuning.
* 0.4.5 : Simplified 'loadFromNeo4j'.
* 0.4.6 : Added 'loadFromAvro()' function.
* 0.5.0 : Refactored the code to make it more readable and maintainable with the separated classes for factory model. Introduced concurrent.futures for better performance.
* 0.5.1 : Improved the usage
* 0.5.2 : Added AzureStorageFreighter class, fixed a bug in ParquetFreighter class (THX! Reported from my co-worker, Srikanth-san)
* 0.5.3 : Refactored AzureStorageFreighter class for better performance and scalability.
* 0.6.0 : Added edge properties support. 'drop_graph' argument is obsoleted. 'create_graph' argument is added.
### Install
```bash
pip install agefreighter
```
### Prerequisites
* over Python 3.9
* This module runs on [psycopg](https://www.psycopg.org/) and [psycopg_pool](https://www.psycopg.org/)
* Enable the Apache AGE extension in your Azure Database for PostgreSQL instance. Login Azure Portal, go to 'server parameters' blade, and check 'AGE" on within 'azure.extensions' and 'shared_preload_libraries' parameters. See, above blog post for more information.
* Load the AGE extension in your PostgreSQL database.
```sql
CREATE EXTENSION IF NOT EXISTS age CASCADE;
```
### Usage
```python
import asyncio
import os
from agefreighter import Factory
import logging
log = logging.getLogger(__name__)
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
async def main():
class_name = "CSVFreighter"
instance = Factory.create_instance(class_name)
await instance.connect(
dsn=os.environ["PG_CONNECTION_STRING"],
max_connections=64,
)
await instance.load(
graph_name="AgeTester",
start_v_label="Actor",
start_id="ActorID",
start_props=["Actor"],
edge_type="ACTED_IN",
edge_props=["Role", "Director"],
end_v_label="Film",
end_id="FilmID",
end_props=["Film", "Year", "Votes", "Rating"],
csv="./actorfilms.csv",
drop_graph=True,
)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
See, [tests/agefreightertester.py](https://github.com/rioriost/agefreighter/blob/main/tests/agefreightertester.py) for more details.
### Test & Samples
```sql
export PG_CONNECTION_STRING="host=your_host.postgres.database.azure.com port=5432 dbname=postgres user=account password=your_password"
cd tests/
python3.9 agefreightertester.py
```
### For more information about [Apache AGE](https://age.apache.org/)
* Apache AGE : https://age.apache.org/
* GitHub : https://github.com/apache/age
* Document : https://age.apache.org/age-manual/master/index.html
### License
MIT License
Raw data
{
"_id": null,
"home_page": null,
"name": "agefreighter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Rio Fujita <rifujita@microsoft.com>",
"download_url": "https://files.pythonhosted.org/packages/85/75/f580b5c6c19a46e96e82af12ab89c61e472d7cd69ddb52bdf0fbbbd738e5/agefreighter-0.6.0.tar.gz",
"platform": null,
"description": "# AGEFreighter\n\na Python package that helps you to create a graph database using Azure Database for PostgreSQL.\n\n[Apache AGE\u2122](https://age.apache.org/) is a PostgreSQL Graph database compatible with PostgreSQL's distributed assets and leverages graph data structures to analyze and use relationships and patterns in data.\n\n[Azure Database for PostgreSQL](https://azure.microsoft.com/en-us/services/postgresql/) is a managed database service that is based on the open-source Postgres database engine.\n\n[Introducing support for Graph data in Azure Database for PostgreSQL (Preview)](https://techcommunity.microsoft.com/blog/adforpostgresql/introducing-support-for-graph-data-in-azure-database-for-postgresql-preview/4275628).\n\n## 0.5.0 Release\nRefactored the code to make it more readable and maintainable with the separated classes for factory model.\nPlease note how to use the new version of the package is tottally different from the previous versions.\n\n### 0.5.2 Release -AzureStorageFreighter-\n* AzureStorageFreighter class is used to load data from Azure Storage into the graph database. It's totally different from other classes. The class works as follows:\n * If the argument, 'subscription_id' is not set, the class tries to find the Azure Subscription ID from your local environment using the 'az' command.\n * Creates an Azure Storage account and a blob container under the resource group where the PostgreSQL server runs in.\n * Enables the 'azure_storage' extension in the PostgreSQL server, if it's not enabled.\n * Uploads the CSV file to the blob container.\n * Creates a UDF (User Defined Function) named 'load_from_azure_storage' in the PostgreSQL server. The UDF loads data from the Azure Storage into the graph database.\n * Executes the UDF.\n* The above process takes time to prepare for loading data, making it unsuitable for loading small files, but effective for loading large files. For instance, it takes under 3 seconds to load 'actorfilms.csv' after uploading.\n* However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.\n\n### 0.5.3 Release -AzureStorageFreighter-\n* AzureStorageFreighter class is totally refactored for better performance and scalability.\n * 0.5.2 didn't work well for large files.\n * Now, it works well for large files.\n Checked with a 5.4GB CSV file consisting of 10M of start vertices, 10K of end vertices, and 25M edges,\n it took 512 seconds to load the data into the graph database with PostgreSQL Flex,\n Standard_D32ds_v4 (32 vcpus, 128 GiB memory) and 512TB / 7500 iops of storage.\n * Tested data was generated with tests/generate_dummy_data.py.\n * UDF to load the data to graph is no longer used.\n* However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.\n\n### 0.6.0 Release\n* Added edge properties support.\n * 'edge_props' argument (list) is added to the 'load()' method.\n* 'drop_graph' argument is obsoleted. 'create_graph' argument is added.\n * 'create_graph' is set to True by default. CAUTION: If the graph already exists, the graph is dropped before loading the data.\n * If 'create_graph' is False, the data is loaded into the existing graph.\n\n### Features\n* Asynchronous connection pool support for psycopg PostgreSQL driver\n* 'direct_loading' option for loading data directly into the graph. If 'direct_loading' is True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries.\n* 'COPY' protocol support for loading data into the graph. If 'use_copy' is True, the data is loaded into the graph using the 'COPY' protocol.\n\n### Classes\n* AzureStorageFreighter\n* AvroFreighter\n* CosmosGremlinFreighter\n* CSVFreighter\n* MultiCSVFreighter\n* Neo4jFreighter\n* NetworkXFreighter\n* ParquetFreighter\n* PGFreighter\n\n### Method\nAll the classes have the same load() method. The method loads data into the graph database.\n\n### Arguments\n* Common arguments\n * graph_name (str) : the name of the graph\n * chunk_size (int) : the number of rows to be loaded at once\n * direct_loading (bool) : if True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries\n * use_copy (bool) : if True, the data is loaded into the graph using the 'COPY' protocol\n * create_graph (bool) : if True, the graph will be created after the existing graph is dropped\n\n* Common arguments for 'Single Source' classes\n * AvroFreighter\n * AzureStorageFreighter\n * CosmosGremlinFreighter\n * Neo4jFreighter\n * NetworkXFreighter\n * ParquetFreighter\n * PGFreighter\n * start_v_label (str): Start Vertex Label\n * start_id (str): Start Vertex ID\n * start_props (list): Start Vertex Properties\n * edge_type (str): Edge Type\n * edge_props (list): Edge Properties\n * end_v_label (str): End Vertex Label\n * end_id (str): End Vertex ID\n * end_props (list): End Vertex Properties\n\n* Class specific arguments\n * AvroFreighter\n * source_avro (str): The path to the Avro file.\n\n * CosmosGremlinFreighter\n * cosmos_gremlin_endpoint (str): The Cosmos Gremlin endpoint.\n * cosmos_gremlin_key (str): The Cosmos Gremlin key.\n * cosmos_username (str): The Cosmos username.\n * id_map (dict): ID Mapping\n\n * MultiCSVFreighter\n * vertex_csvs (list): The paths to the vertex CSV files.\n * vertex_labels (list): The labels of the vertices.\n * edge_csvs (list): The paths to the edge CSV files.\n * edge_types (list): The types of the edges.\n\n * Neo4jFreighter\n * neo4j_uri (str): The URI of the Neo4j database.\n * neo4j_user (str): The username of the Neo4j database.\n * neo4j_password (str): The password of the Neo4j database.\n * neo4j_database (str): The database of the Neo4j database.\n * id_map (dict): ID Mapping\n\n * NetworkXFreighter\n * networkx_graph (nx.Graph): The NetworkX graph.\n * id_map (dict): ID Mapping\n\n * ParquetFreighter\n * source_parquet (str): The path to the Parquet file.\n\n * PGFreighter\n * source_pg_con_string (str): The connection string of the source PostgreSQL database.\n * source_schema (str): The source schema.\n * source_tables (list): The source tables.\n * id_map (dict): ID Mapping\n\n\n### Release Notes\n* 0.4.0 : Added 'loadFromCosmosGremlin()' function.\n* 0.4.1 : Changed base Python version to 3.9 to run on Azure Cloud Shell and Databricks 15.4ML.\n* 0.4.2 : Tuning for 'loadFromCosmosGremlin()' function.\n* 0.4.3 : Standardized the argument names. Enhanced the tests for each functions.\n* 0.4.4 : Performance tuning.\n* 0.4.5 : Simplified 'loadFromNeo4j'.\n* 0.4.6 : Added 'loadFromAvro()' function.\n* 0.5.0 : Refactored the code to make it more readable and maintainable with the separated classes for factory model. Introduced concurrent.futures for better performance.\n* 0.5.1 : Improved the usage\n* 0.5.2 : Added AzureStorageFreighter class, fixed a bug in ParquetFreighter class (THX! Reported from my co-worker, Srikanth-san)\n* 0.5.3 : Refactored AzureStorageFreighter class for better performance and scalability.\n* 0.6.0 : Added edge properties support. 'drop_graph' argument is obsoleted. 'create_graph' argument is added.\n\n### Install\n\n```bash\npip install agefreighter\n```\n\n### Prerequisites\n* over Python 3.9\n* This module runs on [psycopg](https://www.psycopg.org/) and [psycopg_pool](https://www.psycopg.org/)\n* Enable the Apache AGE extension in your Azure Database for PostgreSQL instance. Login Azure Portal, go to 'server parameters' blade, and check 'AGE\" on within 'azure.extensions' and 'shared_preload_libraries' parameters. See, above blog post for more information.\n* Load the AGE extension in your PostgreSQL database.\n\n```sql\nCREATE EXTENSION IF NOT EXISTS age CASCADE;\n```\n\n### Usage\n```python\nimport asyncio\nimport os\nfrom agefreighter import Factory\nimport logging\n\nlog = logging.getLogger(__name__)\nlogging.basicConfig(\n level=logging.DEBUG,\n format=\"%(asctime)s - %(name)s - %(levelname)s - %(message)s\",\n)\n\n\nasync def main():\n class_name = \"CSVFreighter\"\n instance = Factory.create_instance(class_name)\n\n await instance.connect(\n dsn=os.environ[\"PG_CONNECTION_STRING\"],\n max_connections=64,\n )\n await instance.load(\n graph_name=\"AgeTester\",\n start_v_label=\"Actor\",\n start_id=\"ActorID\",\n start_props=[\"Actor\"],\n edge_type=\"ACTED_IN\",\n edge_props=[\"Role\", \"Director\"],\n end_v_label=\"Film\",\n end_id=\"FilmID\",\n end_props=[\"Film\", \"Year\", \"Votes\", \"Rating\"],\n csv=\"./actorfilms.csv\",\n drop_graph=True,\n )\n\n\nif __name__ == \"__main__\":\n import asyncio\n\n asyncio.run(main())\n```\n\nSee, [tests/agefreightertester.py](https://github.com/rioriost/agefreighter/blob/main/tests/agefreightertester.py) for more details.\n\n### Test & Samples\n```sql\nexport PG_CONNECTION_STRING=\"host=your_host.postgres.database.azure.com port=5432 dbname=postgres user=account password=your_password\"\ncd tests/\npython3.9 agefreightertester.py\n```\n\n### For more information about [Apache AGE](https://age.apache.org/)\n* Apache AGE : https://age.apache.org/\n* GitHub : https://github.com/apache/age\n* Document : https://age.apache.org/age-manual/master/index.html\n\n### License\nMIT License\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Rio Fujita Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "AgeFreighter is a Python package that helps you to create a graph database using Azure Database for PostgreSQL.",
"version": "0.6.0",
"project_urls": {
"Homepage": "https://github.com/rioriost/agefreighter",
"Issues": "https://github.com/rioriost/agefreighter/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "182ca65043e60e107a162c7804cc9b02b89b18b893f7d2c51692f2383ed84c10",
"md5": "a897d3016f315afd71e301df9c85266e",
"sha256": "a0498a6563daaed29ed8040d0cf75a88002fe263f3cc589b6b0a30d221095d44"
},
"downloads": -1,
"filename": "agefreighter-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a897d3016f315afd71e301df9c85266e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 30889,
"upload_time": "2024-12-20T03:39:04",
"upload_time_iso_8601": "2024-12-20T03:39:04.333473Z",
"url": "https://files.pythonhosted.org/packages/18/2c/a65043e60e107a162c7804cc9b02b89b18b893f7d2c51692f2383ed84c10/agefreighter-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8575f580b5c6c19a46e96e82af12ab89c61e472d7cd69ddb52bdf0fbbbd738e5",
"md5": "c07f516787d1748c65ace699418d2792",
"sha256": "e02edfeb2dc752edd69f61e7482ea546f030e572a877d7e950e54db322ca8410"
},
"downloads": -1,
"filename": "agefreighter-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "c07f516787d1748c65ace699418d2792",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 41748,
"upload_time": "2024-12-20T03:39:05",
"upload_time_iso_8601": "2024-12-20T03:39:05.814294Z",
"url": "https://files.pythonhosted.org/packages/85/75/f580b5c6c19a46e96e82af12ab89c61e472d7cd69ddb52bdf0fbbbd738e5/agefreighter-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-20 03:39:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rioriost",
"github_project": "agefreighter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "agefreighter"
}