graphreduce


Namegraphreduce JSON
Version 1.6.9 PyPI version JSON
download
home_pagehttps://github.com/wesmadrigal/graphreduce
SummaryLeveraging graph data structures for complex feature engineering pipelines.
upload_time2024-10-30 17:48:15
maintainerNone
docs_urlNone
authorWes Madrigal
requires_pythonNone
licenseMIT
keywords feature engineering mlops entity linking graph algorithms
VCS
bugtrack_url
requirements abstract.jwrotator dask dask icecream networkx numpy pandas pyspark pyvis setuptools structlog pytest woodwork pydantic
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GraphReduce


## Description
GraphReduce is an abstraction for building machine learning feature
engineering pipelines that involve many tables in a composable way.
The library is intended to help bridge the gap between research feature
definitions and production deployment without the overhead of a full 
feature store.  Underneath the hood, GraphReduce uses graph data
structures to represent tables/files as nodes and foreign keys
as edges.

Compute backends supported: `pandas`, `dask`, `spark`, AWS Athena, Redshift, Snowflake, postgresql, MySQL
Compute backends coming soon: `ray`


### Installation
```python
# from pypi
pip install graphreduce

# from github
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'

# install from source
git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
```


## Motivation
Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
are disconnected.  They can be represented as a graph, where tables
are nodes and join keys are edges.  In many model building scenarios
there isn't a nice ML-ready vector waiting for us, so we must curate
the data by joining many tables together to flatten them into a vector.
This is the problem `graphreduce` sets out to solve.  

## Prior work
* [Deep Feature Synthesis](https://www.maxkanter.com/papers/DSAA_DSM_2015.pdf
)
* [One Button Machine (IBM)](One Button Machine (IBM))
* [autofeat (BASF)](http://arxiv.org/pdf/1901.07329)
* [featuretools (inspired by Deep Feature Synthesis)](https://github.com/alteryx/featuretools)

## Shortcomings of prior work
* point in time correctness is not always handled well
* Deep Feature Synthesis and `featuretools` are limited to `pandas` and a couple of SQL databases
* One Button Machine from IBM uses `spark` but their implementation outside of the paper could not be found
* none of the prior implementations allow for custom computational graphs or additional third party libraries

## We extend prior works and add the following functionality:
* point in time correctness on arbitrarily large computational graphs
* extensible computational layers, with support currently spanning: `pandas`, `dask`, `spark`, AWS Athena, AWS Redshift, Snowflake, postgresql, mysql, and more coming
* customizable node implementations for a mix of dynamic and custom feature engineering with the ability to use third party libraries for portions (e.g., [cleanlab](https://github.com/cleanlab/cleanlab) for cleaning)


An example dataset might look like the following:

![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)

## To get this example schema ready for an ML model we need to do the following:
* define the node-level interface and operations for filtering, annotating, normalizing, and reducing
* select the [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity)) to which we'll reduce our data: in this example `customer` 
* specify how much historical data will be included and what holdout period will be used (e.g., 365 days of historical data and 1 month of holdout data for labels)
* filter all data entities to include specified amount of history to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))
* depth first, bottom up aggregation operations group by / aggregation operations to reduce data


1. End to end example:
```python
import datetime
import pandas as pd
from graphreduce.node import GraphReduceNode, DynamicNode
from graphreduce.enum import ComputeLayerEnum, PeriodUnit
from graphreduce.graph_reduce import GraphReduce

# source from a csv file with the relationships
# using the file at: https://github.com/wesmadrigal/GraphReduce/blob/master/examples/cust_graph_labels.csv
reldf = pd.read_csv('cust_graph_labels.csv')

# using the data from: https://github.com/wesmadrigal/GraphReduce/tree/master/tests/data/cust_data
files = {
    'cust.csv' : {'prefix':'cu'},
    'orders.csv':{'prefix':'ord'},
    'order_products.csv': {'prefix':'op'},
    'notifications.csv':{'prefix':'notif'},
    'notification_interactions.csv':{'prefix':'ni'},
    'notification_interaction_types.csv':{'prefix':'nit'}

}
# create graph reduce nodes
gr_nodes = {
    f.split('/')[-1]: DynamicNode(
        fpath=f,
        fmt='csv',
        pk='id',
        prefix=files[f]['prefix'],
        date_key=None,
        compute_layer=GraphReduceComputeLayerEnum.pandas,
        compute_period_val=730,
        compute_period_unit=PeriodUnit.day,
    )
    for f in files.keys()
}
gr = GraphReduce(
    name='cust_dynamic_graph',
    parent_node=gr_nodes['cust.csv'],
    fmt='csv',
    cut_date=datetime.datetime(2023,9,1),
    compute_layer=GraphReduceComputeLayerEnum.pandas,
    auto_features=True,
    auto_feature_hops_front=1,
    auto_feature_hops_back=2,
    label_node=gr_nodes['orders.csv'],
    label_operation='count',
    label_field='id',
    label_period_val=60,
    label_period_unit=PeriodUnit.day
)
# Add graph edges
for ix, row in reldf.iterrows():
    gr.add_entity_edge(
        parent_node=gr_nodes[row['to_name']],
        relation_node=gr_nodes[row['from_name']],
        parent_key=row['to_key'],
        relation_key=row['from_key'],
        reduce=True
    )


gr.do_transformations()
2024-04-23 13:49:41 [info     ] hydrating graph attributes
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating graph data
2024-04-23 13:49:41 [info     ] checking for prefix uniqueness
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interaction_types.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] depth-first traversal through the graph from source: <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notification_interactions.csv fmt=csv> to <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notifications.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=order_products.csv fmt=csv> to <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=orders.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] Had label node <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] computed labels for <GraphReduceNode: fpath=orders.csv fmt=csv>

gr.parent_node.df
cu_id	cu_name	notif_customer_id	notif_id_count	notif_customer_id_count	notif_ts_first	notif_ts_min	notif_ts_max	ni_notification_id_min	ni_notification_id_max	ni_notification_id_sum	ni_id_count_min	ni_id_count_max	ni_id_count_sum	ni_notification_id_count_min	ni_notification_id_count_max	ni_notification_id_count_sum	ni_interaction_type_id_count_min	ni_interaction_type_id_count_max	ni_interaction_type_id_count_sum	ni_ts_first_first	ni_ts_first_min	ni_ts_first_max	ni_ts_min_first	ni_ts_min_min	ni_ts_min_max	ni_ts_max_first	ni_ts_max_min	ni_ts_max_max	ord_customer_id	ord_id_count	ord_customer_id_count	ord_ts_first	ord_ts_min	ord_ts_max	op_order_id_min	op_order_id_max	op_order_id_sum	op_id_count_min	op_id_count_max	op_id_count_sum	op_order_id_count_min	op_order_id_count_max	op_order_id_count_sum	op_product_id_count_min	op_product_id_count_max	op_product_id_count_sum	ord_customer_id_dupe	ord_id_label
0	1	wes	1	6	6	2022-08-05	2022-08-05	2023-06-23	101.0	106.0	621.0	1.0	3.0	14.0	1.0	3.0	14.0	1.0	3.0	14.0	2022-08-06	2022-08-06	2023-05-15	2022-08-06	2022-08-06	2023-05-15	2022-08-08	2022-08-08	2023-05-15	1.0	2.0	2.0	2023-05-12	2023-05-12	2023-06-01	1.0	2.0	3.0	4.0	4.0	8.0	4.0	4.0	8.0	4.0	4.0	8.0	1.0	1.0
1	2	john	2	7	7	2022-09-05	2022-09-05	2023-05-22	107.0	110.0	434.0	1.0	1.0	4.0	1.0	1.0	4.0	1.0	1.0	4.0	2023-06-01	2023-06-01	2023-06-04	2023-06-01	2023-06-01	2023-06-04	2023-06-01	2023-06-01	2023-06-04	2.0	1.0	1.0	2023-01-01	2023-01-01	2023-01-01	3.0	3.0	3.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	NaN	NaN
2	3	ryan	3	2	2	2023-06-12	2023-06-12	2023-09-01	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaT	NaT	NaT	NaT	NaT	NaT	NaT	NaT	NaT	3.0	1.0	1.0	2023-06-01	2023-06-01	2023-06-01	5.0	5.0	5.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	NaN	NaN
3	4	tianji	4	2	2	2024-02-01	2024-02-01	2024-02-15	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0
```

2. Plot the graph reduce compute graph.
```python
gr.plot_graph('my_graph_reduce.html')
```


3. Use materialized dataframe for ML / analytics
```python

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
train, test = train_test_split(gr.parent_node.df)

X = [x for x, y in dict(gr.parent_node.df.dtypes).items() if str(y).startswith('int') or str(y).startswith('float')]
# whether or not the user had an order
Y = 'ord_id_label'
mdl = LinearRegression()
mdl.fit(train[X], train[Y])
```


## order of operations
![order of operations](https://github.com/wesmadrigal/GraphReduce/blob/master/docs/graph_reduce_ops.drawio.png)



# API definition

## GraphReduce instantiation and parameters
`graphreduce.graph_reduce.GraphReduce`
* `cut_date` controls the date around which we orient the data in the graph
* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
* `compute_period_unit` tells us what unit of time we're using
* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
```python
from graphreduce.graph_reduce import GraphReduce
from graphreduce.enums import PeriodUnit
gr = GraphReduce(
    cut_date=datetime.datetime(2023, 2, 1), 
    compute_period_val=365, 
    compute_period_unit=PeriodUnit.day,
    parent_node=customer
)
```

## GraphReduce commonly used functions
* `do_transformations` perform all data transformations
* `plot_graph` plot the graph
* `add_entity_edge` add an edge
* `add_node` add a node

## Node definition and parameters
`graphreduce.node.GraphReduceNode`
* `do_annotate` annotation definitions (e.g., split a string column into a new column)
* `do_filters` filter the data on column(s)
* `do_normalize` clip anomalies like exceedingly large values and do normalization
* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
* `do_labels` label definitions if any
```python
# alternatively can use a dynamic node
from graphreduce.node import DynamicNode

dyna = DynamicNode(
    fpath='s3://some.bucket/path.csv',
    compute_layer=ComputeLayerEnum.dask,
    fmt='csv',
    prefix='myprefix',
    date_key='ts',
    pk='id'
)
```

## Node commonly used functions
* `colabbr` abbreviate a column
* `prep_for_features` filter the node's data by the cut date and the compute period for point in time correctness, also referred to as "time travel" in blogs
* `prep_for_labels` filter the node's data by the cut date and the label period to prepare for labeling


## Roadmap
* integration with Ray
* more dynamic feature engineering abilities, possible integration with Deep Feature Synthesis

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wesmadrigal/graphreduce",
    "name": "graphreduce",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "feature engineering, mlops, entity linking, graph algorithms",
    "author": "Wes Madrigal",
    "author_email": "wes@madconsulting.ai",
    "download_url": "https://files.pythonhosted.org/packages/09/ad/2e3b4367a37c90b20bb2bc82daa7140420f72fe675f0f57f53d508d5e568/graphreduce-1.6.9.tar.gz",
    "platform": null,
    "description": "# GraphReduce\n\n\n## Description\nGraphReduce is an abstraction for building machine learning feature\nengineering pipelines that involve many tables in a composable way.\nThe library is intended to help bridge the gap between research feature\ndefinitions and production deployment without the overhead of a full \nfeature store.  Underneath the hood, GraphReduce uses graph data\nstructures to represent tables/files as nodes and foreign keys\nas edges.\n\nCompute backends supported: `pandas`, `dask`, `spark`, AWS Athena, Redshift, Snowflake, postgresql, MySQL\nCompute backends coming soon: `ray`\n\n\n### Installation\n```python\n# from pypi\npip install graphreduce\n\n# from github\npip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'\n\n# install from source\ngit clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install\n```\n\n\n## Motivation\nMachine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets\nare disconnected.  They can be represented as a graph, where tables\nare nodes and join keys are edges.  In many model building scenarios\nthere isn't a nice ML-ready vector waiting for us, so we must curate\nthe data by joining many tables together to flatten them into a vector.\nThis is the problem `graphreduce` sets out to solve.  \n\n## Prior work\n* [Deep Feature Synthesis](https://www.maxkanter.com/papers/DSAA_DSM_2015.pdf\n)\n* [One Button Machine (IBM)](One Button Machine (IBM))\n* [autofeat (BASF)](http://arxiv.org/pdf/1901.07329)\n* [featuretools (inspired by Deep Feature Synthesis)](https://github.com/alteryx/featuretools)\n\n## Shortcomings of prior work\n* point in time correctness is not always handled well\n* Deep Feature Synthesis and `featuretools` are limited to `pandas` and a couple of SQL databases\n* One Button Machine from IBM uses `spark` but their implementation outside of the paper could not be found\n* none of the prior implementations allow for custom computational graphs or additional third party libraries\n\n## We extend prior works and add the following functionality:\n* point in time correctness on arbitrarily large computational graphs\n* extensible computational layers, with support currently spanning: `pandas`, `dask`, `spark`, AWS Athena, AWS Redshift, Snowflake, postgresql, mysql, and more coming\n* customizable node implementations for a mix of dynamic and custom feature engineering with the ability to use third party libraries for portions (e.g., [cleanlab](https://github.com/cleanlab/cleanlab) for cleaning)\n\n\nAn example dataset might look like the following:\n\n![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)\n\n## To get this example schema ready for an ML model we need to do the following:\n* define the node-level interface and operations for filtering, annotating, normalizing, and reducing\n* select the [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity)) to which we'll reduce our data: in this example `customer` \n* specify how much historical data will be included and what holdout period will be used (e.g., 365 days of historical data and 1 month of holdout data for labels)\n* filter all data entities to include specified amount of history to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))\n* depth first, bottom up aggregation operations group by / aggregation operations to reduce data\n\n\n1. End to end example:\n```python\nimport datetime\nimport pandas as pd\nfrom graphreduce.node import GraphReduceNode, DynamicNode\nfrom graphreduce.enum import ComputeLayerEnum, PeriodUnit\nfrom graphreduce.graph_reduce import GraphReduce\n\n# source from a csv file with the relationships\n# using the file at: https://github.com/wesmadrigal/GraphReduce/blob/master/examples/cust_graph_labels.csv\nreldf = pd.read_csv('cust_graph_labels.csv')\n\n# using the data from: https://github.com/wesmadrigal/GraphReduce/tree/master/tests/data/cust_data\nfiles = {\n    'cust.csv' : {'prefix':'cu'},\n    'orders.csv':{'prefix':'ord'},\n    'order_products.csv': {'prefix':'op'},\n    'notifications.csv':{'prefix':'notif'},\n    'notification_interactions.csv':{'prefix':'ni'},\n    'notification_interaction_types.csv':{'prefix':'nit'}\n\n}\n# create graph reduce nodes\ngr_nodes = {\n    f.split('/')[-1]: DynamicNode(\n        fpath=f,\n        fmt='csv',\n        pk='id',\n        prefix=files[f]['prefix'],\n        date_key=None,\n        compute_layer=GraphReduceComputeLayerEnum.pandas,\n        compute_period_val=730,\n        compute_period_unit=PeriodUnit.day,\n    )\n    for f in files.keys()\n}\ngr = GraphReduce(\n    name='cust_dynamic_graph',\n    parent_node=gr_nodes['cust.csv'],\n    fmt='csv',\n    cut_date=datetime.datetime(2023,9,1),\n    compute_layer=GraphReduceComputeLayerEnum.pandas,\n    auto_features=True,\n    auto_feature_hops_front=1,\n    auto_feature_hops_back=2,\n    label_node=gr_nodes['orders.csv'],\n    label_operation='count',\n    label_field='id',\n    label_period_val=60,\n    label_period_unit=PeriodUnit.day\n)\n# Add graph edges\nfor ix, row in reldf.iterrows():\n    gr.add_entity_edge(\n        parent_node=gr_nodes[row['to_name']],\n        relation_node=gr_nodes[row['from_name']],\n        parent_key=row['to_key'],\n        relation_key=row['from_key'],\n        reduce=True\n    )\n\n\ngr.do_transformations()\n2024-04-23 13:49:41 [info     ] hydrating graph attributes\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating graph data\n2024-04-23 13:49:41 [info     ] checking for prefix uniqueness\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interaction_types.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] depth-first traversal through the graph from source: <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notification_interactions.csv fmt=csv> to <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notifications.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=order_products.csv fmt=csv> to <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=orders.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] Had label node <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] computed labels for <GraphReduceNode: fpath=orders.csv fmt=csv>\n\ngr.parent_node.df\ncu_id\tcu_name\tnotif_customer_id\tnotif_id_count\tnotif_customer_id_count\tnotif_ts_first\tnotif_ts_min\tnotif_ts_max\tni_notification_id_min\tni_notification_id_max\tni_notification_id_sum\tni_id_count_min\tni_id_count_max\tni_id_count_sum\tni_notification_id_count_min\tni_notification_id_count_max\tni_notification_id_count_sum\tni_interaction_type_id_count_min\tni_interaction_type_id_count_max\tni_interaction_type_id_count_sum\tni_ts_first_first\tni_ts_first_min\tni_ts_first_max\tni_ts_min_first\tni_ts_min_min\tni_ts_min_max\tni_ts_max_first\tni_ts_max_min\tni_ts_max_max\tord_customer_id\tord_id_count\tord_customer_id_count\tord_ts_first\tord_ts_min\tord_ts_max\top_order_id_min\top_order_id_max\top_order_id_sum\top_id_count_min\top_id_count_max\top_id_count_sum\top_order_id_count_min\top_order_id_count_max\top_order_id_count_sum\top_product_id_count_min\top_product_id_count_max\top_product_id_count_sum\tord_customer_id_dupe\tord_id_label\n0\t1\twes\t1\t6\t6\t2022-08-05\t2022-08-05\t2023-06-23\t101.0\t106.0\t621.0\t1.0\t3.0\t14.0\t1.0\t3.0\t14.0\t1.0\t3.0\t14.0\t2022-08-06\t2022-08-06\t2023-05-15\t2022-08-06\t2022-08-06\t2023-05-15\t2022-08-08\t2022-08-08\t2023-05-15\t1.0\t2.0\t2.0\t2023-05-12\t2023-05-12\t2023-06-01\t1.0\t2.0\t3.0\t4.0\t4.0\t8.0\t4.0\t4.0\t8.0\t4.0\t4.0\t8.0\t1.0\t1.0\n1\t2\tjohn\t2\t7\t7\t2022-09-05\t2022-09-05\t2023-05-22\t107.0\t110.0\t434.0\t1.0\t1.0\t4.0\t1.0\t1.0\t4.0\t1.0\t1.0\t4.0\t2023-06-01\t2023-06-01\t2023-06-04\t2023-06-01\t2023-06-01\t2023-06-04\t2023-06-01\t2023-06-01\t2023-06-04\t2.0\t1.0\t1.0\t2023-01-01\t2023-01-01\t2023-01-01\t3.0\t3.0\t3.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\tNaN\tNaN\n2\t3\tryan\t3\t2\t2\t2023-06-12\t2023-06-12\t2023-09-01\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\t3.0\t1.0\t1.0\t2023-06-01\t2023-06-01\t2023-06-01\t5.0\t5.0\t5.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\tNaN\tNaN\n3\t4\ttianji\t4\t2\t2\t2024-02-01\t2024-02-01\t2024-02-15\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\n```\n\n2. Plot the graph reduce compute graph.\n```python\ngr.plot_graph('my_graph_reduce.html')\n```\n\n\n3. Use materialized dataframe for ML / analytics\n```python\n\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.model_selection import train_test_split\ntrain, test = train_test_split(gr.parent_node.df)\n\nX = [x for x, y in dict(gr.parent_node.df.dtypes).items() if str(y).startswith('int') or str(y).startswith('float')]\n# whether or not the user had an order\nY = 'ord_id_label'\nmdl = LinearRegression()\nmdl.fit(train[X], train[Y])\n```\n\n\n## order of operations\n![order of operations](https://github.com/wesmadrigal/GraphReduce/blob/master/docs/graph_reduce_ops.drawio.png)\n\n\n\n# API definition\n\n## GraphReduce instantiation and parameters\n`graphreduce.graph_reduce.GraphReduce`\n* `cut_date` controls the date around which we orient the data in the graph\n* `compute_period_val` controls the amount of time back in history we consider during compute over the graph\n* `compute_period_unit` tells us what unit of time we're using\n* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data\n```python\nfrom graphreduce.graph_reduce import GraphReduce\nfrom graphreduce.enums import PeriodUnit\ngr = GraphReduce(\n    cut_date=datetime.datetime(2023, 2, 1), \n    compute_period_val=365, \n    compute_period_unit=PeriodUnit.day,\n    parent_node=customer\n)\n```\n\n## GraphReduce commonly used functions\n* `do_transformations` perform all data transformations\n* `plot_graph` plot the graph\n* `add_entity_edge` add an edge\n* `add_node` add a node\n\n## Node definition and parameters\n`graphreduce.node.GraphReduceNode`\n* `do_annotate` annotation definitions (e.g., split a string column into a new column)\n* `do_filters` filter the data on column(s)\n* `do_normalize` clip anomalies like exceedingly large values and do normalization\n* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too\n* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.\n* `do_labels` label definitions if any\n```python\n# alternatively can use a dynamic node\nfrom graphreduce.node import DynamicNode\n\ndyna = DynamicNode(\n    fpath='s3://some.bucket/path.csv',\n    compute_layer=ComputeLayerEnum.dask,\n    fmt='csv',\n    prefix='myprefix',\n    date_key='ts',\n    pk='id'\n)\n```\n\n## Node commonly used functions\n* `colabbr` abbreviate a column\n* `prep_for_features` filter the node's data by the cut date and the compute period for point in time correctness, also referred to as \"time travel\" in blogs\n* `prep_for_labels` filter the node's data by the cut date and the label period to prepare for labeling\n\n\n## Roadmap\n* integration with Ray\n* more dynamic feature engineering abilities, possible integration with Deep Feature Synthesis\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Leveraging graph data structures for complex feature engineering pipelines.",
    "version": "1.6.9",
    "project_urls": {
        "Homepage": "https://github.com/wesmadrigal/graphreduce",
        "Issue Tracker": "https://github.com/wesmadrigal/graphreduce/issues",
        "Source": "http://github.com/wesmadrigal/graphreduce"
    },
    "split_keywords": [
        "feature engineering",
        " mlops",
        " entity linking",
        " graph algorithms"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7694fcba259446f331cb4f33bd9e24c6358dc1d85f825cafcb177741a2720973",
                "md5": "3bba1fb8000c0d691ed0bbf47848cb31",
                "sha256": "847e62ace85757507b5e48fea5cb1d2f8e75e5d7c3ea542bf8e9a0b3fe7b7792"
            },
            "downloads": -1,
            "filename": "graphreduce-1.6.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3bba1fb8000c0d691ed0bbf47848cb31",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 32910,
            "upload_time": "2024-10-30T17:48:13",
            "upload_time_iso_8601": "2024-10-30T17:48:13.109248Z",
            "url": "https://files.pythonhosted.org/packages/76/94/fcba259446f331cb4f33bd9e24c6358dc1d85f825cafcb177741a2720973/graphreduce-1.6.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "09ad2e3b4367a37c90b20bb2bc82daa7140420f72fe675f0f57f53d508d5e568",
                "md5": "8ac866f453e6cc98a3721c2568dca419",
                "sha256": "f0a051b146c377e3261d3615239b241cb1c98edac7242dff8647c5d48b850706"
            },
            "downloads": -1,
            "filename": "graphreduce-1.6.9.tar.gz",
            "has_sig": false,
            "md5_digest": "8ac866f453e6cc98a3721c2568dca419",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 33348,
            "upload_time": "2024-10-30T17:48:15",
            "upload_time_iso_8601": "2024-10-30T17:48:15.105522Z",
            "url": "https://files.pythonhosted.org/packages/09/ad/2e3b4367a37c90b20bb2bc82daa7140420f72fe675f0f57f53d508d5e568/graphreduce-1.6.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 17:48:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wesmadrigal",
    "github_project": "graphreduce",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "abstract.jwrotator",
            "specs": [
                [
                    ">=",
                    "0.3"
                ]
            ]
        },
        {
            "name": "dask",
            "specs": []
        },
        {
            "name": "dask",
            "specs": []
        },
        {
            "name": "icecream",
            "specs": [
                [
                    "==",
                    "2.1.3"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "2.6.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.26.4"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.4"
                ]
            ]
        },
        {
            "name": "pyspark",
            "specs": [
                [
                    ">=",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "pyvis",
            "specs": [
                [
                    ">=",
                    "0.3.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    ">=",
                    "65.5.1"
                ]
            ]
        },
        {
            "name": "structlog",
            "specs": [
                [
                    ">=",
                    "23.1.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "8.0.2"
                ]
            ]
        },
        {
            "name": "woodwork",
            "specs": [
                [
                    "==",
                    "0.29.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "1.10.5"
                ]
            ]
        }
    ],
    "lcname": "graphreduce"
}
        
Elapsed time: 0.42508s