graphreduce

Name	graphreduce JSON
Version	1.8.6 JSON
	download
home_page	https://github.com/wesmadrigal/graphreduce
Summary	Leveraging graph data structures for complex feature engineering pipelines.
upload_time	2025-07-22 13:25:25
maintainer	None
docs_url	None
author	Wes Madrigal
requires_python	None
license	MIT
keywords	feature engineering mlops entity linking graph algorithms
VCS
bugtrack_url
requirements	abstract.jwrotator dask dask deltalake duckdb getdaft httpx icecream networkx numpy pandas pyspark pyvis setuptools structlog pytest pydantic pytorch_frame pyiceberg
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # GraphReduce


## Description
GraphReduce is an abstraction for building machine learning feature
engineering pipelines that involve many tables in a composable way.
The library is intended to help bridge the gap between research feature
definitions and production deployment without the overhead of a full 
feature store.  Underneath the hood, GraphReduce uses graph data
structures to represent tables/files as nodes and foreign keys
as edges.

Compute backends supported: `pandas`, `dask`, `spark`, AWS Athena, Redshift, Snowflake, postgresql, MySQL
Compute backends coming soon: `ray`


### Installation
```python
# from pypi
pip install graphreduce

# from github
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'

# install from source
git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
```


## Motivation
Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
are disconnected.  They can be represented as a graph, where tables
are nodes and join keys are edges.  In many model building scenarios
there isn't a nice ML-ready vector waiting for us, so we must curate
the data by joining many tables together to flatten them into a vector.
This is the problem `graphreduce` sets out to solve.  

## Prior work
* [Deep Feature Synthesis](https://www.maxkanter.com/papers/DSAA_DSM_2015.pdf
)
* [One Button Machine (IBM)](https://arxiv.org/abs/1706.00327)
* [autofeat (BASF)](http://arxiv.org/pdf/1901.07329)
* [featuretools (inspired by Deep Feature Synthesis)](https://github.com/alteryx/featuretools)

## Shortcomings of prior work
* point in time correctness is not always handled well
* Deep Feature Synthesis and `featuretools` are limited to `pandas` and a couple of SQL databases
* One Button Machine from IBM uses `spark` but their implementation outside of the paper could not be found
* none of the prior implementations allow for custom computational graphs or additional third party libraries

## We extend prior works and add the following functionality:
* point in time correctness on arbitrarily large computational graphs
* extensible computational layers, with support currently spanning: `pandas`, `dask`, `spark`, AWS Athena, AWS Redshift, Snowflake, postgresql, mysql, `daft`
* customizable node implementations for a mix of dynamic and custom feature engineering with the ability to use third party libraries for portions (e.g., [cleanlab](https://github.com/cleanlab/cleanlab) for cleaning)



## To get this example schema ready for an ML model we need to do the following:
* define the node-level interface and operations for filtering, annotating, normalizing, and reducing
* select the [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity)) to which we'll reduce our data: in this example `customer` 
* specify how much historical data will be included and what holdout period will be used (e.g., 365 days of historical data and 1 month of holdout data for labels)
* filter all data entities to include specified amount of history to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))
* depth first, bottom up aggregation operations group by / aggregation operations to reduce data


1. End to end example:
```python
import datetime
import pandas as pd
from graphreduce.node import GraphReduceNode, DynamicNode
from graphreduce.enum import ComputeLayerEnum, PeriodUnit
from graphreduce.graph_reduce import GraphReduce

# source from a csv file with the relationships
# using the file at: https://github.com/wesmadrigal/GraphReduce/blob/master/examples/cust_graph_labels.csv
reldf = pd.read_csv('cust_graph_labels.csv')

# using the data from: https://github.com/wesmadrigal/GraphReduce/tree/master/tests/data/cust_data
files = {
    'cust.csv' : {'prefix':'cu'},
    'orders.csv':{'prefix':'ord'},
    'order_products.csv': {'prefix':'op'},
    'notifications.csv':{'prefix':'notif'},
    'notification_interactions.csv':{'prefix':'ni'},
    'notification_interaction_types.csv':{'prefix':'nit'}

}
# create graph reduce nodes
gr_nodes = {
    f.split('/')[-1]: DynamicNode(
        fpath=f,
        fmt='csv',
        pk='id',
        prefix=files[f]['prefix'],
        date_key=None,
        compute_layer=GraphReduceComputeLayerEnum.pandas,
        compute_period_val=730,
        compute_period_unit=PeriodUnit.day,
    )
    for f in files.keys()
}
gr = GraphReduce(
    name='cust_dynamic_graph',
    parent_node=gr_nodes['cust.csv'],
    fmt='csv',
    cut_date=datetime.datetime(2023,9,1),
    compute_layer=GraphReduceComputeLayerEnum.pandas,
    auto_features=True,
    auto_feature_hops_front=1,
    auto_feature_hops_back=2,
    label_node=gr_nodes['orders.csv'],
    label_operation='count',
    label_field='id',
    label_period_val=60,
    label_period_unit=PeriodUnit.day
)
# Add graph edges
for ix, row in reldf.iterrows():
    gr.add_entity_edge(
        parent_node=gr_nodes[row['to_name']],
        relation_node=gr_nodes[row['from_name']],
        parent_key=row['to_key'],
        relation_key=row['from_key'],
        reduce=True
    )


gr.do_transformations()
2024-04-23 13:49:41 [info     ] hydrating graph attributes
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode
2024-04-23 13:49:41 [info     ] hydrating graph data
2024-04-23 13:49:41 [info     ] checking for prefix uniqueness
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interaction_types.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] depth-first traversal through the graph from source: <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notification_interactions.csv fmt=csv> to <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notifications.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notifications.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=order_products.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=order_products.csv fmt=csv> to <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=orders.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>
2024-04-23 13:49:41 [info     ] Had label node <GraphReduceNode: fpath=orders.csv fmt=csv>
2024-04-23 13:49:41 [info     ] computed labels for <GraphReduceNode: fpath=orders.csv fmt=csv>

gr.parent_node.df
cu_id	cu_name	notif_customer_id	notif_id_count	notif_customer_id_count	notif_ts_first	notif_ts_min	notif_ts_max	ni_notification_id_min	ni_notification_id_max	ni_notification_id_sum	ni_id_count_min	ni_id_count_max	ni_id_count_sum	ni_notification_id_count_min	ni_notification_id_count_max	ni_notification_id_count_sum	ni_interaction_type_id_count_min	ni_interaction_type_id_count_max	ni_interaction_type_id_count_sum	ni_ts_first_first	ni_ts_first_min	ni_ts_first_max	ni_ts_min_first	ni_ts_min_min	ni_ts_min_max	ni_ts_max_first	ni_ts_max_min	ni_ts_max_max	ord_customer_id	ord_id_count	ord_customer_id_count	ord_ts_first	ord_ts_min	ord_ts_max	op_order_id_min	op_order_id_max	op_order_id_sum	op_id_count_min	op_id_count_max	op_id_count_sum	op_order_id_count_min	op_order_id_count_max	op_order_id_count_sum	op_product_id_count_min	op_product_id_count_max	op_product_id_count_sum	ord_customer_id_dupe	ord_id_label
0	1	wes	1	6	6	2022-08-05	2022-08-05	2023-06-23	101.0	106.0	621.0	1.0	3.0	14.0	1.0	3.0	14.0	1.0	3.0	14.0	2022-08-06	2022-08-06	2023-05-15	2022-08-06	2022-08-06	2023-05-15	2022-08-08	2022-08-08	2023-05-15	1.0	2.0	2.0	2023-05-12	2023-05-12	2023-06-01	1.0	2.0	3.0	4.0	4.0	8.0	4.0	4.0	8.0	4.0	4.0	8.0	1.0	1.0
1	2	john	2	7	7	2022-09-05	2022-09-05	2023-05-22	107.0	110.0	434.0	1.0	1.0	4.0	1.0	1.0	4.0	1.0	1.0	4.0	2023-06-01	2023-06-01	2023-06-04	2023-06-01	2023-06-01	2023-06-04	2023-06-01	2023-06-01	2023-06-04	2.0	1.0	1.0	2023-01-01	2023-01-01	2023-01-01	3.0	3.0	3.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	NaN	NaN
2	3	ryan	3	2	2	2023-06-12	2023-06-12	2023-09-01	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaT	NaT	NaT	NaT	NaT	NaT	NaT	NaT	NaT	3.0	1.0	1.0	2023-06-01	2023-06-01	2023-06-01	5.0	5.0	5.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	NaN	NaN
3	4	tianji	4	2	2	2024-02-01	2024-02-01	2024-02-15	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0	NaN	NaN	0.0
```

2. Plot the graph reduce compute graph.
```python
gr.plot_graph('my_graph_reduce.html')
```


3. Use materialized dataframe for ML / analytics
```python

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
train, test = train_test_split(gr.parent_node.df)

X = [x for x, y in dict(gr.parent_node.df.dtypes).items() if str(y).startswith('int') or str(y).startswith('float')]
# whether or not the user had an order
Y = 'ord_id_label'
mdl = LinearRegression()
mdl.fit(train[X], train[Y])
```

## Paper
[![Preview of PDF](./docs/graphreduce_paper_abstract.jpeg)](./docs/GraphReduce_ a scalable feature engineering system-4.pdf)



## order of operations
![order of operations](https://github.com/wesmadrigal/GraphReduce/blob/master/docs/graph_reduce_ops.drawio.png)



# API definition

## GraphReduce instantiation and parameters
`graphreduce.graph_reduce.GraphReduce`
* `cut_date` controls the date around which we orient the data in the graph
* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
* `compute_period_unit` tells us what unit of time we're using
* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
```python
from graphreduce.graph_reduce import GraphReduce
from graphreduce.enums import PeriodUnit
gr = GraphReduce(
    cut_date=datetime.datetime(2023, 2, 1), 
    compute_period_val=365, 
    compute_period_unit=PeriodUnit.day,
    parent_node=customer
)
```

## GraphReduce commonly used functions
* `do_transformations` perform all data transformations
* `plot_graph` plot the graph
* `add_entity_edge` add an edge
* `add_node` add a node

## Node definition and parameters
`graphreduce.node.GraphReduceNode`
* `do_annotate` annotation definitions (e.g., split a string column into a new column)
* `do_filters` filter the data on column(s)
* `do_normalize` clip anomalies like exceedingly large values and do normalization
* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
* `do_labels` label definitions if any
```python
# alternatively can use a dynamic node
from graphreduce.node import DynamicNode

dyna = DynamicNode(
    fpath='s3://some.bucket/path.csv',
    compute_layer=ComputeLayerEnum.dask,
    fmt='csv',
    prefix='myprefix',
    date_key='ts',
    pk='id'
)
```

## Node commonly used functions
* `colabbr` abbreviate a column
* `prep_for_features` filter the node's data by the cut date and the compute period for point in time correctness, also referred to as "time travel" in blogs
* `prep_for_labels` filter the node's data by the cut date and the label period to prepare for labeling




## License
Copyright 2025 Wes Madrigal

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Roadmap
* integration with Ray
* more dynamic feature engineering abilities, possible integration with Deep Feature Synthesis

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wesmadrigal/graphreduce",
    "name": "graphreduce",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "feature engineering, mlops, entity linking, graph algorithms",
    "author": "Wes Madrigal",
    "author_email": "wes@madconsulting.ai",
    "download_url": "https://files.pythonhosted.org/packages/07/8f/dccaa4c4c6f6b2692340d90e17ae42cf259eb09c0c354867cf963973fb58/graphreduce-1.8.6.tar.gz",
    "platform": null,
    "description": "# GraphReduce\n\n\n## Description\nGraphReduce is an abstraction for building machine learning feature\nengineering pipelines that involve many tables in a composable way.\nThe library is intended to help bridge the gap between research feature\ndefinitions and production deployment without the overhead of a full \nfeature store.  Underneath the hood, GraphReduce uses graph data\nstructures to represent tables/files as nodes and foreign keys\nas edges.\n\nCompute backends supported: `pandas`, `dask`, `spark`, AWS Athena, Redshift, Snowflake, postgresql, MySQL\nCompute backends coming soon: `ray`\n\n\n### Installation\n```python\n# from pypi\npip install graphreduce\n\n# from github\npip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'\n\n# install from source\ngit clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install\n```\n\n\n## Motivation\nMachine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets\nare disconnected.  They can be represented as a graph, where tables\nare nodes and join keys are edges.  In many model building scenarios\nthere isn't a nice ML-ready vector waiting for us, so we must curate\nthe data by joining many tables together to flatten them into a vector.\nThis is the problem `graphreduce` sets out to solve.  \n\n## Prior work\n* [Deep Feature Synthesis](https://www.maxkanter.com/papers/DSAA_DSM_2015.pdf\n)\n* [One Button Machine (IBM)](https://arxiv.org/abs/1706.00327)\n* [autofeat (BASF)](http://arxiv.org/pdf/1901.07329)\n* [featuretools (inspired by Deep Feature Synthesis)](https://github.com/alteryx/featuretools)\n\n## Shortcomings of prior work\n* point in time correctness is not always handled well\n* Deep Feature Synthesis and `featuretools` are limited to `pandas` and a couple of SQL databases\n* One Button Machine from IBM uses `spark` but their implementation outside of the paper could not be found\n* none of the prior implementations allow for custom computational graphs or additional third party libraries\n\n## We extend prior works and add the following functionality:\n* point in time correctness on arbitrarily large computational graphs\n* extensible computational layers, with support currently spanning: `pandas`, `dask`, `spark`, AWS Athena, AWS Redshift, Snowflake, postgresql, mysql, `daft`\n* customizable node implementations for a mix of dynamic and custom feature engineering with the ability to use third party libraries for portions (e.g., [cleanlab](https://github.com/cleanlab/cleanlab) for cleaning)\n\n\n\n## To get this example schema ready for an ML model we need to do the following:\n* define the node-level interface and operations for filtering, annotating, normalizing, and reducing\n* select the [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity)) to which we'll reduce our data: in this example `customer` \n* specify how much historical data will be included and what holdout period will be used (e.g., 365 days of historical data and 1 month of holdout data for labels)\n* filter all data entities to include specified amount of history to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))\n* depth first, bottom up aggregation operations group by / aggregation operations to reduce data\n\n\n1. End to end example:\n```python\nimport datetime\nimport pandas as pd\nfrom graphreduce.node import GraphReduceNode, DynamicNode\nfrom graphreduce.enum import ComputeLayerEnum, PeriodUnit\nfrom graphreduce.graph_reduce import GraphReduce\n\n# source from a csv file with the relationships\n# using the file at: https://github.com/wesmadrigal/GraphReduce/blob/master/examples/cust_graph_labels.csv\nreldf = pd.read_csv('cust_graph_labels.csv')\n\n# using the data from: https://github.com/wesmadrigal/GraphReduce/tree/master/tests/data/cust_data\nfiles = {\n    'cust.csv' : {'prefix':'cu'},\n    'orders.csv':{'prefix':'ord'},\n    'order_products.csv': {'prefix':'op'},\n    'notifications.csv':{'prefix':'notif'},\n    'notification_interactions.csv':{'prefix':'ni'},\n    'notification_interaction_types.csv':{'prefix':'nit'}\n\n}\n# create graph reduce nodes\ngr_nodes = {\n    f.split('/')[-1]: DynamicNode(\n        fpath=f,\n        fmt='csv',\n        pk='id',\n        prefix=files[f]['prefix'],\n        date_key=None,\n        compute_layer=GraphReduceComputeLayerEnum.pandas,\n        compute_period_val=730,\n        compute_period_unit=PeriodUnit.day,\n    )\n    for f in files.keys()\n}\ngr = GraphReduce(\n    name='cust_dynamic_graph',\n    parent_node=gr_nodes['cust.csv'],\n    fmt='csv',\n    cut_date=datetime.datetime(2023,9,1),\n    compute_layer=GraphReduceComputeLayerEnum.pandas,\n    auto_features=True,\n    auto_feature_hops_front=1,\n    auto_feature_hops_back=2,\n    label_node=gr_nodes['orders.csv'],\n    label_operation='count',\n    label_field='id',\n    label_period_val=60,\n    label_period_unit=PeriodUnit.day\n)\n# Add graph edges\nfor ix, row in reldf.iterrows():\n    gr.add_entity_edge(\n        parent_node=gr_nodes[row['to_name']],\n        relation_node=gr_nodes[row['from_name']],\n        parent_key=row['to_key'],\n        relation_key=row['from_key'],\n        reduce=True\n    )\n\n\ngr.do_transformations()\n2024-04-23 13:49:41 [info     ] hydrating graph attributes\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating attributes for DynamicNode\n2024-04-23 13:49:41 [info     ] hydrating graph data\n2024-04-23 13:49:41 [info     ] checking for prefix uniqueness\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interaction_types.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] running filters, normalize, and annotations for <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] depth-first traversal through the graph from source: <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notification_interactions.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notification_interactions.csv fmt=csv> to <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=notifications.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=notifications.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=order_products.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=order_products.csv fmt=csv> to <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] reducing relation <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] performing auto_features on node <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] joining <GraphReduceNode: fpath=orders.csv fmt=csv> to <GraphReduceNode: fpath=cust.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] Had label node <GraphReduceNode: fpath=orders.csv fmt=csv>\n2024-04-23 13:49:41 [info     ] computed labels for <GraphReduceNode: fpath=orders.csv fmt=csv>\n\ngr.parent_node.df\ncu_id\tcu_name\tnotif_customer_id\tnotif_id_count\tnotif_customer_id_count\tnotif_ts_first\tnotif_ts_min\tnotif_ts_max\tni_notification_id_min\tni_notification_id_max\tni_notification_id_sum\tni_id_count_min\tni_id_count_max\tni_id_count_sum\tni_notification_id_count_min\tni_notification_id_count_max\tni_notification_id_count_sum\tni_interaction_type_id_count_min\tni_interaction_type_id_count_max\tni_interaction_type_id_count_sum\tni_ts_first_first\tni_ts_first_min\tni_ts_first_max\tni_ts_min_first\tni_ts_min_min\tni_ts_min_max\tni_ts_max_first\tni_ts_max_min\tni_ts_max_max\tord_customer_id\tord_id_count\tord_customer_id_count\tord_ts_first\tord_ts_min\tord_ts_max\top_order_id_min\top_order_id_max\top_order_id_sum\top_id_count_min\top_id_count_max\top_id_count_sum\top_order_id_count_min\top_order_id_count_max\top_order_id_count_sum\top_product_id_count_min\top_product_id_count_max\top_product_id_count_sum\tord_customer_id_dupe\tord_id_label\n0\t1\twes\t1\t6\t6\t2022-08-05\t2022-08-05\t2023-06-23\t101.0\t106.0\t621.0\t1.0\t3.0\t14.0\t1.0\t3.0\t14.0\t1.0\t3.0\t14.0\t2022-08-06\t2022-08-06\t2023-05-15\t2022-08-06\t2022-08-06\t2023-05-15\t2022-08-08\t2022-08-08\t2023-05-15\t1.0\t2.0\t2.0\t2023-05-12\t2023-05-12\t2023-06-01\t1.0\t2.0\t3.0\t4.0\t4.0\t8.0\t4.0\t4.0\t8.0\t4.0\t4.0\t8.0\t1.0\t1.0\n1\t2\tjohn\t2\t7\t7\t2022-09-05\t2022-09-05\t2023-05-22\t107.0\t110.0\t434.0\t1.0\t1.0\t4.0\t1.0\t1.0\t4.0\t1.0\t1.0\t4.0\t2023-06-01\t2023-06-01\t2023-06-04\t2023-06-01\t2023-06-01\t2023-06-04\t2023-06-01\t2023-06-01\t2023-06-04\t2.0\t1.0\t1.0\t2023-01-01\t2023-01-01\t2023-01-01\t3.0\t3.0\t3.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\t4.0\tNaN\tNaN\n2\t3\tryan\t3\t2\t2\t2023-06-12\t2023-06-12\t2023-09-01\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\tNaT\t3.0\t1.0\t1.0\t2023-06-01\t2023-06-01\t2023-06-01\t5.0\t5.0\t5.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\t1.0\tNaN\tNaN\n3\t4\ttianji\t4\t2\t2\t2024-02-01\t2024-02-01\t2024-02-15\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\tNaN\tNaN\t0.0\n```\n\n2. Plot the graph reduce compute graph.\n```python\ngr.plot_graph('my_graph_reduce.html')\n```\n\n\n3. Use materialized dataframe for ML / analytics\n```python\n\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.model_selection import train_test_split\ntrain, test = train_test_split(gr.parent_node.df)\n\nX = [x for x, y in dict(gr.parent_node.df.dtypes).items() if str(y).startswith('int') or str(y).startswith('float')]\n# whether or not the user had an order\nY = 'ord_id_label'\nmdl = LinearRegression()\nmdl.fit(train[X], train[Y])\n```\n\n## Paper\n[![Preview of PDF](./docs/graphreduce_paper_abstract.jpeg)](./docs/GraphReduce_ a scalable feature engineering system-4.pdf)\n\n\n\n## order of operations\n![order of operations](https://github.com/wesmadrigal/GraphReduce/blob/master/docs/graph_reduce_ops.drawio.png)\n\n\n\n# API definition\n\n## GraphReduce instantiation and parameters\n`graphreduce.graph_reduce.GraphReduce`\n* `cut_date` controls the date around which we orient the data in the graph\n* `compute_period_val` controls the amount of time back in history we consider during compute over the graph\n* `compute_period_unit` tells us what unit of time we're using\n* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data\n```python\nfrom graphreduce.graph_reduce import GraphReduce\nfrom graphreduce.enums import PeriodUnit\ngr = GraphReduce(\n    cut_date=datetime.datetime(2023, 2, 1), \n    compute_period_val=365, \n    compute_period_unit=PeriodUnit.day,\n    parent_node=customer\n)\n```\n\n## GraphReduce commonly used functions\n* `do_transformations` perform all data transformations\n* `plot_graph` plot the graph\n* `add_entity_edge` add an edge\n* `add_node` add a node\n\n## Node definition and parameters\n`graphreduce.node.GraphReduceNode`\n* `do_annotate` annotation definitions (e.g., split a string column into a new column)\n* `do_filters` filter the data on column(s)\n* `do_normalize` clip anomalies like exceedingly large values and do normalization\n* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too\n* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.\n* `do_labels` label definitions if any\n```python\n# alternatively can use a dynamic node\nfrom graphreduce.node import DynamicNode\n\ndyna = DynamicNode(\n    fpath='s3://some.bucket/path.csv',\n    compute_layer=ComputeLayerEnum.dask,\n    fmt='csv',\n    prefix='myprefix',\n    date_key='ts',\n    pk='id'\n)\n```\n\n## Node commonly used functions\n* `colabbr` abbreviate a column\n* `prep_for_features` filter the node's data by the cut date and the compute period for point in time correctness, also referred to as \"time travel\" in blogs\n* `prep_for_labels` filter the node's data by the cut date and the label period to prepare for labeling\n\n\n\n\n## License\nCopyright 2025 Wes Madrigal\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \u201cSoftware\u201d), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \u201cAS IS\u201d, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n## Roadmap\n* integration with Ray\n* more dynamic feature engineering abilities, possible integration with Deep Feature Synthesis\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Leveraging graph data structures for complex feature engineering pipelines.",
    "version": "1.8.6",
    "project_urls": {
        "Homepage": "https://github.com/wesmadrigal/graphreduce",
        "Issue Tracker": "https://github.com/wesmadrigal/graphreduce/issues",
        "Source": "http://github.com/wesmadrigal/graphreduce"
    },
    "split_keywords": [
        "feature engineering",
        " mlops",
        " entity linking",
        " graph algorithms"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c510dc0409c0068427d1b2a1868c432a9bc634b61eda31668eec77ba4bc1e05",
                "md5": "cf4b0e30319a6bed7d267243bc72435b",
                "sha256": "bce54d8ae7ab72f89778bea86550d040c439a3b1802bf8ab992462cf768f3d2e"
            },
            "downloads": -1,
            "filename": "graphreduce-1.8.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf4b0e30319a6bed7d267243bc72435b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 43283,
            "upload_time": "2025-07-22T13:25:24",
            "upload_time_iso_8601": "2025-07-22T13:25:24.224320Z",
            "url": "https://files.pythonhosted.org/packages/9c/51/0dc0409c0068427d1b2a1868c432a9bc634b61eda31668eec77ba4bc1e05/graphreduce-1.8.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "078fdccaa4c4c6f6b2692340d90e17ae42cf259eb09c0c354867cf963973fb58",
                "md5": "bdd07366f16c676ae6ea5c17092d61bf",
                "sha256": "d57dbe209ba4351b6a38ff19ecfcba726350d57e21dabd8f0746fec24a4bc857"
            },
            "downloads": -1,
            "filename": "graphreduce-1.8.6.tar.gz",
            "has_sig": false,
            "md5_digest": "bdd07366f16c676ae6ea5c17092d61bf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 44583,
            "upload_time": "2025-07-22T13:25:25",
            "upload_time_iso_8601": "2025-07-22T13:25:25.497626Z",
            "url": "https://files.pythonhosted.org/packages/07/8f/dccaa4c4c6f6b2692340d90e17ae42cf259eb09c0c354867cf963973fb58/graphreduce-1.8.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-22 13:25:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wesmadrigal",
    "github_project": "graphreduce",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "abstract.jwrotator",
            "specs": [
                [
                    ">=",
                    "0.3"
                ]
            ]
        },
        {
            "name": "dask",
            "specs": []
        },
        {
            "name": "dask",
            "specs": []
        },
        {
            "name": "deltalake",
            "specs": [
                [
                    "==",
                    "0.20.1"
                ]
            ]
        },
        {
            "name": "duckdb",
            "specs": [
                [
                    "==",
                    "1.2.2"
                ]
            ]
        },
        {
            "name": "getdaft",
            "specs": []
        },
        {
            "name": "httpx",
            "specs": [
                [
                    "==",
                    "0.27.0"
                ]
            ]
        },
        {
            "name": "icecream",
            "specs": [
                [
                    "==",
                    "2.1.3"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "2.6.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "<",
                    "2"
                ],
                [
                    ">=",
                    "1.15"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.4"
                ]
            ]
        },
        {
            "name": "pyspark",
            "specs": [
                [
                    "==",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "pyvis",
            "specs": [
                [
                    ">=",
                    "0.3.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    ">=",
                    "65.5.1"
                ]
            ]
        },
        {
            "name": "structlog",
            "specs": [
                [
                    ">=",
                    "23.1.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "8.0.2"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "pytorch_frame",
            "specs": []
        },
        {
            "name": "pyiceberg",
            "specs": [
                [
                    "==",
                    "0.8.1"
                ]
            ]
        }
    ],
    "lcname": "graphreduce"
}

Wes Madrigal