Pandas Diff
===========
|CodeFactor| |Python 3|
Installation
------------
Install pandas_diff with pip
.. code:: bash
pip install pandas_diff
Usage/Examples
--------------
.. code:: python
import pandas_diff as pd_diff
import pandas as pd
# Create two example dataframes
df_infinity_war = pd.DataFrame([
{"hero" : "hulk" , "power" : "strength"},
{"hero" : "black_widow" , "power" : "spy"},
{"hero" : "thor" , "hammers" : 0 },
{"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
{"hero" : "hulk" , "power" : "smart"},
{"hero" : "captain marvel" , "power" : "strength"},
{"hero" : "thor" , "hammers" : 2 } ] )
# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")
df
#operation object_keys object_values object_json attribute_changed old_value new_value
#0 create [hero] captain marvel {'hero': 'captain marvel', 'power': 'strength'... NaN NaN NaN
#1 delete [hero] black_widow {'hero': 'black_widow', 'power': 'spy', 'hamme... NaN NaN NaN
#2 modify [hero] thor {'hero': 'thor', 'power': nan, 'hammers': 2.0} hammers 1 2
#3 modify [hero] hulk {'hero': 'hulk', 'power': 'smart', 'hammers': ... power strength smart
Why pandas diff ? Cases of use
------------------------------
Migrating from batch to an event driven architecture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In my work, we use a lot of data pipelines to get info from external
platforms, (active directory, github, jira). We load the new data
replacing the entire table.
By using pandas_diff we detect how the infraestructure changes between
executions, and stream those change events into a kafka cluster, so
other teams could suscribe to their favourite events. Also, by defining
a pandas_diff step in the master pipeline, every item in our project has
ther life cycle events controlled.
Events log
~~~~~~~~~~
For every item in a table, by using pandas_diff you will have an event
log to audit of how the resources are being consumed.
Conciliation
~~~~~~~~~~~~
To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.
Features
--------
- Filtering of columns
Roadmap
-------
- Support for stand alone app
Documentation
-------------
`Documentation <https://pandas-diff.readthedocs.io/en/latest/>`__
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/jaimevalero/pandas_diff/badge
:target: https://www.codefactor.io/repository/github/jaimevalero/pandas_diff
.. |Python 3| image:: https://pyup.io/repos/github/jaimevalero/pandas_diff/python-3-shield.svg
:target: https://pyup.io/repos/github/jaimevalero/pandas_diff/
History
-------
0.7.18 (2021-12-05)
-------------------
\* Add codacy badge
0.7.19 (2021-12-05)
-------------------
\* Feat filter column
0.7.20 (2021-12-05)
-------------------
\* Feat filter column
0.7.21 (2021-12-05)
-------------------
\* Add filter fest
0.7.22 (2021-12-06)
-------------------
\* Add confition keys exist in df's
1.1.0 (2021-12-06)
------------------
\* Add confition keys exist in df's
1.2.0 (2021-12-06)
------------------
\* Improve doc
1.2.0 (2021-12-06)
------------------
\* Improve doc
1.3.0 (2021-12-06)
--------------------
\* Remove workflows
1.4.0 (2021-12-06)
--------------------
\* Remove workflows
1.4.0 (2023-09-01)
--------------------
\* Improve doc
1.4.1 (2023-09-01)
--------------------
\* Improve doc
1.4.2 (2023-09-17)
--------------------
\* Bugfix version string
1.4.3 (2023-09-17)
--------------------
\* bugfix version tag
1.4.4 (2023-09-17)
--------------------
\* bugfix version tag
1.4.5 (2023-09-17)
--------------------
\* bugfixx history string
1.4.6 (2023-09-17)
--------------------
\* bugfix history string
1.4.7 (2023-09-17)
--------------------
\* bugfix release description
Raw data
{
"_id": null,
"home_page": "https://github.com/jaimevalero/pandas_diff",
"name": "pandas-diff",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "pandas_diff",
"author": "Jaime Valero",
"author_email": "jaimevalero78@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b7/19/115c112b5d1f21900a0409e08db618e1d156c0d5ecb8c55c4f8d6bab7c8a/pandas_diff-1.4.7.tar.gz",
"platform": null,
"description": "Pandas Diff\n===========\n\n|CodeFactor| |Python 3|\n\nInstallation\n------------\n\nInstall pandas_diff with pip\n\n.. code:: bash\n\n pip install pandas_diff\n\nUsage/Examples\n--------------\n\n.. code:: python\n\n import pandas_diff as pd_diff\n\n import pandas as pd\n\n # Create two example dataframes\n df_infinity_war = pd.DataFrame([\n {\"hero\" : \"hulk\" , \"power\" : \"strength\"},\n {\"hero\" : \"black_widow\" , \"power\" : \"spy\"},\n {\"hero\" : \"thor\" , \"hammers\" : 0 },\n {\"hero\" : \"thor\" , \"hammers\" : 1 } ] )\n df_endgame = pd.DataFrame([\n {\"hero\" : \"hulk\" , \"power\" : \"smart\"},\n {\"hero\" : \"captain marvel\" , \"power\" : \"strength\"},\n {\"hero\" : \"thor\" , \"hammers\" : 2 } ] )\n\n # Get differences, using the key \"hero\"\n df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,\"hero\")\n\n df\n\n #operation object_keys object_values object_json attribute_changed old_value new_value\n #0 create [hero] captain marvel {'hero': 'captain marvel', 'power': 'strength'... NaN NaN NaN\n #1 delete [hero] black_widow {'hero': 'black_widow', 'power': 'spy', 'hamme... NaN NaN NaN\n #2 modify [hero] thor {'hero': 'thor', 'power': nan, 'hammers': 2.0} hammers 1 2\n #3 modify [hero] hulk {'hero': 'hulk', 'power': 'smart', 'hammers': ... power strength smart\n\nWhy pandas diff ? Cases of use\n------------------------------\n\nMigrating from batch to an event driven architecture\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn my work, we use a lot of data pipelines to get info from external\nplatforms, (active directory, github, jira). We load the new data\nreplacing the entire table.\n\nBy using pandas_diff we detect how the infraestructure changes between\nexecutions, and stream those change events into a kafka cluster, so\nother teams could suscribe to their favourite events. Also, by defining\na pandas_diff step in the master pipeline, every item in our project has\nther life cycle events controlled.\n\nEvents log\n~~~~~~~~~~\n\nFor every item in a table, by using pandas_diff you will have an event\nlog to audit of how the resources are being consumed.\n\nConciliation\n~~~~~~~~~~~~\n\nTo conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.\n\nFeatures\n--------\n\n- Filtering of columns\n\nRoadmap\n-------\n\n- Support for stand alone app\n\nDocumentation\n-------------\n\n`Documentation <https://pandas-diff.readthedocs.io/en/latest/>`__\n\n.. |CodeFactor| image:: https://www.codefactor.io/repository/github/jaimevalero/pandas_diff/badge\n :target: https://www.codefactor.io/repository/github/jaimevalero/pandas_diff\n.. |Python 3| image:: https://pyup.io/repos/github/jaimevalero/pandas_diff/python-3-shield.svg\n :target: https://pyup.io/repos/github/jaimevalero/pandas_diff/\n\n\n\n\nHistory\n-------\n\n0.7.18 (2021-12-05)\n-------------------\n\n\\* Add codacy badge \n\n0.7.19 (2021-12-05)\n-------------------\n\n\\* Feat filter column \n\n0.7.20 (2021-12-05)\n-------------------\n\n\\* Feat filter column \n\n0.7.21 (2021-12-05)\n-------------------\n\n\\* Add filter fest \n\n0.7.22 (2021-12-06)\n-------------------\n\n\\* Add confition keys exist in df's \n\n\n1.1.0 (2021-12-06)\n------------------\n\n\\* Add confition keys exist in df's\n1.2.0 (2021-12-06)\n------------------\n\n\\* Improve doc \n\n1.2.0 (2021-12-06)\n------------------\n\n\\* Improve doc \n\n1.3.0 (2021-12-06)\n--------------------\n\n\\* Remove workflows \n\n1.4.0 (2021-12-06)\n--------------------\n\n\\* Remove workflows \n\n1.4.0 (2023-09-01)\n--------------------\n\n\\* Improve doc \n\n1.4.1 (2023-09-01)\n--------------------\n\n\\* Improve doc\n\n1.4.2 (2023-09-17)\n--------------------\n\n\\* Bugfix version string\n\n1.4.3 (2023-09-17)\n--------------------\n\n\\* bugfix version tag \n\n1.4.4 (2023-09-17)\n--------------------\n\n\\* bugfix version tag \n\n1.4.5 (2023-09-17)\n--------------------\n\n\\* bugfixx history string \n\n1.4.6 (2023-09-17)\n--------------------\n\n\\* bugfix history string \n\n1.4.7 (2023-09-17)\n--------------------\n\n\\* bugfix release description \n\n\n\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "Python utility to extract differences between two pandas dataframes.",
"version": "1.4.7",
"project_urls": {
"Homepage": "https://github.com/jaimevalero/pandas_diff"
},
"split_keywords": [
"pandas_diff"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b719115c112b5d1f21900a0409e08db618e1d156c0d5ecb8c55c4f8d6bab7c8a",
"md5": "c2e3c979e39731f2c4836e5e41de91dd",
"sha256": "fe5e4567ec3402eb77096a04cd7f2488950722fcdc488ca14bb71364f07fbdb1"
},
"downloads": -1,
"filename": "pandas_diff-1.4.7.tar.gz",
"has_sig": false,
"md5_digest": "c2e3c979e39731f2c4836e5e41de91dd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 12841,
"upload_time": "2023-09-17T08:51:50",
"upload_time_iso_8601": "2023-09-17T08:51:50.668200Z",
"url": "https://files.pythonhosted.org/packages/b7/19/115c112b5d1f21900a0409e08db618e1d156c0d5ecb8c55c4f8d6bab7c8a/pandas_diff-1.4.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-17 08:51:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jaimevalero",
"github_project": "pandas_diff",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "pandas-diff"
}