# Walacor Data Tracking
<div align="center">
<img src="https://www.walacor.com/wp-content/uploads/2022/09/Walacor_Logo_Tag.png" width="300" />
[![License Apache 2.0][badge-license]][license]
[](https://discord.gg/BaEWpsg8Yc)
[](https://www.linkedin.com/company/walacor/)
[](https://www.walacor.com/product/)
</div>
[badge-license]: https://img.shields.io/badge/license-Apache2-green.svg?dummy
[license]: https://github.com/walacor/objectvalidation/blob/main/LICENSE
---
A schema-first framework to **track, version, and store the full lineage of data transformations** — from raw ingestion to final model output — using Walacor as a backend snapshot store.
---
## ✨ Why this exists
- **Reproducibility** – Every transformation, parameter, and artifact is captured in a graph you can replay.
- **Auditability** – Snapshots are immutable, version-controlled, and timestamped.
- **Collaboration** – Team members see the same lineage and can compare or branch workflows.
- **Extensibility** – Strict JSON-schemas keep today’s pipelines clean while allowing tomorrow’s to evolve safely.
---
## 🏗️ Core Concepts
| Concept | Stored as | Purpose |
| ------- | --------- | ------- |
| **Transform Node** | `transform_node` | One operation (e.g., “fit model”, “clean text”). |
| **Transform Edge** | `transform_edge` | Dependency between two nodes. |
| **Project Metadata** | `project_metadata` | Run-level info (owner, description, timestamps). |
> **Immutable Snapshots**
> Once a DAG is written to Walacor, it cannot mutate—only a *new* snapshot (with a higher SV or run ID) can supersede it.
---
## 🚀 Getting Started
### 1. Install the SDKs
```bash
pip install walatrack
````
> Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.
### 2. Initialize the Tracking Components
To begin capturing your data lineage:
* **Start the Tracker** – This manages the session and records operations.
* **Attach an Adapter** – For example, use `PandasAdapter` to automatically track DataFrame transformations.
* **Add Writers** – Choose where to send the output:
* Console output for quick inspection
* WalacorWriter to persist snapshots to the Walacor backend
Once set up, your transformation history will be automatically recorded and can be exported or persisted.
---
## 🧪 Example Use Cases
* Track changes in a machine learning pipeline
* Visualize column-level transformations in pandas
* Record versions of a dataset as it’s cleaned and merged
* Keep an auditable log of automated workflows
---
Here’s the updated `README.md` with a concise, illustrative example that highlights how easy it is to use `walatrack`. This is placed right after the **Getting Started** section and demonstrates a realistic tracking flow with minimal code:
---
## 🧪 Minimal Example
Here's how simple it is to start tracking transformations:
```python
import pandas as pd
from walacor_data_tracker import Tracker, PandasAdapter
from walacor_data_tracker.writers import ConsoleWriter
from walacor_data_tracker.writers.walacor import WalacorWriter
# 1️⃣ Start tracking
tracker = Tracker().start()
PandasAdapter().start(tracker) # auto-captures every DataFrame op
ConsoleWriter() # (optional) printf lineage to stdout
# 2️⃣ Open a Walacor run in one line
wal_writer = WalacorWriter(
"https://your-walacor-url/api", # server
"your-username", # login
"your-password",
project_name="MyProject",
pipeline_name="daily_sales_pipeline", # ⇢ opens a new run right away
)
# 3️⃣ Do your normal pandas work
df = pd.DataFrame({"id": [1, 2], "value": [100, 200]})
df2 = df.assign(double=df.value * 2)
df3 = df2.rename(columns={"value": "v"})
# 4️⃣ Finish the run and stop tracking
wal_writer.close(status="finished") # marks the run "finished" in Walacor
tracker.stop()
print("Walacor run UID:", wal_writer._run_uid) # UID of the run you just wrote
````
> 💡 The `PandasAdapter` automatically tracks operations like `.assign()`, `.rename()`, `.merge()`, etc., so you can work with pandas as usual — but with versioned lineage behind the scenes.
---
This snippet:
- Is short enough to understand at a glance
- Avoids hardcoded credentials or IPs
- Clearly reflects your existing setup
- Shows the power and simplicity of the library
---
### 🛠️ Pandas operations automatically tracked
The current release wraps the pandas `DataFrame` API methods below.
Whenever you call any of them, a **transform \_node** is emitted, parameters are
captured, and lineage is updated—zero extra code required:
| Category | Supported `DataFrame` methods |
| --------------------------------- | ------------------------------------------------------------------- |
| **Structural copies / reshaping** | `copy`, `reset_index`, `set_axis`, `pivot_table`, `melt`, `explode` |
| **Column creation / update** | `assign`, `insert`, `__setitem__` (`df["col"] = …`) |
| **Cleaning & NA handling** | `fillna`, `dropna`, `replace` |
| **Column rename / re-order** | `rename`, `reindex`, `sort_values` |
| **Joins & merges** | `merge`, `join` |
| **Type & dtype changes** | `astype` |
> ℹ️ These map directly to the constant in `PandasAdapter`:
>
> ```python
> _DF_METHODS = [
> "copy", "pivot_table", "reset_index", "__setitem__",
> "fillna", "dropna", "replace", "rename", "assign",
> "merge", "join", "set_axis", "insert", "astype",
> "sort_values", "reindex", "explode", "melt",
> ]
> ```
#### Missing your favourite method?
Pull requests are welcome!
Add the method name to `_DF_METHODS`, ensure the wrapper captures a meaningful
snapshot, and open a PR. We’ll review and merge updates that keep to the
schema-first philosophy.
---
## 🔍 Helper API — query your lineage
| Helper | Purpose | Key parameters | Returns |
| ----------------------------------------------------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| `get_projects()` | List every Walacor-tracked project. | *(none)* | `[{uid, project_name, description, user_tag}]` |
| `get_pipelines()` | List the **names of all pipelines** ever executed (across projects). | *(none)* | `["daily_etl", "train_model", ...]` |
| `get_pipelines_for_project(project_name, *, user_tag=None)` | Pipelines that belong to one project. | `project_name` – required<br>`user_tag` – filter if you store multiple laptops/branches | `["sales_pipeline", …]` |
| `get_runs(project_name, *, pipeline_name=None, user_tag=None)` | History of executions (“runs”). | `project_name` – required<br>`pipeline_name` – limit to one pipeline<br>`user_tag` – optional | `[{"UID","status","pipeline_name",…}, …]` |
| `get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Raw **transform\_node rows** (operations). | Same filters as above – *pick **one** of* `pipeline_name` **or** `run_uid`.<br>Omitting both returns every node in the project. | List of node dicts with `operation`, `shape`, `params_json`, … |
| `get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Convenient “everything I need for a graph”. | Same filter rules. | `{"nodes": [...], "edges": [...]}` where edges come from `transform_edge`. |
| `get_projects_with_pipelines()` | High-level catalogue: each project, its pipelines and run-counts. | *(none)* | `[ { "project_name": "Proj", "pipelines":[{"name":"etl","runs":7}] }, … ]` |
### Parameter rules at a glance
| Filter combo | What you get |
| ------------------------------ | ----------------------------------------- |
| `project_name` **only** | all nodes / all edges in the project |
| `project_name + pipeline_name` | all runs & nodes for that pipeline |
| `project_name + run_uid` | nodes/edges of one specific run |
| `user_tag` | optional extra filter on any of the above |
### Example usage
```python
# 1️⃣ list all runs of “train_model” in “ML_Proj”
runs = wal_writer.get_runs("ML_Proj", pipeline_name="train_model")
first_run = runs[0]["UID"]
# 2️⃣ pull the DAG for that first run
dag = wal_writer.get_dag("ML_Proj", run_uid=first_run)
# 3️⃣ quick print
for n in dag["nodes"]:
print(n["operation"], n["shape"])
```
> These helpers leverage the official **[Walacor Python SDK](https://github.com/walacor/python-sdk)**, so every call hits Walacor’s fast *summary* view and transparently re-uses the writer’s authenticated session—no extra login or handshake required.
---
## 🤝 Contributing
1. Fork → feature branch → PR.
2. Run `pre-commit run --all-files`.
3. Add/Update unit tests and **schema definitions**.
4. Keep the README & docs in sync.
---
## 📄 License
Apache 2.0 © 2025 Walacor & Contributors.
Raw data
{
"_id": null,
"home_page": null,
"name": "walacor-data-tracker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "data-transformations, data-lineage, provenance, walacor, dag, snapshot",
"author": null,
"author_email": "Garo Kechichian <garo.keshish@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/2d/9f/cbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231/walacor_data_tracker-0.0.5.tar.gz",
"platform": null,
"description": "# Walacor Data Tracking\n\n<div align=\"center\">\n\n<img src=\"https://www.walacor.com/wp-content/uploads/2022/09/Walacor_Logo_Tag.png\" width=\"300\" />\n\n[![License Apache 2.0][badge-license]][license]\n[](https://discord.gg/BaEWpsg8Yc)\n[](https://www.linkedin.com/company/walacor/)\n[](https://www.walacor.com/product/)\n\n</div>\n\n[badge-license]: https://img.shields.io/badge/license-Apache2-green.svg?dummy\n[license]: https://github.com/walacor/objectvalidation/blob/main/LICENSE\n\n---\n\n\n\nA schema-first framework to **track, version, and store the full lineage of data transformations** \u2014 from raw ingestion to final model output \u2014 using Walacor as a backend snapshot store.\n\n---\n\n## \u2728 Why this exists\n- **Reproducibility** \u2013 Every transformation, parameter, and artifact is captured in a graph you can replay.\n- **Auditability** \u2013 Snapshots are immutable, version-controlled, and timestamped.\n- **Collaboration** \u2013 Team members see the same lineage and can compare or branch workflows.\n- **Extensibility** \u2013 Strict JSON-schemas keep today\u2019s pipelines clean while allowing tomorrow\u2019s to evolve safely.\n\n---\n\n## \ud83c\udfd7\ufe0f Core Concepts\n\n| Concept | Stored as | Purpose |\n| ------- | --------- | ------- |\n| **Transform Node** | `transform_node` | One operation (e.g., \u201cfit model\u201d, \u201cclean text\u201d). |\n| **Transform Edge** | `transform_edge` | Dependency between two nodes. |\n| **Project Metadata** | `project_metadata` | Run-level info (owner, description, timestamps). |\n\n> **Immutable Snapshots**\n> Once a DAG is written to Walacor, it cannot mutate\u2014only a *new* snapshot (with a higher SV or run ID) can supersede it.\n\n---\n\n\n## \ud83d\ude80 Getting Started\n\n### 1. Install the SDKs\n\n```bash\npip install walatrack\n````\n\n> Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.\n\n### 2. Initialize the Tracking Components\n\nTo begin capturing your data lineage:\n\n* **Start the Tracker** \u2013 This manages the session and records operations.\n* **Attach an Adapter** \u2013 For example, use `PandasAdapter` to automatically track DataFrame transformations.\n* **Add Writers** \u2013 Choose where to send the output:\n\n * Console output for quick inspection\n * WalacorWriter to persist snapshots to the Walacor backend\n\nOnce set up, your transformation history will be automatically recorded and can be exported or persisted.\n\n---\n\n\n## \ud83e\uddea Example Use Cases\n\n* Track changes in a machine learning pipeline\n* Visualize column-level transformations in pandas\n* Record versions of a dataset as it\u2019s cleaned and merged\n* Keep an auditable log of automated workflows\n\n---\n\nHere\u2019s the updated `README.md` with a concise, illustrative example that highlights how easy it is to use `walatrack`. This is placed right after the **Getting Started** section and demonstrates a realistic tracking flow with minimal code:\n\n---\n\n## \ud83e\uddea Minimal Example\n\nHere's how simple it is to start tracking transformations:\n\n```python\nimport pandas as pd\nfrom walacor_data_tracker import Tracker, PandasAdapter\nfrom walacor_data_tracker.writers import ConsoleWriter\nfrom walacor_data_tracker.writers.walacor import WalacorWriter\n\n# 1\ufe0f\u20e3 Start tracking\ntracker = Tracker().start()\nPandasAdapter().start(tracker) # auto-captures every DataFrame op\nConsoleWriter() # (optional) printf lineage to stdout\n\n# 2\ufe0f\u20e3 Open a Walacor run in one line\nwal_writer = WalacorWriter(\n \"https://your-walacor-url/api\", # server\n \"your-username\", # login\n \"your-password\",\n project_name=\"MyProject\",\n pipeline_name=\"daily_sales_pipeline\", # \u21e2 opens a new run right away\n)\n\n# 3\ufe0f\u20e3 Do your normal pandas work\ndf = pd.DataFrame({\"id\": [1, 2], \"value\": [100, 200]})\ndf2 = df.assign(double=df.value * 2)\ndf3 = df2.rename(columns={\"value\": \"v\"})\n\n# 4\ufe0f\u20e3 Finish the run and stop tracking\nwal_writer.close(status=\"finished\") # marks the run \"finished\" in Walacor\ntracker.stop()\n\nprint(\"Walacor run UID:\", wal_writer._run_uid) # UID of the run you just wrote\n\n\n````\n\n> \ud83d\udca1 The `PandasAdapter` automatically tracks operations like `.assign()`, `.rename()`, `.merge()`, etc., so you can work with pandas as usual \u2014 but with versioned lineage behind the scenes.\n\n\n---\n\nThis snippet:\n- Is short enough to understand at a glance\n- Avoids hardcoded credentials or IPs\n- Clearly reflects your existing setup\n- Shows the power and simplicity of the library\n\n---\n\n\n### \ud83d\udee0\ufe0f Pandas operations automatically tracked\n\nThe current release wraps the pandas `DataFrame` API methods below.\nWhenever you call any of them, a **transform \\_node** is emitted, parameters are\ncaptured, and lineage is updated\u2014zero extra code required:\n\n| Category | Supported `DataFrame` methods |\n| --------------------------------- | ------------------------------------------------------------------- |\n| **Structural copies / reshaping** | `copy`, `reset_index`, `set_axis`, `pivot_table`, `melt`, `explode` |\n| **Column creation / update** | `assign`, `insert`, `__setitem__` (`df[\"col\"] = \u2026`) |\n| **Cleaning & NA handling** | `fillna`, `dropna`, `replace` |\n| **Column rename / re-order** | `rename`, `reindex`, `sort_values` |\n| **Joins & merges** | `merge`, `join` |\n| **Type & dtype changes** | `astype` |\n\n> \u2139\ufe0f These map directly to the constant in `PandasAdapter`:\n>\n> ```python\n> _DF_METHODS = [\n> \"copy\", \"pivot_table\", \"reset_index\", \"__setitem__\",\n> \"fillna\", \"dropna\", \"replace\", \"rename\", \"assign\",\n> \"merge\", \"join\", \"set_axis\", \"insert\", \"astype\",\n> \"sort_values\", \"reindex\", \"explode\", \"melt\",\n> ]\n> ```\n\n#### Missing your favourite method?\n\nPull requests are welcome!\nAdd the method name to `_DF_METHODS`, ensure the wrapper captures a meaningful\nsnapshot, and open a PR. We\u2019ll review and merge updates that keep to the\nschema-first philosophy.\n\n---\n## \ud83d\udd0d Helper API \u2014 query your lineage\n\n| Helper | Purpose | Key parameters | Returns |\n| ----------------------------------------------------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |\n| `get_projects()` | List every Walacor-tracked project. | *(none)* | `[{uid, project_name, description, user_tag}]` |\n| `get_pipelines()` | List the **names of all pipelines** ever executed (across projects). | *(none)* | `[\"daily_etl\", \"train_model\", ...]` |\n| `get_pipelines_for_project(project_name, *, user_tag=None)` | Pipelines that belong to one project. | `project_name` \u2013 required<br>`user_tag` \u2013 filter if you store multiple laptops/branches | `[\"sales_pipeline\", \u2026]` |\n| `get_runs(project_name, *, pipeline_name=None, user_tag=None)` | History of executions (\u201cruns\u201d). | `project_name` \u2013 required<br>`pipeline_name` \u2013 limit to one pipeline<br>`user_tag` \u2013 optional | `[{\"UID\",\"status\",\"pipeline_name\",\u2026}, \u2026]` |\n| `get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Raw **transform\\_node rows** (operations). | Same filters as above \u2013 *pick **one** of* `pipeline_name` **or** `run_uid`.<br>Omitting both returns every node in the project. | List of node dicts with `operation`, `shape`, `params_json`, \u2026 |\n| `get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Convenient \u201ceverything I need for a graph\u201d. | Same filter rules. | `{\"nodes\": [...], \"edges\": [...]}` where edges come from `transform_edge`. |\n| `get_projects_with_pipelines()` | High-level catalogue: each project, its pipelines and run-counts. | *(none)* | `[ { \"project_name\": \"Proj\", \"pipelines\":[{\"name\":\"etl\",\"runs\":7}] }, \u2026 ]` |\n\n### Parameter rules at a glance\n\n| Filter combo | What you get |\n| ------------------------------ | ----------------------------------------- |\n| `project_name` **only** | all nodes / all edges in the project |\n| `project_name + pipeline_name` | all runs & nodes for that pipeline |\n| `project_name + run_uid` | nodes/edges of one specific run |\n| `user_tag` | optional extra filter on any of the above |\n\n### Example usage\n\n```python\n# 1\ufe0f\u20e3 list all runs of \u201ctrain_model\u201d in \u201cML_Proj\u201d\nruns = wal_writer.get_runs(\"ML_Proj\", pipeline_name=\"train_model\")\nfirst_run = runs[0][\"UID\"]\n\n# 2\ufe0f\u20e3 pull the DAG for that first run\ndag = wal_writer.get_dag(\"ML_Proj\", run_uid=first_run)\n\n# 3\ufe0f\u20e3 quick print\nfor n in dag[\"nodes\"]:\n print(n[\"operation\"], n[\"shape\"])\n```\n\n> These helpers leverage the official **[Walacor Python SDK](https://github.com/walacor/python-sdk)**, so every call hits Walacor\u2019s fast *summary* view and transparently re-uses the writer\u2019s authenticated session\u2014no extra login or handshake required.\n\n---\n\n## \ud83e\udd1d Contributing\n\n1. Fork \u2192 feature branch \u2192 PR.\n2. Run `pre-commit run --all-files`.\n3. Add/Update unit tests and **schema definitions**.\n4. Keep the README & docs in sync.\n\n---\n\n## \ud83d\udcc4 License\n\nApache\u00a02.0 \u00a9 2025\u00a0Walacor & Contributors.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "SDK and CLI for capturing data-science lineage and persisting DAG snapshots to Walacor.",
"version": "0.0.5",
"project_urls": {
"Documentation": "https://apidoc.walacor.com",
"Homepage": "https://github.com/walacor/walacor-data-tracker",
"Issues": "https://github.com/walacor/walacor-data-tracker/issues",
"Source": "https://github.com/walacor/walacor-data-tracker"
},
"split_keywords": [
"data-transformations",
" data-lineage",
" provenance",
" walacor",
" dag",
" snapshot"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6c13b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc",
"md5": "c76e9e077e6e2adac4cd944e5bbc7eff",
"sha256": "e443499949712136cc2cf9e6580b37620b3e986205563e502588189d1f28416a"
},
"downloads": -1,
"filename": "walacor_data_tracker-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c76e9e077e6e2adac4cd944e5bbc7eff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 26465,
"upload_time": "2025-07-16T21:12:26",
"upload_time_iso_8601": "2025-07-16T21:12:26.865464Z",
"url": "https://files.pythonhosted.org/packages/6c/13/b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc/walacor_data_tracker-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2d9fcbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231",
"md5": "3675ba7e8afee7759fc9675912114b8b",
"sha256": "12141773444a8d146dfb2f30cecbe5dbab6bdb5b371ac975b0e60b9f62c1bf97"
},
"downloads": -1,
"filename": "walacor_data_tracker-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "3675ba7e8afee7759fc9675912114b8b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 29951,
"upload_time": "2025-07-16T21:12:28",
"upload_time_iso_8601": "2025-07-16T21:12:28.027758Z",
"url": "https://files.pythonhosted.org/packages/2d/9f/cbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231/walacor_data_tracker-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-16 21:12:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "walacor",
"github_project": "walacor-data-tracker",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "walacor-data-tracker"
}