walacor-data-tracker


Namewalacor-data-tracker JSON
Version 0.0.5 PyPI version JSON
download
home_pageNone
SummarySDK and CLI for capturing data-science lineage and persisting DAG snapshots to Walacor.
upload_time2025-07-16 21:12:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseApache-2.0
keywords data-transformations data-lineage provenance walacor dag snapshot
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Walacor Data Tracking

<div align="center">

<img src="https://www.walacor.com/wp-content/uploads/2022/09/Walacor_Logo_Tag.png" width="300" />

[![License Apache 2.0][badge-license]][license]
[![Walacor (1100127456347832400)](https://img.shields.io/badge/My-Discord-%235865F2.svg?label=Walacor)](https://discord.gg/BaEWpsg8Yc)
[![Walacor (1100127456347832400)](https://img.shields.io/static/v1?label=Walacor&message=LinkedIn&color=blue)](https://www.linkedin.com/company/walacor/)
[![Walacor (1100127456347832400)](https://img.shields.io/static/v1?label=Walacor&message=Website&color)](https://www.walacor.com/product/)

</div>

[badge-license]: https://img.shields.io/badge/license-Apache2-green.svg?dummy
[license]: https://github.com/walacor/objectvalidation/blob/main/LICENSE

---



A schema-first framework to **track, version, and store the full lineage of data transformations** — from raw ingestion to final model output — using Walacor as a backend snapshot store.

---

## ✨ Why this exists
- **Reproducibility** – Every transformation, parameter, and artifact is captured in a graph you can replay.
- **Auditability** – Snapshots are immutable, version-controlled, and timestamped.
- **Collaboration** – Team members see the same lineage and can compare or branch workflows.
- **Extensibility** – Strict JSON-schemas keep today’s pipelines clean while allowing tomorrow’s to evolve safely.

---

## 🏗️ Core Concepts

| Concept | Stored as | Purpose |
| ------- | --------- | ------- |
| **Transform Node** | `transform_node` | One operation (e.g., “fit model”, “clean text”). |
| **Transform Edge** | `transform_edge` | Dependency between two nodes. |
| **Project Metadata** | `project_metadata` | Run-level info (owner, description, timestamps). |

> **Immutable Snapshots**
> Once a DAG is written to Walacor, it cannot mutate—only a *new* snapshot (with a higher SV or run ID) can supersede it.

---


## 🚀 Getting Started

### 1. Install the SDKs

```bash
pip install walatrack
````

> Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.

### 2. Initialize the Tracking Components

To begin capturing your data lineage:

* **Start the Tracker** – This manages the session and records operations.
* **Attach an Adapter** – For example, use `PandasAdapter` to automatically track DataFrame transformations.
* **Add Writers** – Choose where to send the output:

  * Console output for quick inspection
  * WalacorWriter to persist snapshots to the Walacor backend

Once set up, your transformation history will be automatically recorded and can be exported or persisted.

---


## 🧪 Example Use Cases

* Track changes in a machine learning pipeline
* Visualize column-level transformations in pandas
* Record versions of a dataset as it’s cleaned and merged
* Keep an auditable log of automated workflows

---

Here’s the updated `README.md` with a concise, illustrative example that highlights how easy it is to use `walatrack`. This is placed right after the **Getting Started** section and demonstrates a realistic tracking flow with minimal code:

---

## 🧪 Minimal Example

Here's how simple it is to start tracking transformations:

```python
import pandas as pd
from walacor_data_tracker import Tracker, PandasAdapter
from walacor_data_tracker.writers import ConsoleWriter
from walacor_data_tracker.writers.walacor import WalacorWriter

# 1️⃣  Start tracking
tracker = Tracker().start()
PandasAdapter().start(tracker)        # auto-captures every DataFrame op
ConsoleWriter()                       # (optional) printf lineage to stdout

# 2️⃣  Open a Walacor run in one line
wal_writer = WalacorWriter(
    "https://your-walacor-url/api",    # server
    "your-username",                   # login
    "your-password",
    project_name="MyProject",
    pipeline_name="daily_sales_pipeline",   # ⇢ opens a new run right away
)

# 3️⃣  Do your normal pandas work
df = pd.DataFrame({"id": [1, 2], "value": [100, 200]})
df2 = df.assign(double=df.value * 2)
df3 = df2.rename(columns={"value": "v"})

# 4️⃣  Finish the run and stop tracking
wal_writer.close(status="finished")   # marks the run "finished" in Walacor
tracker.stop()

print("Walacor run UID:", wal_writer._run_uid)   # UID of the run you just wrote


````

> 💡 The `PandasAdapter` automatically tracks operations like `.assign()`, `.rename()`, `.merge()`, etc., so you can work with pandas as usual — but with versioned lineage behind the scenes.


---

This snippet:
- Is short enough to understand at a glance
- Avoids hardcoded credentials or IPs
- Clearly reflects your existing setup
- Shows the power and simplicity of the library

---


### 🛠️  Pandas operations automatically tracked

The current release wraps the pandas `DataFrame` API methods below.
Whenever you call any of them, a **transform \_node** is emitted, parameters are
captured, and lineage is updated—zero extra code required:

| Category                          | Supported `DataFrame` methods                                       |
| --------------------------------- | ------------------------------------------------------------------- |
| **Structural copies / reshaping** | `copy`, `reset_index`, `set_axis`, `pivot_table`, `melt`, `explode` |
| **Column creation / update**      | `assign`, `insert`, `__setitem__` (`df["col"] = …`)                 |
| **Cleaning & NA handling**        | `fillna`, `dropna`, `replace`                                       |
| **Column rename / re-order**      | `rename`, `reindex`, `sort_values`                                  |
| **Joins & merges**                | `merge`, `join`                                                     |
| **Type & dtype changes**          | `astype`                                                            |

> ℹ️ These map directly to the constant in `PandasAdapter`:
>
> ```python
> _DF_METHODS = [
>     "copy", "pivot_table", "reset_index", "__setitem__",
>     "fillna", "dropna", "replace", "rename", "assign",
>     "merge", "join", "set_axis", "insert", "astype",
>     "sort_values", "reindex", "explode", "melt",
> ]
> ```

#### Missing your favourite method?

Pull requests are welcome!
Add the method name to `_DF_METHODS`, ensure the wrapper captures a meaningful
snapshot, and open a PR. We’ll review and merge updates that keep to the
schema-first philosophy.

---
## 🔍 Helper API — query your lineage

| Helper                                                                        | Purpose                                                              | Key parameters                                                                                                                    | Returns                                                                           |
| ----------------------------------------------------------------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------   | --------------------------------------------------------------------------------- |
| `get_projects()`                                                              | List every Walacor-tracked project.                                  | *(none)*                                                                                                                          | `[{uid, project_name, description, user_tag}]`                                    |
| `get_pipelines()`                                                             | List the **names of all pipelines** ever executed (across projects). | *(none)*                                                                                                                          | `["daily_etl", "train_model", ...]`                                               |
| `get_pipelines_for_project(project_name, *, user_tag=None)`                   | Pipelines that belong to one project.                                | `project_name` – required<br>`user_tag` – filter if you store multiple laptops/branches                                           | `["sales_pipeline", …]`                                                           |
| `get_runs(project_name, *, pipeline_name=None, user_tag=None)`                | History of executions (“runs”).                                      | `project_name` – required<br>`pipeline_name` – limit to one pipeline<br>`user_tag` – optional                                     | `[{"UID","status","pipeline_name",…}, …]`                                         |
| `get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Raw **transform\_node rows** (operations).                           | Same filters as above – *pick **one** of* `pipeline_name` **or** `run_uid`.<br>Omitting both returns every node in the project.   | List of node dicts with `operation`, `shape`, `params_json`, …                    |
| `get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)`   | Convenient “everything I need for a graph”.                          | Same filter rules.                                                                                                                | `{"nodes": [...], "edges": [...]}` where edges come from `transform_edge`.        |
| `get_projects_with_pipelines()`                                               | High-level catalogue: each project, its pipelines and run-counts.    | *(none)*                                                                                                                          | `[ { "project_name": "Proj", "pipelines":[{"name":"etl","runs":7}] }, … ]` |

### Parameter rules at a glance

| Filter combo                   | What you get                              |
| ------------------------------ | ----------------------------------------- |
| `project_name` **only**        | all nodes / all edges in the project      |
| `project_name + pipeline_name` | all runs & nodes for that pipeline        |
| `project_name + run_uid`       | nodes/edges of one specific run           |
| `user_tag`                     | optional extra filter on any of the above |

### Example usage

```python
# 1️⃣ list all runs of “train_model” in “ML_Proj”
runs = wal_writer.get_runs("ML_Proj", pipeline_name="train_model")
first_run = runs[0]["UID"]

# 2️⃣ pull the DAG for that first run
dag = wal_writer.get_dag("ML_Proj", run_uid=first_run)

# 3️⃣ quick print
for n in dag["nodes"]:
    print(n["operation"], n["shape"])
```

> These helpers leverage the official **[Walacor Python SDK](https://github.com/walacor/python-sdk)**, so every call hits Walacor’s fast *summary* view and transparently re-uses the writer’s authenticated session—no extra login or handshake required.

---

## 🤝 Contributing

1. Fork → feature branch → PR.
2. Run `pre-commit run --all-files`.
3. Add/Update unit tests and **schema definitions**.
4. Keep the README & docs in sync.

---

## 📄 License

Apache 2.0 © 2025 Walacor & Contributors.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "walacor-data-tracker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "data-transformations, data-lineage, provenance, walacor, dag, snapshot",
    "author": null,
    "author_email": "Garo Kechichian <garo.keshish@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/2d/9f/cbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231/walacor_data_tracker-0.0.5.tar.gz",
    "platform": null,
    "description": "# Walacor Data Tracking\n\n<div align=\"center\">\n\n<img src=\"https://www.walacor.com/wp-content/uploads/2022/09/Walacor_Logo_Tag.png\" width=\"300\" />\n\n[![License Apache 2.0][badge-license]][license]\n[![Walacor (1100127456347832400)](https://img.shields.io/badge/My-Discord-%235865F2.svg?label=Walacor)](https://discord.gg/BaEWpsg8Yc)\n[![Walacor (1100127456347832400)](https://img.shields.io/static/v1?label=Walacor&message=LinkedIn&color=blue)](https://www.linkedin.com/company/walacor/)\n[![Walacor (1100127456347832400)](https://img.shields.io/static/v1?label=Walacor&message=Website&color)](https://www.walacor.com/product/)\n\n</div>\n\n[badge-license]: https://img.shields.io/badge/license-Apache2-green.svg?dummy\n[license]: https://github.com/walacor/objectvalidation/blob/main/LICENSE\n\n---\n\n\n\nA schema-first framework to **track, version, and store the full lineage of data transformations** \u2014 from raw ingestion to final model output \u2014 using Walacor as a backend snapshot store.\n\n---\n\n## \u2728 Why this exists\n- **Reproducibility** \u2013 Every transformation, parameter, and artifact is captured in a graph you can replay.\n- **Auditability** \u2013 Snapshots are immutable, version-controlled, and timestamped.\n- **Collaboration** \u2013 Team members see the same lineage and can compare or branch workflows.\n- **Extensibility** \u2013 Strict JSON-schemas keep today\u2019s pipelines clean while allowing tomorrow\u2019s to evolve safely.\n\n---\n\n## \ud83c\udfd7\ufe0f Core Concepts\n\n| Concept | Stored as | Purpose |\n| ------- | --------- | ------- |\n| **Transform Node** | `transform_node` | One operation (e.g., \u201cfit model\u201d, \u201cclean text\u201d). |\n| **Transform Edge** | `transform_edge` | Dependency between two nodes. |\n| **Project Metadata** | `project_metadata` | Run-level info (owner, description, timestamps). |\n\n> **Immutable Snapshots**\n> Once a DAG is written to Walacor, it cannot mutate\u2014only a *new* snapshot (with a higher SV or run ID) can supersede it.\n\n---\n\n\n## \ud83d\ude80 Getting Started\n\n### 1. Install the SDKs\n\n```bash\npip install walatrack\n````\n\n> Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.\n\n### 2. Initialize the Tracking Components\n\nTo begin capturing your data lineage:\n\n* **Start the Tracker** \u2013 This manages the session and records operations.\n* **Attach an Adapter** \u2013 For example, use `PandasAdapter` to automatically track DataFrame transformations.\n* **Add Writers** \u2013 Choose where to send the output:\n\n  * Console output for quick inspection\n  * WalacorWriter to persist snapshots to the Walacor backend\n\nOnce set up, your transformation history will be automatically recorded and can be exported or persisted.\n\n---\n\n\n## \ud83e\uddea Example Use Cases\n\n* Track changes in a machine learning pipeline\n* Visualize column-level transformations in pandas\n* Record versions of a dataset as it\u2019s cleaned and merged\n* Keep an auditable log of automated workflows\n\n---\n\nHere\u2019s the updated `README.md` with a concise, illustrative example that highlights how easy it is to use `walatrack`. This is placed right after the **Getting Started** section and demonstrates a realistic tracking flow with minimal code:\n\n---\n\n## \ud83e\uddea Minimal Example\n\nHere's how simple it is to start tracking transformations:\n\n```python\nimport pandas as pd\nfrom walacor_data_tracker import Tracker, PandasAdapter\nfrom walacor_data_tracker.writers import ConsoleWriter\nfrom walacor_data_tracker.writers.walacor import WalacorWriter\n\n# 1\ufe0f\u20e3  Start tracking\ntracker = Tracker().start()\nPandasAdapter().start(tracker)        # auto-captures every DataFrame op\nConsoleWriter()                       # (optional) printf lineage to stdout\n\n# 2\ufe0f\u20e3  Open a Walacor run in one line\nwal_writer = WalacorWriter(\n    \"https://your-walacor-url/api\",    # server\n    \"your-username\",                   # login\n    \"your-password\",\n    project_name=\"MyProject\",\n    pipeline_name=\"daily_sales_pipeline\",   # \u21e2 opens a new run right away\n)\n\n# 3\ufe0f\u20e3  Do your normal pandas work\ndf = pd.DataFrame({\"id\": [1, 2], \"value\": [100, 200]})\ndf2 = df.assign(double=df.value * 2)\ndf3 = df2.rename(columns={\"value\": \"v\"})\n\n# 4\ufe0f\u20e3  Finish the run and stop tracking\nwal_writer.close(status=\"finished\")   # marks the run \"finished\" in Walacor\ntracker.stop()\n\nprint(\"Walacor run UID:\", wal_writer._run_uid)   # UID of the run you just wrote\n\n\n````\n\n> \ud83d\udca1 The `PandasAdapter` automatically tracks operations like `.assign()`, `.rename()`, `.merge()`, etc., so you can work with pandas as usual \u2014 but with versioned lineage behind the scenes.\n\n\n---\n\nThis snippet:\n- Is short enough to understand at a glance\n- Avoids hardcoded credentials or IPs\n- Clearly reflects your existing setup\n- Shows the power and simplicity of the library\n\n---\n\n\n### \ud83d\udee0\ufe0f  Pandas operations automatically tracked\n\nThe current release wraps the pandas `DataFrame` API methods below.\nWhenever you call any of them, a **transform \\_node** is emitted, parameters are\ncaptured, and lineage is updated\u2014zero extra code required:\n\n| Category                          | Supported `DataFrame` methods                                       |\n| --------------------------------- | ------------------------------------------------------------------- |\n| **Structural copies / reshaping** | `copy`, `reset_index`, `set_axis`, `pivot_table`, `melt`, `explode` |\n| **Column creation / update**      | `assign`, `insert`, `__setitem__` (`df[\"col\"] = \u2026`)                 |\n| **Cleaning & NA handling**        | `fillna`, `dropna`, `replace`                                       |\n| **Column rename / re-order**      | `rename`, `reindex`, `sort_values`                                  |\n| **Joins & merges**                | `merge`, `join`                                                     |\n| **Type & dtype changes**          | `astype`                                                            |\n\n> \u2139\ufe0f These map directly to the constant in `PandasAdapter`:\n>\n> ```python\n> _DF_METHODS = [\n>     \"copy\", \"pivot_table\", \"reset_index\", \"__setitem__\",\n>     \"fillna\", \"dropna\", \"replace\", \"rename\", \"assign\",\n>     \"merge\", \"join\", \"set_axis\", \"insert\", \"astype\",\n>     \"sort_values\", \"reindex\", \"explode\", \"melt\",\n> ]\n> ```\n\n#### Missing your favourite method?\n\nPull requests are welcome!\nAdd the method name to `_DF_METHODS`, ensure the wrapper captures a meaningful\nsnapshot, and open a PR. We\u2019ll review and merge updates that keep to the\nschema-first philosophy.\n\n---\n## \ud83d\udd0d Helper API \u2014 query your lineage\n\n| Helper                                                                        | Purpose                                                              | Key parameters                                                                                                                    | Returns                                                                           |\n| ----------------------------------------------------------------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------   | --------------------------------------------------------------------------------- |\n| `get_projects()`                                                              | List every Walacor-tracked project.                                  | *(none)*                                                                                                                          | `[{uid, project_name, description, user_tag}]`                                    |\n| `get_pipelines()`                                                             | List the **names of all pipelines** ever executed (across projects). | *(none)*                                                                                                                          | `[\"daily_etl\", \"train_model\", ...]`                                               |\n| `get_pipelines_for_project(project_name, *, user_tag=None)`                   | Pipelines that belong to one project.                                | `project_name` \u2013 required<br>`user_tag` \u2013 filter if you store multiple laptops/branches                                           | `[\"sales_pipeline\", \u2026]`                                                           |\n| `get_runs(project_name, *, pipeline_name=None, user_tag=None)`                | History of executions (\u201cruns\u201d).                                      | `project_name` \u2013 required<br>`pipeline_name` \u2013 limit to one pipeline<br>`user_tag` \u2013 optional                                     | `[{\"UID\",\"status\",\"pipeline_name\",\u2026}, \u2026]`                                         |\n| `get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)` | Raw **transform\\_node rows** (operations).                           | Same filters as above \u2013 *pick **one** of* `pipeline_name` **or** `run_uid`.<br>Omitting both returns every node in the project.   | List of node dicts with `operation`, `shape`, `params_json`, \u2026                    |\n| `get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)`   | Convenient \u201ceverything I need for a graph\u201d.                          | Same filter rules.                                                                                                                | `{\"nodes\": [...], \"edges\": [...]}` where edges come from `transform_edge`.        |\n| `get_projects_with_pipelines()`                                               | High-level catalogue: each project, its pipelines and run-counts.    | *(none)*                                                                                                                          | `[ { \"project_name\": \"Proj\", \"pipelines\":[{\"name\":\"etl\",\"runs\":7}] }, \u2026 ]` |\n\n### Parameter rules at a glance\n\n| Filter combo                   | What you get                              |\n| ------------------------------ | ----------------------------------------- |\n| `project_name` **only**        | all nodes / all edges in the project      |\n| `project_name + pipeline_name` | all runs & nodes for that pipeline        |\n| `project_name + run_uid`       | nodes/edges of one specific run           |\n| `user_tag`                     | optional extra filter on any of the above |\n\n### Example usage\n\n```python\n# 1\ufe0f\u20e3 list all runs of \u201ctrain_model\u201d in \u201cML_Proj\u201d\nruns = wal_writer.get_runs(\"ML_Proj\", pipeline_name=\"train_model\")\nfirst_run = runs[0][\"UID\"]\n\n# 2\ufe0f\u20e3 pull the DAG for that first run\ndag = wal_writer.get_dag(\"ML_Proj\", run_uid=first_run)\n\n# 3\ufe0f\u20e3 quick print\nfor n in dag[\"nodes\"]:\n    print(n[\"operation\"], n[\"shape\"])\n```\n\n> These helpers leverage the official **[Walacor Python SDK](https://github.com/walacor/python-sdk)**, so every call hits Walacor\u2019s fast *summary* view and transparently re-uses the writer\u2019s authenticated session\u2014no extra login or handshake required.\n\n---\n\n## \ud83e\udd1d Contributing\n\n1. Fork \u2192 feature branch \u2192 PR.\n2. Run `pre-commit run --all-files`.\n3. Add/Update unit tests and **schema definitions**.\n4. Keep the README & docs in sync.\n\n---\n\n## \ud83d\udcc4 License\n\nApache\u00a02.0 \u00a9 2025\u00a0Walacor & Contributors.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "SDK and CLI for capturing data-science lineage and persisting DAG snapshots to Walacor.",
    "version": "0.0.5",
    "project_urls": {
        "Documentation": "https://apidoc.walacor.com",
        "Homepage": "https://github.com/walacor/walacor-data-tracker",
        "Issues": "https://github.com/walacor/walacor-data-tracker/issues",
        "Source": "https://github.com/walacor/walacor-data-tracker"
    },
    "split_keywords": [
        "data-transformations",
        " data-lineage",
        " provenance",
        " walacor",
        " dag",
        " snapshot"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6c13b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc",
                "md5": "c76e9e077e6e2adac4cd944e5bbc7eff",
                "sha256": "e443499949712136cc2cf9e6580b37620b3e986205563e502588189d1f28416a"
            },
            "downloads": -1,
            "filename": "walacor_data_tracker-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c76e9e077e6e2adac4cd944e5bbc7eff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 26465,
            "upload_time": "2025-07-16T21:12:26",
            "upload_time_iso_8601": "2025-07-16T21:12:26.865464Z",
            "url": "https://files.pythonhosted.org/packages/6c/13/b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc/walacor_data_tracker-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2d9fcbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231",
                "md5": "3675ba7e8afee7759fc9675912114b8b",
                "sha256": "12141773444a8d146dfb2f30cecbe5dbab6bdb5b371ac975b0e60b9f62c1bf97"
            },
            "downloads": -1,
            "filename": "walacor_data_tracker-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "3675ba7e8afee7759fc9675912114b8b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 29951,
            "upload_time": "2025-07-16T21:12:28",
            "upload_time_iso_8601": "2025-07-16T21:12:28.027758Z",
            "url": "https://files.pythonhosted.org/packages/2d/9f/cbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231/walacor_data_tracker-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 21:12:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "walacor",
    "github_project": "walacor-data-tracker",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "walacor-data-tracker"
}
        
Elapsed time: 1.63506s