fosho


Namefosho JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryData Signing & Quality - Offline data integrity with CRC32 file hashing and MD5 schema hashing
upload_time2025-07-24 20:29:29
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords crc32 csv data-integrity data-validation pandera parquet schema
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # fosho - Data Validation for Half-Asleep Scientists

**F**ile-&-schema **O**ffline **S**igning & **H**ash **O**bservatory

Stop wondering if your downstream scripts are using the data you think they are. fosho gives you confidence that your data hasn't changed under your feet.

## Zombie-Proof Steps (Copy-Paste Ready)

**Scenario:** You have `downstream_script.py` that keeps breaking because your data changes. You want it to fail fast with clear errors instead of producing wrong results.

### Step 1: You already have data + a script that breaks
```bash
# Your current situation:
# - data/my_data.csv (keeps changing)  
# - downstream_script.py (breaks silently when data changes)
```

### Step 2: Generate schema and manifest (once)
```bash
# Scan your data directory - generates schemas automatically
uv run fosho scan data/

# This creates:
# - schemas/my_data_schema.py (auto-generated schema)
# - manifest.json (tracks file hashes and signing status)
```

### Step 3: Look at schema, edit if needed
```bash
cat schemas/my_data_schema.py
```
You'll see something like:
```python
schema = pa.DataFrameSchema({
    "id": pa.Column(int),
    "name": pa.Column(str),
    "score": pa.Column(float, nullable=True),
})
```
Edit this file if you want stricter validation (ranges, required values, etc.).

### Step 4: Sign your data (approve current state)
```bash
# Approve the current data state
uv run fosho sign

# Check status
uv run fosho status
```

### Step 5: Replace your pandas.read_csv() calls
**Before (dangerous):**
```python
import pandas as pd
df = pd.read_csv('data/my_data.csv')  # Silent failures
```

**After (safe):**
```python
import fosho

# Load with validation
df = fosho.read_csv(
    file='data/my_data.csv',
    schema='schemas/my_data_schema.py',
    manifest_path='manifest.json'
)

# Must validate before use - crashes if data changed since signing
validated_df = df.validate()  # 🚨 CRASHES if data changed
```

### Step 6: Run your script
- ✅ **If data matches schema:** Script runs normally
- 🚨 **If data changed:** Script crashes with clear error message
- 🎯 **No more silent failures:** You immediately know when data structure changes

## What This Solves

❌ **Before:** "Wait, did my preprocessing script change this CSV? Is my downstream analysis using old data?"

✅ **After:** Your script crashes with a clear error if the data changed. No more silent failures.

## The Magic

1. **File hashing** - Detects when CSVs change (even 1 byte)
2. **Schema validation** - Ensures data structure matches expectations  
3. **Signing workflow** - Explicit approval step prevents accidents
4. **Fail-fast** - Scripts error immediately if using stale/changed data

## Concrete Example

**Your messy situation:**
```bash
# You have this data that keeps changing
cat data/sales.csv
# id,product,revenue
# 1,widget,100.50
# 2,gadget,75.25

# And this script that breaks when data structure changes
cat analyze_sales.py
# import pandas as pd
# df = pd.read_csv('data/sales.csv')
# print(df['revenue'].mean())  # Breaks if 'revenue' column disappears
```

**The fosho solution:**
```bash
# 1. Scan and generate schema
uv run fosho scan data/

# 2. Check what it generated
cat schemas/sales_schema.py
# schema = pa.DataFrameSchema({
#     "id": pa.Column(int),
#     "product": pa.Column(str), 
#     "revenue": pa.Column(float),
# })

# 3. Sign the data state
uv run fosho sign

# 4. Update your script (safer approach)
cat analyze_sales_safe.py
# import fosho
# 
# df = fosho.read_csv(
#     file='data/sales.csv',
#     schema='schemas/sales_schema.py',
#     manifest_path='manifest.json'
# )
# validated_df = df.validate()  # <- This line protects you
# print(validated_df['revenue'].mean())  # Now safe!
```

**Result:** Your script now crashes immediately with a clear error if someone changes the data structure, instead of producing wrong results.

## When Things Change

If your data changes:
```bash
# Re-scan to update checksums
uv run fosho scan data/

# Review what changed
uv run fosho status

# Re-approve if changes are intentional
uv run fosho sign
```

Your Python scripts will refuse to run until you explicitly re-approve the changes.

## Commands

- `fosho scan data/` - Find CSVs, generate schemas, update manifest
- `fosho sign` - Approve all current data/schemas  
- `fosho status` - Show what's signed vs unsigned
- `fosho verify` - Check if everything still matches

## Installation

```bash
cd your-project
uv add fosho  # or pip install fosho
```

## Philosophy

Data scientists need **simple validation first**, not complex rules. fosho generates minimal schemas (just column types + nullability) so you can:

1. Get protection immediately
2. Add more validation rules later as you learn about your data
3. Never wonder "is my script using the right data?"

Perfect for preprocessing pipelines that keep changing and downstream analyses that need to stay in sync.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fosho",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": "Joe Hakim <joebhakim@gmail.com>",
    "keywords": "crc32, csv, data-integrity, data-validation, pandera, parquet, schema",
    "author": null,
    "author_email": "Joe Hakim <joebhakim@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b0/64/79987a34ae095ac3ce605252a0446371c97c96befaa35462912d01b78320/fosho-0.1.0.tar.gz",
    "platform": null,
    "description": "# fosho - Data Validation for Half-Asleep Scientists\n\n**F**ile-&-schema **O**ffline **S**igning & **H**ash **O**bservatory\n\nStop wondering if your downstream scripts are using the data you think they are. fosho gives you confidence that your data hasn't changed under your feet.\n\n## Zombie-Proof Steps (Copy-Paste Ready)\n\n**Scenario:** You have `downstream_script.py` that keeps breaking because your data changes. You want it to fail fast with clear errors instead of producing wrong results.\n\n### Step 1: You already have data + a script that breaks\n```bash\n# Your current situation:\n# - data/my_data.csv (keeps changing)  \n# - downstream_script.py (breaks silently when data changes)\n```\n\n### Step 2: Generate schema and manifest (once)\n```bash\n# Scan your data directory - generates schemas automatically\nuv run fosho scan data/\n\n# This creates:\n# - schemas/my_data_schema.py (auto-generated schema)\n# - manifest.json (tracks file hashes and signing status)\n```\n\n### Step 3: Look at schema, edit if needed\n```bash\ncat schemas/my_data_schema.py\n```\nYou'll see something like:\n```python\nschema = pa.DataFrameSchema({\n    \"id\": pa.Column(int),\n    \"name\": pa.Column(str),\n    \"score\": pa.Column(float, nullable=True),\n})\n```\nEdit this file if you want stricter validation (ranges, required values, etc.).\n\n### Step 4: Sign your data (approve current state)\n```bash\n# Approve the current data state\nuv run fosho sign\n\n# Check status\nuv run fosho status\n```\n\n### Step 5: Replace your pandas.read_csv() calls\n**Before (dangerous):**\n```python\nimport pandas as pd\ndf = pd.read_csv('data/my_data.csv')  # Silent failures\n```\n\n**After (safe):**\n```python\nimport fosho\n\n# Load with validation\ndf = fosho.read_csv(\n    file='data/my_data.csv',\n    schema='schemas/my_data_schema.py',\n    manifest_path='manifest.json'\n)\n\n# Must validate before use - crashes if data changed since signing\nvalidated_df = df.validate()  # \ud83d\udea8 CRASHES if data changed\n```\n\n### Step 6: Run your script\n- \u2705 **If data matches schema:** Script runs normally\n- \ud83d\udea8 **If data changed:** Script crashes with clear error message\n- \ud83c\udfaf **No more silent failures:** You immediately know when data structure changes\n\n## What This Solves\n\n\u274c **Before:** \"Wait, did my preprocessing script change this CSV? Is my downstream analysis using old data?\"\n\n\u2705 **After:** Your script crashes with a clear error if the data changed. No more silent failures.\n\n## The Magic\n\n1. **File hashing** - Detects when CSVs change (even 1 byte)\n2. **Schema validation** - Ensures data structure matches expectations  \n3. **Signing workflow** - Explicit approval step prevents accidents\n4. **Fail-fast** - Scripts error immediately if using stale/changed data\n\n## Concrete Example\n\n**Your messy situation:**\n```bash\n# You have this data that keeps changing\ncat data/sales.csv\n# id,product,revenue\n# 1,widget,100.50\n# 2,gadget,75.25\n\n# And this script that breaks when data structure changes\ncat analyze_sales.py\n# import pandas as pd\n# df = pd.read_csv('data/sales.csv')\n# print(df['revenue'].mean())  # Breaks if 'revenue' column disappears\n```\n\n**The fosho solution:**\n```bash\n# 1. Scan and generate schema\nuv run fosho scan data/\n\n# 2. Check what it generated\ncat schemas/sales_schema.py\n# schema = pa.DataFrameSchema({\n#     \"id\": pa.Column(int),\n#     \"product\": pa.Column(str), \n#     \"revenue\": pa.Column(float),\n# })\n\n# 3. Sign the data state\nuv run fosho sign\n\n# 4. Update your script (safer approach)\ncat analyze_sales_safe.py\n# import fosho\n# \n# df = fosho.read_csv(\n#     file='data/sales.csv',\n#     schema='schemas/sales_schema.py',\n#     manifest_path='manifest.json'\n# )\n# validated_df = df.validate()  # <- This line protects you\n# print(validated_df['revenue'].mean())  # Now safe!\n```\n\n**Result:** Your script now crashes immediately with a clear error if someone changes the data structure, instead of producing wrong results.\n\n## When Things Change\n\nIf your data changes:\n```bash\n# Re-scan to update checksums\nuv run fosho scan data/\n\n# Review what changed\nuv run fosho status\n\n# Re-approve if changes are intentional\nuv run fosho sign\n```\n\nYour Python scripts will refuse to run until you explicitly re-approve the changes.\n\n## Commands\n\n- `fosho scan data/` - Find CSVs, generate schemas, update manifest\n- `fosho sign` - Approve all current data/schemas  \n- `fosho status` - Show what's signed vs unsigned\n- `fosho verify` - Check if everything still matches\n\n## Installation\n\n```bash\ncd your-project\nuv add fosho  # or pip install fosho\n```\n\n## Philosophy\n\nData scientists need **simple validation first**, not complex rules. fosho generates minimal schemas (just column types + nullability) so you can:\n\n1. Get protection immediately\n2. Add more validation rules later as you learn about your data\n3. Never wonder \"is my script using the right data?\"\n\nPerfect for preprocessing pipelines that keep changing and downstream analyses that need to stay in sync.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Data Signing & Quality - Offline data integrity with CRC32 file hashing and MD5 schema hashing",
    "version": "0.1.0",
    "project_urls": {
        "Changelog": "https://github.com/joebhakim/fosho/blob/main/CHANGELOG.md",
        "Homepage": "https://github.com/joebhakim/fosho",
        "Issues": "https://github.com/joebhakim/fosho/issues",
        "Repository": "https://github.com/joebhakim/fosho"
    },
    "split_keywords": [
        "crc32",
        " csv",
        " data-integrity",
        " data-validation",
        " pandera",
        " parquet",
        " schema"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b06479987a34ae095ac3ce605252a0446371c97c96befaa35462912d01b78320",
                "md5": "a9111f1b5dbd837570aeef2a6ec0ff2b",
                "sha256": "d4aa04e8aa1f519a935451a3c174272cb7cbb4717b09e0ea9407b0b6163df586"
            },
            "downloads": -1,
            "filename": "fosho-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a9111f1b5dbd837570aeef2a6ec0ff2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 110299,
            "upload_time": "2025-07-24T20:29:29",
            "upload_time_iso_8601": "2025-07-24T20:29:29.891040Z",
            "url": "https://files.pythonhosted.org/packages/b0/64/79987a34ae095ac3ce605252a0446371c97c96befaa35462912d01b78320/fosho-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 20:29:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "joebhakim",
    "github_project": "fosho",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "fosho"
}
        
Elapsed time: 0.41681s