easy-dataset-share


Nameeasy-dataset-share JSON
Version 0.3.1 PyPI version JSON
download
home_pageNone
SummaryCLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.
upload_time2025-07-15 00:13:36
maintainerNone
docs_urlNone
authorEdward Turner
requires_python<4.0,>=3.10
licenseOther/Proprietary
keywords dataset sharing encryption canary robots cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Easy Dataset Share

A CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Other/Proprietary](https://img.shields.io/badge/License-Other%2FProprietary-red.svg)](LICENSE)
[![PyPI](https://img.shields.io/badge/PyPI-easy--dataset--share-blue.svg)](https://pypi.org/project/easy-dataset-share/)
[![GitHub](https://img.shields.io/badge/GitHub-Responsible%20Dataset%20Sharing-green.svg)](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)

## Features
- **Canary markers**: Unique identifiers to detect if your dataset was used for training
- **Hash verification**: Ensures dataset integrity through SHA256 hashing
- **Protection layers**: ZIP compression, optional encryption, robots.txt
- **Clean removal**: Remove all protection while preserving original data
- **Web hosting** (optional): Deploy a protected download site with CAPTCHA - see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)

## Installation

```bash
pip install easy-dataset-share
```

## Quick Start

### Protect a dataset
```bash
easy-dataset-share magic-protect-dir /path/to/dataset -p your-password
```

### Unprotect and clean
```bash
easy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries
```

### Verify integrity
```bash
easy-dataset-share hash /path/to/dataset
```

## Options
- `-p, --password` - Password for encryption (optional)
- `-o, --output` - Output file path (default: `<dir>.zip` or `<dir>.zip.enc`)
- `-c, --num-canary-files` - Number of canary files to create (default: 1)
- `-e, --embed-canaries` - Embed canaries in existing files (default: create separate files)
- `-a, --allow-crawling` - Allow web crawling in robots.txt (default: disallow all)
- `-u, --user-agent` - User-agent to target in robots.txt (default: *)
- `-on, --organization-name` - Organization name for TOS (default: "Example Corp")
- `-dn, --dataset-name` - Dataset name for TOS (default: "Example Dataset")
- `-ce, --contact-email` - Contact email for TOS (default: "support@example.com")
- `--no-tos` - Skip adding terms of service file
- `--no-gitignore` - Skip adding directory to .gitignore (default: auto-add)
- `-v, --verbose` - Enable verbose output

## How it Works
1. **Hash** original dataset for integrity baseline
2. **Add** canary markers throughout the dataset
3. **Package** with robots.txt and optional encryption
4. **Verify** integrity when unprotecting (canaries removed, data unchanged)

## Example Workflow
```bash
# Protect
easy-dataset-share magic-protect-dir my_dataset -p secret123

# Share dataset.zip publicly

# Recipients unprotect and remove canaries
easy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries
# Output shows: "📊 Dataset hash: abc123..." (matches original)
```

Use `-v` for verbose output to see hashing details and canary operations.

## Hosting with Anti-Scraper Protection
For datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers.

**Why Cloudflare Turnstile?**
- **Human verification**: Requires user interaction to access downloads
- **Bot detection**: Advanced algorithms identify and block automated requests
- **Privacy-focused**: No tracking cookies or invasive data collection
- **Easy integration**: Simple JavaScript widget with server-side verification
- **Free tier available**: Generous limits for research datasets


This layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.

# Maintainence + Development
This is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!

For development:
```bash
git clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git
cd easy-dataset-share
pip install -e .
git config core.hooksPath .githooks
```

## Current Maintainers
* Dipika Khullar <dkhullar98@gmail.com>
* Edward Turner <edward.turner01@outlook.com>
* Roy Rinberg <royrinberg@gmail.com>


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "easy-dataset-share",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "dataset, sharing, encryption, canary, robots, cli",
    "author": "Edward Turner",
    "author_email": "edward.turner01@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/1d/6d/6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b/easy_dataset_share-0.3.1.tar.gz",
    "platform": null,
    "description": "# Easy Dataset Share\n\nA CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: Other/Proprietary](https://img.shields.io/badge/License-Other%2FProprietary-red.svg)](LICENSE)\n[![PyPI](https://img.shields.io/badge/PyPI-easy--dataset--share-blue.svg)](https://pypi.org/project/easy-dataset-share/)\n[![GitHub](https://img.shields.io/badge/GitHub-Responsible%20Dataset%20Sharing-green.svg)](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)\n\n## Features\n- **Canary markers**: Unique identifiers to detect if your dataset was used for training\n- **Hash verification**: Ensures dataset integrity through SHA256 hashing\n- **Protection layers**: ZIP compression, optional encryption, robots.txt\n- **Clean removal**: Remove all protection while preserving original data\n- **Web hosting** (optional): Deploy a protected download site with CAPTCHA - see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)\n\n## Installation\n\n```bash\npip install easy-dataset-share\n```\n\n## Quick Start\n\n### Protect a dataset\n```bash\neasy-dataset-share magic-protect-dir /path/to/dataset -p your-password\n```\n\n### Unprotect and clean\n```bash\neasy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries\n```\n\n### Verify integrity\n```bash\neasy-dataset-share hash /path/to/dataset\n```\n\n## Options\n- `-p, --password` - Password for encryption (optional)\n- `-o, --output` - Output file path (default: `<dir>.zip` or `<dir>.zip.enc`)\n- `-c, --num-canary-files` - Number of canary files to create (default: 1)\n- `-e, --embed-canaries` - Embed canaries in existing files (default: create separate files)\n- `-a, --allow-crawling` - Allow web crawling in robots.txt (default: disallow all)\n- `-u, --user-agent` - User-agent to target in robots.txt (default: *)\n- `-on, --organization-name` - Organization name for TOS (default: \"Example Corp\")\n- `-dn, --dataset-name` - Dataset name for TOS (default: \"Example Dataset\")\n- `-ce, --contact-email` - Contact email for TOS (default: \"support@example.com\")\n- `--no-tos` - Skip adding terms of service file\n- `--no-gitignore` - Skip adding directory to .gitignore (default: auto-add)\n- `-v, --verbose` - Enable verbose output\n\n## How it Works\n1. **Hash** original dataset for integrity baseline\n2. **Add** canary markers throughout the dataset\n3. **Package** with robots.txt and optional encryption\n4. **Verify** integrity when unprotecting (canaries removed, data unchanged)\n\n## Example Workflow\n```bash\n# Protect\neasy-dataset-share magic-protect-dir my_dataset -p secret123\n\n# Share dataset.zip publicly\n\n# Recipients unprotect and remove canaries\neasy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries\n# Output shows: \"\ud83d\udcca Dataset hash: abc123...\" (matches original)\n```\n\nUse `-v` for verbose output to see hashing details and canary operations.\n\n## Hosting with Anti-Scraper Protection\nFor datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers.\n\n**Why Cloudflare Turnstile?**\n- **Human verification**: Requires user interaction to access downloads\n- **Bot detection**: Advanced algorithms identify and block automated requests\n- **Privacy-focused**: No tracking cookies or invasive data collection\n- **Easy integration**: Simple JavaScript widget with server-side verification\n- **Free tier available**: Generous limits for research datasets\n\n\nThis layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.\n\n# Maintainence + Development\nThis is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!\n\nFor development:\n```bash\ngit clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git\ncd easy-dataset-share\npip install -e .\ngit config core.hooksPath .githooks\n```\n\n## Current Maintainers\n* Dipika Khullar <dkhullar98@gmail.com>\n* Edward Turner <edward.turner01@outlook.com>\n* Roy Rinberg <royrinberg@gmail.com>\n\n",
    "bugtrack_url": null,
    "license": "Other/Proprietary",
    "summary": "CLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share",
        "Repository": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share"
    },
    "split_keywords": [
        "dataset",
        " sharing",
        " encryption",
        " canary",
        " robots",
        " cli"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e6e2673ba593ad0240f3919af03d37be846fd674d9740a04435d9b35c441ebd",
                "md5": "faf2beab55181f817025360296779f80",
                "sha256": "5322dd9952eb4e4e80eed73fecb5947dccc40149445673b125b7919bbc87e419"
            },
            "downloads": -1,
            "filename": "easy_dataset_share-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "faf2beab55181f817025360296779f80",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 457212,
            "upload_time": "2025-07-15T00:13:34",
            "upload_time_iso_8601": "2025-07-15T00:13:34.927469Z",
            "url": "https://files.pythonhosted.org/packages/2e/6e/2673ba593ad0240f3919af03d37be846fd674d9740a04435d9b35c441ebd/easy_dataset_share-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1d6d6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b",
                "md5": "f9d0e24da13bc41f2b7bea803f03a413",
                "sha256": "4ce77e358ef3b7f0af96ec425ef79b19f3f65e99f3f4b9795babdcb76fe328e7"
            },
            "downloads": -1,
            "filename": "easy_dataset_share-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f9d0e24da13bc41f2b7bea803f03a413",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 438588,
            "upload_time": "2025-07-15T00:13:36",
            "upload_time_iso_8601": "2025-07-15T00:13:36.603870Z",
            "url": "https://files.pythonhosted.org/packages/1d/6d/6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b/easy_dataset_share-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-15 00:13:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Responsible-Dataset-Sharing",
    "github_project": "easy-dataset-share",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "easy-dataset-share"
}
        
Elapsed time: 0.93727s