easy-dataset-share


Nameeasy-dataset-share JSON
Version 0.4.3 PyPI version JSON
download
home_pageNone
SummaryCLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.
upload_time2025-07-22 03:58:21
maintainerNone
docs_urlNone
authorEdward Turner
requires_python<4.0,>=3.10
licenseOther/Proprietary
keywords dataset sharing encryption canary robots cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Easy Dataset Share

`easy-dataset-share` helps AI researchers share datasets responsibly. It prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.

The `easy-dataset-share` CLI tool provides basic protection against scraping by making the dataset text itself less scrapeable.

However, sophisticated actors will still be able to scrape your content. Rather than providing unsophisticated further defenses which inconvenience real users, we think you should outsource that defense to a provider like CloudFlare. We wrote an easy tutorial on signing up with CloudFlare Turnstile, which is like CAPTCHA but a. actually effective and b. doesn't inconvenience your real users. See [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)


[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/badge/PyPI-easy--dataset--share-blue.svg)](https://pypi.org/project/easy-dataset-share/)
[![GitHub](https://img.shields.io/badge/GitHub-Responsible%20Dataset%20Sharing-green.svg)](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)
[![License: Other/Proprietary](https://img.shields.io/badge/License-Other%2FProprietary-red.svg)](LICENSE)

## Features
In `easy-data-share` we include features for:

- **Canary markers**: Unique identifiers to detect if your dataset was used for training.
- **Hash verification**: Ensuring the process of adding + removing canaries does not alter the dataset through hashing before and after the protection process.
- **Protection layers**: Zipping the data to make it not readable in plaintext by basic crawlers. Optional encryption as well.
- **Default Best Practices**: We generate a robots.txt and Terms of Service which prohibits use for AI training.
- **Clean removal**: Removing all protection while preserving original data (use hash verification to confirm data-integrity).
- **Web hosting** (optional): Deploying a protected download site with Cloudflare-Turnstile (a CAPTCHA-replacement)- see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)

## Installation

```bash
pip install easy-dataset-share
```

## Quick Start

### Protect a dataset
```bash
easy-dataset-share protect-dir /path/to/dataset
```

### Unprotect and clean
```bash
easy-dataset-share unprotect-dir dataset.zip --remove-canaries
```

### Verify integrity
```bash
easy-dataset-share hash /path/to/dataset
```

## Options
Run `easy-dataset-share --help` or for a subcommand `easy-dataset-share protect-dir --help`. To get a description of the options available.

## How it Works
1. **Hash** original dataset for integrity baseline
2. **Add** canary markers throughout the dataset
3. **Package** with robots.txt and optional encryption
4. **Verify** integrity when unprotecting (canaries removed, data unchanged)

## Example Workflow
```bash
# Protect with a password
easy-dataset-share protect-dir my_dataset

# Share dataset.zip publicly

# Recipients unprotect and remove canaries
easy-dataset-share unprotect-dir dataset.zip --remove-canaries
# Output shows: "📊 Dataset hash: abc123..." (matches original)
```
Use `-p your-password` to add a password when you zip (and for others to use when they unzip - will now be .zip.enc)
Use `-v` for verbose output to see hashing details and canary operations.


## Notes about licensing for `Robots.txt` (Text & Data Mining (TDM) opt-out)

The `robots.txt` generated by this helper add TDM protections, see [EDRlab](https://www.edrlab.org/open-standards/tdmrep/) for more information.


## Hosting with Anti-Scraper Protection
For datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers. See [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md) for a guide on how to do this.


This layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.

# Maintainence + Development
This is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!

For development:
```bash
git clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git
cd easy-dataset-share
pip install -e .
git config core.hooksPath .githooks
```

## Current Maintainers

* Roy Rinberg <royrinberg@gmail.com>
* Edward Turner <edward.turner01@outlook.com>
* Dipika Khullar <dkhullar98@gmail.com>

### Acknowledgements

This project was kickstarted by Alex Turner and then funded by the following supporters:

* Alex Turner ($500) - https://turntrout.com/
* Anna Wang ($500) - https://www.linkedin.com/in/annawang01/
* James Aung ($500) - https://jamesaung.com/
* Girish Sastry ($1000) - https://www.linkedin.com/in/girish-sastry-2a39348/

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "easy-dataset-share",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "dataset, sharing, encryption, canary, robots, cli",
    "author": "Edward Turner",
    "author_email": "edward.turner01@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/c0/20aff37c5c02b1eab7743188809a1efb59fda4941fc9dfe467529fd19766/easy_dataset_share-0.4.3.tar.gz",
    "platform": null,
    "description": "# Easy Dataset Share\n\n`easy-dataset-share` helps AI researchers share datasets responsibly. It prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.\n\nThe `easy-dataset-share` CLI tool provides basic protection against scraping by making the dataset text itself less scrapeable.\n\nHowever, sophisticated actors will still be able to scrape your content. Rather than providing unsophisticated further defenses which inconvenience real users, we think you should outsource that defense to a provider like CloudFlare. We wrote an easy tutorial on signing up with CloudFlare Turnstile, which is like CAPTCHA but a. actually effective and b. doesn't inconvenience your real users. See [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)\n\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![PyPI](https://img.shields.io/badge/PyPI-easy--dataset--share-blue.svg)](https://pypi.org/project/easy-dataset-share/)\n[![GitHub](https://img.shields.io/badge/GitHub-Responsible%20Dataset%20Sharing-green.svg)](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)\n[![License: Other/Proprietary](https://img.shields.io/badge/License-Other%2FProprietary-red.svg)](LICENSE)\n\n## Features\nIn `easy-data-share` we include features for:\n\n- **Canary markers**: Unique identifiers to detect if your dataset was used for training.\n- **Hash verification**: Ensuring the process of adding + removing canaries does not alter the dataset through hashing before and after the protection process.\n- **Protection layers**: Zipping the data to make it not readable in plaintext by basic crawlers. Optional encryption as well.\n- **Default Best Practices**: We generate a robots.txt and Terms of Service which prohibits use for AI training.\n- **Clean removal**: Removing all protection while preserving original data (use hash verification to confirm data-integrity).\n- **Web hosting** (optional): Deploying a protected download site with Cloudflare-Turnstile (a CAPTCHA-replacement)- see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)\n\n## Installation\n\n```bash\npip install easy-dataset-share\n```\n\n## Quick Start\n\n### Protect a dataset\n```bash\neasy-dataset-share protect-dir /path/to/dataset\n```\n\n### Unprotect and clean\n```bash\neasy-dataset-share unprotect-dir dataset.zip --remove-canaries\n```\n\n### Verify integrity\n```bash\neasy-dataset-share hash /path/to/dataset\n```\n\n## Options\nRun `easy-dataset-share --help` or for a subcommand `easy-dataset-share protect-dir --help`. To get a description of the options available.\n\n## How it Works\n1. **Hash** original dataset for integrity baseline\n2. **Add** canary markers throughout the dataset\n3. **Package** with robots.txt and optional encryption\n4. **Verify** integrity when unprotecting (canaries removed, data unchanged)\n\n## Example Workflow\n```bash\n# Protect with a password\neasy-dataset-share protect-dir my_dataset\n\n# Share dataset.zip publicly\n\n# Recipients unprotect and remove canaries\neasy-dataset-share unprotect-dir dataset.zip --remove-canaries\n# Output shows: \"\ud83d\udcca Dataset hash: abc123...\" (matches original)\n```\nUse `-p your-password` to add a password when you zip (and for others to use when they unzip - will now be .zip.enc)\nUse `-v` for verbose output to see hashing details and canary operations.\n\n\n## Notes about licensing for `Robots.txt` (Text & Data Mining (TDM) opt-out)\n\nThe `robots.txt` generated by this helper add TDM protections, see [EDRlab](https://www.edrlab.org/open-standards/tdmrep/) for more information.\n\n\n## Hosting with Anti-Scraper Protection\nFor datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers. See [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md) for a guide on how to do this.\n\n\nThis layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.\n\n# Maintainence + Development\nThis is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!\n\nFor development:\n```bash\ngit clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git\ncd easy-dataset-share\npip install -e .\ngit config core.hooksPath .githooks\n```\n\n## Current Maintainers\n\n* Roy Rinberg <royrinberg@gmail.com>\n* Edward Turner <edward.turner01@outlook.com>\n* Dipika Khullar <dkhullar98@gmail.com>\n\n### Acknowledgements\n\nThis project was kickstarted by Alex Turner and then funded by the following supporters:\n\n* Alex Turner ($500) - https://turntrout.com/\n* Anna Wang ($500) - https://www.linkedin.com/in/annawang01/\n* James Aung ($500) - https://jamesaung.com/\n* Girish Sastry ($1000) - https://www.linkedin.com/in/girish-sastry-2a39348/\n",
    "bugtrack_url": null,
    "license": "Other/Proprietary",
    "summary": "CLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.",
    "version": "0.4.3",
    "project_urls": {
        "Homepage": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share",
        "Repository": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share"
    },
    "split_keywords": [
        "dataset",
        " sharing",
        " encryption",
        " canary",
        " robots",
        " cli"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8fbfddcd6f1135fc914e8b2e881abffc308bc73e49df671c712166da92d6a8f3",
                "md5": "cfc9c8c262d5d611731a33f58b101cac",
                "sha256": "aa7572fc298c30a689069f3a02656b561f1aee98140cc609855b2122396851e0"
            },
            "downloads": -1,
            "filename": "easy_dataset_share-0.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cfc9c8c262d5d611731a33f58b101cac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 458578,
            "upload_time": "2025-07-22T03:58:19",
            "upload_time_iso_8601": "2025-07-22T03:58:19.658883Z",
            "url": "https://files.pythonhosted.org/packages/8f/bf/ddcd6f1135fc914e8b2e881abffc308bc73e49df671c712166da92d6a8f3/easy_dataset_share-0.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e6c020aff37c5c02b1eab7743188809a1efb59fda4941fc9dfe467529fd19766",
                "md5": "b7c47081486e7a4ce0ca060a00ed631e",
                "sha256": "45ceb0a217bfc4d199218e6c7a744d5d195509db89fe0d8bf1898cec28147a14"
            },
            "downloads": -1,
            "filename": "easy_dataset_share-0.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b7c47081486e7a4ce0ca060a00ed631e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 440140,
            "upload_time": "2025-07-22T03:58:21",
            "upload_time_iso_8601": "2025-07-22T03:58:21.851973Z",
            "url": "https://files.pythonhosted.org/packages/e6/c0/20aff37c5c02b1eab7743188809a1efb59fda4941fc9dfe467529fd19766/easy_dataset_share-0.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-22 03:58:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Responsible-Dataset-Sharing",
    "github_project": "easy-dataset-share",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "easy-dataset-share"
}
        
Elapsed time: 0.72755s