# Easy Dataset Share
A CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.
[](https://www.python.org/downloads/)
[](LICENSE)
[](https://pypi.org/project/easy-dataset-share/)
[](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)
## Features
- **Canary markers**: Unique identifiers to detect if your dataset was used for training
- **Hash verification**: Ensures dataset integrity through SHA256 hashing
- **Protection layers**: ZIP compression, optional encryption, robots.txt
- **Clean removal**: Remove all protection while preserving original data
- **Web hosting** (optional): Deploy a protected download site with CAPTCHA - see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)
## Installation
```bash
pip install easy-dataset-share
```
## Quick Start
### Protect a dataset
```bash
easy-dataset-share magic-protect-dir /path/to/dataset -p your-password
```
### Unprotect and clean
```bash
easy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries
```
### Verify integrity
```bash
easy-dataset-share hash /path/to/dataset
```
## Options
- `-p, --password` - Password for encryption (optional)
- `-o, --output` - Output file path (default: `<dir>.zip` or `<dir>.zip.enc`)
- `-c, --num-canary-files` - Number of canary files to create (default: 1)
- `-e, --embed-canaries` - Embed canaries in existing files (default: create separate files)
- `-a, --allow-crawling` - Allow web crawling in robots.txt (default: disallow all)
- `-u, --user-agent` - User-agent to target in robots.txt (default: *)
- `-on, --organization-name` - Organization name for TOS (default: "Example Corp")
- `-dn, --dataset-name` - Dataset name for TOS (default: "Example Dataset")
- `-ce, --contact-email` - Contact email for TOS (default: "support@example.com")
- `--no-tos` - Skip adding terms of service file
- `--no-gitignore` - Skip adding directory to .gitignore (default: auto-add)
- `-v, --verbose` - Enable verbose output
## How it Works
1. **Hash** original dataset for integrity baseline
2. **Add** canary markers throughout the dataset
3. **Package** with robots.txt and optional encryption
4. **Verify** integrity when unprotecting (canaries removed, data unchanged)
## Example Workflow
```bash
# Protect
easy-dataset-share magic-protect-dir my_dataset -p secret123
# Share dataset.zip publicly
# Recipients unprotect and remove canaries
easy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries
# Output shows: "📊 Dataset hash: abc123..." (matches original)
```
Use `-v` for verbose output to see hashing details and canary operations.
## Hosting with Anti-Scraper Protection
For datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers.
**Why Cloudflare Turnstile?**
- **Human verification**: Requires user interaction to access downloads
- **Bot detection**: Advanced algorithms identify and block automated requests
- **Privacy-focused**: No tracking cookies or invasive data collection
- **Easy integration**: Simple JavaScript widget with server-side verification
- **Free tier available**: Generous limits for research datasets
This layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.
# Maintainence + Development
This is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!
For development:
```bash
git clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git
cd easy-dataset-share
pip install -e .
git config core.hooksPath .githooks
```
## Current Maintainers
* Dipika Khullar <dkhullar98@gmail.com>
* Edward Turner <edward.turner01@outlook.com>
* Roy Rinberg <royrinberg@gmail.com>
Raw data
{
"_id": null,
"home_page": null,
"name": "easy-dataset-share",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "dataset, sharing, encryption, canary, robots, cli",
"author": "Edward Turner",
"author_email": "edward.turner01@outlook.com",
"download_url": "https://files.pythonhosted.org/packages/1d/6d/6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b/easy_dataset_share-0.3.1.tar.gz",
"platform": null,
"description": "# Easy Dataset Share\n\nA CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.\n\n[](https://www.python.org/downloads/)\n[](LICENSE)\n[](https://pypi.org/project/easy-dataset-share/)\n[](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)\n\n## Features\n- **Canary markers**: Unique identifiers to detect if your dataset was used for training\n- **Hash verification**: Ensures dataset integrity through SHA256 hashing\n- **Protection layers**: ZIP compression, optional encryption, robots.txt\n- **Clean removal**: Remove all protection while preserving original data\n- **Web hosting** (optional): Deploy a protected download site with CAPTCHA - see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)\n\n## Installation\n\n```bash\npip install easy-dataset-share\n```\n\n## Quick Start\n\n### Protect a dataset\n```bash\neasy-dataset-share magic-protect-dir /path/to/dataset -p your-password\n```\n\n### Unprotect and clean\n```bash\neasy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries\n```\n\n### Verify integrity\n```bash\neasy-dataset-share hash /path/to/dataset\n```\n\n## Options\n- `-p, --password` - Password for encryption (optional)\n- `-o, --output` - Output file path (default: `<dir>.zip` or `<dir>.zip.enc`)\n- `-c, --num-canary-files` - Number of canary files to create (default: 1)\n- `-e, --embed-canaries` - Embed canaries in existing files (default: create separate files)\n- `-a, --allow-crawling` - Allow web crawling in robots.txt (default: disallow all)\n- `-u, --user-agent` - User-agent to target in robots.txt (default: *)\n- `-on, --organization-name` - Organization name for TOS (default: \"Example Corp\")\n- `-dn, --dataset-name` - Dataset name for TOS (default: \"Example Dataset\")\n- `-ce, --contact-email` - Contact email for TOS (default: \"support@example.com\")\n- `--no-tos` - Skip adding terms of service file\n- `--no-gitignore` - Skip adding directory to .gitignore (default: auto-add)\n- `-v, --verbose` - Enable verbose output\n\n## How it Works\n1. **Hash** original dataset for integrity baseline\n2. **Add** canary markers throughout the dataset\n3. **Package** with robots.txt and optional encryption\n4. **Verify** integrity when unprotecting (canaries removed, data unchanged)\n\n## Example Workflow\n```bash\n# Protect\neasy-dataset-share magic-protect-dir my_dataset -p secret123\n\n# Share dataset.zip publicly\n\n# Recipients unprotect and remove canaries\neasy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries\n# Output shows: \"\ud83d\udcca Dataset hash: abc123...\" (matches original)\n```\n\nUse `-v` for verbose output to see hashing details and canary operations.\n\n## Hosting with Anti-Scraper Protection\nFor datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers.\n\n**Why Cloudflare Turnstile?**\n- **Human verification**: Requires user interaction to access downloads\n- **Bot detection**: Advanced algorithms identify and block automated requests\n- **Privacy-focused**: No tracking cookies or invasive data collection\n- **Easy integration**: Simple JavaScript widget with server-side verification\n- **Free tier available**: Generous limits for research datasets\n\n\nThis layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.\n\n# Maintainence + Development\nThis is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!\n\nFor development:\n```bash\ngit clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git\ncd easy-dataset-share\npip install -e .\ngit config core.hooksPath .githooks\n```\n\n## Current Maintainers\n* Dipika Khullar <dkhullar98@gmail.com>\n* Edward Turner <edward.turner01@outlook.com>\n* Roy Rinberg <royrinberg@gmail.com>\n\n",
"bugtrack_url": null,
"license": "Other/Proprietary",
"summary": "CLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share",
"Repository": "https://github.com/Responsible-Dataset-Sharing/easy-dataset-share"
},
"split_keywords": [
"dataset",
" sharing",
" encryption",
" canary",
" robots",
" cli"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2e6e2673ba593ad0240f3919af03d37be846fd674d9740a04435d9b35c441ebd",
"md5": "faf2beab55181f817025360296779f80",
"sha256": "5322dd9952eb4e4e80eed73fecb5947dccc40149445673b125b7919bbc87e419"
},
"downloads": -1,
"filename": "easy_dataset_share-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "faf2beab55181f817025360296779f80",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 457212,
"upload_time": "2025-07-15T00:13:34",
"upload_time_iso_8601": "2025-07-15T00:13:34.927469Z",
"url": "https://files.pythonhosted.org/packages/2e/6e/2673ba593ad0240f3919af03d37be846fd674d9740a04435d9b35c441ebd/easy_dataset_share-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1d6d6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b",
"md5": "f9d0e24da13bc41f2b7bea803f03a413",
"sha256": "4ce77e358ef3b7f0af96ec425ef79b19f3f65e99f3f4b9795babdcb76fe328e7"
},
"downloads": -1,
"filename": "easy_dataset_share-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "f9d0e24da13bc41f2b7bea803f03a413",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 438588,
"upload_time": "2025-07-15T00:13:36",
"upload_time_iso_8601": "2025-07-15T00:13:36.603870Z",
"url": "https://files.pythonhosted.org/packages/1d/6d/6179996ea02fd25593a59ecea520fa67ea8744ed8eda0e3f8ddb82a0545b/easy_dataset_share-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-15 00:13:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Responsible-Dataset-Sharing",
"github_project": "easy-dataset-share",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "easy-dataset-share"
}