# datapluck
`datapluck` is a command-line tool and Python library for exporting datasets from the Hugging Face Hub to various file formats and importing datasets back to the Hugging Face Hub. Supported formats include CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets.
## Features
- Export datasets from the Hugging Face Hub
- Import datasets to the Hugging Face Hub
- Support multiple output formats: CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets
- Handle different dataset splits and subsets
- Connect to Google Sheets for import/export operations
- Filter columns during import
- Support for private datasets on Hugging Face
## Purposes
- Preview a dataset in the format of your choice
- Annotate a dataset in the editor of your choice (export, annotate, then import back)
- Simplify dataset management from the CLI and in CI/CD contexts
- Backup datasets as a one-off or on the regular
### Quick Example
#### Export a dataset to csv
```bash
datapluck export team/dataset --format csv --output_file data.csv`
```
#### Import data to your account
```bash
datapluck import username/new-or-existing-dataset --input_file data.csv --format csv --private
```
## Authentication
Before using `datapluck`, ensure you are logged in to the Hugging Face Hub. This is required for authentication when accessing private datasets or updating yours. You can log in using the Hugging Face CLI:
```bash
huggingface-cli login
```
This will prompt you to enter your Hugging Face access token. Once logged in, `datapluck` will use your credentials for operations that require authentication.
## Installation
Install `datapluck` from PyPI:
```bash
pip install datapluck
```
## Usage
### Command-line Interface
1. Connect to Google Sheets (required for Google Sheets operations):
```bash
datapluck connect gsheet
```
2. Export a dataset:
```bash
# Export the entire 'imdb' dataset as CSV
datapluck export imdb --format csv --output_file imdb.csv
# Export the entire 'imdb' dataset as a Microsoft Excel spreadsheet (XLSX)
datapluck export imdb --format xlsx --output_file imdb.xlsx
# Export a specific split of the 'imdb' dataset as JSON
# (not recommended for large datasets, use jsonl instead)
datapluck export imdb --split test --format json --output_file imdb.json
# Export to Google Sheets
datapluck export imdb --format gsheet --spreadsheet_id YOUR_SPREADSHEET_ID --sheetname Sheet1
# Export to SQLite
datapluck export imdb --format sqlite --table_name imdb_data --output_file imdb.sqlite
```
3. Import a dataset:
```bash
# Import a CSV file to Hugging Face
datapluck import my_dataset --input_file data.csv --format csv
# Import from Google Sheets
datapluck import my_dataset --format gsheet --spreadsheet_id YOUR_SPREADSHEET_ID --sheetname Sheet1
# Import specific columns from a JSON file
datapluck import my_dataset --input_file data.json --format json --columns "col1,col2,col3"
# Import as a private dataset with a specific split
datapluck import my_dataset --input_file data.parquet --format parquet --private --split train
```
#### Commands
```
connect: Connect to a service (currently only supports Google Sheets).
export: Export a dataset from Hugging Face to a specified format.
import: Import a dataset from a file to Hugging Face.
```
#### Arguments
Common arguments:
```
dataset_name: The name of the dataset to export or import.
--format: The file format for export or import (default: csv).
Choices: csv, tsv, json, jsonl, parquet, gsheet, sqlite.
--spreadsheet_id: The ID of the Google Sheet to export to or import from (used by gsheet format). If you are exporting from Huggingface to Google Sheet, you can omit this argument and a spreadsheet will automatically be created for you.
--sheetname: The name of the sheet in the Google Sheet (optional).
--subset: The subset of the dataset to export or import (if applicable).
--split: The dataset split to export or import (optional).
```
Export-specific arguments:
```
--output_file: The base name for the output file(s).
--table_name: The name of the table for SQLite export (optional).
```
Import-specific arguments:
```
--input_file: The input file to import.
--private: Make the dataset private on Hugging Face.
--columns: Comma-separated list of columns to include in the dataset.
--table_name: The name of the table for SQLite import (optional).
```
### Python Package
You can use `datapluck` as a Python package:
```python
from datapluck import export_dataset, import_dataset
# Export a dataset
export_dataset(
dataset_name='imdb',
split='train',
output_file='imdb_train',
export_format='csv'
)
# Import a dataset
import_dataset(
input_file='data.csv',
dataset_name='my_dataset',
private=True,
format='csv',
columns='col1,col2,col3',
split='test'
)
```
#### `export_dataset` function
```python
def export_dataset(
dataset_name,
split=None,
output_file=None,
subset=None,
export_format="csv",
spreadsheet_id=None,
sheetname=None,
table_name=None
):
"""
Export a dataset from Hugging Face Hub.
Args:
dataset_name (str): Name of the dataset on Hugging Face Hub.
split (str, optional): Dataset split to export.
output_file (str, optional): Base name for the output file(s).
subset (str, optional): Subset of the dataset to export.
export_format (str, optional): File format for export (default: "csv").
spreadsheet_id (str, optional): ID of the Google Sheet for export.
sheetname (str, optional): Name of the sheet in Google Sheet.
table_name (str, optional): Name of the table for SQLite export.
"""
```
#### `import_dataset` function
```python
def import_dataset(
input_file,
dataset_name,
private=False,
format="csv",
spreadsheet_id=None,
sheetname=None,
columns=None,
table_name=None,
subset=None,
split=None
):
"""
Import a dataset to Hugging Face Hub.
Args:
input_file (str): Path to the input file.
dataset_name (str): Name for the dataset on Hugging Face Hub.
private (bool, optional): Make the dataset private (default: False).
format (str, optional): File format of the input (default: "csv").
spreadsheet_id (str, optional): ID of the Google Sheet for import.
sheetname (str, optional): Name of the sheet in Google Sheet.
columns (str, optional): Comma-separated list of columns to include.
table_name (str, optional): Name of the table for SQLite import.
subset (str, optional): Subset name for the imported dataset.
split (str, optional): Split name for the imported dataset.
"""
```
## Contributing
Contributions will be welcome once `datapluck` reaches feature-completeness from the author's standpoint.
## License
This project's license is currently TBD. In its current version, it can be run without limitations for all lawful purposes, but distribution is restricted via the current PyPI package only.
## Authors
- Omar Kamali - Initial work (datapluck@omarkama.li)
Raw data
{
"_id": null,
"home_page": null,
"name": "datapluck",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "huggingface datasets export csv json parquet tsv jsonl google-sheets spreadsheets google microsoft excel xls xlsx",
"author": "Omar Kamali",
"author_email": "datapluck@omarkama.li",
"download_url": "https://files.pythonhosted.org/packages/0c/fc/83880d1464a6f9db048ffaa19bd46c0ce0b882de1a67b1b24dad36ca335b/datapluck-0.1.7.tar.gz",
"platform": null,
"description": "# datapluck\n\n`datapluck` is a command-line tool and Python library for exporting datasets from the Hugging Face Hub to various file formats and importing datasets back to the Hugging Face Hub. Supported formats include CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets.\n\n## Features\n\n- Export datasets from the Hugging Face Hub\n- Import datasets to the Hugging Face Hub\n- Support multiple output formats: CSV, TSV, JSON, JSON Lines (jsonl), Microsoft Excel's XLSX, Parquet, SQLite, and Google Sheets\n- Handle different dataset splits and subsets\n- Connect to Google Sheets for import/export operations\n- Filter columns during import\n- Support for private datasets on Hugging Face\n\n## Purposes\n\n- Preview a dataset in the format of your choice\n- Annotate a dataset in the editor of your choice (export, annotate, then import back)\n- Simplify dataset management from the CLI and in CI/CD contexts\n- Backup datasets as a one-off or on the regular\n\n### Quick Example\n\n#### Export a dataset to csv\n```bash\ndatapluck export team/dataset --format csv --output_file data.csv`\n```\n\n#### Import data to your account\n```bash\ndatapluck import username/new-or-existing-dataset --input_file data.csv --format csv --private\n```\n\n\n## Authentication\n\nBefore using `datapluck`, ensure you are logged in to the Hugging Face Hub. This is required for authentication when accessing private datasets or updating yours. You can log in using the Hugging Face CLI:\n\n```bash\nhuggingface-cli login\n```\n\nThis will prompt you to enter your Hugging Face access token. Once logged in, `datapluck` will use your credentials for operations that require authentication.\n\n\n## Installation\n\nInstall `datapluck` from PyPI:\n\n```bash\npip install datapluck\n```\n\n## Usage\n\n### Command-line Interface\n\n1. Connect to Google Sheets (required for Google Sheets operations):\n\n```bash\ndatapluck connect gsheet\n```\n\n2. Export a dataset:\n\n```bash\n# Export the entire 'imdb' dataset as CSV\ndatapluck export imdb --format csv --output_file imdb.csv\n\n# Export the entire 'imdb' dataset as a Microsoft Excel spreadsheet (XLSX)\ndatapluck export imdb --format xlsx --output_file imdb.xlsx\n\n# Export a specific split of the 'imdb' dataset as JSON \n# (not recommended for large datasets, use jsonl instead)\ndatapluck export imdb --split test --format json --output_file imdb.json \n\n# Export to Google Sheets\ndatapluck export imdb --format gsheet --spreadsheet_id YOUR_SPREADSHEET_ID --sheetname Sheet1\n\n# Export to SQLite\ndatapluck export imdb --format sqlite --table_name imdb_data --output_file imdb.sqlite \n```\n\n3. Import a dataset:\n\n```bash\n# Import a CSV file to Hugging Face\ndatapluck import my_dataset --input_file data.csv --format csv\n\n# Import from Google Sheets\ndatapluck import my_dataset --format gsheet --spreadsheet_id YOUR_SPREADSHEET_ID --sheetname Sheet1\n\n# Import specific columns from a JSON file\ndatapluck import my_dataset --input_file data.json --format json --columns \"col1,col2,col3\"\n\n# Import as a private dataset with a specific split\ndatapluck import my_dataset --input_file data.parquet --format parquet --private --split train\n```\n\n#### Commands\n\n```\nconnect: Connect to a service (currently only supports Google Sheets).\n\nexport: Export a dataset from Hugging Face to a specified format.\n\nimport: Import a dataset from a file to Hugging Face.\n```\n\n#### Arguments\n\nCommon arguments:\n```\ndataset_name: The name of the dataset to export or import.\n--format: The file format for export or import (default: csv).\n Choices: csv, tsv, json, jsonl, parquet, gsheet, sqlite.\n\n--spreadsheet_id: The ID of the Google Sheet to export to or import from (used by gsheet format). If you are exporting from Huggingface to Google Sheet, you can omit this argument and a spreadsheet will automatically be created for you.\n\n--sheetname: The name of the sheet in the Google Sheet (optional).\n\n--subset: The subset of the dataset to export or import (if applicable).\n\n--split: The dataset split to export or import (optional).\n```\n\nExport-specific arguments:\n```\n--output_file: The base name for the output file(s).\n\n--table_name: The name of the table for SQLite export (optional).\n```\n\nImport-specific arguments:\n```\n--input_file: The input file to import.\n\n--private: Make the dataset private on Hugging Face.\n\n--columns: Comma-separated list of columns to include in the dataset.\n\n--table_name: The name of the table for SQLite import (optional).\n```\n\n### Python Package\n\nYou can use `datapluck` as a Python package:\n\n```python\n\nfrom datapluck import export_dataset, import_dataset\n# Export a dataset\n\nexport_dataset(\n dataset_name='imdb',\n split='train',\n output_file='imdb_train',\n export_format='csv'\n)\n\n# Import a dataset\nimport_dataset(\n input_file='data.csv',\n dataset_name='my_dataset',\n private=True,\n format='csv',\n columns='col1,col2,col3',\n split='test'\n)\n```\n\n#### `export_dataset` function\n\n```python\ndef export_dataset(\n dataset_name,\n split=None,\n output_file=None,\n subset=None,\n export_format=\"csv\",\n spreadsheet_id=None,\n sheetname=None,\n table_name=None\n):\n\"\"\"\nExport a dataset from Hugging Face Hub.\n\nArgs:\ndataset_name (str): Name of the dataset on Hugging Face Hub.\n\nsplit (str, optional): Dataset split to export.\n\noutput_file (str, optional): Base name for the output file(s).\n\nsubset (str, optional): Subset of the dataset to export.\n\nexport_format (str, optional): File format for export (default: \"csv\").\n\nspreadsheet_id (str, optional): ID of the Google Sheet for export.\n\nsheetname (str, optional): Name of the sheet in Google Sheet.\n\ntable_name (str, optional): Name of the table for SQLite export.\n\"\"\"\n```\n\n#### `import_dataset` function\n\n```python\ndef import_dataset(\n input_file,\n dataset_name,\n private=False,\n format=\"csv\",\n spreadsheet_id=None,\n sheetname=None,\n columns=None,\n table_name=None,\n subset=None,\n split=None\n):\n\"\"\"\nImport a dataset to Hugging Face Hub.\n\nArgs:\ninput_file (str): Path to the input file.\n\ndataset_name (str): Name for the dataset on Hugging Face Hub.\n\nprivate (bool, optional): Make the dataset private (default: False).\nformat (str, optional): File format of the input (default: \"csv\").\n\nspreadsheet_id (str, optional): ID of the Google Sheet for import.\n\nsheetname (str, optional): Name of the sheet in Google Sheet.\n\ncolumns (str, optional): Comma-separated list of columns to include.\n\ntable_name (str, optional): Name of the table for SQLite import.\n\nsubset (str, optional): Subset name for the imported dataset.\n\nsplit (str, optional): Split name for the imported dataset.\n\"\"\"\n\n```\n\n## Contributing\n\nContributions will be welcome once `datapluck` reaches feature-completeness from the author's standpoint.\n\n## License\n\nThis project's license is currently TBD. In its current version, it can be run without limitations for all lawful purposes, but distribution is restricted via the current PyPI package only.\n\n## Authors\n\n- Omar Kamali - Initial work (datapluck@omarkama.li)\n",
"bugtrack_url": null,
"license": null,
"summary": "Export & import Hugging Face datasets to spreadsheets and various file formats.",
"version": "0.1.7",
"project_urls": null,
"split_keywords": [
"huggingface",
"datasets",
"export",
"csv",
"json",
"parquet",
"tsv",
"jsonl",
"google-sheets",
"spreadsheets",
"google",
"microsoft",
"excel",
"xls",
"xlsx"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5d3b96d061f41b34309e07d960d5fa296b2267870bd1ae11d8a0f65d5fae39d8",
"md5": "fa7aab75b957965f2e36cb146eef02a1",
"sha256": "05d9e957b9497a2a30abd56125a328ad665cddda8a93abbd10971b6e4933f6e9"
},
"downloads": -1,
"filename": "datapluck-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fa7aab75b957965f2e36cb146eef02a1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11891,
"upload_time": "2024-09-06T23:01:49",
"upload_time_iso_8601": "2024-09-06T23:01:49.818125Z",
"url": "https://files.pythonhosted.org/packages/5d/3b/96d061f41b34309e07d960d5fa296b2267870bd1ae11d8a0f65d5fae39d8/datapluck-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0cfc83880d1464a6f9db048ffaa19bd46c0ce0b882de1a67b1b24dad36ca335b",
"md5": "54a475b24fada6250e6ceaaea40ef2aa",
"sha256": "15cd69c1d9c2b651afb78bde99fc1c2624eb44e6e2ca82aedba780d09ad62379"
},
"downloads": -1,
"filename": "datapluck-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "54a475b24fada6250e6ceaaea40ef2aa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 12525,
"upload_time": "2024-09-06T23:01:51",
"upload_time_iso_8601": "2024-09-06T23:01:51.344736Z",
"url": "https://files.pythonhosted.org/packages/0c/fc/83880d1464a6f9db048ffaa19bd46c0ce0b882de1a67b1b24dad36ca335b/datapluck-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-06 23:01:51",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "datapluck"
}