# FastDatasets
Generate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.
## Install
```bash
pip install fastdatasets
# Optional extras:
# pip install 'fastdatasets[web]' # Web UI / API
# pip install 'fastdatasets[doc]' # Better doc parsing (textract)
# pip install 'fastdatasets[all]' # Everything
```
## Configure LLM
Use environment variables or pass parameters directly (function args override env):
```bash
export LLM_API_KEY="sk-..."
export LLM_API_BASE="https://api.example.com/v1"
export LLM_MODEL="your-model"
```
## Quick Start (Python)
```python
from fastdatasets import generate_dataset_to_dir
dataset = generate_dataset_to_dir(
inputs=["./docs", "./data/sample.txt"],
output_dir="./output",
formats=["alpaca", "sharegpt"],
file_format="jsonl",
chunk_size=1000,
chunk_overlap=200,
enable_cot=False,
max_llm_concurrency=5,
# api_key="sk-...", api_base="https://api.example.com/v1", model_name="your-model",
)
print(len(dataset))
```
## CLI
```bash
# Core usage
fastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl
# Override LLM just for this command
LLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \
fastdatasets generate ./docs -o ./out
```
## Optional Features
- Web/API: `pip install 'fastdatasets[web]'` then run your web/app code
- Better doc parsing (PDF/DOCX): `pip install 'fastdatasets[doc]'`
## Links
- Source: https://github.com/ZhuLinsen/FastDatasets
- Demo (Spaces): https://huggingface.co/spaces/mumu157/FastDatasets
- Issues: https://github.com/ZhuLinsen/FastDatasets/issues
Raw data
{
"_id": null,
"home_page": null,
"name": "fastdatasets-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, dataset, alpaca, sharegpt, distillation, sft",
"author": "FastDatasets Authors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b3/d6/25da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d/fastdatasets_llm-0.1.3.tar.gz",
"platform": null,
"description": "# FastDatasets\n\nGenerate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.\n\n## Install\n\n```bash\npip install fastdatasets\n# Optional extras:\n# pip install 'fastdatasets[web]' # Web UI / API\n# pip install 'fastdatasets[doc]' # Better doc parsing (textract)\n# pip install 'fastdatasets[all]' # Everything\n```\n\n## Configure LLM\n\nUse environment variables or pass parameters directly (function args override env):\n\n```bash\nexport LLM_API_KEY=\"sk-...\"\nexport LLM_API_BASE=\"https://api.example.com/v1\"\nexport LLM_MODEL=\"your-model\"\n```\n\n## Quick Start (Python)\n\n```python\nfrom fastdatasets import generate_dataset_to_dir\n\ndataset = generate_dataset_to_dir(\n inputs=[\"./docs\", \"./data/sample.txt\"],\n output_dir=\"./output\",\n formats=[\"alpaca\", \"sharegpt\"],\n file_format=\"jsonl\",\n chunk_size=1000,\n chunk_overlap=200,\n enable_cot=False,\n max_llm_concurrency=5,\n # api_key=\"sk-...\", api_base=\"https://api.example.com/v1\", model_name=\"your-model\",\n)\nprint(len(dataset))\n```\n\n## CLI\n\n```bash\n# Core usage\nfastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl\n\n# Override LLM just for this command\nLLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \\\n fastdatasets generate ./docs -o ./out\n```\n\n## Optional Features\n- Web/API: `pip install 'fastdatasets[web]'` then run your web/app code\n- Better doc parsing (PDF/DOCX): `pip install 'fastdatasets[doc]'`\n\n## Links\n- Source: https://github.com/ZhuLinsen/FastDatasets\n- Demo (Spaces): https://huggingface.co/spaces/mumu157/FastDatasets\n- Issues: https://github.com/ZhuLinsen/FastDatasets/issues\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Generate high-quality LLM training datasets from documents with distillation and augmentation.",
"version": "0.1.3",
"project_urls": {
"Demo": "https://huggingface.co/spaces/mumu157/FastDatasets",
"Homepage": "https://github.com/ZhuLinsen/FastDatasets",
"Issues": "https://github.com/ZhuLinsen/FastDatasets/issues",
"Repository": "https://github.com/ZhuLinsen/FastDatasets"
},
"split_keywords": [
"llm",
" dataset",
" alpaca",
" sharegpt",
" distillation",
" sft"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9c0a48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0",
"md5": "ed8c844b89182d1e879c2fc0bd579443",
"sha256": "0551789c7a6501cbf3b7d2bf6de7737004bdc6280e01f8e7d936f9c4e6caafc2"
},
"downloads": -1,
"filename": "fastdatasets_llm-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ed8c844b89182d1e879c2fc0bd579443",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 57239,
"upload_time": "2025-08-31T05:51:11",
"upload_time_iso_8601": "2025-08-31T05:51:11.900127Z",
"url": "https://files.pythonhosted.org/packages/9c/0a/48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0/fastdatasets_llm-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b3d625da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d",
"md5": "d6eec0146af98ce3420ed8bdb48ae6de",
"sha256": "3763c13ddf63d6a47fd5a6b2509257c9b9659ce48a5714f4acc4aeb1535ed5a9"
},
"downloads": -1,
"filename": "fastdatasets_llm-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "d6eec0146af98ce3420ed8bdb48ae6de",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 53411,
"upload_time": "2025-08-31T05:51:14",
"upload_time_iso_8601": "2025-08-31T05:51:14.352678Z",
"url": "https://files.pythonhosted.org/packages/b3/d6/25da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d/fastdatasets_llm-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-31 05:51:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ZhuLinsen",
"github_project": "FastDatasets",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "fastapi",
"specs": []
},
{
"name": "uvicorn",
"specs": []
},
{
"name": "gradio",
"specs": []
},
{
"name": "pydantic",
"specs": []
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "loguru",
"specs": [
[
"==",
"0.7.2"
]
]
},
{
"name": "httpx",
"specs": [
[
"==",
"0.27.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.2"
]
]
},
{
"name": "textract",
"specs": [
[
"==",
"1.6.5"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"1.10.0"
]
]
},
{
"name": "asyncio",
"specs": [
[
"==",
"3.4.3"
]
]
},
{
"name": "transformers",
"specs": []
},
{
"name": "datasets",
"specs": []
}
],
"lcname": "fastdatasets-llm"
}