fastdatasets-llm


Namefastdatasets-llm JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryGenerate high-quality LLM training datasets from documents with distillation and augmentation.
upload_time2025-08-31 05:51:14
maintainerNone
docs_urlNone
authorFastDatasets Authors
requires_python>=3.8
licenseApache-2.0
keywords llm dataset alpaca sharegpt distillation sft
VCS
bugtrack_url
requirements fastapi uvicorn gradio pydantic python-dotenv loguru httpx tqdm textract openai asyncio transformers datasets
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # FastDatasets

Generate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.

## Install

```bash
pip install fastdatasets
# Optional extras:
# pip install 'fastdatasets[web]'   # Web UI / API
# pip install 'fastdatasets[doc]'   # Better doc parsing (textract)
# pip install 'fastdatasets[all]'   # Everything
```

## Configure LLM

Use environment variables or pass parameters directly (function args override env):

```bash
export LLM_API_KEY="sk-..."
export LLM_API_BASE="https://api.example.com/v1"
export LLM_MODEL="your-model"
```

## Quick Start (Python)

```python
from fastdatasets import generate_dataset_to_dir

dataset = generate_dataset_to_dir(
  inputs=["./docs", "./data/sample.txt"],
  output_dir="./output",
  formats=["alpaca", "sharegpt"],
  file_format="jsonl",
  chunk_size=1000,
  chunk_overlap=200,
  enable_cot=False,
  max_llm_concurrency=5,
  # api_key="sk-...", api_base="https://api.example.com/v1", model_name="your-model",
)
print(len(dataset))
```

## CLI

```bash
# Core usage
fastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl

# Override LLM just for this command
LLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \
  fastdatasets generate ./docs -o ./out
```

## Optional Features
- Web/API: `pip install 'fastdatasets[web]'` then run your web/app code
- Better doc parsing (PDF/DOCX): `pip install 'fastdatasets[doc]'`

## Links
- Source: https://github.com/ZhuLinsen/FastDatasets
- Demo (Spaces): https://huggingface.co/spaces/mumu157/FastDatasets
- Issues: https://github.com/ZhuLinsen/FastDatasets/issues

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fastdatasets-llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, dataset, alpaca, sharegpt, distillation, sft",
    "author": "FastDatasets Authors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b3/d6/25da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d/fastdatasets_llm-0.1.3.tar.gz",
    "platform": null,
    "description": "# FastDatasets\n\nGenerate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.\n\n## Install\n\n```bash\npip install fastdatasets\n# Optional extras:\n# pip install 'fastdatasets[web]'   # Web UI / API\n# pip install 'fastdatasets[doc]'   # Better doc parsing (textract)\n# pip install 'fastdatasets[all]'   # Everything\n```\n\n## Configure LLM\n\nUse environment variables or pass parameters directly (function args override env):\n\n```bash\nexport LLM_API_KEY=\"sk-...\"\nexport LLM_API_BASE=\"https://api.example.com/v1\"\nexport LLM_MODEL=\"your-model\"\n```\n\n## Quick Start (Python)\n\n```python\nfrom fastdatasets import generate_dataset_to_dir\n\ndataset = generate_dataset_to_dir(\n  inputs=[\"./docs\", \"./data/sample.txt\"],\n  output_dir=\"./output\",\n  formats=[\"alpaca\", \"sharegpt\"],\n  file_format=\"jsonl\",\n  chunk_size=1000,\n  chunk_overlap=200,\n  enable_cot=False,\n  max_llm_concurrency=5,\n  # api_key=\"sk-...\", api_base=\"https://api.example.com/v1\", model_name=\"your-model\",\n)\nprint(len(dataset))\n```\n\n## CLI\n\n```bash\n# Core usage\nfastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl\n\n# Override LLM just for this command\nLLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \\\n  fastdatasets generate ./docs -o ./out\n```\n\n## Optional Features\n- Web/API: `pip install 'fastdatasets[web]'` then run your web/app code\n- Better doc parsing (PDF/DOCX): `pip install 'fastdatasets[doc]'`\n\n## Links\n- Source: https://github.com/ZhuLinsen/FastDatasets\n- Demo (Spaces): https://huggingface.co/spaces/mumu157/FastDatasets\n- Issues: https://github.com/ZhuLinsen/FastDatasets/issues\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Generate high-quality LLM training datasets from documents with distillation and augmentation.",
    "version": "0.1.3",
    "project_urls": {
        "Demo": "https://huggingface.co/spaces/mumu157/FastDatasets",
        "Homepage": "https://github.com/ZhuLinsen/FastDatasets",
        "Issues": "https://github.com/ZhuLinsen/FastDatasets/issues",
        "Repository": "https://github.com/ZhuLinsen/FastDatasets"
    },
    "split_keywords": [
        "llm",
        " dataset",
        " alpaca",
        " sharegpt",
        " distillation",
        " sft"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9c0a48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0",
                "md5": "ed8c844b89182d1e879c2fc0bd579443",
                "sha256": "0551789c7a6501cbf3b7d2bf6de7737004bdc6280e01f8e7d936f9c4e6caafc2"
            },
            "downloads": -1,
            "filename": "fastdatasets_llm-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ed8c844b89182d1e879c2fc0bd579443",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 57239,
            "upload_time": "2025-08-31T05:51:11",
            "upload_time_iso_8601": "2025-08-31T05:51:11.900127Z",
            "url": "https://files.pythonhosted.org/packages/9c/0a/48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0/fastdatasets_llm-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b3d625da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d",
                "md5": "d6eec0146af98ce3420ed8bdb48ae6de",
                "sha256": "3763c13ddf63d6a47fd5a6b2509257c9b9659ce48a5714f4acc4aeb1535ed5a9"
            },
            "downloads": -1,
            "filename": "fastdatasets_llm-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d6eec0146af98ce3420ed8bdb48ae6de",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 53411,
            "upload_time": "2025-08-31T05:51:14",
            "upload_time_iso_8601": "2025-08-31T05:51:14.352678Z",
            "url": "https://files.pythonhosted.org/packages/b3/d6/25da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d/fastdatasets_llm-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-31 05:51:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ZhuLinsen",
    "github_project": "FastDatasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "fastapi",
            "specs": []
        },
        {
            "name": "uvicorn",
            "specs": []
        },
        {
            "name": "gradio",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    "==",
                    "0.7.2"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    "==",
                    "0.27.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.66.2"
                ]
            ]
        },
        {
            "name": "textract",
            "specs": [
                [
                    "==",
                    "1.6.5"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    "==",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "asyncio",
            "specs": [
                [
                    "==",
                    "3.4.3"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "datasets",
            "specs": []
        }
    ],
    "lcname": "fastdatasets-llm"
}
        
Elapsed time: 1.12350s