chromadb-data-pipes

Name	chromadb-data-pipes JSON
Version	0.0.12 JSON
	download
home_page	None
Summary	Chroma Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
upload_time	2024-10-22 11:37:53
maintainer	None
docs_url	None
author	Trayan Azarov
requires_python	<3.12,>=3.9
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB

ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of
"do one thing and do it well".

Roadmap:

- ✅ Integration with LangChain 🦜🔗
- 🚫 Integration with LlamaIndex 🦙
- ✅ Support more than `all-MiniLM-L6-v2` as embedding functions (head over
  to [Embedding Processors](https://datapipes.chromadb.dev/processors/embedding/) for more info)
- 🚫 Multimodal support
- ♾️ Much more!

## Installation

```bash
pip install chromadb-data-pipes
```

## Usage

**Get help:**

```bash
cdp --help
```

### Example Use Cases

This is a short list of use cases to evaluate whether this is the right tool for your needs:

- Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or
  even another remote Chroma DB.
- Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not
  supported, either implement it in a small function or open an issue)
- Create a dataset from your data that you can share with others (including the embeddings)
- Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters
- Re-embed documents in a collection with a different embedding function
- Backup your data to a `jsonl` file
- Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB

### Importing

**Import data from HuggingFace Datasets to `.jsonl` file:**

```bash
cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl
```

**Import data from HuggingFace Datasets to Chroma DB:**

The below command will import the `train` split of the given dataset to Chroma chroma-qna `chroma-qna` collection. The
collection will be created if it does not exist and documents will be upserted.

```bash
cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create
```

**Importing from a directory with PDF files into Local Persisted Chroma DB:**

```bash
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create
```

> Note: The above command will import the first PDF file from the `sample-data/papers/` directory, chunk it into 500
> word chunks, embed each chunk and import the chunks to the `my-pdfs` collection in Chroma DB.

### Exporting

**Export data from Local Persisted Chroma DB to `.jsonl` file:**

The below command will export the first 10 documents from the `chroma-qna` collection to `chroma-qna.jsonl` file.

```bash
cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl
```

**Export data from Local Persisted Chroma DB to `.jsonl` file with filter:**

The below command will export data from local persisted Chroma DB to a `.jsonl` file using a `where` filter to select
the documents to export.

```bash
cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl
```

**Export data from Chroma DB to HuggingFace Datasets:**

The below command will export the first 10 documents with offset 10 from the `chroma-qna` collection to HuggingFace
Datasets `tazarov/chroma-qna` dataset. The dataset will be uploaded to HF.

> HF Auth and Privacy: Make sure you have `HF_TOKEN=hf_....` environment variable set. If you want your dataset to
> be private, add `--private` flag to the `cdp ds-put` command.

```bash
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"
```

To export a dataset to a file, use `--uri` with `file://` prefix:

```bash
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"
```

> File Location The file is relative to the current working directory.

### Processing

**Copy collection from one Chroma collection to another and re-embed the documents:**

```bash
cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create
```

> Note: See [Embedding Processors](./processors/embedding.md) for more info about supported embedding functions.

**Import dataset from HF to Local Persisted Chroma and embed the documents:**

```bash
cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create
```

**Chunk Large Documents:**

```bash
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500
```

### Misc

**Count the number of documents in a collection:**

```bash
cdp export "http://localhost:8000/chroma-qna" | wc -l
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chromadb-data-pipes",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Trayan Azarov",
    "author_email": "trayan.azarov@amikos.tech",
    "download_url": "https://files.pythonhosted.org/packages/9b/4a/5606ab53dad47f2fbdf006c1f842bcd3b6559203cd993ae3910d5f49cd1f/chromadb_data_pipes-0.0.12.tar.gz",
    "platform": null,
    "description": "# ChromaDB Data Pipes \ud83d\udd87\ufe0f - The easiest way to get data into and out of ChromaDB\n\nChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of\n\"do one thing and do it well\".\n\nRoadmap:\n\n- \u2705 Integration with LangChain \ud83e\udd9c\ud83d\udd17\n- \ud83d\udeab Integration with LlamaIndex \ud83e\udd99\n- \u2705 Support more than `all-MiniLM-L6-v2` as embedding functions (head over\n  to [Embedding Processors](https://datapipes.chromadb.dev/processors/embedding/) for more info)\n- \ud83d\udeab Multimodal support\n- \u267e\ufe0f Much more!\n\n## Installation\n\n```bash\npip install chromadb-data-pipes\n```\n\n## Usage\n\n**Get help:**\n\n```bash\ncdp --help\n```\n\n### Example Use Cases\n\nThis is a short list of use cases to evaluate whether this is the right tool for your needs:\n\n- Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or\n  even another remote Chroma DB.\n- Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not\n  supported, either implement it in a small function or open an issue)\n- Create a dataset from your data that you can share with others (including the embeddings)\n- Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters\n- Re-embed documents in a collection with a different embedding function\n- Backup your data to a `jsonl` file\n- Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB\n\n### Importing\n\n**Import data from HuggingFace Datasets to `.jsonl` file:**\n\n```bash\ncdp ds-get \"hf://tazarov/chroma-qna?split=train\" > chroma-qna.jsonl\n```\n\n**Import data from HuggingFace Datasets to Chroma DB:**\n\nThe below command will import the `train` split of the given dataset to Chroma chroma-qna `chroma-qna` collection. The\ncollection will be created if it does not exist and documents will be upserted.\n\n```bash\ncdp ds-get \"hf://tazarov/chroma-qna?split=train\" | cdp import \"http://localhost:8000/chroma-qna\" --upsert --create\n```\n\n**Importing from a directory with PDF files into Local Persisted Chroma DB:**\n\n```bash\ncdp imp pdf sample-data/papers/ | grep \"2401.02412.pdf\" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import \"file://chroma-data/my-pdfs\" --upsert --create\n```\n\n> Note: The above command will import the first PDF file from the `sample-data/papers/` directory, chunk it into 500\n> word chunks, embed each chunk and import the chunks to the `my-pdfs` collection in Chroma DB.\n\n### Exporting\n\n**Export data from Local Persisted Chroma DB to `.jsonl` file:**\n\nThe below command will export the first 10 documents from the `chroma-qna` collection to `chroma-qna.jsonl` file.\n\n```bash\ncdp export \"file://chroma-data/chroma-qna\" --limit 10 > chroma-qna.jsonl\n```\n\n**Export data from Local Persisted Chroma DB to `.jsonl` file with filter:**\n\nThe below command will export data from local persisted Chroma DB to a `.jsonl` file using a `where` filter to select\nthe documents to export.\n\n```bash\ncdp export \"file://chroma-data/chroma-qna\" --where '{\"document_id\": \"123\"}' > chroma-qna.jsonl\n```\n\n**Export data from Chroma DB to HuggingFace Datasets:**\n\nThe below command will export the first 10 documents with offset 10 from the `chroma-qna` collection to HuggingFace\nDatasets `tazarov/chroma-qna` dataset. The dataset will be uploaded to HF.\n\n> HF Auth and Privacy: Make sure you have `HF_TOKEN=hf_....` environment variable set. If you want your dataset to\n> be private, add `--private` flag to the `cdp ds-put` command.\n\n```bash\ncdp export \"http://localhost:8000/chroma-qna\" --limit 10 --offset 10 | cdp ds-put \"hf://tazarov/chroma-qna-modified\"\n```\n\nTo export a dataset to a file, use `--uri` with `file://` prefix:\n\n```bash\ncdp export \"http://localhost:8000/chroma-qna\" --limit 10 --offset 10 | cdp ds-put \"file://chroma-qna\"\n```\n\n> File Location The file is relative to the current working directory.\n\n### Processing\n\n**Copy collection from one Chroma collection to another and re-embed the documents:**\n\n```bash\ncdp export \"http://localhost:8000/chroma-qna\" | cdp embed --ef default | cdp import \"http://localhost:8000/chroma-qna-def-emb\" --upsert --create\n```\n\n> Note: See [Embedding Processors](./processors/embedding.md) for more info about supported embedding functions.\n\n**Import dataset from HF to Local Persisted Chroma and embed the documents:**\n\n```bash\ncdp ds-get \"hf://tazarov/ds2?split=train\" | cdp embed --ef default | cdp import \"file://chroma-data/chroma-qna-def-emb-hf\" --upsert --create\n```\n\n**Chunk Large Documents:**\n\n```bash\ncdp imp pdf sample-data/papers/ | grep \"2401.02412.pdf\" | head -1 | cdp chunk -s 500\n```\n\n### Misc\n\n**Count the number of documents in a collection:**\n\n```bash\ncdp export \"http://localhost:8000/chroma-qna\" | wc -l\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Chroma Data Pipes \ud83d\udd87\ufe0f - The easiest way to get data into and out of ChromaDB",
    "version": "0.0.12",
    "project_urls": {
        "Bug Tracker": "https://github.com/amikos-tech/chromadb-data-pipes/issues",
        "Homepage": "https://datapipes.chromadb.dev/",
        "Source": "https://github.com/amikos-tech/chromadb-data-pipes/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f80defedeac38ef22d88d4655597bd56b087d47b342a4e11d90269eb82db74f8",
                "md5": "1350c057bbbb319586be8bf88f7bb11c",
                "sha256": "92be2ca5e27cabdd8b881e86c8c6887ed4d1737f603a3f650e3222b7e6a593fc"
            },
            "downloads": -1,
            "filename": "chromadb_data_pipes-0.0.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1350c057bbbb319586be8bf88f7bb11c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.9",
            "size": 31586,
            "upload_time": "2024-10-22T11:37:51",
            "upload_time_iso_8601": "2024-10-22T11:37:51.145497Z",
            "url": "https://files.pythonhosted.org/packages/f8/0d/efedeac38ef22d88d4655597bd56b087d47b342a4e11d90269eb82db74f8/chromadb_data_pipes-0.0.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b4a5606ab53dad47f2fbdf006c1f842bcd3b6559203cd993ae3910d5f49cd1f",
                "md5": "bc3506cd1266a2d52ade522f4b7a2fd5",
                "sha256": "5fbe147fa2c6c7f9e79b8c1b2a6049af090fbea532e64992575842934b8d0293"
            },
            "downloads": -1,
            "filename": "chromadb_data_pipes-0.0.12.tar.gz",
            "has_sig": false,
            "md5_digest": "bc3506cd1266a2d52ade522f4b7a2fd5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.9",
            "size": 22462,
            "upload_time": "2024-10-22T11:37:53",
            "upload_time_iso_8601": "2024-10-22T11:37:53.220974Z",
            "url": "https://files.pythonhosted.org/packages/9b/4a/5606ab53dad47f2fbdf006c1f842bcd3b6559203cd993ae3910d5f49cd1f/chromadb_data_pipes-0.0.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-22 11:37:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "amikos-tech",
    "github_project": "chromadb-data-pipes",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chromadb-data-pipes"
}

Trayan Azarov