docetl


Namedocetl JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryETL with LLM operations.
upload_time2025-01-09 09:11:09
maintainerNone
docs_urlNone
authorShreya Shankar
requires_python<4.0,>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 📜 DocETL: Powering Complex Document Processing Pipelines

[![Website](https://img.shields.io/badge/Website-docetl.org-blue)](https://docetl.org)
[![Documentation](https://img.shields.io/badge/Documentation-docs-green)](https://ucbepic.github.io/docetl)
[![Discord](https://img.shields.io/discord/1285485891095236608?label=Discord&logo=discord)](https://discord.gg/fHp7B2X3xx)
[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2410.12189)

![DocETL Figure](docs/assets/readmefig.png)

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

1. An interactive UI playground for iterative prompt engineering and pipeline development
2. A Python package for running production pipelines from the command line or Python code

### 🌟 Community Projects

- [Conversation Generator](https://github.com/PassionFruits-net/docetl-conversation)
- [Text-to-speech](https://github.com/PassionFruits-net/docetl-speaker)
- [YouTube Transcript Topic Extraction](https://github.com/rajib76/docetl_examples)

### 📚 Educational Resources

- [UI/UX Thoughts](https://x.com/sh_reya/status/1846235904664273201)
- [Using Gleaning to Improve Output Quality](https://x.com/sh_reya/status/1843354256335876262)
- [Deep Dive on Resolve Operator](https://x.com/sh_reya/status/1840796824636121288)


## 🚀 Getting Started

There are two main ways to use DocETL:

### 1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

[DocWrangler](https://docetl.org/playground) helps you iteratively develop your pipeline:
- Experiment with different prompts and see results in real-time
- Build your pipeline step by step
- Export your finalized pipeline configuration for production use

![DocWrangler](docs/assets/tutorial/one-operation.png)

DocWrangler is hosted at [docetl.org/playground](https://docetl.org/playground). But to run the playground locally, you can either:
- Use Docker (recommended for quick start): `make docker`
- Set up the development environment manually

See the [Playground Setup Guide](https://ucbepic.github.io/docetl/playground/) for detailed instructions.

### 2. 📦 Python Package (For Production Use)

If you want to use DocETL as a Python package:

#### Prerequisites
- Python 3.10 or later
- OpenAI API key

```bash
pip install docetl
```

Create a `.env` file in your project directory:
```bash
OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)
```

To see examples of how to use DocETL, check out the [tutorial](https://ucbepic.github.io/docetl/tutorial/).

### 2. 🎮 DocWrangler Setup

To run DocWrangler locally, you have two options:

#### Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

1. Create the required environment files:

Create `.env` in the root directory:
```bash
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=0.0.0.0
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
```

Create `.env.local` in the `website` directory:
```bash
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
```

2. Run Docker:
```bash
make docker
```

This will:
- Create a Docker volume for persistent data
- Build the DocETL image
- Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):
```bash
make docker-clean
```

#### Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

1. Clone the repository:
```bash
git clone https://github.com/ucbepic/docetl.git
cd docetl
```

2. Set up environment variables in `.env` in the root/top-level directory:
```bash
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
```

And create an .env.local file in the `website` directory with the following:
```bash
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
```

3. Install dependencies:
```bash
make install      # Install Python package
make install-ui   # Install UI dependencies
```

Note that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.

4. Start the development server:
```bash
make run-ui-dev
```

5. Visit http://localhost:3000/playground to access the interactive UI.

### 🛠️ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

```bash
make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)
```

For detailed documentation and tutorials, visit our [documentation](https://ucbepic.github.io/docetl).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "docetl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Shreya Shankar",
    "author_email": "shreyashankar@berkeley.edu",
    "download_url": "https://files.pythonhosted.org/packages/49/37/cd5181624182be1826d878da60f38ef8efe0c35f24a9127c3217b8322204/docetl-0.2.1.tar.gz",
    "platform": null,
    "description": "# \ud83d\udcdc DocETL: Powering Complex Document Processing Pipelines\n\n[![Website](https://img.shields.io/badge/Website-docetl.org-blue)](https://docetl.org)\n[![Documentation](https://img.shields.io/badge/Documentation-docs-green)](https://ucbepic.github.io/docetl)\n[![Discord](https://img.shields.io/discord/1285485891095236608?label=Discord&logo=discord)](https://discord.gg/fHp7B2X3xx)\n[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2410.12189)\n\n![DocETL Figure](docs/assets/readmefig.png)\n\nDocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:\n\n1. An interactive UI playground for iterative prompt engineering and pipeline development\n2. A Python package for running production pipelines from the command line or Python code\n\n### \ud83c\udf1f Community Projects\n\n- [Conversation Generator](https://github.com/PassionFruits-net/docetl-conversation)\n- [Text-to-speech](https://github.com/PassionFruits-net/docetl-speaker)\n- [YouTube Transcript Topic Extraction](https://github.com/rajib76/docetl_examples)\n\n### \ud83d\udcda Educational Resources\n\n- [UI/UX Thoughts](https://x.com/sh_reya/status/1846235904664273201)\n- [Using Gleaning to Improve Output Quality](https://x.com/sh_reya/status/1843354256335876262)\n- [Deep Dive on Resolve Operator](https://x.com/sh_reya/status/1840796824636121288)\n\n\n## \ud83d\ude80 Getting Started\n\nThere are two main ways to use DocETL:\n\n### 1. \ud83c\udfae DocWrangler, the Interactive UI Playground (Recommended for Development)\n\n[DocWrangler](https://docetl.org/playground) helps you iteratively develop your pipeline:\n- Experiment with different prompts and see results in real-time\n- Build your pipeline step by step\n- Export your finalized pipeline configuration for production use\n\n![DocWrangler](docs/assets/tutorial/one-operation.png)\n\nDocWrangler is hosted at [docetl.org/playground](https://docetl.org/playground). But to run the playground locally, you can either:\n- Use Docker (recommended for quick start): `make docker`\n- Set up the development environment manually\n\nSee the [Playground Setup Guide](https://ucbepic.github.io/docetl/playground/) for detailed instructions.\n\n### 2. \ud83d\udce6 Python Package (For Production Use)\n\nIf you want to use DocETL as a Python package:\n\n#### Prerequisites\n- Python 3.10 or later\n- OpenAI API key\n\n```bash\npip install docetl\n```\n\nCreate a `.env` file in your project directory:\n```bash\nOPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)\n```\n\nTo see examples of how to use DocETL, check out the [tutorial](https://ucbepic.github.io/docetl/tutorial/).\n\n### 2. \ud83c\udfae DocWrangler Setup\n\nTo run DocWrangler locally, you have two options:\n\n#### Option A: Using Docker (Recommended for Quick Start)\n\nThe easiest way to get the DocWrangler playground running:\n\n1. Create the required environment files:\n\nCreate `.env` in the root directory:\n```bash\nOPENAI_API_KEY=your_api_key_here\nBACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000\nBACKEND_HOST=0.0.0.0\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n```\n\nCreate `.env.local` in the `website` directory:\n```bash\nOPENAI_API_KEY=sk-xxx\nOPENAI_API_BASE=https://api.openai.com/v1\nMODEL_NAME=gpt-4o-mini\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\n```\n\n2. Run Docker:\n```bash\nmake docker\n```\n\nThis will:\n- Create a Docker volume for persistent data\n- Build the DocETL image\n- Run the container with the UI accessible at http://localhost:3000\n\nTo clean up Docker resources (note that this will delete the Docker volume):\n```bash\nmake docker-clean\n```\n\n#### Option B: Manual Setup (Development)\n\nFor development or if you prefer not to use Docker:\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/ucbepic/docetl.git\ncd docetl\n```\n\n2. Set up environment variables in `.env` in the root/top-level directory:\n```bash\nOPENAI_API_KEY=your_api_key_here\nBACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000\nBACKEND_HOST=localhost\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n```\n\nAnd create an .env.local file in the `website` directory with the following:\n```bash\nOPENAI_API_KEY=sk-xxx\nOPENAI_API_BASE=https://api.openai.com/v1\nMODEL_NAME=gpt-4o-mini\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\n```\n\n3. Install dependencies:\n```bash\nmake install      # Install Python package\nmake install-ui   # Install UI dependencies\n```\n\nNote that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.\n\n4. Start the development server:\n```bash\nmake run-ui-dev\n```\n\n5. Visit http://localhost:3000/playground to access the interactive UI.\n\n### \ud83d\udee0\ufe0f Development Setup\n\nIf you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:\n\n```bash\nmake tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)\n```\n\nFor detailed documentation and tutorials, visit our [documentation](https://ucbepic.github.io/docetl).\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ETL with LLM operations.",
    "version": "0.2.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "159b028222a50ab9818cd9cae796e6f4c509e2d225ec86bf814a3e33fbad2d8e",
                "md5": "3b147e5551716a04eeb45185e0320211",
                "sha256": "d0fdb8487883accf09754495239d1c5d132e84a245f5367a9b218539407a3bf6"
            },
            "downloads": -1,
            "filename": "docetl-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b147e5551716a04eeb45185e0320211",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 170956,
            "upload_time": "2025-01-09T09:11:06",
            "upload_time_iso_8601": "2025-01-09T09:11:06.254009Z",
            "url": "https://files.pythonhosted.org/packages/15/9b/028222a50ab9818cd9cae796e6f4c509e2d225ec86bf814a3e33fbad2d8e/docetl-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4937cd5181624182be1826d878da60f38ef8efe0c35f24a9127c3217b8322204",
                "md5": "abd79e0c374b93d856ddd09d12b64d28",
                "sha256": "836174ba94259f9fd4eae0f1b7082f0ad87008e0406d0a48a827f4ce79c870a4"
            },
            "downloads": -1,
            "filename": "docetl-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "abd79e0c374b93d856ddd09d12b64d28",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 144629,
            "upload_time": "2025-01-09T09:11:09",
            "upload_time_iso_8601": "2025-01-09T09:11:09.036806Z",
            "url": "https://files.pythonhosted.org/packages/49/37/cd5181624182be1826d878da60f38ef8efe0c35f24a9127c3217b8322204/docetl-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-09 09:11:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "docetl"
}
        
Elapsed time: 0.75289s