chunckerflow

Name	chunckerflow JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Production-grade async text chunking framework for RAG systems
upload_time	2025-10-19 14:10:33
maintainer	None
docs_url	None
author	ChunkFlow Contributors
requires_python	>=3.9
license	MIT
keywords	chunking rag retrieval embeddings nlp text-processing semantic-search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ChunkFlow

**Production-grade async text chunking framework for RAG systems**

[![PyPI version](https://badge.fury.io/py/chunk-flow.svg)](https://badge.fury.io/py/chunk-flow)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/chunk-flow)](https://pypi.org/project/chunk-flow/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/chunk-flow)](https://pypi.org/project/chunk-flow/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![GitHub stars](https://img.shields.io/github/stars/chunkflow/chunk-flow?style=social)](https://github.com/chunkflow/chunk-flow)

ChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.

## Why ChunkFlow?

RAG systems process billions of documents daily, and **chunking quality directly impacts retrieval accuracy, computational costs, and user experience**. Poor chunking causes hallucinations, missed context, and wasted API calls.

ChunkFlow addresses this with:
- **6+ chunking strategies** - From simple fixed-size to revolutionary late chunking
- **Pluggable architecture** - Easy integration with any embedding provider
- **Comprehensive evaluation** - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence
- **Data-driven comparison** - Built-in strategy comparison and ranking framework
- **Production-ready** - Async-first, type-safe, structured logging, extensible design

## Key Features

### Chunking Strategies

- **Fixed-Size** - Simple character/token-based splitting (10K+ chunks/sec)
- **Recursive** - Hierarchical splitting with natural boundaries (recommended default)
- **Document-Based** - Format-aware (Markdown, HTML)
- **Semantic** - Embedding-based topic detection with similarity thresholds
- **Late Chunking** - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)

### Embedding Providers

- **OpenAI** - text-embedding-3-small/large with automatic cost tracking
- **HuggingFace** - Sentence Transformers (local, free, GPU/CPU support)
- **Extensible** - Easy to add custom providers via EmbeddingProvider base class

### Evaluation Metrics

- **Retrieval** (4 metrics): NDCG@k, Recall@k, Precision@k, MRR
- **Semantic** (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity
- **RAG Quality** (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)
- **Framework**: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis

## Quick Start

### Installation

```bash
# Basic installation
pip install chunk-flow

# With specific providers
pip install chunk-flow[openai]
pip install chunk-flow[huggingface]

# With API server
pip install chunk-flow[api]

# Everything
pip install chunk-flow[all]
```

### Basic Usage

```python
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline

# 1. Chunk your document
chunker = StrategyRegistry.create("recursive", {"chunk_size": 512, "overlap": 100})
result = await chunker.chunk(document)

# 2. Embed chunks
embedder = EmbeddingProviderFactory.create("openai", {"model": "text-embedding-3-small"})
emb_result = await embedder.embed_texts(result.chunks)

# 3. Evaluate quality (semantic metrics - no ground truth needed)
pipeline = EvaluationPipeline(metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness"])
metrics = await pipeline.evaluate(
    chunks=result.chunks,
    embeddings=emb_result.embeddings,
)

print(f"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}")
print(f"Boundary Quality: {metrics['boundary_quality'].score:.4f}")
```

### Strategy Comparison

Compare multiple strategies to find the best for your use case:

```python
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, StrategyComparator

# Create strategies to compare
strategies = [
    StrategyRegistry.create("fixed_size", {"chunk_size": 500, "overlap": 50}),
    StrategyRegistry.create("recursive", {"chunk_size": 400, "overlap": 80}),
    StrategyRegistry.create("semantic", {"threshold_percentile": 80}),
]

# Get embedder
embedder = EmbeddingProviderFactory.create("huggingface")

# Set up evaluation pipeline
pipeline = EvaluationPipeline(
    metrics=["ndcg_at_k", "semantic_coherence", "chunk_stickiness"],
)

# Compare strategies
comparison = await pipeline.compare_strategies(
    strategies=strategies,
    text=document,
    ground_truth={"query_embedding": query_emb, "relevant_indices": [0, 2, 5]},
)

# Generate comparison report
report = StrategyComparator.generate_comparison_report(
    {name: comparison["strategies"][name]["metric_results"]
     for name in comparison["strategies"].keys()}
)
print(report)

# See examples/strategy_comparison.py for complete working example
```

### API Server

```bash
# Start FastAPI server
uvicorn chunk_flow.api.app:app --reload

# Use the API
curl -X POST "http://localhost:8000/chunk" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your document here...",
    "strategy": "recursive",
    "strategy_config": {"chunk_size": 512}
  }'
```

## Architecture

ChunkFlow follows a clean, extensible architecture:

```
┌─────────────────────────────────────────────────────────────┐
│                     API Layer (FastAPI)                     │
│  /chunk, /evaluate, /compare, /benchmark, /export          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Orchestration Layer                      │
│  ChunkingPipeline | EvaluationEngine | ResultsAggregator   │
└─────────────────────────────────────────────────────────────┘
                              ↓
        ┌─────────────────────┴─────────────────────┐
        ↓                                           ↓
┌──────────────────┐                    ┌──────────────────────┐
│ Chunking Module  │                    │  Embedding Module    │
│ ----------------│                    │ -------------------- │
│ • Fixed-Size    │                    │ • OpenAI             │
│ • Recursive     │                    │ • HuggingFace        │
│ • Semantic      │                    │ • Google Vertex      │
│ • Late          │                    │ • Cohere             │
└──────────────────┘                    └──────────────────────┘
```

## Research-Backed

ChunkFlow implements cutting-edge research findings:

- **Late Chunking** (Jina AI, 2025): 6-9% improvement in retrieval accuracy
- **Optimal Chunk Sizes** (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context
- **Semantic Independence** (HOPE, 2025): 56% gain in factual correctness
- **MoC Metrics** (Zhao et al., 2025): Boundary clarity and chunk stickiness
- **RAGAS** (ExplodingGradients, 2023): Reference-free RAG evaluation

See [rag-summery-framework.md](rag-summery-framework.md) for comprehensive research review.

## Documentation

- 📚 **[Documentation Hub](docs/README.md)** - Complete documentation index
- 🚀 **[Getting Started](docs/GETTING_STARTED.md)** - Installation and quick start
- 📖 **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation
- 🐳 **[Docker Guide](docs/DOCKER.md)** - Docker deployment
- 📓 **[Examples](examples/)** - Code examples and Jupyter notebooks

## Development

```bash
# Clone repository
git clone https://github.com/chunkflow/chunk-flow.git
cd chunk-flow

# Install with dev dependencies
make install-dev

# Run tests
make test

# Format and lint
make format
make lint

# Run full CI locally
make ci
```

## Contributing

ChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:

- **Report Bugs**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)
- **Request Features**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)
- **Ask Questions**: [GitHub Discussions](https://github.com/chunkflow/chunk-flow/discussions)
- **Star the Repo**: Help spread the word!

See [CONTRIBUTING.md](CONTRIBUTING.md) for more details.

## Roadmap

**Phase 1-4: Core Framework** ✅ COMPLETED
- [x] Core chunking strategies (Fixed, Recursive, Document-based)
- [x] Embedding providers (OpenAI, HuggingFace)
- [x] Semantic chunking
- [x] Late chunking implementation
- [x] Comprehensive evaluation metrics (12 metrics across 3 categories)
- [x] Evaluation pipeline and comparison framework

**Phase 5-6: Analysis & API** ✅ COMPLETED
- [x] ResultsDataFrame with analysis methods
- [x] Visualization utilities (heatmaps, comparison charts)
- [x] FastAPI server with all endpoints
- [x] Docker setup (multi-stage, production-ready)

**Phase 7-9: Testing & Release** ✅ COMPLETED
- [x] Comprehensive testing (unit, integration, E2E)
- [x] Benchmark suite with standard datasets
- [x] CI/CD pipeline (GitHub Actions)
- [x] Complete documentation
- [x] PyPI package release workflow
- [x] Production deployment guides

**v0.1.0 READY FOR RELEASE!** 🚀

**Future Roadmap (v0.2.0+)**
- [ ] Additional providers (Google Vertex, Cohere, Voyage AI)
- [ ] LLM-based chunking (GPT/Claude)
- [ ] Streamlit dashboard
- [ ] Redis caching and PostgreSQL storage
- [ ] Agentic chunking with dynamic boundaries
- [ ] Fine-tuning pipeline for custom strategies
- [ ] Public benchmark datasets (BeIR, MS MARCO)

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Citation

If you use ChunkFlow in your research, please cite:

```bibtex
@software{chunkflow2024,
  title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},
  author = {ChunkFlow Development},
  year = {2024},
  url = {https://github.com/chunkflow/chunk-flow}
}
```

## Acknowledgments

ChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.

---

**Built with passion for the neglected field of text chunking** 🚀

[Documentation](https://chunk-flow.readthedocs.io) | [GitHub](https://github.com/chunkflow/chunk-flow) | [PyPI](https://pypi.org/project/chunk-flow/)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chunckerflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "chunking, rag, retrieval, embeddings, nlp, text-processing, semantic-search",
    "author": "ChunkFlow Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/af/18/493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb/chunckerflow-0.1.0.tar.gz",
    "platform": null,
    "description": "# ChunkFlow\r\n\r\n**Production-grade async text chunking framework for RAG systems**\r\n\r\n[![PyPI version](https://badge.fury.io/py/chunk-flow.svg)](https://badge.fury.io/py/chunk-flow)\r\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/chunk-flow)](https://pypi.org/project/chunk-flow/)\r\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/chunk-flow)](https://pypi.org/project/chunk-flow/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\r\n[![GitHub stars](https://img.shields.io/github/stars/chunkflow/chunk-flow?style=social)](https://github.com/chunkflow/chunk-flow)\r\n\r\nChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.\r\n\r\n## Why ChunkFlow?\r\n\r\nRAG systems process billions of documents daily, and **chunking quality directly impacts retrieval accuracy, computational costs, and user experience**. Poor chunking causes hallucinations, missed context, and wasted API calls.\r\n\r\nChunkFlow addresses this with:\r\n- **6+ chunking strategies** - From simple fixed-size to revolutionary late chunking\r\n- **Pluggable architecture** - Easy integration with any embedding provider\r\n- **Comprehensive evaluation** - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence\r\n- **Data-driven comparison** - Built-in strategy comparison and ranking framework\r\n- **Production-ready** - Async-first, type-safe, structured logging, extensible design\r\n\r\n## Key Features\r\n\r\n### Chunking Strategies\r\n\r\n- **Fixed-Size** - Simple character/token-based splitting (10K+ chunks/sec)\r\n- **Recursive** - Hierarchical splitting with natural boundaries (recommended default)\r\n- **Document-Based** - Format-aware (Markdown, HTML)\r\n- **Semantic** - Embedding-based topic detection with similarity thresholds\r\n- **Late Chunking** - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)\r\n\r\n### Embedding Providers\r\n\r\n- **OpenAI** - text-embedding-3-small/large with automatic cost tracking\r\n- **HuggingFace** - Sentence Transformers (local, free, GPU/CPU support)\r\n- **Extensible** - Easy to add custom providers via EmbeddingProvider base class\r\n\r\n### Evaluation Metrics\r\n\r\n- **Retrieval** (4 metrics): NDCG@k, Recall@k, Precision@k, MRR\r\n- **Semantic** (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity\r\n- **RAG Quality** (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)\r\n- **Framework**: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis\r\n\r\n## Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\n# Basic installation\r\npip install chunk-flow\r\n\r\n# With specific providers\r\npip install chunk-flow[openai]\r\npip install chunk-flow[huggingface]\r\n\r\n# With API server\r\npip install chunk-flow[api]\r\n\r\n# Everything\r\npip install chunk-flow[all]\r\n```\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom chunk_flow.chunking import StrategyRegistry\r\nfrom chunk_flow.embeddings import EmbeddingProviderFactory\r\nfrom chunk_flow.evaluation import EvaluationPipeline\r\n\r\n# 1. Chunk your document\r\nchunker = StrategyRegistry.create(\"recursive\", {\"chunk_size\": 512, \"overlap\": 100})\r\nresult = await chunker.chunk(document)\r\n\r\n# 2. Embed chunks\r\nembedder = EmbeddingProviderFactory.create(\"openai\", {\"model\": \"text-embedding-3-small\"})\r\nemb_result = await embedder.embed_texts(result.chunks)\r\n\r\n# 3. Evaluate quality (semantic metrics - no ground truth needed)\r\npipeline = EvaluationPipeline(metrics=[\"semantic_coherence\", \"boundary_quality\", \"chunk_stickiness\"])\r\nmetrics = await pipeline.evaluate(\r\n    chunks=result.chunks,\r\n    embeddings=emb_result.embeddings,\r\n)\r\n\r\nprint(f\"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}\")\r\nprint(f\"Boundary Quality: {metrics['boundary_quality'].score:.4f}\")\r\n```\r\n\r\n### Strategy Comparison\r\n\r\nCompare multiple strategies to find the best for your use case:\r\n\r\n```python\r\nfrom chunk_flow.chunking import StrategyRegistry\r\nfrom chunk_flow.embeddings import EmbeddingProviderFactory\r\nfrom chunk_flow.evaluation import EvaluationPipeline, StrategyComparator\r\n\r\n# Create strategies to compare\r\nstrategies = [\r\n    StrategyRegistry.create(\"fixed_size\", {\"chunk_size\": 500, \"overlap\": 50}),\r\n    StrategyRegistry.create(\"recursive\", {\"chunk_size\": 400, \"overlap\": 80}),\r\n    StrategyRegistry.create(\"semantic\", {\"threshold_percentile\": 80}),\r\n]\r\n\r\n# Get embedder\r\nembedder = EmbeddingProviderFactory.create(\"huggingface\")\r\n\r\n# Set up evaluation pipeline\r\npipeline = EvaluationPipeline(\r\n    metrics=[\"ndcg_at_k\", \"semantic_coherence\", \"chunk_stickiness\"],\r\n)\r\n\r\n# Compare strategies\r\ncomparison = await pipeline.compare_strategies(\r\n    strategies=strategies,\r\n    text=document,\r\n    ground_truth={\"query_embedding\": query_emb, \"relevant_indices\": [0, 2, 5]},\r\n)\r\n\r\n# Generate comparison report\r\nreport = StrategyComparator.generate_comparison_report(\r\n    {name: comparison[\"strategies\"][name][\"metric_results\"]\r\n     for name in comparison[\"strategies\"].keys()}\r\n)\r\nprint(report)\r\n\r\n# See examples/strategy_comparison.py for complete working example\r\n```\r\n\r\n### API Server\r\n\r\n```bash\r\n# Start FastAPI server\r\nuvicorn chunk_flow.api.app:app --reload\r\n\r\n# Use the API\r\ncurl -X POST \"http://localhost:8000/chunk\" \\\r\n  -H \"Content-Type: application/json\" \\\r\n  -d '{\r\n    \"text\": \"Your document here...\",\r\n    \"strategy\": \"recursive\",\r\n    \"strategy_config\": {\"chunk_size\": 512}\r\n  }'\r\n```\r\n\r\n## Architecture\r\n\r\nChunkFlow follows a clean, extensible architecture:\r\n\r\n```\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502                     API Layer (FastAPI)                     \u2502\r\n\u2502  /chunk, /evaluate, /compare, /benchmark, /export          \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n                              \u2193\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502                    Orchestration Layer                      \u2502\r\n\u2502  ChunkingPipeline | EvaluationEngine | ResultsAggregator   \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n                              \u2193\r\n        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n        \u2193                                           \u2193\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510                    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 Chunking Module  \u2502                    \u2502  Embedding Module    \u2502\r\n\u2502 ----------------\u2502                    \u2502 -------------------- \u2502\r\n\u2502 \u2022 Fixed-Size    \u2502                    \u2502 \u2022 OpenAI             \u2502\r\n\u2502 \u2022 Recursive     \u2502                    \u2502 \u2022 HuggingFace        \u2502\r\n\u2502 \u2022 Semantic      \u2502                    \u2502 \u2022 Google Vertex      \u2502\r\n\u2502 \u2022 Late          \u2502                    \u2502 \u2022 Cohere             \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518                    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n```\r\n\r\n## Research-Backed\r\n\r\nChunkFlow implements cutting-edge research findings:\r\n\r\n- **Late Chunking** (Jina AI, 2025): 6-9% improvement in retrieval accuracy\r\n- **Optimal Chunk Sizes** (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context\r\n- **Semantic Independence** (HOPE, 2025): 56% gain in factual correctness\r\n- **MoC Metrics** (Zhao et al., 2025): Boundary clarity and chunk stickiness\r\n- **RAGAS** (ExplodingGradients, 2023): Reference-free RAG evaluation\r\n\r\nSee [rag-summery-framework.md](rag-summery-framework.md) for comprehensive research review.\r\n\r\n## Documentation\r\n\r\n- \ud83d\udcda **[Documentation Hub](docs/README.md)** - Complete documentation index\r\n- \ud83d\ude80 **[Getting Started](docs/GETTING_STARTED.md)** - Installation and quick start\r\n- \ud83d\udcd6 **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation\r\n- \ud83d\udc33 **[Docker Guide](docs/DOCKER.md)** - Docker deployment\r\n- \ud83d\udcd3 **[Examples](examples/)** - Code examples and Jupyter notebooks\r\n\r\n## Development\r\n\r\n```bash\r\n# Clone repository\r\ngit clone https://github.com/chunkflow/chunk-flow.git\r\ncd chunk-flow\r\n\r\n# Install with dev dependencies\r\nmake install-dev\r\n\r\n# Run tests\r\nmake test\r\n\r\n# Format and lint\r\nmake format\r\nmake lint\r\n\r\n# Run full CI locally\r\nmake ci\r\n```\r\n\r\n## Contributing\r\n\r\nChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:\r\n\r\n- **Report Bugs**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)\r\n- **Request Features**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)\r\n- **Ask Questions**: [GitHub Discussions](https://github.com/chunkflow/chunk-flow/discussions)\r\n- **Star the Repo**: Help spread the word!\r\n\r\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for more details.\r\n\r\n## Roadmap\r\n\r\n**Phase 1-4: Core Framework** \u2705 COMPLETED\r\n- [x] Core chunking strategies (Fixed, Recursive, Document-based)\r\n- [x] Embedding providers (OpenAI, HuggingFace)\r\n- [x] Semantic chunking\r\n- [x] Late chunking implementation\r\n- [x] Comprehensive evaluation metrics (12 metrics across 3 categories)\r\n- [x] Evaluation pipeline and comparison framework\r\n\r\n**Phase 5-6: Analysis & API** \u2705 COMPLETED\r\n- [x] ResultsDataFrame with analysis methods\r\n- [x] Visualization utilities (heatmaps, comparison charts)\r\n- [x] FastAPI server with all endpoints\r\n- [x] Docker setup (multi-stage, production-ready)\r\n\r\n**Phase 7-9: Testing & Release** \u2705 COMPLETED\r\n- [x] Comprehensive testing (unit, integration, E2E)\r\n- [x] Benchmark suite with standard datasets\r\n- [x] CI/CD pipeline (GitHub Actions)\r\n- [x] Complete documentation\r\n- [x] PyPI package release workflow\r\n- [x] Production deployment guides\r\n\r\n**v0.1.0 READY FOR RELEASE!** \ud83d\ude80\r\n\r\n**Future Roadmap (v0.2.0+)**\r\n- [ ] Additional providers (Google Vertex, Cohere, Voyage AI)\r\n- [ ] LLM-based chunking (GPT/Claude)\r\n- [ ] Streamlit dashboard\r\n- [ ] Redis caching and PostgreSQL storage\r\n- [ ] Agentic chunking with dynamic boundaries\r\n- [ ] Fine-tuning pipeline for custom strategies\r\n- [ ] Public benchmark datasets (BeIR, MS MARCO)\r\n\r\n## License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## Citation\r\n\r\nIf you use ChunkFlow in your research, please cite:\r\n\r\n```bibtex\r\n@software{chunkflow2024,\r\n  title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},\r\n  author = {ChunkFlow Development},\r\n  year = {2024},\r\n  url = {https://github.com/chunkflow/chunk-flow}\r\n}\r\n```\r\n\r\n## Acknowledgments\r\n\r\nChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.\r\n\r\n---\r\n\r\n**Built with passion for the neglected field of text chunking** \ud83d\ude80\r\n\r\n[Documentation](https://chunk-flow.readthedocs.io) | [GitHub](https://github.com/chunkflow/chunk-flow) | [PyPI](https://pypi.org/project/chunk-flow/)\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Production-grade async text chunking framework for RAG systems",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/guybass/chunckerflow",
        "Homepage": "https://github.com/guybass/chunckerflow",
        "Issues": "https://github.com/guybass/chunckerflow/issues",
        "Repository": "https://github.com/guybass/chunckerflow"
    },
    "split_keywords": [
        "chunking",
        " rag",
        " retrieval",
        " embeddings",
        " nlp",
        " text-processing",
        " semantic-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a3b09c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69",
                "md5": "2bf7cf991b6fff04f2655b3adb1aaf6a",
                "sha256": "c6bd17f8ea0fcb08381ae876542c8025e77f7064f96e13e85109e03a1acbe52e"
            },
            "downloads": -1,
            "filename": "chunckerflow-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2bf7cf991b6fff04f2655b3adb1aaf6a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 72125,
            "upload_time": "2025-10-19T14:10:32",
            "upload_time_iso_8601": "2025-10-19T14:10:32.403391Z",
            "url": "https://files.pythonhosted.org/packages/a3/b0/9c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69/chunckerflow-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af18493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb",
                "md5": "ff7057e04303a1bfd15f75549e9ade68",
                "sha256": "c4c0832942ac09b942bcd508c48c541b55e6af9d89bc69136b028a4655564445"
            },
            "downloads": -1,
            "filename": "chunckerflow-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ff7057e04303a1bfd15f75549e9ade68",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 85523,
            "upload_time": "2025-10-19T14:10:33",
            "upload_time_iso_8601": "2025-10-19T14:10:33.929928Z",
            "url": "https://files.pythonhosted.org/packages/af/18/493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb/chunckerflow-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-19 14:10:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "guybass",
    "github_project": "chunckerflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chunckerflow"
}

ChunkFlow Contributors