# ChunkFlow
**Production-grade async text chunking framework for RAG systems**
[](https://badge.fury.io/py/chunk-flow)
[](https://pypi.org/project/chunk-flow/)
[](https://pypi.org/project/chunk-flow/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://github.com/chunkflow/chunk-flow)
ChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.
## Why ChunkFlow?
RAG systems process billions of documents daily, and **chunking quality directly impacts retrieval accuracy, computational costs, and user experience**. Poor chunking causes hallucinations, missed context, and wasted API calls.
ChunkFlow addresses this with:
- **6+ chunking strategies** - From simple fixed-size to revolutionary late chunking
- **Pluggable architecture** - Easy integration with any embedding provider
- **Comprehensive evaluation** - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence
- **Data-driven comparison** - Built-in strategy comparison and ranking framework
- **Production-ready** - Async-first, type-safe, structured logging, extensible design
## Key Features
### Chunking Strategies
- **Fixed-Size** - Simple character/token-based splitting (10K+ chunks/sec)
- **Recursive** - Hierarchical splitting with natural boundaries (recommended default)
- **Document-Based** - Format-aware (Markdown, HTML)
- **Semantic** - Embedding-based topic detection with similarity thresholds
- **Late Chunking** - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)
### Embedding Providers
- **OpenAI** - text-embedding-3-small/large with automatic cost tracking
- **HuggingFace** - Sentence Transformers (local, free, GPU/CPU support)
- **Extensible** - Easy to add custom providers via EmbeddingProvider base class
### Evaluation Metrics
- **Retrieval** (4 metrics): NDCG@k, Recall@k, Precision@k, MRR
- **Semantic** (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity
- **RAG Quality** (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)
- **Framework**: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis
## Quick Start
### Installation
```bash
# Basic installation
pip install chunk-flow
# With specific providers
pip install chunk-flow[openai]
pip install chunk-flow[huggingface]
# With API server
pip install chunk-flow[api]
# Everything
pip install chunk-flow[all]
```
### Basic Usage
```python
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline
# 1. Chunk your document
chunker = StrategyRegistry.create("recursive", {"chunk_size": 512, "overlap": 100})
result = await chunker.chunk(document)
# 2. Embed chunks
embedder = EmbeddingProviderFactory.create("openai", {"model": "text-embedding-3-small"})
emb_result = await embedder.embed_texts(result.chunks)
# 3. Evaluate quality (semantic metrics - no ground truth needed)
pipeline = EvaluationPipeline(metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness"])
metrics = await pipeline.evaluate(
chunks=result.chunks,
embeddings=emb_result.embeddings,
)
print(f"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}")
print(f"Boundary Quality: {metrics['boundary_quality'].score:.4f}")
```
### Strategy Comparison
Compare multiple strategies to find the best for your use case:
```python
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, StrategyComparator
# Create strategies to compare
strategies = [
StrategyRegistry.create("fixed_size", {"chunk_size": 500, "overlap": 50}),
StrategyRegistry.create("recursive", {"chunk_size": 400, "overlap": 80}),
StrategyRegistry.create("semantic", {"threshold_percentile": 80}),
]
# Get embedder
embedder = EmbeddingProviderFactory.create("huggingface")
# Set up evaluation pipeline
pipeline = EvaluationPipeline(
metrics=["ndcg_at_k", "semantic_coherence", "chunk_stickiness"],
)
# Compare strategies
comparison = await pipeline.compare_strategies(
strategies=strategies,
text=document,
ground_truth={"query_embedding": query_emb, "relevant_indices": [0, 2, 5]},
)
# Generate comparison report
report = StrategyComparator.generate_comparison_report(
{name: comparison["strategies"][name]["metric_results"]
for name in comparison["strategies"].keys()}
)
print(report)
# See examples/strategy_comparison.py for complete working example
```
### API Server
```bash
# Start FastAPI server
uvicorn chunk_flow.api.app:app --reload
# Use the API
curl -X POST "http://localhost:8000/chunk" \
-H "Content-Type: application/json" \
-d '{
"text": "Your document here...",
"strategy": "recursive",
"strategy_config": {"chunk_size": 512}
}'
```
## Architecture
ChunkFlow follows a clean, extensible architecture:
```
┌─────────────────────────────────────────────────────────────┐
│ API Layer (FastAPI) │
│ /chunk, /evaluate, /compare, /benchmark, /export │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Orchestration Layer │
│ ChunkingPipeline | EvaluationEngine | ResultsAggregator │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────┴─────────────────────┐
↓ ↓
┌──────────────────┐ ┌──────────────────────┐
│ Chunking Module │ │ Embedding Module │
│ ----------------│ │ -------------------- │
│ • Fixed-Size │ │ • OpenAI │
│ • Recursive │ │ • HuggingFace │
│ • Semantic │ │ • Google Vertex │
│ • Late │ │ • Cohere │
└──────────────────┘ └──────────────────────┘
```
## Research-Backed
ChunkFlow implements cutting-edge research findings:
- **Late Chunking** (Jina AI, 2025): 6-9% improvement in retrieval accuracy
- **Optimal Chunk Sizes** (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context
- **Semantic Independence** (HOPE, 2025): 56% gain in factual correctness
- **MoC Metrics** (Zhao et al., 2025): Boundary clarity and chunk stickiness
- **RAGAS** (ExplodingGradients, 2023): Reference-free RAG evaluation
See [rag-summery-framework.md](rag-summery-framework.md) for comprehensive research review.
## Documentation
- 📚 **[Documentation Hub](docs/README.md)** - Complete documentation index
- 🚀 **[Getting Started](docs/GETTING_STARTED.md)** - Installation and quick start
- 📖 **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation
- 🐳 **[Docker Guide](docs/DOCKER.md)** - Docker deployment
- 📓 **[Examples](examples/)** - Code examples and Jupyter notebooks
## Development
```bash
# Clone repository
git clone https://github.com/chunkflow/chunk-flow.git
cd chunk-flow
# Install with dev dependencies
make install-dev
# Run tests
make test
# Format and lint
make format
make lint
# Run full CI locally
make ci
```
## Contributing
ChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:
- **Report Bugs**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)
- **Request Features**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)
- **Ask Questions**: [GitHub Discussions](https://github.com/chunkflow/chunk-flow/discussions)
- **Star the Repo**: Help spread the word!
See [CONTRIBUTING.md](CONTRIBUTING.md) for more details.
## Roadmap
**Phase 1-4: Core Framework** ✅ COMPLETED
- [x] Core chunking strategies (Fixed, Recursive, Document-based)
- [x] Embedding providers (OpenAI, HuggingFace)
- [x] Semantic chunking
- [x] Late chunking implementation
- [x] Comprehensive evaluation metrics (12 metrics across 3 categories)
- [x] Evaluation pipeline and comparison framework
**Phase 5-6: Analysis & API** ✅ COMPLETED
- [x] ResultsDataFrame with analysis methods
- [x] Visualization utilities (heatmaps, comparison charts)
- [x] FastAPI server with all endpoints
- [x] Docker setup (multi-stage, production-ready)
**Phase 7-9: Testing & Release** ✅ COMPLETED
- [x] Comprehensive testing (unit, integration, E2E)
- [x] Benchmark suite with standard datasets
- [x] CI/CD pipeline (GitHub Actions)
- [x] Complete documentation
- [x] PyPI package release workflow
- [x] Production deployment guides
**v0.1.0 READY FOR RELEASE!** 🚀
**Future Roadmap (v0.2.0+)**
- [ ] Additional providers (Google Vertex, Cohere, Voyage AI)
- [ ] LLM-based chunking (GPT/Claude)
- [ ] Streamlit dashboard
- [ ] Redis caching and PostgreSQL storage
- [ ] Agentic chunking with dynamic boundaries
- [ ] Fine-tuning pipeline for custom strategies
- [ ] Public benchmark datasets (BeIR, MS MARCO)
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Citation
If you use ChunkFlow in your research, please cite:
```bibtex
@software{chunkflow2024,
title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},
author = {ChunkFlow Development},
year = {2024},
url = {https://github.com/chunkflow/chunk-flow}
}
```
## Acknowledgments
ChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.
---
**Built with passion for the neglected field of text chunking** 🚀
[Documentation](https://chunk-flow.readthedocs.io) | [GitHub](https://github.com/chunkflow/chunk-flow) | [PyPI](https://pypi.org/project/chunk-flow/)
Raw data
{
"_id": null,
"home_page": null,
"name": "chunckerflow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "chunking, rag, retrieval, embeddings, nlp, text-processing, semantic-search",
"author": "ChunkFlow Contributors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/af/18/493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb/chunckerflow-0.1.0.tar.gz",
"platform": null,
"description": "# ChunkFlow\r\n\r\n**Production-grade async text chunking framework for RAG systems**\r\n\r\n[](https://badge.fury.io/py/chunk-flow)\r\n[](https://pypi.org/project/chunk-flow/)\r\n[](https://pypi.org/project/chunk-flow/)\r\n[](https://opensource.org/licenses/MIT)\r\n[](https://github.com/psf/black)\r\n[](https://github.com/chunkflow/chunk-flow)\r\n\r\nChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.\r\n\r\n## Why ChunkFlow?\r\n\r\nRAG systems process billions of documents daily, and **chunking quality directly impacts retrieval accuracy, computational costs, and user experience**. Poor chunking causes hallucinations, missed context, and wasted API calls.\r\n\r\nChunkFlow addresses this with:\r\n- **6+ chunking strategies** - From simple fixed-size to revolutionary late chunking\r\n- **Pluggable architecture** - Easy integration with any embedding provider\r\n- **Comprehensive evaluation** - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence\r\n- **Data-driven comparison** - Built-in strategy comparison and ranking framework\r\n- **Production-ready** - Async-first, type-safe, structured logging, extensible design\r\n\r\n## Key Features\r\n\r\n### Chunking Strategies\r\n\r\n- **Fixed-Size** - Simple character/token-based splitting (10K+ chunks/sec)\r\n- **Recursive** - Hierarchical splitting with natural boundaries (recommended default)\r\n- **Document-Based** - Format-aware (Markdown, HTML)\r\n- **Semantic** - Embedding-based topic detection with similarity thresholds\r\n- **Late Chunking** - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)\r\n\r\n### Embedding Providers\r\n\r\n- **OpenAI** - text-embedding-3-small/large with automatic cost tracking\r\n- **HuggingFace** - Sentence Transformers (local, free, GPU/CPU support)\r\n- **Extensible** - Easy to add custom providers via EmbeddingProvider base class\r\n\r\n### Evaluation Metrics\r\n\r\n- **Retrieval** (4 metrics): NDCG@k, Recall@k, Precision@k, MRR\r\n- **Semantic** (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity\r\n- **RAG Quality** (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)\r\n- **Framework**: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis\r\n\r\n## Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\n# Basic installation\r\npip install chunk-flow\r\n\r\n# With specific providers\r\npip install chunk-flow[openai]\r\npip install chunk-flow[huggingface]\r\n\r\n# With API server\r\npip install chunk-flow[api]\r\n\r\n# Everything\r\npip install chunk-flow[all]\r\n```\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom chunk_flow.chunking import StrategyRegistry\r\nfrom chunk_flow.embeddings import EmbeddingProviderFactory\r\nfrom chunk_flow.evaluation import EvaluationPipeline\r\n\r\n# 1. Chunk your document\r\nchunker = StrategyRegistry.create(\"recursive\", {\"chunk_size\": 512, \"overlap\": 100})\r\nresult = await chunker.chunk(document)\r\n\r\n# 2. Embed chunks\r\nembedder = EmbeddingProviderFactory.create(\"openai\", {\"model\": \"text-embedding-3-small\"})\r\nemb_result = await embedder.embed_texts(result.chunks)\r\n\r\n# 3. Evaluate quality (semantic metrics - no ground truth needed)\r\npipeline = EvaluationPipeline(metrics=[\"semantic_coherence\", \"boundary_quality\", \"chunk_stickiness\"])\r\nmetrics = await pipeline.evaluate(\r\n chunks=result.chunks,\r\n embeddings=emb_result.embeddings,\r\n)\r\n\r\nprint(f\"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}\")\r\nprint(f\"Boundary Quality: {metrics['boundary_quality'].score:.4f}\")\r\n```\r\n\r\n### Strategy Comparison\r\n\r\nCompare multiple strategies to find the best for your use case:\r\n\r\n```python\r\nfrom chunk_flow.chunking import StrategyRegistry\r\nfrom chunk_flow.embeddings import EmbeddingProviderFactory\r\nfrom chunk_flow.evaluation import EvaluationPipeline, StrategyComparator\r\n\r\n# Create strategies to compare\r\nstrategies = [\r\n StrategyRegistry.create(\"fixed_size\", {\"chunk_size\": 500, \"overlap\": 50}),\r\n StrategyRegistry.create(\"recursive\", {\"chunk_size\": 400, \"overlap\": 80}),\r\n StrategyRegistry.create(\"semantic\", {\"threshold_percentile\": 80}),\r\n]\r\n\r\n# Get embedder\r\nembedder = EmbeddingProviderFactory.create(\"huggingface\")\r\n\r\n# Set up evaluation pipeline\r\npipeline = EvaluationPipeline(\r\n metrics=[\"ndcg_at_k\", \"semantic_coherence\", \"chunk_stickiness\"],\r\n)\r\n\r\n# Compare strategies\r\ncomparison = await pipeline.compare_strategies(\r\n strategies=strategies,\r\n text=document,\r\n ground_truth={\"query_embedding\": query_emb, \"relevant_indices\": [0, 2, 5]},\r\n)\r\n\r\n# Generate comparison report\r\nreport = StrategyComparator.generate_comparison_report(\r\n {name: comparison[\"strategies\"][name][\"metric_results\"]\r\n for name in comparison[\"strategies\"].keys()}\r\n)\r\nprint(report)\r\n\r\n# See examples/strategy_comparison.py for complete working example\r\n```\r\n\r\n### API Server\r\n\r\n```bash\r\n# Start FastAPI server\r\nuvicorn chunk_flow.api.app:app --reload\r\n\r\n# Use the API\r\ncurl -X POST \"http://localhost:8000/chunk\" \\\r\n -H \"Content-Type: application/json\" \\\r\n -d '{\r\n \"text\": \"Your document here...\",\r\n \"strategy\": \"recursive\",\r\n \"strategy_config\": {\"chunk_size\": 512}\r\n }'\r\n```\r\n\r\n## Architecture\r\n\r\nChunkFlow follows a clean, extensible architecture:\r\n\r\n```\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 API Layer (FastAPI) \u2502\r\n\u2502 /chunk, /evaluate, /compare, /benchmark, /export \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n \u2193\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 Orchestration Layer \u2502\r\n\u2502 ChunkingPipeline | EvaluationEngine | ResultsAggregator \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n \u2193\r\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n \u2193 \u2193\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 Chunking Module \u2502 \u2502 Embedding Module \u2502\r\n\u2502 ----------------\u2502 \u2502 -------------------- \u2502\r\n\u2502 \u2022 Fixed-Size \u2502 \u2502 \u2022 OpenAI \u2502\r\n\u2502 \u2022 Recursive \u2502 \u2502 \u2022 HuggingFace \u2502\r\n\u2502 \u2022 Semantic \u2502 \u2502 \u2022 Google Vertex \u2502\r\n\u2502 \u2022 Late \u2502 \u2502 \u2022 Cohere \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n```\r\n\r\n## Research-Backed\r\n\r\nChunkFlow implements cutting-edge research findings:\r\n\r\n- **Late Chunking** (Jina AI, 2025): 6-9% improvement in retrieval accuracy\r\n- **Optimal Chunk Sizes** (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context\r\n- **Semantic Independence** (HOPE, 2025): 56% gain in factual correctness\r\n- **MoC Metrics** (Zhao et al., 2025): Boundary clarity and chunk stickiness\r\n- **RAGAS** (ExplodingGradients, 2023): Reference-free RAG evaluation\r\n\r\nSee [rag-summery-framework.md](rag-summery-framework.md) for comprehensive research review.\r\n\r\n## Documentation\r\n\r\n- \ud83d\udcda **[Documentation Hub](docs/README.md)** - Complete documentation index\r\n- \ud83d\ude80 **[Getting Started](docs/GETTING_STARTED.md)** - Installation and quick start\r\n- \ud83d\udcd6 **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation\r\n- \ud83d\udc33 **[Docker Guide](docs/DOCKER.md)** - Docker deployment\r\n- \ud83d\udcd3 **[Examples](examples/)** - Code examples and Jupyter notebooks\r\n\r\n## Development\r\n\r\n```bash\r\n# Clone repository\r\ngit clone https://github.com/chunkflow/chunk-flow.git\r\ncd chunk-flow\r\n\r\n# Install with dev dependencies\r\nmake install-dev\r\n\r\n# Run tests\r\nmake test\r\n\r\n# Format and lint\r\nmake format\r\nmake lint\r\n\r\n# Run full CI locally\r\nmake ci\r\n```\r\n\r\n## Contributing\r\n\r\nChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:\r\n\r\n- **Report Bugs**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)\r\n- **Request Features**: [GitHub Issues](https://github.com/chunkflow/chunk-flow/issues)\r\n- **Ask Questions**: [GitHub Discussions](https://github.com/chunkflow/chunk-flow/discussions)\r\n- **Star the Repo**: Help spread the word!\r\n\r\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for more details.\r\n\r\n## Roadmap\r\n\r\n**Phase 1-4: Core Framework** \u2705 COMPLETED\r\n- [x] Core chunking strategies (Fixed, Recursive, Document-based)\r\n- [x] Embedding providers (OpenAI, HuggingFace)\r\n- [x] Semantic chunking\r\n- [x] Late chunking implementation\r\n- [x] Comprehensive evaluation metrics (12 metrics across 3 categories)\r\n- [x] Evaluation pipeline and comparison framework\r\n\r\n**Phase 5-6: Analysis & API** \u2705 COMPLETED\r\n- [x] ResultsDataFrame with analysis methods\r\n- [x] Visualization utilities (heatmaps, comparison charts)\r\n- [x] FastAPI server with all endpoints\r\n- [x] Docker setup (multi-stage, production-ready)\r\n\r\n**Phase 7-9: Testing & Release** \u2705 COMPLETED\r\n- [x] Comprehensive testing (unit, integration, E2E)\r\n- [x] Benchmark suite with standard datasets\r\n- [x] CI/CD pipeline (GitHub Actions)\r\n- [x] Complete documentation\r\n- [x] PyPI package release workflow\r\n- [x] Production deployment guides\r\n\r\n**v0.1.0 READY FOR RELEASE!** \ud83d\ude80\r\n\r\n**Future Roadmap (v0.2.0+)**\r\n- [ ] Additional providers (Google Vertex, Cohere, Voyage AI)\r\n- [ ] LLM-based chunking (GPT/Claude)\r\n- [ ] Streamlit dashboard\r\n- [ ] Redis caching and PostgreSQL storage\r\n- [ ] Agentic chunking with dynamic boundaries\r\n- [ ] Fine-tuning pipeline for custom strategies\r\n- [ ] Public benchmark datasets (BeIR, MS MARCO)\r\n\r\n## License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## Citation\r\n\r\nIf you use ChunkFlow in your research, please cite:\r\n\r\n```bibtex\r\n@software{chunkflow2024,\r\n title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},\r\n author = {ChunkFlow Development},\r\n year = {2024},\r\n url = {https://github.com/chunkflow/chunk-flow}\r\n}\r\n```\r\n\r\n## Acknowledgments\r\n\r\nChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.\r\n\r\n---\r\n\r\n**Built with passion for the neglected field of text chunking** \ud83d\ude80\r\n\r\n[Documentation](https://chunk-flow.readthedocs.io) | [GitHub](https://github.com/chunkflow/chunk-flow) | [PyPI](https://pypi.org/project/chunk-flow/)\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Production-grade async text chunking framework for RAG systems",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/guybass/chunckerflow",
"Homepage": "https://github.com/guybass/chunckerflow",
"Issues": "https://github.com/guybass/chunckerflow/issues",
"Repository": "https://github.com/guybass/chunckerflow"
},
"split_keywords": [
"chunking",
" rag",
" retrieval",
" embeddings",
" nlp",
" text-processing",
" semantic-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a3b09c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69",
"md5": "2bf7cf991b6fff04f2655b3adb1aaf6a",
"sha256": "c6bd17f8ea0fcb08381ae876542c8025e77f7064f96e13e85109e03a1acbe52e"
},
"downloads": -1,
"filename": "chunckerflow-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2bf7cf991b6fff04f2655b3adb1aaf6a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 72125,
"upload_time": "2025-10-19T14:10:32",
"upload_time_iso_8601": "2025-10-19T14:10:32.403391Z",
"url": "https://files.pythonhosted.org/packages/a3/b0/9c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69/chunckerflow-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "af18493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb",
"md5": "ff7057e04303a1bfd15f75549e9ade68",
"sha256": "c4c0832942ac09b942bcd508c48c541b55e6af9d89bc69136b028a4655564445"
},
"downloads": -1,
"filename": "chunckerflow-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "ff7057e04303a1bfd15f75549e9ade68",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 85523,
"upload_time": "2025-10-19T14:10:33",
"upload_time_iso_8601": "2025-10-19T14:10:33.929928Z",
"url": "https://files.pythonhosted.org/packages/af/18/493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb/chunckerflow-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-19 14:10:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "guybass",
"github_project": "chunckerflow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "chunckerflow"
}