ai-chunking


Nameai-chunking JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/nexla-opensource/ai-chunking
SummaryA powerful Python library for semantic document chunking and enrichment using AI
upload_time2025-03-16 20:44:19
maintainerNone
docs_urlNone
authorAmey Desai
requires_python<4.0,>=3.9
licenseMIT
keywords ai nlp document-processing chunking semantic-analysis rag embeddings
VCS
bugtrack_url
requirements openai pydantic tiktoken langchain anthropic google-generativeai litellm instructor langchain_text_splitters langchain-experimental langchain-openai tqdm python-dotenv
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AI Chunking

A powerful Python library for semantic document chunking and enrichment using AI. This library provides intelligent document chunking capabilities with various strategies to split text while preserving semantic meaning, particularly useful for processing markdown documentation.

## Features

- Multiple chunking strategies:
  - Recursive Text Splitting: Hierarchical document splitting with configurable chunk sizes
  - Section-based Semantic Chunking: Structure-aware semantic splitting using section markers
  - Base Chunking: Extensible base implementation for custom chunking strategies

- Key Benefits:
  - Preserve semantic meaning across chunks
  - Configurable chunk sizes and overlap
  - Support for various text formats
  - Easy to extend with custom chunking strategies

## Installation

```bash
pip install ai-chunking
```

## Quick Start

```python
from ai_chunking import RecursiveTextSplitter

# Initialize a recursive text splitter
chunker = RecursiveTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# Read and process a markdown file
with open('documentation.md', 'r') as f:
    markdown_content = f.read()

chunks = chunker.split_text(markdown_content)

# Access the chunks
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n---")
```

## Usage Examples

### Recursive Text Splitting

The `RecursiveTextSplitter` splits markdown content into chunks while preserving markdown structure:

```python
from ai_chunking import RecursiveTextSplitter

splitter = RecursiveTextSplitter(
    chunk_size=1000,  # Maximum size of each chunk
    chunk_overlap=100,  # Overlap between chunks
    separators=["\n## ", "\n### ", "\n\n", "\n", " "]  # Markdown-aware separators
)

# Process a large markdown documentation file
with open('large_documentation.md', 'r') as f:
    markdown_content = f.read()

chunks = splitter.split_text(markdown_content)

# Save chunks to separate files for processing
for i, chunk in enumerate(chunks, 1):
    with open(f'chunk_{i}.md', 'w') as f:
        f.write(chunk)
```

### Section-based Semantic Chunking

The `SectionBasedSemanticChunker` is particularly well-suited for markdown files as it respects heading hierarchy:

```python
from ai_chunking import SectionBasedSemanticChunker

chunker = SectionBasedSemanticChunker(
    section_markers=["# ", "## ", "### "],  # Markdown heading levels
    min_chunk_size=100,
    max_chunk_size=1000
)

# Process markdown documentation
with open('api_docs.md', 'r') as f:
    markdown_content = f.read()

chunks = chunker.split_text(markdown_content)

# Print sections with their headings
for i, chunk in enumerate(chunks, 1):
    print(f"Section {i}:\n{chunk}\n---")
```

### Custom Chunking Strategy

You can create your own markdown-specific chunking strategy:

```python
from ai_chunking import BaseChunker
import re

class MarkdownChunker(BaseChunker):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.heading_pattern = re.compile(r'^#{1,6}\s+', re.MULTILINE)

    def split_text(self, text: str) -> list[str]:
        # Split on markdown headings while preserving them
        sections = self.heading_pattern.split(text)
        # Remove empty sections and trim whitespace
        return [section.strip() for section in sections if section.strip()]

# Usage
chunker = MarkdownChunker()
with open('README.md', 'r') as f:
    chunks = chunker.split_text(f.read())
```

## Configuration

Each chunker accepts different configuration parameters:

### RecursiveTextSplitter
- `chunk_size`: Maximum size of each chunk (default: 500)
- `chunk_overlap`: Number of characters to overlap between chunks (default: 50)
- `separators`: List of separators to use for splitting (default: ["\n\n", "\n", " ", ""])

### SectionBasedSemanticChunker
- `section_markers`: List of strings that indicate section boundaries
- `min_chunk_size`: Minimum size of a chunk (default: 100)
- `max_chunk_size`: Maximum size of a chunk (default: 1000)

## Contributing

We welcome contributions! Please follow these steps:

1. Fork the repository
2. Create a new branch for your feature
3. Write tests for your changes
4. Submit a pull request

For more details, see our [Contributing Guidelines](CONTRIBUTING.md).

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

- Issue Tracker: [GitHub Issues](https://github.com/nexla-opensource/ai-chunking/issues)
- Documentation: Coming soon

## Citation

If you use this software in your research, please cite:

```bibtex
@software{ai_chunking2024,
  title = {AI Chunking: A Python Library for Semantic Document Processing},
  author = {Desai, Amey},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/nexla-opensource/ai-chunking}
}
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nexla-opensource/ai-chunking",
    "name": "ai-chunking",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "ai, nlp, document-processing, chunking, semantic-analysis, rag, embeddings",
    "author": "Amey Desai",
    "author_email": "amey.desai@nexla.com",
    "download_url": "https://files.pythonhosted.org/packages/0a/60/61466a70be149e423fd83bc7cf27bdb86c7298faebc77fb30cdf63e56c30/ai_chunking-0.1.4.tar.gz",
    "platform": null,
    "description": "# AI Chunking\n\nA powerful Python library for semantic document chunking and enrichment using AI. This library provides intelligent document chunking capabilities with various strategies to split text while preserving semantic meaning, particularly useful for processing markdown documentation.\n\n## Features\n\n- Multiple chunking strategies:\n  - Recursive Text Splitting: Hierarchical document splitting with configurable chunk sizes\n  - Section-based Semantic Chunking: Structure-aware semantic splitting using section markers\n  - Base Chunking: Extensible base implementation for custom chunking strategies\n\n- Key Benefits:\n  - Preserve semantic meaning across chunks\n  - Configurable chunk sizes and overlap\n  - Support for various text formats\n  - Easy to extend with custom chunking strategies\n\n## Installation\n\n```bash\npip install ai-chunking\n```\n\n## Quick Start\n\n```python\nfrom ai_chunking import RecursiveTextSplitter\n\n# Initialize a recursive text splitter\nchunker = RecursiveTextSplitter(\n    chunk_size=500,\n    chunk_overlap=50\n)\n\n# Read and process a markdown file\nwith open('documentation.md', 'r') as f:\n    markdown_content = f.read()\n\nchunks = chunker.split_text(markdown_content)\n\n# Access the chunks\nfor i, chunk in enumerate(chunks, 1):\n    print(f\"Chunk {i}:\\n{chunk}\\n---\")\n```\n\n## Usage Examples\n\n### Recursive Text Splitting\n\nThe `RecursiveTextSplitter` splits markdown content into chunks while preserving markdown structure:\n\n```python\nfrom ai_chunking import RecursiveTextSplitter\n\nsplitter = RecursiveTextSplitter(\n    chunk_size=1000,  # Maximum size of each chunk\n    chunk_overlap=100,  # Overlap between chunks\n    separators=[\"\\n## \", \"\\n### \", \"\\n\\n\", \"\\n\", \" \"]  # Markdown-aware separators\n)\n\n# Process a large markdown documentation file\nwith open('large_documentation.md', 'r') as f:\n    markdown_content = f.read()\n\nchunks = splitter.split_text(markdown_content)\n\n# Save chunks to separate files for processing\nfor i, chunk in enumerate(chunks, 1):\n    with open(f'chunk_{i}.md', 'w') as f:\n        f.write(chunk)\n```\n\n### Section-based Semantic Chunking\n\nThe `SectionBasedSemanticChunker` is particularly well-suited for markdown files as it respects heading hierarchy:\n\n```python\nfrom ai_chunking import SectionBasedSemanticChunker\n\nchunker = SectionBasedSemanticChunker(\n    section_markers=[\"# \", \"## \", \"### \"],  # Markdown heading levels\n    min_chunk_size=100,\n    max_chunk_size=1000\n)\n\n# Process markdown documentation\nwith open('api_docs.md', 'r') as f:\n    markdown_content = f.read()\n\nchunks = chunker.split_text(markdown_content)\n\n# Print sections with their headings\nfor i, chunk in enumerate(chunks, 1):\n    print(f\"Section {i}:\\n{chunk}\\n---\")\n```\n\n### Custom Chunking Strategy\n\nYou can create your own markdown-specific chunking strategy:\n\n```python\nfrom ai_chunking import BaseChunker\nimport re\n\nclass MarkdownChunker(BaseChunker):\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        self.heading_pattern = re.compile(r'^#{1,6}\\s+', re.MULTILINE)\n\n    def split_text(self, text: str) -> list[str]:\n        # Split on markdown headings while preserving them\n        sections = self.heading_pattern.split(text)\n        # Remove empty sections and trim whitespace\n        return [section.strip() for section in sections if section.strip()]\n\n# Usage\nchunker = MarkdownChunker()\nwith open('README.md', 'r') as f:\n    chunks = chunker.split_text(f.read())\n```\n\n## Configuration\n\nEach chunker accepts different configuration parameters:\n\n### RecursiveTextSplitter\n- `chunk_size`: Maximum size of each chunk (default: 500)\n- `chunk_overlap`: Number of characters to overlap between chunks (default: 50)\n- `separators`: List of separators to use for splitting (default: [\"\\n\\n\", \"\\n\", \" \", \"\"])\n\n### SectionBasedSemanticChunker\n- `section_markers`: List of strings that indicate section boundaries\n- `min_chunk_size`: Minimum size of a chunk (default: 100)\n- `max_chunk_size`: Maximum size of a chunk (default: 1000)\n\n## Contributing\n\nWe welcome contributions! Please follow these steps:\n\n1. Fork the repository\n2. Create a new branch for your feature\n3. Write tests for your changes\n4. Submit a pull request\n\nFor more details, see our [Contributing Guidelines](CONTRIBUTING.md).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- Issue Tracker: [GitHub Issues](https://github.com/nexla-opensource/ai-chunking/issues)\n- Documentation: Coming soon\n\n## Citation\n\nIf you use this software in your research, please cite:\n\n```bibtex\n@software{ai_chunking2024,\n  title = {AI Chunking: A Python Library for Semantic Document Processing},\n  author = {Desai, Amey},\n  year = {2024},\n  publisher = {GitHub},\n  url = {https://github.com/nexla-opensource/ai-chunking}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A powerful Python library for semantic document chunking and enrichment using AI",
    "version": "0.1.4",
    "project_urls": {
        "Documentation": "https://github.com/nexla-opensource/ai-chunking",
        "Homepage": "https://github.com/nexla-opensource/ai-chunking",
        "Repository": "https://github.com/nexla-opensource/ai-chunking.git"
    },
    "split_keywords": [
        "ai",
        " nlp",
        " document-processing",
        " chunking",
        " semantic-analysis",
        " rag",
        " embeddings"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8447cfacd598ec069fed5ffa976699422373b5b68bd4203a33b9e29055d3f40e",
                "md5": "c1b1a75942310331f565c427afff0a17",
                "sha256": "f0861c9c696d86a7968072428f4a54730eb920c8306dc7e8bb86de7d06488bc1"
            },
            "downloads": -1,
            "filename": "ai_chunking-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c1b1a75942310331f565c427afff0a17",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 63148,
            "upload_time": "2025-03-16T20:44:17",
            "upload_time_iso_8601": "2025-03-16T20:44:17.914470Z",
            "url": "https://files.pythonhosted.org/packages/84/47/cfacd598ec069fed5ffa976699422373b5b68bd4203a33b9e29055d3f40e/ai_chunking-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0a6061466a70be149e423fd83bc7cf27bdb86c7298faebc77fb30cdf63e56c30",
                "md5": "518583c1964f460b86656d71168f598f",
                "sha256": "df8440940ec5ae2567f3e1fe66fb4b2c29184604660eb6634ee62d8aa339f47a"
            },
            "downloads": -1,
            "filename": "ai_chunking-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "518583c1964f460b86656d71168f598f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 46466,
            "upload_time": "2025-03-16T20:44:19",
            "upload_time_iso_8601": "2025-03-16T20:44:19.326244Z",
            "url": "https://files.pythonhosted.org/packages/0a/60/61466a70be149e423fd83bc7cf27bdb86c7298faebc77fb30cdf63e56c30/ai_chunking-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-16 20:44:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nexla-opensource",
    "github_project": "ai-chunking",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "openai",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "tiktoken",
            "specs": []
        },
        {
            "name": "langchain",
            "specs": []
        },
        {
            "name": "anthropic",
            "specs": []
        },
        {
            "name": "google-generativeai",
            "specs": []
        },
        {
            "name": "litellm",
            "specs": []
        },
        {
            "name": "instructor",
            "specs": []
        },
        {
            "name": "langchain_text_splitters",
            "specs": []
        },
        {
            "name": "langchain-experimental",
            "specs": []
        },
        {
            "name": "langchain-openai",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "python-dotenv",
            "specs": []
        }
    ],
    "lcname": "ai-chunking"
}
        
Elapsed time: 1.32297s