cpp-chunker

Name	cpp-chunker JSON
Version	0.1.4 JSON
	download
home_page	None
Summary	C++ chunker
upload_time	2025-09-20 08:29:01
maintainer	None
docs_url	None
author	Christian
requires_python	>=3.7
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Semantic Text Chunker

**Advanced semantic text chunking library that preserves meaning and context**

## Overview

This project provides a C++ library (with Python bindings via pybind11) for fast semantically chunking large bodies of text. It aims to split text into coherent, context-preserving segments using semantic similarity, discourse markers, and section detection, fast.

## Features

- Semantic chunking of text based on meaning and context
- Adjustable chunk size and coherence thresholds
- Extraction of chunk details: coherence scores, dominant topics, section types
- Python bindings for easy integration
- Memory-safe C++ implementation with comprehensive error handling

## Installation

### Prerequisites

- C++17 compiler (GCC 7+, Clang 5+, or MSVC 2017+)
- [pybind11](https://github.com/pybind/pybind11) (version 2.10.0+)
- CMake (version 3.16+)
- Python 3.7+

### Installing pybind11

#### Option 1: Using pip (Recommended)

```bash
pip install pybind11
```

#### Option 2: Using conda

```bash
conda install pybind11
```

#### Option 3: From source

```bash
git clone https://github.com/pybind/pybind11.git
cd pybind11
pip install .
```

#### Option 4: System package manager

```bash
# Ubuntu/Debian
sudo apt-get install pybind11-dev

# macOS with Homebrew
brew install pybind11

# Arch Linux
sudo pacman -S pybind11
```

### Build Instructions

#### Method 1: Using CMake (Recommended)

```bash
mkdir build
cd build
cmake ..
make
```

#### Method 2: Using pip (Automatic build)

```bash
pip install .
```

#### Method 3: Development installation

```bash
pip install -e .
```

## Usage

### Basic Usage

#### Simple Text Chunking

```python
import chunker_cpp

text = """This is a long text for testing the semantic chunker. It contains multiple sentences, some of which are quite lengthy and elaborate, while others are short. The purpose of this text is to simulate a realistic document that might be processed by the chunker.

In addition to regular sentences, this text includes various structures such as lists:
- First item in the list.
- Second item, which is a bit longer and more descriptive.
- Third item.

There are also paragraphs separated by blank lines.

Here is a new paragraph. It discusses a different topic and is intended to test how the chunker handles paragraph boundaries. Sometimes, paragraphs can be very long, spanning several lines and containing a lot of information. Other times, they are short.

Finally, this text ends with a concluding sentence to ensure the chunker can handle the end of input gracefully."""

# Basic chunking with default parameters
chunks = chunker_cpp.chunk_text_semantically(text)
print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")
```

#### Advanced Usage with Custom Parameters

```python
# Custom chunking parameters
chunks = chunker_cpp.chunk_text_semantically(
    text,
    max_chunk_size=1000,      # Maximum characters per chunk
    min_chunk_size=200,       # Minimum characters per chunk
    min_coherence_threshold=0.5  # Higher threshold for more coherent chunks
)
```

#### Getting Detailed Chunk Information

```python
# Get detailed information about each chunk
chunk_details = chunker_cpp.get_chunk_details(text)

for i, detail in enumerate(chunk_details):
    print(f"Chunk {i+1}:")
    print(f"  Text: {detail['text'][:100]}...")
    print(f"  Coherence Score: {detail['coherence_score']:.3f}")
    print(f"  Dominant Topics: {detail['dominant_topics']}")
    print(f"  Sentence Count: {detail['sentence_count']}")
    print(f"  Section Type: {detail['primary_section_type']}")
    print()
```

### Using the Class Interface

```python
# Create a chunker instance
chunker = chunker_cpp.SemanticTextChunker()

# Use the instance methods
chunks = chunker.chunk_text_semantically(text, max_chunk_size=1500)
details = chunker.get_chunk_details(text, min_coherence_threshold=0.4)
```

### Error Handling and Memory Safety

The library includes comprehensive error handling and memory safety features:

```python
import chunker_cpp

# Safe handling of edge cases
try:
    # Empty string
    result = chunker_cpp.chunk_text_semantically("")
    print("Empty string handled:", result)

    # Very large text
    large_text = "This is a test sentence. " * 10000
    result = chunker_cpp.chunk_text_semantically(large_text, 1000, 200)
    print("Large text processed successfully")

    # Special characters
    special_text = "Text with special chars: \x00\x01\x02\n\r\t" + "A" * 1000
    result = chunker_cpp.chunk_text_semantically(special_text)
    print("Special characters handled safely")

except Exception as e:
    print(f"Error: {e}")
```

### Performance Considerations

```python
import time

# Benchmarking chunking performance
text = "Your large text here..." * 100  # Create a large text

start_time = time.time()
chunks = chunker_cpp.chunk_text_semantically(text)
end_time = time.time()

print(f"Processed {len(text)} characters in {end_time - start_time:.3f} seconds")
print(f"Generated {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} characters")
```

## API Reference

### Functions

#### `chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

Chunk text semantically while preserving meaning and context.

**Parameters:**

- `text` (str): The input text to be chunked
- `max_chunk_size` (int): Maximum size for each chunk (default: 2000)
- `min_chunk_size` (int): Minimum size for each chunk (default: 500)
- `min_coherence_threshold` (float): Minimum coherence threshold (default: 0.3)

**Returns:**

- `List[str]`: List of text chunks

#### `get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

Get detailed information about each chunk including coherence scores and topics.

**Parameters:**

- `text` (str): The input text to be chunked
- `max_chunk_size` (int): Maximum size for each chunk (default: 2000)
- `min_chunk_size` (int): Minimum size for each chunk (default: 500)
- `min_coherence_threshold` (float): Minimum coherence threshold (default: 0.3)

**Returns:**

- `List[Dict[str, Any]]`: List of dictionaries containing chunk details

### Class

#### `SemanticTextChunker`

Advanced semantic text chunking class that preserves meaning and context.

**Methods:**

- `chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`
- `get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`

## Development

### Building from Source

```bash
# Clone the repository
git clone [<repository-url>](https://github.com/Lumen-Labs/cpp-chunker.git)
cd cpp_chunker

# Install dependencies
pip install pybind11

# Build using CMake
mkdir build && cd build
cmake ..
make

# Or build using pip
pip install .
```

### Running Tests

```bash
# Run basic functionality test
python test_cunk.py

# Run memory safety tests
python test_memory_safety.py
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cpp-chunker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Christian",
    "author_email": "alch.infoemail@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ad/83/5f9dab86c88ca09df1fb66ae001e42aaf5961011c0517297873712511678/cpp_chunker-0.1.4.tar.gz",
    "platform": null,
    "description": "# Semantic Text Chunker\n\n**Advanced semantic text chunking library that preserves meaning and context**\n\n## Overview\n\nThis project provides a C++ library (with Python bindings via pybind11) for fast semantically chunking large bodies of text. It aims to split text into coherent, context-preserving segments using semantic similarity, discourse markers, and section detection, fast.\n\n## Features\n\n- Semantic chunking of text based on meaning and context\n- Adjustable chunk size and coherence thresholds\n- Extraction of chunk details: coherence scores, dominant topics, section types\n- Python bindings for easy integration\n- Memory-safe C++ implementation with comprehensive error handling\n\n## Installation\n\n### Prerequisites\n\n- C++17 compiler (GCC 7+, Clang 5+, or MSVC 2017+)\n- [pybind11](https://github.com/pybind/pybind11) (version 2.10.0+)\n- CMake (version 3.16+)\n- Python 3.7+\n\n### Installing pybind11\n\n#### Option 1: Using pip (Recommended)\n\n```bash\npip install pybind11\n```\n\n#### Option 2: Using conda\n\n```bash\nconda install pybind11\n```\n\n#### Option 3: From source\n\n```bash\ngit clone https://github.com/pybind/pybind11.git\ncd pybind11\npip install .\n```\n\n#### Option 4: System package manager\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install pybind11-dev\n\n# macOS with Homebrew\nbrew install pybind11\n\n# Arch Linux\nsudo pacman -S pybind11\n```\n\n### Build Instructions\n\n#### Method 1: Using CMake (Recommended)\n\n```bash\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n#### Method 2: Using pip (Automatic build)\n\n```bash\npip install .\n```\n\n#### Method 3: Development installation\n\n```bash\npip install -e .\n```\n\n## Usage\n\n### Basic Usage\n\n#### Simple Text Chunking\n\n```python\nimport chunker_cpp\n\ntext = \"\"\"This is a long text for testing the semantic chunker. It contains multiple sentences, some of which are quite lengthy and elaborate, while others are short. The purpose of this text is to simulate a realistic document that might be processed by the chunker.\n\nIn addition to regular sentences, this text includes various structures such as lists:\n- First item in the list.\n- Second item, which is a bit longer and more descriptive.\n- Third item.\n\nThere are also paragraphs separated by blank lines.\n\nHere is a new paragraph. It discusses a different topic and is intended to test how the chunker handles paragraph boundaries. Sometimes, paragraphs can be very long, spanning several lines and containing a lot of information. Other times, they are short.\n\nFinally, this text ends with a concluding sentence to ensure the chunker can handle the end of input gracefully.\"\"\"\n\n# Basic chunking with default parameters\nchunks = chunker_cpp.chunk_text_semantically(text)\nprint(\"Number of chunks:\", len(chunks))\nfor i, chunk in enumerate(chunks):\n    print(f\"Chunk {i+1}: {chunk[:100]}...\")\n```\n\n#### Advanced Usage with Custom Parameters\n\n```python\n# Custom chunking parameters\nchunks = chunker_cpp.chunk_text_semantically(\n    text,\n    max_chunk_size=1000,      # Maximum characters per chunk\n    min_chunk_size=200,       # Minimum characters per chunk\n    min_coherence_threshold=0.5  # Higher threshold for more coherent chunks\n)\n```\n\n#### Getting Detailed Chunk Information\n\n```python\n# Get detailed information about each chunk\nchunk_details = chunker_cpp.get_chunk_details(text)\n\nfor i, detail in enumerate(chunk_details):\n    print(f\"Chunk {i+1}:\")\n    print(f\"  Text: {detail['text'][:100]}...\")\n    print(f\"  Coherence Score: {detail['coherence_score']:.3f}\")\n    print(f\"  Dominant Topics: {detail['dominant_topics']}\")\n    print(f\"  Sentence Count: {detail['sentence_count']}\")\n    print(f\"  Section Type: {detail['primary_section_type']}\")\n    print()\n```\n\n### Using the Class Interface\n\n```python\n# Create a chunker instance\nchunker = chunker_cpp.SemanticTextChunker()\n\n# Use the instance methods\nchunks = chunker.chunk_text_semantically(text, max_chunk_size=1500)\ndetails = chunker.get_chunk_details(text, min_coherence_threshold=0.4)\n```\n\n### Error Handling and Memory Safety\n\nThe library includes comprehensive error handling and memory safety features:\n\n```python\nimport chunker_cpp\n\n# Safe handling of edge cases\ntry:\n    # Empty string\n    result = chunker_cpp.chunk_text_semantically(\"\")\n    print(\"Empty string handled:\", result)\n\n    # Very large text\n    large_text = \"This is a test sentence. \" * 10000\n    result = chunker_cpp.chunk_text_semantically(large_text, 1000, 200)\n    print(\"Large text processed successfully\")\n\n    # Special characters\n    special_text = \"Text with special chars: \\x00\\x01\\x02\\n\\r\\t\" + \"A\" * 1000\n    result = chunker_cpp.chunk_text_semantically(special_text)\n    print(\"Special characters handled safely\")\n\nexcept Exception as e:\n    print(f\"Error: {e}\")\n```\n\n### Performance Considerations\n\n```python\nimport time\n\n# Benchmarking chunking performance\ntext = \"Your large text here...\" * 100  # Create a large text\n\nstart_time = time.time()\nchunks = chunker_cpp.chunk_text_semantically(text)\nend_time = time.time()\n\nprint(f\"Processed {len(text)} characters in {end_time - start_time:.3f} seconds\")\nprint(f\"Generated {len(chunks)} chunks\")\nprint(f\"Average chunk size: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} characters\")\n```\n\n## API Reference\n\n### Functions\n\n#### `chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`\n\nChunk text semantically while preserving meaning and context.\n\n**Parameters:**\n\n- `text` (str): The input text to be chunked\n- `max_chunk_size` (int): Maximum size for each chunk (default: 2000)\n- `min_chunk_size` (int): Minimum size for each chunk (default: 500)\n- `min_coherence_threshold` (float): Minimum coherence threshold (default: 0.3)\n\n**Returns:**\n\n- `List[str]`: List of text chunks\n\n#### `get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`\n\nGet detailed information about each chunk including coherence scores and topics.\n\n**Parameters:**\n\n- `text` (str): The input text to be chunked\n- `max_chunk_size` (int): Maximum size for each chunk (default: 2000)\n- `min_chunk_size` (int): Minimum size for each chunk (default: 500)\n- `min_coherence_threshold` (float): Minimum coherence threshold (default: 0.3)\n\n**Returns:**\n\n- `List[Dict[str, Any]]`: List of dictionaries containing chunk details\n\n### Class\n\n#### `SemanticTextChunker`\n\nAdvanced semantic text chunking class that preserves meaning and context.\n\n**Methods:**\n\n- `chunk_text_semantically(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`\n- `get_chunk_details(text, max_chunk_size=2000, min_chunk_size=500, min_coherence_threshold=0.3)`\n\n## Development\n\n### Building from Source\n\n```bash\n# Clone the repository\ngit clone [<repository-url>](https://github.com/Lumen-Labs/cpp-chunker.git)\ncd cpp_chunker\n\n# Install dependencies\npip install pybind11\n\n# Build using CMake\nmkdir build && cd build\ncmake ..\nmake\n\n# Or build using pip\npip install .\n```\n\n### Running Tests\n\n```bash\n# Run basic functionality test\npython test_cunk.py\n\n# Run memory safety tests\npython test_memory_safety.py\n```\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "C++ chunker",
    "version": "0.1.4",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e5785b3a430db230b88c540b8c6263c5d930597319d4179681e51fd72cf13a0",
                "md5": "2dbf8bcc9fc9a35d81927b2dc8e8c0fb",
                "sha256": "e7bfd512d7a5d3147ff006cdd0e514aa1c6866623c46b37b7def286266c7e74c"
            },
            "downloads": -1,
            "filename": "cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "2dbf8bcc9fc9a35d81927b2dc8e8c0fb",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.7",
            "size": 144457,
            "upload_time": "2025-09-20T08:28:58",
            "upload_time_iso_8601": "2025-09-20T08:28:58.526135Z",
            "url": "https://files.pythonhosted.org/packages/8e/57/85b3a430db230b88c540b8c6263c5d930597319d4179681e51fd72cf13a0/cpp_chunker-0.1.4-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ad835f9dab86c88ca09df1fb66ae001e42aaf5961011c0517297873712511678",
                "md5": "b02628e6b870b34eddb42af609f08b1f",
                "sha256": "9903ff0ca5f0c3eaede4a07db62ac3426f32bd506ed2b0a041157ddecb7bbc67"
            },
            "downloads": -1,
            "filename": "cpp_chunker-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "b02628e6b870b34eddb42af609f08b1f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 11577,
            "upload_time": "2025-09-20T08:29:01",
            "upload_time_iso_8601": "2025-09-20T08:29:01.614229Z",
            "url": "https://files.pythonhosted.org/packages/ad/83/5f9dab86c88ca09df1fb66ae001e42aaf5961011c0517297873712511678/cpp_chunker-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-20 08:29:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "cpp-chunker"
}

Christian