nupunkt-rs


Namenupunkt-rs JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://aleainstitute.ai
SummaryHigh-performance Rust implementation of nupunkt sentence/paragraph tokenization
upload_time2025-08-11 01:23:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords nlp natural language processing tokenization sentence boundary detection paragraph detection punkt text processing linguistics rust
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # nupunkt-rs

[![CI](https://github.com/alea-institute/nupunkt-rs/actions/workflows/CI.yml/badge.svg)](https://github.com/alea-institute/nupunkt-rs/actions/workflows/CI.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Rust](https://img.shields.io/badge/rust-stable-orange.svg)](https://www.rust-lang.org/)

High-performance Rust implementation of [nupunkt](https://github.com/alea-institute/nupunkt), a modern reimplementation of the Punkt sentence tokenizer optimized for high-precision legal and financial text processing. This project provides the same accurate sentence segmentation as the original Python nupunkt library, but with **3x faster performance** thanks to Rust's efficiency.

Based on the research paper: **[Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary](https://arxiv.org/abs/2504.04131)** (Bommarito et al., 2025)

## Features

- **๐Ÿš€ High Performance**: 30M+ characters/second (3x faster than Python nupunkt)
- **๐ŸŽฏ High Precision**: 91.1% precision on legal text benchmarks  
- **โšก Runtime Adjustable**: Tune precision/recall balance at inference time without retraining
- **๐Ÿ“š Legal-Optimized**: Pre-trained model handles complex legal abbreviations and citations
- **๐Ÿ Python API**: Drop-in replacement for Python nupunkt with PyO3 bindings
- **๐Ÿงต Thread-Safe**: Safe for parallel processing

## Installation

### From PyPI (Coming Soon)

```bash
# pip
pip install nupunkt-rs

# uv
uv pip install nupunkt-rs
```

### From Source

1. **Prerequisites**:
   - Python 3.11+
   - Rust toolchain (install from [rustup.rs](https://rustup.rs/))
   - maturin (`pip install maturin`)

2. **Clone and Install**:
```bash
git clone https://github.com/alea-institute/nupunkt-rs.git
cd nupunkt-rs

# pip
pip install maturin
maturin develop --release

# uv
uvx maturin develop --release --uv
```

## Quick Start

### Why nupunkt-rs for Legal & Financial Documents?

Most tokenizers fail on legal and financial text, breaking incorrectly at abbreviations like "v.", "U.S.", "Inc.", "Id.", and "Fed." This library is specifically optimized for high-precision tokenization of complex professional documents.

```python
import nupunkt_rs

# Real Supreme Court text with complex citations and abbreviations
legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150."""

# Most tokenizers would incorrectly break at "v.", "Inc.", "U.S.", "Co.", and "Id."
# nupunkt-rs handles all of these correctly:
sentences = nupunkt_rs.sent_tokenize(legal_text)
print(f"Correctly identified {len(sentences)} sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"\n{i}. {sent}")

# Output:
# Correctly identified 3 sentences:
#
# 1. As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability.
#
# 2. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
#
# 3. There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.
```

### Fine-Tuning Precision with the `precision_recall` Parameter

The `precision_recall` parameter (0.0-1.0) gives you exact control over the precision/recall trade-off. For legal and financial documents, you typically want higher precision (0.3-0.5) to avoid breaking at abbreviations.

```python
# Longer legal text to show the impact
long_legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150. This Court further noted that Rule 702 was amended in response to Daubert and this Court's subsequent cases. See Fed. Rule Evid. 702, Advisory Committee Notes to 2000 Amendments. The amendment affirms the trial court's role as gatekeeper but provides that "all types of expert testimony present questions of admissibility for the trial court." Ibid. Consequently, whether the specific expert testimony on the question at issue focuses on specialized observations, the specialized translation of those observations into theory, a specialized theory itself, or the application of such a theory in a particular case, the expert's testimony often will rest "upon an experience confessedly foreign in kind to [the jury's] own." Hand, Historical and Practical Considerations Regarding Expert Testimony, 15 Harv. L. Rev. 40, 54 (1901). For this reason, the trial judge, in all cases of proffered expert testimony, must find that it is properly grounded, well-reasoned, and not speculative before it can be admitted. The trial judge must determine whether the testimony has "a reliable basis in the knowledge and experience of [the relevant] discipline." Daubert, 509 U. S., at 592."""

# Compare different precision levels
print(f"High recall (PR=0.1): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.1))} sentences")
print(f"Balanced (PR=0.5):    {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5))} sentences")  
print(f"High precision (PR=0.9): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.9))} sentences")

# Output:
# High recall (PR=0.1): 8 sentences
# Balanced (PR=0.5):    7 sentences  
# High precision (PR=0.9): 5 sentences

# Show the actual sentences at balanced setting (recommended for legal text)
sentences = nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5)
print("\nBalanced output (PR=0.5) - Recommended for legal documents:")
for i, sent in enumerate(sentences, 1):
    # Show that abbreviations are correctly preserved
    if "v." in sent or "U.S." in sent or "Id." in sent or "Fed." in sent:
        print(f"\n{i}. โœ“ Correctly preserves legal abbreviations:")
        print(f"   {sent[:100]}...")
```

**Recommended `precision_recall` settings:**
- **Legal documents**: 0.3-0.5 (preserves "v.", "Id.", "Fed.", "U.S.", "Inc.")
- **Financial reports**: 0.4-0.6 (preserves "Inc.", "Ltd.", "Q1", monetary abbreviations)
- **Scientific papers**: 0.4-0.6 (preserves "et al.", "e.g.", "i.e.", technical terms)
- **General text**: 0.5 (default, balanced)
- **Social media**: 0.1-0.3 (more aggressive breaking for informal text)

### Paragraph Tokenization

For documents with multiple paragraphs, you can tokenize at both paragraph and sentence levels:

```python
import nupunkt_rs

text = """First paragraph with legal citations.
See Smith v. Jones, 123 U.S. 456 (2020).

Second paragraph with more detail.
The court in Id. at 457 stated clearly."""

# Get paragraphs as lists of sentences
paragraphs = nupunkt_rs.para_tokenize(text)
print(f"Found {len(paragraphs)} paragraphs")
# Each paragraph is a list of properly segmented sentences

# Or get paragraphs as joined strings
paragraphs_joined = nupunkt_rs.para_tokenize_joined(text)
# Each paragraph is a single string with sentences joined
```

#### Advanced Approach (Using Tokenizer Class)

```python
import nupunkt_rs

# Create a tokenizer with the default model
tokenizer = nupunkt_rs.create_default_tokenizer()

# Default (0.5) - balanced mode
text = "The meeting is at 5 p.m. tomorrow. We'll discuss Q4."
print(tokenizer.tokenize(text))
# Output: ['The meeting is at 5 p.m. tomorrow.', "We'll discuss Q4."]

# High recall (0.1) - more breaks, may split at abbreviations
tokenizer.set_precision_recall_balance(0.1)
print(tokenizer.tokenize(text))
# May split after "p.m."

# High precision (0.9) - fewer breaks, preserves abbreviations
tokenizer.set_precision_recall_balance(0.9) 
print(tokenizer.tokenize(text))
# Won't split after "p.m."
```

### Common Use Cases

#### Processing Multiple Documents

```python
import nupunkt_rs

# Process multiple documents efficiently
documents = [
    "First doc. Two sentences.",
    "Second document here.",
    "Third doc. Also two sentences."
]

# Use list comprehension for batch processing
all_sentences = [nupunkt_rs.sent_tokenize(doc) for doc in documents]
print(all_sentences)
# Output: [['First doc.', 'Two sentences.'], ['Second document here.'], ['Third doc.', 'Also two sentences.']]
```

#### Getting Character Positions

```python
import nupunkt_rs

# Get sentence boundaries as character positions
tokenizer = nupunkt_rs.create_default_tokenizer()
text = "First sentence. Second sentence."
spans = tokenizer.tokenize_spans(text)
print(spans)
# Output: [(0, 15), (16, 32)]

# Extract sentences using spans
for start, end in spans:
    print(f"'{text[start:end]}'")
# Output: 'First sentence.' 'Second sentence.'
```

### Command-Line Interface

```bash
# Quick tokenization with default model
echo "Dr. Smith arrived. He was late." | nupunkt tokenize

# Adjust precision/recall from command line
nupunkt tokenize --pr-balance 0.8 "Your text here."

# Process a file
nupunkt tokenize --input document.txt --output sentences.txt
```

## Advanced Usage

### Understanding Tokenization Decisions

Get detailed insights into why breaks occur or don't occur:

```python
# Get detailed analysis of each token
analysis = tokenizer.analyze_tokens(text)

for token in analysis.tokens:
    if token.has_period:
        print(f"Token: {token.text}")
        print(f"  Break decision: {token.decision}")
        print(f"  Confidence: {token.confidence:.2f}")
        
# Explain a specific position
explanation = tokenizer.explain_decision(text, 28)  # Position of period after "Dr."
print(explanation)
```

### Getting Sentence Boundaries as Spans

```python
# Get character positions instead of text
spans = tokenizer.tokenize_spans(text)
# Returns: [(start1, end1), (start2, end2), ...]

for start, end in spans:
    print(f"Sentence: {text[start:end]}")
```

### Training Custom Models

For domain-specific text, you can train your own model:

```python
trainer = nupunkt_rs.Trainer()

# Optional: Load domain-specific abbreviations
trainer.load_abbreviations_from_json("legal_abbreviations.json")

# Train on your corpus
params = trainer.train(your_text_corpus, verbose=True)

# Save model for reuse
params.save("my_model.npkt.gz")

# Load and use later
params = nupunkt_rs.Parameters.load("my_model.npkt.gz")
tokenizer = nupunkt_rs.SentenceTokenizer(params)
```

## Performance

Benchmarks on commodity hardware (Linux, Intel x86_64):

| Text Size | Processing Time | Speed |
|-----------|----------------|--------|
| 1 KB | < 0.1ms | ~10 MB/s |
| 100 KB | ~3ms | ~30 MB/s |
| 1 MB | ~33ms | ~30 MB/s |
| 10 MB | ~330ms | ~30 MB/s |

The tokenizer maintains consistent speed regardless of text size, processing approximately **30 million characters per second**.

Memory usage is minimal - the default model uses about 12 MB of RAM, compared to 85+ MB for NLTK's Punkt implementation.

## API Reference

### Main Functions

- **`sent_tokenize(text, model_params=None, precision_recall=None)`** โ†’ List of sentences
  - `text`: The text to tokenize
  - `model_params`: Optional custom model parameters
  - `precision_recall`: Optional PR balance (0.0=recall, 1.0=precision, default=0.5)

- **`para_tokenize(text, model_params=None, precision_recall=None)`** โ†’ List of paragraphs (each as list of sentences)
  - Same parameters as `sent_tokenize`

- **`para_tokenize_joined(text, model_params=None, precision_recall=None)`** โ†’ List of paragraphs (each as single string)
  - Same parameters as `sent_tokenize`

- **`create_default_tokenizer()`** โ†’ Returns a `SentenceTokenizer` with default model
- **`load_default_model()`** โ†’ Returns default `Parameters`
- **`train_model(text, verbose=False)`** โ†’ Train new model on text

### Main Classes

- **`SentenceTokenizer`**: The main class for tokenizing text
  - `tokenize(text)` โ†’ List of sentences
  - `tokenize_spans(text)` โ†’ List of (start, end) positions
  - `tokenize_paragraphs(text)` โ†’ List of paragraphs (each as list of sentences)
  - `tokenize_paragraphs_flat(text)` โ†’ List of paragraphs (each as single string)
  - `set_precision_recall_balance(0.0-1.0)` โ†’ Adjust behavior
  - `analyze_tokens(text)` โ†’ Detailed token analysis
  - `explain_decision(text, position)` โ†’ Explain break decision at position
  
- **`Parameters`**: Model parameters
  - `save(path)` โ†’ Save model to disk (compressed)
  - `load(path)` โ†’ Load model from disk

- **`Trainer`**: For training custom models (advanced users only)
  - `train(text, verbose=False)` โ†’ Train on text corpus
  - `load_abbreviations_from_json(path)` โ†’ Load custom abbreviations

## Development

### Running Tests

```bash
# Rust tests
cargo test

# Python tests
pytest python/tests/

# With coverage
cargo tarpaulin
pytest --cov=nupunkt_rs
```

### Code Quality

```bash
# Format code
cargo fmt
black python/

# Lint
cargo clippy -- -D warnings
ruff check python/

# Type checking
mypy python/
```

### Building Documentation

```bash
# Rust docs
cargo doc --open

# Python docs
cd docs && make html
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Areas for Contribution

- Additional language support
- Performance optimizations
- More abbreviation lists
- Documentation improvements
- Test coverage expansion

## License

MIT License - see [LICENSE](LICENSE) for details.

## Citation

If you use nupunkt-rs in your research, please cite the original nupunkt paper:

```bibtex
@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}
```

For the Rust implementation specifically:

```bibtex
@software{nupunkt-rs,
  title = {nupunkt-rs: High-performance Rust implementation of nupunkt},
  author = {ALEA Institute},
  year = {2025},
  url = {https://github.com/alea-institute/nupunkt-rs}
}
```

## Acknowledgments

- Original Punkt algorithm by Kiss & Strunk (2006)

## Support

- **Issues**: [GitHub Issues](https://github.com/alea-institute/nupunkt-rs/issues)
- **Discussions**: [GitHub Discussions](https://github.com/alea-institute/nupunkt-rs/discussions)
- **Email**: hello@aleainstitute.ai


            

Raw data

            {
    "_id": null,
    "home_page": "https://aleainstitute.ai",
    "name": "nupunkt-rs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "nlp, natural language processing, tokenization, sentence boundary detection, paragraph detection, punkt, text processing, linguistics, rust",
    "author": null,
    "author_email": "ALEA Institute <hello@aleainstitute.ai>",
    "download_url": null,
    "platform": null,
    "description": "# nupunkt-rs\n\n[![CI](https://github.com/alea-institute/nupunkt-rs/actions/workflows/CI.yml/badge.svg)](https://github.com/alea-institute/nupunkt-rs/actions/workflows/CI.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![Rust](https://img.shields.io/badge/rust-stable-orange.svg)](https://www.rust-lang.org/)\n\nHigh-performance Rust implementation of [nupunkt](https://github.com/alea-institute/nupunkt), a modern reimplementation of the Punkt sentence tokenizer optimized for high-precision legal and financial text processing. This project provides the same accurate sentence segmentation as the original Python nupunkt library, but with **3x faster performance** thanks to Rust's efficiency.\n\nBased on the research paper: **[Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary](https://arxiv.org/abs/2504.04131)** (Bommarito et al., 2025)\n\n## Features\n\n- **\ud83d\ude80 High Performance**: 30M+ characters/second (3x faster than Python nupunkt)\n- **\ud83c\udfaf High Precision**: 91.1% precision on legal text benchmarks  \n- **\u26a1 Runtime Adjustable**: Tune precision/recall balance at inference time without retraining\n- **\ud83d\udcda Legal-Optimized**: Pre-trained model handles complex legal abbreviations and citations\n- **\ud83d\udc0d Python API**: Drop-in replacement for Python nupunkt with PyO3 bindings\n- **\ud83e\uddf5 Thread-Safe**: Safe for parallel processing\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\n# pip\npip install nupunkt-rs\n\n# uv\nuv pip install nupunkt-rs\n```\n\n### From Source\n\n1. **Prerequisites**:\n   - Python 3.11+\n   - Rust toolchain (install from [rustup.rs](https://rustup.rs/))\n   - maturin (`pip install maturin`)\n\n2. **Clone and Install**:\n```bash\ngit clone https://github.com/alea-institute/nupunkt-rs.git\ncd nupunkt-rs\n\n# pip\npip install maturin\nmaturin develop --release\n\n# uv\nuvx maturin develop --release --uv\n```\n\n## Quick Start\n\n### Why nupunkt-rs for Legal & Financial Documents?\n\nMost tokenizers fail on legal and financial text, breaking incorrectly at abbreviations like \"v.\", \"U.S.\", \"Inc.\", \"Id.\", and \"Fed.\" This library is specifically optimized for high-precision tokenization of complex professional documents.\n\n```python\nimport nupunkt_rs\n\n# Real Supreme Court text with complex citations and abbreviations\nlegal_text = \"\"\"As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to \"scientific knowledge\" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.\"\"\"\n\n# Most tokenizers would incorrectly break at \"v.\", \"Inc.\", \"U.S.\", \"Co.\", and \"Id.\"\n# nupunkt-rs handles all of these correctly:\nsentences = nupunkt_rs.sent_tokenize(legal_text)\nprint(f\"Correctly identified {len(sentences)} sentences:\")\nfor i, sent in enumerate(sentences, 1):\n    print(f\"\\n{i}. {sent}\")\n\n# Output:\n# Correctly identified 3 sentences:\n#\n# 1. As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to \"scientific knowledge\" establishes a standard of evidentiary reliability.\n#\n# 2. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).\n#\n# 3. There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.\n```\n\n### Fine-Tuning Precision with the `precision_recall` Parameter\n\nThe `precision_recall` parameter (0.0-1.0) gives you exact control over the precision/recall trade-off. For legal and financial documents, you typically want higher precision (0.3-0.5) to avoid breaking at abbreviations.\n\n```python\n# Longer legal text to show the impact\nlong_legal_text = \"\"\"As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to \"scientific knowledge\" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150. This Court further noted that Rule 702 was amended in response to Daubert and this Court's subsequent cases. See Fed. Rule Evid. 702, Advisory Committee Notes to 2000 Amendments. The amendment affirms the trial court's role as gatekeeper but provides that \"all types of expert testimony present questions of admissibility for the trial court.\" Ibid. Consequently, whether the specific expert testimony on the question at issue focuses on specialized observations, the specialized translation of those observations into theory, a specialized theory itself, or the application of such a theory in a particular case, the expert's testimony often will rest \"upon an experience confessedly foreign in kind to [the jury's] own.\" Hand, Historical and Practical Considerations Regarding Expert Testimony, 15 Harv. L. Rev. 40, 54 (1901). For this reason, the trial judge, in all cases of proffered expert testimony, must find that it is properly grounded, well-reasoned, and not speculative before it can be admitted. The trial judge must determine whether the testimony has \"a reliable basis in the knowledge and experience of [the relevant] discipline.\" Daubert, 509 U. S., at 592.\"\"\"\n\n# Compare different precision levels\nprint(f\"High recall (PR=0.1): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.1))} sentences\")\nprint(f\"Balanced (PR=0.5):    {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5))} sentences\")  \nprint(f\"High precision (PR=0.9): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.9))} sentences\")\n\n# Output:\n# High recall (PR=0.1): 8 sentences\n# Balanced (PR=0.5):    7 sentences  \n# High precision (PR=0.9): 5 sentences\n\n# Show the actual sentences at balanced setting (recommended for legal text)\nsentences = nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5)\nprint(\"\\nBalanced output (PR=0.5) - Recommended for legal documents:\")\nfor i, sent in enumerate(sentences, 1):\n    # Show that abbreviations are correctly preserved\n    if \"v.\" in sent or \"U.S.\" in sent or \"Id.\" in sent or \"Fed.\" in sent:\n        print(f\"\\n{i}. \u2713 Correctly preserves legal abbreviations:\")\n        print(f\"   {sent[:100]}...\")\n```\n\n**Recommended `precision_recall` settings:**\n- **Legal documents**: 0.3-0.5 (preserves \"v.\", \"Id.\", \"Fed.\", \"U.S.\", \"Inc.\")\n- **Financial reports**: 0.4-0.6 (preserves \"Inc.\", \"Ltd.\", \"Q1\", monetary abbreviations)\n- **Scientific papers**: 0.4-0.6 (preserves \"et al.\", \"e.g.\", \"i.e.\", technical terms)\n- **General text**: 0.5 (default, balanced)\n- **Social media**: 0.1-0.3 (more aggressive breaking for informal text)\n\n### Paragraph Tokenization\n\nFor documents with multiple paragraphs, you can tokenize at both paragraph and sentence levels:\n\n```python\nimport nupunkt_rs\n\ntext = \"\"\"First paragraph with legal citations.\nSee Smith v. Jones, 123 U.S. 456 (2020).\n\nSecond paragraph with more detail.\nThe court in Id. at 457 stated clearly.\"\"\"\n\n# Get paragraphs as lists of sentences\nparagraphs = nupunkt_rs.para_tokenize(text)\nprint(f\"Found {len(paragraphs)} paragraphs\")\n# Each paragraph is a list of properly segmented sentences\n\n# Or get paragraphs as joined strings\nparagraphs_joined = nupunkt_rs.para_tokenize_joined(text)\n# Each paragraph is a single string with sentences joined\n```\n\n#### Advanced Approach (Using Tokenizer Class)\n\n```python\nimport nupunkt_rs\n\n# Create a tokenizer with the default model\ntokenizer = nupunkt_rs.create_default_tokenizer()\n\n# Default (0.5) - balanced mode\ntext = \"The meeting is at 5 p.m. tomorrow. We'll discuss Q4.\"\nprint(tokenizer.tokenize(text))\n# Output: ['The meeting is at 5 p.m. tomorrow.', \"We'll discuss Q4.\"]\n\n# High recall (0.1) - more breaks, may split at abbreviations\ntokenizer.set_precision_recall_balance(0.1)\nprint(tokenizer.tokenize(text))\n# May split after \"p.m.\"\n\n# High precision (0.9) - fewer breaks, preserves abbreviations\ntokenizer.set_precision_recall_balance(0.9) \nprint(tokenizer.tokenize(text))\n# Won't split after \"p.m.\"\n```\n\n### Common Use Cases\n\n#### Processing Multiple Documents\n\n```python\nimport nupunkt_rs\n\n# Process multiple documents efficiently\ndocuments = [\n    \"First doc. Two sentences.\",\n    \"Second document here.\",\n    \"Third doc. Also two sentences.\"\n]\n\n# Use list comprehension for batch processing\nall_sentences = [nupunkt_rs.sent_tokenize(doc) for doc in documents]\nprint(all_sentences)\n# Output: [['First doc.', 'Two sentences.'], ['Second document here.'], ['Third doc.', 'Also two sentences.']]\n```\n\n#### Getting Character Positions\n\n```python\nimport nupunkt_rs\n\n# Get sentence boundaries as character positions\ntokenizer = nupunkt_rs.create_default_tokenizer()\ntext = \"First sentence. Second sentence.\"\nspans = tokenizer.tokenize_spans(text)\nprint(spans)\n# Output: [(0, 15), (16, 32)]\n\n# Extract sentences using spans\nfor start, end in spans:\n    print(f\"'{text[start:end]}'\")\n# Output: 'First sentence.' 'Second sentence.'\n```\n\n### Command-Line Interface\n\n```bash\n# Quick tokenization with default model\necho \"Dr. Smith arrived. He was late.\" | nupunkt tokenize\n\n# Adjust precision/recall from command line\nnupunkt tokenize --pr-balance 0.8 \"Your text here.\"\n\n# Process a file\nnupunkt tokenize --input document.txt --output sentences.txt\n```\n\n## Advanced Usage\n\n### Understanding Tokenization Decisions\n\nGet detailed insights into why breaks occur or don't occur:\n\n```python\n# Get detailed analysis of each token\nanalysis = tokenizer.analyze_tokens(text)\n\nfor token in analysis.tokens:\n    if token.has_period:\n        print(f\"Token: {token.text}\")\n        print(f\"  Break decision: {token.decision}\")\n        print(f\"  Confidence: {token.confidence:.2f}\")\n        \n# Explain a specific position\nexplanation = tokenizer.explain_decision(text, 28)  # Position of period after \"Dr.\"\nprint(explanation)\n```\n\n### Getting Sentence Boundaries as Spans\n\n```python\n# Get character positions instead of text\nspans = tokenizer.tokenize_spans(text)\n# Returns: [(start1, end1), (start2, end2), ...]\n\nfor start, end in spans:\n    print(f\"Sentence: {text[start:end]}\")\n```\n\n### Training Custom Models\n\nFor domain-specific text, you can train your own model:\n\n```python\ntrainer = nupunkt_rs.Trainer()\n\n# Optional: Load domain-specific abbreviations\ntrainer.load_abbreviations_from_json(\"legal_abbreviations.json\")\n\n# Train on your corpus\nparams = trainer.train(your_text_corpus, verbose=True)\n\n# Save model for reuse\nparams.save(\"my_model.npkt.gz\")\n\n# Load and use later\nparams = nupunkt_rs.Parameters.load(\"my_model.npkt.gz\")\ntokenizer = nupunkt_rs.SentenceTokenizer(params)\n```\n\n## Performance\n\nBenchmarks on commodity hardware (Linux, Intel x86_64):\n\n| Text Size | Processing Time | Speed |\n|-----------|----------------|--------|\n| 1 KB | < 0.1ms | ~10 MB/s |\n| 100 KB | ~3ms | ~30 MB/s |\n| 1 MB | ~33ms | ~30 MB/s |\n| 10 MB | ~330ms | ~30 MB/s |\n\nThe tokenizer maintains consistent speed regardless of text size, processing approximately **30 million characters per second**.\n\nMemory usage is minimal - the default model uses about 12 MB of RAM, compared to 85+ MB for NLTK's Punkt implementation.\n\n## API Reference\n\n### Main Functions\n\n- **`sent_tokenize(text, model_params=None, precision_recall=None)`** \u2192 List of sentences\n  - `text`: The text to tokenize\n  - `model_params`: Optional custom model parameters\n  - `precision_recall`: Optional PR balance (0.0=recall, 1.0=precision, default=0.5)\n\n- **`para_tokenize(text, model_params=None, precision_recall=None)`** \u2192 List of paragraphs (each as list of sentences)\n  - Same parameters as `sent_tokenize`\n\n- **`para_tokenize_joined(text, model_params=None, precision_recall=None)`** \u2192 List of paragraphs (each as single string)\n  - Same parameters as `sent_tokenize`\n\n- **`create_default_tokenizer()`** \u2192 Returns a `SentenceTokenizer` with default model\n- **`load_default_model()`** \u2192 Returns default `Parameters`\n- **`train_model(text, verbose=False)`** \u2192 Train new model on text\n\n### Main Classes\n\n- **`SentenceTokenizer`**: The main class for tokenizing text\n  - `tokenize(text)` \u2192 List of sentences\n  - `tokenize_spans(text)` \u2192 List of (start, end) positions\n  - `tokenize_paragraphs(text)` \u2192 List of paragraphs (each as list of sentences)\n  - `tokenize_paragraphs_flat(text)` \u2192 List of paragraphs (each as single string)\n  - `set_precision_recall_balance(0.0-1.0)` \u2192 Adjust behavior\n  - `analyze_tokens(text)` \u2192 Detailed token analysis\n  - `explain_decision(text, position)` \u2192 Explain break decision at position\n  \n- **`Parameters`**: Model parameters\n  - `save(path)` \u2192 Save model to disk (compressed)\n  - `load(path)` \u2192 Load model from disk\n\n- **`Trainer`**: For training custom models (advanced users only)\n  - `train(text, verbose=False)` \u2192 Train on text corpus\n  - `load_abbreviations_from_json(path)` \u2192 Load custom abbreviations\n\n## Development\n\n### Running Tests\n\n```bash\n# Rust tests\ncargo test\n\n# Python tests\npytest python/tests/\n\n# With coverage\ncargo tarpaulin\npytest --cov=nupunkt_rs\n```\n\n### Code Quality\n\n```bash\n# Format code\ncargo fmt\nblack python/\n\n# Lint\ncargo clippy -- -D warnings\nruff check python/\n\n# Type checking\nmypy python/\n```\n\n### Building Documentation\n\n```bash\n# Rust docs\ncargo doc --open\n\n# Python docs\ncd docs && make html\n```\n\n## Contributing\n\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Areas for Contribution\n\n- Additional language support\n- Performance optimizations\n- More abbreviation lists\n- Documentation improvements\n- Test coverage expansion\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Citation\n\nIf you use nupunkt-rs in your research, please cite the original nupunkt paper:\n\n```bibtex\n@article{bommarito2025precise,\n  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},\n  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},\n  journal={arXiv preprint arXiv:2504.04131},\n  year={2025}\n}\n```\n\nFor the Rust implementation specifically:\n\n```bibtex\n@software{nupunkt-rs,\n  title = {nupunkt-rs: High-performance Rust implementation of nupunkt},\n  author = {ALEA Institute},\n  year = {2025},\n  url = {https://github.com/alea-institute/nupunkt-rs}\n}\n```\n\n## Acknowledgments\n\n- Original Punkt algorithm by Kiss & Strunk (2006)\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/alea-institute/nupunkt-rs/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/alea-institute/nupunkt-rs/discussions)\n- **Email**: hello@aleainstitute.ai\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance Rust implementation of nupunkt sentence/paragraph tokenization",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/alea-institute/nupunkt-rs/issues",
        "Documentation": "https://github.com/alea-institute/nupunkt-rs",
        "Homepage": "https://github.com/alea-institute/nupunkt-rs",
        "Source": "https://github.com/alea-institute/nupunkt-rs"
    },
    "split_keywords": [
        "nlp",
        " natural language processing",
        " tokenization",
        " sentence boundary detection",
        " paragraph detection",
        " punkt",
        " text processing",
        " linguistics",
        " rust"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f84eb25cdcf2f04088bf00dc8f65d464618171e1a4c1d5e0510f21c2998c85f1",
                "md5": "bbcf162446c67a99a591aa2020a24be4",
                "sha256": "2c9381f5b0b729f08f8ef84512fe07a2dabd3e8206e1b0740cd1067710013d1e"
            },
            "downloads": -1,
            "filename": "nupunkt_rs-0.1.0-cp311-abi3-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "bbcf162446c67a99a591aa2020a24be4",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.11",
            "size": 12646445,
            "upload_time": "2025-08-11T01:23:25",
            "upload_time_iso_8601": "2025-08-11T01:23:25.699833Z",
            "url": "https://files.pythonhosted.org/packages/f8/4e/b25cdcf2f04088bf00dc8f65d464618171e1a4c1d5e0510f21c2998c85f1/nupunkt_rs-0.1.0-cp311-abi3-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-11 01:23:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alea-institute",
    "github_project": "nupunkt-rs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nupunkt-rs"
}
        
Elapsed time: 1.23645s