kcho-linguistic-toolkit


Namekcho-linguistic-toolkit JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryK'Cho linguistic analysis toolkit for low-resource NLP with collocation extraction, morphological analysis, and corpus processing
upload_time2025-10-24 10:25:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords nlp low-resource kcho linguistics collocation morphology corpus language-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # K'Cho Linguistic Toolkit

> **A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.**

Based on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.

## ๐ŸŽฏ What This Toolkit Does

This is a **single, integrated package** that provides:

- โœ… **Collocation Extraction** - Extract meaningful word combinations using multiple association measures
- โœ… **Morphological Analysis** - Analyze K'Cho word structure (stems, affixes, particles)
- โœ… **Text Normalization** - Clean and normalize K'Cho text for analysis
- โœ… **Corpus Building** - Create annotated datasets with quality control
- โœ… **Lexicon Management** - Build and manage digital K'Cho dictionaries
- โœ… **Data Export** - Export to standard formats (JSON, CoNLL-U, CSV)
- โœ… **Evaluation Tools** - Evaluate collocation extraction quality
- โœ… **Parallel Corpus Processing** - Process aligned K'Cho-English texts
- โœ… **ML-Ready Output** - Prepare data for machine learning training

## ๐Ÿš€ Quick Start

### Installation

```bash
# Install in development mode
pip install -e .

# Or install from PyPI (when published)
pip install kcho-linguistic-toolkit
```

### Basic Usage

```python
from kcho import CollocationExtractor, KChoSystem

# Initialize the system
system = KChoSystem()

# Extract collocations from corpus
extractor = CollocationExtractor()
corpus = ["Om noh Yong am paapai pe ci", "Ak'hmรณ lรนum ci"]
results = extractor.extract(corpus)

# Use advanced defaultdict functionality
pos_patterns = system.corpus.analyze_pos_patterns()
word_contexts = extractor.analyze_word_contexts(corpus)
```

### Command Line Interface

```bash
# Run collocation extraction
python -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt

# Use the main CLI
kcho analyze --corpus data/sample_corpus.txt --output results/
```

## ๐Ÿ“ฆ Installation

Install the package in development mode:

```bash
# Clone the repository
git clone https://github.com/HungOm/kcho-linguistic-toolkit.git
cd kcho-linguistic-toolkit

# Install in development mode
pip install -e .

# Verify installation
python -c "from kcho import CollocationExtractor; print('โœ… Installation successful!')"
```

## ๐Ÿ“ Project Structure

The toolkit is organized following Python packaging best practices:

```
KchoLinguisticToolkit/
โ”œโ”€โ”€ kcho/                           # Main package
โ”‚   โ”œโ”€โ”€ __init__.py                 # Package initialization
โ”‚   โ”œโ”€โ”€ collocation.py              # Collocation extraction
โ”‚   โ”œโ”€โ”€ kcho_system.py              # Core system
โ”‚   โ”œโ”€โ”€ normalize.py                # Text normalization
โ”‚   โ”œโ”€โ”€ evaluation.py               # Evaluation utilities
โ”‚   โ”œโ”€โ”€ export.py                   # Export functions
โ”‚   โ”œโ”€โ”€ eng_kcho_parallel_extractor.py
โ”‚   โ”œโ”€โ”€ export_training_csv.py
โ”‚   โ”œโ”€โ”€ create_gold_standard.py     # Gold standard helper
โ”‚   โ”œโ”€โ”€ kcho_app.py                 # CLI entry point
โ”‚   โ””โ”€โ”€ data/                       # Package data
โ”‚       โ”œโ”€โ”€ linguistic_data.json
โ”‚       โ””โ”€โ”€ word_frequency_top_1000.csv
โ”œโ”€โ”€ examples/                       # Example scripts
โ”‚   โ””โ”€โ”€ defaultdict_usage.py
โ”œโ”€โ”€ data/                           # External data (not in package)
โ”‚   โ”œโ”€โ”€ README.md                   # Data documentation
โ”‚   โ”œโ”€โ”€ sample_corpus.txt           # Small, keep in git
โ”‚   โ”œโ”€โ”€ gold_standard_collocations.txt
โ”‚   โ”œโ”€โ”€ bible_versions/             # Large, .gitignored
โ”‚   โ”œโ”€โ”€ parallel_corpora/           # Medium, .gitignored
โ”‚   โ””โ”€โ”€ research_outputs/           # Generated, .gitignored
โ”œโ”€โ”€ .gitignore                      # Comprehensive ignore rules
โ”œโ”€โ”€ pyproject.toml                  # Package configuration
โ””โ”€โ”€ README.md                       # This file
```

## ๐ŸŒŸ Key Features

### 1. Collocation Extraction

Advanced collocation extraction with multiple association measures:

- **PMI (Pointwise Mutual Information)** - Classical measure for word association
- **NPMI (Normalized PMI)** - Bounded [0,1] variant for comparison
- **t-score** - Statistical significance testing
- **Dice Coefficient** - Symmetric association measure
- **Log-likelihood Ratio (Gยฒ)** - Asymptotic significance testing

```python
from kcho import CollocationExtractor

extractor = CollocationExtractor()
results = extractor.extract(corpus)

# Group by POS patterns using defaultdict
pos_groups = extractor.group_collocations_by_pos_pattern(corpus)

# Analyze word contexts
contexts = extractor.analyze_word_contexts(corpus, context_window=3)
```

### 2. Morphological Analysis

Based on K'Cho linguistic research, the toolkit understands:

- **Applicative Suffix** (-na/-nรกk)
  - `luum-na` = "play with"
  - Automatically detects and analyzes

- **Agreement Particles** (ka, na, a)
- **Postpositions** (noh, ah, am, on)
- **Tense Markers** (ci, khai)

Example:
```python
sentence = toolkit.analyze("Ak'hmรณ noh k'khรฌm luum-na ci")
# Automatically identifies: subject + postposition + instrument + verb-APPL + tense
```

### 2. Text Validation

Automatically detects K'cho text with confidence scoring:

```python
is_kcho, confidence, metrics = toolkit.validate("Om noh Yong am paapai pe ci")
# Returns: (True, 0.875, {...detailed metrics...})
```

**Validation Features:**
- Character set validation
- K'cho marker detection (postpositions, particles)
- Pattern matching for K'cho structures
- Confidence scoring (0-100%)

### 3. Corpus Building

Build clean, annotated K'cho datasets:

```python
# Add with automatic analysis
toolkit.add_to_corpus(
    "Om noh Yong am paapai pe ci",
    translation="Om gave Yong flowers"
)

# Get statistics
stats = toolkit.corpus_stats()
# Returns: total_sentences, vocabulary_size, POS distribution, etc.

# Create ML splits
splits = toolkit.corpus.create_splits(train_ratio=0.8)
# Returns: {'train': [...], 'dev': [...], 'test': [...]}
```

### 4. Lexicon Management

SQLite-based dictionary with full search:

```python
from kcho_toolkit import LexiconEntry

# Add words
entry = LexiconEntry(
    headword="paapai",
    pos="N",
    gloss_en="flower",
    gloss_my="แ€•แ€”แ€บแ€ธ",  # Myanmar translation
    examples=["Om noh Yong am paapai pe ci"]
)
toolkit.lexicon.add_entry(entry)

# Search
results = toolkit.search_lexicon("flower")

# Get frequency list
top_words = toolkit.lexicon.get_frequency_list(100)
```

### 5. Data Export

Export to multiple standard formats:

```python
# JSON (for ML training)
toolkit.corpus.export_json("corpus.json")

# CoNLL-U (for linguistic research)
toolkit.corpus.export_conllu("corpus.conllu")

# CSV (for spreadsheet analysis)
toolkit.corpus.export_csv("corpus.csv")

# Or export everything at once
toolkit.export_all()
```

## ๐Ÿ“Š Use Cases

### Machine Translation Training

```python
# Build parallel corpus
for kcho, english in parallel_sentences:
    toolkit.add_to_corpus(kcho, translation=english)

# Create splits
splits = toolkit.corpus.create_splits()

# Export for training
for split_name, sentences in splits.items():
    data = [{'source': s.text, 'target': s.translation} for s in sentences]
    # Use with Hugging Face, Fairseq, etc.
```

### Linguistic Research

```python
# Analyze corpus
stats = toolkit.corpus_stats()
print(f"POS distribution: {stats['pos_distribution']}")

# Study verb paradigms
paradigm = toolkit.get_verb_forms('lรนum')
# Returns complete conjugation tables

# Export to CoNLL-U for dependency parsing research
toolkit.corpus.export_conllu("research_corpus.conllu")
```

### Dictionary Application Backend

```python
# Search API
results = toolkit.search_lexicon(query)

# Morphological analysis API
analysis = toolkit.analyze(user_input)

# Validation API
is_valid, confidence, _ = toolkit.validate(user_text)
```

## ๐Ÿ“ File Structure

The toolkit creates this organized structure:

```
your_project/
โ”œโ”€โ”€ kcho_lexicon.db          # SQLite dictionary
โ”œโ”€โ”€ corpus/                  # Raw corpus data
โ”œโ”€โ”€ exports/                 # Exported datasets
โ”‚   โ”œโ”€โ”€ corpus_*.json
โ”‚   โ”œโ”€โ”€ corpus_*.conllu
โ”‚   โ”œโ”€โ”€ corpus_*.csv
โ”‚   โ””โ”€โ”€ lexicon_*.json
โ””โ”€โ”€ reports/                 # Quality reports
    โ””โ”€โ”€ report_*.json
```

## ๐ŸŽ“ Examples

See `kcho_examples.py` for 8 complete examples:

1. **Basic Analysis** - Analyze K'cho sentences
2. **Build Corpus** - Create annotated corpus
3. **Validate Text** - Detect K'cho text
4. **Lexicon Management** - Work with dictionary
5. **Verb Paradigms** - Generate conjugation tables
6. **Data Export** - Export to different formats
7. **Quality Control** - Validate corpus quality
8. **ML Preparation** - Prepare training data

Run examples:
```bash
python kcho_examples.py
```

## ๐Ÿ“– Documentation

- **[KCHO_TOOLKIT_DOCS.md](KCHO_TOOLKIT_DOCS.md)** - Complete API reference and usage guide
- **[kcho_examples.py](kcho_examples.py)** - 8 practical examples
- **[kcho_toolkit.py](kcho_toolkit.py)** - Main source code (well-documented)

## ๐Ÿ“Š Data Organization

The toolkit includes several types of data:

### Package Data (included in installation)
- `kcho/data/linguistic_data.json` - Core linguistic knowledge base
- `kcho/data/word_frequency_top_1000.csv` - High-frequency word list

### External Data (not in package)
- `data/sample_corpus.txt` - Small sample corpus for testing
- `data/gold_standard_collocations.txt` - Gold standard annotations
- `data/bible_versions/` - Bible translations (public domain, large files)
- `data/parallel_corpora/` - Aligned parallel texts
- `data/research_outputs/` - Generated analysis results

**Note**: Large data files are not included in the package to keep it lightweight. See `data/README.md` for details on data sources and copyright information.

## ๐Ÿ”ฌ Based on Research

This toolkit implements findings from:

- **Bedell, G. & Mang, K. S. (2012)**. "The Applicative Suffix -na in K'cho"
- **Jordan, M. (1969)**. "Chin Dictionary and Grammar"
- **K'cho linguistic research** on verb stem alternation and morphology

## ๐ŸŽฏ What You Can Build

With this toolkit, you can create:

1. **K'cho-English Machine Translation**
   - Generate parallel corpus
   - Export in ML-ready format
   - Train transformer models

2. **K'cho Dictionary App**
   - SQLite backend ready
   - Full-text search
   - Multi-lingual support

3. **Text Analysis Tools**
   - Morphological analyzer
   - Grammar checker
   - Spell checker (with lexicon validation)

4. **Linguistic Research Tools**
   - Annotated corpus
   - Statistical analysis
   - Pattern discovery

5. **Language Learning Apps**
   - Verb conjugation practice
   - Example sentence database
   - Vocabulary lists by frequency

## ๐Ÿ“ˆ Data Quality

Built-in quality control:

- โœ… **Text validation** with confidence scoring
- โœ… **Morphological validation** (checks grammatical structure)
- โœ… **Character set validation** (ensures K'cho characters)
- โœ… **Quality reports** (identifies issues in corpus)

Example:
```python
quality = toolkit.corpus.quality_report()
print(f"Validated: {quality['validated_sentences']}/{quality['total_sentences']}")
print(f"Avg confidence: {quality['avg_confidence']:.2%}")
```

## ๐Ÿšฆ Project Status

**Status**: Production Ready โœ…

- โœ… Core features complete
- โœ… Fully documented
- โœ… Example code provided
- โœ… Based on peer-reviewed research
- โœ… No external dependencies

## ๐Ÿค Contributing

To extend the toolkit:

1. **Add vocabulary**: Extend `KchoConfig.VERB_STEMS`
2. **Add patterns**: Update validation patterns
3. **Add languages**: Add more gloss languages to `LexiconEntry`
4. **Report issues**: Document any K'cho linguistic features not yet handled

## ๐Ÿ“ Citation

If you use this toolkit in research, please cite:

```bibtex
@misc{kcho_toolkit_2025,
  title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},
  author={Based on research by Bedell, George and Mang, Kee Shein},
  year={2025},
  note={Linguistic analysis based on "The Applicative Suffix -na in K'cho" (2012)}
}
```

## โš ๏ธ Important Notes

- K'cho has **no standard orthography** - this toolkit handles common variants
- The toolkit focuses on **Mindat Township dialect** (southern Chin State)
- Based on research from early 2000s - contemporary usage may vary
- Speaker population: approximately 10,000-20,000

## ๐Ÿ”ฎ Future Enhancements

Potential additions (not yet implemented):

- [ ] Audio processing (speech recognition/synthesis)
- [ ] Neural morphological analyzer
- [ ] Automatic tokenization improvements
- [ ] More comprehensive verb stem database
- [ ] Integration with existing Chin language tools

## ๐Ÿ“ž Support

For K'cho linguistic questions, refer to:
- Published papers by George Bedell and Kee Shein Mang
- Jordan's Chin Dictionary and Grammar (1969)
- K'cho community language documentation

## ๐Ÿ“„ License

This toolkit is provided for K'cho language research, documentation, and preservation.

---

**Version**: 1.0.0  
**Language**: K'cho (Kuki-Chin family)  
**Region**: Mindat Township, Southern Chin State, Myanmar  
**Speakers**: ~10,000-20,000

---

## Quick Links

- [Complete Documentation](KCHO_TOOLKIT_DOCS.md)
- [Example Scripts](kcho_examples.py)
- [Source Code](kcho_toolkit.py)

---

*"Preserving K'cho for future generations through technology"*

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kcho-linguistic-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, low-resource, kcho, linguistics, collocation, morphology, corpus, language-processing",
    "author": null,
    "author_email": "Hung Om <hungom@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/fe/1d/7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443/kcho_linguistic_toolkit-0.1.0.tar.gz",
    "platform": null,
    "description": "# K'Cho Linguistic Toolkit\n\n> **A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.**\n\nBased on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.\n\n## \ud83c\udfaf What This Toolkit Does\n\nThis is a **single, integrated package** that provides:\n\n- \u2705 **Collocation Extraction** - Extract meaningful word combinations using multiple association measures\n- \u2705 **Morphological Analysis** - Analyze K'Cho word structure (stems, affixes, particles)\n- \u2705 **Text Normalization** - Clean and normalize K'Cho text for analysis\n- \u2705 **Corpus Building** - Create annotated datasets with quality control\n- \u2705 **Lexicon Management** - Build and manage digital K'Cho dictionaries\n- \u2705 **Data Export** - Export to standard formats (JSON, CoNLL-U, CSV)\n- \u2705 **Evaluation Tools** - Evaluate collocation extraction quality\n- \u2705 **Parallel Corpus Processing** - Process aligned K'Cho-English texts\n- \u2705 **ML-Ready Output** - Prepare data for machine learning training\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install in development mode\npip install -e .\n\n# Or install from PyPI (when published)\npip install kcho-linguistic-toolkit\n```\n\n### Basic Usage\n\n```python\nfrom kcho import CollocationExtractor, KChoSystem\n\n# Initialize the system\nsystem = KChoSystem()\n\n# Extract collocations from corpus\nextractor = CollocationExtractor()\ncorpus = [\"Om noh Yong am paapai pe ci\", \"Ak'hm\u00f3 l\u00f9um ci\"]\nresults = extractor.extract(corpus)\n\n# Use advanced defaultdict functionality\npos_patterns = system.corpus.analyze_pos_patterns()\nword_contexts = extractor.analyze_word_contexts(corpus)\n```\n\n### Command Line Interface\n\n```bash\n# Run collocation extraction\npython -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt\n\n# Use the main CLI\nkcho analyze --corpus data/sample_corpus.txt --output results/\n```\n\n## \ud83d\udce6 Installation\n\nInstall the package in development mode:\n\n```bash\n# Clone the repository\ngit clone https://github.com/HungOm/kcho-linguistic-toolkit.git\ncd kcho-linguistic-toolkit\n\n# Install in development mode\npip install -e .\n\n# Verify installation\npython -c \"from kcho import CollocationExtractor; print('\u2705 Installation successful!')\"\n```\n\n## \ud83d\udcc1 Project Structure\n\nThe toolkit is organized following Python packaging best practices:\n\n```\nKchoLinguisticToolkit/\n\u251c\u2500\u2500 kcho/                           # Main package\n\u2502   \u251c\u2500\u2500 __init__.py                 # Package initialization\n\u2502   \u251c\u2500\u2500 collocation.py              # Collocation extraction\n\u2502   \u251c\u2500\u2500 kcho_system.py              # Core system\n\u2502   \u251c\u2500\u2500 normalize.py                # Text normalization\n\u2502   \u251c\u2500\u2500 evaluation.py               # Evaluation utilities\n\u2502   \u251c\u2500\u2500 export.py                   # Export functions\n\u2502   \u251c\u2500\u2500 eng_kcho_parallel_extractor.py\n\u2502   \u251c\u2500\u2500 export_training_csv.py\n\u2502   \u251c\u2500\u2500 create_gold_standard.py     # Gold standard helper\n\u2502   \u251c\u2500\u2500 kcho_app.py                 # CLI entry point\n\u2502   \u2514\u2500\u2500 data/                       # Package data\n\u2502       \u251c\u2500\u2500 linguistic_data.json\n\u2502       \u2514\u2500\u2500 word_frequency_top_1000.csv\n\u251c\u2500\u2500 examples/                       # Example scripts\n\u2502   \u2514\u2500\u2500 defaultdict_usage.py\n\u251c\u2500\u2500 data/                           # External data (not in package)\n\u2502   \u251c\u2500\u2500 README.md                   # Data documentation\n\u2502   \u251c\u2500\u2500 sample_corpus.txt           # Small, keep in git\n\u2502   \u251c\u2500\u2500 gold_standard_collocations.txt\n\u2502   \u251c\u2500\u2500 bible_versions/             # Large, .gitignored\n\u2502   \u251c\u2500\u2500 parallel_corpora/           # Medium, .gitignored\n\u2502   \u2514\u2500\u2500 research_outputs/           # Generated, .gitignored\n\u251c\u2500\u2500 .gitignore                      # Comprehensive ignore rules\n\u251c\u2500\u2500 pyproject.toml                  # Package configuration\n\u2514\u2500\u2500 README.md                       # This file\n```\n\n## \ud83c\udf1f Key Features\n\n### 1. Collocation Extraction\n\nAdvanced collocation extraction with multiple association measures:\n\n- **PMI (Pointwise Mutual Information)** - Classical measure for word association\n- **NPMI (Normalized PMI)** - Bounded [0,1] variant for comparison\n- **t-score** - Statistical significance testing\n- **Dice Coefficient** - Symmetric association measure\n- **Log-likelihood Ratio (G\u00b2)** - Asymptotic significance testing\n\n```python\nfrom kcho import CollocationExtractor\n\nextractor = CollocationExtractor()\nresults = extractor.extract(corpus)\n\n# Group by POS patterns using defaultdict\npos_groups = extractor.group_collocations_by_pos_pattern(corpus)\n\n# Analyze word contexts\ncontexts = extractor.analyze_word_contexts(corpus, context_window=3)\n```\n\n### 2. Morphological Analysis\n\nBased on K'Cho linguistic research, the toolkit understands:\n\n- **Applicative Suffix** (-na/-n\u00e1k)\n  - `luum-na` = \"play with\"\n  - Automatically detects and analyzes\n\n- **Agreement Particles** (ka, na, a)\n- **Postpositions** (noh, ah, am, on)\n- **Tense Markers** (ci, khai)\n\nExample:\n```python\nsentence = toolkit.analyze(\"Ak'hm\u00f3 noh k'kh\u00ecm luum-na ci\")\n# Automatically identifies: subject + postposition + instrument + verb-APPL + tense\n```\n\n### 2. Text Validation\n\nAutomatically detects K'cho text with confidence scoring:\n\n```python\nis_kcho, confidence, metrics = toolkit.validate(\"Om noh Yong am paapai pe ci\")\n# Returns: (True, 0.875, {...detailed metrics...})\n```\n\n**Validation Features:**\n- Character set validation\n- K'cho marker detection (postpositions, particles)\n- Pattern matching for K'cho structures\n- Confidence scoring (0-100%)\n\n### 3. Corpus Building\n\nBuild clean, annotated K'cho datasets:\n\n```python\n# Add with automatic analysis\ntoolkit.add_to_corpus(\n    \"Om noh Yong am paapai pe ci\",\n    translation=\"Om gave Yong flowers\"\n)\n\n# Get statistics\nstats = toolkit.corpus_stats()\n# Returns: total_sentences, vocabulary_size, POS distribution, etc.\n\n# Create ML splits\nsplits = toolkit.corpus.create_splits(train_ratio=0.8)\n# Returns: {'train': [...], 'dev': [...], 'test': [...]}\n```\n\n### 4. Lexicon Management\n\nSQLite-based dictionary with full search:\n\n```python\nfrom kcho_toolkit import LexiconEntry\n\n# Add words\nentry = LexiconEntry(\n    headword=\"paapai\",\n    pos=\"N\",\n    gloss_en=\"flower\",\n    gloss_my=\"\u1015\u1014\u103a\u1038\",  # Myanmar translation\n    examples=[\"Om noh Yong am paapai pe ci\"]\n)\ntoolkit.lexicon.add_entry(entry)\n\n# Search\nresults = toolkit.search_lexicon(\"flower\")\n\n# Get frequency list\ntop_words = toolkit.lexicon.get_frequency_list(100)\n```\n\n### 5. Data Export\n\nExport to multiple standard formats:\n\n```python\n# JSON (for ML training)\ntoolkit.corpus.export_json(\"corpus.json\")\n\n# CoNLL-U (for linguistic research)\ntoolkit.corpus.export_conllu(\"corpus.conllu\")\n\n# CSV (for spreadsheet analysis)\ntoolkit.corpus.export_csv(\"corpus.csv\")\n\n# Or export everything at once\ntoolkit.export_all()\n```\n\n## \ud83d\udcca Use Cases\n\n### Machine Translation Training\n\n```python\n# Build parallel corpus\nfor kcho, english in parallel_sentences:\n    toolkit.add_to_corpus(kcho, translation=english)\n\n# Create splits\nsplits = toolkit.corpus.create_splits()\n\n# Export for training\nfor split_name, sentences in splits.items():\n    data = [{'source': s.text, 'target': s.translation} for s in sentences]\n    # Use with Hugging Face, Fairseq, etc.\n```\n\n### Linguistic Research\n\n```python\n# Analyze corpus\nstats = toolkit.corpus_stats()\nprint(f\"POS distribution: {stats['pos_distribution']}\")\n\n# Study verb paradigms\nparadigm = toolkit.get_verb_forms('l\u00f9um')\n# Returns complete conjugation tables\n\n# Export to CoNLL-U for dependency parsing research\ntoolkit.corpus.export_conllu(\"research_corpus.conllu\")\n```\n\n### Dictionary Application Backend\n\n```python\n# Search API\nresults = toolkit.search_lexicon(query)\n\n# Morphological analysis API\nanalysis = toolkit.analyze(user_input)\n\n# Validation API\nis_valid, confidence, _ = toolkit.validate(user_text)\n```\n\n## \ud83d\udcc1 File Structure\n\nThe toolkit creates this organized structure:\n\n```\nyour_project/\n\u251c\u2500\u2500 kcho_lexicon.db          # SQLite dictionary\n\u251c\u2500\u2500 corpus/                  # Raw corpus data\n\u251c\u2500\u2500 exports/                 # Exported datasets\n\u2502   \u251c\u2500\u2500 corpus_*.json\n\u2502   \u251c\u2500\u2500 corpus_*.conllu\n\u2502   \u251c\u2500\u2500 corpus_*.csv\n\u2502   \u2514\u2500\u2500 lexicon_*.json\n\u2514\u2500\u2500 reports/                 # Quality reports\n    \u2514\u2500\u2500 report_*.json\n```\n\n## \ud83c\udf93 Examples\n\nSee `kcho_examples.py` for 8 complete examples:\n\n1. **Basic Analysis** - Analyze K'cho sentences\n2. **Build Corpus** - Create annotated corpus\n3. **Validate Text** - Detect K'cho text\n4. **Lexicon Management** - Work with dictionary\n5. **Verb Paradigms** - Generate conjugation tables\n6. **Data Export** - Export to different formats\n7. **Quality Control** - Validate corpus quality\n8. **ML Preparation** - Prepare training data\n\nRun examples:\n```bash\npython kcho_examples.py\n```\n\n## \ud83d\udcd6 Documentation\n\n- **[KCHO_TOOLKIT_DOCS.md](KCHO_TOOLKIT_DOCS.md)** - Complete API reference and usage guide\n- **[kcho_examples.py](kcho_examples.py)** - 8 practical examples\n- **[kcho_toolkit.py](kcho_toolkit.py)** - Main source code (well-documented)\n\n## \ud83d\udcca Data Organization\n\nThe toolkit includes several types of data:\n\n### Package Data (included in installation)\n- `kcho/data/linguistic_data.json` - Core linguistic knowledge base\n- `kcho/data/word_frequency_top_1000.csv` - High-frequency word list\n\n### External Data (not in package)\n- `data/sample_corpus.txt` - Small sample corpus for testing\n- `data/gold_standard_collocations.txt` - Gold standard annotations\n- `data/bible_versions/` - Bible translations (public domain, large files)\n- `data/parallel_corpora/` - Aligned parallel texts\n- `data/research_outputs/` - Generated analysis results\n\n**Note**: Large data files are not included in the package to keep it lightweight. See `data/README.md` for details on data sources and copyright information.\n\n## \ud83d\udd2c Based on Research\n\nThis toolkit implements findings from:\n\n- **Bedell, G. & Mang, K. S. (2012)**. \"The Applicative Suffix -na in K'cho\"\n- **Jordan, M. (1969)**. \"Chin Dictionary and Grammar\"\n- **K'cho linguistic research** on verb stem alternation and morphology\n\n## \ud83c\udfaf What You Can Build\n\nWith this toolkit, you can create:\n\n1. **K'cho-English Machine Translation**\n   - Generate parallel corpus\n   - Export in ML-ready format\n   - Train transformer models\n\n2. **K'cho Dictionary App**\n   - SQLite backend ready\n   - Full-text search\n   - Multi-lingual support\n\n3. **Text Analysis Tools**\n   - Morphological analyzer\n   - Grammar checker\n   - Spell checker (with lexicon validation)\n\n4. **Linguistic Research Tools**\n   - Annotated corpus\n   - Statistical analysis\n   - Pattern discovery\n\n5. **Language Learning Apps**\n   - Verb conjugation practice\n   - Example sentence database\n   - Vocabulary lists by frequency\n\n## \ud83d\udcc8 Data Quality\n\nBuilt-in quality control:\n\n- \u2705 **Text validation** with confidence scoring\n- \u2705 **Morphological validation** (checks grammatical structure)\n- \u2705 **Character set validation** (ensures K'cho characters)\n- \u2705 **Quality reports** (identifies issues in corpus)\n\nExample:\n```python\nquality = toolkit.corpus.quality_report()\nprint(f\"Validated: {quality['validated_sentences']}/{quality['total_sentences']}\")\nprint(f\"Avg confidence: {quality['avg_confidence']:.2%}\")\n```\n\n## \ud83d\udea6 Project Status\n\n**Status**: Production Ready \u2705\n\n- \u2705 Core features complete\n- \u2705 Fully documented\n- \u2705 Example code provided\n- \u2705 Based on peer-reviewed research\n- \u2705 No external dependencies\n\n## \ud83e\udd1d Contributing\n\nTo extend the toolkit:\n\n1. **Add vocabulary**: Extend `KchoConfig.VERB_STEMS`\n2. **Add patterns**: Update validation patterns\n3. **Add languages**: Add more gloss languages to `LexiconEntry`\n4. **Report issues**: Document any K'cho linguistic features not yet handled\n\n## \ud83d\udcdd Citation\n\nIf you use this toolkit in research, please cite:\n\n```bibtex\n@misc{kcho_toolkit_2025,\n  title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},\n  author={Based on research by Bedell, George and Mang, Kee Shein},\n  year={2025},\n  note={Linguistic analysis based on \"The Applicative Suffix -na in K'cho\" (2012)}\n}\n```\n\n## \u26a0\ufe0f Important Notes\n\n- K'cho has **no standard orthography** - this toolkit handles common variants\n- The toolkit focuses on **Mindat Township dialect** (southern Chin State)\n- Based on research from early 2000s - contemporary usage may vary\n- Speaker population: approximately 10,000-20,000\n\n## \ud83d\udd2e Future Enhancements\n\nPotential additions (not yet implemented):\n\n- [ ] Audio processing (speech recognition/synthesis)\n- [ ] Neural morphological analyzer\n- [ ] Automatic tokenization improvements\n- [ ] More comprehensive verb stem database\n- [ ] Integration with existing Chin language tools\n\n## \ud83d\udcde Support\n\nFor K'cho linguistic questions, refer to:\n- Published papers by George Bedell and Kee Shein Mang\n- Jordan's Chin Dictionary and Grammar (1969)\n- K'cho community language documentation\n\n## \ud83d\udcc4 License\n\nThis toolkit is provided for K'cho language research, documentation, and preservation.\n\n---\n\n**Version**: 1.0.0  \n**Language**: K'cho (Kuki-Chin family)  \n**Region**: Mindat Township, Southern Chin State, Myanmar  \n**Speakers**: ~10,000-20,000\n\n---\n\n## Quick Links\n\n- [Complete Documentation](KCHO_TOOLKIT_DOCS.md)\n- [Example Scripts](kcho_examples.py)\n- [Source Code](kcho_toolkit.py)\n\n---\n\n*\"Preserving K'cho for future generations through technology\"*\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "K'Cho linguistic analysis toolkit for low-resource NLP with collocation extraction, morphological analysis, and corpus processing",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/HungOm/kcho-linguistic-toolkit#readme",
        "Homepage": "https://github.com/HungOm/kcho-linguistic-toolkit",
        "Issues": "https://github.com/HungOm/kcho-linguistic-toolkit/issues",
        "Repository": "https://github.com/HungOm/kcho-linguistic-toolkit"
    },
    "split_keywords": [
        "nlp",
        " low-resource",
        " kcho",
        " linguistics",
        " collocation",
        " morphology",
        " corpus",
        " language-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "de5b02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57",
                "md5": "2b9ea9ab55ab2bb5708f026380491d44",
                "sha256": "97c9127fc59b09eb3a0844c30a4d9822d15146a7788281f067c934f7d36dff34"
            },
            "downloads": -1,
            "filename": "kcho_linguistic_toolkit-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2b9ea9ab55ab2bb5708f026380491d44",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 59194,
            "upload_time": "2025-10-24T10:25:57",
            "upload_time_iso_8601": "2025-10-24T10:25:57.324073Z",
            "url": "https://files.pythonhosted.org/packages/de/5b/02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57/kcho_linguistic_toolkit-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fe1d7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443",
                "md5": "e48d7969f6ef135c95c9b4d1bde3ae8e",
                "sha256": "77b3aee5155b119d0468774f7b2de88d3548daac8217057207549d4d63f64929"
            },
            "downloads": -1,
            "filename": "kcho_linguistic_toolkit-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e48d7969f6ef135c95c9b4d1bde3ae8e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 60468,
            "upload_time": "2025-10-24T10:25:58",
            "upload_time_iso_8601": "2025-10-24T10:25:58.907416Z",
            "url": "https://files.pythonhosted.org/packages/fe/1d/7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443/kcho_linguistic_toolkit-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 10:25:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HungOm",
    "github_project": "kcho-linguistic-toolkit#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kcho-linguistic-toolkit"
}
        
Elapsed time: 1.70150s