# K'Cho Linguistic Toolkit
> **A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.**
Based on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.
## ๐ฏ What This Toolkit Does
This is a **single, integrated package** that provides:
- โ
**Collocation Extraction** - Extract meaningful word combinations using multiple association measures
- โ
**Morphological Analysis** - Analyze K'Cho word structure (stems, affixes, particles)
- โ
**Text Normalization** - Clean and normalize K'Cho text for analysis
- โ
**Corpus Building** - Create annotated datasets with quality control
- โ
**Lexicon Management** - Build and manage digital K'Cho dictionaries
- โ
**Data Export** - Export to standard formats (JSON, CoNLL-U, CSV)
- โ
**Evaluation Tools** - Evaluate collocation extraction quality
- โ
**Parallel Corpus Processing** - Process aligned K'Cho-English texts
- โ
**ML-Ready Output** - Prepare data for machine learning training
## ๐ Quick Start
### Installation
```bash
# Install in development mode
pip install -e .
# Or install from PyPI (when published)
pip install kcho-linguistic-toolkit
```
### Basic Usage
```python
from kcho import CollocationExtractor, KChoSystem
# Initialize the system
system = KChoSystem()
# Extract collocations from corpus
extractor = CollocationExtractor()
corpus = ["Om noh Yong am paapai pe ci", "Ak'hmรณ lรนum ci"]
results = extractor.extract(corpus)
# Use advanced defaultdict functionality
pos_patterns = system.corpus.analyze_pos_patterns()
word_contexts = extractor.analyze_word_contexts(corpus)
```
### Command Line Interface
```bash
# Run collocation extraction
python -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt
# Use the main CLI
kcho analyze --corpus data/sample_corpus.txt --output results/
```
## ๐ฆ Installation
Install the package in development mode:
```bash
# Clone the repository
git clone https://github.com/HungOm/kcho-linguistic-toolkit.git
cd kcho-linguistic-toolkit
# Install in development mode
pip install -e .
# Verify installation
python -c "from kcho import CollocationExtractor; print('โ
Installation successful!')"
```
## ๐ Project Structure
The toolkit is organized following Python packaging best practices:
```
KchoLinguisticToolkit/
โโโ kcho/ # Main package
โ โโโ __init__.py # Package initialization
โ โโโ collocation.py # Collocation extraction
โ โโโ kcho_system.py # Core system
โ โโโ normalize.py # Text normalization
โ โโโ evaluation.py # Evaluation utilities
โ โโโ export.py # Export functions
โ โโโ eng_kcho_parallel_extractor.py
โ โโโ export_training_csv.py
โ โโโ create_gold_standard.py # Gold standard helper
โ โโโ kcho_app.py # CLI entry point
โ โโโ data/ # Package data
โ โโโ linguistic_data.json
โ โโโ word_frequency_top_1000.csv
โโโ examples/ # Example scripts
โ โโโ defaultdict_usage.py
โโโ data/ # External data (not in package)
โ โโโ README.md # Data documentation
โ โโโ sample_corpus.txt # Small, keep in git
โ โโโ gold_standard_collocations.txt
โ โโโ bible_versions/ # Large, .gitignored
โ โโโ parallel_corpora/ # Medium, .gitignored
โ โโโ research_outputs/ # Generated, .gitignored
โโโ .gitignore # Comprehensive ignore rules
โโโ pyproject.toml # Package configuration
โโโ README.md # This file
```
## ๐ Key Features
### 1. Collocation Extraction
Advanced collocation extraction with multiple association measures:
- **PMI (Pointwise Mutual Information)** - Classical measure for word association
- **NPMI (Normalized PMI)** - Bounded [0,1] variant for comparison
- **t-score** - Statistical significance testing
- **Dice Coefficient** - Symmetric association measure
- **Log-likelihood Ratio (Gยฒ)** - Asymptotic significance testing
```python
from kcho import CollocationExtractor
extractor = CollocationExtractor()
results = extractor.extract(corpus)
# Group by POS patterns using defaultdict
pos_groups = extractor.group_collocations_by_pos_pattern(corpus)
# Analyze word contexts
contexts = extractor.analyze_word_contexts(corpus, context_window=3)
```
### 2. Morphological Analysis
Based on K'Cho linguistic research, the toolkit understands:
- **Applicative Suffix** (-na/-nรกk)
- `luum-na` = "play with"
- Automatically detects and analyzes
- **Agreement Particles** (ka, na, a)
- **Postpositions** (noh, ah, am, on)
- **Tense Markers** (ci, khai)
Example:
```python
sentence = toolkit.analyze("Ak'hmรณ noh k'khรฌm luum-na ci")
# Automatically identifies: subject + postposition + instrument + verb-APPL + tense
```
### 2. Text Validation
Automatically detects K'cho text with confidence scoring:
```python
is_kcho, confidence, metrics = toolkit.validate("Om noh Yong am paapai pe ci")
# Returns: (True, 0.875, {...detailed metrics...})
```
**Validation Features:**
- Character set validation
- K'cho marker detection (postpositions, particles)
- Pattern matching for K'cho structures
- Confidence scoring (0-100%)
### 3. Corpus Building
Build clean, annotated K'cho datasets:
```python
# Add with automatic analysis
toolkit.add_to_corpus(
"Om noh Yong am paapai pe ci",
translation="Om gave Yong flowers"
)
# Get statistics
stats = toolkit.corpus_stats()
# Returns: total_sentences, vocabulary_size, POS distribution, etc.
# Create ML splits
splits = toolkit.corpus.create_splits(train_ratio=0.8)
# Returns: {'train': [...], 'dev': [...], 'test': [...]}
```
### 4. Lexicon Management
SQLite-based dictionary with full search:
```python
from kcho_toolkit import LexiconEntry
# Add words
entry = LexiconEntry(
headword="paapai",
pos="N",
gloss_en="flower",
gloss_my="แแแบแธ", # Myanmar translation
examples=["Om noh Yong am paapai pe ci"]
)
toolkit.lexicon.add_entry(entry)
# Search
results = toolkit.search_lexicon("flower")
# Get frequency list
top_words = toolkit.lexicon.get_frequency_list(100)
```
### 5. Data Export
Export to multiple standard formats:
```python
# JSON (for ML training)
toolkit.corpus.export_json("corpus.json")
# CoNLL-U (for linguistic research)
toolkit.corpus.export_conllu("corpus.conllu")
# CSV (for spreadsheet analysis)
toolkit.corpus.export_csv("corpus.csv")
# Or export everything at once
toolkit.export_all()
```
## ๐ Use Cases
### Machine Translation Training
```python
# Build parallel corpus
for kcho, english in parallel_sentences:
toolkit.add_to_corpus(kcho, translation=english)
# Create splits
splits = toolkit.corpus.create_splits()
# Export for training
for split_name, sentences in splits.items():
data = [{'source': s.text, 'target': s.translation} for s in sentences]
# Use with Hugging Face, Fairseq, etc.
```
### Linguistic Research
```python
# Analyze corpus
stats = toolkit.corpus_stats()
print(f"POS distribution: {stats['pos_distribution']}")
# Study verb paradigms
paradigm = toolkit.get_verb_forms('lรนum')
# Returns complete conjugation tables
# Export to CoNLL-U for dependency parsing research
toolkit.corpus.export_conllu("research_corpus.conllu")
```
### Dictionary Application Backend
```python
# Search API
results = toolkit.search_lexicon(query)
# Morphological analysis API
analysis = toolkit.analyze(user_input)
# Validation API
is_valid, confidence, _ = toolkit.validate(user_text)
```
## ๐ File Structure
The toolkit creates this organized structure:
```
your_project/
โโโ kcho_lexicon.db # SQLite dictionary
โโโ corpus/ # Raw corpus data
โโโ exports/ # Exported datasets
โ โโโ corpus_*.json
โ โโโ corpus_*.conllu
โ โโโ corpus_*.csv
โ โโโ lexicon_*.json
โโโ reports/ # Quality reports
โโโ report_*.json
```
## ๐ Examples
See `kcho_examples.py` for 8 complete examples:
1. **Basic Analysis** - Analyze K'cho sentences
2. **Build Corpus** - Create annotated corpus
3. **Validate Text** - Detect K'cho text
4. **Lexicon Management** - Work with dictionary
5. **Verb Paradigms** - Generate conjugation tables
6. **Data Export** - Export to different formats
7. **Quality Control** - Validate corpus quality
8. **ML Preparation** - Prepare training data
Run examples:
```bash
python kcho_examples.py
```
## ๐ Documentation
- **[KCHO_TOOLKIT_DOCS.md](KCHO_TOOLKIT_DOCS.md)** - Complete API reference and usage guide
- **[kcho_examples.py](kcho_examples.py)** - 8 practical examples
- **[kcho_toolkit.py](kcho_toolkit.py)** - Main source code (well-documented)
## ๐ Data Organization
The toolkit includes several types of data:
### Package Data (included in installation)
- `kcho/data/linguistic_data.json` - Core linguistic knowledge base
- `kcho/data/word_frequency_top_1000.csv` - High-frequency word list
### External Data (not in package)
- `data/sample_corpus.txt` - Small sample corpus for testing
- `data/gold_standard_collocations.txt` - Gold standard annotations
- `data/bible_versions/` - Bible translations (public domain, large files)
- `data/parallel_corpora/` - Aligned parallel texts
- `data/research_outputs/` - Generated analysis results
**Note**: Large data files are not included in the package to keep it lightweight. See `data/README.md` for details on data sources and copyright information.
## ๐ฌ Based on Research
This toolkit implements findings from:
- **Bedell, G. & Mang, K. S. (2012)**. "The Applicative Suffix -na in K'cho"
- **Jordan, M. (1969)**. "Chin Dictionary and Grammar"
- **K'cho linguistic research** on verb stem alternation and morphology
## ๐ฏ What You Can Build
With this toolkit, you can create:
1. **K'cho-English Machine Translation**
- Generate parallel corpus
- Export in ML-ready format
- Train transformer models
2. **K'cho Dictionary App**
- SQLite backend ready
- Full-text search
- Multi-lingual support
3. **Text Analysis Tools**
- Morphological analyzer
- Grammar checker
- Spell checker (with lexicon validation)
4. **Linguistic Research Tools**
- Annotated corpus
- Statistical analysis
- Pattern discovery
5. **Language Learning Apps**
- Verb conjugation practice
- Example sentence database
- Vocabulary lists by frequency
## ๐ Data Quality
Built-in quality control:
- โ
**Text validation** with confidence scoring
- โ
**Morphological validation** (checks grammatical structure)
- โ
**Character set validation** (ensures K'cho characters)
- โ
**Quality reports** (identifies issues in corpus)
Example:
```python
quality = toolkit.corpus.quality_report()
print(f"Validated: {quality['validated_sentences']}/{quality['total_sentences']}")
print(f"Avg confidence: {quality['avg_confidence']:.2%}")
```
## ๐ฆ Project Status
**Status**: Production Ready โ
- โ
Core features complete
- โ
Fully documented
- โ
Example code provided
- โ
Based on peer-reviewed research
- โ
No external dependencies
## ๐ค Contributing
To extend the toolkit:
1. **Add vocabulary**: Extend `KchoConfig.VERB_STEMS`
2. **Add patterns**: Update validation patterns
3. **Add languages**: Add more gloss languages to `LexiconEntry`
4. **Report issues**: Document any K'cho linguistic features not yet handled
## ๐ Citation
If you use this toolkit in research, please cite:
```bibtex
@misc{kcho_toolkit_2025,
title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},
author={Based on research by Bedell, George and Mang, Kee Shein},
year={2025},
note={Linguistic analysis based on "The Applicative Suffix -na in K'cho" (2012)}
}
```
## โ ๏ธ Important Notes
- K'cho has **no standard orthography** - this toolkit handles common variants
- The toolkit focuses on **Mindat Township dialect** (southern Chin State)
- Based on research from early 2000s - contemporary usage may vary
- Speaker population: approximately 10,000-20,000
## ๐ฎ Future Enhancements
Potential additions (not yet implemented):
- [ ] Audio processing (speech recognition/synthesis)
- [ ] Neural morphological analyzer
- [ ] Automatic tokenization improvements
- [ ] More comprehensive verb stem database
- [ ] Integration with existing Chin language tools
## ๐ Support
For K'cho linguistic questions, refer to:
- Published papers by George Bedell and Kee Shein Mang
- Jordan's Chin Dictionary and Grammar (1969)
- K'cho community language documentation
## ๐ License
This toolkit is provided for K'cho language research, documentation, and preservation.
---
**Version**: 1.0.0
**Language**: K'cho (Kuki-Chin family)
**Region**: Mindat Township, Southern Chin State, Myanmar
**Speakers**: ~10,000-20,000
---
## Quick Links
- [Complete Documentation](KCHO_TOOLKIT_DOCS.md)
- [Example Scripts](kcho_examples.py)
- [Source Code](kcho_toolkit.py)
---
*"Preserving K'cho for future generations through technology"*
Raw data
{
"_id": null,
"home_page": null,
"name": "kcho-linguistic-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nlp, low-resource, kcho, linguistics, collocation, morphology, corpus, language-processing",
"author": null,
"author_email": "Hung Om <hungom@example.com>",
"download_url": "https://files.pythonhosted.org/packages/fe/1d/7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443/kcho_linguistic_toolkit-0.1.0.tar.gz",
"platform": null,
"description": "# K'Cho Linguistic Toolkit\n\n> **A comprehensive toolkit for K'Cho language processing with collocation extraction, morphological analysis, and corpus processing.**\n\nBased on linguistic research by George Bedell and Kee Shein Mang (2012), this toolkit is developed by Hung Om, an enthusiastic K'Cho speaker and independent developer to provide essential tools for working with K'Cho, a Kuki-Chin language spoken by 10,000-20,000 people in southern Chin State, Myanmar.\n\n## \ud83c\udfaf What This Toolkit Does\n\nThis is a **single, integrated package** that provides:\n\n- \u2705 **Collocation Extraction** - Extract meaningful word combinations using multiple association measures\n- \u2705 **Morphological Analysis** - Analyze K'Cho word structure (stems, affixes, particles)\n- \u2705 **Text Normalization** - Clean and normalize K'Cho text for analysis\n- \u2705 **Corpus Building** - Create annotated datasets with quality control\n- \u2705 **Lexicon Management** - Build and manage digital K'Cho dictionaries\n- \u2705 **Data Export** - Export to standard formats (JSON, CoNLL-U, CSV)\n- \u2705 **Evaluation Tools** - Evaluate collocation extraction quality\n- \u2705 **Parallel Corpus Processing** - Process aligned K'Cho-English texts\n- \u2705 **ML-Ready Output** - Prepare data for machine learning training\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install in development mode\npip install -e .\n\n# Or install from PyPI (when published)\npip install kcho-linguistic-toolkit\n```\n\n### Basic Usage\n\n```python\nfrom kcho import CollocationExtractor, KChoSystem\n\n# Initialize the system\nsystem = KChoSystem()\n\n# Extract collocations from corpus\nextractor = CollocationExtractor()\ncorpus = [\"Om noh Yong am paapai pe ci\", \"Ak'hm\u00f3 l\u00f9um ci\"]\nresults = extractor.extract(corpus)\n\n# Use advanced defaultdict functionality\npos_patterns = system.corpus.analyze_pos_patterns()\nword_contexts = extractor.analyze_word_contexts(corpus)\n```\n\n### Command Line Interface\n\n```bash\n# Run collocation extraction\npython -m kcho.create_gold_standard --corpus data/sample_corpus.txt --output gold_standard.txt\n\n# Use the main CLI\nkcho analyze --corpus data/sample_corpus.txt --output results/\n```\n\n## \ud83d\udce6 Installation\n\nInstall the package in development mode:\n\n```bash\n# Clone the repository\ngit clone https://github.com/HungOm/kcho-linguistic-toolkit.git\ncd kcho-linguistic-toolkit\n\n# Install in development mode\npip install -e .\n\n# Verify installation\npython -c \"from kcho import CollocationExtractor; print('\u2705 Installation successful!')\"\n```\n\n## \ud83d\udcc1 Project Structure\n\nThe toolkit is organized following Python packaging best practices:\n\n```\nKchoLinguisticToolkit/\n\u251c\u2500\u2500 kcho/ # Main package\n\u2502 \u251c\u2500\u2500 __init__.py # Package initialization\n\u2502 \u251c\u2500\u2500 collocation.py # Collocation extraction\n\u2502 \u251c\u2500\u2500 kcho_system.py # Core system\n\u2502 \u251c\u2500\u2500 normalize.py # Text normalization\n\u2502 \u251c\u2500\u2500 evaluation.py # Evaluation utilities\n\u2502 \u251c\u2500\u2500 export.py # Export functions\n\u2502 \u251c\u2500\u2500 eng_kcho_parallel_extractor.py\n\u2502 \u251c\u2500\u2500 export_training_csv.py\n\u2502 \u251c\u2500\u2500 create_gold_standard.py # Gold standard helper\n\u2502 \u251c\u2500\u2500 kcho_app.py # CLI entry point\n\u2502 \u2514\u2500\u2500 data/ # Package data\n\u2502 \u251c\u2500\u2500 linguistic_data.json\n\u2502 \u2514\u2500\u2500 word_frequency_top_1000.csv\n\u251c\u2500\u2500 examples/ # Example scripts\n\u2502 \u2514\u2500\u2500 defaultdict_usage.py\n\u251c\u2500\u2500 data/ # External data (not in package)\n\u2502 \u251c\u2500\u2500 README.md # Data documentation\n\u2502 \u251c\u2500\u2500 sample_corpus.txt # Small, keep in git\n\u2502 \u251c\u2500\u2500 gold_standard_collocations.txt\n\u2502 \u251c\u2500\u2500 bible_versions/ # Large, .gitignored\n\u2502 \u251c\u2500\u2500 parallel_corpora/ # Medium, .gitignored\n\u2502 \u2514\u2500\u2500 research_outputs/ # Generated, .gitignored\n\u251c\u2500\u2500 .gitignore # Comprehensive ignore rules\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u2514\u2500\u2500 README.md # This file\n```\n\n## \ud83c\udf1f Key Features\n\n### 1. Collocation Extraction\n\nAdvanced collocation extraction with multiple association measures:\n\n- **PMI (Pointwise Mutual Information)** - Classical measure for word association\n- **NPMI (Normalized PMI)** - Bounded [0,1] variant for comparison\n- **t-score** - Statistical significance testing\n- **Dice Coefficient** - Symmetric association measure\n- **Log-likelihood Ratio (G\u00b2)** - Asymptotic significance testing\n\n```python\nfrom kcho import CollocationExtractor\n\nextractor = CollocationExtractor()\nresults = extractor.extract(corpus)\n\n# Group by POS patterns using defaultdict\npos_groups = extractor.group_collocations_by_pos_pattern(corpus)\n\n# Analyze word contexts\ncontexts = extractor.analyze_word_contexts(corpus, context_window=3)\n```\n\n### 2. Morphological Analysis\n\nBased on K'Cho linguistic research, the toolkit understands:\n\n- **Applicative Suffix** (-na/-n\u00e1k)\n - `luum-na` = \"play with\"\n - Automatically detects and analyzes\n\n- **Agreement Particles** (ka, na, a)\n- **Postpositions** (noh, ah, am, on)\n- **Tense Markers** (ci, khai)\n\nExample:\n```python\nsentence = toolkit.analyze(\"Ak'hm\u00f3 noh k'kh\u00ecm luum-na ci\")\n# Automatically identifies: subject + postposition + instrument + verb-APPL + tense\n```\n\n### 2. Text Validation\n\nAutomatically detects K'cho text with confidence scoring:\n\n```python\nis_kcho, confidence, metrics = toolkit.validate(\"Om noh Yong am paapai pe ci\")\n# Returns: (True, 0.875, {...detailed metrics...})\n```\n\n**Validation Features:**\n- Character set validation\n- K'cho marker detection (postpositions, particles)\n- Pattern matching for K'cho structures\n- Confidence scoring (0-100%)\n\n### 3. Corpus Building\n\nBuild clean, annotated K'cho datasets:\n\n```python\n# Add with automatic analysis\ntoolkit.add_to_corpus(\n \"Om noh Yong am paapai pe ci\",\n translation=\"Om gave Yong flowers\"\n)\n\n# Get statistics\nstats = toolkit.corpus_stats()\n# Returns: total_sentences, vocabulary_size, POS distribution, etc.\n\n# Create ML splits\nsplits = toolkit.corpus.create_splits(train_ratio=0.8)\n# Returns: {'train': [...], 'dev': [...], 'test': [...]}\n```\n\n### 4. Lexicon Management\n\nSQLite-based dictionary with full search:\n\n```python\nfrom kcho_toolkit import LexiconEntry\n\n# Add words\nentry = LexiconEntry(\n headword=\"paapai\",\n pos=\"N\",\n gloss_en=\"flower\",\n gloss_my=\"\u1015\u1014\u103a\u1038\", # Myanmar translation\n examples=[\"Om noh Yong am paapai pe ci\"]\n)\ntoolkit.lexicon.add_entry(entry)\n\n# Search\nresults = toolkit.search_lexicon(\"flower\")\n\n# Get frequency list\ntop_words = toolkit.lexicon.get_frequency_list(100)\n```\n\n### 5. Data Export\n\nExport to multiple standard formats:\n\n```python\n# JSON (for ML training)\ntoolkit.corpus.export_json(\"corpus.json\")\n\n# CoNLL-U (for linguistic research)\ntoolkit.corpus.export_conllu(\"corpus.conllu\")\n\n# CSV (for spreadsheet analysis)\ntoolkit.corpus.export_csv(\"corpus.csv\")\n\n# Or export everything at once\ntoolkit.export_all()\n```\n\n## \ud83d\udcca Use Cases\n\n### Machine Translation Training\n\n```python\n# Build parallel corpus\nfor kcho, english in parallel_sentences:\n toolkit.add_to_corpus(kcho, translation=english)\n\n# Create splits\nsplits = toolkit.corpus.create_splits()\n\n# Export for training\nfor split_name, sentences in splits.items():\n data = [{'source': s.text, 'target': s.translation} for s in sentences]\n # Use with Hugging Face, Fairseq, etc.\n```\n\n### Linguistic Research\n\n```python\n# Analyze corpus\nstats = toolkit.corpus_stats()\nprint(f\"POS distribution: {stats['pos_distribution']}\")\n\n# Study verb paradigms\nparadigm = toolkit.get_verb_forms('l\u00f9um')\n# Returns complete conjugation tables\n\n# Export to CoNLL-U for dependency parsing research\ntoolkit.corpus.export_conllu(\"research_corpus.conllu\")\n```\n\n### Dictionary Application Backend\n\n```python\n# Search API\nresults = toolkit.search_lexicon(query)\n\n# Morphological analysis API\nanalysis = toolkit.analyze(user_input)\n\n# Validation API\nis_valid, confidence, _ = toolkit.validate(user_text)\n```\n\n## \ud83d\udcc1 File Structure\n\nThe toolkit creates this organized structure:\n\n```\nyour_project/\n\u251c\u2500\u2500 kcho_lexicon.db # SQLite dictionary\n\u251c\u2500\u2500 corpus/ # Raw corpus data\n\u251c\u2500\u2500 exports/ # Exported datasets\n\u2502 \u251c\u2500\u2500 corpus_*.json\n\u2502 \u251c\u2500\u2500 corpus_*.conllu\n\u2502 \u251c\u2500\u2500 corpus_*.csv\n\u2502 \u2514\u2500\u2500 lexicon_*.json\n\u2514\u2500\u2500 reports/ # Quality reports\n \u2514\u2500\u2500 report_*.json\n```\n\n## \ud83c\udf93 Examples\n\nSee `kcho_examples.py` for 8 complete examples:\n\n1. **Basic Analysis** - Analyze K'cho sentences\n2. **Build Corpus** - Create annotated corpus\n3. **Validate Text** - Detect K'cho text\n4. **Lexicon Management** - Work with dictionary\n5. **Verb Paradigms** - Generate conjugation tables\n6. **Data Export** - Export to different formats\n7. **Quality Control** - Validate corpus quality\n8. **ML Preparation** - Prepare training data\n\nRun examples:\n```bash\npython kcho_examples.py\n```\n\n## \ud83d\udcd6 Documentation\n\n- **[KCHO_TOOLKIT_DOCS.md](KCHO_TOOLKIT_DOCS.md)** - Complete API reference and usage guide\n- **[kcho_examples.py](kcho_examples.py)** - 8 practical examples\n- **[kcho_toolkit.py](kcho_toolkit.py)** - Main source code (well-documented)\n\n## \ud83d\udcca Data Organization\n\nThe toolkit includes several types of data:\n\n### Package Data (included in installation)\n- `kcho/data/linguistic_data.json` - Core linguistic knowledge base\n- `kcho/data/word_frequency_top_1000.csv` - High-frequency word list\n\n### External Data (not in package)\n- `data/sample_corpus.txt` - Small sample corpus for testing\n- `data/gold_standard_collocations.txt` - Gold standard annotations\n- `data/bible_versions/` - Bible translations (public domain, large files)\n- `data/parallel_corpora/` - Aligned parallel texts\n- `data/research_outputs/` - Generated analysis results\n\n**Note**: Large data files are not included in the package to keep it lightweight. See `data/README.md` for details on data sources and copyright information.\n\n## \ud83d\udd2c Based on Research\n\nThis toolkit implements findings from:\n\n- **Bedell, G. & Mang, K. S. (2012)**. \"The Applicative Suffix -na in K'cho\"\n- **Jordan, M. (1969)**. \"Chin Dictionary and Grammar\"\n- **K'cho linguistic research** on verb stem alternation and morphology\n\n## \ud83c\udfaf What You Can Build\n\nWith this toolkit, you can create:\n\n1. **K'cho-English Machine Translation**\n - Generate parallel corpus\n - Export in ML-ready format\n - Train transformer models\n\n2. **K'cho Dictionary App**\n - SQLite backend ready\n - Full-text search\n - Multi-lingual support\n\n3. **Text Analysis Tools**\n - Morphological analyzer\n - Grammar checker\n - Spell checker (with lexicon validation)\n\n4. **Linguistic Research Tools**\n - Annotated corpus\n - Statistical analysis\n - Pattern discovery\n\n5. **Language Learning Apps**\n - Verb conjugation practice\n - Example sentence database\n - Vocabulary lists by frequency\n\n## \ud83d\udcc8 Data Quality\n\nBuilt-in quality control:\n\n- \u2705 **Text validation** with confidence scoring\n- \u2705 **Morphological validation** (checks grammatical structure)\n- \u2705 **Character set validation** (ensures K'cho characters)\n- \u2705 **Quality reports** (identifies issues in corpus)\n\nExample:\n```python\nquality = toolkit.corpus.quality_report()\nprint(f\"Validated: {quality['validated_sentences']}/{quality['total_sentences']}\")\nprint(f\"Avg confidence: {quality['avg_confidence']:.2%}\")\n```\n\n## \ud83d\udea6 Project Status\n\n**Status**: Production Ready \u2705\n\n- \u2705 Core features complete\n- \u2705 Fully documented\n- \u2705 Example code provided\n- \u2705 Based on peer-reviewed research\n- \u2705 No external dependencies\n\n## \ud83e\udd1d Contributing\n\nTo extend the toolkit:\n\n1. **Add vocabulary**: Extend `KchoConfig.VERB_STEMS`\n2. **Add patterns**: Update validation patterns\n3. **Add languages**: Add more gloss languages to `LexiconEntry`\n4. **Report issues**: Document any K'cho linguistic features not yet handled\n\n## \ud83d\udcdd Citation\n\nIf you use this toolkit in research, please cite:\n\n```bibtex\n@misc{kcho_toolkit_2025,\n title={K'cho Language Toolkit: A Unified Package for K'cho Language Processing},\n author={Based on research by Bedell, George and Mang, Kee Shein},\n year={2025},\n note={Linguistic analysis based on \"The Applicative Suffix -na in K'cho\" (2012)}\n}\n```\n\n## \u26a0\ufe0f Important Notes\n\n- K'cho has **no standard orthography** - this toolkit handles common variants\n- The toolkit focuses on **Mindat Township dialect** (southern Chin State)\n- Based on research from early 2000s - contemporary usage may vary\n- Speaker population: approximately 10,000-20,000\n\n## \ud83d\udd2e Future Enhancements\n\nPotential additions (not yet implemented):\n\n- [ ] Audio processing (speech recognition/synthesis)\n- [ ] Neural morphological analyzer\n- [ ] Automatic tokenization improvements\n- [ ] More comprehensive verb stem database\n- [ ] Integration with existing Chin language tools\n\n## \ud83d\udcde Support\n\nFor K'cho linguistic questions, refer to:\n- Published papers by George Bedell and Kee Shein Mang\n- Jordan's Chin Dictionary and Grammar (1969)\n- K'cho community language documentation\n\n## \ud83d\udcc4 License\n\nThis toolkit is provided for K'cho language research, documentation, and preservation.\n\n---\n\n**Version**: 1.0.0 \n**Language**: K'cho (Kuki-Chin family) \n**Region**: Mindat Township, Southern Chin State, Myanmar \n**Speakers**: ~10,000-20,000\n\n---\n\n## Quick Links\n\n- [Complete Documentation](KCHO_TOOLKIT_DOCS.md)\n- [Example Scripts](kcho_examples.py)\n- [Source Code](kcho_toolkit.py)\n\n---\n\n*\"Preserving K'cho for future generations through technology\"*\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "K'Cho linguistic analysis toolkit for low-resource NLP with collocation extraction, morphological analysis, and corpus processing",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/HungOm/kcho-linguistic-toolkit#readme",
"Homepage": "https://github.com/HungOm/kcho-linguistic-toolkit",
"Issues": "https://github.com/HungOm/kcho-linguistic-toolkit/issues",
"Repository": "https://github.com/HungOm/kcho-linguistic-toolkit"
},
"split_keywords": [
"nlp",
" low-resource",
" kcho",
" linguistics",
" collocation",
" morphology",
" corpus",
" language-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "de5b02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57",
"md5": "2b9ea9ab55ab2bb5708f026380491d44",
"sha256": "97c9127fc59b09eb3a0844c30a4d9822d15146a7788281f067c934f7d36dff34"
},
"downloads": -1,
"filename": "kcho_linguistic_toolkit-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2b9ea9ab55ab2bb5708f026380491d44",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 59194,
"upload_time": "2025-10-24T10:25:57",
"upload_time_iso_8601": "2025-10-24T10:25:57.324073Z",
"url": "https://files.pythonhosted.org/packages/de/5b/02da4d5151e2d52c7c9679de9db2b38387fce64dcdf807b4fec8eb1e3c57/kcho_linguistic_toolkit-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fe1d7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443",
"md5": "e48d7969f6ef135c95c9b4d1bde3ae8e",
"sha256": "77b3aee5155b119d0468774f7b2de88d3548daac8217057207549d4d63f64929"
},
"downloads": -1,
"filename": "kcho_linguistic_toolkit-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e48d7969f6ef135c95c9b4d1bde3ae8e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 60468,
"upload_time": "2025-10-24T10:25:58",
"upload_time_iso_8601": "2025-10-24T10:25:58.907416Z",
"url": "https://files.pythonhosted.org/packages/fe/1d/7eac91f05842ac2840db67f445f7380689f1f7b8be0c55c47606a4720443/kcho_linguistic_toolkit-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-24 10:25:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "HungOm",
"github_project": "kcho-linguistic-toolkit#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kcho-linguistic-toolkit"
}