# Vocabulous
A bootstrapping language detection system that builds high-quality dictionaries from noisy training data.
[](https://www.python.org/downloads/)
[](https://badge.fury.io/py/vocabulous)
[](https://opensource.org/licenses/MIT)
[](https://github.com/omarkamali/vocabulous/actions/workflows/pytest.yml)
## Overview
Vocabulous addresses a common challenge in NLP: building reliable language detection systems when you only have noisy, potentially mislabeled training data. Traditional approaches either require clean, manually curated datasets or sophisticated neural networks. Vocabulous takes a different approach by using iterative dictionary building and progressive data cleaning to bootstrap accurate language detection from imperfect data.
### Key Features
- **Bootstrapping from Noisy Data**: Starts with potentially mislabeled training data and iteratively improves
- **Dictionary-Based Detection**: Uses word frequency dictionaries for fast, interpretable language detection
- **Progressive Data Cleaning**: Removes ambiguous and mislabeled samples across training cycles
- **Multi-Script Support**: Handles both Latin and Arabic scripts with appropriate text normalization
- **Configurable Training**: Adjustable confidence thresholds and confidence margin parameters for different scenarios
- **Model Persistence**: Save and load trained models for reuse
## Installation
```bash
pip install vocabulous
```
### Development Installation
```bash
git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
pip install -e .
```
## Quick Start
```python
from vocabulous import Vocabulous
# Initialize model
model = Vocabulous()
# Prepare training data (list of dicts with 'text' and 'lang' keys)
train_data = [
{'text': 'Hello world', 'lang': 'en'},
{'text': 'Bonjour le monde', 'lang': 'fr'},
{'text': 'Hola mundo', 'lang': 'es'},
# ... more training examples
]
# Evaluation data for monitoring training progress
eval_data = [
{'text': 'Good morning', 'lang': 'en'},
{'text': 'Bon matin', 'lang': 'fr'},
{'text': 'Buenos días', 'lang': 'es'},
]
# Train the model
model, report = model.train(
train_data=train_data,
eval_data=eval_data,
cycles=3,
base_confidence=0.5,
confidence_margin=0.3
)
# Use for language detection
scores = model._score_sentence("Hello there")
print(scores) # {'en': 1.0}
# Save the model
model.save('my_model.json')
# Load later
loaded_model = Vocabulous.load('my_model.json')
```
## Methodology
### The Bootstrapping Approach
Vocabulous implements a novel bootstrapping methodology for language detection:
1. **Initial Dictionary Building**: Creates word-language frequency dictionaries from all training data
2. **Scoring & Evaluation**: Scores evaluation data to measure current model performance
3. **Data Cleaning**: Removes training samples that contradict the current dictionaries
4. **Iteration**: Repeats the process with cleaned data to progressively improve quality
### Why This Works
The approach is based on several key insights:
- **Majority Signal**: Even noisy datasets typically contain more correct than incorrect labels
- **Word Uniqueness**: Many words are language-specific and provide strong signals
- **Progressive Refinement**: Each iteration removes the most problematic samples first
- **Convergence**: The process naturally converges when no more samples can be confidently removed
### Training Parameters
- **`cycles`**: Number of training iterations (default: 2)
- **`base_confidence`**: Minimum score threshold for keeping samples (0-1)
- **`confidence_margin`**: Minimum difference between top two language scores (0-1)
Higher values make the filtering more aggressive, while lower values are more permissive.
## Use Cases
### 1. Bootstrapping Language Detection
**Scenario**: You have a large dataset of multilingual text with potentially noisy language labels.
```python
# Start with noisy data
noisy_data = [
{'text': 'Hello world', 'lang': 'en'},
{'text': 'Bonjour', 'lang': 'en'}, # Mislabeled!
{'text': 'Hello', 'lang': 'fr'}, # Mislabeled!
{'text': 'Comment ça va?', 'lang': 'fr'},
# ... thousands more with ~10% label noise
]
model = Vocabulous()
model, report = model.train(noisy_data, eval_data, cycles=3)
# The model learns to ignore mislabeled samples
print(f"Dictionary size: {len(model.word_lang_freq)}")
print(f"Final accuracy: {report['cycle_reports'][-1]['accuracy']:.3f}")
```
### 2. Data Cleaning Pipeline
**Scenario**: Clean a noisy multilingual dataset before using it for other NLP tasks.
```python
# Train model on subset of data
model, _ = model.train(sample_data, eval_data)
# Clean the full dataset
cleaned_dataset = model.clean(full_noisy_dataset)
# Now use cleaned_dataset for training other models
print(f"Kept {len(cleaned_dataset)}/{len(full_noisy_dataset)} samples")
```
### 3. Incremental Learning
**Scenario**: Continuously improve language detection as new data becomes available.
```python
# Initial training
model, _ = model.train(initial_data, eval_data)
# Later, integrate new data
model, updated_report = model.train(
new_data + initial_data, # Combine old and new
eval_data,
cycles=2
)
```
### 4. Cross-Domain Adaptation
**Scenario**: Adapt a model trained on one domain (e.g., news) to another (e.g., social media).
```python
# Train on news data
news_model, _ = news_model.train(news_data, news_eval)
# Adapt to social media by combining datasets
adapted_model = Vocabulous()
adapted_model, _ = adapted_model.train(
social_media_data + news_data,
social_media_eval,
cycles=3,
base_confidence=0.3 # Lower threshold for noisy social media text
)
```
## Advanced Usage
### Custom Text Preprocessing
```python
# Subclass to customize text cleaning
class CustomVocabulous(Vocabulous):
def _clean_text(self, text):
# Add custom preprocessing
text = super()._clean_text(text)
# Your custom logic here
return text
```
### Training Monitoring
```python
model, report = model.train(train_data, eval_data, cycles=5)
# Analyze training progress
for i, cycle_report in enumerate(report['cycle_reports']):
print(f"Cycle {i+1}:")
print(f" Accuracy: {cycle_report['accuracy']:.3f}")
print(f" F1 Score: {cycle_report['f1']:.3f}")
print(f" Samples removed: {cycle_report['removed_samples']}")
print(f" Confidence Margin: {cycle_report['confidence_margin']:.3f}")
```
### Confidence Scoring
```python
# Get detailed scores for a sentence
scores = model._score_sentence("Hello world")
# Returns: {'en': 0.75, 'fr': 0.25}
# For datasets
scored_df = model._score(test_data)
print(scored_df[['text', 'scores', 'lang']])
```
## Performance Tips
### Memory Optimization
```python
# For large datasets, disable training data storage
model = Vocabulous(store_training_data=False)
```
### Speed Optimization
```python
# Use fewer cycles for faster training
model, _ = model.train(data, eval_data, cycles=1)
# Lower confidence margin for less aggressive filtering
model, _ = model.train(data, eval_data, confidence_margin=0.1)
```
### Quality Optimization
```python
# More cycles for higher quality
model, _ = model.train(data, eval_data, cycles=5)
# Higher confidence threshold for cleaner dictionaries
model, _ = model.train(data, eval_data, base_confidence=0.7)
```
## Evaluation Metrics
Vocabulous provides comprehensive evaluation metrics:
- **Accuracy**: Overall classification accuracy
- **Precision/Recall/F1**: Per-language and macro-averaged metrics
- **Confusion Score**: Measures how often languages are confused with each other
- **Confidence Margin**: Average difference between top two language scores (higher = more confident)
## Limitations
1. **Vocabulary-Based**: Works best with languages that have distinct vocabularies
2. **Training Data Size**: Requires sufficient training data for each language
3. **Script Mixing**: May struggle with code-switched text within sentences
4. **Short Text**: Performance degrades on very short texts (1-2 words)
## API Reference
### Core Classes
#### `Vocabulous(store_training_data=False)`
Main class for language detection and training.
**Parameters:**
- `store_training_data` (bool): Whether to store training data internally
#### Methods
##### `train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)`
Train the model on provided data.
**Parameters:**
- `train_df`: Training data (list of dicts or DataFrame)
- `eval_df`: Evaluation data (list of dicts or DataFrame)
- `cycles` (int): Number of training cycles
- `base_confidence` (float): Minimum confidence threshold
- `confidence_margin` (float): Minimum score difference threshold
**Returns:**
- `(model, report)`: Updated model and training report
##### `clean(dataset)`
Clean a dataset by filtering confident predictions.
**Parameters:**
- `dataset`: DataFrame with 'text' and 'lang' columns
**Returns:**
- DataFrame with confident predictions only
##### `save(path)` / `load(path)`
Save/load model to/from JSON file.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
### Development Setup
```bash
git clone https://github.com/omarkamali/vocabulous.git
cd vocabulous
pip install -e ".[dev]"
pytest tests/
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
This project is supported by Omneity Labs, a research lab focused on building NLP and generative AI models for low-resource languages and techniques for cultural alignment.
## Citation
If you use Vocabulous in your research, please cite:
```bibtex
@software{vocabulous2025,
title={Vocabulous: Bootstrapping Language Detection from Noisy \& Ambiguous Data},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/vocabulous},
note={Project developed under Omneity Labs}
}
```
## Support
- **Issues**: [GitHub Issues](https://github.com/omarkamali/vocabulous/issues)
- **Discussions**: [GitHub Discussions](https://github.com/omarkamali/vocabulous/discussions)
- **Documentation**: [GitHub README](https://github.com/omarkamali/vocabulous#readme)
Raw data
{
"_id": null,
"home_page": null,
"name": "vocabulous",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "bootstrapping, dictionary-building, language-detection, machine-learning, nlp, text-classification",
"author": null,
"author_email": "Omar Kamali <unscript@omarkama.li>",
"download_url": "https://files.pythonhosted.org/packages/62/2c/826a774ce09aa6f3153b5b25d035295d7e315196a91959a1fab3a6f6de22/vocabulous-0.0.1.tar.gz",
"platform": null,
"description": "# Vocabulous\n\nA bootstrapping language detection system that builds high-quality dictionaries from noisy training data.\n\n[](https://www.python.org/downloads/)\n[](https://badge.fury.io/py/vocabulous)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/omarkamali/vocabulous/actions/workflows/pytest.yml)\n\n\n## Overview\n\nVocabulous addresses a common challenge in NLP: building reliable language detection systems when you only have noisy, potentially mislabeled training data. Traditional approaches either require clean, manually curated datasets or sophisticated neural networks. Vocabulous takes a different approach by using iterative dictionary building and progressive data cleaning to bootstrap accurate language detection from imperfect data.\n\n### Key Features\n\n- **Bootstrapping from Noisy Data**: Starts with potentially mislabeled training data and iteratively improves\n- **Dictionary-Based Detection**: Uses word frequency dictionaries for fast, interpretable language detection \n- **Progressive Data Cleaning**: Removes ambiguous and mislabeled samples across training cycles\n- **Multi-Script Support**: Handles both Latin and Arabic scripts with appropriate text normalization\n- **Configurable Training**: Adjustable confidence thresholds and confidence margin parameters for different scenarios\n- **Model Persistence**: Save and load trained models for reuse\n\n## Installation\n\n```bash\npip install vocabulous\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/omarkamali/vocabulous.git\ncd vocabulous\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom vocabulous import Vocabulous\n\n# Initialize model\nmodel = Vocabulous()\n\n# Prepare training data (list of dicts with 'text' and 'lang' keys)\ntrain_data = [\n {'text': 'Hello world', 'lang': 'en'},\n {'text': 'Bonjour le monde', 'lang': 'fr'},\n {'text': 'Hola mundo', 'lang': 'es'},\n # ... more training examples\n]\n\n# Evaluation data for monitoring training progress\neval_data = [\n {'text': 'Good morning', 'lang': 'en'},\n {'text': 'Bon matin', 'lang': 'fr'},\n {'text': 'Buenos d\u00edas', 'lang': 'es'},\n]\n\n# Train the model\nmodel, report = model.train(\n train_data=train_data,\n eval_data=eval_data,\n cycles=3,\n base_confidence=0.5,\n confidence_margin=0.3\n)\n\n# Use for language detection\nscores = model._score_sentence(\"Hello there\")\nprint(scores) # {'en': 1.0}\n\n# Save the model\nmodel.save('my_model.json')\n\n# Load later\nloaded_model = Vocabulous.load('my_model.json')\n```\n\n## Methodology\n\n### The Bootstrapping Approach\n\nVocabulous implements a novel bootstrapping methodology for language detection:\n\n1. **Initial Dictionary Building**: Creates word-language frequency dictionaries from all training data\n2. **Scoring & Evaluation**: Scores evaluation data to measure current model performance\n3. **Data Cleaning**: Removes training samples that contradict the current dictionaries\n4. **Iteration**: Repeats the process with cleaned data to progressively improve quality\n\n### Why This Works\n\nThe approach is based on several key insights:\n\n- **Majority Signal**: Even noisy datasets typically contain more correct than incorrect labels\n- **Word Uniqueness**: Many words are language-specific and provide strong signals\n- **Progressive Refinement**: Each iteration removes the most problematic samples first\n- **Convergence**: The process naturally converges when no more samples can be confidently removed\n\n### Training Parameters\n\n- **`cycles`**: Number of training iterations (default: 2)\n- **`base_confidence`**: Minimum score threshold for keeping samples (0-1)\n- **`confidence_margin`**: Minimum difference between top two language scores (0-1)\n\nHigher values make the filtering more aggressive, while lower values are more permissive.\n\n## Use Cases\n\n### 1. Bootstrapping Language Detection\n\n**Scenario**: You have a large dataset of multilingual text with potentially noisy language labels.\n\n```python\n# Start with noisy data\nnoisy_data = [\n {'text': 'Hello world', 'lang': 'en'},\n {'text': 'Bonjour', 'lang': 'en'}, # Mislabeled!\n {'text': 'Hello', 'lang': 'fr'}, # Mislabeled!\n {'text': 'Comment \u00e7a va?', 'lang': 'fr'},\n # ... thousands more with ~10% label noise\n]\n\nmodel = Vocabulous()\nmodel, report = model.train(noisy_data, eval_data, cycles=3)\n\n# The model learns to ignore mislabeled samples\nprint(f\"Dictionary size: {len(model.word_lang_freq)}\")\nprint(f\"Final accuracy: {report['cycle_reports'][-1]['accuracy']:.3f}\")\n```\n\n### 2. Data Cleaning Pipeline\n\n**Scenario**: Clean a noisy multilingual dataset before using it for other NLP tasks.\n\n```python\n# Train model on subset of data\nmodel, _ = model.train(sample_data, eval_data)\n\n# Clean the full dataset\ncleaned_dataset = model.clean(full_noisy_dataset)\n\n# Now use cleaned_dataset for training other models\nprint(f\"Kept {len(cleaned_dataset)}/{len(full_noisy_dataset)} samples\")\n```\n\n### 3. Incremental Learning\n\n**Scenario**: Continuously improve language detection as new data becomes available.\n\n```python\n# Initial training\nmodel, _ = model.train(initial_data, eval_data)\n\n# Later, integrate new data\nmodel, updated_report = model.train(\n new_data + initial_data, # Combine old and new\n eval_data,\n cycles=2\n)\n```\n\n### 4. Cross-Domain Adaptation\n\n**Scenario**: Adapt a model trained on one domain (e.g., news) to another (e.g., social media).\n\n```python\n# Train on news data\nnews_model, _ = news_model.train(news_data, news_eval)\n\n# Adapt to social media by combining datasets\nadapted_model = Vocabulous()\nadapted_model, _ = adapted_model.train(\n social_media_data + news_data,\n social_media_eval,\n cycles=3,\n base_confidence=0.3 # Lower threshold for noisy social media text\n)\n```\n\n## Advanced Usage\n\n### Custom Text Preprocessing\n\n```python\n# Subclass to customize text cleaning\nclass CustomVocabulous(Vocabulous):\n def _clean_text(self, text):\n # Add custom preprocessing\n text = super()._clean_text(text)\n # Your custom logic here\n return text\n```\n\n### Training Monitoring\n\n```python\nmodel, report = model.train(train_data, eval_data, cycles=5)\n\n# Analyze training progress\nfor i, cycle_report in enumerate(report['cycle_reports']):\n print(f\"Cycle {i+1}:\")\n print(f\" Accuracy: {cycle_report['accuracy']:.3f}\")\n print(f\" F1 Score: {cycle_report['f1']:.3f}\")\n print(f\" Samples removed: {cycle_report['removed_samples']}\")\n print(f\" Confidence Margin: {cycle_report['confidence_margin']:.3f}\")\n```\n\n### Confidence Scoring\n\n```python\n# Get detailed scores for a sentence\nscores = model._score_sentence(\"Hello world\")\n# Returns: {'en': 0.75, 'fr': 0.25}\n\n# For datasets\nscored_df = model._score(test_data)\nprint(scored_df[['text', 'scores', 'lang']])\n```\n\n## Performance Tips\n\n### Memory Optimization\n\n```python\n# For large datasets, disable training data storage\nmodel = Vocabulous(store_training_data=False)\n```\n\n### Speed Optimization\n\n```python\n# Use fewer cycles for faster training\nmodel, _ = model.train(data, eval_data, cycles=1)\n\n# Lower confidence margin for less aggressive filtering\nmodel, _ = model.train(data, eval_data, confidence_margin=0.1)\n```\n\n### Quality Optimization\n\n```python\n# More cycles for higher quality\nmodel, _ = model.train(data, eval_data, cycles=5)\n\n# Higher confidence threshold for cleaner dictionaries\nmodel, _ = model.train(data, eval_data, base_confidence=0.7)\n```\n\n## Evaluation Metrics\n\nVocabulous provides comprehensive evaluation metrics:\n\n- **Accuracy**: Overall classification accuracy\n- **Precision/Recall/F1**: Per-language and macro-averaged metrics\n- **Confusion Score**: Measures how often languages are confused with each other\n- **Confidence Margin**: Average difference between top two language scores (higher = more confident)\n\n## Limitations\n\n1. **Vocabulary-Based**: Works best with languages that have distinct vocabularies\n2. **Training Data Size**: Requires sufficient training data for each language\n3. **Script Mixing**: May struggle with code-switched text within sentences\n4. **Short Text**: Performance degrades on very short texts (1-2 words)\n\n## API Reference\n\n### Core Classes\n\n#### `Vocabulous(store_training_data=False)`\n\nMain class for language detection and training.\n\n**Parameters:**\n- `store_training_data` (bool): Whether to store training data internally\n\n#### Methods\n\n##### `train(train_df, eval_df, cycles=2, base_confidence=0.5, confidence_margin=0.5)`\n\nTrain the model on provided data.\n\n**Parameters:**\n- `train_df`: Training data (list of dicts or DataFrame)\n- `eval_df`: Evaluation data (list of dicts or DataFrame)\n- `cycles` (int): Number of training cycles\n- `base_confidence` (float): Minimum confidence threshold\n- `confidence_margin` (float): Minimum score difference threshold\n\n**Returns:**\n- `(model, report)`: Updated model and training report\n\n##### `clean(dataset)`\n\nClean a dataset by filtering confident predictions.\n\n**Parameters:**\n- `dataset`: DataFrame with 'text' and 'lang' columns\n\n**Returns:**\n- DataFrame with confident predictions only\n\n##### `save(path)` / `load(path)`\n\nSave/load model to/from JSON file.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n### Development Setup\n\n```bash\ngit clone https://github.com/omarkamali/vocabulous.git\ncd vocabulous\npip install -e \".[dev]\"\npytest tests/\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\nThis project is supported by Omneity Labs, a research lab focused on building NLP and generative AI models for low-resource languages and techniques for cultural alignment.\n\n## Citation\n\nIf you use Vocabulous in your research, please cite:\n\n```bibtex\n@software{vocabulous2025,\n title={Vocabulous: Bootstrapping Language Detection from Noisy \\& Ambiguous Data},\n author={Omar Kamali},\n year={2025},\n url={https://github.com/omarkamali/vocabulous},\n note={Project developed under Omneity Labs}\n}\n```\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/omarkamali/vocabulous/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/omarkamali/vocabulous/discussions)\n- **Documentation**: [GitHub README](https://github.com/omarkamali/vocabulous#readme) ",
"bugtrack_url": null,
"license": "MIT",
"summary": "A bootstrapping language detection system that builds dictionaries from noisy and ambiguous training data",
"version": "0.0.1",
"project_urls": {
"Bug Tracker": "https://github.com/omarkamali/vocabulous/issues",
"Documentation": "https://github.com/omarkamali/vocabulous#readme",
"Homepage": "https://github.com/omarkamali/vocabulous",
"Repository": "https://github.com/omarkamali/vocabulous.git"
},
"split_keywords": [
"bootstrapping",
" dictionary-building",
" language-detection",
" machine-learning",
" nlp",
" text-classification"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ce31a443ea28927f81a3106d868f4c0b6bcee3a51648bb09d0fe1a964ae6d68d",
"md5": "b8aaaabd96f72c1318d4e5193a5ecf5d",
"sha256": "0df56a5f127964e1e104f9225e0e088c4affb23e31716046c1f70dac2d67bbd0"
},
"downloads": -1,
"filename": "vocabulous-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b8aaaabd96f72c1318d4e5193a5ecf5d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11733,
"upload_time": "2025-07-20T15:54:09",
"upload_time_iso_8601": "2025-07-20T15:54:09.103355Z",
"url": "https://files.pythonhosted.org/packages/ce/31/a443ea28927f81a3106d868f4c0b6bcee3a51648bb09d0fe1a964ae6d68d/vocabulous-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "622c826a774ce09aa6f3153b5b25d035295d7e315196a91959a1fab3a6f6de22",
"md5": "74bc0e65a1908047de952c7c51479018",
"sha256": "e69d17be124df6e29b71f18aea3c429acbed3100be55cf9c21d04b9e104178f4"
},
"downloads": -1,
"filename": "vocabulous-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "74bc0e65a1908047de952c7c51479018",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14491,
"upload_time": "2025-07-20T15:54:10",
"upload_time_iso_8601": "2025-07-20T15:54:10.629160Z",
"url": "https://files.pythonhosted.org/packages/62/2c/826a774ce09aa6f3153b5b25d035295d7e315196a91959a1fab3a6f6de22/vocabulous-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 15:54:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "omarkamali",
"github_project": "vocabulous",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "vocabulous"
}