innit

Name	innit JSON
Version	0.0.1a0 JSON
	download
home_page	None
Summary	Placeholder package for innit — name reserved while model is trained
upload_time	2025-08-26 12:04:41
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	language-detection onnx nlp english
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # innit - Fast English Detection

Note: The current PyPI release is a lightweight placeholder to reserve the
package name while the model is trained and productized. It installs quickly
and does not include heavy training dependencies. The CLI expects you to
provide an ONNX model file.

A tiny, fast, and dependency-light tool to determine if text is English or not English. Perfect for book-length texts where you need quick language detection without heavy ML frameworks.

## Features

- **Fast**: Sub-millisecond inference per 2KB window on CPU
- **Small**: ~1-2MB model size (0.5-1MB with int8 quantization)  
- **Simple**: Binary classification - English vs Not-English
- **Legal**: Trained only on legally clean datasets
- **Deployable**: Ships as ONNX runtime (no PyTorch dependency for inference)

## Installation

### For inference only (lightweight):
```bash
pip install onnxruntime
# Download the innit.onnx model file
```

### For training and development:
```bash
git clone <repo>
cd innit
pip install -e .
```

## Quick Start

### CLI Usage
```bash
# Analyze a text file
innit book.txt

# Output as JSON
innit book.txt --json

# Use specific model
innit book.txt --model path/to/innit.onnx
```

### Python API
```python
from innit.onnx_runner import ONNXInnitRunner, score_text_onnx

# Load model
runner = ONNXInnitRunner("innit.onnx")

# Score text
result = score_text_onnx(runner, text)
print(result["label"])  # "ENGLISH", "NOT-EN", or "UNCERTAIN"
```

## Training Your Own Model

1. **Train the model**:
```bash
python train_innit.py
```

2. **Export to ONNX**:
```bash
python export_onnx.py
```

3. **Test evaluation**:
```bash
python eval_innit.py sample_text.txt
```

## How It Works

- **Architecture**: Tiny byte-level CNN with depthwise separable convolutions
- **Input**: UTF-8 bytes (no tokenizer needed)
- **Strategy**: Slides 2KB windows over text and aggregates predictions
- **Thresholds**: Conservative - requires high confidence across many windows

## Model Details

- **Input**: Sequences of up to 2048 UTF-8 bytes
- **Architecture**: 4-block CNN with residual connections
- **Output**: Binary classification (English probability)
- **Training**: ~50K samples each of English and non-English text
- **Datasets**: Project Gutenberg (English) + multilingual sources (non-English)

## Legal & Licensing

### Training Data Sources
- **English**: Project Gutenberg texts (public domain in US)
- **Non-English**: HuggingFace multilingual datasets with permissive licenses
- See `DATA_SOURCES.md` for complete dataset information

### Model License
This model and code are released under MIT License. See `LICENSE` for details.

### Usage Notes
- The model weights are original work trained on legally clean data
- No copyrighted text content is redistributed
- Safe for commercial use

## Performance

| Metric | Value |
|--------|--------|
| Model Size (FP32) | ~1.5 MB |
| Model Size (INT8) | ~0.8 MB |
| Inference Speed | <1ms per 2KB window |
| Memory Usage | <100 MB |
| Accuracy | >95% on book-length texts |

## Contributing

1. Fork the repository
2. Create your feature branch
3. Add tests if applicable  
4. Submit a pull request

## Troubleshooting

**Model file not found**: Ensure you've either trained a model with `python train_innit.py` or downloaded a pre-trained `innit.onnx` file.

**Import errors**: For inference, you only need `onnxruntime`. For training, install the full development dependencies.

**Poor performance**: The model works best on book-length texts (>1KB). Very short texts may return "UNCERTAIN".

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "innit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "language-detection, onnx, nlp, english",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/2c/64/b62ecd01c3febbe8baff80d5405b9bc00357764132576ed938c399b8e830/innit-0.0.1a0.tar.gz",
    "platform": null,
    "description": "# innit - Fast English Detection\n\nNote: The current PyPI release is a lightweight placeholder to reserve the\npackage name while the model is trained and productized. It installs quickly\nand does not include heavy training dependencies. The CLI expects you to\nprovide an ONNX model file.\n\nA tiny, fast, and dependency-light tool to determine if text is English or not English. Perfect for book-length texts where you need quick language detection without heavy ML frameworks.\n\n## Features\n\n- **Fast**: Sub-millisecond inference per 2KB window on CPU\n- **Small**: ~1-2MB model size (0.5-1MB with int8 quantization)  \n- **Simple**: Binary classification - English vs Not-English\n- **Legal**: Trained only on legally clean datasets\n- **Deployable**: Ships as ONNX runtime (no PyTorch dependency for inference)\n\n## Installation\n\n### For inference only (lightweight):\n```bash\npip install onnxruntime\n# Download the innit.onnx model file\n```\n\n### For training and development:\n```bash\ngit clone <repo>\ncd innit\npip install -e .\n```\n\n## Quick Start\n\n### CLI Usage\n```bash\n# Analyze a text file\ninnit book.txt\n\n# Output as JSON\ninnit book.txt --json\n\n# Use specific model\ninnit book.txt --model path/to/innit.onnx\n```\n\n### Python API\n```python\nfrom innit.onnx_runner import ONNXInnitRunner, score_text_onnx\n\n# Load model\nrunner = ONNXInnitRunner(\"innit.onnx\")\n\n# Score text\nresult = score_text_onnx(runner, text)\nprint(result[\"label\"])  # \"ENGLISH\", \"NOT-EN\", or \"UNCERTAIN\"\n```\n\n## Training Your Own Model\n\n1. **Train the model**:\n```bash\npython train_innit.py\n```\n\n2. **Export to ONNX**:\n```bash\npython export_onnx.py\n```\n\n3. **Test evaluation**:\n```bash\npython eval_innit.py sample_text.txt\n```\n\n## How It Works\n\n- **Architecture**: Tiny byte-level CNN with depthwise separable convolutions\n- **Input**: UTF-8 bytes (no tokenizer needed)\n- **Strategy**: Slides 2KB windows over text and aggregates predictions\n- **Thresholds**: Conservative - requires high confidence across many windows\n\n## Model Details\n\n- **Input**: Sequences of up to 2048 UTF-8 bytes\n- **Architecture**: 4-block CNN with residual connections\n- **Output**: Binary classification (English probability)\n- **Training**: ~50K samples each of English and non-English text\n- **Datasets**: Project Gutenberg (English) + multilingual sources (non-English)\n\n## Legal & Licensing\n\n### Training Data Sources\n- **English**: Project Gutenberg texts (public domain in US)\n- **Non-English**: HuggingFace multilingual datasets with permissive licenses\n- See `DATA_SOURCES.md` for complete dataset information\n\n### Model License\nThis model and code are released under MIT License. See `LICENSE` for details.\n\n### Usage Notes\n- The model weights are original work trained on legally clean data\n- No copyrighted text content is redistributed\n- Safe for commercial use\n\n## Performance\n\n| Metric | Value |\n|--------|--------|\n| Model Size (FP32) | ~1.5 MB |\n| Model Size (INT8) | ~0.8 MB |\n| Inference Speed | <1ms per 2KB window |\n| Memory Usage | <100 MB |\n| Accuracy | >95% on book-length texts |\n\n## Contributing\n\n1. Fork the repository\n2. Create your feature branch\n3. Add tests if applicable  \n4. Submit a pull request\n\n## Troubleshooting\n\n**Model file not found**: Ensure you've either trained a model with `python train_innit.py` or downloaded a pre-trained `innit.onnx` file.\n\n**Import errors**: For inference, you only need `onnxruntime`. For training, install the full development dependencies.\n\n**Poor performance**: The model works best on book-length texts (>1KB). Very short texts may return \"UNCERTAIN\".\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Placeholder package for innit \u2014 name reserved while model is trained",
    "version": "0.0.1a0",
    "project_urls": null,
    "split_keywords": [
        "language-detection",
        " onnx",
        " nlp",
        " english"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d2ddd0637e05784d4f782e04062dc7d6b2899083748ae7b515746376a64f834a",
                "md5": "e24a8c23aa3d90a52f86121a0aedf054",
                "sha256": "07e1db4073291b7e84436d610e012bbb1595cc791aca2875934534120234e360"
            },
            "downloads": -1,
            "filename": "innit-0.0.1a0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e24a8c23aa3d90a52f86121a0aedf054",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7722,
            "upload_time": "2025-08-26T12:04:40",
            "upload_time_iso_8601": "2025-08-26T12:04:40.402044Z",
            "url": "https://files.pythonhosted.org/packages/d2/dd/d0637e05784d4f782e04062dc7d6b2899083748ae7b515746376a64f834a/innit-0.0.1a0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2c64b62ecd01c3febbe8baff80d5405b9bc00357764132576ed938c399b8e830",
                "md5": "d12baf715d48ff5455381ec787724f6d",
                "sha256": "057adb0f64b57e8aa53e25956a17d73dec4aa362fbd7b20caaa8c30992d875f0"
            },
            "downloads": -1,
            "filename": "innit-0.0.1a0.tar.gz",
            "has_sig": false,
            "md5_digest": "d12baf715d48ff5455381ec787724f6d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 6495,
            "upload_time": "2025-08-26T12:04:41",
            "upload_time_iso_8601": "2025-08-26T12:04:41.765669Z",
            "url": "https://files.pythonhosted.org/packages/2c/64/b62ecd01c3febbe8baff80d5405b9bc00357764132576ed938c399b8e830/innit-0.0.1a0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-26 12:04:41",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "innit"
}

None