# intelli3text
Ingestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with **paragraph-level language detection** and **PDF export**.
Focus on **frictionless install (`pip install`)**: on first run it **auto-downloads** the required models (fastText LID and spaCy) and works **offline** with sensible fallbacks.
**Docs website:** https://jeffersonspeck.github.io/intelli3text/
**PyPI:** https://pypi.org/project/intelli3text/
**Repository:** https://github.com/jeffersonspeck/intelli3text
---
## Table of Contents
- [Usage Manual](USAGE.md)
- [Why this project?](#why-this-project)
- [Key features](#key-features)
- [Requirements](#requirements)
- [Installation](#installation)
- [Quick start (CLI)](#quick-start-cli)
- [CLI examples](#cli-examples)
- [Python usage (API)](#python-usage-api)
- [Language identification (LID)](#language-identification-lid)
- [spaCy models & normalization](#spacy-models--normalization)
- [Cleaning pipeline](#cleaning-pipeline)
- [PDF export](#pdf-export)
- [Cache, auto-downloads & offline mode](#cache-auto-downloads--offline-mode)
- [Architecture & Design Patterns](#architecture--design-patterns)
- [Design Science Research (DSR)](#design-science-research-dsr)
- [Binary compatibility (NumPy/Thinc/spaCy)](#binary-compatibility-numpythincspacy)
- [Performance tips](#performance-tips)
- [Extensibility](#extensibility)
- [Troubleshooting](#troubleshooting)
- [Publishing to PyPI](#publishing-to-pypi)
- [Roadmap](#roadmap)
- [License](#license)
- [How to cite](#how-to-cite)
---
## Why this project?
In research and production, common needs include:
1. **Ingest** text from heterogeneous sources (web, PDFs, DOCX, TXT);
2. **Clean** and **normalize** the content;
3. **Lemmatize** and remove stopwords;
4. **Detect language** accurately, including **bilingual** documents;
5. **Export** results with traceability (PDF that shows normalized, cleaned, and raw text).
**intelli3text** is built to be **plug-and-play**: `pip install` and go — no native toolchains, no manual compiles, no painful environment setup.
---
## Key features
- **Ingestion**: URL (HTML), PDF (`pdfminer.six`), DOCX (`python-docx`), TXT.
- **Cleaning**: Unicode fixes (`ftfy`), noise removal (`clean-text`), PDF-specific line-break & hyphenation heuristics.
- **Paragraph-level LID**: **fastText LID** (176 languages) with tolerant fallback.
- **spaCy normalization**: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
- **PDF export**: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
- **Auto-download on first run**:
- `lid.176.bin` (fastText LID);
- spaCy models for PT/EN/ES (`lg→md→sm`) with offline fallback.
- **CLI & Python API**: use from shell or embed in code.
---
## Requirements
- **Python 3.9+**
- Internet only on **first run** (to download models). After that, it works offline.
- To avoid binary mismatches, the package pins **compatible** versions of `numpy`, `thinc`, and `spacy`.
---
## Installation
```bash
pip install intelli3text
# or from a local repo:
# pip install .
````
> **No extra scripts.**
> On first execution, required models are fetched to a local cache automatically.
---
## Quick start (CLI)
```bash
intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf
```
Output:
* JSON to `stdout` with `language_global`, `cleaned`, `normalized`, and a list of `paragraphs`.
* A PDF report at `output.pdf`.
---
## CLI examples
* Local PDF:
```bash
intelli3text "./my_paper.pdf" --export-pdf report.pdf
```
* Choose spaCy model size:
```bash
intelli3text "URL" --nlp-size md
# options: lg (default) | md | sm
```
* Select cleaners:
```bash
intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks
```
* Save JSON to file:
```bash
intelli3text "URL" --json-out result.json
```
* Use CLD3 as primary (if installed as extra):
```bash
pip install intelli3text[cld3]
intelli3text "URL" --lid-primary cld3 --lid-fallback none
```
> Full CLI reference: see **Docs → CLI** on the website:
> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)
---
## Python usage (API)
```python
from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
cleaners=["ftfy", "clean_text", "pdf_breaks"],
lid_primary="fasttext", # or "cld3" if you installed the extra
lid_fallback=None, # or "cld3"
nlp_model_pref="lg", # "lg" | "md" | "sm"
export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")
print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])
```
> More samples (including safe-to-import examples): **Docs → Examples**.
---
## Language identification (LID)
* **Primary**: **fastText LID** (`lid.176.bin`) auto-downloaded on first use.
* **Tolerant**: if `fasttext` is unavailable, the pipeline **won’t crash** — it returns `"pt"` with confidence `0.0` as a safe fallback.
* **Accuracy**: detection is per **paragraph**; `language_global` is the most frequent.
* **Optional**: `pycld3` via extra:
```bash
pip install intelli3text[cld3]
# CLI: --lid-primary cld3 --lid-fallback none
```
---
## spaCy models & normalization
* Size preference: **`lg` → `md` → `sm`**.
* If the model is missing, the library **tries to download it**.
* **Offline**: falls back to `spacy.blank(<lang>)` with a `sentencizer` (no crash).
* Normalization includes:
* tokenization;
* dropping stopwords/punctuation/whitespace;
* **lemmatization** (when the model has a lexicon);
* joining lemmas.
---
## Cleaning pipeline
Default order (`--cleaners ftfy,clean_text,pdf_breaks`):
1. **FTFY**: fixes Unicode glitches.
2. **clean-text**: removes URLs/emails/phones; keeps numbers/punctuation by default.
3. **pdf_breaks**: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).
You can customize the list/order via CLI or API.
---
## PDF export
The report includes:
* **Summary** (global language, total paragraphs),
* **Global Normalized Text** (optional),
* **Per-paragraph table** (language, confidence, normalized preview),
* Per-paragraph sections showing:
* **normalized**,
* **cleaned**,
* **raw**.
Library: **ReportLab**.
---
## Cache, auto-downloads & offline mode
* Default **cache** directory: `~/.cache/intelli3text/`
Override via env var:
`INTELLI3TEXT_CACHE_DIR=/your/custom/path`
* **Auto-download** on first use:
* `lid.176.bin` (fastText LID),
* spaCy models PT/EN/ES in order `lg→md→sm`.
* **Offline** behavior:
* LID returns fallback `"pt", 0.0` if fastText is unavailable;
* spaCy uses `blank()` (functional, but without full lexical features).
---
## Architecture & Design Patterns
**Applied patterns**:
* **Builder**: `PipelineBuilder` composes extractors, cleaners, LID, normalizer, and exporters from declarative config.
* **Strategy**:
* *Extractors* (Web/PDF/DOCX/TXT) implement `IExtractor`.
* *Cleaners* implement `ICleaner`, chained via `CleanerChain`.
* *Language Detectors* implement a simple interface (`FastTextLID`, `CLD3LID`).
* *Normalizer* implements `INormalizer` (`SpacyNormalizer` here).
* *Exporters* implement `IExporter` (`PDFExporter` here).
* **Factory/Registry**: lazy loading of spaCy models by lang/size with fallbacks.
* **Facade**: CLI and `Pipeline.process()` offer a simple entry point.
**Package layout (summary)**
```
src/intelli3text/
__init__.py
__main__.py # CLI
config.py # Intelli3Config (parameters)
utils.py # cache/download helpers
builder.py # PipelineBuilder (Builder)
pipeline.py # Pipeline (Facade)
extractors/ # Strategy
base.py
web_trafilatura.py
file_pdfminer.py
file_docx.py
file_text.py
cleaners/ # Strategy + Chain of Responsibility
base.py
chain.py
unicode_ftfy.py
clean_text.py
pdf_linebreaks.py
lid/ # Strategy
base.py
fasttext_lid.py
# (optional) cld3_lid.py
nlp/
base.py
registry.py # Factory/Registry (spaCy models + fallback)
spacy_normalizer.py # Strategy
export/
base.py
pdf_reportlab.py # Strategy
```
---
## Design Science Research (DSR)
* **Artifact**: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.
* **Problem**: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).
* **Design**: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.
* **Demonstration**: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.
* **Evaluation**: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).
* **Contributions**: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.
---
## Binary compatibility (NumPy/Thinc/spaCy)
To avoid the classic `numpy.dtype size changed` error:
* We pin **compatible** versions in `pyproject.toml`.
* If you already had other global packages and hit this error:
1. `pip uninstall -y spacy thinc numpy`
2. `pip cache purge`
3. `pip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"`
4. `pip install --user --no-cache-dir intelli3text` (or `-e .` from the local repo)
> Tip: always use the **same Python** that runs `intelli3text` (check `head -1 ~/.local/bin/intelli3text`).
---
## Performance tips
* **Paragraph length**: controlled by `paragraph_min_chars` (default 30) and `lid_min_chars` (default 60).
* **LID sample cap**: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
* **spaCy model size**: `sm` is lighter; `lg` gives better quality (default).
---
## Extensibility
* **New sources**: implement `IExtractor` and register in `PipelineBuilder`.
* **New cleaners**: implement `ICleaner` and map it in `NAME2CLEANER`.
* **New LIDs**: implement the interface under `lid/base.py`.
* **Exporters**: implement `IExporter` (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.
---
## Troubleshooting
* **Trafilatura ‘unidecode’ warning**: already handled — we depend on `Unidecode`.
* **No Internet on first run**:
* LID: fallback `"pt", 0.0`.
* spaCy: `spacy.blank(<lang>)`.
* Later, with Internet, run again to fetch full models.
* **`ModuleNotFoundError: fasttext`**:
* We depend on `fasttext-wheel` (prebuilt wheels).
* Reinstall: `pip install fasttext-wheel`.
> More tips and parameter-by-parameter guidance:
> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)
---
## Roadmap
* [ ] Exporters: HTML/Markdown with paragraph navigation.
* [ ] Quality metrics (lexical density, diversity, etc.).
* [ ] More languages via custom spaCy models.
* [ ] Optional normalization using Stanza.
---
## License
**MIT** — you’re free to use, modify and distribute.
> Note: the original upstream licenses of third-party models and libraries still apply.
---
## How to cite
> Speck, J. (2025). **intelli3text**: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: [https://github.com/jeffersonspeck/intelli3text](https://github.com/jeffersonspeck/intelli3text)
Raw data
{
"_id": null,
"home_page": null,
"name": "intelli3text",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "NLP, spaCy, language id, LID, cleaning, normalization, PDF, web extraction, text processing, Portuguese, Spanish, English",
"author": null,
"author_email": "Jefferson Rodrigo Speck <jeffersonspeck@msn.com>",
"download_url": "https://files.pythonhosted.org/packages/a5/b3/de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c/intelli3text-0.2.5.tar.gz",
"platform": null,
"description": "# intelli3text\n\nIngestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with **paragraph-level language detection** and **PDF export**. \nFocus on **frictionless install (`pip install`)**: on first run it **auto-downloads** the required models (fastText LID and spaCy) and works **offline** with sensible fallbacks.\n\n**Docs website:** https://jeffersonspeck.github.io/intelli3text/ \n**PyPI:** https://pypi.org/project/intelli3text/ \n**Repository:** https://github.com/jeffersonspeck/intelli3text\n\n---\n\n## Table of Contents\n\n- [Usage Manual](USAGE.md)\n- [Why this project?](#why-this-project)\n- [Key features](#key-features)\n- [Requirements](#requirements)\n- [Installation](#installation)\n- [Quick start (CLI)](#quick-start-cli)\n- [CLI examples](#cli-examples)\n- [Python usage (API)](#python-usage-api)\n- [Language identification (LID)](#language-identification-lid)\n- [spaCy models & normalization](#spacy-models--normalization)\n- [Cleaning pipeline](#cleaning-pipeline)\n- [PDF export](#pdf-export)\n- [Cache, auto-downloads & offline mode](#cache-auto-downloads--offline-mode)\n- [Architecture & Design Patterns](#architecture--design-patterns)\n- [Design Science Research (DSR)](#design-science-research-dsr)\n- [Binary compatibility (NumPy/Thinc/spaCy)](#binary-compatibility-numpythincspacy)\n- [Performance tips](#performance-tips)\n- [Extensibility](#extensibility)\n- [Troubleshooting](#troubleshooting)\n- [Publishing to PyPI](#publishing-to-pypi)\n- [Roadmap](#roadmap)\n- [License](#license)\n- [How to cite](#how-to-cite)\n\n---\n\n## Why this project?\n\nIn research and production, common needs include:\n\n1. **Ingest** text from heterogeneous sources (web, PDFs, DOCX, TXT);\n2. **Clean** and **normalize** the content;\n3. **Lemmatize** and remove stopwords;\n4. **Detect language** accurately, including **bilingual** documents;\n5. **Export** results with traceability (PDF that shows normalized, cleaned, and raw text).\n\n**intelli3text** is built to be **plug-and-play**: `pip install` and go \u2014 no native toolchains, no manual compiles, no painful environment setup.\n\n---\n\n## Key features\n\n- **Ingestion**: URL (HTML), PDF (`pdfminer.six`), DOCX (`python-docx`), TXT.\n- **Cleaning**: Unicode fixes (`ftfy`), noise removal (`clean-text`), PDF-specific line-break & hyphenation heuristics.\n- **Paragraph-level LID**: **fastText LID** (176 languages) with tolerant fallback.\n- **spaCy normalization**: lemmatized tokens without stopwords/punctuation; PT/EN/ES.\n- **PDF export**: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.\n- **Auto-download on first run**:\n - `lid.176.bin` (fastText LID);\n - spaCy models for PT/EN/ES (`lg\u2192md\u2192sm`) with offline fallback.\n- **CLI & Python API**: use from shell or embed in code.\n\n---\n\n## Requirements\n\n- **Python 3.9+**\n- Internet only on **first run** (to download models). After that, it works offline.\n- To avoid binary mismatches, the package pins **compatible** versions of `numpy`, `thinc`, and `spacy`.\n\n---\n\n## Installation\n\n```bash\npip install intelli3text\n# or from a local repo:\n# pip install .\n````\n\n> **No extra scripts.**\n> On first execution, required models are fetched to a local cache automatically.\n\n---\n\n## Quick start (CLI)\n\n```bash\nintelli3text \"https://pt.wikipedia.org/wiki/Howard_Gardner\" --export-pdf output.pdf\n```\n\nOutput:\n\n* JSON to `stdout` with `language_global`, `cleaned`, `normalized`, and a list of `paragraphs`.\n* A PDF report at `output.pdf`.\n\n---\n\n## CLI examples\n\n* Local PDF:\n\n ```bash\n intelli3text \"./my_paper.pdf\" --export-pdf report.pdf\n ```\n\n* Choose spaCy model size:\n\n ```bash\n intelli3text \"URL\" --nlp-size md\n # options: lg (default) | md | sm\n ```\n\n* Select cleaners:\n\n ```bash\n intelli3text \"URL\" --cleaners ftfy,clean_text,pdf_breaks\n ```\n\n* Save JSON to file:\n\n ```bash\n intelli3text \"URL\" --json-out result.json\n ```\n\n* Use CLD3 as primary (if installed as extra):\n\n ```bash\n pip install intelli3text[cld3]\n intelli3text \"URL\" --lid-primary cld3 --lid-fallback none\n ```\n\n> Full CLI reference: see **Docs \u2192 CLI** on the website:\n> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)\n\n---\n\n## Python usage (API)\n\n```python\nfrom intelli3text import PipelineBuilder, Intelli3Config\n\ncfg = Intelli3Config(\n cleaners=[\"ftfy\", \"clean_text\", \"pdf_breaks\"],\n lid_primary=\"fasttext\", # or \"cld3\" if you installed the extra\n lid_fallback=None, # or \"cld3\"\n nlp_model_pref=\"lg\", # \"lg\" | \"md\" | \"sm\"\n export={\"pdf\": {\"path\": \"output.pdf\", \"include_global_normalized\": True}},\n)\n\npipeline = PipelineBuilder(cfg).build()\nres = pipeline.process(\"https://pt.wikipedia.org/wiki/Howard_Gardner\")\n\nprint(res[\"language_global\"], len(res[\"paragraphs\"]))\nprint(res[\"paragraphs\"][0][\"language\"], res[\"paragraphs\"][0][\"normalized\"][:200])\n```\n\n> More samples (including safe-to-import examples): **Docs \u2192 Examples**.\n\n---\n\n## Language identification (LID)\n\n* **Primary**: **fastText LID** (`lid.176.bin`) auto-downloaded on first use.\n* **Tolerant**: if `fasttext` is unavailable, the pipeline **won\u2019t crash** \u2014 it returns `\"pt\"` with confidence `0.0` as a safe fallback.\n* **Accuracy**: detection is per **paragraph**; `language_global` is the most frequent.\n* **Optional**: `pycld3` via extra:\n\n ```bash\n pip install intelli3text[cld3]\n # CLI: --lid-primary cld3 --lid-fallback none\n ```\n\n---\n\n## spaCy models & normalization\n\n* Size preference: **`lg` \u2192 `md` \u2192 `sm`**.\n* If the model is missing, the library **tries to download it**.\n* **Offline**: falls back to `spacy.blank(<lang>)` with a `sentencizer` (no crash).\n* Normalization includes:\n\n * tokenization;\n * dropping stopwords/punctuation/whitespace;\n * **lemmatization** (when the model has a lexicon);\n * joining lemmas.\n\n---\n\n## Cleaning pipeline\n\nDefault order (`--cleaners ftfy,clean_text,pdf_breaks`):\n\n1. **FTFY**: fixes Unicode glitches.\n2. **clean-text**: removes URLs/emails/phones; keeps numbers/punctuation by default.\n3. **pdf_breaks**: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).\n\nYou can customize the list/order via CLI or API.\n\n---\n\n## PDF export\n\nThe report includes:\n\n* **Summary** (global language, total paragraphs),\n* **Global Normalized Text** (optional),\n* **Per-paragraph table** (language, confidence, normalized preview),\n* Per-paragraph sections showing:\n\n * **normalized**,\n * **cleaned**,\n * **raw**.\n\nLibrary: **ReportLab**.\n\n---\n\n## Cache, auto-downloads & offline mode\n\n* Default **cache** directory: `~/.cache/intelli3text/`\n Override via env var:\n `INTELLI3TEXT_CACHE_DIR=/your/custom/path`\n\n* **Auto-download** on first use:\n\n * `lid.176.bin` (fastText LID),\n * spaCy models PT/EN/ES in order `lg\u2192md\u2192sm`.\n\n* **Offline** behavior:\n\n * LID returns fallback `\"pt\", 0.0` if fastText is unavailable;\n * spaCy uses `blank()` (functional, but without full lexical features).\n\n---\n\n## Architecture & Design Patterns\n\n**Applied patterns**:\n\n* **Builder**: `PipelineBuilder` composes extractors, cleaners, LID, normalizer, and exporters from declarative config.\n* **Strategy**:\n\n * *Extractors* (Web/PDF/DOCX/TXT) implement `IExtractor`.\n * *Cleaners* implement `ICleaner`, chained via `CleanerChain`.\n * *Language Detectors* implement a simple interface (`FastTextLID`, `CLD3LID`).\n * *Normalizer* implements `INormalizer` (`SpacyNormalizer` here).\n * *Exporters* implement `IExporter` (`PDFExporter` here).\n* **Factory/Registry**: lazy loading of spaCy models by lang/size with fallbacks.\n* **Facade**: CLI and `Pipeline.process()` offer a simple entry point.\n\n**Package layout (summary)**\n\n```\nsrc/intelli3text/\n __init__.py\n __main__.py # CLI\n config.py # Intelli3Config (parameters)\n utils.py # cache/download helpers\n builder.py # PipelineBuilder (Builder)\n pipeline.py # Pipeline (Facade)\n\n extractors/ # Strategy\n base.py\n web_trafilatura.py\n file_pdfminer.py\n file_docx.py\n file_text.py\n\n cleaners/ # Strategy + Chain of Responsibility\n base.py\n chain.py\n unicode_ftfy.py\n clean_text.py\n pdf_linebreaks.py\n\n lid/ # Strategy\n base.py\n fasttext_lid.py\n # (optional) cld3_lid.py\n\n nlp/\n base.py\n registry.py # Factory/Registry (spaCy models + fallback)\n spacy_normalizer.py # Strategy\n\n export/\n base.py\n pdf_reportlab.py # Strategy\n```\n\n---\n\n## Design Science Research (DSR)\n\n* **Artifact**: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.\n* **Problem**: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).\n* **Design**: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.\n* **Demonstration**: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.\n* **Evaluation**: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).\n* **Contributions**: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.\n\n---\n\n## Binary compatibility (NumPy/Thinc/spaCy)\n\nTo avoid the classic `numpy.dtype size changed` error:\n\n* We pin **compatible** versions in `pyproject.toml`.\n* If you already had other global packages and hit this error:\n\n 1. `pip uninstall -y spacy thinc numpy`\n 2. `pip cache purge`\n 3. `pip install --user --no-cache-dir \"numpy==1.26.4\" \"thinc==8.2.4\" \"spacy==3.7.4\"`\n 4. `pip install --user --no-cache-dir intelli3text` (or `-e .` from the local repo)\n\n> Tip: always use the **same Python** that runs `intelli3text` (check `head -1 ~/.local/bin/intelli3text`).\n\n---\n\n## Performance tips\n\n* **Paragraph length**: controlled by `paragraph_min_chars` (default 30) and `lid_min_chars` (default 60).\n* **LID sample cap**: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.\n* **spaCy model size**: `sm` is lighter; `lg` gives better quality (default).\n\n---\n\n## Extensibility\n\n* **New sources**: implement `IExtractor` and register in `PipelineBuilder`.\n* **New cleaners**: implement `ICleaner` and map it in `NAME2CLEANER`.\n* **New LIDs**: implement the interface under `lid/base.py`.\n* **Exporters**: implement `IExporter` (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.\n\n---\n\n## Troubleshooting\n\n* **Trafilatura \u2018unidecode\u2019 warning**: already handled \u2014 we depend on `Unidecode`.\n* **No Internet on first run**:\n\n * LID: fallback `\"pt\", 0.0`.\n * spaCy: `spacy.blank(<lang>)`.\n * Later, with Internet, run again to fetch full models.\n* **`ModuleNotFoundError: fasttext`**:\n\n * We depend on `fasttext-wheel` (prebuilt wheels).\n * Reinstall: `pip install fasttext-wheel`.\n\n> More tips and parameter-by-parameter guidance:\n> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)\n\n---\n\n## Roadmap\n\n* [ ] Exporters: HTML/Markdown with paragraph navigation.\n* [ ] Quality metrics (lexical density, diversity, etc.).\n* [ ] More languages via custom spaCy models.\n* [ ] Optional normalization using Stanza.\n\n---\n\n## License\n\n**MIT** \u2014 you\u2019re free to use, modify and distribute.\n\n> Note: the original upstream licenses of third-party models and libraries still apply.\n\n---\n\n## How to cite\n\n> Speck, J. (2025). **intelli3text**: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: [https://github.com/jeffersonspeck/intelli3text](https://github.com/jeffersonspeck/intelli3text)\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.",
"version": "0.2.5",
"project_urls": {
"Documentation": "https://jeffersonspeck.github.io/intelli3text/",
"Homepage": "https://github.com/jeffersonspeck/intelli3text",
"Issues": "https://github.com/jeffersonspeck/intelli3text/issues",
"Repository": "https://github.com/jeffersonspeck/intelli3text"
},
"split_keywords": [
"nlp",
" spacy",
" language id",
" lid",
" cleaning",
" normalization",
" pdf",
" web extraction",
" text processing",
" portuguese",
" spanish",
" english"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a5b3de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c",
"md5": "c0cfaea1087b80ec1f8824527424294a",
"sha256": "941965d1c7fa6502ac5b4238b5415e69ad64cfcde6423eeac8017003d0f773c7"
},
"downloads": -1,
"filename": "intelli3text-0.2.5.tar.gz",
"has_sig": false,
"md5_digest": "c0cfaea1087b80ec1f8824527424294a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 30234,
"upload_time": "2025-10-13T00:46:31",
"upload_time_iso_8601": "2025-10-13T00:46:31.631209Z",
"url": "https://files.pythonhosted.org/packages/a5/b3/de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c/intelli3text-0.2.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-13 00:46:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jeffersonspeck",
"github_project": "intelli3text",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "intelli3text"
}