intelli3text

Name	intelli3text JSON
Version	0.2.5 JSON
	download
home_page	None
Summary	Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.
upload_time	2025-10-13 00:46:31
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT
keywords	nlp spacy language id lid cleaning normalization pdf web extraction text processing portuguese spanish english
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # intelli3text

Ingestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with **paragraph-level language detection** and **PDF export**.  
Focus on **frictionless install (`pip install`)**: on first run it **auto-downloads** the required models (fastText LID and spaCy) and works **offline** with sensible fallbacks.

**Docs website:** https://jeffersonspeck.github.io/intelli3text/  
**PyPI:** https://pypi.org/project/intelli3text/  
**Repository:** https://github.com/jeffersonspeck/intelli3text

---

## Table of Contents

- [Usage Manual](USAGE.md)
- [Why this project?](#why-this-project)
- [Key features](#key-features)
- [Requirements](#requirements)
- [Installation](#installation)
- [Quick start (CLI)](#quick-start-cli)
- [CLI examples](#cli-examples)
- [Python usage (API)](#python-usage-api)
- [Language identification (LID)](#language-identification-lid)
- [spaCy models & normalization](#spacy-models--normalization)
- [Cleaning pipeline](#cleaning-pipeline)
- [PDF export](#pdf-export)
- [Cache, auto-downloads & offline mode](#cache-auto-downloads--offline-mode)
- [Architecture & Design Patterns](#architecture--design-patterns)
- [Design Science Research (DSR)](#design-science-research-dsr)
- [Binary compatibility (NumPy/Thinc/spaCy)](#binary-compatibility-numpythincspacy)
- [Performance tips](#performance-tips)
- [Extensibility](#extensibility)
- [Troubleshooting](#troubleshooting)
- [Publishing to PyPI](#publishing-to-pypi)
- [Roadmap](#roadmap)
- [License](#license)
- [How to cite](#how-to-cite)

---

## Why this project?

In research and production, common needs include:

1. **Ingest** text from heterogeneous sources (web, PDFs, DOCX, TXT);
2. **Clean** and **normalize** the content;
3. **Lemmatize** and remove stopwords;
4. **Detect language** accurately, including **bilingual** documents;
5. **Export** results with traceability (PDF that shows normalized, cleaned, and raw text).

**intelli3text** is built to be **plug-and-play**: `pip install` and go — no native toolchains, no manual compiles, no painful environment setup.

---

## Key features

- **Ingestion**: URL (HTML), PDF (`pdfminer.six`), DOCX (`python-docx`), TXT.
- **Cleaning**: Unicode fixes (`ftfy`), noise removal (`clean-text`), PDF-specific line-break & hyphenation heuristics.
- **Paragraph-level LID**: **fastText LID** (176 languages) with tolerant fallback.
- **spaCy normalization**: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
- **PDF export**: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
- **Auto-download on first run**:
  - `lid.176.bin` (fastText LID);
  - spaCy models for PT/EN/ES (`lg→md→sm`) with offline fallback.
- **CLI & Python API**: use from shell or embed in code.

---

## Requirements

- **Python 3.9+**
- Internet only on **first run** (to download models). After that, it works offline.
- To avoid binary mismatches, the package pins **compatible** versions of `numpy`, `thinc`, and `spacy`.

---

## Installation

```bash
pip install intelli3text
# or from a local repo:
# pip install .
````

> **No extra scripts.**
> On first execution, required models are fetched to a local cache automatically.

---

## Quick start (CLI)

```bash
intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf
```

Output:

* JSON to `stdout` with `language_global`, `cleaned`, `normalized`, and a list of `paragraphs`.
* A PDF report at `output.pdf`.

---

## CLI examples

* Local PDF:

  ```bash
  intelli3text "./my_paper.pdf" --export-pdf report.pdf
  ```

* Choose spaCy model size:

  ```bash
  intelli3text "URL" --nlp-size md
  # options: lg (default) | md | sm
  ```

* Select cleaners:

  ```bash
  intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks
  ```

* Save JSON to file:

  ```bash
  intelli3text "URL" --json-out result.json
  ```

* Use CLD3 as primary (if installed as extra):

  ```bash
  pip install intelli3text[cld3]
  intelli3text "URL" --lid-primary cld3 --lid-fallback none
  ```

> Full CLI reference: see **Docs → CLI** on the website:
> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)

---

## Python usage (API)

```python
from intelli3text import PipelineBuilder, Intelli3Config

cfg = Intelli3Config(
    cleaners=["ftfy", "clean_text", "pdf_breaks"],
    lid_primary="fasttext",         # or "cld3" if you installed the extra
    lid_fallback=None,              # or "cld3"
    nlp_model_pref="lg",            # "lg" | "md" | "sm"
    export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)

pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")

print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])
```

> More samples (including safe-to-import examples): **Docs → Examples**.

---

## Language identification (LID)

* **Primary**: **fastText LID** (`lid.176.bin`) auto-downloaded on first use.
* **Tolerant**: if `fasttext` is unavailable, the pipeline **won’t crash** — it returns `"pt"` with confidence `0.0` as a safe fallback.
* **Accuracy**: detection is per **paragraph**; `language_global` is the most frequent.
* **Optional**: `pycld3` via extra:

  ```bash
  pip install intelli3text[cld3]
  # CLI: --lid-primary cld3 --lid-fallback none
  ```

---

## spaCy models & normalization

* Size preference: **`lg` → `md` → `sm`**.
* If the model is missing, the library **tries to download it**.
* **Offline**: falls back to `spacy.blank(<lang>)` with a `sentencizer` (no crash).
* Normalization includes:

  * tokenization;
  * dropping stopwords/punctuation/whitespace;
  * **lemmatization** (when the model has a lexicon);
  * joining lemmas.

---

## Cleaning pipeline

Default order (`--cleaners ftfy,clean_text,pdf_breaks`):

1. **FTFY**: fixes Unicode glitches.
2. **clean-text**: removes URLs/emails/phones; keeps numbers/punctuation by default.
3. **pdf_breaks**: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).

You can customize the list/order via CLI or API.

---

## PDF export

The report includes:

* **Summary** (global language, total paragraphs),
* **Global Normalized Text** (optional),
* **Per-paragraph table** (language, confidence, normalized preview),
* Per-paragraph sections showing:

  * **normalized**,
  * **cleaned**,
  * **raw**.

Library: **ReportLab**.

---

## Cache, auto-downloads & offline mode

* Default **cache** directory: `~/.cache/intelli3text/`
  Override via env var:
  `INTELLI3TEXT_CACHE_DIR=/your/custom/path`

* **Auto-download** on first use:

  * `lid.176.bin` (fastText LID),
  * spaCy models PT/EN/ES in order `lg→md→sm`.

* **Offline** behavior:

  * LID returns fallback `"pt", 0.0` if fastText is unavailable;
  * spaCy uses `blank()` (functional, but without full lexical features).

---

## Architecture & Design Patterns

**Applied patterns**:

* **Builder**: `PipelineBuilder` composes extractors, cleaners, LID, normalizer, and exporters from declarative config.
* **Strategy**:

  * *Extractors* (Web/PDF/DOCX/TXT) implement `IExtractor`.
  * *Cleaners* implement `ICleaner`, chained via `CleanerChain`.
  * *Language Detectors* implement a simple interface (`FastTextLID`, `CLD3LID`).
  * *Normalizer* implements `INormalizer` (`SpacyNormalizer` here).
  * *Exporters* implement `IExporter` (`PDFExporter` here).
* **Factory/Registry**: lazy loading of spaCy models by lang/size with fallbacks.
* **Facade**: CLI and `Pipeline.process()` offer a simple entry point.

**Package layout (summary)**

```
src/intelli3text/
  __init__.py
  __main__.py            # CLI
  config.py              # Intelli3Config (parameters)
  utils.py               # cache/download helpers
  builder.py             # PipelineBuilder (Builder)
  pipeline.py            # Pipeline (Facade)

  extractors/            # Strategy
    base.py
    web_trafilatura.py
    file_pdfminer.py
    file_docx.py
    file_text.py

  cleaners/              # Strategy + Chain of Responsibility
    base.py
    chain.py
    unicode_ftfy.py
    clean_text.py
    pdf_linebreaks.py

  lid/                   # Strategy
    base.py
    fasttext_lid.py
    # (optional) cld3_lid.py

  nlp/
    base.py
    registry.py          # Factory/Registry (spaCy models + fallback)
    spacy_normalizer.py  # Strategy

  export/
    base.py
    pdf_reportlab.py     # Strategy
```

---

## Design Science Research (DSR)

* **Artifact**: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.
* **Problem**: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).
* **Design**: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.
* **Demonstration**: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.
* **Evaluation**: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).
* **Contributions**: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.

---

## Binary compatibility (NumPy/Thinc/spaCy)

To avoid the classic `numpy.dtype size changed` error:

* We pin **compatible** versions in `pyproject.toml`.
* If you already had other global packages and hit this error:

  1. `pip uninstall -y spacy thinc numpy`
  2. `pip cache purge`
  3. `pip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"`
  4. `pip install --user --no-cache-dir intelli3text` (or `-e .` from the local repo)

> Tip: always use the **same Python** that runs `intelli3text` (check `head -1 ~/.local/bin/intelli3text`).

---

## Performance tips

* **Paragraph length**: controlled by `paragraph_min_chars` (default 30) and `lid_min_chars` (default 60).
* **LID sample cap**: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
* **spaCy model size**: `sm` is lighter; `lg` gives better quality (default).

---

## Extensibility

* **New sources**: implement `IExtractor` and register in `PipelineBuilder`.
* **New cleaners**: implement `ICleaner` and map it in `NAME2CLEANER`.
* **New LIDs**: implement the interface under `lid/base.py`.
* **Exporters**: implement `IExporter` (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.

---

## Troubleshooting

* **Trafilatura ‘unidecode’ warning**: already handled — we depend on `Unidecode`.
* **No Internet on first run**:

  * LID: fallback `"pt", 0.0`.
  * spaCy: `spacy.blank(<lang>)`.
  * Later, with Internet, run again to fetch full models.
* **`ModuleNotFoundError: fasttext`**:

  * We depend on `fasttext-wheel` (prebuilt wheels).
  * Reinstall: `pip install fasttext-wheel`.

> More tips and parameter-by-parameter guidance:
> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)

---

## Roadmap

* [ ] Exporters: HTML/Markdown with paragraph navigation.
* [ ] Quality metrics (lexical density, diversity, etc.).
* [ ] More languages via custom spaCy models.
* [ ] Optional normalization using Stanza.

---

## License

**MIT** — you’re free to use, modify and distribute.

> Note: the original upstream licenses of third-party models and libraries still apply.

---

## How to cite

> Speck, J. (2025). **intelli3text**: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: [https://github.com/jeffersonspeck/intelli3text](https://github.com/jeffersonspeck/intelli3text)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "intelli3text",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "NLP, spaCy, language id, LID, cleaning, normalization, PDF, web extraction, text processing, Portuguese, Spanish, English",
    "author": null,
    "author_email": "Jefferson Rodrigo Speck <jeffersonspeck@msn.com>",
    "download_url": "https://files.pythonhosted.org/packages/a5/b3/de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c/intelli3text-0.2.5.tar.gz",
    "platform": null,
    "description": "# intelli3text\n\nIngestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with **paragraph-level language detection** and **PDF export**.  \nFocus on **frictionless install (`pip install`)**: on first run it **auto-downloads** the required models (fastText LID and spaCy) and works **offline** with sensible fallbacks.\n\n**Docs website:** https://jeffersonspeck.github.io/intelli3text/  \n**PyPI:** https://pypi.org/project/intelli3text/  \n**Repository:** https://github.com/jeffersonspeck/intelli3text\n\n---\n\n## Table of Contents\n\n- [Usage Manual](USAGE.md)\n- [Why this project?](#why-this-project)\n- [Key features](#key-features)\n- [Requirements](#requirements)\n- [Installation](#installation)\n- [Quick start (CLI)](#quick-start-cli)\n- [CLI examples](#cli-examples)\n- [Python usage (API)](#python-usage-api)\n- [Language identification (LID)](#language-identification-lid)\n- [spaCy models & normalization](#spacy-models--normalization)\n- [Cleaning pipeline](#cleaning-pipeline)\n- [PDF export](#pdf-export)\n- [Cache, auto-downloads & offline mode](#cache-auto-downloads--offline-mode)\n- [Architecture & Design Patterns](#architecture--design-patterns)\n- [Design Science Research (DSR)](#design-science-research-dsr)\n- [Binary compatibility (NumPy/Thinc/spaCy)](#binary-compatibility-numpythincspacy)\n- [Performance tips](#performance-tips)\n- [Extensibility](#extensibility)\n- [Troubleshooting](#troubleshooting)\n- [Publishing to PyPI](#publishing-to-pypi)\n- [Roadmap](#roadmap)\n- [License](#license)\n- [How to cite](#how-to-cite)\n\n---\n\n## Why this project?\n\nIn research and production, common needs include:\n\n1. **Ingest** text from heterogeneous sources (web, PDFs, DOCX, TXT);\n2. **Clean** and **normalize** the content;\n3. **Lemmatize** and remove stopwords;\n4. **Detect language** accurately, including **bilingual** documents;\n5. **Export** results with traceability (PDF that shows normalized, cleaned, and raw text).\n\n**intelli3text** is built to be **plug-and-play**: `pip install` and go \u2014 no native toolchains, no manual compiles, no painful environment setup.\n\n---\n\n## Key features\n\n- **Ingestion**: URL (HTML), PDF (`pdfminer.six`), DOCX (`python-docx`), TXT.\n- **Cleaning**: Unicode fixes (`ftfy`), noise removal (`clean-text`), PDF-specific line-break & hyphenation heuristics.\n- **Paragraph-level LID**: **fastText LID** (176 languages) with tolerant fallback.\n- **spaCy normalization**: lemmatized tokens without stopwords/punctuation; PT/EN/ES.\n- **PDF export**: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.\n- **Auto-download on first run**:\n  - `lid.176.bin` (fastText LID);\n  - spaCy models for PT/EN/ES (`lg\u2192md\u2192sm`) with offline fallback.\n- **CLI & Python API**: use from shell or embed in code.\n\n---\n\n## Requirements\n\n- **Python 3.9+**\n- Internet only on **first run** (to download models). After that, it works offline.\n- To avoid binary mismatches, the package pins **compatible** versions of `numpy`, `thinc`, and `spacy`.\n\n---\n\n## Installation\n\n```bash\npip install intelli3text\n# or from a local repo:\n# pip install .\n````\n\n> **No extra scripts.**\n> On first execution, required models are fetched to a local cache automatically.\n\n---\n\n## Quick start (CLI)\n\n```bash\nintelli3text \"https://pt.wikipedia.org/wiki/Howard_Gardner\" --export-pdf output.pdf\n```\n\nOutput:\n\n* JSON to `stdout` with `language_global`, `cleaned`, `normalized`, and a list of `paragraphs`.\n* A PDF report at `output.pdf`.\n\n---\n\n## CLI examples\n\n* Local PDF:\n\n  ```bash\n  intelli3text \"./my_paper.pdf\" --export-pdf report.pdf\n  ```\n\n* Choose spaCy model size:\n\n  ```bash\n  intelli3text \"URL\" --nlp-size md\n  # options: lg (default) | md | sm\n  ```\n\n* Select cleaners:\n\n  ```bash\n  intelli3text \"URL\" --cleaners ftfy,clean_text,pdf_breaks\n  ```\n\n* Save JSON to file:\n\n  ```bash\n  intelli3text \"URL\" --json-out result.json\n  ```\n\n* Use CLD3 as primary (if installed as extra):\n\n  ```bash\n  pip install intelli3text[cld3]\n  intelli3text \"URL\" --lid-primary cld3 --lid-fallback none\n  ```\n\n> Full CLI reference: see **Docs \u2192 CLI** on the website:\n> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)\n\n---\n\n## Python usage (API)\n\n```python\nfrom intelli3text import PipelineBuilder, Intelli3Config\n\ncfg = Intelli3Config(\n    cleaners=[\"ftfy\", \"clean_text\", \"pdf_breaks\"],\n    lid_primary=\"fasttext\",         # or \"cld3\" if you installed the extra\n    lid_fallback=None,              # or \"cld3\"\n    nlp_model_pref=\"lg\",            # \"lg\" | \"md\" | \"sm\"\n    export={\"pdf\": {\"path\": \"output.pdf\", \"include_global_normalized\": True}},\n)\n\npipeline = PipelineBuilder(cfg).build()\nres = pipeline.process(\"https://pt.wikipedia.org/wiki/Howard_Gardner\")\n\nprint(res[\"language_global\"], len(res[\"paragraphs\"]))\nprint(res[\"paragraphs\"][0][\"language\"], res[\"paragraphs\"][0][\"normalized\"][:200])\n```\n\n> More samples (including safe-to-import examples): **Docs \u2192 Examples**.\n\n---\n\n## Language identification (LID)\n\n* **Primary**: **fastText LID** (`lid.176.bin`) auto-downloaded on first use.\n* **Tolerant**: if `fasttext` is unavailable, the pipeline **won\u2019t crash** \u2014 it returns `\"pt\"` with confidence `0.0` as a safe fallback.\n* **Accuracy**: detection is per **paragraph**; `language_global` is the most frequent.\n* **Optional**: `pycld3` via extra:\n\n  ```bash\n  pip install intelli3text[cld3]\n  # CLI: --lid-primary cld3 --lid-fallback none\n  ```\n\n---\n\n## spaCy models & normalization\n\n* Size preference: **`lg` \u2192 `md` \u2192 `sm`**.\n* If the model is missing, the library **tries to download it**.\n* **Offline**: falls back to `spacy.blank(<lang>)` with a `sentencizer` (no crash).\n* Normalization includes:\n\n  * tokenization;\n  * dropping stopwords/punctuation/whitespace;\n  * **lemmatization** (when the model has a lexicon);\n  * joining lemmas.\n\n---\n\n## Cleaning pipeline\n\nDefault order (`--cleaners ftfy,clean_text,pdf_breaks`):\n\n1. **FTFY**: fixes Unicode glitches.\n2. **clean-text**: removes URLs/emails/phones; keeps numbers/punctuation by default.\n3. **pdf_breaks**: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).\n\nYou can customize the list/order via CLI or API.\n\n---\n\n## PDF export\n\nThe report includes:\n\n* **Summary** (global language, total paragraphs),\n* **Global Normalized Text** (optional),\n* **Per-paragraph table** (language, confidence, normalized preview),\n* Per-paragraph sections showing:\n\n  * **normalized**,\n  * **cleaned**,\n  * **raw**.\n\nLibrary: **ReportLab**.\n\n---\n\n## Cache, auto-downloads & offline mode\n\n* Default **cache** directory: `~/.cache/intelli3text/`\n  Override via env var:\n  `INTELLI3TEXT_CACHE_DIR=/your/custom/path`\n\n* **Auto-download** on first use:\n\n  * `lid.176.bin` (fastText LID),\n  * spaCy models PT/EN/ES in order `lg\u2192md\u2192sm`.\n\n* **Offline** behavior:\n\n  * LID returns fallback `\"pt\", 0.0` if fastText is unavailable;\n  * spaCy uses `blank()` (functional, but without full lexical features).\n\n---\n\n## Architecture & Design Patterns\n\n**Applied patterns**:\n\n* **Builder**: `PipelineBuilder` composes extractors, cleaners, LID, normalizer, and exporters from declarative config.\n* **Strategy**:\n\n  * *Extractors* (Web/PDF/DOCX/TXT) implement `IExtractor`.\n  * *Cleaners* implement `ICleaner`, chained via `CleanerChain`.\n  * *Language Detectors* implement a simple interface (`FastTextLID`, `CLD3LID`).\n  * *Normalizer* implements `INormalizer` (`SpacyNormalizer` here).\n  * *Exporters* implement `IExporter` (`PDFExporter` here).\n* **Factory/Registry**: lazy loading of spaCy models by lang/size with fallbacks.\n* **Facade**: CLI and `Pipeline.process()` offer a simple entry point.\n\n**Package layout (summary)**\n\n```\nsrc/intelli3text/\n  __init__.py\n  __main__.py            # CLI\n  config.py              # Intelli3Config (parameters)\n  utils.py               # cache/download helpers\n  builder.py             # PipelineBuilder (Builder)\n  pipeline.py            # Pipeline (Facade)\n\n  extractors/            # Strategy\n    base.py\n    web_trafilatura.py\n    file_pdfminer.py\n    file_docx.py\n    file_text.py\n\n  cleaners/              # Strategy + Chain of Responsibility\n    base.py\n    chain.py\n    unicode_ftfy.py\n    clean_text.py\n    pdf_linebreaks.py\n\n  lid/                   # Strategy\n    base.py\n    fasttext_lid.py\n    # (optional) cld3_lid.py\n\n  nlp/\n    base.py\n    registry.py          # Factory/Registry (spaCy models + fallback)\n    spacy_normalizer.py  # Strategy\n\n  export/\n    base.py\n    pdf_reportlab.py     # Strategy\n```\n\n---\n\n## Design Science Research (DSR)\n\n* **Artifact**: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.\n* **Problem**: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).\n* **Design**: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.\n* **Demonstration**: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.\n* **Evaluation**: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).\n* **Contributions**: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.\n\n---\n\n## Binary compatibility (NumPy/Thinc/spaCy)\n\nTo avoid the classic `numpy.dtype size changed` error:\n\n* We pin **compatible** versions in `pyproject.toml`.\n* If you already had other global packages and hit this error:\n\n  1. `pip uninstall -y spacy thinc numpy`\n  2. `pip cache purge`\n  3. `pip install --user --no-cache-dir \"numpy==1.26.4\" \"thinc==8.2.4\" \"spacy==3.7.4\"`\n  4. `pip install --user --no-cache-dir intelli3text` (or `-e .` from the local repo)\n\n> Tip: always use the **same Python** that runs `intelli3text` (check `head -1 ~/.local/bin/intelli3text`).\n\n---\n\n## Performance tips\n\n* **Paragraph length**: controlled by `paragraph_min_chars` (default 30) and `lid_min_chars` (default 60).\n* **LID sample cap**: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.\n* **spaCy model size**: `sm` is lighter; `lg` gives better quality (default).\n\n---\n\n## Extensibility\n\n* **New sources**: implement `IExtractor` and register in `PipelineBuilder`.\n* **New cleaners**: implement `ICleaner` and map it in `NAME2CLEANER`.\n* **New LIDs**: implement the interface under `lid/base.py`.\n* **Exporters**: implement `IExporter` (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.\n\n---\n\n## Troubleshooting\n\n* **Trafilatura \u2018unidecode\u2019 warning**: already handled \u2014 we depend on `Unidecode`.\n* **No Internet on first run**:\n\n  * LID: fallback `\"pt\", 0.0`.\n  * spaCy: `spacy.blank(<lang>)`.\n  * Later, with Internet, run again to fetch full models.\n* **`ModuleNotFoundError: fasttext`**:\n\n  * We depend on `fasttext-wheel` (prebuilt wheels).\n  * Reinstall: `pip install fasttext-wheel`.\n\n> More tips and parameter-by-parameter guidance:\n> [https://jeffersonspeck.github.io/intelli3text/](https://jeffersonspeck.github.io/intelli3text/)\n\n---\n\n## Roadmap\n\n* [ ] Exporters: HTML/Markdown with paragraph navigation.\n* [ ] Quality metrics (lexical density, diversity, etc.).\n* [ ] More languages via custom spaCy models.\n* [ ] Optional normalization using Stanza.\n\n---\n\n## License\n\n**MIT** \u2014 you\u2019re free to use, modify and distribute.\n\n> Note: the original upstream licenses of third-party models and libraries still apply.\n\n---\n\n## How to cite\n\n> Speck, J. (2025). **intelli3text**: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: [https://github.com/jeffersonspeck/intelli3text](https://github.com/jeffersonspeck/intelli3text)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.",
    "version": "0.2.5",
    "project_urls": {
        "Documentation": "https://jeffersonspeck.github.io/intelli3text/",
        "Homepage": "https://github.com/jeffersonspeck/intelli3text",
        "Issues": "https://github.com/jeffersonspeck/intelli3text/issues",
        "Repository": "https://github.com/jeffersonspeck/intelli3text"
    },
    "split_keywords": [
        "nlp",
        " spacy",
        " language id",
        " lid",
        " cleaning",
        " normalization",
        " pdf",
        " web extraction",
        " text processing",
        " portuguese",
        " spanish",
        " english"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a5b3de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c",
                "md5": "c0cfaea1087b80ec1f8824527424294a",
                "sha256": "941965d1c7fa6502ac5b4238b5415e69ad64cfcde6423eeac8017003d0f773c7"
            },
            "downloads": -1,
            "filename": "intelli3text-0.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "c0cfaea1087b80ec1f8824527424294a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 30234,
            "upload_time": "2025-10-13T00:46:31",
            "upload_time_iso_8601": "2025-10-13T00:46:31.631209Z",
            "url": "https://files.pythonhosted.org/packages/a5/b3/de95c6971689906fbf3c600af02e22a73609bb022b819c99a12eaa8ea85c/intelli3text-0.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-13 00:46:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jeffersonspeck",
    "github_project": "intelli3text",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "intelli3text"
}

None