text2clusters


Nametext2clusters JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryText clustering with embeddings, PCA, DBSCAN/KMeans, and reporting.
upload_time2025-11-08 10:01:50
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords clustering dbscan embeddings nlp text unsupervised
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # text2clusters

A practical, configurable toolkit for turning raw text into meaningful clusters with high‑quality reports. It pipelines modern sentence embeddings, dimensionality reduction, and clustering (DBSCAN or K‑Means), then produces an audit‑friendly JSON/CSV report of clusters, exemplars, and per‑sample assignments.

> Note: The algorithm shines on medium to large datasets (thousands of texts). It works on small toy sets, but structure becomes clearer as data grows.

---

## Features

- **Embeddings**: Pluggable backends, including Sentence-Transformers (multilingual) and a lightweight TF‑IDF fallback.
- **Reduction**: PCA with adaptive component selection; optional 2D projection (PCA or t‑SNE) for visualization.
- **Clustering**: DBSCAN (with epsilon sweep) or K‑Means; cosine or euclidean metrics.
- **Reporting**: Per-cluster stats, top representatives, per-sample assignments, noise cluster handling, and export to JSON/CSV.
- **CLI + Python API**: Use it from the command line or integrate in Python notebooks/pipelines.
- **CPU‑friendly defaults**: Works without a GPU. When available, can auto‑select GPU for faster embeddings.

---

## Installation

### From PyPI (recommended)
```bash
pip install text2clusters
```

### Optional dependencies
If you want Sentence-Transformers and PyTorch for high‑quality multilingual embeddings:
```bash
pip install "text2clusters[embeddings]"
```
This installs dependencies like `transformers`, `sentence-transformers`, and `torch`.

> Without extras, the package falls back to TF‑IDF embeddings. This is fast but less semantically rich.

### From source
```bash
git clone https://github.com/your-org/text2clusters.git
cd text2clusters
pip install -e .
```

---

## Quickstart (Python)

Below is a complete example that you can run as-is. It constructs a small synthetic dataset, fits the pipeline, and prints the report and assignments.

```python
from text2clusters import Text2Clusters, EmbeddingConfig, ReductionConfig, ClusterConfig, ReportingConfig
import pandas as pd

# Synthetic dataset for a quick demo
texts = [
    "Best pizza in town, the crust is amazing!",
    "I love this pizzeria. Great sauce and crispy base.",
    "Terrible service at the restaurant. Waited 40 minutes.",
    "The waiter was rude and the food arrived cold.",
    "The museum exhibition on impressionism was breathtaking.",
    "I enjoyed the modern art gallery, especially the sculptures.",
    "Football match was exciting, our team scored twice!",
    "The coach changed tactics and we won the game.",
    "New GPU benchmarks show impressive ray tracing performance.",
]

df = pd.DataFrame({"text": texts})

tc = Text2Clusters(
    embedding=EmbeddingConfig(
        model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        device="auto",
        batch_size=8,
        max_length=256
    ),
    reduction=ReductionConfig(
        var_ratio=0.95,
        max_components=10,
        plot_2d_method="pca"   # "pca" or "tsne"
    ),
    clustering=ClusterConfig(
        method="dbscan",       # "dbscan" or "kmeans"
        min_samples=2,
        metric="euclidean",    # "euclidean" or "cosine"
        eps_start=0.1,
        eps_end=200.0,
        eps_lr=0.1,
        use_tsne=False         # set True to compute a 2D t-SNE projection
    ),
    reporting_cfg=ReportingConfig(
        top_representatives=5
    )
)

fit = tc.fit(df, text_col="text")
assignments = fit.result_df      # DataFrame with text, label, and any projections
report = fit.report              # JSON-like dict with cluster summary
embeddings = fit.embeddings      # np.ndarray of embedding vectors
reduced = fit.reduced            # np.ndarray of reduced components (e.g., PCA)

print("=== REPORT (summary) ===")
print(report)
print("\n=== ASSIGNMENTS (head) ===")
print(assignments.head())
```

### Using your own CSV
If you have a CSV with a `text` column:
```python
import pandas as pd
from text2clusters import Text2Clusters, EmbeddingConfig, ReductionConfig, ClusterConfig, ReportingConfig

df = pd.read_csv("your_texts.csv")  # must contain a 'text' column

tc = Text2Clusters(
    embedding=EmbeddingConfig(
        model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        device="auto",
        batch_size=8,
        max_length=256
    ),
    reduction=ReductionConfig(var_ratio=0.95, max_components=20, plot_2d_method="pca"),
    clustering=ClusterConfig(method="dbscan", min_samples=5, metric="cosine", eps_start=0.1, eps_end=50.0, eps_lr=0.2, use_tsne=False),
    reporting_cfg=ReportingConfig(top_representatives=5)
)

fit = tc.fit(df, text_col="text")
fit.result_df.to_csv("out_assignments.csv", index=False)
fit.save_report("out_report.json")
```

---

## Command Line Interface (CLI)

After installation, a console script (e.g., `text2clusters`) should be available. Typical usage:

```bash
# Minimal run: read CSV, detect text column automatically if only one string column
text2clusters fit \
  --input data.csv \
  --text-col text \
  --method dbscan \
  --out-assignments out_assignments.csv \
  --out-report out_report.json
```

More options:
```bash
text2clusters fit \
  --input data.csv \
  --text-col text \
  --embedding-model "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" \
  --device auto \
  --batch-size 16 \
  --max-length 256 \
  --reduction-var-ratio 0.95 \
  --reduction-max-components 20 \
  --plot-2d pca \
  --method dbscan \
  --metric cosine \
  --min-samples 5 \
  --eps-start 0.1 \
  --eps-end 50.0 \
  --eps-lr 0.2 \
  --use-tsne false \
  --top-representatives 5 \
  --out-assignments out_assignments.csv \
  --out-report out_report.json
```

If you prefer K‑Means:
```bash
text2clusters fit \
  --input data.csv \
  --text-col text \
  --method kmeans \
  --k 8 \
  --metric cosine \
  --out-assignments out_assignments.csv \
  --out-report out_report.json
```

---

## API Reference (Configs)

### `EmbeddingConfig`
- `model_name: str` Sentence-Transformers model id. Use `"tfidf"` for TF‑IDF fallback.
- `device: str` `"auto"`, `"cpu"`, or specific like `"cuda:0"`.
- `batch_size: int` Embedding batch size.
- `max_length: int` Truncation length for transformer models (ignored for TF‑IDF).

### `ReductionConfig`
- `var_ratio: float` Target explained variance ratio for PCA; number of components is auto-chosen.
- `max_components: int` Hard cap on PCA components.
- `plot_2d_method: str` `"pca"` or `"tsne"` for 2D projection saved in `result_df`.

### `ClusterConfig`
- `method: str` `"dbscan"` or `"kmeans"`.
- `metric: str` `"euclidean"` or `"cosine"`.
- **DBSCAN-only**:
  - `min_samples: int` Minimum points to form a core point.
  - `eps_start, eps_end, eps_lr: float` Range and learning rate for epsilon sweep.
  - `use_tsne: bool` If `True`, compute a 2D t‑SNE projection for visualization (costly on large data).
- **K‑Means-only**:
  - `k: int | None` Number of clusters. If `None`, the algorithm may pick a heuristic (e.g., sqrt(n)).

### `ReportingConfig`
- `top_representatives: int` Number of exemplar texts per cluster in the report.
- Additional fields may include saving keyword summaries, noise cluster handling, etc., depending on version.

### `FitResult`
- `result_df: pd.DataFrame` One row per sample with columns like `text`, `label`, `pca_x`, `pca_y` or `tsne_x`, `tsne_y`.
- `report: dict` Cluster‑level statistics and exemplar texts.
- `embeddings: np.ndarray` High‑dimensional embeddings.
- `reduced: np.ndarray | None` PCA‑reduced array (if reduction enabled).

---

## Tips and Guidance

- **Scale matters**: structure becomes more reliable as you approach thousands of texts. With only dozens, expect less stable clusters.
- **Metric**: if you normalize embeddings (default when using cosine), try `metric="cosine"` for DBSCAN and K‑Means.
- **DBSCAN tuning**: sweep `eps` across a sensible range. Start narrow if your data is dense; widen for varied topics.
- **Hardware**: on CPU, transformer embeddings can be slow. If you have a GPU, set `device="cuda:0"` or keep `auto` and let the library decide.
- **Caching**: for repeated runs on the same data, consider caching embeddings to disk.
- **Reproducibility**: set `random_state` where available for deterministic behavior.

---

## Troubleshooting

- **ValueError: perplexity must be less than n_samples**  
  You asked for t‑SNE on too few samples. Lower `perplexity` (if configurable) or disable `use_tsne` until you have more data.
- **Out of memory during embedding**  
  Reduce `batch_size`, switch to a smaller model, or use TF‑IDF fallback (`model_name="tfidf"`).
- **All points labeled -1 (noise) in DBSCAN**  
  Increase `eps`, decrease `min_samples`, or switch to cosine distance.

---

## License

This project is released under the MIT License. See `LICENSE` for details.

---

## Changelog

See `CHANGELOG.md` for notable changes between releases.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "text2clusters",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "clustering, dbscan, embeddings, nlp, text, unsupervised",
    "author": null,
    "author_email": "mohammd salehi <sefl.mohammd.salehi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b4/48/4cb75492a5a7106c3f6b55712fda0e8482bc96852598de927eae19d6e154/text2clusters-0.1.3.tar.gz",
    "platform": null,
    "description": "# text2clusters\n\nA practical, configurable toolkit for turning raw text into meaningful clusters with high\u2011quality reports. It pipelines modern sentence embeddings, dimensionality reduction, and clustering (DBSCAN or K\u2011Means), then produces an audit\u2011friendly JSON/CSV report of clusters, exemplars, and per\u2011sample assignments.\n\n> Note: The algorithm shines on medium to large datasets (thousands of texts). It works on small toy sets, but structure becomes clearer as data grows.\n\n---\n\n## Features\n\n- **Embeddings**: Pluggable backends, including Sentence-Transformers (multilingual) and a lightweight TF\u2011IDF fallback.\n- **Reduction**: PCA with adaptive component selection; optional 2D projection (PCA or t\u2011SNE) for visualization.\n- **Clustering**: DBSCAN (with epsilon sweep) or K\u2011Means; cosine or euclidean metrics.\n- **Reporting**: Per-cluster stats, top representatives, per-sample assignments, noise cluster handling, and export to JSON/CSV.\n- **CLI + Python API**: Use it from the command line or integrate in Python notebooks/pipelines.\n- **CPU\u2011friendly defaults**: Works without a GPU. When available, can auto\u2011select GPU for faster embeddings.\n\n---\n\n## Installation\n\n### From PyPI (recommended)\n```bash\npip install text2clusters\n```\n\n### Optional dependencies\nIf you want Sentence-Transformers and PyTorch for high\u2011quality multilingual embeddings:\n```bash\npip install \"text2clusters[embeddings]\"\n```\nThis installs dependencies like `transformers`, `sentence-transformers`, and `torch`.\n\n> Without extras, the package falls back to TF\u2011IDF embeddings. This is fast but less semantically rich.\n\n### From source\n```bash\ngit clone https://github.com/your-org/text2clusters.git\ncd text2clusters\npip install -e .\n```\n\n---\n\n## Quickstart (Python)\n\nBelow is a complete example that you can run as-is. It constructs a small synthetic dataset, fits the pipeline, and prints the report and assignments.\n\n```python\nfrom text2clusters import Text2Clusters, EmbeddingConfig, ReductionConfig, ClusterConfig, ReportingConfig\nimport pandas as pd\n\n# Synthetic dataset for a quick demo\ntexts = [\n    \"Best pizza in town, the crust is amazing!\",\n    \"I love this pizzeria. Great sauce and crispy base.\",\n    \"Terrible service at the restaurant. Waited 40 minutes.\",\n    \"The waiter was rude and the food arrived cold.\",\n    \"The museum exhibition on impressionism was breathtaking.\",\n    \"I enjoyed the modern art gallery, especially the sculptures.\",\n    \"Football match was exciting, our team scored twice!\",\n    \"The coach changed tactics and we won the game.\",\n    \"New GPU benchmarks show impressive ray tracing performance.\",\n]\n\ndf = pd.DataFrame({\"text\": texts})\n\ntc = Text2Clusters(\n    embedding=EmbeddingConfig(\n        model_name=\"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\",\n        device=\"auto\",\n        batch_size=8,\n        max_length=256\n    ),\n    reduction=ReductionConfig(\n        var_ratio=0.95,\n        max_components=10,\n        plot_2d_method=\"pca\"   # \"pca\" or \"tsne\"\n    ),\n    clustering=ClusterConfig(\n        method=\"dbscan\",       # \"dbscan\" or \"kmeans\"\n        min_samples=2,\n        metric=\"euclidean\",    # \"euclidean\" or \"cosine\"\n        eps_start=0.1,\n        eps_end=200.0,\n        eps_lr=0.1,\n        use_tsne=False         # set True to compute a 2D t-SNE projection\n    ),\n    reporting_cfg=ReportingConfig(\n        top_representatives=5\n    )\n)\n\nfit = tc.fit(df, text_col=\"text\")\nassignments = fit.result_df      # DataFrame with text, label, and any projections\nreport = fit.report              # JSON-like dict with cluster summary\nembeddings = fit.embeddings      # np.ndarray of embedding vectors\nreduced = fit.reduced            # np.ndarray of reduced components (e.g., PCA)\n\nprint(\"=== REPORT (summary) ===\")\nprint(report)\nprint(\"\\n=== ASSIGNMENTS (head) ===\")\nprint(assignments.head())\n```\n\n### Using your own CSV\nIf you have a CSV with a `text` column:\n```python\nimport pandas as pd\nfrom text2clusters import Text2Clusters, EmbeddingConfig, ReductionConfig, ClusterConfig, ReportingConfig\n\ndf = pd.read_csv(\"your_texts.csv\")  # must contain a 'text' column\n\ntc = Text2Clusters(\n    embedding=EmbeddingConfig(\n        model_name=\"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\",\n        device=\"auto\",\n        batch_size=8,\n        max_length=256\n    ),\n    reduction=ReductionConfig(var_ratio=0.95, max_components=20, plot_2d_method=\"pca\"),\n    clustering=ClusterConfig(method=\"dbscan\", min_samples=5, metric=\"cosine\", eps_start=0.1, eps_end=50.0, eps_lr=0.2, use_tsne=False),\n    reporting_cfg=ReportingConfig(top_representatives=5)\n)\n\nfit = tc.fit(df, text_col=\"text\")\nfit.result_df.to_csv(\"out_assignments.csv\", index=False)\nfit.save_report(\"out_report.json\")\n```\n\n---\n\n## Command Line Interface (CLI)\n\nAfter installation, a console script (e.g., `text2clusters`) should be available. Typical usage:\n\n```bash\n# Minimal run: read CSV, detect text column automatically if only one string column\ntext2clusters fit \\\n  --input data.csv \\\n  --text-col text \\\n  --method dbscan \\\n  --out-assignments out_assignments.csv \\\n  --out-report out_report.json\n```\n\nMore options:\n```bash\ntext2clusters fit \\\n  --input data.csv \\\n  --text-col text \\\n  --embedding-model \"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\" \\\n  --device auto \\\n  --batch-size 16 \\\n  --max-length 256 \\\n  --reduction-var-ratio 0.95 \\\n  --reduction-max-components 20 \\\n  --plot-2d pca \\\n  --method dbscan \\\n  --metric cosine \\\n  --min-samples 5 \\\n  --eps-start 0.1 \\\n  --eps-end 50.0 \\\n  --eps-lr 0.2 \\\n  --use-tsne false \\\n  --top-representatives 5 \\\n  --out-assignments out_assignments.csv \\\n  --out-report out_report.json\n```\n\nIf you prefer K\u2011Means:\n```bash\ntext2clusters fit \\\n  --input data.csv \\\n  --text-col text \\\n  --method kmeans \\\n  --k 8 \\\n  --metric cosine \\\n  --out-assignments out_assignments.csv \\\n  --out-report out_report.json\n```\n\n---\n\n## API Reference (Configs)\n\n### `EmbeddingConfig`\n- `model_name: str` Sentence-Transformers model id. Use `\"tfidf\"` for TF\u2011IDF fallback.\n- `device: str` `\"auto\"`, `\"cpu\"`, or specific like `\"cuda:0\"`.\n- `batch_size: int` Embedding batch size.\n- `max_length: int` Truncation length for transformer models (ignored for TF\u2011IDF).\n\n### `ReductionConfig`\n- `var_ratio: float` Target explained variance ratio for PCA; number of components is auto-chosen.\n- `max_components: int` Hard cap on PCA components.\n- `plot_2d_method: str` `\"pca\"` or `\"tsne\"` for 2D projection saved in `result_df`.\n\n### `ClusterConfig`\n- `method: str` `\"dbscan\"` or `\"kmeans\"`.\n- `metric: str` `\"euclidean\"` or `\"cosine\"`.\n- **DBSCAN-only**:\n  - `min_samples: int` Minimum points to form a core point.\n  - `eps_start, eps_end, eps_lr: float` Range and learning rate for epsilon sweep.\n  - `use_tsne: bool` If `True`, compute a 2D t\u2011SNE projection for visualization (costly on large data).\n- **K\u2011Means-only**:\n  - `k: int | None` Number of clusters. If `None`, the algorithm may pick a heuristic (e.g., sqrt(n)).\n\n### `ReportingConfig`\n- `top_representatives: int` Number of exemplar texts per cluster in the report.\n- Additional fields may include saving keyword summaries, noise cluster handling, etc., depending on version.\n\n### `FitResult`\n- `result_df: pd.DataFrame` One row per sample with columns like `text`, `label`, `pca_x`, `pca_y` or `tsne_x`, `tsne_y`.\n- `report: dict` Cluster\u2011level statistics and exemplar texts.\n- `embeddings: np.ndarray` High\u2011dimensional embeddings.\n- `reduced: np.ndarray | None` PCA\u2011reduced array (if reduction enabled).\n\n---\n\n## Tips and Guidance\n\n- **Scale matters**: structure becomes more reliable as you approach thousands of texts. With only dozens, expect less stable clusters.\n- **Metric**: if you normalize embeddings (default when using cosine), try `metric=\"cosine\"` for DBSCAN and K\u2011Means.\n- **DBSCAN tuning**: sweep `eps` across a sensible range. Start narrow if your data is dense; widen for varied topics.\n- **Hardware**: on CPU, transformer embeddings can be slow. If you have a GPU, set `device=\"cuda:0\"` or keep `auto` and let the library decide.\n- **Caching**: for repeated runs on the same data, consider caching embeddings to disk.\n- **Reproducibility**: set `random_state` where available for deterministic behavior.\n\n---\n\n## Troubleshooting\n\n- **ValueError: perplexity must be less than n_samples**  \n  You asked for t\u2011SNE on too few samples. Lower `perplexity` (if configurable) or disable `use_tsne` until you have more data.\n- **Out of memory during embedding**  \n  Reduce `batch_size`, switch to a smaller model, or use TF\u2011IDF fallback (`model_name=\"tfidf\"`).\n- **All points labeled -1 (noise) in DBSCAN**  \n  Increase `eps`, decrease `min_samples`, or switch to cosine distance.\n\n---\n\n## License\n\nThis project is released under the MIT License. See `LICENSE` for details.\n\n---\n\n## Changelog\n\nSee `CHANGELOG.md` for notable changes between releases.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Text clustering with embeddings, PCA, DBSCAN/KMeans, and reporting.",
    "version": "0.1.3",
    "project_urls": {
        "Changelog": "https://github.com/self-ms/text2clusters/releases",
        "Homepage": "https://github.com/self-ms/text2clusters",
        "Issues": "https://github.com/self-ms/text2clusters/issues",
        "Repository": "https://github.com/self-ms/text2clusters.git"
    },
    "split_keywords": [
        "clustering",
        " dbscan",
        " embeddings",
        " nlp",
        " text",
        " unsupervised"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e40592b593885163a5270fb9af89584a7cffe0e1979a2fbbff1a7d8f1324d86d",
                "md5": "72b5378999097f48d3d0ec49c742ae9b",
                "sha256": "2d13afbf5cdfee05c0270f9aa2af2eff62ca2f3bddbf115b2820e17e078981ac"
            },
            "downloads": -1,
            "filename": "text2clusters-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "72b5378999097f48d3d0ec49c742ae9b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 12420,
            "upload_time": "2025-11-08T10:01:49",
            "upload_time_iso_8601": "2025-11-08T10:01:49.279066Z",
            "url": "https://files.pythonhosted.org/packages/e4/05/92b593885163a5270fb9af89584a7cffe0e1979a2fbbff1a7d8f1324d86d/text2clusters-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b4484cb75492a5a7106c3f6b55712fda0e8482bc96852598de927eae19d6e154",
                "md5": "5ce07f437e57b5a6b1c9e1d5160b2263",
                "sha256": "a2f8933440c1aa9ad3fcab2aa04e48463b1f341568f99ef60e3557e8c85d49c3"
            },
            "downloads": -1,
            "filename": "text2clusters-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "5ce07f437e57b5a6b1c9e1d5160b2263",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 9757,
            "upload_time": "2025-11-08T10:01:50",
            "upload_time_iso_8601": "2025-11-08T10:01:50.846866Z",
            "url": "https://files.pythonhosted.org/packages/b4/48/4cb75492a5a7106c3f6b55712fda0e8482bc96852598de927eae19d6e154/text2clusters-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-08 10:01:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "self-ms",
    "github_project": "text2clusters",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "text2clusters"
}
        
Elapsed time: 2.74748s