# Installation
## From Github
```
pip3 install git+https://github.com/er1kb/distinction
```
or clone and install locally:
```
git clone https://github.com/er1kb/distinction.git && cd distinction && pip3 install .
```
## From PyPI
```
python3 -m pip install distinction
```
## Dependencies
* [Numpy](https://numpy.org/) >= 1.25.0
* [SentenceTransformers](https://sbert.net/) >= 3.0.1
* [Plotext](https://github.com/piccolomo/plotext) >= 5.3.2 (optional)
# English
## What is it
A common use case is to be able to predict whether something is this or that, when that something is a piece of text. You may be working with custom logs (customer service requests, reviews, etc.) or open-ended survey responses that need to be coded at scale. Although neural networks can be used to classify latent variables in natural language, their complexity and overhead are a disadvantage.
Embeddings are the features that neural networks use. They quantify information in the original data. Sentence transformer models encode meaning by producing sentence embeddings, ie representing text as high-dimensional vectors. These models are comparatively fast and lightweight and can even run on the cpu. Their output is easily stored in a vector database, so that you really only have to run the model once. Since vectors are points in an abstract space, we are able to measure if these points are close to each other (similar) or further apart (unrelated or opposite).
Classification can be done by comparing the embedding of an individual text to a "typical" embedding for a given category. To "train" the classifier, you need a manually classified dataset. The minimum size of this dataset will depend on the number of dependent variables, how well-defined these variables are, and the ability of the sentence transformer model to encode relevant signals in your dataset.
The classifier uses a relevant subset of the vector dimensions to separate signal from noise. A similarity threshold is chosen so that similarities at least equal to the threshold equal 1, and those below are assigned a 0. Comparisons are made at the level of individual sentences, which tend to be the main unit of coherent meaning. The classifier can be optimized/tuned by repeatedly running it on a validation dataset and selecting the threshold value with the best outcome. In the absence of validation data, this process can also be done manually.
## Examples
Input data is an iterable (list/generator) of dicts. Export from your favourite dataset library using polars.DataFrame.[to\_dicts()](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_dicts.html) or pandas.DataFrame.[to\_dict('records')](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html).
<details>
<summary>Split records</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Combine records</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Classifier from training\_data - raw text</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Classifier from training\_data - pre-encoded</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Tune similarity</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Tune selection</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Tune with plots</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Use optimized criteria from tune()</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>Set up a prediction pipeline for continuous data streams</summary>
<br>
TODO: example
```
pass
```
</details>
<details>
<summary>...</summary>
<br>
TODO: example
```
pass
```
</details>
# Swedish
## TODO
```
import distinction as ds
C = ds.Classifier(**kwargs)
[*C.train(training_data = ...)]
predictions = [*C.predict(...)]
```
Raw data
{
"_id": null,
"home_page": null,
"name": "distinction",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "binary classification, cosine similarity, embeddings, feature selection, sentence transformer",
"author": null,
"author_email": "Erik Broman <mikroberna@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/20/f4/5b683d11fe57289108dead26b44ffc6214999b5655a3f2c2ef2e6f656f5d/distinction-0.0.1.1.tar.gz",
"platform": null,
"description": "# Installation\n\n## From Github\n```\npip3 install git+https://github.com/er1kb/distinction\n```\nor clone and install locally:\n```\ngit clone https://github.com/er1kb/distinction.git && cd distinction && pip3 install .\n```\n\n## From PyPI\n```\npython3 -m pip install distinction\n```\n\n## Dependencies\n* [Numpy](https://numpy.org/) >= 1.25.0\n* [SentenceTransformers](https://sbert.net/) >= 3.0.1\n* [Plotext](https://github.com/piccolomo/plotext) >= 5.3.2 (optional)\n\n\n\n# English\n\n## What is it\nA common use case is to be able to predict whether something is this or that, when that something is a piece of text. You may be working with custom logs (customer service requests, reviews, etc.) or open-ended survey responses that need to be coded at scale. Although neural networks can be used to classify latent variables in natural language, their complexity and overhead are a disadvantage. \n\nEmbeddings are the features that neural networks use. They quantify information in the original data. Sentence transformer models encode meaning by producing sentence embeddings, ie representing text as high-dimensional vectors. These models are comparatively fast and lightweight and can even run on the cpu. Their output is easily stored in a vector database, so that you really only have to run the model once. Since vectors are points in an abstract space, we are able to measure if these points are close to each other (similar) or further apart (unrelated or opposite). \n\nClassification can be done by comparing the embedding of an individual text to a \"typical\" embedding for a given category. To \"train\" the classifier, you need a manually classified dataset. The minimum size of this dataset will depend on the number of dependent variables, how well-defined these variables are, and the ability of the sentence transformer model to encode relevant signals in your dataset. \n\nThe classifier uses a relevant subset of the vector dimensions to separate signal from noise. A similarity threshold is chosen so that similarities at least equal to the threshold equal 1, and those below are assigned a 0. Comparisons are made at the level of individual sentences, which tend to be the main unit of coherent meaning. The classifier can be optimized/tuned by repeatedly running it on a validation dataset and selecting the threshold value with the best outcome. In the absence of validation data, this process can also be done manually. \n\n## Examples\n\nInput data is an iterable (list/generator) of dicts. Export from your favourite dataset library using polars.DataFrame.[to\\_dicts()](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_dicts.html) or pandas.DataFrame.[to\\_dict('records')](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html).\n\n<details>\n<summary>Split records</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Combine records</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Classifier from training\\_data - raw text</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Classifier from training\\_data - pre-encoded</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Tune similarity</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Tune selection</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Tune with plots</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Use optimized criteria from tune()</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>Set up a prediction pipeline for continuous data streams</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n<details>\n<summary>...</summary>\n<br>\nTODO: example\n```\npass\n```\n</details>\n\n\n\n\n\n\n\n\n# Swedish\n\n## TODO\n\n```\nimport distinction as ds\nC = ds.Classifier(**kwargs)\n[*C.train(training_data = ...)]\npredictions = [*C.predict(...)]\n```\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A fast binary classifier using feature selection of sentence transformer embeddings.",
"version": "0.0.1.1",
"project_urls": {
"Bug Tracker": "https://github.com/er1kb/distinction/issues",
"Homepage": "https://github.com/er1kb/distinction"
},
"split_keywords": [
"binary classification",
" cosine similarity",
" embeddings",
" feature selection",
" sentence transformer"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e5dc88c586a02fdffdbd547ad826298ef831b8769a1124b71fde8c1f33e4385b",
"md5": "9dae0ba75f78609343e931f132844b2d",
"sha256": "179a6755e35da477da841ee655dc9e0f9d6d1b7b5f68598c186e0fd7e652b91f"
},
"downloads": -1,
"filename": "distinction-0.0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9dae0ba75f78609343e931f132844b2d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 18334,
"upload_time": "2025-02-17T13:41:31",
"upload_time_iso_8601": "2025-02-17T13:41:31.137528Z",
"url": "https://files.pythonhosted.org/packages/e5/dc/88c586a02fdffdbd547ad826298ef831b8769a1124b71fde8c1f33e4385b/distinction-0.0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "20f45b683d11fe57289108dead26b44ffc6214999b5655a3f2c2ef2e6f656f5d",
"md5": "eee42bfa47258cab54b6bc2e4fd1261d",
"sha256": "b551b1fe360bac902aa8d6875a78ab332bf7e88a2e2f45977d967b82a76439b1"
},
"downloads": -1,
"filename": "distinction-0.0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "eee42bfa47258cab54b6bc2e4fd1261d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 17539,
"upload_time": "2025-02-17T13:41:32",
"upload_time_iso_8601": "2025-02-17T13:41:32.994230Z",
"url": "https://files.pythonhosted.org/packages/20/f4/5b683d11fe57289108dead26b44ffc6214999b5655a3f2c2ef2e6f656f5d/distinction-0.0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-17 13:41:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "er1kb",
"github_project": "distinction",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "distinction"
}