# Classy Classification
Have you ever struggled with needing a [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using [sentence-transformers](https://github.com/UKPLab/sentence-transformers) or [spaCy models](https://spacy.io/usage/models), provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with [Hugginface zero-shot classifiers](https://huggingface.co/models?pipeline_tag=zero-shot-classification).
[](https://github.com/pandora-intelligence/classy-classification/releases)
[](https://pypi.org/project/classy-classification/)
[](https://pypi.org/project/classy-classification/)
[](https://github.com/ambv/black)
# Install
``` pip install classy-classification```
## SetFit support
I got a lot of requests for SetFit support, but I decided to create a [separate package](https://github.com/davidberenstein1957/spacy-setfit) for this. Feel free to check it out. ❤️
# Quickstart
## SpaCy embeddings
```python
import spacy
# or import standalone
# from classy_classification import ClassyClassifier
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture" : 0.21}, {"kitchen": 0.79}]
```
### Sentence level classification
```python
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy",
"include_sent": True
}
)
print(nlp("I am looking for kitchen appliances. And I love doing so.").sents[0]._.cats)
# Output:
#
# [[{"furniture" : 0.21}, {"kitchen": 0.79}]
```
### Define random seed and verbosity
```python
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"verbose": True,
"config": {"seed": 42}
}
)
```
### Multi-label classification
Sometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the **multi-label** implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.
```python
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa.",
"We have a new dinner table.",
"There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table.",
"There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy",
"multi_label": True,
}
)
print(nlp("I am looking for furniture and kitchen equipment.")._.cats)
# Output:
#
# [{"furniture": 0.92}, {"kitchen": 0.91}]
```
### Outlier detection
Sometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using
a binary training dataset, however, I have also implemented support for a `OneClassSVM` for [outlier detection using a single label](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html). Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.
Approach 1:
```python
import spacy
data_binary = {
"inlier": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"outlier": ["Text about kitchen equipment",
"This text is about politics",
"Comments about AI and stuff."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data_binary,
}
)
print(nlp("This text is a random text")._.cats)
# Output:
#
# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]
```
Approach 2:
```python
import spacy
data_singular = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa.",
"We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data_singular,
}
)
print(nlp("This text is a random text")._.cats)
# Output:
#
# [{'furniture': 0, 'not_furniture': 1}]
```
## Sentence-transfomer embeddings
```python
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.blank("en")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]
```
## Hugginface zero-shot classifiers
```python
import spacy
data = ["furniture", "kitchen"]
nlp = spacy.blank("en")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "typeform/distilbert-base-uncased-mnli",
"cat_type": "zero",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]
```
# Credits
## Inspiration Drawn From
[Huggingface](https://huggingface.co/) does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has [a nice approach](https://rasa.com/blog/rasa-nlu-in-depth-part-1-intent-classification/) for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate [sentence-transformers](https://github.com/UKPLab/sentence-transformers) and [Hugginface zero-shot](https://huggingface.co/models?pipeline_tag=zero-shot-classification), instead of default [word embeddings](https://arxiv.org/abs/1301.3781). Finally, I decided to integrate with Spacy, since training a custom [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) seems like a lot of hassle if you want something quick and dirty.
- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [Rasa NLU](https://github.com/RasaHQ/rasa)
- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)
- [Spacy](https://github.com/explosion/spaCy)
## Or buy me a coffee
[](https://www.buymeacoffee.com/98kf2552674)
# Standalone usage without spaCy
```python
from classy_classification import ClassyClassifier
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = ClassyClassifier(data=data)
classifier("I am looking for kitchen appliances.")
classifier.pipe(["I am looking for kitchen appliances."])
# overwrite training data
classifier.set_training_data(data=data)
classifier("I am looking for kitchen appliances.")
# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")
classifier("I am looking for kitchen appliances.")
# overwrite SVC config
classifier.set_classification_model(
config={
"C": [1, 2, 5, 10, 20, 100],
"kernel": ["linear"],
"max_cross_validation_folds": 5
}
)
classifier("I am looking for kitchen appliances.")
```
## Save and load models
```python
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = classyClassifier(data=data)
with open("./classifier.pkl", "wb") as f:
pickle.dump(classifier, f)
f = open("./classifier.pkl", "rb")
classifier = pickle.load(f)
classifier("I am looking for kitchen appliances.")
```
Raw data
{
"_id": null,
"home_page": "https://github.com/davidberenstein1957/classy-classification",
"name": "classy-classification",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.8",
"maintainer_email": null,
"keywords": "spacy, rasa, few-shot classification, nlu, sentence-transformers",
"author": "David Berenstein",
"author_email": "david.m.berenstein@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bf/72/fd3cbb27389e31364f48f00f003fae925cabdaba8c0225b975326a1bd8b9/classy_classification-1.0.1.tar.gz",
"platform": null,
"description": "# Classy Classification\nHave you ever struggled with needing a [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using [sentence-transformers](https://github.com/UKPLab/sentence-transformers) or [spaCy models](https://spacy.io/usage/models), provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with [Hugginface zero-shot classifiers](https://huggingface.co/models?pipeline_tag=zero-shot-classification).\n\n[](https://github.com/pandora-intelligence/classy-classification/releases)\n[](https://pypi.org/project/classy-classification/)\n[](https://pypi.org/project/classy-classification/)\n[](https://github.com/ambv/black)\n\n# Install\n``` pip install classy-classification```\n\n## SetFit support\n\nI got a lot of requests for SetFit support, but I decided to create a [separate package](https://github.com/davidberenstein1957/spacy-setfit) for this. Feel free to check it out. \u2764\ufe0f\n\n# Quickstart\n## SpaCy embeddings\n```python\nimport spacy\n# or import standalone\n# from classy_classification import ClassyClassifier\n\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\"]\n}\n\nnlp = spacy.load(\"en_core_web_trf\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"model\": \"spacy\"\n }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\" : 0.21}, {\"kitchen\": 0.79}]\n```\n### Sentence level classification\n```python\nimport spacy\n\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\"]\n}\n\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"model\": \"spacy\",\n \"include_sent\": True\n }\n)\n\nprint(nlp(\"I am looking for kitchen appliances. And I love doing so.\").sents[0]._.cats)\n\n# Output:\n#\n# [[{\"furniture\" : 0.21}, {\"kitchen\": 0.79}]\n```\n\n### Define random seed and verbosity\n\n```python\n\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"verbose\": True,\n \"config\": {\"seed\": 42}\n }\n)\n```\n\n### Multi-label classification\n\nSometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the **multi-label** implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.\n\n```python\nimport spacy\n\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\",\n \"We have a new dinner table.\",\n \"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\",\n \"We have a new dinner table.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\",\n \"We have a new dinner table.\",\n \"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\",\n \"We have a new dinner table.\"]\n}\n\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"model\": \"spacy\",\n \"multi_label\": True,\n }\n)\n\nprint(nlp(\"I am looking for furniture and kitchen equipment.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.92}, {\"kitchen\": 0.91}]\n```\n\n### Outlier detection\n\nSometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using\na binary training dataset, however, I have also implemented support for a `OneClassSVM` for [outlier detection using a single label](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html). Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.\n\nApproach 1:\n\n```python\nimport spacy\n\ndata_binary = {\n \"inlier\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"outlier\": [\"Text about kitchen equipment\",\n \"This text is about politics\",\n \"Comments about AI and stuff.\"]\n}\n\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data_binary,\n }\n)\n\nprint(nlp(\"This text is a random text\")._.cats)\n\n# Output:\n#\n# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]\n```\n\nApproach 2:\n\n```python\nimport spacy\n\ndata_singular = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\",\n \"We have a new dinner table.\"]\n}\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data_singular,\n }\n)\n\nprint(nlp(\"This text is a random text\")._.cats)\n\n# Output:\n#\n# [{'furniture': 0, 'not_furniture': 1}]\n```\n\n## Sentence-transfomer embeddings\n\n```python\nimport spacy\n\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\"]\n}\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"model\": \"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\",\n \"device\": \"gpu\"\n }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.21}, {\"kitchen\": 0.79}]\n```\n\n## Hugginface zero-shot classifiers\n\n```python\nimport spacy\n\ndata = [\"furniture\", \"kitchen\"]\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"classy_classification\",\n config={\n \"data\": data,\n \"model\": \"typeform/distilbert-base-uncased-mnli\",\n \"cat_type\": \"zero\",\n \"device\": \"gpu\"\n }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.21}, {\"kitchen\": 0.79}]\n```\n\n# Credits\n\n## Inspiration Drawn From\n\n[Huggingface](https://huggingface.co/) does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has [a nice approach](https://rasa.com/blog/rasa-nlu-in-depth-part-1-intent-classification/) for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate [sentence-transformers](https://github.com/UKPLab/sentence-transformers) and [Hugginface zero-shot](https://huggingface.co/models?pipeline_tag=zero-shot-classification), instead of default [word embeddings](https://arxiv.org/abs/1301.3781). Finally, I decided to integrate with Spacy, since training a custom [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) seems like a lot of hassle if you want something quick and dirty.\n\n- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n- [Rasa NLU](https://github.com/RasaHQ/rasa)\n- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)\n- [Spacy](https://github.com/explosion/spaCy)\n\n## Or buy me a coffee\n\n[](https://www.buymeacoffee.com/98kf2552674)\n\n# Standalone usage without spaCy\n\n```python\n\nfrom classy_classification import ClassyClassifier\n\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\"]\n}\n\nclassifier = ClassyClassifier(data=data)\nclassifier(\"I am looking for kitchen appliances.\")\nclassifier.pipe([\"I am looking for kitchen appliances.\"])\n\n# overwrite training data\nclassifier.set_training_data(data=data)\nclassifier(\"I am looking for kitchen appliances.\")\n\n# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)\nclassifier.set_embedding_model(model=\"paraphrase-MiniLM-L3-v2\")\nclassifier(\"I am looking for kitchen appliances.\")\n\n# overwrite SVC config\nclassifier.set_classification_model(\n config={\n \"C\": [1, 2, 5, 10, 20, 100],\n \"kernel\": [\"linear\"],\n \"max_cross_validation_folds\": 5\n }\n)\nclassifier(\"I am looking for kitchen appliances.\")\n```\n\n## Save and load models\n\n```python\ndata = {\n \"furniture\": [\"This text is about chairs.\",\n \"Couches, benches and televisions.\",\n \"I really need to get a new sofa.\"],\n \"kitchen\": [\"There also exist things like fridges.\",\n \"I hope to be getting a new stove today.\",\n \"Do you also have some ovens.\"]\n}\nclassifier = classyClassifier(data=data)\n\nwith open(\"./classifier.pkl\", \"wb\") as f:\n pickle.dump(classifier, f)\n\nf = open(\"./classifier.pkl\", \"rb\")\nclassifier = pickle.load(f)\nclassifier(\"I am looking for kitchen appliances.\")\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://github.com/davidberenstein1957/classy-classification",
"Homepage": "https://github.com/davidberenstein1957/classy-classification",
"Repository": "https://github.com/davidberenstein1957/classy-classification"
},
"split_keywords": [
"spacy",
" rasa",
" few-shot classification",
" nlu",
" sentence-transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2cbb4139462c8aeb0ebfb684fa08bb693edb98fd34ec6335e949cd309885cd52",
"md5": "9e9a7e316b1372038aa755e6b994d07a",
"sha256": "42d78c7e998fd0de86cee614e0e3f70f02818aad04a24e29a5cd05c74e15caf4"
},
"downloads": -1,
"filename": "classy_classification-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9e9a7e316b1372038aa755e6b994d07a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.8",
"size": 15499,
"upload_time": "2024-11-27T09:17:10",
"upload_time_iso_8601": "2024-11-27T09:17:10.184010Z",
"url": "https://files.pythonhosted.org/packages/2c/bb/4139462c8aeb0ebfb684fa08bb693edb98fd34ec6335e949cd309885cd52/classy_classification-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bf72fd3cbb27389e31364f48f00f003fae925cabdaba8c0225b975326a1bd8b9",
"md5": "82a4ca1e2c86da951822c7d22d8ed5b0",
"sha256": "2f71e20074c30bbe0cb3845fa3291096d3a3d2cf9384dd98cd63d74a59cee5cd"
},
"downloads": -1,
"filename": "classy_classification-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "82a4ca1e2c86da951822c7d22d8ed5b0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.8",
"size": 13309,
"upload_time": "2024-11-27T09:17:14",
"upload_time_iso_8601": "2024-11-27T09:17:14.104273Z",
"url": "https://files.pythonhosted.org/packages/bf/72/fd3cbb27389e31364f48f00f003fae925cabdaba8c0225b975326a1bd8b9/classy_classification-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-27 09:17:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "davidberenstein1957",
"github_project": "classy-classification",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "classy-classification"
}