# Positional Vectorizer
The Positional Vectorizer is a transformer in scikit-learn designed to transform text into a bag of words vector using a positional ranking algorithm to assign scores. Similar to scikit-learn's CountVectorizer and TFIDFVectorizer, it assigns a value to each dimension based on the term's position in the original text.
## How to use
```bash
pip install positional-vectorizer
```
Using to generate de text vectors
```python
from positional_vectorizer import PositionalVectorizer
input_texts = ["my text here", "other text here"]
vectorizer = PositionalVectorizer()
vectorizer.fit(input_texts)
encoded_texts = vectorizer.transform(input_texts)
```
Using with scikit-learn pipeline
```python
from positional_vectorizer import PositionalVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
pipeline = Pipeline([
('vect', PositionalVectorizer(ngram_range=(1, 2))),
('clf', SGDClassifier(random_state=42, loss='modified_huber'))
])
pipeline.fit(X_train, y_train)
```
## Why this new vectorizer?
Text embeddings based on bag-of-words using count, binary, or TF-IDF normalization are highly effective in most scenarios. However, in certain cases, such as those involving languages like Latin, the position of terms becomes crucial, which these techniques fail to capture.
For instance, consider the importance of word position in a Portuguese classification task distinguishing between a smartphone device and a smartphone accessory. In traditional bag-of-words approaches with stop words removed, the following titles yield identical representations:
* "xiaomi com fone de ouvido" => {"xiaomi", "fone", "ouvido"}
* "fone de ouvido do xiaomi" => {"xiaomi", "fone", "ouvido"}
As demonstrated, the order of words significantly alters the meaning, but this meaning is not reflected in the vectorization.
One common workaround is to employ n-grams instead of single words, but this can inflate the feature dimensionality, potentially increasing the risk of overfitting.
## How it works
The value in each dimension is calculated as `1 / math.log(rank + 1)` (similar to the Discounted Cumulative Gain formula), where the rank denotes the position of the corresponding term, starting from 1.
If a term appears multiple times in the text, only its lowest rank is taken into account.
## TODO
* Test the common parameters of _VectorizerMixin to identify potential issues when upgrading scikit-learn. Currently, only the `ngrams_range` and `analyzer` parameters are automatically tested.
* Implement the max_features parameter.
Raw data
{
"_id": null,
"home_page": "https://github.com/timotta/positional-vectorizer",
"name": "positional-vectorizer",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.10",
"maintainer_email": null,
"keywords": "machine learning, embedding, vectorizer, scikit-learn, text, NLP",
"author": "Tiago Albineli Motta",
"author_email": "timotta@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3b/70/131f25eebc618decca060d0e47130c8a8856384ad7a5b41b091394375e77/positional_vectorizer-0.0.9.tar.gz",
"platform": null,
"description": "# Positional Vectorizer\n\nThe Positional Vectorizer is a transformer in scikit-learn designed to transform text into a bag of words vector using a positional ranking algorithm to assign scores. Similar to scikit-learn's CountVectorizer and TFIDFVectorizer, it assigns a value to each dimension based on the term's position in the original text.\n\n## How to use\n\n```bash\npip install positional-vectorizer\n```\n\nUsing to generate de text vectors\n\n```python\nfrom positional_vectorizer import PositionalVectorizer\n\ninput_texts = [\"my text here\", \"other text here\"]\n\nvectorizer = PositionalVectorizer()\nvectorizer.fit(input_texts)\n\nencoded_texts = vectorizer.transform(input_texts)\n```\n\nUsing with scikit-learn pipeline\n\n```python\nfrom positional_vectorizer import PositionalVectorizer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import SGDClassifier\n\npipeline = Pipeline([\n ('vect', PositionalVectorizer(ngram_range=(1, 2))),\n ('clf', SGDClassifier(random_state=42, loss='modified_huber'))\n])\n\npipeline.fit(X_train, y_train)\n```\n\n## Why this new vectorizer?\n\nText embeddings based on bag-of-words using count, binary, or TF-IDF normalization are highly effective in most scenarios. However, in certain cases, such as those involving languages like Latin, the position of terms becomes crucial, which these techniques fail to capture.\n\nFor instance, consider the importance of word position in a Portuguese classification task distinguishing between a smartphone device and a smartphone accessory. In traditional bag-of-words approaches with stop words removed, the following titles yield identical representations:\n\n* \"xiaomi com fone de ouvido\" => {\"xiaomi\", \"fone\", \"ouvido\"}\n* \"fone de ouvido do xiaomi\" => {\"xiaomi\", \"fone\", \"ouvido\"}\n\nAs demonstrated, the order of words significantly alters the meaning, but this meaning is not reflected in the vectorization.\n\nOne common workaround is to employ n-grams instead of single words, but this can inflate the feature dimensionality, potentially increasing the risk of overfitting.\n\n## How it works\n\nThe value in each dimension is calculated as `1 / math.log(rank + 1)` (similar to the Discounted Cumulative Gain formula), where the rank denotes the position of the corresponding term, starting from 1.\n\nIf a term appears multiple times in the text, only its lowest rank is taken into account.\n\n## TODO\n\n* Test the common parameters of _VectorizerMixin to identify potential issues when upgrading scikit-learn. Currently, only the `ngrams_range` and `analyzer` parameters are automatically tested.\n* Implement the max_features parameter.\n\n",
"bugtrack_url": null,
"license": "new BSD",
"summary": "Positional Vectorizer is a scikit-learn transformer that converts text to bag of words vector using a positional ranking algorithm as score",
"version": "0.0.9",
"project_urls": {
"Homepage": "https://github.com/timotta/positional-vectorizer"
},
"split_keywords": [
"machine learning",
" embedding",
" vectorizer",
" scikit-learn",
" text",
" nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3b70131f25eebc618decca060d0e47130c8a8856384ad7a5b41b091394375e77",
"md5": "77d6d9556df84c63a94720b915d497ac",
"sha256": "32f321240ea74e2f9fb61ce30c4c522a93e35f666bdc5ce20cca0f6ea7b8cf15"
},
"downloads": -1,
"filename": "positional_vectorizer-0.0.9.tar.gz",
"has_sig": false,
"md5_digest": "77d6d9556df84c63a94720b915d497ac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.10",
"size": 5143,
"upload_time": "2024-05-21T23:41:22",
"upload_time_iso_8601": "2024-05-21T23:41:22.647444Z",
"url": "https://files.pythonhosted.org/packages/3b/70/131f25eebc618decca060d0e47130c8a8856384ad7a5b41b091394375e77/positional_vectorizer-0.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-21 23:41:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "timotta",
"github_project": "positional-vectorizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.4.2"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
">=",
"4.11.0"
]
]
}
],
"lcname": "positional-vectorizer"
}