# TFIDFVectorizer
TFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.
The main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.
## Installation
The package can be installed using pip:
```bash
pip install tfIdfInheritVectorizer
```
## Usage
To use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.
```python
from tfIdfInheritVectorizer.feature_extraction.vectorizer import TFIDFVectorizer
text_data = [ "This is the first document.", "This is the second document.", "And this is the third one.", "Is this the first document?"]
vectorizer = TFIDFVectorizer()
tfidf = vectorizer.fit_transform(text_data)
```
In addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.
```python
new_text_data = [
"This is a new document.",
"Is this a new one?"
]
new_tfidf = vectorizer.transform(new_text_data)
```
## Configuration
The TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:
- stop_words: a list of stop words that will be ignored during the tokenization process
```python
vectorizer = TFIDFVectorizer(stop_words=["is", "the", "this"])
```
- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.
```python
vectorizer = TFIDFVectorizer(max_features=50)
```
- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.
```python
vectorizer = TFIDFVectorizer(use_idf=False)
```
For a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
## Conclusion
TFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.
## License
[MIT](https://choosealicense.com/licenses/mit/)
Raw data
{
"_id": null,
"home_page": "",
"name": "tfIdfInheritVectorizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "machine-learning tf-idf",
"author": "Berke Dilekoglu",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/fc/80/14a1edd3773972d02da249a71b92262b543f56e064d7ba44b3bf433a6ec0/tfIdfInheritVectorizer-0.1.tar.gz",
"platform": null,
"description": "# TFIDFVectorizer\n\nTFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.\n\nThe main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.\n\n## Installation\n\nThe package can be installed using pip:\n\n```bash\npip install tfIdfInheritVectorizer\n```\n\n## Usage\n\nTo use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.\n\n```python\nfrom tfIdfInheritVectorizer.feature_extraction.vectorizer import TFIDFVectorizer\n\n\ntext_data = [ \"This is the first document.\", \"This is the second document.\", \"And this is the third one.\", \"Is this the first document?\"]\n\nvectorizer = TFIDFVectorizer()\ntfidf = vectorizer.fit_transform(text_data)\n\n```\n\nIn addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.\n\n```python\nnew_text_data = [\n \"This is a new document.\",\n \"Is this a new one?\"\n]\n\nnew_tfidf = vectorizer.transform(new_text_data)\n```\n\n## Configuration\n\nThe TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:\n\n- stop_words: a list of stop words that will be ignored during the tokenization process\n\n```python\nvectorizer = TFIDFVectorizer(stop_words=[\"is\", \"the\", \"this\"])\n\n```\n\n- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.\n\n```python\nvectorizer = TFIDFVectorizer(max_features=50)\n\n```\n\n- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.\n\n```python\nvectorizer = TFIDFVectorizer(use_idf=False)\n\n```\n\nFor a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n\n## Conclusion\n\nTFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "",
"version": "0.1",
"split_keywords": [
"machine-learning",
"tf-idf"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fc8014a1edd3773972d02da249a71b92262b543f56e064d7ba44b3bf433a6ec0",
"md5": "f0f15a5e868dba3c0f7412899439020d",
"sha256": "5466c939bc6b2471100fcb18aedaf5712653c800d4844ee8eb05dd81630f46fd"
},
"downloads": -1,
"filename": "tfIdfInheritVectorizer-0.1.tar.gz",
"has_sig": false,
"md5_digest": "f0f15a5e868dba3c0f7412899439020d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4475,
"upload_time": "2023-02-02T11:35:19",
"upload_time_iso_8601": "2023-02-02T11:35:19.632389Z",
"url": "https://files.pythonhosted.org/packages/fc/80/14a1edd3773972d02da249a71b92262b543f56e064d7ba44b3bf433a6ec0/tfIdfInheritVectorizer-0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-02 11:35:19",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "tfidfinheritvectorizer"
}