# TFIDFVectorizer
TFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.
The main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.
## Installation
The package can be installed using pip:
```bash
pip install afinitiprojecttfidf
```
## Usage
To use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.
```python
from afinitiprojecttfidf.feature_extraction.vectorizer import TFIDFVectorizer
text_data = [ "This is the first document.", "This is the second document.", "And this is the third one.", "Is this the first document?"]
vectorizer = TFIDFVectorizer()
tfidf = vectorizer.fit_transform(text_data)
```
In addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.
```python
new_text_data = [
"This is a new document.",
"Is this a new one?"
]
new_tfidf = vectorizer.transform(new_text_data)
```
## Configuration
The TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:
- stop_words: a list of stop words that will be ignored during the tokenization process
```python
vectorizer = TFIDFVectorizer(stop_words=["is", "the", "this"])
```
- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.
```python
vectorizer = TFIDFVectorizer(max_features=50)
```
- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.
```python
vectorizer = TFIDFVectorizer(use_idf=False)
```
For a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
## Conclusion
TFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.
## License
[MIT](https://choosealicense.com/licenses/mit/)
Raw data
{
"_id": null,
"home_page": "https://github.com/berkedilekoglu/tfidfvectorizerinheritencecase",
"name": "berkoidf",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "machine-learning tf-idf",
"author": "Berke Dilekoglu",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/6c/f1/da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84/berkoidf-0.6.tar.gz",
"platform": null,
"description": "# TFIDFVectorizer\n\nTFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.\n\nThe main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.\n\n## Installation\n\nThe package can be installed using pip:\n\n```bash\npip install afinitiprojecttfidf\n```\n\n## Usage\n\nTo use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.\n\n```python\nfrom afinitiprojecttfidf.feature_extraction.vectorizer import TFIDFVectorizer\n\n\ntext_data = [ \"This is the first document.\", \"This is the second document.\", \"And this is the third one.\", \"Is this the first document?\"]\n\nvectorizer = TFIDFVectorizer()\ntfidf = vectorizer.fit_transform(text_data)\n\n```\n\nIn addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.\n\n```python\nnew_text_data = [\n \"This is a new document.\",\n \"Is this a new one?\"\n]\n\nnew_tfidf = vectorizer.transform(new_text_data)\n```\n\n## Configuration\n\nThe TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:\n\n- stop_words: a list of stop words that will be ignored during the tokenization process\n\n```python\nvectorizer = TFIDFVectorizer(stop_words=[\"is\", \"the\", \"this\"])\n\n```\n\n- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.\n\n```python\nvectorizer = TFIDFVectorizer(max_features=50)\n\n```\n\n- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.\n\n```python\nvectorizer = TFIDFVectorizer(use_idf=False)\n\n```\n\nFor a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n\n## Conclusion\n\nTFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "",
"version": "0.6",
"split_keywords": [
"machine-learning",
"tf-idf"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6cf1da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84",
"md5": "b79d6f6af7925856ee1f93257303960c",
"sha256": "1794097ee25518077b37553882a74a6e191ecf5895d094c2c637814787683dd2"
},
"downloads": -1,
"filename": "berkoidf-0.6.tar.gz",
"has_sig": false,
"md5_digest": "b79d6f6af7925856ee1f93257303960c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4458,
"upload_time": "2023-02-02T00:02:36",
"upload_time_iso_8601": "2023-02-02T00:02:36.304650Z",
"url": "https://files.pythonhosted.org/packages/6c/f1/da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84/berkoidf-0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-02 00:02:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "berkedilekoglu",
"github_project": "tfidfvectorizerinheritencecase",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "berkoidf"
}