berkoidf


Nameberkoidf JSON
Version 0.6 PyPI version JSON
download
home_pagehttps://github.com/berkedilekoglu/tfidfvectorizerinheritencecase
Summary
upload_time2023-02-02 00:02:36
maintainer
docs_urlNone
authorBerke Dilekoglu
requires_python
licenseMIT
keywords machine-learning tf-idf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TFIDFVectorizer

TFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.

The main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.

## Installation

The package can be installed using pip:

```bash
pip install afinitiprojecttfidf
```

## Usage

To use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.

```python
from afinitiprojecttfidf.feature_extraction.vectorizer import TFIDFVectorizer


text_data = [    "This is the first document.",    "This is the second document.",    "And this is the third one.",    "Is this the first document?"]

vectorizer = TFIDFVectorizer()
tfidf = vectorizer.fit_transform(text_data)

```

In addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.

```python
new_text_data = [
    "This is a new document.",
    "Is this a new one?"
]

new_tfidf = vectorizer.transform(new_text_data)
```

## Configuration

The TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:

- stop_words: a list of stop words that will be ignored during the tokenization process

```python
vectorizer = TFIDFVectorizer(stop_words=["is", "the", "this"])

```

- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.

```python
vectorizer = TFIDFVectorizer(max_features=50)

```

- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.

```python
vectorizer = TFIDFVectorizer(use_idf=False)

```

For a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

## Conclusion

TFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.

## License

[MIT](https://choosealicense.com/licenses/mit/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/berkedilekoglu/tfidfvectorizerinheritencecase",
    "name": "berkoidf",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "machine-learning tf-idf",
    "author": "Berke Dilekoglu",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/6c/f1/da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84/berkoidf-0.6.tar.gz",
    "platform": null,
    "description": "# TFIDFVectorizer\n\nTFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.\n\nThe main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.\n\n## Installation\n\nThe package can be installed using pip:\n\n```bash\npip install afinitiprojecttfidf\n```\n\n## Usage\n\nTo use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.\n\n```python\nfrom afinitiprojecttfidf.feature_extraction.vectorizer import TFIDFVectorizer\n\n\ntext_data = [    \"This is the first document.\",    \"This is the second document.\",    \"And this is the third one.\",    \"Is this the first document?\"]\n\nvectorizer = TFIDFVectorizer()\ntfidf = vectorizer.fit_transform(text_data)\n\n```\n\nIn addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.\n\n```python\nnew_text_data = [\n    \"This is a new document.\",\n    \"Is this a new one?\"\n]\n\nnew_tfidf = vectorizer.transform(new_text_data)\n```\n\n## Configuration\n\nThe TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:\n\n- stop_words: a list of stop words that will be ignored during the tokenization process\n\n```python\nvectorizer = TFIDFVectorizer(stop_words=[\"is\", \"the\", \"this\"])\n\n```\n\n- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.\n\n```python\nvectorizer = TFIDFVectorizer(max_features=50)\n\n```\n\n- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.\n\n```python\nvectorizer = TFIDFVectorizer(use_idf=False)\n\n```\n\nFor a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n\n## Conclusion\n\nTFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "",
    "version": "0.6",
    "split_keywords": [
        "machine-learning",
        "tf-idf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6cf1da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84",
                "md5": "b79d6f6af7925856ee1f93257303960c",
                "sha256": "1794097ee25518077b37553882a74a6e191ecf5895d094c2c637814787683dd2"
            },
            "downloads": -1,
            "filename": "berkoidf-0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "b79d6f6af7925856ee1f93257303960c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4458,
            "upload_time": "2023-02-02T00:02:36",
            "upload_time_iso_8601": "2023-02-02T00:02:36.304650Z",
            "url": "https://files.pythonhosted.org/packages/6c/f1/da01952d1e8f12bbf912b1bdc6d175cf6c3e5bbbc7605fd018aab8470f84/berkoidf-0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-02 00:02:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "berkedilekoglu",
    "github_project": "tfidfvectorizerinheritencecase",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "berkoidf"
}
        
Elapsed time: 0.04827s