mim-nlp


Namemim-nlp JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/mim-solutions/mim_nlp
SummaryA Python package with ready-to-use models for various NLP tasks and text preprocessing utilities. The implementation allows fine-tuning.
upload_time2024-07-25 11:29:32
maintainerNone
docs_urlNone
authorMichał Brzozowski
requires_python<4.0,>=3.9
licenseMIT
keywords nlp natural-language-processing machine-learning deep-learning neural-network transfer-learning text-classification text-regression seq2seq summarization text text-preprocessing text-cleaning lemmatization deduplication transformers pytorch
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MIM NLP
With this package you can easily use pre-trained models and fine-tune them,
as well as create and train your own neural networks.

Below, we list NLP tasks and models that are available:
  * Classification
    * Neural Network
    * SVM
  * Regression
    * Neural Network
  * Seq2Seq
    * Summarization (Neural Network)

It comes with utilities for text pre-processing such as:
  * Text cleaning
  * Lemmatization
  * Deduplication

## Installation

### TODO PyPI package
The package comes with the following extras (optional dependencies for given modules):
- `svm` - simple svm model for classification
- `classifier` - classification models: svm, neural networks
- `regressor` - regression models
- `preprocessing` - cleaning text, lemmatization and deduplication
- `seq2seq` - `Seq2Seq` and `Summarizer` models

## Usage
Examples can be found in the [notebooks directory](/notebooks).

### Model classes
* `classifier.nn.NNClassifier` - Neural Network Classifier
* `classifier.svm.SVMClassifier` - Support Vector Machine Classifier
* `classifier.svm.SVMClassifierWithFeatureSelection` - `SVMClassifier` with additional feature selection step
* `regressor.AutoRegressor` - regressor based on transformers' Auto Classes
* `regressor.NNRegressor` - Neural Network Regressor
* `seq2seq.AutoSummarizer` - summarizer based on transformers' Auto Classes

### Interface
All the model classes have common interface:
* `fit`
* `predict`
* `save`
* `load`

and specific additional methods.

### Text pre-processing
* `preprocessing.TextCleaner` - define a pipeline for text cleaning, supports concurrent processesing
* `preprocessing.lemmatize` - lemmatize text in Polish with Morfeusz
* `preprocessing.Deduplicator` - find near-duplicate texts (depending on `threshold`) with Jaccard index for n-grams

## Development
Remember to use a separate environment for each project.
Run commands below inside the project's environment.

### Dependencies
We use `poetry` for dependency management.
If you have never used it, consult
[poetry documentation](https://python-poetry.org/docs/)
for installation guidelines and basic usage instructions.

```sh
poetry install --with dev
```

To fix the `Failed to unlock the collection!` error or stuck packages installation, execute the below command:
```sh
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
```

### Git hooks
We use `pre-commit` for git hook management.
If you have never used it, consult
[pre-commit documentation](https://pre-commit.com/)
for installation guidelines and basic usage instructions.
```sh
pre-commit install
```

There are two hooks available:
* _isort_ – runs `isort` for both `.py` files and notebooks.
Fails if any changes are made, so you have to run `git add` and `git commit` once again.
* _Strip notebooks_ – produces _stripped_ versions of notebooks in `stripped` directory.

### Tests
```sh
pytest
```

### Linting
We use `isort` and `flake8` along with `nbqa` to ensure code quality.
The appropriate options are set in configuration files.
You can run them with:
```sh
isort .
nbqa isort notebooks
```
and
```sh
flake8 .
nbqa flake8 notebooks --nbqa-shell
```

### Code formatting
You can run black to format code (including notebooks):
```sh
black .
```

### New version release
In order to add the next version of the package to PyPI, do the following steps:
- First, increment the package version in `pyproject.toml`.
- Then build the new version: run `poetry build` in the root directory.
- Finally, upload to PyPI: `poetry publish` (two newly created files).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mim-solutions/mim_nlp",
    "name": "mim-nlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "nlp, natural-language-processing, machine-learning, deep-learning, neural-network, transfer-learning, text-classification, text-regression, seq2seq, summarization, text, text-preprocessing, text-cleaning, lemmatization, deduplication, transformers, pytorch",
    "author": "Micha\u0142 Brzozowski",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/17/de/57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d/mim_nlp-0.2.0.tar.gz",
    "platform": null,
    "description": "# MIM NLP\nWith this package you can easily use pre-trained models and fine-tune them,\nas well as create and train your own neural networks.\n\nBelow, we list NLP tasks and models that are available:\n  * Classification\n    * Neural Network\n    * SVM\n  * Regression\n    * Neural Network\n  * Seq2Seq\n    * Summarization (Neural Network)\n\nIt comes with utilities for text pre-processing such as:\n  * Text cleaning\n  * Lemmatization\n  * Deduplication\n\n## Installation\n\n### TODO PyPI package\nThe package comes with the following extras (optional dependencies for given modules):\n- `svm` - simple svm model for classification\n- `classifier` - classification models: svm, neural networks\n- `regressor` - regression models\n- `preprocessing` - cleaning text, lemmatization and deduplication\n- `seq2seq` - `Seq2Seq` and `Summarizer` models\n\n## Usage\nExamples can be found in the [notebooks directory](/notebooks).\n\n### Model classes\n* `classifier.nn.NNClassifier` - Neural Network Classifier\n* `classifier.svm.SVMClassifier` - Support Vector Machine Classifier\n* `classifier.svm.SVMClassifierWithFeatureSelection` - `SVMClassifier` with additional feature selection step\n* `regressor.AutoRegressor` - regressor based on transformers' Auto Classes\n* `regressor.NNRegressor` - Neural Network Regressor\n* `seq2seq.AutoSummarizer` - summarizer based on transformers' Auto Classes\n\n### Interface\nAll the model classes have common interface:\n* `fit`\n* `predict`\n* `save`\n* `load`\n\nand specific additional methods.\n\n### Text pre-processing\n* `preprocessing.TextCleaner` - define a pipeline for text cleaning, supports concurrent processesing\n* `preprocessing.lemmatize` - lemmatize text in Polish with Morfeusz\n* `preprocessing.Deduplicator` - find near-duplicate texts (depending on `threshold`) with Jaccard index for n-grams\n\n## Development\nRemember to use a separate environment for each project.\nRun commands below inside the project's environment.\n\n### Dependencies\nWe use `poetry` for dependency management.\nIf you have never used it, consult\n[poetry documentation](https://python-poetry.org/docs/)\nfor installation guidelines and basic usage instructions.\n\n```sh\npoetry install --with dev\n```\n\nTo fix the `Failed to unlock the collection!` error or stuck packages installation, execute the below command:\n```sh\nexport PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring\n```\n\n### Git hooks\nWe use `pre-commit` for git hook management.\nIf you have never used it, consult\n[pre-commit documentation](https://pre-commit.com/)\nfor installation guidelines and basic usage instructions.\n```sh\npre-commit install\n```\n\nThere are two hooks available:\n* _isort_ \u2013 runs `isort` for both `.py` files and notebooks.\nFails if any changes are made, so you have to run `git add` and `git commit` once again.\n* _Strip notebooks_ \u2013 produces _stripped_ versions of notebooks in `stripped` directory.\n\n### Tests\n```sh\npytest\n```\n\n### Linting\nWe use `isort` and `flake8` along with `nbqa` to ensure code quality.\nThe appropriate options are set in configuration files.\nYou can run them with:\n```sh\nisort .\nnbqa isort notebooks\n```\nand\n```sh\nflake8 .\nnbqa flake8 notebooks --nbqa-shell\n```\n\n### Code formatting\nYou can run black to format code (including notebooks):\n```sh\nblack .\n```\n\n### New version release\nIn order to add the next version of the package to PyPI, do the following steps:\n- First, increment the package version in `pyproject.toml`.\n- Then build the new version: run `poetry build` in the root directory.\n- Finally, upload to PyPI: `poetry publish` (two newly created files).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package with ready-to-use models for various NLP tasks and text preprocessing utilities. The implementation allows fine-tuning.",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/mim-solutions/mim_nlp/issues",
        "Homepage": "https://github.com/mim-solutions/mim_nlp",
        "Repository": "https://github.com/mim-solutions/mim_nlp"
    },
    "split_keywords": [
        "nlp",
        " natural-language-processing",
        " machine-learning",
        " deep-learning",
        " neural-network",
        " transfer-learning",
        " text-classification",
        " text-regression",
        " seq2seq",
        " summarization",
        " text",
        " text-preprocessing",
        " text-cleaning",
        " lemmatization",
        " deduplication",
        " transformers",
        " pytorch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "238c5bbc2487cf236d2e78450ffa3876a8759a91a504b2c267172088623ae2ea",
                "md5": "5afd2d49fd42332c7ae51e894f28ce8b",
                "sha256": "ad9e614a0d168e90cb4e22563f03dec039ccc965787b2a449a72b3afc3dcfe1b"
            },
            "downloads": -1,
            "filename": "mim_nlp-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5afd2d49fd42332c7ae51e894f28ce8b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 32266,
            "upload_time": "2024-07-25T11:29:31",
            "upload_time_iso_8601": "2024-07-25T11:29:31.107613Z",
            "url": "https://files.pythonhosted.org/packages/23/8c/5bbc2487cf236d2e78450ffa3876a8759a91a504b2c267172088623ae2ea/mim_nlp-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "17de57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d",
                "md5": "374ddc36de2371c02fa1823b725cf680",
                "sha256": "b82f2676fc04948934e3f831d001499e5880360dfedea2c764a42131ee09d946"
            },
            "downloads": -1,
            "filename": "mim_nlp-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "374ddc36de2371c02fa1823b725cf680",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 25265,
            "upload_time": "2024-07-25T11:29:32",
            "upload_time_iso_8601": "2024-07-25T11:29:32.669580Z",
            "url": "https://files.pythonhosted.org/packages/17/de/57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d/mim_nlp-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-25 11:29:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mim-solutions",
    "github_project": "mim_nlp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mim-nlp"
}
        
Elapsed time: 0.28202s