# MIM NLP
With this package you can easily use pre-trained models and fine-tune them,
as well as create and train your own neural networks.
Below, we list NLP tasks and models that are available:
* Classification
* Neural Network
* SVM
* Regression
* Neural Network
* Seq2Seq
* Summarization (Neural Network)
It comes with utilities for text pre-processing such as:
* Text cleaning
* Lemmatization
* Deduplication
## Installation
### TODO PyPI package
The package comes with the following extras (optional dependencies for given modules):
- `svm` - simple svm model for classification
- `classifier` - classification models: svm, neural networks
- `regressor` - regression models
- `preprocessing` - cleaning text, lemmatization and deduplication
- `seq2seq` - `Seq2Seq` and `Summarizer` models
## Usage
Examples can be found in the [notebooks directory](/notebooks).
### Model classes
* `classifier.nn.NNClassifier` - Neural Network Classifier
* `classifier.svm.SVMClassifier` - Support Vector Machine Classifier
* `classifier.svm.SVMClassifierWithFeatureSelection` - `SVMClassifier` with additional feature selection step
* `regressor.AutoRegressor` - regressor based on transformers' Auto Classes
* `regressor.NNRegressor` - Neural Network Regressor
* `seq2seq.AutoSummarizer` - summarizer based on transformers' Auto Classes
### Interface
All the model classes have common interface:
* `fit`
* `predict`
* `save`
* `load`
and specific additional methods.
### Text pre-processing
* `preprocessing.TextCleaner` - define a pipeline for text cleaning, supports concurrent processesing
* `preprocessing.lemmatize` - lemmatize text in Polish with Morfeusz
* `preprocessing.Deduplicator` - find near-duplicate texts (depending on `threshold`) with Jaccard index for n-grams
## Development
Remember to use a separate environment for each project.
Run commands below inside the project's environment.
### Dependencies
We use `poetry` for dependency management.
If you have never used it, consult
[poetry documentation](https://python-poetry.org/docs/)
for installation guidelines and basic usage instructions.
```sh
poetry install --with dev
```
To fix the `Failed to unlock the collection!` error or stuck packages installation, execute the below command:
```sh
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
```
### Git hooks
We use `pre-commit` for git hook management.
If you have never used it, consult
[pre-commit documentation](https://pre-commit.com/)
for installation guidelines and basic usage instructions.
```sh
pre-commit install
```
There are two hooks available:
* _isort_ – runs `isort` for both `.py` files and notebooks.
Fails if any changes are made, so you have to run `git add` and `git commit` once again.
* _Strip notebooks_ – produces _stripped_ versions of notebooks in `stripped` directory.
### Tests
```sh
pytest
```
### Linting
We use `isort` and `flake8` along with `nbqa` to ensure code quality.
The appropriate options are set in configuration files.
You can run them with:
```sh
isort .
nbqa isort notebooks
```
and
```sh
flake8 .
nbqa flake8 notebooks --nbqa-shell
```
### Code formatting
You can run black to format code (including notebooks):
```sh
black .
```
### New version release
In order to add the next version of the package to PyPI, do the following steps:
- First, increment the package version in `pyproject.toml`.
- Then build the new version: run `poetry build` in the root directory.
- Finally, upload to PyPI: `poetry publish` (two newly created files).
Raw data
{
"_id": null,
"home_page": "https://github.com/mim-solutions/mim_nlp",
"name": "mim-nlp",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "nlp, natural-language-processing, machine-learning, deep-learning, neural-network, transfer-learning, text-classification, text-regression, seq2seq, summarization, text, text-preprocessing, text-cleaning, lemmatization, deduplication, transformers, pytorch",
"author": "Micha\u0142 Brzozowski",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/17/de/57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d/mim_nlp-0.2.0.tar.gz",
"platform": null,
"description": "# MIM NLP\nWith this package you can easily use pre-trained models and fine-tune them,\nas well as create and train your own neural networks.\n\nBelow, we list NLP tasks and models that are available:\n * Classification\n * Neural Network\n * SVM\n * Regression\n * Neural Network\n * Seq2Seq\n * Summarization (Neural Network)\n\nIt comes with utilities for text pre-processing such as:\n * Text cleaning\n * Lemmatization\n * Deduplication\n\n## Installation\n\n### TODO PyPI package\nThe package comes with the following extras (optional dependencies for given modules):\n- `svm` - simple svm model for classification\n- `classifier` - classification models: svm, neural networks\n- `regressor` - regression models\n- `preprocessing` - cleaning text, lemmatization and deduplication\n- `seq2seq` - `Seq2Seq` and `Summarizer` models\n\n## Usage\nExamples can be found in the [notebooks directory](/notebooks).\n\n### Model classes\n* `classifier.nn.NNClassifier` - Neural Network Classifier\n* `classifier.svm.SVMClassifier` - Support Vector Machine Classifier\n* `classifier.svm.SVMClassifierWithFeatureSelection` - `SVMClassifier` with additional feature selection step\n* `regressor.AutoRegressor` - regressor based on transformers' Auto Classes\n* `regressor.NNRegressor` - Neural Network Regressor\n* `seq2seq.AutoSummarizer` - summarizer based on transformers' Auto Classes\n\n### Interface\nAll the model classes have common interface:\n* `fit`\n* `predict`\n* `save`\n* `load`\n\nand specific additional methods.\n\n### Text pre-processing\n* `preprocessing.TextCleaner` - define a pipeline for text cleaning, supports concurrent processesing\n* `preprocessing.lemmatize` - lemmatize text in Polish with Morfeusz\n* `preprocessing.Deduplicator` - find near-duplicate texts (depending on `threshold`) with Jaccard index for n-grams\n\n## Development\nRemember to use a separate environment for each project.\nRun commands below inside the project's environment.\n\n### Dependencies\nWe use `poetry` for dependency management.\nIf you have never used it, consult\n[poetry documentation](https://python-poetry.org/docs/)\nfor installation guidelines and basic usage instructions.\n\n```sh\npoetry install --with dev\n```\n\nTo fix the `Failed to unlock the collection!` error or stuck packages installation, execute the below command:\n```sh\nexport PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring\n```\n\n### Git hooks\nWe use `pre-commit` for git hook management.\nIf you have never used it, consult\n[pre-commit documentation](https://pre-commit.com/)\nfor installation guidelines and basic usage instructions.\n```sh\npre-commit install\n```\n\nThere are two hooks available:\n* _isort_ \u2013 runs `isort` for both `.py` files and notebooks.\nFails if any changes are made, so you have to run `git add` and `git commit` once again.\n* _Strip notebooks_ \u2013 produces _stripped_ versions of notebooks in `stripped` directory.\n\n### Tests\n```sh\npytest\n```\n\n### Linting\nWe use `isort` and `flake8` along with `nbqa` to ensure code quality.\nThe appropriate options are set in configuration files.\nYou can run them with:\n```sh\nisort .\nnbqa isort notebooks\n```\nand\n```sh\nflake8 .\nnbqa flake8 notebooks --nbqa-shell\n```\n\n### Code formatting\nYou can run black to format code (including notebooks):\n```sh\nblack .\n```\n\n### New version release\nIn order to add the next version of the package to PyPI, do the following steps:\n- First, increment the package version in `pyproject.toml`.\n- Then build the new version: run `poetry build` in the root directory.\n- Finally, upload to PyPI: `poetry publish` (two newly created files).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package with ready-to-use models for various NLP tasks and text preprocessing utilities. The implementation allows fine-tuning.",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/mim-solutions/mim_nlp/issues",
"Homepage": "https://github.com/mim-solutions/mim_nlp",
"Repository": "https://github.com/mim-solutions/mim_nlp"
},
"split_keywords": [
"nlp",
" natural-language-processing",
" machine-learning",
" deep-learning",
" neural-network",
" transfer-learning",
" text-classification",
" text-regression",
" seq2seq",
" summarization",
" text",
" text-preprocessing",
" text-cleaning",
" lemmatization",
" deduplication",
" transformers",
" pytorch"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "238c5bbc2487cf236d2e78450ffa3876a8759a91a504b2c267172088623ae2ea",
"md5": "5afd2d49fd42332c7ae51e894f28ce8b",
"sha256": "ad9e614a0d168e90cb4e22563f03dec039ccc965787b2a449a72b3afc3dcfe1b"
},
"downloads": -1,
"filename": "mim_nlp-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5afd2d49fd42332c7ae51e894f28ce8b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 32266,
"upload_time": "2024-07-25T11:29:31",
"upload_time_iso_8601": "2024-07-25T11:29:31.107613Z",
"url": "https://files.pythonhosted.org/packages/23/8c/5bbc2487cf236d2e78450ffa3876a8759a91a504b2c267172088623ae2ea/mim_nlp-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "17de57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d",
"md5": "374ddc36de2371c02fa1823b725cf680",
"sha256": "b82f2676fc04948934e3f831d001499e5880360dfedea2c764a42131ee09d946"
},
"downloads": -1,
"filename": "mim_nlp-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "374ddc36de2371c02fa1823b725cf680",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 25265,
"upload_time": "2024-07-25T11:29:32",
"upload_time_iso_8601": "2024-07-25T11:29:32.669580Z",
"url": "https://files.pythonhosted.org/packages/17/de/57b6f482df07e3bf7d87146ef40a639c627a45c6b1fe66852bf06ed7ed2d/mim_nlp-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-25 11:29:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mim-solutions",
"github_project": "mim_nlp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "mim-nlp"
}