Name | nlp-service JSON |
Version |
1.4.10
JSON |
| download |
home_page | http://recap.uni-trier.de |
Summary | Microservice for NLP tasks using gRPC |
upload_time | 2023-11-07 08:26:32 |
maintainer | |
docs_url | None |
author | Mirko Lenz |
requires_python | >=3.10,<3.13 |
license | MIT |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# NLP Microservice
The goal of this project is to provide a [gRPC](https://grpc.io) server for resource-heavy NLP tasks—for instance, computing vectors/embeddings for words or sentences.
By using [protobuf](https://developers.google.com/protocol-buffers) internally, our NLP server provides native and strongly typed interfaces for many programming languages.
There are multiple advantages that arise from outsourcing such computations to such a server:
- If multiple apps rely on NLP, the underlying models (which are usually quite large) only need to be loaded once into the main memory.
- All programming languages supported by gRPC get easy access to state-of-the-art NLP architectures (e.g., transformers).
- The logic is consolidated at a central place, drastically decreasing the maintenance effort required.
In addition to the server, we also provide a client containing convenience functions.
This makes it easier for python applications to interact with the gRPC server.
We will discuss the client at the end of this README.
## Installation and Setup
We are using `nix` and `poetry` to manage the dependencies and provide a ready-to-use Docker image.
### Docker (recommended)
The container caches the downloaded models, so you should not pass `--rm` to `docker run`.
```sh
docker run ghcr.io/recap-utr/nlp-service:latest "0.0.0.0:50100"
```
### Nix (advanced)
```sh
nix run github:recap-utr/nlp-service -- "127.0.0.1:50100"
# or after cloning this repository
nix develop -c poetry run python -m nlp_service "127.0.0.1:50100"
```
### Poetry (advanced)
```sh
# The server dependencies are optional, thus they have to be installed explicitly.
poetry install --extras all
# To run the server, you need to specify the address it should listen on.
# In this example, it should liston on port 5678 on localhost.
poetry run python -m nlp_service "127.0.0.1:50100"
```
## General Usage
Once the server is running, you are free to call any of the functions defined in the underlying [protobuf file](https://github.com/recap-utr/arg-services/blob/main/arg_services/nlp/v1/nlp.proto).
The corresponding documentation is located at the [Buf Schema Registry](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1).
_Please note:_ The examples here use the Python programming language, but are also directly applicable to any other language supported by gRPC.
```python
import grpc
from arg_services.nlp.v1 import nlp_pb2, nlp_pb2_grpc
# First of all, we are creating a channel (i.e., establish a connection to our server)
channel = grpc.insecure_channel("127.0.0.1:5678")
# The channel can now be used to create the actual client (allowing us to call all available functions)
client = nlp_pb2_grpc.NlpServiceStub(channel)
# Now the time has come to prepare our actual function call.
# We will start by creating a very simple NlpConfig with the default spacy model.
# FOr details about the parameters, please have a look at the next section.
config = nlp_pb2.NlpConfig(
language="en",
spacy_model="en_core_web_lg",
)
# Next, we will build a request to query vectors from our server.
request = nlp_pb2.VectorsRequest(
# The first parameter is a list of strings that shall be embedded by our server.
texts=["What a great tutorial!", "I will definitely recommend this to my friends."],
# Now we need to specify which embeddings have to be computed. In this example, we create one vector for each text
embedding_levels=[nlp_pb2.EmbeddingLevel.EMBEDDING_LEVEL_DOCUMENT],
# The only thing missing now is the spacy configuration we created in the previous step.
config=config
)
# Having created the request, we can now send it to the server and retrieve the corresponding response.
response = client.Vectors(request)
# Due to technical constraints, we cannot directly transfer numpy arrays, thus we convert our response.
vectors = [np.array(entry.document.vector) for entry in response.vectors]
```
<!-- TODO: Prefer Vectors instead of Similarities for Python to increase performacne. -->
## Advanced Usage
A central piece for all available function is the `NlpConfig` message, allowing you to create even complex embedding models easily.
In addition to [its documentation](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1), we will in the following present some examples to demonstrate the possibilities you have.
```python
from arg_services.nlp.v1 import nlp_pb2
# In the example above, we already introduced a quite basic config:
config = nlp_pb2.NlpConfig(
# You have to provide a language for every config: https://spacy.io/usage/models#languages
language="en",
# Also, you need to specify the model that spacy should load: https://spacy.io/models/en
spacy_model="en_core_web_lg",
)
# A central feature of our library is the possibility to combine multiple embedding models, potentially capturing more contextual information.
config = nlp_pb2.NlpConfig(
language="en",
# This parameter expects a list of models. If you pass more than one, the respective vectors are **concatenated** to each other
# (e.g., two 300-dimensional embeddings will result in a 600-dimensional one).
# This approach is based on https://arxiv.org/abs/1803.01400
embedding_models=[
nlp_pb2.EmbeddingModel(
# First select the type of model you would like to use (e.g., SBERT/Sentence Transformers).
model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SENTENCE_TRANSFORMERS,
# Then select the actual model.
# Any of those specified on the website (https://www.sbert.net/docs/pretrained_models.html) are allowed.
model_name="all-mpnet-base-v2"
),
nlp_pb2.EmbeddingModel(
# It is also possible to use a standard spacy model
model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,
model_name="en_core_web_lg",
# Since we have selected a word embedding (i.e., it cannot directly encode sentences), the token vectors need to be aggregated somehow.
# The default strategy is to use the arithmetic mean, but you are free to use other strategies (e.g., the geometric mean).
pooling_type=nlp_pb2.Pooling.POOLING_GMEAN
),
nlp_pb2.EmbeddingModel(
model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,
model_name="en_core_web_lg",
# Alternatively, it is also possible to use the generalized mean / power mean.
# In this example, the selected pmean corresponds to the geometic mean (thus this embedding is identical to the previous one).
# This approach is based on https://arxiv.org/abs/1803.01400
pmean=0
)
]
# This setting is now optional and only needed if you need spacy features (e.g., POS tagging) besides embeddings.
# spacy_model="en_core_web_lg",
)
# If computing the similarity between strings, you get one additional parameter.
config = nlp_pb2.NlpConfig(
language="en",
# To keep the example simple, we will now only use a single spacy model instead of the more powerful embedding models.
# However, it is of course possible to use them here as well.
spacy_model="en_core_web_lg",
# If not specified, we will always use the cosine similarity when comparing two strings.
# As indicated in a recent paper (https://arxiv.org/abs/1904.13264), you may achieve better results with alternative approaches like DynaMax Jaccard.
# Please note that this particular method ignores your selected pooling method due to the fact that even plain word embeddings are not pooled at all.
similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_DYNAMAX_JACCARD
)
# It is also possible to determine a similarity score without the use of embeddings.
config = nlp_pb2.NlpConfig(
language="en",
spacy_model="en_core_web_sm",
# Traditional metric (Jaccard similarity and Levenshtein edit distance) are also available
similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_EDIT
)
```
Raw data
{
"_id": null,
"home_page": "http://recap.uni-trier.de",
"name": "nlp-service",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<3.13",
"maintainer_email": "",
"keywords": "",
"author": "Mirko Lenz",
"author_email": "info@mirko-lenz.de",
"download_url": "https://files.pythonhosted.org/packages/57/9d/5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd/nlp_service-1.4.10.tar.gz",
"platform": null,
"description": "# NLP Microservice\n\nThe goal of this project is to provide a [gRPC](https://grpc.io) server for resource-heavy NLP tasks—for instance, computing vectors/embeddings for words or sentences.\nBy using [protobuf](https://developers.google.com/protocol-buffers) internally, our NLP server provides native and strongly typed interfaces for many programming languages.\nThere are multiple advantages that arise from outsourcing such computations to such a server:\n\n- If multiple apps rely on NLP, the underlying models (which are usually quite large) only need to be loaded once into the main memory.\n- All programming languages supported by gRPC get easy access to state-of-the-art NLP architectures (e.g., transformers).\n- The logic is consolidated at a central place, drastically decreasing the maintenance effort required.\n\nIn addition to the server, we also provide a client containing convenience functions.\nThis makes it easier for python applications to interact with the gRPC server.\nWe will discuss the client at the end of this README.\n\n## Installation and Setup\n\nWe are using `nix` and `poetry` to manage the dependencies and provide a ready-to-use Docker image.\n\n### Docker (recommended)\n\nThe container caches the downloaded models, so you should not pass `--rm` to `docker run`.\n\n```sh\ndocker run ghcr.io/recap-utr/nlp-service:latest \"0.0.0.0:50100\"\n```\n\n### Nix (advanced)\n\n```sh\nnix run github:recap-utr/nlp-service -- \"127.0.0.1:50100\"\n# or after cloning this repository\nnix develop -c poetry run python -m nlp_service \"127.0.0.1:50100\"\n```\n\n### Poetry (advanced)\n\n```sh\n# The server dependencies are optional, thus they have to be installed explicitly.\npoetry install --extras all\n# To run the server, you need to specify the address it should listen on.\n# In this example, it should liston on port 5678 on localhost.\npoetry run python -m nlp_service \"127.0.0.1:50100\"\n```\n\n## General Usage\n\nOnce the server is running, you are free to call any of the functions defined in the underlying [protobuf file](https://github.com/recap-utr/arg-services/blob/main/arg_services/nlp/v1/nlp.proto).\nThe corresponding documentation is located at the [Buf Schema Registry](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1).\n_Please note:_ The examples here use the Python programming language, but are also directly applicable to any other language supported by gRPC.\n\n```python\nimport grpc\nfrom arg_services.nlp.v1 import nlp_pb2, nlp_pb2_grpc\n\n# First of all, we are creating a channel (i.e., establish a connection to our server)\nchannel = grpc.insecure_channel(\"127.0.0.1:5678\")\n\n# The channel can now be used to create the actual client (allowing us to call all available functions)\nclient = nlp_pb2_grpc.NlpServiceStub(channel)\n\n# Now the time has come to prepare our actual function call.\n# We will start by creating a very simple NlpConfig with the default spacy model.\n# FOr details about the parameters, please have a look at the next section.\nconfig = nlp_pb2.NlpConfig(\n language=\"en\",\n spacy_model=\"en_core_web_lg\",\n)\n\n# Next, we will build a request to query vectors from our server.\nrequest = nlp_pb2.VectorsRequest(\n # The first parameter is a list of strings that shall be embedded by our server.\n texts=[\"What a great tutorial!\", \"I will definitely recommend this to my friends.\"],\n # Now we need to specify which embeddings have to be computed. In this example, we create one vector for each text\n embedding_levels=[nlp_pb2.EmbeddingLevel.EMBEDDING_LEVEL_DOCUMENT],\n # The only thing missing now is the spacy configuration we created in the previous step.\n config=config\n)\n\n# Having created the request, we can now send it to the server and retrieve the corresponding response.\nresponse = client.Vectors(request)\n\n# Due to technical constraints, we cannot directly transfer numpy arrays, thus we convert our response.\nvectors = [np.array(entry.document.vector) for entry in response.vectors]\n```\n\n<!-- TODO: Prefer Vectors instead of Similarities for Python to increase performacne. -->\n\n## Advanced Usage\n\nA central piece for all available function is the `NlpConfig` message, allowing you to create even complex embedding models easily.\nIn addition to [its documentation](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1), we will in the following present some examples to demonstrate the possibilities you have.\n\n```python\nfrom arg_services.nlp.v1 import nlp_pb2\n\n# In the example above, we already introduced a quite basic config:\nconfig = nlp_pb2.NlpConfig(\n # You have to provide a language for every config: https://spacy.io/usage/models#languages\n language=\"en\",\n # Also, you need to specify the model that spacy should load: https://spacy.io/models/en\n spacy_model=\"en_core_web_lg\",\n)\n\n# A central feature of our library is the possibility to combine multiple embedding models, potentially capturing more contextual information.\nconfig = nlp_pb2.NlpConfig(\n language=\"en\",\n # This parameter expects a list of models. If you pass more than one, the respective vectors are **concatenated** to each other\n # (e.g., two 300-dimensional embeddings will result in a 600-dimensional one).\n # This approach is based on https://arxiv.org/abs/1803.01400\n embedding_models=[\n nlp_pb2.EmbeddingModel(\n # First select the type of model you would like to use (e.g., SBERT/Sentence Transformers).\n model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SENTENCE_TRANSFORMERS,\n # Then select the actual model.\n # Any of those specified on the website (https://www.sbert.net/docs/pretrained_models.html) are allowed.\n model_name=\"all-mpnet-base-v2\"\n ),\n nlp_pb2.EmbeddingModel(\n # It is also possible to use a standard spacy model\n model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,\n model_name=\"en_core_web_lg\",\n # Since we have selected a word embedding (i.e., it cannot directly encode sentences), the token vectors need to be aggregated somehow.\n # The default strategy is to use the arithmetic mean, but you are free to use other strategies (e.g., the geometric mean).\n pooling_type=nlp_pb2.Pooling.POOLING_GMEAN\n ),\n nlp_pb2.EmbeddingModel(\n model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,\n model_name=\"en_core_web_lg\",\n # Alternatively, it is also possible to use the generalized mean / power mean.\n # In this example, the selected pmean corresponds to the geometic mean (thus this embedding is identical to the previous one).\n # This approach is based on https://arxiv.org/abs/1803.01400\n pmean=0\n )\n ]\n # This setting is now optional and only needed if you need spacy features (e.g., POS tagging) besides embeddings.\n # spacy_model=\"en_core_web_lg\",\n)\n\n# If computing the similarity between strings, you get one additional parameter.\nconfig = nlp_pb2.NlpConfig(\n language=\"en\",\n # To keep the example simple, we will now only use a single spacy model instead of the more powerful embedding models.\n # However, it is of course possible to use them here as well.\n spacy_model=\"en_core_web_lg\",\n # If not specified, we will always use the cosine similarity when comparing two strings.\n # As indicated in a recent paper (https://arxiv.org/abs/1904.13264), you may achieve better results with alternative approaches like DynaMax Jaccard.\n # Please note that this particular method ignores your selected pooling method due to the fact that even plain word embeddings are not pooled at all.\n similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_DYNAMAX_JACCARD\n)\n\n# It is also possible to determine a similarity score without the use of embeddings.\nconfig = nlp_pb2.NlpConfig(\n language=\"en\",\n spacy_model=\"en_core_web_sm\",\n # Traditional metric (Jaccard similarity and Levenshtein edit distance) are also available\n similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_EDIT\n)\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Microservice for NLP tasks using gRPC",
"version": "1.4.10",
"project_urls": {
"Homepage": "http://recap.uni-trier.de",
"Repository": "https://github.com/recap-utr/nlp-service"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "87a2313a9fde6c63c6e2de7a81576efd7f9d85ea421779318317e994487a5d41",
"md5": "33cbc3e881e1f435fa817a86e24a37b7",
"sha256": "2ce303c1f0ea7012d3464ea28ffa12eb0db14047f49785a1d2671b50ae6f0838"
},
"downloads": -1,
"filename": "nlp_service-1.4.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "33cbc3e881e1f435fa817a86e24a37b7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<3.13",
"size": 14665,
"upload_time": "2023-11-07T08:26:30",
"upload_time_iso_8601": "2023-11-07T08:26:30.276877Z",
"url": "https://files.pythonhosted.org/packages/87/a2/313a9fde6c63c6e2de7a81576efd7f9d85ea421779318317e994487a5d41/nlp_service-1.4.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "579d5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd",
"md5": "47c4bf101752d06c43a2b750c2aeaa35",
"sha256": "bad5d91afc2df09487479bf384f46d845119e758fa000fde21b753beddf6e01c"
},
"downloads": -1,
"filename": "nlp_service-1.4.10.tar.gz",
"has_sig": false,
"md5_digest": "47c4bf101752d06c43a2b750c2aeaa35",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<3.13",
"size": 15254,
"upload_time": "2023-11-07T08:26:32",
"upload_time_iso_8601": "2023-11-07T08:26:32.078552Z",
"url": "https://files.pythonhosted.org/packages/57/9d/5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd/nlp_service-1.4.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-07 08:26:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "recap-utr",
"github_project": "nlp-service",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "nlp-service"
}