nlp-service

Name	nlp-service JSON
Version	1.4.10 JSON
	download
home_page	http://recap.uni-trier.de
Summary	Microservice for NLP tasks using gRPC
upload_time	2023-11-07 08:26:32
maintainer
docs_url	None
author	Mirko Lenz
requires_python	>=3.10,<3.13
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # NLP Microservice

The goal of this project is to provide a [gRPC](https://grpc.io) server for resource-heavy NLP tasks&mdash;for instance, computing vectors/embeddings for words or sentences.
By using [protobuf](https://developers.google.com/protocol-buffers) internally, our NLP server provides native and strongly typed interfaces for many programming languages.
There are multiple advantages that arise from outsourcing such computations to such a server:

- If multiple apps rely on NLP, the underlying models (which are usually quite large) only need to be loaded once into the main memory.
- All programming languages supported by gRPC get easy access to state-of-the-art NLP architectures (e.g., transformers).
- The logic is consolidated at a central place, drastically decreasing the maintenance effort required.

In addition to the server, we also provide a client containing convenience functions.
This makes it easier for python applications to interact with the gRPC server.
We will discuss the client at the end of this README.

## Installation and Setup

We are using `nix` and `poetry` to manage the dependencies and provide a ready-to-use Docker image.

### Docker (recommended)

The container caches the downloaded models, so you should not pass `--rm` to `docker run`.

```sh
docker run ghcr.io/recap-utr/nlp-service:latest "0.0.0.0:50100"
```

### Nix (advanced)

```sh
nix run github:recap-utr/nlp-service -- "127.0.0.1:50100"
# or after cloning this repository
nix develop -c poetry run python -m nlp_service "127.0.0.1:50100"
```

### Poetry (advanced)

```sh
# The server dependencies are optional, thus they have to be installed explicitly.
poetry install --extras all
# To run the server, you need to specify the address it should listen on.
# In this example, it should liston on port 5678 on localhost.
poetry run python -m nlp_service "127.0.0.1:50100"
```

## General Usage

Once the server is running, you are free to call any of the functions defined in the underlying [protobuf file](https://github.com/recap-utr/arg-services/blob/main/arg_services/nlp/v1/nlp.proto).
The corresponding documentation is located at the [Buf Schema Registry](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1).
_Please note:_ The examples here use the Python programming language, but are also directly applicable to any other language supported by gRPC.

```python
import grpc
from arg_services.nlp.v1 import nlp_pb2, nlp_pb2_grpc

# First of all, we are creating a channel (i.e., establish a connection to our server)
channel = grpc.insecure_channel("127.0.0.1:5678")

# The channel can now be used to create the actual client (allowing us to call all available functions)
client = nlp_pb2_grpc.NlpServiceStub(channel)

# Now the time has come to prepare our actual function call.
# We will start by creating a very simple NlpConfig with the default spacy model.
# FOr details about the parameters, please have a look at the next section.
config = nlp_pb2.NlpConfig(
  language="en",
  spacy_model="en_core_web_lg",
)

# Next, we will build a request to query vectors from our server.
request = nlp_pb2.VectorsRequest(
  # The first parameter is a list of strings that shall be embedded by our server.
  texts=["What a great tutorial!", "I will definitely recommend this to my friends."],
  # Now we need to specify which embeddings have to be computed. In this example, we create one vector for each text
  embedding_levels=[nlp_pb2.EmbeddingLevel.EMBEDDING_LEVEL_DOCUMENT],
  # The only thing missing now is the spacy configuration we created in the previous step.
  config=config
)

# Having created the request, we can now send it to the server and retrieve the corresponding response.
response = client.Vectors(request)

# Due to technical constraints, we cannot directly transfer numpy arrays, thus we convert our response.
vectors = [np.array(entry.document.vector) for entry in response.vectors]
```

<!-- TODO: Prefer Vectors instead of Similarities for Python to increase performacne. -->

## Advanced Usage

A central piece for all available function is the `NlpConfig` message, allowing you to create even complex embedding models easily.
In addition to [its documentation](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1), we will in the following present some examples to demonstrate the possibilities you have.

```python
from arg_services.nlp.v1 import nlp_pb2

# In the example above, we already introduced a quite basic config:
config = nlp_pb2.NlpConfig(
  # You have to provide a language for every config: https://spacy.io/usage/models#languages
  language="en",
  # Also, you need to specify the model that spacy should load: https://spacy.io/models/en
  spacy_model="en_core_web_lg",
)

# A central feature of our library is the possibility to combine multiple embedding models, potentially capturing more contextual information.
config = nlp_pb2.NlpConfig(
  language="en",
  # This parameter expects a list of models. If you pass more than one, the respective vectors are **concatenated** to each other
  # (e.g., two 300-dimensional embeddings will result in a 600-dimensional one).
  # This approach is based on https://arxiv.org/abs/1803.01400
  embedding_models=[
    nlp_pb2.EmbeddingModel(
      # First select the type of model you would like to use (e.g., SBERT/Sentence Transformers).
      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SENTENCE_TRANSFORMERS,
      # Then select the actual model.
      # Any of those specified on the website (https://www.sbert.net/docs/pretrained_models.html) are allowed.
      model_name="all-mpnet-base-v2"
    ),
    nlp_pb2.EmbeddingModel(
      # It is also possible to use a standard spacy model
      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,
      model_name="en_core_web_lg",
      # Since we have selected a word embedding (i.e., it cannot directly encode sentences), the token vectors need to be aggregated somehow.
      # The default strategy is to use the arithmetic mean, but you are free to use other strategies (e.g., the geometric mean).
      pooling_type=nlp_pb2.Pooling.POOLING_GMEAN
    ),
    nlp_pb2.EmbeddingModel(
      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,
      model_name="en_core_web_lg",
      # Alternatively, it is also possible to use the generalized mean / power mean.
      # In this example, the selected pmean corresponds to the geometic mean (thus this embedding is identical to the previous one).
      # This approach is based on https://arxiv.org/abs/1803.01400
      pmean=0
    )
  ]
  # This setting is now optional and only needed if you need spacy features (e.g., POS tagging) besides embeddings.
  # spacy_model="en_core_web_lg",
)

# If computing the similarity between strings, you get one additional parameter.
config = nlp_pb2.NlpConfig(
  language="en",
  # To keep the example simple, we will now only use a single spacy model instead of the more powerful embedding models.
  # However, it is of course possible to use them here as well.
  spacy_model="en_core_web_lg",
  # If not specified, we will always use the cosine similarity when comparing two strings.
  # As indicated in a recent paper (https://arxiv.org/abs/1904.13264), you may achieve better results with alternative approaches like DynaMax Jaccard.
  # Please note that this particular method ignores your selected pooling method due to the fact that even plain word embeddings are not pooled at all.
  similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_DYNAMAX_JACCARD
)

# It is also possible to determine a similarity score without the use of embeddings.
config = nlp_pb2.NlpConfig(
  language="en",
  spacy_model="en_core_web_sm",
  # Traditional metric (Jaccard similarity and Levenshtein edit distance) are also available
  similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_EDIT
)
```

Raw data

            {
    "_id": null,
    "home_page": "http://recap.uni-trier.de",
    "name": "nlp-service",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<3.13",
    "maintainer_email": "",
    "keywords": "",
    "author": "Mirko Lenz",
    "author_email": "info@mirko-lenz.de",
    "download_url": "https://files.pythonhosted.org/packages/57/9d/5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd/nlp_service-1.4.10.tar.gz",
    "platform": null,
    "description": "# NLP Microservice\n\nThe goal of this project is to provide a [gRPC](https://grpc.io) server for resource-heavy NLP tasks&mdash;for instance, computing vectors/embeddings for words or sentences.\nBy using [protobuf](https://developers.google.com/protocol-buffers) internally, our NLP server provides native and strongly typed interfaces for many programming languages.\nThere are multiple advantages that arise from outsourcing such computations to such a server:\n\n- If multiple apps rely on NLP, the underlying models (which are usually quite large) only need to be loaded once into the main memory.\n- All programming languages supported by gRPC get easy access to state-of-the-art NLP architectures (e.g., transformers).\n- The logic is consolidated at a central place, drastically decreasing the maintenance effort required.\n\nIn addition to the server, we also provide a client containing convenience functions.\nThis makes it easier for python applications to interact with the gRPC server.\nWe will discuss the client at the end of this README.\n\n## Installation and Setup\n\nWe are using `nix` and `poetry` to manage the dependencies and provide a ready-to-use Docker image.\n\n### Docker (recommended)\n\nThe container caches the downloaded models, so you should not pass `--rm` to `docker run`.\n\n```sh\ndocker run ghcr.io/recap-utr/nlp-service:latest \"0.0.0.0:50100\"\n```\n\n### Nix (advanced)\n\n```sh\nnix run github:recap-utr/nlp-service -- \"127.0.0.1:50100\"\n# or after cloning this repository\nnix develop -c poetry run python -m nlp_service \"127.0.0.1:50100\"\n```\n\n### Poetry (advanced)\n\n```sh\n# The server dependencies are optional, thus they have to be installed explicitly.\npoetry install --extras all\n# To run the server, you need to specify the address it should listen on.\n# In this example, it should liston on port 5678 on localhost.\npoetry run python -m nlp_service \"127.0.0.1:50100\"\n```\n\n## General Usage\n\nOnce the server is running, you are free to call any of the functions defined in the underlying [protobuf file](https://github.com/recap-utr/arg-services/blob/main/arg_services/nlp/v1/nlp.proto).\nThe corresponding documentation is located at the [Buf Schema Registry](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1).\n_Please note:_ The examples here use the Python programming language, but are also directly applicable to any other language supported by gRPC.\n\n```python\nimport grpc\nfrom arg_services.nlp.v1 import nlp_pb2, nlp_pb2_grpc\n\n# First of all, we are creating a channel (i.e., establish a connection to our server)\nchannel = grpc.insecure_channel(\"127.0.0.1:5678\")\n\n# The channel can now be used to create the actual client (allowing us to call all available functions)\nclient = nlp_pb2_grpc.NlpServiceStub(channel)\n\n# Now the time has come to prepare our actual function call.\n# We will start by creating a very simple NlpConfig with the default spacy model.\n# FOr details about the parameters, please have a look at the next section.\nconfig = nlp_pb2.NlpConfig(\n  language=\"en\",\n  spacy_model=\"en_core_web_lg\",\n)\n\n# Next, we will build a request to query vectors from our server.\nrequest = nlp_pb2.VectorsRequest(\n  # The first parameter is a list of strings that shall be embedded by our server.\n  texts=[\"What a great tutorial!\", \"I will definitely recommend this to my friends.\"],\n  # Now we need to specify which embeddings have to be computed. In this example, we create one vector for each text\n  embedding_levels=[nlp_pb2.EmbeddingLevel.EMBEDDING_LEVEL_DOCUMENT],\n  # The only thing missing now is the spacy configuration we created in the previous step.\n  config=config\n)\n\n# Having created the request, we can now send it to the server and retrieve the corresponding response.\nresponse = client.Vectors(request)\n\n# Due to technical constraints, we cannot directly transfer numpy arrays, thus we convert our response.\nvectors = [np.array(entry.document.vector) for entry in response.vectors]\n```\n\n<!-- TODO: Prefer Vectors instead of Similarities for Python to increase performacne. -->\n\n## Advanced Usage\n\nA central piece for all available function is the `NlpConfig` message, allowing you to create even complex embedding models easily.\nIn addition to [its documentation](https://buf.build/recap/arg-services/docs/main:arg_services.nlp.v1), we will in the following present some examples to demonstrate the possibilities you have.\n\n```python\nfrom arg_services.nlp.v1 import nlp_pb2\n\n# In the example above, we already introduced a quite basic config:\nconfig = nlp_pb2.NlpConfig(\n  # You have to provide a language for every config: https://spacy.io/usage/models#languages\n  language=\"en\",\n  # Also, you need to specify the model that spacy should load: https://spacy.io/models/en\n  spacy_model=\"en_core_web_lg\",\n)\n\n# A central feature of our library is the possibility to combine multiple embedding models, potentially capturing more contextual information.\nconfig = nlp_pb2.NlpConfig(\n  language=\"en\",\n  # This parameter expects a list of models. If you pass more than one, the respective vectors are **concatenated** to each other\n  # (e.g., two 300-dimensional embeddings will result in a 600-dimensional one).\n  # This approach is based on https://arxiv.org/abs/1803.01400\n  embedding_models=[\n    nlp_pb2.EmbeddingModel(\n      # First select the type of model you would like to use (e.g., SBERT/Sentence Transformers).\n      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SENTENCE_TRANSFORMERS,\n      # Then select the actual model.\n      # Any of those specified on the website (https://www.sbert.net/docs/pretrained_models.html) are allowed.\n      model_name=\"all-mpnet-base-v2\"\n    ),\n    nlp_pb2.EmbeddingModel(\n      # It is also possible to use a standard spacy model\n      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,\n      model_name=\"en_core_web_lg\",\n      # Since we have selected a word embedding (i.e., it cannot directly encode sentences), the token vectors need to be aggregated somehow.\n      # The default strategy is to use the arithmetic mean, but you are free to use other strategies (e.g., the geometric mean).\n      pooling_type=nlp_pb2.Pooling.POOLING_GMEAN\n    ),\n    nlp_pb2.EmbeddingModel(\n      model_type=nlp_pb2.EmbeddingType.EMBEDDING_TYPE_SPACY,\n      model_name=\"en_core_web_lg\",\n      # Alternatively, it is also possible to use the generalized mean / power mean.\n      # In this example, the selected pmean corresponds to the geometic mean (thus this embedding is identical to the previous one).\n      # This approach is based on https://arxiv.org/abs/1803.01400\n      pmean=0\n    )\n  ]\n  # This setting is now optional and only needed if you need spacy features (e.g., POS tagging) besides embeddings.\n  # spacy_model=\"en_core_web_lg\",\n)\n\n# If computing the similarity between strings, you get one additional parameter.\nconfig = nlp_pb2.NlpConfig(\n  language=\"en\",\n  # To keep the example simple, we will now only use a single spacy model instead of the more powerful embedding models.\n  # However, it is of course possible to use them here as well.\n  spacy_model=\"en_core_web_lg\",\n  # If not specified, we will always use the cosine similarity when comparing two strings.\n  # As indicated in a recent paper (https://arxiv.org/abs/1904.13264), you may achieve better results with alternative approaches like DynaMax Jaccard.\n  # Please note that this particular method ignores your selected pooling method due to the fact that even plain word embeddings are not pooled at all.\n  similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_DYNAMAX_JACCARD\n)\n\n# It is also possible to determine a similarity score without the use of embeddings.\nconfig = nlp_pb2.NlpConfig(\n  language=\"en\",\n  spacy_model=\"en_core_web_sm\",\n  # Traditional metric (Jaccard similarity and Levenshtein edit distance) are also available\n  similarity_method=nlp_pb2.SimilarityMethod.SIMILARITY_METHOD_EDIT\n)\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Microservice for NLP tasks using gRPC",
    "version": "1.4.10",
    "project_urls": {
        "Homepage": "http://recap.uni-trier.de",
        "Repository": "https://github.com/recap-utr/nlp-service"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "87a2313a9fde6c63c6e2de7a81576efd7f9d85ea421779318317e994487a5d41",
                "md5": "33cbc3e881e1f435fa817a86e24a37b7",
                "sha256": "2ce303c1f0ea7012d3464ea28ffa12eb0db14047f49785a1d2671b50ae6f0838"
            },
            "downloads": -1,
            "filename": "nlp_service-1.4.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "33cbc3e881e1f435fa817a86e24a37b7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<3.13",
            "size": 14665,
            "upload_time": "2023-11-07T08:26:30",
            "upload_time_iso_8601": "2023-11-07T08:26:30.276877Z",
            "url": "https://files.pythonhosted.org/packages/87/a2/313a9fde6c63c6e2de7a81576efd7f9d85ea421779318317e994487a5d41/nlp_service-1.4.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "579d5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd",
                "md5": "47c4bf101752d06c43a2b750c2aeaa35",
                "sha256": "bad5d91afc2df09487479bf384f46d845119e758fa000fde21b753beddf6e01c"
            },
            "downloads": -1,
            "filename": "nlp_service-1.4.10.tar.gz",
            "has_sig": false,
            "md5_digest": "47c4bf101752d06c43a2b750c2aeaa35",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<3.13",
            "size": 15254,
            "upload_time": "2023-11-07T08:26:32",
            "upload_time_iso_8601": "2023-11-07T08:26:32.078552Z",
            "url": "https://files.pythonhosted.org/packages/57/9d/5f3cca106d45c01e58a0554f2d78bd521d1f127d56edb787e90c008e92fd/nlp_service-1.4.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-07 08:26:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "recap-utr",
    "github_project": "nlp-service",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nlp-service"
}

Mirko Lenz