litserve

Name	litserve JSON
Version	0.0.0.dev2 JSON
	download
home_page	https://github.com/Lightning-AI/litserve
Summary	Lightweight AI server.
upload_time	2024-03-07 23:37:12
maintainer
docs_url	None
author	Lightning-AI et al.
requires_python	>=3.8
license	Apache-2.0
keywords	deep learning pytorch ai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # litserve

Lightweight inference server for AI/ML models based on FastAPI.

litserve supports:

- serving models across multiple GPUs
- full flexibility in the definition of request and response
- automatic schema validation
- handling of timeouts and disconnections
- API key authentication

While being fairly capable and fast, the server is extremely simple and hackable.

For example, adding batching and streaming logic should be relatively straightforward
and we'll probably add those directly as part of the library in the near future.

## Install

```bash
pip install litserve
```

## Use

### Creating a simple API server

```python
from litserve import LitAPI


class SimpleLitAPI(LitAPI):
    def setup(self, device):
        """
        Setup the model so it can be called in `predict`.
        """
        self.model = lambda x: x**2

    def decode_request(self, request):
        """
        Convert the request payload to your model input.
        """
        return request["input"]

    def predict(self, x):
        """
        Run the model on the input and return the output.
        """
        return self.model(x)

    def encode_response(self, output):
        """
        Convert the model output to a response payload.
        """
        return {"output": output}
```

Now instantiate the API and start the server on an accelerator:

```python
from litserve import LitServer

api = SimpleLitAPI()

server = LitServer(api, accelerator="cpu")
server.run(port=8000)
```

The server expects the client to send a `POST` to the `/predict` URL with a JSON payload.
The way the payload is structured is up to the implementation of the `LitAPI` subclass.

Once the server starts it generates an example client you can use like this:

```bash
python client.py
```

that simply posts a sample request to the server:

```python
response = requests.post("http://127.0.0.1:8000/predict", json={"input": 4.0})
```

### Pydantic models

You can also define request and response as [Pydantic models](https://docs.pydantic.dev/latest/),
with the advantage that the API will automatically validate the request.

```python
from pydantic import BaseModel


class PredictRequest(BaseModel):
    input: float


class PredictResponse(BaseModel):
    output: float


class SimpleLitAPI2(LitAPI):
    def setup(self, device):
        self.model = lambda x: x**2

    def decode_request(self, request: PredictRequest) -> float:
        return request.input

    def predict(self, x):
        return self.model(x)

    def encode_response(self, output: float) -> PredictResponse:
        return PredictResponse(output=output)


if __name__ == "__main__":
    api = SimpleLitAPI()
    server = LitServer(api, accelerator="cpu")
    server.run(port=8888)
```

### Serving on GPU

`LitServer` has the ability to coordinate serving from multiple GPUs.

For example, running the API server on a 4-GPU machine, with a PyTorch model served by each GPU:

```python
from fastapi import Request, Response

from litserve import LitAPI, LitServer

import torch
import torch.nn as nn


class Linear(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)
        self.linear.weight.data.fill_(2.0)
        self.linear.bias.data.fill_(1.0)

    def forward(self, x):
        return self.linear(x)


class SimpleLitAPI(LitAPI):
    def setup(self, device):
        # move the model to the correct device
        # keep track of the device for moving data accordingly
        self.model = Linear().to(device)
        self.device = device

    def decode_request(self, request: Request):
        # get the input and create a 1D tensor on the correct device
        content = request["input"]
        return torch.tensor([content], device=self.device)

    def predict(self, x):
        # the model expects a batch dimension, so create it
        return self.model(x[None, :])

    def encode_response(self, output) -> Response:
        # float will take the output value directly onto CPU memory
        return {"output": float(output)}


if __name__ == "__main__":
    # accelerator="cuda", devices=4 will lead to 4 workers serving the
    # model from "cuda:0", "cuda:1", "cuda:2", "cuda:3" respectively
    server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=4)
    server.run(port=8000)
```

The `devices` variable can also be an array specifying what device id to
run the model on:

```python
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=[0, 3])
```

Last, you can run multiple copies of the same model from the same device,
if the model is small. The following will load two copies of the model on
each of the 4 GPUs:

```python
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=4, workers_per_device=2)
```

### Timeouts and disconnections

The server will remove a queued request if the client requesting it disconnects.

You can configure a timeout (in seconds) after which clients will receive a `504` HTTP
response (Gateway Timeout) indicating that their request has timed out.

For example, this is how you can configure the server with a timeout of 30 seconds per response.

```python
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=4, timeout=30)
```

This is useful to avoid requests queuing up beyond the ability of the server to respond.

### Using API key authentication

In order to secure the API behind an API key, just define the env var when
starting the server

```bash
LIT_SERVER_API_KEY=supersecretkey python main.py
```

Clients are expected to auth with the same API key set in the `X-API-Key` HTTP header.

## License

litserve is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.
See LICENSE file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Lightning-AI/litserve",
    "name": "litserve",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "deep learning,pytorch,AI",
    "author": "Lightning-AI et al.",
    "author_email": "community@lightning.ai",
    "download_url": "https://files.pythonhosted.org/packages/2f/ae/857f8cf5e2e51181d3ee4465b341778e212716c27ea755d3949aa1761695/litserve-0.0.0.dev2.tar.gz",
    "platform": null,
    "description": "# litserve\n\nLightweight inference server for AI/ML models based on FastAPI.\n\nlitserve supports:\n\n- serving models across multiple GPUs\n- full flexibility in the definition of request and response\n- automatic schema validation\n- handling of timeouts and disconnections\n- API key authentication\n\nWhile being fairly capable and fast, the server is extremely simple and hackable.\n\nFor example, adding batching and streaming logic should be relatively straightforward\nand we'll probably add those directly as part of the library in the near future.\n\n## Install\n\n```bash\npip install litserve\n```\n\n## Use\n\n### Creating a simple API server\n\n```python\nfrom litserve import LitAPI\n\n\nclass SimpleLitAPI(LitAPI):\n    def setup(self, device):\n        \"\"\"\n        Setup the model so it can be called in `predict`.\n        \"\"\"\n        self.model = lambda x: x**2\n\n    def decode_request(self, request):\n        \"\"\"\n        Convert the request payload to your model input.\n        \"\"\"\n        return request[\"input\"]\n\n    def predict(self, x):\n        \"\"\"\n        Run the model on the input and return the output.\n        \"\"\"\n        return self.model(x)\n\n    def encode_response(self, output):\n        \"\"\"\n        Convert the model output to a response payload.\n        \"\"\"\n        return {\"output\": output}\n```\n\nNow instantiate the API and start the server on an accelerator:\n\n```python\nfrom litserve import LitServer\n\napi = SimpleLitAPI()\n\nserver = LitServer(api, accelerator=\"cpu\")\nserver.run(port=8000)\n```\n\nThe server expects the client to send a `POST` to the `/predict` URL with a JSON payload.\nThe way the payload is structured is up to the implementation of the `LitAPI` subclass.\n\nOnce the server starts it generates an example client you can use like this:\n\n```bash\npython client.py\n```\n\nthat simply posts a sample request to the server:\n\n```python\nresponse = requests.post(\"http://127.0.0.1:8000/predict\", json={\"input\": 4.0})\n```\n\n### Pydantic models\n\nYou can also define request and response as [Pydantic models](https://docs.pydantic.dev/latest/),\nwith the advantage that the API will automatically validate the request.\n\n```python\nfrom pydantic import BaseModel\n\n\nclass PredictRequest(BaseModel):\n    input: float\n\n\nclass PredictResponse(BaseModel):\n    output: float\n\n\nclass SimpleLitAPI2(LitAPI):\n    def setup(self, device):\n        self.model = lambda x: x**2\n\n    def decode_request(self, request: PredictRequest) -> float:\n        return request.input\n\n    def predict(self, x):\n        return self.model(x)\n\n    def encode_response(self, output: float) -> PredictResponse:\n        return PredictResponse(output=output)\n\n\nif __name__ == \"__main__\":\n    api = SimpleLitAPI()\n    server = LitServer(api, accelerator=\"cpu\")\n    server.run(port=8888)\n```\n\n### Serving on GPU\n\n`LitServer` has the ability to coordinate serving from multiple GPUs.\n\nFor example, running the API server on a 4-GPU machine, with a PyTorch model served by each GPU:\n\n```python\nfrom fastapi import Request, Response\n\nfrom litserve import LitAPI, LitServer\n\nimport torch\nimport torch.nn as nn\n\n\nclass Linear(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.linear = nn.Linear(1, 1)\n        self.linear.weight.data.fill_(2.0)\n        self.linear.bias.data.fill_(1.0)\n\n    def forward(self, x):\n        return self.linear(x)\n\n\nclass SimpleLitAPI(LitAPI):\n    def setup(self, device):\n        # move the model to the correct device\n        # keep track of the device for moving data accordingly\n        self.model = Linear().to(device)\n        self.device = device\n\n    def decode_request(self, request: Request):\n        # get the input and create a 1D tensor on the correct device\n        content = request[\"input\"]\n        return torch.tensor([content], device=self.device)\n\n    def predict(self, x):\n        # the model expects a batch dimension, so create it\n        return self.model(x[None, :])\n\n    def encode_response(self, output) -> Response:\n        # float will take the output value directly onto CPU memory\n        return {\"output\": float(output)}\n\n\nif __name__ == \"__main__\":\n    # accelerator=\"cuda\", devices=4 will lead to 4 workers serving the\n    # model from \"cuda:0\", \"cuda:1\", \"cuda:2\", \"cuda:3\" respectively\n    server = LitServer(SimpleLitAPI(), accelerator=\"cuda\", devices=4)\n    server.run(port=8000)\n```\n\nThe `devices` variable can also be an array specifying what device id to\nrun the model on:\n\n```python\nserver = LitServer(SimpleLitAPI(), accelerator=\"cuda\", devices=[0, 3])\n```\n\nLast, you can run multiple copies of the same model from the same device,\nif the model is small. The following will load two copies of the model on\neach of the 4 GPUs:\n\n```python\nserver = LitServer(SimpleLitAPI(), accelerator=\"cuda\", devices=4, workers_per_device=2)\n```\n\n### Timeouts and disconnections\n\nThe server will remove a queued request if the client requesting it disconnects.\n\nYou can configure a timeout (in seconds) after which clients will receive a `504` HTTP\nresponse (Gateway Timeout) indicating that their request has timed out.\n\nFor example, this is how you can configure the server with a timeout of 30 seconds per response.\n\n```python\nserver = LitServer(SimpleLitAPI(), accelerator=\"cuda\", devices=4, timeout=30)\n```\n\nThis is useful to avoid requests queuing up beyond the ability of the server to respond.\n\n### Using API key authentication\n\nIn order to secure the API behind an API key, just define the env var when\nstarting the server\n\n```bash\nLIT_SERVER_API_KEY=supersecretkey python main.py\n```\n\nClients are expected to auth with the same API key set in the `X-API-Key` HTTP header.\n\n## License\n\nlitserve is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.\nSee LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Lightweight AI server.",
    "version": "0.0.0.dev2",
    "project_urls": {
        "Bug Tracker": "https://github.com/Lightning-AI/litserve/issues",
        "Documentation": "https://lightning-ai.github.io/litserve/",
        "Download": "https://github.com/Lightning-AI/litserve",
        "Homepage": "https://github.com/Lightning-AI/litserve",
        "Source Code": "https://github.com/Lightning-AI/litserve"
    },
    "split_keywords": [
        "deep learning",
        "pytorch",
        "ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "645f639930d4e332451652b080ce0dcd37dd41e271b528281d548fc173d523c9",
                "md5": "d2213a3e73090f0e5aa0d081bb62c21f",
                "sha256": "7dc7af69d2b1ec6438e2502690933ba2b9fd0fb1532d86b06da35a9849ff4968"
            },
            "downloads": -1,
            "filename": "litserve-0.0.0.dev2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d2213a3e73090f0e5aa0d081bb62c21f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11050,
            "upload_time": "2024-03-07T23:37:11",
            "upload_time_iso_8601": "2024-03-07T23:37:11.035388Z",
            "url": "https://files.pythonhosted.org/packages/64/5f/639930d4e332451652b080ce0dcd37dd41e271b528281d548fc173d523c9/litserve-0.0.0.dev2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2fae857f8cf5e2e51181d3ee4465b341778e212716c27ea755d3949aa1761695",
                "md5": "5f5f070a915d743ae81c67a72e3319e2",
                "sha256": "dec7739930aedc7544ce91664e6a10a7bcc259b3cfc35062208e112233b94965"
            },
            "downloads": -1,
            "filename": "litserve-0.0.0.dev2.tar.gz",
            "has_sig": false,
            "md5_digest": "5f5f070a915d743ae81c67a72e3319e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 13737,
            "upload_time": "2024-03-07T23:37:12",
            "upload_time_iso_8601": "2024-03-07T23:37:12.526066Z",
            "url": "https://files.pythonhosted.org/packages/2f/ae/857f8cf5e2e51181d3ee4465b341778e212716c27ea755d3949aa1761695/litserve-0.0.0.dev2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-07 23:37:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Lightning-AI",
    "github_project": "litserve",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "litserve"
}

Lightning-AI et al.