<!-- PROJECT SHIELDS -->
<!--
*** I'm using markdown "reference style" links for readability.
*** Reference links are enclosed in brackets [ ] instead of parentheses ( ).
*** See the bottom of this document for the declaration of the reference variables
*** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use.
*** https://www.markdownguide.org/basic-syntax/#reference-style-links
-->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Stargazers][stars-shield]][stars-url]
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
# Infinity βΎοΈ
[![codecov][codecov-shield]][codecov-url]
[![ci][ci-shield]][ci-url]
[![Downloads][pepa-shield]][pepa-url]
[![DOI](https://zenodo.org/badge/703686617.svg)](https://zenodo.org/doi/10.5281/zenodo.11406462)
![Docker pulls](https://img.shields.io/docker/pulls/michaelf34/infinity)
Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali. Infinity is developed under [MIT License](https://github.com/michaelfeil/infinity/blob/main/LICENSE).
## Why Infinity
* **Deploy any model from HuggingFace**: deploy any embedding, reranking, clip and sentence-transformer model from [HuggingFace]( https://huggingface.co/models?other=text-embeddings-inference&sort=trending)
* **Fast inference backends**: The inference server is built on top of [PyTorch](https://github.com/pytorch/pytorch), [optimum (ONNX/TensorRT)](https://huggingface.co/docs/optimum/index) and [CTranslate2](https://github.com/OpenNMT/CTranslate2), using FlashAttention to get the most out of your **NVIDIA CUDA**, **AMD ROCM**, **CPU**, **AWS INF2** or **APPLE MPS** accelerator. Infinity uses dynamic batching and tokenization dedicated in worker threads.
* **Multi-modal and multi-model**: Mix-and-match multiple models. Infinity orchestrates them.
* **Tested implementation**: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
* **Easy to use**: Built on [FastAPI](https://fastapi.tiangolo.com/). Infinity CLI v2 allows launching of all arguments via Environment variable or argument. OpenAPI aligned to [OpenAI's API specs](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings). View the docs at [https://michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity/) on how to get started.
<p align="center">
<a href="https://github.com/basetenlabs/truss-examples/tree/7025918c813d08d718b8939f44f10651a0ff2c8c/custom-server/infinity-embedding-server"><img src="https://avatars.githubusercontent.com/u/54861414" alt="Logo Baseten.co" width="50"/></a>
<a href="https://github.com/runpod-workers/worker-infinity-embedding"><img src="https://github.com/user-attachments/assets/24f1906d-31b8-4e16-a479-1382cbdea046" alt="Logo Runpod" width="50"/></a>
<a href="https://www.truefoundry.com/cognita"><img src="https://github.com/user-attachments/assets/1b515b0f-2332-4b12-be82-933056bddee4" alt="Logo TrueFoundry" width="50"/></a>
<a href="https://vast.ai/article/serving-infinity"><img src="https://github.com/user-attachments/assets/8286d620-f403-48f5-bd7f-f471b228ae7b" alt="Logo Vast" width="46"/></a>
<a href="https://www.dataguard.de"><img src="https://github.com/user-attachments/assets/3fde1ac6-c299-455d-9fc2-ba4012799f9c" alt="Logo DataGuard" width="50"/></a>
<a href="https://community.sap.com/t5/artificial-intelligence-and-machine-learning-blogs/bring-open-source-llms-into-sap-ai-core/ba-p/13655167"><img src="https://github.com/user-attachments/assets/743e932b-ed5b-4a71-84cb-f28235707a84" alt="Logo SAP" width="47"/></a>
<a href="https://x.com/StuartReid1929/status/1763434100382163333"><img src="https://github.com/user-attachments/assets/477a4c54-1113-434b-83bc-1985f10981d3" alt="Logo Nosible" width="44"/></a>
<a href="https://github.com/freshworksinc/freddy-infinity"><img src="https://github.com/user-attachments/assets/a68da78b-d958-464e-aaf6-f39132be68a0" alt="Logo FreshWorks" width="50"/></a>
<a href="https://github.com/dstackai/dstack/tree/master/examples/deployment/infinity"><img src="https://github.com/user-attachments/assets/9cde2d6b-dc16-4f0a-81ba-535a84321467" alt="Logo Dstack" width="50"/></a>
<a href="https://embeddedllm.com/blog/"><img src="https://avatars.githubusercontent.com/u/148834374" alt="Logo JamAI" width="50"/></a>
<a href="https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct#infinity_emb"><img src="https://avatars.githubusercontent.com/u/1961952" alt="Logo Alibaba Group" width="50"/></a>
<a href="https://github.com/bentoml/BentoInfinity/"><img src="https://avatars.githubusercontent.com/u/49176046" alt="Logo BentoML" width="50"/></a>
<a href="https://x.com/bo_wangbo/status/1766371909086724481"><img src="https://avatars.githubusercontent.com/u/60539444" alt="Logo JinaAi" width="50"/></a>
<a href="https://github.com/dwarvesf/llm-hosting"><img src="https://avatars.githubusercontent.com/u/10388449" alt="Logo Dwarves Foundation" width="50"/></a>
<a href="https://github.com/huggingface/chat-ui/blob/daf695ea4a6e2d081587d7dbcae3cacd466bf8b2/docs/source/configuration/embeddings.md#openai"><img src="https://avatars.githubusercontent.com/u/25720743" alt="Logo HF" width="50"/></a>
<a href="https://www.linkedin.com/posts/markhng525_join-me-and-ekin-karabulut-at-the-ai-infra-activity-7163233344875393024-LafB?utm_source=share&utm_medium=member_desktop"><img src="https://avatars.githubusercontent.com/u/86131705" alt="Logo Gradient.ai" width="50"/></a>
</p>
### Latest News π₯
- [2024/11] AMD, CPU, ONNX docker images
- [2024/10] `pip install infinity_client`
- [2024/07] Inference deployment example via [Modal](./infra/modal/README.md) and a [free GPU deployment](https://infinity.modal.michaelfeil.eu/)
- [2024/06] Support for multi-modal: clip, text-classification & launch all arguments from env variables
- [2024/05] launch multiple models using the `v2` cli, including `--api-key`
- [2024/03] infinity supports experimental int8 (cpu/cuda) and fp8 (H100/MI300) support
- [2024/03] Docs are online: https://michaelfeil.github.io/infinity/latest/
- [2024/02] Community meetup at the [Run:AI Infra Club](https://discord.gg/7D4fbEgWjv)
- [2024/01] TensorRT / ONNX inference
- [2023/10] Initial release
## Getting started
### Launch the cli via pip install
```bash
pip install infinity-emb[all]
```
After your pip install, with your venv active, you can run the CLI directly.
```bash
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5
```
Check the `v2 --help` command to get a description for all parameters.
```bash
infinity_emb v2 --help
```
### Launch the CLI using a pre-built docker container (recommended)
Instead of installing the CLI via pip, you may also use docker to run `michaelf34/infinity`.
Make sure you mount your accelerator ( i.e. install `nvidia-docker` and activate with `--gpus all`).
```bash
port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data
docker run -it --gpus all \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id $model1 \
--model-id $model2 \
--port $port
```
The cache path inside the docker container is set by the environment variable `HF_HOME`.
#### Specialized docker images
<details>
<summary>Docker container for CPU</summary>
Use the `latest-cpu` image or `x.x.x-cpu` for slimer image.
Run like any other cpu-only docker image.
Optimum/Onnx is often the prefered engine.
```
docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-cpu \
v2 \
--engine optimum \
--model-id $model1 \
--model-id $model2 \
--port $port
```
</details>
<details>
<summary>Docker Container for ROCm (MI200 Series and MI300 Series)</summary>
Use the `latest-rocm` image or `x.x.x-rocm` for rocm compatible inference.
**This image is currently not build via CI/CD (to large), consider pinning to exact version.**
Make sure you have ROCm is correctly installed and ready to use with Docker.
Visit [Docs](https://michaelfeil.github.io/infinity) for more info.
</details>
<details>
<summary>Docker Container for Onnx-GPU, Cuda Extensions, TensorRT</summary>
Use the `latest-trt-onnx` image or `x.x.x-trt-onnx` for nvidia compatible inference.
**This image is currently not build via CI/CD (to large), consider pinning to exact version.**
This image has support for:
- ONNX-Cuda "CudaExecutionProvider"
- ONNX-TensorRT "TensorRTExecutionProvider" (may not always work due to version mismatch with ORT)
- CudaExtensions and packages, e.g. Tri-Dao's `pip install flash-attn` package when using Pytorch.
- nvcc compiler support
```
docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-trt-onnx \
v2 \
--engine optimum \
--device cuda \
--model-id $model1 \
--port $port
```
</details>
#### Advanced CLI usage
<details>
<summary>Launching multiple models at once</summary>
Since `infinity_emb>=0.0.34`, you can use cli `v2` method to launch multiple models at the same time.
Checkout `infinity_emb v2 --help` for all args and validation.
Multiple Model CLI Playbook:
- 1. cli options can be repeated e.g. `v2 --model-id model/id1 --model-id/id2 --batch-size 8 --batch-size 4`. This will create two models `model/id1` and `model/id2`
- 2. or adapt the defaults by setting ENV Variables separated by `;`: `INFINITY_MODEL_ID="model/id1;model/id2;" && INFINITY_BATCH_SIZE="8;4;"`
- 3. single items are broadcasted to `--model-id` length, `v2 --model-id model/id1 --model-id/id2 --batch-size 8` making both models have batch-size 8.
- 4. Everything is broadcasted to the number of `--model-id` + API requests are routed to the `--served-model-name/--model-id`
</details>
<details>
<summary>Using environment variables instead of the cli</summary>
All CLI arguments are also launchable via environment variables.
Environment variables start with `INFINITY_{UPPER_CASE_SNAKE_CASE}` and often match the `--{lower-case-kebab-case}` cli arguments.
The following two are equivalent:
- CLI `infinity_emb v2 --model-id BAAI/bge-base-en-v1.5`
- ENV-CLI: `export INFINITY_MODEL_ID="BAAI/bge-base-en-v1.5" && infinity_emb v2`
Multiple arguments can be used via `;` syntax: `INFINITY_MODEL_ID="model/id1;model/id2;"`
</details>
<details>
<summary>API Key</summary>
Supply an `--api-key secret123` via CLI or ENV INFINITY_API_KEY="secret123".
</details>
<details>
<summary>Chosing the fastest engine</summary>
With the command `--engine torch` the model must be compatible with https://github.com/UKPLab/sentence-transformers/ and AutoModel
With the command `--engine optimum`, there must be an onnx file. Models from https://huggingface.co/Xenova are recommended.
With the command `--engine ctranslate2`
- only `BERT` models are supported.
</details>
<details>
<summary>Telemetry opt-out</summary>
See which telemetry is collected: https://michaelfeil.eu/infinity/main/telemetry/
```
# Disable
export INFINITY_ANONYMOUS_USAGE_STATS="0"
```
</details>
### Supported Tasks and Models by Infinity
Infinity aims to be the inference server supporting most functionality for embeddings, reranking and related RAG tasks. The following Infinity tests 15+ architectures and all of the below cases in the Github CI.
Click on the sections below to find tasks and **validated example models**.
<details>
<summary>Text Embeddings</summary>
Text embeddings measure the relatedness of text strings. Embeddings are used for search, clustering, recommendations.
Think about a private deployed version of openai's text embeddings. https://platform.openai.com/docs/guides/embeddings
Tested embedding models:
- [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
- [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)
- [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)
- [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)
- [jinaai/jina-embeddings-v2-base-code](https://huggingface.co/jinaai/jina-embeddings-v2-base-code)
- [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
- [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
- [jinaai/jina-embeddings-v3](nomic-ai/nomic-embed-text-v1.5)
- [BAAI/bge-m3, no sparse](https://huggingface.co/BAAI/bge-m3)
- decoder-based models. Keep in mind that they are ~20-100x larger (&slower) than bert-small models:
- [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/discussions/20)
- [Salesforce/SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R/discussions/6)
- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/discussions/39)
Other models:
- Most embedding model are likely supported: https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending
- Check MTEB leaderboard for models https://huggingface.co/spaces/mteb/leaderboard.
</details>
<details>
<summary>Reranking</summary>
Given a query and a list of documents, Reranking indexes the documents from most to least semantically relevant to the query.
Think like a locally deployed version of https://docs.cohere.com/reference/rerank
Tested reranking models:
- [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1)
- [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base)
- [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base)
- [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large)
- [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
- [jinaai/jina-reranker-v1-turbo-en](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en)
Other reranking models:
- Reranking Models supported by infinity are bert-style classification Models with one category.
- Most reranking model are likely supported: https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending
- https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=rerank
</details>
<details>
<summary>Multi-modal and cross-modal - image and audio embeddings</summary>
Specialized embedding models that allow for image<->text or image<->audio search.
Typically, these models allow for text<->text, text<->other and other<->other search, with accuracy tradeoffs when going cross-modal.
Image<->text models can be used for e.g. photo-gallery search, where users can type in keywords to find photos, or use a photo to find related images.
Audio<->text models are less popular, and can be e.g. used to find music songs based on a text description or related music songs.
Tested image<->text models:
- [wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M](https://huggingface.co/wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M)
- [jinaai/jina-clip-v1](https://huggingface.co/jinaai/jina-clip-v1)
- [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
- Models of type: ClipModel / SiglipModel in `config.json`
Tested audio<->text models:
- [Clap Models from LAION](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490)
- limited number open source organizations training these models
- * Note: The sampling rate of the audio data needs to match the model *
Not supported:
- Plain vision models e.g. nomic-ai/nomic-embed-vision-v1.5
</details>
<details>
<summary>ColBert-style late-interaction Embeddings</summary>
ColBert Embeddings don't perform any special Pooling methods, but return the raw **token embeddings**.
The **token embeddings** are then to be scored with the MaxSim Metric in a VectorDB (Qdrant / Vespa)
For usage via the RestAPI, late-interaction embeddings may best be transported via `base64` encoding.
Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing
Tested colbert models:
- [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0)
- [jinaai/jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2)
- [mixedbread-ai/mxbai-colbert-large-v1](https://huggingface.co/mixedbread-ai/mxbai-colbert-large-v1)
- [answerai-colbert-small-v1 - click link for instructions](https://huggingface.co/answerdotai/answerai-colbert-small-v1/discussions/14)
</details>
<details>
<summary>ColPali-style late-interaction Image<->Text Embeddings</summary>
Similar usage to ColBert, but scanning over an image<->text instead of only text.
For usage via the RestAPI, late-interaction embeddings may best be transported via `base64` encoding.
Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing
Tested ColPali/ColQwen models:
- [vidore/colpali-v1.2-merged](https://huggingface.co/michaelfeil/colpali-v1.2-merged)
- [michaelfeil/colqwen2-v0.1](https://huggingface.co/michaelfeil/colqwen2-v0.1)
- No lora adapters supported, only "merged" models.
</details>
<details>
<summary>Text classification</summary>
A bert-style multi-label text classification. Classifies it into distinct categories.
Tested models:
- [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert), financial news classification
- [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions), text to emotion categories.
- bert-style text-classifcation models with more than >1 label in `config.json`
</details>
### Infinity usage via the Python API
Instead of the cli & RestAPI use infinity's interface via the Python API.
This gives you most flexibility. The Python API builds on `asyncio` with its `await/async` features, to allow concurrent processing of requests. Arguments of the CLI are also available via Python.
#### Embeddings
```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])
async def embed_text(engine: AsyncEmbeddingEngine):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
await engine.astop()
asyncio.run(embed_text(array[0]))
```
#### Reranking
Reranking gives you a score for similarity between a query and multiple documents.
Use it in conjunction with a VectorDB+Embeddings, or as standalone for small amount of documents.
Please select a model from huggingface that is a AutoModelForSequenceClassification compatible model with one class classification.
```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...",
"Paris is in France!",
"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
[EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)
async def rerank(engine: AsyncEmbeddingEngine):
async with engine:
ranking, usage = await engine.rerank(query=query, docs=docs)
print(list(zip(ranking, docs)))
# or handle the async start / stop yourself.
await engine.astart()
ranking, usage = await engine.rerank(query=query, docs=docs)
await engine.astop()
asyncio.run(rerank(array[0]))
```
When using the CLI, use this command to launch rerankers:
```bash
infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1
```
#### Image-Embeddings: CLIP models
CLIP models are able to encode images and text at the same time.
```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embeddings_image, _ = await engine.image_embed(images=images)
await engine.astop()
asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))
```
#### Audio-Embeddings: CLAP models
CLAP models are able to encode audio and text at the same time.
```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io
sentences = ["This is awesome.", "I am bored."]
url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content
audios = [raw_bytes]
engine_args = EngineArgs(
model_name_or_path = "laion/clap-htsat-unfused",
dtype="float32",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embedding_audios = await engine.audio_embed(audios=audios)
await engine.astop()
asyncio.run(embed(array["laion/clap-htsat-unfused"]))
```
#### Text Classification
Use text classification with Infinity's `classify` feature, which allows for sentiment analysis, emotion detection, and more classification tasks.
```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
model_name_or_path = "SamLowe/roberta-base-go_emotions",
engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])
async def classifier(engine: AsyncEmbeddingEngine):
async with engine:
predictions, usage = await engine.classify(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
predictions, usage = await engine.classify(sentences=sentences)
await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))
```
### Infinity usage via the Python Client
Infinity has a generated client code for RestAPI client side usage.
If you want to call a remote infinity instance via RestAPI, install the following package locally:
```bash
pip install infinity_client
```
For more information, check out the Client Readme
https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client
## Integrations:
- [Serverless deployments at Runpod](https://github.com/runpod-workers/worker-infinity-embedding)
- [Truefoundry Cognita](https://github.com/truefoundry/cognita)
- [Langchain example](https://github.com/langchain-ai/langchain)
- [imitater - A unified language model server built upon vllm and infinity.](https://github.com/the-seeds/imitater)
- [Dwarves Foundation: Deployment examples using Modal.com](https://github.com/dwarvesf/llm-hosting)
- [infiniflow/Ragflow](https://github.com/infiniflow/ragflow)
- [SAP Core AI](https://github.com/SAP-samples/btp-generative-ai-hub-use-cases/tree/main/10-byom-oss-llm-ai-core)
- [gpt_server - gpt_server is an open-source framework designed for production-level deployment of LLMs (Large Language Models) or Embeddings.](https://github.com/shell-nlp/gpt_server)
- [KubeAI: Kubernetes AI Operator for inferencing](https://github.com/substratusai/kubeai)
- [LangChain](https://python.langchain.com/docs/integrations/text_embedding/infinity)
- [Batched, modification of the Batching algoritm in Infinity](https://github.com/mixedbread-ai/batched)
## Documentation
View the docs at [https:///michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity) on how to get started.
After startup, the Swagger Ui will be available under `{url}:{port}/docs`, in this case `http://localhost:7997/docs`. You can also find a interactive preview here: https://infinity.modal.michaelfeil.eu/docs (and https://michaelfeil-infinity.hf.space/docs)
## Contribute and Develop
Install via Poetry 1.8.1, Python3.11 on Ubuntu 22.04
```bash
cd libs/infinity_emb
poetry install --extras all --with lint,test
```
To pass the CI:
```bash
cd libs/infinity_emb
make precommit
```
All contributions must be made in a way to be compatible with the MIT License of this repo.
### Citation
```
@software{feil_2023_11630143,
author = {Feil, Michael},
title = {Infinity - To Embeddings and Beyond},
month = oct,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.11630143},
url = {https://doi.org/10.5281/zenodo.11630143}
}
```
### π Current contributors <a name="Current contributors"></a>
<a href="https://github.com/michaelfeil/infinity=y/graphs/contributors">
<img src="https://contributors-img.web.app/image?repo=michaelfeil/infinity" />
</a>
<!-- MARKDOWN LINKS & IMAGES -->
<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
[contributors-shield]: https://img.shields.io/github/contributors/michaelfeil/infinity.svg?style=for-the-badge
[contributors-url]: https://github.com/michaelfeil/infinity/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/michaelfeil/infinity.svg?style=for-the-badge
[forks-url]: https://github.com/michaelfeil/infinity/network/members
[stars-shield]: https://img.shields.io/github/stars/michaelfeil/infinity.svg?style=for-the-badge
[stars-url]: https://github.com/michaelfeil/infinity/stargazers
[issues-shield]: https://img.shields.io/github/issues/michaelfeil/infinity.svg?style=for-the-badge
[issues-url]: https://github.com/michaelfeil/infinity/issues
[license-shield]: https://img.shields.io/github/license/michaelfeil/infinity.svg?style=for-the-badge
[license-url]: https://github.com/michaelfeil/infinity/blob/main/LICENSE
[pepa-shield]: https://static.pepy.tech/badge/infinity-emb
[pepa-url]: https://www.pepy.tech/projects/infinity-emb
[codecov-shield]: https://codecov.io/gh/michaelfeil/infinity/branch/main/graph/badge.svg?token=NMVQY5QOFQ
[codecov-url]: https://codecov.io/gh/michaelfeil/infinity/branch/main
[ci-shield]: https://github.com/michaelfeil/infinity/actions/workflows/ci.yaml/badge.svg
[ci-url]: https://github.com/michaelfeil/infinity/actions
Raw data
{
"_id": null,
"home_page": "https://github.com/michaelfeil/infinity",
"name": "infinity-emb",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.9",
"maintainer_email": null,
"keywords": "vector, embedding, neural, search, sentence-transformers",
"author": "michaelfeil",
"author_email": "noreply@michaelfeil.eu",
"download_url": "https://files.pythonhosted.org/packages/7c/49/623534f818544f931dcc263272838702b0bb122b092fe90374c4cde8b864/infinity_emb-0.0.73.tar.gz",
"platform": null,
"description": "\n<!-- PROJECT SHIELDS -->\n<!--\n*** I'm using markdown \"reference style\" links for readability.\n*** Reference links are enclosed in brackets [ ] instead of parentheses ( ).\n*** See the bottom of this document for the declaration of the reference variables\n*** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use.\n*** https://www.markdownguide.org/basic-syntax/#reference-style-links\n-->\n[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Stargazers][stars-shield]][stars-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n\n# Infinity \u267e\ufe0f\n[![codecov][codecov-shield]][codecov-url]\n[![ci][ci-shield]][ci-url]\n[![Downloads][pepa-shield]][pepa-url]\n[![DOI](https://zenodo.org/badge/703686617.svg)](https://zenodo.org/doi/10.5281/zenodo.11406462)\n![Docker pulls](https://img.shields.io/docker/pulls/michaelf34/infinity)\n\n\nInfinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali. Infinity is developed under [MIT License](https://github.com/michaelfeil/infinity/blob/main/LICENSE).\n\n## Why Infinity\n* **Deploy any model from HuggingFace**: deploy any embedding, reranking, clip and sentence-transformer model from [HuggingFace]( https://huggingface.co/models?other=text-embeddings-inference&sort=trending)\n* **Fast inference backends**: The inference server is built on top of [PyTorch](https://github.com/pytorch/pytorch), [optimum (ONNX/TensorRT)](https://huggingface.co/docs/optimum/index) and [CTranslate2](https://github.com/OpenNMT/CTranslate2), using FlashAttention to get the most out of your **NVIDIA CUDA**, **AMD ROCM**, **CPU**, **AWS INF2** or **APPLE MPS** accelerator. Infinity uses dynamic batching and tokenization dedicated in worker threads.\n* **Multi-modal and multi-model**: Mix-and-match multiple models. Infinity orchestrates them.\n* **Tested implementation**: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.\n* **Easy to use**: Built on [FastAPI](https://fastapi.tiangolo.com/). Infinity CLI v2 allows launching of all arguments via Environment variable or argument. OpenAPI aligned to [OpenAI's API specs](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings). View the docs at [https://michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity/) on how to get started.\n\n<p align=\"center\">\n <a href=\"https://github.com/basetenlabs/truss-examples/tree/7025918c813d08d718b8939f44f10651a0ff2c8c/custom-server/infinity-embedding-server\"><img src=\"https://avatars.githubusercontent.com/u/54861414\" alt=\"Logo Baseten.co\" width=\"50\"/></a>\n <a href=\"https://github.com/runpod-workers/worker-infinity-embedding\"><img src=\"https://github.com/user-attachments/assets/24f1906d-31b8-4e16-a479-1382cbdea046\" alt=\"Logo Runpod\" width=\"50\"/></a>\n <a href=\"https://www.truefoundry.com/cognita\"><img src=\"https://github.com/user-attachments/assets/1b515b0f-2332-4b12-be82-933056bddee4\" alt=\"Logo TrueFoundry\" width=\"50\"/></a>\n <a href=\"https://vast.ai/article/serving-infinity\"><img src=\"https://github.com/user-attachments/assets/8286d620-f403-48f5-bd7f-f471b228ae7b\" alt=\"Logo Vast\" width=\"46\"/></a>\n <a href=\"https://www.dataguard.de\"><img src=\"https://github.com/user-attachments/assets/3fde1ac6-c299-455d-9fc2-ba4012799f9c\" alt=\"Logo DataGuard\" width=\"50\"/></a>\n <a href=\"https://community.sap.com/t5/artificial-intelligence-and-machine-learning-blogs/bring-open-source-llms-into-sap-ai-core/ba-p/13655167\"><img src=\"https://github.com/user-attachments/assets/743e932b-ed5b-4a71-84cb-f28235707a84\" alt=\"Logo SAP\" width=\"47\"/></a>\n <a href=\"https://x.com/StuartReid1929/status/1763434100382163333\"><img src=\"https://github.com/user-attachments/assets/477a4c54-1113-434b-83bc-1985f10981d3\" alt=\"Logo Nosible\" width=\"44\"/></a>\n <a href=\"https://github.com/freshworksinc/freddy-infinity\"><img src=\"https://github.com/user-attachments/assets/a68da78b-d958-464e-aaf6-f39132be68a0\" alt=\"Logo FreshWorks\" width=\"50\"/></a>\n <a href=\"https://github.com/dstackai/dstack/tree/master/examples/deployment/infinity\"><img src=\"https://github.com/user-attachments/assets/9cde2d6b-dc16-4f0a-81ba-535a84321467\" alt=\"Logo Dstack\" width=\"50\"/></a>\n <a href=\"https://embeddedllm.com/blog/\"><img src=\"https://avatars.githubusercontent.com/u/148834374\" alt=\"Logo JamAI\" width=\"50\"/></a>\n <a href=\"https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct#infinity_emb\"><img src=\"https://avatars.githubusercontent.com/u/1961952\" alt=\"Logo Alibaba Group\" width=\"50\"/></a>\n <a href=\"https://github.com/bentoml/BentoInfinity/\"><img src=\"https://avatars.githubusercontent.com/u/49176046\" alt=\"Logo BentoML\" width=\"50\"/></a>\n <a href=\"https://x.com/bo_wangbo/status/1766371909086724481\"><img src=\"https://avatars.githubusercontent.com/u/60539444\" alt=\"Logo JinaAi\" width=\"50\"/></a>\n <a href=\"https://github.com/dwarvesf/llm-hosting\"><img src=\"https://avatars.githubusercontent.com/u/10388449\" alt=\"Logo Dwarves Foundation\" width=\"50\"/></a>\n <a href=\"https://github.com/huggingface/chat-ui/blob/daf695ea4a6e2d081587d7dbcae3cacd466bf8b2/docs/source/configuration/embeddings.md#openai\"><img src=\"https://avatars.githubusercontent.com/u/25720743\" alt=\"Logo HF\" width=\"50\"/></a>\n <a href=\"https://www.linkedin.com/posts/markhng525_join-me-and-ekin-karabulut-at-the-ai-infra-activity-7163233344875393024-LafB?utm_source=share&utm_medium=member_desktop\"><img src=\"https://avatars.githubusercontent.com/u/86131705\" alt=\"Logo Gradient.ai\" width=\"50\"/></a>\n</p> \n\n### Latest News \ud83d\udd25\n\n- [2024/11] AMD, CPU, ONNX docker images\n- [2024/10] `pip install infinity_client`\n- [2024/07] Inference deployment example via [Modal](./infra/modal/README.md) and a [free GPU deployment](https://infinity.modal.michaelfeil.eu/)\n- [2024/06] Support for multi-modal: clip, text-classification & launch all arguments from env variables\n- [2024/05] launch multiple models using the `v2` cli, including `--api-key`\n- [2024/03] infinity supports experimental int8 (cpu/cuda) and fp8 (H100/MI300) support\n- [2024/03] Docs are online: https://michaelfeil.github.io/infinity/latest/\n- [2024/02] Community meetup at the [Run:AI Infra Club](https://discord.gg/7D4fbEgWjv)\n- [2024/01] TensorRT / ONNX inference\n- [2023/10] Initial release\n\n## Getting started\n### Launch the cli via pip install\n```bash\npip install infinity-emb[all]\n```\nAfter your pip install, with your venv active, you can run the CLI directly.\n\n```bash\ninfinity_emb v2 --model-id BAAI/bge-small-en-v1.5\n```\nCheck the `v2 --help` command to get a description for all parameters.\n```bash\ninfinity_emb v2 --help\n```\n### Launch the CLI using a pre-built docker container (recommended)\nInstead of installing the CLI via pip, you may also use docker to run `michaelf34/infinity`. \nMake sure you mount your accelerator ( i.e. install `nvidia-docker` and activate with `--gpus all`). \n\n```bash\nport=7997\nmodel1=michaelfeil/bge-small-en-v1.5\nmodel2=mixedbread-ai/mxbai-rerank-xsmall-v1\nvolume=$PWD/data\n\ndocker run -it --gpus all \\\n -v $volume:/app/.cache \\\n -p $port:$port \\\n michaelf34/infinity:latest \\\n v2 \\\n --model-id $model1 \\\n --model-id $model2 \\\n --port $port\n```\nThe cache path inside the docker container is set by the environment variable `HF_HOME`.\n\n#### Specialized docker images\n<details>\n <summary>Docker container for CPU</summary>\n Use the `latest-cpu` image or `x.x.x-cpu` for slimer image. \n Run like any other cpu-only docker image. \n Optimum/Onnx is often the prefered engine. \n\n ```\n docker run -it \\\n -v $volume:/app/.cache \\\n -p $port:$port \\\n michaelf34/infinity:latest-cpu \\\n v2 \\\n --engine optimum \\\n --model-id $model1 \\\n --model-id $model2 \\\n --port $port\n ```\n</details>\n\n<details>\n <summary>Docker Container for ROCm (MI200 Series and MI300 Series)</summary>\n Use the `latest-rocm` image or `x.x.x-rocm` for rocm compatible inference.\n **This image is currently not build via CI/CD (to large), consider pinning to exact version.**\n Make sure you have ROCm is correctly installed and ready to use with Docker.\n\n Visit [Docs](https://michaelfeil.github.io/infinity) for more info.\n</details>\n \n<details>\n <summary>Docker Container for Onnx-GPU, Cuda Extensions, TensorRT</summary>\n Use the `latest-trt-onnx` image or `x.x.x-trt-onnx` for nvidia compatible inference.\n **This image is currently not build via CI/CD (to large), consider pinning to exact version.**\n\n This image has support for:\n - ONNX-Cuda \"CudaExecutionProvider\" \n - ONNX-TensorRT \"TensorRTExecutionProvider\" (may not always work due to version mismatch with ORT)\n - CudaExtensions and packages, e.g. Tri-Dao's `pip install flash-attn` package when using Pytorch.\n - nvcc compiler support\n \n ```\n docker run -it \\\n -v $volume:/app/.cache \\\n -p $port:$port \\\n michaelf34/infinity:latest-trt-onnx \\\n v2 \\\n --engine optimum \\\n --device cuda \\\n --model-id $model1 \\\n --port $port\n ```\n</details>\n\n#### Advanced CLI usage\n\n<details>\n <summary>Launching multiple models at once</summary>\n \n Since `infinity_emb>=0.0.34`, you can use cli `v2` method to launch multiple models at the same time.\n Checkout `infinity_emb v2 --help` for all args and validation.\n\n Multiple Model CLI Playbook: \n - 1. cli options can be repeated e.g. `v2 --model-id model/id1 --model-id/id2 --batch-size 8 --batch-size 4`. This will create two models `model/id1` and `model/id2`\n - 2. or adapt the defaults by setting ENV Variables separated by `;`: `INFINITY_MODEL_ID=\"model/id1;model/id2;\" && INFINITY_BATCH_SIZE=\"8;4;\"`\n - 3. single items are broadcasted to `--model-id` length, `v2 --model-id model/id1 --model-id/id2 --batch-size 8` making both models have batch-size 8.\n - 4. Everything is broadcasted to the number of `--model-id` + API requests are routed to the `--served-model-name/--model-id`\n</details>\n\n<details>\n <summary>Using environment variables instead of the cli</summary>\n All CLI arguments are also launchable via environment variables.\n\n Environment variables start with `INFINITY_{UPPER_CASE_SNAKE_CASE}` and often match the `--{lower-case-kebab-case}` cli arguments.\n \n The following two are equivalent:\n - CLI `infinity_emb v2 --model-id BAAI/bge-base-en-v1.5`\n - ENV-CLI: `export INFINITY_MODEL_ID=\"BAAI/bge-base-en-v1.5\" && infinity_emb v2`\n\n Multiple arguments can be used via `;` syntax: `INFINITY_MODEL_ID=\"model/id1;model/id2;\"`\n</details>\n\n<details>\n <summary>API Key</summary>\n Supply an `--api-key secret123` via CLI or ENV INFINITY_API_KEY=\"secret123\".\n</details>\n\n<details>\n <summary>Chosing the fastest engine</summary>\n \n With the command `--engine torch` the model must be compatible with https://github.com/UKPLab/sentence-transformers/ and AutoModel\n\n With the command `--engine optimum`, there must be an onnx file. Models from https://huggingface.co/Xenova are recommended.\n \n With the command `--engine ctranslate2`\n - only `BERT` models are supported.\n</details>\n\n<details>\n <summary>Telemetry opt-out</summary>\n \n See which telemetry is collected: https://michaelfeil.eu/infinity/main/telemetry/\n ```\n # Disable\n export INFINITY_ANONYMOUS_USAGE_STATS=\"0\"\n ```\n</details>\n\n### Supported Tasks and Models by Infinity\n\nInfinity aims to be the inference server supporting most functionality for embeddings, reranking and related RAG tasks. The following Infinity tests 15+ architectures and all of the below cases in the Github CI.\nClick on the sections below to find tasks and **validated example models**.\n\n<details>\n <summary>Text Embeddings</summary>\n \n Text embeddings measure the relatedness of text strings. Embeddings are used for search, clustering, recommendations.\n Think about a private deployed version of openai's text embeddings. https://platform.openai.com/docs/guides/embeddings\n\n Tested embedding models:\n - [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)\n - [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)\n - [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)\n - [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)\n - [jinaai/jina-embeddings-v2-base-code](https://huggingface.co/jinaai/jina-embeddings-v2-base-code)\n - [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\n - [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)\n - [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)\n - [jinaai/jina-embeddings-v3](nomic-ai/nomic-embed-text-v1.5)\n - [BAAI/bge-m3, no sparse](https://huggingface.co/BAAI/bge-m3)\n - decoder-based models. Keep in mind that they are ~20-100x larger (&slower) than bert-small models:\n - [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/discussions/20)\n - [Salesforce/SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R/discussions/6)\n - [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/discussions/39)\n\n Other models:\n - Most embedding model are likely supported: https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending\n - Check MTEB leaderboard for models https://huggingface.co/spaces/mteb/leaderboard.\n</details>\n\n<details>\n <summary>Reranking</summary>\n Given a query and a list of documents, Reranking indexes the documents from most to least semantically relevant to the query.\n Think like a locally deployed version of https://docs.cohere.com/reference/rerank\n \n Tested reranking models:\n - [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1)\n - [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base)\n - [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base)\n - [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large)\n - [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n - [jinaai/jina-reranker-v1-turbo-en](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en)\n\n Other reranking models:\n - Reranking Models supported by infinity are bert-style classification Models with one category.\n - Most reranking model are likely supported: https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending\n - https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=rerank\n</details>\n\n<details>\n <summary>Multi-modal and cross-modal - image and audio embeddings</summary>\n Specialized embedding models that allow for image<->text or image<->audio search. \n Typically, these models allow for text<->text, text<->other and other<->other search, with accuracy tradeoffs when going cross-modal.\n \n Image<->text models can be used for e.g. photo-gallery search, where users can type in keywords to find photos, or use a photo to find related images.\n Audio<->text models are less popular, and can be e.g. used to find music songs based on a text description or related music songs.\n \n Tested image<->text models:\n - [wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M](https://huggingface.co/wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M)\n - [jinaai/jina-clip-v1](https://huggingface.co/jinaai/jina-clip-v1)\n - [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)\n - Models of type: ClipModel / SiglipModel in `config.json`\n \n Tested audio<->text models:\n - [Clap Models from LAION](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490)\n - limited number open source organizations training these models\n - * Note: The sampling rate of the audio data needs to match the model *\n\n Not supported:\n - Plain vision models e.g. nomic-ai/nomic-embed-vision-v1.5\n</details>\n\n<details>\n <summary>ColBert-style late-interaction Embeddings</summary>\n ColBert Embeddings don't perform any special Pooling methods, but return the raw **token embeddings**.\n The **token embeddings** are then to be scored with the MaxSim Metric in a VectorDB (Qdrant / Vespa)\n \n For usage via the RestAPI, late-interaction embeddings may best be transported via `base64` encoding.\n Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing\n \n Tested colbert models:\n - [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0)\n - [jinaai/jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2)\n - [mixedbread-ai/mxbai-colbert-large-v1](https://huggingface.co/mixedbread-ai/mxbai-colbert-large-v1)\n - [answerai-colbert-small-v1 - click link for instructions](https://huggingface.co/answerdotai/answerai-colbert-small-v1/discussions/14)\n\n</details>\n\n<details>\n <summary>ColPali-style late-interaction Image<->Text Embeddings</summary>\n Similar usage to ColBert, but scanning over an image<->text instead of only text.\n \n For usage via the RestAPI, late-interaction embeddings may best be transported via `base64` encoding.\n Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing\n \n Tested ColPali/ColQwen models:\n - [vidore/colpali-v1.2-merged](https://huggingface.co/michaelfeil/colpali-v1.2-merged)\n - [michaelfeil/colqwen2-v0.1](https://huggingface.co/michaelfeil/colqwen2-v0.1)\n - No lora adapters supported, only \"merged\" models.\n</details>\n\n<details>\n <summary>Text classification</summary>\n A bert-style multi-label text classification. Classifies it into distinct categories. \n \n Tested models:\n - [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert), financial news classification\n - [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions), text to emotion categories.\n - bert-style text-classifcation models with more than >1 label in `config.json`\n</details>\n\n### Infinity usage via the Python API\n\nInstead of the cli & RestAPI use infinity's interface via the Python API. \nThis gives you most flexibility. The Python API builds on `asyncio` with its `await/async` features, to allow concurrent processing of requests. Arguments of the CLI are also available via Python.\n\n#### Embeddings\n```python\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"Embed this is sentence via Infinity.\", \"Paris is in France.\"]\narray = AsyncEngineArray.from_args([\n EngineArgs(model_name_or_path = \"BAAI/bge-small-en-v1.5\", engine=\"torch\", embedding_dtype=\"float32\", dtype=\"auto\")\n])\n\nasync def embed_text(engine: AsyncEmbeddingEngine): \n async with engine: \n embeddings, usage = await engine.embed(sentences=sentences)\n # or handle the async start / stop yourself.\n await engine.astart()\n embeddings, usage = await engine.embed(sentences=sentences)\n await engine.astop()\nasyncio.run(embed_text(array[0]))\n```\n\n#### Reranking\n\nReranking gives you a score for similarity between a query and multiple documents. \nUse it in conjunction with a VectorDB+Embeddings, or as standalone for small amount of documents.\nPlease select a model from huggingface that is a AutoModelForSequenceClassification compatible model with one class classification.\n\n```python\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\nquery = \"What is the python package infinity_emb?\"\ndocs = [\"This is a document not related to the python package infinity_emb, hence...\", \n \"Paris is in France!\",\n \"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!\"]\narray = AsyncEmbeddingEngine.from_args(\n [EngineArgs(model_name_or_path = \"mixedbread-ai/mxbai-rerank-xsmall-v1\", engine=\"torch\")]\n)\n\nasync def rerank(engine: AsyncEmbeddingEngine): \n async with engine:\n ranking, usage = await engine.rerank(query=query, docs=docs)\n print(list(zip(ranking, docs)))\n # or handle the async start / stop yourself.\n await engine.astart()\n ranking, usage = await engine.rerank(query=query, docs=docs)\n await engine.astop()\n\nasyncio.run(rerank(array[0]))\n```\n\nWhen using the CLI, use this command to launch rerankers:\n```bash\ninfinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1\n```\n\n#### Image-Embeddings: CLIP models\n\nCLIP models are able to encode images and text at the same time. \n\n```python\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\nimages = [\"http://images.cocodataset.org/val2017/000000039769.jpg\"]\nengine_args = EngineArgs(\n model_name_or_path = \"wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M\", \n engine=\"torch\"\n)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def embed(engine: AsyncEmbeddingEngine): \n await engine.astart()\n embeddings, usage = await engine.embed(sentences=sentences)\n embeddings_image, _ = await engine.image_embed(images=images)\n await engine.astop()\n\nasyncio.run(embed(array[\"wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M\"]))\n```\n\n#### Audio-Embeddings: CLAP models\n\nCLAP models are able to encode audio and text at the same time. \n\n```python\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\nimport requests\nimport soundfile as sf\nimport io\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\n\nurl = \"https://bigsoundbank.com/UPLOAD/wav/2380.wav\"\nraw_bytes = requests.get(url, stream=True).content\n\naudios = [raw_bytes]\nengine_args = EngineArgs(\n model_name_or_path = \"laion/clap-htsat-unfused\",\n dtype=\"float32\", \n engine=\"torch\"\n\n)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def embed(engine: AsyncEmbeddingEngine): \n await engine.astart()\n embeddings, usage = await engine.embed(sentences=sentences)\n embedding_audios = await engine.audio_embed(audios=audios)\n await engine.astop()\n\nasyncio.run(embed(array[\"laion/clap-htsat-unfused\"]))\n```\n\n#### Text Classification \n\nUse text classification with Infinity's `classify` feature, which allows for sentiment analysis, emotion detection, and more classification tasks.\n\n```python\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\nengine_args = EngineArgs(\n model_name_or_path = \"SamLowe/roberta-base-go_emotions\", \n engine=\"torch\", model_warmup=True)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def classifier(engine: AsyncEmbeddingEngine): \n async with engine:\n predictions, usage = await engine.classify(sentences=sentences)\n # or handle the async start / stop yourself.\n await engine.astart()\n predictions, usage = await engine.classify(sentences=sentences)\n await engine.astop()\nasyncio.run(classifier(array[\"SamLowe/roberta-base-go_emotions\"]))\n```\n\n### Infinity usage via the Python Client\n\nInfinity has a generated client code for RestAPI client side usage.\n\nIf you want to call a remote infinity instance via RestAPI, install the following package locally:\n```bash\npip install infinity_client\n```\n\nFor more information, check out the Client Readme\nhttps://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client\n\n## Integrations:\n- [Serverless deployments at Runpod](https://github.com/runpod-workers/worker-infinity-embedding)\n- [Truefoundry Cognita](https://github.com/truefoundry/cognita)\n- [Langchain example](https://github.com/langchain-ai/langchain)\n- [imitater - A unified language model server built upon vllm and infinity.](https://github.com/the-seeds/imitater)\n- [Dwarves Foundation: Deployment examples using Modal.com](https://github.com/dwarvesf/llm-hosting)\n- [infiniflow/Ragflow](https://github.com/infiniflow/ragflow)\n- [SAP Core AI](https://github.com/SAP-samples/btp-generative-ai-hub-use-cases/tree/main/10-byom-oss-llm-ai-core)\n- [gpt_server - gpt_server is an open-source framework designed for production-level deployment of LLMs (Large Language Models) or Embeddings.](https://github.com/shell-nlp/gpt_server)\n- [KubeAI: Kubernetes AI Operator for inferencing](https://github.com/substratusai/kubeai)\n- [LangChain](https://python.langchain.com/docs/integrations/text_embedding/infinity)\n- [Batched, modification of the Batching algoritm in Infinity](https://github.com/mixedbread-ai/batched)\n\n## Documentation\nView the docs at [https:///michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity) on how to get started.\nAfter startup, the Swagger Ui will be available under `{url}:{port}/docs`, in this case `http://localhost:7997/docs`. You can also find a interactive preview here: https://infinity.modal.michaelfeil.eu/docs (and https://michaelfeil-infinity.hf.space/docs)\n\n## Contribute and Develop\n\nInstall via Poetry 1.8.1, Python3.11 on Ubuntu 22.04\n```bash\ncd libs/infinity_emb\npoetry install --extras all --with lint,test\n```\n\nTo pass the CI:\n```bash\ncd libs/infinity_emb\nmake precommit\n```\n\nAll contributions must be made in a way to be compatible with the MIT License of this repo. \n\n### Citation\n```\n@software{feil_2023_11630143,\n author = {Feil, Michael},\n title = {Infinity - To Embeddings and Beyond},\n month = oct,\n year = 2023,\n publisher = {Zenodo},\n doi = {10.5281/zenodo.11630143},\n url = {https://doi.org/10.5281/zenodo.11630143}\n}\n```\n\n### \ud83d\udc9a Current contributors <a name=\"Current contributors\"></a>\n\n<a href=\"https://github.com/michaelfeil/infinity=y/graphs/contributors\">\n <img src=\"https://contributors-img.web.app/image?repo=michaelfeil/infinity\" />\n</a>\n\n<!-- MARKDOWN LINKS & IMAGES -->\n<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->\n[contributors-shield]: https://img.shields.io/github/contributors/michaelfeil/infinity.svg?style=for-the-badge\n[contributors-url]: https://github.com/michaelfeil/infinity/graphs/contributors\n[forks-shield]: https://img.shields.io/github/forks/michaelfeil/infinity.svg?style=for-the-badge\n[forks-url]: https://github.com/michaelfeil/infinity/network/members\n[stars-shield]: https://img.shields.io/github/stars/michaelfeil/infinity.svg?style=for-the-badge\n[stars-url]: https://github.com/michaelfeil/infinity/stargazers\n[issues-shield]: https://img.shields.io/github/issues/michaelfeil/infinity.svg?style=for-the-badge\n[issues-url]: https://github.com/michaelfeil/infinity/issues\n[license-shield]: https://img.shields.io/github/license/michaelfeil/infinity.svg?style=for-the-badge\n[license-url]: https://github.com/michaelfeil/infinity/blob/main/LICENSE\n[pepa-shield]: https://static.pepy.tech/badge/infinity-emb\n[pepa-url]: https://www.pepy.tech/projects/infinity-emb\n[codecov-shield]: https://codecov.io/gh/michaelfeil/infinity/branch/main/graph/badge.svg?token=NMVQY5QOFQ\n[codecov-url]: https://codecov.io/gh/michaelfeil/infinity/branch/main\n[ci-shield]: https://github.com/michaelfeil/infinity/actions/workflows/ci.yaml/badge.svg\n[ci-url]: https://github.com/michaelfeil/infinity/actions\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip.",
"version": "0.0.73",
"project_urls": {
"Homepage": "https://github.com/michaelfeil/infinity",
"Repository": "https://github.com/michaelfeil/infinity"
},
"split_keywords": [
"vector",
" embedding",
" neural",
" search",
" sentence-transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "87f70a17eeefbe5db4cd3dd7bb30988fb43be3895bf4c91961ad72e8741b4f1c",
"md5": "05f818354caac534a607cb037d51f109",
"sha256": "8613128a2139e990b3ac54bf023857fc81c02d8e8c5039b03100c044e4bb31c1"
},
"downloads": -1,
"filename": "infinity_emb-0.0.73-py3-none-any.whl",
"has_sig": false,
"md5_digest": "05f818354caac534a607cb037d51f109",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.9",
"size": 95982,
"upload_time": "2024-12-10T09:03:44",
"upload_time_iso_8601": "2024-12-10T09:03:44.852294Z",
"url": "https://files.pythonhosted.org/packages/87/f7/0a17eeefbe5db4cd3dd7bb30988fb43be3895bf4c91961ad72e8741b4f1c/infinity_emb-0.0.73-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7c49623534f818544f931dcc263272838702b0bb122b092fe90374c4cde8b864",
"md5": "7315a3f3e963243b9dceb8594d406dbf",
"sha256": "136124413812e87261b845241be18897a9c12075216b7a456712a939ab5aac05"
},
"downloads": -1,
"filename": "infinity_emb-0.0.73.tar.gz",
"has_sig": false,
"md5_digest": "7315a3f3e963243b9dceb8594d406dbf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.9",
"size": 81847,
"upload_time": "2024-12-10T09:03:47",
"upload_time_iso_8601": "2024-12-10T09:03:47.554958Z",
"url": "https://files.pythonhosted.org/packages/7c/49/623534f818544f931dcc263272838702b0bb122b092fe90374c4cde8b864/infinity_emb-0.0.73.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-10 09:03:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michaelfeil",
"github_project": "infinity",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "infinity-emb"
}