# OnPrem.LLM
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
> A toolkit for running large language models on-premises using
> non-public data
**[OnPrem.LLM](https://github.com/amaiya/onprem)** is a simple Python
package that makes it easier to apply large language models (LLMs) to
non-public data on your own machines (possibly behind corporate
firewalls). Inspired largely by the
[privateGPT](https://github.com/imartinez/privateGPT) GitHub repo,
**OnPrem.LLM** is intended to help integrate local LLMs into practical
applications.
The full documentation is [here](https://amaiya.github.io/onprem/).
A Google Colab demo of installing and using **OnPrem.LLM** is
[here](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing).
------------------------------------------------------------------------
*Latest News* π₯
- \[2024/12\] v0.7.0 released and now includes support for [structured
outputs](https://amaiya.github.io/onprem/#structured-and-guided-outputs).
- \[2024/12\] v0.6.0 released and now includes support for PDF to
Markdown conversion (which includes Markdown representations of
tables), as shown
[here](https://amaiya.github.io/onprem/#extract-text-from-documents).
- \[2024/11\] v0.5.0 released and now includes support for running LLMs
with Hugging Face
[transformers](https://github.com/huggingface/transformers) as the
backend instead of
[llama.cpp](https://github.com/abetlen/llama-cpp-python). See [this
example](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).
- \[2024/11\] v0.4.0 released and now includes a `default_model`
parameter to more easily use models like **Llama-3.1** and
**Zephyr-7B-beta**.
- \[2024/10\] v0.3.0 released and now includes support for
[concept-focused
summarization](https://amaiya.github.io/onprem/examples_summarization.html#concept-focused-summarization)
- \[2024/09\] v0.2.0 released and now includes PDF OCR support and
better PDF table handling.
- \[2024/06\] v0.1.0 of **OnPrem.LLM** has been released. Lots of new
updates!
- [Ability to use with any OpenAI-compatible
API](https://amaiya.github.io/onprem/#connecting-to-llms-served-through-rest-apis)
(e.g., vLLM, Ollama, OpenLLM, etc.).
- Pipeline for [information
extraction](https://amaiya.github.io/onprem/examples_information_extraction.html)
from raw documents.
- Pipeline for [few-shot text
classification](https://amaiya.github.io/onprem/examples_classification.html)
(i.e., training a classifier on a tiny number of labeled examples)
along with the ability to explain few-shot predictions.
- Default model changed to
[Mistral-7B-Instruct-v0.2](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
- [API augmentations and bug
fixes](https://github.com/amaiya/onprem/blob/master/CHANGELOG.md)
------------------------------------------------------------------------
## Install
Once you have [installed
PyTorch](https://pytorch.org/get-started/locally/), you can install
**OnPrem.LLM** with the following steps:
1. Install **llama-cpp-python**:
- **CPU:** `pip install llama-cpp-python` ([extra
steps](https://github.com/amaiya/onprem/blob/master/MSWindows.md)
required for Microsoft Windows)
- **GPU**: Follow [instructions
below](https://amaiya.github.io/onprem/#on-gpu-accelerated-inference).
2. Install **OnPrem.LLM**: `pip install onprem`
### On GPU-Accelerated Inference
When installing **llama-cpp-python** with
`pip install llama-cpp-python`, the LLM will run on your **CPU**. To
generate answers much faster, you can run the LLM on your **GPU** by
building **llama-cpp-python** based on your operating system.
- **Linux**:
`CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`
- **Mac**: `CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python`
- **Windows 11**: Follow the instructions
[here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-the-system-python-in-windows-11s).
- **Windows Subsystem for Linux (WSL2)**: Follow the instructions
[here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-wsl2-with-gpu-acceleration).
For Linux and Windows, you will need [an up-to-date NVIDIA
driver](https://www.nvidia.com/en-us/drivers/) along with the [CUDA
toolkit](https://developer.nvidia.com/cuda-downloads) installed before
running the installation commands above.
After following the instructions above, supply the `n_gpu_layers=-1`
parameter when instantiating an LLM to use your GPU for fast inference:
``` python
llm = LLM(n_gpu_layers=-1, ...)
```
Quantized models with 8B parameters and below can typically run on GPUs
with as little as 6GB of VRAM. If a model does not fit on your GPU
(e.g., you get a βCUDA Error: Out-of-Memoryβ error), you can offload a
subset of layers to the GPU by experimenting with different values for
the `n_gpu_layers` parameter (e.g., `n_gpu_layers=20`). Setting
`n_gpu_layers=-1`, as shown above, offloads all layers to the GPU.
See [the FAQ](https://amaiya.github.io/onprem/#faq) for extra tips, if
you experience issues with
[llama-cpp-python](https://pypi.org/project/llama-cpp-python/)
installation.
**Note:** Installing **llama-cpp-python** is optional if either the
following is true:
- You use Hugging Face Transformers (instead of llama-cpp-python) as the
LLM backend by supplying the `model_id` parameter when instantiating
an LLM, as [shown
here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).
- You are using **OnPrem.LLM** with an LLM being served through an
[external REST API](#connecting-to-llms-served-through-rest-apis)
(e.g., vLLM, OpenLLM, Ollama).
## How to Use
### Setup
``` python
from onprem import LLM
llm = LLM()
```
By default, a 7B-parameter model (**Mistral-7B-Instruct-v0.2**) is
downloaded and used. If `default_model='llama'` is supplied, then a
**Llama-3.1-8B-Instsruct** model is automatically downloaded and used
(which is useful if the default Mistral model struggles with a
particular task):
``` python
# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used
llm = LLM(default_model='llama')
```
Similarly, suppyling `default_model='zephyr`, will use
**Zephyr-7B-beta**. Of course, you can also easily supply the URL to an
LLM of your choosing to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (see the the
[code generation
example](https://amaiya.github.io/onprem/examples_code.html) or the
[FAQ](https://amaiya.github.io/onprem/#faq) for examples). Any extra
parameters supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) are forwarded
directly to `llama-cpp-python`.
**Note:** The default context window size (`n_ctx`) is set to 3900 and
the default output size (`max_tokens`) is set 512. Both are configurable
parameters to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm). Increase if
you have larger prompts or need longer outputs.
### Send Prompts to the LLM to Solve Problems
This is an example of few-shot prompting, where we provide an example of
what we want the LLM to do.
``` python
prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""
saved_output = llm.prompt(prompt)
```
Cillian Murphy, Florence Pugh.
Additional prompt examples are [shown
here](https://amaiya.github.io/onprem/examples.html).
### Talk to Your Documents
Answers are generated from the content of your documents (i.e.,
[retrieval augmented generation](https://arxiv.org/abs/2005.11401) or
RAG). Here, we will use [GPU
offloading](https://amaiya.github.io/onprem/#speeding-up-inference-using-a-gpu)
to speed up answer generation using the default model. However, the
Zephyr-7B model may perform even better, responds faster, and is used in
our [example
notebook](https://amaiya.github.io/onprem/examples_rag.html).
``` python
from onprem import LLM
llm = LLM(n_gpu_layers=-1)
```
#### Step 1: Ingest the Documents into a Vector Database
``` python
llm.ingest("./sample_data")
```
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from ./sample_data
Loaded 12 new documents from ./sample_data
Split into 153 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
Loading new documents: 100%|ββββββββββββββββββββββ| 3/3 [00:00<00:00, 13.71it/s]
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.49s/it]
#### Step 2: Answer Questions About the Documents
``` python
question = """What is ktrain?"""
result = llm.ask(question)
```
Ktrain is a low-code machine learning library designed to facilitate the full machine learning workflow from curating and preprocessing inputs to training, tuning, troubleshooting, and applying models. Ktrain is well-suited for domain experts who may have less experience with machine learning and software coding.
The sources used by the model to generate the answer are stored in
`result['source_documents']`:
``` python
print("\nSources:\n")
for i, document in enumerate(result["source_documents"]):
print(f"\n{i+1}.> " + document.metadata["source"] + ":")
print(document.page_content)
```
Sources:
1.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workο¬ow. For these reasons, ktrain is less of a traditional Au-
2
2.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best ο¬t their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai
3.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
with custom models and data formats, as well.
Inspired by other low-code (and no-
code) open-source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig
(Molino et al., 2019), ktrain is intended to help further democratize machine learning by
enabling beginners and domain experts with minimal programming or data science experi-
4. http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
6
4.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
ktrain: A Low-Code Library for Augmented Machine Learning
toML platform and more of what might be called a βlow-codeβ ML platform. Through
automation or semi-automation, ktrain facilitates the full machine learning workο¬ow from
curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training,
tuning, troubleshooting, and applying models. In this way, ktrain is well-suited for domain
experts who may have less experience with machine learning and software coding. Where
### Extract Text from Documents
The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function can extract text from a range of different document formats
(e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is
automatically invoked when calling
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest).
Extracted text is represented as LangChain `Document` objects, where
`Document.page_content` stores the extracted text and
`Document.metadata` stores any extracted document metadata.
For PDFs, in particular, a number of different options are available
depending on your use case.
**Fast PDF Extraction (default)**
- **Pro:** Fast
- **Con:** Does not infer/retain structure of tables in PDF documents
``` python
from onprem.ingest import load_single_document
docs = load_single_document('sample_data/1/ktrain_paper.pdf')
docs[0].metadata
```
{'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
'file_path': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
'page': 0,
'total_pages': 9,
'format': 'PDF 1.4',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'creator': 'LaTeX with hyperref',
'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22',
'creationDate': "D:20220406214054-04'00'",
'modDate': "D:20220406214054-04'00'",
'trapped': ''}
**Automatic OCR of PDFs**
- **Pro:** Automatically extracts text from scanned PDFs
- **Con:** Slow
The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function will automatically OCR PDFs that require it (i.e., PDFs that
are scanned hard-copies of documents). If a document is OCRβed during
extraction, the `metadata['ocr']` field will be populated with `True`.
``` python
docs = load_single_document('sample_data/4/lynn1975.pdf')
docs[0].metadata
```
{'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/4/lynn1975.pdf',
'ocr': True}
**Markdown Conversion in PDFs**
- **Pro**: Better chunking for QA
- **Con**: Slower than default PDF extraction
The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function can convert PDFs to Markdown instead of plain text by supplying
the `pdf_markdown=True` as an argument:
``` python
docs = load_single_document('your_pdf_document.pdf',
pdf_markdown=True)
```
Converting to Markdown can facilitate downstream tasks like
question-answering. For instance, when supplying `pdf_markdown=True` to
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),
documents are chunked in a Markdown-aware fashion (e.g., the abstract of
a research paper tends to be kept together into a single chunk instead
of being split up). Note that Markdown will not be extracted if the
document requires OCR.
**Inferring Table Structure in PDFs**
- **Pro**: Makes it easier for LLMs to analyze information in tables
- **Con**: Slower than default PDF extraction
When supplying `infer_table_structure=True` to either
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
or
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),
tables are inferred and extracted from PDFs using a TableTransformer
model. Tables are represented as **Markdown** (or **HTML** if Markdown
conversion is not possible).
``` python
docs = load_single_document('your_pdf_document.pdf',
infer_table_structure=True)
```
**Parsing Extracted Text Into Sentences or Paragraphs**
For some analyses (e.g., using prompts for information extraction), it
may be useful to parse the text extracted from documents into individual
sentences or paragraphs. This can be accomplished using the
[`segment`](https://amaiya.github.io/onprem/utils.html#segment)
function:
``` python
from onprem.ingest import load_single_document
from onprem.utils import segment
text = load_single_document('sample_data/3/state_of_the_union.txt')[0].page_content
```
``` python
segment(text, unit='paragraph')[0]
```
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'
``` python
segment(text, unit='sentence')[0]
```
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'
### Summarization Pipeline
Summarize your raw documents (e.g., PDFs, MS Word) with an LLM.
#### Map-Reduce Summarization
Summarize each chunk in a document and then generate a single summary
from the individual summaries.
``` python
from onprem import LLM
llm = LLM(n_gpu_layers=-1, verbose=False, mute_stream=True) # disabling viewing of intermediate summarization prompts/inferences
```
``` python
from onprem.pipelines import Summarizer
summ = Summarizer(llm)
resp = summ.summarize('sample_data/1/ktrain_paper.pdf', max_chunks_to_use=5) # omit max_chunks_to_use parameter to consider entire document
print(resp['output_text'])
```
Ktrain is an open-source machine learning library that offers a unified interface for various machine learning tasks. The library supports both supervised and non-supervised machine learning, and includes methods for training models, evaluating models, making predictions on new data, and providing explanations for model decisions. Additionally, the library integrates with various explainable AI libraries such as shap, eli5 with lime, and others to provide more interpretable models.
#### Concept-Focused Summarization
Summarize a large document with respect to a particular concept of
interest.
``` python
from onprem import LLM
from onprem.pipelines import Summarizer
```
``` python
llm = LLM(default_model='zephyr', n_gpu_layers=-1, verbose=False, temperature=0)
summ = Summarizer(llm)
summary, sources = summ.summarize_by_concept('sample_data/1/ktrain_paper.pdf', concept_description="question answering")
```
The context provided describes the implementation of an open-domain question-answering system using ktrain, a low-code library for augmented machine learning. The system follows three main steps: indexing documents into a search engine, locating documents containing words in the question, and extracting candidate answers from those documents using a BERT model pretrained on the SQuAD dataset. Confidence scores are used to sort and prune candidate answers before returning results. The entire workflow can be implemented with only three lines of code using ktrain's SimpleQA module. This system allows for the submission of natural language questions and receives exact answers, as demonstrated in the provided example. Overall, the context highlights the ease and accessibility of building sophisticated machine learning models, including open-domain question-answering systems, through ktrain's low-code interface.
### Information Extraction Pipeline
Extract information from raw documents (e.g., PDFs, MS Word documents)
with an LLM.
``` python
from onprem import LLM
from onprem.pipelines import Extractor
# Notice that we're using a cloud-based, off-premises model here! See "OpenAI" section below.
llm = LLM(model_url='openai://gpt-3.5-turbo', verbose=False, mute_stream=True, temperature=0)
extractor = Extractor(llm)
prompt = """Extract the names of research institutions (e.g., universities, research labs, corporations, etc.)
from the following sentence delimited by three backticks. If there are no organizations, return NA.
If there are multiple organizations, separate them with commas.
```{text}```
"""
df = extractor.apply(prompt, fpath='sample_data/1/ktrain_paper.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]
```
/home/amaiya/projects/ghub/onprem/onprem/core.py:159: UserWarning: The model you supplied is gpt-3.5-turbo, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\
'Institute for Defense Analyses'
### Few-Shot Classification
Make accurate text classification predictions using only a tiny number
of labeled examples.
``` python
# create classifier
from onprem.pipelines import FewShotClassifier
clf = FewShotClassifier(use_smaller=True)
# Fetching data
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
classes = ["soc.religion.christian", "sci.space"]
newsgroups = fetch_20newsgroups(subset="all", categories=classes)
corpus, group_labels = np.array(newsgroups.data), np.array(newsgroups.target_names)[newsgroups.target]
# Wrangling data into a dataframe and selecting training examples
data = pd.DataFrame({"text": corpus, "label": group_labels})
train_df = data.groupby("label").sample(5)
test_df = data.drop(index=train_df.index)
# X_sample only contains 5 examples of each class!
X_sample, y_sample = train_df['text'].values, train_df['label'].values
# test set
X_test, y_test = test_df['text'].values, test_df['label'].values
# train
clf.train(X_sample, y_sample, max_steps=20)
# evaluate
print(clf.evaluate(X_test, y_test)['accuracy'])
#output: 0.98
# make predictions
clf.predict(['Elon Musk likes launching satellites.']).tolist()[0]
#output: sci.space
```
### Using Hugging Face Transformers Instead of Llama.cpp
By default, the LLM backend employed by **OnPrem.LLM** is
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), which
requires models in [GGUF format](https://huggingface.co/docs/hub/gguf).
As of v0.5.0, it is now possible to use [Hugging Face
transformers](https://github.com/huggingface/transformers) as the LLM
backend instead. This is accomplished by using the `model_id` parameter
(instead of supplying a `model_url` argument). In the example below, we
run the
[Llama-3.1-8B](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
model.
``` python
# llama-cpp-python does NOT need to be installed when using model_id parameter
llm = LLM(model_id="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", device_map='cuda')
```
This allows you to more easily use any model on the Hugging Face hub in
[SafeTensors format](https://huggingface.co/docs/safetensors/index)
provided it can be loaded with the Hugging Face `transformers.pipeline`.
Note that, when using the `model_id` parameter, the `prompt_template` is
set automatically by `transformers`.
The Llama-3.1 model loaded above was quantized using
[AWQ](https://huggingface.co/docs/transformers/main/en/quantization/awq),
which allows the model to fit onto smaller GPUs (e.g., laptop GPUs with
6GB of VRAM) similar to the default GGUF format. AWQ models will require
the [autoawq](https://pypi.org/project/autoawq/) package to be
installed: `pip install autoawq` (AWQ only supports Linux system,
including Windows Subsystem for Linux). If you do need to load a model
that is not quantized, you can supply a quantization configuration at
load time (known as βinflight quantizationβ). In the following example,
we load an unquantized [Zephyr-7B-beta
model](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) that will be
quantized during loading to fit on GPUs with as little as 6GB of VRAM:
``` python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
llm = LLM(model_id="HuggingFaceH4/zephyr-7b-beta", device_map='cuda',
model_kwargs={"quantization_config":quantization_config})
```
When supplying a `quantization_config`, the
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/installation)
library, a lightweight Python wrapper around CUDA custom functions, in
particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 &
4-bit quantization functions, is used. There are ongoing efforts by the
bitsandbytes team to support multiple backends in addition to CUDA. If
you receive errors related to bitsandbytes, please refer to the
[bitsandbytes
documentation](https://huggingface.co/docs/bitsandbytes/main/en/installation).
### Connecting to LLMs Served Through REST APIs
**OnPrem.LLM** can be used with LLMs being served through any
OpenAI-compatible REST API. This means you can easily use **OnPrem.LLM**
with tools like [vLLM](https://github.com/vllm-project/vllm),
[OpenLLM](https://github.com/bentoml/OpenLLM),
[Ollama](https://ollama.com/blog/openai-compatibility), and the
[llama.cpp
server](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).
For instance, using [vLLM](https://github.com/vllm-project/vllm), you
can serve a LLaMA 3 model as follows:
``` sh
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
```
You can then connect OnPrem.LLM to the LLM by supplying the URL of the
server you just started:
``` python
from onprem import LLM
llm = LLM(model_url='http://localhost:8000/v1', api_key='token-abc123')
# Note: The API key can either be supplied directly or stored in the OPENAI_API_KEY environment variable.
# If the server does not require an API key, `api_key` should still be supplied with a dummy value like 'na'.
```
Thatβs it! Solve problems with **OnPrem.LLM** as you normally would
(e.g., RAG question-answering, summarization, few-shot prompting, code
generation, etc.).
### Using OpenAI Models with OnPrem.LLM
Even when using on-premises language models, it can sometimes be useful
to have easy access to non-local, cloud-based models (e.g., OpenAI) for
testing, producing baselines for comparison, and generating synthetic
examples for fine-tuning. For these reasons, in spite of the name,
**OnPrem.LLM** now includes support for OpenAI chat models:
``` python
from onprem import LLM
llm = LLM(model_url='openai://gpt-4o', temperature=0)
```
/home/amaiya/projects/ghub/onprem/onprem/core.py:196: UserWarning: The model you supplied is gpt-4o, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\
This OpenAI [`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm)
instance can now be used with as the engine for most features in
OnPrem.LLM (e.g., RAG, information extraction, summarization, etc.).
Here we simply use it for general prompting:
``` python
saved_result = llm.prompt('List three cute names for a cat and explain why each is cute.')
```
Certainly! Here are three cute names for a cat, along with explanations for why each is adorable:
1. **Whiskers**: This name is cute because it highlights one of the most distinctive and charming features of a catβtheir whiskers. It's playful and endearing, evoking the image of a curious cat twitching its whiskers as it explores its surroundings.
2. **Mittens**: This name is cute because it conjures up the image of a cat with little white paws that look like they are wearing mittens. It's a cozy and affectionate name that suggests warmth and cuddliness, much like a pair of soft mittens.
3. **Pumpkin**: This name is cute because it brings to mind the warm, orange hues of a pumpkin, which can be reminiscent of certain cat fur colors. It's also associated with the fall season, which is often linked to comfort and coziness. Plus, the name "Pumpkin" has a sweet and affectionate ring to it, making it perfect for a beloved pet.
**Using Vision Capabilities in GPT-4o**
``` python
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
saved_result = llm.prompt('Describe the weather in this image.', image_path_or_url=image_url)
```
The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, illuminating the green grass and landscape.
**Using OpenAI-Style Message Dictionaries**
``` python
messages = [
{'content': [{'text': 'describe the weather in this image',
'type': 'text'},
{'image_url': {'url': image_url},
'type': 'image_url'}],
'role': 'user'}]
saved_result = llm.prompt(messages)
```
The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, casting clear shadows and illuminating the green landscape.
**Azure OpenAI**
For Azure OpenAI models, use the following URL format:
``` python
llm = LLM(model_url='azure://<deployment_name>', ...)
# <deployment_name> is the Azure deployment name and additional Azure-specific parameters
# can be supplied as extra arguments to LLM (or set as environment variables)
```
### Structured and Guided Outputs
The
[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)
method allows you to specify the desired structure of the LLMβs output
as a Pydantic model.
``` python
from pydantic import BaseModel, Field
class Joke(BaseModel):
setup: str = Field(description="question to set up a joke")
punchline: str = Field(description="answer to resolve the joke")
from onprem import LLM
llm = LLM(default_model='llama', verbose=False)
structured_output = llm.pydantic_prompt('Tell me a joke.', pydantic_model=Joke)
```
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
{
"setup": "Why couldn't the bicycle stand alone?",
"punchline": "Because it was two-tired!"
}
The output is a Pydantic object instead of a string:
``` python
structured_output
```
Joke(setup="Why couldn't the bicycle stand alone?", punchline='Because it was two-tired!')
``` python
print(structured_output.setup)
print()
print(structured_output.punchline)
```
Why couldn't the bicycle stand alone?
Because it was two-tired!
You can also use **OnPrem.LLM** with the
[Guidance](https://github.com/guidance-ai/guidance) package to guide the
LLM to generate outputs based on your conditions and constraints. Weβll
show a couple of examples here, but see [our documentation on guided
prompts](https://amaiya.github.io/onprem/examples_guided_prompts.html)
for more information.
``` python
from onprem import LLM
llm = LLM(n_gpu_layers=-1, verbose=False)
from onprem.pipelines.guider import Guider
guider = Guider(llm)
```
With the Guider, you can use use Regular Expressions to control LLM
generation:
``` python
prompt = f"""Question: Luke has ten balls. He gives three to his brother. How many balls does he have left?
Answer: """ + gen(name='answer', regex='\d+')
guider.prompt(prompt, echo=False)
```
{'answer': '7'}
``` python
prompt = '19, 18,' + gen(name='output', max_tokens=50, stop_regex='[^\d]7[^\d]')
guider.prompt(prompt)
```
<pre style='margin: 0px; padding: 0px; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;'>19, 18<span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>7</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>6</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>5</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>4</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>3</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>2</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>0</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 9</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 8</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span></pre>
{'output': ' 17, 16, 15, 14, 13, 12, 11, 10, 9, 8,'}
See [the
documentation](https://amaiya.github.io/onprem/examples_guided_prompts.html)
for more examples of how to use
[Guidance](https://github.com/guidance-ai/guidance) with **OnPrem.LLM**.
## Built-In Web App
**OnPrem.LLM** includes a built-in Web app to access the LLM. To start
it, run the following command after installation:
``` shell
onprem --port 8000
```
Then, enter `localhost:8000` (or `<domain_name>:8000` if running on
remote server) in a Web browser to access the application:
<img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/onprem_screenshot.png" border="1" alt="screenshot" width="775"/>
For more information, [see the corresponding
documentation](https://amaiya.github.io/onprem/webapp.html).
## FAQ
1. **How do I use other models with OnPrem.LLM?**
> You can supply the URL to other models to the `LLM` constructor,
> as we did above in the code generation example.
> As of v0.0.20, we support models in GGUF format, which supersedes
> the older GGML format. You can find llama.cpp-supported models
> with `GGUF` in the file name on
> [huggingface.co](https://huggingface.co/models?sort=trending&search=gguf).
> Make sure you are pointing to the URL of the actual GGUF model
> file, which is the βdownloadβ link on the modelβs page. An example
> for **Mistral-7B** is shown below:
> <img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/model_download_link.png" border="1" alt="screenshot" width="775"/>
> Note that some models have specific prompt formats. For instance,
> the prompt template required for **Zephyr-7B**, as described on
> the [modelβs
> page](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF), is:
>
> `<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>`
>
> So, to use the **Zephyr-7B** model, you must supply the
> `prompt_template` argument to the `LLM` constructor (or specify it
> in the `webapp.yml` configuration for the Web app).
>
> ``` python
> # how to use Zephyr-7B with OnPrem.LLM
> llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',
> prompt_template = "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
> n_gpu_layers=33)
> llm.prompt("List three cute names for a cat.")
> ```
2. **When installing `onprem`, Iβm getting βbuildβ errors related to
`llama-cpp-python` (or `chroma-hnswlib`) on Windows/Mac/Linux?**
> See [this LangChain documentation on
> LLama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp)
> for help on installing the `llama-cpp-python` package for your
> system. Additional tips for different operating systems are shown
> below:
> For **Linux** systems like Ubuntu, try this:
> `sudo apt-get install build-essential g++ clang`. Other tips are
> [here](https://github.com/oobabooga/text-generation-webui/issues/1534).
> For **Windows** systems, please try following [these
> instructions](https://github.com/amaiya/onprem/blob/master/MSWindows.md).
> We recommend you use [Windows Subsystem for Linux
> (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install)
> instead of using Microsoft Windows directly. If you do need to use
> Microsoft Window directly, be sure to install the [Microsoft C++
> Build
> Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
> and make sure the **Desktop development with C++** is selected.
> For **Macs**, try following [these
> tips](https://github.com/imartinez/privateGPT/issues/445#issuecomment-1563333950).
> There are also various other tips for each of the above OSes in
> [this privateGPT repo
> thread](https://github.com/imartinez/privateGPT/issues/445). Of
> course, you can also [easily
> use](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)
> **OnPrem.LLM** on Google Colab.
> Finally, if you still canβt overcome issues with building
> `llama-cpp-python`, you can try [installing the pre-built wheel
> file](https://abetlen.github.io/llama-cpp-python/whl/cpu/llama-cpp-python/)
> for your system:
> **Example:**
> `pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`
>
> **Tip:** There are [pre-built wheel files for
> `chroma-hnswlib`](https://pypi.org/project/chroma-hnswlib/#files),
> as well. If running `pip install onprem` fails on building
> `chroma-hnswlib`, it may be because a pre-built wheel doesnβt yet
> exist for the version of Python youβre using (in which case you
> can try downgrading Python).
3. **Iβm behind a corporate firewall and am receiving an SSL error when
trying to download the model?**
> Try this:
>
> ``` python
> from onprem import LLM
> LLM.download_model(url, ssl_verify=False)
> ```
> You can download the embedding model (used by `LLM.ingest` and
> `LLM.ask`) as follows:
>
> ``` sh
> wget --no-check-certificate https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/all-MiniLM-L6-v2.zip
> ```
> Supply the unzipped folder name as the `embedding_model_name`
> argument to `LLM`.
> If youβre getting SSL errors when even running `pip install`, try
> this:
>
> ``` sh
> pip install β-trusted-host pypi.org β-trusted-host files.pythonhosted.org pip_system_certs
> ```
4. **How do I use this on a machine with no internet access?**
> Use the `LLM.download_model` method to download the model files to
> `<your_home_directory>/onprem_data` and transfer them to the same
> location on the air-gapped machine.
> For the `ingest` and `ask` methods, you will need to also download
> and transfer the embedding model files:
>
> ``` python
> from sentence_transformers import SentenceTransformer
> model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
> model.save('/some/folder')
> ```
> Copy the `some/folder` folder to the air-gapped machine and supply
> the path to `LLM` via the `embedding_model_name` parameter.
5. **My model is not loading when I call `llm = LLM(...)`?**
> This can happen if the model file is corrupt (in which case you
> should delete from `<home directory>/onprem_data` and
> re-download). It can also happen if the version of
> `llama-cpp-python` needs to be upgraded to the latest.
6. **Iβm getting an `βIllegal instruction (core dumped)` error when
instantiating a `langchain.llms.Llamacpp` or `onprem.LLM` object?**
> Your CPU may not support instructions that `cmake` is using for
> one reason or another (e.g., [due to Hyper-V in VirtualBox
> settings](https://stackoverflow.com/questions/65780506/how-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit)).
> You can try turning them off when building and installing
> `llama-cpp-python`:
> ``` sh
> # example
> CMAKE_ARGS="-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir
> ```
7. **How can I speed up
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest)
using my GPU?**
> Try using the `embedding_model_kwargs` argument:
>
> ``` python
> from onprem import LLM
> llm = LLM(embedding_model_kwargs={'device':'cuda'})
> ```
Raw data
{
"_id": null,
"home_page": "https://github.com/amaiya/onprem",
"name": "onprem",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "nbdev jupyter notebook python",
"author": "Arun S. Maiya",
"author_email": "arun@maiya.net",
"download_url": "https://files.pythonhosted.org/packages/d3/bf/d0898779af925732a83d1fe1a94e85f701cfac3e7070f5471528e9e860ef/onprem-0.7.1.tar.gz",
"platform": null,
"description": "# OnPrem.LLM\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n> A toolkit for running large language models on-premises using\n> non-public data\n\n**[OnPrem.LLM](https://github.com/amaiya/onprem)** is a simple Python\npackage that makes it easier to apply large language models (LLMs) to\nnon-public data on your own machines (possibly behind corporate\nfirewalls). Inspired largely by the\n[privateGPT](https://github.com/imartinez/privateGPT) GitHub repo,\n**OnPrem.LLM** is intended to help integrate local LLMs into practical\napplications.\n\nThe full documentation is [here](https://amaiya.github.io/onprem/).\n\nA Google Colab demo of installing and using **OnPrem.LLM** is\n[here](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing).\n\n------------------------------------------------------------------------\n\n*Latest News* \ud83d\udd25\n\n- \\[2024/12\\] v0.7.0 released and now includes support for [structured\n outputs](https://amaiya.github.io/onprem/#structured-and-guided-outputs).\n\n- \\[2024/12\\] v0.6.0 released and now includes support for PDF to\n Markdown conversion (which includes Markdown representations of\n tables), as shown\n [here](https://amaiya.github.io/onprem/#extract-text-from-documents).\n\n- \\[2024/11\\] v0.5.0 released and now includes support for running LLMs\n with Hugging Face\n [transformers](https://github.com/huggingface/transformers) as the\n backend instead of\n [llama.cpp](https://github.com/abetlen/llama-cpp-python). See [this\n example](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).\n\n- \\[2024/11\\] v0.4.0 released and now includes a `default_model`\n parameter to more easily use models like **Llama-3.1** and\n **Zephyr-7B-beta**.\n\n- \\[2024/10\\] v0.3.0 released and now includes support for\n [concept-focused\n summarization](https://amaiya.github.io/onprem/examples_summarization.html#concept-focused-summarization)\n\n- \\[2024/09\\] v0.2.0 released and now includes PDF OCR support and\n better PDF table handling.\n\n- \\[2024/06\\] v0.1.0 of **OnPrem.LLM** has been released. Lots of new\n updates!\n\n - [Ability to use with any OpenAI-compatible\n API](https://amaiya.github.io/onprem/#connecting-to-llms-served-through-rest-apis)\n (e.g., vLLM, Ollama, OpenLLM, etc.).\n - Pipeline for [information\n extraction](https://amaiya.github.io/onprem/examples_information_extraction.html)\n from raw documents.\n - Pipeline for [few-shot text\n classification](https://amaiya.github.io/onprem/examples_classification.html)\n (i.e., training a classifier on a tiny number of labeled examples)\n along with the ability to explain few-shot predictions.\n - Default model changed to\n [Mistral-7B-Instruct-v0.2](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)\n - [API augmentations and bug\n fixes](https://github.com/amaiya/onprem/blob/master/CHANGELOG.md)\n\n------------------------------------------------------------------------\n\n## Install\n\nOnce you have [installed\nPyTorch](https://pytorch.org/get-started/locally/), you can install\n**OnPrem.LLM** with the following steps:\n\n1. Install **llama-cpp-python**:\n - **CPU:** `pip install llama-cpp-python` ([extra\n steps](https://github.com/amaiya/onprem/blob/master/MSWindows.md)\n required for Microsoft Windows)\n - **GPU**: Follow [instructions\n below](https://amaiya.github.io/onprem/#on-gpu-accelerated-inference).\n2. Install **OnPrem.LLM**: `pip install onprem`\n\n### On GPU-Accelerated Inference\n\nWhen installing **llama-cpp-python** with\n`pip install llama-cpp-python`, the LLM will run on your **CPU**. To\ngenerate answers much faster, you can run the LLM on your **GPU** by\nbuilding **llama-cpp-python** based on your operating system.\n\n- **Linux**:\n `CMAKE_ARGS=\"-DGGML_CUDA=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`\n- **Mac**: `CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python`\n- **Windows 11**: Follow the instructions\n [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-the-system-python-in-windows-11s).\n- **Windows Subsystem for Linux (WSL2)**: Follow the instructions\n [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-wsl2-with-gpu-acceleration).\n\nFor Linux and Windows, you will need [an up-to-date NVIDIA\ndriver](https://www.nvidia.com/en-us/drivers/) along with the [CUDA\ntoolkit](https://developer.nvidia.com/cuda-downloads) installed before\nrunning the installation commands above.\n\nAfter following the instructions above, supply the `n_gpu_layers=-1`\nparameter when instantiating an LLM to use your GPU for fast inference:\n\n``` python\nllm = LLM(n_gpu_layers=-1, ...)\n```\n\nQuantized models with 8B parameters and below can typically run on GPUs\nwith as little as 6GB of VRAM. If a model does not fit on your GPU\n(e.g., you get a \u201cCUDA Error: Out-of-Memory\u201d error), you can offload a\nsubset of layers to the GPU by experimenting with different values for\nthe `n_gpu_layers` parameter (e.g., `n_gpu_layers=20`). Setting\n`n_gpu_layers=-1`, as shown above, offloads all layers to the GPU.\n\nSee [the FAQ](https://amaiya.github.io/onprem/#faq) for extra tips, if\nyou experience issues with\n[llama-cpp-python](https://pypi.org/project/llama-cpp-python/)\ninstallation.\n\n**Note:** Installing **llama-cpp-python** is optional if either the\nfollowing is true:\n\n- You use Hugging Face Transformers (instead of llama-cpp-python) as the\n LLM backend by supplying the `model_id` parameter when instantiating\n an LLM, as [shown\n here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).\n- You are using **OnPrem.LLM** with an LLM being served through an\n [external REST API](#connecting-to-llms-served-through-rest-apis)\n (e.g., vLLM, OpenLLM, Ollama).\n\n## How to Use\n\n### Setup\n\n``` python\nfrom onprem import LLM\n\nllm = LLM()\n```\n\nBy default, a 7B-parameter model (**Mistral-7B-Instruct-v0.2**) is\ndownloaded and used. If `default_model='llama'` is supplied, then a\n**Llama-3.1-8B-Instsruct** model is automatically downloaded and used\n(which is useful if the default Mistral model struggles with a\nparticular task):\n\n``` python\n# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used\nllm = LLM(default_model='llama')\n```\n\nSimilarly, suppyling `default_model='zephyr`, will use\n**Zephyr-7B-beta**. Of course, you can also easily supply the URL to an\nLLM of your choosing to\n[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (see the the\n[code generation\nexample](https://amaiya.github.io/onprem/examples_code.html) or the\n[FAQ](https://amaiya.github.io/onprem/#faq) for examples). Any extra\nparameters supplied to\n[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) are forwarded\ndirectly to `llama-cpp-python`.\n\n**Note:** The default context window size (`n_ctx`) is set to 3900 and\nthe default output size (`max_tokens`) is set 512. Both are configurable\nparameters to\n[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm). Increase if\nyou have larger prompts or need longer outputs.\n\n### Send Prompts to the LLM to Solve Problems\n\nThis is an example of few-shot prompting, where we provide an example of\nwhat we want the LLM to do.\n\n``` python\nprompt = \"\"\"Extract the names of people in the supplied sentences. Here is an example:\nSentence: James Gandolfini and Paul Newman were great actors.\nPeople:\nJames Gandolfini, Paul Newman\nSentence:\nI like Cillian Murphy's acting. Florence Pugh is great, too.\nPeople:\"\"\"\n\nsaved_output = llm.prompt(prompt)\n```\n\n Cillian Murphy, Florence Pugh.\n\nAdditional prompt examples are [shown\nhere](https://amaiya.github.io/onprem/examples.html).\n\n### Talk to Your Documents\n\nAnswers are generated from the content of your documents (i.e.,\n[retrieval augmented generation](https://arxiv.org/abs/2005.11401) or\nRAG). Here, we will use [GPU\noffloading](https://amaiya.github.io/onprem/#speeding-up-inference-using-a-gpu)\nto speed up answer generation using the default model. However, the\nZephyr-7B model may perform even better, responds faster, and is used in\nour [example\nnotebook](https://amaiya.github.io/onprem/examples_rag.html).\n\n``` python\nfrom onprem import LLM\n\nllm = LLM(n_gpu_layers=-1)\n```\n\n#### Step 1: Ingest the Documents into a Vector Database\n\n``` python\nllm.ingest(\"./sample_data\")\n```\n\n Creating new vectorstore at /home/amaiya/onprem_data/vectordb\n Loading documents from ./sample_data\n Loaded 12 new documents from ./sample_data\n Split into 153 chunks of text (max. 500 chars each)\n Creating embeddings. May take some minutes...\n Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods\n\n Loading new documents: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3/3 [00:00<00:00, 13.71it/s]\n 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:02<00:00, 2.49s/it]\n\n#### Step 2: Answer Questions About the Documents\n\n``` python\nquestion = \"\"\"What is ktrain?\"\"\"\nresult = llm.ask(question)\n```\n\n Ktrain is a low-code machine learning library designed to facilitate the full machine learning workflow from curating and preprocessing inputs to training, tuning, troubleshooting, and applying models. Ktrain is well-suited for domain experts who may have less experience with machine learning and software coding.\n\nThe sources used by the model to generate the answer are stored in\n`result['source_documents']`:\n\n``` python\nprint(\"\\nSources:\\n\")\nfor i, document in enumerate(result[\"source_documents\"]):\n print(f\"\\n{i+1}.> \" + document.metadata[\"source\"] + \":\")\n print(document.page_content)\n```\n\n\n Sources:\n\n\n 1.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:\n lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-\n tomation and instead focuses on either partially or fully automating other aspects of the\n machine learning (ML) work\ufb02ow. For these reasons, ktrain is less of a traditional Au-\n 2\n\n 2.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:\n possible, ktrain automates (either algorithmically or through setting well-performing de-\n faults), but also allows users to make choices that best \ufb01t their unique application require-\n ments. In this way, ktrain uses automation to augment and complement human engineers\n rather than attempting to entirely replace them. In doing so, the strengths of both are\n better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai\n\n 3.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:\n with custom models and data formats, as well.\n Inspired by other low-code (and no-\n code) open-source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig\n (Molino et al., 2019), ktrain is intended to help further democratize machine learning by\n enabling beginners and domain experts with minimal programming or data science experi-\n 4. http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups\n 6\n\n 4.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:\n ktrain: A Low-Code Library for Augmented Machine Learning\n toML platform and more of what might be called a \u201clow-code\u201d ML platform. Through\n automation or semi-automation, ktrain facilitates the full machine learning work\ufb02ow from\n curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training,\n tuning, troubleshooting, and applying models. In this way, ktrain is well-suited for domain\n experts who may have less experience with machine learning and software coding. Where\n\n### Extract Text from Documents\n\nThe\n[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)\nfunction can extract text from a range of different document formats\n(e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is\nautomatically invoked when calling\n[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest).\nExtracted text is represented as LangChain `Document` objects, where\n`Document.page_content` stores the extracted text and\n`Document.metadata` stores any extracted document metadata.\n\nFor PDFs, in particular, a number of different options are available\ndepending on your use case.\n\n**Fast PDF Extraction (default)**\n\n- **Pro:** Fast\n- **Con:** Does not infer/retain structure of tables in PDF documents\n\n``` python\nfrom onprem.ingest import load_single_document\n\ndocs = load_single_document('sample_data/1/ktrain_paper.pdf')\ndocs[0].metadata\n```\n\n {'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',\n 'file_path': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',\n 'page': 0,\n 'total_pages': 9,\n 'format': 'PDF 1.4',\n 'title': '',\n 'author': '',\n 'subject': '',\n 'keywords': '',\n 'creator': 'LaTeX with hyperref',\n 'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22',\n 'creationDate': \"D:20220406214054-04'00'\",\n 'modDate': \"D:20220406214054-04'00'\",\n 'trapped': ''}\n\n**Automatic OCR of PDFs**\n\n- **Pro:** Automatically extracts text from scanned PDFs\n- **Con:** Slow\n\nThe\n[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)\nfunction will automatically OCR PDFs that require it (i.e., PDFs that\nare scanned hard-copies of documents). If a document is OCR\u2019ed during\nextraction, the `metadata['ocr']` field will be populated with `True`.\n\n``` python\ndocs = load_single_document('sample_data/4/lynn1975.pdf')\ndocs[0].metadata\n```\n\n {'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/4/lynn1975.pdf',\n 'ocr': True}\n\n**Markdown Conversion in PDFs**\n\n- **Pro**: Better chunking for QA\n- **Con**: Slower than default PDF extraction\n\nThe\n[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)\nfunction can convert PDFs to Markdown instead of plain text by supplying\nthe `pdf_markdown=True` as an argument:\n\n``` python\ndocs = load_single_document('your_pdf_document.pdf', \n pdf_markdown=True)\n```\n\nConverting to Markdown can facilitate downstream tasks like\nquestion-answering. For instance, when supplying `pdf_markdown=True` to\n[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),\ndocuments are chunked in a Markdown-aware fashion (e.g., the abstract of\na research paper tends to be kept together into a single chunk instead\nof being split up). Note that Markdown will not be extracted if the\ndocument requires OCR.\n\n**Inferring Table Structure in PDFs**\n\n- **Pro**: Makes it easier for LLMs to analyze information in tables\n- **Con**: Slower than default PDF extraction\n\nWhen supplying `infer_table_structure=True` to either\n[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)\nor\n[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),\ntables are inferred and extracted from PDFs using a TableTransformer\nmodel. Tables are represented as **Markdown** (or **HTML** if Markdown\nconversion is not possible).\n\n``` python\ndocs = load_single_document('your_pdf_document.pdf', \n infer_table_structure=True)\n```\n\n**Parsing Extracted Text Into Sentences or Paragraphs**\n\nFor some analyses (e.g., using prompts for information extraction), it\nmay be useful to parse the text extracted from documents into individual\nsentences or paragraphs. This can be accomplished using the\n[`segment`](https://amaiya.github.io/onprem/utils.html#segment)\nfunction:\n\n``` python\nfrom onprem.ingest import load_single_document\nfrom onprem.utils import segment\ntext = load_single_document('sample_data/3/state_of_the_union.txt')[0].page_content\n```\n\n``` python\nsegment(text, unit='paragraph')[0]\n```\n\n 'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'\n\n``` python\nsegment(text, unit='sentence')[0]\n```\n\n 'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'\n\n### Summarization Pipeline\n\nSummarize your raw documents (e.g., PDFs, MS Word) with an LLM.\n\n#### Map-Reduce Summarization\n\nSummarize each chunk in a document and then generate a single summary\nfrom the individual summaries.\n\n``` python\nfrom onprem import LLM\nllm = LLM(n_gpu_layers=-1, verbose=False, mute_stream=True) # disabling viewing of intermediate summarization prompts/inferences\n```\n\n``` python\nfrom onprem.pipelines import Summarizer\nsumm = Summarizer(llm)\n\nresp = summ.summarize('sample_data/1/ktrain_paper.pdf', max_chunks_to_use=5) # omit max_chunks_to_use parameter to consider entire document\nprint(resp['output_text'])\n```\n\n Ktrain is an open-source machine learning library that offers a unified interface for various machine learning tasks. The library supports both supervised and non-supervised machine learning, and includes methods for training models, evaluating models, making predictions on new data, and providing explanations for model decisions. Additionally, the library integrates with various explainable AI libraries such as shap, eli5 with lime, and others to provide more interpretable models.\n\n#### Concept-Focused Summarization\n\nSummarize a large document with respect to a particular concept of\ninterest.\n\n``` python\nfrom onprem import LLM\nfrom onprem.pipelines import Summarizer\n```\n\n``` python\nllm = LLM(default_model='zephyr', n_gpu_layers=-1, verbose=False, temperature=0)\nsumm = Summarizer(llm)\nsummary, sources = summ.summarize_by_concept('sample_data/1/ktrain_paper.pdf', concept_description=\"question answering\")\n```\n\n\n The context provided describes the implementation of an open-domain question-answering system using ktrain, a low-code library for augmented machine learning. The system follows three main steps: indexing documents into a search engine, locating documents containing words in the question, and extracting candidate answers from those documents using a BERT model pretrained on the SQuAD dataset. Confidence scores are used to sort and prune candidate answers before returning results. The entire workflow can be implemented with only three lines of code using ktrain's SimpleQA module. This system allows for the submission of natural language questions and receives exact answers, as demonstrated in the provided example. Overall, the context highlights the ease and accessibility of building sophisticated machine learning models, including open-domain question-answering systems, through ktrain's low-code interface.\n\n### Information Extraction Pipeline\n\nExtract information from raw documents (e.g., PDFs, MS Word documents)\nwith an LLM.\n\n``` python\nfrom onprem import LLM\nfrom onprem.pipelines import Extractor\n# Notice that we're using a cloud-based, off-premises model here! See \"OpenAI\" section below.\nllm = LLM(model_url='openai://gpt-3.5-turbo', verbose=False, mute_stream=True, temperature=0) \nextractor = Extractor(llm)\nprompt = \"\"\"Extract the names of research institutions (e.g., universities, research labs, corporations, etc.) \nfrom the following sentence delimited by three backticks. If there are no organizations, return NA. \nIf there are multiple organizations, separate them with commas.\n```{text}```\n\"\"\"\ndf = extractor.apply(prompt, fpath='sample_data/1/ktrain_paper.pdf', pdf_pages=[1], stop=['\\n'])\ndf.loc[df['Extractions'] != 'NA'].Extractions[0]\n```\n\n /home/amaiya/projects/ghub/onprem/onprem/core.py:159: UserWarning: The model you supplied is gpt-3.5-turbo, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.\n warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\\\n\n 'Institute for Defense Analyses'\n\n### Few-Shot Classification\n\nMake accurate text classification predictions using only a tiny number\nof labeled examples.\n\n``` python\n# create classifier\nfrom onprem.pipelines import FewShotClassifier\nclf = FewShotClassifier(use_smaller=True)\n\n# Fetching data\nfrom sklearn.datasets import fetch_20newsgroups\nimport pandas as pd\nimport numpy as np\nclasses = [\"soc.religion.christian\", \"sci.space\"]\nnewsgroups = fetch_20newsgroups(subset=\"all\", categories=classes)\ncorpus, group_labels = np.array(newsgroups.data), np.array(newsgroups.target_names)[newsgroups.target]\n\n# Wrangling data into a dataframe and selecting training examples\ndata = pd.DataFrame({\"text\": corpus, \"label\": group_labels})\ntrain_df = data.groupby(\"label\").sample(5)\ntest_df = data.drop(index=train_df.index)\n\n# X_sample only contains 5 examples of each class!\nX_sample, y_sample = train_df['text'].values, train_df['label'].values\n\n# test set\nX_test, y_test = test_df['text'].values, test_df['label'].values\n\n# train\nclf.train(X_sample, y_sample, max_steps=20)\n\n# evaluate\nprint(clf.evaluate(X_test, y_test)['accuracy'])\n#output: 0.98\n\n# make predictions\nclf.predict(['Elon Musk likes launching satellites.']).tolist()[0]\n#output: sci.space\n```\n\n### Using Hugging Face Transformers Instead of Llama.cpp\n\nBy default, the LLM backend employed by **OnPrem.LLM** is\n[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), which\nrequires models in [GGUF format](https://huggingface.co/docs/hub/gguf).\nAs of v0.5.0, it is now possible to use [Hugging Face\ntransformers](https://github.com/huggingface/transformers) as the LLM\nbackend instead. This is accomplished by using the `model_id` parameter\n(instead of supplying a `model_url` argument). In the example below, we\nrun the\n[Llama-3.1-8B](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)\nmodel.\n\n``` python\n# llama-cpp-python does NOT need to be installed when using model_id parameter\nllm = LLM(model_id=\"hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4\", device_map='cuda')\n```\n\nThis allows you to more easily use any model on the Hugging Face hub in\n[SafeTensors format](https://huggingface.co/docs/safetensors/index)\nprovided it can be loaded with the Hugging Face `transformers.pipeline`.\nNote that, when using the `model_id` parameter, the `prompt_template` is\nset automatically by `transformers`.\n\nThe Llama-3.1 model loaded above was quantized using\n[AWQ](https://huggingface.co/docs/transformers/main/en/quantization/awq),\nwhich allows the model to fit onto smaller GPUs (e.g., laptop GPUs with\n6GB of VRAM) similar to the default GGUF format. AWQ models will require\nthe [autoawq](https://pypi.org/project/autoawq/) package to be\ninstalled: `pip install autoawq` (AWQ only supports Linux system,\nincluding Windows Subsystem for Linux). If you do need to load a model\nthat is not quantized, you can supply a quantization configuration at\nload time (known as \u201cinflight quantization\u201d). In the following example,\nwe load an unquantized [Zephyr-7B-beta\nmodel](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) that will be\nquantized during loading to fit on GPUs with as little as 6GB of VRAM:\n\n``` python\nfrom transformers import BitsAndBytesConfig\nquantization_config = BitsAndBytesConfig(\n load_in_4bit=True,\n bnb_4bit_quant_type=\"nf4\",\n bnb_4bit_compute_dtype=\"float16\",\n bnb_4bit_use_double_quant=True,\n)\nllm = LLM(model_id=\"HuggingFaceH4/zephyr-7b-beta\", device_map='cuda', \n model_kwargs={\"quantization_config\":quantization_config})\n```\n\nWhen supplying a `quantization_config`, the\n[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/installation)\nlibrary, a lightweight Python wrapper around CUDA custom functions, in\nparticular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 &\n4-bit quantization functions, is used. There are ongoing efforts by the\nbitsandbytes team to support multiple backends in addition to CUDA. If\nyou receive errors related to bitsandbytes, please refer to the\n[bitsandbytes\ndocumentation](https://huggingface.co/docs/bitsandbytes/main/en/installation).\n\n### Connecting to LLMs Served Through REST APIs\n\n**OnPrem.LLM** can be used with LLMs being served through any\nOpenAI-compatible REST API. This means you can easily use **OnPrem.LLM**\nwith tools like [vLLM](https://github.com/vllm-project/vllm),\n[OpenLLM](https://github.com/bentoml/OpenLLM),\n[Ollama](https://ollama.com/blog/openai-compatibility), and the\n[llama.cpp\nserver](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).\n\nFor instance, using [vLLM](https://github.com/vllm-project/vllm), you\ncan serve a LLaMA 3 model as follows:\n\n``` sh\npython -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123\n```\n\nYou can then connect OnPrem.LLM to the LLM by supplying the URL of the\nserver you just started:\n\n``` python\nfrom onprem import LLM\nllm = LLM(model_url='http://localhost:8000/v1', api_key='token-abc123') \n# Note: The API key can either be supplied directly or stored in the OPENAI_API_KEY environment variable.\n# If the server does not require an API key, `api_key` should still be supplied with a dummy value like 'na'.\n```\n\nThat\u2019s it! Solve problems with **OnPrem.LLM** as you normally would\n(e.g., RAG question-answering, summarization, few-shot prompting, code\ngeneration, etc.).\n\n### Using OpenAI Models with OnPrem.LLM\n\nEven when using on-premises language models, it can sometimes be useful\nto have easy access to non-local, cloud-based models (e.g., OpenAI) for\ntesting, producing baselines for comparison, and generating synthetic\nexamples for fine-tuning. For these reasons, in spite of the name,\n**OnPrem.LLM** now includes support for OpenAI chat models:\n\n``` python\nfrom onprem import LLM\nllm = LLM(model_url='openai://gpt-4o', temperature=0)\n```\n\n /home/amaiya/projects/ghub/onprem/onprem/core.py:196: UserWarning: The model you supplied is gpt-4o, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.\n warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\\\n\nThis OpenAI [`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm)\ninstance can now be used with as the engine for most features in\nOnPrem.LLM (e.g., RAG, information extraction, summarization, etc.).\nHere we simply use it for general prompting:\n\n``` python\nsaved_result = llm.prompt('List three cute names for a cat and explain why each is cute.')\n```\n\n Certainly! Here are three cute names for a cat, along with explanations for why each is adorable:\n\n 1. **Whiskers**: This name is cute because it highlights one of the most distinctive and charming features of a cat\u2014their whiskers. It's playful and endearing, evoking the image of a curious cat twitching its whiskers as it explores its surroundings.\n\n 2. **Mittens**: This name is cute because it conjures up the image of a cat with little white paws that look like they are wearing mittens. It's a cozy and affectionate name that suggests warmth and cuddliness, much like a pair of soft mittens.\n\n 3. **Pumpkin**: This name is cute because it brings to mind the warm, orange hues of a pumpkin, which can be reminiscent of certain cat fur colors. It's also associated with the fall season, which is often linked to comfort and coziness. Plus, the name \"Pumpkin\" has a sweet and affectionate ring to it, making it perfect for a beloved pet.\n\n**Using Vision Capabilities in GPT-4o**\n\n``` python\nimage_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\nsaved_result = llm.prompt('Describe the weather in this image.', image_path_or_url=image_url)\n```\n\n The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, illuminating the green grass and landscape.\n\n**Using OpenAI-Style Message Dictionaries**\n\n``` python\nmessages = [\n {'content': [{'text': 'describe the weather in this image', \n 'type': 'text'},\n {'image_url': {'url': image_url},\n 'type': 'image_url'}],\n 'role': 'user'}]\nsaved_result = llm.prompt(messages)\n```\n\n The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, casting clear shadows and illuminating the green landscape.\n\n**Azure OpenAI**\n\nFor Azure OpenAI models, use the following URL format:\n\n``` python\nllm = LLM(model_url='azure://<deployment_name>', ...) \n# <deployment_name> is the Azure deployment name and additional Azure-specific parameters \n# can be supplied as extra arguments to LLM (or set as environment variables)\n```\n\n### Structured and Guided Outputs\n\nThe\n[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)\nmethod allows you to specify the desired structure of the LLM\u2019s output\nas a Pydantic model.\n\n``` python\nfrom pydantic import BaseModel, Field\n\nclass Joke(BaseModel):\n setup: str = Field(description=\"question to set up a joke\")\n punchline: str = Field(description=\"answer to resolve the joke\")\n\nfrom onprem import LLM\nllm = LLM(default_model='llama', verbose=False)\nstructured_output = llm.pydantic_prompt('Tell me a joke.', pydantic_model=Joke)\n```\n\n llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized\n\n {\n \"setup\": \"Why couldn't the bicycle stand alone?\",\n \"punchline\": \"Because it was two-tired!\"\n }\n\nThe output is a Pydantic object instead of a string:\n\n``` python\nstructured_output\n```\n\n Joke(setup=\"Why couldn't the bicycle stand alone?\", punchline='Because it was two-tired!')\n\n``` python\nprint(structured_output.setup)\nprint()\nprint(structured_output.punchline)\n```\n\n Why couldn't the bicycle stand alone?\n\n Because it was two-tired!\n\nYou can also use **OnPrem.LLM** with the\n[Guidance](https://github.com/guidance-ai/guidance) package to guide the\nLLM to generate outputs based on your conditions and constraints. We\u2019ll\nshow a couple of examples here, but see [our documentation on guided\nprompts](https://amaiya.github.io/onprem/examples_guided_prompts.html)\nfor more information.\n\n``` python\nfrom onprem import LLM\n\nllm = LLM(n_gpu_layers=-1, verbose=False)\nfrom onprem.pipelines.guider import Guider\nguider = Guider(llm)\n```\n\nWith the Guider, you can use use Regular Expressions to control LLM\ngeneration:\n\n``` python\nprompt = f\"\"\"Question: Luke has ten balls. He gives three to his brother. How many balls does he have left?\nAnswer: \"\"\" + gen(name='answer', regex='\\d+')\n\nguider.prompt(prompt, echo=False)\n```\n\n {'answer': '7'}\n\n``` python\nprompt = '19, 18,' + gen(name='output', max_tokens=50, stop_regex='[^\\d]7[^\\d]')\nguider.prompt(prompt)\n```\n\n<pre style='margin: 0px; padding: 0px; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;'>19, 18<span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>7</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>6</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>5</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>4</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>3</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>2</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>0</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 9</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 8</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span></pre>\n\n {'output': ' 17, 16, 15, 14, 13, 12, 11, 10, 9, 8,'}\n\nSee [the\ndocumentation](https://amaiya.github.io/onprem/examples_guided_prompts.html)\nfor more examples of how to use\n[Guidance](https://github.com/guidance-ai/guidance) with **OnPrem.LLM**.\n\n## Built-In Web App\n\n**OnPrem.LLM** includes a built-in Web app to access the LLM. To start\nit, run the following command after installation:\n\n``` shell\nonprem --port 8000\n```\n\nThen, enter `localhost:8000` (or `<domain_name>:8000` if running on\nremote server) in a Web browser to access the application:\n\n<img src=\"https://raw.githubusercontent.com/amaiya/onprem/master/images/onprem_screenshot.png\" border=\"1\" alt=\"screenshot\" width=\"775\"/>\n\nFor more information, [see the corresponding\ndocumentation](https://amaiya.github.io/onprem/webapp.html).\n\n## FAQ\n\n1. **How do I use other models with OnPrem.LLM?**\n\n > You can supply the URL to other models to the `LLM` constructor,\n > as we did above in the code generation example.\n\n > As of v0.0.20, we support models in GGUF format, which supersedes\n > the older GGML format. You can find llama.cpp-supported models\n > with `GGUF` in the file name on\n > [huggingface.co](https://huggingface.co/models?sort=trending&search=gguf).\n\n > Make sure you are pointing to the URL of the actual GGUF model\n > file, which is the \u201cdownload\u201d link on the model\u2019s page. An example\n > for **Mistral-7B** is shown below:\n\n > <img src=\"https://raw.githubusercontent.com/amaiya/onprem/master/images/model_download_link.png\" border=\"1\" alt=\"screenshot\" width=\"775\"/>\n\n > Note that some models have specific prompt formats. For instance,\n > the prompt template required for **Zephyr-7B**, as described on\n > the [model\u2019s\n > page](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF), is:\n >\n > `<|system|>\\n</s>\\n<|user|>\\n{prompt}</s>\\n<|assistant|>`\n >\n > So, to use the **Zephyr-7B** model, you must supply the\n > `prompt_template` argument to the `LLM` constructor (or specify it\n > in the `webapp.yml` configuration for the Web app).\n >\n > ``` python\n > # how to use Zephyr-7B with OnPrem.LLM\n > llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',\n > prompt_template = \"<|system|>\\n</s>\\n<|user|>\\n{prompt}</s>\\n<|assistant|>\",\n > n_gpu_layers=33)\n > llm.prompt(\"List three cute names for a cat.\")\n > ```\n\n2. **When installing `onprem`, I\u2019m getting \u201cbuild\u201d errors related to\n `llama-cpp-python` (or `chroma-hnswlib`) on Windows/Mac/Linux?**\n\n > See [this LangChain documentation on\n > LLama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp)\n > for help on installing the `llama-cpp-python` package for your\n > system. Additional tips for different operating systems are shown\n > below:\n\n > For **Linux** systems like Ubuntu, try this:\n > `sudo apt-get install build-essential g++ clang`. Other tips are\n > [here](https://github.com/oobabooga/text-generation-webui/issues/1534).\n\n > For **Windows** systems, please try following [these\n > instructions](https://github.com/amaiya/onprem/blob/master/MSWindows.md).\n > We recommend you use [Windows Subsystem for Linux\n > (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install)\n > instead of using Microsoft Windows directly. If you do need to use\n > Microsoft Window directly, be sure to install the [Microsoft C++\n > Build\n > Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)\n > and make sure the **Desktop development with C++** is selected.\n\n > For **Macs**, try following [these\n > tips](https://github.com/imartinez/privateGPT/issues/445#issuecomment-1563333950).\n\n > There are also various other tips for each of the above OSes in\n > [this privateGPT repo\n > thread](https://github.com/imartinez/privateGPT/issues/445). Of\n > course, you can also [easily\n > use](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)\n > **OnPrem.LLM** on Google Colab.\n\n > Finally, if you still can\u2019t overcome issues with building\n > `llama-cpp-python`, you can try [installing the pre-built wheel\n > file](https://abetlen.github.io/llama-cpp-python/whl/cpu/llama-cpp-python/)\n > for your system:\n\n > **Example:**\n > `pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`\n >\n > **Tip:** There are [pre-built wheel files for\n > `chroma-hnswlib`](https://pypi.org/project/chroma-hnswlib/#files),\n > as well. If running `pip install onprem` fails on building\n > `chroma-hnswlib`, it may be because a pre-built wheel doesn\u2019t yet\n > exist for the version of Python you\u2019re using (in which case you\n > can try downgrading Python).\n\n3. **I\u2019m behind a corporate firewall and am receiving an SSL error when\n trying to download the model?**\n\n > Try this:\n >\n > ``` python\n > from onprem import LLM\n > LLM.download_model(url, ssl_verify=False)\n > ```\n\n > You can download the embedding model (used by `LLM.ingest` and\n > `LLM.ask`) as follows:\n >\n > ``` sh\n > wget --no-check-certificate https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/all-MiniLM-L6-v2.zip\n > ```\n\n > Supply the unzipped folder name as the `embedding_model_name`\n > argument to `LLM`.\n\n > If you\u2019re getting SSL errors when even running `pip install`, try\n > this:\n >\n > ``` sh\n > pip install \u2013-trusted-host pypi.org \u2013-trusted-host files.pythonhosted.org pip_system_certs\n > ```\n\n4. **How do I use this on a machine with no internet access?**\n\n > Use the `LLM.download_model` method to download the model files to\n > `<your_home_directory>/onprem_data` and transfer them to the same\n > location on the air-gapped machine.\n\n > For the `ingest` and `ask` methods, you will need to also download\n > and transfer the embedding model files:\n >\n > ``` python\n > from sentence_transformers import SentenceTransformer\n > model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\n > model.save('/some/folder')\n > ```\n\n > Copy the `some/folder` folder to the air-gapped machine and supply\n > the path to `LLM` via the `embedding_model_name` parameter.\n\n5. **My model is not loading when I call `llm = LLM(...)`?**\n\n > This can happen if the model file is corrupt (in which case you\n > should delete from `<home directory>/onprem_data` and\n > re-download). It can also happen if the version of\n > `llama-cpp-python` needs to be upgraded to the latest.\n\n6. **I\u2019m getting an `\u201cIllegal instruction (core dumped)` error when\n instantiating a `langchain.llms.Llamacpp` or `onprem.LLM` object?**\n\n > Your CPU may not support instructions that `cmake` is using for\n > one reason or another (e.g., [due to Hyper-V in VirtualBox\n > settings](https://stackoverflow.com/questions/65780506/how-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit)).\n > You can try turning them off when building and installing\n > `llama-cpp-python`:\n\n > ``` sh\n > # example\n > CMAKE_ARGS=\"-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF\" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir\n > ```\n\n7. **How can I speed up\n [`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest)\n using my GPU?**\n\n > Try using the `embedding_model_kwargs` argument:\n >\n > ``` python\n > from onprem import LLM\n > llm = LLM(embedding_model_kwargs={'device':'cuda'})\n > ```\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "A tool for running on-premises large language models on non-public data",
"version": "0.7.1",
"project_urls": {
"Homepage": "https://github.com/amaiya/onprem"
},
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9a6ab853532fad92b862e1e95ce12c61035b8225c87f33e7ee56d02c2e292a64",
"md5": "5136393a369c05495707943a0b5118a2",
"sha256": "0832be2a2671590d2b6e24fe990239ea782e32c5e39de1850620ad43fc7bebb4"
},
"downloads": -1,
"filename": "onprem-0.7.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5136393a369c05495707943a0b5118a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 83010,
"upload_time": "2024-12-19T19:14:12",
"upload_time_iso_8601": "2024-12-19T19:14:12.495204Z",
"url": "https://files.pythonhosted.org/packages/9a/6a/b853532fad92b862e1e95ce12c61035b8225c87f33e7ee56d02c2e292a64/onprem-0.7.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d3bfd0898779af925732a83d1fe1a94e85f701cfac3e7070f5471528e9e860ef",
"md5": "9472c706ceebd754a3c081420b34c48e",
"sha256": "19afaa7e0da7fca73d61d59e1c5f90567c73a5350167a5946b9d319d4556e3ac"
},
"downloads": -1,
"filename": "onprem-0.7.1.tar.gz",
"has_sig": false,
"md5_digest": "9472c706ceebd754a3c081420b34c48e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 97528,
"upload_time": "2024-12-19T19:14:13",
"upload_time_iso_8601": "2024-12-19T19:14:13.668592Z",
"url": "https://files.pythonhosted.org/packages/d3/bf/d0898779af925732a83d1fe1a94e85f701cfac3e7070f5471528e9e860ef/onprem-0.7.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-19 19:14:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "amaiya",
"github_project": "onprem",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "onprem"
}