# TokenProbs
Extract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.
## Installation
```bash
conda create -n GenCasting python=3.9
pip3 install GenCasting
```
__Troubling Shooting__
If recieving `CUDA Setup failed despite GPU being available.` Identify the location of the cuda driver, typically found under `/usr/local/` and input the following commands via the command line. The example below shows this for cuda-12.3.:
```bash
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location
export BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed
```
## Usage
```python
from TokenProbs import LogitExtractor
extractor = LogitExtractor(
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
quantization="8bit" # None = Full precision, "4bit" also suported
)
# Test sentence
sentence = "AAPL shares were up in morning trading, but closed even on the day."
# Prompt sentence
prompt = \
"""Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.
\nInput: %text_input
Answer:"""
prompted_sentence = prompt.replace("%text_input",sentence)
# Provide tokens to extract (can be TokenIDs or strings)
pred_tokens = ['positive','neutral','negative']
# Extract normalized token probabilities
probabilities = extractor.logit_extraction(
input_data = prompted_sentence,
tokens = pred_tokens,
batch_size=1
)
print(f"Probabilities: {probabilities}")
Probabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}
# Compare to text output
text_output = extractor.text_generation(input_data,batch_size=1)
```
## Additional Features
`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.
```python
from datasets import load_dataset
from TokenProbs import LogitExtractor
# Load dataset
dataset = load_dataset("financial_phrasebank",'sentences_50agree')['train']
# Apply training and test split
dataset = dataset.train_test_split(seed=42)
train = dataset['train']
# Convert class labels to text
labels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]
# Get sentences
prompted_sentences = [prompt.replace("%text_input",sent) for sent in train['sentence']]
# Add labels to prompted sentences
training_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]
# Load model
extractor = LogitExtractor(
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
quantization="8bit"
)
# Set up SFFTrainer
extractor.trainer_setup(
train_ds = training_texts, #either a dataloader object or text list
response_seq = "\nAnswer:", # Tells trainer to train only on text following "\nAnswer: "
# Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.
lora_alpha=16,
lora_rank=32,
lora_dropout=0.1
)
extractor.trainer.train()
# Push model to huggingface
extractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')
# Load model later
trained_model = extractor(
model_name = '<HF_USERNAME>/<MODEL_NAME>',
quantization="8bit"
)
```
## Examples
Coming soon.
Raw data
{
"_id": null,
"home_page": null,
"name": "TokenProbs",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "python, LLMs, finance, forecasting, language models, huggingface",
"author": "Francesco A. Fabozzi",
"author_email": "francescoafabozzi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/99/d6/ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b/TokenProbs-1.0.0.tar.gz",
"platform": null,
"description": "\n# TokenProbs\n\nExtract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.\n\n\n## Installation\n\n```bash\nconda create -n GenCasting python=3.9\npip3 install GenCasting \n```\n\n__Troubling Shooting__\n\nIf recieving `CUDA Setup failed despite GPU being available.` Identify the location of the cuda driver, typically found under `/usr/local/` and input the following commands via the command line. The example below shows this for cuda-12.3.:\n\n```bash\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location\nexport BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed\n```\n\n## Usage\n```python\nfrom TokenProbs import LogitExtractor\n\nextractor = LogitExtractor(\n model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n quantization=\"8bit\" # None = Full precision, \"4bit\" also suported\n)\n\n# Test sentence\nsentence = \"AAPL shares were up in morning trading, but closed even on the day.\"\n\n# Prompt sentence\nprompt = \\\n\"\"\"Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.\n\\nInput: %text_input\nAnswer:\"\"\"\n\nprompted_sentence = prompt.replace(\"%text_input\",sentence)\n\n# Provide tokens to extract (can be TokenIDs or strings)\npred_tokens = ['positive','neutral','negative']\n\n\n# Extract normalized token probabilities\nprobabilities = extractor.logit_extraction(\n input_data = prompted_sentence,\n tokens = pred_tokens,\n batch_size=1\n)\n\nprint(f\"Probabilities: {probabilities}\")\nProbabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}\n\n# Compare to text output\ntext_output = extractor.text_generation(input_data,batch_size=1)\n```\n\n## Additional Features\n\n`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.\n\n```python\nfrom datasets import load_dataset\nfrom TokenProbs import LogitExtractor\n\n# Load dataset\ndataset = load_dataset(\"financial_phrasebank\",'sentences_50agree')['train']\n# Apply training and test split\ndataset = dataset.train_test_split(seed=42)\ntrain = dataset['train']\n\n# Convert class labels to text\nlabels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]\n# Get sentences \nprompted_sentences = [prompt.replace(\"%text_input\",sent) for sent in train['sentence']]\n\n# Add labels to prompted sentences\ntraining_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]\n\n# Load model\nextractor = LogitExtractor(\n model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n quantization=\"8bit\"\n)\n\n# Set up SFFTrainer\nextractor.trainer_setup(\n train_ds = training_texts, #either a dataloader object or text list\n response_seq = \"\\nAnswer:\", # Tells trainer to train only on text following \"\\nAnswer: \"\n # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.\n lora_alpha=16,\n lora_rank=32,\n lora_dropout=0.1\n)\nextractor.trainer.train()\n# Push model to huggingface\nextractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')\n\n# Load model later\ntrained_model = extractor(\n model_name = '<HF_USERNAME>/<MODEL_NAME>',\n quantization=\"8bit\"\n)\n```\n\n## Examples\n\nComing soon.\n\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Extract token-level probabilities from LLMs for classification-type outputs.",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"python",
" llms",
" finance",
" forecasting",
" language models",
" huggingface"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3f1f12058bfb856572943855b6065a7d7b54244c13525501e7f1164e569ca5e1",
"md5": "967782e46d905ac94b57ac7beba9ed33",
"sha256": "1c74ebbfb8c7fe3fc5657243481e70e7090acf3bad241cccaa211e6d1bb9efd9"
},
"downloads": -1,
"filename": "TokenProbs-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "967782e46d905ac94b57ac7beba9ed33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6062,
"upload_time": "2024-04-19T16:24:58",
"upload_time_iso_8601": "2024-04-19T16:24:58.079648Z",
"url": "https://files.pythonhosted.org/packages/3f/1f/12058bfb856572943855b6065a7d7b54244c13525501e7f1164e569ca5e1/TokenProbs-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "99d6ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b",
"md5": "daaafd72133853e63b1f5b3cf5c02b37",
"sha256": "c977d927794b7a688f6ae9765be0a22e0593fca38f603f9ca2bb7a36a167b45d"
},
"downloads": -1,
"filename": "TokenProbs-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "daaafd72133853e63b1f5b3cf5c02b37",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5873,
"upload_time": "2024-04-19T16:25:01",
"upload_time_iso_8601": "2024-04-19T16:25:01.201788Z",
"url": "https://files.pythonhosted.org/packages/99/d6/ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b/TokenProbs-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-19 16:25:01",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tokenprobs"
}