# TokenProbs
Extract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.
## Installation
Install with `pip`:
```bash
conda create -n TokenProbs python=3.11 # Note: not available for 3.13
conda activate TokenProbs
pip3 install TokenProbs
```
Install via Github Repository:
```bash
conda create -n TokenProbs python=3.12 # Note: not available for 3.13
conda activate TokenProbs
git clone https://github.com/francescoafabozzi/TokenProbs.git
cd TokenProbs
pip3 install -e . # Install in editable mode
```
## Usage
See `examples/FinancialPhrasebank.ipynb` for an example of using `LogitExtractor` to extract token-level probabilities for a sentiment classification task.
```python
from TokenProbs import LogitExtractor
extractor = LogitExtractor(
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
quantization="8bit" # None = Full precision, "4bit" also suported
)
# Test sentence
sentence = "AAPL shares were up in morning trading, but closed even on the day."
# Prompt sentence
prompt = \
"""Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.
\nInput: %text_input
Answer:"""
prompted_sentence = prompt.replace("%text_input",sentence)
# Provide tokens to extract (can be TokenIDs or strings)
pred_tokens = ['positive','neutral','negative']
# Extract normalized token probabilities
probabilities = extractor.logit_extraction(
input_data = prompted_sentence,
tokens = pred_tokens,
batch_size=1
)
print(f"Probabilities: {probabilities}")
Probabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}
# Compare to text output
text_output = extractor.text_generation(input_data,batch_size=1)
```
## Trouble Shooting Installation
__Import Errors due to `torch`__
If recieving import errors due to `torch`, specific torch version may be required. Follow the steps below:
__Step 1__: Identify the CUDA versions (for GPU users):
```bash
nvcc --version
```
```bash
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
```
In this case, the CUDA version is 12.3.
__Step 2__: Navigate to the [Pytorch website](https://pytorch.org/get-started/locally/) and select the version that matches the CUDA version.
There is no cuda version for 12.3, so select torch CUDA download < 12.3 (i.e., 12.1)
__Step 3__: Pip uninstall torch and download with the correct version:
```bash
pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
__Issues with `bitsandbytes`__
If recieving CUDA Setup failed despite GPU being available. error, identify the location of the cuda driver, typically found under /usr/local/ and input the following commands via the command line. The example below shows this for cuda-12.3.:
```bash
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location
export BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed
```
<!--
## Additional Features
`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.
Below is an example of fine-tuning Mistral on Financial Phrasebank, a financial sentiment classification dataset.
```python
from datasets import load_dataset
from TokenProbs import LogitExtractor
# Load dataset
dataset = load_dataset("financial_phrasebank",'sentences_50agree')['train']
# Apply training and test split
dataset = dataset.train_test_split(seed=42)
train = dataset['train']
# Convert class labels to text
labels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]
# Get sentences
prompted_sentences = [prompt.replace("%text_input",sent) for sent in train['sentence']]
# Add labels to prompted sentences
training_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]
# Load model
extractor = LogitExtractor(
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
quantization="8bit"
)
# Set up SFFTrainer
extractor.trainer_setup(
train_ds = training_texts, #either a dataloader object or text list
response_seq = "\nAnswer:", # Tells trainer to train only on text following "\nAnswer: "
# Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.
lora_alpha=16,
lora_rank=32,
lora_dropout=0.1
)
extractor.trainer.train()
# Push model to huggingface
extractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')
# Load model later
trained_model = extractor(
model_name = '<HF_USERNAME>/<MODEL_NAME>',
quantization="8bit"
)
```
-->
<!-- ## Examples -->
<!-- Coming soon. -->
Raw data
{
"_id": null,
"home_page": null,
"name": "TokenProbs",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.8",
"maintainer_email": null,
"keywords": "python, LLMs, finance, forecasting, language models, huggingface",
"author": "Francesco A. Fabozzi",
"author_email": "francescoafabozzi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/59/29/64c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9/tokenprobs-1.0.3.tar.gz",
"platform": null,
"description": "\n# TokenProbs\n\nExtract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.\n\n\n## Installation\n\nInstall with `pip`:\n\n```bash\nconda create -n TokenProbs python=3.11 # Note: not available for 3.13\nconda activate TokenProbs\npip3 install TokenProbs \n```\n\nInstall via Github Repository:\n```bash\nconda create -n TokenProbs python=3.12 # Note: not available for 3.13\nconda activate TokenProbs\n\ngit clone https://github.com/francescoafabozzi/TokenProbs.git\ncd TokenProbs\npip3 install -e . # Install in editable mode \n```\n\n\n\n## Usage\n\nSee `examples/FinancialPhrasebank.ipynb` for an example of using `LogitExtractor` to extract token-level probabilities for a sentiment classification task.\n\n```python\nfrom TokenProbs import LogitExtractor\n\nextractor = LogitExtractor(\n model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n quantization=\"8bit\" # None = Full precision, \"4bit\" also suported\n)\n\n# Test sentence\nsentence = \"AAPL shares were up in morning trading, but closed even on the day.\"\n\n# Prompt sentence\nprompt = \\\n\"\"\"Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.\n\\nInput: %text_input\nAnswer:\"\"\"\n\nprompted_sentence = prompt.replace(\"%text_input\",sentence)\n\n# Provide tokens to extract (can be TokenIDs or strings)\npred_tokens = ['positive','neutral','negative']\n\n\n# Extract normalized token probabilities\nprobabilities = extractor.logit_extraction(\n input_data = prompted_sentence,\n tokens = pred_tokens,\n batch_size=1\n)\n\nprint(f\"Probabilities: {probabilities}\")\nProbabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}\n\n# Compare to text output\ntext_output = extractor.text_generation(input_data,batch_size=1)\n```\n\n## Trouble Shooting Installation\n\n__Import Errors due to `torch`__\n\nIf recieving import errors due to `torch`, specific torch version may be required. Follow the steps below:\n\n__Step 1__: Identify the CUDA versions (for GPU users):\n```bash\nnvcc --version\n``` \n\n```bash\nnvcc: NVIDIA (R) Cuda compiler driver\nCopyright (c) 2005-2023 NVIDIA Corporation\nBuilt on Wed_Nov_22_10:17:15_PST_2023\nCuda compilation tools, release 12.3, V12.3.107\nBuild cuda_12.3.r12.3/compiler.33567101_0\n```\n\nIn this case, the CUDA version is 12.3. \n\n__Step 2__: Navigate to the [Pytorch website](https://pytorch.org/get-started/locally/) and select the version that matches the CUDA version.\n\nThere is no cuda version for 12.3, so select torch CUDA download < 12.3 (i.e., 12.1)\n\n__Step 3__: Pip uninstall torch and download with the correct version:\n```bash\npip3 uninstall torch\npip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n```\n\n__Issues with `bitsandbytes`__\n\nIf recieving CUDA Setup failed despite GPU being available. error, identify the location of the cuda driver, typically found under /usr/local/ and input the following commands via the command line. The example below shows this for cuda-12.3.:\n\n```bash\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location\nexport BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed\n```\n\n\n<!-- \n## Additional Features\n\n`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.\n\nBelow is an example of fine-tuning Mistral on Financial Phrasebank, a financial sentiment classification dataset.\n\n```python\nfrom datasets import load_dataset\nfrom TokenProbs import LogitExtractor\n\n# Load dataset\ndataset = load_dataset(\"financial_phrasebank\",'sentences_50agree')['train']\n# Apply training and test split\ndataset = dataset.train_test_split(seed=42)\ntrain = dataset['train']\n\n# Convert class labels to text\nlabels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]\n# Get sentences \nprompted_sentences = [prompt.replace(\"%text_input\",sent) for sent in train['sentence']]\n\n# Add labels to prompted sentences\ntraining_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]\n\n# Load model\nextractor = LogitExtractor(\n model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n quantization=\"8bit\"\n)\n\n# Set up SFFTrainer\nextractor.trainer_setup(\n train_ds = training_texts, #either a dataloader object or text list\n response_seq = \"\\nAnswer:\", # Tells trainer to train only on text following \"\\nAnswer: \"\n # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.\n lora_alpha=16,\n lora_rank=32,\n lora_dropout=0.1\n)\nextractor.trainer.train()\n# Push model to huggingface\nextractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')\n\n# Load model later\ntrained_model = extractor(\n model_name = '<HF_USERNAME>/<MODEL_NAME>',\n quantization=\"8bit\"\n)\n```\n-->\n\n<!-- ## Examples -->\n\n<!-- Coming soon. -->\n\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Extract token-level probabilities from LLMs for classification-type outputs.",
"version": "1.0.3",
"project_urls": null,
"split_keywords": [
"python",
" llms",
" finance",
" forecasting",
" language models",
" huggingface"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "14dab1d0bfb16a03cec7bec097f17cb2766627e0ab0bd17ea72fd65972288f06",
"md5": "a8dd330497e99e1687e1f1a804ca215c",
"sha256": "e5a691042c745c82b73a284c0fca59a4c0984d9444c87797236422fadd7d89e0"
},
"downloads": -1,
"filename": "TokenProbs-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a8dd330497e99e1687e1f1a804ca215c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.8",
"size": 11647,
"upload_time": "2024-10-31T18:49:32",
"upload_time_iso_8601": "2024-10-31T18:49:32.184103Z",
"url": "https://files.pythonhosted.org/packages/14/da/b1d0bfb16a03cec7bec097f17cb2766627e0ab0bd17ea72fd65972288f06/TokenProbs-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "592964c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9",
"md5": "f3efd0107a2a364ee1d15ad081c5fde0",
"sha256": "064c6b2563f4e209e0393fa3d067218d6ea3615af3b1bb3c86da0e0ce9e0a3d4"
},
"downloads": -1,
"filename": "tokenprobs-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "f3efd0107a2a364ee1d15ad081c5fde0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.8",
"size": 11420,
"upload_time": "2024-10-31T18:49:33",
"upload_time_iso_8601": "2024-10-31T18:49:33.444959Z",
"url": "https://files.pythonhosted.org/packages/59/29/64c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9/tokenprobs-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-31 18:49:33",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tokenprobs"
}