TokenProbs

Name	TokenProbs JSON
Version	1.0.3 JSON
	download
home_page	None
Summary	Extract token-level probabilities from LLMs for classification-type outputs.
upload_time	2024-10-31 18:49:33
maintainer	None
docs_url	None
author	Francesco A. Fabozzi
requires_python	<3.13,>=3.8
license	None
keywords	python llms finance forecasting language models huggingface
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# TokenProbs

Extract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.


## Installation

Install with `pip`:

```bash
conda create -n TokenProbs python=3.11 # Note: not available for 3.13
conda activate TokenProbs
pip3 install TokenProbs 
```

Install via Github Repository:
```bash
conda create -n TokenProbs python=3.12 # Note: not available for 3.13
conda activate TokenProbs

git clone https://github.com/francescoafabozzi/TokenProbs.git
cd TokenProbs
pip3 install -e . # Install in editable mode 
```



## Usage

See `examples/FinancialPhrasebank.ipynb` for an example of using `LogitExtractor` to extract token-level probabilities for a sentiment classification task.

```python
from TokenProbs import LogitExtractor

extractor = LogitExtractor(
    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
    quantization="8bit" # None = Full precision, "4bit" also suported
)

# Test sentence
sentence = "AAPL shares were up in morning trading, but closed even on the day."

# Prompt sentence
prompt = \
"""Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.
\nInput: %text_input
Answer:"""

prompted_sentence = prompt.replace("%text_input",sentence)

# Provide tokens to extract (can be TokenIDs or strings)
pred_tokens = ['positive','neutral','negative']


# Extract normalized token probabilities
probabilities = extractor.logit_extraction(
    input_data = prompted_sentence,
    tokens = pred_tokens,
    batch_size=1
)

print(f"Probabilities: {probabilities}")
Probabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}

# Compare to text output
text_output = extractor.text_generation(input_data,batch_size=1)
```

## Trouble Shooting Installation

__Import Errors due to `torch`__

If recieving import errors due to `torch`, specific torch version may be required. Follow the steps below:

__Step 1__:  Identify the CUDA versions (for GPU users):
```bash
nvcc --version
``` 

```bash
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
```

In this case, the CUDA version is 12.3. 

__Step 2__: Navigate to the [Pytorch website](https://pytorch.org/get-started/locally/) and select the version that matches the CUDA version.

There is no cuda version for 12.3, so select torch CUDA download < 12.3 (i.e., 12.1)

__Step 3__: Pip uninstall torch and download with the correct version:
```bash
pip3 uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```

__Issues with `bitsandbytes`__

If recieving CUDA Setup failed despite GPU being available. error, identify the location of the cuda driver, typically found under /usr/local/ and input the following commands via the command line. The example below shows this for cuda-12.3.:

```bash
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location
export BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed
```


<!-- 
## Additional Features

`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.

Below is an example of fine-tuning Mistral on Financial Phrasebank, a financial sentiment classification dataset.

```python
from datasets import load_dataset
from TokenProbs import LogitExtractor

# Load dataset
dataset = load_dataset("financial_phrasebank",'sentences_50agree')['train']
# Apply training and test split
dataset = dataset.train_test_split(seed=42)
train = dataset['train']

# Convert class labels to text
labels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]
# Get sentences 
prompted_sentences = [prompt.replace("%text_input",sent) for sent in train['sentence']]

# Add labels to prompted sentences
training_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]

# Load model
extractor = LogitExtractor(
    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
    quantization="8bit"
)

# Set up SFFTrainer
extractor.trainer_setup(
    train_ds = training_texts, #either a dataloader object or text list
    response_seq = "\nAnswer:", # Tells trainer to train only on text following "\nAnswer: "
    # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.
    lora_alpha=16,
    lora_rank=32,
    lora_dropout=0.1
)
extractor.trainer.train()
# Push model to huggingface
extractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')

# Load model later
trained_model = extractor(
    model_name = '<HF_USERNAME>/<MODEL_NAME>',
    quantization="8bit"
)
```
-->

<!-- ## Examples -->

<!-- Coming soon. -->

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "TokenProbs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": null,
    "keywords": "python, LLMs, finance, forecasting, language models, huggingface",
    "author": "Francesco A. Fabozzi",
    "author_email": "francescoafabozzi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/59/29/64c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9/tokenprobs-1.0.3.tar.gz",
    "platform": null,
    "description": "\n# TokenProbs\n\nExtract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.\n\n\n## Installation\n\nInstall with `pip`:\n\n```bash\nconda create -n TokenProbs python=3.11 # Note: not available for 3.13\nconda activate TokenProbs\npip3 install TokenProbs \n```\n\nInstall via Github Repository:\n```bash\nconda create -n TokenProbs python=3.12 # Note: not available for 3.13\nconda activate TokenProbs\n\ngit clone https://github.com/francescoafabozzi/TokenProbs.git\ncd TokenProbs\npip3 install -e . # Install in editable mode \n```\n\n\n\n## Usage\n\nSee `examples/FinancialPhrasebank.ipynb` for an example of using `LogitExtractor` to extract token-level probabilities for a sentiment classification task.\n\n```python\nfrom TokenProbs import LogitExtractor\n\nextractor = LogitExtractor(\n    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n    quantization=\"8bit\" # None = Full precision, \"4bit\" also suported\n)\n\n# Test sentence\nsentence = \"AAPL shares were up in morning trading, but closed even on the day.\"\n\n# Prompt sentence\nprompt = \\\n\"\"\"Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.\n\\nInput: %text_input\nAnswer:\"\"\"\n\nprompted_sentence = prompt.replace(\"%text_input\",sentence)\n\n# Provide tokens to extract (can be TokenIDs or strings)\npred_tokens = ['positive','neutral','negative']\n\n\n# Extract normalized token probabilities\nprobabilities = extractor.logit_extraction(\n    input_data = prompted_sentence,\n    tokens = pred_tokens,\n    batch_size=1\n)\n\nprint(f\"Probabilities: {probabilities}\")\nProbabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}\n\n# Compare to text output\ntext_output = extractor.text_generation(input_data,batch_size=1)\n```\n\n## Trouble Shooting Installation\n\n__Import Errors due to `torch`__\n\nIf recieving import errors due to `torch`, specific torch version may be required. Follow the steps below:\n\n__Step 1__:  Identify the CUDA versions (for GPU users):\n```bash\nnvcc --version\n``` \n\n```bash\nnvcc: NVIDIA (R) Cuda compiler driver\nCopyright (c) 2005-2023 NVIDIA Corporation\nBuilt on Wed_Nov_22_10:17:15_PST_2023\nCuda compilation tools, release 12.3, V12.3.107\nBuild cuda_12.3.r12.3/compiler.33567101_0\n```\n\nIn this case, the CUDA version is 12.3. \n\n__Step 2__: Navigate to the [Pytorch website](https://pytorch.org/get-started/locally/) and select the version that matches the CUDA version.\n\nThere is no cuda version for 12.3, so select torch CUDA download < 12.3 (i.e., 12.1)\n\n__Step 3__: Pip uninstall torch and download with the correct version:\n```bash\npip3 uninstall torch\npip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n```\n\n__Issues with `bitsandbytes`__\n\nIf recieving CUDA Setup failed despite GPU being available. error, identify the location of the cuda driver, typically found under /usr/local/ and input the following commands via the command line. The example below shows this for cuda-12.3.:\n\n```bash\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location\nexport BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed\n```\n\n\n<!-- \n## Additional Features\n\n`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.\n\nBelow is an example of fine-tuning Mistral on Financial Phrasebank, a financial sentiment classification dataset.\n\n```python\nfrom datasets import load_dataset\nfrom TokenProbs import LogitExtractor\n\n# Load dataset\ndataset = load_dataset(\"financial_phrasebank\",'sentences_50agree')['train']\n# Apply training and test split\ndataset = dataset.train_test_split(seed=42)\ntrain = dataset['train']\n\n# Convert class labels to text\nlabels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]\n# Get sentences \nprompted_sentences = [prompt.replace(\"%text_input\",sent) for sent in train['sentence']]\n\n# Add labels to prompted sentences\ntraining_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]\n\n# Load model\nextractor = LogitExtractor(\n    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n    quantization=\"8bit\"\n)\n\n# Set up SFFTrainer\nextractor.trainer_setup(\n    train_ds = training_texts, #either a dataloader object or text list\n    response_seq = \"\\nAnswer:\", # Tells trainer to train only on text following \"\\nAnswer: \"\n    # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.\n    lora_alpha=16,\n    lora_rank=32,\n    lora_dropout=0.1\n)\nextractor.trainer.train()\n# Push model to huggingface\nextractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')\n\n# Load model later\ntrained_model = extractor(\n    model_name = '<HF_USERNAME>/<MODEL_NAME>',\n    quantization=\"8bit\"\n)\n```\n-->\n\n<!-- ## Examples -->\n\n<!-- Coming soon. -->\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Extract token-level probabilities from LLMs for classification-type outputs.",
    "version": "1.0.3",
    "project_urls": null,
    "split_keywords": [
        "python",
        " llms",
        " finance",
        " forecasting",
        " language models",
        " huggingface"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "14dab1d0bfb16a03cec7bec097f17cb2766627e0ab0bd17ea72fd65972288f06",
                "md5": "a8dd330497e99e1687e1f1a804ca215c",
                "sha256": "e5a691042c745c82b73a284c0fca59a4c0984d9444c87797236422fadd7d89e0"
            },
            "downloads": -1,
            "filename": "TokenProbs-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a8dd330497e99e1687e1f1a804ca215c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 11647,
            "upload_time": "2024-10-31T18:49:32",
            "upload_time_iso_8601": "2024-10-31T18:49:32.184103Z",
            "url": "https://files.pythonhosted.org/packages/14/da/b1d0bfb16a03cec7bec097f17cb2766627e0ab0bd17ea72fd65972288f06/TokenProbs-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "592964c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9",
                "md5": "f3efd0107a2a364ee1d15ad081c5fde0",
                "sha256": "064c6b2563f4e209e0393fa3d067218d6ea3615af3b1bb3c86da0e0ce9e0a3d4"
            },
            "downloads": -1,
            "filename": "tokenprobs-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f3efd0107a2a364ee1d15ad081c5fde0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 11420,
            "upload_time": "2024-10-31T18:49:33",
            "upload_time_iso_8601": "2024-10-31T18:49:33.444959Z",
            "url": "https://files.pythonhosted.org/packages/59/29/64c7b7dcc7a0f5d4bcb8af6538d368f40dae61fae97ceeb4db81878063d9/tokenprobs-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-31 18:49:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tokenprobs"
}

Francesco A. Fabozzi