TokenProbs


NameTokenProbs JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryExtract token-level probabilities from LLMs for classification-type outputs.
upload_time2024-04-19 16:25:01
maintainerNone
docs_urlNone
authorFrancesco A. Fabozzi
requires_pythonNone
licenseNone
keywords python llms finance forecasting language models huggingface
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# TokenProbs

Extract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.


## Installation

```bash
conda create -n GenCasting python=3.9
pip3 install GenCasting 
```

__Troubling Shooting__

If recieving `CUDA Setup failed despite GPU being available.` Identify the location of the cuda driver, typically found under `/usr/local/` and input the following commands via the command line. The example below shows this for cuda-12.3.:

```bash
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location
export BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed
```

## Usage
```python
from TokenProbs import LogitExtractor

extractor = LogitExtractor(
    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
    quantization="8bit" # None = Full precision, "4bit" also suported
)

# Test sentence
sentence = "AAPL shares were up in morning trading, but closed even on the day."

# Prompt sentence
prompt = \
"""Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.
\nInput: %text_input
Answer:"""

prompted_sentence = prompt.replace("%text_input",sentence)

# Provide tokens to extract (can be TokenIDs or strings)
pred_tokens = ['positive','neutral','negative']


# Extract normalized token probabilities
probabilities = extractor.logit_extraction(
    input_data = prompted_sentence,
    tokens = pred_tokens,
    batch_size=1
)

print(f"Probabilities: {probabilities}")
Probabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}

# Compare to text output
text_output = extractor.text_generation(input_data,batch_size=1)
```

## Additional Features

`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.

```python
from datasets import load_dataset
from TokenProbs import LogitExtractor

# Load dataset
dataset = load_dataset("financial_phrasebank",'sentences_50agree')['train']
# Apply training and test split
dataset = dataset.train_test_split(seed=42)
train = dataset['train']

# Convert class labels to text
labels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]
# Get sentences 
prompted_sentences = [prompt.replace("%text_input",sent) for sent in train['sentence']]

# Add labels to prompted sentences
training_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]

# Load model
extractor = LogitExtractor(
    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
    quantization="8bit"
)

# Set up SFFTrainer
extractor.trainer_setup(
    train_ds = training_texts, #either a dataloader object or text list
    response_seq = "\nAnswer:", # Tells trainer to train only on text following "\nAnswer: "
    # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.
    lora_alpha=16,
    lora_rank=32,
    lora_dropout=0.1
)
extractor.trainer.train()
# Push model to huggingface
extractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')

# Load model later
trained_model = extractor(
    model_name = '<HF_USERNAME>/<MODEL_NAME>',
    quantization="8bit"
)
```

## Examples

Coming soon.




            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "TokenProbs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, LLMs, finance, forecasting, language models, huggingface",
    "author": "Francesco A. Fabozzi",
    "author_email": "francescoafabozzi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/99/d6/ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b/TokenProbs-1.0.0.tar.gz",
    "platform": null,
    "description": "\n# TokenProbs\n\nExtract token-level probability scores from generative language models (GLMs) without fine-tuning. Often times, it is relevent to request probability assessment to binary or multi-class outcomes. GLMs are not well-suited for this task. Instead, use `LogitExtractor` to obtain label probabilities without fine-tuning.\n\n\n## Installation\n\n```bash\nconda create -n GenCasting python=3.9\npip3 install GenCasting \n```\n\n__Troubling Shooting__\n\nIf recieving `CUDA Setup failed despite GPU being available.` Identify the location of the cuda driver, typically found under `/usr/local/` and input the following commands via the command line. The example below shows this for cuda-12.3.:\n\n```bash\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3 # change 12.3 to appropriate location\nexport BNB_CUDA_VERSION=123 # 123 (i.e., 12.3) also needs to be changed\n```\n\n## Usage\n```python\nfrom TokenProbs import LogitExtractor\n\nextractor = LogitExtractor(\n    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n    quantization=\"8bit\" # None = Full precision, \"4bit\" also suported\n)\n\n# Test sentence\nsentence = \"AAPL shares were up in morning trading, but closed even on the day.\"\n\n# Prompt sentence\nprompt = \\\n\"\"\"Instructions: What is the sentiment of this news article? Select from {positive/neutral/negative}.\n\\nInput: %text_input\nAnswer:\"\"\"\n\nprompted_sentence = prompt.replace(\"%text_input\",sentence)\n\n# Provide tokens to extract (can be TokenIDs or strings)\npred_tokens = ['positive','neutral','negative']\n\n\n# Extract normalized token probabilities\nprobabilities = extractor.logit_extraction(\n    input_data = prompted_sentence,\n    tokens = pred_tokens,\n    batch_size=1\n)\n\nprint(f\"Probabilities: {probabilities}\")\nProbabilities: {'positive': 0.7, 'neutral': 0.2, 'negative': 0.1}\n\n# Compare to text output\ntext_output = extractor.text_generation(input_data,batch_size=1)\n```\n\n## Additional Features\n\n`LogitExtractor` also provides functionality for applying Low-rank Adaptation (LoRA) fine-tuning tailored to extracting logit scores for next-token predictions.\n\n```python\nfrom datasets import load_dataset\nfrom TokenProbs import LogitExtractor\n\n# Load dataset\ndataset = load_dataset(\"financial_phrasebank\",'sentences_50agree')['train']\n# Apply training and test split\ndataset = dataset.train_test_split(seed=42)\ntrain = dataset['train']\n\n# Convert class labels to text\nlabels = [{0:'negative',1:'neutral',2:'positive'}[i] for i in train['label']]\n# Get sentences \nprompted_sentences = [prompt.replace(\"%text_input\",sent) for sent in train['sentence']]\n\n# Add labels to prompted sentences\ntraining_texts = [prompted_sentences[i] + labels[i] for i in range(len(labels))]\n\n# Load model\nextractor = LogitExtractor(\n    model_name = 'mistralai/Mistral-7B-Instruct-v0.1',\n    quantization=\"8bit\"\n)\n\n# Set up SFFTrainer\nextractor.trainer_setup(\n    train_ds = training_texts, #either a dataloader object or text list\n    response_seq = \"\\nAnswer:\", # Tells trainer to train only on text following \"\\nAnswer: \"\n    # Input can be text string or list of TokenIDs. Be careful, tokens can differ based on context.\n    lora_alpha=16,\n    lora_rank=32,\n    lora_dropout=0.1\n)\nextractor.trainer.train()\n# Push model to huggingface\nextractor.trainer.model.push_to_hub('<HF_USERNAME>/<MODEL_NAME>')\n\n# Load model later\ntrained_model = extractor(\n    model_name = '<HF_USERNAME>/<MODEL_NAME>',\n    quantization=\"8bit\"\n)\n```\n\n## Examples\n\nComing soon.\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Extract token-level probabilities from LLMs for classification-type outputs.",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "python",
        " llms",
        " finance",
        " forecasting",
        " language models",
        " huggingface"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3f1f12058bfb856572943855b6065a7d7b54244c13525501e7f1164e569ca5e1",
                "md5": "967782e46d905ac94b57ac7beba9ed33",
                "sha256": "1c74ebbfb8c7fe3fc5657243481e70e7090acf3bad241cccaa211e6d1bb9efd9"
            },
            "downloads": -1,
            "filename": "TokenProbs-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "967782e46d905ac94b57ac7beba9ed33",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 6062,
            "upload_time": "2024-04-19T16:24:58",
            "upload_time_iso_8601": "2024-04-19T16:24:58.079648Z",
            "url": "https://files.pythonhosted.org/packages/3f/1f/12058bfb856572943855b6065a7d7b54244c13525501e7f1164e569ca5e1/TokenProbs-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "99d6ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b",
                "md5": "daaafd72133853e63b1f5b3cf5c02b37",
                "sha256": "c977d927794b7a688f6ae9765be0a22e0593fca38f603f9ca2bb7a36a167b45d"
            },
            "downloads": -1,
            "filename": "TokenProbs-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "daaafd72133853e63b1f5b3cf5c02b37",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5873,
            "upload_time": "2024-04-19T16:25:01",
            "upload_time_iso_8601": "2024-04-19T16:25:01.201788Z",
            "url": "https://files.pythonhosted.org/packages/99/d6/ce1ce457ff2be06444bcc5d4d979885d3c50ca5df0423c8182c2d523e31b/TokenProbs-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-19 16:25:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tokenprobs"
}
        
Elapsed time: 0.25896s