# nano-askllm
Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', [arXiv:2402.09668](https://arxiv.org/abs/2402.09668).
[![PyPI](https://img.shields.io/pypi/v/nano-askllm?color=blue)](https://pypi.org/project/nano-askllm/)
[![GitHub License](https://img.shields.io/github/license/susumuota/nano-askllm)](https://github.com/susumuota/nano-askllm/blob/main/LICENSE)
[![GitHub last commit](https://img.shields.io/github/last-commit/susumuota/nano-askllm)](https://github.com/susumuota/nano-askllm/commits)
<img width="514" alt="Ask-LLM prompt" src="https://github.com/susumuota/nano-askllm/assets/1632335/f7bd37dc-3702-43f9-a6db-d4f74d7822ea">
## Installation
```bash
pip install nano-askllm
```
## Usage
- Scoring C4 English dataset with `flan-t5-small` model.
> **Note**: Flan-T5 models cannot tokenize multilingual text properly (e.g. Japanese).
```python
# pip install datasets sentencepiece accelerate
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from nano_askllm import AskLLM
model_id = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
llm = AskLLM(tokenizer, model)
batch_size = 2
num_ask = 5
for i in range(num_ask):
datapoints = [item["text"] for item in list(dataset.take(batch_size))]
scores = llm.ask(datapoints)
for score, datapoint in zip(scores.tolist(), datapoints):
text = datapoint[:40].replace("\n", " ")
print(f"score: {score:.4f}\ttext: {text}")
dataset = dataset.skip(batch_size)
```
- Scoring mC4 Japanese dataset with `gemma-2b-it` model. `gemma` models need to tweak the prompt template and the yes tokens.
```python
# pip install datasets sentencepiece accelerate
# hugginface-cli login
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from nano_askllm import AskLLM
model_id = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
dataset = load_dataset("allenai/c4", "ja", split="train", streaming=True)
prompt_template_prefix = "###\n"
prompt_template_postfix = """
###
Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of the world, and strictly NOT have any harmful, racist, sexist, etc. content.
OPTIONS: yes/no
ANSWER:"""
yes_tokens = ["yes", "Yes", "YES", " yes", " Yes", " YES"]
llm = AskLLM(
tokenizer,
model,
prompt_template_prefix=prompt_template_prefix,
prompt_template_postfix=prompt_template_postfix,
yes_tokens=yes_tokens,
max_tokens=512, # You can increase it up to 8192 for gemma-2b-it.
)
batch_size = 2
num_ask = 5
for i in range(num_ask):
datapoints = [item["text"] for item in list(dataset.take(batch_size))]
scores = llm.ask(datapoints)
for score, datapoint in zip(scores.tolist(), datapoints):
text = datapoint[:40].replace("\n", " ")
print(f"score: {score:.4f}\ttext: {text}")
dataset = dataset.skip(batch_size)
```
If you want to see the debug logs, you can set the logger as follows:
```python
from logging import DEBUG, StreamHandler, getLogger
logger = getLogger("nano_askllm.askllm")
logger.setLevel(DEBUG)
handler = StreamHandler()
handler.setLevel(DEBUG)
logger.addHandler(handler)
```
## Development
```bash
poetry -V # Poetry (version 1.5.1)
git clone https://github.com/susumuota/nano-askllm.git
cd nano-askllm
poetry install
poetry run pytest -s # run pytest once
poetry run -- ptw -- -s # watch for changes and run pytest
```
## Citation
```bibtex
@misc{sachdeva2024train,
title={How to Train Data-Efficient LLMs},
author={Noveen Sachdeva and Benjamin Coleman and Wang-Cheng Kang and Jianmo Ni and Lichan Hong and Ed H. Chi and James Caverlee and Julian McAuley and Derek Zhiyuan Cheng},
year={2024},
eprint={2402.09668},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## License
MIT License. See [LICENSE](LICENSE) for details.
## TODO
- [ ] Add Colab notebook
- [x] Add examples using Hugging Face Datasets
Raw data
{
"_id": null,
"home_page": "https://github.com/susumuota/nano-askllm",
"name": "nano-askllm",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "transformers, llm, large language model, pre-training, language model",
"author": "Susumu OTA",
"author_email": "1632335+susumuota@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/b7/ef/8c0ef48034b9819549e2acbd34731fa9c6a5daf0c9cc2353c38cff522a39/nano_askllm-0.2.3.tar.gz",
"platform": null,
"description": "# nano-askllm\n\nUnofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', [arXiv:2402.09668](https://arxiv.org/abs/2402.09668).\n\n[![PyPI](https://img.shields.io/pypi/v/nano-askllm?color=blue)](https://pypi.org/project/nano-askllm/)\n[![GitHub License](https://img.shields.io/github/license/susumuota/nano-askllm)](https://github.com/susumuota/nano-askllm/blob/main/LICENSE)\n[![GitHub last commit](https://img.shields.io/github/last-commit/susumuota/nano-askllm)](https://github.com/susumuota/nano-askllm/commits)\n\n<img width=\"514\" alt=\"Ask-LLM prompt\" src=\"https://github.com/susumuota/nano-askllm/assets/1632335/f7bd37dc-3702-43f9-a6db-d4f74d7822ea\">\n\n## Installation\n\n```bash\npip install nano-askllm\n```\n\n## Usage\n\n- Scoring C4 English dataset with `flan-t5-small` model.\n> **Note**: Flan-T5 models cannot tokenize multilingual text properly (e.g. Japanese).\n\n```python\n# pip install datasets sentencepiece accelerate\n\nfrom transformers import T5ForConditionalGeneration, T5Tokenizer\nfrom datasets import load_dataset\nfrom nano_askllm import AskLLM\n\nmodel_id = \"google/flan-t5-small\"\ntokenizer = T5Tokenizer.from_pretrained(model_id)\nmodel = T5ForConditionalGeneration.from_pretrained(model_id, device_map=\"auto\")\n\ndataset = load_dataset(\"allenai/c4\", \"en\", split=\"train\", streaming=True)\n\nllm = AskLLM(tokenizer, model)\n\nbatch_size = 2\nnum_ask = 5\n\nfor i in range(num_ask):\n datapoints = [item[\"text\"] for item in list(dataset.take(batch_size))]\n scores = llm.ask(datapoints)\n for score, datapoint in zip(scores.tolist(), datapoints):\n text = datapoint[:40].replace(\"\\n\", \" \")\n print(f\"score: {score:.4f}\\ttext: {text}\")\n dataset = dataset.skip(batch_size)\n```\n\n- Scoring mC4 Japanese dataset with `gemma-2b-it` model. `gemma` models need to tweak the prompt template and the yes tokens.\n\n```python\n# pip install datasets sentencepiece accelerate\n# hugginface-cli login\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom datasets import load_dataset\nfrom nano_askllm import AskLLM\n\nmodel_id = \"google/gemma-2b-it\"\ntokenizer = AutoTokenizer.from_pretrained(model_id)\nmodel = AutoModelForCausalLM.from_pretrained(model_id, device_map=\"auto\")\n\ndataset = load_dataset(\"allenai/c4\", \"ja\", split=\"train\", streaming=True)\n\nprompt_template_prefix = \"###\\n\"\nprompt_template_postfix = \"\"\"\n###\n\nDoes the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of the world, and strictly NOT have any harmful, racist, sexist, etc. content.\n\nOPTIONS: yes/no\nANSWER:\"\"\"\n\nyes_tokens = [\"yes\", \"Yes\", \"YES\", \" yes\", \" Yes\", \" YES\"]\n\nllm = AskLLM(\n tokenizer,\n model,\n prompt_template_prefix=prompt_template_prefix,\n prompt_template_postfix=prompt_template_postfix,\n yes_tokens=yes_tokens,\n max_tokens=512, # You can increase it up to 8192 for gemma-2b-it.\n)\n\nbatch_size = 2\nnum_ask = 5\n\nfor i in range(num_ask):\n datapoints = [item[\"text\"] for item in list(dataset.take(batch_size))]\n scores = llm.ask(datapoints)\n for score, datapoint in zip(scores.tolist(), datapoints):\n text = datapoint[:40].replace(\"\\n\", \" \")\n print(f\"score: {score:.4f}\\ttext: {text}\")\n dataset = dataset.skip(batch_size)\n```\n\nIf you want to see the debug logs, you can set the logger as follows:\n\n```python\nfrom logging import DEBUG, StreamHandler, getLogger\n\nlogger = getLogger(\"nano_askllm.askllm\")\nlogger.setLevel(DEBUG)\nhandler = StreamHandler()\nhandler.setLevel(DEBUG)\nlogger.addHandler(handler)\n```\n\n## Development\n\n```bash\npoetry -V # Poetry (version 1.5.1)\ngit clone https://github.com/susumuota/nano-askllm.git\ncd nano-askllm\npoetry install\npoetry run pytest -s # run pytest once\npoetry run -- ptw -- -s # watch for changes and run pytest\n```\n\n## Citation\n\n```bibtex\n@misc{sachdeva2024train,\n title={How to Train Data-Efficient LLMs},\n author={Noveen Sachdeva and Benjamin Coleman and Wang-Cheng Kang and Jianmo Ni and Lichan Hong and Ed H. Chi and James Caverlee and Julian McAuley and Derek Zhiyuan Cheng},\n year={2024},\n eprint={2402.09668},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n}\n```\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n## TODO\n\n- [ ] Add Colab notebook\n- [x] Add examples using Hugging Face Datasets\n",
"bugtrack_url": null,
"license": null,
"summary": "Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', arXiv:2402.09668.",
"version": "0.2.3",
"project_urls": {
"Documentation": "https://github.com/susumuota/nano-askllm#readme",
"Homepage": "https://github.com/susumuota/nano-askllm",
"Repository": "https://github.com/susumuota/nano-askllm"
},
"split_keywords": [
"transformers",
" llm",
" large language model",
" pre-training",
" language model"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d6f543f19077544fb44a4132d533cfd6d28d02dd5e0e5f4e06c07d3f26c54b20",
"md5": "aa5b5b6f12c316c33ddfd6f8b40717cd",
"sha256": "9de686504c32391e2686239144d97ef91fc46dc64ce1e3dca190186a1b1276cd"
},
"downloads": -1,
"filename": "nano_askllm-0.2.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aa5b5b6f12c316c33ddfd6f8b40717cd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 6434,
"upload_time": "2024-03-22T14:50:02",
"upload_time_iso_8601": "2024-03-22T14:50:02.759967Z",
"url": "https://files.pythonhosted.org/packages/d6/f5/43f19077544fb44a4132d533cfd6d28d02dd5e0e5f4e06c07d3f26c54b20/nano_askllm-0.2.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b7ef8c0ef48034b9819549e2acbd34731fa9c6a5daf0c9cc2353c38cff522a39",
"md5": "f7f86d1e3a54f06481507ddf4699bc63",
"sha256": "8e1e20fc2630aca09077fb6b39816dacff1eae3c3c6478d17a4ecb168302ca9d"
},
"downloads": -1,
"filename": "nano_askllm-0.2.3.tar.gz",
"has_sig": false,
"md5_digest": "f7f86d1e3a54f06481507ddf4699bc63",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 5520,
"upload_time": "2024-03-22T14:50:06",
"upload_time_iso_8601": "2024-03-22T14:50:06.281430Z",
"url": "https://files.pythonhosted.org/packages/b7/ef/8c0ef48034b9819549e2acbd34731fa9c6a5daf0c9cc2353c38cff522a39/nano_askllm-0.2.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-22 14:50:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "susumuota",
"github_project": "nano-askllm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "nano-askllm"
}