# PromptInject
[**Paper: Ignore Previous Prompt: Attack Techniques For Language Models**](https://arxiv.org/abs/2211.09527)
## Abstract
> Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PROMPTINJECT, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3’s stochastic nature, creating long-tail risks.

Figure 1: Diagram showing how adversarial user input can derail model instructions. In both attacks,
the attacker aims to change the goal of the original prompt. In *goal hijacking*, the new goal is to print
a specific target string, which may contain malicious instructions, while in *prompt leaking*, the new
goal is to print the application prompt. Application Prompt (gray box) shows the original prompt,
where `{user_input}` is substituted by the user input. In this example, a user would normally input
a phrase to be corrected by the application (blue boxes). *Goal Hijacking* and *Prompt Leaking* (orange
boxes) show malicious user inputs (left) for both attacks and the respective model outputs (right)
when the attack is successful.
## Install
Run:
pip install git+https://github.com/agencyenterprise/PromptInject
## Usage
See [notebooks/Example.ipynb](notebooks/Example.ipynb) for an example.
## Cite
Bibtex:
@misc{ignore_previous_prompt,
doi = {10.48550/ARXIV.2211.09527},
url = {https://arxiv.org/abs/2211.09527},
author = {Perez, Fábio and Ribeiro, Ian},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Ignore Previous Prompt: Attack Techniques For Language Models},
publisher = {arXiv},
year = {2022}
}
## Contributing
We appreciate any additional request and/or contribution to `PromptInject`. The [issues](/issues) tracker is used to keep a list of features and bugs to be worked on. Please see our [contributing documentation](/CONTRIBUTING.md) for some tips on getting started.
Raw data
{
"_id": null,
"home_page": "https://github.com/agencyenterprise/PromptInject",
"name": "promptinject",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "nlp,ai-safety,language-models",
"author": "Fabio Perez",
"author_email": "fabioperez@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/e3/9d/868cf6b3571334d00741150539ef713700d250324dc5a5ebffeac8272497/promptinject-0.1.1.1.tar.gz",
"platform": null,
"description": "# PromptInject\n\n[**Paper: Ignore Previous Prompt: Attack Techniques For Language Models**](https://arxiv.org/abs/2211.09527)\n\n## Abstract\n\n> Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PROMPTINJECT, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3\u2019s stochastic nature, creating long-tail risks.\n\n\n\nFigure 1: Diagram showing how adversarial user input can derail model instructions. In both attacks,\nthe attacker aims to change the goal of the original prompt. In *goal hijacking*, the new goal is to print\na specific target string, which may contain malicious instructions, while in *prompt leaking*, the new\ngoal is to print the application prompt. Application Prompt (gray box) shows the original prompt,\nwhere `{user_input}` is substituted by the user input. In this example, a user would normally input\na phrase to be corrected by the application (blue boxes). *Goal Hijacking* and *Prompt Leaking* (orange\nboxes) show malicious user inputs (left) for both attacks and the respective model outputs (right)\nwhen the attack is successful.\n\n## Install\n\nRun:\n\n pip install git+https://github.com/agencyenterprise/PromptInject\n\n## Usage\n\nSee [notebooks/Example.ipynb](notebooks/Example.ipynb) for an example.\n\n## Cite\n\nBibtex:\n\n @misc{ignore_previous_prompt,\n doi = {10.48550/ARXIV.2211.09527},\n url = {https://arxiv.org/abs/2211.09527},\n author = {Perez, F\u00e1bio and Ribeiro, Ian},\n keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},\n title = {Ignore Previous Prompt: Attack Techniques For Language Models},\n publisher = {arXiv},\n year = {2022}\n }\n\n## Contributing\n\nWe appreciate any additional request and/or contribution to `PromptInject`. The [issues](/issues) tracker is used to keep a list of features and bugs to be worked on. Please see our [contributing documentation](/CONTRIBUTING.md) for some tips on getting started.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks.",
"version": "0.1.1.1",
"project_urls": {
"Homepage": "https://github.com/agencyenterprise/PromptInject",
"Repository": "https://github.com/agencyenterprise/PromptInject"
},
"split_keywords": [
"nlp",
"ai-safety",
"language-models"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "76ce394b0237b7b0b77f273ecd5229e7c6bd7bbd9b8184b9f767ec7482495ff7",
"md5": "4754238bce5bb7698d3d46022f2ff46f",
"sha256": "98d020b6878c0c32703110d19c0b1651868e906da0f3b1472d349923bda5fedf"
},
"downloads": -1,
"filename": "promptinject-0.1.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4754238bce5bb7698d3d46022f2ff46f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 14684,
"upload_time": "2024-02-27T04:47:34",
"upload_time_iso_8601": "2024-02-27T04:47:34.512946Z",
"url": "https://files.pythonhosted.org/packages/76/ce/394b0237b7b0b77f273ecd5229e7c6bd7bbd9b8184b9f767ec7482495ff7/promptinject-0.1.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e39d868cf6b3571334d00741150539ef713700d250324dc5a5ebffeac8272497",
"md5": "d6c894676c44c680388da192ff012a86",
"sha256": "b7c1790c75b3ab7f28c891f7d4e60d6cbe7896ab0e6d9a9e80199aac8e21f4fc"
},
"downloads": -1,
"filename": "promptinject-0.1.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d6c894676c44c680388da192ff012a86",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 14474,
"upload_time": "2024-02-27T04:47:36",
"upload_time_iso_8601": "2024-02-27T04:47:36.276939Z",
"url": "https://files.pythonhosted.org/packages/e3/9d/868cf6b3571334d00741150539ef713700d250324dc5a5ebffeac8272497/promptinject-0.1.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-27 04:47:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "agencyenterprise",
"github_project": "PromptInject",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "promptinject"
}