<img src="./pkm.png" width="400px"></img>
## Product Key Memory
[](https://badge.fury.io/py/product-key-memory)
Standalone <a href="https://arxiv.org/abs/1907.05242">Product Key Memory</a> module for augmenting Transformer models
## Install
```bash
$ pip install product-key-memory
```
## Usage
Replace the feedforwards in a Transformer with the following
```python
import torch
from product_key_memory import PKM
pkm = PKM(
dim = 512,
heads = 4,
dim_head = 128, # keep at 128 for best results
num_keys = 256, # number of subkeys, # values will be num_keys ^ 2
topk = 32 # the top number of subkeys to select
)
x = torch.randn(1, 1024, 512)
mask = torch.ones((1, 1024)).bool()
values = pkm(x, input_mask = mask) # (1, 1024, 512)
```
## Learning Rates
To give different learning rates to the value parameters of the product-key-memory network, use the following helper function.
```python
from torch.optim import Adam
from product_key_memory import fetch_pkm_value_parameters
# this helper function, for your root model, finds all the PKM models and the embedding bag weight parameters
pkm_parameters, other_parameters = fetch_pkm_value_parameters(model)
optim = Adam([
{'params': other_parameters},
{'params': pkm_parameters, 'lr': 1e-2}
], lr=1e-3)
```
Or, if product-key-memory parameters are the only other parameters you have a different learning rate for
```python
from torch.optim import Adam
from product_key_memory import fetch_optimizer_parameters
parameters = fetch_optimizer_parameters(model) # automatically creates array of parameter settings with learning rate set at 1e-2 for pkm values
optim = Adam(parameters, lr=1e-3)
```
## Appreciation
Special thanks go to <a href="https://github.com/AranKomat">Aran</a> for encouraging me to look into this, and to <a href="https://github.com/madisonmay">Madison May</a> for his <a href="https://www.pragmatic.ml/large-memory-layers-with-product-keys/">educational blog post</a>, which helped me understand this better.
## Todo
- [x] offer stochasticity with annealed gumbel noise. seen dramatic effects in vector-quantization setting
- [x] offer a way for smaller value dimensions + concat and linear combination of heads (like multi-head attention)
- [ ] get caught up on latest literature on product key memories, if any
- [ ] instead of additive scores, try multiplicative using coordinate descent routing
## Citations
```bibtex
@misc{lample2019large,
title = {Large Memory Layers with Product Keys},
author = {Guillaume Lample and Alexandre Sablayrolles and Marc'Aurelio Ranzato and Ludovic Denoyer and Hervé Jégou},
year = {2019},
eprint = {1907.05242},
archivePrefix = {arXiv}
}
```
```bibtex
@misc{liu2020evolving,
title = {Evolving Normalization-Activation Layers},
author = {Hanxiao Liu and Andrew Brock and Karen Simonyan and Quoc V. Le},
year = {2020},
eprint = {2004.02967},
archivePrefix = {arXiv}
}
```
```bibtex
@article{Shen2023ASO,
title = {A Study on ReLU and Softmax in Transformer},
author = {Kai Shen and Junliang Guo and Xuejiao Tan and Siliang Tang and Rui Wang and Jiang Bian},
journal = {ArXiv},
year = {2023},
volume = {abs/2302.06461},
url = {https://api.semanticscholar.org/CorpusID:256827573}
}
```
```bibtex
@article{Csordas2023ApproximatingTF,
title = {Approximating Two-Layer Feedforward Networks for Efficient Transformers},
author = {R'obert Csord'as and Kazuki Irie and J{\"u}rgen Schmidhuber},
journal = {ArXiv},
year = {2023},
volume = {abs/2310.10837},
url = {https://api.semanticscholar.org/CorpusID:264172384}
}
```
```bibtex
@inproceedings{anonymous2025continual,
title = {Continual Learning via Sparse Memory Finetuning},
author = {Anonymous},
booktitle = {Submitted to The Fourteenth International Conference on Learning Representations},
year = {2025},
url = {https://openreview.net/forum?id=LGo7U1m24L},
note = {under review}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "product-key-memory",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "artificial intelligence, transformers",
"author": null,
"author_email": "Aran Komatsuzaki <aran1234321@gmail.com>, Phil Wang <lucidrains@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d2/15/88ff08fc280ee8fe702379699e57220aebaf4a676dc740ef6b0a97b3a26d/product_key_memory-0.3.0.tar.gz",
"platform": null,
"description": "<img src=\"./pkm.png\" width=\"400px\"></img>\n\n## Product Key Memory\n\n[](https://badge.fury.io/py/product-key-memory)\n\nStandalone <a href=\"https://arxiv.org/abs/1907.05242\">Product Key Memory</a> module for augmenting Transformer models\n\n## Install\n\n```bash\n$ pip install product-key-memory\n```\n\n## Usage\n\nReplace the feedforwards in a Transformer with the following\n\n```python\nimport torch\nfrom product_key_memory import PKM\n\npkm = PKM(\n dim = 512,\n heads = 4,\n dim_head = 128, # keep at 128 for best results\n num_keys = 256, # number of subkeys, # values will be num_keys ^ 2\n topk = 32 # the top number of subkeys to select\n)\n\nx = torch.randn(1, 1024, 512)\nmask = torch.ones((1, 1024)).bool()\nvalues = pkm(x, input_mask = mask) # (1, 1024, 512)\n```\n\n## Learning Rates\n\nTo give different learning rates to the value parameters of the product-key-memory network, use the following helper function.\n\n```python\nfrom torch.optim import Adam\nfrom product_key_memory import fetch_pkm_value_parameters\n\n# this helper function, for your root model, finds all the PKM models and the embedding bag weight parameters\npkm_parameters, other_parameters = fetch_pkm_value_parameters(model)\n\noptim = Adam([\n {'params': other_parameters},\n {'params': pkm_parameters, 'lr': 1e-2}\n], lr=1e-3)\n```\n\nOr, if product-key-memory parameters are the only other parameters you have a different learning rate for\n\n```python\nfrom torch.optim import Adam\nfrom product_key_memory import fetch_optimizer_parameters\n\nparameters = fetch_optimizer_parameters(model) # automatically creates array of parameter settings with learning rate set at 1e-2 for pkm values\noptim = Adam(parameters, lr=1e-3)\n```\n\n## Appreciation\n\nSpecial thanks go to <a href=\"https://github.com/AranKomat\">Aran</a> for encouraging me to look into this, and to <a href=\"https://github.com/madisonmay\">Madison May</a> for his <a href=\"https://www.pragmatic.ml/large-memory-layers-with-product-keys/\">educational blog post</a>, which helped me understand this better.\n\n## Todo\n\n- [x] offer stochasticity with annealed gumbel noise. seen dramatic effects in vector-quantization setting\n- [x] offer a way for smaller value dimensions + concat and linear combination of heads (like multi-head attention)\n\n- [ ] get caught up on latest literature on product key memories, if any\n- [ ] instead of additive scores, try multiplicative using coordinate descent routing\n\n## Citations\n\n```bibtex\n@misc{lample2019large,\n title = {Large Memory Layers with Product Keys},\n author = {Guillaume Lample and Alexandre Sablayrolles and Marc'Aurelio Ranzato and Ludovic Denoyer and Herv\u00e9 J\u00e9gou},\n year = {2019},\n eprint = {1907.05242},\n archivePrefix = {arXiv}\n}\n```\n\n```bibtex\n@misc{liu2020evolving,\n title = {Evolving Normalization-Activation Layers},\n author = {Hanxiao Liu and Andrew Brock and Karen Simonyan and Quoc V. Le},\n year = {2020},\n eprint = {2004.02967},\n archivePrefix = {arXiv}\n}\n```\n\n```bibtex\n@article{Shen2023ASO,\n title = {A Study on ReLU and Softmax in Transformer},\n author = {Kai Shen and Junliang Guo and Xuejiao Tan and Siliang Tang and Rui Wang and Jiang Bian},\n journal = {ArXiv},\n year = {2023},\n volume = {abs/2302.06461},\n url = {https://api.semanticscholar.org/CorpusID:256827573}\n}\n```\n\n```bibtex\n@article{Csordas2023ApproximatingTF,\n title = {Approximating Two-Layer Feedforward Networks for Efficient Transformers},\n author = {R'obert Csord'as and Kazuki Irie and J{\\\"u}rgen Schmidhuber},\n journal = {ArXiv},\n year = {2023},\n volume = {abs/2310.10837},\n url = {https://api.semanticscholar.org/CorpusID:264172384}\n}\n```\n\n```bibtex\n@inproceedings{anonymous2025continual,\n title = {Continual Learning via Sparse Memory Finetuning},\n author = {Anonymous},\n booktitle = {Submitted to The Fourteenth International Conference on Learning Representations},\n year = {2025},\n url = {https://openreview.net/forum?id=LGo7U1m24L},\n note = {under review}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Product Key Memory",
"version": "0.3.0",
"project_urls": {
"Repository": "https://github.com/lucidrains/product-key-memory"
},
"split_keywords": [
"artificial intelligence",
" transformers"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "951683bd0a7213e5f38713a565ce3b62d53d7ddec7ac0dfa1aa18b0144b48e9f",
"md5": "3fb6d849eefc209e3801aaaccb14fbbc",
"sha256": "bea0b37808906f173aa05f54bcda508d9e22c63055513124022c503dfa2e5c70"
},
"downloads": -1,
"filename": "product_key_memory-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3fb6d849eefc209e3801aaaccb14fbbc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8254,
"upload_time": "2025-11-01T16:09:56",
"upload_time_iso_8601": "2025-11-01T16:09:56.869081Z",
"url": "https://files.pythonhosted.org/packages/95/16/83bd0a7213e5f38713a565ce3b62d53d7ddec7ac0dfa1aa18b0144b48e9f/product_key_memory-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d21588ff08fc280ee8fe702379699e57220aebaf4a676dc740ef6b0a97b3a26d",
"md5": "402e72ee264323ed198f42f6f837f316",
"sha256": "e4961ee71da62a25e6740bfca693feef20f4503e1049233c8684c27def065a2a"
},
"downloads": -1,
"filename": "product_key_memory-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "402e72ee264323ed198f42f6f837f316",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 36610161,
"upload_time": "2025-11-01T16:09:58",
"upload_time_iso_8601": "2025-11-01T16:09:58.921821Z",
"url": "https://files.pythonhosted.org/packages/d2/15/88ff08fc280ee8fe702379699e57220aebaf4a676dc740ef6b0a97b3a26d/product_key_memory-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-01 16:09:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lucidrains",
"github_project": "product-key-memory",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "product-key-memory"
}