# pymsh
<p>
<img alt="PyPI" src="https://img.shields.io/pypi/v/pymsh?color=blue">
<img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/pymsh">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/pymsh">
<img alt="PyPI - License" src="https://img.shields.io/pypi/l/pymsh?label=license">
<img alt="Test Status" src="https://github.com/cgshep/pymsh/actions/workflows/python-package.yml/badge.svg">
</p>
**pymsh** is a Python library that provides **multiset hash (MSH)** functions. These functions let you hash collections of items _without_ worrying about the order of those items.
## What is a Multiset Hash?
A _multiset_ is like a set, except it can contain multiple instances of the same item. For example:
- A set might be `[apple, banana, cherry]` (with no duplicates).
- A multiset could be `[apple, apple, apple, banana, banana, cherry]` (duplicates matter).
Multiset hashing (MSH) produces a hash value that reflects both the _types_ of items you have and the _quantities_ of each item, but _not_ their order. If we hash the following using an MSH scheme, then the same hash values will be produced: `hash(apple, banana, banana, apple, apple, cherry) == hash(apple, apple, apple, banana, banana, cherry)`
We can see that the order does not matter as long as the elements and their quantities are the same.
### Why Is This Useful?
If you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.
Furthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.
## Installation
```bash
pip install pymsh
```
## Basic Usage
For most general use cases, we recommend using the **additive multiset hash** (accessible via the shortcut class `Hasher`).
1. **Prepare a multiset**: You can use our helper `list_to_multiset` if you have a Python list containing repeated elements.
2. **Hash it**: Pass the resulting dictionary (`element -> count`) into a hasher.
*Note: pymsh expects your inputs to be passed as bytes.*
Example:
```python
from pymsh import list_to_multiset, Hasher
# Suppose you have this list with repeated elements
fruit_list = [b"apple", b"apple", b"banana", b"cherry", b"banana", b"apple"]
# 1) Convert the list to a multiset:
multiset = list_to_multiset(fruit_list)
# => {b"apple": 3, b"banana": 2, b"cherry": 1}
# 2) Hash your multiset (Hasher is an alias for MSetAddHash)
msh = Hasher().hash(multiset)
print("Multiset hash:", msh)
```
That’s it! You’ll get a tuple representing the multiset, independent of how you ordered "apple, banana, cherry."
If you are using `Hasher` (`MSetAddHash`). The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key.
## Advanced Usage
`pymsh` implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to **Incremental vs. One-shot Hashing** if you don’t need these details right now.
<details>
<summary><strong>MSetXORHash</strong> (Keyed, Set-collision Resistant)</summary>
- **What it does**: A keyed hash using XOR operations internally.
- **Best for**: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).
- **Supports incremental hashing?**: Yes.
- **Uses a secret key**: Yes.
- It is **NOT** multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.
</details>
<details>
<summary><strong>MSetAddHash</strong> (Keyed, Multiset-collision Resistant)</summary>
- **What it does**: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.
- **Best for**: Most general-purpose scenarios. This is the same as the default `Hasher` class.
- **Supports incremental hashing?**: Yes.
- **Uses a secret key**: Yes.
</details>
<details>
<summary><strong>MSetMuHash</strong> (Keyless, Multiset-collision Resistant)</summary>
- **What it does**: Uses multiplication in a finite field (large prime modulus).
- **Best for**: Keyless scenarios with a short output size. Good when you want collision resistance without managing keys.
- **Supports incremental hashing?**: No.
- **Uses a secret key**: No.
</details>
<details>
<summary><strong>MSetVAddHash</strong> (Keyless, Multiset-collision Resistant)</summary>
- **What it does**: Uses vector addition space.
- **Best for**: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.
- **Supports incremental hashing?**: Yes.
- **Requires a Key**: No.
</details>
### Examples
```python
import secrets
from pymsh import (
MSetXORHash,
MSetAddHash,
MSetMuHash,
MSetVAddHash
)
# Sample secret key for keyed hashes
key = secrets.token_bytes(32)
# A sample multiset: item -> count
multiset = {
b"apple": 3,
b"banana": 2,
b"cherry": 1
}
# 1) XOR Hash (Keyed, set-collision resistant)
xor_hasher = MSetXORHash(key)
xor_result = xor_hasher.hash(multiset)
print("XOR Hash (one-shot):", xor_result)
# 2) Additive Hash (Keyed, multiset-collision resistant)
add_hasher = MSetAddHash(key)
one_shot = add_hasher.hash(multiset)
print("Additive Hash (one-shot):", one_shot)
# Incremental usage of Additive Hash
add_hasher.update(b"apple", 3)
add_hasher.update(b"banana", 2)
add_hasher.update(b"cherry", 1)
incremental_hash = add_hasher.digest()
print("Additive Hash (incremental):", incremental_hash)
assert one_shot == incremental_hash # They should match
# 3) MuHash (Keyless)
mu_hasher = MSetMuHash()
print("MuHash:", mu_hasher.hash(multiset))
# 4) Vector Add Hash (Keyless)
vadd_hasher = MSetVAddHash()
print("VAdd Hash:", vadd_hasher.hash(multiset))
```
---
## Incremental vs. One-shot Hashing
**One‐shot**: Pass a full dictionary (e.g., `{item: count}`) at once using `.hash(multiset)`.
**Incremental**: Create an instance, then call `.update(item, count)` for each new element as needed, and finally call `.digest()` to get the final hash.
## Which Should I Pick?
For most **general-purpose** tasks, use **MSetAddHash** (the default `Hasher`).
If you prefer **keyless** usage or want a smaller output size, consider **MSetMuHash**. However, if you need **incremental** and **keyless**, try **MSetVAddHash**. Here's a comparison table:
| Hash Type | Security | Key Required | Incremental | Notes |
|-----------------|-------------------|--------------|-------------|------------------------------|
| `MSetXORHash` | Set-collision resistance | Yes | Yes | Fast set verification |
| `MSetAddHash` | Multiset-collision resistance | Yes | Yes | General purpose |
| `MSetMuHash` | Multiset-collision| No | No | Keyless; short outputs |
| `MSetVAddHash` | Multiset-collision| No | Yes | Efficient, but longer hashes |
## References
1. D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. [“Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking,”](https://www.iacr.org/cryptodb/data/paper.php?pubkey=151) ASIACRYPT 2003.
## Note
This project has not been audited or verified; do not rely on this library for serious production systems.
## Contribute
Feel free to open an issue or pull request if you have questions or suggestions!
Raw data
{
"_id": null,
"home_page": null,
"name": "pymsh",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "crypto, cryptography, hash, hashing, multi-set, multiset, security",
"author": null,
"author_email": "Carlton Shepherd <carlton@linux.com>",
"download_url": "https://files.pythonhosted.org/packages/bc/a1/ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf/pymsh-1.2.1.tar.gz",
"platform": null,
"description": "# pymsh\n\n<p>\n <img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/pymsh?color=blue\">\n <img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/pymsh\">\n <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/pymsh\">\n <img alt=\"PyPI - License\" src=\"https://img.shields.io/pypi/l/pymsh?label=license\">\n <img alt=\"Test Status\" src=\"https://github.com/cgshep/pymsh/actions/workflows/python-package.yml/badge.svg\">\n</p>\n\n**pymsh** is a Python library that provides **multiset hash (MSH)** functions. These functions let you hash collections of items _without_ worrying about the order of those items.\n\n## What is a Multiset Hash?\n\nA _multiset_ is like a set, except it can contain multiple instances of the same item. For example:\n- A set might be `[apple, banana, cherry]` (with no duplicates).\n- A multiset could be `[apple, apple, apple, banana, banana, cherry]` (duplicates matter).\n\nMultiset hashing (MSH) produces a hash value that reflects both the _types_ of items you have and the _quantities_ of each item, but _not_ their order. If we hash the following using an MSH scheme, then the same hash values will be produced: `hash(apple, banana, banana, apple, apple, cherry) == hash(apple, apple, apple, banana, banana, cherry)` \n\nWe can see that the order does not matter as long as the elements and their quantities are the same.\n\n### Why Is This Useful?\n\nIf you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.\n\nFurthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.\n\n## Installation\n\n```bash\npip install pymsh\n```\n\n## Basic Usage\n\nFor most general use cases, we recommend using the **additive multiset hash** (accessible via the shortcut class `Hasher`).\n\n1. **Prepare a multiset**: You can use our helper `list_to_multiset` if you have a Python list containing repeated elements.\n2. **Hash it**: Pass the resulting dictionary (`element -> count`) into a hasher.\n\n*Note: pymsh expects your inputs to be passed as bytes.*\n\nExample:\n```python\nfrom pymsh import list_to_multiset, Hasher\n\n# Suppose you have this list with repeated elements\nfruit_list = [b\"apple\", b\"apple\", b\"banana\", b\"cherry\", b\"banana\", b\"apple\"]\n\n# 1) Convert the list to a multiset:\nmultiset = list_to_multiset(fruit_list)\n# => {b\"apple\": 3, b\"banana\": 2, b\"cherry\": 1}\n\n# 2) Hash your multiset (Hasher is an alias for MSetAddHash)\nmsh = Hasher().hash(multiset)\nprint(\"Multiset hash:\", msh)\n```\n\nThat\u2019s it! You\u2019ll get a tuple representing the multiset, independent of how you ordered \"apple, banana, cherry.\"\n\nIf you are using `Hasher` (`MSetAddHash`). The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key.\n\n## Advanced Usage\n\n`pymsh` implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to **Incremental vs. One-shot Hashing** if you don\u2019t need these details right now.\n\n\n<details>\n<summary><strong>MSetXORHash</strong> (Keyed, Set-collision Resistant)</summary>\n\n- **What it does**: A keyed hash using XOR operations internally.\n- **Best for**: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).\n- **Supports incremental hashing?**: Yes.\n- **Uses a secret key**: Yes.\n- It is **NOT** multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.\n</details>\n\n\n<details>\n<summary><strong>MSetAddHash</strong> (Keyed, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.\n- **Best for**: Most general-purpose scenarios. This is the same as the default `Hasher` class.\n- **Supports incremental hashing?**: Yes.\n- **Uses a secret key**: Yes.\n</details>\n\n<details>\n<summary><strong>MSetMuHash</strong> (Keyless, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses multiplication in a finite field (large prime modulus).\n- **Best for**: Keyless scenarios with a short output size. Good when you want collision resistance without managing keys.\n- **Supports incremental hashing?**: No.\n- **Uses a secret key**: No.\n</details>\n\n<details>\n<summary><strong>MSetVAddHash</strong> (Keyless, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses vector addition space.\n- **Best for**: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.\n- **Supports incremental hashing?**: Yes.\n- **Requires a Key**: No.\n</details>\n\n### Examples\n\n```python\nimport secrets\n\nfrom pymsh import (\n MSetXORHash,\n MSetAddHash,\n MSetMuHash,\n MSetVAddHash\n)\n\n# Sample secret key for keyed hashes\nkey = secrets.token_bytes(32)\n\n# A sample multiset: item -> count\nmultiset = {\n b\"apple\": 3,\n b\"banana\": 2,\n b\"cherry\": 1\n}\n\n# 1) XOR Hash (Keyed, set-collision resistant)\nxor_hasher = MSetXORHash(key)\nxor_result = xor_hasher.hash(multiset)\nprint(\"XOR Hash (one-shot):\", xor_result)\n\n# 2) Additive Hash (Keyed, multiset-collision resistant)\nadd_hasher = MSetAddHash(key)\none_shot = add_hasher.hash(multiset)\nprint(\"Additive Hash (one-shot):\", one_shot)\n\n# Incremental usage of Additive Hash\nadd_hasher.update(b\"apple\", 3)\nadd_hasher.update(b\"banana\", 2)\nadd_hasher.update(b\"cherry\", 1)\nincremental_hash = add_hasher.digest()\nprint(\"Additive Hash (incremental):\", incremental_hash)\nassert one_shot == incremental_hash # They should match\n\n# 3) MuHash (Keyless)\nmu_hasher = MSetMuHash()\nprint(\"MuHash:\", mu_hasher.hash(multiset))\n\n# 4) Vector Add Hash (Keyless)\nvadd_hasher = MSetVAddHash()\nprint(\"VAdd Hash:\", vadd_hasher.hash(multiset))\n```\n\n---\n\n## Incremental vs. One-shot Hashing\n\n**One\u2010shot**: Pass a full dictionary (e.g., `{item: count}`) at once using `.hash(multiset)`.\n\n**Incremental**: Create an instance, then call `.update(item, count)` for each new element as needed, and finally call `.digest()` to get the final hash.\n\n## Which Should I Pick?\n\nFor most **general-purpose** tasks, use **MSetAddHash** (the default `Hasher`).\n\nIf you prefer **keyless** usage or want a smaller output size, consider **MSetMuHash**. However, if you need **incremental** and **keyless**, try **MSetVAddHash**. Here's a comparison table:\n\n| Hash Type | Security | Key Required | Incremental | Notes |\n|-----------------|-------------------|--------------|-------------|------------------------------|\n| `MSetXORHash` | Set-collision resistance | Yes | Yes | Fast set verification |\n| `MSetAddHash` | Multiset-collision resistance | Yes | Yes | General purpose |\n| `MSetMuHash` | Multiset-collision| No | No | Keyless; short outputs |\n| `MSetVAddHash` | Multiset-collision| No | Yes | Efficient, but longer hashes |\n\n\n## References\n\n1. D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. [\u201cIncremental Multiset Hash Functions and Their Application to Memory Integrity Checking,\u201d](https://www.iacr.org/cryptodb/data/paper.php?pubkey=151) ASIACRYPT 2003.\n\n## Note\nThis project has not been audited or verified; do not rely on this library for serious production systems.\n\n## Contribute\n\nFeel free to open an issue or pull request if you have questions or suggestions!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "The Python multiset hashing library",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/cgshep/pymsh",
"Issues": "https://github.com/cgshep/pymsh/issues"
},
"split_keywords": [
"crypto",
" cryptography",
" hash",
" hashing",
" multi-set",
" multiset",
" security"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "bf557e69abce361f8e358342763e5ab5e11afb4b7bf098dfa697311cda6ddc57",
"md5": "ea2d8aa72feb9adbb6e93f7cb02420f8",
"sha256": "cca56ec5a5bf1c417aefa0aadd2caa6f6ed64c1ab286bf50b5c6f7ec3a3382ba"
},
"downloads": -1,
"filename": "pymsh-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ea2d8aa72feb9adbb6e93f7cb02420f8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 9258,
"upload_time": "2025-01-29T21:37:13",
"upload_time_iso_8601": "2025-01-29T21:37:13.089703Z",
"url": "https://files.pythonhosted.org/packages/bf/55/7e69abce361f8e358342763e5ab5e11afb4b7bf098dfa697311cda6ddc57/pymsh-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bca1ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf",
"md5": "bc5abfe94560f34c31f4ddb3533d56f2",
"sha256": "c664dd72449cf52d09912f615a9b36a1f1b7086e9b981e2914486291f72d0795"
},
"downloads": -1,
"filename": "pymsh-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "bc5abfe94560f34c31f4ddb3533d56f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 12612,
"upload_time": "2025-01-29T21:37:16",
"upload_time_iso_8601": "2025-01-29T21:37:16.650575Z",
"url": "https://files.pythonhosted.org/packages/bc/a1/ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf/pymsh-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-29 21:37:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cgshep",
"github_project": "pymsh",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pymsh"
}