pymsh


Namepymsh JSON
Version 1.2.1 PyPI version JSON
download
home_pageNone
SummaryThe Python multiset hashing library
upload_time2025-01-29 21:37:16
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords crypto cryptography hash hashing multi-set multiset security
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pymsh

<p>
   <img alt="PyPI" src="https://img.shields.io/pypi/v/pymsh?color=blue">
   <img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/pymsh">
   <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/pymsh">
   <img alt="PyPI - License" src="https://img.shields.io/pypi/l/pymsh?label=license">
   <img alt="Test Status" src="https://github.com/cgshep/pymsh/actions/workflows/python-package.yml/badge.svg">
</p>

**pymsh** is a Python library that provides **multiset hash (MSH)** functions. These functions let you hash collections of items _without_ worrying about the order of those items.

## What is a Multiset Hash?

A _multiset_ is like a set, except it can contain multiple instances of the same item. For example:
- A set might be `[apple, banana, cherry]` (with no duplicates).
- A multiset could be `[apple, apple, apple, banana, banana, cherry]` (duplicates matter).

Multiset hashing (MSH) produces a hash value that reflects both the _types_ of items you have and the _quantities_ of each item, but _not_ their order. If we hash the following using an MSH scheme, then the same hash values will be produced: `hash(apple, banana, banana, apple, apple, cherry) == hash(apple, apple, apple, banana, banana, cherry)` 

We can see that the order does not matter as long as the elements and their quantities are the same.

### Why Is This Useful?

If you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.

Furthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.

## Installation

```bash
pip install pymsh
```

## Basic Usage

For most general use cases, we recommend using the **additive multiset hash** (accessible via the shortcut class `Hasher`).

1. **Prepare a multiset**: You can use our helper `list_to_multiset` if you have a Python list containing repeated elements.
2. **Hash it**: Pass the resulting dictionary (`element -> count`) into a hasher.

*Note: pymsh expects your inputs to be passed as bytes.*

Example:
```python
from pymsh import list_to_multiset, Hasher

# Suppose you have this list with repeated elements
fruit_list = [b"apple", b"apple", b"banana", b"cherry", b"banana", b"apple"]

# 1) Convert the list to a multiset:
multiset = list_to_multiset(fruit_list)
# => {b"apple": 3, b"banana": 2, b"cherry": 1}

# 2) Hash your multiset (Hasher is an alias for MSetAddHash)
msh = Hasher().hash(multiset)
print("Multiset hash:", msh)
```

That’s it! You’ll get a tuple representing the multiset, independent of how you ordered "apple, banana, cherry."

If you are using `Hasher` (`MSetAddHash`). The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key.

## Advanced Usage

`pymsh` implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to **Incremental vs. One-shot Hashing** if you don’t need these details right now.


<details>
<summary><strong>MSetXORHash</strong> (Keyed, Set-collision Resistant)</summary>

- **What it does**: A keyed hash using XOR operations internally.
- **Best for**: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).
- **Supports incremental hashing?**: Yes.
- **Uses a secret key**: Yes.
- It is **NOT** multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.
</details>


<details>
<summary><strong>MSetAddHash</strong> (Keyed, Multiset-collision Resistant)</summary>

- **What it does**: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.
- **Best for**: Most general-purpose scenarios. This is the same as the default `Hasher` class.
- **Supports incremental hashing?**: Yes.
- **Uses a secret key**: Yes.
</details>

<details>
<summary><strong>MSetMuHash</strong> (Keyless, Multiset-collision Resistant)</summary>

- **What it does**: Uses multiplication in a finite field (large prime modulus).
- **Best for**: Keyless scenarios with a short output size. Good when you want collision resistance without managing keys.
- **Supports incremental hashing?**: No.
- **Uses a secret key**: No.
</details>

<details>
<summary><strong>MSetVAddHash</strong> (Keyless, Multiset-collision Resistant)</summary>

- **What it does**: Uses vector addition space.
- **Best for**: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.
- **Supports incremental hashing?**: Yes.
- **Requires a Key**: No.
</details>

### Examples

```python
import secrets

from pymsh import (
    MSetXORHash,
    MSetAddHash,
    MSetMuHash,
    MSetVAddHash
)

# Sample secret key for keyed hashes
key = secrets.token_bytes(32)

# A sample multiset: item -> count
multiset = {
    b"apple": 3,
    b"banana": 2,
    b"cherry": 1
}

# 1) XOR Hash (Keyed, set-collision resistant)
xor_hasher = MSetXORHash(key)
xor_result = xor_hasher.hash(multiset)
print("XOR Hash (one-shot):", xor_result)

# 2) Additive Hash (Keyed, multiset-collision resistant)
add_hasher = MSetAddHash(key)
one_shot = add_hasher.hash(multiset)
print("Additive Hash (one-shot):", one_shot)

# Incremental usage of Additive Hash
add_hasher.update(b"apple", 3)
add_hasher.update(b"banana", 2)
add_hasher.update(b"cherry", 1)
incremental_hash = add_hasher.digest()
print("Additive Hash (incremental):", incremental_hash)
assert one_shot == incremental_hash  # They should match

# 3) MuHash (Keyless)
mu_hasher = MSetMuHash()
print("MuHash:", mu_hasher.hash(multiset))

# 4) Vector Add Hash (Keyless)
vadd_hasher = MSetVAddHash()
print("VAdd Hash:", vadd_hasher.hash(multiset))
```

---

## Incremental vs. One-shot Hashing

**One‐shot**: Pass a full dictionary (e.g., `{item: count}`) at once using `.hash(multiset)`.

**Incremental**: Create an instance, then call `.update(item, count)` for each new element as needed, and finally call `.digest()` to get the final hash.

## Which Should I Pick?

For most **general-purpose** tasks, use **MSetAddHash** (the default `Hasher`).

If you prefer **keyless** usage or want a smaller output size, consider **MSetMuHash**. However, if you need **incremental** and **keyless**, try **MSetVAddHash**. Here's a comparison table:

| Hash Type       | Security          | Key Required | Incremental | Notes                        |
|-----------------|-------------------|--------------|-------------|------------------------------|
| `MSetXORHash`   | Set-collision resistance    | Yes          | Yes         | Fast set verification        |
| `MSetAddHash`   | Multiset-collision resistance | Yes          | Yes         | General purpose              |
| `MSetMuHash`    | Multiset-collision| No           | No          | Keyless; short outputs       |
| `MSetVAddHash`  | Multiset-collision| No           | Yes         | Efficient, but longer hashes |


## References

1. D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. [“Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking,”](https://www.iacr.org/cryptodb/data/paper.php?pubkey=151) ASIACRYPT 2003.

## Note
This project has not been audited or verified; do not rely on this library for serious production systems.

## Contribute

Feel free to open an issue or pull request if you have questions or suggestions!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pymsh",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "crypto, cryptography, hash, hashing, multi-set, multiset, security",
    "author": null,
    "author_email": "Carlton Shepherd <carlton@linux.com>",
    "download_url": "https://files.pythonhosted.org/packages/bc/a1/ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf/pymsh-1.2.1.tar.gz",
    "platform": null,
    "description": "# pymsh\n\n<p>\n   <img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/pymsh?color=blue\">\n   <img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/pymsh\">\n   <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/pymsh\">\n   <img alt=\"PyPI - License\" src=\"https://img.shields.io/pypi/l/pymsh?label=license\">\n   <img alt=\"Test Status\" src=\"https://github.com/cgshep/pymsh/actions/workflows/python-package.yml/badge.svg\">\n</p>\n\n**pymsh** is a Python library that provides **multiset hash (MSH)** functions. These functions let you hash collections of items _without_ worrying about the order of those items.\n\n## What is a Multiset Hash?\n\nA _multiset_ is like a set, except it can contain multiple instances of the same item. For example:\n- A set might be `[apple, banana, cherry]` (with no duplicates).\n- A multiset could be `[apple, apple, apple, banana, banana, cherry]` (duplicates matter).\n\nMultiset hashing (MSH) produces a hash value that reflects both the _types_ of items you have and the _quantities_ of each item, but _not_ their order. If we hash the following using an MSH scheme, then the same hash values will be produced: `hash(apple, banana, banana, apple, apple, cherry) == hash(apple, apple, apple, banana, banana, cherry)` \n\nWe can see that the order does not matter as long as the elements and their quantities are the same.\n\n### Why Is This Useful?\n\nIf you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.\n\nFurthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.\n\n## Installation\n\n```bash\npip install pymsh\n```\n\n## Basic Usage\n\nFor most general use cases, we recommend using the **additive multiset hash** (accessible via the shortcut class `Hasher`).\n\n1. **Prepare a multiset**: You can use our helper `list_to_multiset` if you have a Python list containing repeated elements.\n2. **Hash it**: Pass the resulting dictionary (`element -> count`) into a hasher.\n\n*Note: pymsh expects your inputs to be passed as bytes.*\n\nExample:\n```python\nfrom pymsh import list_to_multiset, Hasher\n\n# Suppose you have this list with repeated elements\nfruit_list = [b\"apple\", b\"apple\", b\"banana\", b\"cherry\", b\"banana\", b\"apple\"]\n\n# 1) Convert the list to a multiset:\nmultiset = list_to_multiset(fruit_list)\n# => {b\"apple\": 3, b\"banana\": 2, b\"cherry\": 1}\n\n# 2) Hash your multiset (Hasher is an alias for MSetAddHash)\nmsh = Hasher().hash(multiset)\nprint(\"Multiset hash:\", msh)\n```\n\nThat\u2019s it! You\u2019ll get a tuple representing the multiset, independent of how you ordered \"apple, banana, cherry.\"\n\nIf you are using `Hasher` (`MSetAddHash`). The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key.\n\n## Advanced Usage\n\n`pymsh` implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to **Incremental vs. One-shot Hashing** if you don\u2019t need these details right now.\n\n\n<details>\n<summary><strong>MSetXORHash</strong> (Keyed, Set-collision Resistant)</summary>\n\n- **What it does**: A keyed hash using XOR operations internally.\n- **Best for**: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).\n- **Supports incremental hashing?**: Yes.\n- **Uses a secret key**: Yes.\n- It is **NOT** multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.\n</details>\n\n\n<details>\n<summary><strong>MSetAddHash</strong> (Keyed, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.\n- **Best for**: Most general-purpose scenarios. This is the same as the default `Hasher` class.\n- **Supports incremental hashing?**: Yes.\n- **Uses a secret key**: Yes.\n</details>\n\n<details>\n<summary><strong>MSetMuHash</strong> (Keyless, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses multiplication in a finite field (large prime modulus).\n- **Best for**: Keyless scenarios with a short output size. Good when you want collision resistance without managing keys.\n- **Supports incremental hashing?**: No.\n- **Uses a secret key**: No.\n</details>\n\n<details>\n<summary><strong>MSetVAddHash</strong> (Keyless, Multiset-collision Resistant)</summary>\n\n- **What it does**: Uses vector addition space.\n- **Best for**: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.\n- **Supports incremental hashing?**: Yes.\n- **Requires a Key**: No.\n</details>\n\n### Examples\n\n```python\nimport secrets\n\nfrom pymsh import (\n    MSetXORHash,\n    MSetAddHash,\n    MSetMuHash,\n    MSetVAddHash\n)\n\n# Sample secret key for keyed hashes\nkey = secrets.token_bytes(32)\n\n# A sample multiset: item -> count\nmultiset = {\n    b\"apple\": 3,\n    b\"banana\": 2,\n    b\"cherry\": 1\n}\n\n# 1) XOR Hash (Keyed, set-collision resistant)\nxor_hasher = MSetXORHash(key)\nxor_result = xor_hasher.hash(multiset)\nprint(\"XOR Hash (one-shot):\", xor_result)\n\n# 2) Additive Hash (Keyed, multiset-collision resistant)\nadd_hasher = MSetAddHash(key)\none_shot = add_hasher.hash(multiset)\nprint(\"Additive Hash (one-shot):\", one_shot)\n\n# Incremental usage of Additive Hash\nadd_hasher.update(b\"apple\", 3)\nadd_hasher.update(b\"banana\", 2)\nadd_hasher.update(b\"cherry\", 1)\nincremental_hash = add_hasher.digest()\nprint(\"Additive Hash (incremental):\", incremental_hash)\nassert one_shot == incremental_hash  # They should match\n\n# 3) MuHash (Keyless)\nmu_hasher = MSetMuHash()\nprint(\"MuHash:\", mu_hasher.hash(multiset))\n\n# 4) Vector Add Hash (Keyless)\nvadd_hasher = MSetVAddHash()\nprint(\"VAdd Hash:\", vadd_hasher.hash(multiset))\n```\n\n---\n\n## Incremental vs. One-shot Hashing\n\n**One\u2010shot**: Pass a full dictionary (e.g., `{item: count}`) at once using `.hash(multiset)`.\n\n**Incremental**: Create an instance, then call `.update(item, count)` for each new element as needed, and finally call `.digest()` to get the final hash.\n\n## Which Should I Pick?\n\nFor most **general-purpose** tasks, use **MSetAddHash** (the default `Hasher`).\n\nIf you prefer **keyless** usage or want a smaller output size, consider **MSetMuHash**. However, if you need **incremental** and **keyless**, try **MSetVAddHash**. Here's a comparison table:\n\n| Hash Type       | Security          | Key Required | Incremental | Notes                        |\n|-----------------|-------------------|--------------|-------------|------------------------------|\n| `MSetXORHash`   | Set-collision resistance    | Yes          | Yes         | Fast set verification        |\n| `MSetAddHash`   | Multiset-collision resistance | Yes          | Yes         | General purpose              |\n| `MSetMuHash`    | Multiset-collision| No           | No          | Keyless; short outputs       |\n| `MSetVAddHash`  | Multiset-collision| No           | Yes         | Efficient, but longer hashes |\n\n\n## References\n\n1. D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. [\u201cIncremental Multiset Hash Functions and Their Application to Memory Integrity Checking,\u201d](https://www.iacr.org/cryptodb/data/paper.php?pubkey=151) ASIACRYPT 2003.\n\n## Note\nThis project has not been audited or verified; do not rely on this library for serious production systems.\n\n## Contribute\n\nFeel free to open an issue or pull request if you have questions or suggestions!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The Python multiset hashing library",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/cgshep/pymsh",
        "Issues": "https://github.com/cgshep/pymsh/issues"
    },
    "split_keywords": [
        "crypto",
        " cryptography",
        " hash",
        " hashing",
        " multi-set",
        " multiset",
        " security"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bf557e69abce361f8e358342763e5ab5e11afb4b7bf098dfa697311cda6ddc57",
                "md5": "ea2d8aa72feb9adbb6e93f7cb02420f8",
                "sha256": "cca56ec5a5bf1c417aefa0aadd2caa6f6ed64c1ab286bf50b5c6f7ec3a3382ba"
            },
            "downloads": -1,
            "filename": "pymsh-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea2d8aa72feb9adbb6e93f7cb02420f8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 9258,
            "upload_time": "2025-01-29T21:37:13",
            "upload_time_iso_8601": "2025-01-29T21:37:13.089703Z",
            "url": "https://files.pythonhosted.org/packages/bf/55/7e69abce361f8e358342763e5ab5e11afb4b7bf098dfa697311cda6ddc57/pymsh-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bca1ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf",
                "md5": "bc5abfe94560f34c31f4ddb3533d56f2",
                "sha256": "c664dd72449cf52d09912f615a9b36a1f1b7086e9b981e2914486291f72d0795"
            },
            "downloads": -1,
            "filename": "pymsh-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "bc5abfe94560f34c31f4ddb3533d56f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 12612,
            "upload_time": "2025-01-29T21:37:16",
            "upload_time_iso_8601": "2025-01-29T21:37:16.650575Z",
            "url": "https://files.pythonhosted.org/packages/bc/a1/ec686abc00c6b7edb673658198759faadd16f2701cefd0d967b9924dffbf/pymsh-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-29 21:37:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cgshep",
    "github_project": "pymsh",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pymsh"
}
        
Elapsed time: 2.27449s