imohash


Nameimohash JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryFast hashing for large files
upload_time2024-09-05 17:50:40
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT
keywords hash hashing imohash
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            imohash
=======

imohash is a fast, constant-time hashing library. It uses file
size and sampling to calculate hashes quickly, regardless of file size.
It was originally released as a `Go library <https://github.com/kalafut/imohash>`__.

``imosum`` is a sample application to hash files from the command line, similar to
md5sum.

Alternative implementations
---------------------------
* **Go**: https://github.com/kalafut/imohash
* **Java**: https://github.com/dynatrace-oss/hash4j
* **Rust**: https://github.com/hiql/imohash

Installation
------------

``pip install imohash``

Usage
-----

As a library:

.. code-block:: python

    from imohash import hashfile

    hashfile('foo.txt')
    'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

    hashfile('foo.txt', hexdigest=True)
    'a608658926d8aa86b3db8208ad279bfe'

    # just hash the whole file if smaller then 200000 bytes. Default is 128K
    hashfile('foo.txt', sample_threshhold=200000)
    'x86\x9dE\x0e3LI\x83\r~\xa3O\x9b\xbd\xd3[E'

    # use samples of 1000 bytes. Default is 16K
    hashfile('foo.txt', sample_size=1000)
    'E\x0e3LI\x83\r~\xa3O\x9b\xbd\xd3[E\x23\x25'

    # hash an already opened file.
    # note: the file-like object passed in should be in binary mode. Text mode
    #       behavior is undefined (and likely will raise an exception)
    f = open('foo.txt', 'rb')
    hashfileobject(f)
    'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

    # hash a file on a remote server
    import paramiko
    ssh = paramiko.SSHClient()
    ssh.connect('host', username='username', password='verysecurepassword')
    ftp = ssh.open_sftp()
    hashfileobject(ftp.file('/path/to/remote/file/foo.txt'))
    'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

Or from the command line:

``imosum *.jpg``

Uses
----

Because imohash only reads a small portion of a file's data, it is very
fast and well suited to file synchronization and deduplication,
especially over a fairly slow network. A need to manage media (photos
and video) over Wi-Fi between a NAS and multiple family computers is how
the library was born.

If you just need to check whether two files are the same, and understand
the limitations that sampling imposes (see below), imohash may be a good
fit.

Misuses
-------

Because imohash only reads a small portion of a file's data, it is not
suitable for:

-  file verification or integrity monitoring
-  cases where fixed-size files are manipulated
-  anything cryptographic

Design
------

(Note: a more precise description is provided in the `algorithm
description <https://github.com/kalafut/imohash/blob/master/algorithm.md>`__.)

imohash works by hashing small chunks of data from the beginning,
middle and end of a file. It also incorporates the file size into the
final 128-bit hash. This approach is based on a few assumptions which
will vary by application. First, file size alone *tends* (1) to be a
pretty good differentiator, especially as file size increases. And when
people do things to files (such as editing photos), size tends to
change. So size is used directly in the hash, and **any files that have
different sizes will have different hashes**.

Size is an effective differentiator but isn't sufficient. It can show
that two files aren't the same, but to increase confidence that
like-size files are the same, a few segments are hashed using
`murmur3 <https://en.wikipedia.org/wiki/MurmurHash>`__, a fast and
effective hashing algorithm. By default, 16K chunks from the beginning,
middle and end of the file are used. The ends of files often contain
metadata which is more prone to changing without affecting file size.
The middle is for good measure. The sample size can be changed for your
application.

1 Try ``du -a . | sort -nr | less`` on a sample of your files to check
this assertion.

Small file exemption
~~~~~~~~~~~~~~~~~~~~

Small files are more likely to collide on size than large ones. They're
also probably more likely to change in subtle ways that sampling will
miss (e.g. editing a large text file). For this reason, imohash will
simply hash the entire file if it is less than 128K. This parameter is
also configurable.

Performance
-----------

The standard hash performance metrics make no sense for imohash since
it's only reading a limited set of the data. That said, the real-world
performance is very good. If you are working with large files and/or a
slow network, expect huge speedups. (**spoiler**: reading 48K is quicker
than reading 500MB.)

Name
----

Inspired by `ILS marker
beacons <https://en.wikipedia.org/wiki/Marker_beacon>`__.

Credits
-------

-  The "sparseFingerprints" used in
   `TMSU <https://github.com/oniony/TMSU>`__ gave me some confidence in
   this approach to hashing.
-  Sébastien Paolacci's
   `murmur3 <https://github.com/spaolacci/murmur3>`__ library does all
   of the heavy lifting in the Go version.
-  As does Hajime Senuma's
   `mmh3 <https://github.com/hajimes/mmh3>`__ library for the Python version.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "imohash",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "hash, hashing, imohash",
    "author": null,
    "author_email": "Jim Kalafut <jim@kalafut.net>",
    "download_url": "https://files.pythonhosted.org/packages/a6/39/1d83aeacb40fc094c8151734d923d4f8f10277df762dd8df1ab00cffdd05/imohash-1.1.0.tar.gz",
    "platform": null,
    "description": "imohash\n=======\n\nimohash is a fast, constant-time hashing library. It uses file\nsize and sampling to calculate hashes quickly, regardless of file size.\nIt was originally released as a `Go library <https://github.com/kalafut/imohash>`__.\n\n``imosum`` is a sample application to hash files from the command line, similar to\nmd5sum.\n\nAlternative implementations\n---------------------------\n* **Go**: https://github.com/kalafut/imohash\n* **Java**: https://github.com/dynatrace-oss/hash4j\n* **Rust**: https://github.com/hiql/imohash\n\nInstallation\n------------\n\n``pip install imohash``\n\nUsage\n-----\n\nAs a library:\n\n.. code-block:: python\n\n    from imohash import hashfile\n\n    hashfile('foo.txt')\n    'O\\x9b\\xbd\\xd3[\\x86\\x9dE\\x0e3LI\\x83\\r~\\xa3'\n\n    hashfile('foo.txt', hexdigest=True)\n    'a608658926d8aa86b3db8208ad279bfe'\n\n    # just hash the whole file if smaller then 200000 bytes. Default is 128K\n    hashfile('foo.txt', sample_threshhold=200000)\n    'x86\\x9dE\\x0e3LI\\x83\\r~\\xa3O\\x9b\\xbd\\xd3[E'\n\n    # use samples of 1000 bytes. Default is 16K\n    hashfile('foo.txt', sample_size=1000)\n    'E\\x0e3LI\\x83\\r~\\xa3O\\x9b\\xbd\\xd3[E\\x23\\x25'\n\n    # hash an already opened file.\n    # note: the file-like object passed in should be in binary mode. Text mode\n    #       behavior is undefined (and likely will raise an exception)\n    f = open('foo.txt', 'rb')\n    hashfileobject(f)\n    'O\\x9b\\xbd\\xd3[\\x86\\x9dE\\x0e3LI\\x83\\r~\\xa3'\n\n    # hash a file on a remote server\n    import paramiko\n    ssh = paramiko.SSHClient()\n    ssh.connect('host', username='username', password='verysecurepassword')\n    ftp = ssh.open_sftp()\n    hashfileobject(ftp.file('/path/to/remote/file/foo.txt'))\n    'O\\x9b\\xbd\\xd3[\\x86\\x9dE\\x0e3LI\\x83\\r~\\xa3'\n\nOr from the command line:\n\n``imosum *.jpg``\n\nUses\n----\n\nBecause imohash only reads a small portion of a file's data, it is very\nfast and well suited to file synchronization and deduplication,\nespecially over a fairly slow network. A need to manage media (photos\nand video) over Wi-Fi between a NAS and multiple family computers is how\nthe library was born.\n\nIf you just need to check whether two files are the same, and understand\nthe limitations that sampling imposes (see below), imohash may be a good\nfit.\n\nMisuses\n-------\n\nBecause imohash only reads a small portion of a file's data, it is not\nsuitable for:\n\n-  file verification or integrity monitoring\n-  cases where fixed-size files are manipulated\n-  anything cryptographic\n\nDesign\n------\n\n(Note: a more precise description is provided in the `algorithm\ndescription <https://github.com/kalafut/imohash/blob/master/algorithm.md>`__.)\n\nimohash works by hashing small chunks of data from the beginning,\nmiddle and end of a file. It also incorporates the file size into the\nfinal 128-bit hash. This approach is based on a few assumptions which\nwill vary by application. First, file size alone *tends* (1) to be a\npretty good differentiator, especially as file size increases. And when\npeople do things to files (such as editing photos), size tends to\nchange. So size is used directly in the hash, and **any files that have\ndifferent sizes will have different hashes**.\n\nSize is an effective differentiator but isn't sufficient. It can show\nthat two files aren't the same, but to increase confidence that\nlike-size files are the same, a few segments are hashed using\n`murmur3 <https://en.wikipedia.org/wiki/MurmurHash>`__, a fast and\neffective hashing algorithm. By default, 16K chunks from the beginning,\nmiddle and end of the file are used. The ends of files often contain\nmetadata which is more prone to changing without affecting file size.\nThe middle is for good measure. The sample size can be changed for your\napplication.\n\n1 Try ``du -a . | sort -nr | less`` on a sample of your files to check\nthis assertion.\n\nSmall file exemption\n~~~~~~~~~~~~~~~~~~~~\n\nSmall files are more likely to collide on size than large ones. They're\nalso probably more likely to change in subtle ways that sampling will\nmiss (e.g. editing a large text file). For this reason, imohash will\nsimply hash the entire file if it is less than 128K. This parameter is\nalso configurable.\n\nPerformance\n-----------\n\nThe standard hash performance metrics make no sense for imohash since\nit's only reading a limited set of the data. That said, the real-world\nperformance is very good. If you are working with large files and/or a\nslow network, expect huge speedups. (**spoiler**: reading 48K is quicker\nthan reading 500MB.)\n\nName\n----\n\nInspired by `ILS marker\nbeacons <https://en.wikipedia.org/wiki/Marker_beacon>`__.\n\nCredits\n-------\n\n-  The \"sparseFingerprints\" used in\n   `TMSU <https://github.com/oniony/TMSU>`__ gave me some confidence in\n   this approach to hashing.\n-  S\u00e9bastien Paolacci's\n   `murmur3 <https://github.com/spaolacci/murmur3>`__ library does all\n   of the heavy lifting in the Go version.\n-  As does Hajime Senuma's\n   `mmh3 <https://github.com/hajimes/mmh3>`__ library for the Python version.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast hashing for large files",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/kalafut/py-imohash"
    },
    "split_keywords": [
        "hash",
        " hashing",
        " imohash"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93a7d961461048db0564d03909ca266aa9c0716b0651b404ea3f68b16d399d52",
                "md5": "87d73c27886e84f0ad797007d998e2e2",
                "sha256": "e93d70e5cbd7a4356df6289a0f3a5b44cded86d7ce6c1566bd215cebfb3e332a"
            },
            "downloads": -1,
            "filename": "imohash-1.1.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "87d73c27886e84f0ad797007d998e2e2",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 6568,
            "upload_time": "2024-09-05T17:50:37",
            "upload_time_iso_8601": "2024-09-05T17:50:37.710344Z",
            "url": "https://files.pythonhosted.org/packages/93/a7/d961461048db0564d03909ca266aa9c0716b0651b404ea3f68b16d399d52/imohash-1.1.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a6391d83aeacb40fc094c8151734d923d4f8f10277df762dd8df1ab00cffdd05",
                "md5": "43aaa64e37f0c598a390bc4456a9912d",
                "sha256": "087a608e88021b13967994ed2888d6f685943717f52afd16bb7f85105184ed6b"
            },
            "downloads": -1,
            "filename": "imohash-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "43aaa64e37f0c598a390bc4456a9912d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6122,
            "upload_time": "2024-09-05T17:50:40",
            "upload_time_iso_8601": "2024-09-05T17:50:40.569802Z",
            "url": "https://files.pythonhosted.org/packages/a6/39/1d83aeacb40fc094c8151734d923d4f8f10277df762dd8df1ab00cffdd05/imohash-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-05 17:50:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kalafut",
    "github_project": "py-imohash",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "imohash"
}
        
Elapsed time: 0.31671s