
Namehashfs2 JSON
Version 1.1.0 PyPI version JSON
SummaryA content-addressable file management system.
upload_time2024-10-06 06:53:57
authorDerrick Gilland
licenseMIT License
keywords hashfs hash file system content addressable fixed storage
requirements flake8 pytest-cov pytest Sphinx tox twine wheel
coveralls test coverage No coveralls.

|version| |travis| |coveralls| |license|

HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file's hash.

Typical use cases for this kind of system are ones where:

- Files are written once and never change (e.g. image storage).
- It's desirable to have no duplicate files (e.g. user uploads).
- File metadata is stored elsewhere (e.g. in a database).

this was forked from https://github.com/dgilland/hashfs/ and it supports:

- init all directory to optimize the performance
- support put file with hashid
- change str to pathlib.Path


- Files are stored once and never duplicated.
- Uses an efficient folder structure optimized for a large number of files. File paths are based on the content hash and are nested based on the first ``n`` number of characters.
- Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).
- Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.
- Supports any hashing algorithm available via ``hashlib.new``.
- Python 2.7+/3.3+ compatible.


- Project: https://github.com/dgilland/hashfs
- Documentation: http://hashfs.readthedocs.org
- PyPI: https://pypi.python.org/pypi/hashfs/
- TravisCI: https://travis-ci.org/dgilland/hashfs


Install using pip:


    pip install hashfs


.. code-block:: python

    from hashfs import HashFS

Designate a root folder for ``HashFS``. If the folder doesn't already exist, it will be created.

.. code-block:: python

    # Set the `depth` to the number of subfolders the file's hash should be split when saving.
    # Set the `width` to the desired width of each subfolder.
    fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')

    # With depth=4 and width=1, files will be saved in the following pattern:
    # temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz

    # With depth=3 and width=2, files will be saved in the following pattern:
    # temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz

**NOTE:** The ``algorithm`` value should be a valid string argument to ``hashlib.new()``.

Basic Usage

``HashFS`` supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.

Storing Content

Add content to the folder using either readable objects (e.g. ``StringIO``) or file paths (e.g. ``'a/path/to/some/file'``).

.. code-block:: python

    from io import StringIO

    some_content = StringIO('some content')

    address = fs.put(some_content)

    # Or if you'd like to save the file with an extension...
    address = fs.put(some_content, '.txt')

    # The id of the file (i.e. the hexdigest of its contents).

    # The absolute path where the file was saved.

    # The path relative to fs.root.

    # Whether the file previously existed.

Retrieving File Address

Get a file's ``HashAddress`` by address ID or path. This address would be identical to the address returned by ``put()``.

.. code-block:: python

    assert fs.get(address.id) == address
    assert fs.get(address.relpath) == address
    assert fs.get(address.abspath) == address
    assert fs.get('invalid') is None

Retrieving Content

Get a ``BufferedReader`` handler for an existing file by address ID or path.

.. code-block:: python

    fileio = fs.open(address.id)

    # Or using the full path...
    fileio = fs.open(address.abspath)

    # Or using a path relative to fs.root
    fileio = fs.open(address.relpath)

**NOTE:** When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.

Removing Content

Delete a file by address ID or path.

.. code-block:: python


**NOTE:** When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.

Advanced Usage

Below are some of the more advanced features of ``HashFS``.

Repairing Files

The ``HashFS`` files may not always be in sync with it's ``depth``, ``width``, or ``algorithm`` settings (e.g. if ``HashFS`` takes ownership of a directory that wasn't previously stored using content hashes or if the ``HashFS`` settings change). These files can be easily reindexed using ``repair()``.

.. code-block:: python

    repaired = fs.repair()

    # Or if you want to drop file extensions...
    repaired = fs.repair(extensions=False)

**WARNING:** It's recommended that a backup of the directory be made before repairing just in case something goes wrong.

Walking Corrupted Files

Instead of actually repairing the files, you can iterate over them for custom processing.

.. code-block:: python

    for corrupted_path, expected_address in fs.corrupted():
        # do something

**WARNING:** ``HashFS.corrupted()`` is a generator so be aware that modifying the file system while iterating could have unexpected results.

Walking All Files

Iterate over files.

.. code-block:: python

    for file in fs.files():
        # do something

    # Or using the class' iter method...
    for file in fs:
        # do something

Iterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).

.. code-block:: python

    for folder in fs.folders():
        # do something

Computing Size

Compute the size in bytes of all files in the ``root`` directory.

.. code-block:: python

    total_bytes = fs.size()

Count the total number of files.

.. code-block:: python

    total_files = fs.count()

    # Or via len()...
    total_files = len(fs)

For more details, please see the full documentation at http://hashfs.readthedocs.org.

.. |version| image:: http://img.shields.io/pypi/v/hashfs.svg?style=flat-square
    :target: https://pypi.python.org/pypi/hashfs/

.. |travis| image:: http://img.shields.io/travis/dgilland/hashfs/master.svg?style=flat-square
    :target: https://travis-ci.org/dgilland/hashfs

.. |coveralls| image:: http://img.shields.io/coveralls/dgilland/hashfs/master.svg?style=flat-square
    :target: https://coveralls.io/r/dgilland/hashfs

.. |license| image:: http://img.shields.io/pypi/l/hashfs.svg?style=flat-square
    :target: https://pypi.python.org/pypi/hashfs/


v0.7.2 (2019-10-24)

- Fix out-of-memory issue when computing file ID hashes of large files.

v0.7.1 (2018-10-13)

- Replace usage of ``distutils.dir_util.mkpath`` with ``os.path.makedirs``.

v0.7.0 (2016-04-19)

- Use ``shutil.move`` instead of ``shutil.copy`` to move temporary file created during ``put`` operation to ``HashFS`` directory.

v0.6.0 (2015-10-19)

- Add faster ``scandir`` package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to ``os`` module starting with Python 3.5.

v0.5.0 (2015-07-02)

- Rename private method ``HashFS.copy`` to ``HashFS._copy``.
- Add ``is_duplicate`` attribute to ``HashAddress``.
- Make ``HashFS.put()`` return ``HashAddress`` with ``is_duplicate=True`` when file with same hash already exists on disk.

v0.4.0 (2015-06-03)

- Add ``HashFS.size()`` method that returns the size of all files in bytes.
- Add ``HashFS.count()``/``HashFS.__len__()`` methods that return the count of all files.
- Add ``HashFS.__iter__()`` method to support iteration. Proxies to ``HashFS.files()``.
- Add ``HashFS.__contains__()`` method to support ``in`` operator. Proxies to ``HashFS.exists()``.
- Don't create the root directory (if it doesn't exist) until at least one file has been added.
- Fix ``HashFS.repair()`` not using ``extensions`` argument properly.

v0.3.0 (2015-06-02)

- Rename ``HashFS.length`` parameter/property to ``width``. (**breaking change**)

v0.2.0 (2015-05-29)

- Rename ``HashFS.get`` to ``HashFS.open``. (**breaking change**)
- Add ``HashFS.get()`` method that returns a ``HashAddress`` or ``None`` given a file ID or path.

v0.1.0 (2015-05-28)

- Add ``HashFS.get()`` method that retrieves a reader object given a file ID or path.
- Add ``HashFS.delete()`` method that deletes a file ID or path.
- Add ``HashFS.folders()`` method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).
- Add ``HashFS.detokenize()`` method that returns the file ID contained in a file path.
- Add ``HashFS.repair()`` method that reindexes any files under root directory whose file path doesn't not match its tokenized file ID.
- Rename ``Address`` classs to ``HashAddress``. (**breaking change**)
- Rename ``HashAddress.digest`` to ``HashAddress.id``. (**breaking change**)
- Rename ``HashAddress.path`` to ``HashAddress.abspath``. (**breaking change**)
- Add ``HashAddress.relpath`` which represents path relative to ``HashFS.root``.

v0.0.1 (2015-05-27)

- First release.
- Add ``HashFS`` class.
- Add ``HashFS.put()`` method that saves a file path or file-like object by content hash.
- Add ``HashFS.files()`` method that returns all files under root directory.
- Add ``HashFS.exists()`` which checks either a file hash or file path for existence.


Raw data

    "_id": null,
    "home_page": "https://github.com/ramwin/hashfs",
    "name": "hashfs2",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "hashfs hash file system content addressable fixed storage",
    "author": "Derrick Gilland",
    "author_email": "dgilland@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/0a/f96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac/hashfs2-1.1.0.tar.gz",
    "platform": null,
    "description": "******\nHashFS\n******\n\n|version| |travis| |coveralls| |license|\n\n\nHashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file's hash.\n\nTypical use cases for this kind of system are ones where:\n\n- Files are written once and never change (e.g. image storage).\n- It's desirable to have no duplicate files (e.g. user uploads).\n- File metadata is stored elsewhere (e.g. in a database).\n\nExtra\n=====\nthis was forked from https://github.com/dgilland/hashfs/ and it supports:\n\n- init all directory to optimize the performance\n- support put file with hashid\n- change str to pathlib.Path\n\nFeatures\n========\n\n- Files are stored once and never duplicated.\n- Uses an efficient folder structure optimized for a large number of files. File paths are based on the content hash and are nested based on the first ``n`` number of characters.\n- Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).\n- Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.\n- Supports any hashing algorithm available via ``hashlib.new``.\n- Python 2.7+/3.3+ compatible.\n\n\nLinks\n=====\n\n- Project: https://github.com/dgilland/hashfs\n- Documentation: http://hashfs.readthedocs.org\n- PyPI: https://pypi.python.org/pypi/hashfs/\n- TravisCI: https://travis-ci.org/dgilland/hashfs\n\n\nQuickstart\n==========\n\nInstall using pip:\n\n\n::\n\n    pip install hashfs\n\n\nInitialization\n--------------\n\n.. code-block:: python\n\n    from hashfs import HashFS\n\n\nDesignate a root folder for ``HashFS``. If the folder doesn't already exist, it will be created.\n\n\n.. code-block:: python\n\n    # Set the `depth` to the number of subfolders the file's hash should be split when saving.\n    # Set the `width` to the desired width of each subfolder.\n    fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')\n\n    # With depth=4 and width=1, files will be saved in the following pattern:\n    # temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz\n\n    # With depth=3 and width=2, files will be saved in the following pattern:\n    # temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz\n\n\n**NOTE:** The ``algorithm`` value should be a valid string argument to ``hashlib.new()``.\n\n\nBasic Usage\n===========\n\n``HashFS`` supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.\n\n\nStoring Content\n---------------\n\nAdd content to the folder using either readable objects (e.g. ``StringIO``) or file paths (e.g. ``'a/path/to/some/file'``).\n\n\n.. code-block:: python\n\n    from io import StringIO\n\n    some_content = StringIO('some content')\n\n    address = fs.put(some_content)\n\n    # Or if you'd like to save the file with an extension...\n    address = fs.put(some_content, '.txt')\n\n    # The id of the file (i.e. the hexdigest of its contents).\n    address.id\n\n    # The absolute path where the file was saved.\n    address.abspath\n\n    # The path relative to fs.root.\n    address.relpath\n\n    # Whether the file previously existed.\n    address.is_duplicate\n\n\nRetrieving File Address\n-----------------------\n\nGet a file's ``HashAddress`` by address ID or path. This address would be identical to the address returned by ``put()``.\n\n.. code-block:: python\n\n    assert fs.get(address.id) == address\n    assert fs.get(address.relpath) == address\n    assert fs.get(address.abspath) == address\n    assert fs.get('invalid') is None\n\n\nRetrieving Content\n------------------\n\nGet a ``BufferedReader`` handler for an existing file by address ID or path.\n\n\n.. code-block:: python\n\n    fileio = fs.open(address.id)\n\n    # Or using the full path...\n    fileio = fs.open(address.abspath)\n\n    # Or using a path relative to fs.root\n    fileio = fs.open(address.relpath)\n\n\n**NOTE:** When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.\n\n\nRemoving Content\n----------------\n\nDelete a file by address ID or path.\n\n\n.. code-block:: python\n\n    fs.delete(address.id)\n    fs.delete(address.abspath)\n    fs.delete(address.relpath)\n\n\n**NOTE:** When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.\n\n\nAdvanced Usage\n==============\n\nBelow are some of the more advanced features of ``HashFS``.\n\n\nRepairing Files\n---------------\n\nThe ``HashFS`` files may not always be in sync with it's ``depth``, ``width``, or ``algorithm`` settings (e.g. if ``HashFS`` takes ownership of a directory that wasn't previously stored using content hashes or if the ``HashFS`` settings change). These files can be easily reindexed using ``repair()``.\n\n\n.. code-block:: python\n\n    repaired = fs.repair()\n\n    # Or if you want to drop file extensions...\n    repaired = fs.repair(extensions=False)\n\n\n**WARNING:** It's recommended that a backup of the directory be made before repairing just in case something goes wrong.\n\n\nWalking Corrupted Files\n-----------------------\n\nInstead of actually repairing the files, you can iterate over them for custom processing.\n\n\n.. code-block:: python\n\n    for corrupted_path, expected_address in fs.corrupted():\n        # do something\n\n\n**WARNING:** ``HashFS.corrupted()`` is a generator so be aware that modifying the file system while iterating could have unexpected results.\n\n\nWalking All Files\n-----------------\n\nIterate over files.\n\n\n.. code-block:: python\n\n    for file in fs.files():\n        # do something\n\n    # Or using the class' iter method...\n    for file in fs:\n        # do something\n\n\nIterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).\n\n\n.. code-block:: python\n\n    for folder in fs.folders():\n        # do something\n\n\nComputing Size\n--------------\n\nCompute the size in bytes of all files in the ``root`` directory.\n\n\n.. code-block:: python\n\n    total_bytes = fs.size()\n\n\nCount the total number of files.\n\n\n.. code-block:: python\n\n    total_files = fs.count()\n\n    # Or via len()...\n    total_files = len(fs)\n\n\nFor more details, please see the full documentation at http://hashfs.readthedocs.org.\n\n\n\n.. |version| image:: http://img.shields.io/pypi/v/hashfs.svg?style=flat-square\n    :target: https://pypi.python.org/pypi/hashfs/\n\n.. |travis| image:: http://img.shields.io/travis/dgilland/hashfs/master.svg?style=flat-square\n    :target: https://travis-ci.org/dgilland/hashfs\n\n.. |coveralls| image:: http://img.shields.io/coveralls/dgilland/hashfs/master.svg?style=flat-square\n    :target: https://coveralls.io/r/dgilland/hashfs\n\n.. |license| image:: http://img.shields.io/pypi/l/hashfs.svg?style=flat-square\n    :target: https://pypi.python.org/pypi/hashfs/\n\n\nChangelog\n=========\n\n\nv0.7.2 (2019-10-24)\n-------------------\n\n- Fix out-of-memory issue when computing file ID hashes of large files.\n\n\nv0.7.1 (2018-10-13)\n-------------------\n\n- Replace usage of ``distutils.dir_util.mkpath`` with ``os.path.makedirs``.\n\n\nv0.7.0 (2016-04-19)\n-------------------\n\n- Use ``shutil.move`` instead of ``shutil.copy`` to move temporary file created during ``put`` operation to ``HashFS`` directory.\n\n\nv0.6.0 (2015-10-19)\n-------------------\n\n- Add faster ``scandir`` package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to ``os`` module starting with Python 3.5.\n\n\nv0.5.0 (2015-07-02)\n-------------------\n\n- Rename private method ``HashFS.copy`` to ``HashFS._copy``.\n- Add ``is_duplicate`` attribute to ``HashAddress``.\n- Make ``HashFS.put()`` return ``HashAddress`` with ``is_duplicate=True`` when file with same hash already exists on disk.\n\n\nv0.4.0 (2015-06-03)\n-------------------\n\n- Add ``HashFS.size()`` method that returns the size of all files in bytes.\n- Add ``HashFS.count()``/``HashFS.__len__()`` methods that return the count of all files.\n- Add ``HashFS.__iter__()`` method to support iteration. Proxies to ``HashFS.files()``.\n- Add ``HashFS.__contains__()`` method to support ``in`` operator. Proxies to ``HashFS.exists()``.\n- Don't create the root directory (if it doesn't exist) until at least one file has been added.\n- Fix ``HashFS.repair()`` not using ``extensions`` argument properly.\n\n\nv0.3.0 (2015-06-02)\n-------------------\n\n- Rename ``HashFS.length`` parameter/property to ``width``. (**breaking change**)\n\n\nv0.2.0 (2015-05-29)\n-------------------\n\n- Rename ``HashFS.get`` to ``HashFS.open``. (**breaking change**)\n- Add ``HashFS.get()`` method that returns a ``HashAddress`` or ``None`` given a file ID or path.\n\n\nv0.1.0 (2015-05-28)\n-------------------\n\n- Add ``HashFS.get()`` method that retrieves a reader object given a file ID or path.\n- Add ``HashFS.delete()`` method that deletes a file ID or path.\n- Add ``HashFS.folders()`` method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).\n- Add ``HashFS.detokenize()`` method that returns the file ID contained in a file path.\n- Add ``HashFS.repair()`` method that reindexes any files under root directory whose file path doesn't not match its tokenized file ID.\n- Rename ``Address`` classs to ``HashAddress``. (**breaking change**)\n- Rename ``HashAddress.digest`` to ``HashAddress.id``. (**breaking change**)\n- Rename ``HashAddress.path`` to ``HashAddress.abspath``. (**breaking change**)\n- Add ``HashAddress.relpath`` which represents path relative to ``HashFS.root``.\n\n\nv0.0.1 (2015-05-27)\n-------------------\n\n- First release.\n- Add ``HashFS`` class.\n- Add ``HashFS.put()`` method that saves a file path or file-like object by content hash.\n- Add ``HashFS.files()`` method that returns all files under root directory.\n- Add ``HashFS.exists()`` which checks either a file hash or file path for existence.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A content-addressable file management system.",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/ramwin/hashfs"
    "split_keywords": [
    "urls": [
            "comment_text": "",
            "digests": {
                "blake2b_256": "128d847c8543ec998f7b08034ce01dd4d6f4b2f058c42aca83f4bb3be9056072",
                "md5": "550c58d5929a8ae30b926480b26e7107",
                "sha256": "b9fc7ac601482172f0a10828c982969b15962d2a39c06aa34a8aeab31939c2b0"
            "downloads": -1,
            "filename": "hashfs2-1.1.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "550c58d5929a8ae30b926480b26e7107",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 12720,
            "upload_time": "2024-10-06T06:53:55",
            "upload_time_iso_8601": "2024-10-06T06:53:55.572589Z",
            "url": "https://files.pythonhosted.org/packages/12/8d/847c8543ec998f7b08034ce01dd4d6f4b2f058c42aca83f4bb3be9056072/hashfs2-1.1.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
            "comment_text": "",
            "digests": {
                "blake2b_256": "180af96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac",
                "md5": "222ff124a08129d6f9edacb85d34adc6",
                "sha256": "04d730ae1584b1b41938342348c07c6ddb5a42cbc04fb3dd6c5b7da486750617"
            "downloads": -1,
            "filename": "hashfs2-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "222ff124a08129d6f9edacb85d34adc6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 24694,
            "upload_time": "2024-10-06T06:53:57",
            "upload_time_iso_8601": "2024-10-06T06:53:57.581190Z",
            "url": "https://files.pythonhosted.org/packages/18/0a/f96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac/hashfs2-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
    "upload_time": "2024-10-06 06:53:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ramwin",
    "github_project": "hashfs",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
            "name": "flake8",
            "specs": []
            "name": "pytest-cov",
            "specs": []
            "name": "pytest",
            "specs": []
            "name": "Sphinx",
            "specs": []
            "name": "tox",
            "specs": []
            "name": "twine",
            "specs": []
            "name": "wheel",
            "specs": []
    "tox": true,
    "lcname": "hashfs2"
Elapsed time: 0.31710s