******
HashFS
******
|version| |travis| |coveralls| |license|
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file's hash.
Typical use cases for this kind of system are ones where:
- Files are written once and never change (e.g. image storage).
- It's desirable to have no duplicate files (e.g. user uploads).
- File metadata is stored elsewhere (e.g. in a database).
Extra
=====
this was forked from https://github.com/dgilland/hashfs/ and it supports:
- init all directory to optimize the performance
- support put file with hashid
- change str to pathlib.Path
Features
========
- Files are stored once and never duplicated.
- Uses an efficient folder structure optimized for a large number of files. File paths are based on the content hash and are nested based on the first ``n`` number of characters.
- Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).
- Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.
- Supports any hashing algorithm available via ``hashlib.new``.
- Python 2.7+/3.3+ compatible.
Links
=====
- Project: https://github.com/dgilland/hashfs
- Documentation: http://hashfs.readthedocs.org
- PyPI: https://pypi.python.org/pypi/hashfs/
- TravisCI: https://travis-ci.org/dgilland/hashfs
Quickstart
==========
Install using pip:
::
pip install hashfs
Initialization
--------------
.. code-block:: python
from hashfs import HashFS
Designate a root folder for ``HashFS``. If the folder doesn't already exist, it will be created.
.. code-block:: python
# Set the `depth` to the number of subfolders the file's hash should be split when saving.
# Set the `width` to the desired width of each subfolder.
fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')
# With depth=4 and width=1, files will be saved in the following pattern:
# temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz
# With depth=3 and width=2, files will be saved in the following pattern:
# temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz
**NOTE:** The ``algorithm`` value should be a valid string argument to ``hashlib.new()``.
Basic Usage
===========
``HashFS`` supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.
Storing Content
---------------
Add content to the folder using either readable objects (e.g. ``StringIO``) or file paths (e.g. ``'a/path/to/some/file'``).
.. code-block:: python
from io import StringIO
some_content = StringIO('some content')
address = fs.put(some_content)
# Or if you'd like to save the file with an extension...
address = fs.put(some_content, '.txt')
# The id of the file (i.e. the hexdigest of its contents).
address.id
# The absolute path where the file was saved.
address.abspath
# The path relative to fs.root.
address.relpath
# Whether the file previously existed.
address.is_duplicate
Retrieving File Address
-----------------------
Get a file's ``HashAddress`` by address ID or path. This address would be identical to the address returned by ``put()``.
.. code-block:: python
assert fs.get(address.id) == address
assert fs.get(address.relpath) == address
assert fs.get(address.abspath) == address
assert fs.get('invalid') is None
Retrieving Content
------------------
Get a ``BufferedReader`` handler for an existing file by address ID or path.
.. code-block:: python
fileio = fs.open(address.id)
# Or using the full path...
fileio = fs.open(address.abspath)
# Or using a path relative to fs.root
fileio = fs.open(address.relpath)
**NOTE:** When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.
Removing Content
----------------
Delete a file by address ID or path.
.. code-block:: python
fs.delete(address.id)
fs.delete(address.abspath)
fs.delete(address.relpath)
**NOTE:** When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.
Advanced Usage
==============
Below are some of the more advanced features of ``HashFS``.
Repairing Files
---------------
The ``HashFS`` files may not always be in sync with it's ``depth``, ``width``, or ``algorithm`` settings (e.g. if ``HashFS`` takes ownership of a directory that wasn't previously stored using content hashes or if the ``HashFS`` settings change). These files can be easily reindexed using ``repair()``.
.. code-block:: python
repaired = fs.repair()
# Or if you want to drop file extensions...
repaired = fs.repair(extensions=False)
**WARNING:** It's recommended that a backup of the directory be made before repairing just in case something goes wrong.
Walking Corrupted Files
-----------------------
Instead of actually repairing the files, you can iterate over them for custom processing.
.. code-block:: python
for corrupted_path, expected_address in fs.corrupted():
# do something
**WARNING:** ``HashFS.corrupted()`` is a generator so be aware that modifying the file system while iterating could have unexpected results.
Walking All Files
-----------------
Iterate over files.
.. code-block:: python
for file in fs.files():
# do something
# Or using the class' iter method...
for file in fs:
# do something
Iterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).
.. code-block:: python
for folder in fs.folders():
# do something
Computing Size
--------------
Compute the size in bytes of all files in the ``root`` directory.
.. code-block:: python
total_bytes = fs.size()
Count the total number of files.
.. code-block:: python
total_files = fs.count()
# Or via len()...
total_files = len(fs)
For more details, please see the full documentation at http://hashfs.readthedocs.org.
.. |version| image:: http://img.shields.io/pypi/v/hashfs.svg?style=flat-square
:target: https://pypi.python.org/pypi/hashfs/
.. |travis| image:: http://img.shields.io/travis/dgilland/hashfs/master.svg?style=flat-square
:target: https://travis-ci.org/dgilland/hashfs
.. |coveralls| image:: http://img.shields.io/coveralls/dgilland/hashfs/master.svg?style=flat-square
:target: https://coveralls.io/r/dgilland/hashfs
.. |license| image:: http://img.shields.io/pypi/l/hashfs.svg?style=flat-square
:target: https://pypi.python.org/pypi/hashfs/
Changelog
=========
v0.7.2 (2019-10-24)
-------------------
- Fix out-of-memory issue when computing file ID hashes of large files.
v0.7.1 (2018-10-13)
-------------------
- Replace usage of ``distutils.dir_util.mkpath`` with ``os.path.makedirs``.
v0.7.0 (2016-04-19)
-------------------
- Use ``shutil.move`` instead of ``shutil.copy`` to move temporary file created during ``put`` operation to ``HashFS`` directory.
v0.6.0 (2015-10-19)
-------------------
- Add faster ``scandir`` package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to ``os`` module starting with Python 3.5.
v0.5.0 (2015-07-02)
-------------------
- Rename private method ``HashFS.copy`` to ``HashFS._copy``.
- Add ``is_duplicate`` attribute to ``HashAddress``.
- Make ``HashFS.put()`` return ``HashAddress`` with ``is_duplicate=True`` when file with same hash already exists on disk.
v0.4.0 (2015-06-03)
-------------------
- Add ``HashFS.size()`` method that returns the size of all files in bytes.
- Add ``HashFS.count()``/``HashFS.__len__()`` methods that return the count of all files.
- Add ``HashFS.__iter__()`` method to support iteration. Proxies to ``HashFS.files()``.
- Add ``HashFS.__contains__()`` method to support ``in`` operator. Proxies to ``HashFS.exists()``.
- Don't create the root directory (if it doesn't exist) until at least one file has been added.
- Fix ``HashFS.repair()`` not using ``extensions`` argument properly.
v0.3.0 (2015-06-02)
-------------------
- Rename ``HashFS.length`` parameter/property to ``width``. (**breaking change**)
v0.2.0 (2015-05-29)
-------------------
- Rename ``HashFS.get`` to ``HashFS.open``. (**breaking change**)
- Add ``HashFS.get()`` method that returns a ``HashAddress`` or ``None`` given a file ID or path.
v0.1.0 (2015-05-28)
-------------------
- Add ``HashFS.get()`` method that retrieves a reader object given a file ID or path.
- Add ``HashFS.delete()`` method that deletes a file ID or path.
- Add ``HashFS.folders()`` method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).
- Add ``HashFS.detokenize()`` method that returns the file ID contained in a file path.
- Add ``HashFS.repair()`` method that reindexes any files under root directory whose file path doesn't not match its tokenized file ID.
- Rename ``Address`` classs to ``HashAddress``. (**breaking change**)
- Rename ``HashAddress.digest`` to ``HashAddress.id``. (**breaking change**)
- Rename ``HashAddress.path`` to ``HashAddress.abspath``. (**breaking change**)
- Add ``HashAddress.relpath`` which represents path relative to ``HashFS.root``.
v0.0.1 (2015-05-27)
-------------------
- First release.
- Add ``HashFS`` class.
- Add ``HashFS.put()`` method that saves a file path or file-like object by content hash.
- Add ``HashFS.files()`` method that returns all files under root directory.
- Add ``HashFS.exists()`` which checks either a file hash or file path for existence.
Raw data
{
"_id": null,
"home_page": "https://github.com/ramwin/hashfs",
"name": "hashfs2",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "hashfs hash file system content addressable fixed storage",
"author": "Derrick Gilland",
"author_email": "dgilland@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/18/0a/f96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac/hashfs2-1.1.0.tar.gz",
"platform": null,
"description": "******\nHashFS\n******\n\n|version| |travis| |coveralls| |license|\n\n\nHashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file's hash.\n\nTypical use cases for this kind of system are ones where:\n\n- Files are written once and never change (e.g. image storage).\n- It's desirable to have no duplicate files (e.g. user uploads).\n- File metadata is stored elsewhere (e.g. in a database).\n\nExtra\n=====\nthis was forked from https://github.com/dgilland/hashfs/ and it supports:\n\n- init all directory to optimize the performance\n- support put file with hashid\n- change str to pathlib.Path\n\nFeatures\n========\n\n- Files are stored once and never duplicated.\n- Uses an efficient folder structure optimized for a large number of files. File paths are based on the content hash and are nested based on the first ``n`` number of characters.\n- Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).\n- Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.\n- Supports any hashing algorithm available via ``hashlib.new``.\n- Python 2.7+/3.3+ compatible.\n\n\nLinks\n=====\n\n- Project: https://github.com/dgilland/hashfs\n- Documentation: http://hashfs.readthedocs.org\n- PyPI: https://pypi.python.org/pypi/hashfs/\n- TravisCI: https://travis-ci.org/dgilland/hashfs\n\n\nQuickstart\n==========\n\nInstall using pip:\n\n\n::\n\n pip install hashfs\n\n\nInitialization\n--------------\n\n.. code-block:: python\n\n from hashfs import HashFS\n\n\nDesignate a root folder for ``HashFS``. If the folder doesn't already exist, it will be created.\n\n\n.. code-block:: python\n\n # Set the `depth` to the number of subfolders the file's hash should be split when saving.\n # Set the `width` to the desired width of each subfolder.\n fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')\n\n # With depth=4 and width=1, files will be saved in the following pattern:\n # temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz\n\n # With depth=3 and width=2, files will be saved in the following pattern:\n # temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz\n\n\n**NOTE:** The ``algorithm`` value should be a valid string argument to ``hashlib.new()``.\n\n\nBasic Usage\n===========\n\n``HashFS`` supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.\n\n\nStoring Content\n---------------\n\nAdd content to the folder using either readable objects (e.g. ``StringIO``) or file paths (e.g. ``'a/path/to/some/file'``).\n\n\n.. code-block:: python\n\n from io import StringIO\n\n some_content = StringIO('some content')\n\n address = fs.put(some_content)\n\n # Or if you'd like to save the file with an extension...\n address = fs.put(some_content, '.txt')\n\n # The id of the file (i.e. the hexdigest of its contents).\n address.id\n\n # The absolute path where the file was saved.\n address.abspath\n\n # The path relative to fs.root.\n address.relpath\n\n # Whether the file previously existed.\n address.is_duplicate\n\n\nRetrieving File Address\n-----------------------\n\nGet a file's ``HashAddress`` by address ID or path. This address would be identical to the address returned by ``put()``.\n\n.. code-block:: python\n\n assert fs.get(address.id) == address\n assert fs.get(address.relpath) == address\n assert fs.get(address.abspath) == address\n assert fs.get('invalid') is None\n\n\nRetrieving Content\n------------------\n\nGet a ``BufferedReader`` handler for an existing file by address ID or path.\n\n\n.. code-block:: python\n\n fileio = fs.open(address.id)\n\n # Or using the full path...\n fileio = fs.open(address.abspath)\n\n # Or using a path relative to fs.root\n fileio = fs.open(address.relpath)\n\n\n**NOTE:** When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.\n\n\nRemoving Content\n----------------\n\nDelete a file by address ID or path.\n\n\n.. code-block:: python\n\n fs.delete(address.id)\n fs.delete(address.abspath)\n fs.delete(address.relpath)\n\n\n**NOTE:** When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.\n\n\nAdvanced Usage\n==============\n\nBelow are some of the more advanced features of ``HashFS``.\n\n\nRepairing Files\n---------------\n\nThe ``HashFS`` files may not always be in sync with it's ``depth``, ``width``, or ``algorithm`` settings (e.g. if ``HashFS`` takes ownership of a directory that wasn't previously stored using content hashes or if the ``HashFS`` settings change). These files can be easily reindexed using ``repair()``.\n\n\n.. code-block:: python\n\n repaired = fs.repair()\n\n # Or if you want to drop file extensions...\n repaired = fs.repair(extensions=False)\n\n\n**WARNING:** It's recommended that a backup of the directory be made before repairing just in case something goes wrong.\n\n\nWalking Corrupted Files\n-----------------------\n\nInstead of actually repairing the files, you can iterate over them for custom processing.\n\n\n.. code-block:: python\n\n for corrupted_path, expected_address in fs.corrupted():\n # do something\n\n\n**WARNING:** ``HashFS.corrupted()`` is a generator so be aware that modifying the file system while iterating could have unexpected results.\n\n\nWalking All Files\n-----------------\n\nIterate over files.\n\n\n.. code-block:: python\n\n for file in fs.files():\n # do something\n\n # Or using the class' iter method...\n for file in fs:\n # do something\n\n\nIterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).\n\n\n.. code-block:: python\n\n for folder in fs.folders():\n # do something\n\n\nComputing Size\n--------------\n\nCompute the size in bytes of all files in the ``root`` directory.\n\n\n.. code-block:: python\n\n total_bytes = fs.size()\n\n\nCount the total number of files.\n\n\n.. code-block:: python\n\n total_files = fs.count()\n\n # Or via len()...\n total_files = len(fs)\n\n\nFor more details, please see the full documentation at http://hashfs.readthedocs.org.\n\n\n\n.. |version| image:: http://img.shields.io/pypi/v/hashfs.svg?style=flat-square\n :target: https://pypi.python.org/pypi/hashfs/\n\n.. |travis| image:: http://img.shields.io/travis/dgilland/hashfs/master.svg?style=flat-square\n :target: https://travis-ci.org/dgilland/hashfs\n\n.. |coveralls| image:: http://img.shields.io/coveralls/dgilland/hashfs/master.svg?style=flat-square\n :target: https://coveralls.io/r/dgilland/hashfs\n\n.. |license| image:: http://img.shields.io/pypi/l/hashfs.svg?style=flat-square\n :target: https://pypi.python.org/pypi/hashfs/\n\n\nChangelog\n=========\n\n\nv0.7.2 (2019-10-24)\n-------------------\n\n- Fix out-of-memory issue when computing file ID hashes of large files.\n\n\nv0.7.1 (2018-10-13)\n-------------------\n\n- Replace usage of ``distutils.dir_util.mkpath`` with ``os.path.makedirs``.\n\n\nv0.7.0 (2016-04-19)\n-------------------\n\n- Use ``shutil.move`` instead of ``shutil.copy`` to move temporary file created during ``put`` operation to ``HashFS`` directory.\n\n\nv0.6.0 (2015-10-19)\n-------------------\n\n- Add faster ``scandir`` package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to ``os`` module starting with Python 3.5.\n\n\nv0.5.0 (2015-07-02)\n-------------------\n\n- Rename private method ``HashFS.copy`` to ``HashFS._copy``.\n- Add ``is_duplicate`` attribute to ``HashAddress``.\n- Make ``HashFS.put()`` return ``HashAddress`` with ``is_duplicate=True`` when file with same hash already exists on disk.\n\n\nv0.4.0 (2015-06-03)\n-------------------\n\n- Add ``HashFS.size()`` method that returns the size of all files in bytes.\n- Add ``HashFS.count()``/``HashFS.__len__()`` methods that return the count of all files.\n- Add ``HashFS.__iter__()`` method to support iteration. Proxies to ``HashFS.files()``.\n- Add ``HashFS.__contains__()`` method to support ``in`` operator. Proxies to ``HashFS.exists()``.\n- Don't create the root directory (if it doesn't exist) until at least one file has been added.\n- Fix ``HashFS.repair()`` not using ``extensions`` argument properly.\n\n\nv0.3.0 (2015-06-02)\n-------------------\n\n- Rename ``HashFS.length`` parameter/property to ``width``. (**breaking change**)\n\n\nv0.2.0 (2015-05-29)\n-------------------\n\n- Rename ``HashFS.get`` to ``HashFS.open``. (**breaking change**)\n- Add ``HashFS.get()`` method that returns a ``HashAddress`` or ``None`` given a file ID or path.\n\n\nv0.1.0 (2015-05-28)\n-------------------\n\n- Add ``HashFS.get()`` method that retrieves a reader object given a file ID or path.\n- Add ``HashFS.delete()`` method that deletes a file ID or path.\n- Add ``HashFS.folders()`` method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).\n- Add ``HashFS.detokenize()`` method that returns the file ID contained in a file path.\n- Add ``HashFS.repair()`` method that reindexes any files under root directory whose file path doesn't not match its tokenized file ID.\n- Rename ``Address`` classs to ``HashAddress``. (**breaking change**)\n- Rename ``HashAddress.digest`` to ``HashAddress.id``. (**breaking change**)\n- Rename ``HashAddress.path`` to ``HashAddress.abspath``. (**breaking change**)\n- Add ``HashAddress.relpath`` which represents path relative to ``HashFS.root``.\n\n\nv0.0.1 (2015-05-27)\n-------------------\n\n- First release.\n- Add ``HashFS`` class.\n- Add ``HashFS.put()`` method that saves a file path or file-like object by content hash.\n- Add ``HashFS.files()`` method that returns all files under root directory.\n- Add ``HashFS.exists()`` which checks either a file hash or file path for existence.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A content-addressable file management system.",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/ramwin/hashfs"
},
"split_keywords": [
"hashfs",
"hash",
"file",
"system",
"content",
"addressable",
"fixed",
"storage"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "128d847c8543ec998f7b08034ce01dd4d6f4b2f058c42aca83f4bb3be9056072",
"md5": "550c58d5929a8ae30b926480b26e7107",
"sha256": "b9fc7ac601482172f0a10828c982969b15962d2a39c06aa34a8aeab31939c2b0"
},
"downloads": -1,
"filename": "hashfs2-1.1.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "550c58d5929a8ae30b926480b26e7107",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 12720,
"upload_time": "2024-10-06T06:53:55",
"upload_time_iso_8601": "2024-10-06T06:53:55.572589Z",
"url": "https://files.pythonhosted.org/packages/12/8d/847c8543ec998f7b08034ce01dd4d6f4b2f058c42aca83f4bb3be9056072/hashfs2-1.1.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "180af96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac",
"md5": "222ff124a08129d6f9edacb85d34adc6",
"sha256": "04d730ae1584b1b41938342348c07c6ddb5a42cbc04fb3dd6c5b7da486750617"
},
"downloads": -1,
"filename": "hashfs2-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "222ff124a08129d6f9edacb85d34adc6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 24694,
"upload_time": "2024-10-06T06:53:57",
"upload_time_iso_8601": "2024-10-06T06:53:57.581190Z",
"url": "https://files.pythonhosted.org/packages/18/0a/f96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac/hashfs2-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-06 06:53:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ramwin",
"github_project": "hashfs",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "flake8",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "Sphinx",
"specs": []
},
{
"name": "tox",
"specs": []
},
{
"name": "twine",
"specs": []
},
{
"name": "wheel",
"specs": []
}
],
"tox": true,
"lcname": "hashfs2"
}