barecat

Name	barecat JSON
Version	0.2.7 JSON
	download
home_page	None
Summary	Scalable archive format for storing millions of small files with random access and SQLite indexing.
upload_time	2025-09-17 16:34:13
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT License Copyright (c) 2023 István Sárándi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	sqlite dataset storage archive random-access image-dataset filesystem key-value-store deep-learning data-loader file-indexing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Barecat

**[Full API Reference Docs](https://istvansarandi.com/docs/barecat/api/barecat/Barecat.html)**

Barecat (**bare** con**cat**enation) is a highly scalable, simple aggregate storage format for
storing many (tens of millions and more) small files, with focus on fast random access and
minimal overhead.

Barecat can be thought of as a simple filesystem, or as something akin to an indexed tarball, or a
key-value store. Indeed, it can be [mounted via FUSE](https://github.com/isarandi/barecat-mount), converted to a tarball, or used like a dictionary
within Python.

Barecat associates strings (file paths) with binary data (file contents). It's like a dictionary,
but it has some special handling for '/' characters in the keys, supporting a filesystem-like
experience (`listdir`, `walk`, `glob`, etc).

Internally, all the data is simply concatenated one after another into one or more data shard files.
Additionally, an index is maintained in an SQLite database, which stores the shard number, the offset
and the size of each inner file (as well as a checksum, and further filesystem-like metadata
like modification time). Barecat also maintains aggregate statistics for each directory, such as the
total number of files and total file size.

![Architecture](./figure.png)

As you can see, the Barecat format is very simple. Readers/writers are easy to write in any language, since
SQLite is a widely-supported format.

## Background

A typical use case for Barecat is storing image files for training deep learning models, where the
files are accessed randomly during training. The files are typically stored on a network file
system, where accessing many small files can be slow, and clusters often put a limit on the number
of files of a user. So it is necessary to somehow merge the small files into big ones.
However, typical archive formats such as tar are not suitable, since they don't allow fast random
lookups. In tar, one has to scan the entire archive as there is no central directory.
Zip is better, but still requires scanning the central directory, which can be slow for very large
archives with millions or tens of millions of files.

We need an index into the archive, and the index itself cannot be required to be loaded
into memory, to support very large datasets.

Therefore, in this format the metadata is indexed separately in an SQLite database for fast lookup
based on paths. The index also allows fast listing of directory contents and contains aggregate
statistics (total file size, number of files) for each directory.

## Features

- **Fast random access**: The archive can be accessed randomly, addressed by filepath,
without having to scan the entire archive or all the metadata.
The index is stored in a separate SQLite database file, which itself does not need to be loaded
entirely into memory. Ideal for storing training image data for deep learning jobs.
- **Sharding**: To make it easier to move the data around or to distribute it across multiple
storage devices, the archive can be split into multiple files of equal size (shards, or volumes).
The shards do not have to be concatenated to be used, the library will keep all shard files open
and load data from the appropriate one during normal operations.
- **Fast browsing**: The SQLite database contains an index for the parent directories, allowing
fast listing of directory contents and aggregate statistics (total file size, number of files).
- **Intuitive API**: Familiar filesystem-like API, as well as a dictionary-like one.
- **Mountable**: The archive can be efficiently mounted in readonly or read-write mode.
- **Simple storage format**: The files are simply concatenated after each other and the index contains
the offsets and sizes of each file. There is no header format to understand. The index can be
dumped into any format with simple SQL queries.

## Command line interface

To create a Barecat archive, use the `barecat-create` or `barecat-create-recursive` commands, which
are automatically installed executables with the pip package.

```bash
barecat-create --file=mydata.barecat --shard-size=100G < path_of_paths.txt

find dirname -name '*.jpg' -print0 | barecat-create --null --file=mydata.barecat --shard-size=100G

barecat-create-recursive dir1 dir2 dir3 --file=mydata.barecat --shard-size=100G
```

This may yield the following files:

```
mydata.barecat-shard-00001
mydata.barecat-shard-00002
mydata.barecat-sqlite-index
```

The files can be extracted out again. Unix-like permissions, modification times, owner info are
preserved.

```bash
barecat-extract --file=mydata.barecat --target-directory=targetdir/
```

## Python API

```python

import barecat

with barecat.Barecat('mydata.barecat', readonly=False) as bc:
bc['path/to/file/as/stored.jpg'] = binary_file_data
bc.add_by_path('path/to/file/on/disk.jpg')

with open('path', 'rb') as f:
bc.add('path/to/file/on/disk.jpg', fileobj=f)

with barecat.Barecat('mydata.barecat') as bc:
binary_file_data = bc['path/to/file.jpg']
entrynames = bc.listdir('path/to')
for root, dirs, files in bc.walk('path/to/something'):
print(root, dirs, files)

paths = bc.glob('path/to/**/*.jpg', recursive=True)

with bc.open('path/to/file.jpg', 'rb') as f:
data = f.read(123)
```

## Image viewer

Barecat comes with a simple image viewer that can be used to browse the contents of a Barecat
archive.

```bash
barecat-image-viewer mydata.barecat
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "barecat",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "sqlite, dataset, storage, archive, random-access, image-dataset, filesystem, key-value-store, deep-learning, data-loader, file-indexing",
    "author": null,
    "author_email": "Istv\u00e1n S\u00e1r\u00e1ndi <istvan.sarandi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/24/a2/864f6a01a1fb261082ce2d07a856dd8ae9700109e7c67e1bc7e1661164f2/barecat-0.2.7.tar.gz",
    "platform": null,
    "description": "# Barecat\n\n**[Full API Reference Docs](https://istvansarandi.com/docs/barecat/api/barecat/Barecat.html)**\n\nBarecat (**bare** con**cat**enation) is a highly scalable, simple aggregate storage format for\nstoring many (tens of millions and more) small files, with focus on fast random access and \nminimal overhead.\n\nBarecat can be thought of as a simple filesystem, or as something akin to an indexed tarball, or a\nkey-value store. Indeed, it can be [mounted via FUSE](https://github.com/isarandi/barecat-mount), converted to a tarball, or used like a dictionary\nwithin Python.\n\nBarecat associates strings (file paths) with binary data (file contents). It's like a dictionary,\nbut it has some special handling for '/' characters in the keys, supporting a filesystem-like\nexperience (`listdir`, `walk`, `glob`, etc).\n\nInternally, all the data is simply concatenated one after another into one or more data shard files.\nAdditionally, an index is maintained in an SQLite database, which stores the shard number, the offset\nand the size of each inner file (as well as a checksum, and further filesystem-like metadata \nlike modification time). Barecat also maintains aggregate statistics for each directory, such as the\ntotal number of files and total file size.\n\n\n![Architecture](./figure.png)\n\nAs you can see, the Barecat format is very simple. Readers/writers are easy to write in any language, since\nSQLite is a widely-supported format.\n\n\n## Background\n\nA typical use case for Barecat is storing image files for training deep learning models, where the\nfiles are accessed randomly during training. The files are typically stored on a network file\nsystem, where accessing many small files can be slow, and clusters often put a limit on the number\nof files of a user. So it is necessary to somehow merge the small files into big ones.\nHowever, typical archive formats such as tar are not suitable, since they don't allow fast random\nlookups. In tar, one has to scan the entire archive as there is no central directory.\nZip is better, but still requires scanning the central directory, which can be slow for very large\narchives with millions or tens of millions of files.\n\nWe need an index into the archive, and the index itself cannot be required to be loaded\ninto memory, to support very large datasets.\n\nTherefore, in this format the metadata is indexed separately in an SQLite database for fast lookup\nbased on paths. The index also allows fast listing of directory contents and contains aggregate\nstatistics (total file size, number of files) for each directory.\n\n## Features\n\n- **Fast random access**: The archive can be accessed randomly, addressed by filepath,\n  without having to scan the entire archive or all the metadata.\n  The index is stored in a separate SQLite database file, which itself does not need to be loaded\n  entirely into memory. Ideal for storing training image data for deep learning jobs.\n- **Sharding**: To make it easier to move the data around or to distribute it across multiple\n  storage devices, the archive can be split into multiple files of equal size (shards, or volumes). \n  The shards do not have to be concatenated to be used, the library will keep all shard files open\n  and load data from the appropriate one during normal operations.\n- **Fast browsing**: The SQLite database contains an index for the parent directories, allowing\n  fast listing of directory contents and aggregate statistics (total file size, number of files).\n- **Intuitive API**: Familiar filesystem-like API, as well as a dictionary-like one.\n- **Mountable**: The archive can be efficiently mounted in readonly or read-write mode.\n- **Simple storage format**: The files are simply concatenated after each other and the index contains\n  the offsets and sizes of each file. There is no header format to understand. The index can be\n  dumped into any format with simple SQL queries.\n\n## Command line interface\n\nTo create a Barecat archive, use the `barecat-create` or `barecat-create-recursive` commands, which \nare automatically installed executables with the pip package.\n\n```bash\nbarecat-create --file=mydata.barecat --shard-size=100G < path_of_paths.txt \n\nfind dirname -name '*.jpg' -print0 | barecat-create --null --file=mydata.barecat --shard-size=100G\n\nbarecat-create-recursive dir1 dir2 dir3 --file=mydata.barecat --shard-size=100G\n```\n\nThis may yield the following files:\n\n```\nmydata.barecat-shard-00001\nmydata.barecat-shard-00002\nmydata.barecat-sqlite-index\n```\n\nThe files can be extracted out again. Unix-like permissions, modification times, owner info are\npreserved.\n\n```bash\nbarecat-extract --file=mydata.barecat --target-directory=targetdir/\n```\n\n## Python API\n\n```python\n\nimport barecat\n\nwith barecat.Barecat('mydata.barecat', readonly=False) as bc:\n  bc['path/to/file/as/stored.jpg'] = binary_file_data\n  bc.add_by_path('path/to/file/on/disk.jpg')\n  \n  with open('path', 'rb') as f:\n    bc.add('path/to/file/on/disk.jpg', fileobj=f)\n    \nwith barecat.Barecat('mydata.barecat') as bc:\n  binary_file_data = bc['path/to/file.jpg']\n  entrynames = bc.listdir('path/to')\n  for root, dirs, files in bc.walk('path/to/something'):\n    print(root, dirs, files)\n    \n  paths = bc.glob('path/to/**/*.jpg', recursive=True)\n  \n  with bc.open('path/to/file.jpg', 'rb') as f:\n    data = f.read(123)\n```\n\n## Image viewer\n\nBarecat comes with a simple image viewer that can be used to browse the contents of a Barecat\narchive.\n\n```bash\nbarecat-image-viewer mydata.barecat\n```\n\n \n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2023 Istv\u00e1n S\u00e1r\u00e1ndi\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Scalable archive format for storing millions of small files with random access and SQLite indexing.",
    "version": "0.2.7",
    "project_urls": {
        "Author": "https://istvansarandi.com",
        "Documentation": "https://istvansarandi.com/docs/barecat/api/barecat/Barecat.html",
        "Homepage": "https://github.com/isarandi/barecat",
        "Issues": "https://github.com/isarandi/barecat/issues",
        "Repository": "https://github.com/isarandi/barecat"
    },
    "split_keywords": [
        "sqlite",
        " dataset",
        " storage",
        " archive",
        " random-access",
        " image-dataset",
        " filesystem",
        " key-value-store",
        " deep-learning",
        " data-loader",
        " file-indexing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "24a2864f6a01a1fb261082ce2d07a856dd8ae9700109e7c67e1bc7e1661164f2",
                "md5": "18062c8ec0a882a47e543c998ffd8e10",
                "sha256": "8057d018f7da6606a8df73beb9dededce0a1d57adc2f603f36fc7fa2e27bb4be"
            },
            "downloads": -1,
            "filename": "barecat-0.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "18062c8ec0a882a47e543c998ffd8e10",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 255541,
            "upload_time": "2025-09-17T16:34:13",
            "upload_time_iso_8601": "2025-09-17T16:34:13.821740Z",
            "url": "https://files.pythonhosted.org/packages/24/a2/864f6a01a1fb261082ce2d07a856dd8ae9700109e7c67e1bc7e1661164f2/barecat-0.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-17 16:34:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "isarandi",
    "github_project": "barecat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "barecat"
}

None