# richfile
A more natural approach to saving hierarchical data structures.
`richfile` saves any Python object using directory structures on disk, and loads
them back again into the same Python objects.
`richfile` can save any atomic Python object, including custom classes, so long
as you can write a function to save and load it. It is intended as a replacement
for things like: `pickle`, `json`, `yaml`, `HDF5`, `Parquet`, `netCDF`, `zarr`,
`numpy`, etc. when you want to save a complex data structure in a human-readable
and editable format. We find the `richfile` format ideal to use when you are
building a data processing pipeline and you want to contain intermediate results
in a format that allows for custom data types, is insensitive to version changes
(pickling issues), allows for easy debugging, and is human readable.
It is easy to use, the code is simple and pure python, and the operations follow [ACID](https://en.wikipedia.org/wiki/ACID) principles.
## Installation
```bash
pip install richfile
```
## Examples
Try out the examples in the [demo_notebook.ipynb](https://github.com/RichieHakim/richfile/blob/main/demo_notebook.ipynb) file.
## Usage
Saving and loading data is simple:
```python
## Given some complex data structure
data = {
"name": "John Doe",
"age": 25,
"address": {
"street": "1234 Elm St",
"zip": None
},
"siblings": [
"Jane",
"Jim"
],
"data": [1,2,3],
(1,2,3): "complex key",
}
## Save it
import as rf
r = rf.RichFile("path/to/data.richfile").save(data)
## Load it back
data = rf.RichFile("path/to/data.richfile").load()
```
You can also load just a part of the data:
```python
r = rf.RichFile("path/to/data.richfile")
first_sibling = r["siblings"][0] ## Lazily load a single item using pythonic indexing
print(f"First sibling: {first_sibling}")
>>> First sibling: Jane
```
View the contents of a richfile directory without loading it:
```python
r.view_directory_structure()
```
Output:
```
Directory structure:
Viewing tree structure of richfile at path: ~/path/data.richfile (dict)
├── name.dict_item (dict_item)
| ├── key.json (str)
| ├── value.json (str)
|
├── age.dict_item (dict_item)
| ├── key.json (str)
| ├── value.json (int)
|
├── address.dict_item (dict_item)
| ├── key.json (str)
| ├── value.dict (dict)
| | ├── street.dict_item (dict_item)
| | | ├── key.json (str)
| | | ├── value.json (str)
| | |
| | ├── zip.dict_item (dict_item)
| | | ├── key.json (str)
| | | ├── value.json (None)
| | |
| |
|
├── siblings.dict_item (dict_item)
| ├── key.json (str)
| ├── value.list (list)
| | ├── 0.json (str)
| | ├── 1.json (str)
| |
|
├── data.dict_item (dict_item)
| ├── key.json (str)
| ├── value.list (list)
| | ├── 0.json (int)
| | ├── 1.json (int)
| | ├── 2.json (int)
| |
|
├── 5.dict_item (dict_item)
| ├── key.tuple (tuple)
| | ├── 0.json (int)
| | ├── 1.json (int)
| | ├── 2.json (int)
| |
| ├── value.json (str)
|
```
You can also add new data types easily:
```python
## Add type to a RichFile object
r = rf.RichFile("path/to/data.richfile")
r.register_type(
type_name='numpy_array',
function_load=lambda path: np.load(path),
function_save=lambda path, obj: np.save(path, obj),
object_class=np.ndarray,
library='numpy',
suffix='npy',
)
## OR
## Add type to environment so that all new RichFile objects can use it
rf.functions.register_type(
type_name='numpy_array',
function_load=lambda path: np.load(path),
function_save=lambda path, obj: np.save(path, obj),
object_class=np.ndarray,
library='numpy',
suffix='npy',
)
```
## Installation from source
```bash
git clone https://github.com/RichieHakim/richfile
cd richfile
pip install -e .
```
## Considerations and Limitations
- **Inversibility**: When creating custom data types, it is important to consider whether the saving and loading operations are exactly reversible.
- [**ACID**](https://en.wikipedia.org/wiki/ACID) principles are reasonably followed via the use of temporary files, file locks, and atomic operations. However, the library is not a database, and therefore cannot guarantee the same level of ACID compliance as a database. In addition, atomic replacements of existing non-empty directories require two operations, which reduces atomicity.
- **Performance**: Data structures with many branches will require many files and operations, which may become slow. Consider packaging highly branched data structures into a single file that supports hierarchical data, such as JSON, HDF5, Parquet, netCDF, zarr, numpy, etc. and making a custom data type for it.
## TODO:
- [ ] Tests
- [ ] Documentation
- [x] Examples
- [x] Readme
- [ ] License
- [x] PyPi
- [x] ~~Hashing~~
- [x] ~~Item assignment (safely)~~
- [x] Custom saving/loading functions
- [x] ~~Put the library imports in the function calls~~
- [x] Add handling for data without a known type
- [ ] Change name of library to something more descriptive
- [x] Test out memmap stuff
- [x] ~~Make it a .zip type~~
Raw data
{
"_id": null,
"home_page": "https://github.com/RichieHakim/richfile",
"name": "richfile",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "data analysis, machine learning, neuroscience",
"author": "Richard Hakim",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/7a/eb/be45e035e4bcdcfad3c4cd43d51a39962a5bfcf68ceb470ed0f17abfc17a/richfile-0.4.5.tar.gz",
"platform": null,
"description": "# richfile\nA more natural approach to saving hierarchical data structures.\n\n`richfile` saves any Python object using directory structures on disk, and loads\nthem back again into the same Python objects. \n\n`richfile` can save any atomic Python object, including custom classes, so long\nas you can write a function to save and load it. It is intended as a replacement\nfor things like: `pickle`, `json`, `yaml`, `HDF5`, `Parquet`, `netCDF`, `zarr`,\n`numpy`, etc. when you want to save a complex data structure in a human-readable\nand editable format. We find the `richfile` format ideal to use when you are\nbuilding a data processing pipeline and you want to contain intermediate results\nin a format that allows for custom data types, is insensitive to version changes\n(pickling issues), allows for easy debugging, and is human readable.\n\nIt is easy to use, the code is simple and pure python, and the operations follow [ACID](https://en.wikipedia.org/wiki/ACID) principles.\n\n## Installation\n```bash\npip install richfile\n```\n\n## Examples\nTry out the examples in the [demo_notebook.ipynb](https://github.com/RichieHakim/richfile/blob/main/demo_notebook.ipynb) file.\n\n## Usage\nSaving and loading data is simple:\n```python\n## Given some complex data structure\ndata = {\n \"name\": \"John Doe\",\n \"age\": 25,\n \"address\": {\n \"street\": \"1234 Elm St\",\n \"zip\": None\n },\n \"siblings\": [\n \"Jane\",\n \"Jim\"\n ],\n \"data\": [1,2,3],\n (1,2,3): \"complex key\",\n}\n\n## Save it\nimport as rf\nr = rf.RichFile(\"path/to/data.richfile\").save(data)\n\n## Load it back\ndata = rf.RichFile(\"path/to/data.richfile\").load()\n```\n\nYou can also load just a part of the data:\n```python\nr = rf.RichFile(\"path/to/data.richfile\")\nfirst_sibling = r[\"siblings\"][0] ## Lazily load a single item using pythonic indexing\nprint(f\"First sibling: {first_sibling}\")\n\n>>> First sibling: Jane\n```\n\nView the contents of a richfile directory without loading it:\n```python\nr.view_directory_structure()\n```\n\nOutput:\n```\nDirectory structure:\nViewing tree structure of richfile at path: ~/path/data.richfile (dict)\n\u251c\u2500\u2500 name.dict_item (dict_item)\n| \u251c\u2500\u2500 key.json (str)\n| \u251c\u2500\u2500 value.json (str)\n| \n\u251c\u2500\u2500 age.dict_item (dict_item)\n| \u251c\u2500\u2500 key.json (str)\n| \u251c\u2500\u2500 value.json (int)\n| \n\u251c\u2500\u2500 address.dict_item (dict_item)\n| \u251c\u2500\u2500 key.json (str)\n| \u251c\u2500\u2500 value.dict (dict)\n| | \u251c\u2500\u2500 street.dict_item (dict_item)\n| | | \u251c\u2500\u2500 key.json (str)\n| | | \u251c\u2500\u2500 value.json (str)\n| | | \n| | \u251c\u2500\u2500 zip.dict_item (dict_item)\n| | | \u251c\u2500\u2500 key.json (str)\n| | | \u251c\u2500\u2500 value.json (None)\n| | | \n| | \n| \n\u251c\u2500\u2500 siblings.dict_item (dict_item)\n| \u251c\u2500\u2500 key.json (str)\n| \u251c\u2500\u2500 value.list (list)\n| | \u251c\u2500\u2500 0.json (str)\n| | \u251c\u2500\u2500 1.json (str)\n| | \n| \n\u251c\u2500\u2500 data.dict_item (dict_item)\n| \u251c\u2500\u2500 key.json (str)\n| \u251c\u2500\u2500 value.list (list)\n| | \u251c\u2500\u2500 0.json (int)\n| | \u251c\u2500\u2500 1.json (int)\n| | \u251c\u2500\u2500 2.json (int)\n| | \n| \n\u251c\u2500\u2500 5.dict_item (dict_item)\n| \u251c\u2500\u2500 key.tuple (tuple)\n| | \u251c\u2500\u2500 0.json (int)\n| | \u251c\u2500\u2500 1.json (int)\n| | \u251c\u2500\u2500 2.json (int)\n| | \n| \u251c\u2500\u2500 value.json (str)\n| \n```\n\nYou can also add new data types easily:\n```python\n## Add type to a RichFile object\nr = rf.RichFile(\"path/to/data.richfile\")\nr.register_type(\n type_name='numpy_array',\n function_load=lambda path: np.load(path),\n function_save=lambda path, obj: np.save(path, obj),\n object_class=np.ndarray,\n library='numpy',\n suffix='npy',\n)\n\n## OR\n## Add type to environment so that all new RichFile objects can use it\nrf.functions.register_type(\n type_name='numpy_array',\n function_load=lambda path: np.load(path),\n function_save=lambda path, obj: np.save(path, obj),\n object_class=np.ndarray,\n library='numpy',\n suffix='npy',\n)\n```\n\n## Installation from source\n```bash\ngit clone https://github.com/RichieHakim/richfile\ncd richfile\npip install -e .\n```\n\n## Considerations and Limitations\n- **Inversibility**: When creating custom data types, it is important to consider whether the saving and loading operations are exactly reversible.\n- [**ACID**](https://en.wikipedia.org/wiki/ACID) principles are reasonably followed via the use of temporary files, file locks, and atomic operations. However, the library is not a database, and therefore cannot guarantee the same level of ACID compliance as a database. In addition, atomic replacements of existing non-empty directories require two operations, which reduces atomicity.\n- **Performance**: Data structures with many branches will require many files and operations, which may become slow. Consider packaging highly branched data structures into a single file that supports hierarchical data, such as JSON, HDF5, Parquet, netCDF, zarr, numpy, etc. and making a custom data type for it.\n\n## TODO:\n- [ ] Tests\n- [ ] Documentation\n- [x] Examples\n- [x] Readme\n- [ ] License\n- [x] PyPi\n- [x] ~~Hashing~~\n- [x] ~~Item assignment (safely)~~\n- [x] Custom saving/loading functions\n- [x] ~~Put the library imports in the function calls~~\n- [x] Add handling for data without a known type\n- [ ] Change name of library to something more descriptive\n- [x] Test out memmap stuff\n- [x] ~~Make it a .zip type~~\n",
"bugtrack_url": null,
"license": "LICENSE",
"summary": "A library for reading and writing hierarchical data files",
"version": "0.4.5",
"project_urls": {
"Homepage": "https://github.com/RichieHakim/richfile"
},
"split_keywords": [
"data analysis",
" machine learning",
" neuroscience"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "958ca8c1c15852724409da57023c842bb579c2ed3a5db190f9b8c9c80e7bb04d",
"md5": "7618de61e2946a59bf2483a48e0a5e39",
"sha256": "0d4d4c4dbc7c9acc649e6e9acaf742212bb8c9b1d365897e6c21236e4d6839cb"
},
"downloads": -1,
"filename": "richfile-0.4.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7618de61e2946a59bf2483a48e0a5e39",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 22500,
"upload_time": "2024-10-05T08:48:02",
"upload_time_iso_8601": "2024-10-05T08:48:02.063649Z",
"url": "https://files.pythonhosted.org/packages/95/8c/a8c1c15852724409da57023c842bb579c2ed3a5db190f9b8c9c80e7bb04d/richfile-0.4.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7aebbe45e035e4bcdcfad3c4cd43d51a39962a5bfcf68ceb470ed0f17abfc17a",
"md5": "cb7d89cbcb51493bc4ab09aeb411ae03",
"sha256": "05e899ebc4ed6315b1ac8ba9b621e33845e54fbe123fadf6882624fe304e10d9"
},
"downloads": -1,
"filename": "richfile-0.4.5.tar.gz",
"has_sig": false,
"md5_digest": "cb7d89cbcb51493bc4ab09aeb411ae03",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26708,
"upload_time": "2024-10-05T08:48:03",
"upload_time_iso_8601": "2024-10-05T08:48:03.534700Z",
"url": "https://files.pythonhosted.org/packages/7a/eb/be45e035e4bcdcfad3c4cd43d51a39962a5bfcf68ceb470ed0f17abfc17a/richfile-0.4.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-05 08:48:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "RichieHakim",
"github_project": "richfile",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "filelock",
"specs": [
[
">",
"3.15"
]
]
},
{
"name": "hypothesis",
"specs": [
[
"==",
"6.112.2"
]
]
}
],
"lcname": "richfile"
}