# Installation
`pip install .` or `pip install persist-to-disk`
**By default, a folder called `.cache/persist_to_disk` is created under your home directory, and will be used to store cache files.**
If you want to change it, see "Global Settings" below.
# Global Settings
To set global settings (for example, where the cache should go by default), please do the following:
```
import persist_to_disk as ptd
ptd.config.generate_config()
```
Then, you could (optionally) change the settings in the generated `config.ini`:
1. `persist_path`: where to store the cache.
All projects you have on this machine will have a folder under `persist_path` by default, unless you specify it within the project (See examples below).
2. `hashsize`: How many hash buckets to use to store each function's outputs. Default=500.
3. `lock_granularity`:
How granular the lock is.
This could be `call`, `func` or `global`.
* `call` means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.
* `func` means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.
* `global` all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).
# Quick Start
### Basic Example
Using `persist_to_disk` is very easy.
For example, if you want to write a general training function:
```
import torch
@ptd.persistf()
def train_a_model(dataset, model_cls, lr, epochs, device='cpu'):
...
return trained_model_or_key
if __name__ == '__main__':
train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)
```
Suppose the above is in a file with path `~/project_name/pipeline/train.py`.
If we are in `~/project_name` and run `python -m pipeline.train`, a cache folder will be created under `PERSIST_PATH`, like the following:
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│ ├── pipeline
│ │ ├── train
│ │ │ ├── train_a_model
│ │ │ │ ├──[hashed_bucket].pkl
```
Note that in the above, `[autoid]` is a auto-generated id.
`[hashed_bucket]` will be an int in [0, `hashsize`).
### Multiprocessing
Note that `ptd.persistf` can be used with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) directly.
# Advanced Settings
## `config.set_project_path` and `config.set_persist_path`
There are two important paths for each workspace/project: `project_path` and `persist_path`.
You could set them by calling `ptd.config.set_project_path` and `ptd.config.set_persist_path`.
On a high level, `persist_path` determines *where* the results are cached/persisted, and `project_path` determines the structure of the cache file tree.
Following the basic example, `ptd.config.persist_path(PERSIST_PATH)` will only change the root directory.
On the other hand, supppose we add a line of `ptd.config.set_project_path("./pipeline")` to `train.py` and run it again, the new file structure will be created under `PERSIST_PATH`, like the following:
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── pipeline-[autoid]
│ ├── train
│ │ ├── train_a_model
│ │ │ ├──[hashed_bucket].pkl
```
Alternatively, it is also possible that we store some notebooks under `~/project_name/notebook/`.
In this case, we could set the `project_path` back to `~/project_name`.
You could check the mapping from projects to autoids in `~/.persist_to_disk/project_to_pids.txt`.
## Additional Parameters
`persist` take additional arguments.
For example, consider the new function below:
```
@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])
def train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):
model = model_cls(**model_kwargs)
model.to(device)
... # train the model
model.save(path)
return path
```
The kwargs we passed to `persistf` has the following effects:
* `groupby`: We will create more intermediate directories basing on what's in `groupby`.
In the example above, the new cache structure will look like
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│ ├── pipeline
│ │ ├── train
│ │ │ ├── train_a_model
│ │ │ │ ├── MNIST
│ │ │ │ │ ├── 20
│ │ │ │ │ │ ├──[hashed_bucket].pkl
│ │ │ │ │ ├── 10
│ │ │ │ │ │ ├──[hashed_bucket].pkl
│ │ │ │ ├── CIFAR10
│ │ │ │ │ ├── 30
│ │ │ │ │ │ ├──[hashed_bucket].pkl
```
* `expand_dict_kwargs`: This simply allows the dictionary to be passed in.
This is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within `ptd`.
Note that you can also set `expand_dict_kwargs='all'` to avoid specifying individual dictionary arguements.
However, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.
* `skip_kwargs`: This specifies arguments that will be *ignored*.
For examplte, if we call `train_a_model(..., device='cpu')` and `train_a_model(..., device='cuda:0')`, the second run will simply read the cache, as `device` is ignored.
### Other useful parameters:
* `hash_size`: Defaults to 500.
If a function has a lot of cache files, you can also increase this if necessary to reduce the number of `.pkl` files on disk.
## 0.0.7
==================
1. Shared cache vs local cache (the latter specified by `persist_path_local` in the config). This assumes local reads faster. Can be skipped
2. Add support for `argparse.Namespace` to support a common practice.
3. Add support for argument `alt_dirs` for `persistf`.
For example, if the function is called `func1` and its default cache path is `/path/repo-2/module/func1`, and we have cache from a similar code base at a different location, whose cache looks like `/path/repo-1/module/func1`.
Then, we could do:
```
@ptd.persistf(alt_dirs=["/path/repo-1/module/func1"])
def func1(a=1):
print(1)
```
A call to `func1` will read cache from `repo-1` and write it to `repo-2`.
4. Add support for argument `alt_root` for `manual_cache`. It could be a function that modifies the default path.
## 0.0.6
==================
1. Added the json serialization mode. This could be specified by `hash_method` when calling `persistf`.
2. If a function is specified to be `cache=ptd.READONLY`, no file lock will be used (to avoid unncessary conflict).
## 0.0.5
==================
1. `lock_granularity` can be set differently for each function.
2. Changed the default cache folder to `.cache/persist_to_disk`.
## 0.0.4
==================
1. Changed the behavior of `switch_kwarg`. Now, this is not considered an input to the wrapped function. For example, the correct usage is
```
@ptd.persistf(switch_kwarg='switch')
def func1(a=1):
print(1)
func1(a=1, switch=ptd.NOCACHE)
```
Note how `switch` is not an argument of `func1`.
2. Fix the path inference step, which now finds the absolute paths for `project_path` or `file_path` (the path to the file contaning the function) before inferencing the structure.
## 0.0.3
==================
1. Added `set_project_path` to config.
Raw data
{
"_id": null,
"home_page": "https://github.com/zlin7/python-persist_to_disk",
"name": "persist-to-disk",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Cache, Persist",
"author": "Zhen Lin",
"author_email": "zhenlin4@illinois.edu",
"download_url": "https://files.pythonhosted.org/packages/fa/5d/f2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466/persist_to_disk-0.0.7.tar.gz",
"platform": null,
"description": "\n# Installation\n\n`pip install .` or `pip install persist-to-disk`\n\n**By default, a folder called `.cache/persist_to_disk` is created under your home directory, and will be used to store cache files.**\nIf you want to change it, see \"Global Settings\" below.\n\n# Global Settings\n\nTo set global settings (for example, where the cache should go by default), please do the following:\n\n```\nimport persist_to_disk as ptd\nptd.config.generate_config()\n```\nThen, you could (optionally) change the settings in the generated `config.ini`:\n\n1. `persist_path`: where to store the cache.\n All projects you have on this machine will have a folder under `persist_path` by default, unless you specify it within the project (See examples below).\n2. `hashsize`: How many hash buckets to use to store each function's outputs. Default=500.\n3. `lock_granularity`:\n How granular the lock is.\n This could be `call`, `func` or `global`.\n\n * `call` means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.\n * `func` means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.\n * `global` all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).\n\n\n# Quick Start\n\n### Basic Example\nUsing `persist_to_disk` is very easy.\nFor example, if you want to write a general training function:\n```\nimport torch\n\n@ptd.persistf()\ndef train_a_model(dataset, model_cls, lr, epochs, device='cpu'):\n ...\n return trained_model_or_key\n\nif __name__ == '__main__':\n train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)\n```\n\nSuppose the above is in a file with path `~/project_name/pipeline/train.py`.\nIf we are in `~/project_name` and run `python -m pipeline.train`, a cache folder will be created under `PERSIST_PATH`, like the following:\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 project_name-[autoid]\n\u2502 \u251c\u2500\u2500 pipeline\n\u2502 \u2502 \u251c\u2500\u2500 train\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 train_a_model\n\u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500[hashed_bucket].pkl\n```\nNote that in the above, `[autoid]` is a auto-generated id.\n`[hashed_bucket]` will be an int in [0, `hashsize`).\n\n### Multiprocessing\nNote that `ptd.persistf` can be used with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) directly.\n\n\n# Advanced Settings\n\n## `config.set_project_path` and `config.set_persist_path`\n\nThere are two important paths for each workspace/project: `project_path` and `persist_path`.\nYou could set them by calling `ptd.config.set_project_path` and `ptd.config.set_persist_path`.\n\nOn a high level, `persist_path` determines *where* the results are cached/persisted, and `project_path` determines the structure of the cache file tree.\nFollowing the basic example, `ptd.config.persist_path(PERSIST_PATH)` will only change the root directory.\nOn the other hand, supppose we add a line of `ptd.config.set_project_path(\"./pipeline\")` to `train.py` and run it again, the new file structure will be created under `PERSIST_PATH`, like the following:\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 pipeline-[autoid]\n\u2502 \u251c\u2500\u2500 train\n\u2502 \u2502 \u251c\u2500\u2500 train_a_model\n\u2502 \u2502 \u2502 \u251c\u2500\u2500[hashed_bucket].pkl\n```\n\nAlternatively, it is also possible that we store some notebooks under `~/project_name/notebook/`.\nIn this case, we could set the `project_path` back to `~/project_name`.\nYou could check the mapping from projects to autoids in `~/.persist_to_disk/project_to_pids.txt`.\n\n\n\n## Additional Parameters\n`persist` take additional arguments.\nFor example, consider the new function below:\n```\n@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])\ndef train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):\n model = model_cls(**model_kwargs)\n model.to(device)\n ... # train the model\n model.save(path)\n return path\n```\nThe kwargs we passed to `persistf` has the following effects:\n\n* `groupby`: We will create more intermediate directories basing on what's in `groupby`.\nIn the example above, the new cache structure will look like\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 project_name-[autoid]\n\u2502 \u251c\u2500\u2500 pipeline\n\u2502 \u2502 \u251c\u2500\u2500 train\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 train_a_model\n\u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500 MNIST\n\u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500 20\n\u2502 \u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500[hashed_bucket].pkl\n\u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500 10\n\u2502 \u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500[hashed_bucket].pkl\n\u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500 CIFAR10\n\u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500 30\n\u2502 \u2502 \u2502 \u2502 \u2502 \u2502 \u251c\u2500\u2500[hashed_bucket].pkl\n```\n\n* `expand_dict_kwargs`: This simply allows the dictionary to be passed in.\nThis is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within `ptd`.\nNote that you can also set `expand_dict_kwargs='all'` to avoid specifying individual dictionary arguements.\nHowever, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.\n\n* `skip_kwargs`: This specifies arguments that will be *ignored*.\nFor examplte, if we call `train_a_model(..., device='cpu')` and `train_a_model(..., device='cuda:0')`, the second run will simply read the cache, as `device` is ignored.\n\n### Other useful parameters:\n* `hash_size`: Defaults to 500.\nIf a function has a lot of cache files, you can also increase this if necessary to reduce the number of `.pkl` files on disk.\n\n## 0.0.7\n==================\n1. Shared cache vs local cache (the latter specified by `persist_path_local` in the config). This assumes local reads faster. Can be skipped\n2. Add support for `argparse.Namespace` to support a common practice.\n3. Add support for argument `alt_dirs` for `persistf`.\n For example, if the function is called `func1` and its default cache path is `/path/repo-2/module/func1`, and we have cache from a similar code base at a different location, whose cache looks like `/path/repo-1/module/func1`.\n Then, we could do:\n ```\n @ptd.persistf(alt_dirs=[\"/path/repo-1/module/func1\"])\n def func1(a=1):\n print(1)\n ```\n A call to `func1` will read cache from `repo-1` and write it to `repo-2`.\n4. Add support for argument `alt_root` for `manual_cache`. It could be a function that modifies the default path.\n\n## 0.0.6\n==================\n1. Added the json serialization mode. This could be specified by `hash_method` when calling `persistf`.\n2. If a function is specified to be `cache=ptd.READONLY`, no file lock will be used (to avoid unncessary conflict).\n\n## 0.0.5\n==================\n1. `lock_granularity` can be set differently for each function.\n2. Changed the default cache folder to `.cache/persist_to_disk`.\n\n## 0.0.4\n==================\n1. Changed the behavior of `switch_kwarg`. Now, this is not considered an input to the wrapped function. For example, the correct usage is\n ```\n @ptd.persistf(switch_kwarg='switch')\n def func1(a=1):\n print(1)\n func1(a=1, switch=ptd.NOCACHE)\n ```\n Note how `switch` is not an argument of `func1`.\n2. Fix the path inference step, which now finds the absolute paths for `project_path` or `file_path` (the path to the file contaning the function) before inferencing the structure.\n\n## 0.0.3\n==================\n\n1. Added `set_project_path` to config.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Persist expensive operations on disk.",
"version": "0.0.7",
"project_urls": {
"Homepage": "https://github.com/zlin7/python-persist_to_disk"
},
"split_keywords": [
"cache",
" persist"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0a6398b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955",
"md5": "1d1270c7d5f06344a681b8871a1df19f",
"sha256": "4cbe320fff6690dc25e26eb76d75174a10932129afd826ec675abd9740178409"
},
"downloads": -1,
"filename": "persist_to_disk-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d1270c7d5f06344a681b8871a1df19f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 14570,
"upload_time": "2024-06-02T04:59:30",
"upload_time_iso_8601": "2024-06-02T04:59:30.231170Z",
"url": "https://files.pythonhosted.org/packages/0a/63/98b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955/persist_to_disk-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fa5df2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466",
"md5": "802f9190db3233c5e75bfbdcdcbb45b4",
"sha256": "20f87ca913a66b4460b675507a86898111fd2014617def6ec84dc3961b91d3c1"
},
"downloads": -1,
"filename": "persist_to_disk-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "802f9190db3233c5e75bfbdcdcbb45b4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15930,
"upload_time": "2024-06-02T04:59:31",
"upload_time_iso_8601": "2024-06-02T04:59:31.959480Z",
"url": "https://files.pythonhosted.org/packages/fa/5d/f2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466/persist_to_disk-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-02 04:59:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zlin7",
"github_project": "python-persist_to_disk",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "persist-to-disk"
}