persist-to-disk

Name	persist-to-disk JSON
Version	0.0.7 JSON
	download
home_page	https://github.com/zlin7/python-persist_to_disk
Summary	Persist expensive operations on disk.
upload_time	2024-06-02 04:59:31
maintainer	None
docs_url	None
author	Zhen Lin
requires_python	None
license	MIT
keywords	cache persist
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# Installation

`pip install .` or `pip install persist-to-disk`

**By default, a folder called `.cache/persist_to_disk` is created under your home directory, and will be used to store cache files.**
If you want to change it, see "Global Settings" below.

# Global Settings

To set global settings (for example, where the cache should go by default), please do the following:

```
import persist_to_disk as ptd
ptd.config.generate_config()
```
Then, you could (optionally) change the settings in the generated `config.ini`:

1. `persist_path`: where to store the cache.
    All projects you have on this machine will have a folder under `persist_path` by default, unless you specify it within the project (See examples below).
2. `hashsize`: How many hash buckets to use to store each function's outputs. Default=500.
3. `lock_granularity`:
    How granular the lock is.
    This could be `call`, `func` or `global`.

    * `call` means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.
    * `func` means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.
    * `global` all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).


# Quick Start

### Basic Example
Using `persist_to_disk` is very easy.
For example, if you want to write a general training function:
```
import torch

@ptd.persistf()
def train_a_model(dataset, model_cls, lr, epochs, device='cpu'):
    ...
    return trained_model_or_key

if __name__ == '__main__':
    train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)
```

Suppose the above is in a file with path `~/project_name/pipeline/train.py`.
If we are in `~/project_name` and run `python -m pipeline.train`, a cache folder will be created under `PERSIST_PATH`, like the following:
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│   ├── pipeline
│   │   ├── train
│   │   │   ├── train_a_model
│   │   │   │   ├──[hashed_bucket].pkl
```
Note that in the above, `[autoid]` is a auto-generated id.
`[hashed_bucket]` will be an int in [0, `hashsize`).

### Multiprocessing
Note that `ptd.persistf` can be used with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) directly.


# Advanced Settings

## `config.set_project_path` and `config.set_persist_path`

There are two important paths for each workspace/project: `project_path` and `persist_path`.
You could set them by calling `ptd.config.set_project_path` and `ptd.config.set_persist_path`.

On a high level, `persist_path` determines *where* the results are cached/persisted, and `project_path` determines the structure of the cache file tree.
Following the basic example, `ptd.config.persist_path(PERSIST_PATH)` will only change the root directory.
On the other hand, supppose we add a line of `ptd.config.set_project_path("./pipeline")` to `train.py` and run it again, the new file structure will be created under `PERSIST_PATH`, like the following:
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── pipeline-[autoid]
│   ├── train
│   │   ├── train_a_model
│   │   │   ├──[hashed_bucket].pkl
```

Alternatively, it is also possible that we store some notebooks under `~/project_name/notebook/`.
In this case, we could set the `project_path` back to `~/project_name`.
You could check the mapping from projects to autoids in `~/.persist_to_disk/project_to_pids.txt`.



## Additional Parameters
`persist` take additional arguments.
For example, consider the new function below:
```
@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])
def train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):
    model = model_cls(**model_kwargs)
    model.to(device)
    ... # train the model
    model.save(path)
    return path
```
The kwargs we passed to `persistf` has the following effects:

* `groupby`: We will create more intermediate directories basing on what's in `groupby`.
In the example above, the new cache structure will look like
```
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│   ├── pipeline
│   │   ├── train
│   │   │   ├── train_a_model
│   │   │   │   ├── MNIST
│   │   │   │   │   ├── 20
│   │   │   │   │   │   ├──[hashed_bucket].pkl
│   │   │   │   │   ├── 10
│   │   │   │   │   │   ├──[hashed_bucket].pkl
│   │   │   │   ├── CIFAR10
│   │   │   │   │   ├── 30
│   │   │   │   │   │   ├──[hashed_bucket].pkl
```

* `expand_dict_kwargs`: This simply allows the dictionary to be passed in.
This is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within `ptd`.
Note that you can also set `expand_dict_kwargs='all'` to avoid specifying individual dictionary arguements.
However, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.

* `skip_kwargs`: This specifies arguments that will be *ignored*.
For examplte, if we call `train_a_model(..., device='cpu')` and `train_a_model(..., device='cuda:0')`, the second run will simply read the cache, as `device` is ignored.

### Other useful parameters:
* `hash_size`: Defaults to 500.
If a function has a lot of cache files, you can also increase this if necessary to reduce the number of `.pkl` files on disk.

## 0.0.7
==================
1. Shared cache vs local cache (the latter specified by `persist_path_local` in the config). This assumes local reads faster. Can be skipped
2. Add support for `argparse.Namespace` to support a common practice.
3. Add support for argument `alt_dirs` for `persistf`.
    For example, if the function is called `func1` and its default cache path is `/path/repo-2/module/func1`, and we have cache from a similar code base at a different location, whose cache looks like `/path/repo-1/module/func1`.
    Then, we could do:
    ```
    @ptd.persistf(alt_dirs=["/path/repo-1/module/func1"])
    def func1(a=1):
        print(1)
    ```
    A call to `func1` will read cache from `repo-1` and write it to `repo-2`.
4. Add support for argument `alt_root` for `manual_cache`. It could be a function that modifies the default path.

## 0.0.6
==================
1. Added the json serialization mode. This could be specified by `hash_method` when calling `persistf`.
2. If a function is specified to be `cache=ptd.READONLY`, no file lock will be used (to avoid unncessary conflict).

## 0.0.5
==================
1. `lock_granularity` can be set differently for each function.
2. Changed the default cache folder to `.cache/persist_to_disk`.

## 0.0.4
==================
1. Changed the behavior of `switch_kwarg`. Now, this is not considered an input to the wrapped function. For example, the correct usage is
    ```
    @ptd.persistf(switch_kwarg='switch')
    def func1(a=1):
        print(1)
    func1(a=1, switch=ptd.NOCACHE)
    ```
    Note how `switch` is not an argument of `func1`.
2. Fix the path inference step, which now finds the absolute paths for `project_path` or `file_path` (the path to the file contaning the function) before inferencing the structure.

## 0.0.3
==================

1. Added `set_project_path` to config.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/zlin7/python-persist_to_disk",
    "name": "persist-to-disk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Cache, Persist",
    "author": "Zhen Lin",
    "author_email": "zhenlin4@illinois.edu",
    "download_url": "https://files.pythonhosted.org/packages/fa/5d/f2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466/persist_to_disk-0.0.7.tar.gz",
    "platform": null,
    "description": "\n# Installation\n\n`pip install .` or `pip install persist-to-disk`\n\n**By default, a folder called `.cache/persist_to_disk` is created under your home directory, and will be used to store cache files.**\nIf you want to change it, see \"Global Settings\" below.\n\n# Global Settings\n\nTo set global settings (for example, where the cache should go by default), please do the following:\n\n```\nimport persist_to_disk as ptd\nptd.config.generate_config()\n```\nThen, you could (optionally) change the settings in the generated `config.ini`:\n\n1. `persist_path`: where to store the cache.\n    All projects you have on this machine will have a folder under `persist_path` by default, unless you specify it within the project (See examples below).\n2. `hashsize`: How many hash buckets to use to store each function's outputs. Default=500.\n3. `lock_granularity`:\n    How granular the lock is.\n    This could be `call`, `func` or `global`.\n\n    * `call` means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.\n    * `func` means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.\n    * `global` all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).\n\n\n# Quick Start\n\n### Basic Example\nUsing `persist_to_disk` is very easy.\nFor example, if you want to write a general training function:\n```\nimport torch\n\n@ptd.persistf()\ndef train_a_model(dataset, model_cls, lr, epochs, device='cpu'):\n    ...\n    return trained_model_or_key\n\nif __name__ == '__main__':\n    train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)\n```\n\nSuppose the above is in a file with path `~/project_name/pipeline/train.py`.\nIf we are in `~/project_name` and run `python -m pipeline.train`, a cache folder will be created under `PERSIST_PATH`, like the following:\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 project_name-[autoid]\n\u2502   \u251c\u2500\u2500 pipeline\n\u2502   \u2502   \u251c\u2500\u2500 train\n\u2502   \u2502   \u2502   \u251c\u2500\u2500 train_a_model\n\u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500[hashed_bucket].pkl\n```\nNote that in the above, `[autoid]` is a auto-generated id.\n`[hashed_bucket]` will be an int in [0, `hashsize`).\n\n### Multiprocessing\nNote that `ptd.persistf` can be used with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) directly.\n\n\n# Advanced Settings\n\n## `config.set_project_path` and `config.set_persist_path`\n\nThere are two important paths for each workspace/project: `project_path` and `persist_path`.\nYou could set them by calling `ptd.config.set_project_path` and `ptd.config.set_persist_path`.\n\nOn a high level, `persist_path` determines *where* the results are cached/persisted, and `project_path` determines the structure of the cache file tree.\nFollowing the basic example, `ptd.config.persist_path(PERSIST_PATH)` will only change the root directory.\nOn the other hand, supppose we add a line of `ptd.config.set_project_path(\"./pipeline\")` to `train.py` and run it again, the new file structure will be created under `PERSIST_PATH`, like the following:\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 pipeline-[autoid]\n\u2502   \u251c\u2500\u2500 train\n\u2502   \u2502   \u251c\u2500\u2500 train_a_model\n\u2502   \u2502   \u2502   \u251c\u2500\u2500[hashed_bucket].pkl\n```\n\nAlternatively, it is also possible that we store some notebooks under `~/project_name/notebook/`.\nIn this case, we could set the `project_path` back to `~/project_name`.\nYou could check the mapping from projects to autoids in `~/.persist_to_disk/project_to_pids.txt`.\n\n\n\n## Additional Parameters\n`persist` take additional arguments.\nFor example, consider the new function below:\n```\n@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])\ndef train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):\n    model = model_cls(**model_kwargs)\n    model.to(device)\n    ... # train the model\n    model.save(path)\n    return path\n```\nThe kwargs we passed to `persistf` has the following effects:\n\n* `groupby`: We will create more intermediate directories basing on what's in `groupby`.\nIn the example above, the new cache structure will look like\n```\nPERSIST_PATH(=ptd.config.get_persist_path())\n\u251c\u2500\u2500 project_name-[autoid]\n\u2502   \u251c\u2500\u2500 pipeline\n\u2502   \u2502   \u251c\u2500\u2500 train\n\u2502   \u2502   \u2502   \u251c\u2500\u2500 train_a_model\n\u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500 MNIST\n\u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500 20\n\u2502   \u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500[hashed_bucket].pkl\n\u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500 10\n\u2502   \u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500[hashed_bucket].pkl\n\u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500 CIFAR10\n\u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500 30\n\u2502   \u2502   \u2502   \u2502   \u2502   \u2502   \u251c\u2500\u2500[hashed_bucket].pkl\n```\n\n* `expand_dict_kwargs`: This simply allows the dictionary to be passed in.\nThis is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within `ptd`.\nNote that you can also set `expand_dict_kwargs='all'` to avoid specifying individual dictionary arguements.\nHowever, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.\n\n* `skip_kwargs`: This specifies arguments that will be *ignored*.\nFor examplte, if we call `train_a_model(..., device='cpu')` and `train_a_model(..., device='cuda:0')`, the second run will simply read the cache, as `device` is ignored.\n\n### Other useful parameters:\n* `hash_size`: Defaults to 500.\nIf a function has a lot of cache files, you can also increase this if necessary to reduce the number of `.pkl` files on disk.\n\n## 0.0.7\n==================\n1. Shared cache vs local cache (the latter specified by `persist_path_local` in the config). This assumes local reads faster. Can be skipped\n2. Add support for `argparse.Namespace` to support a common practice.\n3. Add support for argument `alt_dirs` for `persistf`.\n    For example, if the function is called `func1` and its default cache path is `/path/repo-2/module/func1`, and we have cache from a similar code base at a different location, whose cache looks like `/path/repo-1/module/func1`.\n    Then, we could do:\n    ```\n    @ptd.persistf(alt_dirs=[\"/path/repo-1/module/func1\"])\n    def func1(a=1):\n        print(1)\n    ```\n    A call to `func1` will read cache from `repo-1` and write it to `repo-2`.\n4. Add support for argument `alt_root` for `manual_cache`. It could be a function that modifies the default path.\n\n## 0.0.6\n==================\n1. Added the json serialization mode. This could be specified by `hash_method` when calling `persistf`.\n2. If a function is specified to be `cache=ptd.READONLY`, no file lock will be used (to avoid unncessary conflict).\n\n## 0.0.5\n==================\n1. `lock_granularity` can be set differently for each function.\n2. Changed the default cache folder to `.cache/persist_to_disk`.\n\n## 0.0.4\n==================\n1. Changed the behavior of `switch_kwarg`. Now, this is not considered an input to the wrapped function. For example, the correct usage is\n    ```\n    @ptd.persistf(switch_kwarg='switch')\n    def func1(a=1):\n        print(1)\n    func1(a=1, switch=ptd.NOCACHE)\n    ```\n    Note how `switch` is not an argument of `func1`.\n2. Fix the path inference step, which now finds the absolute paths for `project_path` or `file_path` (the path to the file contaning the function) before inferencing the structure.\n\n## 0.0.3\n==================\n\n1. Added `set_project_path` to config.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Persist expensive operations on disk.",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/zlin7/python-persist_to_disk"
    },
    "split_keywords": [
        "cache",
        " persist"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a6398b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955",
                "md5": "1d1270c7d5f06344a681b8871a1df19f",
                "sha256": "4cbe320fff6690dc25e26eb76d75174a10932129afd826ec675abd9740178409"
            },
            "downloads": -1,
            "filename": "persist_to_disk-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d1270c7d5f06344a681b8871a1df19f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14570,
            "upload_time": "2024-06-02T04:59:30",
            "upload_time_iso_8601": "2024-06-02T04:59:30.231170Z",
            "url": "https://files.pythonhosted.org/packages/0a/63/98b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955/persist_to_disk-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fa5df2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466",
                "md5": "802f9190db3233c5e75bfbdcdcbb45b4",
                "sha256": "20f87ca913a66b4460b675507a86898111fd2014617def6ec84dc3961b91d3c1"
            },
            "downloads": -1,
            "filename": "persist_to_disk-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "802f9190db3233c5e75bfbdcdcbb45b4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 15930,
            "upload_time": "2024-06-02T04:59:31",
            "upload_time_iso_8601": "2024-06-02T04:59:31.959480Z",
            "url": "https://files.pythonhosted.org/packages/fa/5d/f2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466/persist_to_disk-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-02 04:59:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zlin7",
    "github_project": "python-persist_to_disk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "persist-to-disk"
}

Zhen Lin