doorget


Namedoorget JSON
Version 1.1 PyPI version JSON
download
home_page
SummaryMemoize package supporting data dependencies
upload_time2024-03-11 16:59:16
maintainer
docs_urlNone
author
requires_python>=3.6
license
keywords memoize memoization doorget
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # doorget

[![build](https://github.com/fdieulle/doorget/actions/workflows/build.yml/badge.svg)](https://github.com/fdieulle/doorget/actions/workflows/build.yml)
[![release](https://github.com/fdieulle/doorget/actions/workflows/release.yml/badge.svg)](https://github.com/fdieulle/doorget/actions/workflows/release.yml)
[![codecov](https://codecov.io/gh/fdieulle/doorget/graph/badge.svg?token=NQQ17SQUYK)](https://codecov.io/gh/fdieulle/doorget)

[![license](https://img.shields.io/badge/license-MIT-blue.svg?maxAge=3600)](./LICENSE) 
[![pypi](https://img.shields.io/pypi/v/doorget.svg)](https://pypi.org/project/doorget/)
[![python supported](https://img.shields.io/pypi/pyversions/doorget.svg)](https://pypi.org/project/doorget/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Python package which memoizes functions and supports data dependencies across them.
With the `cache` decorator the memoization of your code is low touch.

## Memoization

### Definition

> In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. [Wikipedia](https://en.wikipedia.org/wiki/Memoization) 

### Example

```python
from doorget import cache
import panda as pd

@cache
def fetch(name: str) -> pd.DataFrame:
    pass

# do it
foo = fetch('foo') # The function is called
# get it
foo_again = fetch('foo') # The previous returned data is read from the cache and the function is not called.
# do it
bar = fetch('bar') # The function is called because the input is not known yet
```

## Storage modes

The package propose you 3 built in storage modes and the ability to proivde your customized storages.

* `Memory`: The fastest mode.
* `Disk`: The unlimited mode.
* `Identity` The tracker mode for data [dependencies](#data_dependencies) and [cascades](#cascade_data_identity).
* `Custom`: Your best mode.

### Memory

The memory storage use the RAM memory to memoize data.

This is the default storage mode. It is the fastest way to retreive  data from an existing input. 
But this mode has 2 caveats. The first caveat is that the data is cached within your current process only. Once your process ends all cached data are lost. It avoids you sharing the cached data between processes neither. The second caveat is the limited memory, constrained by your hardware to a couple of Giga Bytes maximum.

```python
from doorget import cache, StorageMode
import panda as pd


@cache # Memory by default
def foo(name: str) -> pd.DataFrame:
    pass

@cache(mode=StorageMode.Memory)
def bar(name: str) -> pd.DataFrame:
    pass

```

### Disk

The disk storage use the disk to memoize data.

This is the most common usage of the memoization. It is slower than the `Memory` mode because the data is read from
the disk but it solves the 2 caveats. The built in `Disk` mode uses `parquet` format if a pandas `DataFrame`
is returned, `pickle` otherwise.

If no cache folder is specified, a global folder is used instead. When the global folder is prefered, a sub folder is created by function from its name and module to guaranty an unique storage location. 

The global folder can change at any time with the function `setup_disk_storage`.

```python
from doorget import cache, StorageMode, setup_disk_storage
import panda as pd


@cache(mode=StorageMode.Disk) # Use the global folder
def fetch_default(name: str) -> pd.DataFrame:
    pass

@cache(cache_folder='./my_custom/folder/bar') # Overrides the global folder
def fetch_custom(name: str) -> pd.DataFrame:
    pass

# do it
fetch_default('foo')
# do it
fetch_custom('bar')

setup_disk_storage('./my_default/folder')

# do it
fetch_default('foo') # Function is called because the cache folder changed
# get it
fetch_custom('bar') # The function is not called because the custom folder is unchanged
```

### Custom

The custom storage allows you to provide you own cache where the data is memoized.

When you specify this mode you have to specify the `storage` argument. This argument takes any kind of storage which could fits better your use case, for performances, infrastructure, sharing, ... purpose.

Your custom storage imlplementation must inherit from the `Storage` class.

```python
from typing import Any, List
from dataclasses import dataclass
from doorget import CacheKey

@dataclass
class Storage:
    name: str

    # Core functions
    def contains(self, key: CacheKey) -> bool:
        pass
    def fetch(self, key: CacheKey) -> Any:
        pass
    def store(self, key: CacheKey, data: Any) -> None:
        pass

    # Cache management helpers
    def clear(self) -> None:
        pass
    def remove(self, key: CacheKey) -> bool:
        pass
    def keys(self) -> List[CacheKey]:
        pass
```

## Data dependency

The added value from a simple memoized package is the carrying of data dependencies. 

When a complex object, like a pandas `DataFrame`, is passed as an argument, a simple memoization won't work.
With `doorget` when a complex object is passed as an argument, it is not used directly to build the key, but substitued with its own memoized key.

```python
from doorget import cache
import panda as pd

@cache
def fetch(name: str) -> pd.DataFrame:
    pass

@cache
def summarize(by: str, df: pd.DataFrame) -> pd.DataFrame:
    pass

df = fetch('foo') # key for df: <fetch('foo')>
summary = summarize('date', df) # key for summary: <summarize(<fetch('foo')>, 'daily')>

```

Under the hood a memoized function keep a track of any object reference (Identity in python) returned by attaching its 
memoized key. If the function returns a `Tuple`, the contained items are tracked as well.


## Cascade data identity

Sometime it is faster to transform a data than memoizing it. If this transformed data is used as an argument for another memoized function, the data dependency is lost. To avoid breaking a dependency, you can isolate your
data transformation within a memoized function with the `Identity` storage mode. It will carry no additional storage
than what your program is currently using. Under the hood the data graph is tracked by their references or `Identity`
in python.

```python
from doorget import cache, StorageMode
import panda as pd

@cache(mode=StorageMode.Identity)
def transform(x: pd.DataFrame) -> pd.DataFrame:
    return x.copy()

df = fetch('foo') # key for df: <fetch('foo')>
df_copy = transform(df) # key for df_copy: <transform(<fetch('foo')>)>

summary = summarize('date', df_copy) # key for summary: <summarize(<transform(<fetch('foo')>)>, 'daily')>
```

## Administrate your caches

The package provides you some backdoors to administrate your caches and more precisely your storages.

* `list_memory_storages()`
* `list_disk_storages()`
* `list_custom_storages()`

Each memoized function has its owned storage list. From a storage you can list all the contained keys,  
remove some items or clear them all. A function call, is stored in a `CacheKey` object. A `CacheKey` object 
contains the function name with its arguments combination as a `Tuple`. If the key has data depedencies, 
the dependencies are nested by using `Tuple` recursively.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "doorget",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "memoize,memoization,doorget",
    "author": "",
    "author_email": "Fabien Dieulle <fabien.dieull@hotmail.fr>",
    "download_url": "https://files.pythonhosted.org/packages/45/8b/9ea69010e53de2646deeaa4fe9ac3b5f9a450ca43272c402cb034c493207/doorget-1.1.tar.gz",
    "platform": null,
    "description": "# doorget\n\n[![build](https://github.com/fdieulle/doorget/actions/workflows/build.yml/badge.svg)](https://github.com/fdieulle/doorget/actions/workflows/build.yml)\n[![release](https://github.com/fdieulle/doorget/actions/workflows/release.yml/badge.svg)](https://github.com/fdieulle/doorget/actions/workflows/release.yml)\n[![codecov](https://codecov.io/gh/fdieulle/doorget/graph/badge.svg?token=NQQ17SQUYK)](https://codecov.io/gh/fdieulle/doorget)\n\n[![license](https://img.shields.io/badge/license-MIT-blue.svg?maxAge=3600)](./LICENSE) \n[![pypi](https://img.shields.io/pypi/v/doorget.svg)](https://pypi.org/project/doorget/)\n[![python supported](https://img.shields.io/pypi/pyversions/doorget.svg)](https://pypi.org/project/doorget/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nPython package which memoizes functions and supports data dependencies across them.\nWith the `cache` decorator the memoization of your code is low touch.\n\n## Memoization\n\n### Definition\n\n> In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. [Wikipedia](https://en.wikipedia.org/wiki/Memoization) \n\n### Example\n\n```python\nfrom doorget import cache\nimport panda as pd\n\n@cache\ndef fetch(name: str) -> pd.DataFrame:\n    pass\n\n# do it\nfoo = fetch('foo') # The function is called\n# get it\nfoo_again = fetch('foo') # The previous returned data is read from the cache and the function is not called.\n# do it\nbar = fetch('bar') # The function is called because the input is not known yet\n```\n\n## Storage modes\n\nThe package propose you 3 built in storage modes and the ability to proivde your customized storages.\n\n* `Memory`: The fastest mode.\n* `Disk`: The unlimited mode.\n* `Identity` The tracker mode for data [dependencies](#data_dependencies) and [cascades](#cascade_data_identity).\n* `Custom`: Your best mode.\n\n### Memory\n\nThe memory storage use the RAM memory to memoize data.\n\nThis is the default storage mode. It is the fastest way to retreive  data from an existing input. \nBut this mode has 2 caveats. The first caveat is that the data is cached within your current process only. Once your process ends all cached data are lost. It avoids you sharing the cached data between processes neither. The second caveat is the limited memory, constrained by your hardware to a couple of Giga Bytes maximum.\n\n```python\nfrom doorget import cache, StorageMode\nimport panda as pd\n\n\n@cache # Memory by default\ndef foo(name: str) -> pd.DataFrame:\n    pass\n\n@cache(mode=StorageMode.Memory)\ndef bar(name: str) -> pd.DataFrame:\n    pass\n\n```\n\n### Disk\n\nThe disk storage use the disk to memoize data.\n\nThis is the most common usage of the memoization. It is slower than the `Memory` mode because the data is read from\nthe disk but it solves the 2 caveats. The built in `Disk` mode uses `parquet` format if a pandas `DataFrame`\nis returned, `pickle` otherwise.\n\nIf no cache folder is specified, a global folder is used instead. When the global folder is prefered, a sub folder is created by function from its name and module to guaranty an unique storage location. \n\nThe global folder can change at any time with the function `setup_disk_storage`.\n\n```python\nfrom doorget import cache, StorageMode, setup_disk_storage\nimport panda as pd\n\n\n@cache(mode=StorageMode.Disk) # Use the global folder\ndef fetch_default(name: str) -> pd.DataFrame:\n    pass\n\n@cache(cache_folder='./my_custom/folder/bar') # Overrides the global folder\ndef fetch_custom(name: str) -> pd.DataFrame:\n    pass\n\n# do it\nfetch_default('foo')\n# do it\nfetch_custom('bar')\n\nsetup_disk_storage('./my_default/folder')\n\n# do it\nfetch_default('foo') # Function is called because the cache folder changed\n# get it\nfetch_custom('bar') # The function is not called because the custom folder is unchanged\n```\n\n### Custom\n\nThe custom storage allows you to provide you own cache where the data is memoized.\n\nWhen you specify this mode you have to specify the `storage` argument. This argument takes any kind of storage which could fits better your use case, for performances, infrastructure, sharing, ... purpose.\n\nYour custom storage imlplementation must inherit from the `Storage` class.\n\n```python\nfrom typing import Any, List\nfrom dataclasses import dataclass\nfrom doorget import CacheKey\n\n@dataclass\nclass Storage:\n    name: str\n\n    # Core functions\n    def contains(self, key: CacheKey) -> bool:\n        pass\n    def fetch(self, key: CacheKey) -> Any:\n        pass\n    def store(self, key: CacheKey, data: Any) -> None:\n        pass\n\n    # Cache management helpers\n    def clear(self) -> None:\n        pass\n    def remove(self, key: CacheKey) -> bool:\n        pass\n    def keys(self) -> List[CacheKey]:\n        pass\n```\n\n## Data dependency\n\nThe added value from a simple memoized package is the carrying of data dependencies. \n\nWhen a complex object, like a pandas `DataFrame`, is passed as an argument, a simple memoization won't work.\nWith `doorget` when a complex object is passed as an argument, it is not used directly to build the key, but substitued with its own memoized key.\n\n```python\nfrom doorget import cache\nimport panda as pd\n\n@cache\ndef fetch(name: str) -> pd.DataFrame:\n    pass\n\n@cache\ndef summarize(by: str, df: pd.DataFrame) -> pd.DataFrame:\n    pass\n\ndf = fetch('foo') # key for df: <fetch('foo')>\nsummary = summarize('date', df) # key for summary: <summarize(<fetch('foo')>, 'daily')>\n\n```\n\nUnder the hood a memoized function keep a track of any object reference (Identity in python) returned by attaching its \nmemoized key. If the function returns a `Tuple`, the contained items are tracked as well.\n\n\n## Cascade data identity\n\nSometime it is faster to transform a data than memoizing it. If this transformed data is used as an argument for another memoized function, the data dependency is lost. To avoid breaking a dependency, you can isolate your\ndata transformation within a memoized function with the `Identity` storage mode. It will carry no additional storage\nthan what your program is currently using. Under the hood the data graph is tracked by their references or `Identity`\nin python.\n\n```python\nfrom doorget import cache, StorageMode\nimport panda as pd\n\n@cache(mode=StorageMode.Identity)\ndef transform(x: pd.DataFrame) -> pd.DataFrame:\n    return x.copy()\n\ndf = fetch('foo') # key for df: <fetch('foo')>\ndf_copy = transform(df) # key for df_copy: <transform(<fetch('foo')>)>\n\nsummary = summarize('date', df_copy) # key for summary: <summarize(<transform(<fetch('foo')>)>, 'daily')>\n```\n\n## Administrate your caches\n\nThe package provides you some backdoors to administrate your caches and more precisely your storages.\n\n* `list_memory_storages()`\n* `list_disk_storages()`\n* `list_custom_storages()`\n\nEach memoized function has its owned storage list. From a storage you can list all the contained keys,  \nremove some items or clear them all. A function call, is stored in a `CacheKey` object. A `CacheKey` object \ncontains the function name with its arguments combination as a `Tuple`. If the key has data depedencies, \nthe dependencies are nested by using `Tuple` recursively.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Memoize package supporting data dependencies",
    "version": "1.1",
    "project_urls": {
        "Homepage": "https://github.com/fdieulle/doorget",
        "Issues": "https://github.com/fdieulle/doorget/issues"
    },
    "split_keywords": [
        "memoize",
        "memoization",
        "doorget"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2b8bd70df9dd04c52f17772071326e4d8f1275e89fd3171edb4fb4623c75f475",
                "md5": "8437bd829729aa6e5c3bab8505d36ff9",
                "sha256": "da1462d193935dd68629a73a4884e33bc8dc3d90864b0f6e55550d2aedce1adc"
            },
            "downloads": -1,
            "filename": "doorget-1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8437bd829729aa6e5c3bab8505d36ff9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11882,
            "upload_time": "2024-03-11T16:59:15",
            "upload_time_iso_8601": "2024-03-11T16:59:15.709012Z",
            "url": "https://files.pythonhosted.org/packages/2b/8b/d70df9dd04c52f17772071326e4d8f1275e89fd3171edb4fb4623c75f475/doorget-1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "458b9ea69010e53de2646deeaa4fe9ac3b5f9a450ca43272c402cb034c493207",
                "md5": "415b1f9f243ce5c79b51afb7271e0462",
                "sha256": "61a4cd87e712e604832e02c6c941ec6c9c5abcb968f27b70524d547ee9c59091"
            },
            "downloads": -1,
            "filename": "doorget-1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "415b1f9f243ce5c79b51afb7271e0462",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 15900,
            "upload_time": "2024-03-11T16:59:16",
            "upload_time_iso_8601": "2024-03-11T16:59:16.764094Z",
            "url": "https://files.pythonhosted.org/packages/45/8b/9ea69010e53de2646deeaa4fe9ac3b5f9a450ca43272c402cb034c493207/doorget-1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-11 16:59:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fdieulle",
    "github_project": "doorget",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "doorget"
}
        
Elapsed time: 2.97783s