memori


Namememori JSON
Version 0.3.6 PyPI version JSON
download
home_page
SummaryA python library for creating memoized data and code for neuroimaging pipelines
upload_time2023-04-05 03:13:10
maintainer
docs_urlNone
author
requires_python>=3.7
licenseMIT License
keywords neuroimaging pipeline memoization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # memori
[![CircleCI](https://circleci.com/gh/vanandrew/memori/tree/main.svg?style=svg)](https://circleci.com/gh/vanandrew/memori/tree/main)
[![Python package](https://github.com/vanandrew/memori/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/vanandrew/memori/actions/workflows/python-package.yml)
[![codecov](https://codecov.io/gh/vanandrew/memori/branch/main/graph/badge.svg?token=DSVJMHTVLE)](https://codecov.io/gh/vanandrew/memori)

A python library for creating memoized data and code for neuroimaging pipelines

## Table of Contents

1. [Installation](#installation)
2. [Command-Line Usage](#command-line-usage)
    1. [`memori`](#memori)
    2. [`pathman`](#pathman)
3. [Python Usage](#python-usage)
    1. [The `Stage` Object](#the-stage-object)
    2. [The `Pipeline` Object](#the-pipeline-object)
    3. [Stage Aliases and Complex Pipelines](#stage-aliases-and-complex-pipelines)
    4. [Hashing external functions](#hashing-external-functions)
    5. [Path Management](#path-management)

## Installation

To install, use `pip`:
```
pip install memori
```
## Command-Line Usage

`memori` can be used to memoize the running of command-line scripts. It is designed
to check the inputs and sha256 integrity of the calling script and determines whether the running of that calling script should be run or not. It accomplishes
this through 3 checks:

1. Check against the stored cache that input arguments are all the same.
2. Check that the sha256 hash of the calling script (and dependents) are the same.
3. (Optional) Check that the desired outputs match the hashes stored in the cache.

If at least one of these conditions is not met, `memori` will re-run the script.

### memori

The main command-line script to use `memori` is simply called `memori` on the command-line:

```bash
memori -h [any command/script here]
```

The above command will let you view the help of the memori script.

To call memori on a script simply add `memori` before the command you want
to call. For example:

```bash
# wrapping echo in memori
memori echo "this echo command has been wrapped in memori"
```

This will call the `echo` command with `memori`. To cache the running of the
command, you need to specify the `-d/--hash_output` flag:

```bash
# same command but with cached run
memori -d /path/to/cache echo 1
# the first call will print 1
memori -d /path/to/cache echo 1
# running this a second time will not print anything to the screen
# since the inputs/command is the same, so execution is skipped!
```

Since memori determines if a calling script has changed through hashing, you
may want to determine script execution if the calling script depends on another
script. This can occur if calling script 1 calls script 2 and changes are made
to script 2. This can be accomplished through the `-c/--dependents` flag.

```bash
# script execution of script1.sh is now sensitive to changes in script2.sh
memori -c script2.sh -d /path/to/cache script1.sh arg0 arg1...
```

If we are expecting certain files to be written from a calling script,
we can inform `memori` of their existence through the `-o/--outputs` flag.
`memori` will re-run the calling script if the files are missing/modified.

```bash
memori -o /path/to/an/expected/output -d /path/to/cache script.sh arg0 arg1...
```

The `-k/--kill` flag can be used to kill the parent process, if the calling
script returns an error code. This can be useful to halt a parent script if
execution has failed.

Use the `--verbose` flag for under the hood logging info!

### pathman

`pathman` is a script that allows for the convenient management of file
path manipulations.

```bash
pathman -h
```

To view the full help.

###

## Python Usage

`memori` uses a directed acyclic graph (DAG) approach to constructing pipelines.
Nodes of the the graph represent a "logical unit of processing" (up to the user
to define) that can be encomposed in a function. The edges of the
graph transfers data between these nodes to create a pipeline.
To represent this `memori` employs the use of the `Stage` and `Pipeline` objects.

### The `Stage` object

A `Stage` is a wrapper around a python function and is the conceptual equivalent
of a node of our graph. A `Stage` object can take input/output from/to other `Stage`
objects, but can also be run in isolation. Here is an example of a `Stage` wrapped
around a python function:

```python
# our example function
def test_function(a, b, c):
    # Do some stuff
    d = a + b
    e = b + c
    
    # and return stuff
    return d, e
```

We can wrap this function in a `Stage` object and run it:
```python
from memori import Stage

# any values a function returns need to be labeled with the `stage_outputs` parameter
my_test_stage = Stage(test_function, stage_outputs=["d", "e"])

# we can run this stage with the run method and store the results
result = my_test_stage.run(1, 2, 3)
# result will return a dictionary containing: {"d": 3, "e": 5}

# running it again with different parameters
result = my_test_stage.run(2, 3, 4)
# result will return a dictionary containing: {"d": 5, "e": 7}
```

Now lets write a 2nd function that can take input from our `test_function`. Note that 
the input arguments for this function should match the key names of the stage outputs 
for the `test_function`.

```python
# new test function with input arguments matching previous stage
# function stage_output names
def test_function2(d, e):
    return d + e

# and wrap this in a Stage
my_test_stage2 = Stage(test_function2, stage_outputs=["f"])

# to run this we just merely need to **results (kwarg unpacking) to pass information
# from my_test_stage to my_test_stage2
result2 = my_test_stage2.run(**results)
# result2 will return a dictionary containing: {"f": 12}

# or running the entire pipeline from the beginning
result2 = my_test_stage2.run(**my_test_stage.run(1, 2, 3))
# result2 will return a dictionary containing: {"f": 8}

# The previous two lines is the equivalent to running
test_function2(**test_function(1, 2, 3))
```

We can create static values in our `Stage` object that ignores inputs from other stages 
that are passed into the `run` method.

```python3
# Stage will take the same params as test_function
# and use them as static values
my_test_stage = Stage(
    test_function,
    stage_outputs=["d", "e"],
    a=1,
    b=2,
    c=3
)

# when we run the stage, we will see that it does not change with the input (2, 3, 4)
result = my_test_stage.run(2, 3, 4)
# result will return a dictionary containing: {"d": 3, "e": 5}
# if static values weren't used this should return {"d": 5, "e": 7}
```

Now we know how to wrap the functions we write into a `Stage` object, but what benefit 
does this provide? The main feature of `memori` is to `memoize` the inputs to each 
stage and recall the outputs if they are the same. This can enable long running 
functions to be skipped if the results are going to be the same!

```python
# To enable memoization feature, we need to add the hash_output 
# parameter when constructing a Stage object. hash_output is 
# just some directory to where the memoization files can be 
# written to.
my_test_stage = Stage(test_function, stage_output=["d", "e"], hash_output="/test/directory")

# run the stage
my_test_stage.run(1, 2, 3)
```
This will write 3 files: `test_function.inputs`, `test_function.stage`, and 
`test_function.outputs` at the location: /test/directory
These 3 files record the important states of the Stage for memoization, after it has
been run.

The `.stage` file contains information about the function that was run.
It contains some rudimentary static analysis to check whether and code
wrapped by a Stage has changed in a way that will affect the result. If it has 
detected this, it will rerun the stage. Note that this file contains binary data
is mostly non-human readable (unlike the `.inputs` and `.outputs` files).

The `.inputs` and `.outputs` files contain information about the inputs and outputs of the stage. These files are simply JSON files and upon opening them in a text editor you should see the following:

`test_function.inputs`
```json
{
    "a": 1,
    "b": 2,
    "c": 3
}
```

`test_function.outputs`
```json
{
    "d": 3,
    "e": 5
}
```

`memori` checks the `.inputs` file on each run to determine if the stage needs to be run (assuming it has also passed the `.stage` file check). If the stage is skipped, the `.outputs` file is used to load the results into the stage.

By default, `memori` uses the name of the function as the name for the hash files. If you
would like to use a different name for these files, you can set the name of the Stage object with
the `stage_name` parameter in the constructor:

```python
# Stage with a custom stage name
Stage(...
    stage_name="my_stage_name"
...)
```

When passing path/file strings between `Stage` objects, `memori` has a special behavior: if it
determines the string to be a valid file on the disk, it will hash it with the SHA256
algorithm. For files, this gives memoization results that can reflect changes in data integrity:

```python
# now we specify the input and output to be files on the disk
file0 = "/Some/file/path"
file1 = "/Some/second/file/path"

# define our simple test_function that outputs a file path
def test_function3(f0):
    # always return file1
    return file1

# Now we wrap it in a stage
my_test_stage3 = Stage(test_function3, stage_outputs=["file1"], hash_output="/test/directory")

# and run the stage with file0 as the input
results3 = my_test_stage3.run(file0)
```
Now if you examine the `test_function3.inputs` and `test_function3.outputs` you will see the following:

`test_function3.inputs`
```json
{
    "file0": {
        "file": "/Some/file/path",
        "hash": "f0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b"
    }
}
```

`test_function3.outputs`
```json
{
    "file1": {
        "file": "/Some/second/file/path",
        "hash": "f91c3b6b3ec826aca3dfaf46d47a32cc627d2ba92e2d63d945fbd98b87b2b002"
    }
}
```

As shown above `memori` replaces a valid file path with a dictionary entry containing the `"file"` and `"hash"` keys. Valid files are compared by hash values rather than path/filename ensuring data integrity.

> **NOTE**: Since `"file"` and `"hash"` are keywords used to hash valid files. These are reserved keywords that should NOT be used when returning an output from a stage using a dictionary. Doing so could lead to catastrophic results!

> **CAUTION**: `memori` uses JSON to memoize and pass information 
> between `Stage` objects. This means that the inputs/outputs of your function MUST be JSON
> serializable or you will get a serialization error. You can
> also get data conversion effects if you don't use the proper
> data types. For example, python always converts a Tuple to a
> List when serializing a dictionary to JSON. This will lead to
> hash check fail each time you run the Stage! Since whenever memori loads the stage
> output data from the `.outputs` file, the Tuple in the code will never match against 
> list it was converted to in the JSON. So take care to
> use only JSON compatible data types (This means None, integers, floats, 
> strings, bools, lists, and dictionaries are the only valid
> input/output data types in `memori`). 
>
> For data that is not JSON serializable, the typical workaround is to save it to a file
> and pass the file location between the `Stage` objects. This also allows you to take
> advantage of the SHA256 file hashing features of `memori`.

### The `Pipeline` object

What happens when you have more complex pipelines? Maybe you have a `Stage`
that needs to provide input to two different `Stage` objects.

This is where the `Pipeline` object comes in. A `Pipeline` is a collection of Stage
objects with their input/output connections defined. A `Pipeline` object represents
the conceptual DAG that was mentioned above.

```python
from memori import Stage, Pipeline

# create some stages (see the last section on Stages for details)
stage0 = Stage(some params go here...)
stage1 = Stage(some params go here...)
stage2 = Stage(some params go here...)
stage3 = Stage(some params go here...)

# Now we create a Pipeline object, a pipeline takes a definition list during construction
# the definition list is a list of tuples specifying the connection between stages
#
# The "start" keyword is a special instruction that the Pipeline object can read
# it specifies that a particular stage has not precedent Stage and should be a Stage
# that is run first in the Pipeline.
p = Pipeline([
    ("start", stage0),  # stage0 takes no input from other stages, so it should run first
    (stage0, stage1),  # stage0 passes it's output to stage1
    (stage0, stage2),  # and also to stage2
    ((stage1, stage2), stage3)  # stage3 needs inputs from stage1 and stage2, so we use a
                                # special tuple-in-tuple so that it can get outputs from both
                                # NOTE: if stage1 and stage2 have stage_outputs with the same
                                # name, the last stage (right-most) stage will have precedence
                                # for it's output
])

# we can run the Pipeline with the run method, and get it's result
result = p.run(some input parameters here...)
```

Running the pipeline has the effect of invoking the run method 
of each `Stage` object individually, and passing the result of the stage onto the
next stage as defined by the `Pipeline` definition passed in during `Pipeline`
initialization.

## Stage Aliases and Complex Pipelines

When building a complicated pipleine, sometimes the functions that you write
will have input argument names that are different from the `stage_output` names
that you have defined in a `Stage`. Consider the following example:

```python
def test_function(a, b):
    return a + b

def test_function2(c):
    # this might represent some complicated processing
    c += 1
    return c

def test_function3(d):
    # this might be another function with some more complocated processing
    d += 2
    return d 
```

Now let's say I want to pass the result of `test_function` to both `test_function2` and
`test_function3`. This presents an issue because `test_function2` and `test_function3` have
different input argument names. So if I define the `stage_output` of the wrapped `test_function`
to be `stage_outputs=["c"]` this won't work for `test_function3` and if I define it to be
`stage_outputs=["d"]` it won't work for `test_function2`.

One way of solving this issue would be to rewrite the `test_function2` and `test_function3`
functions to have the same argument name, this may not always be possible (particularly when
wrapping a function call from a third-party library). Another option would be to wrap the
call of either `test_function2` or `test_function3` to take in the same input. For example:

```python3
# this is necessary hashing external function calls
# more about the hashable wrapper in the next section
from memori import hashable

# we wrap the call of test_function3
def test_function3_mod(c):
    return hashable(test_function3)(c)
```

Now when we create the `Stage` for each function, `test_function2` and `test_function3_mod` now have the same input argument names and can take in input from `test_function`.

While this solution works (and indeed this was how it used to be done), `memori` provides a more 
convienent solution through Stage aliases. Aliases can map the name of one of the stage outputs to 
another name. When creating a `Stage` object, you can define this through the `aliases` parameter.

```python
# We wrap test_function in a Stage, and specify an alias from d -> c
test_stage = Stage(test_function, stage_outputs=["c"], aliases=["d": "c"])

# Now I can construct stages around test_function2 and test_function3 without
# writing extra code
test_stage2 = Stage(test_function2, stage_outputs=["e"])
test_stage3 = Stage(test_function3, stage_outputs=["f"])

# now definte the pipeline
my_pipeline = Pipeline(
    [
        ("start", test_stage),
        (test_stage, test_stage2),
        (test_stage, test_stage3), # because we mapped d -> c, memori know where to pass the result to
    ]
)
```

Stage aliases reduces the need for extra boilerplate code, and adding on an extra
stage the feeds from `test_stage` is as simple as adding another alias.

## Hashing external functions

In the last section, we saw the use of the hashable wrapper when trying to wrap a
function call in another function. But what does it actually do? Consider the
following example:

```python
def test_function(a, b)
    c = a + b
    d = test_function2(c)
    return d

def test_function2(c)
    return c + 1

stage0 = Stage(test_function, stage_outputs=["d"], hash_output="test")
result = stage0.run(1, 2)
# this will return the result {"d": 4}
```

Now, what if we change the code of test_function to:

```python
# change up test_function!
def test_function(a, b)
    c = a + b + 1
    d = test_function2(c)
    return d
```

Rebuilding the stage on this function and invoking the `run` method it will cause the
`.stage` hash to mismatch (since the function signature is different with the added
`+ 1` in the code), and the function will rerun instead of loading from cache
(this should return the result `{"d": 5}`).

So the function hashing feature of memori works! but what happens when we modify
`test_function2` and rerun our stage.

```python
# will memori see this change?
def test_function2(c):
    return c + 2
```

Rerunning the stage with the updated `test_function2`, you will see that after invoking
`run`, the `Stage` object simply loads the result from the `.output` file and ignores
the difference in the updated `test_function2` (this will still return `{"d": 5}` rather
than `{"d": 6}`.

This occurs because `memori` function hashing only occurs one call deep. Meaning that
only the instructions of the wrapped callable are the only thing that is hashed. Function calls inside a function are simply recorded as constants, meaning that only
the name `test_function2` is memoized, not the actual instructions!

To correct this issue, `memori` provides the `hashable` wrapper. This wrapper marks 
a function so that memori knows to try and hash it.

```python
# wrap test_funtion2 in hashable
def test_function(a, b)
    c = a + b + 1
    d = hashable(test_function2)(c)
    return d
```

Alternatively, you can add the hashable wrapper a decorator.

```python
# this is the same as calling hashable(test_function2)
# but makes everything transparent
@hashable
def test_function2(c)
    return c + 1
```

This allows you to simply call `test_function2` without worrying about calling
the hashable wrapper each time.

## Path Management

`memori` also provides a path management utility called `PathManager`. It
is useful for manipulating file paths as well as suffixes and extensions.
If is derived from a `Path` object from the [pathlib](https://docs.python.org/3/library/pathlib.html) library, and so can use any of the
parent methods as well.

Here are a few useful examples:

```python
from memori import PathManager as PathMan

# a string to a path I want PathManager to manage
my_file_path_pm = PathMan("/my/path/to/a/file.ext.ext2")

# get only the file prefix
prefix = my_file_path_pm.get_prefix()
# prefix contains "file"

# get the path and file prefix
path_and_prefix = my_file_path_pm.get_path_and_prefix()
# path_and_prefix contains "/my/path/to/a/file"

# change path of the file, keeping the filename the same
repathed = my_file_path_pm.repath("/new/path")
# repathed contains "/new/path/file.ext.ext2"

# append a suffix (following the BIDS standard, suffixes should always have _)
suffixed = my_file_path_pm.append_suffix("_newsuffix")
# suffixed contains "/my/path/to/a/file_newsuffix.ext.ext2"

# replace last suffix
replaced = suffixed.replace_suffix("_newsuffix2")
# replaced contains "/my/path/to/a/file_newsuffix2.ext.ext2"

# delete last suffix
deleted = replaced.delete_suffix()
# deleted contains "/my/path/to/a/file.ext.ext2"

# methods can be chained together
chained = my_file_path_pm.repath("/new").append_suffix("_test").get_path_and_prefix()
# chained contains /new/file_test

# return as a string
my_file_path = my_file_path_pm.path
# /new/file_test
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "memori",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "neuroimaging,pipeline,memoization",
    "author": "",
    "author_email": "Andrew Van <vanandrew@wustl.edu>",
    "download_url": "https://files.pythonhosted.org/packages/6b/cf/251c4e213117e1d891742022af999045cafac0c63aa6347c809499c305c4/memori-0.3.6.tar.gz",
    "platform": null,
    "description": "# memori\n[![CircleCI](https://circleci.com/gh/vanandrew/memori/tree/main.svg?style=svg)](https://circleci.com/gh/vanandrew/memori/tree/main)\n[![Python package](https://github.com/vanandrew/memori/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/vanandrew/memori/actions/workflows/python-package.yml)\n[![codecov](https://codecov.io/gh/vanandrew/memori/branch/main/graph/badge.svg?token=DSVJMHTVLE)](https://codecov.io/gh/vanandrew/memori)\n\nA python library for creating memoized data and code for neuroimaging pipelines\n\n## Table of Contents\n\n1. [Installation](#installation)\n2. [Command-Line Usage](#command-line-usage)\n    1. [`memori`](#memori)\n    2. [`pathman`](#pathman)\n3. [Python Usage](#python-usage)\n    1. [The `Stage` Object](#the-stage-object)\n    2. [The `Pipeline` Object](#the-pipeline-object)\n    3. [Stage Aliases and Complex Pipelines](#stage-aliases-and-complex-pipelines)\n    4. [Hashing external functions](#hashing-external-functions)\n    5. [Path Management](#path-management)\n\n## Installation\n\nTo install, use `pip`:\n```\npip install memori\n```\n## Command-Line Usage\n\n`memori` can be used to memoize the running of command-line scripts. It is designed\nto check the inputs and sha256 integrity of the calling script and determines whether the running of that calling script should be run or not. It accomplishes\nthis through 3 checks:\n\n1. Check against the stored cache that input arguments are all the same.\n2. Check that the sha256 hash of the calling script (and dependents) are the same.\n3. (Optional) Check that the desired outputs match the hashes stored in the cache.\n\nIf at least one of these conditions is not met, `memori` will re-run the script.\n\n### memori\n\nThe main command-line script to use `memori` is simply called `memori` on the command-line:\n\n```bash\nmemori -h [any command/script here]\n```\n\nThe above command will let you view the help of the memori script.\n\nTo call memori on a script simply add `memori` before the command you want\nto call. For example:\n\n```bash\n# wrapping echo in memori\nmemori echo \"this echo command has been wrapped in memori\"\n```\n\nThis will call the `echo` command with `memori`. To cache the running of the\ncommand, you need to specify the `-d/--hash_output` flag:\n\n```bash\n# same command but with cached run\nmemori -d /path/to/cache echo 1\n# the first call will print 1\nmemori -d /path/to/cache echo 1\n# running this a second time will not print anything to the screen\n# since the inputs/command is the same, so execution is skipped!\n```\n\nSince memori determines if a calling script has changed through hashing, you\nmay want to determine script execution if the calling script depends on another\nscript. This can occur if calling script 1 calls script 2 and changes are made\nto script 2. This can be accomplished through the `-c/--dependents` flag.\n\n```bash\n# script execution of script1.sh is now sensitive to changes in script2.sh\nmemori -c script2.sh -d /path/to/cache script1.sh arg0 arg1...\n```\n\nIf we are expecting certain files to be written from a calling script,\nwe can inform `memori` of their existence through the `-o/--outputs` flag.\n`memori` will re-run the calling script if the files are missing/modified.\n\n```bash\nmemori -o /path/to/an/expected/output -d /path/to/cache script.sh arg0 arg1...\n```\n\nThe `-k/--kill` flag can be used to kill the parent process, if the calling\nscript returns an error code. This can be useful to halt a parent script if\nexecution has failed.\n\nUse the `--verbose` flag for under the hood logging info!\n\n### pathman\n\n`pathman` is a script that allows for the convenient management of file\npath manipulations.\n\n```bash\npathman -h\n```\n\nTo view the full help.\n\n###\n\n## Python Usage\n\n`memori` uses a directed acyclic graph (DAG) approach to constructing pipelines.\nNodes of the the graph represent a \"logical unit of processing\" (up to the user\nto define) that can be encomposed in a function. The edges of the\ngraph transfers data between these nodes to create a pipeline.\nTo represent this `memori` employs the use of the `Stage` and `Pipeline` objects.\n\n### The `Stage` object\n\nA `Stage` is a wrapper around a python function and is the conceptual equivalent\nof a node of our graph. A `Stage` object can take input/output from/to other `Stage`\nobjects, but can also be run in isolation. Here is an example of a `Stage` wrapped\naround a python function:\n\n```python\n# our example function\ndef test_function(a, b, c):\n    # Do some stuff\n    d = a + b\n    e = b + c\n    \n    # and return stuff\n    return d, e\n```\n\nWe can wrap this function in a `Stage` object and run it:\n```python\nfrom memori import Stage\n\n# any values a function returns need to be labeled with the `stage_outputs` parameter\nmy_test_stage = Stage(test_function, stage_outputs=[\"d\", \"e\"])\n\n# we can run this stage with the run method and store the results\nresult = my_test_stage.run(1, 2, 3)\n# result will return a dictionary containing: {\"d\": 3, \"e\": 5}\n\n# running it again with different parameters\nresult = my_test_stage.run(2, 3, 4)\n# result will return a dictionary containing: {\"d\": 5, \"e\": 7}\n```\n\nNow lets write a 2nd function that can take input from our `test_function`. Note that \nthe input arguments for this function should match the key names of the stage outputs \nfor the `test_function`.\n\n```python\n# new test function with input arguments matching previous stage\n# function stage_output names\ndef test_function2(d, e):\n    return d + e\n\n# and wrap this in a Stage\nmy_test_stage2 = Stage(test_function2, stage_outputs=[\"f\"])\n\n# to run this we just merely need to **results (kwarg unpacking) to pass information\n# from my_test_stage to my_test_stage2\nresult2 = my_test_stage2.run(**results)\n# result2 will return a dictionary containing: {\"f\": 12}\n\n# or running the entire pipeline from the beginning\nresult2 = my_test_stage2.run(**my_test_stage.run(1, 2, 3))\n# result2 will return a dictionary containing: {\"f\": 8}\n\n# The previous two lines is the equivalent to running\ntest_function2(**test_function(1, 2, 3))\n```\n\nWe can create static values in our `Stage` object that ignores inputs from other stages \nthat are passed into the `run` method.\n\n```python3\n# Stage will take the same params as test_function\n# and use them as static values\nmy_test_stage = Stage(\n    test_function,\n    stage_outputs=[\"d\", \"e\"],\n    a=1,\n    b=2,\n    c=3\n)\n\n# when we run the stage, we will see that it does not change with the input (2, 3, 4)\nresult = my_test_stage.run(2, 3, 4)\n# result will return a dictionary containing: {\"d\": 3, \"e\": 5}\n# if static values weren't used this should return {\"d\": 5, \"e\": 7}\n```\n\nNow we know how to wrap the functions we write into a `Stage` object, but what benefit \ndoes this provide? The main feature of `memori` is to `memoize` the inputs to each \nstage and recall the outputs if they are the same. This can enable long running \nfunctions to be skipped if the results are going to be the same!\n\n```python\n# To enable memoization feature, we need to add the hash_output \n# parameter when constructing a Stage object. hash_output is \n# just some directory to where the memoization files can be \n# written to.\nmy_test_stage = Stage(test_function, stage_output=[\"d\", \"e\"], hash_output=\"/test/directory\")\n\n# run the stage\nmy_test_stage.run(1, 2, 3)\n```\nThis will write 3 files: `test_function.inputs`, `test_function.stage`, and \n`test_function.outputs` at the location: /test/directory\nThese 3 files record the important states of the Stage for memoization, after it has\nbeen run.\n\nThe `.stage` file contains information about the function that was run.\nIt contains some rudimentary static analysis to check whether and code\nwrapped by a Stage has changed in a way that will affect the result. If it has \ndetected this, it will rerun the stage. Note that this file contains binary data\nis mostly non-human readable (unlike the `.inputs` and `.outputs` files).\n\nThe `.inputs` and `.outputs` files contain information about the inputs and outputs of the stage. These files are simply JSON files and upon opening them in a text editor you should see the following:\n\n`test_function.inputs`\n```json\n{\n    \"a\": 1,\n    \"b\": 2,\n    \"c\": 3\n}\n```\n\n`test_function.outputs`\n```json\n{\n    \"d\": 3,\n    \"e\": 5\n}\n```\n\n`memori` checks the `.inputs` file on each run to determine if the stage needs to be run (assuming it has also passed the `.stage` file check). If the stage is skipped, the `.outputs` file is used to load the results into the stage.\n\nBy default, `memori` uses the name of the function as the name for the hash files. If you\nwould like to use a different name for these files, you can set the name of the Stage object with\nthe `stage_name` parameter in the constructor:\n\n```python\n# Stage with a custom stage name\nStage(...\n    stage_name=\"my_stage_name\"\n...)\n```\n\nWhen passing path/file strings between `Stage` objects, `memori` has a special behavior: if it\ndetermines the string to be a valid file on the disk, it will hash it with the SHA256\nalgorithm. For files, this gives memoization results that can reflect changes in data integrity:\n\n```python\n# now we specify the input and output to be files on the disk\nfile0 = \"/Some/file/path\"\nfile1 = \"/Some/second/file/path\"\n\n# define our simple test_function that outputs a file path\ndef test_function3(f0):\n    # always return file1\n    return file1\n\n# Now we wrap it in a stage\nmy_test_stage3 = Stage(test_function3, stage_outputs=[\"file1\"], hash_output=\"/test/directory\")\n\n# and run the stage with file0 as the input\nresults3 = my_test_stage3.run(file0)\n```\nNow if you examine the `test_function3.inputs` and `test_function3.outputs` you will see the following:\n\n`test_function3.inputs`\n```json\n{\n    \"file0\": {\n        \"file\": \"/Some/file/path\",\n        \"hash\": \"f0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b\"\n    }\n}\n```\n\n`test_function3.outputs`\n```json\n{\n    \"file1\": {\n        \"file\": \"/Some/second/file/path\",\n        \"hash\": \"f91c3b6b3ec826aca3dfaf46d47a32cc627d2ba92e2d63d945fbd98b87b2b002\"\n    }\n}\n```\n\nAs shown above `memori` replaces a valid file path with a dictionary entry containing the `\"file\"` and `\"hash\"` keys. Valid files are compared by hash values rather than path/filename ensuring data integrity.\n\n> **NOTE**: Since `\"file\"` and `\"hash\"` are keywords used to hash valid files. These are reserved keywords that should NOT be used when returning an output from a stage using a dictionary. Doing so could lead to catastrophic results!\n\n> **CAUTION**: `memori` uses JSON to memoize and pass information \n> between `Stage` objects. This means that the inputs/outputs of your function MUST be JSON\n> serializable or you will get a serialization error. You can\n> also get data conversion effects if you don't use the proper\n> data types. For example, python always converts a Tuple to a\n> List when serializing a dictionary to JSON. This will lead to\n> hash check fail each time you run the Stage! Since whenever memori loads the stage\n> output data from the `.outputs` file, the Tuple in the code will never match against \n> list it was converted to in the JSON. So take care to\n> use only JSON compatible data types (This means None, integers, floats, \n> strings, bools, lists, and dictionaries are the only valid\n> input/output data types in `memori`). \n>\n> For data that is not JSON serializable, the typical workaround is to save it to a file\n> and pass the file location between the `Stage` objects. This also allows you to take\n> advantage of the SHA256 file hashing features of `memori`.\n\n### The `Pipeline` object\n\nWhat happens when you have more complex pipelines? Maybe you have a `Stage`\nthat needs to provide input to two different `Stage` objects.\n\nThis is where the `Pipeline` object comes in. A `Pipeline` is a collection of Stage\nobjects with their input/output connections defined. A `Pipeline` object represents\nthe conceptual DAG that was mentioned above.\n\n```python\nfrom memori import Stage, Pipeline\n\n# create some stages (see the last section on Stages for details)\nstage0 = Stage(some params go here...)\nstage1 = Stage(some params go here...)\nstage2 = Stage(some params go here...)\nstage3 = Stage(some params go here...)\n\n# Now we create a Pipeline object, a pipeline takes a definition list during construction\n# the definition list is a list of tuples specifying the connection between stages\n#\n# The \"start\" keyword is a special instruction that the Pipeline object can read\n# it specifies that a particular stage has not precedent Stage and should be a Stage\n# that is run first in the Pipeline.\np = Pipeline([\n    (\"start\", stage0),  # stage0 takes no input from other stages, so it should run first\n    (stage0, stage1),  # stage0 passes it's output to stage1\n    (stage0, stage2),  # and also to stage2\n    ((stage1, stage2), stage3)  # stage3 needs inputs from stage1 and stage2, so we use a\n                                # special tuple-in-tuple so that it can get outputs from both\n                                # NOTE: if stage1 and stage2 have stage_outputs with the same\n                                # name, the last stage (right-most) stage will have precedence\n                                # for it's output\n])\n\n# we can run the Pipeline with the run method, and get it's result\nresult = p.run(some input parameters here...)\n```\n\nRunning the pipeline has the effect of invoking the run method \nof each `Stage` object individually, and passing the result of the stage onto the\nnext stage as defined by the `Pipeline` definition passed in during `Pipeline`\ninitialization.\n\n## Stage Aliases and Complex Pipelines\n\nWhen building a complicated pipleine, sometimes the functions that you write\nwill have input argument names that are different from the `stage_output` names\nthat you have defined in a `Stage`. Consider the following example:\n\n```python\ndef test_function(a, b):\n    return a + b\n\ndef test_function2(c):\n    # this might represent some complicated processing\n    c += 1\n    return c\n\ndef test_function3(d):\n    # this might be another function with some more complocated processing\n    d += 2\n    return d \n```\n\nNow let's say I want to pass the result of `test_function` to both `test_function2` and\n`test_function3`. This presents an issue because `test_function2` and `test_function3` have\ndifferent input argument names. So if I define the `stage_output` of the wrapped `test_function`\nto be `stage_outputs=[\"c\"]` this won't work for `test_function3` and if I define it to be\n`stage_outputs=[\"d\"]` it won't work for `test_function2`.\n\nOne way of solving this issue would be to rewrite the `test_function2` and `test_function3`\nfunctions to have the same argument name, this may not always be possible (particularly when\nwrapping a function call from a third-party library). Another option would be to wrap the\ncall of either `test_function2` or `test_function3` to take in the same input. For example:\n\n```python3\n# this is necessary hashing external function calls\n# more about the hashable wrapper in the next section\nfrom memori import hashable\n\n# we wrap the call of test_function3\ndef test_function3_mod(c):\n    return hashable(test_function3)(c)\n```\n\nNow when we create the `Stage` for each function, `test_function2` and `test_function3_mod` now have the same input argument names and can take in input from `test_function`.\n\nWhile this solution works (and indeed this was how it used to be done), `memori` provides a more \nconvienent solution through Stage aliases. Aliases can map the name of one of the stage outputs to \nanother name. When creating a `Stage` object, you can define this through the `aliases` parameter.\n\n```python\n# We wrap test_function in a Stage, and specify an alias from d -> c\ntest_stage = Stage(test_function, stage_outputs=[\"c\"], aliases=[\"d\": \"c\"])\n\n# Now I can construct stages around test_function2 and test_function3 without\n# writing extra code\ntest_stage2 = Stage(test_function2, stage_outputs=[\"e\"])\ntest_stage3 = Stage(test_function3, stage_outputs=[\"f\"])\n\n# now definte the pipeline\nmy_pipeline = Pipeline(\n    [\n        (\"start\", test_stage),\n        (test_stage, test_stage2),\n        (test_stage, test_stage3), # because we mapped d -> c, memori know where to pass the result to\n    ]\n)\n```\n\nStage aliases reduces the need for extra boilerplate code, and adding on an extra\nstage the feeds from `test_stage` is as simple as adding another alias.\n\n## Hashing external functions\n\nIn the last section, we saw the use of the hashable wrapper when trying to wrap a\nfunction call in another function. But what does it actually do? Consider the\nfollowing example:\n\n```python\ndef test_function(a, b)\n    c = a + b\n    d = test_function2(c)\n    return d\n\ndef test_function2(c)\n    return c + 1\n\nstage0 = Stage(test_function, stage_outputs=[\"d\"], hash_output=\"test\")\nresult = stage0.run(1, 2)\n# this will return the result {\"d\": 4}\n```\n\nNow, what if we change the code of test_function to:\n\n```python\n# change up test_function!\ndef test_function(a, b)\n    c = a + b + 1\n    d = test_function2(c)\n    return d\n```\n\nRebuilding the stage on this function and invoking the `run` method it will cause the\n`.stage` hash to mismatch (since the function signature is different with the added\n`+ 1` in the code), and the function will rerun instead of loading from cache\n(this should return the result `{\"d\": 5}`).\n\nSo the function hashing feature of memori works! but what happens when we modify\n`test_function2` and rerun our stage.\n\n```python\n# will memori see this change?\ndef test_function2(c):\n    return c + 2\n```\n\nRerunning the stage with the updated `test_function2`, you will see that after invoking\n`run`, the `Stage` object simply loads the result from the `.output` file and ignores\nthe difference in the updated `test_function2` (this will still return `{\"d\": 5}` rather\nthan `{\"d\": 6}`.\n\nThis occurs because `memori` function hashing only occurs one call deep. Meaning that\nonly the instructions of the wrapped callable are the only thing that is hashed. Function calls inside a function are simply recorded as constants, meaning that only\nthe name `test_function2` is memoized, not the actual instructions!\n\nTo correct this issue, `memori` provides the `hashable` wrapper. This wrapper marks \na function so that memori knows to try and hash it.\n\n```python\n# wrap test_funtion2 in hashable\ndef test_function(a, b)\n    c = a + b + 1\n    d = hashable(test_function2)(c)\n    return d\n```\n\nAlternatively, you can add the hashable wrapper a decorator.\n\n```python\n# this is the same as calling hashable(test_function2)\n# but makes everything transparent\n@hashable\ndef test_function2(c)\n    return c + 1\n```\n\nThis allows you to simply call `test_function2` without worrying about calling\nthe hashable wrapper each time.\n\n## Path Management\n\n`memori` also provides a path management utility called `PathManager`. It\nis useful for manipulating file paths as well as suffixes and extensions.\nIf is derived from a `Path` object from the [pathlib](https://docs.python.org/3/library/pathlib.html) library, and so can use any of the\nparent methods as well.\n\nHere are a few useful examples:\n\n```python\nfrom memori import PathManager as PathMan\n\n# a string to a path I want PathManager to manage\nmy_file_path_pm = PathMan(\"/my/path/to/a/file.ext.ext2\")\n\n# get only the file prefix\nprefix = my_file_path_pm.get_prefix()\n# prefix contains \"file\"\n\n# get the path and file prefix\npath_and_prefix = my_file_path_pm.get_path_and_prefix()\n# path_and_prefix contains \"/my/path/to/a/file\"\n\n# change path of the file, keeping the filename the same\nrepathed = my_file_path_pm.repath(\"/new/path\")\n# repathed contains \"/new/path/file.ext.ext2\"\n\n# append a suffix (following the BIDS standard, suffixes should always have _)\nsuffixed = my_file_path_pm.append_suffix(\"_newsuffix\")\n# suffixed contains \"/my/path/to/a/file_newsuffix.ext.ext2\"\n\n# replace last suffix\nreplaced = suffixed.replace_suffix(\"_newsuffix2\")\n# replaced contains \"/my/path/to/a/file_newsuffix2.ext.ext2\"\n\n# delete last suffix\ndeleted = replaced.delete_suffix()\n# deleted contains \"/my/path/to/a/file.ext.ext2\"\n\n# methods can be chained together\nchained = my_file_path_pm.repath(\"/new\").append_suffix(\"_test\").get_path_and_prefix()\n# chained contains /new/file_test\n\n# return as a string\nmy_file_path = my_file_path_pm.path\n# /new/file_test\n```\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A python library for creating memoized data and code for neuroimaging pipelines",
    "version": "0.3.6",
    "split_keywords": [
        "neuroimaging",
        "pipeline",
        "memoization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8630064a350690f5c490a5106ab704f4c380c4d6e0a25d818cfa431e5dc08eff",
                "md5": "f7a5115aa3f6e632a2b216bc4481de87",
                "sha256": "4228f3d321abd8b65a2456f1707a3d72fe5457e90581cf33128acc569f6c07a7"
            },
            "downloads": -1,
            "filename": "memori-0.3.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f7a5115aa3f6e632a2b216bc4481de87",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 29028,
            "upload_time": "2023-04-05T03:13:08",
            "upload_time_iso_8601": "2023-04-05T03:13:08.987848Z",
            "url": "https://files.pythonhosted.org/packages/86/30/064a350690f5c490a5106ab704f4c380c4d6e0a25d818cfa431e5dc08eff/memori-0.3.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bcf251c4e213117e1d891742022af999045cafac0c63aa6347c809499c305c4",
                "md5": "e02f44874bf38d31f46db312f9265d48",
                "sha256": "34bb591354cc062120d2fd4bb8bd9e6d02031de04f3223a393aabc9ad791401a"
            },
            "downloads": -1,
            "filename": "memori-0.3.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e02f44874bf38d31f46db312f9265d48",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 33864,
            "upload_time": "2023-04-05T03:13:10",
            "upload_time_iso_8601": "2023-04-05T03:13:10.998580Z",
            "url": "https://files.pythonhosted.org/packages/6b/cf/251c4e213117e1d891742022af999045cafac0c63aa6347c809499c305c4/memori-0.3.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-05 03:13:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "memori"
}
        
Elapsed time: 0.05579s