tarn


Nametarn JSON
Version 0.14.0 PyPI version JSON
download
home_pagehttps://github.com/neuro-ml/tarn
SummaryA generic framework for key-value storage
upload_time2024-02-21 12:17:55
maintainer
docs_urlNone
authorMax
requires_python>=3.7
license
keywords storage cache invalidation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![codecov](https://codecov.io/gh/neuro-ml/tarn/branch/master/graph/badge.svg)](https://codecov.io/gh/neuro-ml/tarn)
[![pypi](https://img.shields.io/pypi/v/tarn?logo=pypi&label=PyPi)](https://pypi.org/project/tarn/)
![License](https://img.shields.io/github/license/neuro-ml/tarn)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/tarn)](https://pypi.org/project/tarn/)

A generic framework for key-value storage

# Install

```shell
pip install tarn
```

# Recipes

## A simple datalake

Let's start small and create a simple disk-based datalake. It will store various files, and the keys will be their
[sha256](https://en.wikipedia.org/wiki/SHA-2) digest:

```python
from tarn import HashKeyStorage

storage = HashKeyStorage('/path/to/some/folder')
# here `key` is the sha256 digest
key = storage.write('/path/to/some/file.png')
# now we can use the key to read the file at a later time
with storage.read(key) as value:
    # this will output something like Path('/path/to/some/folder/a0/ff9ae8987..')
    print(value.resolve())

# you can also store values directly from memory
# - either byte strings
key = storage.write(b'my-bytes')
# - or file-like objects
#  in this example we stream data from an url directly to the datalake
import requests

key = storage.write(requests.get('https://example.com').raw)
```

## Smart cache to disk

A really cool feature of `tarn` is [memoization](https://en.wikipedia.org/wiki/Memoization) with automatic invalidation:

```python
from tarn import smart_cache


@smart_cache('/path/to/storage')
def my_expensive_function(x):
    y = x ** 2
    return my_other_function(x, y)


def my_other_function(x, y):
    ...
    z = x * y
    return x + y + z
```

Now the calls to `my_expensive_function` will be automatically cached to disk.

But that's not all! Let's assume that `my_expensive_function` and `my_other_function` are often prone to change,
and we would like to invalidate the cache when they do. Just annotate these function with a decorator:

```python
from tarn import smart_cache, mark_unstable


@smart_cache('/path/to/storage')
@mark_unstable
def my_expensive_function(x):
    ...


@mark_unstable
def my_other_function(x, y):
    ...
```

Now any change to these functions, will cause the cache to invalidate itself!

## Other storage locations

We support multiple storage locations out of the box.

Didn't find the location you were looking for? Create an [issue](https://github.com/neuro-ml/tarn/issues).

### S3

```python
from tarn import HashKeyStorage, S3

storage = HashKeyStorage(S3('my-storage-url', 'my-bucket'))
```

### Redis

If your files are small, and you want a fast in-memory storage [Redis](https://redis.io/) is a great option

```python
from tarn import HashKeyStorage, RedisLocation

storage = HashKeyStorage(RedisLocation('localhost'))
```

### SFTP

```python
from tarn import HashKeyStorage, SFTP

storage = HashKeyStorage(SFTP('myserver', '/path/to/root/folder'))
```

### SCP

```python
from tarn import HashKeyStorage, SCP

storage = HashKeyStorage(SCP('myserver', '/path/to/root/folder'))
```

### Nginx

Nginx has an [autoindex](https://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex_format) option, that
allows to serve files and list directory contents. This is useful when you want to access files over http/https:

```python
from tarn import HashKeyStorage, Nginx

storage = HashKeyStorage(Nginx('https://example.com/storage'))
```

## Advanced

Here we'll show more specific (but useful!) use-cases

### Fanout

You might have several HDDs, and you may want to keep your datalake on both without creating a RAID array:

```python
from tarn import HashKeyStorage, Fanout

storage = HashKeyStorage(Fanout(
    '/mount/hdd1/lake',
    '/mount/hdd2/lake',
))
```

Now both disks are used, and we'll start writing to `/mount/hdd2/lake` after `/mount/hdd1/lake` becomes full.

You can even use other types of locations:

```python
from tarn import HashKeyStorage, Fanout, S3

storage = HashKeyStorage(Fanout(S3('server1', 'bucket1'), S3('server2', 'bucket2')))
```

Or mix and match them as you please:

```python
from tarn import HashKeyStorage, Fanout, S3

# write to s3, then start writing to HDD1 after s3 becomes full
storage = HashKeyStorage(Fanout(S3('server2', 'bucket2'), '/mount/hdd1/lake'))
```

### Lazy migration

Let's say you want to seamlessly replicate an old storage to a new location, but copy only the needed files first:

```python
from tarn import HashKeyStorage, Levels

storage = HashKeyStorage(Levels(
    '/mount/new-hdd/lake',
    '/mount/old-hdd/lake',
))
```

This will create something like a [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) with copy-on-read
behaviour. Each time we read a key, if we don't find it in `/mount/new-hdd/lake`, we read it from `/mount/old-hdd/lake`
and save a copy to `/mount/new-hdd/lake`.

### Cache levels

The same [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) logic can be used if you have a combination of
HDDs and SSD which will seriously speed up the reading:

```python
from tarn import HashKeyStorage, Levels, Level

storage = HashKeyStorage(Levels(
    Level('/mount/fast-ssd/lake', write=False),
    Level('/mount/slow-hdd/lake', write=False),
    '/mount/slower-nfs/lake',
))
```

The setup above is similar to the one we use in our lab:

- we have a slow but _huge_ NFS-mounted storage
- a faster but smaller HDD
- and a super fast but even smaller SSD

Now, we only write to the NFS storage, but the data gets lazily replicated to the local HDD and SSD to speed up the
reads.

### Caching small files to Redis

We can take this approach even further and use ultra fast in-memory storages, such as Redis:

```python
from tarn import HashKeyStorage, Levels, Small, RedisLocation

storage = HashKeyStorage(Levels(
    # max file size = 100KiB
    Small(RedisLocation('my-host'), max_size=100 * 1024),
    '/mount/hdd/lake',
))
```

Here we use `Small` - a wrapper that only allows small (<=100KiB in this case) files to be written to it.
In our experiments we observed a 10x speedup for reading small files.

## Composability

Because all the locations implement the same interface, you can start creating more complex storage logic specifically
tailored to your needs. You can make setups as crazy as you want!

```python
from tarn import HashKeyStorage, Levels, Fanout, RedisLocation, Small, S3, SFTP

storage = HashKeyStorage(Levels(
    Small(RedisLocation('my-host'), max_size=10 * 1024 ** 2),
    '/mount/fast-ssd/lake',

    Fanout(
        '/mount/hdd1/lake',
        '/mount/hdd2/lake',
        '/mount/hdd3/lake',

        # nested locations are not a problem!
        Levels(
            # apparently we want mirrored locations here
            '/mount/hdd3/lake',
            '/mount/old-hdd/lake',
        ),
    ),

    '/mount/slower-nfs/lake',

    S3('my-s3-host', 'my-bucket'),

    # pull missing files over sftp when needed
    SFTP('remove-host', '/path/to/remote/folder'),
))
```

# Acknowledgements

Some parts of our cache invalidation machinery were heavily inspired by
the [cloudpickle](https://github.com/cloudpipe/cloudpickle) project.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuro-ml/tarn",
    "name": "tarn",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "storage,cache,invalidation",
    "author": "Max",
    "author_email": "max@ira-labs.com",
    "download_url": "https://files.pythonhosted.org/packages/74/27/8eafb648d0db6251fb03706350fd4d9c95fb95185d57d05fee60d9c968cb/tarn-0.14.0.tar.gz",
    "platform": null,
    "description": "[![codecov](https://codecov.io/gh/neuro-ml/tarn/branch/master/graph/badge.svg)](https://codecov.io/gh/neuro-ml/tarn)\n[![pypi](https://img.shields.io/pypi/v/tarn?logo=pypi&label=PyPi)](https://pypi.org/project/tarn/)\n![License](https://img.shields.io/github/license/neuro-ml/tarn)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/tarn)](https://pypi.org/project/tarn/)\n\nA generic framework for key-value storage\n\n# Install\n\n```shell\npip install tarn\n```\n\n# Recipes\n\n## A simple datalake\n\nLet's start small and create a simple disk-based datalake. It will store various files, and the keys will be their\n[sha256](https://en.wikipedia.org/wiki/SHA-2) digest:\n\n```python\nfrom tarn import HashKeyStorage\n\nstorage = HashKeyStorage('/path/to/some/folder')\n# here `key` is the sha256 digest\nkey = storage.write('/path/to/some/file.png')\n# now we can use the key to read the file at a later time\nwith storage.read(key) as value:\n    # this will output something like Path('/path/to/some/folder/a0/ff9ae8987..')\n    print(value.resolve())\n\n# you can also store values directly from memory\n# - either byte strings\nkey = storage.write(b'my-bytes')\n# - or file-like objects\n#  in this example we stream data from an url directly to the datalake\nimport requests\n\nkey = storage.write(requests.get('https://example.com').raw)\n```\n\n## Smart cache to disk\n\nA really cool feature of `tarn` is [memoization](https://en.wikipedia.org/wiki/Memoization) with automatic invalidation:\n\n```python\nfrom tarn import smart_cache\n\n\n@smart_cache('/path/to/storage')\ndef my_expensive_function(x):\n    y = x ** 2\n    return my_other_function(x, y)\n\n\ndef my_other_function(x, y):\n    ...\n    z = x * y\n    return x + y + z\n```\n\nNow the calls to `my_expensive_function` will be automatically cached to disk.\n\nBut that's not all! Let's assume that `my_expensive_function` and `my_other_function` are often prone to change,\nand we would like to invalidate the cache when they do. Just annotate these function with a decorator:\n\n```python\nfrom tarn import smart_cache, mark_unstable\n\n\n@smart_cache('/path/to/storage')\n@mark_unstable\ndef my_expensive_function(x):\n    ...\n\n\n@mark_unstable\ndef my_other_function(x, y):\n    ...\n```\n\nNow any change to these functions, will cause the cache to invalidate itself!\n\n## Other storage locations\n\nWe support multiple storage locations out of the box.\n\nDidn't find the location you were looking for? Create an [issue](https://github.com/neuro-ml/tarn/issues).\n\n### S3\n\n```python\nfrom tarn import HashKeyStorage, S3\n\nstorage = HashKeyStorage(S3('my-storage-url', 'my-bucket'))\n```\n\n### Redis\n\nIf your files are small, and you want a fast in-memory storage [Redis](https://redis.io/) is a great option\n\n```python\nfrom tarn import HashKeyStorage, RedisLocation\n\nstorage = HashKeyStorage(RedisLocation('localhost'))\n```\n\n### SFTP\n\n```python\nfrom tarn import HashKeyStorage, SFTP\n\nstorage = HashKeyStorage(SFTP('myserver', '/path/to/root/folder'))\n```\n\n### SCP\n\n```python\nfrom tarn import HashKeyStorage, SCP\n\nstorage = HashKeyStorage(SCP('myserver', '/path/to/root/folder'))\n```\n\n### Nginx\n\nNginx has an [autoindex](https://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex_format) option, that\nallows to serve files and list directory contents. This is useful when you want to access files over http/https:\n\n```python\nfrom tarn import HashKeyStorage, Nginx\n\nstorage = HashKeyStorage(Nginx('https://example.com/storage'))\n```\n\n## Advanced\n\nHere we'll show more specific (but useful!) use-cases\n\n### Fanout\n\nYou might have several HDDs, and you may want to keep your datalake on both without creating a RAID array:\n\n```python\nfrom tarn import HashKeyStorage, Fanout\n\nstorage = HashKeyStorage(Fanout(\n    '/mount/hdd1/lake',\n    '/mount/hdd2/lake',\n))\n```\n\nNow both disks are used, and we'll start writing to `/mount/hdd2/lake` after `/mount/hdd1/lake` becomes full.\n\nYou can even use other types of locations:\n\n```python\nfrom tarn import HashKeyStorage, Fanout, S3\n\nstorage = HashKeyStorage(Fanout(S3('server1', 'bucket1'), S3('server2', 'bucket2')))\n```\n\nOr mix and match them as you please:\n\n```python\nfrom tarn import HashKeyStorage, Fanout, S3\n\n# write to s3, then start writing to HDD1 after s3 becomes full\nstorage = HashKeyStorage(Fanout(S3('server2', 'bucket2'), '/mount/hdd1/lake'))\n```\n\n### Lazy migration\n\nLet's say you want to seamlessly replicate an old storage to a new location, but copy only the needed files first:\n\n```python\nfrom tarn import HashKeyStorage, Levels\n\nstorage = HashKeyStorage(Levels(\n    '/mount/new-hdd/lake',\n    '/mount/old-hdd/lake',\n))\n```\n\nThis will create something like a [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) with copy-on-read\nbehaviour. Each time we read a key, if we don't find it in `/mount/new-hdd/lake`, we read it from `/mount/old-hdd/lake`\nand save a copy to `/mount/new-hdd/lake`.\n\n### Cache levels\n\nThe same [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) logic can be used if you have a combination of\nHDDs and SSD which will seriously speed up the reading:\n\n```python\nfrom tarn import HashKeyStorage, Levels, Level\n\nstorage = HashKeyStorage(Levels(\n    Level('/mount/fast-ssd/lake', write=False),\n    Level('/mount/slow-hdd/lake', write=False),\n    '/mount/slower-nfs/lake',\n))\n```\n\nThe setup above is similar to the one we use in our lab:\n\n- we have a slow but _huge_ NFS-mounted storage\n- a faster but smaller HDD\n- and a super fast but even smaller SSD\n\nNow, we only write to the NFS storage, but the data gets lazily replicated to the local HDD and SSD to speed up the\nreads.\n\n### Caching small files to Redis\n\nWe can take this approach even further and use ultra fast in-memory storages, such as Redis:\n\n```python\nfrom tarn import HashKeyStorage, Levels, Small, RedisLocation\n\nstorage = HashKeyStorage(Levels(\n    # max file size = 100KiB\n    Small(RedisLocation('my-host'), max_size=100 * 1024),\n    '/mount/hdd/lake',\n))\n```\n\nHere we use `Small` - a wrapper that only allows small (<=100KiB in this case) files to be written to it.\nIn our experiments we observed a 10x speedup for reading small files.\n\n## Composability\n\nBecause all the locations implement the same interface, you can start creating more complex storage logic specifically\ntailored to your needs. You can make setups as crazy as you want!\n\n```python\nfrom tarn import HashKeyStorage, Levels, Fanout, RedisLocation, Small, S3, SFTP\n\nstorage = HashKeyStorage(Levels(\n    Small(RedisLocation('my-host'), max_size=10 * 1024 ** 2),\n    '/mount/fast-ssd/lake',\n\n    Fanout(\n        '/mount/hdd1/lake',\n        '/mount/hdd2/lake',\n        '/mount/hdd3/lake',\n\n        # nested locations are not a problem!\n        Levels(\n            # apparently we want mirrored locations here\n            '/mount/hdd3/lake',\n            '/mount/old-hdd/lake',\n        ),\n    ),\n\n    '/mount/slower-nfs/lake',\n\n    S3('my-s3-host', 'my-bucket'),\n\n    # pull missing files over sftp when needed\n    SFTP('remove-host', '/path/to/remote/folder'),\n))\n```\n\n# Acknowledgements\n\nSome parts of our cache invalidation machinery were heavily inspired by\nthe [cloudpickle](https://github.com/cloudpipe/cloudpickle) project.\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A generic framework for key-value storage",
    "version": "0.14.0",
    "project_urls": {
        "Download": "https://github.com/neuro-ml/tarn/archive/v0.14.0.tar.gz",
        "Homepage": "https://github.com/neuro-ml/tarn"
    },
    "split_keywords": [
        "storage",
        "cache",
        "invalidation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "74278eafb648d0db6251fb03706350fd4d9c95fb95185d57d05fee60d9c968cb",
                "md5": "8c658c9ab1471fab2af243617472a4ee",
                "sha256": "d84a556081c8bfbc52ae706948dfc9abd9064d738ccd76979031b36807f8e053"
            },
            "downloads": -1,
            "filename": "tarn-0.14.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8c658c9ab1471fab2af243617472a4ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 38053,
            "upload_time": "2024-02-21T12:17:55",
            "upload_time_iso_8601": "2024-02-21T12:17:55.668702Z",
            "url": "https://files.pythonhosted.org/packages/74/27/8eafb648d0db6251fb03706350fd4d9c95fb95185d57d05fee60d9c968cb/tarn-0.14.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-21 12:17:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuro-ml",
    "github_project": "tarn",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "tarn"
}
        
Max
Elapsed time: 0.19073s