Name | dryml JSON |
Version |
0.2.2
JSON |
| download |
home_page | https://github.com/ncsa/dryml |
Summary | Don't Repeat Yourself Machine Learning: A Machine Learning library to reduce code duplication, and increase portability and interoperability in the Machine Learning Space. |
upload_time | 2024-08-30 21:41:51 |
maintainer | None |
docs_url | None |
author | Matthew Krafczyk |
requires_python | >=3.7 |
license | Apache-2.0 |
keywords |
ml
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DRYML
![Py 3.8 tests](https://github.com/ncsa/dryml/actions/workflows/test38.yaml/badge.svg)
![Py 3.9 tests](https://github.com/ncsa/dryml/actions/workflows/test39.yaml/badge.svg)
![Py 3.10 tests](https://github.com/ncsa/dryml/actions/workflows/test310.yaml/badge.svg)
![Py 3.11 tests](https://github.com/ncsa/dryml/actions/workflows/test311.yaml/badge.svg)
[![codecov](https://codecov.io/gh/ncsa/dryml/branch/main/graph/badge.svg?token=ELz0TSuOzo)](https://codecov.io/gh/ncsa/dryml)
Don't Repeat Yourself Machine Learning: A Machine Learning library to reduce code duplication, automate testing, perform hyper paramter searches, and ease model serialization.
DRYML aims to empower the Machine Learning practitioner to spend less time writing boilerplate code, and more time implementing new techniques. DRYML provides a model serialization framework along with serialization implementation for many common ML frameworks and model types, a framework for defining and training models on a specific problem, and a system to compare models from different ML frameworks on the same footing.
## DRYML Programming Philosophy
### Easy Object Serialization
All DRYML `Object`s can be uniquely identified, and serialized to disk. Once saved, you can load `Object`s directly from disk without having to first build a holding object as originally constructed. This allows for instance, Neural Net objects to be initialized with the correct parameters before loading the model weights without user intervention. Loading an `Object` consists of a single command `load_object`. Basic save/load logic is available for some major ML platforms, but the user is able and encouraged to implement new `Object`s with custom save/load methods and so is extendable to any ML system.
### Reuse Model Components
DRYML borrows from the Entity Component System programming pattern, and many Model types are created from components which are attached to the model, and can be reused. These can include training procedure, optimization algorithm, loss function, and the underlying NN model itself. This compartmentalization allows us to enable hyperparameter searches over nearly any parameter of your ML algorithm from the NN topology, from the optimizer learning rate, to the training procedure. It also allows the user to define a complex training procedure once, and then reuse it for multiple related problems in the future. DRYML also has defined training procedures for some common ML frameworks.
### Compare Models Between Frameworks
Many ML Problems can be tackled by different ML Frameworks. DRYML's API places all supported Frameworks on equal footing. All models output data in the form of DRYML Datasets. This means metrics on Datasets can be reused for models in different frameworks and models from different frameworks can be compared directly, allowing the ML practictioner to make decisions about which method is best for their problem. Models can also be easily chained together
### Allow Frameworks to work together
Modern ML frameworks such as tensorflow and pytorch are greedy about GPU use. DRYML implements a context system to enforce constraints on these frameworks when possible. The context system also provides a resource request API to allow the user to request the types of resources each framework is allowed to use. This allows elements from multiple frameworks to co-exist within the same data pipeline. For example, we can use a pytorch dataset and preprocessing with a tensorflow machine learning model.
## Bringing ML Frameworks together
The following ML Frameworks are currently supported, or planned to be supported
* Tensorflow (Initial support complete)
* Pytorch (Initial support complete)
* Sklearn (Initial support complete)
* XGBoost (Initial support complete)
* Statsmodels (support planned)
## DRYML Major Components
### DRYML `Object` and `ObjectDef`
The DRYML API provides the `Object` class which automatically implements all basic machinery for automatic object serialization. Any class you create which you want to serialize must inherit from the `Object` class. Here's a simple example:
```python
>>> from dryml import Object
>>> class Data(Object):
... def __init__(self, data):
... pass
...
>>> data_obj = Data([1, 2, 3, 4, 5])
>>> data_obj.dry_args
([1, 2, 3, 4, 5],)
>>> data_obj.dry_kwargs
{'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}
```
We can see that `Object` gives the new `Data` class some extra powers! It remembers the arguments used to create it, and it receives a unique identifier (if you don't specify it yourself!). `data_obj` can also be easily serialized to disk with the `save_self` member method or `save_object` global method. Let's see that here:
```python
>>> from dryml import save_object, load_object
>>> save_object(data_obj, 'test.dry')
True
>>> new_obj = load_object('test.dry')
>>> new_obj
{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}
```
Now, why not just use `pickle` or `dill`? There is one major issue with that. `pickle` and `dill` tries to save every python object contained within the object you're trying to save. This will fail if your model object contains data which isn't supported by these major serialization platforms! Tensorflow tensors for example aren't supported.
If we want to add the ability for an `Object` to store an internal state, we need to implement the `save_object_imp` and `load_object_imp` methods as well.
`data_obj` has another ability too. It has the method `definition` which builds an `ObjectDef` object matching the arguments `data_obj` was constructed with. Let's take a quick look at that.
```python
>>> obj_def = data_obj.definition()
>>> obj_def
{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}
>>> new_obj_2 = obj_def.build()
>>> new_obj_2
{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}
>>> new_obj_2.dry_args
([1, 2, 3, 4, 5],)
```
So we can see that `ObjectDef` is a factory object creating objects matching the arguments used to initially construct `data_obj`!.
We can create new `ObjectDef`s directly and use it to create new objects with different definitions! Let's look at that here:
```python
>>> from dryml import ObjectDef
>>> obj_def_2 = ObjectDef(Data, 3)
>>> obj_def_2
{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (3,), 'dry_kwargs': {}}
>>> test_obj_2 = obj_def_2.build()
>>> test_obj_2.dry_args
(3,)
```
This is great for creating copies of an object which contains internal data that is randomly initialized. We can for example, create many copies of the same neural network, train them, and see how well each network trains.
### DRYML `Repo` and `Selector`
A major issue working with many machine learning models is we want to try different things, which means different models and parameters. This can get unyieldy as the variety of models we're interested in trying gets larger. DRYML introduces the `Repo` and `Selector` types to help solve this issue. Any `Object` can be added to a `Repo`, and `Repo`s methods can be used to automate saving, loading, and application of a method across a collection of `Objects`. `Selector` is a type which can match properties of an `Object`. It's created similarly to `ObjectDef` and is a callable object. When passed an `ObjectDef` or `Object`, it indicates with a `bool` whether that object is compatible with the `Selector`. With `Selector`, `Repo` can return only those objects contained which match the `Selector`. Let's look at this now.
First, let's create a `Repo` and save a few objects.
```python
>>> repo = Repo(directory='./test', create=True)
>>> obj_def = ObjectDef(Data, 1)
>>> obj_def
{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (1,), 'dry_kwargs': {}}
>>> for i in range(5):
... obj = obj_def.build()
... repo.add_object(obj)
...
>>> obj_def_2 = ObjectDef(Data, 2)
>>> for i in range(5):
... obj = obj_def_2.build()
... repo.add_object(obj)
...
>>> len(repo)
10
```
Do now our repo has 10 objects, 5 of each type. Let's use a `Selector` to grab only those `Objects` with a 2.
```python
>>> repo.get(Selector(Data, obj_args=(2,)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'obj_args'
>>> repo.get(Selector(Data, args=(2,)))
[{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': '99d796f9-6bc6-4341-947e-94b1b89a9ff3'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'b9208924-8714-448a-b280-d63eefa758a7'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'deb0b98b-6ec7-4a98-a98f-81cd0c9b3f3f'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': '77561cf0-bdf5-4ae0-b7d8-9b9302544cd8'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'c256d4c2-45d0-495a-8b67-e29f4d5e824f'}}]
>>> len(repo.get(Selector(Data, args=(2,))))
5
```
And now we can work with the selected models!
### DRYML Dataset
The DRYML API provides the `Dataset` class which represents a machine learning dataset. It presents a number of useful methods for working with data, and also transformations to change datasets defined within major machine learning systems like `tensorflow` or `pytorch` into a more relevant framework or data type. We'll create a small Dataset here, and look at the `unbatch`, and `peek` methods.
```python
>>> import numpy as np
>>> num_examples = 32
>>> data_shape = (10, 10)
>>> data = np.random.random((num_examples,)+data_shape)
>>> data.shape
(32, 10, 10)
>>> from dryml.data import NumpyDataset
>>> data_ds = NumpyDataset(data)
>>> data_ds.peek().shape
(32, 10, 10)
>>> type(data_ds.peek())
<class 'numpy.ndarray'>
>>> type(data_ds.tf().unbatch().peek())
<class 'tensorflow.python.framework.ops.EagerTensor'>
>>> data_ds.tf().peek().shape
TensorShape([32, 10, 10])
>>> data_ds.tf().unbatch().peek().shape
TensorShape([10, 10])
>>> type(data_ds.torch().peek())
<class 'torch.Tensor'>
```
We can also see that `tf` turns the Dataset into a `TFDataset` which is backed by a `tf.data.Dataset`. Thus the elements retrievable become tensorflow `Tensor`s. Similarly, `torch` turns the `Dataset` into a `TorchDataset` which is backed by a `torch.utils.data.IterableDataset`.
### DRYML Context
The DRYML API provides a context system to manage the allocation of compute devices like GPUs. It also provides a decorator `dryml.context.compute_context` which can be applied to any method we want to launch in a separate process with a specific set of context requirements. This allows the user to prevent code which allocates memory on a device like a GPU from running unless you explicitly allow it. `Object` supports this as well with the `load_compute_imp` and `save_compute_imp` methods which manage an `Object`'s transition into and out of 'compute mode' in which an `Object` may allocate memory on a device. This is especially useful for situations when we may want to compare results from a `tensorflow` based model and a `pytorch` based model. We can wrap a method with `compute_context` and then request a context with requirements `{'tf': {}}` for the `tensorflow` model, and requirements `{'torch': {}}` for the `pytorch` model.
The user can specify the amount or even specific resources with a context requirement for example:
```python
{
'tf': {'gpu/1': 0.5},
'torch': {'gpu/1': 0.5},
}
```
Let's look at a simple example. We'll first define a method which loads a tensorflow dataset, but first checks whether the context can support it. It then returns the first element of that dataset. Then we'll decorate that method so we can execute it in a separate process with certain contexts. We'll then try a couple contexts and see what happens!
```python
>>> def test_method():
... import dryml
... import tensorflow_datasets as tfds
... from dryml.data.tf import TFDataset
... dryml.context.context_check({'tf': {}})
... (test_ds,) = tfds.load(
... 'mnist',
... split=['test'],
... shuffle_files=True,
... as_supervised=True)
... test_ds = TFDataset(
... test_ds,
... supervised=True)
... return test_ds.unbatch().numpy().peek()
...
>>> test_method = dryml.context.compute_context()(test_method)
>>> test_method()
Exception encountered in context thread! pid: 724396
Traceback (most recent call last):
File "/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py", line 34, in run
super().run()
File "/data0/matthew/Software/NCSA/DRYML/venv_dryml_dev/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py", line 186, in __call__
self.final_call(f, ctx_send_q, ctx_ret_q, *args, **kwargs)
File "/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py", line 129, in final_call
res = f(*args, **kwargs)
File "<stdin>", line 5, in test_method
File "/data0/matthew/Software/NCSA/DRYML/src/dryml/context/context_tracker.py", line 420, in context_check
raise ContextIncompatibilityError(
dryml.context.context_tracker.ContextIncompatibilityError: Context doesn't satisfy requirements {'tf': {}}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py", line 370, in wrapped_func
raise e
dryml.context.context_tracker.ContextIncompatibilityError: Context doesn't satisfy requirements {'tf': {}}
>>> x, y = test_method(call_context_reqs={'tf': {}})
2022-10-06 15:27:33.318424: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
>>> x.shape
(28, 28, 1)
>>> y.shape
()
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ncsa/dryml",
"name": "dryml",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "ML",
"author": "Matthew Krafczyk",
"author_email": "mkrafcz2@illinois.edu",
"download_url": "https://files.pythonhosted.org/packages/2f/2d/3980e0aabcfb53c56f5213c8a1dbc3dec9ce54bfeff8a9c0801f2cfe78a3/dryml-0.2.2.tar.gz",
"platform": null,
"description": "# DRYML\n\n![Py 3.8 tests](https://github.com/ncsa/dryml/actions/workflows/test38.yaml/badge.svg)\n![Py 3.9 tests](https://github.com/ncsa/dryml/actions/workflows/test39.yaml/badge.svg)\n![Py 3.10 tests](https://github.com/ncsa/dryml/actions/workflows/test310.yaml/badge.svg)\n![Py 3.11 tests](https://github.com/ncsa/dryml/actions/workflows/test311.yaml/badge.svg)\n[![codecov](https://codecov.io/gh/ncsa/dryml/branch/main/graph/badge.svg?token=ELz0TSuOzo)](https://codecov.io/gh/ncsa/dryml)\n\nDon't Repeat Yourself Machine Learning: A Machine Learning library to reduce code duplication, automate testing, perform hyper paramter searches, and ease model serialization.\n\nDRYML aims to empower the Machine Learning practitioner to spend less time writing boilerplate code, and more time implementing new techniques. DRYML provides a model serialization framework along with serialization implementation for many common ML frameworks and model types, a framework for defining and training models on a specific problem, and a system to compare models from different ML frameworks on the same footing.\n\n## DRYML Programming Philosophy\n\n### Easy Object Serialization\n\nAll DRYML `Object`s can be uniquely identified, and serialized to disk. Once saved, you can load `Object`s directly from disk without having to first build a holding object as originally constructed. This allows for instance, Neural Net objects to be initialized with the correct parameters before loading the model weights without user intervention. Loading an `Object` consists of a single command `load_object`. Basic save/load logic is available for some major ML platforms, but the user is able and encouraged to implement new `Object`s with custom save/load methods and so is extendable to any ML system.\n\n### Reuse Model Components\n\nDRYML borrows from the Entity Component System programming pattern, and many Model types are created from components which are attached to the model, and can be reused. These can include training procedure, optimization algorithm, loss function, and the underlying NN model itself. This compartmentalization allows us to enable hyperparameter searches over nearly any parameter of your ML algorithm from the NN topology, from the optimizer learning rate, to the training procedure. It also allows the user to define a complex training procedure once, and then reuse it for multiple related problems in the future. DRYML also has defined training procedures for some common ML frameworks.\n\n### Compare Models Between Frameworks\n\nMany ML Problems can be tackled by different ML Frameworks. DRYML's API places all supported Frameworks on equal footing. All models output data in the form of DRYML Datasets. This means metrics on Datasets can be reused for models in different frameworks and models from different frameworks can be compared directly, allowing the ML practictioner to make decisions about which method is best for their problem. Models can also be easily chained together\n\n### Allow Frameworks to work together\n\nModern ML frameworks such as tensorflow and pytorch are greedy about GPU use. DRYML implements a context system to enforce constraints on these frameworks when possible. The context system also provides a resource request API to allow the user to request the types of resources each framework is allowed to use. This allows elements from multiple frameworks to co-exist within the same data pipeline. For example, we can use a pytorch dataset and preprocessing with a tensorflow machine learning model.\n\n## Bringing ML Frameworks together\n\nThe following ML Frameworks are currently supported, or planned to be supported\n\n* Tensorflow (Initial support complete)\n* Pytorch (Initial support complete)\n* Sklearn (Initial support complete)\n* XGBoost (Initial support complete)\n* Statsmodels (support planned)\n\n## DRYML Major Components\n\n### DRYML `Object` and `ObjectDef`\n\nThe DRYML API provides the `Object` class which automatically implements all basic machinery for automatic object serialization. Any class you create which you want to serialize must inherit from the `Object` class. Here's a simple example:\n\n```python\n>>> from dryml import Object\n>>> class Data(Object):\n... def __init__(self, data):\n... pass\n... \n>>> data_obj = Data([1, 2, 3, 4, 5])\n>>> data_obj.dry_args\n([1, 2, 3, 4, 5],)\n>>> data_obj.dry_kwargs\n{'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}\n```\n\nWe can see that `Object` gives the new `Data` class some extra powers! It remembers the arguments used to create it, and it receives a unique identifier (if you don't specify it yourself!). `data_obj` can also be easily serialized to disk with the `save_self` member method or `save_object` global method. Let's see that here:\n\n```python\n>>> from dryml import save_object, load_object\n>>> save_object(data_obj, 'test.dry')\nTrue\n>>> new_obj = load_object('test.dry')\n>>> new_obj\n{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}\n```\n\nNow, why not just use `pickle` or `dill`? There is one major issue with that. `pickle` and `dill` tries to save every python object contained within the object you're trying to save. This will fail if your model object contains data which isn't supported by these major serialization platforms! Tensorflow tensors for example aren't supported.\n\nIf we want to add the ability for an `Object` to store an internal state, we need to implement the `save_object_imp` and `load_object_imp` methods as well.\n\n`data_obj` has another ability too. It has the method `definition` which builds an `ObjectDef` object matching the arguments `data_obj` was constructed with. Let's take a quick look at that.\n\n```python\n>>> obj_def = data_obj.definition()\n>>> obj_def\n{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}\n>>> new_obj_2 = obj_def.build()\n>>> new_obj_2\n{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': 'e18d670c-b3b8-41b3-a941-3c2f7bf0b11e'}}\n>>> new_obj_2.dry_args\n([1, 2, 3, 4, 5],)\n```\n\nSo we can see that `ObjectDef` is a factory object creating objects matching the arguments used to initially construct `data_obj`!. \n\nWe can create new `ObjectDef`s directly and use it to create new objects with different definitions! Let's look at that here:\n\n```python\n>>> from dryml import ObjectDef\n>>> obj_def_2 = ObjectDef(Data, 3)\n>>> obj_def_2\n{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (3,), 'dry_kwargs': {}}\n>>> test_obj_2 = obj_def_2.build()\n>>> test_obj_2.dry_args\n(3,)\n```\n\nThis is great for creating copies of an object which contains internal data that is randomly initialized. We can for example, create many copies of the same neural network, train them, and see how well each network trains.\n\n\n### DRYML `Repo` and `Selector`\n\nA major issue working with many machine learning models is we want to try different things, which means different models and parameters. This can get unyieldy as the variety of models we're interested in trying gets larger. DRYML introduces the `Repo` and `Selector` types to help solve this issue. Any `Object` can be added to a `Repo`, and `Repo`s methods can be used to automate saving, loading, and application of a method across a collection of `Objects`. `Selector` is a type which can match properties of an `Object`. It's created similarly to `ObjectDef` and is a callable object. When passed an `ObjectDef` or `Object`, it indicates with a `bool` whether that object is compatible with the `Selector`. With `Selector`, `Repo` can return only those objects contained which match the `Selector`. Let's look at this now.\n\nFirst, let's create a `Repo` and save a few objects.\n\n```python\n>>> repo = Repo(directory='./test', create=True)\n>>> obj_def = ObjectDef(Data, 1)\n>>> obj_def\n{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (1,), 'dry_kwargs': {}}\n>>> for i in range(5):\n... obj = obj_def.build()\n... repo.add_object(obj)\n... \n>>> obj_def_2 = ObjectDef(Data, 2)\n>>> for i in range(5):\n... obj = obj_def_2.build()\n... repo.add_object(obj)\n... \n>>> len(repo)\n10\n```\n\nDo now our repo has 10 objects, 5 of each type. Let's use a `Selector` to grab only those `Objects` with a 2.\n\n```python\n>>> repo.get(Selector(Data, obj_args=(2,)))\nTraceback (most recent call last):\n File \"<stdin>\", line 1, in <module>\nTypeError: __init__() got an unexpected keyword argument 'obj_args'\n>>> repo.get(Selector(Data, args=(2,)))\n[{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': '99d796f9-6bc6-4341-947e-94b1b89a9ff3'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'b9208924-8714-448a-b280-d63eefa758a7'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'deb0b98b-6ec7-4a98-a98f-81cd0c9b3f3f'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': '77561cf0-bdf5-4ae0-b7d8-9b9302544cd8'}}, {'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': (2,), 'dry_kwargs': {'dry_id': 'c256d4c2-45d0-495a-8b67-e29f4d5e824f'}}]\n>>> len(repo.get(Selector(Data, args=(2,))))\n5\n```\n\nAnd now we can work with the selected models!\n\n### DRYML Dataset\n\nThe DRYML API provides the `Dataset` class which represents a machine learning dataset. It presents a number of useful methods for working with data, and also transformations to change datasets defined within major machine learning systems like `tensorflow` or `pytorch` into a more relevant framework or data type. We'll create a small Dataset here, and look at the `unbatch`, and `peek` methods.\n\n```python\n>>> import numpy as np\n>>> num_examples = 32\n>>> data_shape = (10, 10)\n>>> data = np.random.random((num_examples,)+data_shape)\n>>> data.shape\n(32, 10, 10)\n>>> from dryml.data import NumpyDataset\n>>> data_ds = NumpyDataset(data)\n>>> data_ds.peek().shape\n(32, 10, 10)\n>>> type(data_ds.peek())\n<class 'numpy.ndarray'>\n>>> type(data_ds.tf().unbatch().peek())\n<class 'tensorflow.python.framework.ops.EagerTensor'>\n>>> data_ds.tf().peek().shape\nTensorShape([32, 10, 10])\n>>> data_ds.tf().unbatch().peek().shape\nTensorShape([10, 10])\n>>> type(data_ds.torch().peek())\n<class 'torch.Tensor'>\n```\n\nWe can also see that `tf` turns the Dataset into a `TFDataset` which is backed by a `tf.data.Dataset`. Thus the elements retrievable become tensorflow `Tensor`s. Similarly, `torch` turns the `Dataset` into a `TorchDataset` which is backed by a `torch.utils.data.IterableDataset`.\n\n### DRYML Context\n\nThe DRYML API provides a context system to manage the allocation of compute devices like GPUs. It also provides a decorator `dryml.context.compute_context` which can be applied to any method we want to launch in a separate process with a specific set of context requirements. This allows the user to prevent code which allocates memory on a device like a GPU from running unless you explicitly allow it. `Object` supports this as well with the `load_compute_imp` and `save_compute_imp` methods which manage an `Object`'s transition into and out of 'compute mode' in which an `Object` may allocate memory on a device. This is especially useful for situations when we may want to compare results from a `tensorflow` based model and a `pytorch` based model. We can wrap a method with `compute_context` and then request a context with requirements `{'tf': {}}` for the `tensorflow` model, and requirements `{'torch': {}}` for the `pytorch` model.\n\nThe user can specify the amount or even specific resources with a context requirement for example:\n\n```python\n{\n 'tf': {'gpu/1': 0.5},\n 'torch': {'gpu/1': 0.5},\n}\n```\n\nLet's look at a simple example. We'll first define a method which loads a tensorflow dataset, but first checks whether the context can support it. It then returns the first element of that dataset. Then we'll decorate that method so we can execute it in a separate process with certain contexts. We'll then try a couple contexts and see what happens!\n\n```python\n>>> def test_method():\n... import dryml\n... import tensorflow_datasets as tfds\n... from dryml.data.tf import TFDataset\n... dryml.context.context_check({'tf': {}})\n... (test_ds,) = tfds.load(\n... 'mnist',\n... split=['test'],\n... shuffle_files=True,\n... as_supervised=True)\n... test_ds = TFDataset(\n... test_ds,\n... supervised=True)\n... return test_ds.unbatch().numpy().peek()\n...\n>>> test_method = dryml.context.compute_context()(test_method)\n>>> test_method()\nException encountered in context thread! pid: 724396\nTraceback (most recent call last):\n File \"/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py\", line 34, in run\n super().run()\n File \"/data0/matthew/Software/NCSA/DRYML/venv_dryml_dev/lib/python3.8/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n File \"/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py\", line 186, in __call__\n self.final_call(f, ctx_send_q, ctx_ret_q, *args, **kwargs)\n File \"/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py\", line 129, in final_call\n res = f(*args, **kwargs)\n File \"<stdin>\", line 5, in test_method\n File \"/data0/matthew/Software/NCSA/DRYML/src/dryml/context/context_tracker.py\", line 420, in context_check\n raise ContextIncompatibilityError(\ndryml.context.context_tracker.ContextIncompatibilityError: Context doesn't satisfy requirements {'tf': {}}\n\nTraceback (most recent call last):\n File \"<stdin>\", line 1, in <module>\n File \"/data0/matthew/Software/NCSA/DRYML/src/dryml/context/process.py\", line 370, in wrapped_func\n raise e\ndryml.context.context_tracker.ContextIncompatibilityError: Context doesn't satisfy requirements {'tf': {}}\n>>> x, y = test_method(call_context_reqs={'tf': {}})\n2022-10-06 15:27:33.318424: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.\n>>> x.shape\n(28, 28, 1)\n>>> y.shape\n()\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Don't Repeat Yourself Machine Learning: A Machine Learning library to reduce code duplication, and increase portability and interoperability in the Machine Learning Space.",
"version": "0.2.2",
"project_urls": {
"Bug Tracker": "https://github.com/ncsa/dryml/issues",
"Homepage": "https://github.com/ncsa/dryml"
},
"split_keywords": [
"ml"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "11b3e5f662da5732877ab0d990501c28f4646a89b3a12416f4bf726f445f1d33",
"md5": "0b8fb98c3f0daa54f5021c7d320f21f4",
"sha256": "f431b4092c4e20c2f9bc7eeff50738a5e3927e250f1a22e553f72fd8ff8f1e6a"
},
"downloads": -1,
"filename": "dryml-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0b8fb98c3f0daa54f5021c7d320f21f4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 88749,
"upload_time": "2024-08-30T21:41:49",
"upload_time_iso_8601": "2024-08-30T21:41:49.620081Z",
"url": "https://files.pythonhosted.org/packages/11/b3/e5f662da5732877ab0d990501c28f4646a89b3a12416f4bf726f445f1d33/dryml-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2f2d3980e0aabcfb53c56f5213c8a1dbc3dec9ce54bfeff8a9c0801f2cfe78a3",
"md5": "ceeaf5b4b2e9dc5be620217dbcba6306",
"sha256": "af7df7e06d96f449b6f3ff83a23797eb8da7eb2eaf7ca1d8ad0f5bd8dedd2e41"
},
"downloads": -1,
"filename": "dryml-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "ceeaf5b4b2e9dc5be620217dbcba6306",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 94847,
"upload_time": "2024-08-30T21:41:51",
"upload_time_iso_8601": "2024-08-30T21:41:51.229352Z",
"url": "https://files.pythonhosted.org/packages/2f/2d/3980e0aabcfb53c56f5213c8a1dbc3dec9ce54bfeff8a9c0801f2cfe78a3/dryml-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-30 21:41:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ncsa",
"github_project": "dryml",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"test_requirements": [
{
"name": "flake8",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
},
{
"name": "tensorflow",
"specs": []
},
{
"name": "ray",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "xgboost",
"specs": []
},
{
"name": "torch",
"specs": [
[
"<=",
"2.3"
]
]
},
{
"name": "GPUtil",
"specs": []
}
],
"lcname": "dryml"
}