rigorous-recorder

Name	rigorous-recorder JSON
Version	1.4.4 JSON
	download
home_page	https://github.com/jeff-hykin/rigorous_recorder.git
Summary	Save everything in a filterable way
upload_time	2024-04-12 15:30:23
maintainer	None
docs_url	None
author	Jeff Hykin
requires_python	>=3.6
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # What is this?

I needed an efficient data logger for my machine learning experiments. Specifically one that
- could log in a hierarchical way (not one big global logging variable)
- while still having a flat table-like structure for performing queries/summaries
- without having tons of duplicated data

This library would likely work well with PySpark

# What is a Use-case Example?

Lets say you're going to perform
- 3 experiments
- each experiment has 10 episodes
- each episode has 100,000 timesteps
- there is an an `x` and a `y` value at each timestep <br>

#### Example goal:
- We want to get the average `x` value across all timesteps in episode 2 (we don't care what experiment they're from)


Our timestamp data could look like:
```python
record1 = { "x":1, "y":1 } # first timestep
record2 = { "x":2, "y":2 } # second timestep
record3 = { "x":3, "y":3 } # third timestep
```

#### Problem
Those records don't contain the experiment number or the episode number (and we need those for our goal)

#### Bad Solution

Duplicating the data would provide a flat structure, but (for 100,000 timesteps) thats a huge memory cost
```python
record1 = { "x":1, "y":1, "episode":1, "experiment": 1, } # first timestep
record2 = { "x":2, "y":2, "episode":1, "experiment": 1, } # second timestep
record3 = { "x":3, "y":3, "episode":1, "experiment": 1, } # third timestep
```

#### Good-ish Solution

We could use references to be both more efficient and allow adding parent data after the fact

```python
# parent data
experiment_data = { "experiment": 1 }
episode_data    = { "episode":1, "parent": experiment_data }

record1 = { "x":1, "y":1, "parent": episode_data } # first timestep
record2 = { "x":2, "y":2, "parent": episode_data } # second timestep
record3 = { "x":3, "y":3, "parent": episode_data } # third timestep
```

We could reduce the cost of key duplication by having shared keys

```python
# parent data
experiment_data = { "experiment": 1 }
episode_data    = { "episode":1, "parent": experiment_data }

episode_keeper = {"parent": episode_data} # timestep 0
episode_keeper = { "x":[1],     "y":[1],     "parent": episode_data} # first timestep (keys added on-demand)
episode_keeper = { "x":[1,2],   "y":[1,2],   "parent": episode_data} # second timestep
episode_keeper = { "x":[1,2,3], "y":[1,2,3], "parent": episode_data} # third timestep
```

#### How does Rigorous Recorder Fix This?

The "Good-ish Solution" above is still crude, this library cleans it up
1. The `Recorder` class in this library is the core/pure data structure
2. The `ExperimentCollection` class automates common boilerplate for saving (python pickle), catching errors, managing experiments, etc

```python
from rigorous_recorder import Recorder
recorder = Recorder()

# parent data
experiment_recorder = Recorder(experiment=1).set_parent(recorder)
episode_recorder    = Recorder(episode=1).set_parent(experiment_recorder)

episode_recorder.push(x=1, y=1) # timestep1
episode_recorder.push(x=2, y=2) # timestep2
episode_recorder.push(x=3, y=3) # timestep3

recorder.save_to("where/ever/you_want.pickle")
```

# How do I use this?

`pip install rigorous-recorder`

Super simple usage:

```python
from rigorous_recorder import RecordKeeper
record_keeper = RecordKeeper().live_write_to("where/ever/you_want.yaml", as_yaml=True)
record_keeper.push(x=1, y=1)
```

Project/Experiment collection usage:

```python
from rigorous_recorder import RecordKeeper, ExperimentCollection

from statistics import mean as average
from random import random, sample, choices

collection = ExperimentCollection("data/my_study") # <- filepath 
number_of_new_experiments = 1

for _ in range(number_of_new_experiments):
    
    # at the end (even when an error is thrown), all data is saved to disk automatically
    # experiment number increments based on the last saved-to-disk experiment number
    # running again (after error) won't double-increment the experiment number (same number until non-error run is achieved)
    with collection.new_experiment() as experiment_recorder:
        # we can create a hierarchy like this:
        # 
        #                          experiment_recorder
        #                           /              \
        #               model1_recorder           model2_recorder
        #                /        |                 |           \
        # m1_train_recorder m1_test_recorder   m2_test_recorder m2_train_recorder
        # 
        model1_recorder = RecordKeeper(model="model1").set_parent(experiment_recorder)
        model2_recorder = RecordKeeper(model="model2").set_parent(experiment_recorder)
        
        # 
        # training
        # 
        model1_train_recorder = RecordKeeper(training=True).set_parent(model1_recorder)
        model2_train_recorder = RecordKeeper(training=True).set_parent(model2_recorder)
        for each_index in range(100_000):
            # one approach
            model1_train_recorder.push(index=each_index, loss=random())
            
            # alternative approach (same outcome)
            model2_train_recorder.add(index=each_index)
            # - this way is very handy for adding data in one method (like a loss func)
            #   while calling .commit() in a different method (like update weights)
            model2_train_recorder.add({ "loss": random() })
            model2_train_recorder.commit()
            
        # 
        # testing
        # 
        model1_test_recorder = RecordKeeper(testing=True).set_parent(model1_recorder)
        model2_test_recorder = RecordKeeper(testing=True).set_parent(model2_recorder)
        for each_index in range(500):
            # one method
            model1_test_recorder.push(
                index=each_index,
                accuracy=random(),
            )
            
            # alternative way (same outcome)
            model2_test_recorder.add(index=each_index, accuracy=random())
            model2_test_recorder.commit()


# 
# 
# Analysis
# 
# 

all_records = collection.records
print("first record", all_records[0]) # behaves just like a regular dictionary

# slice across both models (first 500 training records from both models)
records_first_half_of_time = tuple(each for each in all_records if each["training"] and each["index"] < 500)
# average loss across both models
first_half_average_loss = average(tuple(each["loss"] for each in records_first_half_of_time))
# average only for model 1
model1_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model1"))
# average only for model 2
model2_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model2"))
```

# What are some other details?

The `ExperimentCollection` adds 6 keys as a parent to every record:
```
experiment_number     # int
error_number          # int, is only incremented for back-to-back error runs
had_error             # boolean for easy filtering
experiment_start_time # the output of time.time() from python's time module
experiment_end_time   # the output of time.time() from python's time module
experiment_duration   # the difference between start and end (for easy graphing/filtering)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jeff-hykin/rigorous_recorder.git",
    "name": "rigorous-recorder",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Jeff Hykin",
    "author_email": "jeff.hykin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/55/2b/a0826531e2bce53f873fe5a296f5771f43343a2d29bd7da899a950031255/rigorous_recorder-1.4.4.tar.gz",
    "platform": null,
    "description": "# What is this?\n\nI needed an efficient data logger for my machine learning experiments. Specifically one that\n- could log in a hierarchical way (not one big global logging variable)\n- while still having a flat table-like structure for performing queries/summaries\n- without having tons of duplicated data\n\nThis library would likely work well with PySpark\n\n# What is a Use-case Example?\n\nLets say you're going to perform\n- 3 experiments\n- each experiment has 10 episodes\n- each episode has 100,000 timesteps\n- there is an an `x` and a `y` value at each timestep <br>\n\n#### Example goal:\n- We want to get the average `x` value across all timesteps in episode 2 (we don't care what experiment they're from)\n\n\nOur timestamp data could look like:\n```python\nrecord1 = { \"x\":1, \"y\":1 } # first timestep\nrecord2 = { \"x\":2, \"y\":2 } # second timestep\nrecord3 = { \"x\":3, \"y\":3 } # third timestep\n```\n\n#### Problem\nThose records don't contain the experiment number or the episode number (and we need those for our goal)\n\n#### Bad Solution\n\nDuplicating the data would provide a flat structure, but (for 100,000 timesteps) thats a huge memory cost\n```python\nrecord1 = { \"x\":1, \"y\":1, \"episode\":1, \"experiment\": 1, } # first timestep\nrecord2 = { \"x\":2, \"y\":2, \"episode\":1, \"experiment\": 1, } # second timestep\nrecord3 = { \"x\":3, \"y\":3, \"episode\":1, \"experiment\": 1, } # third timestep\n```\n\n#### Good-ish Solution\n\nWe could use references to be both more efficient and allow adding parent data after the fact\n\n```python\n# parent data\nexperiment_data = { \"experiment\": 1 }\nepisode_data    = { \"episode\":1, \"parent\": experiment_data }\n\nrecord1 = { \"x\":1, \"y\":1, \"parent\": episode_data } # first timestep\nrecord2 = { \"x\":2, \"y\":2, \"parent\": episode_data } # second timestep\nrecord3 = { \"x\":3, \"y\":3, \"parent\": episode_data } # third timestep\n```\n\nWe could reduce the cost of key duplication by having shared keys\n\n```python\n# parent data\nexperiment_data = { \"experiment\": 1 }\nepisode_data    = { \"episode\":1, \"parent\": experiment_data }\n\nepisode_keeper = {\"parent\": episode_data} # timestep 0\nepisode_keeper = { \"x\":[1],     \"y\":[1],     \"parent\": episode_data} # first timestep (keys added on-demand)\nepisode_keeper = { \"x\":[1,2],   \"y\":[1,2],   \"parent\": episode_data} # second timestep\nepisode_keeper = { \"x\":[1,2,3], \"y\":[1,2,3], \"parent\": episode_data} # third timestep\n```\n\n#### How does Rigorous Recorder Fix This?\n\nThe \"Good-ish Solution\" above is still crude, this library cleans it up\n1. The `Recorder` class in this library is the core/pure data structure\n2. The `ExperimentCollection` class automates common boilerplate for saving (python pickle), catching errors, managing experiments, etc\n\n```python\nfrom rigorous_recorder import Recorder\nrecorder = Recorder()\n\n# parent data\nexperiment_recorder = Recorder(experiment=1).set_parent(recorder)\nepisode_recorder    = Recorder(episode=1).set_parent(experiment_recorder)\n\nepisode_recorder.push(x=1, y=1) # timestep1\nepisode_recorder.push(x=2, y=2) # timestep2\nepisode_recorder.push(x=3, y=3) # timestep3\n\nrecorder.save_to(\"where/ever/you_want.pickle\")\n```\n\n# How do I use this?\n\n`pip install rigorous-recorder`\n\nSuper simple usage:\n\n```python\nfrom rigorous_recorder import RecordKeeper\nrecord_keeper = RecordKeeper().live_write_to(\"where/ever/you_want.yaml\", as_yaml=True)\nrecord_keeper.push(x=1, y=1)\n```\n\nProject/Experiment collection usage:\n\n```python\nfrom rigorous_recorder import RecordKeeper, ExperimentCollection\n\nfrom statistics import mean as average\nfrom random import random, sample, choices\n\ncollection = ExperimentCollection(\"data/my_study\") # <- filepath \nnumber_of_new_experiments = 1\n\nfor _ in range(number_of_new_experiments):\n    \n    # at the end (even when an error is thrown), all data is saved to disk automatically\n    # experiment number increments based on the last saved-to-disk experiment number\n    # running again (after error) won't double-increment the experiment number (same number until non-error run is achieved)\n    with collection.new_experiment() as experiment_recorder:\n        # we can create a hierarchy like this:\n        # \n        #                          experiment_recorder\n        #                           /              \\\n        #               model1_recorder           model2_recorder\n        #                /        |                 |           \\\n        # m1_train_recorder m1_test_recorder   m2_test_recorder m2_train_recorder\n        # \n        model1_recorder = RecordKeeper(model=\"model1\").set_parent(experiment_recorder)\n        model2_recorder = RecordKeeper(model=\"model2\").set_parent(experiment_recorder)\n        \n        # \n        # training\n        # \n        model1_train_recorder = RecordKeeper(training=True).set_parent(model1_recorder)\n        model2_train_recorder = RecordKeeper(training=True).set_parent(model2_recorder)\n        for each_index in range(100_000):\n            # one approach\n            model1_train_recorder.push(index=each_index, loss=random())\n            \n            # alternative approach (same outcome)\n            model2_train_recorder.add(index=each_index)\n            # - this way is very handy for adding data in one method (like a loss func)\n            #   while calling .commit() in a different method (like update weights)\n            model2_train_recorder.add({ \"loss\": random() })\n            model2_train_recorder.commit()\n            \n        # \n        # testing\n        # \n        model1_test_recorder = RecordKeeper(testing=True).set_parent(model1_recorder)\n        model2_test_recorder = RecordKeeper(testing=True).set_parent(model2_recorder)\n        for each_index in range(500):\n            # one method\n            model1_test_recorder.push(\n                index=each_index,\n                accuracy=random(),\n            )\n            \n            # alternative way (same outcome)\n            model2_test_recorder.add(index=each_index, accuracy=random())\n            model2_test_recorder.commit()\n\n\n# \n# \n# Analysis\n# \n# \n\nall_records = collection.records\nprint(\"first record\", all_records[0]) # behaves just like a regular dictionary\n\n# slice across both models (first 500 training records from both models)\nrecords_first_half_of_time = tuple(each for each in all_records if each[\"training\"] and each[\"index\"] < 500)\n# average loss across both models\nfirst_half_average_loss = average(tuple(each[\"loss\"] for each in records_first_half_of_time))\n# average only for model 1\nmodel1_first_half_loss = average(tuple(each[\"loss\"] for each in records_first_half_of_time if each[\"model\"] == \"model1\"))\n# average only for model 2\nmodel2_first_half_loss = average(tuple(each[\"loss\"] for each in records_first_half_of_time if each[\"model\"] == \"model2\"))\n```\n\n# What are some other details?\n\nThe `ExperimentCollection` adds 6 keys as a parent to every record:\n```\nexperiment_number     # int\nerror_number          # int, is only incremented for back-to-back error runs\nhad_error             # boolean for easy filtering\nexperiment_start_time # the output of time.time() from python's time module\nexperiment_end_time   # the output of time.time() from python's time module\nexperiment_duration   # the difference between start and end (for easy graphing/filtering)\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Save everything in a filterable way",
    "version": "1.4.4",
    "project_urls": {
        "Homepage": "https://github.com/jeff-hykin/rigorous_recorder.git"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ea81a828006b4d4970ed50c824d601a61e8aec018f5cbec72795a49aa66928c",
                "md5": "a2f09dc8e3a2176ffae7bfca9d0a88c2",
                "sha256": "c07ec72f82eb48a6df44a707953fc6e14d9c5e8e91886a989e11e3c0d4d06a2c"
            },
            "downloads": -1,
            "filename": "rigorous_recorder-1.4.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a2f09dc8e3a2176ffae7bfca9d0a88c2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 2642923,
            "upload_time": "2024-04-12T15:30:21",
            "upload_time_iso_8601": "2024-04-12T15:30:21.130935Z",
            "url": "https://files.pythonhosted.org/packages/5e/a8/1a828006b4d4970ed50c824d601a61e8aec018f5cbec72795a49aa66928c/rigorous_recorder-1.4.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "552ba0826531e2bce53f873fe5a296f5771f43343a2d29bd7da899a950031255",
                "md5": "bfe72bbfe03534d79e81798056ef672b",
                "sha256": "aa20505816b82552043e62dd95594a57ab5ebbd09e31c4b804c425768f3eb976"
            },
            "downloads": -1,
            "filename": "rigorous_recorder-1.4.4.tar.gz",
            "has_sig": false,
            "md5_digest": "bfe72bbfe03534d79e81798056ef672b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 2627690,
            "upload_time": "2024-04-12T15:30:23",
            "upload_time_iso_8601": "2024-04-12T15:30:23.455796Z",
            "url": "https://files.pythonhosted.org/packages/55/2b/a0826531e2bce53f873fe5a296f5771f43343a2d29bd7da899a950031255/rigorous_recorder-1.4.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-12 15:30:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jeff-hykin",
    "github_project": "rigorous_recorder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "rigorous-recorder"
}

Jeff Hykin