flexibuff


Nameflexibuff JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryMulti-Agent RL memory buffer which supports Numpy array and PyTorch Tensor formats
upload_time2025-01-22 02:34:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Flexibuff is a minimalistic library designed to store replay 
buffers and episode rollouts for multi agent RL, but it works
just as well for single agent RL. 

## Motivation

Flexibuff came to be due to frustrations with integrating many kinds of RL models on the
same environment for benchmark purposes. Some models like 
DeepQ learning only require `[state,state_,action,reward,terminated]`
where transitions can be sampled in any order off policy. Other
algorithms like vanilla policy gradient require memory rollouts
in chronilogical order to calculate the discounted rewards G. 
Other algorithms still such as QMIX and other CTDE methods
require many agent buffers to be sampled synchronously, meaning
that the same timestep is needed for each agent to perform mixing.
More exotic still, some RL algorithms such as TAMER maintain a 
second reward signal which comes from human preference. Some models
also have mixed action spaces or multiple outputs at the same
time such as a search and rescue robot which must operate a radio
and navigate itself at the same time. Lastly, some policy gradient
algorithms require log probabilities to be stored where DeepQ does
not, but any of the algorithms above might use memory weighting
to bias transition sampling or other effects.

Comparing these methods to each other and programming memory 
buffers for each kind of agent takes a lot of time and code and
it introduces yet another step of the process where there could
be errors and precious debugging time. Flexibuff claims to fit
every one of these use cases at one time with optional storage
for human rewards, log probabilities, memory weights, and more. 
Additionally, Flexibuff can sample either transitions or entire
chronologically ordered episodes synchronized to all agents with
samples returned as either numpy arrays or torch tensors. 

## Bare bones documentation (WIP)

Flexible Buffer supports numpy and torch tensor outputs formats,
but all memories are held internally as numpy buffers because
Torch.from_numpy() will share the same memory either way in RAM

Flexible Buffer stores the memories of multiple 'n_agents' agents
in their own separate memory blocks where each agent has 'num_steps'
storage capacity. Setting 'n_agents' to 1 will remove a dimension
from the returned buffer results for single agent tasks.

cardinal supports both continuous and discrete actions at the
same time along and it can sample episodes for use in recurrent training
or policy gradient methods using recorded discounted episodic rewards,
'G'. cardinal can also store action masks for environments with
illegal actions and a second reward signal called 'global_auxiliary_reward' for
simultaneous human and MDP rewards for RLHF + RL.

For Mixed discrete and continuous actions, actions will be saved and
returned in the format
```
    discrete_actions
        [   # Discrete action tensor
            [d0_s0,d1_s0,d2_s0,...,dN-1_s0],
            [d0_s1,d1_s1,d2_s1,...,dN-1_s1],
                        ...,
            [d0_sB,d1_sB,d2_sB,...,dN-1_sB],
        ],
    continuous_actions
        [   # Continuous Action Tensor
            [c0_s0,c1_s0,c2_s0,...,cM-1_s0],
            [c0_s1,c1_s1,c2_s1,...,cM-1_s1],
                        ...,
            [c0_sB,c1_sB,c2_sB,...,cM-1_sB]
        ],
```
where d0_s0 refers to discrete dimension 0 out of 'N' dimensions
sample 0 out of 'B' batch size timesteps. c2_s1 would refer to continuous
dimension 2 our of 'M' sample timestep 1 our of 'B' batch size.

init variables:
    num_steps: int Number of timesteps per agent to be saved in the
        buffer.
    obs_size: int Number of dimensions in the flattened 1 dimensional
        observation for a particular agent
    global_auxiliary_reward=False: bool Whether to record a second reward signal for
        human feedback.
    action_mask: [bool] List for whether to mask each dimension of the
        discrete portion of the actions.
    discrete_action_cardinalities: [int] List of integers to denote the
        number of discrete action choices for each discrete action output
    continuous_action_dimension: int Number of continuous action dimensions
        (Note: suppose a network outputs a distribution for each
        continuous dimension like [mean,std], then the continuous_action_dimension
        should be set to 2*n_action_dimensions because flexibuff will save
        exactly as many numbers as specified here)
    path: String the path to where flexibuff will be saved if a path is not
        passed at save time if no such path exists it will be made. default
        is './default_dir/'
    name: the name which will be appended onto the path to save these numpy
        arrays. default is 'flexibuff_test'
    n_agents: int The number of agents to save buffers for.
    state_size: int The number of dimensions for a global state for use in
        centralized training. None by default assuming observations are local
    global_reward: bool reward given to a group of agents
    global_auxiliary_reward: bool a second global reward such as human feedback
    individual_reward: bool reward given to an individual agent
    individual_auxiliary_reward: bool second reward given to individual such as
        with human feedback
    log_prob_discrete: bool whether to track log probabilities for discrete action
        space
    log_prob_continuous: int = 0 the dimension of probabilities to track for
        continuous action spaces for instance if there is one continuous action
        parameterized by a normal distribution with mean mu and std sigma, then
        continuous_action_dimension = 2, but log_prob_continuous would only be
        storing a single probability so it would be 1.
    memory_weights: bool whether or not to store weights along with each timestep
        for memory weighting or another purpose

## Methods of Interest:

`save_transition`: Takes the actions / observations / states, etc from a step and saves it.
`sample_transitions`: Samples unordered transitions 
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "flexibuff",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Timothy Flavin <timothy.c.flavin@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/42/52/6ce6b1943a9c9cb0c116ed9323ae250f0205f9c97a80804624bb51b15dd4/flexibuff-0.2.2.tar.gz",
    "platform": null,
    "description": "Flexibuff is a minimalistic library designed to store replay \nbuffers and episode rollouts for multi agent RL, but it works\njust as well for single agent RL. \n\n## Motivation\n\nFlexibuff came to be due to frustrations with integrating many kinds of RL models on the\nsame environment for benchmark purposes. Some models like \nDeepQ learning only require `[state,state_,action,reward,terminated]`\nwhere transitions can be sampled in any order off policy. Other\nalgorithms like vanilla policy gradient require memory rollouts\nin chronilogical order to calculate the discounted rewards G. \nOther algorithms still such as QMIX and other CTDE methods\nrequire many agent buffers to be sampled synchronously, meaning\nthat the same timestep is needed for each agent to perform mixing.\nMore exotic still, some RL algorithms such as TAMER maintain a \nsecond reward signal which comes from human preference. Some models\nalso have mixed action spaces or multiple outputs at the same\ntime such as a search and rescue robot which must operate a radio\nand navigate itself at the same time. Lastly, some policy gradient\nalgorithms require log probabilities to be stored where DeepQ does\nnot, but any of the algorithms above might use memory weighting\nto bias transition sampling or other effects.\n\nComparing these methods to each other and programming memory \nbuffers for each kind of agent takes a lot of time and code and\nit introduces yet another step of the process where there could\nbe errors and precious debugging time. Flexibuff claims to fit\nevery one of these use cases at one time with optional storage\nfor human rewards, log probabilities, memory weights, and more. \nAdditionally, Flexibuff can sample either transitions or entire\nchronologically ordered episodes synchronized to all agents with\nsamples returned as either numpy arrays or torch tensors. \n\n## Bare bones documentation (WIP)\n\nFlexible Buffer supports numpy and torch tensor outputs formats,\nbut all memories are held internally as numpy buffers because\nTorch.from_numpy() will share the same memory either way in RAM\n\nFlexible Buffer stores the memories of multiple 'n_agents' agents\nin their own separate memory blocks where each agent has 'num_steps'\nstorage capacity. Setting 'n_agents' to 1 will remove a dimension\nfrom the returned buffer results for single agent tasks.\n\ncardinal supports both continuous and discrete actions at the\nsame time along and it can sample episodes for use in recurrent training\nor policy gradient methods using recorded discounted episodic rewards,\n'G'. cardinal can also store action masks for environments with\nillegal actions and a second reward signal called 'global_auxiliary_reward' for\nsimultaneous human and MDP rewards for RLHF + RL.\n\nFor Mixed discrete and continuous actions, actions will be saved and\nreturned in the format\n```\n    discrete_actions\n        [   # Discrete action tensor\n            [d0_s0,d1_s0,d2_s0,...,dN-1_s0],\n            [d0_s1,d1_s1,d2_s1,...,dN-1_s1],\n                        ...,\n            [d0_sB,d1_sB,d2_sB,...,dN-1_sB],\n        ],\n    continuous_actions\n        [   # Continuous Action Tensor\n            [c0_s0,c1_s0,c2_s0,...,cM-1_s0],\n            [c0_s1,c1_s1,c2_s1,...,cM-1_s1],\n                        ...,\n            [c0_sB,c1_sB,c2_sB,...,cM-1_sB]\n        ],\n```\nwhere d0_s0 refers to discrete dimension 0 out of 'N' dimensions\nsample 0 out of 'B' batch size timesteps. c2_s1 would refer to continuous\ndimension 2 our of 'M' sample timestep 1 our of 'B' batch size.\n\ninit variables:\n    num_steps: int Number of timesteps per agent to be saved in the\n        buffer.\n    obs_size: int Number of dimensions in the flattened 1 dimensional\n        observation for a particular agent\n    global_auxiliary_reward=False: bool Whether to record a second reward signal for\n        human feedback.\n    action_mask: [bool] List for whether to mask each dimension of the\n        discrete portion of the actions.\n    discrete_action_cardinalities: [int] List of integers to denote the\n        number of discrete action choices for each discrete action output\n    continuous_action_dimension: int Number of continuous action dimensions\n        (Note: suppose a network outputs a distribution for each\n        continuous dimension like [mean,std], then the continuous_action_dimension\n        should be set to 2*n_action_dimensions because flexibuff will save\n        exactly as many numbers as specified here)\n    path: String the path to where flexibuff will be saved if a path is not\n        passed at save time if no such path exists it will be made. default\n        is './default_dir/'\n    name: the name which will be appended onto the path to save these numpy\n        arrays. default is 'flexibuff_test'\n    n_agents: int The number of agents to save buffers for.\n    state_size: int The number of dimensions for a global state for use in\n        centralized training. None by default assuming observations are local\n    global_reward: bool reward given to a group of agents\n    global_auxiliary_reward: bool a second global reward such as human feedback\n    individual_reward: bool reward given to an individual agent\n    individual_auxiliary_reward: bool second reward given to individual such as\n        with human feedback\n    log_prob_discrete: bool whether to track log probabilities for discrete action\n        space\n    log_prob_continuous: int = 0 the dimension of probabilities to track for\n        continuous action spaces for instance if there is one continuous action\n        parameterized by a normal distribution with mean mu and std sigma, then\n        continuous_action_dimension = 2, but log_prob_continuous would only be\n        storing a single probability so it would be 1.\n    memory_weights: bool whether or not to store weights along with each timestep\n        for memory weighting or another purpose\n\n## Methods of Interest:\n\n`save_transition`: Takes the actions / observations / states, etc from a step and saves it.\n`sample_transitions`: Samples unordered transitions ",
    "bugtrack_url": null,
    "license": null,
    "summary": "Multi-Agent RL memory buffer which supports Numpy array and PyTorch Tensor formats",
    "version": "0.2.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/Timothy-Flavin/flexibuff/issues",
        "Homepage": "https://github.com/Timothy-Flavin/flexibuff/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b8f980585a15c2b3e375de9760b776ba7d419b2421139c2a8c965573ac400ac7",
                "md5": "1b0d974a34041384f54b4116dfff6c62",
                "sha256": "fe452bc150843893af47523d0e56413ef6f11b9e5c5851e83735a3aa7dedfe38"
            },
            "downloads": -1,
            "filename": "flexibuff-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1b0d974a34041384f54b4116dfff6c62",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 17225,
            "upload_time": "2025-01-22T02:34:38",
            "upload_time_iso_8601": "2025-01-22T02:34:38.077550Z",
            "url": "https://files.pythonhosted.org/packages/b8/f9/80585a15c2b3e375de9760b776ba7d419b2421139c2a8c965573ac400ac7/flexibuff-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "42526ce6b1943a9c9cb0c116ed9323ae250f0205f9c97a80804624bb51b15dd4",
                "md5": "9857c01f93e9b92d7830f10a02d4a48b",
                "sha256": "7fcc89037d55485bb720e0d09aa3545e5094dcae7ac579928e409cf35ff41109"
            },
            "downloads": -1,
            "filename": "flexibuff-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "9857c01f93e9b92d7830f10a02d4a48b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 21186,
            "upload_time": "2025-01-22T02:34:58",
            "upload_time_iso_8601": "2025-01-22T02:34:58.633110Z",
            "url": "https://files.pythonhosted.org/packages/42/52/6ce6b1943a9c9cb0c116ed9323ae250f0205f9c97a80804624bb51b15dd4/flexibuff-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-22 02:34:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Timothy-Flavin",
    "github_project": "flexibuff",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "flexibuff"
}
        
Elapsed time: 0.43469s