pamai

Name	pamai JSON
Version	1.0.5 JSON
	download
home_page	https://github.com/ArthurZucker/PAMAI
Summary	Package for PAMAI written by Arthur Zucker and Chris Rauch.
upload_time	2023-05-09 08:27:37
maintainer
docs_url	None
author	Chris Rauch
requires_python	~=3.8
license	MIT
keywords	pamai ai audio denet arthur zucker
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![wakatime](https://wakatime.com/badge/user/57d887d6-525a-4214-a78c-21863f2f88f7/project/93d14295-7eb1-438b-b391-744be6d71661.svg)](https://wakatime.com/badge/user/57d887d6-525a-4214-a78c-21863f2f88f7/project/93d14295-7eb1-438b-b391-744be6d71661)
# DENet training code 

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## Author 
- [Arthur Zucker](https://github.com/ArthurZucker)
- [Chris Rauch](https://github.com/chrisrauch193)

## Installation 
Clone this repository

## Model 
The model is available at the original github repository : git@github.com:MiviaLab/DENet.git

## Information 
The original DENET code is in tensorflow, but I need to use both YOLOR and DENET, and given that YOLOR is in pytorch, I have to convert the code to pytorch. That's also a good exercise, thus I will do it. 

Moreover, I am using my template implementation which allows to easily switch between model and data type, integrating this also requires a bit of work. 
## TODO 

- [x] Create dataset
- [x] Add `get_model`, `get_loss`, `get_optimizer` from the names
- [x] Create dataloaders
- [x] Convert both [__init__.py](./__init__.py) and [layers.py](./layers.py) to pytorch (since its now in tensorflow)
- [x] Create model 
- [x] Create training agent
- [x] Add utils and choose metric 
- [x] Sample training on subset 
- [ ] Clean hparams
- [ ] Use profiler to check training time and dataleaks
- [ ] Use time distributed for `visualize` function (process full audio)
- [ ] **Use fp16 and mixed precision with apex!**
- [ ] Use https://tut-arg.github.io/sed_eval/sound_event.html var validation metrics
- [ ] Create Docker 
- [ ] Details on the readme 
- [ ] Add to my website 

## Issue : 
output of sincnet and denet is the label at each window of 50ms. Thus the labels that I use should also be of the same format. Which means that when I select an audio file to use, randomly select a part of it? then extract the label from the sample. 
Sample available labels : {500->550},{...},{...} a list of ranges of samples where we know that a flying foxe shouted! 

DENET's input : 1 sequence of 10 frames of 50ms audio (thus 500ms sequences). Per-frame labels. 
SincNet's input : 1 chunk of 200ms, 10ms overlap, 128 batch size. Labels are for each wav file, a single label, which is a bit strange. 


### Annotations 

Realized that the annotations were a bit... Strange? 
In the `FileInfo.csv` file, each call's start sample and end sample is labeled. For example, for the following recording (visualized using sonic Visualiser), 2 calls are visible, the label is the following: 

![example immage](assets/images/example_call.png)

| File name              | Voice start sample (1) | Voice end sample (1) | Voice start sample (2) | Voice end sample (2) |
|------------------------|------------------------|----------------------|------------------------|----------------------|
| 120601011047196133.WAV | 47553                  | 95094                | 298537                 | 349426               |

But in the `annotation.csv`, the corresponding labels are the following: 

| FileID | Context | Start sample | End sample |
|--------|---------|--------------|------------|
| 30     | 1       | 298537       | 607056     |

Not only is a single call labeled, but the tail of it is also marked as `context = 1`. This makes it a bit strange. Though we can actually see that there is some noise/signal at the end of the recording, which annotation should we trust. 

I decided to merge the to files, ignoring the files without labels. This allows for a subsampling of the already huge dataset. 


### Dataloader

This has a huge impact on the way I will chose to label the data from a given random begin and end sample. 
I don't have a lot of options. 

1. Use call annotations, without processing it. That's the simplest solution but not very intelligent. Though given that the GRU uses previous samples, it should be able to automatically know tht this or this is not a call anymore (tail of a call). 
2. Look into unsupervised learning, wince some of the calls are labeled, others are not. In this case I should use the per-label calls. Then, process them *using the phoneme annotation* to at least remove the silence parts. Since the GRU uses previous samples, I guess it makes sense to say "this is still part of the audio event" and process it as a whole. 

**Solution** : Will simply process the labels beforehand. Dataloader will load the random sample's label.
I should also subsample the dataset to keep mostly calls that have a class or emitter label. This will both reduce the database size and allow for easier data augmentation (adding other animals) 

One thing that can be said is that bat calls (and here I don't mean phonemes I mean fully "calls") can be easily grouped together based on the separation. YOLOR's results give phonemes, but we can simply infer the actual call. This is not very accurate, but can work. 

#### In process : 
- Given the previous section, I will write a data-processing function in `/utils` which will write the `.csv` label file to the `/assets` repository. 
Then, labels are given to a random audio sequence based on its position. If it is between the minium `start_sample` and maximum `end_sample` I gave it a the label `1` for now, which is that of a bat call. Giving the context or emitter label is straightforward but need to be done 
>@TODO 

### Visualization 

I create scripts to add the visualization of the detection on validation data at val time. Again, since the input of the network is 200ms (or more) **I can't feed it with the full audio can I?**
> If i can't, I need to write a script to process the full audio, and give individual labels. 

## More ideas:
- Add FF call to background sounds and use it as training. Hashizume's FF already have Background noises. 
- Add other animal calls, rain, wind and other degradations in the transformations. 

### Improving the dataset  

Our dataset only has positive instances -- bat calls are almost always present in recordings -- thus we need to augment it with more recordings, were other events happen. We have a huge enough dataset, which means that we can subsample it, and combine it with random noises from famous benchmark dataset? The audio conditions will be different :confused:


## Dataset description : 

- 293 235 total number of different audio files. Each containing at least 1 call. 
- 91 081 total number of labeled audio. Labels are : `Emitter, Addressee, Context, Emitter pre-vocalization action,	Addressee pre-vocalization action,	Emitter post-vocalization action,	Addressee post-vocalization action`
- In the case of emitter, 7858 and 44075 calls have no labels. ("-" sign in label means that the bat is either emitter or addressee)
- 39147 is the number of audio files for which the emitter is known.  
- 51 different individuals 
- 44 different individuals emitted calls
- 60813 is the number of audio files for which the context of the call is known.  (General and Unknown labels excluded)
- 31922 is the number of audio files for which the gender of the emitter is known. (puppies without a known gender excluded)

| 0 | Unknown | Unknown context | 640 |
|---|---|---|---|
| 1 | Separation | Emitted (rarely) by adults when separated from the group. | 504 |
| 2 | Biting | Emitted by a bat after being bitten by another. | 1788 |
| 3 | Feeding | The interaction involves food. | 6683 |
| 4 | Fighting | The interaction involves intense aggressive physical contact. | 7963 |
| 5 | Grooming | The interaction involves one bat grooming another. | 383 |
| 6 | Isolation | Emitted by young pups. | 5714 |
| 7 | Kissing | The interaction involves one bat licking another's mouth. | 362 |
| 8 | Landing | The interaction involves one bat landing on top of another. | 16 |
| 9 | Mating protest | Emitted by a female protesting a mating attempt. | 2338 |
| 10 | Threat-like | The interaction involves contactless aggressive displays. | 1065 |
| 11 | General | Unspecified context. The interacting bats are usually 10-20cm apart (in other interactions the bats are usually closer). | 29627 |
| 12 | Sleeping | The interaction occurs in the sleep cluster. |  |



## DENet 
Implemented it, but for now I can't use the attention because the MLP takes too much memory.

## Docker
```bash
docker build . --network=host -t pytorch-latest-pamai
docker run -it --rm --name pytorch-container --network=host pytorch-latest-pamai bash
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ArthurZucker/PAMAI",
    "name": "pamai",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "~=3.8",
    "maintainer_email": "",
    "keywords": "pamai,ai,audio,denet,arthur,zucker",
    "author": "Chris Rauch",
    "author_email": "chrisrauch193@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/39/45/aa7a3e1e45b6d759fdc9205ce106e2056dfbf37dc0c0061b965e4b76035a/pamai-1.0.5.tar.gz",
    "platform": null,
    "description": "[![wakatime](https://wakatime.com/badge/user/57d887d6-525a-4214-a78c-21863f2f88f7/project/93d14295-7eb1-438b-b391-744be6d71661.svg)](https://wakatime.com/badge/user/57d887d6-525a-4214-a78c-21863f2f88f7/project/93d14295-7eb1-438b-b391-744be6d71661)\n# DENet training code \n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n## Author \n- [Arthur Zucker](https://github.com/ArthurZucker)\n- [Chris Rauch](https://github.com/chrisrauch193)\n\n## Installation \nClone this repository\n\n## Model \nThe model is available at the original github repository : git@github.com:MiviaLab/DENet.git\n\n## Information \nThe original DENET code is in tensorflow, but I need to use both YOLOR and DENET, and given that YOLOR is in pytorch, I have to convert the code to pytorch. That's also a good exercise, thus I will do it. \n\nMoreover, I am using my template implementation which allows to easily switch between model and data type, integrating this also requires a bit of work. \n## TODO \n\n- [x] Create dataset\n- [x] Add `get_model`, `get_loss`, `get_optimizer` from the names\n- [x] Create dataloaders\n- [x] Convert both [__init__.py](./__init__.py) and [layers.py](./layers.py) to pytorch (since its now in tensorflow)\n- [x] Create model \n- [x] Create training agent\n- [x] Add utils and choose metric \n- [x] Sample training on subset \n- [ ] Clean hparams\n- [ ] Use profiler to check training time and dataleaks\n- [ ] Use time distributed for `visualize` function (process full audio)\n- [ ] **Use fp16 and mixed precision with apex!**\n- [ ] Use https://tut-arg.github.io/sed_eval/sound_event.html var validation metrics\n- [ ] Create Docker \n- [ ] Details on the readme \n- [ ] Add to my website \n\n## Issue : \noutput of sincnet and denet is the label at each window of 50ms. Thus the labels that I use should also be of the same format. Which means that when I select an audio file to use, randomly select a part of it? then extract the label from the sample. \nSample available labels : {500->550},{...},{...} a list of ranges of samples where we know that a flying foxe shouted! \n\nDENET's input : 1 sequence of 10 frames of 50ms audio (thus 500ms sequences). Per-frame labels. \nSincNet's input : 1 chunk of 200ms, 10ms overlap, 128 batch size. Labels are for each wav file, a single label, which is a bit strange. \n\n\n### Annotations \n\nRealized that the annotations were a bit... Strange? \nIn the `FileInfo.csv` file, each call's start sample and end sample is labeled. For example, for the following recording (visualized using sonic Visualiser), 2 calls are visible, the label is the following: \n\n![example immage](assets/images/example_call.png)\n\n| File name              | Voice start sample (1) | Voice end sample (1) | Voice start sample (2) | Voice end sample (2) |\n|------------------------|------------------------|----------------------|------------------------|----------------------|\n| 120601011047196133.WAV | 47553                  | 95094                | 298537                 | 349426               |\n\nBut in the `annotation.csv`, the corresponding labels are the following: \n\n| FileID | Context | Start sample | End sample |\n|--------|---------|--------------|------------|\n| 30     | 1       | 298537       | 607056     |\n\nNot only is a single call labeled, but the tail of it is also marked as `context = 1`. This makes it a bit strange. Though we can actually see that there is some noise/signal at the end of the recording, which annotation should we trust. \n\nI decided to merge the to files, ignoring the files without labels. This allows for a subsampling of the already huge dataset. \n\n\n### Dataloader\n\nThis has a huge impact on the way I will chose to label the data from a given random begin and end sample. \nI don't have a lot of options. \n\n1. Use call annotations, without processing it. That's the simplest solution but not very intelligent. Though given that the GRU uses previous samples, it should be able to automatically know tht this or this is not a call anymore (tail of a call). \n2. Look into unsupervised learning, wince some of the calls are labeled, others are not. In this case I should use the per-label calls. Then, process them *using the phoneme annotation* to at least remove the silence parts. Since the GRU uses previous samples, I guess it makes sense to say \"this is still part of the audio event\" and process it as a whole. \n\n**Solution** : Will simply process the labels beforehand. Dataloader will load the random sample's label.\nI should also subsample the dataset to keep mostly calls that have a class or emitter label. This will both reduce the database size and allow for easier data augmentation (adding other animals) \n\nOne thing that can be said is that bat calls (and here I don't mean phonemes I mean fully \"calls\") can be easily grouped together based on the separation. YOLOR's results give phonemes, but we can simply infer the actual call. This is not very accurate, but can work. \n\n#### In process : \n- Given the previous section, I will write a data-processing function in `/utils` which will write the `.csv` label file to the `/assets` repository. \nThen, labels are given to a random audio sequence based on its position. If it is between the minium `start_sample` and maximum `end_sample` I gave it a the label `1` for now, which is that of a bat call. Giving the context or emitter label is straightforward but need to be done \n>@TODO \n\n### Visualization \n\nI create scripts to add the visualization of the detection on validation data at val time. Again, since the input of the network is 200ms (or more) **I can't feed it with the full audio can I?**\n> If i can't, I need to write a script to process the full audio, and give individual labels. \n\n## More ideas:\n- Add FF call to background sounds and use it as training. Hashizume's FF already have Background noises. \n- Add other animal calls, rain, wind and other degradations in the transformations. \n\n### Improving the dataset  \n\nOur dataset only has positive instances -- bat calls are almost always present in recordings -- thus we need to augment it with more recordings, were other events happen. We have a huge enough dataset, which means that we can subsample it, and combine it with random noises from famous benchmark dataset? The audio conditions will be different :confused:\n\n\n## Dataset description : \n\n- 293 235 total number of different audio files. Each containing at least 1 call. \n- 91 081 total number of labeled audio. Labels are : `Emitter, Addressee, Context, Emitter pre-vocalization action,\tAddressee pre-vocalization action,\tEmitter post-vocalization action,\tAddressee post-vocalization action`\n- In the case of emitter, 7858 and 44075 calls have no labels. (\"-\" sign in label means that the bat is either emitter or addressee)\n- 39147 is the number of audio files for which the emitter is known.  \n- 51 different individuals \n- 44 different individuals emitted calls\n- 60813 is the number of audio files for which the context of the call is known.  (General and Unknown labels excluded)\n- 31922 is the number of audio files for which the gender of the emitter is known. (puppies without a known gender excluded)\n\n| 0 | Unknown | Unknown context | 640 |\n|---|---|---|---|\n| 1 | Separation | Emitted (rarely) by adults when separated from the group. | 504 |\n| 2 | Biting | Emitted by a bat after being bitten by another. | 1788 |\n| 3 | Feeding | The interaction involves food. | 6683 |\n| 4 | Fighting | The interaction involves intense aggressive physical contact. | 7963 |\n| 5 | Grooming | The interaction involves one bat grooming another. | 383 |\n| 6 | Isolation | Emitted by young pups. | 5714 |\n| 7 | Kissing | The interaction involves one bat licking another's mouth. | 362 |\n| 8 | Landing | The interaction involves one bat landing on top of another. | 16 |\n| 9 | Mating protest | Emitted by a female protesting a mating attempt. | 2338 |\n| 10 | Threat-like | The interaction involves contactless aggressive displays. | 1065 |\n| 11 | General | Unspecified context. The interacting bats are usually 10-20cm apart (in other interactions the bats are usually closer). | 29627 |\n| 12 | Sleeping | The interaction occurs in the sleep cluster. |  |\n\n\n\n## DENet \nImplemented it, but for now I can't use the attention because the MLP takes too much memory.\n\n## Docker\n```bash\ndocker build . --network=host -t pytorch-latest-pamai\ndocker run -it --rm --name pytorch-container --network=host pytorch-latest-pamai bash\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Package for PAMAI written by Arthur Zucker and Chris Rauch.",
    "version": "1.0.5",
    "project_urls": {
        "Download": "https://github.com/ArthurZucker/PAMAI/archive/refs/tags/v1.0.5.tar.gz",
        "Homepage": "https://github.com/ArthurZucker/PAMAI"
    },
    "split_keywords": [
        "pamai",
        "ai",
        "audio",
        "denet",
        "arthur",
        "zucker"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2d405416215fdc60e4dee98eca49f149a46144f8d55a7cf8e01473383274f0ee",
                "md5": "410fc860a34d3ae44c32b47d9a760191",
                "sha256": "7398fe203d1f2ae5c1387033436f02bd48b272e5a226e8284585e99ebcdef4d6"
            },
            "downloads": -1,
            "filename": "pamai-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "410fc860a34d3ae44c32b47d9a760191",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.8",
            "size": 32513,
            "upload_time": "2023-05-09T08:27:35",
            "upload_time_iso_8601": "2023-05-09T08:27:35.916999Z",
            "url": "https://files.pythonhosted.org/packages/2d/40/5416215fdc60e4dee98eca49f149a46144f8d55a7cf8e01473383274f0ee/pamai-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3945aa7a3e1e45b6d759fdc9205ce106e2056dfbf37dc0c0061b965e4b76035a",
                "md5": "133c91d77b0281224169cb1955a3e56b",
                "sha256": "55384db862294edb43f44ea8287a02131ab8dc7e476b2fc06bf89a17fdff70f7"
            },
            "downloads": -1,
            "filename": "pamai-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "133c91d77b0281224169cb1955a3e56b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "~=3.8",
            "size": 29729,
            "upload_time": "2023-05-09T08:27:37",
            "upload_time_iso_8601": "2023-05-09T08:27:37.868679Z",
            "url": "https://files.pythonhosted.org/packages/39/45/aa7a3e1e45b6d759fdc9205ce106e2056dfbf37dc0c0061b965e4b76035a/pamai-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-09 08:27:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ArthurZucker",
    "github_project": "PAMAI",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pamai"
}

Chris Rauch