Name | rouskinhf JSON |
Version |
0.3.5
JSON |
| download |
home_page | |
Summary | A library to manipulate data for our DMS prediction models. |
upload_time | 2023-11-22 07:46:35 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.10 |
license | MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
[](https://github.com/rouskinlab/rouskinhf/actions/workflows/CI.yml)
[](https://github.com/rouskinlab/rouskinhf/actions/workflows/release.yml)


# Download your RNA data from HuggingFace with rouskinhf!
A repo to manipulate the data for our RNA structure prediction model. This repo allows you to:
- pull datasets from the Rouskinlab's HuggingFace
- create datasets from local files and push them to HuggingFace, from the formats:
- `.fasta`
- `.ct`
- `.json` (DREEM output format)
- `.json` (Rouskinlab's huggingface format)
## Important notes
- Sequences with bases different than `A`, `C`, `G`, `T`, `U`, `N`, `a`, `c`, `g`, `t`, `u`, `n` are not supported. The data will be filtered out.
## Dependencies
- [RNAstructure](https://rna.urmc.rochester.edu/RNAstructure.html) (also available on [Rouskinlab GitHub](https://github.com/rouskinlab/RNAstructure)).
## Push a new release to Pypi
1. Edit version to `vx.y.z` in `pyproject.toml`. Then run in a terminal `git add . && git commit -m 'vx.y.z' && git push`.
2. Create and push a git tag `vx.y.z` by running in a terminal `git tag 'vx.y.z' && git push --tag`.
3. Create a release for the tag `vx.y.z` on Github Release.
4. Make sure that the Github Action `Publish distributions 📦 to PyPI` passed on Github Actions.
## Installation
### Get a HuggingFace token
Go to [HuggingFace](https://huggingface.co/) and create an account. Then go to your profile and copy your token ([huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)).
### Create an environment file
Open a terminal and type:
```bash
nano env
```
Copy paste the following content, and change the values to your own:
```bash
export HUGGINGFACE_TOKEN="your token here" # you must change this to your HuggingFace token
export DATA_FOLDER="data/datafolders" # where the datafolder are stored by default, change it if you want to store it somewhere else
export DATA_FOLDER_TESTING="data/input_files_for_testing" # Don't touch this
export RNASTRUCTURE_PATH="/Users/ymdt/src/RNAstructure/exe" # Change this to the path of your RNAstructure executable
export RNASTRUCTURE_TEMP_FOLDER="temp" # You can change this to the path of your RNAstructure temp folder
```
Then save the file and exit nano.
### Source the environment
```bash
source env
```
### Install the package with pip
```bash
pip install rouskinhf
```
## Tutorials
### Authentify your machine to HuggingFace
See the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/huggingface.ipynb).
### Download a datafolder from HuggingFace
See the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/use_for_models.ipynb).
### Create a datafolder from local files and push it to HuggingFace
See the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/create_push_pull.ipynb).
## About
### Sourcing the environment and keeping your environment variable secret
The variables defined in the `env` file are required by `rouskinhf`. Make that before you use `rouskinhf`, you run in a terminal:
```bash
source env
```
or, in a Jupyter notebook:
```python
!pip install python-dotenv
%load_ext dotenv
%dotenv env
```
or, in a python script or Jupyter notebook:
```python
from rouskinhf import setup_env
setup_env(
HUGGINGFACE_TOKEN="your token here",
DATA_FOLDER="data/datafolders",
...
)
```
The point of using environment variables is to ensure the privacy of your huggingface token. Make sure to add your `env` file to your `.gitignore`, so your HuggingFace token doesn't get pushed to any public repository.
### Import data with ``import_dataset``
This repo provides a function ``import_dataset``, which allows your to pull a dataset from HuggingFace and store it locally. If the data is already stored locally, it will be loaded from the local folder. The type of data available is the DMS signal and the structure, under the shape of paired bases tuples. The function has the following signature:
```python
def import_dataset(name:str, data:str, force_download:bool=False)->np.ndarray:
"""Finds the dataset with the given name for the given type of data.
Parameters
----------
name : str
Name of the dataset to find.
data : str
Name of the type of data to find the dataset for (structure or DMS).
force_download : bool
Whether to force download the dataset from HuggingFace Hub. Defaults to False.
Returns
-------
ndarray
The dataset with the given name for the given type of data.
Example
-------
>>> import_dataset(name='for_testing', data='structure').keys()
dict_keys(['references', 'sequences', 'structure'])
>>> import_dataset(name='for_testing', data='DMS').keys()
dict_keys(['references', 'sequences', 'DMS'])
>>> import_dataset(name='for_testing', data='structure', force_download=True).keys()
dict_keys(['references', 'sequences', 'structure'])
>>> import_dataset(name='for_testing', data='DMS', force_download=True).keys()
dict_keys(['references', 'sequences', 'DMS'])
```
### FYI, the datafolder object
The datafolder object is a wrapper around your local folder and HuggingFace API, to keep a consistent datastructure across your datasets. It contains multiple methods to create datasets from various input formats, store the data and metadata in a systematic way, and push / pull from HuggingFace.
On HuggingFace, the datafolder stores the data under the following structure:
```bash
HUGGINGFACE DATAFOLDER
- [datafolder name]
- source
- whichever file(s) you used to create the dataset (fasta, set of CTs, etc.).
- data.json # the data under a human readable format.
- info.json # the metadata of the dataset. This file indicates how we got the DMS signal and the structures (directly from the source or from a prediction).
- README.md # the metadata of the dataset in a human readable format.
```
Locally, we have the same structure with the addition of .npy files which contain the data in a machine readable format. Each .npy file contains a numpy array of the data, and the name of the file is the name of the corresponding key in the data.json file. The source file won’t be downloaded by default. Hence, the local structure is:
```bash
LOCAL DATAFOLDER
- [datafolder name]
...
- README.md # the metadata of the dataset in a human readable format
- references.npy
- sequences.npy
- base_pairs.npy
- dms.npy
```
Raw data
{
"_id": null,
"home_page": "",
"name": "rouskinhf",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "",
"author": "",
"author_email": "Yves Martin <yves@martin.yt>, Alberic de Lajarte <albericlajarte@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ee/4d/8314e2f0679bc2f3c6cc047a6ae6e8a0b7d5d3c9067763fe6e8b9394b39b/rouskinhf-0.3.5.tar.gz",
"platform": null,
"description": "[](https://github.com/rouskinlab/rouskinhf/actions/workflows/CI.yml)\n[](https://github.com/rouskinlab/rouskinhf/actions/workflows/release.yml)\n\n\n\n# Download your RNA data from HuggingFace with rouskinhf!\n\nA repo to manipulate the data for our RNA structure prediction model. This repo allows you to:\n- pull datasets from the Rouskinlab's HuggingFace\n- create datasets from local files and push them to HuggingFace, from the formats:\n - `.fasta`\n - `.ct`\n - `.json` (DREEM output format)\n - `.json` (Rouskinlab's huggingface format)\n\n## Important notes\n\n- Sequences with bases different than `A`, `C`, `G`, `T`, `U`, `N`, `a`, `c`, `g`, `t`, `u`, `n` are not supported. The data will be filtered out.\n\n## Dependencies\n- [RNAstructure](https://rna.urmc.rochester.edu/RNAstructure.html) (also available on [Rouskinlab GitHub](https://github.com/rouskinlab/RNAstructure)).\n\n## Push a new release to Pypi\n\n1. Edit version to `vx.y.z` in `pyproject.toml`. Then run in a terminal `git add . && git commit -m 'vx.y.z' && git push`.\n2. Create and push a git tag `vx.y.z` by running in a terminal `git tag 'vx.y.z' && git push --tag`.\n3. Create a release for the tag `vx.y.z` on Github Release.\n4. Make sure that the Github Action `Publish distributions \ud83d\udce6 to PyPI` passed on Github Actions.\n\n## Installation\n\n### Get a HuggingFace token\n\nGo to [HuggingFace](https://huggingface.co/) and create an account. Then go to your profile and copy your token ([huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)).\n\n### Create an environment file\n\nOpen a terminal and type:\n\n```bash\nnano env\n```\n\nCopy paste the following content, and change the values to your own:\n\n```bash\nexport HUGGINGFACE_TOKEN=\"your token here\" # you must change this to your HuggingFace token\nexport DATA_FOLDER=\"data/datafolders\" # where the datafolder are stored by default, change it if you want to store it somewhere else\nexport DATA_FOLDER_TESTING=\"data/input_files_for_testing\" # Don't touch this\nexport RNASTRUCTURE_PATH=\"/Users/ymdt/src/RNAstructure/exe\" # Change this to the path of your RNAstructure executable\nexport RNASTRUCTURE_TEMP_FOLDER=\"temp\" # You can change this to the path of your RNAstructure temp folder\n```\n\nThen save the file and exit nano.\n\n### Source the environment\n\n```bash\nsource env\n```\n\n### Install the package with pip\n\n```bash\npip install rouskinhf\n```\n\n\n## Tutorials\n\n### Authentify your machine to HuggingFace\n\nSee the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/huggingface.ipynb).\n\n### Download a datafolder from HuggingFace\n\nSee the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/use_for_models.ipynb).\n\n### Create a datafolder from local files and push it to HuggingFace\n\nSee the [tutorial](https://github.com/rouskinlab/rouskinhf/blob/main/tutorials/create_push_pull.ipynb).\n\n## About\n\n### Sourcing the environment and keeping your environment variable secret\n\nThe variables defined in the `env` file are required by `rouskinhf`. Make that before you use `rouskinhf`, you run in a terminal:\n\n```bash\nsource env\n```\n or, in a Jupyter notebook:\n\n```python\n!pip install python-dotenv\n%load_ext dotenv\n%dotenv env\n```\n\nor, in a python script or Jupyter notebook:\n\n```python\nfrom rouskinhf import setup_env\nsetup_env(\n HUGGINGFACE_TOKEN=\"your token here\",\n DATA_FOLDER=\"data/datafolders\",\n ...\n)\n```\n\n The point of using environment variables is to ensure the privacy of your huggingface token. Make sure to add your `env` file to your `.gitignore`, so your HuggingFace token doesn't get pushed to any public repository.\n\n### Import data with ``import_dataset``\n\nThis repo provides a function ``import_dataset``, which allows your to pull a dataset from HuggingFace and store it locally. If the data is already stored locally, it will be loaded from the local folder. The type of data available is the DMS signal and the structure, under the shape of paired bases tuples. The function has the following signature:\n\n```python\ndef import_dataset(name:str, data:str, force_download:bool=False)->np.ndarray:\n\n \"\"\"Finds the dataset with the given name for the given type of data.\n\n Parameters\n ----------\n\n name : str\n Name of the dataset to find.\n data : str\n Name of the type of data to find the dataset for (structure or DMS).\n force_download : bool\n Whether to force download the dataset from HuggingFace Hub. Defaults to False.\n\n Returns\n -------\n\n ndarray\n The dataset with the given name for the given type of data.\n\n Example\n -------\n\n >>> import_dataset(name='for_testing', data='structure').keys()\n dict_keys(['references', 'sequences', 'structure'])\n >>> import_dataset(name='for_testing', data='DMS').keys()\n dict_keys(['references', 'sequences', 'DMS'])\n >>> import_dataset(name='for_testing', data='structure', force_download=True).keys()\n dict_keys(['references', 'sequences', 'structure'])\n >>> import_dataset(name='for_testing', data='DMS', force_download=True).keys()\n dict_keys(['references', 'sequences', 'DMS'])\n```\n\n### FYI, the datafolder object\n\nThe datafolder object is a wrapper around your local folder and HuggingFace API, to keep a consistent datastructure across your datasets. It contains multiple methods to create datasets from various input formats, store the data and metadata in a systematic way, and push / pull from HuggingFace.\n\nOn HuggingFace, the datafolder stores the data under the following structure:\n\n```bash\nHUGGINGFACE DATAFOLDER\n- [datafolder name]\n - source\n - whichever file(s) you used to create the dataset (fasta, set of CTs, etc.).\n - data.json # the data under a human readable format.\n - info.json # the metadata of the dataset. This file indicates how we got the DMS signal and the structures (directly from the source or from a prediction).\n - README.md # the metadata of the dataset in a human readable format.\n```\n\nLocally, we have the same structure with the addition of .npy files which contain the data in a machine readable format. Each .npy file contains a numpy array of the data, and the name of the file is the name of the corresponding key in the data.json file. The source file won\u2019t be downloaded by default. Hence, the local structure is:\n\n```bash\nLOCAL DATAFOLDER\n- [datafolder name]\n ...\n - README.md # the metadata of the dataset in a human readable format\n - references.npy\n - sequences.npy\n - base_pairs.npy\n - dms.npy\n```\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE ",
"summary": "A library to manipulate data for our DMS prediction models.",
"version": "0.3.5",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "226111fbc42ad69d95c6a1ae33605372b4ef8f7e62a403c3ddddd655311aafb2",
"md5": "f4eec87a6bc2ec3811ce6533ae39fadf",
"sha256": "fa3a5975ede9981307bfe845a39b521a52c609dccab5af76d706d36ea82b2177"
},
"downloads": -1,
"filename": "rouskinhf-0.3.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f4eec87a6bc2ec3811ce6533ae39fadf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 25300,
"upload_time": "2023-11-22T07:46:34",
"upload_time_iso_8601": "2023-11-22T07:46:34.137764Z",
"url": "https://files.pythonhosted.org/packages/22/61/11fbc42ad69d95c6a1ae33605372b4ef8f7e62a403c3ddddd655311aafb2/rouskinhf-0.3.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ee4d8314e2f0679bc2f3c6cc047a6ae6e8a0b7d5d3c9067763fe6e8b9394b39b",
"md5": "d620c16842a40d56dbf24bd500a743ed",
"sha256": "5f18314100cb0ee4eecc6a3c112de6defc0766a41c06140c90e3cb25ac274bdd"
},
"downloads": -1,
"filename": "rouskinhf-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "d620c16842a40d56dbf24bd500a743ed",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 24636,
"upload_time": "2023-11-22T07:46:35",
"upload_time_iso_8601": "2023-11-22T07:46:35.711328Z",
"url": "https://files.pythonhosted.org/packages/ee/4d/8314e2f0679bc2f3c6cc047a6ae6e8a0b7d5d3c9067763fe6e8b9394b39b/rouskinhf-0.3.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-22 07:46:35",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rouskinhf"
}