biosets


Namebiosets JSON
Version 1.2.1 PyPI version JSON
download
home_pagehttps://github.com/psmyth94/biosets
SummaryBioinformatics datasets and tools
upload_time2024-11-14 14:29:34
maintainerNone
docs_urlNone
authorPatrick Smyth
requires_python<3.12.0,>=3.8.0
licenseApache 2.0
keywords omics machine learning bioinformatics datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    $${\Huge{\textbf{\textsf{\color{#2E8B57}Bio\color{#4682B4}sets}}}}$$
    <br/>
    <br/>
</p> 
<p align="center">
    <a href="https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main"></a>
    <a href="https://github.com/psmyth94/biosets/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/psmyth94/biosets.svg?color=blue"></a>
    <a href="https://github.com/psmyth94/biosets/tree/main/docs"><img alt="Documentation" src="https://img.shields.io/website/http/github/psmyth94/biosets/tree/main/docs.svg?down_color=red&down_message=offline&up_message=online"></a>
    <a href="https://github.com/psmyth94/biosets/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/psmyth94/biosets.svg"></a>
    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"></a>
    <a href="https://zenodo.org/records/14028772"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg" alt="DOI"></a>
</p>

**Biosets** is a specialized library that extends 🤗 [Datasets](https://github.com/huggingface/datasets) for bioinformatics data, providing the following main features:

- **Bioinformatics Specialization**: Streamlines data management specific to bioinformatics, such as handling samples, features, batches, and associated metadata.
- **Automatic Column Detection**: Infers sample, batch, input features, and target columns, simplifying downstream preprocessing.
- **Custom Data Classes**: Leverages specialized data classes (`ValueWithMetadata`, `Sample`, `Batch`, `RegressionTarget`, etc.) to manage metadata-rich bioinformatics data.
- **Polars Integration**: Optional [Polars](https://github.com/pola-rs/polars) integration enables high-performance data manipulation, ideal for large datasets.
- **Flexible Task Support**: Native support for binary classification, multiclass classification, multiclass-to-binary classification, and regression, adapting to diverse bioinformatics tasks.
- **Integration with 🤗 Datasets**: `load_dataset` function supports loading various bioinformatics formats like CSV, JSON, NPZ, and more, including metadata integration.
- **Arrow File Caching**: Uses [Apache Arrow](https://github.com/apache/arrow) for efficient on-disk caching, enabling fast access to large datasets without memory limitations.

Biosets helps bioinformatics researchers focus on analysis rather than data handling, with seamless compatibility with 🤗 Datasets.

## Installation

### With pip

You can install **Biosets** from PyPI:

```bash
pip install biosets
```

### With conda

Install **Biosets** via conda:

```bash
conda install -c patrico49 biosets
```

## Usage

**Biosets** provides a straightforward API for handling bioinformatics datasets with integrated metadata management. Here's a quick example:

```python
from biosets import load_biodata

bio_data = load_dataset(
    data_files="data_with_samples.csv",
    sample_metadata_files="sample_metadata.csv",
    feature_metadata_files="feature_metadata.csv",
    target_column="metadata1",
    experiment_type="metagenomics",
    batch_column="batch",
    sample_column="sample",
    metadata_columns=["metadata1", "metadata2"],
    drop_samples=False
)["train"]
```

For further details, check the [advance usage documentation](./docs/DATA_LOADING.md).

## Main Differences Between Biosets and 🤗 Datasets

- **Bioinformatics Focus**: While 🤗 Datasets is a general-purpose library, Biosets is tailored for the bioinformatics domain.
- **Seamless Metadata Integration**: Biosets is built for datasets with metadata dependencies, like sample and feature metadata.
- **Automatic Column Detection**: Reduces preprocessing time with automatic inference of sample, batch, feature, and label columns.
- **Specialized Data Classes**: Biosets introduces custom classes (e.g., `Sample`, `Batch`, `ValueWithMetadata`) to enable richer data representation.

## Disclaimers

Biosets may run Python code from custom `datasets` scripts to handle specific data formats. For security, users should:

- Inspect dataset scripts prior to execution.
- Use pinned versions for any repository dependencies.

If you manage a dataset and wish to update or remove it, please open a discussion or pull request on the Community tab of 🤗's datasets page.

## BibTeX

If you'd like to cite **Biosets**, please use the following:

```bibtex
@misc{smyth2024biosets,
    title = {psmyth94/biosets: 1.1.0},
    author = {Patrick Smyth},
    year = {2024},
    url = {https://github.com/psmyth94/biosets},
    note = {A library designed to support bioinformatics data with custom features, metadata integration, and compatibility with 🤗 Datasets.}
}
```




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/psmyth94/biosets",
    "name": "biosets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12.0,>=3.8.0",
    "maintainer_email": null,
    "keywords": "omics machine learning bioinformatics datasets",
    "author": "Patrick Smyth",
    "author_email": "psmyth1994@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3e/48/4af7a13ae318f51c9c0dc8644b9b08217b5d8dac5c275ef331d52342941c/biosets-1.2.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    $${\\Huge{\\textbf{\\textsf{\\color{#2E8B57}Bio\\color{#4682B4}sets}}}}$$\n    <br/>\n    <br/>\n</p> \n<p align=\"center\">\n    <a href=\"https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain\"><img alt=\"Build\" src=\"https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main\"></a>\n    <a href=\"https://github.com/psmyth94/biosets/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/psmyth94/biosets.svg?color=blue\"></a>\n    <a href=\"https://github.com/psmyth94/biosets/tree/main/docs\"><img alt=\"Documentation\" src=\"https://img.shields.io/website/http/github/psmyth94/biosets/tree/main/docs.svg?down_color=red&down_message=offline&up_message=online\"></a>\n    <a href=\"https://github.com/psmyth94/biosets/releases\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/psmyth94/biosets.svg\"></a>\n    <a href=\"CODE_OF_CONDUCT.md\"><img alt=\"Contributor Covenant\" src=\"https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg\"></a>\n    <a href=\"https://zenodo.org/records/14028772\"><img src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg\" alt=\"DOI\"></a>\n</p>\n\n**Biosets** is a specialized library that extends \ud83e\udd17 [Datasets](https://github.com/huggingface/datasets) for bioinformatics data, providing the following main features:\n\n- **Bioinformatics Specialization**: Streamlines data management specific to bioinformatics, such as handling samples, features, batches, and associated metadata.\n- **Automatic Column Detection**: Infers sample, batch, input features, and target columns, simplifying downstream preprocessing.\n- **Custom Data Classes**: Leverages specialized data classes (`ValueWithMetadata`, `Sample`, `Batch`, `RegressionTarget`, etc.) to manage metadata-rich bioinformatics data.\n- **Polars Integration**: Optional [Polars](https://github.com/pola-rs/polars) integration enables high-performance data manipulation, ideal for large datasets.\n- **Flexible Task Support**: Native support for binary classification, multiclass classification, multiclass-to-binary classification, and regression, adapting to diverse bioinformatics tasks.\n- **Integration with \ud83e\udd17 Datasets**: `load_dataset` function supports loading various bioinformatics formats like CSV, JSON, NPZ, and more, including metadata integration.\n- **Arrow File Caching**: Uses [Apache Arrow](https://github.com/apache/arrow) for efficient on-disk caching, enabling fast access to large datasets without memory limitations.\n\nBiosets helps bioinformatics researchers focus on analysis rather than data handling, with seamless compatibility with \ud83e\udd17 Datasets.\n\n## Installation\n\n### With pip\n\nYou can install **Biosets** from PyPI:\n\n```bash\npip install biosets\n```\n\n### With conda\n\nInstall **Biosets** via conda:\n\n```bash\nconda install -c patrico49 biosets\n```\n\n## Usage\n\n**Biosets** provides a straightforward API for handling bioinformatics datasets with integrated metadata management. Here's a quick example:\n\n```python\nfrom biosets import load_biodata\n\nbio_data = load_dataset(\n    data_files=\"data_with_samples.csv\",\n    sample_metadata_files=\"sample_metadata.csv\",\n    feature_metadata_files=\"feature_metadata.csv\",\n    target_column=\"metadata1\",\n    experiment_type=\"metagenomics\",\n    batch_column=\"batch\",\n    sample_column=\"sample\",\n    metadata_columns=[\"metadata1\", \"metadata2\"],\n    drop_samples=False\n)[\"train\"]\n```\n\nFor further details, check the [advance usage documentation](./docs/DATA_LOADING.md).\n\n## Main Differences Between Biosets and \ud83e\udd17 Datasets\n\n- **Bioinformatics Focus**: While \ud83e\udd17 Datasets is a general-purpose library, Biosets is tailored for the bioinformatics domain.\n- **Seamless Metadata Integration**: Biosets is built for datasets with metadata dependencies, like sample and feature metadata.\n- **Automatic Column Detection**: Reduces preprocessing time with automatic inference of sample, batch, feature, and label columns.\n- **Specialized Data Classes**: Biosets introduces custom classes (e.g., `Sample`, `Batch`, `ValueWithMetadata`) to enable richer data representation.\n\n## Disclaimers\n\nBiosets may run Python code from custom `datasets` scripts to handle specific data formats. For security, users should:\n\n- Inspect dataset scripts prior to execution.\n- Use pinned versions for any repository dependencies.\n\nIf you manage a dataset and wish to update or remove it, please open a discussion or pull request on the Community tab of \ud83e\udd17's datasets page.\n\n## BibTeX\n\nIf you'd like to cite **Biosets**, please use the following:\n\n```bibtex\n@misc{smyth2024biosets,\n    title = {psmyth94/biosets: 1.1.0},\n    author = {Patrick Smyth},\n    year = {2024},\n    url = {https://github.com/psmyth94/biosets},\n    note = {A library designed to support bioinformatics data with custom features, metadata integration, and compatibility with \ud83e\udd17 Datasets.}\n}\n```\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Bioinformatics datasets and tools",
    "version": "1.2.1",
    "project_urls": {
        "Download": "https://github.com/psmyth94/biosets/tags",
        "Homepage": "https://github.com/psmyth94/biosets"
    },
    "split_keywords": [
        "omics",
        "machine",
        "learning",
        "bioinformatics",
        "datasets"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d8d1ec0df3d5de9423f2c76e2246de55fa68e58ddf0ce1fbabae090ddfc4030",
                "md5": "e82587f0a14b73da24527371fd0bec92",
                "sha256": "3dde3ac1e7e7ae754e725084b6d7a8cab3f420706e182a3fa537a585f0060180"
            },
            "downloads": -1,
            "filename": "biosets-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e82587f0a14b73da24527371fd0bec92",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12.0,>=3.8.0",
            "size": 93224,
            "upload_time": "2024-11-14T14:29:31",
            "upload_time_iso_8601": "2024-11-14T14:29:31.719057Z",
            "url": "https://files.pythonhosted.org/packages/8d/8d/1ec0df3d5de9423f2c76e2246de55fa68e58ddf0ce1fbabae090ddfc4030/biosets-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3e484af7a13ae318f51c9c0dc8644b9b08217b5d8dac5c275ef331d52342941c",
                "md5": "f53a865c784e06d1515b4f38ae03be19",
                "sha256": "7adf64679a07aa52c96dd1bcdaea2f222e16db04de4907c90b557059a0bf865b"
            },
            "downloads": -1,
            "filename": "biosets-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f53a865c784e06d1515b4f38ae03be19",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12.0,>=3.8.0",
            "size": 83681,
            "upload_time": "2024-11-14T14:29:34",
            "upload_time_iso_8601": "2024-11-14T14:29:34.108994Z",
            "url": "https://files.pythonhosted.org/packages/3e/48/4af7a13ae318f51c9c0dc8644b9b08217b5d8dac5c275ef331d52342941c/biosets-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-14 14:29:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "psmyth94",
    "github_project": "biosets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "biosets"
}
        
Elapsed time: 0.36645s