biofit

Name	biofit JSON
Version	0.0.1 JSON
	download
home_page	https://github.com/psmyth94/biofit
Summary	BioFit: Bioinformatics Machine Learning Framework
upload_time	2024-11-18 19:47:28
maintainer	None
docs_url	None
author	Patrick Smyth
requires_python	<3.12.0,>=3.8.0
license	MIT
keywords	omics machine learning bioinformatics metrics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
    $${\Huge{\textbf{\textsf{\color{#2E8B57}Bio\color{red}fit}}}}$$
    <br/>
    <br/>
</p>
<p align="center">
    <a href="https://github.com/psmyth94/biofit/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/psmyth94/biofit/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main"></a>
    <a href="https://github.com/psmyth94/biofit/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/psmyth94/biofit.svg?color=blue"></a>
    <a href="https://github.com/psmyth94/biofit/tree/main/docs"><img alt="Documentation" src="https://img.shields.io/website/http/github/psmyth94/biofit/tree/main/docs.svg?down_color=red&down_message=offline&up_message=online"></a>
    <a href="https://github.com/psmyth94/biofit/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/psmyth94/biofit.svg"></a>
    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"></a>
    <!-- <a href="https://zenodo.org/records/14028772"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg" alt="DOI"></a> -->
</p>

**Biofit** is a machine learning library designed for bioinformatics datasets. It
provides tools for transforming, extracting, training, and evaluating machine learning
models on biomedical data. It also provides automatic data preprocessing, visualization,
and configurable processing pipelines. Here are some of the main features of Biofit:

- **Automatic Data Preprocessing:** Automatically preprocess biomedical datasets using
  built-in preprocessing steps.
- **Automatic Visualization:** Automatically visualize data using built-in visualization
  methods geared towards biomedical data.
- **Configurable Processing Pipelines:** Define and customize data processing pipelines.
- **Data Handling Flexibility:** Support for a wide range of data formats, including:
  - [Pandas](https://github.com/pandas-dev/pandas)
  - [Polars](https://github.com/pola-rs/polars)
  - [NumPy](https://github.com/numpy/numpy)
  - [CSR (SciPy)](https://github.com/scipy/scipy)
  - [Arrow](https://github.com/apache/arrow)
  - 🤗 [Datasets](https://github.com/huggingface/datasets)
  - [Biosets](https://github.com/psmyth94/biosets)
- **Machine Learning Models:** Supports a wide range of machine learning models, including:
  - [Scikit-learn](https://github.com/scikit-learn/scikit-learn)
    - [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    - [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
    - [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
    - [Lasso Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
  - [LightGBM](https://github.com/microsoft/LightGBM)
  - More to come!
- **Caching and Reuse:** Caches intermediate results using Apache Arrow for efficient reuse.
- **Batch Processing and Multiprocessing:** Utilize batch processing and multiprocessing for efficient handling of large-scale data.

## Installation

You can install Biofit via pip:

```bash
pip install biofit
```

## Quick Start

### Preprocessing Data

Biofit provides preprocessing capabilities tailored for omics data. You can use
built-in classes to load preprocessing steps based on the experiment type or create
custom preprocessing pipelines. The preprocessing pipeline in Biofit uses a syntax
similar to sklearn and supports distributed processing.

#### Using a Preprocessor

Biofit allows you to fit and transform your data in a few lines, similar to sklearn.
For example, you can use the LogTransformer to apply a log transformation to your data:

```python
from biofit.preprocessing import LogTransformer
import pandas as pd

dataset = pd.DataFrame({"feature1": [1, 2, 3, 4, 5]})
log_transformer = LogTransformer()
preprocessed_data = log_transformer.fit_transform(dataset)
# Applying log transformation: 100%|█████████████████████████████| 5/5 [00:00<00:00, 7656.63 examples/s]
print(preprocessed_data)
#    feature1
# 0  0.000000
# 1  0.693147
# 2  1.098612
# 3  1.386294
# 4  1.609438
```

#### Auto Preprocessing

You can automatically apply standard preprocessing steps by specifying the experiment
type. This allows you to load tailored preprocessing steps for the type of data you are
working with, such as "otu", "asv", "snp", or "maldi":

```python
from biofit.preprocessing import AutoPreprocessor

preprocessor = AutoPreprocessor.for_experiment("snp", [{"min_prevalence": 0.1}, None])
print(preprocessor)
# [('min_prevalence_row', MinPrevalencFilter(min_prevalence=0.1)),
#  ('min_prevalence', MinPrevalenceFeatureSelector(min_prevalence=0.01))]

# Fit and transform the dataset using the preprocessor
preprocessed_data = preprocessor.fit_transform(dataset)
```

Biofit is made with [Biosets](https://github.com/psmyth94/biosets) in mind. You can
pass the loaded dataset instead of a string to load the preprocessors:

```python
from biosets import load_dataset

dataset = load_dataset("csv", data_files="my_file.csv", experiment_type="snp")

preprocessor = AutoPreprocessor.for_experiment(dataset)
print(preprocessor)
# [('min_prevalence_row', MinPrevalencFilter(min_prevalence=0.01)),
#  ('min_prevalence', MinPrevalenceFeatureSelector(min_prevalence=0.01))]
preprocessed_data = preprocessor.fit_transform(dataset)
```

#### Custom Preprocessing Pipeline

Biofit allows you to create custom preprocessing pipelines using the
`PreprocessorPipeline` class. This allows chaining multiple preprocessing steps from
`sklearn` and Biofit in a single operation:

```python
from biofit import load_dataset
from biofit.preprocessing import LogTransformer, PreprocessorPipeline
from sklearn.preprocessing import StandardScaler

# Load the dataset
dataset = load_dataset("csv", data_files="my_file.csv")

# Define a custom preprocessing pipeline
pipeline = PreprocessorPipeline(
    [("scaler", StandardScaler()), ("log_transformer", LogTransformer())]
)

# Fit and transform the dataset using the pipeline
preprocessed_data = pipeline.fit_transform(dataset.to_pandas())
```

For further details, check the [advance usage documentation](./docs/PREPROCESSING.md).

# License

Biofit is licensed under the Apache 2.0 License. See the [LICENSE](./LICENSE) file for
more information.

# Contributing

If you would like to contribute to Biofit, please read the
[CONTRIBUTING](./CONTRIBUTING.md) guidelines.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/psmyth94/biofit",
    "name": "biofit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12.0,>=3.8.0",
    "maintainer_email": null,
    "keywords": "omics machine learning bioinformatics metrics",
    "author": "Patrick Smyth",
    "author_email": "psmyth1994@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f4/d2/0077210f703999578e2a5827e69a8db4f197bfdebfa5c2d0523d87d04d29/biofit-0.0.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    $${\\Huge{\\textbf{\\textsf{\\color{#2E8B57}Bio\\color{red}fit}}}}$$\n    <br/>\n    <br/>\n</p>\n<p align=\"center\">\n    <a href=\"https://github.com/psmyth94/biofit/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain\"><img alt=\"Build\" src=\"https://github.com/psmyth94/biofit/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main\"></a>\n    <a href=\"https://github.com/psmyth94/biofit/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/psmyth94/biofit.svg?color=blue\"></a>\n    <a href=\"https://github.com/psmyth94/biofit/tree/main/docs\"><img alt=\"Documentation\" src=\"https://img.shields.io/website/http/github/psmyth94/biofit/tree/main/docs.svg?down_color=red&down_message=offline&up_message=online\"></a>\n    <a href=\"https://github.com/psmyth94/biofit/releases\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/psmyth94/biofit.svg\"></a>\n    <a href=\"CODE_OF_CONDUCT.md\"><img alt=\"Contributor Covenant\" src=\"https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg\"></a>\n    <!-- <a href=\"https://zenodo.org/records/14028772\"><img src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg\" alt=\"DOI\"></a> -->\n</p>\n\n**Biofit** is a machine learning library designed for bioinformatics datasets. It\nprovides tools for transforming, extracting, training, and evaluating machine learning\nmodels on biomedical data. It also provides automatic data preprocessing, visualization,\nand configurable processing pipelines. Here are some of the main features of Biofit:\n\n- **Automatic Data Preprocessing:** Automatically preprocess biomedical datasets using\n  built-in preprocessing steps.\n- **Automatic Visualization:** Automatically visualize data using built-in visualization\n  methods geared towards biomedical data.\n- **Configurable Processing Pipelines:** Define and customize data processing pipelines.\n- **Data Handling Flexibility:** Support for a wide range of data formats, including:\n  - [Pandas](https://github.com/pandas-dev/pandas)\n  - [Polars](https://github.com/pola-rs/polars)\n  - [NumPy](https://github.com/numpy/numpy)\n  - [CSR (SciPy)](https://github.com/scipy/scipy)\n  - [Arrow](https://github.com/apache/arrow)\n  - \ud83e\udd17 [Datasets](https://github.com/huggingface/datasets)\n  - [Biosets](https://github.com/psmyth94/biosets)\n- **Machine Learning Models:** Supports a wide range of machine learning models, including:\n  - [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n    - [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)\n    - [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n    - [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n    - [Lasso Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)\n  - [LightGBM](https://github.com/microsoft/LightGBM)\n  - More to come!\n- **Caching and Reuse:** Caches intermediate results using Apache Arrow for efficient reuse.\n- **Batch Processing and Multiprocessing:** Utilize batch processing and multiprocessing for efficient handling of large-scale data.\n\n## Installation\n\nYou can install Biofit via pip:\n\n```bash\npip install biofit\n```\n\n## Quick Start\n\n### Preprocessing Data\n\nBiofit provides preprocessing capabilities tailored for omics data. You can use\nbuilt-in classes to load preprocessing steps based on the experiment type or create\ncustom preprocessing pipelines. The preprocessing pipeline in Biofit uses a syntax\nsimilar to sklearn and supports distributed processing.\n\n#### Using a Preprocessor\n\nBiofit allows you to fit and transform your data in a few lines, similar to sklearn.\nFor example, you can use the LogTransformer to apply a log transformation to your data:\n\n```python\nfrom biofit.preprocessing import LogTransformer\nimport pandas as pd\n\ndataset = pd.DataFrame({\"feature1\": [1, 2, 3, 4, 5]})\nlog_transformer = LogTransformer()\npreprocessed_data = log_transformer.fit_transform(dataset)\n# Applying log transformation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 5/5 [00:00<00:00, 7656.63 examples/s]\nprint(preprocessed_data)\n#    feature1\n# 0  0.000000\n# 1  0.693147\n# 2  1.098612\n# 3  1.386294\n# 4  1.609438\n```\n\n#### Auto Preprocessing\n\nYou can automatically apply standard preprocessing steps by specifying the experiment\ntype. This allows you to load tailored preprocessing steps for the type of data you are\nworking with, such as \"otu\", \"asv\", \"snp\", or \"maldi\":\n\n```python\nfrom biofit.preprocessing import AutoPreprocessor\n\npreprocessor = AutoPreprocessor.for_experiment(\"snp\", [{\"min_prevalence\": 0.1}, None])\nprint(preprocessor)\n# [('min_prevalence_row', MinPrevalencFilter(min_prevalence=0.1)),\n#  ('min_prevalence', MinPrevalenceFeatureSelector(min_prevalence=0.01))]\n\n# Fit and transform the dataset using the preprocessor\npreprocessed_data = preprocessor.fit_transform(dataset)\n```\n\nBiofit is made with [Biosets](https://github.com/psmyth94/biosets) in mind. You can\npass the loaded dataset instead of a string to load the preprocessors:\n\n```python\nfrom biosets import load_dataset\n\ndataset = load_dataset(\"csv\", data_files=\"my_file.csv\", experiment_type=\"snp\")\n\npreprocessor = AutoPreprocessor.for_experiment(dataset)\nprint(preprocessor)\n# [('min_prevalence_row', MinPrevalencFilter(min_prevalence=0.01)),\n#  ('min_prevalence', MinPrevalenceFeatureSelector(min_prevalence=0.01))]\npreprocessed_data = preprocessor.fit_transform(dataset)\n```\n\n#### Custom Preprocessing Pipeline\n\nBiofit allows you to create custom preprocessing pipelines using the\n`PreprocessorPipeline` class. This allows chaining multiple preprocessing steps from\n`sklearn` and Biofit in a single operation:\n\n```python\nfrom biofit import load_dataset\nfrom biofit.preprocessing import LogTransformer, PreprocessorPipeline\nfrom sklearn.preprocessing import StandardScaler\n\n# Load the dataset\ndataset = load_dataset(\"csv\", data_files=\"my_file.csv\")\n\n# Define a custom preprocessing pipeline\npipeline = PreprocessorPipeline(\n    [(\"scaler\", StandardScaler()), (\"log_transformer\", LogTransformer())]\n)\n\n# Fit and transform the dataset using the pipeline\npreprocessed_data = pipeline.fit_transform(dataset.to_pandas())\n```\n\nFor further details, check the [advance usage documentation](./docs/PREPROCESSING.md).\n\n# License\n\nBiofit is licensed under the Apache 2.0 License. See the [LICENSE](./LICENSE) file for\nmore information.\n\n# Contributing\n\nIf you would like to contribute to Biofit, please read the\n[CONTRIBUTING](./CONTRIBUTING.md) guidelines.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "BioFit: Bioinformatics Machine Learning Framework",
    "version": "0.0.1",
    "project_urls": {
        "Download": "https://github.com/psmyth94/biofit/tags",
        "Homepage": "https://github.com/psmyth94/biofit"
    },
    "split_keywords": [
        "omics",
        "machine",
        "learning",
        "bioinformatics",
        "metrics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "609d2b81cc24c6c2cc814ba11de571c27dd084e662ceab828d51eed1956f5495",
                "md5": "dda36965d22fa26a1a1418852796a362",
                "sha256": "afadb1674f28d8aa7979a7ab043418b9a40c6571e45c19b26a7cfcac8fb64066"
            },
            "downloads": -1,
            "filename": "biofit-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dda36965d22fa26a1a1418852796a362",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12.0,>=3.8.0",
            "size": 268509,
            "upload_time": "2024-11-18T19:47:26",
            "upload_time_iso_8601": "2024-11-18T19:47:26.190846Z",
            "url": "https://files.pythonhosted.org/packages/60/9d/2b81cc24c6c2cc814ba11de571c27dd084e662ceab828d51eed1956f5495/biofit-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4d20077210f703999578e2a5827e69a8db4f197bfdebfa5c2d0523d87d04d29",
                "md5": "49f37ce45db98dde9cf53e4bc1c4bb02",
                "sha256": "8578e289df6773d1a3ef60a262bc1fa957a14b02d0ed47355c07aa30f1e8fe1e"
            },
            "downloads": -1,
            "filename": "biofit-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "49f37ce45db98dde9cf53e4bc1c4bb02",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12.0,>=3.8.0",
            "size": 208008,
            "upload_time": "2024-11-18T19:47:28",
            "upload_time_iso_8601": "2024-11-18T19:47:28.559321Z",
            "url": "https://files.pythonhosted.org/packages/f4/d2/0077210f703999578e2a5827e69a8db4f197bfdebfa5c2d0523d87d04d29/biofit-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-18 19:47:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "psmyth94",
    "github_project": "biofit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "biofit"
}

Patrick Smyth