rfmix-reader


Namerfmix-reader JSON
Version 0.1.21 PyPI version JSON
download
home_pagehttps://rfmix-reader.readthedocs.io/en/latest/
SummaryRFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.
upload_time2025-01-07 20:50:53
maintainerKynon JM Benjamin
docs_urlNone
authorKynon JM Benjamin
requires_python<4.0,>=3.10
licenseGPL-3.0
keywords file parser rfmix gpu acceleration local ancestry
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RFMix-reader
`RFMix-reader` is a Python package designed to efficiently read and process output
files generated by [`RFMix`](https://github.com/slowkoni/rfmix), a popular tool 
for estimating local ancestry in admixed populations. The package employs a lazy
loading approach, which minimizes memory consumption by reading only the loci that
are accessed by the user, rather than loading the entire dataset into memory at 
once. Additionally, we leverage GPU acceleration to improve computational speed.

## Install
`rfmix-reader` can be installed using [pip](https://pypi.python.org/pypi/pip):

```bash
pip install rfmix-reader
```

**GPU Acceleration:**
`rfmix-reader` leverages GPU acceleration for improved performance. To use this
functionality, you will need to install the following libraries for your specific
CUDA version:
- `RAPIDS`: Refer to official installation guide [here](https://docs.rapids.ai/install)
- `PyTorch`: Installation instructions can be found [here](https://pytorch.org/)

**Additional Notes:**
- We have not tested installation with `Docker` or `Conda` environemnts. Compatibility
  may vary.
- If you do not have GPU, you can still use the basic functionality of `rfmix-reader`.
  This is still much faster than processing the files with stardard scripting.


## Key Features
**Lazy Loading**
- Reads data on-the-fly as requested, reducing memory footprint.
- Ideal for working with large RFMix output files that may not fit entirely in memory.

**Efficient Data Access**
- Provides convenient access to specific loci or regions of interest.
- Allows for selective loading of data, enabling faster processing times.

**Seamless Integration**
- Designed to work seamlessly with existing Python data analysis workflows.
- Facilitates downstream analysis and manipulation of `RFMix` output data.

**Loci Imputation**
- Designed to impute local ancestry loci to a larger genotype data genomic positions.
- Array-based data for ease of integration with downstream analysis.

Whether you are working with large-scale genomic datasets or have limited
computational resources, `RFMix-reader` offers an efficient and memory-conscious
solution for reading and processing `RFMix` output files. Its lazy loading approach
ensures optimal resource utilization, making it a valuable tool for researchers
and bioinformaticians working with admixed population data.

## Simulation Data
Simulation data is available for testing two and three population admixture on 
Synapse: [syn61691659](https://www.synapse.org/Synapse:syn61691659).

## Usage
This works similarly to [`pandas-plink`]():

### Two Population Admixture Example
This is a two-part process.

#### Generate Binary Files
To reduce computational time and memory, we leverage binary files.
While `RFMix` does not generate these directly, we provide a function
for their creation: `create_binaries`. This function can also be invoked 
via the command line:
`create-binaries [-h] [--version] [--binary_dir BINARY_DIR] file_prefix`.

```python
from rfmix_reader import create_binaries

# Generate binary files
file_path = "examples/two_popuations/out/"
binary_dir = "./binary_files"
create_binaries(file_path, binary_dir=binary_dir)
```

You can also do this on the fly.

```python
from rfmix_reader import read_rfmix

file_path = "examples/two_popuations/out/"
binary_dir = "./binary_files"
loci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,
                               generate_binary=True)
```

We do not have this turned on by default, as it is the
rate limiting step. It can take upwards of 20 to 25 minutes
to run depending on `*fb.tsv` file size.

#### Main Function
Once binary files are generated, you can the main function
to process the RFMix results. With GPU this takes less than
5 minutes.

```python
from rfmix_reader import read_rfmix

file_path = "examples/two_popuations/out/"
loci, rf_q, admix = read_rfmix(file_path)
```
**Note:** `./binary_files` is the default for `binary_dir`,
so this is an optional parameter.

### Three Population Admixture Example
`RFMix-reader` is adaptable for as many population admixtures as
needed.

```python
from rfmix_reader import read_rfmix

file_path = "examples/three_popuations/out/"
binary_dir = "./binary_files"
loci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,
                               generate_binary=True)
```

### Loci Imputation
Imputing local ancestry loci information to genotype variant locations improves
integration of the local ancestry information with genotype data. As such, we provide
the `interpolate_array` function to efficiently interpolate missing values when local
ancestry loci information is converted to more variable genotype variant locations. 
It leverages the power of [`Zarr`](https://zarr.readthedocs.io/en/stable/index.html) 
arrays, making it suitable for handling substantial datasets while managing memory 
usage effectively.

#### Features
- **CUDA Acceleration**: Uses CUDA for performance enhancement when available; 
  otherwise, it defaults to `NumPy`.
- **Chunk Processing**: Processes data in manageable chunks to optimize memory usage,
  making it ideal for large datasets.
- **Progress Monitoring**: Displays progress through a `tqdm` progress bar, providing
  real-time feedback during execution.
- **Column-wise Interpolation**: Employs the `_interpolate_col` function to perform 
  interpolation along each column of the dataset.

#### Example Usage
```python
import pandas as pd
import dask.array as da

# Outer merged dataframe of loci and variant locations
# "i" is from the loci information; "chrom" and "pos" from both dataframes
variant_loci_df = pd.DataFrame({'chrom': ['1', '1', '1', '1'], 
                                'pos': [100, 200, 300, 400], 
                                'i': [1, NA, NA, 2]})

# Dask array of admixture data, which will have few rows than variant_loci_df
admix = da.random.random((2, 3)) # Random data here

# This expands the Dask array (admix) and interpolates missing data
# Default chunk_size = 50000 assuming variant_loci_df 6-9M rows. 
# Adjust this based on variant_loci_df size.
z = interpolate_array(variant_loci_df, admix, '/path/to/output', chunk_size=1)

# Check the shape of the resulting Zarr array, which should have the same
# row numbers as variant_loci_df
print(z.shape)  # Output: (4, 3)
```

#### Example Preprocessing Functions
The helper functions `_load_genotypes` and `_load_admix` are designed to facilitate
the loading of loci and genotype data for constructing the `variant_loci_df`
DataFrame.

1. **`_load_genotypes(plink_prefix_path)`**: This function uses the `tensorqtl` 
   library to read genotype data from PLINK files (`PGEN`). It returns both the 
   loaded genotype data and a DataFrame containing variant information, which 
   includes chromosome and position details. The chromosome identifiers are 
   formatted to include the "chr" prefix for consistency.
2. **`_load_admix(prefix_path, binary_dir)`**: This function employs the 
   `rfmix_reader` library to load local ancestry data from specified paths. It 
   reads the ancestry information into a suitable format for further processing, 
   enabling integration with genotype data.

These functions ensure accurate loading and formatting of variant and local ancestry 
data, streamlining subsequent analyses.

```python
def _load_genotypes(plink_prefix_path):
    from tensorqtl import pgen
    pgr = pgen.PgenReader(plink_prefix_path)
    variant_df = pgr.variant_df
    variant_df.loc[:, "chrom"] = "chr" + variant_df.chrom
    return pgr.load_genotypes(), variant_df

def _load_admix(prefix_path, binary_dir):
    from rfmix_reader import read_rfmix
    return read_rfmix(prefix_path, binary_dir=binary_dir)

def __testing__():
	basename = "/projects/b1213/large_projects/brain_coloc_app/input"
    # Local ancestry
    prefix_path = f"{basename}/local_ancestry_rfmix/_m/"
    binary_dir = f"{basename}/local_ancestry_rfmix/_m/binary_files/"
    loci, _, admix = _load_admix(prefix_path, binary_dir)
    loci.rename(columns={"chromosome": "chrom",
                         "physical_position": "pos"},
                inplace=True)
    # Variant data
    plink_prefix = f"{basename}/genotypes/TOPMed_LIBD"
    _, variant_df = _load_genotypes(plink_prefix)
    variant_df = variant_df.drop_duplicates(subset=["chrom", "pos"],
                                            keep='first')
	# Keep all locations for more accurate imputation
    variant_loci_df = variant_df.merge(loci.to_pandas(), on=["chrom", "pos"],
                                       how="outer", indicator=True)\
                                .loc[:, ["chrom", "pos", "i", "_merge"]]
    data_path = f"{basename}/local_ancestry_rfmix/_m"
    z = interpolate_array(variant_loci_df, admix, data_path)
	# Match variant data genomic positions
    arr_geno = arr_mod.array(variant_loci_df[~(variant_loci_df["_merge"] == "right_only")].index)
    new_admix = z[arr_geno.get(), :]
```

**Note**: Following imputation, `variant_df` will include genomic positions for
both local ancestry and genotype data.

## Author(s)
* [Kynon JM Benjamin](https://github.com/Krotosbenjamin)

## Citation
If you use this software in your work, please cite it.
[![DOI](https://zenodo.org/badge/807052842.svg)](https://zenodo.org/doi/10.5281/zenodo.12629787)

Benjamin, K. J. M. (2024). RFMix-reader (Version v0.1.15) [Computer software]. 
https://github.com/heart-gen/rfmix_reader

Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for
local ancestry studies." *bioRxiv*. 2024.
DOI: [10.1101/2024.07.13.603370](https://www.biorxiv.org/content/10.1101/2024.07.13.603370v2).

## Funding
This work was supported by grants from the National Institutes of Health,
National Institute on Minority Health and Health Disparities (NIMHD) 
K99MD016964 / R00MD016964.

            

Raw data

            {
    "_id": null,
    "home_page": "https://rfmix-reader.readthedocs.io/en/latest/",
    "name": "rfmix-reader",
    "maintainer": "Kynon JM Benjamin",
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": "kj.benjamin90@gmail.com",
    "keywords": "file parser, rfmix, gpu acceleration, local ancestry",
    "author": "Kynon JM Benjamin",
    "author_email": "kj.benjamin90@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d5/e6/6995fc5d3954650d4648c07c4aafbd48a80891b38a4c7a9125e703a72d1d/rfmix_reader-0.1.21.tar.gz",
    "platform": null,
    "description": "# RFMix-reader\n`RFMix-reader` is a Python package designed to efficiently read and process output\nfiles generated by [`RFMix`](https://github.com/slowkoni/rfmix), a popular tool \nfor estimating local ancestry in admixed populations. The package employs a lazy\nloading approach, which minimizes memory consumption by reading only the loci that\nare accessed by the user, rather than loading the entire dataset into memory at \nonce. Additionally, we leverage GPU acceleration to improve computational speed.\n\n## Install\n`rfmix-reader` can be installed using [pip](https://pypi.python.org/pypi/pip):\n\n```bash\npip install rfmix-reader\n```\n\n**GPU Acceleration:**\n`rfmix-reader` leverages GPU acceleration for improved performance. To use this\nfunctionality, you will need to install the following libraries for your specific\nCUDA version:\n- `RAPIDS`: Refer to official installation guide [here](https://docs.rapids.ai/install)\n- `PyTorch`: Installation instructions can be found [here](https://pytorch.org/)\n\n**Additional Notes:**\n- We have not tested installation with `Docker` or `Conda` environemnts. Compatibility\n  may vary.\n- If you do not have GPU, you can still use the basic functionality of `rfmix-reader`.\n  This is still much faster than processing the files with stardard scripting.\n\n\n## Key Features\n**Lazy Loading**\n- Reads data on-the-fly as requested, reducing memory footprint.\n- Ideal for working with large RFMix output files that may not fit entirely in memory.\n\n**Efficient Data Access**\n- Provides convenient access to specific loci or regions of interest.\n- Allows for selective loading of data, enabling faster processing times.\n\n**Seamless Integration**\n- Designed to work seamlessly with existing Python data analysis workflows.\n- Facilitates downstream analysis and manipulation of `RFMix` output data.\n\n**Loci Imputation**\n- Designed to impute local ancestry loci to a larger genotype data genomic positions.\n- Array-based data for ease of integration with downstream analysis.\n\nWhether you are working with large-scale genomic datasets or have limited\ncomputational resources, `RFMix-reader` offers an efficient and memory-conscious\nsolution for reading and processing `RFMix` output files. Its lazy loading approach\nensures optimal resource utilization, making it a valuable tool for researchers\nand bioinformaticians working with admixed population data.\n\n## Simulation Data\nSimulation data is available for testing two and three population admixture on \nSynapse: [syn61691659](https://www.synapse.org/Synapse:syn61691659).\n\n## Usage\nThis works similarly to [`pandas-plink`]():\n\n### Two Population Admixture Example\nThis is a two-part process.\n\n#### Generate Binary Files\nTo reduce computational time and memory, we leverage binary files.\nWhile `RFMix` does not generate these directly, we provide a function\nfor their creation: `create_binaries`. This function can also be invoked \nvia the command line:\n`create-binaries [-h] [--version] [--binary_dir BINARY_DIR] file_prefix`.\n\n```python\nfrom rfmix_reader import create_binaries\n\n# Generate binary files\nfile_path = \"examples/two_popuations/out/\"\nbinary_dir = \"./binary_files\"\ncreate_binaries(file_path, binary_dir=binary_dir)\n```\n\nYou can also do this on the fly.\n\n```python\nfrom rfmix_reader import read_rfmix\n\nfile_path = \"examples/two_popuations/out/\"\nbinary_dir = \"./binary_files\"\nloci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,\n                               generate_binary=True)\n```\n\nWe do not have this turned on by default, as it is the\nrate limiting step. It can take upwards of 20 to 25 minutes\nto run depending on `*fb.tsv` file size.\n\n#### Main Function\nOnce binary files are generated, you can the main function\nto process the RFMix results. With GPU this takes less than\n5 minutes.\n\n```python\nfrom rfmix_reader import read_rfmix\n\nfile_path = \"examples/two_popuations/out/\"\nloci, rf_q, admix = read_rfmix(file_path)\n```\n**Note:** `./binary_files` is the default for `binary_dir`,\nso this is an optional parameter.\n\n### Three Population Admixture Example\n`RFMix-reader` is adaptable for as many population admixtures as\nneeded.\n\n```python\nfrom rfmix_reader import read_rfmix\n\nfile_path = \"examples/three_popuations/out/\"\nbinary_dir = \"./binary_files\"\nloci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,\n                               generate_binary=True)\n```\n\n### Loci Imputation\nImputing local ancestry loci information to genotype variant locations improves\nintegration of the local ancestry information with genotype data. As such, we provide\nthe `interpolate_array` function to efficiently interpolate missing values when local\nancestry loci information is converted to more variable genotype variant locations. \nIt leverages the power of [`Zarr`](https://zarr.readthedocs.io/en/stable/index.html) \narrays, making it suitable for handling substantial datasets while managing memory \nusage effectively.\n\n#### Features\n- **CUDA Acceleration**: Uses CUDA for performance enhancement when available; \n  otherwise, it defaults to `NumPy`.\n- **Chunk Processing**: Processes data in manageable chunks to optimize memory usage,\n  making it ideal for large datasets.\n- **Progress Monitoring**: Displays progress through a `tqdm` progress bar, providing\n  real-time feedback during execution.\n- **Column-wise Interpolation**: Employs the `_interpolate_col` function to perform \n  interpolation along each column of the dataset.\n\n#### Example Usage\n```python\nimport pandas as pd\nimport dask.array as da\n\n# Outer merged dataframe of loci and variant locations\n# \"i\" is from the loci information; \"chrom\" and \"pos\" from both dataframes\nvariant_loci_df = pd.DataFrame({'chrom': ['1', '1', '1', '1'], \n                                'pos': [100, 200, 300, 400], \n                                'i': [1, NA, NA, 2]})\n\n# Dask array of admixture data, which will have few rows than variant_loci_df\nadmix = da.random.random((2, 3)) # Random data here\n\n# This expands the Dask array (admix) and interpolates missing data\n# Default chunk_size = 50000 assuming variant_loci_df 6-9M rows. \n# Adjust this based on variant_loci_df size.\nz = interpolate_array(variant_loci_df, admix, '/path/to/output', chunk_size=1)\n\n# Check the shape of the resulting Zarr array, which should have the same\n# row numbers as variant_loci_df\nprint(z.shape)  # Output: (4, 3)\n```\n\n#### Example Preprocessing Functions\nThe helper functions `_load_genotypes` and `_load_admix` are designed to facilitate\nthe loading of loci and genotype data for constructing the `variant_loci_df`\nDataFrame.\n\n1. **`_load_genotypes(plink_prefix_path)`**: This function uses the `tensorqtl` \n   library to read genotype data from PLINK files (`PGEN`). It returns both the \n   loaded genotype data and a DataFrame containing variant information, which \n   includes chromosome and position details. The chromosome identifiers are \n   formatted to include the \"chr\" prefix for consistency.\n2. **`_load_admix(prefix_path, binary_dir)`**: This function employs the \n   `rfmix_reader` library to load local ancestry data from specified paths. It \n   reads the ancestry information into a suitable format for further processing, \n   enabling integration with genotype data.\n\nThese functions ensure accurate loading and formatting of variant and local ancestry \ndata, streamlining subsequent analyses.\n\n```python\ndef _load_genotypes(plink_prefix_path):\n    from tensorqtl import pgen\n    pgr = pgen.PgenReader(plink_prefix_path)\n    variant_df = pgr.variant_df\n    variant_df.loc[:, \"chrom\"] = \"chr\" + variant_df.chrom\n    return pgr.load_genotypes(), variant_df\n\ndef _load_admix(prefix_path, binary_dir):\n    from rfmix_reader import read_rfmix\n    return read_rfmix(prefix_path, binary_dir=binary_dir)\n\ndef __testing__():\n\tbasename = \"/projects/b1213/large_projects/brain_coloc_app/input\"\n    # Local ancestry\n    prefix_path = f\"{basename}/local_ancestry_rfmix/_m/\"\n    binary_dir = f\"{basename}/local_ancestry_rfmix/_m/binary_files/\"\n    loci, _, admix = _load_admix(prefix_path, binary_dir)\n    loci.rename(columns={\"chromosome\": \"chrom\",\n                         \"physical_position\": \"pos\"},\n                inplace=True)\n    # Variant data\n    plink_prefix = f\"{basename}/genotypes/TOPMed_LIBD\"\n    _, variant_df = _load_genotypes(plink_prefix)\n    variant_df = variant_df.drop_duplicates(subset=[\"chrom\", \"pos\"],\n                                            keep='first')\n\t# Keep all locations for more accurate imputation\n    variant_loci_df = variant_df.merge(loci.to_pandas(), on=[\"chrom\", \"pos\"],\n                                       how=\"outer\", indicator=True)\\\n                                .loc[:, [\"chrom\", \"pos\", \"i\", \"_merge\"]]\n    data_path = f\"{basename}/local_ancestry_rfmix/_m\"\n    z = interpolate_array(variant_loci_df, admix, data_path)\n\t# Match variant data genomic positions\n    arr_geno = arr_mod.array(variant_loci_df[~(variant_loci_df[\"_merge\"] == \"right_only\")].index)\n    new_admix = z[arr_geno.get(), :]\n```\n\n**Note**: Following imputation, `variant_df` will include genomic positions for\nboth local ancestry and genotype data.\n\n## Author(s)\n* [Kynon JM Benjamin](https://github.com/Krotosbenjamin)\n\n## Citation\nIf you use this software in your work, please cite it.\n[![DOI](https://zenodo.org/badge/807052842.svg)](https://zenodo.org/doi/10.5281/zenodo.12629787)\n\nBenjamin, K. J. M. (2024). RFMix-reader (Version v0.1.15) [Computer software]. \nhttps://github.com/heart-gen/rfmix_reader\n\nKynon JM Benjamin. \"RFMix-reader: Accelerated reading and processing for\nlocal ancestry studies.\" *bioRxiv*. 2024.\nDOI: [10.1101/2024.07.13.603370](https://www.biorxiv.org/content/10.1101/2024.07.13.603370v2).\n\n## Funding\nThis work was supported by grants from the National Institutes of Health,\nNational Institute on Minority Health and Health Disparities (NIMHD) \nK99MD016964 / R00MD016964.\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.",
    "version": "0.1.21",
    "project_urls": {
        "Homepage": "https://rfmix-reader.readthedocs.io/en/latest/",
        "Repository": "https://github.com/heart-gen/rfmix_reader.git"
    },
    "split_keywords": [
        "file parser",
        " rfmix",
        " gpu acceleration",
        " local ancestry"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "259f0c1eb7f24afa766490570375a9fbc4e322fea4deb5e65490cda2778988c4",
                "md5": "6faf4ba559f2f6ae0a3a3734a892a0f6",
                "sha256": "30cf2c8e7268ec9241ac7e370d90a075778e486a9235057f44d56b768d7af050"
            },
            "downloads": -1,
            "filename": "rfmix_reader-0.1.21-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6faf4ba559f2f6ae0a3a3734a892a0f6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 45166,
            "upload_time": "2025-01-07T20:50:51",
            "upload_time_iso_8601": "2025-01-07T20:50:51.406173Z",
            "url": "https://files.pythonhosted.org/packages/25/9f/0c1eb7f24afa766490570375a9fbc4e322fea4deb5e65490cda2778988c4/rfmix_reader-0.1.21-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5e66995fc5d3954650d4648c07c4aafbd48a80891b38a4c7a9125e703a72d1d",
                "md5": "da85110123bcd18b8d9a0d0edfc5be7f",
                "sha256": "44b68e7086dc50bafcdfb24c10f9f605f0e92a9c1fb8c356aef329954a445070"
            },
            "downloads": -1,
            "filename": "rfmix_reader-0.1.21.tar.gz",
            "has_sig": false,
            "md5_digest": "da85110123bcd18b8d9a0d0edfc5be7f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 43216,
            "upload_time": "2025-01-07T20:50:53",
            "upload_time_iso_8601": "2025-01-07T20:50:53.782104Z",
            "url": "https://files.pythonhosted.org/packages/d5/e6/6995fc5d3954650d4648c07c4aafbd48a80891b38a4c7a9125e703a72d1d/rfmix_reader-0.1.21.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-07 20:50:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "heart-gen",
    "github_project": "rfmix_reader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "rfmix-reader"
}
        
Elapsed time: 0.77279s