| Name | optimask JSON |
| Version |
1.3.3
JSON |
| download |
| home_page | https://optimask.readthedocs.io |
| Summary | OptiMask: extracting the largest (non-contiguous) submatrix without NaN |
| upload_time | 2024-10-21 08:11:57 |
| maintainer | None |
| docs_url | None |
| author | Cyril Joly |
| requires_python | None |
| license | MIT |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
[](https://pypi.org/project/optimask/)
[](https://anaconda.org/conda-forge/optimask)
[](https://anaconda.org/conda-forge/optimask)
[](https://optimask.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/CyrilJl/OptiMask/actions/workflows/pytest.yml)
[](https://app.codacy.com/gh/CyrilJl/OptiMask?utm_source=github.com&utm_medium=referral&utm_content=CyrilJl/OptiMask&utm_campaign=Badge_Grade)
# <img src="https://raw.githubusercontent.com/CyrilJl/OptiMask/main/docs/source/_static/icon.svg" alt="Logo OptiMask" width="200" height="200" align="right"> OptiMask: Efficient NaN Data Removal in Python
`OptiMask` is a Python package designed for efficiently handling NaN values in matrices, specifically focusing on computing the largest non-contiguous submatrix without NaN. OptiMask employs a heuristic method, relying solely on Numpy for speed and efficiency. In machine learning applications, OptiMask surpasses traditional methods like pandas `dropna` by maximizing the amount of valid data available for model fitting. It strategically identifies the optimal set of columns (features) and rows (samples) to retain or remove, ensuring that the largest (non-contiguous) submatrix without NaN is utilized for training models.
The problem differs from the computation of the largest rectangles of 1s in a binary matrix (which can be tackled with dynamic programming) and requires a novel approach.
## Key Features
- **Largest Submatrix without NaN:** OptiMask calculates the largest submatrix without NaN, enhancing data analysis accuracy.
- **Efficient Computation:** With optimized computation, OptiMask provides rapid results without undue delays.
- **Numpy and Pandas Compatibility:** OptiMask seamlessly adapts to both Numpy and Pandas data structures.
## Utilization
To employ OptiMask, install the `optimask` package via pip:
```bash
pip install optimask
```
OptiMask is also available on the conda-forge channel:
```bash
conda install -c conda-forge optimask
```
```bash
mamba install optimask
```
## Usage Example
Import the `OptiMask` class from the `optimask` package and utilize its methods for efficient data masking:
```python
from optimask import OptiMask
import numpy as np
# Create a matrix with NaN values
m = 120
n = 7
data = np.zeros(shape=(m, n))
data[24:72, 3] = np.nan
data[95, :5] = np.nan
# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)
# Calculate the ratio of non-NaN values in the result
coverage_ratio = len(rows) * len(cols) / data.size
# Check if there are any NaN values in the selected submatrix
has_nan_values = np.isnan(data[rows][:, cols]).any()
# Print or display the results
print(f"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}")
# Output: Coverage Ratio: 0.85, Has NaN Values: False
```
The grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:
<img src="https://github.com/CyrilJl/OptiMask/blob/main/docs/source/_static/example0.png?raw=true" width="400">
OptiMask’s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:
<img src="https://github.com/CyrilJl/OptiMask/blob/main/docs/source/_static/example2.png?raw=true" width="400">
## Performances
``OptiMask`` efficiently handles large matrices, delivering results within reasonable computation times:
```python
from optimask import OptiMask
import numpy as np
def generate_random(m, n, ratio):
"""Missing at random arrays"""
arr = np.zeros((m, n))
nan_count = int(ratio * m * n)
indices = np.random.choice(m * n, nan_count, replace=False)
arr.flat[indices] = np.nan
return arr
x = generate_random(m=100_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask(verbose=True).solve(x)
>>> Trial 1 : submatrix of size 37094x49 (1817606 elements) found.
>>> Trial 2 : submatrix of size 35667x51 (1819017 elements) found.
>>> Trial 3 : submatrix of size 37908x48 (1819584 elements) found.
>>> Trial 4 : submatrix of size 37047x49 (1815303 elements) found.
>>> Trial 5 : submatrix of size 37895x48 (1818960 elements) found.
>>> Result: the largest submatrix found is of size 37908x48 (1819584 elements) found.
>>> CPU times: total: 172 ms
>>> Wall time: 435 ms
```
## Documentation
For detailed documentation, including installation instructions, API usage, and examples, visit [OptiMask Documentation](https://optimask.readthedocs.io/en/latest/index.html).
## Repository Link
Find more about OptiMask on [GitHub](https://github.com/CyrilJl/OptiMask).
## Citation
If you use OptiMask in your research or work, please cite it:
```bibtex
@software{optimask2024,
author = {Cyril Joly},
title = {OptiMask: NaN Removal and Largest Submatrix Computation},
year = {2024},
url = {https://github.com/CyrilJl/OptiMask},
}
```
Or:
```OptiMask (2024). NaN Removal and Largest Submatrix Computation. Developed by Cyril Joly: https://github.com/CyrilJl/OptiMask```
Raw data
{
"_id": null,
"home_page": "https://optimask.readthedocs.io",
"name": "optimask",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Cyril Joly",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/71/7a/694b2ba87a05ccb020337f317f96b87453cffc9cd00ff66eaa99ae78106e/optimask-1.3.3.tar.gz",
"platform": null,
"description": "[](https://pypi.org/project/optimask/)\n[](https://anaconda.org/conda-forge/optimask)\n[](https://anaconda.org/conda-forge/optimask)\n[](https://optimask.readthedocs.io/en/latest/?badge=latest)\n[](https://github.com/CyrilJl/OptiMask/actions/workflows/pytest.yml)\n[](https://app.codacy.com/gh/CyrilJl/OptiMask?utm_source=github.com&utm_medium=referral&utm_content=CyrilJl/OptiMask&utm_campaign=Badge_Grade)\n\n# <img src=\"https://raw.githubusercontent.com/CyrilJl/OptiMask/main/docs/source/_static/icon.svg\" alt=\"Logo OptiMask\" width=\"200\" height=\"200\" align=\"right\"> OptiMask: Efficient NaN Data Removal in Python\n\n`OptiMask` is a Python package designed for efficiently handling NaN values in matrices, specifically focusing on computing the largest non-contiguous submatrix without NaN. OptiMask employs a heuristic method, relying solely on Numpy for speed and efficiency. In machine learning applications, OptiMask surpasses traditional methods like pandas `dropna` by maximizing the amount of valid data available for model fitting. It strategically identifies the optimal set of columns (features) and rows (samples) to retain or remove, ensuring that the largest (non-contiguous) submatrix without NaN is utilized for training models.\n\nThe problem differs from the computation of the largest rectangles of 1s in a binary matrix (which can be tackled with dynamic programming) and requires a novel approach.\n\n## Key Features\n\n- **Largest Submatrix without NaN:** OptiMask calculates the largest submatrix without NaN, enhancing data analysis accuracy.\n- **Efficient Computation:** With optimized computation, OptiMask provides rapid results without undue delays.\n- **Numpy and Pandas Compatibility:** OptiMask seamlessly adapts to both Numpy and Pandas data structures.\n\n## Utilization\n\nTo employ OptiMask, install the `optimask` package via pip:\n\n```bash\npip install optimask\n```\n\nOptiMask is also available on the conda-forge channel:\n\n```bash\nconda install -c conda-forge optimask\n```\n\n```bash\nmamba install optimask\n```\n\n## Usage Example\n\nImport the `OptiMask` class from the `optimask` package and utilize its methods for efficient data masking:\n\n```python\nfrom optimask import OptiMask\nimport numpy as np\n\n# Create a matrix with NaN values\nm = 120\nn = 7\ndata = np.zeros(shape=(m, n))\ndata[24:72, 3] = np.nan\ndata[95, :5] = np.nan\n\n# Solve for the largest submatrix without NaN values\nrows, cols = OptiMask().solve(data)\n\n# Calculate the ratio of non-NaN values in the result\ncoverage_ratio = len(rows) * len(cols) / data.size\n\n# Check if there are any NaN values in the selected submatrix\nhas_nan_values = np.isnan(data[rows][:, cols]).any()\n\n# Print or display the results\nprint(f\"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}\")\n# Output: Coverage Ratio: 0.85, Has NaN Values: False\n```\n\nThe grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:\n\n<img src=\"https://github.com/CyrilJl/OptiMask/blob/main/docs/source/_static/example0.png?raw=true\" width=\"400\">\n\nOptiMask\u2019s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:\n\n<img src=\"https://github.com/CyrilJl/OptiMask/blob/main/docs/source/_static/example2.png?raw=true\" width=\"400\">\n\n## Performances\n``OptiMask`` efficiently handles large matrices, delivering results within reasonable computation times:\n\n```python\nfrom optimask import OptiMask\nimport numpy as np\n\ndef generate_random(m, n, ratio):\n \"\"\"Missing at random arrays\"\"\"\n arr = np.zeros((m, n))\n nan_count = int(ratio * m * n)\n indices = np.random.choice(m * n, nan_count, replace=False)\n arr.flat[indices] = np.nan\n return arr\n\nx = generate_random(m=100_000, n=1_000, ratio=0.02)\n%time rows, cols = OptiMask(verbose=True).solve(x)\n>>> \tTrial 1 : submatrix of size 37094x49 (1817606 elements) found.\n>>> \tTrial 2 : submatrix of size 35667x51 (1819017 elements) found.\n>>> \tTrial 3 : submatrix of size 37908x48 (1819584 elements) found.\n>>> \tTrial 4 : submatrix of size 37047x49 (1815303 elements) found.\n>>> \tTrial 5 : submatrix of size 37895x48 (1818960 elements) found.\n>>> Result: the largest submatrix found is of size 37908x48 (1819584 elements) found.\n>>> CPU times: total: 172 ms\n>>> Wall time: 435 ms\n```\n\n## Documentation\n\nFor detailed documentation, including installation instructions, API usage, and examples, visit [OptiMask Documentation](https://optimask.readthedocs.io/en/latest/index.html).\n\n## Repository Link\n\nFind more about OptiMask on [GitHub](https://github.com/CyrilJl/OptiMask).\n\n## Citation\n\nIf you use OptiMask in your research or work, please cite it:\n\n```bibtex\n@software{optimask2024,\n author = {Cyril Joly},\n title = {OptiMask: NaN Removal and Largest Submatrix Computation},\n year = {2024},\n url = {https://github.com/CyrilJl/OptiMask},\n}\n```\nOr:\n\n```OptiMask (2024). NaN Removal and Largest Submatrix Computation. Developed by Cyril Joly: https://github.com/CyrilJl/OptiMask```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "OptiMask: extracting the largest (non-contiguous) submatrix without NaN",
"version": "1.3.3",
"project_urls": {
"Homepage": "https://optimask.readthedocs.io"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "717a694b2ba87a05ccb020337f317f96b87453cffc9cd00ff66eaa99ae78106e",
"md5": "b40c73f9daf4efecb832108d1f44e470",
"sha256": "b4c9cf5a78b095500966c20f8b5f2035402dfb1463b555a09d11425e91acf111"
},
"downloads": -1,
"filename": "optimask-1.3.3.tar.gz",
"has_sig": false,
"md5_digest": "b40c73f9daf4efecb832108d1f44e470",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8302,
"upload_time": "2024-10-21T08:11:57",
"upload_time_iso_8601": "2024-10-21T08:11:57.783465Z",
"url": "https://files.pythonhosted.org/packages/71/7a/694b2ba87a05ccb020337f317f96b87453cffc9cd00ff66eaa99ae78106e/optimask-1.3.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-21 08:11:57",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "optimask"
}