MatFold

Name	MatFold JSON
Version	1.2.3 JSON
	download
home_page	None
Summary	Package for systematic insights into materials discovery models’ performance through standardized chemical cross-validation protocols
upload_time	2024-11-26 05:29:24
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT License Copyright (c) 2024 Peter Schindler Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	machine learning materials science cross-validation generalization error
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
  <img alt="MatFold Logo" src=logo.svg width="200"><br>
</div>

# `MatFold` – Cross-validation Protocols for Materials Science Data 

![Python - Version](https://img.shields.io/pypi/pyversions/MatFold)
[![PyPI - Version](https://img.shields.io/pypi/v/MatFold?color=blue)](https://pypi.org/project/MatFold)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13147391.svg)](https://doi.org/10.5281/zenodo.13147391)

This is a Python package for gaining systematic insights into materials discovery models’ 
performance through standardized, reproducible, and featurization-agnostic chemical and structural cross-validation protocols.

Please, cite the following paper if you use the model in your research:
> Matthew D. Witman and Peter Schindler, *MatFold: systematic insights into materials discovery models’ performance 
> through standardized cross-validation protocols*, ChemRxiv (2024) [10.26434/chemrxiv-2024-bmw1n](https://doi.org/10.26434/chemrxiv-2024-bmw1n)

## Installation

`MatFold` can be installed using pip by running `pip install MatFold`.
Alternatively, this repository can be downloaded/cloned and then installed by running `pip install .` inside the main folder.

## Usage

### Data Preparation and Loading

To utilize `MatFold`, the user has to provide their materials data as a Pandas dataframe and 
a dictionary for initialization: 
`df` and `bulk_dict`.

The dataframe `df` has to contain as a first column the strings of either form `<structureid>` or
`<structureid>:<structuretag>` (where `<structureid>` refers to a bulk ID and `<structuretag>` 
refers to an identifier of a derivative structure). All other columns are optional and are 
retained during the splitting process by default. 

The dictionary `bulk_dict` has to contain `<structureid>` as keys and the corresponding bulk pymatgen
dictionary as values. This dictionary can also be directly created from cif files 
using the convenience function `cifs_to_dict`. The user should ensure that all bulk structures that 
are referred to in the `df` labels are provided in `bulk_dict` (and each string 
specifying `structureid` should match).

During initialization of `MatFold` the user can also pick a random subset of the data by specifying the 
variable `return_frac`. When this value is set to less than 1.0, then the variable 
`always_include_n_elements` can be specified to ensure that materials with a certain number of unique elements 
is always included (*i.e.*, not affected by the `return_frac` sampling). 
For example, `always_include_n_elements=[1, 2]` would ensure that all elemental and binary compounds remain 
in the selected subset of the data.

### Creating Splits with Different Chemical and Structural Holdout Strategies

Once the `MatFold` class is initialized with the material data, the user can choose from various chemical and 
structural holdout strategies when creating their splits. The available splitting options are: 
 - *"index"* (naive random splitting)
 - *"structureid"* (split by parent bulk structure - this is identical to *"index"* for datasets where each entry corresponds to a unique bulk structure)
 - *"composition"*
 - *"chemsys"*
 - *"sgnum"* (Space group number)
 - *"pointgroup"*
 - *"crystalsys"*
 - *"elements"*
 - *"periodictablerows"*
 - *"periodictablegroups"*

Further, the user can analyze the distribution of unique split values and the corresponding 
fraction (prevalence) in the dataset by using the class function `split_statistics`. 
There are several optional variables that the user can specify (full list in the documentation below). 
Most, notably the number of inner and outer splits for nested folding are specified in 
`n_inner_splits` and `n_outer_splits`, respectively. If either of these two value is set to 0, 
then `MatFold` will set them equal to the number of possible split label option (*i.e.*, this corresponds 
to leave-one-out cross-validation).

The user can also create a single leave-one-out split (rather than all possible splits) by utilizing the class 
function `create_loo_split` while specifying a single label that is to be held out in `loo_label` for 
the specified `split_type`.

### Example Use

Below find an example of how running `MatFold` could look like:

```Python3
from MatFold import MatFold
import pandas as pd
import json

df = pd.read_csv('data.csv')  # Ensure that first column contains the correct label format
with open('bulk.json', 'r') as fp:  
    # Ensure all bulk pymatgen dictionaries are contained with the same key as specified in `df`
    bulk_dict = json.load(fp)

# Initialize MatFold and work with 50% of the dataset, but ensure to include all binary compounds
mf = MatFold(df, bulk_dict, return_frac=0.5, always_include_n_elements=[2])
stats = mf.split_statistics('crystalsys')
print(stats)  # print out statistics for the `crystalsys` split strategy
# Create all nested (and non-nested) splits utilizing `crystalsys` with the outer 
# split being leave-one-out and the inner splits being split into 5.
mf.create_splits('crystalsys', n_outer_splits=0, n_inner_splits=5, output_dir='./output/', verbose=True)
# Create a single leave-one-out split where Iron is held out from the dataset
mf.create_loo_split('elements', 'Fe', output_dir='./output/', verbose=True)
```

## Code Documentation

Below find a detailed documentation of all `MatFold` capabilities and description of variables.

### Function `cifs_to_dict`

```python
def cifs_to_dict(directory: str | os.PathLike) -> dict
```

Converts a directory of cif files into a dictionary with keys '<filename>' (of `<filename>.cif`) 
and values 'pymatgen dictionary' (parsed from `<filename>.cif`)

**Arguments**:

- `directory`: Directory where cif files are stored

**Returns**:

Dictionary of cif files with keys '<filename>' (of `<filename>.cif`).
Can be used as input `bulk_df` to `MatFold` class.

### Class `MatFold`

### \_\_init\_\_

```python
def __init__(df: pd.DataFrame,
             bulk_dict: dict,
             return_frac: float = 1.0,
             always_include_n_elements: list | int | None = None,
             cols_to_keep: list | None = None,
             seed: int = 0) -> None
```

MatFold class constructor

**Arguments**:

- `df`: Pandas dataframe with the first column containing strings of either form `<structureid>` or
`<structureid>:<structuretag>` (where <structureid> refers to a bulk ID and <structuretag> refers to
an identifier of a derivative structure). All other columns are optional and may be retained specifying the
`cols_to_keep` parameter described below.
- `bulk_dict`: Dictionary containing <structureid> as keys and the corresponding bulk pymatgen
dictionary as values.
- `return_frac`: The fraction of the df dataset that is utilized during splitting.
Must be larger than 0.0 and equal/less than 1.0 (=100%).
- `always_include_n_elements`: A list of number of elements for which the corresponding materials are
always to be included in the dataset (for cases where `return_frac` < 1.0).
- `cols_to_keep`: List of columns to keep in the splits. If left `None`, then all columns of the
original df are kept.
- `seed`: Seed for selecting random subset of data and splits.


#### from\_json

```python
@classmethod
def from_json(cls,
              df: pd.DataFrame,
              bulk_dict: dict,
              json_file: str | os.PathLike,
              create_splits: bool = True)
```

Reconstruct a `MatFold` class instance, along with its associated splits, from a JSON file previously generated 
by the `create_splits` or `create_loo_split` methods. The same `df` and `bulk_dict` used during
the original split creation must be provided to guarantee that the exact splits are regenerated.

**Arguments**:

- `df`: Pandas dataframe with the first column containing strings of either form `<structureid>` or
`<structureid>:<structuretag>` (where <structureid> refers to a bulk ID and <structuretag> refers to
an identifier of a derivative structure). All other columns are optional and may be retained specifying the
`cols_to_keep` parameter described below.
- `bulk_dict`: Dictionary containing <structureid> as keys and the corresponding bulk pymatgen
dictionary as values.
- `json_file`: Location of JSON file that is created when MatFold is used to generate splits.
- `create_splits`: Whether to create splits with the same json settings

**Returns**:

MatFold class instance


### split\_statistics

```python
def split_statistics(split_type: str) -> dict
```

Analyzes the statistics of the sgnum, pointgroup, crystalsys, chemsys, composition, elements,  periodictablerows, 
and periodictablegroups splits.

**Arguments**:

- `split_type`: String specifying the splitting type

**Returns**:

Dictionary with keys of unique split values and the corresponding fraction of this key being 
represented in the entire dataset.


### create\_splits

```python
def create_splits(split_type: str,
                  n_inner_splits: int = 10,
                  n_outer_splits: int = 10,
                  fraction_upper_limit: float = 1.0,
                  fraction_lower_limit: float = 0.0,
                  keep_n_elements_in_train: list | int | None = None,
                  min_train_test_factor: float | None = None,
                  inner_equals_outer_split_strategy: bool = True,
                  write_base_str: str = 'mf',
                  output_dir: str | os.PathLike | None = None,
                  verbose: bool = False) -> None
```

Creates splits based on `split_type`, `n_inner_splits`, `n_outer_splits` among other specifications 
(cf. full list of function variables). The splits are saved in `output_dir` as csv files named
`<write_base_str>.<split_type>.k<i>_outer.<train/test>.csv` and
`<write_base_str>.<split_type>.k<i>_outer.l<j>_inner.<train/test>.csv` for all outer (index `<i>`) and inner
splits (index `<j>`), respectively. Additionally, a summary of the created splits is saved as
`<write_base_str>.<split_type>.summary.k<n_outer_splits>.l<n_inner_splits>.<self.return_frac>.csv`.
Lastly, a JSON file is saved that stores all relevant class and function variables to recreate the splits
utilizing the class function `from_json` and is named `<write_base_str>.<split_type>.json`.

**Arguments**:

- `split_type`: Defines the type of splitting, must be either "index", "structureid", "composition",
"chemsys", "sgnum", "pointgroup", "crystalsys", "elements", "periodictablerows", or "periodictablegroups"
- `n_inner_splits`: Number of inner splits (for nested k-fold); if set to 0, then `n_inner_splits` is set
equal to the number of inner test possiblities (i.e., each inner test set holds one possibility out
for all possible options)
- `n_outer_splits`: Number of outer splits (k-fold); if set to 0, then `n_outer_splits` is set equal to the
number of test possiblities (i.e., each outer test set holds one possibility out for all possible options)
- `fraction_upper_limit`: If a split possiblity is represented in the dataset with a fraction above
this limit then the corresponding indices will be forced to be in the training set by default.
- `fraction_lower_limit`: If a split possiblity is represented in the dataset with a fraction below
this limit then the corresponding indices will be forced to be in the training set by default.
- `keep_n_elements_in_train`: List of number of elements for which the corresponding materials are kept
in the test set by default (i.e., not k-folded). For example, '2' will keep all binaries in the training set.
- `min_train_test_factor`: Minimum factor that the training set needs to be
larger (for factors greater than 1.0) than the test set.
- `inner_equals_outer_split_strategy`: If true, then the inner splitting strategy used is equal to
the outer splitting strategy, if false, then inner splitting strategy is random (by index).
- `write_base_str`: Beginning string of csv file names of the written splits
- `output_dir`: Directory where the splits are written to
- `verbose`: Whether to print out details during code execution.

**Returns**:

None

### create\_loo\_split

```python
def create_loo_split(split_type: str,
                     loo_label: str,
                     keep_n_elements_in_train: list | int | None = None,
                     write_base_str: str = 'mf',
                     output_dir: str | os.PathLike | None = None,
                     verbose: bool = False) -> None
```

Creates leave-one-out split based on `split_type`, specified `loo_label` and `keep_n_elements_in_train`.
The splits are saved in `output_dir` as csv files named
`<write_base_str>.<split_type>.loo.<loo_label>.<train/test>.csv`. Additionally, a summary of the created split
is saved as `<write_base_str>.<split_type>.summary.loo.<loo_label>.<self.return_frac>.csv`.
Lastly, a JSON file is saved that stores all relevant class and function variables to recreate the splits
utilizing the class function `from_json` and is named `<write_base_str>.<split_type>.loo.<loo_label>.json`.

**Arguments**:

- `split_type`: Defines the type of splitting, must be either "structureid", "composition", "chemsys",
"sgnum", "pointgroup", "crystalsys", "elements", "periodictablerows", or "periodictablegroups".
- `loo_label`: Label specifying which single option is to be left out (i.e., constitute the test set).
This label must be a valid option for the specified `split_type`.
- `keep_n_elements_in_train`: List of number of elements for which the corresponding materials are kept
in the test set by default (i.e., not k-folded). For example, '2' will keep all binaries in the training set.
- `write_base_str`: Beginning string of csv file names of the written splits
- `output_dir`: Directory where the splits are written to
- `verbose`: Whether to print out details during code execution.

**Returns**:

None

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "MatFold",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Peter Schindler <p.schindler@northeastern.edu>",
    "keywords": "machine learning, materials science, cross-validation, generalization error",
    "author": null,
    "author_email": "Peter Schindler <p.schindler@northeastern.edu>, \"Matthew D. Witman\" <mwitman@sandia.gov>",
    "download_url": "https://files.pythonhosted.org/packages/57/2f/5dc6e02d475c4d5e87eec301d03d2eac70ee6313b60eaad62c79854ad16f/matfold-1.2.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\r\n  <img alt=\"MatFold Logo\" src=logo.svg width=\"200\"><br>\r\n</div>\r\n\r\n# `MatFold` \u2013 Cross-validation Protocols for Materials Science Data \r\n\r\n![Python - Version](https://img.shields.io/pypi/pyversions/MatFold)\r\n[![PyPI - Version](https://img.shields.io/pypi/v/MatFold?color=blue)](https://pypi.org/project/MatFold)\r\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13147391.svg)](https://doi.org/10.5281/zenodo.13147391)\r\n\r\nThis is a Python package for gaining systematic insights into materials discovery models\u2019 \r\nperformance through standardized, reproducible, and featurization-agnostic chemical and structural cross-validation protocols.\r\n\r\nPlease, cite the following paper if you use the model in your research:\r\n> Matthew D. Witman and Peter Schindler, *MatFold: systematic insights into materials discovery models\u2019 performance \r\n> through standardized cross-validation protocols*, ChemRxiv (2024) [10.26434/chemrxiv-2024-bmw1n](https://doi.org/10.26434/chemrxiv-2024-bmw1n)\r\n\r\n## Installation\r\n\r\n`MatFold` can be installed using pip by running `pip install MatFold`.\r\nAlternatively, this repository can be downloaded/cloned and then installed by running `pip install .` inside the main folder.\r\n\r\n## Usage\r\n\r\n### Data Preparation and Loading\r\n\r\nTo utilize `MatFold`, the user has to provide their materials data as a Pandas dataframe and \r\na dictionary for initialization: \r\n`df` and `bulk_dict`.\r\n\r\nThe dataframe `df` has to contain as a first column the strings of either form `<structureid>` or\r\n`<structureid>:<structuretag>` (where `<structureid>` refers to a bulk ID and `<structuretag>` \r\nrefers to an identifier of a derivative structure). All other columns are optional and are \r\nretained during the splitting process by default. \r\n\r\nThe dictionary `bulk_dict` has to contain `<structureid>` as keys and the corresponding bulk pymatgen\r\ndictionary as values. This dictionary can also be directly created from cif files \r\nusing the convenience function `cifs_to_dict`. The user should ensure that all bulk structures that \r\nare referred to in the `df` labels are provided in `bulk_dict` (and each string \r\nspecifying `structureid` should match).\r\n\r\nDuring initialization of `MatFold` the user can also pick a random subset of the data by specifying the \r\nvariable `return_frac`. When this value is set to less than 1.0, then the variable \r\n`always_include_n_elements` can be specified to ensure that materials with a certain number of unique elements \r\nis always included (*i.e.*, not affected by the `return_frac` sampling). \r\nFor example, `always_include_n_elements=[1, 2]` would ensure that all elemental and binary compounds remain \r\nin the selected subset of the data.\r\n\r\n### Creating Splits with Different Chemical and Structural Holdout Strategies\r\n\r\nOnce the `MatFold` class is initialized with the material data, the user can choose from various chemical and \r\nstructural holdout strategies when creating their splits. The available splitting options are: \r\n - *\"index\"* (naive random splitting)\r\n - *\"structureid\"* (split by parent bulk structure - this is identical to *\"index\"* for datasets where each entry corresponds to a unique bulk structure)\r\n - *\"composition\"*\r\n - *\"chemsys\"*\r\n - *\"sgnum\"* (Space group number)\r\n - *\"pointgroup\"*\r\n - *\"crystalsys\"*\r\n - *\"elements\"*\r\n - *\"periodictablerows\"*\r\n - *\"periodictablegroups\"*\r\n\r\nFurther, the user can analyze the distribution of unique split values and the corresponding \r\nfraction (prevalence) in the dataset by using the class function `split_statistics`. \r\nThere are several optional variables that the user can specify (full list in the documentation below). \r\nMost, notably the number of inner and outer splits for nested folding are specified in \r\n`n_inner_splits` and `n_outer_splits`, respectively. If either of these two value is set to 0, \r\nthen `MatFold` will set them equal to the number of possible split label option (*i.e.*, this corresponds \r\nto leave-one-out cross-validation).\r\n\r\nThe user can also create a single leave-one-out split (rather than all possible splits) by utilizing the class \r\nfunction `create_loo_split` while specifying a single label that is to be held out in `loo_label` for \r\nthe specified `split_type`.\r\n\r\n### Example Use\r\n\r\nBelow find an example of how running `MatFold` could look like:\r\n\r\n```Python3\r\nfrom MatFold import MatFold\r\nimport pandas as pd\r\nimport json\r\n\r\ndf = pd.read_csv('data.csv')  # Ensure that first column contains the correct label format\r\nwith open('bulk.json', 'r') as fp:  \r\n    # Ensure all bulk pymatgen dictionaries are contained with the same key as specified in `df`\r\n    bulk_dict = json.load(fp)\r\n\r\n# Initialize MatFold and work with 50% of the dataset, but ensure to include all binary compounds\r\nmf = MatFold(df, bulk_dict, return_frac=0.5, always_include_n_elements=[2])\r\nstats = mf.split_statistics('crystalsys')\r\nprint(stats)  # print out statistics for the `crystalsys` split strategy\r\n# Create all nested (and non-nested) splits utilizing `crystalsys` with the outer \r\n# split being leave-one-out and the inner splits being split into 5.\r\nmf.create_splits('crystalsys', n_outer_splits=0, n_inner_splits=5, output_dir='./output/', verbose=True)\r\n# Create a single leave-one-out split where Iron is held out from the dataset\r\nmf.create_loo_split('elements', 'Fe', output_dir='./output/', verbose=True)\r\n```\r\n\r\n## Code Documentation\r\n\r\nBelow find a detailed documentation of all `MatFold` capabilities and description of variables.\r\n\r\n### Function `cifs_to_dict`\r\n\r\n```python\r\ndef cifs_to_dict(directory: str | os.PathLike) -> dict\r\n```\r\n\r\nConverts a directory of cif files into a dictionary with keys '<filename>' (of `<filename>.cif`) \r\nand values 'pymatgen dictionary' (parsed from `<filename>.cif`)\r\n\r\n**Arguments**:\r\n\r\n- `directory`: Directory where cif files are stored\r\n\r\n**Returns**:\r\n\r\nDictionary of cif files with keys '<filename>' (of `<filename>.cif`).\r\nCan be used as input `bulk_df` to `MatFold` class.\r\n\r\n### Class `MatFold`\r\n\r\n### \\_\\_init\\_\\_\r\n\r\n```python\r\ndef __init__(df: pd.DataFrame,\r\n             bulk_dict: dict,\r\n             return_frac: float = 1.0,\r\n             always_include_n_elements: list | int | None = None,\r\n             cols_to_keep: list | None = None,\r\n             seed: int = 0) -> None\r\n```\r\n\r\nMatFold class constructor\r\n\r\n**Arguments**:\r\n\r\n- `df`: Pandas dataframe with the first column containing strings of either form `<structureid>` or\r\n`<structureid>:<structuretag>` (where <structureid> refers to a bulk ID and <structuretag> refers to\r\nan identifier of a derivative structure). All other columns are optional and may be retained specifying the\r\n`cols_to_keep` parameter described below.\r\n- `bulk_dict`: Dictionary containing <structureid> as keys and the corresponding bulk pymatgen\r\ndictionary as values.\r\n- `return_frac`: The fraction of the df dataset that is utilized during splitting.\r\nMust be larger than 0.0 and equal/less than 1.0 (=100%).\r\n- `always_include_n_elements`: A list of number of elements for which the corresponding materials are\r\nalways to be included in the dataset (for cases where `return_frac` < 1.0).\r\n- `cols_to_keep`: List of columns to keep in the splits. If left `None`, then all columns of the\r\noriginal df are kept.\r\n- `seed`: Seed for selecting random subset of data and splits.\r\n\r\n\r\n#### from\\_json\r\n\r\n```python\r\n@classmethod\r\ndef from_json(cls,\r\n              df: pd.DataFrame,\r\n              bulk_dict: dict,\r\n              json_file: str | os.PathLike,\r\n              create_splits: bool = True)\r\n```\r\n\r\nReconstruct a `MatFold` class instance, along with its associated splits, from a JSON file previously generated \r\nby the `create_splits` or `create_loo_split` methods. The same `df` and `bulk_dict` used during\r\nthe original split creation must be provided to guarantee that the exact splits are regenerated.\r\n\r\n**Arguments**:\r\n\r\n- `df`: Pandas dataframe with the first column containing strings of either form `<structureid>` or\r\n`<structureid>:<structuretag>` (where <structureid> refers to a bulk ID and <structuretag> refers to\r\nan identifier of a derivative structure). All other columns are optional and may be retained specifying the\r\n`cols_to_keep` parameter described below.\r\n- `bulk_dict`: Dictionary containing <structureid> as keys and the corresponding bulk pymatgen\r\ndictionary as values.\r\n- `json_file`: Location of JSON file that is created when MatFold is used to generate splits.\r\n- `create_splits`: Whether to create splits with the same json settings\r\n\r\n**Returns**:\r\n\r\nMatFold class instance\r\n\r\n\r\n### split\\_statistics\r\n\r\n```python\r\ndef split_statistics(split_type: str) -> dict\r\n```\r\n\r\nAnalyzes the statistics of the sgnum, pointgroup, crystalsys, chemsys, composition, elements,  periodictablerows, \r\nand periodictablegroups splits.\r\n\r\n**Arguments**:\r\n\r\n- `split_type`: String specifying the splitting type\r\n\r\n**Returns**:\r\n\r\nDictionary with keys of unique split values and the corresponding fraction of this key being \r\nrepresented in the entire dataset.\r\n\r\n\r\n### create\\_splits\r\n\r\n```python\r\ndef create_splits(split_type: str,\r\n                  n_inner_splits: int = 10,\r\n                  n_outer_splits: int = 10,\r\n                  fraction_upper_limit: float = 1.0,\r\n                  fraction_lower_limit: float = 0.0,\r\n                  keep_n_elements_in_train: list | int | None = None,\r\n                  min_train_test_factor: float | None = None,\r\n                  inner_equals_outer_split_strategy: bool = True,\r\n                  write_base_str: str = 'mf',\r\n                  output_dir: str | os.PathLike | None = None,\r\n                  verbose: bool = False) -> None\r\n```\r\n\r\nCreates splits based on `split_type`, `n_inner_splits`, `n_outer_splits` among other specifications \r\n(cf. full list of function variables). The splits are saved in `output_dir` as csv files named\r\n`<write_base_str>.<split_type>.k<i>_outer.<train/test>.csv` and\r\n`<write_base_str>.<split_type>.k<i>_outer.l<j>_inner.<train/test>.csv` for all outer (index `<i>`) and inner\r\nsplits (index `<j>`), respectively. Additionally, a summary of the created splits is saved as\r\n`<write_base_str>.<split_type>.summary.k<n_outer_splits>.l<n_inner_splits>.<self.return_frac>.csv`.\r\nLastly, a JSON file is saved that stores all relevant class and function variables to recreate the splits\r\nutilizing the class function `from_json` and is named `<write_base_str>.<split_type>.json`.\r\n\r\n**Arguments**:\r\n\r\n- `split_type`: Defines the type of splitting, must be either \"index\", \"structureid\", \"composition\",\r\n\"chemsys\", \"sgnum\", \"pointgroup\", \"crystalsys\", \"elements\", \"periodictablerows\", or \"periodictablegroups\"\r\n- `n_inner_splits`: Number of inner splits (for nested k-fold); if set to 0, then `n_inner_splits` is set\r\nequal to the number of inner test possiblities (i.e., each inner test set holds one possibility out\r\nfor all possible options)\r\n- `n_outer_splits`: Number of outer splits (k-fold); if set to 0, then `n_outer_splits` is set equal to the\r\nnumber of test possiblities (i.e., each outer test set holds one possibility out for all possible options)\r\n- `fraction_upper_limit`: If a split possiblity is represented in the dataset with a fraction above\r\nthis limit then the corresponding indices will be forced to be in the training set by default.\r\n- `fraction_lower_limit`: If a split possiblity is represented in the dataset with a fraction below\r\nthis limit then the corresponding indices will be forced to be in the training set by default.\r\n- `keep_n_elements_in_train`: List of number of elements for which the corresponding materials are kept\r\nin the test set by default (i.e., not k-folded). For example, '2' will keep all binaries in the training set.\r\n- `min_train_test_factor`: Minimum factor that the training set needs to be\r\nlarger (for factors greater than 1.0) than the test set.\r\n- `inner_equals_outer_split_strategy`: If true, then the inner splitting strategy used is equal to\r\nthe outer splitting strategy, if false, then inner splitting strategy is random (by index).\r\n- `write_base_str`: Beginning string of csv file names of the written splits\r\n- `output_dir`: Directory where the splits are written to\r\n- `verbose`: Whether to print out details during code execution.\r\n\r\n**Returns**:\r\n\r\nNone\r\n\r\n### create\\_loo\\_split\r\n\r\n```python\r\ndef create_loo_split(split_type: str,\r\n                     loo_label: str,\r\n                     keep_n_elements_in_train: list | int | None = None,\r\n                     write_base_str: str = 'mf',\r\n                     output_dir: str | os.PathLike | None = None,\r\n                     verbose: bool = False) -> None\r\n```\r\n\r\nCreates leave-one-out split based on `split_type`, specified `loo_label` and `keep_n_elements_in_train`.\r\nThe splits are saved in `output_dir` as csv files named\r\n`<write_base_str>.<split_type>.loo.<loo_label>.<train/test>.csv`. Additionally, a summary of the created split\r\nis saved as `<write_base_str>.<split_type>.summary.loo.<loo_label>.<self.return_frac>.csv`.\r\nLastly, a JSON file is saved that stores all relevant class and function variables to recreate the splits\r\nutilizing the class function `from_json` and is named `<write_base_str>.<split_type>.loo.<loo_label>.json`.\r\n\r\n**Arguments**:\r\n\r\n- `split_type`: Defines the type of splitting, must be either \"structureid\", \"composition\", \"chemsys\",\r\n\"sgnum\", \"pointgroup\", \"crystalsys\", \"elements\", \"periodictablerows\", or \"periodictablegroups\".\r\n- `loo_label`: Label specifying which single option is to be left out (i.e., constitute the test set).\r\nThis label must be a valid option for the specified `split_type`.\r\n- `keep_n_elements_in_train`: List of number of elements for which the corresponding materials are kept\r\nin the test set by default (i.e., not k-folded). For example, '2' will keep all binaries in the training set.\r\n- `write_base_str`: Beginning string of csv file names of the written splits\r\n- `output_dir`: Directory where the splits are written to\r\n- `verbose`: Whether to print out details during code execution.\r\n\r\n**Returns**:\r\n\r\nNone\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Peter Schindler  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Package for systematic insights into materials discovery models\u2019 performance through standardized chemical cross-validation protocols",
    "version": "1.2.3",
    "project_urls": {
        "Repository": "https://github.com/d2r2group/MatFold"
    },
    "split_keywords": [
        "machine learning",
        " materials science",
        " cross-validation",
        " generalization error"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "82a895a90a95c3af998bfdf79b7e6397bc295dc0fd6f76dba05870bf572508fd",
                "md5": "d473b4bb5da213acca586ea3b046d4b1",
                "sha256": "1a525490eafe5cc41248561139cbe22ce746c878a182dd1a7c0d3a665efd3831"
            },
            "downloads": -1,
            "filename": "MatFold-1.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d473b4bb5da213acca586ea3b046d4b1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18322,
            "upload_time": "2024-11-26T05:29:22",
            "upload_time_iso_8601": "2024-11-26T05:29:22.612291Z",
            "url": "https://files.pythonhosted.org/packages/82/a8/95a90a95c3af998bfdf79b7e6397bc295dc0fd6f76dba05870bf572508fd/MatFold-1.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "572f5dc6e02d475c4d5e87eec301d03d2eac70ee6313b60eaad62c79854ad16f",
                "md5": "66c8051ca6101fc1ca842b743d5f5f8c",
                "sha256": "e9053cc55e8636088bc4a430f4141129620e3ec4ba4bb6680ea810f04a239c83"
            },
            "downloads": -1,
            "filename": "matfold-1.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "66c8051ca6101fc1ca842b743d5f5f8c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 30871,
            "upload_time": "2024-11-26T05:29:24",
            "upload_time_iso_8601": "2024-11-26T05:29:24.317374Z",
            "url": "https://files.pythonhosted.org/packages/57/2f/5dc6e02d475c4d5e87eec301d03d2eac70ee6313b60eaad62c79854ad16f/matfold-1.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-26 05:29:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "d2r2group",
    "github_project": "MatFold",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "matfold"
}

None