active-sampler


Nameactive-sampler JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/rogeriog/active_sampler # Replace with your repository URL
SummaryAn active learning package for experimental design in chemistry and materials science.
upload_time2025-02-08 19:09:49
maintainerNone
docs_urlNone
authorRogerio Gouvea
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ActiveSampler: An Active Learning Package for Experimental Design in Chemistry and Materials Science

ActiveSampler is a Python package designed to facilitate active learning workflows specifically tailored for experimental design in chemistry and materials science. By intelligently selecting the most informative data points for labeling, ActiveSampler aims to optimize experiments, reduce costs, and accelerate discovery in these fields.

## Features

- **Model Training and Prediction**: Supports both classification and regression tasks using models like Logistic Regression, Random Forest, and XGBoost.
- **Uncertainty Calculation**: Computes uncertainty for classification using entropy and for regression using variance.
- **Objective Function Evaluation**: Allows custom objective functions to guide the selection of samples.
- **Diversity and Acquisition**: Incorporates diversity measures and acquisition functions to balance exploration and exploitation.
- **Grid Sampling and Constraints**: Generates sampling grids and applies constraints to ensure valid experimental designs.
- **Active Learning Selection**: Selects the most informative samples to enhance model performance with customizable weights for objective, uncertainty, and diversity.

## Installation

To install ActiveSampler, clone the repository and install the dependencies:

```bash
git clone https://github.com/yourusername/active_sampler.git  # Replace with your repository URL
cd active_sampler
pip install -r requirements.txt
```

## Usage

### Example

This is an example input data to select new data points in a LARP synthesis, full data on [examples/example1_LARP/input.csv](examples/example1_LARP/input.csv):

```csv
ligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response
10.0,300,0,3000,1
5.0,300,0,3000,1
...
```
Here is the code to sample these new points:

```python
from active_sampler import active_sampling, load_and_preprocess_data

# Define the path to your data file
filepath = 'input.csv'

# Specify target columns and their types
target_columns = ['structural_response']
target_types = {
    'structural_response': 'classification',
}
num_classes_dict = {
    'structural_response': 3
}

# Define the objective function as a string
obj_fn_str = 'structural_response_class_2'

# Load and preprocess data
X, y_dict = load_and_preprocess_data(
    filepath,
    target_columns,
    target_types,
)

# Start active learning selection
active_sampling(
    X,
    y_dict,
    target_types,
    obj_fn_str,
    num_classes_dict=num_classes_dict,
    num_sampling=25,
    alpha=0.25,  # Objective weight
    beta=0.25,  # Uncertainty weight
    gamma=0.5,  # Diversity weight
    sufix='LARP',
)
```

### Input Data Format

The input data should be in CSV format:

```csv
ligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response
10.0,300,0,3000,1
5.0,300,0,3000,1
...
```

### `load_and_preprocess_data` Function

The `load_and_preprocess_data` function loads, cleans, and prepares your data. It handles renaming, missing values, removing rows/columns, and splitting data into features (X) and targets (y_dict).  See the examples for detailed usage.

**Parameters:** `filepath`, `target_columns`, `target_types`, `column_mapping` (optional), `categorical_cols` (optional), `missing_value_strategy` (optional), `imputation_values` (optional), `rows_to_remove` (optional), `columns_to_remove` (optional), `regex_columns_to_remove` (optional).

### `active_sampling` Function Parameters

- `X`: Feature DataFrame.
- `y_dict`: Dictionary mapping target names to their Series.
- `target_types`: Dictionary mapping target names to 'classification' or 'regression'.
- `obj_fn_str`: String defining the objective function.  References:
    - Classification: `target_class_i` (e.g., `'structure_type_class_2'`).
    - Regression: `target` (e.g., `'contact_angle'`).
    - Normalized Regression: `norm_target` (e.g., `norm_contact_angle`).
- `sufix`: Suffix for output files.
- `categorical_cols`: List of categorical columns.
- `num_classes_dict`: Dictionary mapping classification targets to number of classes.
- `initial_train_size`: Initial training set size (or `None` for all data).
- `num_sampling`: Number of samples to select.
- `alpha`, `beta`, `gamma`: Weights for objective, uncertainty, and diversity.
- `user_num_grid_points`: Custom grid points per numerical variable (int, 'unique', or dict).
- `variable_constraints`: Constraints to filter the sampling grid (list of dicts).  Each dict has `conditions`, `assignments`, and optional `mutual_constraint`.
- `unc_fn_str`: Custom formula for combining uncertainties. References: `target_unc`, `norm_target_unc`.
- `diversity_settings`: Settings for diversity: `neighbor_distance_metric` (default: 'euclidean'), `same_cluster_penalty` (default: 0.5), `number_of_clusters` (default: 'num_sampling').

### Output

The `active_sampling` function generates a `.txt` file and a `.csv` file containing the coordinates of the selected samples, sorted by all columns.  See the examples folder for detailed output formats.

### Examples

The package includes several examples demonstrating different use cases, located in the `examples` folder. The structure is as follows:

```
├── README.md
├── active_sampler
│   ├── __init__.py
│   ├── core.py
│   └── utils.py
├── examples
│   ├── example1_LARP
│   │   ├── example1.py
│   │   ├── input.csv
│   │   ├── selected_samples_LARP.csv
│   │   └── selected_samples_LARP.txt
│   ├── example2_PhobicSurfaces
│   │   ├── example2.py
│   │   ├── input.csv
│   │   ├── selected_samples_PhobicSurfaces.csv
│   │   └── selected_samples_PhobicSurfaces.txt
│   ├── example3_BatteryOptimization
│   │   ├── example3.py
│   │   ├── input.csv
│   │   ├── selected_samples_BatteryOptimization.csv
│   │   └── selected_samples_BatteryOptimization.txt
│   └── example4_ProcessingAndConstraints
│       ├── example4.py
│       ├── input.csv
│       ├── selected_samples_LARP_advanced_features.csv
│       └── selected_samples_LARP_advanced_features.txt
```

Each example folder contains:

-   `example[N].py`: The Python script implementing the active learning workflow.
-   `input.csv`: The input data used for the example.
> Pre-generated output files are provided for each example:
-   `selected_samples_[sufix].csv`:  The CSV file with the selected samples.
-   `selected_samples_[sufix].txt`: The text file with the selected samples and run information.

Here's a breakdown of each example:

- **`example1_LARP`**: A basic example focused on optimizing a **LARP (Ligand-Assisted Reprecipitation)** synthesis. It uses a single classification target (`structural_response`) to predict the structural outcome of the synthesis.

- **`example2_PhobicSurfaces`**: This example deals with predicting the **contact angle** of surfaces, a regression problem. It also demonstrates the use of categorical features (`metal_precursor`, `surface_coating_material`).

- **`example3_BatteryOptimization`**: A more complex, multi-output example focused on **battery material optimization**. It involves multiple regression targets (specific capacity, capacity retention, etc.) and uses custom objective and uncertainty functions to guide the selection process.  It also uses categorical features.

- **`example4_ProcessingAndConstraints`**: This example showcases advanced features like **custom grid points** (restricting the sampling space for certain variables), **variable constraints** (ensuring logical relationships between variables), and more detailed data preprocessing options. It uses a combination of classification and regression targets.

Run them directly (e.g., `python example1_LARP/example1.py`) after ensuring the `active_sampler` package is installed and the `input.csv` files are present.

## Contributing

Contributions are welcome! Please submit a Pull Request.

## License

This project is licensed under the MIT License.

## Contact

For questions or issues, please contact [rogeriog.em@gmail.com].

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rogeriog/active_sampler  # Replace with your repository URL",
    "name": "active-sampler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Rogerio Gouvea",
    "author_email": "Rogerio Gouvea <rogeriog.em@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/58/af/91e81cb178150fad3869d6457da0bc14d508a00a7b96be3582090d9595f2/active_sampler-0.1.0.tar.gz",
    "platform": null,
    "description": "# ActiveSampler: An Active Learning Package for Experimental Design in Chemistry and Materials Science\n\nActiveSampler is a Python package designed to facilitate active learning workflows specifically tailored for experimental design in chemistry and materials science. By intelligently selecting the most informative data points for labeling, ActiveSampler aims to optimize experiments, reduce costs, and accelerate discovery in these fields.\n\n## Features\n\n- **Model Training and Prediction**: Supports both classification and regression tasks using models like Logistic Regression, Random Forest, and XGBoost.\n- **Uncertainty Calculation**: Computes uncertainty for classification using entropy and for regression using variance.\n- **Objective Function Evaluation**: Allows custom objective functions to guide the selection of samples.\n- **Diversity and Acquisition**: Incorporates diversity measures and acquisition functions to balance exploration and exploitation.\n- **Grid Sampling and Constraints**: Generates sampling grids and applies constraints to ensure valid experimental designs.\n- **Active Learning Selection**: Selects the most informative samples to enhance model performance with customizable weights for objective, uncertainty, and diversity.\n\n## Installation\n\nTo install ActiveSampler, clone the repository and install the dependencies:\n\n```bash\ngit clone https://github.com/yourusername/active_sampler.git  # Replace with your repository URL\ncd active_sampler\npip install -r requirements.txt\n```\n\n## Usage\n\n### Example\n\nThis is an example input data to select new data points in a LARP synthesis, full data on [examples/example1_LARP/input.csv](examples/example1_LARP/input.csv):\n\n```csv\nligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response\n10.0,300,0,3000,1\n5.0,300,0,3000,1\n...\n```\nHere is the code to sample these new points:\n\n```python\nfrom active_sampler import active_sampling, load_and_preprocess_data\n\n# Define the path to your data file\nfilepath = 'input.csv'\n\n# Specify target columns and their types\ntarget_columns = ['structural_response']\ntarget_types = {\n    'structural_response': 'classification',\n}\nnum_classes_dict = {\n    'structural_response': 3\n}\n\n# Define the objective function as a string\nobj_fn_str = 'structural_response_class_2'\n\n# Load and preprocess data\nX, y_dict = load_and_preprocess_data(\n    filepath,\n    target_columns,\n    target_types,\n)\n\n# Start active learning selection\nactive_sampling(\n    X,\n    y_dict,\n    target_types,\n    obj_fn_str,\n    num_classes_dict=num_classes_dict,\n    num_sampling=25,\n    alpha=0.25,  # Objective weight\n    beta=0.25,  # Uncertainty weight\n    gamma=0.5,  # Diversity weight\n    sufix='LARP',\n)\n```\n\n### Input Data Format\n\nThe input data should be in CSV format:\n\n```csv\nligand_quantity,ligand_ii_quantity,halogen_alloy_quantity,antisolvent_quantity,structural_response\n10.0,300,0,3000,1\n5.0,300,0,3000,1\n...\n```\n\n### `load_and_preprocess_data` Function\n\nThe `load_and_preprocess_data` function loads, cleans, and prepares your data. It handles renaming, missing values, removing rows/columns, and splitting data into features (X) and targets (y_dict).  See the examples for detailed usage.\n\n**Parameters:** `filepath`, `target_columns`, `target_types`, `column_mapping` (optional), `categorical_cols` (optional), `missing_value_strategy` (optional), `imputation_values` (optional), `rows_to_remove` (optional), `columns_to_remove` (optional), `regex_columns_to_remove` (optional).\n\n### `active_sampling` Function Parameters\n\n- `X`: Feature DataFrame.\n- `y_dict`: Dictionary mapping target names to their Series.\n- `target_types`: Dictionary mapping target names to 'classification' or 'regression'.\n- `obj_fn_str`: String defining the objective function.  References:\n    - Classification: `target_class_i` (e.g., `'structure_type_class_2'`).\n    - Regression: `target` (e.g., `'contact_angle'`).\n    - Normalized Regression: `norm_target` (e.g., `norm_contact_angle`).\n- `sufix`: Suffix for output files.\n- `categorical_cols`: List of categorical columns.\n- `num_classes_dict`: Dictionary mapping classification targets to number of classes.\n- `initial_train_size`: Initial training set size (or `None` for all data).\n- `num_sampling`: Number of samples to select.\n- `alpha`, `beta`, `gamma`: Weights for objective, uncertainty, and diversity.\n- `user_num_grid_points`: Custom grid points per numerical variable (int, 'unique', or dict).\n- `variable_constraints`: Constraints to filter the sampling grid (list of dicts).  Each dict has `conditions`, `assignments`, and optional `mutual_constraint`.\n- `unc_fn_str`: Custom formula for combining uncertainties. References: `target_unc`, `norm_target_unc`.\n- `diversity_settings`: Settings for diversity: `neighbor_distance_metric` (default: 'euclidean'), `same_cluster_penalty` (default: 0.5), `number_of_clusters` (default: 'num_sampling').\n\n### Output\n\nThe `active_sampling` function generates a `.txt` file and a `.csv` file containing the coordinates of the selected samples, sorted by all columns.  See the examples folder for detailed output formats.\n\n### Examples\n\nThe package includes several examples demonstrating different use cases, located in the `examples` folder. The structure is as follows:\n\n```\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 active_sampler\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 core.py\n\u2502   \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 examples\n\u2502   \u251c\u2500\u2500 example1_LARP\n\u2502   \u2502   \u251c\u2500\u2500 example1.py\n\u2502   \u2502   \u251c\u2500\u2500 input.csv\n\u2502   \u2502   \u251c\u2500\u2500 selected_samples_LARP.csv\n\u2502   \u2502   \u2514\u2500\u2500 selected_samples_LARP.txt\n\u2502   \u251c\u2500\u2500 example2_PhobicSurfaces\n\u2502   \u2502   \u251c\u2500\u2500 example2.py\n\u2502   \u2502   \u251c\u2500\u2500 input.csv\n\u2502   \u2502   \u251c\u2500\u2500 selected_samples_PhobicSurfaces.csv\n\u2502   \u2502   \u2514\u2500\u2500 selected_samples_PhobicSurfaces.txt\n\u2502   \u251c\u2500\u2500 example3_BatteryOptimization\n\u2502   \u2502   \u251c\u2500\u2500 example3.py\n\u2502   \u2502   \u251c\u2500\u2500 input.csv\n\u2502   \u2502   \u251c\u2500\u2500 selected_samples_BatteryOptimization.csv\n\u2502   \u2502   \u2514\u2500\u2500 selected_samples_BatteryOptimization.txt\n\u2502   \u2514\u2500\u2500 example4_ProcessingAndConstraints\n\u2502       \u251c\u2500\u2500 example4.py\n\u2502       \u251c\u2500\u2500 input.csv\n\u2502       \u251c\u2500\u2500 selected_samples_LARP_advanced_features.csv\n\u2502       \u2514\u2500\u2500 selected_samples_LARP_advanced_features.txt\n```\n\nEach example folder contains:\n\n-   `example[N].py`: The Python script implementing the active learning workflow.\n-   `input.csv`: The input data used for the example.\n> Pre-generated output files are provided for each example:\n-   `selected_samples_[sufix].csv`:  The CSV file with the selected samples.\n-   `selected_samples_[sufix].txt`: The text file with the selected samples and run information.\n\nHere's a breakdown of each example:\n\n- **`example1_LARP`**: A basic example focused on optimizing a **LARP (Ligand-Assisted Reprecipitation)** synthesis. It uses a single classification target (`structural_response`) to predict the structural outcome of the synthesis.\n\n- **`example2_PhobicSurfaces`**: This example deals with predicting the **contact angle** of surfaces, a regression problem. It also demonstrates the use of categorical features (`metal_precursor`, `surface_coating_material`).\n\n- **`example3_BatteryOptimization`**: A more complex, multi-output example focused on **battery material optimization**. It involves multiple regression targets (specific capacity, capacity retention, etc.) and uses custom objective and uncertainty functions to guide the selection process.  It also uses categorical features.\n\n- **`example4_ProcessingAndConstraints`**: This example showcases advanced features like **custom grid points** (restricting the sampling space for certain variables), **variable constraints** (ensuring logical relationships between variables), and more detailed data preprocessing options. It uses a combination of classification and regression targets.\n\nRun them directly (e.g., `python example1_LARP/example1.py`) after ensuring the `active_sampler` package is installed and the `input.csv` files are present.\n\n## Contributing\n\nContributions are welcome! Please submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Contact\n\nFor questions or issues, please contact [rogeriog.em@gmail.com].\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An active learning package for experimental design in chemistry and materials science.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/rogeriog/active_sampler  # Replace with your repository URL"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9d4fbdc34073ab38732380862c25fea7285afc0c642b8cc4bcf22562a4c2f77f",
                "md5": "65929e9da2f1b74be7a606f7a26be020",
                "sha256": "c3e1f0d485bf686d928bccadb32d619cd88353aa67c9637012e03500cbb3d67e"
            },
            "downloads": -1,
            "filename": "active_sampler-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65929e9da2f1b74be7a606f7a26be020",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18500,
            "upload_time": "2025-02-08T19:09:48",
            "upload_time_iso_8601": "2025-02-08T19:09:48.342270Z",
            "url": "https://files.pythonhosted.org/packages/9d/4f/bdc34073ab38732380862c25fea7285afc0c642b8cc4bcf22562a4c2f77f/active_sampler-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "58af91e81cb178150fad3869d6457da0bc14d508a00a7b96be3582090d9595f2",
                "md5": "a20679531eb278c0d7cc0a3cafd49fb1",
                "sha256": "730a5e62f274c3a53c38c979930b882bdbfb50c7e400f89d5925bbf4d0e0878e"
            },
            "downloads": -1,
            "filename": "active_sampler-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a20679531eb278c0d7cc0a3cafd49fb1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 21572,
            "upload_time": "2025-02-08T19:09:49",
            "upload_time_iso_8601": "2025-02-08T19:09:49.974490Z",
            "url": "https://files.pythonhosted.org/packages/58/af/91e81cb178150fad3869d6457da0bc14d508a00a7b96be3582090d9595f2/active_sampler-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-08 19:09:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rogeriog",
    "github_project": "active_sampler  # Replace with your repository URL",
    "github_not_found": true,
    "lcname": "active-sampler"
}
        
Elapsed time: 1.85734s