# MoML-CA: Molecular Machine Learning for Chemical Applications
MoML-CA is a Python package for molecular representation learning and property prediction using Graph Neural Networks. The package provides a comprehensive set of tools for converting molecular structures to graph representations, training GNN models, and predicting molecular properties.
## Features
- **Molecular Graph Creation**: Convert SMILES and RDKit molecules to graph representations with extensive feature extraction
- **Hierarchical Graph Representations**: Create multi-level graph representations for improved model performance
- **Modular Model Architecture**: Flexible and extensible GNN architectures with easy configuration
- **Training Utilities**: Comprehensive training pipelines with callbacks and monitoring
- **Evaluation Tools**: Metrics calculation and visualization of predictions
- **Example Scripts**: Ready-to-use examples for common molecular machine learning tasks
- **Command-Line Tools**: Easy-to-use CLI for model training and prediction
- **Data Processing**: Efficient batch processing of molecular datasets
- **Visualization**: Tools for visualizing molecular graphs and model predictions
## Large Files Handling
Large data files (>100MB) like training datasets and models are not stored in the Git repository. These files are ignored by Git via the `.gitignore` file and should be shared via alternative methods (cloud storage, direct transfer, etc.).
Large files in the `data/qm9/processed/` directory (particularly `*.pt` files) are automatically excluded from Git.
## Installation
```bash
# Clone the repository (choose HTTPS or SSH)
git clone https://github.com/SAKETH11111/MoML-CA.git
# or, if you have SSH keys configured:
# git clone git@github.com:SAKETH11111/MoML-CA.git
cd MoML-CA
# Create a conda environment
conda env create -f environment.yml
# Activate the environment
conda activate moml-ca
# Install dependencies
pip install -r requirements.txt
# Install the package in development mode
pip install -e .
```
## Quick Start
```python
import torch
from rdkit import Chem
from moml.core import create_graph_processor
from moml.models.mgnn.training import initialize_model, MGNNConfig, create_trainer
from moml.models.mgnn.evaluation.predictor import create_predictor
# Create molecular graph
processor = create_graph_processor({'use_partial_charges': True})
smiles = "C(C(F)(F)F)(C(F)(F)F)(F)F" # Perfluorobutane
graph = processor.smiles_to_graph(smiles)
# Initialize model with configuration
config = MGNNConfig({
'model_type': 'multi_task_djmgnn',
'hidden_dim': 64,
'n_blocks': 3
})
model = initialize_model(config, graph.x.shape[1], graph.edge_attr.shape[1])
# Train model with dataloaders
trainer = create_trainer(config=config, train_loader=train_loader, val_loader=val_loader)
# Note: train_loader and val_loader should be PyTorch DataLoader objects containing your training and validation datasets.
# See the examples directory (examples/training_examples or examples/quickstart_examples) for how to create these dataloaders.
# Example:
# from torch.utils.data import DataLoader
# train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# val_loader = DataLoader(val_dataset, batch_size=32)
history = trainer.train(epochs=50)
# Make predictions
predictor = create_predictor(model_path="path/to/saved_model.pt") # Or pass model directly
predictions = predictor.predict_from_dataloader(val_loader) # Or predictor.predict([graph])
```
See the [examples directory](examples) for more comprehensive examples.
### Generating force field labels
After running ORCA calculations you can generate a JSON file containing atom
types, partial charges and other force field parameters for each PFAS molecule:
```bash
python scripts/generate_force_field_labels.py
```
The output `force_field_labels.json` will be placed in
`orca_results_b3lyp_sto3g/`.
## Project Structure
```
MoML-CA/
├── moml/ # Main package directory
│ ├── core/ # Core functionality
│ │ ├── graph_coarsening.py # Graph coarsening algorithms
│ │ └── molecular_graph.py # Molecular graph representation
│ ├── models/ # Model implementations
│ │ ├── mgnn/ # MGNN models
│ │ │ ├── djmgnn.py # DJMGNN implementation
│ │ │ ├── training/ # Training utilities
│ │ │ └── evaluation/ # Evaluation utilities
│ │ └── lstm/ # LSTM models
│ ├── data/ # Data handling utilities
│ │ ├── dataset.py # Dataset implementations
│ │ └── processors.py # Data processors
│ ├── utils/ # Utility functions
│ │ ├── visualization/ # Visualization tools
│ │ ├── molecular/ # Molecular utilities
│ │ └── graph/ # Graph utilities
│ ├── pipeline/ # Pipeline orchestration
│ ├── simulation/ # Simulation utilities
│ └── __init__.py # Package initialization
├── examples/ # Example scripts
│ ├── quickstart/ # Quickstart examples
│ ├── training/ # Training examples
│ ├── prediction/ # Prediction examples
│ ├── molecular_graph/ # Molecular graph examples
│ └── preprocess/ # Preprocessing examples
└── tests/ # Test directory
```
## Recent Improvements
- **Enhanced Model Architecture**: Improved hierarchical graph representations and attention mechanisms
- **Streamlined API**: Simplified interface with factory functions and better error handling
- **Advanced Training Features**: Added support for mixed precision training and gradient accumulation
- **Improved Data Processing**: Enhanced batch processing and memory efficiency
- **Better Visualization**: New tools for visualizing molecular graphs and model attention
- **Command-Line Interface**: Added CLI tools for common tasks
- **Documentation**: Comprehensive documentation with examples and tutorials
## Documentation
See the [docs](docs/) directory for comprehensive documentation.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For guidelines on contributing, see [CONTRIBUTING.md](CONTRIBUTING.md).
## License
This project is licensed under the terms of the MIT license.
Raw data
{
"_id": null,
"home_page": null,
"name": "moml-ca",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "molecular, machine learning, graph neural networks, chemistry, PFAS",
"author": null,
"author_email": "SAKETH11111 <sakethbaddam10@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/27/67/a5021bd7b76cc04a4a37d200150813e5357ac663bd811688822c45e4a421/moml_ca-0.1.1.tar.gz",
"platform": null,
"description": "# MoML-CA: Molecular Machine Learning for Chemical Applications\n\nMoML-CA is a Python package for molecular representation learning and property prediction using Graph Neural Networks. The package provides a comprehensive set of tools for converting molecular structures to graph representations, training GNN models, and predicting molecular properties.\n\n## Features\n\n- **Molecular Graph Creation**: Convert SMILES and RDKit molecules to graph representations with extensive feature extraction\n- **Hierarchical Graph Representations**: Create multi-level graph representations for improved model performance\n- **Modular Model Architecture**: Flexible and extensible GNN architectures with easy configuration\n- **Training Utilities**: Comprehensive training pipelines with callbacks and monitoring\n- **Evaluation Tools**: Metrics calculation and visualization of predictions\n- **Example Scripts**: Ready-to-use examples for common molecular machine learning tasks\n- **Command-Line Tools**: Easy-to-use CLI for model training and prediction\n- **Data Processing**: Efficient batch processing of molecular datasets\n- **Visualization**: Tools for visualizing molecular graphs and model predictions\n\n## Large Files Handling\n\nLarge data files (>100MB) like training datasets and models are not stored in the Git repository. These files are ignored by Git via the `.gitignore` file and should be shared via alternative methods (cloud storage, direct transfer, etc.).\n\nLarge files in the `data/qm9/processed/` directory (particularly `*.pt` files) are automatically excluded from Git.\n\n## Installation\n\n```bash\n# Clone the repository (choose HTTPS or SSH)\ngit clone https://github.com/SAKETH11111/MoML-CA.git\n# or, if you have SSH keys configured:\n# git clone git@github.com:SAKETH11111/MoML-CA.git\ncd MoML-CA\n\n# Create a conda environment\nconda env create -f environment.yml\n\n# Activate the environment\nconda activate moml-ca\n\n# Install dependencies\npip install -r requirements.txt\n\n# Install the package in development mode\npip install -e .\n```\n\n## Quick Start\n\n```python\nimport torch\nfrom rdkit import Chem\nfrom moml.core import create_graph_processor\nfrom moml.models.mgnn.training import initialize_model, MGNNConfig, create_trainer\nfrom moml.models.mgnn.evaluation.predictor import create_predictor\n\n# Create molecular graph\nprocessor = create_graph_processor({'use_partial_charges': True})\nsmiles = \"C(C(F)(F)F)(C(F)(F)F)(F)F\" # Perfluorobutane\ngraph = processor.smiles_to_graph(smiles)\n\n# Initialize model with configuration\nconfig = MGNNConfig({\n 'model_type': 'multi_task_djmgnn',\n 'hidden_dim': 64,\n 'n_blocks': 3\n})\nmodel = initialize_model(config, graph.x.shape[1], graph.edge_attr.shape[1])\n\n# Train model with dataloaders\ntrainer = create_trainer(config=config, train_loader=train_loader, val_loader=val_loader)\n# Note: train_loader and val_loader should be PyTorch DataLoader objects containing your training and validation datasets.\n# See the examples directory (examples/training_examples or examples/quickstart_examples) for how to create these dataloaders.\n# Example:\n# from torch.utils.data import DataLoader\n# train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)\n# val_loader = DataLoader(val_dataset, batch_size=32)\nhistory = trainer.train(epochs=50)\n\n# Make predictions\npredictor = create_predictor(model_path=\"path/to/saved_model.pt\") # Or pass model directly\npredictions = predictor.predict_from_dataloader(val_loader) # Or predictor.predict([graph])\n```\n\nSee the [examples directory](examples) for more comprehensive examples.\n\n### Generating force field labels\n\nAfter running ORCA calculations you can generate a JSON file containing atom\ntypes, partial charges and other force field parameters for each PFAS molecule:\n\n```bash\npython scripts/generate_force_field_labels.py\n```\n\nThe output `force_field_labels.json` will be placed in\n`orca_results_b3lyp_sto3g/`.\n\n## Project Structure\n\n```\nMoML-CA/\n\u251c\u2500\u2500 moml/ # Main package directory\n\u2502 \u251c\u2500\u2500 core/ # Core functionality\n\u2502 \u2502 \u251c\u2500\u2500 graph_coarsening.py # Graph coarsening algorithms\n\u2502 \u2502 \u2514\u2500\u2500 molecular_graph.py # Molecular graph representation\n\u2502 \u251c\u2500\u2500 models/ # Model implementations\n\u2502 \u2502 \u251c\u2500\u2500 mgnn/ # MGNN models\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 djmgnn.py # DJMGNN implementation\n\u2502 \u2502 \u2502 \u251c\u2500\u2500 training/ # Training utilities\n\u2502 \u2502 \u2502 \u2514\u2500\u2500 evaluation/ # Evaluation utilities\n\u2502 \u2502 \u2514\u2500\u2500 lstm/ # LSTM models\n\u2502 \u251c\u2500\u2500 data/ # Data handling utilities\n\u2502 \u2502 \u251c\u2500\u2500 dataset.py # Dataset implementations\n\u2502 \u2502 \u2514\u2500\u2500 processors.py # Data processors\n\u2502 \u251c\u2500\u2500 utils/ # Utility functions\n\u2502 \u2502 \u251c\u2500\u2500 visualization/ # Visualization tools\n\u2502 \u2502 \u251c\u2500\u2500 molecular/ # Molecular utilities\n\u2502 \u2502 \u2514\u2500\u2500 graph/ # Graph utilities\n\u2502 \u251c\u2500\u2500 pipeline/ # Pipeline orchestration\n\u2502 \u251c\u2500\u2500 simulation/ # Simulation utilities\n\u2502 \u2514\u2500\u2500 __init__.py # Package initialization\n\u251c\u2500\u2500 examples/ # Example scripts\n\u2502 \u251c\u2500\u2500 quickstart/ # Quickstart examples\n\u2502 \u251c\u2500\u2500 training/ # Training examples\n\u2502 \u251c\u2500\u2500 prediction/ # Prediction examples\n\u2502 \u251c\u2500\u2500 molecular_graph/ # Molecular graph examples\n\u2502 \u2514\u2500\u2500 preprocess/ # Preprocessing examples\n\u2514\u2500\u2500 tests/ # Test directory\n```\n\n## Recent Improvements\n\n- **Enhanced Model Architecture**: Improved hierarchical graph representations and attention mechanisms\n- **Streamlined API**: Simplified interface with factory functions and better error handling\n- **Advanced Training Features**: Added support for mixed precision training and gradient accumulation\n- **Improved Data Processing**: Enhanced batch processing and memory efficiency\n- **Better Visualization**: New tools for visualizing molecular graphs and model attention\n- **Command-Line Interface**: Added CLI tools for common tasks\n- **Documentation**: Comprehensive documentation with examples and tutorials\n\n## Documentation\n\nSee the [docs](docs/) directory for comprehensive documentation.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For guidelines on contributing, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Molecular Machine Learning for Chemical Applications - A comprehensive Python package for molecular representation learning and property prediction using Graph Neural Networks",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/SAKETH11111/MoML-CA",
"Issues": "https://github.com/SAKETH11111/MoML-CA/issues",
"Repository": "https://github.com/SAKETH11111/MoML-CA"
},
"split_keywords": [
"molecular",
" machine learning",
" graph neural networks",
" chemistry",
" pfas"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0c21dfc54a353150f2b2d6ae113a2afe29d0e3cb0af7c69bf174eee16919a6d0",
"md5": "158119adb4622fdc1a12bbb057ea7dc7",
"sha256": "a86b57fee478e74e4ebb1590ef49d7d055980daee8cabc3c442b4c5cf9c130d1"
},
"downloads": -1,
"filename": "moml_ca-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "158119adb4622fdc1a12bbb057ea7dc7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 197261,
"upload_time": "2025-08-06T00:51:40",
"upload_time_iso_8601": "2025-08-06T00:51:40.090008Z",
"url": "https://files.pythonhosted.org/packages/0c/21/dfc54a353150f2b2d6ae113a2afe29d0e3cb0af7c69bf174eee16919a6d0/moml_ca-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2767a5021bd7b76cc04a4a37d200150813e5357ac663bd811688822c45e4a421",
"md5": "2f369a55b8ce8c5cf3de9555aa161595",
"sha256": "90e13a674b0d462b9d10c026585db64d7c0904bb62844e43051f090d5d3ee3bc"
},
"downloads": -1,
"filename": "moml_ca-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "2f369a55b8ce8c5cf3de9555aa161595",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 261330,
"upload_time": "2025-08-06T00:51:41",
"upload_time_iso_8601": "2025-08-06T00:51:41.563778Z",
"url": "https://files.pythonhosted.org/packages/27/67/a5021bd7b76cc04a4a37d200150813e5357ac663bd811688822c45e4a421/moml_ca-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-06 00:51:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SAKETH11111",
"github_project": "MoML-CA",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.12.0"
]
]
},
{
"name": "torch-geometric",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "rdkit",
"specs": [
[
">=",
"2022.03.1"
]
]
},
{
"name": "openmm",
"specs": [
[
">=",
"7.5.0"
]
]
},
{
"name": "mdtraj",
"specs": [
[
">=",
"1.9.5"
]
]
},
{
"name": "MDAnalysis",
"specs": [
[
">=",
"2.4.0"
]
]
},
{
"name": "pdbfixer",
"specs": [
[
">=",
"1.11"
]
]
},
{
"name": "dask",
"specs": [
[
">=",
"2022.2.0"
]
]
},
{
"name": "distributed",
"specs": [
[
">=",
"2022.2.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "python-json-logger",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.5.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.11.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "networkx",
"specs": [
[
">=",
"2.6.0"
]
]
},
{
"name": "plotly",
"specs": [
[
">=",
"5.3.0"
]
]
},
{
"name": "h5py",
"specs": [
[
">=",
"3.6.0"
]
]
},
{
"name": "luigi",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.62.0"
]
]
},
{
"name": "black",
"specs": [
[
">=",
"21.12b0"
]
]
},
{
"name": "flake8",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "isort",
"specs": [
[
">=",
"5.10.0"
]
]
}
],
"lcname": "moml-ca"
}