# pKa-predictor
Leveraging our Teaching Experience to Improve Machine Learning: Application to pKa PredictionJérôme Genzling, Ziling Luo, Benjamin Weiser, Nicolas Moitessier
nicolas.moitessier@mcgill.ca
2023-12-07 – revised 2025-05-16

# 🔍 What is this?
A Graph Neural Network (GNN) model for:
- Predicting pKa values of ionizable centers
- Identifying protonation sites
- Estimating dominant protonation states at a given pH
- Supporting iterative protonation/deprotonation of polyprotic molecules
# 🧪 Core Functionalities
- Input: CSV with SMILES and (optionally) ionizable atom indices
- Output: pKa value(s), and major protonated species at given pH
- Iterative inference for molecules with multiple ionizable centers
- Easily extendable to new datasets or re-trainable on custom data
# 📦 Required Libraries
Install with pip:
pip install torch torch_geometric pandas numpy rdkit seaborn hyperopt
You can also recreate our virtual environment using environment.yml
# 📁 Repository Structure
Datasets/ : All cleaned, split, and raw datasets
Baseline_Models/Descriptors/ : Code to generate traditional descriptors
Baseline_Models/RF, /XGB : Traditional model training scripts (Random Forest/XGB)
GNN/ : All code related to GNN/GAT models
MolGpKa_retrained/ : Code and data for retraining MolGpKa
# 🚀 Getting Started with the GNN
## 1. See available options
python main.py --mode usage
All possible arguments and their default values will be printed.
## 2. Predict pKa on a sample set
Your CSV will need to have at least two columns: 'Name' and 'Smiles'
On Windows:
python main.py --mode infer --input your_input.csv > infer_your_input.out
On Linux:
python main.py --mode infer --data_path ..\Datasets\ --input your_input.csv --infer_pickled ..\Datasets\pickled_data\infer_pickled.pkl --model_dir ..\Model\ > infer_your_input.out
## 3. Predict from a CSV in Python
You can also use the predict() function directly:
from predict import predict
predicted_pkas, protonated_smiles = predict("your_dataset.csv", pH=7.4)
## 4. Verbose Levels
Use the --verbose flag to control output detail:
--verbose 0: No details printed in the output (silent mode)
--verbose 1: Summary of predictions + Some cleaning details
--verbose 2: Detailed view of every deprotonation step
# 📖 Citation
If you use this code or model, please cite:
Genzling J, Luo Z, Weiser B, Moitessier N. Leveraging our Teacher’s Experience to Improve Machine Learning: Application to pKa Prediction. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-bpd53-v2
This content is a preprint and has not been peer-reviewed.
# 🧠 Tips
Use Cheminfo SMILES viewer to visualize and debug SMILES (https://www.cheminfo.org/Chemistry/Cheminformatics/Smiles/index.html)
If protonation states are off, check atom indexing or consider using neutral forms.
You can retrain on your own dataset by modifying train_pKa_predictor.py.
# 🛠 Support
Feel free to reach out via email or GitHub issues if you need help using or adapting the model.
Raw data
{
"_id": null,
"home_page": "https://github.com/MoitessierLab/pKa-predictor",
"name": "pka-predictor-moitessier",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "pKa prediction GNN chemistry rdkit",
"author": "Moitessier Lab",
"author_email": "nicolas.moitessier@mcgill.ca",
"download_url": "https://files.pythonhosted.org/packages/f5/5e/92f8942d2e1224a46d3eb1c3ebcb0985e19b474d76b1573ed9ea3702f962/pka_predictor_moitessier-0.1.12.tar.gz",
"platform": null,
"description": "# pKa-predictor\r\n\r\nLeveraging our Teaching Experience to Improve Machine Learning: Application to pKa PredictionJ\u00e9r\u00f4me Genzling, Ziling Luo, Benjamin Weiser, Nicolas Moitessier\r\nnicolas.moitessier@mcgill.ca\r\n2023-12-07 \u2013 revised 2025-05-16\r\n\r\n\r\n\r\n# \ud83d\udd0d What is this?\r\n\r\nA Graph Neural Network (GNN) model for:\r\n\r\n- Predicting pKa values of ionizable centers\r\n- Identifying protonation sites\r\n- Estimating dominant protonation states at a given pH\r\n- Supporting iterative protonation/deprotonation of polyprotic molecules\r\n\r\n# \ud83e\uddea Core Functionalities\r\n\r\n- Input: CSV with SMILES and (optionally) ionizable atom indices\r\n- Output: pKa value(s), and major protonated species at given pH\r\n- Iterative inference for molecules with multiple ionizable centers\r\n- Easily extendable to new datasets or re-trainable on custom data\r\n\r\n# \ud83d\udce6 Required Libraries\r\n\r\nInstall with pip:\r\n\r\npip install torch torch_geometric pandas numpy rdkit seaborn hyperopt\r\n\r\nYou can also recreate our virtual environment using environment.yml\r\n\r\n# \ud83d\udcc1 Repository Structure\r\n\r\nDatasets/ : All cleaned, split, and raw datasets\r\n\r\nBaseline_Models/Descriptors/ : Code to generate traditional descriptors\r\n\r\nBaseline_Models/RF, /XGB : Traditional model training scripts (Random Forest/XGB)\r\n\r\nGNN/ : All code related to GNN/GAT models\r\n\r\nMolGpKa_retrained/ : Code and data for retraining MolGpKa\r\n\r\n# \ud83d\ude80 Getting Started with the GNN\r\n\r\n## 1. See available options\r\n\r\npython main.py --mode usage\r\n\r\nAll possible arguments and their default values will be printed.\r\n\r\n## 2. Predict pKa on a sample set\r\nYour CSV will need to have at least two columns: 'Name' and 'Smiles'\r\n\r\nOn Windows:\r\n\r\npython main.py --mode infer --input your_input.csv > infer_your_input.out\r\n\r\nOn Linux: \r\n\r\npython main.py --mode infer --data_path ..\\Datasets\\ --input your_input.csv --infer_pickled ..\\Datasets\\pickled_data\\infer_pickled.pkl --model_dir ..\\Model\\ > infer_your_input.out\r\n\r\n## 3. Predict from a CSV in Python\r\n\r\nYou can also use the predict() function directly:\r\n\r\nfrom predict import predict\r\n\r\npredicted_pkas, protonated_smiles = predict(\"your_dataset.csv\", pH=7.4)\r\n\r\n## 4. Verbose Levels\r\n\r\nUse the --verbose flag to control output detail:\r\n\r\n--verbose 0: No details printed in the output (silent mode)\r\n\r\n--verbose 1: Summary of predictions + Some cleaning details\r\n\r\n--verbose 2: Detailed view of every deprotonation step\r\n\r\n# \ud83d\udcd6 Citation\r\n\r\nIf you use this code or model, please cite:\r\n\r\nGenzling J, Luo Z, Weiser B, Moitessier N. Leveraging our Teacher\u2019s Experience to Improve Machine Learning: Application to pKa Prediction. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-bpd53-v2 \r\nThis content is a preprint and has not been peer-reviewed.\r\n\r\n# \ud83e\udde0 Tips\r\n\r\nUse Cheminfo SMILES viewer to visualize and debug SMILES (https://www.cheminfo.org/Chemistry/Cheminformatics/Smiles/index.html)\r\n\r\nIf protonation states are off, check atom indexing or consider using neutral forms.\r\n\r\nYou can retrain on your own dataset by modifying train_pKa_predictor.py.\r\n\r\n# \ud83d\udee0 Support\r\n\r\nFeel free to reach out via email or GitHub issues if you need help using or adapting the model.\r\n\r\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "Graph-based pKa prediction for small molecules",
"version": "0.1.12",
"project_urls": {
"Homepage": "https://github.com/MoitessierLab/pKa-predictor"
},
"split_keywords": [
"pka",
"prediction",
"gnn",
"chemistry",
"rdkit"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f58d291f775255ee862f359058a24ee4404bfb64e91f1facb9ae257b6977f1c2",
"md5": "d3115f8be103e67b2991d2951a93d4f4",
"sha256": "5c478fc6e1518056dc7836594505f4141e0a86eb7a684e9542682486758f7232"
},
"downloads": -1,
"filename": "pka_predictor_moitessier-0.1.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d3115f8be103e67b2991d2951a93d4f4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 9109704,
"upload_time": "2025-07-29T17:46:51",
"upload_time_iso_8601": "2025-07-29T17:46:51.531341Z",
"url": "https://files.pythonhosted.org/packages/f5/8d/291f775255ee862f359058a24ee4404bfb64e91f1facb9ae257b6977f1c2/pka_predictor_moitessier-0.1.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f55e92f8942d2e1224a46d3eb1c3ebcb0985e19b474d76b1573ed9ea3702f962",
"md5": "5b98a6b026036997f77f9feb5b46cd68",
"sha256": "1a9103c3b331c1747a06d03e464379c6812790799a58c3ca9b339b8d74485baf"
},
"downloads": -1,
"filename": "pka_predictor_moitessier-0.1.12.tar.gz",
"has_sig": false,
"md5_digest": "5b98a6b026036997f77f9feb5b46cd68",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 9104142,
"upload_time": "2025-07-29T17:46:53",
"upload_time_iso_8601": "2025-07-29T17:46:53.667761Z",
"url": "https://files.pythonhosted.org/packages/f5/5e/92f8942d2e1224a46d3eb1c3ebcb0985e19b474d76b1573ed9ea3702f962/pka_predictor_moitessier-0.1.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 17:46:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "MoitessierLab",
"github_project": "pKa-predictor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pka-predictor-moitessier"
}