# MASSA Algorithm
MASSA Algorithm: A tool for separating data sets of molecules into training and test sets. Developed with the objective of preparing data sets for the generation of prediction models in cheminformatics.
## Instalation
MASSA Algorithm can be installed using pip:
```
pip install MASSA_Algorithm
```
To upgrade to the latest version (recommended), also use pip:
```
pip install --upgrade MASSA_Algorithm
```
Alternatively, you can build the latest development version from source:
```
git clone https://github.com/gcverissimo/MASSA_Algorithm.git
cd MASSA_Algorithm
python setup.py install
```
### Requirements
* python: >= 3.8;
* rdkit;
* numpy: < 2.0;
* pandas;
* matplotlib: >= 3.2;
* scipy: >= 1.6;
* scikit-learn: > 0.24;
* kmodes:¹ >= 0.10.
## Usage
Once installed, the program can be run directly from the command line:
```
MASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf
```
Required arguments:
* **Input file**: ```-i``` or ```--input```.
* MASSA Algorithm accepts input files in the formats: .sdf, .mol, .mol2, .xlsx, .xls, and .csv. However, the .sdf format is preferred. Notes:
* .mol2 files have limitations in storing molecular properties and follow different saving patterns. Due to this, we only support .mol2 files that are generated by Discovery Studio Visualizer.
* For .xlsx, .xls, and .csv:
* MASSA will look for a column with smiles in the name and use it as the source of input molecules.
* MASSA will look for a column with one of the following names in the priority order: Molecule name, Name, Molecule ChEMBL ID, or ID. If none of these columns are found, it will use the molecule index.
* **Output file**: ```-o``` or ```--output```.
* Enter the output file name or file path. Image files will be saved to a folder within the same directory as the output file.
* It is highly recommended to use an .sdf file to avoid errors.
Optional arguments include:
* **Percentage of molecules in training set**: ```-p``` or ```--percentage_of_training```.
* Percentage of molecules in training set. Must be a number from 0 to 1.
* Default = 0.8.
* **Number of biological activities for separation**: ```-b``` or ```--number_of_biological```.
* Number of biological activities that will be used to separate the set into training and test.
* Default = 1.
* **Name of biological activities for separation**: ```-s``` or ```--the_biological_activities```.
* The biological activity(ies) or other y-property(ies) used for QSAR or machine learning modeling. The algorithm can handle one or more properties. Enter a list with the names of biological activities separated by commas and no spaces. These properties must be represented as either integers or floating-point numbers. If your dataset is represented in classes (e.g., active or inactive; soluble, partially soluble, or insoluble), you should represent them as integers (e.g., 0 or 1; 1, 2, or 3).
* Example: ```MASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf -s pIC50,pMIC```.
* Default = If not entered directly on the command line, it will be requested during algorithm execution.
* **Number of principal components in PCA**: ```-n``` or ```--number_of_PCs```.
* Defines the number of principal components to reduce the dimensionality of variables related to biological, physicochemical and structural domains. If the value is a decimal between 0 and 1, the number of principal components is what explains for (```<input number>```* 100)% of the variance. If the value is greater than 1, the number of PCs will be exactly the input integer, but PAY ATTENTION:
1) If the number of PCs is an integer and equal to or greater than the number of physicochemical properties (7), the PCA step will be bypassed for this domain.
2) The same for the biological domain.
3) If the number of biological activities is less than 3, the PCA step will be bypassed for this domain.
* Default = 0.85.
* **SVD solver parameter for PCA**: ```-v``` or ```--svd_solver_for_PCA```.
* See the sklearn.decomposition.PCA topic on https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html for more info.
* Default = full.
* **Extension of image files**: ```-t``` or ```--image_type```.
* Extension of the image files that will be generated. Suggested = png or svg.
* Default = png.
* **Font size for X-axis of dendrograms**: ```-d``` or ```--dendrogram_Xfont_size```.
* Sets the font size on the x-axis of the dendrogram (molecule labels).
* Default = 5.
* **Font size for X-axis of bar plots**: ```-x``` or ```--barplot_Xfont_size```.
* Sets the font size on the x-axis of the bar plot (cluster labels).
* Default = 12.
* **HCA linkage method**: ```-l``` or ```--linkage_method```.
* The linkage criterion to use. The algorithm will merge the pairs of cluster that minimize this criterion.
* Options = complete, single, ward, average, weighted, centroid, median. For more info, see the scipy.cluster.hierarchy.linkage topic on https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html?highlight=linkage#scipy.cluster.hierarchy.linkage.
* Default = complete.
* **Enable Dendrogram plot**: ```-f``` or ```--dendrogram_plot```.
* Defines whether or not dendrogram images will be generated.
* Options = true (dendrogram will be generated), false (dendrogram will not be generated).
* Default = true.
#### Command line help
A full description of the arguments can also be viewed directly from the command line using the command:
```
MASSA_Algorithm -h
```
or
```
MASSA_Algorithm --help
```
## Cite
1. To ensure accurate citation, kindly include a reference to the MASSA article, accessible via the DOI:
VerÃssimo, G. C.; Panteleão, S. Q.; Fernandes, P. O.; Gertrudes, J. C.; Kronenberger, T.; Honorio, K. M.; Maltarollo; V. G. MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling. J. Comput. Aided Mol. Des. 2023, 37, 735–754. https://doi.org/10.1007/s10822-023-00536-y.
2. Furthermore, please incorporate the program citation following the provided template:
```
@Misc{verÃssimo2021,
author = {Gabriel Corrêa VerÃssimo},
title = {MASSA Algorithm: Molecular data set sampling for training-test separation},
howpublished = {\url{https://github.com/gcverissimo/MASSA_Algorithm}},
year = {2021}
}
```
## References
[1]: DE VOS, N. J. kmodes categorical clustering library. https://github.com/nicodv/kmodes. 2015-2021.
Raw data
{
"_id": null,
"home_page": "https://github.com/gcverissimo/MASSA_Algorithm",
"name": "MASSA-Algorithm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "chemoinformatics, training, test, training-test, dataset preparation, data set preparation, rdkit",
"author": "Gabriel Corr\u00eaa Ver\u00edssimo",
"author_email": "gcverissimo@outlook.com",
"download_url": null,
"platform": null,
"description": "# MASSA Algorithm\r\nMASSA Algorithm: A tool for separating data sets of molecules into training and test sets. Developed with the objective of preparing data sets for the generation of prediction models in cheminformatics.\r\n\r\n## Instalation\r\nMASSA Algorithm can be installed using pip:\r\n```\r\npip install MASSA_Algorithm\r\n```\r\nTo upgrade to the latest version (recommended), also use pip:\r\n```\r\npip install --upgrade MASSA_Algorithm\r\n```\r\nAlternatively, you can build the latest development version from source:\r\n```\r\ngit clone https://github.com/gcverissimo/MASSA_Algorithm.git\r\ncd MASSA_Algorithm\r\npython setup.py install\r\n```\r\n### Requirements\r\n* python: >= 3.8;\r\n* rdkit;\r\n* numpy: < 2.0;\r\n* pandas;\r\n* matplotlib: >= 3.2;\r\n* scipy: >= 1.6;\r\n* scikit-learn: > 0.24;\r\n* kmodes:\u00c2\u00b9 >= 0.10.\r\n\r\n## Usage\r\nOnce installed, the program can be run directly from the command line:\r\n```\r\nMASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf\r\n```\r\n\r\nRequired arguments:\r\n* **Input file**: ```-i``` or ```--input```.\r\n * MASSA Algorithm accepts input files in the formats: .sdf, .mol, .mol2, .xlsx, .xls, and .csv. However, the .sdf format is preferred. Notes:\r\n * .mol2 files have limitations in storing molecular properties and follow different saving patterns. Due to this, we only support .mol2 files that are generated by Discovery Studio Visualizer.\r\n * For .xlsx, .xls, and .csv:\r\n * MASSA will look for a column with smiles in the name and use it as the source of input molecules.\r\n * MASSA will look for a column with one of the following names in the priority order: Molecule name, Name, Molecule ChEMBL ID, or ID. If none of these columns are found, it will use the molecule index.\r\n* **Output file**: ```-o``` or ```--output```.\r\n * Enter the output file name or file path. Image files will be saved to a folder within the same directory as the output file.\r\n * It is highly recommended to use an .sdf file to avoid errors.\r\n\r\nOptional arguments include:\r\n* **Percentage of molecules in training set**: ```-p``` or ```--percentage_of_training```.\r\n * Percentage of molecules in training set. Must be a number from 0 to 1.\r\n * Default = 0.8.\r\n* **Number of biological activities for separation**: ```-b``` or ```--number_of_biological```.\r\n * Number of biological activities that will be used to separate the set into training and test.\r\n * Default = 1.\r\n* **Name of biological activities for separation**: ```-s``` or ```--the_biological_activities```.\r\n * The biological activity(ies) or other y-property(ies) used for QSAR or machine learning modeling. The algorithm can handle one or more properties. Enter a list with the names of biological activities separated by commas and no spaces. These properties must be represented as either integers or floating-point numbers. If your dataset is represented in classes (e.g., active or inactive; soluble, partially soluble, or insoluble), you should represent them as integers (e.g., 0 or 1; 1, 2, or 3).\r\n * Example: ```MASSA_Algorithm -i <input_file>.sdf -o <output_file>.sdf -s pIC50,pMIC```.\r\n * Default = If not entered directly on the command line, it will be requested during algorithm execution.\r\n* **Number of principal components in PCA**: ```-n``` or ```--number_of_PCs```.\r\n * Defines the number of principal components to reduce the dimensionality of variables related to biological, physicochemical and structural domains. If the value is a decimal between 0 and 1, the number of principal components is what explains for (```<input number>```* 100)% of the variance. If the value is greater than 1, the number of PCs will be exactly the input integer, but PAY ATTENTION:\r\n\r\n 1) If the number of PCs is an integer and equal to or greater than the number of physicochemical properties (7), the PCA step will be bypassed for this domain.\r\n 2) The same for the biological domain.\r\n 3) If the number of biological activities is less than 3, the PCA step will be bypassed for this domain.\r\n * Default = 0.85.\r\n* **SVD solver parameter for PCA**: ```-v``` or ```--svd_solver_for_PCA```.\r\n * See the sklearn.decomposition.PCA topic on https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html for more info.\r\n * Default = full.\r\n* **Extension of image files**: ```-t``` or ```--image_type```.\r\n * Extension of the image files that will be generated. Suggested = png or svg.\r\n * Default = png.\r\n* **Font size for X-axis of dendrograms**: ```-d``` or ```--dendrogram_Xfont_size```.\r\n * Sets the font size on the x-axis of the dendrogram (molecule labels).\r\n * Default = 5.\r\n* **Font size for X-axis of bar plots**: ```-x``` or ```--barplot_Xfont_size```.\r\n * Sets the font size on the x-axis of the bar plot (cluster labels).\r\n * Default = 12.\r\n* **HCA linkage method**: ```-l``` or ```--linkage_method```.\r\n * The linkage criterion to use. The algorithm will merge the pairs of cluster that minimize this criterion.\r\n * Options = complete, single, ward, average, weighted, centroid, median. For more info, see the scipy.cluster.hierarchy.linkage topic on https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html?highlight=linkage#scipy.cluster.hierarchy.linkage.\r\n * Default = complete.\r\n* **Enable Dendrogram plot**: ```-f``` or ```--dendrogram_plot```.\r\n * Defines whether or not dendrogram images will be generated.\r\n\t* Options = true (dendrogram will be generated), false (dendrogram will not be generated).\r\n * Default = true.\r\n\r\n#### Command line help\r\nA full description of the arguments can also be viewed directly from the command line using the command:\r\n```\r\nMASSA_Algorithm -h\r\n```\r\nor\r\n```\r\nMASSA_Algorithm --help\r\n```\r\n\r\n## Cite\r\n1. To ensure accurate citation, kindly include a reference to the MASSA article, accessible via the DOI:\r\nVer\u00c3\u00adssimo, G. C.; Pantele\u00c3\u00a3o, S. Q.; Fernandes, P. O.; Gertrudes, J. C.; Kronenberger, T.; Honorio, K. M.; Maltarollo; V. G. MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling. J. Comput. Aided Mol. Des. 2023, 37, 735\u00e2\u20ac\u201c754. https://doi.org/10.1007/s10822-023-00536-y.\r\n\r\n2. Furthermore, please incorporate the program citation following the provided template:\r\n\r\n```\r\n@Misc{ver\u00c3\u00adssimo2021,\r\n author = {Gabriel Corr\u00c3\u00aaa Ver\u00c3\u00adssimo},\r\n title = {MASSA Algorithm: Molecular data set sampling for training-test separation},\r\n howpublished = {\\url{https://github.com/gcverissimo/MASSA_Algorithm}},\r\n year = {2021}\r\n }\r\n```\r\n\r\n## References\r\n[1]: DE VOS, N. J. kmodes categorical clustering library. https://github.com/nicodv/kmodes. 2015-2021.\r\n",
"bugtrack_url": null,
"license": "AGPLv3",
"summary": "MASSA Algorithm is a Python package to separate data sets of molecules into training and test sets, considering the diversity of structural, physicochemical and biological characteristics of these molecules.",
"version": "0.9.6",
"project_urls": {
"Homepage": "https://github.com/gcverissimo/MASSA_Algorithm"
},
"split_keywords": [
"chemoinformatics",
" training",
" test",
" training-test",
" dataset preparation",
" data set preparation",
" rdkit"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f7aacd66d32407718804f25048a97475f585c2128b89cc7cf43d445be8b725a3",
"md5": "4d04a0e0170a4bccdf22dea7f4a587c4",
"sha256": "316657cbb44183d4dcaad43bf8b2ea31c68623eaa4bc2ca6a233a2411e67a209"
},
"downloads": -1,
"filename": "MASSA_Algorithm-0.9.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4d04a0e0170a4bccdf22dea7f4a587c4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 35519,
"upload_time": "2024-12-04T19:52:52",
"upload_time_iso_8601": "2024-12-04T19:52:52.230333Z",
"url": "https://files.pythonhosted.org/packages/f7/aa/cd66d32407718804f25048a97475f585c2128b89cc7cf43d445be8b725a3/MASSA_Algorithm-0.9.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-04 19:52:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gcverissimo",
"github_project": "MASSA_Algorithm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "massa-algorithm"
}