`PsmPy`
=====
Matching techniques for epidemiological observational studies as carried out in Python. Propensity score matching is a statistical matching technique used with observational data that attempts to ascertain the validity of concluding there is a potential causal link between a treatment or intervention and an outcome(s) of interest. It does so by accounting for a set of covariates between a binary treatment state (as would occur in a randomized control trial, either received the intervention or not), and control for potential confounding (covariates) in outcome measures between the treatment and control groups such as death, or length of stay etc. It is using this technique on observational data that we gain an insight into the effects or lack thereof of an interventional state.
---
## Citing this work:
A. Kline and Y. Luo, *PsmPy: A Package for Retrospective Cohort Matching in Python,* 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2022, pp. 1354-1357, doi: 10.1109/EMBC48229.2022.9871333.
---
* Integration with Jupyter Notebooks
* Additional plotting functionality to assess balance before and after
* A more modular, user-specified matching process
* Ability to define 1:1 or 1:many matching
---
# Installation
Install the package through pip:
```bash
$ pip install psmpy
```
* [Installation](#installation)
* [Data Preparation](#data-prep)
* [Predict Scores](#predict-scores)
* [Matching algorithm](#matching-algorithm)
* [Graphical Outputs](#graphical-outputs)
* [Extra Attributes](#extra-attributes)
* [Cohen D Function](#cohen-function)
* [Conclusion](#conclusion)
----
# Data Prep
# Import psmpy class and functions
```python
# import relevant libraries
from psmpy import PsmPy
from psmpy.functions import cohenD
from psmpy.plotting import *
sns.set(rc={'figure.figsize':(10,8)}, font_scale = 1.3)
```
----
```python
# read in your data
data = pd.read_csv(path)
```
----
# Initialize PsmPy Class
Initialize the `PsmPy` class:
```python
psm = PsmPy(df, treatment='treatment', indx='pat_id', exclude = [])
```
**Note:**
* `PsmPy` - The class. It will use all covariates in the dataset unless formally excluded in the `exclude` argument.
* `df` - the dataframe being passed to the class
* `exclude` - (optional) parameter and will ignore any covariates (columns) passed to the it during the model fitting process. This will be a list of strings. Note, it is not necessary to pass the unique index column here. That process will be taken care of within the code after specifying your index column.
* `indx` - required parameter that references a unique ID number for each case in the dataset.
# Predict Scores
Calculate logistic propensity scores/logits:
```python
psm.logistic_ps(balance = True)
```
**Note:**
* `balance` - Whether the logistic regression will run in a balanced fashion, default = True.
There often exists a significant **Class Imbalance** in the data. This will be detected automatically in the software where the majority group has more records than the minority group. We account for this by setting `balance=True` when calling `psm.logistic_ps()`. This tells `PsmPy` to sample from the majority group when fitting the logistic regression model so that the groups are of equal size. This process is repeated until all the entries of the major class have been regressed on the minor class in equal paritions. This calculates both the logistic propensity scores and logits for each entry.
Review values in dataframe:
```
psm.predicted_data
```
---
# Matching algorithm - version 1
Perform KNN matching.
```python
psm.knn_matched(matcher='propensity_logit', replacement=False, caliper=None, drop_unmatched=True)
```
**Note:**
* `matcher` - `propensity_logit` (default) and generated inprevious step alternative option is `propensity_score`, specifies the argument on which matching will proceed
* `replacement` - `False` (default), determines whethermacthing will happen with or without replacement,when replacement is false matching happens 1:1
* `caliper` - `None` (default), user can specify caliper size relative to std. dev of the control sample, restricting neighbors eligible to match within a certain distance.
* `drop_unmatched` - `True` (default) In the event that indexes do not have a match due to caliper size it will remove them from the 'matched_df', 'matched_ids' and subsequent calculations of effect size
---
# Matching algorithm - version 2
Perform KNN matching 1:many
```python
psm.knn_matched_12n(matcher='propensity_logit', how_many=1)
```
**Note:**
* `matcher` - `propensity_logit` (default) and generated inprevious step alternative option is `propensity_score`, specifies the argument on which matching will proceed
* `how_many` - `1` (default) performs 1:n matching, where 'n' is specified by the user and matched the minor class 'n' times to the major class
---
# Graphical Outputs
## Plot the propensity score or propensity logits
Plot the distribution of the propensity scores (or logits) for the two groups side by side. Note that here the names are coded as 'treatment' and 'control' under the assumption that the majority class you are sampling from is the control group. If this is not the case you will need to flip the order of these.
```python
psm.plot_match(Title='Side by side matched controls', Ylabel='Number ofpatients', Xlabel= 'Propensity logit', names = ['treatment', 'control'], colors=['#E69F00', '#56B4E9'] ,save=True)
```
**Note:**
* `title` - 'Side by side matched controls' (default),creates plot title
* `Ylabel` - 'Number of patients' (default), string, labelfor y-axis
* `Xlabel` - 'Propensity logit' (default), string, label for x-axis
* `names` - ['treatment', 'control'] (default), list of strings for legend
* `colors` - ['#E69F00', '#56B4E9'] (default) plotting colors default
* `save` - False (default), saves the figure generated to current working directory if True
## Plot the effect sizes
```python
psm.effect_size_plot(title='Standardized Mean differences accross covariates before and after matching', before_color='#FCB754', after_color='#3EC8FB', save=False)
```
**Note:**
* `title` - Title of the plot
* `before_color` - color (hex) for before matching effect size
* `after_color` - color (hex) for after macthing effect size
* `save` - False (default), saves the figure generated tocurrent working directory if True
---
# Extra Attributes
Other attributes available to user:
## Matched IDs
```python
psm.matched_ids
```
* `matched_ids` - returns a dataframe of indicies from the minor class and their associated matched indice from the major class psm.
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>Major_ID</th>
<th>Minor_ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>6781</td>
<td>9432</td>
</tr>
<tr>
<td>3264</td>
<td>7624</td>
</tr>
<tr>
</tr>
</tbody>
</table>
**Note:**
That not all matches will be unique if `replacement=False`
## Matched Dataframe
```python
psm.df_matched
```
* `df_matched` - returns a subset of the original dataframe using indices that were matched. This works regardless of which matching protocol is used.
## Effect sizes per variable
```python
psm.effect_size
```
* `effect_size` - returns dataframe with columns 'variable', 'matching' (before or after), and 'effect_size'
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>variable</th>
<th>matching</th>
<th>effect_size</th>
</tr>
</thead>
<tbody>
<tr>
<td>hypertension</td>
<td>before</td>
<td>0.5</td>
</tr>
<tr>
<td>hypertension</td>
<td>after</td>
<td>0.01</td>
</tr>
<tr>
<td>age</td>
<td>7624</td>
<td>9432</td>
</tr>
<tr>
<td>age</td>
<td>7624</td>
<td>9432</td>
</tr>
<tr>
<td>sex</td>
<td>7624</td>
<td>9432</td>
</tr>
<tr>
</tr>
</tbody>
</table>
**Note:** The thresholds for a small, medium and large effect size were characterizedby Cohen in: J. Cohen, "A Power Primer", Quantitative Methods in Psychology, vol.111, no. 1, pp. 155-159, 1992
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>Relative Size</th>
<th>Effect Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>small</td>
<td> ≤ 0.2</td>
</tr>
<tr>
<td>medium</td>
<td> ≤ 0.5</td>
</tr>
<tr>
<td>large</td>
<td> ≤0.8</td>
</tr>
<tr>
</tr>
</tbody>
</table>
---
# Cohen D Function
A function to calculate effect size (Cohen D) can be imported alone should the user have a need for it. A floating point number is returned. This floating point number represents the effect size of a variable on a binary outcome.
```python
from psmpy.functions import cohenD
cohenD(df, treatment, metricName)
```
* `df` - dataframe with data under investigation
* `treatment` - name of binary treatment/intervention under investigation
* `metricName` - variable user wishes to check the influence of on treatment/intervention
---
# Conclusion
This package offers a user friendly propensity score matching protocol created for a Python environment. In this we have tried to capture automatic figure generation, contextualization of the results and flexibility in the matching and modeling protocol to serve a wide base.
Raw data
{
"_id": null,
"home_page": null,
"name": "psmpy",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "propensity score matching, statistics, plotting",
"author": null,
"author_email": "Adrienne Kline <askline1@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/fc/44/544360d8102cfa89f2874daf6afa78876cb71398e9d2a8d06472b0c66574/psmpy-0.3.14.tar.gz",
"platform": null,
"description": "`PsmPy`\n=====\n\nMatching techniques for epidemiological observational studies as carried out in Python. Propensity score matching is a statistical matching technique used with observational data that attempts to ascertain the validity of concluding there is a potential causal link between a treatment or intervention and an outcome(s) of interest. It does so by accounting for a set of covariates between a binary treatment state (as would occur in a randomized control trial, either received the intervention or not), and control for potential confounding (covariates) in outcome measures between the treatment and control groups such as death, or length of stay etc. It is using this technique on observational data that we gain an insight into the effects or lack thereof of an interventional state.\n\n---\n\n## Citing this work:\n\nA. Kline and Y. Luo, *PsmPy: A Package for Retrospective Cohort Matching in Python,* 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2022, pp. 1354-1357, doi: 10.1109/EMBC48229.2022.9871333.\n\n---\n\n* Integration with Jupyter Notebooks\n* Additional plotting functionality to assess balance before and after\n* A more modular, user-specified matching process\n* Ability to define 1:1 or 1:many matching\n\n---\n\n# Installation\n\nInstall the package through pip:\n\n```bash\n$ pip install psmpy\n```\n\n* [Installation](#installation)\n* [Data Preparation](#data-prep)\n* [Predict Scores](#predict-scores)\n* [Matching algorithm](#matching-algorithm)\n* [Graphical Outputs](#graphical-outputs)\n* [Extra Attributes](#extra-attributes)\n* [Cohen D Function](#cohen-function)\n* [Conclusion](#conclusion)\n\n----\n\n# Data Prep\n\n# Import psmpy class and functions\n```python\n# import relevant libraries\nfrom psmpy import PsmPy\nfrom psmpy.functions import cohenD\nfrom psmpy.plotting import *\nsns.set(rc={'figure.figsize':(10,8)}, font_scale = 1.3)\n```\n----\n\n```python\n# read in your data\ndata = pd.read_csv(path)\n```\n----\n\n# Initialize PsmPy Class\n\nInitialize the `PsmPy` class:\n\n```python\npsm = PsmPy(df, treatment='treatment', indx='pat_id', exclude = [])\n```\n\n**Note:**\n\n* `PsmPy` - The class. It will use all covariates in the dataset unless formally excluded in the `exclude` argument.\n* `df` - the dataframe being passed to the class\n* `exclude` - (optional) parameter and will ignore any covariates (columns) passed to the it during the model fitting process. This will be a list of strings. Note, it is not necessary to pass the unique index column here. That process will be taken care of within the code after specifying your index column.\n* `indx` - required parameter that references a unique ID number for each case in the dataset.\n\n# Predict Scores\nCalculate logistic propensity scores/logits:\n\n```python\npsm.logistic_ps(balance = True)\n```\n\n**Note:**\n\n* `balance` - Whether the logistic regression will run in a balanced fashion, default = True.\n\nThere often exists a significant **Class Imbalance** in the data. This will be detected automatically in the software where the majority group has more records than the minority group. We account for this by setting `balance=True` when calling `psm.logistic_ps()`. This tells `PsmPy` to sample from the majority group when fitting the logistic regression model so that the groups are of equal size. This process is repeated until all the entries of the major class have been regressed on the minor class in equal paritions. This calculates both the logistic propensity scores and logits for each entry.\n\nReview values in dataframe:\n\n```\npsm.predicted_data\n```\n\n---\n\n# Matching algorithm - version 1\n\nPerform KNN matching. \n\n```python\npsm.knn_matched(matcher='propensity_logit', replacement=False, caliper=None, drop_unmatched=True)\n```\n\n**Note:**\n\n* `matcher` - `propensity_logit` (default) and generated inprevious step alternative option is `propensity_score`, specifies the argument on which matching will proceed\n* `replacement` - `False` (default), determines whethermacthing will happen with or without replacement,when replacement is false matching happens 1:1\n* `caliper` - `None` (default), user can specify caliper size relative to std. dev of the control sample, restricting neighbors eligible to match within a certain distance. \n* `drop_unmatched` - `True` (default) In the event that indexes do not have a match due to caliper size it will remove them from the 'matched_df', 'matched_ids' and subsequent calculations of effect size\n\n---\n\n# Matching algorithm - version 2\n\nPerform KNN matching 1:many \n\n```python\npsm.knn_matched_12n(matcher='propensity_logit', how_many=1)\n```\n\n**Note:**\n\n* `matcher` - `propensity_logit` (default) and generated inprevious step alternative option is `propensity_score`, specifies the argument on which matching will proceed\n* `how_many` - `1` (default) performs 1:n matching, where 'n' is specified by the user and matched the minor class 'n' times to the major class \n\n---\n\n# Graphical Outputs\n\n## Plot the propensity score or propensity logits\nPlot the distribution of the propensity scores (or logits) for the two groups side by side. Note that here the names are coded as 'treatment' and 'control' under the assumption that the majority class you are sampling from is the control group. If this is not the case you will need to flip the order of these. \n\n```python\npsm.plot_match(Title='Side by side matched controls', Ylabel='Number ofpatients', Xlabel= 'Propensity logit', names = ['treatment', 'control'], colors=['#E69F00', '#56B4E9'] ,save=True)\n```\n\n**Note:**\n\n* `title` - 'Side by side matched controls' (default),creates plot title\n* `Ylabel` - 'Number of patients' (default), string, labelfor y-axis\n* `Xlabel` - 'Propensity logit' (default), string, label for x-axis \n* `names` - ['treatment', 'control'] (default), list of strings for legend\n* `colors` - ['#E69F00', '#56B4E9'] (default) plotting colors default \n* `save` - False (default), saves the figure generated to current working directory if True\n\n## Plot the effect sizes \n\n```python\npsm.effect_size_plot(title='Standardized Mean differences accross covariates before and after matching', before_color='#FCB754', after_color='#3EC8FB', save=False)\n```\n\n**Note:**\n* `title` - Title of the plot \n* `before_color` - color (hex) for before matching effect size \n* `after_color` - color (hex) for after macthing effect size\n* `save` - False (default), saves the figure generated tocurrent working directory if True\n\n---\n\n# Extra Attributes\nOther attributes available to user:\n## Matched IDs\n\n```python\npsm.matched_ids\n```\n\n* `matched_ids` - returns a dataframe of indicies from the minor class and their associated matched indice from the major class psm.\n\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>Major_ID</th>\n <th>Minor_ID</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>6781</td>\n <td>9432</td>\n </tr>\n <tr>\n <td>3264</td>\n <td>7624</td>\n </tr>\n <tr>\n </tr>\n </tbody>\n</table>\n\n\n**Note:**\nThat not all matches will be unique if `replacement=False`\n\n## Matched Dataframe \n\n```python\npsm.df_matched\n```\n\n* `df_matched` - returns a subset of the original dataframe using indices that were matched. This works regardless of which matching protocol is used. \n\n## Effect sizes per variable\n\n```python\npsm.effect_size\n```\n\n* `effect_size` - returns dataframe with columns 'variable', 'matching' (before or after), and 'effect_size'\n\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>variable</th>\n <th>matching</th>\n <th>effect_size</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>hypertension</td>\n <td>before</td>\n <td>0.5</td>\n </tr>\n <tr>\n <td>hypertension</td>\n <td>after</td>\n <td>0.01</td>\n </tr>\n <tr>\n <td>age</td>\n <td>7624</td>\n <td>9432</td>\n </tr>\n <tr>\n <td>age</td>\n <td>7624</td>\n <td>9432</td>\n </tr>\n <tr>\n <td>sex</td>\n <td>7624</td>\n <td>9432</td>\n </tr>\n <tr>\n </tr>\n </tbody>\n</table>\n\n**Note:** The thresholds for a small, medium and large effect size were characterizedby Cohen in: J. Cohen, \"A Power Primer\", Quantitative Methods in Psychology, vol.111, no. 1, pp. 155-159, 1992\n\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>Relative Size</th>\n <th>Effect Size</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>small</td>\n <td> ≤ 0.2</td>\n </tr>\n <tr>\n <td>medium</td>\n <td> ≤ 0.5</td>\n </tr>\n <tr>\n <td>large</td>\n <td> ≤0.8</td>\n </tr>\n <tr>\n </tr>\n </tbody>\n</table>\n\n---\n# Cohen D Function\nA function to calculate effect size (Cohen D) can be imported alone should the user have a need for it. A floating point number is returned. This floating point number represents the effect size of a variable on a binary outcome. \n\n```python\nfrom psmpy.functions import cohenD\n\ncohenD(df, treatment, metricName)\n```\n\n* `df` - dataframe with data under investigation\n* `treatment` - name of binary treatment/intervention under investigation\n* `metricName` - variable user wishes to check the influence of on treatment/intervention\n\n---\n\n# Conclusion\nThis package offers a user friendly propensity score matching protocol created for a Python environment. In this we have tried to capture automatic figure generation, contextualization of the results and flexibility in the matching and modeling protocol to serve a wide base. \n",
"bugtrack_url": null,
"license": null,
"summary": "Propensity score matching for python and graphical plots",
"version": "0.3.14",
"project_urls": {
"Documentation": "https://pypi.org/project/psmpy",
"Homepage": "https://github.com/adriennekline/psmpy",
"Issues": "https://github.com/adriennekline/psmpy/issues"
},
"split_keywords": [
"propensity score matching",
" statistics",
" plotting"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2b4c6d39f1e96b2117fb15fbdc9e04c12d14dcb972b95f89396fb98ce4da7868",
"md5": "c9acd0043cb0f0dacc514aa25b15797c",
"sha256": "6b7114fee4a439a035c3257563ab0a2bf9fa7e91f12ec17d2482428a44839e12"
},
"downloads": -1,
"filename": "psmpy-0.3.14-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c9acd0043cb0f0dacc514aa25b15797c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13862,
"upload_time": "2025-07-18T19:05:57",
"upload_time_iso_8601": "2025-07-18T19:05:57.384879Z",
"url": "https://files.pythonhosted.org/packages/2b/4c/6d39f1e96b2117fb15fbdc9e04c12d14dcb972b95f89396fb98ce4da7868/psmpy-0.3.14-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fc44544360d8102cfa89f2874daf6afa78876cb71398e9d2a8d06472b0c66574",
"md5": "d04929a2deba1f33b428bf1e17ca71bc",
"sha256": "58ede4d31208f9f0684891a21d72a244cf1cf9e500f427077919c689083d0b47"
},
"downloads": -1,
"filename": "psmpy-0.3.14.tar.gz",
"has_sig": false,
"md5_digest": "d04929a2deba1f33b428bf1e17ca71bc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16474,
"upload_time": "2025-07-18T19:05:58",
"upload_time_iso_8601": "2025-07-18T19:05:58.550610Z",
"url": "https://files.pythonhosted.org/packages/fc/44/544360d8102cfa89f2874daf6afa78876cb71398e9d2a8d06472b0c66574/psmpy-0.3.14.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-18 19:05:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adriennekline",
"github_project": "psmpy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "psmpy"
}