# SCD Polars Matching Plugin
A high-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology to avoid immortal time bias.
## Overview
This plugin implements time-to-event methodology for matching severe chronic disease (SCD) cases with controls, processing cases chronologically and ensuring controls are eligible at the time of each case's diagnosis.
## Installation
```bash
pip install scd-polars-matching-plugin
```
Or build from source:
```bash
maturin develop
```
## Usage
```python
from matching_plugin import complete_scd_matching_workflow
# Perform complete matching workflow
matched_df = complete_scd_matching_workflow(
mfr_data=mfr_df,
lpr_data=lpr_df,
vital_data=vital_df, # Optional
matching_ratio=5,
birth_date_window_days=30,
parent_birth_date_window_days=365,
match_parent_birth_dates=True,
match_parity=True
)
```
## Input Data Formats
### MFR Data (Birth Registry)
Required columns:
- `PNR`: Person identifier (string)
- `FOEDSELSDATO`: Birth date (date)
- `CPR_MODER`: Mother's identifier (string)
- `CPR_FADER`: Father's identifier (string)
- `MODER_FOEDSELSDATO`: Mother's birth date (date)
- `FADER_FOEDSELSDATO`: Father's birth date (date)
- `PARITET`: Birth order/parity (integer)
**Example MFR Data:**
```
┌─────────────┬──────────────┬─────────────┬─────────────┬────────────────────┬────────────────────┬─────────┐
│ PNR ┆ FOEDSELSDATO ┆ CPR_MODER ┆ CPR_FADER ┆ MODER_FOEDSELSDATO ┆ FADER_FOEDSELSDATO ┆ PARITET │
├─────────────┼──────────────┼─────────────┼─────────────┼────────────────────┼────────────────────┼─────────┤
│ person_0001 ┆ 1995-01-15 ┆ mother_0001 ┆ father_0001 ┆ 1970-03-22 ┆ 1968-07-10 ┆ 1 │
│ person_0002 ┆ 1995-02-20 ┆ mother_0002 ┆ father_0002 ┆ 1972-11-15 ┆ 1969-05-03 ┆ 2 │
│ person_0003 ┆ 1995-03-10 ┆ mother_0003 ┆ father_0003 ┆ 1973-08-07 ┆ 1971-12-25 ┆ 1 │
└─────────────┴──────────────┴─────────────┴─────────────┴────────────────────┴────────────────────┴─────────┘
```
### LPR Data (Patient Registry)
Required columns:
- `PNR`: Person identifier (string)
- `SCD_STATUS`: Disease status ("SCD", "SCD_LATE", "NO_SCD")
- `SCD_DATE`: Diagnosis date (date, null for non-cases)
- `ICD_CODE`: Diagnosis code (string, optional)
**Example LPR Data:**
```
┌─────────────┬────────────┬────────────┬──────────┐
│ PNR ┆ SCD_STATUS ┆ SCD_DATE ┆ ICD_CODE │
├─────────────┼────────────┼────────────┼──────────┤
│ person_0001 ┆ SCD ┆ 1997-06-15 ┆ D57.1 │
│ person_0002 ┆ NO_SCD ┆ null ┆ null │
│ person_0003 ┆ SCD_LATE ┆ 2001-03-22 ┆ D57.0 │
│ person_0004 ┆ NO_SCD ┆ null ┆ null │
└─────────────┴────────────┴────────────┴──────────┘
```
### Vital Events Data (Optional)
Required columns:
- `PNR`: Person identifier (string)
- `EVENT`: Event type ("DEATH", "EMIGRATION")
- `EVENT_DATE`: Event date (date)
- `ROLE`: Individual role ("CHILD", "PARENT")
**Example Vital Events Data:**
```
┌─────────────┬────────────┬────────────┬────────┐
│ PNR ┆ EVENT ┆ EVENT_DATE ┆ ROLE │
├─────────────┼────────────┼────────────┼────────┤
│ person_0001 ┆ EMIGRATION ┆ 1999-12-01 ┆ CHILD │
│ mother_0002 ┆ DEATH ┆ 1998-07-15 ┆ PARENT │
│ person_0004 ┆ DEATH ┆ 2000-03-10 ┆ CHILD │
│ father_0001 ┆ EMIGRATION ┆ 1997-11-20 ┆ PARENT │
└─────────────┴────────────┴────────────┴────────┘
```
### Data Relationships
- **MFR and LPR**: Must be joined on `PNR` to combine birth registry and patient data
- **Vital Events**: Optional supplementary data that tracks death/emigration events
- **Parent Links**: `CPR_MODER` and `CPR_FADER` in MFR link to parent `PNR` values in vital events
- **Temporal Logic**: All dates must be proper date types for chronological processing
## Output Format
The function returns a Polars DataFrame with the following columns:
- `MATCH_INDEX`: Unique identifier for each case-control group (integer)
- `PNR`: Person identifier (string)
- `ROLE`: Individual role in the match ("case" or "control")
- `INDEX_DATE`: SCD diagnosis date from the case (date)
### Example Output
```
┌─────────────┬─────────────┬─────────┬────────────┐
│ MATCH_INDEX ┆ PNR ┆ ROLE ┆ INDEX_DATE │
├─────────────┼─────────────┼─────────┼────────────┤
│ 1 ┆ person_0001 ┆ case ┆ 1997-01-01 │
│ 1 ┆ person_0002 ┆ control ┆ 1997-01-01 │
│ 1 ┆ person_0003 ┆ control ┆ 1997-01-01 │
│ 2 ┆ person_0004 ┆ case ┆ 1997-06-15 │
│ 2 ┆ person_0005 ┆ control ┆ 1997-06-15 │
└─────────────┴─────────────┴─────────┴────────────┘
```
## Key Features
### Risk-Set Sampling
- **Chronological Processing**: Cases are processed in order of diagnosis date
- **Temporal Validity**: Controls must be eligible (alive, present, undiagnosed) at case diagnosis time
- **No Immortal Time Bias**: Future SCD cases can serve as controls for earlier cases
### Matching Criteria
- **Birth Date Window**: Match controls within specified days of case birth date
- **Parent Birth Dates**: Optional matching on parental birth dates with configurable windows
- **Parity Matching**: Optional matching on birth order
- **Vital Status**: Optional incorporation of death/emigration events
### Performance
- **Rust Implementation**: High-performance core algorithms
- **Polars Integration**: Seamless integration with Polars DataFrames
- **Memory Efficient**: Optimized for large datasets
## Parameters
- `matching_ratio`: Number of controls per case (default: 5)
- `birth_date_window_days`: Maximum birth date difference in days (default: 30)
- `parent_birth_date_window_days`: Maximum parent birth date difference (default: 365)
- `match_parent_birth_dates`: Enable parent birth date matching (default: True)
- `match_mother_birth_date_only`: Match only maternal birth dates (default: False)
- `require_both_parents`: Require both parents for matching (default: False)
- `match_parity`: Enable parity matching (default: True)
## License
This project is licensed under the MIT License.
Raw data
{
"_id": null,
"home_page": null,
"name": "scd-polars-matching-plugin",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "epidemiology, case-control, matching, polars, rust, scd",
"author": "Tobias Kragholm",
"author_email": null,
"download_url": null,
"platform": null,
"description": "# SCD Polars Matching Plugin\n\nA high-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology to avoid immortal time bias.\n\n## Overview\n\nThis plugin implements time-to-event methodology for matching severe chronic disease (SCD) cases with controls, processing cases chronologically and ensuring controls are eligible at the time of each case's diagnosis.\n\n## Installation\n\n```bash\npip install scd-polars-matching-plugin\n```\n\nOr build from source:\n```bash\nmaturin develop\n```\n\n## Usage\n\n```python\nfrom matching_plugin import complete_scd_matching_workflow\n\n# Perform complete matching workflow\nmatched_df = complete_scd_matching_workflow(\n mfr_data=mfr_df,\n lpr_data=lpr_df,\n vital_data=vital_df, # Optional\n matching_ratio=5,\n birth_date_window_days=30,\n parent_birth_date_window_days=365,\n match_parent_birth_dates=True,\n match_parity=True\n)\n```\n\n## Input Data Formats\n\n### MFR Data (Birth Registry)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `FOEDSELSDATO`: Birth date (date)\n- `CPR_MODER`: Mother's identifier (string)\n- `CPR_FADER`: Father's identifier (string)\n- `MODER_FOEDSELSDATO`: Mother's birth date (date)\n- `FADER_FOEDSELSDATO`: Father's birth date (date)\n- `PARITET`: Birth order/parity (integer)\n\n**Example MFR Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR \u2506 FOEDSELSDATO \u2506 CPR_MODER \u2506 CPR_FADER \u2506 MODER_FOEDSELSDATO \u2506 FADER_FOEDSELSDATO \u2506 PARITET \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 1995-01-15 \u2506 mother_0001 \u2506 father_0001 \u2506 1970-03-22 \u2506 1968-07-10 \u2506 1 \u2502\n\u2502 person_0002 \u2506 1995-02-20 \u2506 mother_0002 \u2506 father_0002 \u2506 1972-11-15 \u2506 1969-05-03 \u2506 2 \u2502\n\u2502 person_0003 \u2506 1995-03-10 \u2506 mother_0003 \u2506 father_0003 \u2506 1973-08-07 \u2506 1971-12-25 \u2506 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### LPR Data (Patient Registry)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `SCD_STATUS`: Disease status (\"SCD\", \"SCD_LATE\", \"NO_SCD\")\n- `SCD_DATE`: Diagnosis date (date, null for non-cases)\n- `ICD_CODE`: Diagnosis code (string, optional)\n\n**Example LPR Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR \u2506 SCD_STATUS \u2506 SCD_DATE \u2506 ICD_CODE \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 SCD \u2506 1997-06-15 \u2506 D57.1 \u2502\n\u2502 person_0002 \u2506 NO_SCD \u2506 null \u2506 null \u2502\n\u2502 person_0003 \u2506 SCD_LATE \u2506 2001-03-22 \u2506 D57.0 \u2502\n\u2502 person_0004 \u2506 NO_SCD \u2506 null \u2506 null \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Vital Events Data (Optional)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `EVENT`: Event type (\"DEATH\", \"EMIGRATION\")\n- `EVENT_DATE`: Event date (date)\n- `ROLE`: Individual role (\"CHILD\", \"PARENT\")\n\n**Example Vital Events Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR \u2506 EVENT \u2506 EVENT_DATE \u2506 ROLE \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 EMIGRATION \u2506 1999-12-01 \u2506 CHILD \u2502\n\u2502 mother_0002 \u2506 DEATH \u2506 1998-07-15 \u2506 PARENT \u2502\n\u2502 person_0004 \u2506 DEATH \u2506 2000-03-10 \u2506 CHILD \u2502\n\u2502 father_0001 \u2506 EMIGRATION \u2506 1997-11-20 \u2506 PARENT \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Data Relationships\n- **MFR and LPR**: Must be joined on `PNR` to combine birth registry and patient data\n- **Vital Events**: Optional supplementary data that tracks death/emigration events\n- **Parent Links**: `CPR_MODER` and `CPR_FADER` in MFR link to parent `PNR` values in vital events\n- **Temporal Logic**: All dates must be proper date types for chronological processing\n\n## Output Format\n\nThe function returns a Polars DataFrame with the following columns:\n\n- `MATCH_INDEX`: Unique identifier for each case-control group (integer)\n- `PNR`: Person identifier (string)\n- `ROLE`: Individual role in the match (\"case\" or \"control\")\n- `INDEX_DATE`: SCD diagnosis date from the case (date)\n\n### Example Output\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 MATCH_INDEX \u2506 PNR \u2506 ROLE \u2506 INDEX_DATE \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 1 \u2506 person_0001 \u2506 case \u2506 1997-01-01 \u2502\n\u2502 1 \u2506 person_0002 \u2506 control \u2506 1997-01-01 \u2502\n\u2502 1 \u2506 person_0003 \u2506 control \u2506 1997-01-01 \u2502\n\u2502 2 \u2506 person_0004 \u2506 case \u2506 1997-06-15 \u2502\n\u2502 2 \u2506 person_0005 \u2506 control \u2506 1997-06-15 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Key Features\n\n### Risk-Set Sampling\n- **Chronological Processing**: Cases are processed in order of diagnosis date\n- **Temporal Validity**: Controls must be eligible (alive, present, undiagnosed) at case diagnosis time\n- **No Immortal Time Bias**: Future SCD cases can serve as controls for earlier cases\n\n### Matching Criteria\n- **Birth Date Window**: Match controls within specified days of case birth date\n- **Parent Birth Dates**: Optional matching on parental birth dates with configurable windows\n- **Parity Matching**: Optional matching on birth order\n- **Vital Status**: Optional incorporation of death/emigration events\n\n### Performance\n- **Rust Implementation**: High-performance core algorithms\n- **Polars Integration**: Seamless integration with Polars DataFrames\n- **Memory Efficient**: Optimized for large datasets\n\n## Parameters\n\n- `matching_ratio`: Number of controls per case (default: 5)\n- `birth_date_window_days`: Maximum birth date difference in days (default: 30)\n- `parent_birth_date_window_days`: Maximum parent birth date difference (default: 365)\n- `match_parent_birth_dates`: Enable parent birth date matching (default: True)\n- `match_mother_birth_date_only`: Match only maternal birth dates (default: False)\n- `require_both_parents`: Require both parents for matching (default: False)\n- `match_parity`: Enable parity matching (default: True)\n\n## License\n\nThis project is licensed under the MIT License.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "High-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology",
"version": "0.1.0",
"project_urls": null,
"split_keywords": [
"epidemiology",
" case-control",
" matching",
" polars",
" rust",
" scd"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2e61828a757511f2910bdab34d89bc9e7bd9acd3daf6c26a387473749facf00e",
"md5": "e4fc36bc6a4ea780406b8c3109fa4f8a",
"sha256": "e6ed38d3ca06144a209c9d53f04637694bd3ee80cddb9b8563c3bafc2796f53a"
},
"downloads": -1,
"filename": "scd_polars_matching_plugin-0.1.0-cp39-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "e4fc36bc6a4ea780406b8c3109fa4f8a",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.11",
"size": 29474092,
"upload_time": "2025-08-27T11:06:32",
"upload_time_iso_8601": "2025-08-27T11:06:32.100485Z",
"url": "https://files.pythonhosted.org/packages/2e/61/828a757511f2910bdab34d89bc9e7bd9acd3daf6c26a387473749facf00e/scd_polars_matching_plugin-0.1.0-cp39-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-27 11:06:32",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "scd-polars-matching-plugin"
}