scd-polars-matching-plugin


Namescd-polars-matching-plugin JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryHigh-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology
upload_time2025-08-27 11:06:32
maintainerNone
docs_urlNone
authorTobias Kragholm
requires_python>=3.11
licenseMIT
keywords epidemiology case-control matching polars rust scd
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SCD Polars Matching Plugin

A high-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology to avoid immortal time bias.

## Overview

This plugin implements time-to-event methodology for matching severe chronic disease (SCD) cases with controls, processing cases chronologically and ensuring controls are eligible at the time of each case's diagnosis.

## Installation

```bash
pip install scd-polars-matching-plugin
```

Or build from source:
```bash
maturin develop
```

## Usage

```python
from matching_plugin import complete_scd_matching_workflow

# Perform complete matching workflow
matched_df = complete_scd_matching_workflow(
    mfr_data=mfr_df,
    lpr_data=lpr_df,
    vital_data=vital_df,  # Optional
    matching_ratio=5,
    birth_date_window_days=30,
    parent_birth_date_window_days=365,
    match_parent_birth_dates=True,
    match_parity=True
)
```

## Input Data Formats

### MFR Data (Birth Registry)
Required columns:
- `PNR`: Person identifier (string)
- `FOEDSELSDATO`: Birth date (date)
- `CPR_MODER`: Mother's identifier (string)
- `CPR_FADER`: Father's identifier (string)
- `MODER_FOEDSELSDATO`: Mother's birth date (date)
- `FADER_FOEDSELSDATO`: Father's birth date (date)
- `PARITET`: Birth order/parity (integer)

**Example MFR Data:**
```
┌─────────────┬──────────────┬─────────────┬─────────────┬────────────────────┬────────────────────┬─────────┐
│ PNR         ┆ FOEDSELSDATO ┆ CPR_MODER   ┆ CPR_FADER   ┆ MODER_FOEDSELSDATO ┆ FADER_FOEDSELSDATO ┆ PARITET │
├─────────────┼──────────────┼─────────────┼─────────────┼────────────────────┼────────────────────┼─────────┤
│ person_0001 ┆ 1995-01-15   ┆ mother_0001 ┆ father_0001 ┆ 1970-03-22         ┆ 1968-07-10         ┆ 1       │
│ person_0002 ┆ 1995-02-20   ┆ mother_0002 ┆ father_0002 ┆ 1972-11-15         ┆ 1969-05-03         ┆ 2       │
│ person_0003 ┆ 1995-03-10   ┆ mother_0003 ┆ father_0003 ┆ 1973-08-07         ┆ 1971-12-25         ┆ 1       │
└─────────────┴──────────────┴─────────────┴─────────────┴────────────────────┴────────────────────┴─────────┘
```

### LPR Data (Patient Registry)
Required columns:
- `PNR`: Person identifier (string)
- `SCD_STATUS`: Disease status ("SCD", "SCD_LATE", "NO_SCD")
- `SCD_DATE`: Diagnosis date (date, null for non-cases)
- `ICD_CODE`: Diagnosis code (string, optional)

**Example LPR Data:**
```
┌─────────────┬────────────┬────────────┬──────────┐
│ PNR         ┆ SCD_STATUS ┆ SCD_DATE   ┆ ICD_CODE │
├─────────────┼────────────┼────────────┼──────────┤
│ person_0001 ┆ SCD        ┆ 1997-06-15 ┆ D57.1    │
│ person_0002 ┆ NO_SCD     ┆ null       ┆ null     │
│ person_0003 ┆ SCD_LATE   ┆ 2001-03-22 ┆ D57.0    │
│ person_0004 ┆ NO_SCD     ┆ null       ┆ null     │
└─────────────┴────────────┴────────────┴──────────┘
```

### Vital Events Data (Optional)
Required columns:
- `PNR`: Person identifier (string)
- `EVENT`: Event type ("DEATH", "EMIGRATION")
- `EVENT_DATE`: Event date (date)
- `ROLE`: Individual role ("CHILD", "PARENT")

**Example Vital Events Data:**
```
┌─────────────┬────────────┬────────────┬────────┐
│ PNR         ┆ EVENT      ┆ EVENT_DATE ┆ ROLE   │
├─────────────┼────────────┼────────────┼────────┤
│ person_0001 ┆ EMIGRATION ┆ 1999-12-01 ┆ CHILD  │
│ mother_0002 ┆ DEATH      ┆ 1998-07-15 ┆ PARENT │
│ person_0004 ┆ DEATH      ┆ 2000-03-10 ┆ CHILD  │
│ father_0001 ┆ EMIGRATION ┆ 1997-11-20 ┆ PARENT │
└─────────────┴────────────┴────────────┴────────┘
```

### Data Relationships
- **MFR and LPR**: Must be joined on `PNR` to combine birth registry and patient data
- **Vital Events**: Optional supplementary data that tracks death/emigration events
- **Parent Links**: `CPR_MODER` and `CPR_FADER` in MFR link to parent `PNR` values in vital events
- **Temporal Logic**: All dates must be proper date types for chronological processing

## Output Format

The function returns a Polars DataFrame with the following columns:

- `MATCH_INDEX`: Unique identifier for each case-control group (integer)
- `PNR`: Person identifier (string)
- `ROLE`: Individual role in the match ("case" or "control")
- `INDEX_DATE`: SCD diagnosis date from the case (date)

### Example Output
```
┌─────────────┬─────────────┬─────────┬────────────┐
│ MATCH_INDEX ┆ PNR         ┆ ROLE    ┆ INDEX_DATE │
├─────────────┼─────────────┼─────────┼────────────┤
│ 1           ┆ person_0001 ┆ case    ┆ 1997-01-01 │
│ 1           ┆ person_0002 ┆ control ┆ 1997-01-01 │
│ 1           ┆ person_0003 ┆ control ┆ 1997-01-01 │
│ 2           ┆ person_0004 ┆ case    ┆ 1997-06-15 │
│ 2           ┆ person_0005 ┆ control ┆ 1997-06-15 │
└─────────────┴─────────────┴─────────┴────────────┘
```

## Key Features

### Risk-Set Sampling
- **Chronological Processing**: Cases are processed in order of diagnosis date
- **Temporal Validity**: Controls must be eligible (alive, present, undiagnosed) at case diagnosis time
- **No Immortal Time Bias**: Future SCD cases can serve as controls for earlier cases

### Matching Criteria
- **Birth Date Window**: Match controls within specified days of case birth date
- **Parent Birth Dates**: Optional matching on parental birth dates with configurable windows
- **Parity Matching**: Optional matching on birth order
- **Vital Status**: Optional incorporation of death/emigration events

### Performance
- **Rust Implementation**: High-performance core algorithms
- **Polars Integration**: Seamless integration with Polars DataFrames
- **Memory Efficient**: Optimized for large datasets

## Parameters

- `matching_ratio`: Number of controls per case (default: 5)
- `birth_date_window_days`: Maximum birth date difference in days (default: 30)
- `parent_birth_date_window_days`: Maximum parent birth date difference (default: 365)
- `match_parent_birth_dates`: Enable parent birth date matching (default: True)
- `match_mother_birth_date_only`: Match only maternal birth dates (default: False)
- `require_both_parents`: Require both parents for matching (default: False)
- `match_parity`: Enable parity matching (default: True)

## License

This project is licensed under the MIT License.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scd-polars-matching-plugin",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "epidemiology, case-control, matching, polars, rust, scd",
    "author": "Tobias Kragholm",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# SCD Polars Matching Plugin\n\nA high-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology to avoid immortal time bias.\n\n## Overview\n\nThis plugin implements time-to-event methodology for matching severe chronic disease (SCD) cases with controls, processing cases chronologically and ensuring controls are eligible at the time of each case's diagnosis.\n\n## Installation\n\n```bash\npip install scd-polars-matching-plugin\n```\n\nOr build from source:\n```bash\nmaturin develop\n```\n\n## Usage\n\n```python\nfrom matching_plugin import complete_scd_matching_workflow\n\n# Perform complete matching workflow\nmatched_df = complete_scd_matching_workflow(\n    mfr_data=mfr_df,\n    lpr_data=lpr_df,\n    vital_data=vital_df,  # Optional\n    matching_ratio=5,\n    birth_date_window_days=30,\n    parent_birth_date_window_days=365,\n    match_parent_birth_dates=True,\n    match_parity=True\n)\n```\n\n## Input Data Formats\n\n### MFR Data (Birth Registry)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `FOEDSELSDATO`: Birth date (date)\n- `CPR_MODER`: Mother's identifier (string)\n- `CPR_FADER`: Father's identifier (string)\n- `MODER_FOEDSELSDATO`: Mother's birth date (date)\n- `FADER_FOEDSELSDATO`: Father's birth date (date)\n- `PARITET`: Birth order/parity (integer)\n\n**Example MFR Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR         \u2506 FOEDSELSDATO \u2506 CPR_MODER   \u2506 CPR_FADER   \u2506 MODER_FOEDSELSDATO \u2506 FADER_FOEDSELSDATO \u2506 PARITET \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 1995-01-15   \u2506 mother_0001 \u2506 father_0001 \u2506 1970-03-22         \u2506 1968-07-10         \u2506 1       \u2502\n\u2502 person_0002 \u2506 1995-02-20   \u2506 mother_0002 \u2506 father_0002 \u2506 1972-11-15         \u2506 1969-05-03         \u2506 2       \u2502\n\u2502 person_0003 \u2506 1995-03-10   \u2506 mother_0003 \u2506 father_0003 \u2506 1973-08-07         \u2506 1971-12-25         \u2506 1       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### LPR Data (Patient Registry)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `SCD_STATUS`: Disease status (\"SCD\", \"SCD_LATE\", \"NO_SCD\")\n- `SCD_DATE`: Diagnosis date (date, null for non-cases)\n- `ICD_CODE`: Diagnosis code (string, optional)\n\n**Example LPR Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR         \u2506 SCD_STATUS \u2506 SCD_DATE   \u2506 ICD_CODE \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 SCD        \u2506 1997-06-15 \u2506 D57.1    \u2502\n\u2502 person_0002 \u2506 NO_SCD     \u2506 null       \u2506 null     \u2502\n\u2502 person_0003 \u2506 SCD_LATE   \u2506 2001-03-22 \u2506 D57.0    \u2502\n\u2502 person_0004 \u2506 NO_SCD     \u2506 null       \u2506 null     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Vital Events Data (Optional)\nRequired columns:\n- `PNR`: Person identifier (string)\n- `EVENT`: Event type (\"DEATH\", \"EMIGRATION\")\n- `EVENT_DATE`: Event date (date)\n- `ROLE`: Individual role (\"CHILD\", \"PARENT\")\n\n**Example Vital Events Data:**\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PNR         \u2506 EVENT      \u2506 EVENT_DATE \u2506 ROLE   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 person_0001 \u2506 EMIGRATION \u2506 1999-12-01 \u2506 CHILD  \u2502\n\u2502 mother_0002 \u2506 DEATH      \u2506 1998-07-15 \u2506 PARENT \u2502\n\u2502 person_0004 \u2506 DEATH      \u2506 2000-03-10 \u2506 CHILD  \u2502\n\u2502 father_0001 \u2506 EMIGRATION \u2506 1997-11-20 \u2506 PARENT \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Data Relationships\n- **MFR and LPR**: Must be joined on `PNR` to combine birth registry and patient data\n- **Vital Events**: Optional supplementary data that tracks death/emigration events\n- **Parent Links**: `CPR_MODER` and `CPR_FADER` in MFR link to parent `PNR` values in vital events\n- **Temporal Logic**: All dates must be proper date types for chronological processing\n\n## Output Format\n\nThe function returns a Polars DataFrame with the following columns:\n\n- `MATCH_INDEX`: Unique identifier for each case-control group (integer)\n- `PNR`: Person identifier (string)\n- `ROLE`: Individual role in the match (\"case\" or \"control\")\n- `INDEX_DATE`: SCD diagnosis date from the case (date)\n\n### Example Output\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 MATCH_INDEX \u2506 PNR         \u2506 ROLE    \u2506 INDEX_DATE \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 1           \u2506 person_0001 \u2506 case    \u2506 1997-01-01 \u2502\n\u2502 1           \u2506 person_0002 \u2506 control \u2506 1997-01-01 \u2502\n\u2502 1           \u2506 person_0003 \u2506 control \u2506 1997-01-01 \u2502\n\u2502 2           \u2506 person_0004 \u2506 case    \u2506 1997-06-15 \u2502\n\u2502 2           \u2506 person_0005 \u2506 control \u2506 1997-06-15 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Key Features\n\n### Risk-Set Sampling\n- **Chronological Processing**: Cases are processed in order of diagnosis date\n- **Temporal Validity**: Controls must be eligible (alive, present, undiagnosed) at case diagnosis time\n- **No Immortal Time Bias**: Future SCD cases can serve as controls for earlier cases\n\n### Matching Criteria\n- **Birth Date Window**: Match controls within specified days of case birth date\n- **Parent Birth Dates**: Optional matching on parental birth dates with configurable windows\n- **Parity Matching**: Optional matching on birth order\n- **Vital Status**: Optional incorporation of death/emigration events\n\n### Performance\n- **Rust Implementation**: High-performance core algorithms\n- **Polars Integration**: Seamless integration with Polars DataFrames\n- **Memory Efficient**: Optimized for large datasets\n\n## Parameters\n\n- `matching_ratio`: Number of controls per case (default: 5)\n- `birth_date_window_days`: Maximum birth date difference in days (default: 30)\n- `parent_birth_date_window_days`: Maximum parent birth date difference (default: 365)\n- `match_parent_birth_dates`: Enable parent birth date matching (default: True)\n- `match_mother_birth_date_only`: Match only maternal birth dates (default: False)\n- `require_both_parents`: Require both parents for matching (default: False)\n- `match_parity`: Enable parity matching (default: True)\n\n## License\n\nThis project is licensed under the MIT License.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance Rust-based plugin for Polars that performs epidemiologically sound case-control matching with proper risk-set sampling methodology",
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [
        "epidemiology",
        " case-control",
        " matching",
        " polars",
        " rust",
        " scd"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e61828a757511f2910bdab34d89bc9e7bd9acd3daf6c26a387473749facf00e",
                "md5": "e4fc36bc6a4ea780406b8c3109fa4f8a",
                "sha256": "e6ed38d3ca06144a209c9d53f04637694bd3ee80cddb9b8563c3bafc2796f53a"
            },
            "downloads": -1,
            "filename": "scd_polars_matching_plugin-0.1.0-cp39-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "e4fc36bc6a4ea780406b8c3109fa4f8a",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.11",
            "size": 29474092,
            "upload_time": "2025-08-27T11:06:32",
            "upload_time_iso_8601": "2025-08-27T11:06:32.100485Z",
            "url": "https://files.pythonhosted.org/packages/2e/61/828a757511f2910bdab34d89bc9e7bd9acd3daf6c26a387473749facf00e/scd_polars_matching_plugin-0.1.0-cp39-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-27 11:06:32",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scd-polars-matching-plugin"
}
        
Elapsed time: 0.46931s