# flatiron-cleaner
`flatiron-cleaner` is a Python package that cleans Flatiron Health cancer datasets into analysis-ready formats, specifically designed with predictive modeling and survival analysis in mind. By automating complex and tedious data processing workflows, it helps researchers extract meaningful insights and ensure reproducible results while reducing preparation time.
Key features of the package include:
- Providing a modular architecture that allows researchers to select which Flatiron files to process
- Converting long-format dataframes into wide-format dataframes with unique PatientIDs per row
- Ensuring appropriate data types for predictive modeling and statistical analysis
- Standardizing data cleaning around a user-specified index date, such as metastatic diagnosis or treatment initiation
- Engineering clinically relevant variables for analysis
## Installation
Built and tested in python 3.13.
```python
pip install flatiron-cleaner
```
## Available Processors
### Cancer-Specific Processors
The following cancers have their own dedicated data processor class:
| Cancer Type | Processor Name |
|-------------|-----------------|
| Advanced Urothelial Cancer | `DataProcessorUrothelial` |
| Advanced NSCLC | `DataProcessorNSCLC` |
| Metastatic Colorectal Cancer | `DataProcessorColorectal` |
| Metastatic Breast Cancer | `DataProcessorBreast` |
| Metastatic Prostate Cancer | `DataProcessorProstate` |
| Metastatic Renal Cell Cancer | `DataProcessorRenal` |
### General Processor
For cancer types without a dedicated processor, `DataProcessorGeneral` is available with standard methods.
## Processing Methods
### Standard Methods
The following methods are available across all processor classes, including the general processor:
| Method | Description | File Processed |
|--------|-------------|----------------|
| `process_demographics()` | Processes patient demographic information | Demographics.csv |
| `process_mortality()` | Processes mortality data | Enhanced_Mortality_V2.csv |
| `process_ecog()` | Processes performance status data | ECOG.csv |
| `process_medications()` | Processes medication administration records | MedicationAdministration.csv |
| `process_diagnosis()` | Processes ICD coding information | Diagnosis.csv |
| `process_labs()` | Processes laboratory test results | Lab.csv |
| `process_vitals()` | Processes vital signs data | Vitals.csv |
| `process_insurance()` | Processes insurance information | Insurance.csv |
| `process_practice()` | Processes practice type data | Practice.csv |
### Cancer-Specific Methods
Cancer-specific classes contain additional methods (e.g., `process_enhanced()` and `process_biomarkers()`). For a complete list of available methods for each cancer type, refer to the source code or use Python's built-in help functionality:
```python
from flatiron_cleaner import DataProcessorUrothelial
```
## Usage Example
```python
from flatiron_cleaner import DataProcessorUrothelial
from flatiron_cleaner import merge_dataframes
# Initialize class
processor = DataProcessorUrothelial()
# Import dataframe with PatientIDs and index date of interest
df = pd.read_csv('path/to/your/data')
# Load and clean data
cleaned_ecog_df = processor.process_ecog('path/to/your/ECOG.csv',
index_date_df=df,
index_date_column='AdvancedDiagnosisDate',
days_before=30,
days_after=0)
cleaned_medication_df = processor.process_medications('path/to/your/MedicationAdmninistration.csv',
index_date_df=df,
index_date_column='AdvancedDiagnosisDate',
days_before=180,
days_after=0)
# Merge dataframes
merged_data = merge_dataframes(cleaned_ecog_df, cleaned_medication_df)
```
For a more detailed usage demonstration, see the notebook titled "tutorial" in the `example/` directory.
## Contact
Contributions and feedback are welcome. Contact: xavierorcutt@gmail.com
Raw data
{
"_id": null,
"home_page": null,
"name": "flatiron-cleaner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "cancer, healthcare, data, flatiron",
"author": null,
"author_email": "Xavier Orcutt <xavierorcutt@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d7/cd/1297296ef26131aed3a5ada37fae2041e1d1e43e5ef500a6522235708a73/flatiron_cleaner-0.1.4.tar.gz",
"platform": null,
"description": "# flatiron-cleaner\n\n`flatiron-cleaner` is a Python package that cleans Flatiron Health cancer datasets into analysis-ready formats, specifically designed with predictive modeling and survival analysis in mind. By automating complex and tedious data processing workflows, it helps researchers extract meaningful insights and ensure reproducible results while reducing preparation time.\n\nKey features of the package include:\n- Providing a modular architecture that allows researchers to select which Flatiron files to process\n- Converting long-format dataframes into wide-format dataframes with unique PatientIDs per row\n- Ensuring appropriate data types for predictive modeling and statistical analysis\n- Standardizing data cleaning around a user-specified index date, such as metastatic diagnosis or treatment initiation\n- Engineering clinically relevant variables for analysis\n\n## Installation\n\nBuilt and tested in python 3.13.\n\n```python\npip install flatiron-cleaner \n\n```\n\n## Available Processors\n\n### Cancer-Specific Processors\n\nThe following cancers have their own dedicated data processor class:\n\n| Cancer Type | Processor Name | \n|-------------|-----------------|\n| Advanced Urothelial Cancer | `DataProcessorUrothelial` |\n| Advanced NSCLC | `DataProcessorNSCLC` |\n| Metastatic Colorectal Cancer | `DataProcessorColorectal` |\n| Metastatic Breast Cancer | `DataProcessorBreast` |\n| Metastatic Prostate Cancer | `DataProcessorProstate` |\n| Metastatic Renal Cell Cancer | `DataProcessorRenal` |\n\n### General Processor \n\nFor cancer types without a dedicated processor, `DataProcessorGeneral` is available with standard methods. \n\n## Processing Methods\n\n### Standard Methods\n\nThe following methods are available across all processor classes, including the general processor:\n\n| Method | Description | File Processed |\n|--------|-------------|----------------|\n| `process_demographics()` | Processes patient demographic information | Demographics.csv |\n| `process_mortality()` | Processes mortality data | Enhanced_Mortality_V2.csv |\n| `process_ecog()` | Processes performance status data | ECOG.csv |\n| `process_medications()` | Processes medication administration records | MedicationAdministration.csv |\n| `process_diagnosis()` | Processes ICD coding information | Diagnosis.csv |\n| `process_labs()` | Processes laboratory test results | Lab.csv |\n| `process_vitals()` | Processes vital signs data | Vitals.csv |\n| `process_insurance()` | Processes insurance information | Insurance.csv |\n| `process_practice()` | Processes practice type data | Practice.csv |\n\n### Cancer-Specific Methods\n\nCancer-specific classes contain additional methods (e.g., `process_enhanced()` and `process_biomarkers()`). For a complete list of available methods for each cancer type, refer to the source code or use Python's built-in help functionality:\n\n```python\nfrom flatiron_cleaner import DataProcessorUrothelial\n\n```\n\n## Usage Example\n\n```python\nfrom flatiron_cleaner import DataProcessorUrothelial\nfrom flatiron_cleaner import merge_dataframes\n\n# Initialize class\nprocessor = DataProcessorUrothelial()\n\n# Import dataframe with PatientIDs and index date of interest\ndf = pd.read_csv('path/to/your/data')\n\n# Load and clean data\ncleaned_ecog_df = processor.process_ecog('path/to/your/ECOG.csv',\n index_date_df=df,\n index_date_column='AdvancedDiagnosisDate',\n days_before=30,\n days_after=0) \n\ncleaned_medication_df = processor.process_medications('path/to/your/MedicationAdmninistration.csv',\n index_date_df=df,\n index_date_column='AdvancedDiagnosisDate',\n days_before=180,\n days_after=0)\n\n# Merge dataframes \nmerged_data = merge_dataframes(cleaned_ecog_df, cleaned_medication_df)\n```\n\nFor a more detailed usage demonstration, see the notebook titled \"tutorial\" in the `example/` directory.\n\n## Contact\n\nContributions and feedback are welcome. Contact: xavierorcutt@gmail.com\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package for cleaning and harmonizing Flatiron Health cancer data",
"version": "0.1.4",
"project_urls": {
"Bug Tracker": "https://github.com/xavier-orcutt/FlatironCleaner/issues",
"Documentation": "https://github.com/xavier-orcutt/FlatironCleaner#readme",
"Homepage": "https://github.com/xavier-orcutt/FlatironCleaner"
},
"split_keywords": [
"cancer",
" healthcare",
" data",
" flatiron"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fb0ec63b3fabd235ac6a38e7999c6d17a1b357b9fccc83319fe23db22eecd230",
"md5": "a7bdadfa00ff756982615fcd57d54fd2",
"sha256": "a91dbd071c50e53e5bbbd3971752805dea23a3351e908d0c974def8bb562e102"
},
"downloads": -1,
"filename": "flatiron_cleaner-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a7bdadfa00ff756982615fcd57d54fd2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 182325,
"upload_time": "2025-10-30T05:30:50",
"upload_time_iso_8601": "2025-10-30T05:30:50.585872Z",
"url": "https://files.pythonhosted.org/packages/fb/0e/c63b3fabd235ac6a38e7999c6d17a1b357b9fccc83319fe23db22eecd230/flatiron_cleaner-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d7cd1297296ef26131aed3a5ada37fae2041e1d1e43e5ef500a6522235708a73",
"md5": "3313bae1f72e3be66540623fc81081a6",
"sha256": "71d3fa331ea11318b7aeb54eb2e933f0608301738944b784059d0e41b109fa33"
},
"downloads": -1,
"filename": "flatiron_cleaner-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "3313bae1f72e3be66540623fc81081a6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 171806,
"upload_time": "2025-10-30T05:30:51",
"upload_time_iso_8601": "2025-10-30T05:30:51.736581Z",
"url": "https://files.pythonhosted.org/packages/d7/cd/1297296ef26131aed3a5ada37fae2041e1d1e43e5ef500a6522235708a73/flatiron_cleaner-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-30 05:30:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xavier-orcutt",
"github_project": "FlatironCleaner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "asttokens",
"specs": [
[
"==",
"3.0.0"
]
]
},
{
"name": "decorator",
"specs": [
[
"==",
"5.1.1"
]
]
},
{
"name": "executing",
"specs": [
[
"==",
"2.2.0"
]
]
},
{
"name": "ipython",
"specs": [
[
"==",
"8.31.0"
]
]
},
{
"name": "jedi",
"specs": [
[
"==",
"0.19.2"
]
]
},
{
"name": "matplotlib-inline",
"specs": [
[
"==",
"0.1.7"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.2.2"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.2.3"
]
]
},
{
"name": "parso",
"specs": [
[
"==",
"0.8.4"
]
]
},
{
"name": "pexpect",
"specs": [
[
"==",
"4.9.0"
]
]
},
{
"name": "prompt_toolkit",
"specs": [
[
"==",
"3.0.50"
]
]
},
{
"name": "ptyprocess",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "pure_eval",
"specs": [
[
"==",
"0.2.3"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.19.1"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.9.0.post0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2024.2"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "stack-data",
"specs": [
[
"==",
"0.6.3"
]
]
},
{
"name": "traitlets",
"specs": [
[
"==",
"5.14.3"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2025.1"
]
]
},
{
"name": "wcwidth",
"specs": [
[
"==",
"0.2.13"
]
]
}
],
"lcname": "flatiron-cleaner"
}