xcube


Namexcube JSON
Version 0.0.9 PyPI version JSON
download
home_pagehttps://github.com/debjyotiSRoy/xcube
SummaryExtreme Multi-Label Text Classification
upload_time2024-01-23 16:23:33
maintainer
docs_urlNone
authordebjyotiSRoy
requires_python>=3.9
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ```
#|hide
#| eval: false
! [ -e /content ] && pip install -q condacolab && python3 -c  "import condacolab; condacolab.install()" # create conda environment in colab
! [ -e /content ] && pip install -Uqq xcube  # upgrade xcube on colab
```


```
#| hide
from xcube.text.all import *

%load_ext autoreload
%autoreload 2
```

# xcube

> xcube trains and explains XMTC models

## E**X**plainable E**X**treme Multi-Label Te**X**t Classification:

-   *What is XMTC?* Extreme Multi-Label Text Classification (XMTC) addresses the problem of automatically assigning each data point with most relevant subset of labels from an extremely large label set. One major application of XMTC is in the global healthcare system, specifically in the context of the International Classification of Diseases (ICD). ICD coding is the process of assigning codes representing diagnoses and procedures performed during a patient visit using clinical notes documented by health professionals.

-   *Datasets?* Examples of ICD coding dataset: [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) and [MIMIC-IV](https://physionet.org/content/mimic-iv-note/2.2/). Please note that you need to be a credentialated user and complete a training to acces the data.

-   *What is xcube?* xcube trains and explains XMTC models using LLM fine-tuning.

## Install

- Create new conda environment:
```sh
conda create -n xxx python=3.10
```
```sh
conda activate xxx
```

- Install PyTorch with cuda enabled: [Optional]
```sh 
conda search pytorch
```
<img alt="output of conda search pytorch" width="400" src="pics/pytorch.png" caption="Pictorial representation of mutual information gain" id="img_mut_info">

use the build string that matches the python and cuda version, replacing the pytorch version and build string appropriately:
```sh
conda install pytorch=2.0.0=cuda118py310h072bc4c pytorch-cuda=11.8 -c pytorch -c nvidia
```

Update cuda-toolkit:
```sh
sudo apt install nvidia-cuda-toolkit
```

Verify cuda is available:  Run `python` and `import torch; torch.cuda.is_available()`


- Install using:
```sh
pip install xcube
```
Configure accelerate by:
```sh
accelerate config
```

## How to use

You can either clone the repo and open it in your own machine. Or if you don't want to setup a python development environment, an even easier and quicker approach is to open this repo using [Google Colab](https://colab.research.google.com/). You can open this readme page in Colab using this [link](https://colab.research.google.com/github/debjyotiSRoy/xcube/blob/plant/nbs/index.ipynb).


```
IN_COLAB = is_colab()
```


```
source_mimic3 = untar_xxx(XURLs.MIMIC3_DEMO)
source_mimic4 = untar_xxx(XURLs.MIMIC4)
path = Path.cwd().parent/f"{'xcube' if IN_COLAB else ''}" # root of the repo
(path/'tmp/models').mkdir(exist_ok=True, parents=True)
tmp = path/'tmp'
```

Check your GPU memory! If you are running this on google colab be sure to turn on the GPU runtime. You should be able to train and infer all the models with atleast 16GB of memory. However, note that training the full versions of the datasets from scratch requires atleast 48GB memory.


```
cudamem()
```

### Train and Infer on MIMIC3-rare50

MIMIC3-rare50 refers to a split of [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) that contains the 50 most rare codes (Refer to [Knowledge Injected Prompt Based Fine-Tuning for Multi-label Few-shot ICD Coding](https://aclanthology.org/2022.findings-emnlp.127/) for split creation).


```
data = join_path_file('mimic3-9k_rare50', source_mimic3, ext='.csv')
!head -n 1 {data}
```


```
df = pd.read_csv(data,
                 header=0,
                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],
                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})
df.head(2)
```

To launch the training of an XMTC model on MIMIC3-rare50:


```
os.chdir(path/'scripts')
!./run_scripts.sh --script_list_file script_list_mimic3_rare50train
```

### Train and Infer on MIMIC3-top50

MIMIC3-top50 refers to a split of [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) that contains 50 most frequent codes (Refer to [Explainable Prediction of Medical Codes from Clinical Text](https://aclanthology.org/N18-1100/) for split creation)


```
data = join_path_file('mimic3-9k_top50', source_mimic3, ext='.csv')
!head -n 1 {data}
```


```
df = pd.read_csv(data,
                 header=0,
                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],
                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})
df.head(2)
```

To infer one our pretrained XMTC models on MIMIC3-top50 (Metrics for inference - Precision@3,5,8,15):


```
model_fnames = L(source_mimic3.glob("**/*top50*.pth")).map(str)
print('\n'.join(model_fnames))
fname = Path(shutil.copy(model_fnames[2], tmp/'models')).name.split('.')[0]
print(f"We are going to infer model {fname}.")
```


```
os.chdir(path/'scripts')
!./launches/launch_top50_mimic3 --fname {fname} --no_running_decoder --infer 1
```

To launch the training of an XMTC model on MIMIC3-top50 from scratch:


```
#| eval: false
os.chdir(path/'scripts')
!./run_scripts.sh --script_list_file script_list_mimic3_top50train
```

### Train and Infer on MIMIC3-full:

MIMIC3-full refers to the full [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) dataset. (Refer to [Explainable Prediction of Medical Codes from Clinical Text](https://aclanthology.org/N18-1100/) for details of how the data was curated)


```
data = join_path_file('mimic3-9k_full', source_mimic3, ext='.csv')
!head -n 1 {data}
```


```
df = pd.read_csv(data,
                 header=0,
                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],
                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})
df.head(2)
```

Lets's look at some of the ICD9 codes description:


```
des = load_pickle(source_mimic3/'code_desc.pkl')
lbl_dict = dict()
for lbl in df.labels[1].split(';'):
    lbl_dict[lbl] = des.get(lbl, 'NF')
pd.DataFrame(lbl_dict.items(), columns=['icd9_code', 'desccription'])
```

To infer one our pretrained XMTC models on MIMIC3-full (Metrics for inference - Precision@3,5,8,15):


```
model_fnames = L(source_mimic3.glob("**/*full*.pth")).map(str)
print('\n'.join(model_fnames))
fname = Path(shutil.copy(model_fnames[0], tmp/'models')).name.split('.')[0]
print(f"Let's infer the pretrained model {fname}.")
```


```
os.chdir(path/'scripts')
!./launches/launch_complete_mimic3 --fname {fname} --infer 1 --no_running_decoder
```

### Train and Infer on MIMIC4-full:

MIMIC4-full refers to the full [MIMIC-IV](https://physionet.org/content/mimiciii/1.4/) dataset using ICD10 codes. (Refer to [Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study](https://arxiv.org/pdf/2304.10909.pdf) for details of how the data was curated)


```
data = join_path_file('mimic4_icd10_full', source_mimic4, ext='.csv')
!head -n 1 {data}
```


```
df = pd.read_csv(data,
                    header=0,
                    usecols=['subject_id', '_id', 'text', 'labels', 'num_targets', 'is_valid', 'split'],
                    dtype={'subject_id': str, '_id': str, 'text': str, 'labels': str, 'num_targets': np.int64, 'is_valid': bool, 'split': str})
df.head(2)
```

Let's look at some of the descriptions of ICD10 codes: 



```
stripped_codes = [''.join(filter(str.isalnum, s)) for s in df.labels[0].split(';')]
desc = get_description(stripped_codes)
pd.DataFrame(desc.items(), columns=['icd10_code', 'desccription'])
```

To infer one our pretrained XMTC models on MIMIC4-full (Metrics for inference - Precision@5,8,15):


```
print('\n'.join(L(source_mimic4.glob("**/*full*.pth")).map(str)))
model_fname = Path('/home/deb/.xcube/data/mimic4/mimic4_icd10_clas_full.pth')
fname = Path(shutil.copy(model_fname, tmp/'models')).name.split('.')[0]
print(f"Let's infer the pretrained model {fname}.")
```


```
#| eval: false
os.chdir(path/'scripts')
!./launches/launch_complete_mimic4_icd10 --fname mimic4_icd10_clas_full --no_running_decoder --infer 1
```

## Acknowledgement

This repository is my attempt to create Extreme Multi-Label Text Classifiers using Language Model Fine-Tuning as proposed by [Jeremy Howard](https://jeremy.fast.ai) and [Sebastian Ruder](https://www.ruder.io) in [ULMFit](https://arxiv.org/pdf/1801.06146v5.pdf). I am also heavily influenced by the [fast.ai's](https://fast.ai) course [Practical Deep Learning for Coders](https://course.fast.ai/) and the excellent library [fastai](https://github.com/fastai/fastai). I have adopted the style of coding from [fastai](https://github.com/fastai/fastai) using the jupyter based dev environment [nbdev](https://nbdev.fast.ai/). Since this is one of my fast attempt to create a full fledged python library, I have at times replicated implementations from fastai with some modifications. A big thanks to Jeremy and his team from [fast.ai](https://fast.ai) for everything they have been doing to make AI accessible to everyone.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/debjyotiSRoy/xcube",
    "name": "xcube",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python",
    "author": "debjyotiSRoy",
    "author_email": "debjyoti.saharoy@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/cd/06/266ad5aa724da54f858296c91d93c5db4bdd1b4c676913a564b35bb7cf5c/xcube-0.0.9.tar.gz",
    "platform": null,
    "description": "```\n#|hide\n#| eval: false\n! [ -e /content ] && pip install -q condacolab && python3 -c  \"import condacolab; condacolab.install()\" # create conda environment in colab\n! [ -e /content ] && pip install -Uqq xcube  # upgrade xcube on colab\n```\n\n\n```\n#| hide\nfrom xcube.text.all import *\n\n%load_ext autoreload\n%autoreload 2\n```\n\n# xcube\n\n> xcube trains and explains XMTC models\n\n## E**X**plainable E**X**treme Multi-Label Te**X**t Classification:\n\n-   *What is XMTC?* Extreme Multi-Label Text Classification (XMTC) addresses the problem of automatically assigning each data point with most relevant subset of labels from an extremely large label set. One major application of XMTC is in the global healthcare system, specifically in the context of the International Classification of Diseases (ICD). ICD coding is the process of assigning codes representing diagnoses and procedures performed during a patient visit using clinical notes documented by health professionals.\n\n-   *Datasets?* Examples of ICD coding dataset: [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) and [MIMIC-IV](https://physionet.org/content/mimic-iv-note/2.2/). Please note that you need to be a credentialated user and complete a training to acces the data.\n\n-   *What is xcube?* xcube trains and explains XMTC models using LLM fine-tuning.\n\n## Install\n\n- Create new conda environment:\n```sh\nconda create -n xxx python=3.10\n```\n```sh\nconda activate xxx\n```\n\n- Install PyTorch with cuda enabled: [Optional]\n```sh \nconda search pytorch\n```\n<img alt=\"output of conda search pytorch\" width=\"400\" src=\"pics/pytorch.png\" caption=\"Pictorial representation of mutual information gain\" id=\"img_mut_info\">\n\nuse the build string that matches the python and cuda version, replacing the pytorch version and build string appropriately:\n```sh\nconda install pytorch=2.0.0=cuda118py310h072bc4c pytorch-cuda=11.8 -c pytorch -c nvidia\n```\n\nUpdate cuda-toolkit:\n```sh\nsudo apt install nvidia-cuda-toolkit\n```\n\nVerify cuda is available:  Run `python` and `import torch; torch.cuda.is_available()`\n\n\n- Install using:\n```sh\npip install xcube\n```\nConfigure accelerate by:\n```sh\naccelerate config\n```\n\n## How to use\n\nYou can either clone the repo and open it in your own machine. Or if you don't want to setup a python development environment, an even easier and quicker approach is to open this repo using [Google Colab](https://colab.research.google.com/). You can open this readme page in Colab using this [link](https://colab.research.google.com/github/debjyotiSRoy/xcube/blob/plant/nbs/index.ipynb).\n\n\n```\nIN_COLAB = is_colab()\n```\n\n\n```\nsource_mimic3 = untar_xxx(XURLs.MIMIC3_DEMO)\nsource_mimic4 = untar_xxx(XURLs.MIMIC4)\npath = Path.cwd().parent/f\"{'xcube' if IN_COLAB else ''}\" # root of the repo\n(path/'tmp/models').mkdir(exist_ok=True, parents=True)\ntmp = path/'tmp'\n```\n\nCheck your GPU memory! If you are running this on google colab be sure to turn on the GPU runtime. You should be able to train and infer all the models with atleast 16GB of memory. However, note that training the full versions of the datasets from scratch requires atleast 48GB memory.\n\n\n```\ncudamem()\n```\n\n### Train and Infer on MIMIC3-rare50\n\nMIMIC3-rare50 refers to a split of [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) that contains the 50 most rare codes (Refer to [Knowledge Injected Prompt Based Fine-Tuning for Multi-label Few-shot ICD Coding](https://aclanthology.org/2022.findings-emnlp.127/) for split creation).\n\n\n```\ndata = join_path_file('mimic3-9k_rare50', source_mimic3, ext='.csv')\n!head -n 1 {data}\n```\n\n\n```\ndf = pd.read_csv(data,\n                 header=0,\n                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],\n                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})\ndf.head(2)\n```\n\nTo launch the training of an XMTC model on MIMIC3-rare50:\n\n\n```\nos.chdir(path/'scripts')\n!./run_scripts.sh --script_list_file script_list_mimic3_rare50train\n```\n\n### Train and Infer on MIMIC3-top50\n\nMIMIC3-top50 refers to a split of [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) that contains 50 most frequent codes (Refer to [Explainable Prediction of Medical Codes from Clinical Text](https://aclanthology.org/N18-1100/) for split creation)\n\n\n```\ndata = join_path_file('mimic3-9k_top50', source_mimic3, ext='.csv')\n!head -n 1 {data}\n```\n\n\n```\ndf = pd.read_csv(data,\n                 header=0,\n                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],\n                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})\ndf.head(2)\n```\n\nTo infer one our pretrained XMTC models on MIMIC3-top50 (Metrics for inference - Precision@3,5,8,15):\n\n\n```\nmodel_fnames = L(source_mimic3.glob(\"**/*top50*.pth\")).map(str)\nprint('\\n'.join(model_fnames))\nfname = Path(shutil.copy(model_fnames[2], tmp/'models')).name.split('.')[0]\nprint(f\"We are going to infer model {fname}.\")\n```\n\n\n```\nos.chdir(path/'scripts')\n!./launches/launch_top50_mimic3 --fname {fname} --no_running_decoder --infer 1\n```\n\nTo launch the training of an XMTC model on MIMIC3-top50 from scratch:\n\n\n```\n#| eval: false\nos.chdir(path/'scripts')\n!./run_scripts.sh --script_list_file script_list_mimic3_top50train\n```\n\n### Train and Infer on MIMIC3-full:\n\nMIMIC3-full refers to the full [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) dataset. (Refer to [Explainable Prediction of Medical Codes from Clinical Text](https://aclanthology.org/N18-1100/) for details of how the data was curated)\n\n\n```\ndata = join_path_file('mimic3-9k_full', source_mimic3, ext='.csv')\n!head -n 1 {data}\n```\n\n\n```\ndf = pd.read_csv(data,\n                 header=0,\n                 names=['subject_id', 'hadm_id', 'text', 'labels', 'length', 'is_valid', 'split'],\n                 dtype={'subject_id': str, 'hadm_id': str, 'text': str, 'labels': str, 'length': np.int64, 'is_valid': bool, 'split': str})\ndf.head(2)\n```\n\nLets's look at some of the ICD9 codes description:\n\n\n```\ndes = load_pickle(source_mimic3/'code_desc.pkl')\nlbl_dict = dict()\nfor lbl in df.labels[1].split(';'):\n    lbl_dict[lbl] = des.get(lbl, 'NF')\npd.DataFrame(lbl_dict.items(), columns=['icd9_code', 'desccription'])\n```\n\nTo infer one our pretrained XMTC models on MIMIC3-full (Metrics for inference - Precision@3,5,8,15):\n\n\n```\nmodel_fnames = L(source_mimic3.glob(\"**/*full*.pth\")).map(str)\nprint('\\n'.join(model_fnames))\nfname = Path(shutil.copy(model_fnames[0], tmp/'models')).name.split('.')[0]\nprint(f\"Let's infer the pretrained model {fname}.\")\n```\n\n\n```\nos.chdir(path/'scripts')\n!./launches/launch_complete_mimic3 --fname {fname} --infer 1 --no_running_decoder\n```\n\n### Train and Infer on MIMIC4-full:\n\nMIMIC4-full refers to the full [MIMIC-IV](https://physionet.org/content/mimiciii/1.4/) dataset using ICD10 codes. (Refer to [Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study](https://arxiv.org/pdf/2304.10909.pdf) for details of how the data was curated)\n\n\n```\ndata = join_path_file('mimic4_icd10_full', source_mimic4, ext='.csv')\n!head -n 1 {data}\n```\n\n\n```\ndf = pd.read_csv(data,\n                    header=0,\n                    usecols=['subject_id', '_id', 'text', 'labels', 'num_targets', 'is_valid', 'split'],\n                    dtype={'subject_id': str, '_id': str, 'text': str, 'labels': str, 'num_targets': np.int64, 'is_valid': bool, 'split': str})\ndf.head(2)\n```\n\nLet's look at some of the descriptions of ICD10 codes: \n\n\n\n```\nstripped_codes = [''.join(filter(str.isalnum, s)) for s in df.labels[0].split(';')]\ndesc = get_description(stripped_codes)\npd.DataFrame(desc.items(), columns=['icd10_code', 'desccription'])\n```\n\nTo infer one our pretrained XMTC models on MIMIC4-full (Metrics for inference - Precision@5,8,15):\n\n\n```\nprint('\\n'.join(L(source_mimic4.glob(\"**/*full*.pth\")).map(str)))\nmodel_fname = Path('/home/deb/.xcube/data/mimic4/mimic4_icd10_clas_full.pth')\nfname = Path(shutil.copy(model_fname, tmp/'models')).name.split('.')[0]\nprint(f\"Let's infer the pretrained model {fname}.\")\n```\n\n\n```\n#| eval: false\nos.chdir(path/'scripts')\n!./launches/launch_complete_mimic4_icd10 --fname mimic4_icd10_clas_full --no_running_decoder --infer 1\n```\n\n## Acknowledgement\n\nThis repository is my attempt to create Extreme Multi-Label Text Classifiers using Language Model Fine-Tuning as proposed by [Jeremy Howard](https://jeremy.fast.ai) and [Sebastian Ruder](https://www.ruder.io) in [ULMFit](https://arxiv.org/pdf/1801.06146v5.pdf). I am also heavily influenced by the [fast.ai's](https://fast.ai) course [Practical Deep Learning for Coders](https://course.fast.ai/) and the excellent library [fastai](https://github.com/fastai/fastai). I have adopted the style of coding from [fastai](https://github.com/fastai/fastai) using the jupyter based dev environment [nbdev](https://nbdev.fast.ai/). Since this is one of my fast attempt to create a full fledged python library, I have at times replicated implementations from fastai with some modifications. A big thanks to Jeremy and his team from [fast.ai](https://fast.ai) for everything they have been doing to make AI accessible to everyone.\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Extreme Multi-Label Text Classification",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/debjyotiSRoy/xcube"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "982a2c05e0a62c1429fc140d0ad53a7a923cd1ae5498e13c993a0bd156d14397",
                "md5": "17a19c53858461c348fc820fa9cac295",
                "sha256": "fee19efcb7dedc12d060938e01d5c97ba6f559f7b44fc0c6550d0aa04662c31c"
            },
            "downloads": -1,
            "filename": "xcube-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "17a19c53858461c348fc820fa9cac295",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 59671,
            "upload_time": "2024-01-23T16:23:31",
            "upload_time_iso_8601": "2024-01-23T16:23:31.297221Z",
            "url": "https://files.pythonhosted.org/packages/98/2a/2c05e0a62c1429fc140d0ad53a7a923cd1ae5498e13c993a0bd156d14397/xcube-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd06266ad5aa724da54f858296c91d93c5db4bdd1b4c676913a564b35bb7cf5c",
                "md5": "09afcc71bd691fb6845d845fe07e0b01",
                "sha256": "f869ff0b9a3a2a0384045e1a4d2d40e4566d5d7ed0aac11dde0496da16a300e8"
            },
            "downloads": -1,
            "filename": "xcube-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "09afcc71bd691fb6845d845fe07e0b01",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 55025,
            "upload_time": "2024-01-23T16:23:33",
            "upload_time_iso_8601": "2024-01-23T16:23:33.209278Z",
            "url": "https://files.pythonhosted.org/packages/cd/06/266ad5aa724da54f858296c91d93c5db4bdd1b4c676913a564b35bb7cf5c/xcube-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-23 16:23:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "debjyotiSRoy",
    "github_project": "xcube",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "xcube"
}
        
Elapsed time: 0.25623s