zensols.mimicsid

Name	zensols.mimicsid JSON
Version	1.7.0 JSON
	download
home_page	https://github.com/plandes/mimicsid
Summary	Use the MedSecId section annotations with MIMIC-III corpus parsing.
upload_time	2024-03-07 17:50:26
maintainer
docs_url	None
author	Paul Landes
requires_python
license
keywords	tooling
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MIMIC-III corpus parsing and section prediction with MedSecId

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.10][python310-badge]][python310-link]
[![Python 3.11][python311-badge]][python311-link]

This repository contains the a Python package to automatically segment and
identify sections of clinical notes, such as electronic health record (EHR)
medical documents.  It also provides access to the MedSecId section annotations
with MIMIC-III corpus parsing from the paper [A New Public Corpus for Clinical
Section Identification: MedSecId].  See the [medsecid repository] to reproduce
the results from the paper.

This package provides the following:

* The same access to MIMIC-III data as provided in the [mimic package].
* Access to the annotated MedSecId notes as an easy to use Python object graph.
* The pretrained model inferencing, which produces a similar Python object
  graph to the annotations (provides the class `PredictedNote` instead of an
  `AnnotatedNote` class.


<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
## Table of Contents

- [Obtaining](#obtaining)
- [Documentation](#documentation)
- [Installation](#installation)
- [Usage](#usage)
    - [Prediction Usage](#prediction-usage)
    - [Annotation Access](#annotation-access)
- [Differences from the Paper Repository](#differences-from-the-paper-repository)
- [Training](#training)
    - [Preprocessing Step](#preprocessing-step)
    - [Training and Testing](#training-and-testing)
- [Training Production Models](#training-production-models)
- [Models](#models)
    - [MedCAT Models](#medcat-models)
    - [Performance Metrics](#performance-metrics)
        - [Version 0.0.2](#version-002)
        - [Version 0.0.3](#version-003)
- [Citation](#citation)
- [Docker](#docker)
- [Changelog](#changelog)
- [Community](#community)
- [License](#license)

<!-- markdown-toc end -->



## Obtaining

The easiest way to install the command line program is via the `pip` installer:
```bash
pip3 install zensols.mimicsid
```

Binaries are also available on [pypi].

A [docker](#docker) image is now available as well.


## Documentation

See the [full documentation](https://plandes.github.io/mimicsid/index.html).
The [API reference](https://plandes.github.io/mimicsid/api.html) is also
available.


## Installation

If you only want to predict sections using the pretrained model, you need only
to [install](#obtaining) the package.  However, if you want to access the
annotated notes, you must install a Postgres MIMIC-III database as [mimic
package install section].


## Usage

This package provides models to predict sections of a medical note and access
to the MIMIC-III section annotations available on [Zenodo].  The first time it
is run it will take a while to download the annotation set and the pretrained
models.

See the [examples](example) for the complete code and additional documentation.


### Prediction Usage

The `SectionPredictor` class creates section annotation span IDs/types and
header token spans.  See the example below:

```python
from zensols.nlp import FeatureToken
from zensols.mimic import Section
from zensols.mimicsid import PredictedNote, ApplicationFactory
from zensols.mimicsid.pred import SectionPredictor

if (__name__ == '__main__'):
    # get the section predictor from the application context in the app
    section_predictor: SectionPredictor = ApplicationFactory.section_predictor()

    # read in a test note to predict
    with open('../../test-resources/note.txt') as f:
        content: str = f.read().strip()

    # predict the sections of read in note and print it
    note: PredictedNote = section_predictor.predict([content])[0]
    note.write()

    # iterate through the note object graph
    sec: Section
    for sec in note.sections.values():
        print(sec.id, sec.name)

    # concepts or special MIMIC tokens from the addendum section
    sec = note.sections_by_name['addendum'][0]
    tok: FeatureToken
    for tok in sec.body_doc.token_iter():
        print(tok, tok.mimic_, tok.cui_)
```


### Annotation Access

Annotated notes are provided as a Python [Note class], which contains most of
the MIMIC-III data from the `NOTEEVENTS` table.  This includes not only the
text, but parsed `FeatureDocument` instances.  However, you must build a
Postgres database and provide a login to it in the application as detailed
below:

```python
from zensols.config import IniConfig
from zensols.mimic import Section
from zensols.mimicsid import ApplicationFactory
from zensols.mimic import Note
from zensols.mimicsid import AnnotatedNote, NoteStash

if (__name__ == '__main__'):
    # create a configuration with the Postgres database login
    config = IniConfig('db.conf')
    # get the `dict` like data structure that has notes by `row_id`
    note_stash: NoteStash = ApplicationFactory.note_stash(
        **config.get_options(section='mimic_postgres_conn_manager'))

    # get a note by `row_id`
    note: Note = note_stash[14793]

    # iterate through the note object graph
    sec: Section
    for sec in note.sections.values():
        print(sec.id, sec.name)
```


## Differences from the Paper Repository

The paper [medsecid repository] has quite a few differences, mostly around
reproducibility.  However, this repository is designed to be a package used for
research that applies the model.  To reproduce the results of the paper, please
refer to the [medsicid repository].  To use the best performing model
(BiLSTM-CRF token model) from that paper, then use this repository.

Perhaps the largest difference is that this repository has a pretrained model
and code for header tokens.  This is a separate model whose header token
predictions are "merged" with the section ID/type predictions.

The differences in performance between the section ID/type models and metrics
reported involve several factors.  The primary difference being that released
models were trained on the test data with only validation performance metrics
reported to increase the pretrained model performance.  Other changes include:

* Uses the [mednlp package], which uses [MedCAT] to parse clinical medical
  text.  This includes changes such as fixing misspellings and expanding
  acronyms.
* Uses the [mimic package], which builds on the [mednlp package] and parses
  [MIMIC-III] text by configuring the [spaCy] tokenizer to deal with pseudo
  tokens (i.e. `[**First Name**]`).  This is a significant change given how
  these tokens are treated between the models and term mapping (`Pt.` becomes
  `patient`).  This was changed so the model will work well on non-MIMIC data.
* Feature sets differences such as provided by the [Zensols Deep NLP package].
* Model changes include LSTM hidden layer parameter size and activation
  function.
* White space tokens are removed in [medsecid repository] and added back in
  this package to give additional cues to the model on when to break a
  section.  However, this might have had the opposite effect.

There are also changes in the libraries used:

* PyTorch was upgraded from 1.9.1 to 1.12.1
* [spaCy] was upgraded from 3.0.7 to 3.2.4
* Python version 3.9 to 3.10.


## Training

This document explains how to create and package models for distribution.


### Preprocessing Step

1. To train the model, first install the MIMIC-III Postgres database per the [mimic
   package] instructions in the *Installation* section.
2. Add the MIMIC-III Postgres credentials and database configuration to
   `etc/batch.conf`.
3. Comment out the line `resource(zensols.mimicsid): resources/model/adm.conf`
   in `resources/app.conf`.
4. Vectorize the batches using the preprocessing script:
   `$ ./src/bin/preprocess.sh`.  This also creates cached hospital admission and
   spaCy data parse files.


### Training and Testing

To get performance metrics on the test set by training on the training, use the
command: `./mimicsid traintest -c models/glove300.conf` for the section ID
model.  The configuration file can be any of those in the `models` directory.
For the header model use:

```bash
./mimicsid traintest -c models/glove300.conf --override mimicsid_default.model_type=header
```


## Training Production Models

To train models used in your projects, train the model on both the training and
test sets.  This still leaves the validation set to inform when to save for
epochs where the loss decreases:

1. Update the `deeplearn_model_packer:version` in `resources/app.conf`.
2. Preprocess (see the [preprocessing](#preprocessing-step)) section.
3. Run the script that trains the models and packages them: `src/bin/package.sh`.
4. Check for errors and verify models: `$ ./src/bin/verify-model.py`.
5. Don't forget to revert files `etc/batch.conf` and `resources/app.conf`.


## Models

You can mix and match models across section vs. header models (see [Performance
Metrics](#performance-metrics)).  By default the package uses the best
performing models but you can select the model you want by adding a
configuration file and specifying it on the command line with `-c`:

```ini
[mimicsid_default]
section_prediction_model = bilstm-crf-tok-fasttext
header_prediction_model = bilstm-crf-tok-glove-300d
```

The resources live on [Zenodo] and are automatically downloaded on the first
time the program is used in the `~/.cache` directory (or similar home directory
on Windows).


### MedCAT Models

The dependency [mednlp package] package uses the [default MedCAT
model](https://github.com/plandes/mednlp#medcat-models).



### Performance Metrics

The distributed models add in the test set to the training set to improve the
performance for inferencing, which is why only the validation metrics are
given.  The validation set performance of the pretrained models are given
below, where:

* **wF1** is the weighted F1
* **mF1** is the micro F1
* **Mf1** is the macro F1
* **acc** is the accuracy

Fundamental API changes have necessitated subsequent versions of the model.
Each version of this package is tied to a model version.  While some minor
changes of each version might present language parsing differences such as
sentence chunking, metrics are most likely statistically insignificant.


#### Version 0.0.2

| Name                          | Type    | Id                                     | wF1   | mF1   | MF1   | acc   |
|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|
| `BiLSTM-CRF_tok (fastText)`   | Section | bilstm-crf-tok-fasttext-section-type   | 0.918 | 0.925 | 0.797 | 0.925 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Section | bilstm-crf-tok-glove-300d-section-type | 0.917 | 0.922 | 0.809 | 0.922 |
| `BiLSTM-CRF_tok (fastText)`   | Header  | bilstm-crf-tok-fasttext-header         | 0.996 | 0.996 | 0.959 | 0.996 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Header  | bilstm-crf-tok-glove-300d-header       | 0.996 | 0.996 | 0.962 | 0.996 |


#### Version 0.0.3

| Name                          | Type    | Id                                     | wF1   | mF1   | MF1   | acc   |
|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|
| `BiLSTM-CRF_tok (fastText)`   | Section | bilstm-crf-tok-fasttext-section-type   | 0.911 | 0.917 | 0.792 | 0.917 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Section | bilstm-crf-tok-glove-300d-section-type | 0.929 | 0.933 | 0.810 | 0.933 |
| `BiLSTM-CRF_tok (fastText)`   | Header  | bilstm-crf-tok-fasttext-header         | 0.996 | 0.996 | 0.965 | 0.996 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Header  | bilstm-crf-tok-glove-300d-header       | 0.996 | 0.996 | 0.962 | 0.996 |


## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{landes-etal-2022-new,
    title = "A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d",
    author = "Landes, Paul  and
      Patel, Kunal  and
      Huang, Sean S.  and
      Webb, Adam  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.326",
    pages = "3709--3721"
}
```

Also please cite the [Zensols Framework]:

```bibtex
@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}
```


## Docker

A [docker](docker/app/README.md) image is now available as well.

To use the docker image, do the following:

1. Create (or obtain) the [Postgres docker image]
1. Clone this repository `git clone --recurse-submodules
   https://github.com/plandes/mimicsid`
1. Set the working directory to the repo: `cd mimicsid`
1. Copy the configuration from the installed [mimicdb] image configuration:
   `make -C docker/mimicdb SRC_DIR=<cloned mimicdb directory> cpconfig`
1. Start the container: `make -C docker/app up`
1. Test sectioning a document: `make -C docker/app testdumpsec`
1. Log in to the container: `make -C docker/app devlogin`
1. Output a note to a temporary file: `mimic note 1118471 > note.txt`
1. Predict the sections on the note: `mimicsid predict note.txt`
1. Look at the section predictions: `cat preds/note-pred.txt`


## Changelog

An extensive changelog is available [here](CHANGELOG.md).


## Community

Please star this repository and let me know how and where you use this API.
Contributions as pull requests, feedback and any input is welcome.


## License

[MIT License](LICENSE.md)

Copyright (c) 2022 - 2024 Paul Landes


<!-- links -->
[pypi]: https://pypi.org/project/zensols.mimicsid/
[pypi-link]: https://pypi.python.org/pypi/zensols.mimicsid
[pypi-badge]: https://img.shields.io/pypi/v/zensols.mimicsid.svg
[python310-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python310-link]: https://www.python.org/downloads/release/python-3100
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110

[MedCat]: https://github.com/CogStack/MedCAT
[spaCy]: https://spacy.io

[mednlp package]: https://github.com/plandes/mednlp
[mimic package]: https://github.com/plandes/mimic
[mimic package install section]: https://github.com/plandes/mimic#installation
[medsecid repository]: https://github.com/uic-nlp-lab/medsecid
[Zensols Deep NLP package]: https://github.com/plandes/deepnlp
[Zensols Framework]: https://github.com/plandes/deepnlp

[annotation example]: example/anon/anon.py
[A New Public Corpus for Clinical Section Identification: MedSecId]: https://aclanthology.org/2022.coling-1.326.pdf
[Zenodo]: https://zenodo.org/record/7150451#.Yz30BS2B3Bs

[Postgres docker image]: https://github.com/plandes/mimicdb#installation
[mimicdb]: https://github.com/plandes/mimicdb

[Note class]: https://plandes.github.io/mimic/api/zensols.mimic.html#zensols.mimic.note.Note

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/plandes/mimicsid",
    "name": "zensols.mimicsid",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tooling",
    "author": "Paul Landes",
    "author_email": "landes@mailc.net",
    "download_url": "https://github.com/plandes/mimicsid/releases/download/v1.7.0/zensols.mimicsid-1.7.0-py3-none-any.whl",
    "platform": null,
    "description": "# MIMIC-III corpus parsing and section prediction with MedSecId\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.10][python310-badge]][python310-link]\n[![Python 3.11][python311-badge]][python311-link]\n\nThis repository contains the a Python package to automatically segment and\nidentify sections of clinical notes, such as electronic health record (EHR)\nmedical documents.  It also provides access to the MedSecId section annotations\nwith MIMIC-III corpus parsing from the paper [A New Public Corpus for Clinical\nSection Identification: MedSecId].  See the [medsecid repository] to reproduce\nthe results from the paper.\n\nThis package provides the following:\n\n* The same access to MIMIC-III data as provided in the [mimic package].\n* Access to the annotated MedSecId notes as an easy to use Python object graph.\n* The pretrained model inferencing, which produces a similar Python object\n  graph to the annotations (provides the class `PredictedNote` instead of an\n  `AnnotatedNote` class.\n\n\n<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->\n## Table of Contents\n\n- [Obtaining](#obtaining)\n- [Documentation](#documentation)\n- [Installation](#installation)\n- [Usage](#usage)\n    - [Prediction Usage](#prediction-usage)\n    - [Annotation Access](#annotation-access)\n- [Differences from the Paper Repository](#differences-from-the-paper-repository)\n- [Training](#training)\n    - [Preprocessing Step](#preprocessing-step)\n    - [Training and Testing](#training-and-testing)\n- [Training Production Models](#training-production-models)\n- [Models](#models)\n    - [MedCAT Models](#medcat-models)\n    - [Performance Metrics](#performance-metrics)\n        - [Version 0.0.2](#version-002)\n        - [Version 0.0.3](#version-003)\n- [Citation](#citation)\n- [Docker](#docker)\n- [Changelog](#changelog)\n- [Community](#community)\n- [License](#license)\n\n<!-- markdown-toc end -->\n\n\n\n## Obtaining\n\nThe easiest way to install the command line program is via the `pip` installer:\n```bash\npip3 install zensols.mimicsid\n```\n\nBinaries are also available on [pypi].\n\nA [docker](#docker) image is now available as well.\n\n\n## Documentation\n\nSee the [full documentation](https://plandes.github.io/mimicsid/index.html).\nThe [API reference](https://plandes.github.io/mimicsid/api.html) is also\navailable.\n\n\n## Installation\n\nIf you only want to predict sections using the pretrained model, you need only\nto [install](#obtaining) the package.  However, if you want to access the\nannotated notes, you must install a Postgres MIMIC-III database as [mimic\npackage install section].\n\n\n## Usage\n\nThis package provides models to predict sections of a medical note and access\nto the MIMIC-III section annotations available on [Zenodo].  The first time it\nis run it will take a while to download the annotation set and the pretrained\nmodels.\n\nSee the [examples](example) for the complete code and additional documentation.\n\n\n### Prediction Usage\n\nThe `SectionPredictor` class creates section annotation span IDs/types and\nheader token spans.  See the example below:\n\n```python\nfrom zensols.nlp import FeatureToken\nfrom zensols.mimic import Section\nfrom zensols.mimicsid import PredictedNote, ApplicationFactory\nfrom zensols.mimicsid.pred import SectionPredictor\n\nif (__name__ == '__main__'):\n    # get the section predictor from the application context in the app\n    section_predictor: SectionPredictor = ApplicationFactory.section_predictor()\n\n    # read in a test note to predict\n    with open('../../test-resources/note.txt') as f:\n        content: str = f.read().strip()\n\n    # predict the sections of read in note and print it\n    note: PredictedNote = section_predictor.predict([content])[0]\n    note.write()\n\n    # iterate through the note object graph\n    sec: Section\n    for sec in note.sections.values():\n        print(sec.id, sec.name)\n\n    # concepts or special MIMIC tokens from the addendum section\n    sec = note.sections_by_name['addendum'][0]\n    tok: FeatureToken\n    for tok in sec.body_doc.token_iter():\n        print(tok, tok.mimic_, tok.cui_)\n```\n\n\n### Annotation Access\n\nAnnotated notes are provided as a Python [Note class], which contains most of\nthe MIMIC-III data from the `NOTEEVENTS` table.  This includes not only the\ntext, but parsed `FeatureDocument` instances.  However, you must build a\nPostgres database and provide a login to it in the application as detailed\nbelow:\n\n```python\nfrom zensols.config import IniConfig\nfrom zensols.mimic import Section\nfrom zensols.mimicsid import ApplicationFactory\nfrom zensols.mimic import Note\nfrom zensols.mimicsid import AnnotatedNote, NoteStash\n\nif (__name__ == '__main__'):\n    # create a configuration with the Postgres database login\n    config = IniConfig('db.conf')\n    # get the `dict` like data structure that has notes by `row_id`\n    note_stash: NoteStash = ApplicationFactory.note_stash(\n        **config.get_options(section='mimic_postgres_conn_manager'))\n\n    # get a note by `row_id`\n    note: Note = note_stash[14793]\n\n    # iterate through the note object graph\n    sec: Section\n    for sec in note.sections.values():\n        print(sec.id, sec.name)\n```\n\n\n## Differences from the Paper Repository\n\nThe paper [medsecid repository] has quite a few differences, mostly around\nreproducibility.  However, this repository is designed to be a package used for\nresearch that applies the model.  To reproduce the results of the paper, please\nrefer to the [medsicid repository].  To use the best performing model\n(BiLSTM-CRF token model) from that paper, then use this repository.\n\nPerhaps the largest difference is that this repository has a pretrained model\nand code for header tokens.  This is a separate model whose header token\npredictions are \"merged\" with the section ID/type predictions.\n\nThe differences in performance between the section ID/type models and metrics\nreported involve several factors.  The primary difference being that released\nmodels were trained on the test data with only validation performance metrics\nreported to increase the pretrained model performance.  Other changes include:\n\n* Uses the [mednlp package], which uses [MedCAT] to parse clinical medical\n  text.  This includes changes such as fixing misspellings and expanding\n  acronyms.\n* Uses the [mimic package], which builds on the [mednlp package] and parses\n  [MIMIC-III] text by configuring the [spaCy] tokenizer to deal with pseudo\n  tokens (i.e. `[**First Name**]`).  This is a significant change given how\n  these tokens are treated between the models and term mapping (`Pt.` becomes\n  `patient`).  This was changed so the model will work well on non-MIMIC data.\n* Feature sets differences such as provided by the [Zensols Deep NLP package].\n* Model changes include LSTM hidden layer parameter size and activation\n  function.\n* White space tokens are removed in [medsecid repository] and added back in\n  this package to give additional cues to the model on when to break a\n  section.  However, this might have had the opposite effect.\n\nThere are also changes in the libraries used:\n\n* PyTorch was upgraded from 1.9.1 to 1.12.1\n* [spaCy] was upgraded from 3.0.7 to 3.2.4\n* Python version 3.9 to 3.10.\n\n\n## Training\n\nThis document explains how to create and package models for distribution.\n\n\n### Preprocessing Step\n\n1. To train the model, first install the MIMIC-III Postgres database per the [mimic\n   package] instructions in the *Installation* section.\n2. Add the MIMIC-III Postgres credentials and database configuration to\n   `etc/batch.conf`.\n3. Comment out the line `resource(zensols.mimicsid): resources/model/adm.conf`\n   in `resources/app.conf`.\n4. Vectorize the batches using the preprocessing script:\n   `$ ./src/bin/preprocess.sh`.  This also creates cached hospital admission and\n   spaCy data parse files.\n\n\n### Training and Testing\n\nTo get performance metrics on the test set by training on the training, use the\ncommand: `./mimicsid traintest -c models/glove300.conf` for the section ID\nmodel.  The configuration file can be any of those in the `models` directory.\nFor the header model use:\n\n```bash\n./mimicsid traintest -c models/glove300.conf --override mimicsid_default.model_type=header\n```\n\n\n## Training Production Models\n\nTo train models used in your projects, train the model on both the training and\ntest sets.  This still leaves the validation set to inform when to save for\nepochs where the loss decreases:\n\n1. Update the `deeplearn_model_packer:version` in `resources/app.conf`.\n2. Preprocess (see the [preprocessing](#preprocessing-step)) section.\n3. Run the script that trains the models and packages them: `src/bin/package.sh`.\n4. Check for errors and verify models: `$ ./src/bin/verify-model.py`.\n5. Don't forget to revert files `etc/batch.conf` and `resources/app.conf`.\n\n\n## Models\n\nYou can mix and match models across section vs. header models (see [Performance\nMetrics](#performance-metrics)).  By default the package uses the best\nperforming models but you can select the model you want by adding a\nconfiguration file and specifying it on the command line with `-c`:\n\n```ini\n[mimicsid_default]\nsection_prediction_model = bilstm-crf-tok-fasttext\nheader_prediction_model = bilstm-crf-tok-glove-300d\n```\n\nThe resources live on [Zenodo] and are automatically downloaded on the first\ntime the program is used in the `~/.cache` directory (or similar home directory\non Windows).\n\n\n### MedCAT Models\n\nThe dependency [mednlp package] package uses the [default MedCAT\nmodel](https://github.com/plandes/mednlp#medcat-models).\n\n\n\n### Performance Metrics\n\nThe distributed models add in the test set to the training set to improve the\nperformance for inferencing, which is why only the validation metrics are\ngiven.  The validation set performance of the pretrained models are given\nbelow, where:\n\n* **wF1** is the weighted F1\n* **mF1** is the micro F1\n* **Mf1** is the macro F1\n* **acc** is the accuracy\n\nFundamental API changes have necessitated subsequent versions of the model.\nEach version of this package is tied to a model version.  While some minor\nchanges of each version might present language parsing differences such as\nsentence chunking, metrics are most likely statistically insignificant.\n\n\n#### Version 0.0.2\n\n| Name                          | Type    | Id                                     | wF1   | mF1   | MF1   | acc   |\n|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|\n| `BiLSTM-CRF_tok (fastText)`   | Section | bilstm-crf-tok-fasttext-section-type   | 0.918 | 0.925 | 0.797 | 0.925 |\n| `BiLSTM-CRF_tok (GloVE 300D)` | Section | bilstm-crf-tok-glove-300d-section-type | 0.917 | 0.922 | 0.809 | 0.922 |\n| `BiLSTM-CRF_tok (fastText)`   | Header  | bilstm-crf-tok-fasttext-header         | 0.996 | 0.996 | 0.959 | 0.996 |\n| `BiLSTM-CRF_tok (GloVE 300D)` | Header  | bilstm-crf-tok-glove-300d-header       | 0.996 | 0.996 | 0.962 | 0.996 |\n\n\n#### Version 0.0.3\n\n| Name                          | Type    | Id                                     | wF1   | mF1   | MF1   | acc   |\n|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|\n| `BiLSTM-CRF_tok (fastText)`   | Section | bilstm-crf-tok-fasttext-section-type   | 0.911 | 0.917 | 0.792 | 0.917 |\n| `BiLSTM-CRF_tok (GloVE 300D)` | Section | bilstm-crf-tok-glove-300d-section-type | 0.929 | 0.933 | 0.810 | 0.933 |\n| `BiLSTM-CRF_tok (fastText)`   | Header  | bilstm-crf-tok-fasttext-header         | 0.996 | 0.996 | 0.965 | 0.996 |\n| `BiLSTM-CRF_tok (GloVE 300D)` | Header  | bilstm-crf-tok-glove-300d-header       | 0.996 | 0.996 | 0.962 | 0.996 |\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{landes-etal-2022-new,\n    title = \"A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d\",\n    author = \"Landes, Paul  and\n      Patel, Kunal  and\n      Huang, Sean S.  and\n      Webb, Adam  and\n      Di Eugenio, Barbara  and\n      Caragea, Cornelia\",\n    booktitle = \"Proceedings of the 29th International Conference on Computational Linguistics\",\n    month = oct,\n    year = \"2022\",\n    address = \"Gyeongju, Republic of Korea\",\n    publisher = \"International Committee on Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.coling-1.326\",\n    pages = \"3709--3721\"\n}\n```\n\nAlso please cite the [Zensols Framework]:\n\n```bibtex\n@inproceedings{landes-etal-2023-deepzensols,\n    title = \"{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility\",\n    author = \"Landes, Paul  and\n      Di Eugenio, Barbara  and\n      Caragea, Cornelia\",\n    editor = \"Tan, Liling  and\n      Milajevs, Dmitrijs  and\n      Chauhan, Geeticka  and\n      Gwinnup, Jeremy  and\n      Rippeth, Elijah\",\n    booktitle = \"Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)\",\n    month = dec,\n    year = \"2023\",\n    address = \"Singapore, Singapore\",\n    publisher = \"Empirical Methods in Natural Language Processing\",\n    url = \"https://aclanthology.org/2023.nlposs-1.16\",\n    pages = \"141--146\"\n}\n```\n\n\n## Docker\n\nA [docker](docker/app/README.md) image is now available as well.\n\nTo use the docker image, do the following:\n\n1. Create (or obtain) the [Postgres docker image]\n1. Clone this repository `git clone --recurse-submodules\n   https://github.com/plandes/mimicsid`\n1. Set the working directory to the repo: `cd mimicsid`\n1. Copy the configuration from the installed [mimicdb] image configuration:\n   `make -C docker/mimicdb SRC_DIR=<cloned mimicdb directory> cpconfig`\n1. Start the container: `make -C docker/app up`\n1. Test sectioning a document: `make -C docker/app testdumpsec`\n1. Log in to the container: `make -C docker/app devlogin`\n1. Output a note to a temporary file: `mimic note 1118471 > note.txt`\n1. Predict the sections on the note: `mimicsid predict note.txt`\n1. Look at the section predictions: `cat preds/note-pred.txt`\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Community\n\nPlease star this repository and let me know how and where you use this API.\nContributions as pull requests, feedback and any input is welcome.\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2022 - 2024 Paul Landes\n\n\n<!-- links -->\n[pypi]: https://pypi.org/project/zensols.mimicsid/\n[pypi-link]: https://pypi.python.org/pypi/zensols.mimicsid\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.mimicsid.svg\n[python310-badge]: https://img.shields.io/badge/python-3.10-blue.svg\n[python310-link]: https://www.python.org/downloads/release/python-3100\n[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg\n[python311-link]: https://www.python.org/downloads/release/python-3110\n\n[MedCat]: https://github.com/CogStack/MedCAT\n[spaCy]: https://spacy.io\n\n[mednlp package]: https://github.com/plandes/mednlp\n[mimic package]: https://github.com/plandes/mimic\n[mimic package install section]: https://github.com/plandes/mimic#installation\n[medsecid repository]: https://github.com/uic-nlp-lab/medsecid\n[Zensols Deep NLP package]: https://github.com/plandes/deepnlp\n[Zensols Framework]: https://github.com/plandes/deepnlp\n\n[annotation example]: example/anon/anon.py\n[A New Public Corpus for Clinical Section Identification: MedSecId]: https://aclanthology.org/2022.coling-1.326.pdf\n[Zenodo]: https://zenodo.org/record/7150451#.Yz30BS2B3Bs\n\n[Postgres docker image]: https://github.com/plandes/mimicdb#installation\n[mimicdb]: https://github.com/plandes/mimicdb\n\n[Note class]: https://plandes.github.io/mimic/api/zensols.mimic.html#zensols.mimic.note.Note\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Use the MedSecId section annotations with MIMIC-III corpus parsing.",
    "version": "1.7.0",
    "project_urls": {
        "Download": "https://github.com/plandes/mimicsid/releases/download/v1.7.0/zensols.mimicsid-1.7.0-py3-none-any.whl",
        "Homepage": "https://github.com/plandes/mimicsid"
    },
    "split_keywords": [
        "tooling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c1a30a126179708c186b60969dbad7027f3d0bd353f92c1f592159c175e0fc83",
                "md5": "1b071ce9c939a49dd3f4be64fc6be9ac",
                "sha256": "7cfd905529a39fbe7ccbf24ea6a06eae3bac997cae92c4f0a44ee365e47b8c4c"
            },
            "downloads": -1,
            "filename": "zensols.mimicsid-1.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1b071ce9c939a49dd3f4be64fc6be9ac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 38851,
            "upload_time": "2024-03-07T17:50:26",
            "upload_time_iso_8601": "2024-03-07T17:50:26.600330Z",
            "url": "https://files.pythonhosted.org/packages/c1/a3/0a126179708c186b60969dbad7027f3d0bd353f92c1f592159c175e0fc83/zensols.mimicsid-1.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-07 17:50:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "plandes",
    "github_project": "mimicsid",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "zensols.mimicsid"
}

Paul Landes