kapipe


Namekapipe JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/norikinishida/kapipe
SummaryA learnable pipeline for knowledge acquisition
upload_time2024-07-23 09:35:09
maintainerNone
docs_urlNone
authorNoriki Nishida
requires_python>=3.10
licenseApache License 2.0
keywords nlp knowledge acquisition information extraction named entity recognition entity disambiguation relation extraction
VCS
bugtrack_url
requirements numpy scipy pandas spacy torch opt-einsum faiss-gpu pyhocon tqdm jsonlines Levenshtein transformers accelerate bitsandbytes
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # KAPipe

KAPipe is a learnable pipeline for knowledge acquisition, with a particular focus on (semi-)automatically complementing knowledge bases in specialized domains.

## Features

- KAPipe provides **trained pipelines** for end-to-end knowledge graph construction from text.
- A pipeline is designed as a cascade of the following task components:
    - **Named Entity Recognition (NER):** Extracting entity mention spans and their entity types from the input text.
    - **Entity Disambiguation - Retrieval (ED-Retrieval)**: Retrieving a set of candidate entity IDs for each given mention in the text, based on a knowledge-base entity pool.
    - **Entity Disambiguation - Reranking (ED-Reranking)**: Reranking the retrieved entity IDs and selecting the most likely entity ID for each given mention in the text.
    - **Document-level Relation Extraction (DocRE)**: Extracting a set of relational triples (head entity, relation, tail entity) for a given entity set.
- It is possible to use only specific task components.
- KAPipe uses the **state-of-the-art models** for each task component.
- KAPipe also supports **training** of the pipeline (or specific task components) for new domains, entity types, relation labels, and knowledge bases.
- This repository also contains the source codes for experiments on custom models, including BERT-based supervised learning models and Large Language Model (LLM)-based In-Context Learning. The following customizable models are implemented for each task:
    - NER: Biaffine-NER ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)), LLM-NER
    - ED-Retrieval: BLINK Bi-Encoder ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)), BM25, Levenshtein-based retriever
    - ED-Reranking: BLINK Cross-Encoder ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)), LLM-ED
    - DocRE: ATLOP ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)), LLM-DocRE, MA-ATLOP and MA-QA (Oumaima and Nishida et al., 2024)

## Installation

```bash
python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install kapipe
```

## Data Format: *Document*

We define a common dictionary format as the input and output for the pipeline.
We call this dictionary format **"Document"**.
The pipeline is a cascade of the tasks components, and the input and output of each task component is also a Document.
The information in the input Document is either passed on to the output Document or updated.
Note: As Documents are just dictionary data, users can add their own meta-information, such as information on the correspondence between each word and its position on the PDF, to the Documents, and this information will be retained in the pipeline's output.

### Input:

```json
{
    "doc_key": "6794356",
    "sentences": [
        "Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
        "A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
        "This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .",
        "Sixty - three percent of these infants had tricuspid valve involvement .",
        "Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .",
        "It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery ."
    ]
}
```

### Output:

```json
{
    "doc_key": "6794356",
    "sentences": [
        "Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
        "A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
        "This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .",
        "Sixty - three percent of these infants had tricuspid valve involvement .",
        "Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .",
        "It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery ."
    ],
    "mentions": [
        {
            "span": [
                0,
                2
            ],
            "name": "Tricuspid valve regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                4,
                5
            ],
            "name": "lithium carbonate",
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "span": [
                6,
                6
            ],
            "name": "toxicity",
            "entity_type": "Disease",
            "entity_id": "D064420"
        },
        {
            "span": [
                16,
                17
            ],
            "name": "tricuspid regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                19,
                20
            ],
            "name": "atrial flutter",
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "span": [
                22,
                24
            ],
            "name": "congestive heart failure",
            "entity_type": "Disease",
            "entity_id": "D006333"
        },
        {
            "span": [
                30,
                30
            ],
            "name": "lithium",
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "span": [
                43,
                44
            ],
            "name": "tricuspid regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "span": [
                46,
                47
            ],
            "name": "atrial flutter",
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "span": [
                55,
                56
            ],
            "name": "cardiac disease",
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "span": [
                61,
                61
            ],
            "name": "lithium",
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "span": [
                82,
                83
            ],
            "name": "Lithium carbonate",
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "span": [
                93,
                95
            ],
            "name": "congenital heart disease",
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "span": [
                105,
                106
            ],
            "name": "neurologic depression",
            "entity_type": "Disease",
            "entity_id": "D003866"
        },
        {
            "span": [
                108,
                108
            ],
            "name": "cyanosis",
            "entity_type": "Disease",
            "entity_id": "D003490"
        },
        {
            "span": [
                111,
                112
            ],
            "name": "cardiac arrhythmia",
            "entity_type": "Disease",
            "entity_id": "D001145"
        }
    ],
    "entities": [
        {
            "mention_indices": [
                0,
                3,
                7
            ],
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        {
            "mention_indices": [
                1,
                11
            ],
            "entity_type": "Chemical",
            "entity_id": "D016651"
        },
        {
            "mention_indices": [
                2
            ],
            "entity_type": "Disease",
            "entity_id": "D064420"
        },
        {
            "mention_indices": [
                4,
                8
            ],
            "entity_type": "Disease",
            "entity_id": "D001282"
        },
        {
            "mention_indices": [
                5
            ],
            "entity_type": "Disease",
            "entity_id": "D006333"
        },
        {
            "mention_indices": [
                6,
                10
            ],
            "entity_type": "Chemical",
            "entity_id": "D008094"
        },
        {
            "mention_indices": [
                9,
                12
            ],
            "entity_type": "Disease",
            "entity_id": "D006331"
        },
        {
            "mention_indices": [
                13
            ],
            "entity_type": "Disease",
            "entity_id": "D003866"
        },
        {
            "mention_indices": [
                14
            ],
            "entity_type": "Disease",
            "entity_id": "D003490"
        },
        {
            "mention_indices": [
                15
            ],
            "entity_type": "Disease",
            "entity_id": "D001145"
        }
    ],
    "relations": [
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 7
        },
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 8
        },
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 9
        }
    ]
}
```

## Downloading Trained Pipelines

Trained pipelines can be downloaded from https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp.

Download the latest compressed file `release.YYYYMMDD.tar.gz`, and then unzip it in `~/.kapipe` directory as follows:

```bash
mkdir ~/.kapipe
mv release.YYYYMMDD.tar.gz ~/.kapipe
cd ~/.kapipe
tar -zxvf release.YYYYMMDD.tar.gz
```

## Loading and Using Pipeline

The easiest way to apply the knowledge acquisition pipeline (i.e., the cascade of NER, ED, and DocRE tasks) to an input document is to load the pipeline using `kapipe.load()` and just apply it to the document.

```python
import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")
document = ka(document)
```

The above code loads and uses models that have already been trained for specific domains, entity types (in NER), knowledge bases (in ED), and relation labels (in DocRE).
Specifically, the identifier `"cdr_biaffinener_blink_atlop"` above indicates that the Biaffine-NER, BLINK, and ATLOP models trained on the CDR dataset (biomedical abstracts, Chemical and Disease entity types, entity IDs based on the MeSH ontology, and Chemical-Induce-Disease relation label) are used for NER, ED, and DocRE, respectively.

It is also possible to apply specific tasks by directly calling the task components.
For example, if you would like to perform only NER and ED, please do the following.

```python
import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")

# NER
document = ka.ner(document)
# ED-Retrieval
document, candidate_entities = ka.ed_ret(document, num_candidate_entities=10)
# ED-Reranking
document = ka.ed_rank(document, candidate_entities)
```

Also, for example, if mentions and entities have already been annotated (by humans or external systems), and if you would like to perform only DocRE, do the following.
Note that the mentions and entities have already been integrated into the input document.

```python
import kapipe
ka = kapipe.load("cdr_biaffinener_blink_atlop")

# DocRE
document = ka.docre(document_with_gold_mentions_and_entities)
```

## Available Trained Pipelines

The following pipelines are currently available.

| identifier | NER Model and Dataset (Entity Types) | ED-Retrieval Model and Dataset with Knowledge Base | ED-Reranking Model and Dataset with Knowledge Base | DocRE Model and Dataset (Relation Labels) |
| --- | --- | --- | --- | --- |
| cdr_biaffinener_blink_atlop | Biaffine-NER on CDR (Chemical, Disease) | BLINK Bi-Encoder on CDR + MeSH (2015) | BLINK Cross-Encoder on CDR + MeSH (2015) | ATLOP on CDR (Chemical-Induce-Disease) |

## Training

If the trained pipelines do not cover your target domain, entity types, knowledge base, or relation labels, please train each task component in the pipeline on your dataset.
Once you have trained the pipeline, please save it for future reuse. You can set your own identifier.

```python
import kapipe

ka = kapipe.blank(gpu_map={"ner":0, "ed_retrieval":1, "ed_reranking": 2, "docre": 3})

# NER
ka.ner.fit(
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "allow_nested_entities": True,
        "max_epoch": 10,
    }
)

# ED-Retrieval
ka.ed_ret.fit(
    entity_dict,
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
    }
)

# ED-Reranking
train_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in train_documents
]
dev_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents
]
ka.ed_rank.fit(
    entity_dict,
    train_documents, train_candidate_entities,
    dev_documents, dev_candidate_entities,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
    }
)

# DocRE
ka.docre.fit(
    train_documents,
    dev_documents,
    optional_config={
        "bert_pretrained_name_or_path": "allenai/scibert_scivocab_uncased",
        "bert_learning_rate": 2e-5,
        "task_learning_rate": 1e-4,
        "dataset_name": "example_dataset",
        "max_epoch": 10,
        "possible_head_entity_types": ["Chemical"], # or None
        "possible_tail_entity_types": ["Disease"], # or None
    }
)
 
ka.save("your favorite identifier")
```

If you would like to train only specific task components (for example, if you would like to use models trained on CDR for NER and DocRE, and train a new model on a different version of MeSH for ED), please do the following.

```python
import kapipe

ka = kapipe.load("cdr_biaffinener_blink_atlop")

# ED-Retrieval
ka.ed_ret.fit(
    entity_dict,
    train_documents,
    dev_documents
)

# ED-Reranking
train_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in train_documents
]
dev_candidate_entities = [
	ka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents
]
ka.ed_rank.fit(
    entity_dict,
    train_documents, train_candidate_entities,
    dev_documents, dev_candidate_entities
)

ka.save("your favorite identifier")
```

When using the saved pipeline, please load it by specifying the identifier.

```python
import kapipe

kapipe.load("your favorite identifier")
```

## Experiments on Custom Models

The pipeline is a top-level wrapper class that consists of a cascade of task components, and each task component is also a black box class, in which specific models (e.g., Biaffine-NER, ATLOP) are used.
In order to perform various training, evaluation, and analysis on specific methods, it may be more intuitive to directly instantiate each method (hereafter referred to as a "system") rather than the pipeline.

The core of KAPipe is the systems, and the pipeline is just a wrapper to make them easy to use with minimal coding. If you are familiar with coding and your goal is not just to apply the KA pipeline, but also to develop the methods, it would be better to work directly with the systems rather than using the pipeline.

The fastest way to find out how to initialize, train and evaluate each system is to look at the `experiments/codes/run_*` scripts.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/norikinishida/kapipe",
    "name": "kapipe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "NLP, knowledge acquisition, information extraction, named entity recognition, entity disambiguation, relation extraction",
    "author": "Noriki Nishida",
    "author_email": "norikinishida@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/26/e2/2a44a54bda885e219063ade5c230cc33bd23694efa9b4584cb5191381e0f/kapipe-0.0.3.tar.gz",
    "platform": null,
    "description": "# KAPipe\n\nKAPipe is a learnable pipeline for knowledge acquisition, with a particular focus on (semi-)automatically complementing knowledge bases in specialized domains.\n\n## Features\n\n- KAPipe provides **trained pipelines** for end-to-end knowledge graph construction from text.\n- A pipeline is designed as a cascade of the following task components:\n    - **Named Entity Recognition (NER):** Extracting entity mention spans and their entity types from the input text.\n    - **Entity Disambiguation - Retrieval (ED-Retrieval)**: Retrieving a set of candidate entity IDs for each given mention in the text, based on a knowledge-base entity pool.\n    - **Entity Disambiguation - Reranking (ED-Reranking)**: Reranking the retrieved entity IDs and selecting the most likely entity ID for each given mention in the text.\n    - **Document-level Relation Extraction (DocRE)**: Extracting a set of relational triples (head entity, relation, tail entity) for a given entity set.\n- It is possible to use only specific task components.\n- KAPipe uses the **state-of-the-art models** for each task component.\n- KAPipe also supports **training** of the pipeline (or specific task components) for new domains, entity types, relation labels, and knowledge bases.\n- This repository also contains the source codes for experiments on custom models, including BERT-based supervised learning models and Large Language Model (LLM)-based In-Context Learning. The following customizable models are implemented for each task:\n    - NER: Biaffine-NER ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)), LLM-NER\n    - ED-Retrieval: BLINK Bi-Encoder ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)), BM25, Levenshtein-based retriever\n    - ED-Reranking: BLINK Cross-Encoder ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)), LLM-ED\n    - DocRE: ATLOP ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)), LLM-DocRE, MA-ATLOP and MA-QA (Oumaima and Nishida et al., 2024)\n\n## Installation\n\n```bash\npython -m venv .env\nsource .env/bin/activate\npip install -U pip setuptools wheel\npip install kapipe\n```\n\n## Data Format: *Document*\n\nWe define a common dictionary format as the input and output for the pipeline.\nWe call this dictionary format **\"Document\"**.\nThe pipeline is a cascade of the tasks components, and the input and output of each task component is also a Document.\nThe information in the input Document is either passed on to the output Document or updated.\nNote: As Documents are just dictionary data, users can add their own meta-information, such as information on the correspondence between each word and its position on the PDF, to the Documents, and this information will be retained in the pipeline's output.\n\n### Input:\n\n```json\n{\n    \"doc_key\": \"6794356\",\n    \"sentences\": [\n        \"Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .\",\n        \"A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .\",\n        \"This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .\",\n        \"Sixty - three percent of these infants had tricuspid valve involvement .\",\n        \"Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .\",\n        \"It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery .\"\n    ]\n}\n```\n\n### Output:\n\n```json\n{\n    \"doc_key\": \"6794356\",\n    \"sentences\": [\n        \"Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .\",\n        \"A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .\",\n        \"This is the first patient to initially manifest tricuspid regurgitation and atrial flutter , and the 11th described patient with cardiac disease among infants exposed to lithium compounds in the first trimester of pregnancy .\",\n        \"Sixty - three percent of these infants had tricuspid valve involvement .\",\n        \"Lithium carbonate may be a factor in the increasing incidence of congenital heart disease when taken during early pregnancy .\",\n        \"It also causes neurologic depression , cyanosis , and cardiac arrhythmia when consumed prior to delivery .\"\n    ],\n    \"mentions\": [\n        {\n            \"span\": [\n                0,\n                2\n            ],\n            \"name\": \"Tricuspid valve regurgitation\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        {\n            \"span\": [\n                4,\n                5\n            ],\n            \"name\": \"lithium carbonate\",\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D016651\"\n        },\n        {\n            \"span\": [\n                6,\n                6\n            ],\n            \"name\": \"toxicity\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D064420\"\n        },\n        {\n            \"span\": [\n                16,\n                17\n            ],\n            \"name\": \"tricuspid regurgitation\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        {\n            \"span\": [\n                19,\n                20\n            ],\n            \"name\": \"atrial flutter\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D001282\"\n        },\n        {\n            \"span\": [\n                22,\n                24\n            ],\n            \"name\": \"congestive heart failure\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D006333\"\n        },\n        {\n            \"span\": [\n                30,\n                30\n            ],\n            \"name\": \"lithium\",\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D008094\"\n        },\n        {\n            \"span\": [\n                43,\n                44\n            ],\n            \"name\": \"tricuspid regurgitation\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        {\n            \"span\": [\n                46,\n                47\n            ],\n            \"name\": \"atrial flutter\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D001282\"\n        },\n        {\n            \"span\": [\n                55,\n                56\n            ],\n            \"name\": \"cardiac disease\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D006331\"\n        },\n        {\n            \"span\": [\n                61,\n                61\n            ],\n            \"name\": \"lithium\",\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D008094\"\n        },\n        {\n            \"span\": [\n                82,\n                83\n            ],\n            \"name\": \"Lithium carbonate\",\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D016651\"\n        },\n        {\n            \"span\": [\n                93,\n                95\n            ],\n            \"name\": \"congenital heart disease\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D006331\"\n        },\n        {\n            \"span\": [\n                105,\n                106\n            ],\n            \"name\": \"neurologic depression\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D003866\"\n        },\n        {\n            \"span\": [\n                108,\n                108\n            ],\n            \"name\": \"cyanosis\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D003490\"\n        },\n        {\n            \"span\": [\n                111,\n                112\n            ],\n            \"name\": \"cardiac arrhythmia\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D001145\"\n        }\n    ],\n    \"entities\": [\n        {\n            \"mention_indices\": [\n                0,\n                3,\n                7\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        {\n            \"mention_indices\": [\n                1,\n                11\n            ],\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D016651\"\n        },\n        {\n            \"mention_indices\": [\n                2\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D064420\"\n        },\n        {\n            \"mention_indices\": [\n                4,\n                8\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D001282\"\n        },\n        {\n            \"mention_indices\": [\n                5\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D006333\"\n        },\n        {\n            \"mention_indices\": [\n                6,\n                10\n            ],\n            \"entity_type\": \"Chemical\",\n            \"entity_id\": \"D008094\"\n        },\n        {\n            \"mention_indices\": [\n                9,\n                12\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D006331\"\n        },\n        {\n            \"mention_indices\": [\n                13\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D003866\"\n        },\n        {\n            \"mention_indices\": [\n                14\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D003490\"\n        },\n        {\n            \"mention_indices\": [\n                15\n            ],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D001145\"\n        }\n    ],\n    \"relations\": [\n        {\n            \"arg1\": 1,\n            \"relation\": \"CID\",\n            \"arg2\": 7\n        },\n        {\n            \"arg1\": 1,\n            \"relation\": \"CID\",\n            \"arg2\": 8\n        },\n        {\n            \"arg1\": 1,\n            \"relation\": \"CID\",\n            \"arg2\": 9\n        }\n    ]\n}\n```\n\n## Downloading Trained Pipelines\n\nTrained pipelines can be downloaded from https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp.\n\nDownload the latest compressed file `release.YYYYMMDD.tar.gz`, and then unzip it in `~/.kapipe` directory as follows:\n\n```bash\nmkdir ~/.kapipe\nmv release.YYYYMMDD.tar.gz ~/.kapipe\ncd ~/.kapipe\ntar -zxvf release.YYYYMMDD.tar.gz\n```\n\n## Loading and Using Pipeline\n\nThe easiest way to apply the knowledge acquisition pipeline (i.e., the cascade of NER, ED, and DocRE tasks) to an input document is to load the pipeline using `kapipe.load()` and just apply it to the document.\n\n```python\nimport kapipe\nka = kapipe.load(\"cdr_biaffinener_blink_atlop\")\ndocument = ka(document)\n```\n\nThe above code loads and uses models that have already been trained for specific domains, entity types (in NER), knowledge bases (in ED), and relation labels (in DocRE).\nSpecifically, the identifier `\"cdr_biaffinener_blink_atlop\"` above indicates that the Biaffine-NER, BLINK, and ATLOP models trained on the CDR dataset (biomedical abstracts, Chemical and Disease entity types, entity IDs based on the MeSH ontology, and Chemical-Induce-Disease relation label) are used for NER, ED, and DocRE, respectively.\n\nIt is also possible to apply specific tasks by directly calling the task components.\nFor example, if you would like to perform only NER and ED, please do the following.\n\n```python\nimport kapipe\nka = kapipe.load(\"cdr_biaffinener_blink_atlop\")\n\n# NER\ndocument = ka.ner(document)\n# ED-Retrieval\ndocument, candidate_entities = ka.ed_ret(document, num_candidate_entities=10)\n# ED-Reranking\ndocument = ka.ed_rank(document, candidate_entities)\n```\n\nAlso, for example, if mentions and entities have already been annotated (by humans or external systems), and if you would like to perform only DocRE, do the following.\nNote that the mentions and entities have already been integrated into the input document.\n\n```python\nimport kapipe\nka = kapipe.load(\"cdr_biaffinener_blink_atlop\")\n\n# DocRE\ndocument = ka.docre(document_with_gold_mentions_and_entities)\n```\n\n## Available Trained Pipelines\n\nThe following pipelines are currently available.\n\n| identifier | NER Model and Dataset (Entity Types) | ED-Retrieval Model and Dataset with Knowledge Base | ED-Reranking Model and Dataset with Knowledge Base | DocRE Model and Dataset (Relation Labels) |\n| --- | --- | --- | --- | --- |\n| cdr_biaffinener_blink_atlop | Biaffine-NER on CDR (Chemical, Disease) | BLINK Bi-Encoder on CDR + MeSH (2015) | BLINK Cross-Encoder on CDR + MeSH (2015) | ATLOP on CDR (Chemical-Induce-Disease) |\n\n## Training\n\nIf the trained pipelines do not cover your target domain, entity types, knowledge base, or relation labels, please train each task component in the pipeline on your dataset.\nOnce you have trained the pipeline, please save it for future reuse. You can set your own identifier.\n\n```python\nimport kapipe\n\nka = kapipe.blank(gpu_map={\"ner\":0, \"ed_retrieval\":1, \"ed_reranking\": 2, \"docre\": 3})\n\n# NER\nka.ner.fit(\n    train_documents,\n    dev_documents,\n    optional_config={\n        \"bert_pretrained_name_or_path\": \"allenai/scibert_scivocab_uncased\",\n        \"bert_learning_rate\": 2e-5,\n        \"task_learning_rate\": 1e-4,\n        \"dataset_name\": \"example_dataset\",\n        \"allow_nested_entities\": True,\n        \"max_epoch\": 10,\n    }\n)\n\n# ED-Retrieval\nka.ed_ret.fit(\n    entity_dict,\n    train_documents,\n    dev_documents,\n    optional_config={\n        \"bert_pretrained_name_or_path\": \"allenai/scibert_scivocab_uncased\",\n        \"bert_learning_rate\": 2e-5,\n        \"task_learning_rate\": 1e-4,\n        \"dataset_name\": \"example_dataset\",\n        \"max_epoch\": 10,\n    }\n)\n\n# ED-Reranking\ntrain_candidate_entities = [\n\tka.ed_ret(d, retrieval_size=128)[1] for d in train_documents\n]\ndev_candidate_entities = [\n\tka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents\n]\nka.ed_rank.fit(\n    entity_dict,\n    train_documents, train_candidate_entities,\n    dev_documents, dev_candidate_entities,\n    optional_config={\n        \"bert_pretrained_name_or_path\": \"allenai/scibert_scivocab_uncased\",\n        \"bert_learning_rate\": 2e-5,\n        \"task_learning_rate\": 1e-4,\n        \"dataset_name\": \"example_dataset\",\n        \"max_epoch\": 10,\n    }\n)\n\n# DocRE\nka.docre.fit(\n    train_documents,\n    dev_documents,\n    optional_config={\n        \"bert_pretrained_name_or_path\": \"allenai/scibert_scivocab_uncased\",\n        \"bert_learning_rate\": 2e-5,\n        \"task_learning_rate\": 1e-4,\n        \"dataset_name\": \"example_dataset\",\n        \"max_epoch\": 10,\n        \"possible_head_entity_types\": [\"Chemical\"], # or None\n        \"possible_tail_entity_types\": [\"Disease\"], # or None\n    }\n)\n \nka.save(\"your favorite identifier\")\n```\n\nIf you would like to train only specific task components (for example, if you would like to use models trained on CDR for NER and DocRE, and train a new model on a different version of MeSH for ED), please do the following.\n\n```python\nimport kapipe\n\nka = kapipe.load(\"cdr_biaffinener_blink_atlop\")\n\n# ED-Retrieval\nka.ed_ret.fit(\n    entity_dict,\n    train_documents,\n    dev_documents\n)\n\n# ED-Reranking\ntrain_candidate_entities = [\n\tka.ed_ret(d, retrieval_size=128)[1] for d in train_documents\n]\ndev_candidate_entities = [\n\tka.ed_ret(d, retrieval_size=128)[1] for d in dev_documents\n]\nka.ed_rank.fit(\n    entity_dict,\n    train_documents, train_candidate_entities,\n    dev_documents, dev_candidate_entities\n)\n\nka.save(\"your favorite identifier\")\n```\n\nWhen using the saved pipeline, please load it by specifying the identifier.\n\n```python\nimport kapipe\n\nkapipe.load(\"your favorite identifier\")\n```\n\n## Experiments on Custom Models\n\nThe pipeline is a top-level wrapper class that consists of a cascade of task components, and each task component is also a black box class, in which specific models (e.g., Biaffine-NER, ATLOP) are used.\nIn order to perform various training, evaluation, and analysis on specific methods, it may be more intuitive to directly instantiate each method (hereafter referred to as a \"system\") rather than the pipeline.\n\nThe core of KAPipe is the systems, and the pipeline is just a wrapper to make them easy to use with minimal coding. If you are familiar with coding and your goal is not just to apply the KA pipeline, but also to develop the methods, it would be better to work directly with the systems rather than using the pipeline.\n\nThe fastest way to find out how to initialize, train and evaluate each system is to look at the `experiments/codes/run_*` scripts.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "A learnable pipeline for knowledge acquisition",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/norikinishida/kapipe"
    },
    "split_keywords": [
        "nlp",
        " knowledge acquisition",
        " information extraction",
        " named entity recognition",
        " entity disambiguation",
        " relation extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e38e1c005e9fd2c1eaf8f7d5713cd2e00236481168ae9135d6d3bee26f1a0565",
                "md5": "7c48a34bd9a264641c00f6d6ab015c75",
                "sha256": "0e5ae6a73880e51dfe817cbc0ce9175467eb9a963aaf4c1fbf903a4abbad96a4"
            },
            "downloads": -1,
            "filename": "kapipe-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7c48a34bd9a264641c00f6d6ab015c75",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 141978,
            "upload_time": "2024-07-23T09:35:07",
            "upload_time_iso_8601": "2024-07-23T09:35:07.599808Z",
            "url": "https://files.pythonhosted.org/packages/e3/8e/1c005e9fd2c1eaf8f7d5713cd2e00236481168ae9135d6d3bee26f1a0565/kapipe-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "26e22a44a54bda885e219063ade5c230cc33bd23694efa9b4584cb5191381e0f",
                "md5": "9c45c58936ba157ffb901d82b1fbaa12",
                "sha256": "b156d9cc9e4b48ed1ab94301da550c4b0e026e2d2c40dd8dbe227939caa3e429"
            },
            "downloads": -1,
            "filename": "kapipe-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "9c45c58936ba157ffb901d82b1fbaa12",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 93472,
            "upload_time": "2024-07-23T09:35:09",
            "upload_time_iso_8601": "2024-07-23T09:35:09.758192Z",
            "url": "https://files.pythonhosted.org/packages/26/e2/2a44a54bda885e219063ade5c230cc33bd23694efa9b4584cb5191381e0f/kapipe-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-23 09:35:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "norikinishida",
    "github_project": "kapipe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.22.2"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.10.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.7.1"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "opt-einsum",
            "specs": [
                [
                    ">=",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "faiss-gpu",
            "specs": [
                [
                    ">=",
                    "1.7.2"
                ]
            ]
        },
        {
            "name": "pyhocon",
            "specs": [
                [
                    ">=",
                    "0.3.60"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.66.1"
                ]
            ]
        },
        {
            "name": "jsonlines",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "Levenshtein",
            "specs": [
                [
                    ">=",
                    "0.25.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "accelerate",
            "specs": []
        },
        {
            "name": "bitsandbytes",
            "specs": []
        }
    ],
    "lcname": "kapipe"
}
        
Elapsed time: 4.10459s