accord-nlp

Name	accord-nlp JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/Accord-Project/accord-nlp
Summary	ACCORD-NLP: Transformer/language model-based information extraction from regulatory text
upload_time	2024-04-25 19:47:36
maintainer	None
docs_url	None
author	Hansi Hettiarachchi
requires_python	>=3.9
license	None
keywords	nlp ner relation extraction information extraction
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ACCORD-NLP Framework

ACCORD-NLP is a Natural Language Processing (NLP) framework developed as a part of the Horizon European project for  Automated Compliance Checks for Construction, Renovation or Demolition Works ([ACCORD](https://accordproject.eu/)) to facilitate Automated Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector.

Compliance checking plays a pivotal role in the AEC sector, ensuring the safety, reliability, stability, and usability of building designs. Traditionally, this process relied on manual approaches, which are resource-intensive and time-consuming. Thus, attention has shifted towards automated methods to streamline compliance checks. Automating these processes necessitates the transformation of building regulations written in text aiming domain experts into machine-processable formats. However, this has been challenging, primarily due to the inherent complexities and unstructured nature of natural languages. Moreover, regulatory texts often exhibit domain-specific characteristics, ambiguities, and intricate clausal structures, further complicating the task.

ACCORD-NLP offers data, AI models and workflows developed using state-of-the-art NLP techniques to extract rules from textual data, supporting ACC.

## Installation <a name="installation"> </a>

As the initial step, Pytorch needs to be installed. The recommended Pytorch version is 2.0.1. Please refer to [PyTorch](https://pytorch.org/get-started/locally/#start-locally) 
installation page for the specific installation command for your platform.

Once PyTorch has been installed, accord-nlp can be installed either from the source or as a Python package via pip. 
The latter approach is recommended. 

### From Source
```
git clone https://github.com/Accord-Project/accord-nlp.git
cd accord-nlp
pip install -r requirements.txt
```

### From pip
```
pip install accord-nlp
```

## Features
1. [Data Augmentation](#da)
2. [Entity Classification](#ner)
3. [Relation Classification](#re)
4. [Information Extraction](#ie)

### Data Augmentation <a name="da"> </a>

Data augmentation supports the synthetic oversampling of relation-annotated data within a domain-specific context. It can be used using the following code. The original experiment script is available [here](https://github.com/Accord-Project/accord-nlp/blob/main/experiments/data_augmentation/da_experiment.py).

```python
from accord_nlp.data_augmentation import RelationDA

entities = ['object', 'property', 'quality', 'value']
rda = RelationDA(entity_categories=entities)

relations_path = '<.csv file path to original relation-annotated data>'
entities_path = '<.csv file path to entity samples per category>'
output_path = '<.csv file path to save newly created data>'
rda.replace_entities(relations_path, entities_path, output_path, n=12)
```

#### Available Datasets

The data augmentation approach was applied to the relation-annotated training data in the [CODE-ACCORD](https://github.com/Accord-Project/CODE-ACCORD) corpus. It generated 2,912 synthetic data samples, resulting in a training set of 6,375 relations. Our paper, listed below, provides more details about the data statistics.

The augmented training dataset can be loaded into a Pandas DataFrame using the following code.

```python
from datasets import Dataset
from datasets import load_dataset

data_files = {"augmented_train": "augmented.csv"}
augmented_train = Dataset.to_pandas(load_dataset("ACCORD-NLP/CODE-ACCORD-Relations", data_files=data_files, split="augmented_train"))
```

### Entity Classification <a name="ner"> </a>

We adapted the transformer's sequence labelling architecture to fine-tune the entity classifier, following its remarkable results in the NLP domain. The general transformer architecture was modified by adding individual softmax layers per output token to support entity classification. 

Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations. 

#### Available Models

We fine-tuned four pre-trained transformer models (i.e. BERT, ELECTRA, ALBERT and ROBERTA) for entity classification.
All the fine-tuned models are available in [HuggingFace](https://huggingface.co/ACCORD-NLP), and can be accessed using the following code.

```python
from accord_nlp.text_classification.ner.ner_model import NERModel

model = NERModel('roberta', 'ACCORD-NLP/ner-roberta-large')
predictions, raw_outputs = model.predict(['The gradient of the passageway should not exceed five per cent.'])
print(predictions)
```

### Relation Classification <a name="re"> </a>

Relation classification aims to predict the semantic relationship between two entities within a context. We introduced four special tokens (i.e. \<e1>, \</e1>, \<e2> and \</e2>) to format the input text with an entity pair to facilitate relation classification. Both \<e1> and \</e1> mark the start and end of the first entity in the selected text sequence, while \<e2> and \</e2> mark the start and end of the second entity. The transformer output corresponds to \<e1> and \<e2> were passed through a softmax layer to predict the relation category. 

Our paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.

#### Available Models

We fine-tuned three pre-trained transformer models (i.e. BERT, ALBERT and ROBERTA) for relation classification. 
All the fine-tuned models are available in [HuggingFace](https://huggingface.co/ACCORD-NLP), and can be accessed using the following code.

```python
from accord_nlp.text_classification.relation_extraction.re_model import REModel

model = REModel('roberta', 'ACCORD-NLP/re-roberta-large')
predictions, raw_outputs = model.predict(['The <e1>gradient<\e1> of the passageway should not exceed <e2>five per cent</e2>.'])
print(predictions)
```

### Information Extraction <a name="ie"> </a>

Our information extraction pipeline aims to transform a regulatory sentence into a machine-processable output (i.e., a knowledge graph of entities and relations). It utilises the entity and relation classifiers mentioned above to sequentially extract information from the text to build the final graph. 

Our paper, listed below, provides more details about the pipeline, including its individual components. 


The default pipeline configurations are set to the best-performed entity and relation classification models, and the 
default pipeline can be accessed using the following code.

```python
from accord_nlp.information_extraction.ie_pipeline import InformationExtractor

sentence = 'The gradient of the passageway should not exceed five per cent.'

ie = InformationExtractor()
ie.sentence_to_graph(sentence)
```

The following code can be used to access the pipeline with different configurations. Please refer to the [ie_pipeline.py](https://github.com/Accord-Project/accord-nlp/blob/main/accord_nlp/information_extraction/ie_pipeline.py) 
for more details about the input parameters. 

```python
from accord_nlp.information_extraction.ie_pipeline import InformationExtractor

sentence = 'The gradient of the passageway should not exceed five per cent.'

ner_model_info = ('roberta', 'ACCORD-NLP/ner-roberta-large')
re_model_info = ('roberta', 'ACCORD-NLP/re-roberta-large')
ie = InformationExtractor(ner_model_info=ner_model_info, re_model_info=re_model_info, debug=True)
ie.sentence_to_graph(sentence)

```

Also, a live demo of the Information Extractor is available in [HuggingFace](https://huggingface.co/spaces/ACCORD-NLP/information-extractor). 


## Reference

*Please note that the corresponding paper for this work is currently in progress and will be made available soon. Thank you for your patience and interest.*

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Accord-Project/accord-nlp",
    "name": "accord-nlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "NLP, NER, Relation Extraction, Information Extraction",
    "author": "Hansi Hettiarachchi",
    "author_email": "hansi.h.hettiarachchi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8c/ee/10e9b07689f6158bc9c9edcab978daadabbffba7e6c171dc40f1c1e9fdee/accord_nlp-1.0.0.tar.gz",
    "platform": null,
    "description": "# ACCORD-NLP Framework\n\nACCORD-NLP is a Natural Language Processing (NLP) framework developed as a part of the Horizon European project for  Automated Compliance Checks for Construction, Renovation or Demolition Works ([ACCORD](https://accordproject.eu/)) to facilitate Automated Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector.\n\nCompliance checking plays a pivotal role in the AEC sector, ensuring the safety, reliability, stability, and usability of building designs. Traditionally, this process relied on manual approaches, which are resource-intensive and time-consuming. Thus, attention has shifted towards automated methods to streamline compliance checks. Automating these processes necessitates the transformation of building regulations written in text aiming domain experts into machine-processable formats. However, this has been challenging, primarily due to the inherent complexities and unstructured nature of natural languages. Moreover, regulatory texts often exhibit domain-specific characteristics, ambiguities, and intricate clausal structures, further complicating the task.\n\nACCORD-NLP offers data, AI models and workflows developed using state-of-the-art NLP techniques to extract rules from textual data, supporting ACC.\n\n## Installation <a name=\"installation\"> </a>\n\nAs the initial step, Pytorch needs to be installed. The recommended Pytorch version is 2.0.1. Please refer to [PyTorch](https://pytorch.org/get-started/locally/#start-locally) \ninstallation page for the specific installation command for your platform.\n\nOnce PyTorch has been installed, accord-nlp can be installed either from the source or as a Python package via pip. \nThe latter approach is recommended. \n\n### From Source\n```\ngit clone https://github.com/Accord-Project/accord-nlp.git\ncd accord-nlp\npip install -r requirements.txt\n```\n\n### From pip\n```\npip install accord-nlp\n```\n\n## Features\n1. [Data Augmentation](#da)\n2. [Entity Classification](#ner)\n3. [Relation Classification](#re)\n4. [Information Extraction](#ie)\n\n### Data Augmentation <a name=\"da\"> </a>\n\nData augmentation supports the synthetic oversampling of relation-annotated data within a domain-specific context. It can be used using the following code. The original experiment script is available [here](https://github.com/Accord-Project/accord-nlp/blob/main/experiments/data_augmentation/da_experiment.py).\n\n```python\nfrom accord_nlp.data_augmentation import RelationDA\n\nentities = ['object', 'property', 'quality', 'value']\nrda = RelationDA(entity_categories=entities)\n\nrelations_path = '<.csv file path to original relation-annotated data>'\nentities_path = '<.csv file path to entity samples per category>'\noutput_path = '<.csv file path to save newly created data>'\nrda.replace_entities(relations_path, entities_path, output_path, n=12)\n```\n\n#### Available Datasets\n\nThe data augmentation approach was applied to the relation-annotated training data in the [CODE-ACCORD](https://github.com/Accord-Project/CODE-ACCORD) corpus. It generated 2,912 synthetic data samples, resulting in a training set of 6,375 relations. Our paper, listed below, provides more details about the data statistics.\n\nThe augmented training dataset can be loaded into a Pandas DataFrame using the following code.\n\n```python\nfrom datasets import Dataset\nfrom datasets import load_dataset\n\ndata_files = {\"augmented_train\": \"augmented.csv\"}\naugmented_train = Dataset.to_pandas(load_dataset(\"ACCORD-NLP/CODE-ACCORD-Relations\", data_files=data_files, split=\"augmented_train\"))\n```\n\n### Entity Classification <a name=\"ner\"> </a>\n\nWe adapted the transformer's sequence labelling architecture to fine-tune the entity classifier, following its remarkable results in the NLP domain. The general transformer architecture was modified by adding individual softmax layers per output token to support entity classification. \n\nOur paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations. \n\n#### Available Models\n\nWe fine-tuned four pre-trained transformer models (i.e. BERT, ELECTRA, ALBERT and ROBERTA) for entity classification.\nAll the fine-tuned models are available in [HuggingFace](https://huggingface.co/ACCORD-NLP), and can be accessed using the following code.\n\n```python\nfrom accord_nlp.text_classification.ner.ner_model import NERModel\n\nmodel = NERModel('roberta', 'ACCORD-NLP/ner-roberta-large')\npredictions, raw_outputs = model.predict(['The gradient of the passageway should not exceed five per cent.'])\nprint(predictions)\n```\n\n### Relation Classification <a name=\"re\"> </a>\n\nRelation classification aims to predict the semantic relationship between two entities within a context. We introduced four special tokens (i.e. \\<e1>, \\</e1>, \\<e2> and \\</e2>) to format the input text with an entity pair to facilitate relation classification. Both \\<e1> and \\</e1> mark the start and end of the first entity in the selected text sequence, while \\<e2> and \\</e2> mark the start and end of the second entity. The transformer output corresponds to \\<e1> and \\<e2> were passed through a softmax layer to predict the relation category. \n\nOur paper, listed below, provides more details about the model architecture, fine-tuning process, experiments and evaluations.\n\n#### Available Models\n\nWe fine-tuned three pre-trained transformer models (i.e. BERT, ALBERT and ROBERTA) for relation classification. \nAll the fine-tuned models are available in [HuggingFace](https://huggingface.co/ACCORD-NLP), and can be accessed using the following code.\n\n```python\nfrom accord_nlp.text_classification.relation_extraction.re_model import REModel\n\nmodel = REModel('roberta', 'ACCORD-NLP/re-roberta-large')\npredictions, raw_outputs = model.predict(['The <e1>gradient<\\e1> of the passageway should not exceed <e2>five per cent</e2>.'])\nprint(predictions)\n```\n\n### Information Extraction <a name=\"ie\"> </a>\n\nOur information extraction pipeline aims to transform a regulatory sentence into a machine-processable output (i.e., a knowledge graph of entities and relations). It utilises the entity and relation classifiers mentioned above to sequentially extract information from the text to build the final graph. \n\nOur paper, listed below, provides more details about the pipeline, including its individual components. \n\n\nThe default pipeline configurations are set to the best-performed entity and relation classification models, and the \ndefault pipeline can be accessed using the following code.\n\n```python\nfrom accord_nlp.information_extraction.ie_pipeline import InformationExtractor\n\nsentence = 'The gradient of the passageway should not exceed five per cent.'\n\nie = InformationExtractor()\nie.sentence_to_graph(sentence)\n```\n\nThe following code can be used to access the pipeline with different configurations. Please refer to the [ie_pipeline.py](https://github.com/Accord-Project/accord-nlp/blob/main/accord_nlp/information_extraction/ie_pipeline.py) \nfor more details about the input parameters. \n\n```python\nfrom accord_nlp.information_extraction.ie_pipeline import InformationExtractor\n\nsentence = 'The gradient of the passageway should not exceed five per cent.'\n\nner_model_info = ('roberta', 'ACCORD-NLP/ner-roberta-large')\nre_model_info = ('roberta', 'ACCORD-NLP/re-roberta-large')\nie = InformationExtractor(ner_model_info=ner_model_info, re_model_info=re_model_info, debug=True)\nie.sentence_to_graph(sentence)\n\n```\n\nAlso, a live demo of the Information Extractor is available in [HuggingFace](https://huggingface.co/spaces/ACCORD-NLP/information-extractor). \n\n\n## Reference\n\n*Please note that the corresponding paper for this work is currently in progress and will be made available soon. Thank you for your patience and interest.*\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "ACCORD-NLP: Transformer/language model-based information extraction from regulatory text",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/Accord-Project/accord-nlp"
    },
    "split_keywords": [
        "nlp",
        " ner",
        " relation extraction",
        " information extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "937f8b9b1b6d89b75c72f4a580155aeb6d1b2f14d2aff99234d67a12a275b941",
                "md5": "4c4ad480517a8ef33c946ce7734f7d38",
                "sha256": "d2724a136959c709161acd0dedc87b031edebf1331969a6d90dd7b0dd10d5260"
            },
            "downloads": -1,
            "filename": "accord_nlp-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c4ad480517a8ef33c946ce7734f7d38",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 79628,
            "upload_time": "2024-04-25T19:47:34",
            "upload_time_iso_8601": "2024-04-25T19:47:34.569157Z",
            "url": "https://files.pythonhosted.org/packages/93/7f/8b9b1b6d89b75c72f4a580155aeb6d1b2f14d2aff99234d67a12a275b941/accord_nlp-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cee10e9b07689f6158bc9c9edcab978daadabbffba7e6c171dc40f1c1e9fdee",
                "md5": "53a2dd4d93d281da5fc5d9f128be75fe",
                "sha256": "cb96f750b9a1534abdb01bd26588d75a21374014d7d27ad76624d626fc6fc2f4"
            },
            "downloads": -1,
            "filename": "accord_nlp-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "53a2dd4d93d281da5fc5d9f128be75fe",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 72442,
            "upload_time": "2024-04-25T19:47:36",
            "upload_time_iso_8601": "2024-04-25T19:47:36.615463Z",
            "url": "https://files.pythonhosted.org/packages/8c/ee/10e9b07689f6158bc9c9edcab978daadabbffba7e6c171dc40f1c1e9fdee/accord_nlp-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-25 19:47:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Accord-Project",
    "github_project": "accord-nlp",
    "github_not_found": true,
    "lcname": "accord-nlp"
}

Hansi Hettiarachchi