[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
![example event parameter](https://github.com/davidsbatista/BREDS/actions/workflows/code_checks.yml/badge.svg?event=pull_request)
![code coverage](https://raw.githubusercontent.com/davidsbatista/BREDS/coverage-badge/coverage.svg?raw=true)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)
# BREDS
BREDS extracts relationships using a bootstrapping/semi-supervised approach, it relies on an initial set of
seeds, i.e. paris of named-entities representing relationship type to be extracted.
The algorithm expands the initial set of seeds using distributional semantics to generalize the relationship while
limiting the semantic drift.
## Extracting companies headquarters:
The input text needs to have the named-entities tagged, like show in the example bellow:
```yaml
The tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.
<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.
<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.
<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.
<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.
<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.
<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.
```
We need to give seeds to boostrap the extraction process, specifying the type of each named-entity and
relationships examples that should also be present in the input text:
```yaml
e1:ORG
e2:LOC
Lufthansa;Cologne
Nokia;Espoo
Google;Mountain View
DoubleClick;New York
SAP;Walldorf
```
To run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files
```
- afp_apw_xin_embeddings.bin
- sentences_short.txt.bz2
- seeds_positive.txt
```
Install BREDS using pip
```sh
pip install breads
```
Run the following command:
```sh
breds --word2vec=afp_apw_xin_embeddings.bin --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6
```
After the process is terminated an output file `relationships.jsonl` is generated containing the extracted relationships.
You can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`:
```json
{
"entity_1": "Medtronic",
"entity_2": "Minneapolis",
"confidence": 0.9982486865148862,
"sentence": "<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . ",
"bef_words": "",
"bet_words": ", based in",
"aft_words": ", is",
"passive_voice": false
}
{
"entity_1": "DynCorp",
"entity_2": "Reston",
"confidence": 0.9982486865148862,
"sentence": "Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .",
"bef_words": "Because",
"bet_words": ", headquartered in",
"aft_words": ", Va.",
"passive_voice": false
}
{
"entity_1": "Handspring",
"entity_2": "Silicon Valley",
"confidence": 0.893486865148862,
"sentence": "There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .",
"bef_words": "firms like",
"bet_words": ", a company based in",
"aft_words": "that looks",
"passive_voice": false
}
```
<br>
BREDS has several parameters to tune the extraction process, in the example above it uses the default values, but these
can be set in the configuration file: `parameters.cfg`
```yaml
max_tokens_away=6 # maximum number of tokens between the two entities
min_tokens_away=1 # minimum number of tokens between the two entities
context_window_size=2 # number of tokens to the left and right of each entity
alpha=0.2 # weight of the BEF context in the similarity function
beta=0.6 # weight of the BET context in the similarity function
gamma=0.2 # weight of the AFT context in the similarity function
wUpdt=0.5 # < 0.5 trusts new examples less on each iteration
number_iterations=4 # number of bootstrap iterations
wUnk=0.1 # weight given to unknown extracted relationship instances
wNeg=2 # weight given to extracted relationship instances
min_pattern_support=2 # minimum number of instances in a cluster to be considered a pattern
```
and passed with the argument `--config=parameters.cfg`.
The full command line parameters are:
```sh
-h, --help show this help message and exit
--config CONFIG file with bootstrapping configuration parameters
--word2vec WORD2VEC an embedding model based on word2vec, in the format of a .bin file
--sentences SENTENCES
a text file with a sentence per line, and with at least two entities per sentence
--positive_seeds POSITIVE_SEEDS
a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'
--negative_seeds NEGATIVE_SEEDS
a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'
--similarity SIMILARITY
the minimum similarity between tuples and patterns to be considered a match
--confidence CONFIDENCE
the minimum confidence score for a match to be considered a true positive
--number_iterations NUMBER_ITERATIONS
the number of iterations the run
```
Please, refer to the section [References and Citations](#references-and-citations) for details on the parameters.
In the first step BREDS pre-processes the input file `sentences.txt` generating word vector representations of
relationships (i.e.: `processed_tuples.pkl`).
This is done so that then you can experiment with different seed examples without having to repeat the process of
generating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip
this generation step.
---
# References and Citations
[Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics, EMNLP'15](https://aclanthology.org/D15-1056/)
```BibTeX
@inproceedings{batista-etal-2015-semi,
title = "Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics",
author = "Batista, David S. and Martins, Bruno and Silva, M{\'a}rio J.",
booktitle = "Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing",
month = sep,
year = "2015",
address = "Lisbon, Portugal",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D15-1056",
doi = "10.18653/v1/D15-1056",
pages = "499--504",
}
```
["Large-Scale Semantic Relationship Extraction for Information Discovery" - Chapter 5, David S Batista, Ph.D. Thesis](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf)
```BibTeX
@incollection{phd-dsbatista2016
title = {Large-Scale Semantic Relationship Extraction for Information Discovery},
author = {Batista, David S.},
school = {Instituto Superior Técnico, Universidade de Lisboa},
year = {2016}
}
```
# Presenting BREDS at PyData Berlin 2017
[![Presentation at PyData Berlin 2017](https://img.youtube.com/vi/Ra15lX-wojg/hqdefault.jpg)](https://www.youtube.com/watch?v=Ra15lX-wojg)
---
# Contributing to BREDS
Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of BREDS,
please read the following guidelines.
## The contribution process at a glance
1. Preparing the development environment
2. Code away!
3. Continuous Integration
4. Submit your changes by opening a pull request
Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in
an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer.
## Preparing the development environment
Make sure you have Python3.9 installed on your system
macOs
```sh
brew install python@3.9
python3.9 -m pip install --user --upgrade pip
python3.9 -m pip install virtualenv
```
Clone the repository and prepare the development environment:
```sh
git clone git@github.com:davidsbatista/BREDS.git
cd BREDS
python3.9 -m virtualenv venv # create a new virtual environment for development using python3.9
source venv/bin/activate # activate the virtual environment
pip install -r requirements_dev.txt # install the development requirements
pip install -e . # install BREDS in edit mode
```
## Continuous Integration
BREDS runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a full
test suite is run on your PR:
- The code is formatted using `black` and `isort`
- Unused imports are auto-removed using `pycln`
- Linting is done using `pyling` and `flake8`
- Type checking is done using `mypy`
- Tests are run using `pytest`
Nevertheless, if you prefer to run the tests & formatting locally, it's possible too.
```sh
make all
```
## Opening a Pull Request
Every PR should be accompanied by short description of the changes, including:
- Impact and motivation for the changes
- Any open issues that are closed by this PR
---
Give a ⭐️ if this project helped you!
Raw data
{
"_id": null,
"home_page": "",
"name": "breds",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "nlp,semantic relationship extraction,bootstrapping,emnlp,vector representations,word embeddings,word2vec,phrase embeddings",
"author": "",
"author_email": "\"David S. Batista\" <dsbatista@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a1/a3/68a77d43ff1100e90839539da0b816eeb6ce3e7cb1e7572538b1eb016a76/breds-1.0.1.tar.gz",
"platform": null,
"description": "[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)\n \n![example event parameter](https://github.com/davidsbatista/BREDS/actions/workflows/code_checks.yml/badge.svg?event=pull_request)\n \n![code coverage](https://raw.githubusercontent.com/davidsbatista/BREDS/coverage-badge/coverage.svg?raw=true)\n \n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n \n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n \n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n \n[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)\n\n\n# BREDS\n\nBREDS extracts relationships using a bootstrapping/semi-supervised approach, it relies on an initial set of \nseeds, i.e. paris of named-entities representing relationship type to be extracted. \n\nThe algorithm expands the initial set of seeds using distributional semantics to generalize the relationship while \nlimiting the semantic drift.\n\n\n## Extracting companies headquarters:\n\nThe input text needs to have the named-entities tagged, like show in the example bellow:\n \n```yaml\nThe tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.\n<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.\n<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.\n<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.\n<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.\n<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.\n<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.\n```\n\nWe need to give seeds to boostrap the extraction process, specifying the type of each named-entity and \nrelationships examples that should also be present in the input text:\n\n```yaml\ne1:ORG\ne2:LOC\n\nLufthansa;Cologne\nNokia;Espoo\nGoogle;Mountain View\nDoubleClick;New York\nSAP;Walldorf\n``` \n\nTo run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files\n\n\n```\n- afp_apw_xin_embeddings.bin\n- sentences_short.txt.bz2\n- seeds_positive.txt\n```\n\nInstall BREDS using pip\n\n```sh\npip install breads\n```\n\nRun the following command:\n\n```sh\nbreds --word2vec=afp_apw_xin_embeddings.bin --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6\n```\n\nAfter the process is terminated an output file `relationships.jsonl` is generated containing the extracted relationships. \n\nYou can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`: \n\n```json\n{\n \"entity_1\": \"Medtronic\",\n \"entity_2\": \"Minneapolis\",\n \"confidence\": 0.9982486865148862,\n \"sentence\": \"<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . \",\n \"bef_words\": \"\",\n \"bet_words\": \", based in\",\n \"aft_words\": \", is\",\n \"passive_voice\": false\n}\n\n{\n \"entity_1\": \"DynCorp\",\n \"entity_2\": \"Reston\",\n \"confidence\": 0.9982486865148862,\n \"sentence\": \"Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .\",\n \"bef_words\": \"Because\",\n \"bet_words\": \", headquartered in\",\n \"aft_words\": \", Va.\",\n \"passive_voice\": false\n}\n\n{\n \"entity_1\": \"Handspring\",\n \"entity_2\": \"Silicon Valley\",\n \"confidence\": 0.893486865148862,\n \"sentence\": \"There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .\",\n \"bef_words\": \"firms like\",\n \"bet_words\": \", a company based in\",\n \"aft_words\": \"that looks\",\n \"passive_voice\": false\n}\n```\n<br>\n\nBREDS has several parameters to tune the extraction process, in the example above it uses the default values, but these \ncan be set in the configuration file: `parameters.cfg`\n\n```yaml\nmax_tokens_away=6 # maximum number of tokens between the two entities\nmin_tokens_away=1 # minimum number of tokens between the two entities\ncontext_window_size=2 # number of tokens to the left and right of each entity\n\nalpha=0.2 # weight of the BEF context in the similarity function\nbeta=0.6 # weight of the BET context in the similarity function\ngamma=0.2 # weight of the AFT context in the similarity function\n\nwUpdt=0.5 # < 0.5 trusts new examples less on each iteration\nnumber_iterations=4 # number of bootstrap iterations\nwUnk=0.1 # weight given to unknown extracted relationship instances\nwNeg=2 # weight given to extracted relationship instances\nmin_pattern_support=2 # minimum number of instances in a cluster to be considered a pattern\n```\n\nand passed with the argument `--config=parameters.cfg`.\n\nThe full command line parameters are:\n\n```sh\n -h, --help show this help message and exit\n --config CONFIG file with bootstrapping configuration parameters\n --word2vec WORD2VEC an embedding model based on word2vec, in the format of a .bin file\n --sentences SENTENCES\n a text file with a sentence per line, and with at least two entities per sentence\n --positive_seeds POSITIVE_SEEDS\n a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'\n --negative_seeds NEGATIVE_SEEDS\n a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'\n --similarity SIMILARITY\n the minimum similarity between tuples and patterns to be considered a match\n --confidence CONFIDENCE\n the minimum confidence score for a match to be considered a true positive\n --number_iterations NUMBER_ITERATIONS\n the number of iterations the run\n```\n\nPlease, refer to the section [References and Citations](#references-and-citations) for details on the parameters.\n\nIn the first step BREDS pre-processes the input file `sentences.txt` generating word vector representations of \nrelationships (i.e.: `processed_tuples.pkl`). \n\nThis is done so that then you can experiment with different seed examples without having to repeat the process of \ngenerating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip \nthis generation step.\n\n---\n\n# References and Citations\n[Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics, EMNLP'15](https://aclanthology.org/D15-1056/)\n```BibTeX\n@inproceedings{batista-etal-2015-semi,\n title = \"Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics\",\n author = \"Batista, David S. and Martins, Bruno and Silva, M{\\'a}rio J.\",\n booktitle = \"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing\",\n month = sep,\n year = \"2015\",\n address = \"Lisbon, Portugal\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/D15-1056\",\n doi = \"10.18653/v1/D15-1056\",\n pages = \"499--504\",\n}\n```\n[\"Large-Scale Semantic Relationship Extraction for Information Discovery\" - Chapter 5, David S Batista, Ph.D. Thesis](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf)\n```BibTeX\n@incollection{phd-dsbatista2016\n title = {Large-Scale Semantic Relationship Extraction for Information Discovery},\n author = {Batista, David S.},\n school = {Instituto Superior T\u00e9cnico, Universidade de Lisboa},\n year = {2016}\n}\n```\n\n# Presenting BREDS at PyData Berlin 2017\n[![Presentation at PyData Berlin 2017](https://img.youtube.com/vi/Ra15lX-wojg/hqdefault.jpg)](https://www.youtube.com/watch?v=Ra15lX-wojg)\n\n---\n\n# Contributing to BREDS\n\nImprovements, adding new features and bug fixes are welcome. If you wish to participate in the development of BREDS, \nplease read the following guidelines.\n\n## The contribution process at a glance\n\n1. Preparing the development environment\n2. Code away!\n3. Continuous Integration\n4. Submit your changes by opening a pull request\n\nSmall fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in \nan issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. \n\n\n## Preparing the development environment\n\nMake sure you have Python3.9 installed on your system\n\nmacOs\n```sh\nbrew install python@3.9\npython3.9 -m pip install --user --upgrade pip\npython3.9 -m pip install virtualenv\n```\n\nClone the repository and prepare the development environment:\n\n```sh\ngit clone git@github.com:davidsbatista/BREDS.git\ncd BREDS \npython3.9 -m virtualenv venv # create a new virtual environment for development using python3.9 \nsource venv/bin/activate # activate the virtual environment\npip install -r requirements_dev.txt # install the development requirements\npip install -e . # install BREDS in edit mode\n```\n\n\n## Continuous Integration\n\nBREDS runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a full \ntest suite is run on your PR: \n\n- The code is formatted using `black` and `isort` \n- Unused imports are auto-removed using `pycln`\n- Linting is done using `pyling` and `flake8`\n- Type checking is done using `mypy`\n- Tests are run using `pytest`\n\nNevertheless, if you prefer to run the tests & formatting locally, it's possible too. \n\n```sh\nmake all\n```\n\n## Opening a Pull Request\n\nEvery PR should be accompanied by short description of the changes, including:\n- Impact and motivation for the changes\n- Any open issues that are closed by this PR\n\n---\n\nGive a \u2b50\ufe0f if this project helped you!\n",
"bugtrack_url": null,
"license": "GNU GPLv3",
"summary": "Bootstrapping Relationship Extraction with Distributional Semantics",
"version": "1.0.1",
"project_urls": {
"documentation": "https://www.davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf",
"homepage": "https://github.com/davidsbatista/BREDS",
"repository": "https://github.com/davidsbatista/BREDS"
},
"split_keywords": [
"nlp",
"semantic relationship extraction",
"bootstrapping",
"emnlp",
"vector representations",
"word embeddings",
"word2vec",
"phrase embeddings"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "aa5cc1019f2272a7111e125159f5d4d075b9269f39b774c138a62e875392f579",
"md5": "890d4435eedfa1454e2c0d638ff10f84",
"sha256": "77aad19ef8043c5774a3e44900f9450dd2267dcb304b67cbe7f67f1698a83f25"
},
"downloads": -1,
"filename": "breds-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "890d4435eedfa1454e2c0d638ff10f84",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 24176,
"upload_time": "2023-05-15T13:17:53",
"upload_time_iso_8601": "2023-05-15T13:17:53.174626Z",
"url": "https://files.pythonhosted.org/packages/aa/5c/c1019f2272a7111e125159f5d4d075b9269f39b774c138a62e875392f579/breds-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a1a368a77d43ff1100e90839539da0b816eeb6ce3e7cb1e7572538b1eb016a76",
"md5": "d70f591a1c75b1b9a1b6e6f0d395b891",
"sha256": "239e0367f6fe8018fa6cb7575822fe7f05b4e2adb319ce9d3c5849ee6c2e687e"
},
"downloads": -1,
"filename": "breds-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "d70f591a1c75b1b9a1b6e6f0d395b891",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 42913,
"upload_time": "2023-05-15T13:17:54",
"upload_time_iso_8601": "2023-05-15T13:17:54.904522Z",
"url": "https://files.pythonhosted.org/packages/a1/a3/68a77d43ff1100e90839539da0b816eeb6ce3e7cb1e7572538b1eb016a76/breds-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-15 13:17:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "davidsbatista",
"github_project": "BREDS",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "gensim",
"specs": [
[
">=",
"3.7.3"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"=3.4.1"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.16.3"
]
]
},
{
"name": "nltk",
"specs": [
[
"~=",
"3.8.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.65.0"
]
]
}
],
"lcname": "breds"
}