[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
![example event parameter](https://github.com/davidsbatista/Snowball/actions/workflows/code_checks.yml/badge.svg?event=pull_request)
![code coverage](https://raw.githubusercontent.com/davidsbatista/Snowball/coverage-badge/coverage.svg?raw=true)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)
Snowball: Extracting Relations from Large Plain-Text Collections
================================================================
An implementation of Snowball, a relationship extraction system that uses a bootstrapping/semi-supervised approach, it
relies on an initial set of seeds, i.e. paris of named-entities representing relationship type to be extracted.
## Extracting companies headquarters:
The input text needs to have the named-entities tagged, like show in the example bellow:
```yaml
The tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.
<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.
<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.
<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.
<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.
<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.
<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.
```
We need to give seeds to boostrap the extraction process, specifying the type of each named-entity and relationships
examples that should also be present in the input text:
```yaml
e1:ORG
e2:LOC
Lufthansa;Cologne
Nokia;Espoo
Google;Mountain View
DoubleClick;New York
SAP;Walldorf
```
To run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files
```
- sentences_short.txt.bz2
- seeds_positive.txt
```
Install Snowball using pip
```sh
pip install snwoball
```
Run the following command:
```sh
snowball --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6
```
After the process is terminated an output file `relationships.jsonl` is generated containing the extracted relationships.
You can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`:
```json
{
"entity_1": "Medtronic",
"entity_2": "Minneapolis",
"confidence": 0.9982486865148862,
"sentence": "<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . ",
"bef_words": "",
"bet_words": ", based in",
"aft_words": ", is",
"passive_voice": false
}
{
"entity_1": "DynCorp",
"entity_2": "Reston",
"confidence": 0.9982486865148862,
"sentence": "Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .",
"bef_words": "Because",
"bet_words": ", headquartered in",
"aft_words": ", Va.",
"passive_voice": false
}
{
"entity_1": "Handspring",
"entity_2": "Silicon Valley",
"confidence": 0.893486865148862,
"sentence": "There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .",
"bef_words": "firms like",
"bet_words": ", a company based in",
"aft_words": "that looks",
"passive_voice": false
}
```
<br>
Snowball has several parameters to tune the extraction process, in the example above it uses the default values, but
these can be set in the configuration file: `parameters.cfg`
```yaml
max_tokens_away=6 # maximum number of tokens between the two entities
min_tokens_away=1 # minimum number of tokens between the two entities
context_window_size=2 # number of tokens to the left and right of each entity
alpha=0.2 # weight of the BEF context in the similarity function
beta=0.6 # weight of the BET context in the similarity function
gamma=0.2 # weight of the AFT context in the similarity function
wUpdt=0.5 # < 0.5 trusts new examples less on each iteration
number_iterations=3 # number of bootstrap iterations
wUnk=0.1 # weight given to unknown extracted relationship instances
wNeg=2 # weight given to extracted relationship instances
min_pattern_support=2 # minimum number of instances in a cluster to be considered a pattern
```
and passed with the argument `--config=parameters.cfg`.
The full command line parameters are:
```sh
-h, --help show this help message and exit
--config CONFIG file with bootstrapping configuration parameters
--sentences SENTENCES
a text file with a sentence per line, and with at least two entities per sentence
--positive_seeds POSITIVE_SEEDS
a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'
--negative_seeds NEGATIVE_SEEDS
a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'
--similarity SIMILARITY
the minimum similarity between tuples and patterns to be considered a match
--confidence CONFIDENCE
the minimum confidence score for a match to be considered a true positive
--number_iterations NUMBER_ITERATIONS
the number of iterations the run
```
In the first step it pre-processes the input file `sentences.txt` generating word vector representations of
relationships (i.e.: `processed_tuples.pkl`).
This is done so that then you can experiment with different seed examples without having to repeat the process of
generating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip
this generation step.
You can find more details about the original system here:
- Eugene Agichtein and Luis Gravano, [Snowball: Extracting Relations from Large Plain-Text Collections](http://www.mathcs.emory.edu/~eugene/papers/dl00.pdf). In Proceedings of the fifth ACM conference on Digital libraries. ACM, 200.
- H Yu, E Agichtein, [Extracting synonymous gene and protein terms from biological literature](http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i340.full.pdf). In Bioinformatics, 19(suppl 1), 2003 - Oxford University Press
For details about this particular implementation and how it was used, please refer to the following publications:
- David S Batista, Bruno Martins, and Mário J Silva. , [Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics](http://davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf). In Empirical Methods in Natural Language Processing. ACL, 2015. (Honorable Mention for Best Short Paper)
- David S Batista, Ph.D. Thesis, [Large-Scale Semantic Relationship Extraction for Information Discovery (Chapter 5)](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf), Instituto Superior Técnico, University of Lisbon, 2016
# Contributing to Snowball
Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of Snowball,
please read the following guidelines.
## The contribution process at a glance
1. Preparing the development environment
2. Code away!
3. Continuous Integration
4. Submit your changes by opening a pull request
Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in
an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer.
## Preparing the development environment
Make sure you have Python3.9 installed on your system
macOs
```sh
brew install python@3.9
python3.9 -m pip install --user --upgrade pip
python3.9 -m pip install virtualenv
```
Clone the repository and prepare the development environment:
```sh
git clone git@github.com:davidsbatista/Snowball.git
cd Snowball
python3.9 -m virtualenv venv # create a new virtual environment for development using python3.9
source venv/bin/activate # activate the virtual environment
pip install -r requirements_dev.txt # install the development requirements
pip install -e . # install Snowball in edit mode
```
## Continuous Integration
Snowball runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a
full test suite is run on your PR:
- The code is formatted using `black` and `isort`
- Unused imports are auto-removed using `pycln`
- Linting is done using `pyling` and `flake8`
- Type checking is done using `mypy`
- Tests are run using `pytest`
Nevertheless, if you prefer to run the tests & formatting locally, it's possible too.
```sh
make all
```
## Opening a Pull Request
Every PR should be accompanied by short description of the changes, including:
- Impact and motivation for the changes
- Any open issues that are closed by this PR
---
Give a ⭐️ if this project helped you!
Raw data
{
"_id": null,
"home_page": "",
"name": "snowball-extractor",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "nlp,semantic relationship extraction,bootstrapping,emnlp,tf-idf",
"author": "",
"author_email": "\"David S. Batista\" <dsbatista@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a7/c6/2b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73/snowball-extractor-1.0.5.tar.gz",
"platform": null,
"description": "[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)\n \n![example event parameter](https://github.com/davidsbatista/Snowball/actions/workflows/code_checks.yml/badge.svg?event=pull_request)\n \n![code coverage](https://raw.githubusercontent.com/davidsbatista/Snowball/coverage-badge/coverage.svg?raw=true)\n \n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n \n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n \n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n \n[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)\n\n\nSnowball: Extracting Relations from Large Plain-Text Collections\n================================================================\n\nAn implementation of Snowball, a relationship extraction system that uses a bootstrapping/semi-supervised approach, it \nrelies on an initial set of seeds, i.e. paris of named-entities representing relationship type to be extracted. \n\n## Extracting companies headquarters:\n\nThe input text needs to have the named-entities tagged, like show in the example bellow:\n \n```yaml\nThe tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.\n<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.\n<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.\n<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.\n<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.\n<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.\n<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.\n```\n\nWe need to give seeds to boostrap the extraction process, specifying the type of each named-entity and relationships \nexamples that should also be present in the input text:\n\n```yaml\ne1:ORG\ne2:LOC\n\nLufthansa;Cologne\nNokia;Espoo\nGoogle;Mountain View\nDoubleClick;New York\nSAP;Walldorf\n``` \n\nTo run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files\n\n\n```\n- sentences_short.txt.bz2\n- seeds_positive.txt\n```\n\nInstall Snowball using pip\n\n```sh\npip install snwoball\n```\n\nRun the following command:\n\n```sh\nsnowball --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6\n```\n\nAfter the process is terminated an output file `relationships.jsonl` is generated containing the extracted relationships. \n\nYou can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`: \n\n```json\n{\n \"entity_1\": \"Medtronic\",\n \"entity_2\": \"Minneapolis\",\n \"confidence\": 0.9982486865148862,\n \"sentence\": \"<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . \",\n \"bef_words\": \"\",\n \"bet_words\": \", based in\",\n \"aft_words\": \", is\",\n \"passive_voice\": false\n}\n\n{\n \"entity_1\": \"DynCorp\",\n \"entity_2\": \"Reston\",\n \"confidence\": 0.9982486865148862,\n \"sentence\": \"Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .\",\n \"bef_words\": \"Because\",\n \"bet_words\": \", headquartered in\",\n \"aft_words\": \", Va.\",\n \"passive_voice\": false\n}\n\n{\n \"entity_1\": \"Handspring\",\n \"entity_2\": \"Silicon Valley\",\n \"confidence\": 0.893486865148862,\n \"sentence\": \"There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .\",\n \"bef_words\": \"firms like\",\n \"bet_words\": \", a company based in\",\n \"aft_words\": \"that looks\",\n \"passive_voice\": false\n}\n```\n<br>\n\nSnowball has several parameters to tune the extraction process, in the example above it uses the default values, but \nthese can be set in the configuration file: `parameters.cfg`\n\n```yaml\nmax_tokens_away=6 # maximum number of tokens between the two entities\nmin_tokens_away=1 # minimum number of tokens between the two entities\ncontext_window_size=2 # number of tokens to the left and right of each entity\n\nalpha=0.2 # weight of the BEF context in the similarity function\nbeta=0.6 # weight of the BET context in the similarity function\ngamma=0.2 # weight of the AFT context in the similarity function\n\nwUpdt=0.5 # < 0.5 trusts new examples less on each iteration\nnumber_iterations=3 # number of bootstrap iterations\nwUnk=0.1 # weight given to unknown extracted relationship instances\nwNeg=2 # weight given to extracted relationship instances\nmin_pattern_support=2 # minimum number of instances in a cluster to be considered a pattern\n```\n\nand passed with the argument `--config=parameters.cfg`.\n\nThe full command line parameters are:\n\n```sh\n -h, --help show this help message and exit\n --config CONFIG file with bootstrapping configuration parameters\n --sentences SENTENCES\n a text file with a sentence per line, and with at least two entities per sentence\n --positive_seeds POSITIVE_SEEDS\n a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'\n --negative_seeds NEGATIVE_SEEDS\n a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'\n --similarity SIMILARITY\n the minimum similarity between tuples and patterns to be considered a match\n --confidence CONFIDENCE\n the minimum confidence score for a match to be considered a true positive\n --number_iterations NUMBER_ITERATIONS\n the number of iterations the run\n```\n\nIn the first step it pre-processes the input file `sentences.txt` generating word vector representations of \nrelationships (i.e.: `processed_tuples.pkl`). \n\nThis is done so that then you can experiment with different seed examples without having to repeat the process of \ngenerating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip \nthis generation step.\n\n\nYou can find more details about the original system here: \n\n- Eugene Agichtein and Luis Gravano, [Snowball: Extracting Relations from Large Plain-Text Collections](http://www.mathcs.emory.edu/~eugene/papers/dl00.pdf). In Proceedings of the fifth ACM conference on Digital libraries. ACM, 200.\n- H Yu, E Agichtein, [Extracting synonymous gene and protein terms from biological literature](http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i340.full.pdf). In Bioinformatics, 19(suppl 1), 2003 - Oxford University Press\n\n\nFor details about this particular implementation and how it was used, please refer to the following publications:\n\n- David S Batista, Bruno Martins, and M\u00e1rio J Silva. , [Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics](http://davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf). In Empirical Methods in Natural Language Processing. ACL, 2015. (Honorable Mention for Best Short Paper)\n- David S Batista, Ph.D. Thesis, [Large-Scale Semantic Relationship Extraction for Information Discovery (Chapter 5)](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf), Instituto Superior T\u00e9cnico, University of Lisbon, 2016\n\n\n# Contributing to Snowball\n\nImprovements, adding new features and bug fixes are welcome. If you wish to participate in the development of Snowball, \nplease read the following guidelines.\n\n## The contribution process at a glance\n\n1. Preparing the development environment\n2. Code away!\n3. Continuous Integration\n4. Submit your changes by opening a pull request\n\nSmall fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in \nan issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. \n\n\n## Preparing the development environment\n\nMake sure you have Python3.9 installed on your system\n\nmacOs\n```sh\nbrew install python@3.9\npython3.9 -m pip install --user --upgrade pip\npython3.9 -m pip install virtualenv\n```\n\nClone the repository and prepare the development environment:\n\n```sh\ngit clone git@github.com:davidsbatista/Snowball.git\ncd Snowball \npython3.9 -m virtualenv venv # create a new virtual environment for development using python3.9 \nsource venv/bin/activate # activate the virtual environment\npip install -r requirements_dev.txt # install the development requirements\npip install -e . # install Snowball in edit mode\n```\n\n\n## Continuous Integration\n\nSnowball runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a \nfull test suite is run on your PR: \n\n- The code is formatted using `black` and `isort` \n- Unused imports are auto-removed using `pycln`\n- Linting is done using `pyling` and `flake8`\n- Type checking is done using `mypy`\n- Tests are run using `pytest`\n\nNevertheless, if you prefer to run the tests & formatting locally, it's possible too. \n\n```sh\nmake all\n```\n\n## Opening a Pull Request\n\nEvery PR should be accompanied by short description of the changes, including:\n- Impact and motivation for the changes\n- Any open issues that are closed by this PR\n\n---\n\nGive a \u2b50\ufe0f if this project helped you!\n",
"bugtrack_url": null,
"license": "GNU GPLv3",
"summary": "Snowball: Extracting Relations from Large Plain-Text Collections",
"version": "1.0.5",
"project_urls": {
"documentation": "https://www.davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf",
"homepage": "https://github.com/davidsbatista/Snowball",
"repository": "https://github.com/davidsbatista/Snowball"
},
"split_keywords": [
"nlp",
"semantic relationship extraction",
"bootstrapping",
"emnlp",
"tf-idf"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05d899e18385b6c144b9ad068d2c6cdd5a0aacf02b3bbec63fc36914dc147a8b",
"md5": "75e35c7466d5b7625073aff1b2b1bd45",
"sha256": "e159c6b8e9c8f6c5a93632b5d721e5576a33e33d1db55cb28dda5ab26ecd484d"
},
"downloads": -1,
"filename": "snowball_extractor-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "75e35c7466d5b7625073aff1b2b1bd45",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 34324,
"upload_time": "2023-05-14T21:49:19",
"upload_time_iso_8601": "2023-05-14T21:49:19.601609Z",
"url": "https://files.pythonhosted.org/packages/05/d8/99e18385b6c144b9ad068d2c6cdd5a0aacf02b3bbec63fc36914dc147a8b/snowball_extractor-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a7c62b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73",
"md5": "24c7bb8b5ca1d6deb9329503f1c5b8ae",
"sha256": "7b0103184446c8486390e29f384dbd9ffd87572c96339b920a2a26d6a7bd709f"
},
"downloads": -1,
"filename": "snowball-extractor-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "24c7bb8b5ca1d6deb9329503f1c5b8ae",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 38342,
"upload_time": "2023-05-14T21:49:21",
"upload_time_iso_8601": "2023-05-14T21:49:21.620988Z",
"url": "https://files.pythonhosted.org/packages/a7/c6/2b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73/snowball-extractor-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-14 21:49:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "davidsbatista",
"github_project": "Snowball",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "snowball-extractor"
}