snowball-extractor


Namesnowball-extractor JSON
Version 1.0.5 PyPI version JSON
download
home_page
SummarySnowball: Extracting Relations from Large Plain-Text Collections
upload_time2023-05-14 21:49:21
maintainer
docs_urlNone
author
requires_python>=3.9
licenseGNU GPLv3
keywords nlp semantic relationship extraction bootstrapping emnlp tf-idf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
 
![example event parameter](https://github.com/davidsbatista/Snowball/actions/workflows/code_checks.yml/badge.svg?event=pull_request)
 
![code coverage](https://raw.githubusercontent.com/davidsbatista/Snowball/coverage-badge/coverage.svg?raw=true)
 
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
 
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
 
[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)


Snowball: Extracting Relations from Large Plain-Text Collections
================================================================

An implementation of Snowball, a relationship extraction system that uses a bootstrapping/semi-supervised approach, it 
relies on an initial set of seeds, i.e. paris of named-entities representing relationship type to be extracted. 

## Extracting companies headquarters:

The input text needs to have the named-entities tagged, like show in the example bellow:
 
```yaml
The tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.
<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.
<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.
<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.
<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.
<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.
<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.
```

We need to give seeds to boostrap the extraction process, specifying the type of each named-entity and relationships 
examples that should also be present in the input text:

```yaml
e1:ORG
e2:LOC

Lufthansa;Cologne
Nokia;Espoo
Google;Mountain View
DoubleClick;New York
SAP;Walldorf
```   

To run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files


```
- sentences_short.txt.bz2
- seeds_positive.txt
```

Install Snowball using pip

```sh
pip install snwoball
```

Run the following command:

```sh
snowball --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6
```

After the  process is terminated an output file `relationships.jsonl` is generated containing the extracted  relationships. 

You can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`: 

```json
{
  "entity_1": "Medtronic",
  "entity_2": "Minneapolis",
  "confidence": 0.9982486865148862,
  "sentence": "<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . ",
  "bef_words": "",
  "bet_words": ", based in",
  "aft_words": ", is",
  "passive_voice": false
}

{
  "entity_1": "DynCorp",
  "entity_2": "Reston",
  "confidence": 0.9982486865148862,
  "sentence": "Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .",
  "bef_words": "Because",
  "bet_words": ", headquartered in",
  "aft_words": ", Va.",
  "passive_voice": false
}

{
  "entity_1": "Handspring",
  "entity_2": "Silicon Valley",
  "confidence": 0.893486865148862,
  "sentence": "There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .",
  "bef_words": "firms like",
  "bet_words": ", a company based in",
  "aft_words": "that looks",
  "passive_voice": false
}
```
<br>

Snowball has several parameters to tune the extraction process, in the example above it uses the default values, but 
these can be set in the configuration file: `parameters.cfg`

```yaml
max_tokens_away=6           # maximum number of tokens between the two entities
min_tokens_away=1           # minimum number of tokens between the two entities
context_window_size=2       # number of tokens to the left and right of each entity

alpha=0.2                   # weight of the BEF context in the similarity function
beta=0.6                    # weight of the BET context in the similarity function
gamma=0.2                   # weight of the AFT context in the similarity function

wUpdt=0.5                   # < 0.5 trusts new examples less on each iteration
number_iterations=3         # number of bootstrap iterations
wUnk=0.1                    # weight given to unknown extracted relationship instances
wNeg=2                      # weight given to extracted relationship instances
min_pattern_support=2       # minimum number of instances in a cluster to be considered a pattern
```

and passed with the argument `--config=parameters.cfg`.

The full command line parameters are:

```sh
  -h, --help            show this help message and exit
  --config CONFIG       file with bootstrapping configuration parameters
  --sentences SENTENCES
                        a text file with a sentence per line, and with at least two entities per sentence
  --positive_seeds POSITIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'
  --negative_seeds NEGATIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'
  --similarity SIMILARITY
                        the minimum similarity between tuples and patterns to be considered a match
  --confidence CONFIDENCE
                        the minimum confidence score for a match to be considered a true positive
  --number_iterations NUMBER_ITERATIONS
                        the number of iterations the run
```

In the first step it pre-processes the input file `sentences.txt` generating word vector representations of  
relationships (i.e.: `processed_tuples.pkl`). 

This is done so that then you can experiment with different seed examples without having to repeat the process of 
generating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip 
this generation step.


You can find more details about the original system here: 

- Eugene Agichtein and Luis Gravano, [Snowball: Extracting Relations from Large Plain-Text Collections](http://www.mathcs.emory.edu/~eugene/papers/dl00.pdf). In Proceedings of the fifth ACM conference on Digital libraries. ACM, 200.
- H Yu, E Agichtein, [Extracting synonymous gene and protein terms from biological literature](http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i340.full.pdf). In Bioinformatics, 19(suppl 1), 2003 - Oxford University Press


For details about this particular implementation and how it was used, please refer to the following publications:

- David S Batista, Bruno Martins, and Mário J Silva. , [Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics](http://davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf). In Empirical Methods in Natural Language Processing. ACL, 2015. (Honorable Mention for Best Short Paper)
- David S Batista, Ph.D. Thesis, [Large-Scale Semantic Relationship Extraction for Information Discovery (Chapter 5)](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf), Instituto Superior Técnico, University of Lisbon, 2016


# Contributing to Snowball

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of Snowball, 
please read the following guidelines.

## The contribution process at a glance

1. Preparing the development environment
2. Code away!
3. Continuous Integration
4. Submit your changes by opening a pull request

Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in 
an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. 


## Preparing the development environment

Make sure you have Python3.9 installed on your system

macOs
```sh
brew install python@3.9
python3.9 -m pip install --user --upgrade pip
python3.9 -m pip install virtualenv
```

Clone the repository and prepare the development environment:

```sh
git clone git@github.com:davidsbatista/Snowball.git
cd Snowball            
python3.9 -m virtualenv venv         # create a new virtual environment for development using python3.9 
source venv/bin/activate             # activate the virtual environment
pip install -r requirements_dev.txt  # install the development requirements
pip install -e .                     # install Snowball in edit mode
```


## Continuous Integration

Snowball runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a 
full  test suite is run on your PR: 

- The code is formatted using `black` and `isort` 
- Unused imports are auto-removed using `pycln`
- Linting is done using `pyling` and `flake8`
- Type checking is done using `mypy`
- Tests are run using `pytest`

Nevertheless, if you prefer to run the tests & formatting locally, it's possible too. 

```sh
make all
```

## Opening a Pull Request

Every PR should be accompanied by short description of the changes, including:
- Impact and  motivation for the changes
- Any open issues that are closed by this PR

---

Give a ⭐️ if this project helped you!

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "snowball-extractor",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "nlp,semantic relationship extraction,bootstrapping,emnlp,tf-idf",
    "author": "",
    "author_email": "\"David S. Batista\" <dsbatista@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a7/c6/2b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73/snowball-extractor-1.0.5.tar.gz",
    "platform": null,
    "description": "[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)\n&nbsp;\n![example event parameter](https://github.com/davidsbatista/Snowball/actions/workflows/code_checks.yml/badge.svg?event=pull_request)\n&nbsp;\n![code coverage](https://raw.githubusercontent.com/davidsbatista/Snowball/coverage-badge/coverage.svg?raw=true)\n&nbsp;\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n&nbsp;\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n&nbsp;\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n&nbsp;\n[![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)](https://github.com/davidsbatista/BREDS/blob/main/CONTRIBUTING.md)\n\n\nSnowball: Extracting Relations from Large Plain-Text Collections\n================================================================\n\nAn implementation of Snowball, a relationship extraction system that uses a bootstrapping/semi-supervised approach, it \nrelies on an initial set of seeds, i.e. paris of named-entities representing relationship type to be extracted. \n\n## Extracting companies headquarters:\n\nThe input text needs to have the named-entities tagged, like show in the example bellow:\n \n```yaml\nThe tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.\n<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.\n<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.\n<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.\n<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.\n<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.\n<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.\n```\n\nWe need to give seeds to boostrap the extraction process, specifying the type of each named-entity and relationships \nexamples that should also be present in the input text:\n\n```yaml\ne1:ORG\ne2:LOC\n\nLufthansa;Cologne\nNokia;Espoo\nGoogle;Mountain View\nDoubleClick;New York\nSAP;Walldorf\n```   \n\nTo run a simple example, [download](https://drive.google.com/drive/folders/0B0CbnDgKi0PyQ1plbHo0cG5tV2M?resourcekey=0-h_UaGhD4dLfoYITP3pvvUA) the following files\n\n\n```\n- sentences_short.txt.bz2\n- seeds_positive.txt\n```\n\nInstall Snowball using pip\n\n```sh\npip install snwoball\n```\n\nRun the following command:\n\n```sh\nsnowball --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6\n```\n\nAfter the  process is terminated an output file `relationships.jsonl` is generated containing the extracted  relationships. \n\nYou can pretty print it's content to the terminal with: `jq '.' < relationships.jsonl`: \n\n```json\n{\n  \"entity_1\": \"Medtronic\",\n  \"entity_2\": \"Minneapolis\",\n  \"confidence\": 0.9982486865148862,\n  \"sentence\": \"<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . \",\n  \"bef_words\": \"\",\n  \"bet_words\": \", based in\",\n  \"aft_words\": \", is\",\n  \"passive_voice\": false\n}\n\n{\n  \"entity_1\": \"DynCorp\",\n  \"entity_2\": \"Reston\",\n  \"confidence\": 0.9982486865148862,\n  \"sentence\": \"Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .\",\n  \"bef_words\": \"Because\",\n  \"bet_words\": \", headquartered in\",\n  \"aft_words\": \", Va.\",\n  \"passive_voice\": false\n}\n\n{\n  \"entity_1\": \"Handspring\",\n  \"entity_2\": \"Silicon Valley\",\n  \"confidence\": 0.893486865148862,\n  \"sentence\": \"There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .\",\n  \"bef_words\": \"firms like\",\n  \"bet_words\": \", a company based in\",\n  \"aft_words\": \"that looks\",\n  \"passive_voice\": false\n}\n```\n<br>\n\nSnowball has several parameters to tune the extraction process, in the example above it uses the default values, but \nthese can be set in the configuration file: `parameters.cfg`\n\n```yaml\nmax_tokens_away=6           # maximum number of tokens between the two entities\nmin_tokens_away=1           # minimum number of tokens between the two entities\ncontext_window_size=2       # number of tokens to the left and right of each entity\n\nalpha=0.2                   # weight of the BEF context in the similarity function\nbeta=0.6                    # weight of the BET context in the similarity function\ngamma=0.2                   # weight of the AFT context in the similarity function\n\nwUpdt=0.5                   # < 0.5 trusts new examples less on each iteration\nnumber_iterations=3         # number of bootstrap iterations\nwUnk=0.1                    # weight given to unknown extracted relationship instances\nwNeg=2                      # weight given to extracted relationship instances\nmin_pattern_support=2       # minimum number of instances in a cluster to be considered a pattern\n```\n\nand passed with the argument `--config=parameters.cfg`.\n\nThe full command line parameters are:\n\n```sh\n  -h, --help            show this help message and exit\n  --config CONFIG       file with bootstrapping configuration parameters\n  --sentences SENTENCES\n                        a text file with a sentence per line, and with at least two entities per sentence\n  --positive_seeds POSITIVE_SEEDS\n                        a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'\n  --negative_seeds NEGATIVE_SEEDS\n                        a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'\n  --similarity SIMILARITY\n                        the minimum similarity between tuples and patterns to be considered a match\n  --confidence CONFIDENCE\n                        the minimum confidence score for a match to be considered a true positive\n  --number_iterations NUMBER_ITERATIONS\n                        the number of iterations the run\n```\n\nIn the first step it pre-processes the input file `sentences.txt` generating word vector representations of  \nrelationships (i.e.: `processed_tuples.pkl`). \n\nThis is done so that then you can experiment with different seed examples without having to repeat the process of \ngenerating word vectors representations. Just pass the argument `--sentences=processed_tuples.pkl` instead to skip \nthis generation step.\n\n\nYou can find more details about the original system here: \n\n- Eugene Agichtein and Luis Gravano, [Snowball: Extracting Relations from Large Plain-Text Collections](http://www.mathcs.emory.edu/~eugene/papers/dl00.pdf). In Proceedings of the fifth ACM conference on Digital libraries. ACM, 200.\n- H Yu, E Agichtein, [Extracting synonymous gene and protein terms from biological literature](http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i340.full.pdf). In Bioinformatics, 19(suppl 1), 2003 - Oxford University Press\n\n\nFor details about this particular implementation and how it was used, please refer to the following publications:\n\n- David S Batista, Bruno Martins, and M\u00e1rio J Silva. , [Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics](http://davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf). In Empirical Methods in Natural Language Processing. ACL, 2015. (Honorable Mention for Best Short Paper)\n- David S Batista, Ph.D. Thesis, [Large-Scale Semantic Relationship Extraction for Information Discovery (Chapter 5)](http://davidsbatista.net/assets/documents/publications/dsbatista-phd-thesis-2016.pdf), Instituto Superior T\u00e9cnico, University of Lisbon, 2016\n\n\n# Contributing to Snowball\n\nImprovements, adding new features and bug fixes are welcome. If you wish to participate in the development of Snowball, \nplease read the following guidelines.\n\n## The contribution process at a glance\n\n1. Preparing the development environment\n2. Code away!\n3. Continuous Integration\n4. Submit your changes by opening a pull request\n\nSmall fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in \nan issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer. \n\n\n## Preparing the development environment\n\nMake sure you have Python3.9 installed on your system\n\nmacOs\n```sh\nbrew install python@3.9\npython3.9 -m pip install --user --upgrade pip\npython3.9 -m pip install virtualenv\n```\n\nClone the repository and prepare the development environment:\n\n```sh\ngit clone git@github.com:davidsbatista/Snowball.git\ncd Snowball            \npython3.9 -m virtualenv venv         # create a new virtual environment for development using python3.9 \nsource venv/bin/activate             # activate the virtual environment\npip install -r requirements_dev.txt  # install the development requirements\npip install -e .                     # install Snowball in edit mode\n```\n\n\n## Continuous Integration\n\nSnowball runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a \nfull  test suite is run on your PR: \n\n- The code is formatted using `black` and `isort` \n- Unused imports are auto-removed using `pycln`\n- Linting is done using `pyling` and `flake8`\n- Type checking is done using `mypy`\n- Tests are run using `pytest`\n\nNevertheless, if you prefer to run the tests & formatting locally, it's possible too. \n\n```sh\nmake all\n```\n\n## Opening a Pull Request\n\nEvery PR should be accompanied by short description of the changes, including:\n- Impact and  motivation for the changes\n- Any open issues that are closed by this PR\n\n---\n\nGive a \u2b50\ufe0f if this project helped you!\n",
    "bugtrack_url": null,
    "license": "GNU GPLv3",
    "summary": "Snowball: Extracting Relations from Large Plain-Text Collections",
    "version": "1.0.5",
    "project_urls": {
        "documentation": "https://www.davidsbatista.net/assets/documents/publications/breds-emnlp_15.pdf",
        "homepage": "https://github.com/davidsbatista/Snowball",
        "repository": "https://github.com/davidsbatista/Snowball"
    },
    "split_keywords": [
        "nlp",
        "semantic relationship extraction",
        "bootstrapping",
        "emnlp",
        "tf-idf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "05d899e18385b6c144b9ad068d2c6cdd5a0aacf02b3bbec63fc36914dc147a8b",
                "md5": "75e35c7466d5b7625073aff1b2b1bd45",
                "sha256": "e159c6b8e9c8f6c5a93632b5d721e5576a33e33d1db55cb28dda5ab26ecd484d"
            },
            "downloads": -1,
            "filename": "snowball_extractor-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "75e35c7466d5b7625073aff1b2b1bd45",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 34324,
            "upload_time": "2023-05-14T21:49:19",
            "upload_time_iso_8601": "2023-05-14T21:49:19.601609Z",
            "url": "https://files.pythonhosted.org/packages/05/d8/99e18385b6c144b9ad068d2c6cdd5a0aacf02b3bbec63fc36914dc147a8b/snowball_extractor-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a7c62b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73",
                "md5": "24c7bb8b5ca1d6deb9329503f1c5b8ae",
                "sha256": "7b0103184446c8486390e29f384dbd9ffd87572c96339b920a2a26d6a7bd709f"
            },
            "downloads": -1,
            "filename": "snowball-extractor-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "24c7bb8b5ca1d6deb9329503f1c5b8ae",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 38342,
            "upload_time": "2023-05-14T21:49:21",
            "upload_time_iso_8601": "2023-05-14T21:49:21.620988Z",
            "url": "https://files.pythonhosted.org/packages/a7/c6/2b84a640d20a1c9f10e802c0170dc2629e456b9980951aae8cb8d6348f73/snowball-extractor-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-14 21:49:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "davidsbatista",
    "github_project": "Snowball",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "snowball-extractor"
}
        
Elapsed time: 0.07576s