wordsprobability


Namewordsprobability JSON
Version 0.17 PyPI version JSON
download
home_pagehttps://github.com/tpimentelms/probability-of-a-word
SummaryMethod to get a words probability with fixes from How to Compute the Probability of a Word.
upload_time2024-07-10 13:28:31
maintainerNone
docs_urlNone
authorTiago Pimentel and Clara Meister
requires_pythonNone
licenseMIT
keywords language modelling surprisal word tokenisation probability
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # probability-of-a-word

[![CircleCI](https://circleci.com/gh/tpimentelms/probability-of-a-word.svg?style=svg)](https://circleci.com/gh/tpimentelms/probability-of-a-word)

Code to compute a word's probability using the fixes from "How to Compute the Probability of a Word"


### Installation

You can install WordsProbability directly from PyPI:

`pip install wordsprobability`

Or from source:

```
git clone git@github.com:tpimentelms/probability-of-a-word.git
cd probability-of-a-word
pip install -e .
```

#### Dependencies

WordsProbability has the following requirements:

* [Pandas](https://pandas.pydata.org)
* [PyTorch](https://pytorch.org)
* [Transformers](https://huggingface.co/docs/transformers/en/index)

### Usage

#### Basic Usage

Install this repository. Then run:
```bash
$ wordsprobability --model pythia-70m --input examples/abstract.txt --output temp.tsv
```

The input must be a txt file, with one sequence per line.
The output will be a tsv file with a word per row with its respective computed `surprisal` values.
To also get computed `surprisal_buggy` values (without our paper's correction) use the optional flag `--return-buggy-surprisals`.
Currently, supported models are: `pythia-70m`, `pythia-160m`, `pythia-410m`, `pythia-14b`, `pythia-28b`, `pythia-69b`, `pythia-120b`, `gpt2-small`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`.
The code

#### Using in other Applications

Import wordsprobability in your application and get surprisals with:
```python
    from wordsprobability import get_surprisal_per_word
    df = get_surprisal_per_word(text='Hello world! Who are you???\nWho am I?', model_name='pythia-70m')
```


## Extra Information

#### Citation

If this code or the paper were usefull to you, consider citing it:


```bibtex
@article{pimentel-etal-2024-howto,
    title = "How to Compute the Probability of a Word",
    author = "Pimentel, Tiago and
    Meister, Clara",
    year = "2024",
    eprint = {2406.14561},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL},
    url = {https://arxiv.org/abs/2406.14561},
    journal = "arXiv preprint arXiv:2406.14561",
}
```


#### Contact

To ask questions or report problems, please open an [issue](https://github.com/tpimentelms/probability-of-a-word/issues).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tpimentelms/probability-of-a-word",
    "name": "wordsprobability",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "language modelling surprisal word tokenisation probability",
    "author": "Tiago Pimentel and Clara Meister",
    "author_email": "tpimentelms@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/72/62/2a99ec06452435a991282fe83d551035111c8a13d2b9400fd494f146da55/wordsprobability-0.17.tar.gz",
    "platform": null,
    "description": "# probability-of-a-word\n\n[![CircleCI](https://circleci.com/gh/tpimentelms/probability-of-a-word.svg?style=svg)](https://circleci.com/gh/tpimentelms/probability-of-a-word)\n\nCode to compute a word's probability using the fixes from \"How to Compute the Probability of a Word\"\n\n\n### Installation\n\nYou can install WordsProbability directly from PyPI:\n\n`pip install wordsprobability`\n\nOr from source:\n\n```\ngit clone git@github.com:tpimentelms/probability-of-a-word.git\ncd probability-of-a-word\npip install -e .\n```\n\n#### Dependencies\n\nWordsProbability has the following requirements:\n\n* [Pandas](https://pandas.pydata.org)\n* [PyTorch](https://pytorch.org)\n* [Transformers](https://huggingface.co/docs/transformers/en/index)\n\n### Usage\n\n#### Basic Usage\n\nInstall this repository. Then run:\n```bash\n$ wordsprobability --model pythia-70m --input examples/abstract.txt --output temp.tsv\n```\n\nThe input must be a txt file, with one sequence per line.\nThe output will be a tsv file with a word per row with its respective computed `surprisal` values.\nTo also get computed `surprisal_buggy` values (without our paper's correction) use the optional flag `--return-buggy-surprisals`.\nCurrently, supported models are: `pythia-70m`, `pythia-160m`, `pythia-410m`, `pythia-14b`, `pythia-28b`, `pythia-69b`, `pythia-120b`, `gpt2-small`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`.\nThe code\n\n#### Using in other Applications\n\nImport wordsprobability in your application and get surprisals with:\n```python\n    from wordsprobability import get_surprisal_per_word\n    df = get_surprisal_per_word(text='Hello world! Who are you???\\nWho am I?', model_name='pythia-70m')\n```\n\n\n## Extra Information\n\n#### Citation\n\nIf this code or the paper were usefull to you, consider citing it:\n\n\n```bibtex\n@article{pimentel-etal-2024-howto,\n    title = \"How to Compute the Probability of a Word\",\n    author = \"Pimentel, Tiago and\n    Meister, Clara\",\n    year = \"2024\",\n    eprint = {2406.14561},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL},\n    url = {https://arxiv.org/abs/2406.14561},\n    journal = \"arXiv preprint arXiv:2406.14561\",\n}\n```\n\n\n#### Contact\n\nTo ask questions or report problems, please open an [issue](https://github.com/tpimentelms/probability-of-a-word/issues).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Method to get a words probability with fixes from How to Compute the Probability of a Word.",
    "version": "0.17",
    "project_urls": {
        "Homepage": "https://github.com/tpimentelms/probability-of-a-word"
    },
    "split_keywords": [
        "language",
        "modelling",
        "surprisal",
        "word",
        "tokenisation",
        "probability"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b5f43a3c191154aa19ec08acfe53db14ff4c866c0f9774091cea97376ce6fb4",
                "md5": "096909062635a4dfafc8ed12fe1b7b0b",
                "sha256": "7a4bc6ad27160bc5795d1e17aa379c26d0760691101a2517f4eb5b4cb46f928b"
            },
            "downloads": -1,
            "filename": "wordsprobability-0.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "096909062635a4dfafc8ed12fe1b7b0b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8150,
            "upload_time": "2024-07-10T13:28:29",
            "upload_time_iso_8601": "2024-07-10T13:28:29.707388Z",
            "url": "https://files.pythonhosted.org/packages/9b/5f/43a3c191154aa19ec08acfe53db14ff4c866c0f9774091cea97376ce6fb4/wordsprobability-0.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72622a99ec06452435a991282fe83d551035111c8a13d2b9400fd494f146da55",
                "md5": "2fdc9b9ad8c1d43fb639da821dfe6031",
                "sha256": "9554a8d84e98c414acd44465b699ccd91bf6f3e20de7ef074a35232b3fffa9e9"
            },
            "downloads": -1,
            "filename": "wordsprobability-0.17.tar.gz",
            "has_sig": false,
            "md5_digest": "2fdc9b9ad8c1d43fb639da821dfe6031",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7954,
            "upload_time": "2024-07-10T13:28:31",
            "upload_time_iso_8601": "2024-07-10T13:28:31.220931Z",
            "url": "https://files.pythonhosted.org/packages/72/62/2a99ec06452435a991282fe83d551035111c8a13d2b9400fd494f146da55/wordsprobability-0.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-10 13:28:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tpimentelms",
    "github_project": "probability-of-a-word",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "lcname": "wordsprobability"
}
        
Elapsed time: 0.69035s