faststylometry


Namefaststylometry JSON
Version 1.0.4 PyPI version JSON
download
home_pagehttps://fastdatascience.com/fast-stylometry-python-library
SummaryPython library for calculating the Burrows Delta.
upload_time2023-09-15 21:06:47
maintainer
docs_urlNone
authorThomas Wood
requires_python>=3.6
license
keywords stylometry nlp burrows delta delta forensic stylometry natural language processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![Fast Data Science logo](https://raw.githubusercontent.com/fastdatascience/brand/main/primary_logo.svg)

<a href="https://fastdatascience.com"><span align="left">🌐 fastdatascience.com</span></a>
<a href="https://www.linkedin.com/company/fastdatascience/"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/linkedin.svg" alt="Fast Data Science | LinkedIn" width="21px"/></a>
<a href="https://twitter.com/fastdatascienc1"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/x.svg" alt="Fast Data Science | X" width="21px"/></a>
<a href="https://www.instagram.com/fastdatascience/"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/instagram.svg" alt="Fast Data Science | Instagram" width="21px"/></a>
<a href="https://www.facebook.com/fastdatascienceltd"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/fb.svg" alt="Fast Data Science | Facebook" width="21px"/></a>
<a href="https://www.youtube.com/channel/UCLPrDH7SoRT55F6i50xMg5g"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/yt.svg" alt="Fast Data Science | YouTube" width="21px"/></a>

# Fast Stylometry Python library: Natural Language Processing tool

<!-- badges: start -->
![my badge](https://badgen.net/badge/Status/In%20Development/orange)

[![PyPI package](https://img.shields.io/badge/pip%20install-faststylometry-brightgreen)](https://pypi.org/project/faststylometry/) [![version number](https://img.shields.io/pypi/v/faststylometry?color=green&label=version)](https://github.com/fastdatascience/faststylometry/releases) [![License](https://img.shields.io/github/license/fastdatascience/faststylometry)](https://github.com/fastdatascience/faststylometry/blob/main/LICENSE)

You can run the walkthrough notebook in [Google Colab](https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb) with a single click: <a href="https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!-- badges: end -->

# ☄ Fast Stylometry - Burrows Delta NLP technique ☄

Developed by [**Fast Data Science**](https://fastdatascience.com). Fast Data Science develops [products](https://fastdatascience.com/demos/), offers [consulting services](https://fastdatascience.com/case-studies/), and [training courses](https://fastdatascience.com/training-and-upskilling-analytics-teams-in-data-science/) in [natural language processing (NLP)](https://fastdatascience.com/guide-natural-language-processing-nlp/).

Source code at https://github.com/fastdatascience/faststylometry

Tutorial at https://fastdatascience.com/fast-stylometry-python-library/

**Fast Stylometry** is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as [forensic stylometry](https://fastdatascience.com/how-you-can-identify-the-author-of-a-document/).

* [A useful explanation of the maths and thinking behind Burrows' Delta and how it works](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python#third-stylometric-test-john-burrows-delta-method-advanced)


# “💻 Installing Fast Stylometry Python package

You can install from [PyPI](https://pypi.org/project/faststylometry).

```
pip install faststylometry
```

# 🌟 Using Fast Stylometry NLP library for the first time 🌟

⚠️ We recommend you follow the walk through notebook titled [Burrows Delta Walkthrough.ipynb](Burrows%20Delta%20Walkthrough.ipynb) in order to understand how the library works. If you don't have the correct environment set up on your machine, then you can run the walkthrough notebook easily using [this link to create a notebook in Google Colab](https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb).

# 💡 Usage examples

Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.

We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.

You can get the training corpus by cloning https://github.com/woodthom2/faststylometry, the data is in faststylometry/data.

## 📖 Create a corpus

The [Burrows Delta Walkthrough.ipynb](Burrows%20Delta%20Walkthrough.ipynb)  Jupyter notebook is the best place to start, but here are the basic commands to use the library:

To create a corpus and add books, the pattern is as follows:

```
from faststylometry import Corpus
corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])
```

Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method ```util.load_corpus_from_folder(folder, pattern)```.

```
import os
import re

from faststylometry.corpus import Corpus

corpus = Corpus()
for root, _, files in os.walk(folder):
    for filename in files:
        if filename.endswith(".txt") and "_" in filename:
            with open(os.path.join(root, filename), "r", encoding="utf-8") as f:
                text = f.read()
            author, book = re.split("_-_", re.sub(r'\.txt', '', filename))

            corpus.add_book(author, book, text)
```


## 💡 Example 1

Download some example data (Project Gutenberg texts) from the Fast Stylometry repository:

```
from faststylometry import download_examples
download_examples()
```

Load a corpus and calculate Burrows' Delta

```
from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta

train_corpus = load_corpus_from_folder("faststylometry/data/train")

train_corpus.tokenise(tokenise_remove_pronouns_en)

test_corpus_sense_and_sensibility = load_corpus_from_folder("faststylometry/data/test", pattern="sense")

test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)

calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)
```

returns a Pandas dataframe of Burrows' Delta scores

## 💡 Example 2

Using the probability calibration functionality, you can calculate the probability of two books being by the same author.

```
from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)
```

outputs a Pandas dataframe of probabilities.

# ✉️ Who to contact

Thomas Wood at [Fast Data Science](https://fastdatascience.com)

## 🤝 Contributing to the project

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our [Github repository](https://github.com/fastdatascience/faststylometry). You can also [raise an issue](https://github.com/fastdatascience/faststylometry/issues). 

## Developing the library

### Automated tests

Test code is in **tests/** folder using [unittest](https://docs.python.org/3/library/unittest.html).

The testing tool `tox` is used in the automation with GitHub Actions CI/CD.

### Use tox locally

Install tox and run it:

```
pip install tox
tox
```

In our configuration, tox runs a check of source distribution using [check-manifest](https://pypi.org/project/check-manifest/) (which requires your repo to be git-initialized (`git init`) and added (`git add .`) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

```
tox -e py39
```

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

### 🤖 Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

- uses GitHub Actions for both testing and publishing
- is tested when pushing `master` or `main` branch, and is published when create a release
- includes test files in the source distribution
- uses **setup.cfg** for [version single-sourcing](https://packaging.python.org/guides/single-sourcing-package-version/) (setuptools 46.4.0+)

## 🧍 Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

```
source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*
```

## 😊 Who worked on the Fast Stylometry NLP library?

The tool was developed by:

* Thomas Wood, Natural Language Processing consultant and data scientist at [Fast Data Science](https://fastdatascience.com).

## 📜 License of Fast Stylometry library

MIT License. Copyright (c) 2023 [Fast Data Science](https://fastdatascience.com)

## ✍️ Citing the Fast Stylometry library

If you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.

Wood, T.A., Fast Stylometry [Computer software], Version 1.0.2, accessed at [https://fastdatascience.com/fast-stylometry-python-library](https://fastdatascience.com/fast-stylometry-python-library), Fast Data Science Ltd (2023)

```
@unpublished{faststylometry,
    AUTHOR = {Wood, T.A.},
    TITLE  = {Fast Stylometry (Computer software), Version 1.0.4},
    YEAR   = {2023},
    Note   = {To appear},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://fastdatascience.com/fast-stylometry-python-library",
    "name": "faststylometry",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "stylometry,nlp,burrows delta,delta,forensic stylometry,natural language processing",
    "author": "Thomas Wood",
    "author_email": "thomas@fastdatascience.com",
    "download_url": "https://files.pythonhosted.org/packages/42/11/95b6ed560f0cc33ee1ab3347b3c8aba121ba21c843395843457e9b649029/faststylometry-1.0.4.tar.gz",
    "platform": null,
    "description": "![Fast Data Science logo](https://raw.githubusercontent.com/fastdatascience/brand/main/primary_logo.svg)\n\n<a href=\"https://fastdatascience.com\"><span align=\"left\">\ud83c\udf10 fastdatascience.com</span></a>\n<a href=\"https://www.linkedin.com/company/fastdatascience/\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/linkedin.svg\" alt=\"Fast Data Science | LinkedIn\" width=\"21px\"/></a>\n<a href=\"https://twitter.com/fastdatascienc1\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/x.svg\" alt=\"Fast Data Science | X\" width=\"21px\"/></a>\n<a href=\"https://www.instagram.com/fastdatascience/\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/instagram.svg\" alt=\"Fast Data Science | Instagram\" width=\"21px\"/></a>\n<a href=\"https://www.facebook.com/fastdatascienceltd\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/fb.svg\" alt=\"Fast Data Science | Facebook\" width=\"21px\"/></a>\n<a href=\"https://www.youtube.com/channel/UCLPrDH7SoRT55F6i50xMg5g\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/yt.svg\" alt=\"Fast Data Science | YouTube\" width=\"21px\"/></a>\n\n# Fast Stylometry Python library: Natural Language Processing tool\n\n<!-- badges: start -->\n![my badge](https://badgen.net/badge/Status/In%20Development/orange)\n\n[![PyPI package](https://img.shields.io/badge/pip%20install-faststylometry-brightgreen)](https://pypi.org/project/faststylometry/) [![version number](https://img.shields.io/pypi/v/faststylometry?color=green&label=version)](https://github.com/fastdatascience/faststylometry/releases) [![License](https://img.shields.io/github/license/fastdatascience/faststylometry)](https://github.com/fastdatascience/faststylometry/blob/main/LICENSE)\n\nYou can run the walkthrough notebook in [Google Colab](https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb) with a single click: <a href=\"https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n<!-- badges: end -->\n\n# \u2604 Fast Stylometry - Burrows Delta NLP technique \u2604\n\nDeveloped by [**Fast Data Science**](https://fastdatascience.com). Fast Data Science develops [products](https://fastdatascience.com/demos/), offers [consulting services](https://fastdatascience.com/case-studies/), and [training courses](https://fastdatascience.com/training-and-upskilling-analytics-teams-in-data-science/) in [natural language processing (NLP)](https://fastdatascience.com/guide-natural-language-processing-nlp/).\n\nSource code at https://github.com/fastdatascience/faststylometry\n\nTutorial at https://fastdatascience.com/fast-stylometry-python-library/\n\n**Fast Stylometry** is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as [forensic stylometry](https://fastdatascience.com/how-you-can-identify-the-author-of-a-document/).\n\n* [A useful explanation of the maths and thinking behind Burrows' Delta and how it works](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python#third-stylometric-test-john-burrows-delta-method-advanced)\n\n\n# \u201c\ud83d\udcbb Installing Fast Stylometry Python package\n\nYou can install from [PyPI](https://pypi.org/project/faststylometry).\n\n```\npip install faststylometry\n```\n\n# \ud83c\udf1f Using Fast Stylometry NLP library for the first time \ud83c\udf1f\n\n\u26a0\ufe0f We recommend you follow the walk through notebook titled [Burrows Delta Walkthrough.ipynb](Burrows%20Delta%20Walkthrough.ipynb) in order to understand how the library works. If you don't have the correct environment set up on your machine, then you can run the walkthrough notebook easily using [this link to create a notebook in Google Colab](https://colab.research.google.com/github/fastdatascience/faststylometry/blob/main/Burrows%20Delta%20Walkthrough.ipynb).\n\n# \ud83d\udca1 Usage examples\n\nDemonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.\n\nWe will test the Burrows' Delta code on two \"unknown\" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.\n\nYou can get the training corpus by cloning https://github.com/woodthom2/faststylometry, the data is in faststylometry/data.\n\n## \ud83d\udcd6 Create a corpus\n\nThe [Burrows Delta Walkthrough.ipynb](Burrows%20Delta%20Walkthrough.ipynb)  Jupyter notebook is the best place to start, but here are the basic commands to use the library:\n\nTo create a corpus and add books, the pattern is as follows:\n\n```\nfrom faststylometry import Corpus\ncorpus = Corpus()\ncorpus.add_book(\"Jane Austen\", \"Pride and Prejudice\", [whole book text])\n```\n\nHere is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method ```util.load_corpus_from_folder(folder, pattern)```.\n\n```\nimport os\nimport re\n\nfrom faststylometry.corpus import Corpus\n\ncorpus = Corpus()\nfor root, _, files in os.walk(folder):\n    for filename in files:\n        if filename.endswith(\".txt\") and \"_\" in filename:\n            with open(os.path.join(root, filename), \"r\", encoding=\"utf-8\") as f:\n                text = f.read()\n            author, book = re.split(\"_-_\", re.sub(r'\\.txt', '', filename))\n\n            corpus.add_book(author, book, text)\n```\n\n\n## \ud83d\udca1 Example 1\n\nDownload some example data (Project Gutenberg texts) from the Fast Stylometry repository:\n\n```\nfrom faststylometry import download_examples\ndownload_examples()\n```\n\nLoad a corpus and calculate Burrows' Delta\n\n```\nfrom faststylometry.util import load_corpus_from_folder\nfrom faststylometry.en import tokenise_remove_pronouns_en\nfrom faststylometry.burrows_delta import calculate_burrows_delta\n\ntrain_corpus = load_corpus_from_folder(\"faststylometry/data/train\")\n\ntrain_corpus.tokenise(tokenise_remove_pronouns_en)\n\ntest_corpus_sense_and_sensibility = load_corpus_from_folder(\"faststylometry/data/test\", pattern=\"sense\")\n\ntest_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)\n\ncalculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)\n```\n\nreturns a Pandas dataframe of Burrows' Delta scores\n\n## \ud83d\udca1 Example 2\n\nUsing the probability calibration functionality, you can calculate the probability of two books being by the same author.\n\n```\nfrom faststylometry.probability import predict_proba, calibrate\ncalibrate(train_corpus)\npredict_proba(train_corpus, test_corpus_sense_and_sensibility)\n```\n\noutputs a Pandas dataframe of probabilities.\n\n# \u2709\ufe0f Who to contact\n\nThomas Wood at [Fast Data Science](https://fastdatascience.com)\n\n## \ud83e\udd1d Contributing to the project\n\nIf you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our [Github repository](https://github.com/fastdatascience/faststylometry). You can also [raise an issue](https://github.com/fastdatascience/faststylometry/issues). \n\n## Developing the library\n\n### Automated tests\n\nTest code is in **tests/** folder using [unittest](https://docs.python.org/3/library/unittest.html).\n\nThe testing tool `tox` is used in the automation with GitHub Actions CI/CD.\n\n### Use tox locally\n\nInstall tox and run it:\n\n```\npip install tox\ntox\n```\n\nIn our configuration, tox runs a check of source distribution using [check-manifest](https://pypi.org/project/check-manifest/) (which requires your repo to be git-initialized (`git init`) and added (`git add .`) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.\n\nThe automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:\n\n```\ntox -e py39\n```\n\nThanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the \"Generate distribution files\" section.\n\n### \ud83e\udd16 Continuous integration/deployment to PyPI\n\nThis package is based on the template https://pypi.org/project/example-pypi-package/\n\nThis package\n\n- uses GitHub Actions for both testing and publishing\n- is tested when pushing `master` or `main` branch, and is published when create a release\n- includes test files in the source distribution\n- uses **setup.cfg** for [version single-sourcing](https://packaging.python.org/guides/single-sourcing-package-version/) (setuptools 46.4.0+)\n\n## \ud83e\uddcd Re-releasing the package manually\n\nThe code to re-release Harmony on PyPI is as follows:\n\n```\nsource activate py311\npip install twine\nrm -rf dist\npython setup.py sdist\ntwine upload dist/*\n```\n\n## \ud83d\ude0a Who worked on the Fast Stylometry NLP library?\n\nThe tool was developed by:\n\n* Thomas Wood, Natural Language Processing consultant and data scientist at [Fast Data Science](https://fastdatascience.com).\n\n## \ud83d\udcdc License of Fast Stylometry library\n\nMIT License. Copyright (c) 2023 [Fast Data Science](https://fastdatascience.com)\n\n## \u270d\ufe0f Citing the Fast Stylometry library\n\nIf you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.\n\nWood, T.A., Fast Stylometry [Computer software], Version 1.0.2, accessed at [https://fastdatascience.com/fast-stylometry-python-library](https://fastdatascience.com/fast-stylometry-python-library), Fast Data Science Ltd (2023)\n\n```\n@unpublished{faststylometry,\n    AUTHOR = {Wood, T.A.},\n    TITLE  = {Fast Stylometry (Computer software), Version 1.0.4},\n    YEAR   = {2023},\n    Note   = {To appear},\n}\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python library for calculating the Burrows Delta.",
    "version": "1.0.4",
    "project_urls": {
        "Bug Reports": "https://github.com/fastdatascience/faststylometry/issues",
        "Documentation": "https://fastdatascience.com/fast-stylometry-python-library",
        "Homepage": "https://fastdatascience.com/fast-stylometry-python-library",
        "Source Code": "https://github.com/fastdatascience/faststylometry"
    },
    "split_keywords": [
        "stylometry",
        "nlp",
        "burrows delta",
        "delta",
        "forensic stylometry",
        "natural language processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7fe4aa333559a52c862d7fafa446e1e6092abf4add561ad66eb006761ddccd04",
                "md5": "6a25c04467e16b6325295975f5092f4f",
                "sha256": "b9d80e57ab073f44a7e94647f60f46f880bc7d4911ae569667c6b038dee3a17a"
            },
            "downloads": -1,
            "filename": "faststylometry-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6a25c04467e16b6325295975f5092f4f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 18063,
            "upload_time": "2023-09-15T21:06:45",
            "upload_time_iso_8601": "2023-09-15T21:06:45.642051Z",
            "url": "https://files.pythonhosted.org/packages/7f/e4/aa333559a52c862d7fafa446e1e6092abf4add561ad66eb006761ddccd04/faststylometry-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "421195b6ed560f0cc33ee1ab3347b3c8aba121ba21c843395843457e9b649029",
                "md5": "0cc67d45c8949582cc485448540f98f2",
                "sha256": "271a3fc8946ff0fe074052fd7f6ab81768d567ef6a04d4c97025253fa2629bcc"
            },
            "downloads": -1,
            "filename": "faststylometry-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "0cc67d45c8949582cc485448540f98f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 16012,
            "upload_time": "2023-09-15T21:06:47",
            "upload_time_iso_8601": "2023-09-15T21:06:47.467465Z",
            "url": "https://files.pythonhosted.org/packages/42/11/95b6ed560f0cc33ee1ab3347b3c8aba121ba21c843395843457e9b649029/faststylometry-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-15 21:06:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fastdatascience",
    "github_project": "faststylometry",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "faststylometry"
}
        
Elapsed time: 1.63734s