# SaysWho
**SaysWho** is a Python package for identifying and attributing quotes in text. It uses a combination of logic and grammer to find quotes and their speakers, then uses a [coreferencing model](https://explosion.ai/blog/coref) to better clarify who is speaking. It's built on [Textacy](https://textacy.readthedocs.io/en/latest/) and [SpaCy](https://spacy.io/).
## Notes
- Corefencing is an experimental feature not fully integrated into SpaCy, and the current pipeline is built on SpaCy 3.4. I haven't had any problems using it with SpaCy 3.5+, but it takes some finesse to navigate the different versions.
- SaysWho grew out of a larger project for analyzing newspaper articles from Lexis between ~250 and ~2000 words, and it is optimized to navitage the syntax and common errors particular to that text.
- The output of this version is kind of open-ended, and possibly not as useful as it could be. HTML viz is coming, but I'm open to any suggestions about how this could be more useful!
## Installation
Install and update using [pip](https://pip.pypa.io/en/stable/):
```
$ pip install sayswho
```
You will probably need to do this to navigate some versioning issues (see [Notes](#notes))
```
$ pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.0/en_coreference_web_trf-3.4.0a0-py3-none-any.whl
$ pip install spacy -U
$ spacy download en_core_web_lg
```
## A Simple Example
##### Sample text adapted from [here](https://sports.yahoo.com/nets-jacque-vaughn-looking-forward-150705556.html):
> Nets Coach Jacque Vaughn was optimistic when discussing Ben Simmons's prospects on NBA TV.
>
> “It’s been great, being able to check in with Ben," Vaughn said, via Nets Daily. “I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.
>
> "He has an innate ability to impact the basketball game on both ends of the floor. So, we missed that in the Philly series and looking forward to it.”
>
> Simmons arrived in Brooklyn during the 2021-22 season, but did not play that year after a back injury. The 26-year-old would make 42 appearances (33 starts) during a tumult-filled season for Brooklyn.
>
> “He is on the court. No setbacks," Vaughn later told reporters about Simmons' workouts. “We’ll continue to see him improve through the offseason.”
#### Instantiate `SaysWho` and run `.attribute` on target text.
```python
from sayswho import SaysWho
sw = SaysWho(text)
```
#### See speaker, cue and content of every quote with `.quotes`.
```python
print(sw.quotes)
```
```
[DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,"),
DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward."),
DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,"),
DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”)]
```
#### See resolved entity clusters with `.clusters`.
```python
print(sw.clusters)
```
```
[[Ben Simmons's,
Ben,
a healthy Ben Simmons,
him,
He,
Simmons,
The 26-year-old,
He,
Simmons'x,
him],
[Nets Coach Jacque Vaughn, Vaughn, I, Vaughn],
[Nets, The team, our, we],
[an innate ability to impact the basketball game on both ends of the floor,
that,
it],
[the 2021-22 season, that year],
[Brooklyn, Brooklyn, We]]
```
#### Use `.print_clusters()` to see unique text in each cluster, easier to read.
```python
sw.print_clusters()
```
```
0 {'Ben', 'He', 'The 26-year-old', 'a healthy Ben Simmons', "Simmons'x", "Ben Simmons's", 'Simmons', 'him'}
1 {'I', 'Nets Coach Jacque Vaughn', 'Vaughn'}
2 {'The team', 'our', 'we', 'Nets'}
3 {'it', 'an innate ability to impact the basketball game on both ends of the floor', 'that'}
4 {'that year', 'the 2021-22 season'}
5 {'Brooklyn', 'We'}
```
#### Quote/cluster matches are saved to `.quote_matches` as `namedtuples`.
```python
for qm in sw.quote_matches:
print(qm)
```
```
QuoteClusterMatch(quote_index=0, cluster_index=1)
QuoteClusterMatch(quote_index=1, cluster_index=1)
QuoteClusterMatch(quote_index=2, cluster_index=1)
QuoteClusterMatch(quote_index=3, cluster_index=1)
```
#### Use `.expand_match()` to view and interpret quote/cluster matches.
```python
sw.expand_match()
```
```
QUOTE : 0
DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 1
DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 2
DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 3
DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”)
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
```
Raw data
{
"_id": null,
"home_page": "https://github.com/afriedman412/sayswho",
"name": "sayswho",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "nlp,natural-language-processing,spacy",
"author": "Andy Friedman",
"author_email": "afriedman412@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/95/8c/65492b607846e710f60467e88623a815e6d142e89edf8ffba559ee158dce/sayswho-0.1.2.tar.gz",
"platform": null,
"description": "# SaysWho\n**SaysWho** is a Python package for identifying and attributing quotes in text. It uses a combination of logic and grammer to find quotes and their speakers, then uses a [coreferencing model](https://explosion.ai/blog/coref) to better clarify who is speaking. It's built on [Textacy](https://textacy.readthedocs.io/en/latest/) and [SpaCy](https://spacy.io/).\n\n## Notes\n- Corefencing is an experimental feature not fully integrated into SpaCy, and the current pipeline is built on SpaCy 3.4. I haven't had any problems using it with SpaCy 3.5+, but it takes some finesse to navigate the different versions.\n\n- SaysWho grew out of a larger project for analyzing newspaper articles from Lexis between ~250 and ~2000 words, and it is optimized to navitage the syntax and common errors particular to that text.\n\n- The output of this version is kind of open-ended, and possibly not as useful as it could be. HTML viz is coming, but I'm open to any suggestions about how this could be more useful!\n\n## Installation\nInstall and update using [pip](https://pip.pypa.io/en/stable/):\n\n```\n$ pip install sayswho\n```\n\nYou will probably need to do this to navigate some versioning issues (see [Notes](#notes))\n\n```\n$ pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.0/en_coreference_web_trf-3.4.0a0-py3-none-any.whl\n$ pip install spacy -U\n$ spacy download en_core_web_lg\n```\n\n## A Simple Example\n\n##### Sample text adapted from [here](https://sports.yahoo.com/nets-jacque-vaughn-looking-forward-150705556.html):\n> Nets Coach Jacque Vaughn was optimistic when discussing Ben Simmons's prospects on NBA TV.\n> \n> \u201cIt\u2019s been great, being able to check in with Ben,\" Vaughn said, via Nets Daily. \u201cI look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.\n> \n> \"He has an innate ability to impact the basketball game on both ends of the floor. So, we missed that in the Philly series and looking forward to it.\u201d\n> \n> Simmons arrived in Brooklyn during the 2021-22 season, but did not play that year after a back injury. The 26-year-old would make 42 appearances (33 starts) during a tumult-filled season for Brooklyn.\n> \n> \u201cHe is on the court. No setbacks,\" Vaughn later told reporters about Simmons' workouts. \u201cWe\u2019ll continue to see him improve through the offseason.\u201d\n\n\n#### Instantiate `SaysWho` and run `.attribute` on target text.\n\n```python\nfrom sayswho import SaysWho\n\nsw = SaysWho(text)\n```\n\n\n#### See speaker, cue and content of every quote with `.quotes`.\n\n\n```python\nprint(sw.quotes)\n```\n\n```\n[DQTriple(speaker=[Vaughn], cue=[said], content=\u201cIt\u2019s been great, being able to check in with Ben,\"),\n DQTriple(speaker=[Vaughn], cue=[said], content=\u201cI look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.\"),\n DQTriple(speaker=[Vaughn], cue=[told], content=\u201cHe is on the court. No setbacks,\"),\n DQTriple(speaker=[Vaughn], cue=[told], content=\u201cWe\u2019ll continue to see him improve through the offseason.\u201d)]\n```\n\n\n\n#### See resolved entity clusters with `.clusters`.\n\n\n```python\nprint(sw.clusters)\n```\n\n```\n[[Ben Simmons's,\n Ben,\n a healthy Ben Simmons,\n him,\n He,\n Simmons,\n The 26-year-old,\n He,\n Simmons'x,\n him],\n [Nets Coach Jacque Vaughn, Vaughn, I, Vaughn],\n [Nets, The team, our, we],\n [an innate ability to impact the basketball game on both ends of the floor,\n that,\n it],\n [the 2021-22 season, that year],\n [Brooklyn, Brooklyn, We]]\n```\n\n\n\n#### Use `.print_clusters()` to see unique text in each cluster, easier to read.\n\n\n```python\nsw.print_clusters()\n```\n```\n0 {'Ben', 'He', 'The 26-year-old', 'a healthy Ben Simmons', \"Simmons'x\", \"Ben Simmons's\", 'Simmons', 'him'}\n1 {'I', 'Nets Coach Jacque Vaughn', 'Vaughn'}\n2 {'The team', 'our', 'we', 'Nets'}\n3 {'it', 'an innate ability to impact the basketball game on both ends of the floor', 'that'}\n4 {'that year', 'the 2021-22 season'}\n5 {'Brooklyn', 'We'}\n```\n\n\n#### Quote/cluster matches are saved to `.quote_matches` as `namedtuples`.\n\n\n```python\nfor qm in sw.quote_matches:\n print(qm)\n```\n```\nQuoteClusterMatch(quote_index=0, cluster_index=1)\nQuoteClusterMatch(quote_index=1, cluster_index=1)\nQuoteClusterMatch(quote_index=2, cluster_index=1)\nQuoteClusterMatch(quote_index=3, cluster_index=1)\n```\n\n\n#### Use `.expand_match()` to view and interpret quote/cluster matches.\n\n\n```python\nsw.expand_match()\n```\n```\nQUOTE : 0\n DQTriple(speaker=[Vaughn], cue=[said], content=\u201cIt\u2019s been great, being able to check in with Ben,\") \n\nCLUSTER : 1\n ['Nets Coach Jacque Vaughn', 'Vaughn'] \n\nQUOTE : 1\n DQTriple(speaker=[Vaughn], cue=[said], content=\u201cI look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.\") \n\nCLUSTER : 1\n ['Nets Coach Jacque Vaughn', 'Vaughn'] \n\nQUOTE : 2\n DQTriple(speaker=[Vaughn], cue=[told], content=\u201cHe is on the court. No setbacks,\") \n\nCLUSTER : 1\n ['Nets Coach Jacque Vaughn', 'Vaughn'] \n\nQUOTE : 3\n DQTriple(speaker=[Vaughn], cue=[told], content=\u201cWe\u2019ll continue to see him improve through the offseason.\u201d) \n\nCLUSTER : 1\n ['Nets Coach Jacque Vaughn', 'Vaughn'] \n```\n\n \n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Quote identification, attribution and resolution.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/afriedman412/sayswho",
"Repository": "https://github.com/afriedman412/sayswho"
},
"split_keywords": [
"nlp",
"natural-language-processing",
"spacy"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c35d9fa3bf101f6efe2af74e60653d522cf70e8b5d507378971060f0e3aadf37",
"md5": "1d77c009e957625e26a0f3182cf3cf90",
"sha256": "ebe5b4679965167a5ade4f4a9db6d02f323082ae0334c49043198886514f41cb"
},
"downloads": -1,
"filename": "sayswho-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d77c009e957625e26a0f3182cf3cf90",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 15460,
"upload_time": "2023-07-16T00:46:13",
"upload_time_iso_8601": "2023-07-16T00:46:13.446814Z",
"url": "https://files.pythonhosted.org/packages/c3/5d/9fa3bf101f6efe2af74e60653d522cf70e8b5d507378971060f0e3aadf37/sayswho-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "958c65492b607846e710f60467e88623a815e6d142e89edf8ffba559ee158dce",
"md5": "268167fc3b1a0019e97e7dd911cd4ea3",
"sha256": "613479b591fa217fd93bdc8872d0dc1db6d49f862429cfcf6f0239af015d8381"
},
"downloads": -1,
"filename": "sayswho-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "268167fc3b1a0019e97e7dd911cd4ea3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 14881,
"upload_time": "2023-07-16T00:46:15",
"upload_time_iso_8601": "2023-07-16T00:46:15.015465Z",
"url": "https://files.pythonhosted.org/packages/95/8c/65492b607846e710f60467e88623a815e6d142e89edf8ffba559ee158dce/sayswho-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-16 00:46:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "afriedman412",
"github_project": "sayswho",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "sayswho"
}