Name | str_sim_scorer JSON |
Version |
3.0.2
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2025-08-28 15:51:28 |
maintainer | None |
docs_url | None |
author | Devin McCabe |
requires_python | >=3.9 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
from str_sim_scorer import StrSimScorer
# STR Similarity Scorer
This repository contains a Python package to compute the [similarity score](https://www.cellosaurus.org/str-search/help.html) for pairs of records in an input data frame, some of which may be optionally specified as reference profiles.
This package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).
## Installation
Install `str_sim_scorer` [from PyPI](https://pypi.org/project/str_sim_scorer/) using the package manager of your choice.
## Usage
The `StrSimScorer` class provides an object-oriented interface with caching for efficient computation:
```py
import pandas as pd
from str_sim_scorer import StrSimScorer
df = pd.DataFrame(
[
{
"id": "sample1",
"csf1po": "11, 13",
"d13s317": "11, 12",
"d16s539": "9, 12",
"d18s51": "11, 19",
"d21s11": "29, 31.2",
"d3s1358": "17",
"d5s818": "13",
"d7s820": "10",
"d8s1179": "12, 13",
"fga": "24",
"penta_d": "9, 12",
"penta_e": "7, 13",
"th01": "6, 8",
"tpox": "11",
},
{
"id": "sample2",
"csf1po": "12",
"d13s317": "11, 12",
"d16s539": "8, 12",
"d18s51": "17, 18",
"d21s11": "28, 33.2",
"d3s1358": "16",
"d5s818": "11",
"d7s820": "8, 13",
"d8s1179": "9, 10",
"fga": "21, 25",
"penta_d": "9, 12",
"penta_e": "7",
"th01": "7, 9.3",
"tpox": "8",
},
{
"id": "sample3",
"csf1po": "11, 12",
"d13s317": "8",
"d16s539": "11",
"d18s51": "18",
"d21s11": pd.NA,
"d3s1358": "16",
"d5s818": "10, 11",
"d7s820": "12",
"d8s1179": "11",
"fga": "26",
"penta_d": pd.NA,
"penta_e": "13.1, 12.1",
"th01": "6, 9.3",
"tpox": "12",
},
]
)
# Create the comparison object
scorer = StrSimScorer(
df,
sample_id_col_name="id",
locus_col_names=[
"csf1po",
"d13s317",
"d16s539",
"d18s51",
"d21s11",
"d3s1358",
"d5s818",
"d7s820",
"d8s1179",
"fga",
"penta_d",
"penta_e",
"th01",
"tpox",
],
)
scores = scorer.scores(output="df")
```
### Output formats
Using `output="df"` returns a DataFrame for distinct pairs of IDs:
```
>>> print(scores)
id1 id2 n_loci_used score
0 sample1 sample2 14 0.26087
1 sample1 sample3 12 0.114286
2 sample2 sample3 12 0.285714
```
Using `output="full_df"` returns the same data with both (id1, id2) and (id2, id1) rows:
```
>>> scores_sym = comp.scores(output="symmetric_df")
>>> print(scores_sym)
id1 id2 n_loci_used score
0 sample1 sample2 14 0.26087
1 sample1 sample3 12 0.114286
2 sample2 sample1 14 0.26087
3 sample2 sample3 12 0.285714
4 sample3 sample1 12 0.114286
5 sample3 sample2 12 0.285714
```
Using `output="array"` returns the raw similarity matrix as a numpy [masked array](https://numpy.org/doc/stable/reference/routines.ma.html):
```
>>> array = comp.scores(output="array")
>>> print(array)
masked_array(
data=[[--, 0.2608695652173913, 0.11428571428571428],
[0.2608695652173913, --, 0.2857142857142857],
[0.11428571428571428, 0.2857142857142857, --]],
mask=[[ True, False, False],
[False, True, False],
[False, False, True]],
fill_value=0.0)
>>> print(scorer.sample_ids) # the row/col names of the matrix
['sample1', 'sample2', 'sample3']
```
Only cells representing a valid and relevant pair of samples are unmasked, which is why the diagonal is masked in this example.
### Algorithms
#### Tanabe / non-empty markers
This package implements two [algorithms](https://www.cellosaurus.org/str-search/help.html). For a pair of samples where neither is indicated as a "reference", the score is calculated using the Tanabe algorithm under the "non-empty markers" mode. Thus, `n_loci_used` is the number of loci where both samples had data.
#### Master vs. reference / reference markers
If your input data frame has a boolean column indicating that some samples are references (e.g. STR profiles from a canonical source like Cellosaurus), set the `is_reference_col_name` argument to the name of that column. A cell `i,j` in the scores matrix will be computed using the "masters vs. reference" algorithm if `scorer.sample_ids[i]` is a master (i.e. real sample) and `scorer.sample_ids[j]` is a reference. In ths case, `n_loci_used` is the number of loci present in the reference.
```py
import pandas as pd
from str_sim_scorer import StrSimScorer
df = pd.DataFrame(
[
{
"id": "sample1",
"is_ref": False,
"csf1po": "11, 13",
"d13s317": "11, 12",
"d16s539": "9, 12",
"d18s51": "11, 19",
"d21s11": "29, 31.2",
"d3s1358": "17",
"d5s818": "13",
"d7s820": "10",
"d8s1179": "12, 13",
"fga": "24",
"penta_d": "9, 12",
"penta_e": "7, 13",
"th01": "6, 8",
"tpox": "11",
},
{
"id": "sample2",
"is_ref": True,
"csf1po": "12",
"d13s317": "11, 12",
"d16s539": "8, 12",
"d18s51": "17, 18",
"d21s11": "28, 33.2",
"d3s1358": "16",
"d5s818": "11",
"d7s820": "8, 13",
"d8s1179": "9, 10",
"fga": "21, 25",
"penta_d": "9, 12",
"penta_e": "7",
"th01": "7, 9.3",
"tpox": "8",
},
{
"id": "sample3",
"is_ref": False,
"csf1po": "11, 12",
"d13s317": "8",
"d16s539": "11",
"d18s51": "18",
"d21s11": pd.NA,
"d3s1358": "16",
"d5s818": "10, 11",
"d7s820": "12",
"d8s1179": "11",
"fga": "26",
"penta_d": pd.NA,
"penta_e": "13.1, 12.1",
"th01": "6, 9.3",
"tpox": "12",
},
]
)
scorer = StrSimScorer(
df,
sample_id_col_name="id",
locus_col_names=[
"csf1po",
"d13s317",
"d16s539",
"d18s51",
"d21s11",
"d3s1358",
"d5s818",
"d7s820",
"d8s1179",
"fga",
"penta_d",
"penta_e",
"th01",
"tpox",
],
is_reference_col_name="is_ref",
)
scores = scorer.scores(output="array")
```
```
>>> print(scores)
masked_array(
data=[[--, 0.2608695652173913, 0.11428571428571428],
[--, --, --],
[0.11428571428571428, 0.21739130434782608, --]],
mask=[[ True, False, False],
[ True, True, True],
[False, False, True]],
fill_value=0.0)
```
- Cells for pairs of non-reference samples like `0,2` and `2,0` (a vs. c) are computed using the Tanabe algorithm.
- Cells like `0,1` (a vs. b) and `2,1` (c vs. b) are computed against the reference sample b, so they use the other algorithm and loci counting mode.
- Cells `1,0` (b vs. a) and `1,2` (b vs. c) are masked since these involve comparisons that are irrelevant under both algorithms.
## Development
### Installation
1. Install the required system dependencies:
- [pyenv](https://github.com/pyenv/pyenv)
- [Poetry](https://python-poetry.org/)
- [pre-commit](https://pre-commit.com/)
3. Install the required Python version (>=3.9):
```bash
pyenv install "$(cat .python-version)"
```
4. Confirm that `python` maps to the correct version:
```
python --version
```
5. Set the Poetry interpreter and install the Python dependencies:
```bash
poetry env use "$(pyenv which python)"
poetry install
```
Run `poetry run pyright` to check static types with [Pyright](https://microsoft.github.io/pyright).
### Testing
```bash
poetry run pytest
```
Raw data
{
"_id": null,
"home_page": null,
"name": "str_sim_scorer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Devin McCabe",
"author_email": "dmccabe@broadinstitute.org",
"download_url": "https://files.pythonhosted.org/packages/c7/de/3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7/str_sim_scorer-3.0.2.tar.gz",
"platform": null,
"description": "from str_sim_scorer import StrSimScorer\n\n# STR Similarity Scorer\n\nThis repository contains a Python package to compute the [similarity score](https://www.cellosaurus.org/str-search/help.html) for pairs of records in an input data frame, some of which may be optionally specified as reference profiles.\n\nThis package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).\n\n## Installation\n\nInstall `str_sim_scorer` [from PyPI](https://pypi.org/project/str_sim_scorer/) using the package manager of your choice.\n\n## Usage\n\nThe `StrSimScorer` class provides an object-oriented interface with caching for efficient computation:\n\n```py\nimport pandas as pd\nfrom str_sim_scorer import StrSimScorer\n\ndf = pd.DataFrame(\n [\n {\n \"id\": \"sample1\",\n \"csf1po\": \"11, 13\",\n \"d13s317\": \"11, 12\",\n \"d16s539\": \"9, 12\",\n \"d18s51\": \"11, 19\",\n \"d21s11\": \"29, 31.2\",\n \"d3s1358\": \"17\",\n \"d5s818\": \"13\",\n \"d7s820\": \"10\",\n \"d8s1179\": \"12, 13\",\n \"fga\": \"24\",\n \"penta_d\": \"9, 12\",\n \"penta_e\": \"7, 13\",\n \"th01\": \"6, 8\",\n \"tpox\": \"11\",\n },\n {\n \"id\": \"sample2\",\n \"csf1po\": \"12\",\n \"d13s317\": \"11, 12\",\n \"d16s539\": \"8, 12\",\n \"d18s51\": \"17, 18\",\n \"d21s11\": \"28, 33.2\",\n \"d3s1358\": \"16\",\n \"d5s818\": \"11\",\n \"d7s820\": \"8, 13\",\n \"d8s1179\": \"9, 10\",\n \"fga\": \"21, 25\",\n \"penta_d\": \"9, 12\",\n \"penta_e\": \"7\",\n \"th01\": \"7, 9.3\",\n \"tpox\": \"8\",\n },\n {\n \"id\": \"sample3\",\n \"csf1po\": \"11, 12\",\n \"d13s317\": \"8\",\n \"d16s539\": \"11\",\n \"d18s51\": \"18\",\n \"d21s11\": pd.NA,\n \"d3s1358\": \"16\",\n \"d5s818\": \"10, 11\",\n \"d7s820\": \"12\",\n \"d8s1179\": \"11\",\n \"fga\": \"26\",\n \"penta_d\": pd.NA,\n \"penta_e\": \"13.1, 12.1\",\n \"th01\": \"6, 9.3\",\n \"tpox\": \"12\",\n },\n ]\n)\n\n# Create the comparison object\nscorer = StrSimScorer(\n df,\n sample_id_col_name=\"id\",\n locus_col_names=[\n \"csf1po\",\n \"d13s317\",\n \"d16s539\",\n \"d18s51\",\n \"d21s11\",\n \"d3s1358\",\n \"d5s818\",\n \"d7s820\",\n \"d8s1179\",\n \"fga\",\n \"penta_d\",\n \"penta_e\",\n \"th01\",\n \"tpox\",\n ],\n)\n\nscores = scorer.scores(output=\"df\")\n```\n\n### Output formats\n\nUsing `output=\"df\"` returns a DataFrame for distinct pairs of IDs:\n```\n>>> print(scores)\n id1 id2 n_loci_used score\n0 sample1 sample2 14 0.26087\n1 sample1 sample3 12 0.114286\n2 sample2 sample3 12 0.285714\n```\n\nUsing `output=\"full_df\"` returns the same data with both (id1, id2) and (id2, id1) rows:\n```\n>>> scores_sym = comp.scores(output=\"symmetric_df\")\n>>> print(scores_sym)\n id1 id2 n_loci_used score\n0 sample1 sample2 14 0.26087\n1 sample1 sample3 12 0.114286\n2 sample2 sample1 14 0.26087\n3 sample2 sample3 12 0.285714\n4 sample3 sample1 12 0.114286\n5 sample3 sample2 12 0.285714\n```\n\nUsing `output=\"array\"` returns the raw similarity matrix as a numpy [masked array](https://numpy.org/doc/stable/reference/routines.ma.html):\n```\n>>> array = comp.scores(output=\"array\")\n>>> print(array)\nmasked_array(\n data=[[--, 0.2608695652173913, 0.11428571428571428],\n [0.2608695652173913, --, 0.2857142857142857],\n [0.11428571428571428, 0.2857142857142857, --]],\n mask=[[ True, False, False],\n [False, True, False],\n [False, False, True]],\n fill_value=0.0)\n>>> print(scorer.sample_ids) # the row/col names of the matrix \n['sample1', 'sample2', 'sample3']\n```\n\nOnly cells representing a valid and relevant pair of samples are unmasked, which is why the diagonal is masked in this example. \n\n### Algorithms\n\n#### Tanabe / non-empty markers\n\nThis package implements two [algorithms](https://www.cellosaurus.org/str-search/help.html). For a pair of samples where neither is indicated as a \"reference\", the score is calculated using the Tanabe algorithm under the \"non-empty markers\" mode. Thus, `n_loci_used` is the number of loci where both samples had data.\n\n#### Master vs. reference / reference markers\n\nIf your input data frame has a boolean column indicating that some samples are references (e.g. STR profiles from a canonical source like Cellosaurus), set the `is_reference_col_name` argument to the name of that column. A cell `i,j` in the scores matrix will be computed using the \"masters vs. reference\" algorithm if `scorer.sample_ids[i]` is a master (i.e. real sample) and `scorer.sample_ids[j]` is a reference. In ths case, `n_loci_used` is the number of loci present in the reference.\n\n```py\nimport pandas as pd\nfrom str_sim_scorer import StrSimScorer\n\ndf = pd.DataFrame(\n [\n {\n \"id\": \"sample1\",\n \"is_ref\": False,\n \"csf1po\": \"11, 13\",\n \"d13s317\": \"11, 12\",\n \"d16s539\": \"9, 12\",\n \"d18s51\": \"11, 19\",\n \"d21s11\": \"29, 31.2\",\n \"d3s1358\": \"17\",\n \"d5s818\": \"13\",\n \"d7s820\": \"10\",\n \"d8s1179\": \"12, 13\",\n \"fga\": \"24\",\n \"penta_d\": \"9, 12\",\n \"penta_e\": \"7, 13\",\n \"th01\": \"6, 8\",\n \"tpox\": \"11\",\n },\n {\n \"id\": \"sample2\",\n \"is_ref\": True,\n \"csf1po\": \"12\",\n \"d13s317\": \"11, 12\",\n \"d16s539\": \"8, 12\",\n \"d18s51\": \"17, 18\",\n \"d21s11\": \"28, 33.2\",\n \"d3s1358\": \"16\",\n \"d5s818\": \"11\",\n \"d7s820\": \"8, 13\",\n \"d8s1179\": \"9, 10\",\n \"fga\": \"21, 25\",\n \"penta_d\": \"9, 12\",\n \"penta_e\": \"7\",\n \"th01\": \"7, 9.3\",\n \"tpox\": \"8\",\n },\n {\n \"id\": \"sample3\",\n \"is_ref\": False,\n \"csf1po\": \"11, 12\",\n \"d13s317\": \"8\",\n \"d16s539\": \"11\",\n \"d18s51\": \"18\",\n \"d21s11\": pd.NA,\n \"d3s1358\": \"16\",\n \"d5s818\": \"10, 11\",\n \"d7s820\": \"12\",\n \"d8s1179\": \"11\",\n \"fga\": \"26\",\n \"penta_d\": pd.NA,\n \"penta_e\": \"13.1, 12.1\",\n \"th01\": \"6, 9.3\",\n \"tpox\": \"12\",\n },\n ]\n)\n\nscorer = StrSimScorer(\n df,\n sample_id_col_name=\"id\",\n locus_col_names=[\n \"csf1po\",\n \"d13s317\",\n \"d16s539\",\n \"d18s51\",\n \"d21s11\",\n \"d3s1358\",\n \"d5s818\",\n \"d7s820\",\n \"d8s1179\",\n \"fga\",\n \"penta_d\",\n \"penta_e\",\n \"th01\",\n \"tpox\",\n ],\n is_reference_col_name=\"is_ref\",\n)\n\nscores = scorer.scores(output=\"array\")\n```\n\n```\n>>> print(scores)\nmasked_array(\n data=[[--, 0.2608695652173913, 0.11428571428571428],\n [--, --, --],\n [0.11428571428571428, 0.21739130434782608, --]],\n mask=[[ True, False, False],\n [ True, True, True],\n [False, False, True]],\n fill_value=0.0)\n```\n\n- Cells for pairs of non-reference samples like `0,2` and `2,0` (a vs. c) are computed using the Tanabe algorithm.\n- Cells like `0,1` (a vs. b) and `2,1` (c vs. b) are computed against the reference sample b, so they use the other algorithm and loci counting mode.\n- Cells `1,0` (b vs. a) and `1,2` (b vs. c) are masked since these involve comparisons that are irrelevant under both algorithms. \n\n## Development\n\n### Installation\n\n1. Install the required system dependencies:\n - [pyenv](https://github.com/pyenv/pyenv)\n - [Poetry](https://python-poetry.org/)\n - [pre-commit](https://pre-commit.com/)\n \n3. Install the required Python version (>=3.9):\n\t```bash\n\tpyenv install \"$(cat .python-version)\"\n\t```\n\n4. Confirm that `python` maps to the correct version:\n\t```\n\tpython --version\n\t```\n\n5. Set the Poetry interpreter and install the Python dependencies:\n\t```bash\n\tpoetry env use \"$(pyenv which python)\"\n\tpoetry install\n\t```\n\nRun `poetry run pyright` to check static types with [Pyright](https://microsoft.github.io/pyright).\n\n### Testing\n\n```bash\npoetry run pytest\n```\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "3.0.2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1ae5d6f0993544a4bb0b900e81ae44bea962840e2847ba9b6cbfae502c128c01",
"md5": "746cbba5836511b1714bc98ce6cce2f6",
"sha256": "027e5d8480de49bbce45e760f76f5005c8fb3e0d628adb30c44bb35b847b0a8f"
},
"downloads": -1,
"filename": "str_sim_scorer-3.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "746cbba5836511b1714bc98ce6cce2f6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 9660,
"upload_time": "2025-08-28T15:51:27",
"upload_time_iso_8601": "2025-08-28T15:51:27.605255Z",
"url": "https://files.pythonhosted.org/packages/1a/e5/d6f0993544a4bb0b900e81ae44bea962840e2847ba9b6cbfae502c128c01/str_sim_scorer-3.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c7de3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7",
"md5": "980bfb7238c0f2873e13cccc8dfb91c9",
"sha256": "78659de55ade2b0fffc17f8bd7b6994c4a526a432035654ab1cf089bdfe74a88"
},
"downloads": -1,
"filename": "str_sim_scorer-3.0.2.tar.gz",
"has_sig": false,
"md5_digest": "980bfb7238c0f2873e13cccc8dfb91c9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 10374,
"upload_time": "2025-08-28T15:51:28",
"upload_time_iso_8601": "2025-08-28T15:51:28.375810Z",
"url": "https://files.pythonhosted.org/packages/c7/de/3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7/str_sim_scorer-3.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 15:51:28",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "str_sim_scorer"
}