str_sim_scorer

Name	str_sim_scorer JSON
Version	3.0.2 JSON
	download
home_page	None
Summary	None
upload_time	2025-08-28 15:51:28
maintainer	None
docs_url	None
author	Devin McCabe
requires_python	>=3.9
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            from str_sim_scorer import StrSimScorer

# STR Similarity Scorer

This repository contains a Python package to compute the [similarity score](https://www.cellosaurus.org/str-search/help.html) for pairs of records in an input data frame, some of which may be optionally specified as reference profiles.

This package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).

## Installation

Install `str_sim_scorer` [from PyPI](https://pypi.org/project/str_sim_scorer/) using the package manager of your choice.

## Usage

The `StrSimScorer` class provides an object-oriented interface with caching for efficient computation:

```py
import pandas as pd
from str_sim_scorer import StrSimScorer

df = pd.DataFrame(
    [
        {
            "id": "sample1",
            "csf1po": "11, 13",
            "d13s317": "11, 12",
            "d16s539": "9, 12",
            "d18s51": "11, 19",
            "d21s11": "29, 31.2",
            "d3s1358": "17",
            "d5s818": "13",
            "d7s820": "10",
            "d8s1179": "12, 13",
            "fga": "24",
            "penta_d": "9, 12",
            "penta_e": "7, 13",
            "th01": "6, 8",
            "tpox": "11",
        },
        {
            "id": "sample2",
            "csf1po": "12",
            "d13s317": "11, 12",
            "d16s539": "8, 12",
            "d18s51": "17, 18",
            "d21s11": "28, 33.2",
            "d3s1358": "16",
            "d5s818": "11",
            "d7s820": "8, 13",
            "d8s1179": "9, 10",
            "fga": "21, 25",
            "penta_d": "9, 12",
            "penta_e": "7",
            "th01": "7, 9.3",
            "tpox": "8",
        },
        {
            "id": "sample3",
            "csf1po": "11, 12",
            "d13s317": "8",
            "d16s539": "11",
            "d18s51": "18",
            "d21s11": pd.NA,
            "d3s1358": "16",
            "d5s818": "10, 11",
            "d7s820": "12",
            "d8s1179": "11",
            "fga": "26",
            "penta_d": pd.NA,
            "penta_e": "13.1, 12.1",
            "th01": "6, 9.3",
            "tpox": "12",
        },
    ]
)

# Create the comparison object
scorer = StrSimScorer(
    df,
    sample_id_col_name="id",
    locus_col_names=[
        "csf1po",
        "d13s317",
        "d16s539",
        "d18s51",
        "d21s11",
        "d3s1358",
        "d5s818",
        "d7s820",
        "d8s1179",
        "fga",
        "penta_d",
        "penta_e",
        "th01",
        "tpox",
    ],
)

scores = scorer.scores(output="df")
```

### Output formats

Using `output="df"` returns a DataFrame for distinct pairs of IDs:
```
>>> print(scores)
       id1      id2  n_loci_used     score
0  sample1  sample2           14   0.26087
1  sample1  sample3           12  0.114286
2  sample2  sample3           12  0.285714
```

Using `output="full_df"` returns the same data with both (id1, id2) and (id2, id1) rows:
```
>>> scores_sym = comp.scores(output="symmetric_df")
>>> print(scores_sym)
       id1      id2  n_loci_used     score
0  sample1  sample2           14   0.26087
1  sample1  sample3           12  0.114286
2  sample2  sample1           14   0.26087
3  sample2  sample3           12  0.285714
4  sample3  sample1           12  0.114286
5  sample3  sample2           12  0.285714
```

Using `output="array"` returns the raw similarity matrix as a numpy [masked array](https://numpy.org/doc/stable/reference/routines.ma.html):
```
>>> array = comp.scores(output="array")
>>> print(array)
masked_array(
  data=[[--, 0.2608695652173913, 0.11428571428571428],
        [0.2608695652173913, --, 0.2857142857142857],
        [0.11428571428571428, 0.2857142857142857, --]],
  mask=[[ True, False, False],
        [False,  True, False],
        [False, False,  True]],
  fill_value=0.0)
>>> print(scorer.sample_ids) # the row/col names of the matrix 
['sample1', 'sample2', 'sample3']
```

Only cells representing a valid and relevant pair of samples are unmasked, which is why the diagonal is masked in this example. 

### Algorithms

#### Tanabe / non-empty markers

This package implements two [algorithms](https://www.cellosaurus.org/str-search/help.html). For a pair of samples where neither is indicated as a "reference", the score is calculated using the Tanabe algorithm under the "non-empty markers" mode. Thus, `n_loci_used` is the number of loci where both samples had data.

#### Master vs. reference / reference markers

If your input data frame has a boolean column indicating that some samples are references (e.g. STR profiles from a canonical source like Cellosaurus), set the `is_reference_col_name` argument to the name of that column. A cell `i,j` in the scores matrix will be computed using the "masters vs. reference" algorithm if `scorer.sample_ids[i]` is a master (i.e. real sample) and `scorer.sample_ids[j]` is a reference. In ths case, `n_loci_used` is the number of loci present in the reference.

```py
import pandas as pd
from str_sim_scorer import StrSimScorer

df = pd.DataFrame(
    [
        {
            "id": "sample1",
            "is_ref": False,
            "csf1po": "11, 13",
            "d13s317": "11, 12",
            "d16s539": "9, 12",
            "d18s51": "11, 19",
            "d21s11": "29, 31.2",
            "d3s1358": "17",
            "d5s818": "13",
            "d7s820": "10",
            "d8s1179": "12, 13",
            "fga": "24",
            "penta_d": "9, 12",
            "penta_e": "7, 13",
            "th01": "6, 8",
            "tpox": "11",
        },
        {
            "id": "sample2",
            "is_ref": True,
            "csf1po": "12",
            "d13s317": "11, 12",
            "d16s539": "8, 12",
            "d18s51": "17, 18",
            "d21s11": "28, 33.2",
            "d3s1358": "16",
            "d5s818": "11",
            "d7s820": "8, 13",
            "d8s1179": "9, 10",
            "fga": "21, 25",
            "penta_d": "9, 12",
            "penta_e": "7",
            "th01": "7, 9.3",
            "tpox": "8",
        },
        {
            "id": "sample3",
            "is_ref": False,
            "csf1po": "11, 12",
            "d13s317": "8",
            "d16s539": "11",
            "d18s51": "18",
            "d21s11": pd.NA,
            "d3s1358": "16",
            "d5s818": "10, 11",
            "d7s820": "12",
            "d8s1179": "11",
            "fga": "26",
            "penta_d": pd.NA,
            "penta_e": "13.1, 12.1",
            "th01": "6, 9.3",
            "tpox": "12",
        },
    ]
)

scorer = StrSimScorer(
    df,
    sample_id_col_name="id",
    locus_col_names=[
        "csf1po",
        "d13s317",
        "d16s539",
        "d18s51",
        "d21s11",
        "d3s1358",
        "d5s818",
        "d7s820",
        "d8s1179",
        "fga",
        "penta_d",
        "penta_e",
        "th01",
        "tpox",
    ],
    is_reference_col_name="is_ref",
)

scores = scorer.scores(output="array")
```

```
>>> print(scores)
masked_array(
  data=[[--, 0.2608695652173913, 0.11428571428571428],
        [--, --, --],
        [0.11428571428571428, 0.21739130434782608, --]],
  mask=[[ True, False, False],
        [ True,  True,  True],
        [False, False,  True]],
  fill_value=0.0)
```

- Cells for pairs of non-reference samples like `0,2` and `2,0` (a vs. c) are computed using the Tanabe algorithm.
- Cells like `0,1` (a vs. b) and `2,1` (c vs. b) are computed against the reference sample b, so they use the other algorithm and loci counting mode.
- Cells `1,0` (b vs. a) and `1,2` (b vs. c) are masked since these involve comparisons that are irrelevant under both algorithms.  

## Development

### Installation

1. Install the required system dependencies:
   - [pyenv](https://github.com/pyenv/pyenv)
   - [Poetry](https://python-poetry.org/)
   - [pre-commit](https://pre-commit.com/)
 
3. Install the required Python version (>=3.9):
	```bash
	pyenv install "$(cat .python-version)"
	```

4. Confirm that `python` maps to the correct version:
	```
	python --version
	```

5. Set the Poetry interpreter and install the Python dependencies:
	```bash
	poetry env use "$(pyenv which python)"
	poetry install
	```

Run `poetry run pyright` to check static types with [Pyright](https://microsoft.github.io/pyright).

### Testing

```bash
poetry run pytest
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "str_sim_scorer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Devin McCabe",
    "author_email": "dmccabe@broadinstitute.org",
    "download_url": "https://files.pythonhosted.org/packages/c7/de/3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7/str_sim_scorer-3.0.2.tar.gz",
    "platform": null,
    "description": "from str_sim_scorer import StrSimScorer\n\n# STR Similarity Scorer\n\nThis repository contains a Python package to compute the [similarity score](https://www.cellosaurus.org/str-search/help.html) for pairs of records in an input data frame, some of which may be optionally specified as reference profiles.\n\nThis package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).\n\n## Installation\n\nInstall `str_sim_scorer` [from PyPI](https://pypi.org/project/str_sim_scorer/) using the package manager of your choice.\n\n## Usage\n\nThe `StrSimScorer` class provides an object-oriented interface with caching for efficient computation:\n\n```py\nimport pandas as pd\nfrom str_sim_scorer import StrSimScorer\n\ndf = pd.DataFrame(\n    [\n        {\n            \"id\": \"sample1\",\n            \"csf1po\": \"11, 13\",\n            \"d13s317\": \"11, 12\",\n            \"d16s539\": \"9, 12\",\n            \"d18s51\": \"11, 19\",\n            \"d21s11\": \"29, 31.2\",\n            \"d3s1358\": \"17\",\n            \"d5s818\": \"13\",\n            \"d7s820\": \"10\",\n            \"d8s1179\": \"12, 13\",\n            \"fga\": \"24\",\n            \"penta_d\": \"9, 12\",\n            \"penta_e\": \"7, 13\",\n            \"th01\": \"6, 8\",\n            \"tpox\": \"11\",\n        },\n        {\n            \"id\": \"sample2\",\n            \"csf1po\": \"12\",\n            \"d13s317\": \"11, 12\",\n            \"d16s539\": \"8, 12\",\n            \"d18s51\": \"17, 18\",\n            \"d21s11\": \"28, 33.2\",\n            \"d3s1358\": \"16\",\n            \"d5s818\": \"11\",\n            \"d7s820\": \"8, 13\",\n            \"d8s1179\": \"9, 10\",\n            \"fga\": \"21, 25\",\n            \"penta_d\": \"9, 12\",\n            \"penta_e\": \"7\",\n            \"th01\": \"7, 9.3\",\n            \"tpox\": \"8\",\n        },\n        {\n            \"id\": \"sample3\",\n            \"csf1po\": \"11, 12\",\n            \"d13s317\": \"8\",\n            \"d16s539\": \"11\",\n            \"d18s51\": \"18\",\n            \"d21s11\": pd.NA,\n            \"d3s1358\": \"16\",\n            \"d5s818\": \"10, 11\",\n            \"d7s820\": \"12\",\n            \"d8s1179\": \"11\",\n            \"fga\": \"26\",\n            \"penta_d\": pd.NA,\n            \"penta_e\": \"13.1, 12.1\",\n            \"th01\": \"6, 9.3\",\n            \"tpox\": \"12\",\n        },\n    ]\n)\n\n# Create the comparison object\nscorer = StrSimScorer(\n    df,\n    sample_id_col_name=\"id\",\n    locus_col_names=[\n        \"csf1po\",\n        \"d13s317\",\n        \"d16s539\",\n        \"d18s51\",\n        \"d21s11\",\n        \"d3s1358\",\n        \"d5s818\",\n        \"d7s820\",\n        \"d8s1179\",\n        \"fga\",\n        \"penta_d\",\n        \"penta_e\",\n        \"th01\",\n        \"tpox\",\n    ],\n)\n\nscores = scorer.scores(output=\"df\")\n```\n\n### Output formats\n\nUsing `output=\"df\"` returns a DataFrame for distinct pairs of IDs:\n```\n>>> print(scores)\n       id1      id2  n_loci_used     score\n0  sample1  sample2           14   0.26087\n1  sample1  sample3           12  0.114286\n2  sample2  sample3           12  0.285714\n```\n\nUsing `output=\"full_df\"` returns the same data with both (id1, id2) and (id2, id1) rows:\n```\n>>> scores_sym = comp.scores(output=\"symmetric_df\")\n>>> print(scores_sym)\n       id1      id2  n_loci_used     score\n0  sample1  sample2           14   0.26087\n1  sample1  sample3           12  0.114286\n2  sample2  sample1           14   0.26087\n3  sample2  sample3           12  0.285714\n4  sample3  sample1           12  0.114286\n5  sample3  sample2           12  0.285714\n```\n\nUsing `output=\"array\"` returns the raw similarity matrix as a numpy [masked array](https://numpy.org/doc/stable/reference/routines.ma.html):\n```\n>>> array = comp.scores(output=\"array\")\n>>> print(array)\nmasked_array(\n  data=[[--, 0.2608695652173913, 0.11428571428571428],\n        [0.2608695652173913, --, 0.2857142857142857],\n        [0.11428571428571428, 0.2857142857142857, --]],\n  mask=[[ True, False, False],\n        [False,  True, False],\n        [False, False,  True]],\n  fill_value=0.0)\n>>> print(scorer.sample_ids) # the row/col names of the matrix \n['sample1', 'sample2', 'sample3']\n```\n\nOnly cells representing a valid and relevant pair of samples are unmasked, which is why the diagonal is masked in this example. \n\n### Algorithms\n\n#### Tanabe / non-empty markers\n\nThis package implements two [algorithms](https://www.cellosaurus.org/str-search/help.html). For a pair of samples where neither is indicated as a \"reference\", the score is calculated using the Tanabe algorithm under the \"non-empty markers\" mode. Thus, `n_loci_used` is the number of loci where both samples had data.\n\n#### Master vs. reference / reference markers\n\nIf your input data frame has a boolean column indicating that some samples are references (e.g. STR profiles from a canonical source like Cellosaurus), set the `is_reference_col_name` argument to the name of that column. A cell `i,j` in the scores matrix will be computed using the \"masters vs. reference\" algorithm if `scorer.sample_ids[i]` is a master (i.e. real sample) and `scorer.sample_ids[j]` is a reference. In ths case, `n_loci_used` is the number of loci present in the reference.\n\n```py\nimport pandas as pd\nfrom str_sim_scorer import StrSimScorer\n\ndf = pd.DataFrame(\n    [\n        {\n            \"id\": \"sample1\",\n            \"is_ref\": False,\n            \"csf1po\": \"11, 13\",\n            \"d13s317\": \"11, 12\",\n            \"d16s539\": \"9, 12\",\n            \"d18s51\": \"11, 19\",\n            \"d21s11\": \"29, 31.2\",\n            \"d3s1358\": \"17\",\n            \"d5s818\": \"13\",\n            \"d7s820\": \"10\",\n            \"d8s1179\": \"12, 13\",\n            \"fga\": \"24\",\n            \"penta_d\": \"9, 12\",\n            \"penta_e\": \"7, 13\",\n            \"th01\": \"6, 8\",\n            \"tpox\": \"11\",\n        },\n        {\n            \"id\": \"sample2\",\n            \"is_ref\": True,\n            \"csf1po\": \"12\",\n            \"d13s317\": \"11, 12\",\n            \"d16s539\": \"8, 12\",\n            \"d18s51\": \"17, 18\",\n            \"d21s11\": \"28, 33.2\",\n            \"d3s1358\": \"16\",\n            \"d5s818\": \"11\",\n            \"d7s820\": \"8, 13\",\n            \"d8s1179\": \"9, 10\",\n            \"fga\": \"21, 25\",\n            \"penta_d\": \"9, 12\",\n            \"penta_e\": \"7\",\n            \"th01\": \"7, 9.3\",\n            \"tpox\": \"8\",\n        },\n        {\n            \"id\": \"sample3\",\n            \"is_ref\": False,\n            \"csf1po\": \"11, 12\",\n            \"d13s317\": \"8\",\n            \"d16s539\": \"11\",\n            \"d18s51\": \"18\",\n            \"d21s11\": pd.NA,\n            \"d3s1358\": \"16\",\n            \"d5s818\": \"10, 11\",\n            \"d7s820\": \"12\",\n            \"d8s1179\": \"11\",\n            \"fga\": \"26\",\n            \"penta_d\": pd.NA,\n            \"penta_e\": \"13.1, 12.1\",\n            \"th01\": \"6, 9.3\",\n            \"tpox\": \"12\",\n        },\n    ]\n)\n\nscorer = StrSimScorer(\n    df,\n    sample_id_col_name=\"id\",\n    locus_col_names=[\n        \"csf1po\",\n        \"d13s317\",\n        \"d16s539\",\n        \"d18s51\",\n        \"d21s11\",\n        \"d3s1358\",\n        \"d5s818\",\n        \"d7s820\",\n        \"d8s1179\",\n        \"fga\",\n        \"penta_d\",\n        \"penta_e\",\n        \"th01\",\n        \"tpox\",\n    ],\n    is_reference_col_name=\"is_ref\",\n)\n\nscores = scorer.scores(output=\"array\")\n```\n\n```\n>>> print(scores)\nmasked_array(\n  data=[[--, 0.2608695652173913, 0.11428571428571428],\n        [--, --, --],\n        [0.11428571428571428, 0.21739130434782608, --]],\n  mask=[[ True, False, False],\n        [ True,  True,  True],\n        [False, False,  True]],\n  fill_value=0.0)\n```\n\n- Cells for pairs of non-reference samples like `0,2` and `2,0` (a vs. c) are computed using the Tanabe algorithm.\n- Cells like `0,1` (a vs. b) and `2,1` (c vs. b) are computed against the reference sample b, so they use the other algorithm and loci counting mode.\n- Cells `1,0` (b vs. a) and `1,2` (b vs. c) are masked since these involve comparisons that are irrelevant under both algorithms.  \n\n## Development\n\n### Installation\n\n1. Install the required system dependencies:\n   - [pyenv](https://github.com/pyenv/pyenv)\n   - [Poetry](https://python-poetry.org/)\n   - [pre-commit](https://pre-commit.com/)\n \n3. Install the required Python version (>=3.9):\n\t```bash\n\tpyenv install \"$(cat .python-version)\"\n\t```\n\n4. Confirm that `python` maps to the correct version:\n\t```\n\tpython --version\n\t```\n\n5. Set the Poetry interpreter and install the Python dependencies:\n\t```bash\n\tpoetry env use \"$(pyenv which python)\"\n\tpoetry install\n\t```\n\nRun `poetry run pyright` to check static types with [Pyright](https://microsoft.github.io/pyright).\n\n### Testing\n\n```bash\npoetry run pytest\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "3.0.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ae5d6f0993544a4bb0b900e81ae44bea962840e2847ba9b6cbfae502c128c01",
                "md5": "746cbba5836511b1714bc98ce6cce2f6",
                "sha256": "027e5d8480de49bbce45e760f76f5005c8fb3e0d628adb30c44bb35b847b0a8f"
            },
            "downloads": -1,
            "filename": "str_sim_scorer-3.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "746cbba5836511b1714bc98ce6cce2f6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 9660,
            "upload_time": "2025-08-28T15:51:27",
            "upload_time_iso_8601": "2025-08-28T15:51:27.605255Z",
            "url": "https://files.pythonhosted.org/packages/1a/e5/d6f0993544a4bb0b900e81ae44bea962840e2847ba9b6cbfae502c128c01/str_sim_scorer-3.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7de3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7",
                "md5": "980bfb7238c0f2873e13cccc8dfb91c9",
                "sha256": "78659de55ade2b0fffc17f8bd7b6994c4a526a432035654ab1cf089bdfe74a88"
            },
            "downloads": -1,
            "filename": "str_sim_scorer-3.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "980bfb7238c0f2873e13cccc8dfb91c9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 10374,
            "upload_time": "2025-08-28T15:51:28",
            "upload_time_iso_8601": "2025-08-28T15:51:28.375810Z",
            "url": "https://files.pythonhosted.org/packages/c7/de/3d2191ce321d1860869811b7cd673b0a0cda75a1ee2cc3065fcc31d3b0f7/str_sim_scorer-3.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 15:51:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "str_sim_scorer"
}

Devin McCabe