treetag


Nametreetag JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryHierarchical cell-type tagging with YAML ontologies and markers.
upload_time2025-09-06 21:32:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords single-cell bioinformatics annotation scanpy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TreeTag

TreeTag is a lightweight Python package that automatically annotates single-cell RNA-seq data. It reads two editable YAML files: one lays out the hierarchy of cell types, and the other lists positive and negative marker genes. TreeTag promotes quick, interactive adjustment of marker sets and ontologies by keeping marker rules human-readable, and performing near-instant re-annotation. Marker pruning avoids misleading assignments from dataset- or batch-specific markers, while smoothing helps overcome inherent scRNA-seq sparsity by integrating consistent signals from a PCA-driven neighborhood embedding.

---

## Features

- Integrates smoothly with AnnData and scanpy.

- Reads human‑editable YAMLs for the ontology and for positive/negative markers and builds the ontology as a graph (via igraph) for hierarchical traversal.

- Visualizes the ontology to inspect and validate structure **(plot_tree)**.

- Pre‑scales marker columns (sparse‑friendly), cache, and run lean matrix operations for fast scoring. **(TreeTag)**

- Computes hierarchical marker‑based scores top‑down; optionally applying KNN smoothing and majority vote using a PCA‑driven neighborhood embedding. **(TreeTag toggles)**

- Assigns cell‑type tags **(TreeTag in AnnData object)**.

- Prunes unreliable markers when they fail to separate the intended type  **(TreeTag toggles)**.

- Exposes per‑cell scores for manual inspection within AnnData/scanpy. **(*_score in AnnData object)**.

- Detects likely doublets after scoring, using per‑node scores to flag candidates for review/removal **(find_doublets)**.

---

## Installation
From PyPI (recommended)
```bash
pip install treetag
```
Upgrade
```bash
pip install --upgrade treetag
```
Verify installation
```bash
python -c "import treetag, sys; print('TreeTag', treetag.__version__)"
```
---
## Quickstart
```python

# 1) Install deps + your package from GitHub
!pip install -q scanpy
!pip install -q git+https://github.com/valleyofdawn/treetag.git
import treetag as tt, scanpy as sc
import matplotlib.pyplot as plt

# 2) Load example PBMC dataset (downloaded directly from CZI / cellxgene)
!wget -O PBMC_dataset.h5ad \
https://datasets.cellxgene.cziscience.com/fdf57c52-ad71-4004-9db2-a962e849b524.h5ad
adata = sc.read_h5ad("PBMC_dataset.h5ad")

# 3) (Recommended) Harmonize gene names
tt.convert(adata, prefer_var_cols=("feature_name",))

# 4) Neighbors (needed only if smoothing/majority_vote=False)
sc.pp.pca(adata)
sc.pp.neighbors(adata, use_rep="X_pca")

# 5) Copy example YAMLs to a local folder so you can edit them
print("Available example files:", tt.list_files())
tree_yaml, markers_yaml = tt.fetch_files (["PBMC_tree.yaml", "PBMC_markers.yaml"], dest=".")  # returns paths

# 6) Plot the ontology tree (structure only, subtree of root)
plt.rcParams["figure.figsize"] = (8, 8)
tt.plot_tree("PBMC_tree.yaml", root="root")

# 7) Run TreeTag (explicit YAML paths)
tt.TreeTag (adata, tree_yaml='PBMC_tree.yaml', markers_yaml='PBMC_markers.yaml', root="root", smoothing=True, majority_vote=True, save_scores=True)

# 8) Inspect results compared to ground truth
sc.pl.umap(adata, color=["TreeTag",'scType_celltype'], size=5, legend_loc='on data', legend_fontsize=10, legend_fontweight='regular')

# 9) Doublet detection
tt.find_doublets(adata, tree_yaml='PBMC_tree.yaml', markers_yaml='PBMC_markers.yaml', root="root" )
sc.pl.umap(adata, color=['doublet_score','cell#1','cell#2'], size=5, legend_loc='lower left', legend_fontsize=10, legend_fontweight='regular')

# 10) Cell type markers (you can also observe "neg" markers or "both")
print ("B-cell markers (pos):", tt.markers (cell_type ="B", sign= "pos",  markers_yaml='PBMC_markers.yaml'))

# 11) Cell scores
sc.pl.umap (adata,color = tt.subscores (root_cell='CD4',adata=adata, markers_yaml='PBMC_markers.yaml',tree_yaml='PBMC_tree.yaml', only_leaves=True), size=5, legend_loc='on data', legend_fontsize=10, legend_fontweight='regular')

```

## YAML File Formats

#### Ontology YAML

```yaml
root:
  T_NK:
    CD4_T:
      Treg:
      Th:
    CD8_T:
  B:
    Naive_B:
    Memory_B:
  Myeloid:
    Mono:
    DC:
    _mac:
      res_mac:
      mono_mac:
```

**`!` note:** Keys starting with "_" are **disabled**; the cell-type ("mac" in the example above) and its entire subtree ("res_mac" and "mono_mac") are skipped.

#### Markers YAML

```yaml
T_NK: [CD2, IL32, CD7, CD247, CD3E, LCK, IFITM1, GIMAP7, -MS4A1]
CD4_T: [CD4, TRAT1, ICOS, GPR183, CD40LG, IL6ST, -CD8A, -CD8B]
Treg: [FOXP3, RTKN2, IL2RA, IKZF2, CTLA4, TNFRSF18, TIGIT, -CD40LG]
```
**`!` note:** At least 2 positive markers are needed per cell type. Negative markers start with "-" and are not obligatory. Make sure not to put spaces after "-".

## Function reference

## `TreeTag`

**What it does:** Hierarchical cell‑type tagging using positive/negative markers.

**Signature:**

```python
TreeTag(
    adata, # The AnnData object to analyze
    tree_yaml: str, # The YAML file describing the cell ontology
    markers_yaml: str, # The YAML file with the positive and negative markers for each cell in tree_yaml
    root: str = 'root', # start node in the ontology (e.g., if your cell ontology is of all PBMCs but your dataset only contains T and NK cells then specify root="T_NK")
    min_marker_count: int = 2, # the minimum number of positive markers required for a cell type to be scored
    verbose: bool = False, # print per-split diagnostics and pruning details
    smoothing: bool = True, # KNN score smoothing using neighbors graph in adata.obsp
    majority_vote: bool = True, # one-pass label consensus using the same neighbors graph
    save_scores: bool = False, # write <cell type>_score columns to adata.obs
    min_score: float = 0.0, # gate final labels below this score to "unknown" (0 disables), can reveal cell-types missing from the cell ontology and prevent irrelevant cell-types from taking over ambiguous cell types.
    min_pruning_fc: float = 1.5 # prune positive markers if FC vs avg(avg (other siblings)) is smaller than this

**Writes:** `adata.obs["TreeTag"]`; if `save_scores=True`, also `<node>_score` columns.

**Requires (if enabled):** neighbors in `adata.obsp` for `smoothing`/`majority_vote`.

**Common errors (and fixes):**

* *No neighbor graph:* run `sc.pp.neighbors(adata, use_rep="X_pca")` **or** set `smoothing=False, majority_vote=False`.
* *No subtree markers found:* check gene naming (symbols vs Ensembl vs Entrez), root, and `.raw` usage.
* *Neighbor shape mismatch:* rebuild neighbors **after** any cell filtering.
```
---


### `markers`

**What it does:** Returns marker genes for a node (optionally filtered to genes present in `adata`).

**Signature:**

```python
markers(
    cell_type: str,
    sign: str = "pos",            # "pos" or "neg"
    markers_yaml: str = "markers.yaml",
    tree_yaml: str = "ontology.yaml",
    adata=None,                    # optional filter to adata.var_names/raw.var_names
) -> list[str]
```
---

### `subscores`

**What it does:** Lists existing `<node>_score` columns under a root (useful after `TreeTag(save_scores=True)`).

**Signature:**

```python
subscores(
    root_cell: str,
    adata,
    markers_yaml: str,
    tree_yaml: str,
) -> list[str]
```

---

### `find_doublets`

**What it does:** Flags likely doublets **after scoring** using per‑node score patterns (e.g., strong scores for incompatible lineages).

**Signature (minimal):**

```python
find_doublets(
    adata,
    threshold: float = 0.25,   # heuristic overlap metric; implementation‑specific
    write: bool = True,
    key: str = "doublet_like",
) -> "pd.Series[bool] | np.ndarray[bool]"
```

**Writes (if `write=True`):** `adata.obs["doublet_like"]` boolean mask.

---

### `plot_tree`

**What it does:** Renders the ontology tree (optionally overlaying counts/assignments).

**Signature (typical):**

```python
plot_tree(
    tree_yaml: str | None = None,
    markers_yaml: str | None = None,
    root: str | None = None,
    G=None,                      # alternatively pass a prebuilt graph
    adata=None,                  # optional: color by counts/labels
    ax=None,
    layout: str = "rt",         # e.g., top‑down
) -> "matplotlib.axes.Axes"
```
---
##  Results Gallery
### A UMAP of PBMC cell types produced with TreeTag
![UMAP of PBMCs](docs/img/UMAP.png)
---
### Visualization of the cell ontology producing the above UMAP
![Ontology of PBMCs](docs/img/Tree.png)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "treetag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "single-cell, bioinformatics, annotation, scanpy",
    "author": null,
    "author_email": "Guy Shakhar <guy.shakhar@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/64/d5/94c5bc87fa81813e0e44ec434ac4359c62b7ec819a3a52fcd312de9c43d2/treetag-0.1.1.tar.gz",
    "platform": null,
    "description": "# TreeTag\r\n\r\nTreeTag is a lightweight Python package that automatically annotates single-cell RNA-seq data. It reads two editable YAML files: one lays out the hierarchy of cell types, and the other lists positive and negative marker genes. TreeTag promotes quick, interactive adjustment of marker sets and ontologies by keeping marker rules human-readable, and performing near-instant re-annotation. Marker pruning avoids misleading assignments from dataset- or batch-specific markers, while smoothing helps overcome inherent scRNA-seq sparsity by integrating consistent signals from a PCA-driven neighborhood embedding.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- Integrates smoothly with AnnData and scanpy.\r\n\r\n- Reads human\u2011editable YAMLs for the ontology and for positive/negative markers and builds the ontology as a graph (via igraph) for hierarchical traversal.\r\n\r\n- Visualizes the ontology to inspect and validate structure **(plot_tree)**.\r\n\r\n- Pre\u2011scales marker columns (sparse\u2011friendly), cache, and run lean matrix operations for fast scoring. **(TreeTag)**\r\n\r\n- Computes hierarchical marker\u2011based scores top\u2011down; optionally applying KNN smoothing and majority vote using a PCA\u2011driven neighborhood embedding. **(TreeTag toggles)**\r\n\r\n- Assigns cell\u2011type tags **(TreeTag in AnnData object)**.\r\n\r\n- Prunes unreliable markers when they fail to separate the intended type  **(TreeTag toggles)**.\r\n\r\n- Exposes per\u2011cell scores for manual inspection within AnnData/scanpy. **(*_score in AnnData object)**.\r\n\r\n- Detects likely doublets after scoring, using per\u2011node scores to flag candidates for review/removal **(find_doublets)**.\r\n\r\n---\r\n\r\n## Installation\r\nFrom PyPI (recommended)\r\n```bash\r\npip install treetag\r\n```\r\nUpgrade\r\n```bash\r\npip install --upgrade treetag\r\n```\r\nVerify installation\r\n```bash\r\npython -c \"import treetag, sys; print('TreeTag', treetag.__version__)\"\r\n```\r\n---\r\n## Quickstart\r\n```python\r\n\r\n# 1) Install deps + your package from GitHub\r\n!pip install -q scanpy\r\n!pip install -q git+https://github.com/valleyofdawn/treetag.git\r\nimport treetag as tt, scanpy as sc\r\nimport matplotlib.pyplot as plt\r\n\r\n# 2) Load example PBMC dataset (downloaded directly from CZI / cellxgene)\r\n!wget -O PBMC_dataset.h5ad \\\r\nhttps://datasets.cellxgene.cziscience.com/fdf57c52-ad71-4004-9db2-a962e849b524.h5ad\r\nadata = sc.read_h5ad(\"PBMC_dataset.h5ad\")\r\n\r\n# 3) (Recommended) Harmonize gene names\r\ntt.convert(adata, prefer_var_cols=(\"feature_name\",))\r\n\r\n# 4) Neighbors (needed only if smoothing/majority_vote=False)\r\nsc.pp.pca(adata)\r\nsc.pp.neighbors(adata, use_rep=\"X_pca\")\r\n\r\n# 5) Copy example YAMLs to a local folder so you can edit them\r\nprint(\"Available example files:\", tt.list_files())\r\ntree_yaml, markers_yaml = tt.fetch_files ([\"PBMC_tree.yaml\", \"PBMC_markers.yaml\"], dest=\".\")  # returns paths\r\n\r\n# 6) Plot the ontology tree (structure only, subtree of root)\r\nplt.rcParams[\"figure.figsize\"] = (8, 8)\r\ntt.plot_tree(\"PBMC_tree.yaml\", root=\"root\")\r\n\r\n# 7) Run TreeTag (explicit YAML paths)\r\ntt.TreeTag (adata, tree_yaml='PBMC_tree.yaml', markers_yaml='PBMC_markers.yaml', root=\"root\", smoothing=True, majority_vote=True, save_scores=True)\r\n\r\n# 8) Inspect results compared to ground truth\r\nsc.pl.umap(adata, color=[\"TreeTag\",'scType_celltype'], size=5, legend_loc='on data', legend_fontsize=10, legend_fontweight='regular')\r\n\r\n# 9) Doublet detection\r\ntt.find_doublets(adata, tree_yaml='PBMC_tree.yaml', markers_yaml='PBMC_markers.yaml', root=\"root\" )\r\nsc.pl.umap(adata, color=['doublet_score','cell#1','cell#2'], size=5, legend_loc='lower left', legend_fontsize=10, legend_fontweight='regular')\r\n\r\n# 10) Cell type markers (you can also observe \"neg\" markers or \"both\")\r\nprint (\"B-cell markers (pos):\", tt.markers (cell_type =\"B\", sign= \"pos\",  markers_yaml='PBMC_markers.yaml'))\r\n\r\n# 11) Cell scores\r\nsc.pl.umap (adata,color = tt.subscores (root_cell='CD4',adata=adata, markers_yaml='PBMC_markers.yaml',tree_yaml='PBMC_tree.yaml', only_leaves=True), size=5, legend_loc='on data', legend_fontsize=10, legend_fontweight='regular')\r\n\r\n```\r\n\r\n## YAML File Formats\r\n\r\n#### Ontology YAML\r\n\r\n```yaml\r\nroot:\r\n  T_NK:\r\n    CD4_T:\r\n      Treg:\r\n      Th:\r\n    CD8_T:\r\n  B:\r\n    Naive_B:\r\n    Memory_B:\r\n  Myeloid:\r\n    Mono:\r\n    DC:\r\n    _mac:\r\n      res_mac:\r\n      mono_mac:\r\n```\r\n\r\n**`!` note:** Keys starting with \"_\" are **disabled**; the cell-type (\"mac\" in the example above) and its entire subtree (\"res_mac\" and \"mono_mac\") are skipped.\r\n\r\n#### Markers YAML\r\n\r\n```yaml\r\nT_NK: [CD2, IL32, CD7, CD247, CD3E, LCK, IFITM1, GIMAP7, -MS4A1]\r\nCD4_T: [CD4, TRAT1, ICOS, GPR183, CD40LG, IL6ST, -CD8A, -CD8B]\r\nTreg: [FOXP3, RTKN2, IL2RA, IKZF2, CTLA4, TNFRSF18, TIGIT, -CD40LG]\r\n```\r\n**`!` note:** At least 2 positive markers are needed per cell type. Negative markers start with \"-\" and are not obligatory. Make sure not to put spaces after \"-\".\r\n\r\n## Function reference\r\n\r\n## `TreeTag`\r\n\r\n**What it does:** Hierarchical cell\u2011type tagging using positive/negative markers.\r\n\r\n**Signature:**\r\n\r\n```python\r\nTreeTag(\r\n    adata, # The AnnData object to analyze\r\n    tree_yaml: str, # The YAML file describing the cell ontology\r\n    markers_yaml: str, # The YAML file with the positive and negative markers for each cell in tree_yaml\r\n    root: str = 'root', # start node in the ontology (e.g., if your cell ontology is of all PBMCs but your dataset only contains T and NK cells then specify root=\"T_NK\")\r\n    min_marker_count: int = 2, # the minimum number of positive markers required for a cell type to be scored\r\n    verbose: bool = False, # print per-split diagnostics and pruning details\r\n    smoothing: bool = True, # KNN score smoothing using neighbors graph in adata.obsp\r\n    majority_vote: bool = True, # one-pass label consensus using the same neighbors graph\r\n    save_scores: bool = False, # write <cell type>_score columns to adata.obs\r\n    min_score: float = 0.0, # gate final labels below this score to \"unknown\" (0 disables), can reveal cell-types missing from the cell ontology and prevent irrelevant cell-types from taking over ambiguous cell types.\r\n    min_pruning_fc: float = 1.5 # prune positive markers if FC vs avg(avg (other siblings)) is smaller than this\r\n\r\n**Writes:** `adata.obs[\"TreeTag\"]`; if `save_scores=True`, also `<node>_score` columns.\r\n\r\n**Requires (if enabled):** neighbors in `adata.obsp` for `smoothing`/`majority_vote`.\r\n\r\n**Common errors (and fixes):**\r\n\r\n* *No neighbor graph:* run `sc.pp.neighbors(adata, use_rep=\"X_pca\")` **or** set `smoothing=False, majority_vote=False`.\r\n* *No subtree markers found:* check gene naming (symbols vs Ensembl vs Entrez), root, and `.raw` usage.\r\n* *Neighbor shape mismatch:* rebuild neighbors **after** any cell filtering.\r\n```\r\n---\r\n\r\n\r\n### `markers`\r\n\r\n**What it does:** Returns marker genes for a node (optionally filtered to genes present in `adata`).\r\n\r\n**Signature:**\r\n\r\n```python\r\nmarkers(\r\n    cell_type: str,\r\n    sign: str = \"pos\",            # \"pos\" or \"neg\"\r\n    markers_yaml: str = \"markers.yaml\",\r\n    tree_yaml: str = \"ontology.yaml\",\r\n    adata=None,                    # optional filter to adata.var_names/raw.var_names\r\n) -> list[str]\r\n```\r\n---\r\n\r\n### `subscores`\r\n\r\n**What it does:** Lists existing `<node>_score` columns under a root (useful after `TreeTag(save_scores=True)`).\r\n\r\n**Signature:**\r\n\r\n```python\r\nsubscores(\r\n    root_cell: str,\r\n    adata,\r\n    markers_yaml: str,\r\n    tree_yaml: str,\r\n) -> list[str]\r\n```\r\n\r\n---\r\n\r\n### `find_doublets`\r\n\r\n**What it does:** Flags likely doublets **after scoring** using per\u2011node score patterns (e.g., strong scores for incompatible lineages).\r\n\r\n**Signature (minimal):**\r\n\r\n```python\r\nfind_doublets(\r\n    adata,\r\n    threshold: float = 0.25,   # heuristic overlap metric; implementation\u2011specific\r\n    write: bool = True,\r\n    key: str = \"doublet_like\",\r\n) -> \"pd.Series[bool] | np.ndarray[bool]\"\r\n```\r\n\r\n**Writes (if `write=True`):** `adata.obs[\"doublet_like\"]` boolean mask.\r\n\r\n---\r\n\r\n### `plot_tree`\r\n\r\n**What it does:** Renders the ontology tree (optionally overlaying counts/assignments).\r\n\r\n**Signature (typical):**\r\n\r\n```python\r\nplot_tree(\r\n    tree_yaml: str | None = None,\r\n    markers_yaml: str | None = None,\r\n    root: str | None = None,\r\n    G=None,                      # alternatively pass a prebuilt graph\r\n    adata=None,                  # optional: color by counts/labels\r\n    ax=None,\r\n    layout: str = \"rt\",         # e.g., top\u2011down\r\n) -> \"matplotlib.axes.Axes\"\r\n```\r\n---\r\n##  Results Gallery\r\n### A UMAP of PBMC cell types produced with TreeTag\r\n![UMAP of PBMCs](docs/img/UMAP.png)\r\n---\r\n### Visualization of the cell ontology producing the above UMAP\r\n![Ontology of PBMCs](docs/img/Tree.png)\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Hierarchical cell-type tagging with YAML ontologies and markers.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/valleyofdawn/TreeTag",
        "Issues": "https://github.com/valleyofdawn/TreeTag/issues",
        "Repository": "https://github.com/valleyofdawn/TreeTag"
    },
    "split_keywords": [
        "single-cell",
        " bioinformatics",
        " annotation",
        " scanpy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "504f4aef0a2fbae59903bb97ec4454d1ada82a68f73e54e64f5239f18865f527",
                "md5": "4597c830c26e9d754223ae1c36ba8e6e",
                "sha256": "77778ce0b24c9264bc458367dc9f49209da2cb91864722b43dd947aa8ec28722"
            },
            "downloads": -1,
            "filename": "treetag-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4597c830c26e9d754223ae1c36ba8e6e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 25668,
            "upload_time": "2025-09-06T21:32:47",
            "upload_time_iso_8601": "2025-09-06T21:32:47.551568Z",
            "url": "https://files.pythonhosted.org/packages/50/4f/4aef0a2fbae59903bb97ec4454d1ada82a68f73e54e64f5239f18865f527/treetag-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "64d594c5bc87fa81813e0e44ec434ac4359c62b7ec819a3a52fcd312de9c43d2",
                "md5": "548e07abcf14417c42fb6cf12b9e47b8",
                "sha256": "30cd78bfbf2d229ce970e0ae542e88e11ee2b40079286aebe269e88192056ccd"
            },
            "downloads": -1,
            "filename": "treetag-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "548e07abcf14417c42fb6cf12b9e47b8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 29761,
            "upload_time": "2025-09-06T21:32:48",
            "upload_time_iso_8601": "2025-09-06T21:32:48.830213Z",
            "url": "https://files.pythonhosted.org/packages/64/d5/94c5bc87fa81813e0e44ec434ac4359c62b7ec819a3a52fcd312de9c43d2/treetag-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-06 21:32:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "valleyofdawn",
    "github_project": "TreeTag",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "treetag"
}
        
Elapsed time: 1.79854s