[![PyPI version](https://badge.fury.io/py/treesimi.svg)](https://badge.fury.io/py/treesimi)
[![PyPi downloads](https://img.shields.io/pypi/dm/treesimi)](https://img.shields.io/pypi/dm/treesimi)
[![DOI](https://zenodo.org/badge/318838452.svg)](https://zenodo.org/badge/latestdoi/318838452)
# treesimi: Shingling for measuring tree similarity
Compute similarity between trees, e.g. dependency trees
## Convert an Adjacency List into a Nested Set Table
For example, CoNLL-U's `['id', 'head']` fields form an adjacency list of a dependency tree.
Traversing an adjacency list is slower than reading a nested set.
Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.
```py
import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
```
### Demo: Query a nested set table
To extract a subtree we just need to iterate through the list ($O(n)$)
```py
_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]
```
or `ts.get_subtree(nested, sub_id=2)`
### Set node attributes
```py
import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
```
### Convert Adjacency List with attributes
```py
import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]
```
## Extract subtree patterns
We can extract the following patterns from one tree:
* Depth dimension
* Full subtrees
* Truncate leaves
* Sibling dimension
* All siblings
* Drop siblings (and their subtree)
* Placeholder attribute field
### Full subtrees
The function `extract_subtrees` returns all subtrees of a tree.
The depth information is adjusted accordingly for each subtree.
```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
# [[1, 2, 0, 'd']],
# [[1, 2, 0, 'c']]
# ]
```
### Truncate leaves
In the first step, the function `trunc_leaves` removes leaves of the largest depth level.
The result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.
In the next steps, the depth level is further removed down to `depth=1`.
```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]
```
Hint: Run `trunc_leaves` for each subtree extracted by `extract_subtrees`. Call `unique_trees` after each step.
### Drop sibling nodes
Generate variants of a tree by dropping each node once.
Again, the result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.
```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
# [[1, 8, 0, 'a']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
# [[1, 8, 0, 'a']]
# ]
```
Hints: Create subtrees with `extract_subtrees` and `trunc_leaves`, and run `drop_nodes` on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. `drop_nodes(drop_nodes(...))`.
### Placeholder attribute field
The `replace_attr` removes the data attribute of a node with a generic placeholder.
```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
# [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]
```
## Demo Notebooks about Shingling for MinHash
We recommend using the `mmh3` hash function, and 32 permutations in `datasketch.MinHash`.
- [Create subtrees as shingle sets](https://github.com/ulf1/treesimi/blob/master/demo/Create%20subtrees%20as%20shingle%20sets.ipynb)
- [Jaccard Similarity between Dependency Trees](https://github.com/ulf1/treesimi/blob/master/demo/Jaccard%20Similarity%20between%20Dependency%20Trees.ipynb)
- [Shingle Dependency Trees for datasketch's Minhash](https://github.com/ulf1/treesimi/blob/master/demo/Shingle%20Dependency%20Trees%20for%20datasketch's%20Minhash.ipynb)
Start jupyter to run the demo notebook
```sh
source .venv/bin/activate
jupyter lab
```
## Appendix
### Installation
The `treesimi` [git repo](http://github.com/ulf1/treesimi) is available as [PyPi package](https://pypi.org/project/treesimi)
```sh
pip install treesimi
pip install git+ssh://git@github.com/ulf1/treesimi.git
```
### Commands
Install a virtual environment
```sh
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
```
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)
### Python commands
* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`
* Run Unit Tests: `pytest`
Publish
```sh
python setup.py sdist
twine upload -r pypi dist/*
```
### Clean up
```sh
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
```
### Support
Please [open an issue](https://github.com/ulf1/treesimi/issues/new) for support.
### Contributing
Please contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/treesimi/compare/).
### Acknowledgements
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).
### Maintenance
- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)
- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.
Raw data
{
"_id": null,
"home_page": "http://github.com/ulf1/treesimi",
"name": "treesimi",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Ulf Hamster",
"author_email": "554c46@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/58/3f/dbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f/treesimi-0.3.0.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/treesimi.svg)](https://badge.fury.io/py/treesimi)\n[![PyPi downloads](https://img.shields.io/pypi/dm/treesimi)](https://img.shields.io/pypi/dm/treesimi)\n[![DOI](https://zenodo.org/badge/318838452.svg)](https://zenodo.org/badge/latestdoi/318838452)\n\n\n# treesimi: Shingling for measuring tree similarity\nCompute similarity between trees, e.g. dependency trees\n\n\n## Convert an Adjacency List into a Nested Set Table\nFor example, CoNLL-U's `['id', 'head']` fields form an adjacency list of a dependency tree.\nTraversing an adjacency list is slower than reading a nested set.\nThus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.\n\n```py\nimport treesimi as ts\nadjac = [(1, 0), (2, 1), (3, 1), (4, 2)]\nnested = ts.adjac_to_nested(adjac)\n# columns: node id, left, right, depth\n# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]\n```\n\n### Demo: Query a nested set table\nTo extract a subtree we just need to iterate through the list ($O(n)$)\n\n```py\n_, lft0, rgt0, _ = nested[1]\nsubtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]\n# [[2, 2, 5, 1], [4, 3, 4, 2]]\n```\n\nor `ts.get_subtree(nested, sub_id=2)`\n\n### Set node attributes\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]\nattrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]\nnested = ts.set_attr(nested, attrs)\n# columns: node id, left, right, depth, attributes\n# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\n```\n\n### Convert Adjacency List with attributes\n\n```py\nimport treesimi as ts\nadjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]\nnested = ts.adjac_to_nested_with_attr(adjac)\n# columns: node id, left, right, depth\n# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]\n```\n\n\n## Extract subtree patterns\nWe can extract the following patterns from one tree:\n\n* Depth dimension\n * Full subtrees\n * Truncate leaves\n* Sibling dimension\n * All siblings\n * Drop siblings (and their subtree)\n* Placeholder attribute field\n\n\n### Full subtrees\nThe function `extract_subtrees` returns all subtrees of a tree.\nThe depth information is adjusted accordingly for each subtree.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.extract_subtrees(nested)\n# [\n# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],\n# [[1, 4, 0, 'b'], [2, 3, 1, 'd']],\n# [[1, 2, 0, 'd']],\n# [[1, 2, 0, 'c']]\n# ]\n```\n\n### Truncate leaves\nIn the first step, the function `trunc_leaves` removes leaves of the largest depth level.\nThe result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.\nIn the next steps, the depth level is further removed down to `depth=1`.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.trunc_leaves(nested)\n# [\n# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]\n# ]\n```\n\nHint: Run `trunc_leaves` for each subtree extracted by `extract_subtrees`. Call `unique_trees` after each step.\n\n\n### Drop sibling nodes\nGenerate variants of a tree by dropping each node once.\nAgain, the result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.drop_nodes(nested)\n# [\n# [[1, 8, 0, 'a']],\n# [[1, 8, 0, 'a'], [2, 5, 1, 'b']],\n# [[1, 8, 0, 'a']]\n# ]\n```\n\nHints: Create subtrees with `extract_subtrees` and `trunc_leaves`, and run `drop_nodes` on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. `drop_nodes(drop_nodes(...))`.\n\n\n### Placeholder attribute field\nThe `replace_attr` removes the data attribute of a node with a generic placeholder.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.replace_attr(nested, placeholder='[MASK]')\n# [\n# [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],\n# [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']], \n# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']], \n# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]\n# ]\n```\n\n## Demo Notebooks about Shingling for MinHash\nWe recommend using the `mmh3` hash function, and 32 permutations in `datasketch.MinHash`.\n\n- [Create subtrees as shingle sets](https://github.com/ulf1/treesimi/blob/master/demo/Create%20subtrees%20as%20shingle%20sets.ipynb)\n- [Jaccard Similarity between Dependency Trees](https://github.com/ulf1/treesimi/blob/master/demo/Jaccard%20Similarity%20between%20Dependency%20Trees.ipynb)\n- [Shingle Dependency Trees for datasketch's Minhash](https://github.com/ulf1/treesimi/blob/master/demo/Shingle%20Dependency%20Trees%20for%20datasketch's%20Minhash.ipynb)\n\nStart jupyter to run the demo notebook\n\n```sh\nsource .venv/bin/activate\njupyter lab\n```\n\n## Appendix\n\n### Installation\nThe `treesimi` [git repo](http://github.com/ulf1/treesimi) is available as [PyPi package](https://pypi.org/project/treesimi)\n\n```sh\npip install treesimi\npip install git+ssh://git@github.com/ulf1/treesimi.git\n```\n\n### Commands\nInstall a virtual environment\n\n```sh\npython3 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -r requirements.txt --no-cache-dir\npip install -r requirements-dev.txt --no-cache-dir\npip install -r requirements-demo.txt --no-cache-dir\n```\n\n(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)\n\n### Python commands\n\n* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`\n* Run Unit Tests: `pytest`\n\nPublish\n\n```sh\npython setup.py sdist \ntwine upload -r pypi dist/*\n```\n\n### Clean up \n\n```sh\nfind . -type f -name \"*.pyc\" | xargs rm\nfind . -type d -name \"__pycache__\" | xargs rm -r\nrm -r .pytest_cache\nrm -r .venv\n```\n\n\n### Support\nPlease [open an issue](https://github.com/ulf1/treesimi/issues/new) for support.\n\n\n### Contributing\nPlease contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/treesimi/compare/).\n\n\n### Acknowledgements\nThe \"Evidence\" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).\n\n### Maintenance\n- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)\n- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.\n\n\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Compute similarity between netsted set based trees.",
"version": "0.3.0",
"project_urls": {
"Homepage": "http://github.com/ulf1/treesimi"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "583fdbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f",
"md5": "cd68a221fa0050c900be018a792e169e",
"sha256": "7f8c278308b6b74a4683e88fbd0d894a23de50ce1589642dd18589cadf81d2cd"
},
"downloads": -1,
"filename": "treesimi-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "cd68a221fa0050c900be018a792e169e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 16031,
"upload_time": "2023-07-10T15:28:45",
"upload_time_iso_8601": "2023-07-10T15:28:45.967385Z",
"url": "https://files.pythonhosted.org/packages/58/3f/dbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f/treesimi-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-10 15:28:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ulf1",
"github_project": "treesimi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "treesimi"
}