treesimi


Nametreesimi JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttp://github.com/ulf1/treesimi
SummaryCompute similarity between netsted set based trees.
upload_time2023-07-10 15:28:45
maintainer
docs_urlNone
authorUlf Hamster
requires_python>=3.6
licenseApache License 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/treesimi.svg)](https://badge.fury.io/py/treesimi)
[![PyPi downloads](https://img.shields.io/pypi/dm/treesimi)](https://img.shields.io/pypi/dm/treesimi)
[![DOI](https://zenodo.org/badge/318838452.svg)](https://zenodo.org/badge/latestdoi/318838452)


# treesimi: Shingling for measuring tree similarity
Compute similarity between trees, e.g. dependency trees


## Convert an Adjacency List into a Nested Set Table
For example, CoNLL-U's `['id', 'head']` fields form an adjacency list of a dependency tree.
Traversing an adjacency list is slower than reading a nested set.
Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.

```py
import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
```

### Demo: Query a nested set table
To extract a subtree we just need to iterate through the list ($O(n)$)

```py
_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]
```

or `ts.get_subtree(nested, sub_id=2)`

### Set node attributes

```py
import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
```

### Convert Adjacency List with attributes

```py
import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]
```


## Extract subtree patterns
We can extract the following patterns from one tree:

* Depth dimension
    * Full subtrees
    * Truncate leaves
* Sibling dimension
    * All siblings
    * Drop siblings (and their subtree)
* Placeholder attribute field


### Full subtrees
The function `extract_subtrees` returns all subtrees of a tree.
The depth information is adjusted accordingly for each subtree.

```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
#    [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
#    [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
#    [[1, 2, 0, 'd']],
#    [[1, 2, 0, 'c']]
# ]
```

### Truncate leaves
In the first step, the function `trunc_leaves` removes leaves of the largest depth level.
The result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.
In the next steps, the depth level is further removed down to `depth=1`.

```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]
```

Hint: Run `trunc_leaves` for each subtree extracted by `extract_subtrees`. Call `unique_trees` after each step.


### Drop sibling nodes
Generate variants of a tree by dropping each node once.
Again, the result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.

```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
#   [[1, 8, 0, 'a']],
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
#   [[1, 8, 0, 'a']]
# ]
```

Hints: Create subtrees with `extract_subtrees` and `trunc_leaves`, and run `drop_nodes` on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. `drop_nodes(drop_nodes(...))`.


### Placeholder attribute field
The `replace_attr` removes the data attribute of a node with a generic placeholder.

```py
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
#   [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
#   [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']], 
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']], 
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]
```

## Demo Notebooks about Shingling for MinHash
We recommend using the `mmh3` hash function, and 32 permutations in `datasketch.MinHash`.

- [Create subtrees as shingle sets](https://github.com/ulf1/treesimi/blob/master/demo/Create%20subtrees%20as%20shingle%20sets.ipynb)
- [Jaccard Similarity between Dependency Trees](https://github.com/ulf1/treesimi/blob/master/demo/Jaccard%20Similarity%20between%20Dependency%20Trees.ipynb)
- [Shingle Dependency Trees for datasketch's Minhash](https://github.com/ulf1/treesimi/blob/master/demo/Shingle%20Dependency%20Trees%20for%20datasketch's%20Minhash.ipynb)

Start jupyter to run the demo notebook

```sh
source .venv/bin/activate
jupyter lab
```

## Appendix

### Installation
The `treesimi` [git repo](http://github.com/ulf1/treesimi) is available as [PyPi package](https://pypi.org/project/treesimi)

```sh
pip install treesimi
pip install git+ssh://git@github.com/ulf1/treesimi.git
```

### Commands
Install a virtual environment

```sh
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
```

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)

### Python commands

* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`
* Run Unit Tests: `pytest`

Publish

```sh
python setup.py sdist 
twine upload -r pypi dist/*
```

### Clean up 

```sh
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
```


### Support
Please [open an issue](https://github.com/ulf1/treesimi/issues/new) for support.


### Contributing
Please contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/treesimi/compare/).


### Acknowledgements
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).

### Maintenance
- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)
- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.



            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/ulf1/treesimi",
    "name": "treesimi",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Ulf Hamster",
    "author_email": "554c46@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/58/3f/dbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f/treesimi-0.3.0.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/treesimi.svg)](https://badge.fury.io/py/treesimi)\n[![PyPi downloads](https://img.shields.io/pypi/dm/treesimi)](https://img.shields.io/pypi/dm/treesimi)\n[![DOI](https://zenodo.org/badge/318838452.svg)](https://zenodo.org/badge/latestdoi/318838452)\n\n\n# treesimi: Shingling for measuring tree similarity\nCompute similarity between trees, e.g. dependency trees\n\n\n## Convert an Adjacency List into a Nested Set Table\nFor example, CoNLL-U's `['id', 'head']` fields form an adjacency list of a dependency tree.\nTraversing an adjacency list is slower than reading a nested set.\nThus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.\n\n```py\nimport treesimi as ts\nadjac = [(1, 0), (2, 1), (3, 1), (4, 2)]\nnested = ts.adjac_to_nested(adjac)\n# columns: node id, left, right, depth\n# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]\n```\n\n### Demo: Query a nested set table\nTo extract a subtree we just need to iterate through the list ($O(n)$)\n\n```py\n_, lft0, rgt0, _ = nested[1]\nsubtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]\n# [[2, 2, 5, 1], [4, 3, 4, 2]]\n```\n\nor `ts.get_subtree(nested, sub_id=2)`\n\n### Set node attributes\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]\nattrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]\nnested = ts.set_attr(nested, attrs)\n# columns: node id, left, right, depth, attributes\n# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\n```\n\n### Convert Adjacency List with attributes\n\n```py\nimport treesimi as ts\nadjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]\nnested = ts.adjac_to_nested_with_attr(adjac)\n# columns: node id, left, right, depth\n# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]\n```\n\n\n## Extract subtree patterns\nWe can extract the following patterns from one tree:\n\n* Depth dimension\n    * Full subtrees\n    * Truncate leaves\n* Sibling dimension\n    * All siblings\n    * Drop siblings (and their subtree)\n* Placeholder attribute field\n\n\n### Full subtrees\nThe function `extract_subtrees` returns all subtrees of a tree.\nThe depth information is adjusted accordingly for each subtree.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.extract_subtrees(nested)\n# [\n#    [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],\n#    [[1, 4, 0, 'b'], [2, 3, 1, 'd']],\n#    [[1, 2, 0, 'd']],\n#    [[1, 2, 0, 'c']]\n# ]\n```\n\n### Truncate leaves\nIn the first step, the function `trunc_leaves` removes leaves of the largest depth level.\nThe result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.\nIn the next steps, the depth level is further removed down to `depth=1`.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.trunc_leaves(nested)\n# [\n#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]\n# ]\n```\n\nHint: Run `trunc_leaves` for each subtree extracted by `extract_subtrees`. Call `unique_trees` after each step.\n\n\n### Drop sibling nodes\nGenerate variants of a tree by dropping each node once.\nAgain, the result is always an incomplete tree, and the `lft` and `rgt` values are **not adjusted** to indicate that **there is a missing node**.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.drop_nodes(nested)\n# [\n#   [[1, 8, 0, 'a']],\n#   [[1, 8, 0, 'a'], [2, 5, 1, 'b']],\n#   [[1, 8, 0, 'a']]\n# ]\n```\n\nHints: Create subtrees with `extract_subtrees` and `trunc_leaves`, and run `drop_nodes` on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. `drop_nodes(drop_nodes(...))`.\n\n\n### Placeholder attribute field\nThe `replace_attr` removes the data attribute of a node with a generic placeholder.\n\n```py\nimport treesimi as ts\nnested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]\nnested = ts.remove_node_ids(nested)\nsubtrees = ts.replace_attr(nested, placeholder='[MASK]')\n# [\n#   [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],\n#   [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']], \n#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']], \n#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]\n# ]\n```\n\n## Demo Notebooks about Shingling for MinHash\nWe recommend using the `mmh3` hash function, and 32 permutations in `datasketch.MinHash`.\n\n- [Create subtrees as shingle sets](https://github.com/ulf1/treesimi/blob/master/demo/Create%20subtrees%20as%20shingle%20sets.ipynb)\n- [Jaccard Similarity between Dependency Trees](https://github.com/ulf1/treesimi/blob/master/demo/Jaccard%20Similarity%20between%20Dependency%20Trees.ipynb)\n- [Shingle Dependency Trees for datasketch's Minhash](https://github.com/ulf1/treesimi/blob/master/demo/Shingle%20Dependency%20Trees%20for%20datasketch's%20Minhash.ipynb)\n\nStart jupyter to run the demo notebook\n\n```sh\nsource .venv/bin/activate\njupyter lab\n```\n\n## Appendix\n\n### Installation\nThe `treesimi` [git repo](http://github.com/ulf1/treesimi) is available as [PyPi package](https://pypi.org/project/treesimi)\n\n```sh\npip install treesimi\npip install git+ssh://git@github.com/ulf1/treesimi.git\n```\n\n### Commands\nInstall a virtual environment\n\n```sh\npython3 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -r requirements.txt --no-cache-dir\npip install -r requirements-dev.txt --no-cache-dir\npip install -r requirements-demo.txt --no-cache-dir\n```\n\n(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)\n\n### Python commands\n\n* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`\n* Run Unit Tests: `pytest`\n\nPublish\n\n```sh\npython setup.py sdist \ntwine upload -r pypi dist/*\n```\n\n### Clean up \n\n```sh\nfind . -type f -name \"*.pyc\" | xargs rm\nfind . -type d -name \"__pycache__\" | xargs rm -r\nrm -r .pytest_cache\nrm -r .venv\n```\n\n\n### Support\nPlease [open an issue](https://github.com/ulf1/treesimi/issues/new) for support.\n\n\n### Contributing\nPlease contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/treesimi/compare/).\n\n\n### Acknowledgements\nThe \"Evidence\" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).\n\n### Maintenance\n- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)\n- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Compute similarity between netsted set based trees.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "http://github.com/ulf1/treesimi"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "583fdbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f",
                "md5": "cd68a221fa0050c900be018a792e169e",
                "sha256": "7f8c278308b6b74a4683e88fbd0d894a23de50ce1589642dd18589cadf81d2cd"
            },
            "downloads": -1,
            "filename": "treesimi-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "cd68a221fa0050c900be018a792e169e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 16031,
            "upload_time": "2023-07-10T15:28:45",
            "upload_time_iso_8601": "2023-07-10T15:28:45.967385Z",
            "url": "https://files.pythonhosted.org/packages/58/3f/dbf6357eafcaf7c20419e3851e115dc4786645cd770da9565569afe6102f/treesimi-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-10 15:28:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ulf1",
    "github_project": "treesimi",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "treesimi"
}
        
Elapsed time: 0.09563s