fog


Namefog JSON
Version 0.11.9 PyPI version JSON
download
home_pagehttp://github.com/Yomguithereal/fog
SummaryA fuzzy matching & clustering library for python.
upload_time2023-09-08 12:13:36
maintainer
docs_urlNone
authorGuillaume Plique
requires_python>=3
licenseMIT
keywords fuzzy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            [![Build Status](https://travis-ci.org/Yomguithereal/fog.svg)](https://travis-ci.org/Yomguithereal/fog)

# Fog

A fuzzy matching/clustering library for Python.

## Installation

You can install `fog` with pip with the following command:

```
pip install fog
```

## Usage

* [Evaluation](#evaluation)
  * [best_matching_macro_average](#best_matching_macro_average)
* [Graph](#graph)
  * [floatsam_sparsification](#floatsam_sparsification)
  * [monopartite_projection](#monopartite_projection)
* [Keyers](#keyers)
  * [omission_key](#omission_key)
  * [skeleton_key](#skeleton_key)
* [Metrics](#metrics)
  * [cosine_similarity](#cosine_similarity)
  * [sparse_cosine_similarity](#sparse_cosine_similarity)
  * [sparse_dot_product](#sparse_dot_product)
  * [binary_cosine_similarity](#binary_cosine_similarity)
  * [sparse_binary_cosine_similarity](#sparse_binary_cosine_similarity)
  * [dice_coefficient](#dice_coefficient)
  * [jaccard_similarity](#jaccard_similarity)
  * [weighted_jaccard_similarity](#weighted_jaccard_similarity)
  * [overlap_coefficient](#overlap_coefficient)

### Evaluation

#### best_matching_macro_average

Efficient implementation of the "macro average best matching F1" evaluation
metric for clusters.

Note that this metric is not symmetric and will match truth -> predicted.

*Arguments*
* **truth** *iterable*: the truth clusters.
* **predicted** *iterable*: the predicted clusters.
* **allow_additional_items** *?bool* [`False`]: Whether to allow additional items
that don't exist in truth clusters to be found in predicted ones. Those
additional items will then be ignored when computing the metrics instead
of raising an error when found.

### Graph

#### floatsam_sparsification

Function using an iterative algorithm to try and find the best weight
threshold to apply to trim the given graph's edges while keeping the
underlying community structures.

It works by iteratively increasing the threshold and stopping as soon as
a significant connected component starts to drift away from the principal
one.

This is basically a very naive gradient descent with a very naive cost
function but it works decently for typical cases.

*Arguments*
* **graph** *nx.Graph*: Graph to sparsify.
* **starting_treshold** *?float* [`0.0`]: Starting similarity threshold.
* **learning_rate** *?float* [`0.05`]: How much to increase the threshold
at each step of the algorithm.
* **max_drifter_size** *?int*: Max size of component to detach itself
from the principal one before stopping the algorithm. If not
provided it will default to the logarithm of the graph's total
number of nodes.
* **weight** *?str* [`weight wrt networkx conventions`]: Name of the weight attribute.
* **remove_edges** *?bool* [`False`]: Whether to remove edges from the graph
having a weight less than found threshold or not. Note that if
`True`, this will mutate the given graph.

#### monopartite_projection

Function computing a monopartite projection of the given bipartite graph.
This projection can be basic and create a weighted edge each time two nodes
in target partition share a common neighbor. Or it can be weighted and
filtered using a similarity metric such as Jaccard or cosine similarity,
for instance.

*Arguments*
* **bipartite** *nx.Graph*: Target bipartite graph.
* **project** *str*: Name of the partition to project.
* **part** *?str* [`bipartite`]: Name of the node attribute on which the
graph partition is built e.g. "color" or "type" etc.
* **weight** *?str* [`weight`]: Name of the weight edge attribute.
* **metric** *?str* [`None`]: Metric to use. If `None`, the basic projection
will be returned. Also accepts `jaccard`, `overlap`, `dice`,
`cosine` or `binary_cosine`.
* **threshold** *?float* [`None`]: Optional similarity threshold under which
edges won't be added to the monopartite projection.
* **use_topology** *?bool*: Whether to use the bipartite graph's
topology to attempt a subquadratic time projection. Intuitively,
this works by not computing similarities of all pairs of nodes but
only of pairs of nodes that share at least a common neighbor.
It generally works better than the quadratic approach but can
sometimes hurt your performance by losing time on graph traversals
when your graph is very dense.
* **bipartition_check** *?bool*: This function will start by checking
whether your graph is bipartite because it can get stuck in an
infinite loop if given graph is not truly bipartite. Be sure to
disable this kwarg if you know beforehand that your graph is
bipartite and for better performance.

### Keyers

#### omission_key

Function returning a string's omission key which is constructed thusly:
1. First we record the string's set of consonant in an order
   where most frequently mispelled consonants will be last.
2. Then we record the string's set of vowels in the order of
   first appearance.

This key is very useful when searching for mispelled strings because
if sorted using this key, similar strings will be next to each other.

*Arguments*
* **string** *str*: The string to encode.

#### skeleton_key

Function returning a string's skeleton key which is constructed thusly:
1. The first letter of the string
2. Unique consonants in order of appearance
3. Unique vowels in order of appearance

This key is very useful when searching for mispelled strings because
if sorted using this key, similar strings will be next to each other.

*Arguments*
* **string** *str*: The string to encode.

### Metrics

#### cosine_similarity

Function computing the cosine similarity of the given sequences.
Runs in O(n), n being the sum of A & B's sizes.

*Arguments*
* **A** *iterable*: First sequence.
* **B** *iterable*: Second sequence.

#### sparse_cosine_similarity

Function computing cosine similarity on sparse weighted sets represented
as python dicts.

Runs in O(n), n being the sum of A & B's sizes.

```python
from fog.metrics import sparse_cosine_similarity

# Basic
sparse_cosine_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})
>>> ~0.062
```

*Arguments*
* **A** *Counter*: First weighted set.
* **B** *Counter*: Second weighted set.

#### sparse_dot_product

Function used to compute the dotproduct of sparse weighted sets represented
by python dicts.

Runs in O(n), n being the size of the smallest set.

*Arguments*
* **A** *Counter*: First weighted set.
* **B** *Counter*: Second weighted set.

#### binary_cosine_similarity

Function computing the binary cosine similarity of the given sequences.
Runs in O(n), n being the size of the smallest set.

*Arguments*
* **A** *iterable*: First sequence.
* **B** *iterable*: Second sequence.

#### sparse_binary_cosine_similarity

Function computing binary cosine similarity on sparse vectors represented
as python sets.

Runs in O(n), n being the size of the smaller set.

*Arguments*
* **A** *Counter*: First set.
* **B** *Counter*: Second set.

#### dice_coefficient

Function computing the Dice coefficient. That is to say twice the size of
the intersection of both sets divided by the sum of both their sizes.

Runs in O(n), n being the size of the smallest set.

```python
from fog.metrics import dice_coefficient

# Basic
dice_coefficient('context', 'contact')
>>> ~0.727
```

*Arguments*
* **A** *iterable*: First sequence.
* **B** *iterable*: Second sequence.

#### jaccard_similarity

Function computing the Jaccard similarity. That is to say the intersection
of input sets divided by their union.

Runs in O(n), n being the size of the smallest set.

```python
from fog.metrics import jaccard_similarity

# Basic
jaccard_similarity('context', 'contact')
>>> ~0.571
```

*Arguments*
* **A** *iterable*: First sequence.
* **B** *iterable*: Second sequence.

#### weighted_jaccard_similarity

Function computing the weighted Jaccard similarity.
Runs in O(n), n being the sum of A & B's sizes.

```python
from fog.metrics import weighted_jaccard_similarity

# Basic
weighted_jaccard_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})
>>> ~0.026
```

*Arguments*
* **A** *Counter*: First weighted set.
* **B** *Counter*: Second weighted set.

#### overlap_coefficient

Function computing the overlap coefficient of the given sets, i.e. the size
of their intersection divided by the size of the smallest set.

Runs in O(n), n being the size of the smallest set.

*Arguments*
* **A** *iterable*: First sequence.
* **B** *iterable*: Second sequence.

            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/Yomguithereal/fog",
    "name": "fog",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "",
    "keywords": "fuzzy",
    "author": "Guillaume Plique",
    "author_email": "kropotkinepiotr@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c9/0f/69b4fcffd9153b2dffe904e198c047e14dededc5d3477277f017438c095b/fog-0.11.9.tar.gz",
    "platform": null,
    "description": "[![Build Status](https://travis-ci.org/Yomguithereal/fog.svg)](https://travis-ci.org/Yomguithereal/fog)\n\n# Fog\n\nA fuzzy matching/clustering library for Python.\n\n## Installation\n\nYou can install `fog` with pip with the following command:\n\n```\npip install fog\n```\n\n## Usage\n\n* [Evaluation](#evaluation)\n  * [best_matching_macro_average](#best_matching_macro_average)\n* [Graph](#graph)\n  * [floatsam_sparsification](#floatsam_sparsification)\n  * [monopartite_projection](#monopartite_projection)\n* [Keyers](#keyers)\n  * [omission_key](#omission_key)\n  * [skeleton_key](#skeleton_key)\n* [Metrics](#metrics)\n  * [cosine_similarity](#cosine_similarity)\n  * [sparse_cosine_similarity](#sparse_cosine_similarity)\n  * [sparse_dot_product](#sparse_dot_product)\n  * [binary_cosine_similarity](#binary_cosine_similarity)\n  * [sparse_binary_cosine_similarity](#sparse_binary_cosine_similarity)\n  * [dice_coefficient](#dice_coefficient)\n  * [jaccard_similarity](#jaccard_similarity)\n  * [weighted_jaccard_similarity](#weighted_jaccard_similarity)\n  * [overlap_coefficient](#overlap_coefficient)\n\n### Evaluation\n\n#### best_matching_macro_average\n\nEfficient implementation of the \"macro average best matching F1\" evaluation\nmetric for clusters.\n\nNote that this metric is not symmetric and will match truth -> predicted.\n\n*Arguments*\n* **truth** *iterable*: the truth clusters.\n* **predicted** *iterable*: the predicted clusters.\n* **allow_additional_items** *?bool* [`False`]: Whether to allow additional items\nthat don't exist in truth clusters to be found in predicted ones. Those\nadditional items will then be ignored when computing the metrics instead\nof raising an error when found.\n\n### Graph\n\n#### floatsam_sparsification\n\nFunction using an iterative algorithm to try and find the best weight\nthreshold to apply to trim the given graph's edges while keeping the\nunderlying community structures.\n\nIt works by iteratively increasing the threshold and stopping as soon as\na significant connected component starts to drift away from the principal\none.\n\nThis is basically a very naive gradient descent with a very naive cost\nfunction but it works decently for typical cases.\n\n*Arguments*\n* **graph** *nx.Graph*: Graph to sparsify.\n* **starting_treshold** *?float* [`0.0`]: Starting similarity threshold.\n* **learning_rate** *?float* [`0.05`]: How much to increase the threshold\nat each step of the algorithm.\n* **max_drifter_size** *?int*: Max size of component to detach itself\nfrom the principal one before stopping the algorithm. If not\nprovided it will default to the logarithm of the graph's total\nnumber of nodes.\n* **weight** *?str* [`weight wrt networkx conventions`]: Name of the weight attribute.\n* **remove_edges** *?bool* [`False`]: Whether to remove edges from the graph\nhaving a weight less than found threshold or not. Note that if\n`True`, this will mutate the given graph.\n\n#### monopartite_projection\n\nFunction computing a monopartite projection of the given bipartite graph.\nThis projection can be basic and create a weighted edge each time two nodes\nin target partition share a common neighbor. Or it can be weighted and\nfiltered using a similarity metric such as Jaccard or cosine similarity,\nfor instance.\n\n*Arguments*\n* **bipartite** *nx.Graph*: Target bipartite graph.\n* **project** *str*: Name of the partition to project.\n* **part** *?str* [`bipartite`]: Name of the node attribute on which the\ngraph partition is built e.g. \"color\" or \"type\" etc.\n* **weight** *?str* [`weight`]: Name of the weight edge attribute.\n* **metric** *?str* [`None`]: Metric to use. If `None`, the basic projection\nwill be returned. Also accepts `jaccard`, `overlap`, `dice`,\n`cosine` or `binary_cosine`.\n* **threshold** *?float* [`None`]: Optional similarity threshold under which\nedges won't be added to the monopartite projection.\n* **use_topology** *?bool*: Whether to use the bipartite graph's\ntopology to attempt a subquadratic time projection. Intuitively,\nthis works by not computing similarities of all pairs of nodes but\nonly of pairs of nodes that share at least a common neighbor.\nIt generally works better than the quadratic approach but can\nsometimes hurt your performance by losing time on graph traversals\nwhen your graph is very dense.\n* **bipartition_check** *?bool*: This function will start by checking\nwhether your graph is bipartite because it can get stuck in an\ninfinite loop if given graph is not truly bipartite. Be sure to\ndisable this kwarg if you know beforehand that your graph is\nbipartite and for better performance.\n\n### Keyers\n\n#### omission_key\n\nFunction returning a string's omission key which is constructed thusly:\n1. First we record the string's set of consonant in an order\n   where most frequently mispelled consonants will be last.\n2. Then we record the string's set of vowels in the order of\n   first appearance.\n\nThis key is very useful when searching for mispelled strings because\nif sorted using this key, similar strings will be next to each other.\n\n*Arguments*\n* **string** *str*: The string to encode.\n\n#### skeleton_key\n\nFunction returning a string's skeleton key which is constructed thusly:\n1. The first letter of the string\n2. Unique consonants in order of appearance\n3. Unique vowels in order of appearance\n\nThis key is very useful when searching for mispelled strings because\nif sorted using this key, similar strings will be next to each other.\n\n*Arguments*\n* **string** *str*: The string to encode.\n\n### Metrics\n\n#### cosine_similarity\n\nFunction computing the cosine similarity of the given sequences.\nRuns in O(n), n being the sum of A & B's sizes.\n\n*Arguments*\n* **A** *iterable*: First sequence.\n* **B** *iterable*: Second sequence.\n\n#### sparse_cosine_similarity\n\nFunction computing cosine similarity on sparse weighted sets represented\nas python dicts.\n\nRuns in O(n), n being the sum of A & B's sizes.\n\n```python\nfrom fog.metrics import sparse_cosine_similarity\n\n# Basic\nsparse_cosine_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})\n>>> ~0.062\n```\n\n*Arguments*\n* **A** *Counter*: First weighted set.\n* **B** *Counter*: Second weighted set.\n\n#### sparse_dot_product\n\nFunction used to compute the dotproduct of sparse weighted sets represented\nby python dicts.\n\nRuns in O(n), n being the size of the smallest set.\n\n*Arguments*\n* **A** *Counter*: First weighted set.\n* **B** *Counter*: Second weighted set.\n\n#### binary_cosine_similarity\n\nFunction computing the binary cosine similarity of the given sequences.\nRuns in O(n), n being the size of the smallest set.\n\n*Arguments*\n* **A** *iterable*: First sequence.\n* **B** *iterable*: Second sequence.\n\n#### sparse_binary_cosine_similarity\n\nFunction computing binary cosine similarity on sparse vectors represented\nas python sets.\n\nRuns in O(n), n being the size of the smaller set.\n\n*Arguments*\n* **A** *Counter*: First set.\n* **B** *Counter*: Second set.\n\n#### dice_coefficient\n\nFunction computing the Dice coefficient. That is to say twice the size of\nthe intersection of both sets divided by the sum of both their sizes.\n\nRuns in O(n), n being the size of the smallest set.\n\n```python\nfrom fog.metrics import dice_coefficient\n\n# Basic\ndice_coefficient('context', 'contact')\n>>> ~0.727\n```\n\n*Arguments*\n* **A** *iterable*: First sequence.\n* **B** *iterable*: Second sequence.\n\n#### jaccard_similarity\n\nFunction computing the Jaccard similarity. That is to say the intersection\nof input sets divided by their union.\n\nRuns in O(n), n being the size of the smallest set.\n\n```python\nfrom fog.metrics import jaccard_similarity\n\n# Basic\njaccard_similarity('context', 'contact')\n>>> ~0.571\n```\n\n*Arguments*\n* **A** *iterable*: First sequence.\n* **B** *iterable*: Second sequence.\n\n#### weighted_jaccard_similarity\n\nFunction computing the weighted Jaccard similarity.\nRuns in O(n), n being the sum of A & B's sizes.\n\n```python\nfrom fog.metrics import weighted_jaccard_similarity\n\n# Basic\nweighted_jaccard_similarity({'apple': 34, 'pear': 3}, {'pear': 1, 'orange': 1})\n>>> ~0.026\n```\n\n*Arguments*\n* **A** *Counter*: First weighted set.\n* **B** *Counter*: Second weighted set.\n\n#### overlap_coefficient\n\nFunction computing the overlap coefficient of the given sets, i.e. the size\nof their intersection divided by the size of the smallest set.\n\nRuns in O(n), n being the size of the smallest set.\n\n*Arguments*\n* **A** *iterable*: First sequence.\n* **B** *iterable*: Second sequence.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fuzzy matching & clustering library for python.",
    "version": "0.11.9",
    "project_urls": {
        "Homepage": "http://github.com/Yomguithereal/fog"
    },
    "split_keywords": [
        "fuzzy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c90f69b4fcffd9153b2dffe904e198c047e14dededc5d3477277f017438c095b",
                "md5": "9c67104545ff406579820a57b4f766e1",
                "sha256": "a6e044b2cde8dde696ff9bd3d90647048b199f69c4b1582209b547cd0d5ee635"
            },
            "downloads": -1,
            "filename": "fog-0.11.9.tar.gz",
            "has_sig": false,
            "md5_digest": "9c67104545ff406579820a57b4f766e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 87711,
            "upload_time": "2023-09-08T12:13:36",
            "upload_time_iso_8601": "2023-09-08T12:13:36.049216Z",
            "url": "https://files.pythonhosted.org/packages/c9/0f/69b4fcffd9153b2dffe904e198c047e14dededc5d3477277f017438c095b/fog-0.11.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-08 12:13:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Yomguithereal",
    "github_project": "fog",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "fog"
}
        
Elapsed time: 2.80858s