gaspra


Namegaspra JSON
Version 0.1.0a3 PyPI version JSON
download
home_pagehttps://github.com/mawicks/gaspra
SummaryA fast Python tool for searching, diffing, and merging text
upload_time2023-12-29 03:01:07
maintainer
docs_urlNone
authorMark Wicks
requires_python>=3.10,<4.0
licenseBSD-3-Clause
keywords diff lcs merge search text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # What is GASPRA

GASPRA is the Generalized Automaton-based String Processing for Rapid Alignment.
GASPRA implements an efficient algorithm to solve the Longest Common
Substring (LCS) problem.  Specifically it uses a suffix automaton, which
is a finite state machine and is extremely efficient at string searching
and matching problems.  It uses the suffix automaton as a base on which
to build diffing and merging tools.  Because it's efficient, it's
able to apply correct merge files viewing them as sequences of characters,
where most traditional merge programs view files as sequences of lines.
Documents with a lot of multi-line text (e.g., markdown files, and LaTex
files, and lengthy comments in source code) confuse line-oriented
merge algorithms, and you end up with a lot of
"conflicts" that don't really exist.  

The line-oriented approach is an order of magnitude faster, and GASPRA
allows you to choose line-oriented or character-oriented merges.  Both 
will use the suffix automaton approach.  It's a matter of whether you
view a text file as a sequence of characters or a sequence of lines.
It's still a sequence of tokens and the suffix automaton approach works
on both cases.

Here's an example of running `gaspra-merge` on three files were the
first had no line breaks, the second was edited (without adding line
breaks), and the third was reflowed.  These files were merged with no
conflicts.  Line-oriented tools show the entire paragraph as a conflict.
The result shown is the output of the command:

```
gaspra-merge -sdc data/tale0 data/tale1-reflowed data/tale2-edited
```

![Merged a reflowed and an edited version of text](./images/text-merge.png)

See [Merging](#merging) for another example.


# What's included

* `gaspra-diff` - A tool for diffing two text files in either a
  line-oriented or a character-oriented fashion.
* `gaspra-merge` - A tool for doing a 3-way merge in either a
  line-oriented or a character-oriented fashion.

GASPRA has been tested on [Bill Ritcher's pathological merge test
casea](https://www.guiffy.com/SureMergeWP.html) and performs quite
well

 
# Why GASPRA exists

Recently, I became interested in string matching. In particular, I was
interested in the LCS problem and related problems. I
discovered that efficient string matching is hard to find, especially in
Python. Much of what you find on the internet is either slow, or in some
cases, simply doesn't work.  I wanted a pure Python solution that is
reasonably fast to experiment with some string matching algorithms.


Though implementations appear uncommon, efficient algorithms exist.
The LCS problem is solvable in linear time, yet most online examples
use a classical dynamic programming approach which is quadratic in
both space and time. Even the venerable `difflib`, which is part of
Python's standard library uses quadratic-time algorithms.
GASPRA uses efficient suffix automata
with linear time and space complexity. The difference is dramatic.
It's the same as the difference between a quick-sort and a bubble sort.
Because of this difference, nobody uses a bubble sort, which is
universally recognized as a naive approach.  Yet somehow, using
dynamic programming to solve text matching is widely accepted.
`Difflib` is practically unusable on large, document-length
strings. The table below shows the time to find the LCS in two
strings for `difflib` compared to this package. The strings are
equal-length, random sequences of the letters a, b, and c.
The lengths shown in the table are the combined length of
the two strings.
The quadratic complexity of 'difflib' is obvious.  Even though it's a pure
Python solution, GASPRA is quite fast by comparison.

| Length   |   Match Length |   Difflib (ms) |      GASPRA (ms) |
|----------|----------------|----------------|------------------|
| 2k       |             13 |             31 |                3 |
| 4k       |             13 |            120 |                5 |
| 8k       |             15 |            472 |               11 |
| 16k      |             15 |          1,897 |               32 |
| 32k      |             16 |          7,828 |               52 |
| 64k      |             18 |         33,314 |              121 |
| 128k     |             19 |        135,413 |              341 |
| 256k     |             23 |        554,665 |              690 |

# Installation

You can install GASPRA using `pip` (or similar compatible tools):

```
pip install gaspra
```
# Examples.

## Finding Longest Common Substrings

The function `gaspra.find_lcs()` returns the starting positions and the
length of common substring of two strings with maximal length:

```
>>> import gaspra                                                                                               
>>> a="Call me Ishmael. Some years ago—never mind how long \
precisely—having little or no money in my purse, and nothing \
particular to interest me on shore, I thought I would sail \
about a little and see the watery part of the world"                                                                                                                  
>>> b="It was the best of times, it was the worst of times, \
it was the age of wisdom, it was the age of foolishness, \
it was the epoch of belief, it was the epoch of incredulity, \
it was the season of Light, it was the season of Darkness, \
it was the spring of hope, it was the winter of despair, we \
had everything before us, we had nothing before us, we were \
all going direct to Heaven, we were all going direct the other \
way – in short, the period was so far like the present period, \
that some of its noisiest authorities insisted on its being \
received, for good or for evil, in the superlative degree of \
comparison only."
>>> gaspra.find_lcs(a,b)
(103, 321, 10)
>>> a[103:103+10]
'd nothing '
>>> b[321:321+10]
'd nothing '
```


There's also `find_lcs_multiple()` for n-way common strings:

```
>>> a="The quick brown fox jumps over the lazy dog near the riverbank."
>>> b="The quick brown fox leaps over the lazy dogs near the river."
>>> c="The quick, clever fox jumps across the lazy dogs by the riverbank."
>>> gaspra.find_lcs_multiple(a, b, c)
((30, 30, 34), 13)
>>> a[30:30+13]
' the lazy dog'
```

## Finding Diffs

Given an original string and a modified string, the function
`gaspra.diff()` returns the sequence of changes represented
interspersed with fragments from the original string.  The sequence is
a sequence of strings (fragments from the original) and tuples of two strings
representing an insertion/deletion pair.  Note that an insertion is a tuple
where the deletion string is empty, and vice versa.

```
>>> import gaspra
>>> original="The quick brown fox jumps over the lazy dog near the riverbank."
>>> modified="The quick brown fox leaps over the lazy dogs near the river"
>>> list(gaspra.diff(original, modified))
['The quick brown fox ', ('lea', 'jum'), 'ps over the lazy dog', ('s', ''), ' near the river', ('', 'bank.')]
```

# Merging

Here's an example of two different revisions of the same sentence that can be
resolved without any conflicts.

``` 
>>> original = "The quick brown fox jumps over the lazy dog near the riverbank."
>>> editor1 = "The quick brown fox leaps over the lazy dogs near the river."
>>> editor2 = "The quick, clever fox jumps across the lazy dogs by the riverbank."
>>> list(gaspra.diff(original, editor1))
['The quick brown fox ', ('lea', 'jum'), 'ps over the lazy dog', ('s', ''), ' near the river', ('', 'bank'), '.']
>>> list(gaspra.diff(original, editor2))
['The quick', (',', ''), ' ', ('cleve', 'b'), 'r', ('', 'own'), ' fox jumps ', ('ac', 'ove'), 'r', ('oss', ''), ' the lazy dog', ('s', ''), ' ', ('by', 'near'), ' the riverbank.']
>>> list(gaspra.merge(original, editor1, editor2))
['The quick, clever fox leaps across the lazy dogs by the river.']
```

Here's the same example but with a different revision by editor2 that *does*
conflict with editor1.  The conflicts are represented as a tuple with two
alternate versions of the merged text.

```
>>> conflicts_with_1 = "The swift, agile fox leaps over the sleepy dog near the riverside."
>>> list(gaspra.diff(original, conflicts_with_1))
['The ', ('s', 'quick bro'), 'w', ('ift, agile', 'n'), ' fox ', ('lea', 'jum'), 'ps over the ', ('s', ''), 'l', ('eep', 'az'), 'y dog near the river', ('side', 'bank'), '.']
>>> list(gaspra.merge(original, editor1, conflicts_with_1))
['The swift, agile fox leaps over the sleepy dogs near the river', ('', 'side'), '.']
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mawicks/gaspra",
    "name": "gaspra",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "diff,LCS,merge,search,text-processing",
    "author": "Mark Wicks",
    "author_email": "mawicks@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/82/7c/4fbf87925a304e8747ce4fae11bdaf51d2258a01ef58c55f6ca33218f226/gaspra-0.1.0a3.tar.gz",
    "platform": null,
    "description": "# What is GASPRA\n\nGASPRA is the Generalized Automaton-based String Processing for Rapid Alignment.\nGASPRA implements an efficient algorithm to solve the Longest Common\nSubstring (LCS) problem.  Specifically it uses a suffix automaton, which\nis a finite state machine and is extremely efficient at string searching\nand matching problems.  It uses the suffix automaton as a base on which\nto build diffing and merging tools.  Because it's efficient, it's\nable to apply correct merge files viewing them as sequences of characters,\nwhere most traditional merge programs view files as sequences of lines.\nDocuments with a lot of multi-line text (e.g., markdown files, and LaTex\nfiles, and lengthy comments in source code) confuse line-oriented\nmerge algorithms, and you end up with a lot of\n\"conflicts\" that don't really exist.  \n\nThe line-oriented approach is an order of magnitude faster, and GASPRA\nallows you to choose line-oriented or character-oriented merges.  Both \nwill use the suffix automaton approach.  It's a matter of whether you\nview a text file as a sequence of characters or a sequence of lines.\nIt's still a sequence of tokens and the suffix automaton approach works\non both cases.\n\nHere's an example of running `gaspra-merge` on three files were the\nfirst had no line breaks, the second was edited (without adding line\nbreaks), and the third was reflowed.  These files were merged with no\nconflicts.  Line-oriented tools show the entire paragraph as a conflict.\nThe result shown is the output of the command:\n\n```\ngaspra-merge -sdc data/tale0 data/tale1-reflowed data/tale2-edited\n```\n\n![Merged a reflowed and an edited version of text](./images/text-merge.png)\n\nSee [Merging](#merging) for another example.\n\n\n# What's included\n\n* `gaspra-diff` - A tool for diffing two text files in either a\n  line-oriented or a character-oriented fashion.\n* `gaspra-merge` - A tool for doing a 3-way merge in either a\n  line-oriented or a character-oriented fashion.\n\nGASPRA has been tested on [Bill Ritcher's pathological merge test\ncasea](https://www.guiffy.com/SureMergeWP.html) and performs quite\nwell\n\n \n# Why GASPRA exists\n\nRecently, I became interested in string matching. In particular, I was\ninterested in the LCS problem and related problems. I\ndiscovered that efficient string matching is hard to find, especially in\nPython. Much of what you find on the internet is either slow, or in some\ncases, simply doesn't work.  I wanted a pure Python solution that is\nreasonably fast to experiment with some string matching algorithms.\n\n\nThough implementations appear uncommon, efficient algorithms exist.\nThe LCS problem is solvable in linear time, yet most online examples\nuse a classical dynamic programming approach which is quadratic in\nboth space and time. Even the venerable `difflib`, which is part of\nPython's standard library uses quadratic-time algorithms.\nGASPRA uses efficient suffix automata\nwith linear time and space complexity. The difference is dramatic.\nIt's the same as the difference between a quick-sort and a bubble sort.\nBecause of this difference, nobody uses a bubble sort, which is\nuniversally recognized as a naive approach.  Yet somehow, using\ndynamic programming to solve text matching is widely accepted.\n`Difflib` is practically unusable on large, document-length\nstrings. The table below shows the time to find the LCS in two\nstrings for `difflib` compared to this package. The strings are\nequal-length, random sequences of the letters a, b, and c.\nThe lengths shown in the table are the combined length of\nthe two strings.\nThe quadratic complexity of 'difflib' is obvious.  Even though it's a pure\nPython solution, GASPRA is quite fast by comparison.\n\n| Length   |   Match Length |   Difflib (ms) |      GASPRA (ms) |\n|----------|----------------|----------------|------------------|\n| 2k       |             13 |             31 |                3 |\n| 4k       |             13 |            120 |                5 |\n| 8k       |             15 |            472 |               11 |\n| 16k      |             15 |          1,897 |               32 |\n| 32k      |             16 |          7,828 |               52 |\n| 64k      |             18 |         33,314 |              121 |\n| 128k     |             19 |        135,413 |              341 |\n| 256k     |             23 |        554,665 |              690 |\n\n# Installation\n\nYou can install GASPRA using `pip` (or similar compatible tools):\n\n```\npip install gaspra\n```\n# Examples.\n\n## Finding Longest Common Substrings\n\nThe function `gaspra.find_lcs()` returns the starting positions and the\nlength of common substring of two strings with maximal length:\n\n```\n>>> import gaspra                                                                                               \n>>> a=\"Call me Ishmael. Some years ago\u2014never mind how long \\\nprecisely\u2014having little or no money in my purse, and nothing \\\nparticular to interest me on shore, I thought I would sail \\\nabout a little and see the watery part of the world\"                                                                                                                  \n>>> b=\"It was the best of times, it was the worst of times, \\\nit was the age of wisdom, it was the age of foolishness, \\\nit was the epoch of belief, it was the epoch of incredulity, \\\nit was the season of Light, it was the season of Darkness, \\\nit was the spring of hope, it was the winter of despair, we \\\nhad everything before us, we had nothing before us, we were \\\nall going direct to Heaven, we were all going direct the other \\\nway \u2013 in short, the period was so far like the present period, \\\nthat some of its noisiest authorities insisted on its being \\\nreceived, for good or for evil, in the superlative degree of \\\ncomparison only.\"\n>>> gaspra.find_lcs(a,b)\n(103, 321, 10)\n>>> a[103:103+10]\n'd nothing '\n>>> b[321:321+10]\n'd nothing '\n```\n\n\nThere's also `find_lcs_multiple()` for n-way common strings:\n\n```\n>>> a=\"The quick brown fox jumps over the lazy dog near the riverbank.\"\n>>> b=\"The quick brown fox leaps over the lazy dogs near the river.\"\n>>> c=\"The quick, clever fox jumps across the lazy dogs by the riverbank.\"\n>>> gaspra.find_lcs_multiple(a, b, c)\n((30, 30, 34), 13)\n>>> a[30:30+13]\n' the lazy dog'\n```\n\n## Finding Diffs\n\nGiven an original string and a modified string, the function\n`gaspra.diff()` returns the sequence of changes represented\ninterspersed with fragments from the original string.  The sequence is\na sequence of strings (fragments from the original) and tuples of two strings\nrepresenting an insertion/deletion pair.  Note that an insertion is a tuple\nwhere the deletion string is empty, and vice versa.\n\n```\n>>> import gaspra\n>>> original=\"The quick brown fox jumps over the lazy dog near the riverbank.\"\n>>> modified=\"The quick brown fox leaps over the lazy dogs near the river\"\n>>> list(gaspra.diff(original, modified))\n['The quick brown fox ', ('lea', 'jum'), 'ps over the lazy dog', ('s', ''), ' near the river', ('', 'bank.')]\n```\n\n# Merging\n\nHere's an example of two different revisions of the same sentence that can be\nresolved without any conflicts.\n\n``` \n>>> original = \"The quick brown fox jumps over the lazy dog near the riverbank.\"\n>>> editor1 = \"The quick brown fox leaps over the lazy dogs near the river.\"\n>>> editor2 = \"The quick, clever fox jumps across the lazy dogs by the riverbank.\"\n>>> list(gaspra.diff(original, editor1))\n['The quick brown fox ', ('lea', 'jum'), 'ps over the lazy dog', ('s', ''), ' near the river', ('', 'bank'), '.']\n>>> list(gaspra.diff(original, editor2))\n['The quick', (',', ''), ' ', ('cleve', 'b'), 'r', ('', 'own'), ' fox jumps ', ('ac', 'ove'), 'r', ('oss', ''), ' the lazy dog', ('s', ''), ' ', ('by', 'near'), ' the riverbank.']\n>>> list(gaspra.merge(original, editor1, editor2))\n['The quick, clever fox leaps across the lazy dogs by the river.']\n```\n\nHere's the same example but with a different revision by editor2 that *does*\nconflict with editor1.  The conflicts are represented as a tuple with two\nalternate versions of the merged text.\n\n```\n>>> conflicts_with_1 = \"The swift, agile fox leaps over the sleepy dog near the riverside.\"\n>>> list(gaspra.diff(original, conflicts_with_1))\n['The ', ('s', 'quick bro'), 'w', ('ift, agile', 'n'), ' fox ', ('lea', 'jum'), 'ps over the ', ('s', ''), 'l', ('eep', 'az'), 'y dog near the river', ('side', 'bank'), '.']\n>>> list(gaspra.merge(original, editor1, conflicts_with_1))\n['The swift, agile fox leaps over the sleepy dogs near the river', ('', 'side'), '.']\n```\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "A fast Python tool for searching, diffing, and merging text",
    "version": "0.1.0a3",
    "project_urls": {
        "Homepage": "https://github.com/mawicks/gaspra",
        "Repository": "https://github.com/mawicks/gaspra"
    },
    "split_keywords": [
        "diff",
        "lcs",
        "merge",
        "search",
        "text-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1b26fbdab0f1a6027a88fc78ee8328f00b43089ff29050443e0a018efc92ae59",
                "md5": "e3c60fbd4d4e90a252f26431b47335e0",
                "sha256": "3916f8fa71150f0d68d3fa39ec69de26f1b1d597a25d64309d4518b050c04dd1"
            },
            "downloads": -1,
            "filename": "gaspra-0.1.0a3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e3c60fbd4d4e90a252f26431b47335e0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 29148,
            "upload_time": "2023-12-29T03:01:05",
            "upload_time_iso_8601": "2023-12-29T03:01:05.871847Z",
            "url": "https://files.pythonhosted.org/packages/1b/26/fbdab0f1a6027a88fc78ee8328f00b43089ff29050443e0a018efc92ae59/gaspra-0.1.0a3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "827c4fbf87925a304e8747ce4fae11bdaf51d2258a01ef58c55f6ca33218f226",
                "md5": "529c9cc1075d85effae0865f5fe0b3c4",
                "sha256": "8266def47578bdbb123778595f310c479f1b60ccd433cd8e46acfe710fe8b81d"
            },
            "downloads": -1,
            "filename": "gaspra-0.1.0a3.tar.gz",
            "has_sig": false,
            "md5_digest": "529c9cc1075d85effae0865f5fe0b3c4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 23902,
            "upload_time": "2023-12-29T03:01:07",
            "upload_time_iso_8601": "2023-12-29T03:01:07.729341Z",
            "url": "https://files.pythonhosted.org/packages/82/7c/4fbf87925a304e8747ce4fae11bdaf51d2258a01ef58c55f6ca33218f226/gaspra-0.1.0a3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-29 03:01:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mawicks",
    "github_project": "gaspra",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "gaspra"
}
        
Elapsed time: 4.05636s