text-equivalences


Nametext-equivalences JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/o-alexandre-felipe/text-equivalences
SummaryRule based language modeling
upload_time2023-03-30 16:34:10
maintainer
docs_urlNone
authorAlexandre Felipe
requires_python
licenseMIT
keywords fst transducers regular expression text normalization graph language model
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # text-equivalences

## Introduction

Introduces a formalism to express easily some sort of regular expressions that caries semantic information.
This could be used to align two versions of a text, or enumerate variations of one.

Take the following examples _First of January of twenty twenty one_, _Jan 1st 2021_, _01/01/2021_ as humans we understand that these are equivalent because we understand the concept behind it. In my experience I saw this problem bein tackled by doing some text replacements before comparing, this helps but it is difficult to track what happened. What if we could compare the different versions directly.


## Language support

The language is defined so that it can give another interpretation to python code.

### Literal input
Text can be quoted with single or double quotes, e.g. `'single'`, `"double"`, strings can be made case insensitive by adding the suffix `i`, e.g. `'he'i will match any of 'He', 'he', 'HE' or 'hE'.

### Equivalence operator `|` 

The `|` denotes equivalence between two inputs

```
      'first' | '1st'
```

If one of the terms in an alternative chain is matched, for two inputs, the inputs are considered equivalent.

### Alternative operator `/`

The `/` makes it possible to distinguish between two inputs

```
    'one' / 'two' / 'three' | '3'
```

### Sequence

Sequences are defined using the `+` operator.
e.g.
```
  'First' + 'of' + 'January'
```

### Quantifiers

Quantifiers are prefix operators that makes it possible to match variable number of occurrencies of the operand
 - `+`: at least one
 - `-`: at most one

This example makes it possible to accept both _millimetre_ and _millimetres_
```
   'millimetre' + -'s'
```

Concatenation with optional terms can be achieved by simply using `-` operator

```
   'milimetre' -'s'
```

Zero or more repetitions can be achieved by with `-+` prefix operator.

### Rule asignment

Grammar can be stored in local variables, e.g.
```
  first = 'first' | '1st'
```


# Ideas for future versions

### Capture groups (python 3.8)

The output a matched pattern can be assigned to different capture groups

```
(day:=Day + 'of'  -'the' (month:=Month)) | ((month:=Month) + (day:=Day))
```
In the above example two date formats are compared, given that the captured groups matches the outputs will be considered equivalent.

### Capture reference

Capture references allows to check the content at a given position against the content on the input in another position

```
size=Number -!unit 'by' Number (unit:=Unit);
```

### Mapping

Alternatively, the input can be mapped to in a more sophisticated ways by means of mapping rules

```
Digit = (('one' >> '1') / ('two' >> '2') / ('three' >> '3') / ('four' >> '4')
Size =   ((w:=Digit) ('by' | 'x') (h:=Digit)) >> (w ' x ' h)
```

This will translate `one by two` to `1 x 2`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/o-alexandre-felipe/text-equivalences",
    "name": "text-equivalences",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "FST,Transducers,Regular Expression,Text normalization,graph,language model",
    "author": "Alexandre Felipe",
    "author_email": "o.alexandre.felipe@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ad/e9/9bb537a1459f7cc82d72b8ff1bd5c86f192d8a11a64338203f744be663b0/text-equivalences-0.1.0.tar.gz",
    "platform": null,
    "description": "# text-equivalences\n\n## Introduction\n\nIntroduces a formalism to express easily some sort of regular expressions that caries semantic information.\nThis could be used to align two versions of a text, or enumerate variations of one.\n\nTake the following examples _First of January of twenty twenty one_, _Jan 1st 2021_, _01/01/2021_ as humans we understand that these are equivalent because we understand the concept behind it. In my experience I saw this problem bein tackled by doing some text replacements before comparing, this helps but it is difficult to track what happened. What if we could compare the different versions directly.\n\n\n## Language support\n\nThe language is defined so that it can give another interpretation to python code.\n\n### Literal input\nText can be quoted with single or double quotes, e.g. `'single'`, `\"double\"`, strings can be made case insensitive by adding the suffix `i`, e.g. `'he'i will match any of 'He', 'he', 'HE' or 'hE'.\n\n### Equivalence operator `|` \n\nThe `|` denotes equivalence between two inputs\n\n```\n      'first' | '1st'\n```\n\nIf one of the terms in an alternative chain is matched, for two inputs, the inputs are considered equivalent.\n\n### Alternative operator `/`\n\nThe `/` makes it possible to distinguish between two inputs\n\n```\n    'one' / 'two' / 'three' | '3'\n```\n\n### Sequence\n\nSequences are defined using the `+` operator.\ne.g.\n```\n  'First' + 'of' + 'January'\n```\n\n### Quantifiers\n\nQuantifiers are prefix operators that makes it possible to match variable number of occurrencies of the operand\n - `+`: at least one\n - `-`: at most one\n\nThis example makes it possible to accept both _millimetre_ and _millimetres_\n```\n   'millimetre' + -'s'\n```\n\nConcatenation with optional terms can be achieved by simply using `-` operator\n\n```\n   'milimetre' -'s'\n```\n\nZero or more repetitions can be achieved by with `-+` prefix operator.\n\n### Rule asignment\n\nGrammar can be stored in local variables, e.g.\n```\n  first = 'first' | '1st'\n```\n\n\n# Ideas for future versions\n\n### Capture groups (python 3.8)\n\nThe output a matched pattern can be assigned to different capture groups\n\n```\n(day:=Day + 'of'  -'the' (month:=Month)) | ((month:=Month) + (day:=Day))\n```\nIn the above example two date formats are compared, given that the captured groups matches the outputs will be considered equivalent.\n\n### Capture reference\n\nCapture references allows to check the content at a given position against the content on the input in another position\n\n```\nsize=Number -!unit 'by' Number (unit:=Unit);\n```\n\n### Mapping\n\nAlternatively, the input can be mapped to in a more sophisticated ways by means of mapping rules\n\n```\nDigit = (('one' >> '1') / ('two' >> '2') / ('three' >> '3') / ('four' >> '4')\nSize =   ((w:=Digit) ('by' | 'x') (h:=Digit)) >> (w ' x ' h)\n```\n\nThis will translate `one by two` to `1 x 2`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Rule based language modeling",
    "version": "0.1.0",
    "split_keywords": [
        "fst",
        "transducers",
        "regular expression",
        "text normalization",
        "graph",
        "language model"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ade99bb537a1459f7cc82d72b8ff1bd5c86f192d8a11a64338203f744be663b0",
                "md5": "a8e31819457c5d8183ad66c61c3a3dca",
                "sha256": "9aeb5168a4a579d9fef6a784ffbe864521019bfe4a0a033b5be6bf74893a69f8"
            },
            "downloads": -1,
            "filename": "text-equivalences-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a8e31819457c5d8183ad66c61c3a3dca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6671,
            "upload_time": "2023-03-30T16:34:10",
            "upload_time_iso_8601": "2023-03-30T16:34:10.035180Z",
            "url": "https://files.pythonhosted.org/packages/ad/e9/9bb537a1459f7cc82d72b8ff1bd5c86f192d8a11a64338203f744be663b0/text-equivalences-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-30 16:34:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "o-alexandre-felipe",
    "github_project": "text-equivalences",
    "lcname": "text-equivalences"
}
        
Elapsed time: 0.06348s