# Maxwell 👹
[![PyPI
version](https://badge.fury.io/py/Maxwell.svg)](https://pypi.org/project/maxwell/)
[![Supported Python
versions](https://img.shields.io/pypi/pyversions/maxwell.svg)](https://pypi.org/project/maxwell/)
[![CircleCI](https://dl.circleci.com/status-badge/img/gh/CUNY-CL/maxwell/tree/main.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/CUNY-CL/maxwell/tree/main)
Maxwell is a Python library for learning the stochastic edit distance (SED)
between source and target alphabets for string transduction.
Given a corpus of source and target string pairs, it uses
expectation-maximization to learn the log-probability weights of edit actions
(copy, substitution, deletion, insertion) that minimize the number of edits
between source and target strings. These weights can then be used for edits over
unknown strings through Viterbi decoding.
## Install
First install dependencies:
pip install -r requirements.txt
Then install:
pip install .
It can then be imported like a regular Python module:
```python
import maxwell
```
## Usage
SED training can be done as either a command line tool or imported as a Python
dependency.
For command-line use, run:
maxwell-train \
--train /path/to/train/data \
--output /path/to/output/file \
--epochs "${NUM_EPOCHS}"
As a library object, you can use the `StochasticEditDistance` class to pass any
iterable of source-target pairs for training. Learned edit weights can then be
saved with the `write_params` method:
```python
from maxwell import sed
aligner = sed.StochasticEditDistance.fit_from_data(
training_samples, NUM_EPOCHS
)
aligner.params.write_params("/path/to/output/file")
```
After training, parameters can be loaded from file to calculate optimal edits
between strings with the `action_sequence` method, which returns a tuple of the
learned optimal sequence and the weight given to the sequence:
```python
from maxwell import sed
params = sed.ParamsDict.read_params("/path/to/learned/parameters")
aligner = sed.StochasticEditDistance(params)
optimal_sequence, optimal_cost = aligner.action_sequence(source, target)
```
If only weight and no actions are required, `action_sequence_cost` can be called
instead:
```python
optimal_cost = aligner.action_sequence_cost(source, target)
```
Conversely, individual actions can be evaluated with the `action_cost` method:
```python
action_cost = aligner.action_cost(action)
```
## Details
### Data
The default data format is based on the SIGMORPHON 2017 shared tasks:
source target ...
That is, the first column is the source (a lemma) and the second is the target.
In the case where the formatting is different, the `--source-col` and
`--target-col` flags can be invoked. For instance, for the SIGMORPHON 2016
shared task data format:
source ... target
one would instead use the flag `--target-col 3` to use the third column as
target strings (note the use of 1-based indexing).
### Edit actions
Edit weights are maintained as a `ParamsDict` object, a dataclass comprising
three dictionaries and one floats. The dictionaries, and their indexing, are as
follows:
1. `delta_sub` Keys: Tuple of source alphabet X target alphabet. Values:
Substitution weight for all non-equivalent source-target pairs. If source
symbol == target symbol, a seperate copy probability is used.
2. `delta_del` Keys: All symbols in source string alphabet. Represents deletion
from string. Values: Deletion weight for removal of source symbol from
string.
3. `delta_ins` Keys: All symbols in target string alphabet. Represents
insertion into string. Values: Insertion weight for introduction of target
symbol into string.
4. `delta_eos` A float value representing probability of terminating the
string.
In Python, these values may be accessed through a `StochasticEditDistance`
object's `params` attribute.
## References
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete
data via the EM algorithm. *Journal of the Royal Statistical Society, Series B*
30(1): 1-38.
Ristad, E. S. and Yianilos, P. N. 1998. Learning string-edit distance. *IEEE
Transactions on Pattern Analysis and Machine Intelligence* 20(5): 522-532.
Raw data
{
"_id": null,
"home_page": null,
"name": "maxwell",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "computational linguistics, morphology, natural language processing, language",
"author": "Simon Clematide, Peter Makarov, Travis Bartley",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/38/05/fbacbb45cdfb0d309e98de58e81cbaa443afb6d40b7355c448bd9e0496d7/maxwell-0.2.5.tar.gz",
"platform": null,
"description": "# Maxwell \ud83d\udc79\n\n[![PyPI\nversion](https://badge.fury.io/py/Maxwell.svg)](https://pypi.org/project/maxwell/)\n[![Supported Python\nversions](https://img.shields.io/pypi/pyversions/maxwell.svg)](https://pypi.org/project/maxwell/)\n[![CircleCI](https://dl.circleci.com/status-badge/img/gh/CUNY-CL/maxwell/tree/main.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/CUNY-CL/maxwell/tree/main)\n\nMaxwell is a Python library for learning the stochastic edit distance (SED)\nbetween source and target alphabets for string transduction.\n\nGiven a corpus of source and target string pairs, it uses\nexpectation-maximization to learn the log-probability weights of edit actions\n(copy, substitution, deletion, insertion) that minimize the number of edits\nbetween source and target strings. These weights can then be used for edits over\nunknown strings through Viterbi decoding.\n\n## Install\n\nFirst install dependencies:\n\n pip install -r requirements.txt\n\nThen install:\n\n pip install .\n\nIt can then be imported like a regular Python module:\n\n```python\nimport maxwell\n```\n\n## Usage\n\nSED training can be done as either a command line tool or imported as a Python\ndependency.\n\nFor command-line use, run:\n\n maxwell-train \\\n --train /path/to/train/data \\\n --output /path/to/output/file \\\n --epochs \"${NUM_EPOCHS}\"\n\nAs a library object, you can use the `StochasticEditDistance` class to pass any\niterable of source-target pairs for training. Learned edit weights can then be\nsaved with the `write_params` method:\n\n```python\nfrom maxwell import sed\n\n\naligner = sed.StochasticEditDistance.fit_from_data(\n training_samples, NUM_EPOCHS\n)\naligner.params.write_params(\"/path/to/output/file\")\n```\n\nAfter training, parameters can be loaded from file to calculate optimal edits\nbetween strings with the `action_sequence` method, which returns a tuple of the\nlearned optimal sequence and the weight given to the sequence:\n\n```python\nfrom maxwell import sed\n\n\nparams = sed.ParamsDict.read_params(\"/path/to/learned/parameters\")\naligner = sed.StochasticEditDistance(params)\noptimal_sequence, optimal_cost = aligner.action_sequence(source, target)\n```\n\nIf only weight and no actions are required, `action_sequence_cost` can be called\ninstead:\n\n```python\noptimal_cost = aligner.action_sequence_cost(source, target)\n```\n\nConversely, individual actions can be evaluated with the `action_cost` method:\n\n```python\naction_cost = aligner.action_cost(action)\n```\n\n## Details\n\n### Data\n\nThe default data format is based on the SIGMORPHON 2017 shared tasks:\n\n source target ...\n\nThat is, the first column is the source (a lemma) and the second is the target.\n\nIn the case where the formatting is different, the `--source-col` and\n`--target-col` flags can be invoked. For instance, for the SIGMORPHON 2016\nshared task data format:\n\n source ... target\n\none would instead use the flag `--target-col 3` to use the third column as\ntarget strings (note the use of 1-based indexing).\n\n### Edit actions\n\nEdit weights are maintained as a `ParamsDict` object, a dataclass comprising\nthree dictionaries and one floats. The dictionaries, and their indexing, are as\nfollows:\n\n1. `delta_sub` Keys: Tuple of source alphabet X target alphabet. Values:\n Substitution weight for all non-equivalent source-target pairs. If source\n symbol == target symbol, a seperate copy probability is used.\n2. `delta_del` Keys: All symbols in source string alphabet. Represents deletion\n from string. Values: Deletion weight for removal of source symbol from\n string.\n3. `delta_ins` Keys: All symbols in target string alphabet. Represents\n insertion into string. Values: Insertion weight for introduction of target\n symbol into string.\n4. `delta_eos` A float value representing probability of terminating the\n string.\n\nIn Python, these values may be accessed through a `StochasticEditDistance`\nobject's `params` attribute.\n\n## References\n\nDempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete\ndata via the EM algorithm. *Journal of the Royal Statistical Society, Series B*\n30(1): 1-38.\n\nRistad, E. S. and Yianilos, P. N. 1998. Learning string-edit distance. *IEEE\nTransactions on Pattern Analysis and Machine Intelligence* 20(5): 522-532.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Stochastic Edit Distance aligner for string transduction",
"version": "0.2.5",
"project_urls": {
"homepage": "https://github.com/CUNY-CL/maxwell"
},
"split_keywords": [
"computational linguistics",
" morphology",
" natural language processing",
" language"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "70021aa3008dbe8490f0af2105a229b0aa1ca6193745733ee39e7065162c9dd5",
"md5": "35e680eee66d26785a9ce1937bb9c122",
"sha256": "42767c05b6034eb59394742df357dcb964af0abafe01ab82dae712559231abc3"
},
"downloads": -1,
"filename": "maxwell-0.2.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "35e680eee66d26785a9ce1937bb9c122",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 14680,
"upload_time": "2024-10-25T18:43:49",
"upload_time_iso_8601": "2024-10-25T18:43:49.158483Z",
"url": "https://files.pythonhosted.org/packages/70/02/1aa3008dbe8490f0af2105a229b0aa1ca6193745733ee39e7065162c9dd5/maxwell-0.2.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3805fbacbb45cdfb0d309e98de58e81cbaa443afb6d40b7355c448bd9e0496d7",
"md5": "27cc656148d119591b6fbefeec45397c",
"sha256": "1f8e9002dfc30e6ad4812f73a68cabb9fae914b3b06fb706caf78ed09fd65eed"
},
"downloads": -1,
"filename": "maxwell-0.2.5.tar.gz",
"has_sig": false,
"md5_digest": "27cc656148d119591b6fbefeec45397c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 16420,
"upload_time": "2024-10-25T18:43:51",
"upload_time_iso_8601": "2024-10-25T18:43:51.033130Z",
"url": "https://files.pythonhosted.org/packages/38/05/fbacbb45cdfb0d309e98de58e81cbaa443afb6d40b7355c448bd9e0496d7/maxwell-0.2.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-25 18:43:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CUNY-CL",
"github_project": "maxwell",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"requirements": [
{
"name": "black",
"specs": [
[
">=",
"22.3.0"
]
]
},
{
"name": "build",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "flake8",
"specs": [
[
">=",
"3.9.2"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.4.0"
]
]
},
{
"name": "twine",
"specs": [
[
">=",
"4.0.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.64.1"
]
]
}
],
"lcname": "maxwell"
}