# Yoyodyne 🪀
[![PyPI
version](https://badge.fury.io/py/yoyodyne.svg)](https://pypi.org/project/yoyodyne)
[![Supported Python
versions](https://img.shields.io/pypi/pyversions/yoyodyne.svg)](https://pypi.org/project/yoyodyne)
[![CircleCI](https://dl.circleci.com/status-badge/img/gh/CUNY-CL/yoyodyne/tree/master.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/CUNY-CL/yoyodyne/tree/master)
Yoyodyne provides neural models for small-vocabulary sequence-to-sequence
generation with and without feature conditioning.
These models are implemented using [PyTorch](https://pytorch.org/) and
[Lightning](https://www.pytorchlightning.ai/).
While we provide classic LSTM and transformer models, some of the provided
models are particularly well-suited for problems where the source-target
alignments are roughly monotonic (e.g., `transducer` and `hard_attention_lstm`)
and/or where source and target vocabularies have substantial overlap (e.g.,
`pointer_generator_lstm`).
## Philosophy
Yoyodyne is inspired by [FairSeq](https://github.com/facebookresearch/fairseq)
(Ott et al. 2019) but differs on several key points of design:
- It is for small-vocabulary sequence-to-sequence generation, and therefore
includes no affordances for machine translation or language modeling.
Because of this:
- The architectures provided are intended to be reasonably exhaustive.
- There is little need for data preprocessing; it works with TSV files.
- It has support for using features to condition decoding, with
architecture-specific code for handling feature information.
- It supports the use of validation accuracy (not loss) for model selection
and early stopping.
- Releases are made regularly.
- 🚧 UNDER CONSTRUCTION 🚧: It has exhaustive test suites.
- 🚧 UNDER CONSTRUCTION 🚧: It has performance benchmarks.
## Authors
Yoyodyne was created by [Adam Wiemerslage](https://adamits.github.io/), [Kyle
Gorman](https://wellformedness.com/), Travis Bartley, and [other
contributors](https://github.com/CUNY-CL/yoyodyne/graphs/contributors) like
yourself.
## Installation
### Local installation
Yoyodyne currently supports Python 3.9 through 3.12.
First install dependencies:
pip install -r requirements.txt
Then install:
pip install .
It can then be imported like a regular Python module:
```python
import yoyodyne
```
### Google Colab
Yoyodyne is compatible with [Google Colab](https://colab.research.google.com/)
GPU runtimes. [This
notebook](https://colab.research.google.com/drive/1O4VWvpqLrCxxUvyYMbGH9HOyXQSoh5bP?usp=sharing)
provides a worked example. Colab also provides access to TPU runtimes, but this
is not yet compatible with Yoyodyne to our knowledge.
## Usage
### Training
Training is performed by the [`yoyodyne-train`](yoyodyne/train.py) script. One
must specify the following required arguments:
- `--model_dir`: path for model metadata and checkpoints
- `--experiment`: name of experiment (pick something unique)
- `--train`: path to TSV file containing training data
- `--val`: path to TSV file containing validation data
The user can also specify various optional training and architectural arguments.
See below or run [`yoyodyne-train --help`](yoyodyne/train.py) for more
information.
### Validation
Validation is run at intervals requested by the user. See `--val_check_interval`
and `--check_val_every_n_epoch`
[here](https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api).
Additional evaluation metrics can also be requested with `--eval_metric`. For
example
yoyodyne-train --eval_metric ser ...
will additionally compute symbol error rate (SER) each time validation is
performed. Additional metrics can be added to
[`evaluators.py`](yoyodyne/evaluators.py).
### Prediction
Prediction is performed by the [`yoyodyne-predict`](yoyodyne/predict.py) script.
One must specify the following required arguments:
- `--arch`: architecture, matching the one used for training
- `--model_dir`: path for model metadata
- `--experiment`: name of experiment
- `--checkpoint`: path to checkpoint
- `--predict`: path to file containing data to be predicted
- `--output`: path for predictions
The `--predict` file can either be a TSV file or an ordinary TXT file with one
source string per line; in the latter case, specify `--target_col 0`. Run
[`yoyodyne-predict --help`](yoyodyne/predict.py) for more information.
Beam search is implemented (currently only for LSTM-based models) and can be
enabled by setting `--beam_width` \> 1. When using beam search, the
log-likelihood for each hypothesis is always returned. The outputs are pairs of
hypotheses and the associated log-likelihoods.
## Data format
The default data format is a two-column TSV file in which the first column is
the source string and the second the target string.
source target
To enable the use of a feature column, one specifies a (non-zero) argument to
`--features_col`. For instance in the [SIGMORPHON 2017 shared
task](https://sigmorphon.github.io/sharedtasks/2017/), the first column is the
source (a lemma), the second is the target (the inflection), and the third
contains semi-colon delimited feature strings:
source target feat1;feat2;...
this format is specified by `--features_col 3`.
Alternatively, for the [SIGMORPHON 2016 shared
task](https://sigmorphon.github.io/sharedtasks/2016/) data:
source feat1,feat2,... target
this format is specified by `--features_col 2 --features_sep , --target_col 3`.
In order to ensure that targets are ignored during prediction, one can specify
`--target_col 0`.
## Reserved symbols
Yoyodyne reserves symbols of the form `<...>` for internal use.
Feature-conditioned models also use `[...]` to avoid clashes between feature
symbols and source and target symbols, and `--no_tie_embeddings` uses `{...}` to
avoid clashes between source and t arget symbols. Therefore, users should not
provide any symbols of the form `<...>`, `[...]`, or `{...}`.
## Model checkpointing
Checkpointing is handled by
[Lightning](https://pytorch-lightning.readthedocs.io/en/stable/common/checkpointing_basic.html).
The path for model information, including checkpoints, is specified by a
combination of `--model_dir` and `--experiment`, such that we build the path
`model_dir/experiment/version_n`, where each run of an experiment with the same
`model_dir` and `experiment` is namespaced with a new version number. A version
stores everything needed to reload the model, including the hyperparameters
(`model_dir/experiment_name/version_n/hparams.yaml`) and the checkpoints
directory (`model_dir/experiment_name/version_n/checkpoints`).
By default, each run initializes a new model from scratch, unless the
`--train_from` argument is specified. To continue training from a specific
checkpoint, the **full path to the checkpoint** should be specified with for the
`--train_from` argument. This creates a new version, but starts training from
the provided model checkpoint.
By default 1 checkpoint is saved. To save more than one checkpoint, use the
`--num_checkpoints` flag. To save a checkpoint every epoch, set
`--num_checkpoints -1`. By default, the checkpoints saved are those which
maximize validation accuracy. To instead select checkpoints which minimize
validation loss, set `--checkpoint_metric loss`.
## Models
The user specifies the overall architecture for the model using the `--arch`
flag. The value of this flag specifies the decoder's architecture and whether or
not an attention mechanism is present. This flag also specifies a default
architecture for the encoder(s), but it is possible to override this with
additional flags. Supported values for `--arch` are:
- `attentive_gru`: This is an GRU decoder with GRU encoders (by default) and
an attention mechanism. The initial hidden state is treated as a learned
parameter.
- `attentive_lstm`: This is similar to the `attentive_gru` but instead uses an
LSTM decoder and encoder (by default).
- `gru`: This is an GRU decoder with GRU encoders (by default); in lieu of an
attention mechanism, the last non-padding hidden state of the encoder is
concatenated with the decoder hidden state.
- `hard_attention_gru`: This is an GRU encoder/decoder modeling generation as
a Markov process. By default, it assumes a non-monotonic progression over
the source string, but with `--enforce_monotonic` the model must progress
over each source character in order. A non-zero value of
`--attention_context` (default: `0`) widens the context window for
conditioning state transitions to include one or more previous states.
- `hard_attention_lstm`: This is similar to the `hard_attention_gru` but
instead uses an LSTM decoder and encoder (by deafult). `--attention_context`
(default: `0`) widens the context window for conditioning state transitions
to include one or more previous states.
- `lstm`: This is similar to the `gru` but instead uses an LSTM decoder and
encoder (by default).
- `pointer_generator_gru`: This is an GRU decoder with GRU encoders (by
default) and a pointer-generator mechanism. Since this model contains a copy
mechanism, it may be superior to an ordinary attentive GRU when the source
and target vocabularies overlap significantly. Note that this model requires
that the number of `--encoder_layers` and `--decoder_layers` match.
- `pointer_generator_lstm`: This is similar to the `pointer_generator_gru` but
instead uses an LSTM decoder and encoder (by default).
- `pointer_generator_transformer`: This is similar to the
`pointer_generator_gru` and `pointer_generator_lstm` but instead uses a
transformer decoder and encoder (by default). When using features, the user
may wish to specify the number of features attention heads (with
`--features_attention_heads`).
- `transducer_gru`: This is an GRU decoder with GRU encoders (by default) and
a neural transducer mechanism. On model creation, expectation maximization
is used to learn a sequence of edit operations, and imitation learning is
used to train the model to implement the oracle policy, with roll-in
controlled by the `--oracle_factor` flag (default: `1`). Since this model
assumes monotonic alignment, it may be superior to attentive models when the
alignment between input and output is roughly monotonic and when input and
output vocabularies overlap significantly.
- `transducer_lstm`: This is similar to the `transducer_gru` but instead uses
an LSTM decoder and encoder (by default).
- `transformer`: This is a transformer decoder with transformer encoders (by
default). Sinusodial positional encodings and layer normalization are used.
The user may wish to specify the number of attention heads (with
`--source_attention_heads`; default: `4`).
The `--arch` flag specifies the decoder type; the user can override default
encoder types using the `--source_encoder_arch` flag and, when features are
present, the `--features_encoder_arch` flag. Valid values are:
- `feature_invariant_transformer` (`--source_encoder_arch` only): a variant of
the transformer encoder used with features; it concatenates source and
features and uses a learned embedding to distinguish between source and
features symbols.
- `linear`: a linear encoder.
- `gru`: a GRU encoder.
- `lstm`: a LSTM encoder.
- `transformer`: a transformer encoder.
For all models, the user may also wish to specify:
- `--decoder_layers` (default: `1`): number of decoder layers
- `--embedding` (default: `128`): embedding size
- `--encoder_layers` (default: `1`): number of encoder layers
- `--hidden_size` (default: `512`): hidden layer size
By default, RNN-backed (i.e., GRU and LSTM) encoders are bidirectional. One can
disable this with the `--no_bidirectional` flag.
## Training options
A non-exhaustive list includes:
- Batch size:
- `--batch_size` (default: `32`)
- `--accumulate_grad_batches` (default: not enabled)
- Regularization:
- `--dropout` (default: `0.2`)
- `--label_smoothing` (default: `0.0`)
- `--gradient_clip_val` (default: not enabled)
- Optimizer:
- `--learning_rate` (default: `0.001`)
- `--optimizer` (default: `"adam"`)
- `--beta1` (default: `0.9`): $\beta_1$ hyperparameter for the Adam
optimizer (`--optimizer adam`)
- `--beta2` (default: `0.99`): $\beta_2$ hyperparameter for the Adam
optimizer (`--optimizer adam`)
- `--scheduler` (default: not enabled)
- Duration:
- `--max_epochs`
- `--min_epochs`
- `--max_steps`
- `--min_steps`
- `--max_time`
- Seeding:
- `--seed`
- [Weights & Biases](https://wandb.ai/site):
- `--log_wandb` (default: `False`): enables Weights & Biases tracking
Additional training options are discussed below.
### Early stopping
To enable early stopping, use the `--patience` and `--patience_metric` flags.
Early stopping occurs after `--patience` epochs with no improvement (when
validation loss stops decreasing if `--patience_metric loss`, or when validation
accuracy stops increasing if `--patience_metric accuracy`). Early stopping is
not enabled by default.
### Schedulers
By default, Yoyodyne uses a constant learning rate during training, but best
practice is to gradually decrease learning rate as the model approaches
convergence using a [scheduler](yoyodyne/schedulers.py). The following
schedulers are supported and are selected with `--scheduler`:
- `reduceonplateau`: reduces the learning rate (multiplying it by
`--reduceonplateau_factor`) after `--reduceonplateau_patience` epochs with
no improvement (when validation loss stops decreasing if
`--reduceonplateau loss`, or when validation accuracy stops increasing if
`--reduceonplateau_metric accuracy`) until the learning rate is less than or
equal to `--min_learning_rate`.
- `warmupinvsqrt`: linearly increases the learning rate from 0 to
`--learning_rate` for `--warmup_steps` steps, then decreases learning rate
according to an inverse root square schedule.
## Tied embeddings
By default, the source and target vocabularies are shared. This can be disabled
with the flag `--no_tie_embeddings`, which uses `{...}` to avoid clashes between
source and target symbols.
### Batch size tricks
**Choosing a good batch size is key to fast training and optimal performance.**
Batch size is specified by the `--batch_size` flag.
One may wish to train with a larger batch size than will fit in "in core". For
example, suppose one wishes to fit with a batch size of 4,096, but this gives an
out of memory (OOM) exception. Then, with minimal overhead, one could simulate
an effective batch size of 4,096 by using batches of size 1,024, [accumulating
gradients from 4 batches per
update](https://lightning.ai/docs/pytorch/stable/common/optimization.html#id3):
yoyodyne-train --batch_size 1024 --accumulate_grad_batches 4 ...
The `--find_batch_size` flag enables [automatically computation of the batch
size](https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder).
With `--find_batch_size max`, it simply uses the maximum batch size, ignoring
`--batch_size`. With `--find_batch_size opt`, it finds the maximum batch size,
and then interprets it as follows:
- If the maximum batch size is greater than `--batch_size`, then
`--batch_size` is used as the batch size.
- However, if the maximum batch size is less than `--batch_size`, it solves
for the optimal gradient accumulation trick and uses the largest batch size
and the smallest number of gradient accumulation steps whose product is
`--batch_size`.
If one wishes to solve for these quantities without actually training, pass
`--find_batch_size opt` and `--max_epochs 0`. This will halt after computing and
logging the solution.
### Hyperparameter tuning
**No neural model should be deployed without proper hyperparameter tuning.**
However, the default options give a reasonable initial settings for an attentive
biLSTM. For transformer-based architectures, experiment with multiple encoder
and decoder layers, much larger batches, and the warmup-plus-inverse square root
decay scheduler.
### Weights & Biases tuning
[`wandb_sweeps`](examples/wandb_sweeps) shows how to use [Weights &
Biases](https://wandb.ai/site) to run hyperparameter sweeps.
## Accelerators
[Hardware
accelerators](https://pytorch-lightning.readthedocs.io/en/stable/extensions/accelerator.html)
can be used during training or prediction. In addition to CPU (the default) and
GPU (`--accelerator gpu`), [other
accelerators](https://pytorch-lightning.readthedocs.io/en/stable/extensions/accelerator.html)
may also be supported but not all have been tested yet.
## Precision
By default, training uses 32-bit precision. However, the `--precision` flag
allows the user to perform training with half precision (`16`) or with the
[`bfloat16` half precision
format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) if
supported by the accelerator. This may reduce the size of the model and batches
in memory, allowing one to use larger batches. Note that only default precision
is expected to work with CPU training.
## Examples
The [`examples`](examples) directory contains interesting examples, including:
- [`wandb_sweeps`](examples/wandb_sweeps) shows how to use [Weights &
Biases](https://wandb.ai/site) to run hyperparameter sweeps.
## For developers
*Developers, developers, developers!* - Steve Ballmer
This section contains instructions for the Yoyodyne maintainers.
### Releasing
1. Create a new branch. E.g., if you want to call this branch "release":
`git checkout -b release`
2. Sync your fork's branch to the upstream master branch. E.g., if the upstream
remote is called "upstream": `git pull upstream master`
3. Increment the version field in [`pyproject.toml`](pyproject.toml).
4. Stage your changes: `git add pyproject.toml`.
5. Commit your changes: `git commit -m "your commit message here"`
6. Push your changes. E.g., if your branch is called "release":
`git push origin release`
7. Submit a PR for your release and wait for it to be merged into `master`.
8. Tag the `master` branch's last commit. The tag should begin with `v`; e.g.,
if the new version is 3.1.4, the tag should be `v3.1.4`. This can be done:
- on GitHub itself: click the "Releases" or "Create a new release" link on
the right-hand side of the Yoyodyne GitHub page) and follow the
dialogues.
- from the command-line using `git tag`.
9. Build the new release: `python -m build`
10. Upload the result to PyPI: `twine upload dist/*`
## References
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and
Auli, M. 2019. [fairseq: a fast, extensible toolkit for sequence
modeling](https://aclanthology.org/N19-4009/). In *Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics (Demonstrations)*, pages 48-53.
Raw data
{
"_id": null,
"home_page": null,
"name": "yoyodyne",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "computational linguistics, morphology, natural language processing, language",
"author": "Adam Wiemerslage, Kyle Gorman, Travis Bartley",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/86/40/1404fcff36af4af0f6740e401c2d58efb380ed450ec2be91ceba4d3cef9d/yoyodyne-0.2.14.tar.gz",
"platform": null,
"description": "# Yoyodyne \ud83e\ude80\n\n[![PyPI\nversion](https://badge.fury.io/py/yoyodyne.svg)](https://pypi.org/project/yoyodyne)\n[![Supported Python\nversions](https://img.shields.io/pypi/pyversions/yoyodyne.svg)](https://pypi.org/project/yoyodyne)\n[![CircleCI](https://dl.circleci.com/status-badge/img/gh/CUNY-CL/yoyodyne/tree/master.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/CUNY-CL/yoyodyne/tree/master)\n\nYoyodyne provides neural models for small-vocabulary sequence-to-sequence\ngeneration with and without feature conditioning.\n\nThese models are implemented using [PyTorch](https://pytorch.org/) and\n[Lightning](https://www.pytorchlightning.ai/).\n\nWhile we provide classic LSTM and transformer models, some of the provided\nmodels are particularly well-suited for problems where the source-target\nalignments are roughly monotonic (e.g., `transducer` and `hard_attention_lstm`)\nand/or where source and target vocabularies have substantial overlap (e.g.,\n`pointer_generator_lstm`).\n\n## Philosophy\n\nYoyodyne is inspired by [FairSeq](https://github.com/facebookresearch/fairseq)\n(Ott et al.\u00a02019) but differs on several key points of design:\n\n- It is for small-vocabulary sequence-to-sequence generation, and therefore\n includes no affordances for machine translation or language modeling.\n Because of this:\n - The architectures provided are intended to be reasonably exhaustive.\n - There is little need for data preprocessing; it works with TSV files.\n- It has support for using features to condition decoding, with\n architecture-specific code for handling feature information.\n- It supports the use of validation accuracy (not loss) for model selection\n and early stopping.\n- Releases are made regularly.\n- \ud83d\udea7 UNDER CONSTRUCTION \ud83d\udea7: It has exhaustive test suites.\n- \ud83d\udea7 UNDER CONSTRUCTION \ud83d\udea7: It has performance benchmarks.\n\n## Authors\n\nYoyodyne was created by [Adam Wiemerslage](https://adamits.github.io/), [Kyle\nGorman](https://wellformedness.com/), Travis Bartley, and [other\ncontributors](https://github.com/CUNY-CL/yoyodyne/graphs/contributors) like\nyourself.\n\n## Installation\n\n### Local installation\n\nYoyodyne currently supports Python 3.9 through 3.12.\n\nFirst install dependencies:\n\n pip install -r requirements.txt\n\nThen install:\n\n pip install .\n\nIt can then be imported like a regular Python module:\n\n```python\nimport yoyodyne\n```\n\n### Google Colab\n\nYoyodyne is compatible with [Google Colab](https://colab.research.google.com/)\nGPU runtimes. [This\nnotebook](https://colab.research.google.com/drive/1O4VWvpqLrCxxUvyYMbGH9HOyXQSoh5bP?usp=sharing)\nprovides a worked example. Colab also provides access to TPU runtimes, but this\nis not yet compatible with Yoyodyne to our knowledge.\n\n## Usage\n\n### Training\n\nTraining is performed by the [`yoyodyne-train`](yoyodyne/train.py) script. One\nmust specify the following required arguments:\n\n- `--model_dir`: path for model metadata and checkpoints\n- `--experiment`: name of experiment (pick something unique)\n- `--train`: path to TSV file containing training data\n- `--val`: path to TSV file containing validation data\n\nThe user can also specify various optional training and architectural arguments.\nSee below or run [`yoyodyne-train --help`](yoyodyne/train.py) for more\ninformation.\n\n### Validation\n\nValidation is run at intervals requested by the user. See `--val_check_interval`\nand `--check_val_every_n_epoch`\n[here](https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api).\nAdditional evaluation metrics can also be requested with `--eval_metric`. For\nexample\n\n yoyodyne-train --eval_metric ser ...\n\nwill additionally compute symbol error rate (SER) each time validation is\nperformed. Additional metrics can be added to\n[`evaluators.py`](yoyodyne/evaluators.py).\n\n### Prediction\n\nPrediction is performed by the [`yoyodyne-predict`](yoyodyne/predict.py) script.\nOne must specify the following required arguments:\n\n- `--arch`: architecture, matching the one used for training\n- `--model_dir`: path for model metadata\n- `--experiment`: name of experiment\n- `--checkpoint`: path to checkpoint\n- `--predict`: path to file containing data to be predicted\n- `--output`: path for predictions\n\nThe `--predict` file can either be a TSV file or an ordinary TXT file with one\nsource string per line; in the latter case, specify `--target_col 0`. Run\n[`yoyodyne-predict --help`](yoyodyne/predict.py) for more information.\n\nBeam search is implemented (currently only for LSTM-based models) and can be\nenabled by setting `--beam_width` \\> 1. When using beam search, the\nlog-likelihood for each hypothesis is always returned. The outputs are pairs of\nhypotheses and the associated log-likelihoods.\n\n## Data format\n\nThe default data format is a two-column TSV file in which the first column is\nthe source string and the second the target string.\n\n source target\n\nTo enable the use of a feature column, one specifies a (non-zero) argument to\n`--features_col`. For instance in the [SIGMORPHON 2017 shared\ntask](https://sigmorphon.github.io/sharedtasks/2017/), the first column is the\nsource (a lemma), the second is the target (the inflection), and the third\ncontains semi-colon delimited feature strings:\n\n source target feat1;feat2;...\n\nthis format is specified by `--features_col 3`.\n\nAlternatively, for the [SIGMORPHON 2016 shared\ntask](https://sigmorphon.github.io/sharedtasks/2016/) data:\n\n source feat1,feat2,... target\n\nthis format is specified by `--features_col 2 --features_sep , --target_col 3`.\n\nIn order to ensure that targets are ignored during prediction, one can specify\n`--target_col 0`.\n\n## Reserved symbols\n\nYoyodyne reserves symbols of the form `<...>` for internal use.\nFeature-conditioned models also use `[...]` to avoid clashes between feature\nsymbols and source and target symbols, and `--no_tie_embeddings` uses `{...}` to\navoid clashes between source and t arget symbols. Therefore, users should not\nprovide any symbols of the form `<...>`, `[...]`, or `{...}`.\n\n## Model checkpointing\n\nCheckpointing is handled by\n[Lightning](https://pytorch-lightning.readthedocs.io/en/stable/common/checkpointing_basic.html).\nThe path for model information, including checkpoints, is specified by a\ncombination of `--model_dir` and `--experiment`, such that we build the path\n`model_dir/experiment/version_n`, where each run of an experiment with the same\n`model_dir` and `experiment` is namespaced with a new version number. A version\nstores everything needed to reload the model, including the hyperparameters\n(`model_dir/experiment_name/version_n/hparams.yaml`) and the checkpoints\ndirectory (`model_dir/experiment_name/version_n/checkpoints`).\n\nBy default, each run initializes a new model from scratch, unless the\n`--train_from` argument is specified. To continue training from a specific\ncheckpoint, the **full path to the checkpoint** should be specified with for the\n`--train_from` argument. This creates a new version, but starts training from\nthe provided model checkpoint.\n\nBy default 1 checkpoint is saved. To save more than one checkpoint, use the\n`--num_checkpoints` flag. To save a checkpoint every epoch, set\n`--num_checkpoints -1`. By default, the checkpoints saved are those which\nmaximize validation accuracy. To instead select checkpoints which minimize\nvalidation loss, set `--checkpoint_metric loss`.\n\n## Models\n\nThe user specifies the overall architecture for the model using the `--arch`\nflag. The value of this flag specifies the decoder's architecture and whether or\nnot an attention mechanism is present. This flag also specifies a default\narchitecture for the encoder(s), but it is possible to override this with\nadditional flags. Supported values for `--arch` are:\n\n- `attentive_gru`: This is an GRU decoder with GRU encoders (by default) and\n an attention mechanism. The initial hidden state is treated as a learned\n parameter.\n- `attentive_lstm`: This is similar to the `attentive_gru` but instead uses an\n LSTM decoder and encoder (by default).\n- `gru`: This is an GRU decoder with GRU encoders (by default); in lieu of an\n attention mechanism, the last non-padding hidden state of the encoder is\n concatenated with the decoder hidden state.\n- `hard_attention_gru`: This is an GRU encoder/decoder modeling generation as\n a Markov process. By default, it assumes a non-monotonic progression over\n the source string, but with `--enforce_monotonic` the model must progress\n over each source character in order. A non-zero value of\n `--attention_context` (default: `0`) widens the context window for\n conditioning state transitions to include one or more previous states.\n- `hard_attention_lstm`: This is similar to the `hard_attention_gru` but\n instead uses an LSTM decoder and encoder (by deafult). `--attention_context`\n (default: `0`) widens the context window for conditioning state transitions\n to include one or more previous states.\n- `lstm`: This is similar to the `gru` but instead uses an LSTM decoder and\n encoder (by default).\n- `pointer_generator_gru`: This is an GRU decoder with GRU encoders (by\n default) and a pointer-generator mechanism. Since this model contains a copy\n mechanism, it may be superior to an ordinary attentive GRU when the source\n and target vocabularies overlap significantly. Note that this model requires\n that the number of `--encoder_layers` and `--decoder_layers` match.\n- `pointer_generator_lstm`: This is similar to the `pointer_generator_gru` but\n instead uses an LSTM decoder and encoder (by default).\n- `pointer_generator_transformer`: This is similar to the\n `pointer_generator_gru` and `pointer_generator_lstm` but instead uses a\n transformer decoder and encoder (by default). When using features, the user\n may wish to specify the number of features attention heads (with\n `--features_attention_heads`).\n- `transducer_gru`: This is an GRU decoder with GRU encoders (by default) and\n a neural transducer mechanism. On model creation, expectation maximization\n is used to learn a sequence of edit operations, and imitation learning is\n used to train the model to implement the oracle policy, with roll-in\n controlled by the `--oracle_factor` flag (default: `1`). Since this model\n assumes monotonic alignment, it may be superior to attentive models when the\n alignment between input and output is roughly monotonic and when input and\n output vocabularies overlap significantly.\n- `transducer_lstm`: This is similar to the `transducer_gru` but instead uses\n an LSTM decoder and encoder (by default).\n- `transformer`: This is a transformer decoder with transformer encoders (by\n default). Sinusodial positional encodings and layer normalization are used.\n The user may wish to specify the number of attention heads (with\n `--source_attention_heads`; default: `4`).\n\nThe `--arch` flag specifies the decoder type; the user can override default\nencoder types using the `--source_encoder_arch` flag and, when features are\npresent, the `--features_encoder_arch` flag. Valid values are:\n\n- `feature_invariant_transformer` (`--source_encoder_arch` only): a variant of\n the transformer encoder used with features; it concatenates source and\n features and uses a learned embedding to distinguish between source and\n features symbols.\n- `linear`: a linear encoder.\n- `gru`: a GRU encoder.\n- `lstm`: a LSTM encoder.\n- `transformer`: a transformer encoder.\n\nFor all models, the user may also wish to specify:\n\n- `--decoder_layers` (default: `1`): number of decoder layers\n- `--embedding` (default: `128`): embedding size\n- `--encoder_layers` (default: `1`): number of encoder layers\n- `--hidden_size` (default: `512`): hidden layer size\n\nBy default, RNN-backed (i.e., GRU and LSTM) encoders are bidirectional. One can\ndisable this with the `--no_bidirectional` flag.\n\n## Training options\n\nA non-exhaustive list includes:\n\n- Batch size:\n - `--batch_size` (default: `32`)\n - `--accumulate_grad_batches` (default: not enabled)\n- Regularization:\n - `--dropout` (default: `0.2`)\n - `--label_smoothing` (default: `0.0`)\n - `--gradient_clip_val` (default: not enabled)\n- Optimizer:\n - `--learning_rate` (default: `0.001`)\n - `--optimizer` (default: `\"adam\"`)\n - `--beta1` (default: `0.9`): $\\beta_1$ hyperparameter for the Adam\n optimizer (`--optimizer adam`)\n - `--beta2` (default: `0.99`): $\\beta_2$ hyperparameter for the Adam\n optimizer (`--optimizer adam`)\n - `--scheduler` (default: not enabled)\n- Duration:\n - `--max_epochs`\n - `--min_epochs`\n - `--max_steps`\n - `--min_steps`\n - `--max_time`\n- Seeding:\n - `--seed`\n- [Weights & Biases](https://wandb.ai/site):\n - `--log_wandb` (default: `False`): enables Weights & Biases tracking\n\nAdditional training options are discussed below.\n\n### Early stopping\n\nTo enable early stopping, use the `--patience` and `--patience_metric` flags.\nEarly stopping occurs after `--patience` epochs with no improvement (when\nvalidation loss stops decreasing if `--patience_metric loss`, or when validation\naccuracy stops increasing if `--patience_metric accuracy`). Early stopping is\nnot enabled by default.\n\n### Schedulers\n\nBy default, Yoyodyne uses a constant learning rate during training, but best\npractice is to gradually decrease learning rate as the model approaches\nconvergence using a [scheduler](yoyodyne/schedulers.py). The following\nschedulers are supported and are selected with `--scheduler`:\n\n- `reduceonplateau`: reduces the learning rate (multiplying it by\n `--reduceonplateau_factor`) after `--reduceonplateau_patience` epochs with\n no improvement (when validation loss stops decreasing if\n `--reduceonplateau loss`, or when validation accuracy stops increasing if\n `--reduceonplateau_metric accuracy`) until the learning rate is less than or\n equal to `--min_learning_rate`.\n- `warmupinvsqrt`: linearly increases the learning rate from 0 to\n `--learning_rate` for `--warmup_steps` steps, then decreases learning rate\n according to an inverse root square schedule.\n\n## Tied embeddings\n\nBy default, the source and target vocabularies are shared. This can be disabled\nwith the flag `--no_tie_embeddings`, which uses `{...}` to avoid clashes between\nsource and target symbols.\n\n### Batch size tricks\n\n**Choosing a good batch size is key to fast training and optimal performance.**\nBatch size is specified by the `--batch_size` flag.\n\nOne may wish to train with a larger batch size than will fit in \"in core\". For\nexample, suppose one wishes to fit with a batch size of 4,096, but this gives an\nout of memory (OOM) exception. Then, with minimal overhead, one could simulate\nan effective batch size of 4,096 by using batches of size 1,024, [accumulating\ngradients from 4 batches per\nupdate](https://lightning.ai/docs/pytorch/stable/common/optimization.html#id3):\n\n yoyodyne-train --batch_size 1024 --accumulate_grad_batches 4 ...\n\nThe `--find_batch_size` flag enables [automatically computation of the batch\nsize](https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder).\nWith `--find_batch_size max`, it simply uses the maximum batch size, ignoring\n`--batch_size`. With `--find_batch_size opt`, it finds the maximum batch size,\nand then interprets it as follows:\n\n- If the maximum batch size is greater than `--batch_size`, then\n `--batch_size` is used as the batch size.\n- However, if the maximum batch size is less than `--batch_size`, it solves\n for the optimal gradient accumulation trick and uses the largest batch size\n and the smallest number of gradient accumulation steps whose product is\n `--batch_size`.\n\nIf one wishes to solve for these quantities without actually training, pass\n`--find_batch_size opt` and `--max_epochs 0`. This will halt after computing and\nlogging the solution.\n\n### Hyperparameter tuning\n\n**No neural model should be deployed without proper hyperparameter tuning.**\nHowever, the default options give a reasonable initial settings for an attentive\nbiLSTM. For transformer-based architectures, experiment with multiple encoder\nand decoder layers, much larger batches, and the warmup-plus-inverse square root\ndecay scheduler.\n\n### Weights & Biases tuning\n\n[`wandb_sweeps`](examples/wandb_sweeps) shows how to use [Weights &\nBiases](https://wandb.ai/site) to run hyperparameter sweeps.\n\n## Accelerators\n\n[Hardware\naccelerators](https://pytorch-lightning.readthedocs.io/en/stable/extensions/accelerator.html)\ncan be used during training or prediction. In addition to CPU (the default) and\nGPU (`--accelerator gpu`), [other\naccelerators](https://pytorch-lightning.readthedocs.io/en/stable/extensions/accelerator.html)\nmay also be supported but not all have been tested yet.\n\n## Precision\n\nBy default, training uses 32-bit precision. However, the `--precision` flag\nallows the user to perform training with half precision (`16`) or with the\n[`bfloat16` half precision\nformat](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) if\nsupported by the accelerator. This may reduce the size of the model and batches\nin memory, allowing one to use larger batches. Note that only default precision\nis expected to work with CPU training.\n\n## Examples\n\nThe [`examples`](examples) directory contains interesting examples, including:\n\n- [`wandb_sweeps`](examples/wandb_sweeps) shows how to use [Weights &\n Biases](https://wandb.ai/site) to run hyperparameter sweeps.\n\n## For developers\n\n*Developers, developers, developers!* - Steve Ballmer\n\nThis section contains instructions for the Yoyodyne maintainers.\n\n### Releasing\n\n1. Create a new branch. E.g., if you want to call this branch \"release\":\n `git checkout -b release`\n2. Sync your fork's branch to the upstream master branch. E.g., if the upstream\n remote is called \"upstream\": `git pull upstream master`\n3. Increment the version field in [`pyproject.toml`](pyproject.toml).\n4. Stage your changes: `git add pyproject.toml`.\n5. Commit your changes: `git commit -m \"your commit message here\"`\n6. Push your changes. E.g., if your branch is called \"release\":\n `git push origin release`\n7. Submit a PR for your release and wait for it to be merged into `master`.\n8. Tag the `master` branch's last commit. The tag should begin with `v`; e.g.,\n if the new version is 3.1.4, the tag should be `v3.1.4`. This can be done:\n - on GitHub itself: click the \"Releases\" or \"Create a new release\" link on\n the right-hand side of the Yoyodyne GitHub page) and follow the\n dialogues.\n - from the command-line using `git tag`.\n9. Build the new release: `python -m build`\n10. Upload the result to PyPI: `twine upload dist/*`\n\n## References\n\nOtt, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and\nAuli, M. 2019. [fairseq: a fast, extensible toolkit for sequence\nmodeling](https://aclanthology.org/N19-4009/). In *Proceedings of the 2019\nConference of the North American Chapter of the Association for Computational\nLinguistics (Demonstrations)*, pages 48-53.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Small-vocabulary neural sequence-to-sequence models",
"version": "0.2.14",
"project_urls": {
"homepage": "https://github.com/CUNY-CL/yoyodyne"
},
"split_keywords": [
"computational linguistics",
" morphology",
" natural language processing",
" language"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ae7313eface98971bc68514e4bc05cb9f2b434c158f584b5d7f5304e1e8ddfcc",
"md5": "164befc1a58a8a03cd3bb902e7bc1756",
"sha256": "9c526c02e5dfd3567b3064307fd2eb1c49dc768b892e90a308c1f203c2726ae7"
},
"downloads": -1,
"filename": "yoyodyne-0.2.14-py3-none-any.whl",
"has_sig": false,
"md5_digest": "164befc1a58a8a03cd3bb902e7bc1756",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 85128,
"upload_time": "2024-10-31T00:10:03",
"upload_time_iso_8601": "2024-10-31T00:10:03.343330Z",
"url": "https://files.pythonhosted.org/packages/ae/73/13eface98971bc68514e4bc05cb9f2b434c158f584b5d7f5304e1e8ddfcc/yoyodyne-0.2.14-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "86401404fcff36af4af0f6740e401c2d58efb380ed450ec2be91ceba4d3cef9d",
"md5": "e4c37180725c0aa2a42a7fbb6ca49687",
"sha256": "6df22d4516f0ebec03b45fa80226813f69a93f860928289f46befeba3d06412c"
},
"downloads": -1,
"filename": "yoyodyne-0.2.14.tar.gz",
"has_sig": false,
"md5_digest": "e4c37180725c0aa2a42a7fbb6ca49687",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 75099,
"upload_time": "2024-10-31T00:10:05",
"upload_time_iso_8601": "2024-10-31T00:10:05.132586Z",
"url": "https://files.pythonhosted.org/packages/86/40/1404fcff36af4af0f6740e401c2d58efb380ed450ec2be91ceba4d3cef9d/yoyodyne-0.2.14.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-31 00:10:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CUNY-CL",
"github_project": "yoyodyne",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"requirements": [
{
"name": "black",
"specs": [
[
">=",
"24.10.0"
]
]
},
{
"name": "build",
"specs": [
[
">=",
"1.2.1"
]
]
},
{
"name": "flake8",
"specs": [
[
">=",
"7.1.0"
]
]
},
{
"name": "lightning",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.7.0"
]
]
},
{
"name": "maxwell",
"specs": [
[
">=",
"0.2.5"
]
]
},
{
"name": "numpy",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.26.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.2.2"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"8.3.2"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.13.1"
]
]
},
{
"name": "setuptools",
"specs": [
[
">=",
"75.3.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.5.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.6"
]
]
},
{
"name": "twine",
"specs": [
[
">=",
"5.1.1"
]
]
},
{
"name": "wandb",
"specs": [
[
">=",
"0.18.5"
]
]
},
{
"name": "wheel",
"specs": [
[
">=",
"0.40.0"
]
]
}
],
"lcname": "yoyodyne"
}