langvae


Namelangvae JSON
Version 0.5.2 PyPI version JSON
download
home_pageNone
SummaryLangVAE: Large Language VAEs made simple
upload_time2024-07-23 10:17:57
maintainerNone
docs_urlNone
authorDanilo S. Carvalho
requires_python>=3.9
licenseNone
keywords vae llm generative nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LangVAE: Large Language VAEs made simple 

LangVAE is a Python library for training and running language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to train VAEs on text data, allowing users to customize the model architecture, loss function, and training parameters.

## Installation

To install LangVAE, simply run:

```bash
pip install langvae
```

This will install all necessary dependencies and set up the package for use in your Python projects.

## Usage

Here's a basic example of how to train a VAE on text data using LangVAE:

```python
from pythae.models.vae import VAEConfig
from langvae import LangVAE
from langvae.encoders import SentenceEncoder
from langvae.decoders import SentenceDecoder
from langvae.data_conversion.tokenization import TokenizedDataSet
from langvae.pipelines import LanguageTrainingPipeline
from langvae.trainers import CyclicalScheduleKLThresholdTrainerConfig
from saf_datasets import EntailmentBankDataSet

DEVICE = "cuda"
LATENT_SIZE = 32
MAX_SENT_LEN = 32

# Load pre-trained sentence encoder and decoder models.
decoder = SentenceDecoder("gpt2", LATENT_SIZE, MAX_SENT_LEN, device=DEVICE)
encoder = SentenceEncoder("bert-base-cased", LATENT_SIZE, decoder.tokenizer, device=DEVICE)

# Select explanatory sentences from the EntailmentBank dataset.
dataset = [
    sent for sent in EntailmentBankDataSet()
    if (sent.annotations["type"] == "answer" or 
        sent.annotations["type"].startswith("context"))
]

# Set training and evaluation datasets with auto tokenization.
eval_size = int(0.1 * len(dataset))
train_dataset = TokenizedDataSet(dataset[:-eval_size], decoder.tokenizer, decoder.max_len)
eval_dataset = TokenizedDataSet(dataset[-eval_size:], decoder.tokenizer, decoder.max_len)


# Define VAE model configuration
model_config = VAEConfig(
    input_dim=(train_dataset[0]["data"].shape[-2], train_dataset[0]["data"].shape[-1]),
    latent_dim=LATENT_SIZE
)

# Initialize LangVAE model
model = LangVAE(model_config, encoder, decoder)

# Train VAE on explanatory sentences
training_config = CyclicalScheduleKLThresholdTrainerConfig(
    output_dir='expl_vae',
    num_epochs=5,
    learning_rate=1e-4,
    per_device_train_batch_size=50,
    per_device_eval_batch_size=50,
    steps_saving=1,
    optimizer_cls="AdamW",
    scheduler_cls="ReduceLROnPlateau",
    scheduler_params={"patience": 5, "factor": 0.5},
    max_beta=1.0,
    n_cycles=40,
    target_kl=2.0
)

pipeline = LanguageTrainingPipeline(
    training_config=training_config,
    model=model
)

pipeline(
    train_data=train_dataset,
    eval_data=eval_dataset
)
```

This example loads pre-trained encoder and decoder models, defines a VAE model configuration, initializes the LangVAE model, and trains it on text data using a custom training pipeline.


## License

LangVAE is licensed under the GPLv3 License. See the LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "langvae",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "vae, llm, generative, nlp",
    "author": "Danilo S. Carvalho",
    "author_email": "\"Danilo S. Carvalho\" <danilo.carvalho@manchester.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/7d/f1/850d038f411af26add1d0ae97e09cb3fcba3f586cf80aceba91df406a9ca/langvae-0.5.2.tar.gz",
    "platform": null,
    "description": "# LangVAE: Large Language VAEs made simple \n\nLangVAE is a Python library for training and running language models using Variational Autoencoders (VAEs). It provides an easy-to-use interface to train VAEs on text data, allowing users to customize the model architecture, loss function, and training parameters.\n\n## Installation\n\nTo install LangVAE, simply run:\n\n```bash\npip install langvae\n```\n\nThis will install all necessary dependencies and set up the package for use in your Python projects.\n\n## Usage\n\nHere's a basic example of how to train a VAE on text data using LangVAE:\n\n```python\nfrom pythae.models.vae import VAEConfig\nfrom langvae import LangVAE\nfrom langvae.encoders import SentenceEncoder\nfrom langvae.decoders import SentenceDecoder\nfrom langvae.data_conversion.tokenization import TokenizedDataSet\nfrom langvae.pipelines import LanguageTrainingPipeline\nfrom langvae.trainers import CyclicalScheduleKLThresholdTrainerConfig\nfrom saf_datasets import EntailmentBankDataSet\n\nDEVICE = \"cuda\"\nLATENT_SIZE = 32\nMAX_SENT_LEN = 32\n\n# Load pre-trained sentence encoder and decoder models.\ndecoder = SentenceDecoder(\"gpt2\", LATENT_SIZE, MAX_SENT_LEN, device=DEVICE)\nencoder = SentenceEncoder(\"bert-base-cased\", LATENT_SIZE, decoder.tokenizer, device=DEVICE)\n\n# Select explanatory sentences from the EntailmentBank dataset.\ndataset = [\n    sent for sent in EntailmentBankDataSet()\n    if (sent.annotations[\"type\"] == \"answer\" or \n        sent.annotations[\"type\"].startswith(\"context\"))\n]\n\n# Set training and evaluation datasets with auto tokenization.\neval_size = int(0.1 * len(dataset))\ntrain_dataset = TokenizedDataSet(dataset[:-eval_size], decoder.tokenizer, decoder.max_len)\neval_dataset = TokenizedDataSet(dataset[-eval_size:], decoder.tokenizer, decoder.max_len)\n\n\n# Define VAE model configuration\nmodel_config = VAEConfig(\n    input_dim=(train_dataset[0][\"data\"].shape[-2], train_dataset[0][\"data\"].shape[-1]),\n    latent_dim=LATENT_SIZE\n)\n\n# Initialize LangVAE model\nmodel = LangVAE(model_config, encoder, decoder)\n\n# Train VAE on explanatory sentences\ntraining_config = CyclicalScheduleKLThresholdTrainerConfig(\n    output_dir='expl_vae',\n    num_epochs=5,\n    learning_rate=1e-4,\n    per_device_train_batch_size=50,\n    per_device_eval_batch_size=50,\n    steps_saving=1,\n    optimizer_cls=\"AdamW\",\n    scheduler_cls=\"ReduceLROnPlateau\",\n    scheduler_params={\"patience\": 5, \"factor\": 0.5},\n    max_beta=1.0,\n    n_cycles=40,\n    target_kl=2.0\n)\n\npipeline = LanguageTrainingPipeline(\n    training_config=training_config,\n    model=model\n)\n\npipeline(\n    train_data=train_dataset,\n    eval_data=eval_dataset\n)\n```\n\nThis example loads pre-trained encoder and decoder models, defines a VAE model configuration, initializes the LangVAE model, and trains it on text data using a custom training pipeline.\n\n\n## License\n\nLangVAE is licensed under the GPLv3 License. See the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "LangVAE: Large Language VAEs made simple",
    "version": "0.5.2",
    "project_urls": {
        "Homepage": "https://github.com/neuro-symbolic-ai/LangVAE",
        "Issues": "https://github.com/neuro-symbolic-ai/LangVAE/issues"
    },
    "split_keywords": [
        "vae",
        " llm",
        " generative",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2628cc37a49cbf99ad8e87a5d26b7025da15e0bc451dd9547e75c1330f4ebb28",
                "md5": "c8c6f82bffaac3d4d7e32c8ea7b096b7",
                "sha256": "2fb03d447913163cd867248b662527145d31d25bed61313ec285d74bc98d9835"
            },
            "downloads": -1,
            "filename": "langvae-0.5.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c8c6f82bffaac3d4d7e32c8ea7b096b7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 38828,
            "upload_time": "2024-07-23T10:17:56",
            "upload_time_iso_8601": "2024-07-23T10:17:56.372222Z",
            "url": "https://files.pythonhosted.org/packages/26/28/cc37a49cbf99ad8e87a5d26b7025da15e0bc451dd9547e75c1330f4ebb28/langvae-0.5.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7df1850d038f411af26add1d0ae97e09cb3fcba3f586cf80aceba91df406a9ca",
                "md5": "f097721b494a033b04e2edfbb769cccd",
                "sha256": "5c093d7bd651116029e4a5588f6d87a5f17bbba879e7660b65dbd6c8bca50943"
            },
            "downloads": -1,
            "filename": "langvae-0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f097721b494a033b04e2edfbb769cccd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 30611,
            "upload_time": "2024-07-23T10:17:57",
            "upload_time_iso_8601": "2024-07-23T10:17:57.898163Z",
            "url": "https://files.pythonhosted.org/packages/7d/f1/850d038f411af26add1d0ae97e09cb3fcba3f586cf80aceba91df406a9ca/langvae-0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-23 10:17:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuro-symbolic-ai",
    "github_project": "LangVAE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "langvae"
}
        
Elapsed time: 0.59400s