langmo

Name	langmo JSON
Version	0.2.0 JSON
	download
home_page	http://vecto.space
Summary	toolbox for various tasks in the area of vector space models of computational linguistic
upload_time	2023-08-17 01:38:51
maintainer
docs_url	None
author
requires_python	>=3.5
license	Apache License 2.0
keywords	nlp linguistics language
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            langmo
######

The library for distributed pretraining and finetuning of language models.

Supported features:

- vanilla pre-training of BERT-like models
- distributed training on multi-node/multi-GPU systems
- benchmarking/finetuning the following tasks
    - all GLUE
    - MNLI  + additional validation on HANS
    - more coming soon
- using siamese architectures for finutuning


Pretraining
-----------

Pretraining a model::

    mpirun -np N python -m langmo.pretraining config.yaml

langmo saves 2 types of snapshots: in pytorch_ligning format 

To resume crashed/aborted pretraining session:

    mpirun -np N python -m langmo.pretraining.resume path_to_run


Finetuning/Evaluation
---------------------

Finetuning on one of the GLUE tasks::

    mpirun -np N python -m langmo.benchmarks.GLUE config.yaml glue_task

supported tasks: **cola, rte, stsb, mnli, mnli-mm, mrpc, sst2, qqp, qnli**

NLI task has additional special implentation which supports validation on adversarial HANS dataset,
as well as additional staticics for each label/heuristic.

To perfrorm fibetuning on NLI run as::

    mpirun -np N python -m langmo.benchmarks.NLI config.yaml


Finetuning on extractive question-answering tasks::

    mpirun -np N python -m langmo.benchmarks.QA config.yaml qa_task

supported tasks: **squad, squad_v2**

example config file:

::

    model_name: "roberta-base"
    batch_size: 32
    cnt_epochs: 4
    path_results: ./logs
    max_lr: 0.0005
    siamese: true
    freeze_encoder: false
    encoder_wrapper: pooler
    shuffle: true


Automatic evaluation
--------------------

langmo supports automatic scheduling of evaluation runs for a model saved in a given location, or for all snapshots found int /snapshots folder.
To configure langmo the user has to create the following file:

./configs/langmo.yaml with entry "submit_command" correspoding to a job submission command of a given cluster. If the file is not present, the jobs will not be submitted to the job queue, but executed immediately one by one on the same node.

./configs/auto_finetune.inc - the content of this file will be copied to the beginning of the job scripts. Place here directive for e.g. slurm job scheduler such as 
which resource group to use, how many nodes to allocate, time limit etc. Set up all necessary environment variables, particulalry NUM_GPUS_PER_NODE and
PL_TORCH_DISTRIBUTED_BACKED (MPI, NCCL or GLOO). Finally add mpirun command with necessay option and end the file with new line.
Command to invoke langmo in the right way will be added automatically.

./configs/auto_finetune.yaml - any parameters such as batch size etc to owerride the defaults in a fine-tuning run.

To schedule evaluation jobs run from the login node::

    python -m langmo.benchmarks path_to_model task_name

the results will be saved in the eval/task_name/run_name/ subfolder in the same folder the model is saved.

Fugaku notes
------------

Add these lines before the :code:`return` of :code:`_compare_version`
statement of :code:`pytorch_lightning/utilities/imports.py`.::

    if str(pkg_version).startswith(version):
        return True

This :code:`sed` command should do the trick::

    sed -i -e '/pkg_version = Version(pkg_version.base_version/a\    if str(pkg_version).startswith(version):\n\        return True' \
      ~/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py

Raw data

            {
    "_id": null,
    "home_page": "http://vecto.space",
    "name": "langmo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "NLP,linguistics,language",
    "author": "",
    "author_email": "",
    "download_url": "",
    "platform": null,
    "description": "langmo\n######\n\nThe library for distributed pretraining and finetuning of language models.\n\nSupported features:\n\n- vanilla pre-training of BERT-like models\n- distributed training on multi-node/multi-GPU systems\n- benchmarking/finetuning the following tasks\n    - all GLUE\n    - MNLI  + additional validation on HANS\n    - more coming soon\n- using siamese architectures for finutuning\n\n\nPretraining\n-----------\n\nPretraining a model::\n\n    mpirun -np N python -m langmo.pretraining config.yaml\n\nlangmo saves 2 types of snapshots: in pytorch_ligning format \n\nTo resume crashed/aborted pretraining session:\n\n    mpirun -np N python -m langmo.pretraining.resume path_to_run\n\n\nFinetuning/Evaluation\n---------------------\n\nFinetuning on one of the GLUE tasks::\n\n    mpirun -np N python -m langmo.benchmarks.GLUE config.yaml glue_task\n\nsupported tasks: **cola, rte, stsb, mnli, mnli-mm, mrpc, sst2, qqp, qnli**\n\nNLI task has additional special implentation which supports validation on adversarial HANS dataset,\nas well as additional staticics for each label/heuristic.\n\nTo perfrorm fibetuning on NLI run as::\n\n    mpirun -np N python -m langmo.benchmarks.NLI config.yaml\n\n\nFinetuning on extractive question-answering tasks::\n\n    mpirun -np N python -m langmo.benchmarks.QA config.yaml qa_task\n\nsupported tasks: **squad, squad_v2**\n\nexample config file:\n\n::\n\n    model_name: \"roberta-base\"\n    batch_size: 32\n    cnt_epochs: 4\n    path_results: ./logs\n    max_lr: 0.0005\n    siamese: true\n    freeze_encoder: false\n    encoder_wrapper: pooler\n    shuffle: true\n\n\nAutomatic evaluation\n--------------------\n\nlangmo supports automatic scheduling of evaluation runs for a model saved in a given location, or for all snapshots found int /snapshots folder.\nTo configure langmo the user has to create the following file:\n\n./configs/langmo.yaml with entry \"submit_command\" correspoding to a job submission command of a given cluster. If the file is not present, the jobs will not be submitted to the job queue, but executed immediately one by one on the same node.\n\n./configs/auto_finetune.inc - the content of this file will be copied to the beginning of the job scripts. Place here directive for e.g. slurm job scheduler such as \nwhich resource group to use, how many nodes to allocate, time limit etc. Set up all necessary environment variables, particulalry NUM_GPUS_PER_NODE and\nPL_TORCH_DISTRIBUTED_BACKED (MPI, NCCL or GLOO). Finally add mpirun command with necessay option and end the file with new line.\nCommand to invoke langmo in the right way will be added automatically.\n\n./configs/auto_finetune.yaml - any parameters such as batch size etc to owerride the defaults in a fine-tuning run.\n\nTo schedule evaluation jobs run from the login node::\n\n    python -m langmo.benchmarks path_to_model task_name\n\nthe results will be saved in the eval/task_name/run_name/ subfolder in the same folder the model is saved.\n\nFugaku notes\n------------\n\nAdd these lines before the :code:`return` of :code:`_compare_version`\nstatement of :code:`pytorch_lightning/utilities/imports.py`.::\n\n    if str(pkg_version).startswith(version):\n        return True\n\nThis :code:`sed` command should do the trick::\n\n    sed -i -e '/pkg_version = Version(pkg_version.base_version/a\\    if str(pkg_version).startswith(version):\\n\\        return True' \\\n      ~/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "toolbox for various tasks in the area of vector space models of computational linguistic",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "http://vecto.space"
    },
    "split_keywords": [
        "nlp",
        "linguistics",
        "language"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0fab0b7fe68e8662514d66a47c98f3886c91bdf2f97ea60f94a50134f59ff2cf",
                "md5": "15046d93bc43963ff3ad3295bb4f2241",
                "sha256": "ab926f539b88dc3c0a0587a406154f494fb62b8a891c49e688719d2feaf69db4"
            },
            "downloads": -1,
            "filename": "langmo-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "15046d93bc43963ff3ad3295bb4f2241",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 76345,
            "upload_time": "2023-08-17T01:38:51",
            "upload_time_iso_8601": "2023-08-17T01:38:51.167212Z",
            "url": "https://files.pythonhosted.org/packages/0f/ab/0b7fe68e8662514d66a47c98f3886c91bdf2f97ea60f94a50134f59ff2cf/langmo-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-17 01:38:51",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "langmo"
}