aiu-fms-testing-utils

Name	aiu-fms-testing-utils JSON
Version	0.2.1 JSON
	download
home_page	None
Summary	Spyre FMS Testing Utils
upload_time	2025-09-05 19:20:30
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.11
license	None
keywords	aiu-fms-testing-utils python utils
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Installation Guide for AIU FMS Testing Utilities

This guide provides instructions for installing the `aiu-fms-testing-utils` package.

## Installation with CPU-Only PyTorch

To install the package with the CPU-only version of PyTorch, run the following command:

```shell
pip3 install aiu-fms-testing-utils --extra-index-url=https://download.pytorch.org/whl/cpu
```

## Installation with Default PyTorch

To install the package with the platform's default PyTorch version, execute:

```shell
pip3 install aiu-fms-testing-utils
```

## Verify the PyTorch Version

To ensure compatibility with aiu-fms-testing-utils, verify that the correct PyTorch version is installed.

1. Check the PyTorch Version

    ```shell
    pip show torch
    ```

    Expected Output for CPU-Only PyTorch:

    ```plain
    Name: torch
    Version: 2.7.1+cpu
    ```

    For the CPU-only version, the version string should include the +cpu postfix (e.g., 2.7.1+cpu).

2. Corrective Action for Default PyTorch

    If the installed PyTorch version does not include the +cpu postfix (e.g., 2.7.1 without +cpu), it indicates that the default PyTorch version (which may include CUDA support) was installed. To resolve this, uninstall torch and reinstall aiu-fms-testing-utils with the CPU-only version:

    ```shell
    pip3 uninstall torch -y
    pip3 install aiu-fms-testing-utils --extra-index-url=https://download.pytorch.org/whl/cpu
    ```

## Setting Up the Development Environment from Source

To set up the development environment for aiu-fms-testing-utils from source, follow these steps.

In this directory, checkout the Foundation Model Stack (FMS) and the FMS Model Optimizer:

```shell
git clone https://github.com/foundation-model-stack/foundation-model-stack.git
git clone https://github.com/foundation-model-stack/fms-model-optimizer.git
```

Install both FMS, FMS-Model-Optimizer and aiu-fms-testing-utils:

```shell
cd foundation-model-stack
pip install -e .
cd ..

cd fms-model-optimizer
pip install -e .
cd ..

pip install -e .
```

### Running in OpenShift

Use the `pod.yaml` file to get started with your OpenShift allocation

* Modify the `ibm.com/aiu_pf_tier0` values to indicate the number of AIUs that you want to use
* Modify the `namespace` to match your namespace/project (i.e., `oc project`)

Start the pod

```shell
oc apply -f pod.yaml
```

Copy this repository into the pod (includes scripts, FMS stack)

```shell
oc cp ${PWD} my-workspace:/tmp/
```

Exec into the pod

```shell
 oc rsh my-workspace bash -l
 ```

When you are finished, make sure to delete your pod:

```shell
oc delete -f pod.yaml
```

### Setup the environment in the container

Verify the AIU discovery has happened by looking for output like the following when you exec into the pod:

```console
---- IBM AIU Device Discovery...
---- IBM AIU Environment Setup... (Generate config and environment)
---- IBM AIU Devices Found: 2
------------------------
[1000760000@my-workspace ~]$  echo $AIU_WORLD_SIZE
2
```

Inside the container, setup envars to use the FMS:

```shell
export HOME=/tmp
cd ${HOME}/aiu-fms-testing-utils/foundation-model-stack/
# Install the FMS stack
pip install -e .
```

Run with AIU instead of, default, senulator.

```shell
export FLEX_COMPUTE=SENTIENT
export FLEX_DEVICE=PF
```

Optional envars to supress debugging output:

```shell
export DTLOG_LEVEL=error
export TORCH_SENDNN_LOG=CRITICAL
export DT_DEEPRT_VERBOSE=-1
```

## How to use Foundation Model Stack (FMS) on AIU hardware
The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides various scripts to use FMS on AIU hardware for many use cases. These scripts provide robust support for passing desired command line options for running encoder and decoder models along with other use cases. Refer to the documentation on [using different scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/scripts/README.md) for more details.

The [examples](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/examples) directory provides small examples aimed at helping understand the general workflow of running a model using FMS on AIU hardware.

## Common Errors

### Pod connection error

Errors like the following often indicate that the pod has not started or is still in the process of starting.

```console
error: unable to upgrade connection: container not found ("my-pod")
```

Use `oc get pods` to check on the status. `ContainerCreating` indicates that the pod is being created. `Running` indicates that it is ready to use.

If there is an error the use `oc describe pod/my-workspace` to see a full diagnostic view. The `Events` list at the bottom will often let you know what the problem is.

### torchrun generic error

Below is the generic `torchrun` failed program trace. It is not helpful when trying to find the problem in the program. Instead look for the actual error message a little higher in the output trace.

```console
[2024-09-16 16:10:15,705] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1479484) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./roberta.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-16_16:10:15
  host      : ibm-aiu-rdma-jjhursey
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1479484)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

### Additional warnings

You may see the following additional warnings/notices printed to the console. They are normal and expected at this point in time. The team will work on cleaning these up.

```console
CUDA extension not installed.
using tensor parallel
ignoring module=Embedding when distributing module
[WARNING] Keys from checkpoint (adapted to FMS) not copied into model: {'roberta.embeddings.token_type_embeddings.weight', 'lm_head.bias'}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aiu-fms-testing-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.11",
    "maintainer_email": null,
    "keywords": "aiu-fms-testing-utils, python, utils",
    "author": null,
    "author_email": "Thara Palanivel <tpalan@ibm.com>, Joshua Rosenkranz <jmrosenk@us.ibm.com>, Antoni Viros i Martin <aviros@ibm.com>",
    "download_url": null,
    "platform": null,
    "description": "# Installation Guide for AIU FMS Testing Utilities\n\nThis guide provides instructions for installing the `aiu-fms-testing-utils` package.\n\n## Installation with CPU-Only PyTorch\n\nTo install the package with the CPU-only version of PyTorch, run the following command:\n\n```shell\npip3 install aiu-fms-testing-utils --extra-index-url=https://download.pytorch.org/whl/cpu\n```\n\n## Installation with Default PyTorch\n\nTo install the package with the platform's default PyTorch version, execute:\n\n```shell\npip3 install aiu-fms-testing-utils\n```\n\n## Verify the PyTorch Version\n\nTo ensure compatibility with aiu-fms-testing-utils, verify that the correct PyTorch version is installed.\n\n1. Check the PyTorch Version\n\n    ```shell\n    pip show torch\n    ```\n\n    Expected Output for CPU-Only PyTorch:\n\n    ```plain\n    Name: torch\n    Version: 2.7.1+cpu\n    ```\n\n    For the CPU-only version, the version string should include the +cpu postfix (e.g., 2.7.1+cpu).\n\n2. Corrective Action for Default PyTorch\n\n    If the installed PyTorch version does not include the +cpu postfix (e.g., 2.7.1 without +cpu), it indicates that the default PyTorch version (which may include CUDA support) was installed. To resolve this, uninstall torch and reinstall aiu-fms-testing-utils with the CPU-only version:\n\n    ```shell\n    pip3 uninstall torch -y\n    pip3 install aiu-fms-testing-utils --extra-index-url=https://download.pytorch.org/whl/cpu\n    ```\n\n## Setting Up the Development Environment from Source\n\nTo set up the development environment for aiu-fms-testing-utils from source, follow these steps.\n\nIn this directory, checkout the Foundation Model Stack (FMS) and the FMS Model Optimizer:\n\n```shell\ngit clone https://github.com/foundation-model-stack/foundation-model-stack.git\ngit clone https://github.com/foundation-model-stack/fms-model-optimizer.git\n```\n\nInstall both FMS, FMS-Model-Optimizer and aiu-fms-testing-utils:\n\n```shell\ncd foundation-model-stack\npip install -e .\ncd ..\n\ncd fms-model-optimizer\npip install -e .\ncd ..\n\npip install -e .\n```\n\n### Running in OpenShift\n\nUse the `pod.yaml` file to get started with your OpenShift allocation\n\n* Modify the `ibm.com/aiu_pf_tier0` values to indicate the number of AIUs that you want to use\n* Modify the `namespace` to match your namespace/project (i.e., `oc project`)\n\nStart the pod\n\n```shell\noc apply -f pod.yaml\n```\n\nCopy this repository into the pod (includes scripts, FMS stack)\n\n```shell\noc cp ${PWD} my-workspace:/tmp/\n```\n\nExec into the pod\n\n```shell\n oc rsh my-workspace bash -l\n ```\n\nWhen you are finished, make sure to delete your pod:\n\n```shell\noc delete -f pod.yaml\n```\n\n### Setup the environment in the container\n\nVerify the AIU discovery has happened by looking for output like the following when you exec into the pod:\n\n```console\n---- IBM AIU Device Discovery...\n---- IBM AIU Environment Setup... (Generate config and environment)\n---- IBM AIU Devices Found: 2\n------------------------\n[1000760000@my-workspace ~]$  echo $AIU_WORLD_SIZE\n2\n```\n\nInside the container, setup envars to use the FMS:\n\n```shell\nexport HOME=/tmp\ncd ${HOME}/aiu-fms-testing-utils/foundation-model-stack/\n# Install the FMS stack\npip install -e .\n```\n\nRun with AIU instead of, default, senulator.\n\n```shell\nexport FLEX_COMPUTE=SENTIENT\nexport FLEX_DEVICE=PF\n```\n\nOptional envars to supress debugging output:\n\n```shell\nexport DTLOG_LEVEL=error\nexport TORCH_SENDNN_LOG=CRITICAL\nexport DT_DEEPRT_VERBOSE=-1\n```\n\n## How to use Foundation Model Stack (FMS) on AIU hardware\nThe [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides various scripts to use FMS on AIU hardware for many use cases. These scripts provide robust support for passing desired command line options for running encoder and decoder models along with other use cases. Refer to the documentation on [using different scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/scripts/README.md) for more details.\n\nThe [examples](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/examples) directory provides small examples aimed at helping understand the general workflow of running a model using FMS on AIU hardware.\n\n## Common Errors\n\n### Pod connection error\n\nErrors like the following often indicate that the pod has not started or is still in the process of starting.\n\n```console\nerror: unable to upgrade connection: container not found (\"my-pod\")\n```\n\nUse `oc get pods` to check on the status. `ContainerCreating` indicates that the pod is being created. `Running` indicates that it is ready to use.\n\nIf there is an error the use `oc describe pod/my-workspace` to see a full diagnostic view. The `Events` list at the bottom will often let you know what the problem is.\n\n### torchrun generic error\n\nBelow is the generic `torchrun` failed program trace. It is not helpful when trying to find the problem in the program. Instead look for the actual error message a little higher in the output trace.\n\n```console\n[2024-09-16 16:10:15,705] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1479484) of binary: /usr/bin/python3\nTraceback (most recent call last):\n  File \"/usr/local/bin/torchrun\", line 8, in <module>\n    sys.exit(main())\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 347, in wrapper\n    return f(*args, **kwargs)\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py\", line 812, in main\n    run(args)\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py\", line 803, in run\n    elastic_launch(\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py\", line 135, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py\", line 268, in launch_agent\n    raise ChildFailedError(\ntorch.distributed.elastic.multiprocessing.errors.ChildFailedError:\n============================================================\n./roberta.py FAILED\n------------------------------------------------------------\nFailures:\n  <NO_OTHER_FAILURES>\n------------------------------------------------------------\nRoot Cause (first observed failure):\n[0]:\n  time      : 2024-09-16_16:10:15\n  host      : ibm-aiu-rdma-jjhursey\n  rank      : 0 (local_rank: 0)\n  exitcode  : 1 (pid: 1479484)\n  error_file: <N/A>\n  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html\n============================================================\n```\n\n### Additional warnings\n\nYou may see the following additional warnings/notices printed to the console. They are normal and expected at this point in time. The team will work on cleaning these up.\n\n```console\nCUDA extension not installed.\nusing tensor parallel\nignoring module=Embedding when distributing module\n[WARNING] Keys from checkpoint (adapted to FMS) not copied into model: {'roberta.embeddings.token_type_embeddings.weight', 'lm_head.bias'}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Spyre FMS Testing Utils",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/foundation-model-stack/aiu-fms-testing-utils",
        "Issues": "https://github.com/foundation-model-stack/aiu-fms-testing-utils/issues",
        "Repository": "https://github.com/foundation-model-stack/aiu-fms-testing-utils"
    },
    "split_keywords": [
        "aiu-fms-testing-utils",
        " python",
        " utils"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "353b3b925707ffbabb2561d2c7b8b4b22eba0008b69e8637c34ca294b23eef4b",
                "md5": "a36ec0e8027b2691853e43c413f9caac",
                "sha256": "6545856be957e9b8fce507076190fbbd934e0a22168b3a7349b9bccdf56d364f"
            },
            "downloads": -1,
            "filename": "aiu_fms_testing_utils-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a36ec0e8027b2691853e43c413f9caac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.11",
            "size": 43989,
            "upload_time": "2025-09-05T19:20:30",
            "upload_time_iso_8601": "2025-09-05T19:20:30.827693Z",
            "url": "https://files.pythonhosted.org/packages/35/3b/3b925707ffbabb2561d2c7b8b4b22eba0008b69e8637c34ca294b23eef4b/aiu_fms_testing_utils-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-05 19:20:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "foundation-model-stack",
    "github_project": "aiu-fms-testing-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "aiu-fms-testing-utils"
}

None