aiu-fms-testing-utils


Nameaiu-fms-testing-utils JSON
Version 0.0.2a1 PyPI version JSON
download
home_pageNone
SummarySpyre FMS Testing Utils
upload_time2025-07-25 20:39:20
maintainerNone
docs_urlNone
authorNone
requires_python~=3.12
licenseNone
keywords aiu-fms-testing-utils python utils
VCS
bugtrack_url
requirements ibm-fms fms-model-optimizer sentencepiece numpy transformers
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # aiu-fms-testing-utils

## Setup your environment

In this directory, checkout the Foundation Model Stack (FMS) and the FMS Model Optimizer:
```shell
git clone https://github.com/foundation-model-stack/foundation-model-stack.git
git clone https://github.com/foundation-model-stack/fms-model-optimizer.git
```

Install both FMS, FMS-Model-Optimizer and aiu-fms-testing-utils:
```shell
cd foundation-model-stack
pip install -e .
cd ..

cd fms-model-optimizer
pip install -e .
cd ..

pip install -e .
```

### Running in OpenShift

Use the `pod.yaml` file to get started with your OpenShift allocation
 * Modify the `ibm.com/aiu_pf_tier0` values to indicate the number of AIUs that you want to use
 * Modify the `namespace` to match your namespace/project (i.e., `oc project`)

Start the pod
```shell
oc apply -f pod.yaml
```

Copy this repository into the pod (includes scripts, FMS stack)
```shell
oc cp ${PWD} my-workspace:/tmp/
```

Exec into the pod
```shell
 oc rsh my-workspace bash -l
 ```

When you are finished, make sure to delete your pod:
```shell
oc delete -f pod.yaml
```

### Setup the environment in the container

Verify the AIU discovery has happened by looking for output like the following when you exec into the pod:
```console
---- IBM AIU Device Discovery...
---- IBM AIU Environment Setup... (Generate config and environment)
---- IBM AIU Devices Found: 2
------------------------
[1000760000@my-workspace ~]$  echo $AIU_WORLD_SIZE
2
```

Inside the container, setup envars to use the FMS:
```shell
export HOME=/tmp
cd ${HOME}/aiu-fms-testing-utils/foundation-model-stack/
# Install the FMS stack
pip install -e .
```

Run with AIU instead of, default, senulator.
```shell
export FLEX_COMPUTE=SENTIENT
export FLEX_DEVICE=VFIO
```

Optional envars to supress debugging output:
```shell
export DTLOG_LEVEL=error
export TORCH_SENDNN_LOG=CRITICAL
export DT_DEEPRT_VERBOSE=-1
```

## Example runs

 Tensor parallel execution is only supported on the AIU through the [Foundation Model Stack](https://github.com/foundation-model-stack/foundation-model-stack).

The `--nproc-per-node` command line option controls the number of AIUs to use (number of parallel processes).

### Small Toy

The `small-toy.py` is a slimmed down version of the Big Toy model. The purpose of this model is to demostrate how to run a tensor parallel model with the FMS on AIU hardware.

```bash
cd ${HOME}/aiu-fms-testing-utils/scripts

# 1 AIU (sequential)
# Inductor (CPU) backend (default)
torchrun --nproc-per-node 1 ./small-toy.py
# AIU backend
torchrun --nproc-per-node 1 ./small-toy.py --backend aiu

# 2 AIUs (tensor parallel)
# Inductor (CPU) backend (default)
torchrun --nproc-per-node 2 ./small-toy.py
# AIU backend
torchrun --nproc-per-node 2 ./small-toy.py --backend aiu
```

Example Output

```console
shell$ torchrun --nproc-per-node 4 ./small-toy.py --backend aiu
------------------------------------------------------------
0 / 4 : Python Version  : 3.11.7
0 / 4 : PyTorch Version : 2.2.2+cpu
0 / 4 : Dynamo Backend  : aiu -> sendnn
0 / 4 : PCI Addr. for Rank 0 : 0000:bd:00.0
0 / 4 : PCI Addr. for Rank 1 : 0000:b6:00.0
0 / 4 : PCI Addr. for Rank 2 : 0000:b9:00.0
0 / 4 : PCI Addr. for Rank 3 : 0000:b5:00.0
------------------------------------------------------------
0 / 4 : Creating the model...
0 / 4 : Compiling the model...
0 / 4 : Running model: First Time...
0 / 4 : Running model: Second Time...
0 / 4 : Done
```


### Roberta

The `roberta.py` is a simple version of the Roberta model. The purpose of this model is to demostrate how to run a tensor parallel model with the FMS on AIU hardware. 

**Note**: We need to disable the Tensor Parallel `Embedding` conversion to avoid the use of a `torch.distributed` interface that `gloo` does not support. Namely `torch.ops._c10d_functional.all_gather_into_tensor`. The `roberta.py` script will set the following envar to avoid the problematic conversion. This will be removed in a future PyTorch release.
```shell
export DISTRIBUTED_STRATEGY_IGNORE_MODULES=WordEmbedding,Embedding
```

```bash
cd ${HOME}/aiu-fms-testing-utils/scripts

# 1 AIU (sequential)
# Inductor (CPU) backend (default)
torchrun --nproc-per-node 1 ./roberta.py
# AIU backend
torchrun --nproc-per-node 1 ./roberta.py --backend aiu

# 2 AIUs (tensor parallel)
# Inductor (CPU) backend (default)
torchrun --nproc-per-node 2 ./roberta.py
# AIU backend
torchrun --nproc-per-node 2 ./roberta.py --backend aiu
```

Example Output

```console
shell$ torchrun --nproc-per-node 2 ./roberta.py --backend aiu
------------------------------------------------------------
0 / 2 : Python Version  : 3.11.7
0 / 2 : PyTorch Version : 2.2.2+cpu
0 / 2 : Dynamo Backend  : aiu -> sendnn
0 / 2 : PCI Addr. for Rank 0 : 0000:bd:00.0
0 / 2 : PCI Addr. for Rank 1 : 0000:b6:00.0
------------------------------------------------------------
0 / 2 : Creating the model...
0 / 2 : Compiling the model...
0 / 2 : Running model: First Time...
0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
0 / 2 : Running model: Second Time...
0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
0 / 2 : Done
```

### LLaMA/Granite
```bash
export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1

# run 194m on AIU
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic

# run 194m on CPU
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32

# run 7b on AIU
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b --tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic

# run 7b on CPU
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b--tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32

# run gpt_bigcode (granite) 3b on AIU
python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --prompt_type=code --compile --default_dtype=fp16 --compile_dynamic

# run gpt_bigcode (granite) 3b on CPU
python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --prompt_type=code --default_dtype=fp32
```

To try mini-batch, use `--batch_input`

For the validation script, here are a few examples:

```bash
export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1

# Run a llama 194m model, grab the example inputs in the script, generate validation tokens on cpu, validate token equivalency: 
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --compile_dynamic

# Run a llama 194m model, grab the example inputs in a folder, generate validation tokens on cpu, validate token equivalency:
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --compile_dynamic

# Run a llama 194m model, grab the example inputs in a folder, grab validation text from a folder, validate token equivalency (will only validate up to max(max_new_tokens, tokens_in_validation_file)):
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --validation_files_path=/home/devel/aiu-fms-testing-utils/prompts/validation/*.txt --compile_dynamic

# Validate a reduced size version of llama 8b
python3 scripts/validation.py --architecture=hf_configured --model_path=/home/devel/models/llama-8b --tokenizer=/home/devel/models/llama-8b --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --extra_get_model_kwargs nlayers=3 --compile_dynamic
```

To run a logits-based validation, pass `--validation_level=1` to the validation script. This will check for the logits output to match at every step of the model through cross-entropy loss.
You can control the acceptable threshold with `--logits_loss_threshold`

## Common Errors

### Pod connection error

Errors like the following often indicate that the pod has not started or is still in the process of starting.
```console
error: unable to upgrade connection: container not found ("my-pod")
```

Use `oc get pods` to check on the status. `ContainerCreating` indicates that the pod is being created. `Running` indicates that it is ready to use.

If there is an error the use `oc describe pod/my-workspace` to see a full diagnostic view. The `Events` list at the bottom will often let you know what the problem is.

### torchrun generic error

Below is the generic `torchrun` failed program trace. It is not helpful when trying to find the problem in the program. Instead look for the actual error message a little higher in the output trace.

```console
[2024-09-16 16:10:15,705] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1479484) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./roberta.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-16_16:10:15
  host      : ibm-aiu-rdma-jjhursey
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1479484)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

### Additional warnings

You may see the following additional warnings/notices printed to the console. They are normal and expected at this point in time. The team will work on cleaning these up.

```console
CUDA extension not installed.
using tensor parallel
ignoring module=Embedding when distributing module
[WARNING] Keys from checkpoint (adapted to FMS) not copied into model: {'roberta.embeddings.token_type_embeddings.weight', 'lm_head.bias'}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aiu-fms-testing-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "~=3.12",
    "maintainer_email": null,
    "keywords": "aiu-fms-testing-utils, python, utils",
    "author": null,
    "author_email": "Thara Palanivel <Tharangini.Palanivel@ibm.com>, Ted Chang <htchang@us.ibm.com>",
    "download_url": null,
    "platform": null,
    "description": "# aiu-fms-testing-utils\n\n## Setup your environment\n\nIn this directory, checkout the Foundation Model Stack (FMS) and the FMS Model Optimizer:\n```shell\ngit clone https://github.com/foundation-model-stack/foundation-model-stack.git\ngit clone https://github.com/foundation-model-stack/fms-model-optimizer.git\n```\n\nInstall both FMS, FMS-Model-Optimizer and aiu-fms-testing-utils:\n```shell\ncd foundation-model-stack\npip install -e .\ncd ..\n\ncd fms-model-optimizer\npip install -e .\ncd ..\n\npip install -e .\n```\n\n### Running in OpenShift\n\nUse the `pod.yaml` file to get started with your OpenShift allocation\n * Modify the `ibm.com/aiu_pf_tier0` values to indicate the number of AIUs that you want to use\n * Modify the `namespace` to match your namespace/project (i.e., `oc project`)\n\nStart the pod\n```shell\noc apply -f pod.yaml\n```\n\nCopy this repository into the pod (includes scripts, FMS stack)\n```shell\noc cp ${PWD} my-workspace:/tmp/\n```\n\nExec into the pod\n```shell\n oc rsh my-workspace bash -l\n ```\n\nWhen you are finished, make sure to delete your pod:\n```shell\noc delete -f pod.yaml\n```\n\n### Setup the environment in the container\n\nVerify the AIU discovery has happened by looking for output like the following when you exec into the pod:\n```console\n---- IBM AIU Device Discovery...\n---- IBM AIU Environment Setup... (Generate config and environment)\n---- IBM AIU Devices Found: 2\n------------------------\n[1000760000@my-workspace ~]$  echo $AIU_WORLD_SIZE\n2\n```\n\nInside the container, setup envars to use the FMS:\n```shell\nexport HOME=/tmp\ncd ${HOME}/aiu-fms-testing-utils/foundation-model-stack/\n# Install the FMS stack\npip install -e .\n```\n\nRun with AIU instead of, default, senulator.\n```shell\nexport FLEX_COMPUTE=SENTIENT\nexport FLEX_DEVICE=VFIO\n```\n\nOptional envars to supress debugging output:\n```shell\nexport DTLOG_LEVEL=error\nexport TORCH_SENDNN_LOG=CRITICAL\nexport DT_DEEPRT_VERBOSE=-1\n```\n\n## Example runs\n\n Tensor parallel execution is only supported on the AIU through the [Foundation Model Stack](https://github.com/foundation-model-stack/foundation-model-stack).\n\nThe `--nproc-per-node` command line option controls the number of AIUs to use (number of parallel processes).\n\n### Small Toy\n\nThe `small-toy.py` is a slimmed down version of the Big Toy model. The purpose of this model is to demostrate how to run a tensor parallel model with the FMS on AIU hardware.\n\n```bash\ncd ${HOME}/aiu-fms-testing-utils/scripts\n\n# 1 AIU (sequential)\n# Inductor (CPU) backend (default)\ntorchrun --nproc-per-node 1 ./small-toy.py\n# AIU backend\ntorchrun --nproc-per-node 1 ./small-toy.py --backend aiu\n\n# 2 AIUs (tensor parallel)\n# Inductor (CPU) backend (default)\ntorchrun --nproc-per-node 2 ./small-toy.py\n# AIU backend\ntorchrun --nproc-per-node 2 ./small-toy.py --backend aiu\n```\n\nExample Output\n\n```console\nshell$ torchrun --nproc-per-node 4 ./small-toy.py --backend aiu\n------------------------------------------------------------\n0 / 4 : Python Version  : 3.11.7\n0 / 4 : PyTorch Version : 2.2.2+cpu\n0 / 4 : Dynamo Backend  : aiu -> sendnn\n0 / 4 : PCI Addr. for Rank 0 : 0000:bd:00.0\n0 / 4 : PCI Addr. for Rank 1 : 0000:b6:00.0\n0 / 4 : PCI Addr. for Rank 2 : 0000:b9:00.0\n0 / 4 : PCI Addr. for Rank 3 : 0000:b5:00.0\n------------------------------------------------------------\n0 / 4 : Creating the model...\n0 / 4 : Compiling the model...\n0 / 4 : Running model: First Time...\n0 / 4 : Running model: Second Time...\n0 / 4 : Done\n```\n\n\n### Roberta\n\nThe `roberta.py` is a simple version of the Roberta model. The purpose of this model is to demostrate how to run a tensor parallel model with the FMS on AIU hardware. \n\n**Note**: We need to disable the Tensor Parallel `Embedding` conversion to avoid the use of a `torch.distributed` interface that `gloo` does not support. Namely `torch.ops._c10d_functional.all_gather_into_tensor`. The `roberta.py` script will set the following envar to avoid the problematic conversion. This will be removed in a future PyTorch release.\n```shell\nexport DISTRIBUTED_STRATEGY_IGNORE_MODULES=WordEmbedding,Embedding\n```\n\n```bash\ncd ${HOME}/aiu-fms-testing-utils/scripts\n\n# 1 AIU (sequential)\n# Inductor (CPU) backend (default)\ntorchrun --nproc-per-node 1 ./roberta.py\n# AIU backend\ntorchrun --nproc-per-node 1 ./roberta.py --backend aiu\n\n# 2 AIUs (tensor parallel)\n# Inductor (CPU) backend (default)\ntorchrun --nproc-per-node 2 ./roberta.py\n# AIU backend\ntorchrun --nproc-per-node 2 ./roberta.py --backend aiu\n```\n\nExample Output\n\n```console\nshell$ torchrun --nproc-per-node 2 ./roberta.py --backend aiu\n------------------------------------------------------------\n0 / 2 : Python Version  : 3.11.7\n0 / 2 : PyTorch Version : 2.2.2+cpu\n0 / 2 : Dynamo Backend  : aiu -> sendnn\n0 / 2 : PCI Addr. for Rank 0 : 0000:bd:00.0\n0 / 2 : PCI Addr. for Rank 1 : 0000:b6:00.0\n------------------------------------------------------------\n0 / 2 : Creating the model...\n0 / 2 : Compiling the model...\n0 / 2 : Running model: First Time...\n0 / 2 : Answer: (0.11509) Miss Piggy is a pig.\n0 / 2 : Running model: Second Time...\n0 / 2 : Answer: (0.11509) Miss Piggy is a pig.\n0 / 2 : Done\n```\n\n### LLaMA/Granite\n```bash\nexport DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1\n\n# run 194m on AIU\npython3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic\n\n# run 194m on CPU\npython3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32\n\n# run 7b on AIU\npython3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b --tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic\n\n# run 7b on CPU\npython3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b--tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32\n\n# run gpt_bigcode (granite) 3b on AIU\npython3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --prompt_type=code --compile --default_dtype=fp16 --compile_dynamic\n\n# run gpt_bigcode (granite) 3b on CPU\npython3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --prompt_type=code --default_dtype=fp32\n```\n\nTo try mini-batch, use `--batch_input`\n\nFor the validation script, here are a few examples:\n\n```bash\nexport DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1\n\n# Run a llama 194m model, grab the example inputs in the script, generate validation tokens on cpu, validate token equivalency: \npython3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --compile_dynamic\n\n# Run a llama 194m model, grab the example inputs in a folder, generate validation tokens on cpu, validate token equivalency:\npython3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --compile_dynamic\n\n# Run a llama 194m model, grab the example inputs in a folder, grab validation text from a folder, validate token equivalency (will only validate up to max(max_new_tokens, tokens_in_validation_file)):\npython3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --validation_files_path=/home/devel/aiu-fms-testing-utils/prompts/validation/*.txt --compile_dynamic\n\n# Validate a reduced size version of llama 8b\npython3 scripts/validation.py --architecture=hf_configured --model_path=/home/devel/models/llama-8b --tokenizer=/home/devel/models/llama-8b --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --extra_get_model_kwargs nlayers=3 --compile_dynamic\n```\n\nTo run a logits-based validation, pass `--validation_level=1` to the validation script. This will check for the logits output to match at every step of the model through cross-entropy loss.\nYou can control the acceptable threshold with `--logits_loss_threshold`\n\n## Common Errors\n\n### Pod connection error\n\nErrors like the following often indicate that the pod has not started or is still in the process of starting.\n```console\nerror: unable to upgrade connection: container not found (\"my-pod\")\n```\n\nUse `oc get pods` to check on the status. `ContainerCreating` indicates that the pod is being created. `Running` indicates that it is ready to use.\n\nIf there is an error the use `oc describe pod/my-workspace` to see a full diagnostic view. The `Events` list at the bottom will often let you know what the problem is.\n\n### torchrun generic error\n\nBelow is the generic `torchrun` failed program trace. It is not helpful when trying to find the problem in the program. Instead look for the actual error message a little higher in the output trace.\n\n```console\n[2024-09-16 16:10:15,705] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1479484) of binary: /usr/bin/python3\nTraceback (most recent call last):\n  File \"/usr/local/bin/torchrun\", line 8, in <module>\n    sys.exit(main())\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 347, in wrapper\n    return f(*args, **kwargs)\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py\", line 812, in main\n    run(args)\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py\", line 803, in run\n    elastic_launch(\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py\", line 135, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py\", line 268, in launch_agent\n    raise ChildFailedError(\ntorch.distributed.elastic.multiprocessing.errors.ChildFailedError:\n============================================================\n./roberta.py FAILED\n------------------------------------------------------------\nFailures:\n  <NO_OTHER_FAILURES>\n------------------------------------------------------------\nRoot Cause (first observed failure):\n[0]:\n  time      : 2024-09-16_16:10:15\n  host      : ibm-aiu-rdma-jjhursey\n  rank      : 0 (local_rank: 0)\n  exitcode  : 1 (pid: 1479484)\n  error_file: <N/A>\n  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html\n============================================================\n```\n\n### Additional warnings\n\nYou may see the following additional warnings/notices printed to the console. They are normal and expected at this point in time. The team will work on cleaning these up.\n\n```console\nCUDA extension not installed.\nusing tensor parallel\nignoring module=Embedding when distributing module\n[WARNING] Keys from checkpoint (adapted to FMS) not copied into model: {'roberta.embeddings.token_type_embeddings.weight', 'lm_head.bias'}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Spyre FMS Testing Utils",
    "version": "0.0.2a1",
    "project_urls": {
        "Homepage": "https://github.com/foundation-model-stack/aiu-fms-testing-utils",
        "Issues": "https://github.com/foundation-model-stack/aiu-fms-testing-utils/issues",
        "Repository": "https://github.com/foundation-model-stack/aiu-fms-testing-utils"
    },
    "split_keywords": [
        "aiu-fms-testing-utils",
        " python",
        " utils"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3138db03cf0aee10e4ea2fa0f987071e7aa7f8d107cf2adcc5df6d29a1d28900",
                "md5": "9b7c001a5c932f79e433bf2c7a97a199",
                "sha256": "af224ef44f4d5898a03c014e4fb23e17321355845614331939bc3fab41ed422e"
            },
            "downloads": -1,
            "filename": "aiu_fms_testing_utils-0.0.2a1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9b7c001a5c932f79e433bf2c7a97a199",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.12",
            "size": 35204,
            "upload_time": "2025-07-25T20:39:20",
            "upload_time_iso_8601": "2025-07-25T20:39:20.855057Z",
            "url": "https://files.pythonhosted.org/packages/31/38/db03cf0aee10e4ea2fa0f987071e7aa7f8d107cf2adcc5df6d29a1d28900/aiu_fms_testing_utils-0.0.2a1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 20:39:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "foundation-model-stack",
    "github_project": "aiu-fms-testing-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "ibm-fms",
            "specs": []
        },
        {
            "name": "fms-model-optimizer",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        }
    ],
    "lcname": "aiu-fms-testing-utils"
}
        
Elapsed time: 0.50453s