Skip to content
This repository was archived by the owner on Aug 12, 2025. It is now read-only.
This repository was archived by the owner on Aug 12, 2025. It is now read-only.

Non-OK-status: GpuLaunchKernel error during distributed training of a large model  #88

@bojone

Description

@bojone

I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.

The error message is:

2023-06-03 07:27:26.364872: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:161] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast(output->dimension(0)), static_cast(output->dimension(1)), output->data()) status: Internal: invalid configuration argument

This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).

I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions