如何解决NVIDIA显卡报错:uncorrectable ECC error的问题

卡亦克 2024-10-01 17:01:04 阅读 76

一、问题是怎么发现的

近期工作中发现数字人形象模型训练期间服务器报错:

<code>[2024-03-22 03:13:57] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] [W CUDAGuardImpl.h:112] Warning: CUDA warning: uncorrectable ECC error encountered (function destroyEvent)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 206, in <module>

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] main(args)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 178, in main

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn(subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')code>

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] while not context.join():

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] raise ProcessRaisedException(msg, error_index, failed_process.pid)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn.ProcessRaisedException:

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] -- Process 2 terminated with the following error:

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] fn(i, *args)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 165, in subprocess_fn

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] training_loop(rank=rank, args=args)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 90, in training_loop

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] input = input.to(device)

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] RuntimeError: CUDA error: uncorrectable ECC error encountered

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

通过异常日志分析,发现关键字:RuntimeError: CUDA error: uncorrectable ECC error encountered

二、问题带来的影响

该异常导致数字人形象模型训练失败,无法为客户提供数字人形象,进而影响交付。

三、排查问题的详细过程

首先通过百度搜索[CUDA warning: uncorrectable ECC error encountered],了解了ECC是什么,它的作用是什么。

Volatile Uncorr. ECC:是否启用显存错误校正(如果未启用则为0)(Volatile Uncorr. ECC——Volatile Uncorrectable Error Correction and Detection (VUECC):是一种可变不可修正的错误校验与纠正(ECC)技术,它旨在在计算机存储器中检测和纠正位错误。它使用了特殊的硬件来监控计算机内部数据,并在发现任何差错时通过可靠的方法自动纠正它们。)

然后搜索解决方案,百度搜索没有找到特别好的解决方案,然后改用Google搜索,找到了如下搜索记录:

发现有英伟达官方的帖子,果断点进去,寻找解决方案。

四、如何解决问题

1、查看显卡状态 nvidia-smi, 发现了关键参数[Volatile Uncorr. ECC],4张显卡其中第3张的值与其他三张不同,这样就定位到了出故障的显卡。

2、通过指令 nvidia-smi -i 2 -p 0 修复显卡状态

显卡状态已恢复,完好如初。

联系业务运营,重新开启形象模型训练任务。

五、总结反思

线上问题出现的时候,如果国内的百度搜不到解决方案,就试试国际的Google,办法总比困难多。



声明

本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。