单机多卡训练rt-detrv2-r101，loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

DoctorDream · 2024-08-16T03:30:04Z

问题确认 Search before asking

我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

当我使用下述指令训练rt-detr的时候：

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

会出现报错：

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 209, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/home/zqy/zqy/Codes/PaddleDetection/ppdet/engine/trainer.py", line 614, in train
    loss.backward()
  File "/usr/local/lib/python3.10/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

当我使用单卡训练的时候就不会报错了

复现环境 Environment

OS：Linux
PaddlePaddle: paddlepaddle-gpu 2.6.1.post117和 paddlepaddle-gpu 2.6.0.post117
PaddleDetection: develop/2.7
python: 3.10
CUDA: 11.7

Bug描述确认 Bug description confirmation

我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

我愿意提交PR！I'd like to help by submitting a PR!

Sunting78 · 2024-08-19T03:03:38Z

您好，可以切换release/2.7.1试一下

DoctorDream · 2024-08-23T01:40:26Z

您好，可以切换release/2.7.1试一下

您好，我切换到PaddleDetection:release/2.7.1分支后，configs中并没有rtdetrv2文件夹，当我按照 release/2.7.0 分支的下述指令：

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

出现了如下报错：

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 213, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 209, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 149, in run
    trainer = Trainer(cfg, mode='train')
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/engine/trainer.py", line 116, in __init__
    self.model = create(cfg.architecture)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 255, in create
    cls_kwargs.update(cls.from_config(config, **kwargs))
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/modeling/architectures/detr.py", line 63, in from_config
    transformer = create(cfg['transformer'], **kwargs)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 229, in create
    raise ValueError("The module {} is not registered".format(name))
ValueError: The module RTDETRTransformerv2 is not registered

lyuwenyu · 2024-08-23T03:16:45Z

先用v1跑一下同样的数据是不是有问题

DoctorDream · 2024-08-23T03:38:22Z

先用v1跑一下同样的数据是不是有问题

很抱歉，我之前填写运行环境时出现了错误，现在更正如下：

复现环境 Environment

OS：Linux
PaddlePaddle: paddlepaddle-gpu 2.6.1.post117和 paddlepaddle-gpu 2.6.0.post117
PaddleDetection: develop （原本为2.7.0，现修正为develop，因为只有该版本有rtdetr v2）
python: 3.10
CUDA: 11.7

经过试验，在该环境下跑多卡跑rtdetr是没问题的，但是多卡跑rtdetr v2时会出现上述:

ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

的报错

DoctorDream · 2024-08-29T02:10:23Z

先用v1跑一下同样的数据是不是有问题

请问这个问题短期内有解决方案吗？辛苦您了

lyuwenyu · 2024-08-30T06:42:04Z

收到最近安排时间看下；其实v1 和 v2的第一阶段训练没啥区别

yski · 2024-09-26T03:49:32Z

先用v1跑一下同样的数据是不是有问题

大佬，导出问题看看吧，v2训练推理都没问题，但是导出报错，paddle3.0b1+paddledetection develop

DoctorDream · 2024-09-27T02:13:09Z

先用v1跑一下同样的数据是不是有问题

大佬，导出问题看看吧，v2训练推理都没问题，但是导出报错，paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

zhang-prog · 2024-09-27T12:03:27Z

@lyuwenyu 大佬看下呢

yski · 2024-09-28T06:21:21Z

先用v1跑一下同样的数据是不是有问题

大佬，导出问题看看吧，v2训练推理都没问题，但是导出报错，paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

没有测试，我一直用的windows单卡

paddle-bot bot assigned nemonameless Aug 16, 2024

TingquanGao assigned Sunting78 Aug 16, 2024

Sunting78 assigned lyuwenyu and unassigned nemonameless and Sunting78 Aug 29, 2024

TingquanGao assigned zhang-prog Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡训练rt-detrv2-r101，loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

单机多卡训练rt-detrv2-r101，loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

DoctorDream commented Aug 16, 2024

Sunting78 commented Aug 19, 2024

DoctorDream commented Aug 23, 2024

lyuwenyu commented Aug 23, 2024

DoctorDream commented Aug 23, 2024 •

edited

Loading

DoctorDream commented Aug 29, 2024

lyuwenyu commented Aug 30, 2024

yski commented Sep 26, 2024

DoctorDream commented Sep 27, 2024

zhang-prog commented Sep 27, 2024

yski commented Sep 28, 2024

单机多卡训练rt-detrv2-r101，loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

单机多卡训练rt-detrv2-r101，loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

Comments

DoctorDream commented Aug 16, 2024

问题确认 Search before asking

Bug组件 Bug Component

Bug描述 Describe the Bug

复现环境 Environment

Bug描述确认 Bug description confirmation

是否愿意提交PR？ Are you willing to submit a PR?

Sunting78 commented Aug 19, 2024

DoctorDream commented Aug 23, 2024

lyuwenyu commented Aug 23, 2024

DoctorDream commented Aug 23, 2024 • edited Loading

DoctorDream commented Aug 29, 2024

lyuwenyu commented Aug 30, 2024

yski commented Sep 26, 2024

DoctorDream commented Sep 27, 2024

zhang-prog commented Sep 27, 2024

yski commented Sep 28, 2024

DoctorDream commented Aug 23, 2024 •

edited

Loading