Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡训练rt-detrv2-r101,loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

Open
2 of 3 tasks
DoctorDream opened this issue Aug 16, 2024 · 10 comments
Assignees

Comments

@DoctorDream
Copy link

问题确认 Search before asking

  • 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

当我使用下述指令训练rt-detr的时候:

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

会出现报错:

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 209, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/home/zqy/zqy/Codes/PaddleDetection/ppdet/engine/trainer.py", line 614, in train
    loss.backward()
  File "/usr/local/lib/python3.10/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

当我使用单卡训练的时候就不会报错了

复现环境 Environment

  • OS:Linux
  • PaddlePaddle: paddlepaddle-gpu 2.6.1.post117和 paddlepaddle-gpu 2.6.0.post117
  • PaddleDetection: develop/2.7
  • python: 3.10
  • CUDA: 11.7

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!
@Sunting78
Copy link
Collaborator

您好,可以切换release/2.7.1试一下

@DoctorDream
Copy link
Author

您好,可以切换release/2.7.1试一下

您好,我切换到PaddleDetection:release/2.7.1分支后,configs中并没有rtdetrv2文件夹,当我按照 release/2.7.0 分支的下述指令:

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

出现了如下报错:

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 213, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 209, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 149, in run
    trainer = Trainer(cfg, mode='train')
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/engine/trainer.py", line 116, in __init__
    self.model = create(cfg.architecture)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 255, in create
    cls_kwargs.update(cls.from_config(config, **kwargs))
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/modeling/architectures/detr.py", line 63, in from_config
    transformer = create(cfg['transformer'], **kwargs)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 229, in create
    raise ValueError("The module {} is not registered".format(name))
ValueError: The module RTDETRTransformerv2 is not registered

@lyuwenyu
Copy link
Collaborator

先用v1跑一下同样的数据是不是有问题

@DoctorDream
Copy link
Author

DoctorDream commented Aug 23, 2024

先用v1跑一下同样的数据是不是有问题

很抱歉,我之前填写运行环境时出现了错误,现在更正如下:

复现环境 Environment

  • OS:Linux
  • PaddlePaddle: paddlepaddle-gpu 2.6.1.post117和 paddlepaddle-gpu 2.6.0.post117
  • PaddleDetection: develop (原本为2.7.0,现修正为develop,因为只有该版本有rtdetr v2)
  • python: 3.10
  • CUDA: 11.7

经过试验,在该环境下跑多卡跑rtdetr是没问题的,但是多卡跑rtdetr v2时会出现上述:

ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

的报错

@DoctorDream
Copy link
Author

先用v1跑一下同样的数据是不是有问题

请问这个问题短期内有解决方案吗?辛苦您了

@lyuwenyu
Copy link
Collaborator

收到 最近安排时间看下;其实v1v2的第一阶段训练没啥区别

@yski
Copy link

yski commented Sep 26, 2024

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

@DoctorDream
Copy link
Author

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

@zhang-prog
Copy link
Collaborator

@lyuwenyu 大佬看下呢

@yski
Copy link

yski commented Sep 28, 2024

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

没有测试,我一直用的windows单卡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants