Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

将fairmot的backbone替换成x101后,训练正常,eval_mot会疯狂的输出 #9139

Open
1 task done
zouhan6806504 opened this issue Sep 13, 2024 · 11 comments
Open
1 task done
Assignees

Comments

@zouhan6806504
Copy link

问题确认 Search before asking

  • 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

配置文件如下

use_gpu: true
use_xpu: false
use_mlu: false
use_npu: false
log_iter: 100
save_dir: /home/aistudio/output
#save_dir: /home/aistudio/output
snapshot_epoch: 1
print_flops: false
print_params: false

# Exporting the model
export:
  post_process: True  # Whether post-processing is included in the network when export model.
  nms: True           # Whether NMS is included in the network when export model.
  benchmark: False    # It is used to testing model performance, if set `True`, post-process and NMS will not be exported.
  fuse_conv_bn: False


metric: MCMOT
num_classes: 228
#/home/aistudio/data/mot
TrainDataset:
  !MCMOTDataSet
    dataset_dir: /home/aistudio/data/mot # 需要更改为自己对应的文件目录下
    image_lists: ['IKCEST.train']
    data_fields: ['image', 'gt_bbox', 'gt_class', 'gt_ide']
    label_list: /home/aistudio/data/mot/label_list.txt

EvalMOTDataset:
  !MOTImageFolder
    dataset_dir: /home/aistudio/data/mot
    data_root: IKCEST/images/test/
    keep_ori_im: False # set True if save visualization images or video, or used in DeepSORT
    anno_path: /home/aistudio/data/mot/label_list.txt

TestMOTDataset:
  !MOTImageFolder
    dataset_dir: /home/aistudio/data/mot/IKCEST/images/test/
    keep_ori_im: True # set True if save visualization images or video
    anno_path: /home/aistudio/data/mot/label_list.txt


#pretrain_weights: https://paddledet.bj.bcebos.com/models/centernet_dla34_140e_coco.pdparams
pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/ResNeXt101_vd_64x4d_pretrained.pdparams
architecture: FairMOT
for_mot: True

FairMOT:
  detector: CenterNet
  reid: FairMOTEmbeddingHead
  loss: FairMOTLoss
  tracker: JDETracker # multi-class tracker

CenterNet:
  backbone: ResNet
  neck: CenterNetDLAFPN
  head: CenterNetHead
  post_process: CenterNetPostProcess

ResNet:
  # for ResNeXt: groups, base_width, base_channels
  depth: 101
  groups: 64
  base_width: 4
  variant: d
  norm_type: bn
  freeze_at: 0
  return_idx: [0,1,2,3]
  num_stages: 4
  dcn_v2_stages: [1,2,3]

CenterNetDLAFPN:
  down_ratio: 4
  last_level: 3
  out_channel: 0
  first_level: 0
  dcn_v2: True
  with_sge: True



CenterNetHead:
  head_planes: 256
  prior_bias: -2.19
  regress_ltrb: False
  size_loss: 'L1'
  loss_weight: {'heatmap': 1.0, 'size': 0.1, 'offset': 1.0, 'iou': 0.0}
  add_iou: False

FairMOTEmbeddingHead:
  ch_head: 256
  ch_emb: 128

CenterNetPostProcess:
  max_per_img: 200
  down_ratio: 4
  regress_ltrb: False

JDETracker:
  conf_thres: 0.4
  tracked_thresh: 0.4
  metric_type: cosine
  min_box_area: 0
  vertical_ratio: 0 # for pedestrian
  use_byte: True
  match_thres: 0.8
  low_conf_thres: 0.2

weights: /home/aistudio/output/dla
#weights: /home/aistudio/output/dla

epoch: 30
LearningRate:
  base_lr: 0.00025
  schedulers:
    - name: CosineDecay
      max_epochs: 36
    - name: LinearWarmup
      start_factor: 0.
      epochs: 1

OptimizerBuilder:
  regularizer: false
  optimizer:
    type: AdamW
    weight_decay: 0.0001
    param_groups:
      - params: ['absolute_pos_embed', 'relative_position_bias_table', 'norm']
        weight_decay: 0.0

worker_num: 1
TrainReader:
  inputs_def:
    image_shape: [3, 608, 1088]
  sample_transforms:
    - Decode: {}
    - RGBReverse: {}
    - AugmentHSV: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - MOTRandomAffine: {reject_outside: False}
    - RandomFlip: {}
    - BboxXYXY2XYWH: {}
    - NormalizeBox: {}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1]}
    - RGBReverse: {}
    - Permute: {}
  batch_transforms:
    - Gt2FairMOTTarget: {}
  batch_size: 4
  shuffle: True
  drop_last: True
  use_shared_memory: True

EvalMOTReader:
  sample_transforms:
    - Decode: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1], is_scale: True}
    - Permute: {}
  batch_size: 1


TestMOTReader:
  inputs_def:
    image_shape: [3, 608, 1088]
  sample_transforms:
    - Decode: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1], is_scale: True}
    - Permute: {}
  batch_size: 1

eval的时候会疯狂输出warning

Warning:: 0D Tensor cannot be used as 'Tensor.numpy()[0]' . In order to avoid this problem, 0D Tensor will be changed to 1D numpy currently, but it's not correct and will be removed in release 2.6. For Tensor contain only one element, Please modify 'Tensor.numpy()[0]' to 'float(Tensor)' as soon as possible, otherwise 'Tensor.numpy()[0]' will raise error in release 2.6.
I0913 17:10:51.835693 264737 eager_method.cc:140] Warning:: 0D Tensor cannot be used as 'Tensor.numpy()[0]' . In order to avoid this problem, 0D Tensor will be changed to 1D numpy currently, but it's not correct and will be removed in release 2.6. For Tensor contain only one element, Please modify 'Tensor.numpy()[0]' to 'float(Tensor)' as soon as possible, otherwise 'Tensor.numpy()[0]' will raise error in release 2.6.
I0913 17:10:51.835812 264737 eager_method.cc:140]

像这种组装配件后还需要注意哪些地方要改动?

@Bobholamovic
Copy link
Member

请提供你使用的PaddleDetection版本和Paddle版本,以便于我们排查问题。

@zouhan6806504
Copy link
Author

请提供你使用的PaddleDetection版本和Paddle版本,以便于我们排查问题。

项目“ikcest2024_notebook”共享链接(有效期三天):https://aistudio.baidu.com/studio/project/partial/verify/8294004/4a9858342afa488d8eac9acd98c09667
按序执行即可看见
detection2.7 paddle2.5

@Bobholamovic
Copy link
Member

可能需要将这两处的x.numpy()[0]修改为np.array(x.numpy())[0]
https://github.com/search?q=repo%3APaddlePaddle%2FPaddleDetection%20%22.numpy()%5B0%5D%22&type=code

@zouhan6806504
Copy link
Author

可能需要将这两处的x.numpy()[0]修改为np.array(x.numpy())[0]https://github.com/search?q=repo%3APaddlePaddle%2FPaddleDetection%20%22.numpy()%5B0%5D%22&type=code

这几处改了之后依然会有“0D Tensor cannot be used as 'Tensor.numpy()[0]'”
我尝试着把paddle版本改成2.6的,结果log确实没有了,但是生成的txt文件全部是空的

@Bobholamovic
Copy link
Member

请问日志里有报错信息吗?

@zouhan6806504
Copy link
Author

请问日志里有报错信息吗?

改成paddle2.6后,日志没报错

@Bobholamovic
Copy link
Member

如果没有报错的话,那结果为空会不会是模型效果问题呀?请问是否使用了自己的数据呢,以及模型在验证集上的精度如何?

@zouhan6806504
Copy link
Author

如果没有报错的话,那结果为空会不会是模型效果问题呀?请问是否使用了自己的数据呢,以及模型在验证集上的精度如何?

训练的数据都是一样的,感觉不是模型的问题,训练的时候--eval和dla34对比过的,loss比dla34低,而且我只是在fairmot_dla34的基础上换了个backbone而已DLA->ResNet101

@zouhan6806504
Copy link
Author

如果没有报错的话,那结果为空会不会是模型效果问题呀?请问是否使用了自己的数据呢,以及模型在验证集上的精度如何?

dla的loss最少都有4.4左右,resnet101能到4.2
下面是resnet101的训练log

[09/12 09:28:19] ppdet.engine INFO: Epoch: [13] [10500/10687] learning_rate: 0.000174 loss: 4.230835 heatmap_loss: 0.332204 size_loss: 0.522570 offset_loss: 0.156625 det_loss: 0.551854 reid_loss: 1159.913696 eta: 5 days, 7:38:55 batch_cost: 2.6880 data_cost: 0.0003 ips: 1.4881 images/s
[09/12 09:32:47] ppdet.engine INFO: Epoch: [13] [10600/10687] learning_rate: 0.000174 loss: 4.205076 heatmap_loss: 0.312019 size_loss: 0.504238 offset_loss: 0.155205 det_loss: 0.523159 reid_loss: 1159.933716 eta: 5 days, 7:34:25 batch_cost: 2.6793 data_cost: 0.0003 ips: 1.4929 images/s

dla的

[09/10 13:17:44] ppdet.engine INFO: Epoch: [13] [5200/5343] learning_rate: 0.000279 loss: 4.377353 heatmap_loss: 0.517096 size_loss: 0.592028 offset_loss: 0.163601 det_loss: 0.738086 reid_loss: 1160.759399 eta: 4 days, 0:43:39 batch_cost: 2.8901 data_cost: 1.8178 ips: 2.7681 images/s
[09/10 13:22:36] ppdet.engine INFO: Epoch: [13] [5300/5343] learning_rate: 0.000279 loss: 4.386300 heatmap_loss: 0.523919 size_loss: 0.597360 offset_loss: 0.163826 det_loss: 0.752244 reid_loss: 1160.773315 eta: 4 days, 0:26:39 batch_cost: 2.9197 data_cost: 1.8222 ips: 2.7400 images/s

两者的batch因为显存的关系差1倍

@Bobholamovic
Copy link
Member

请问是在什么数据上测试结果为空呢?

@lyuwenyu
Copy link
Collaborator

lyuwenyu commented Oct 17, 2024

dla输出结果正常嘛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants