继续预训练的效果评估 #5865

luoruijie · 2024-10-30T06:00:28Z

luoruijie
Oct 30, 2024

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.12.3
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.45.2
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA L20
Bitsandbytes version: 0.44.1

Reproduction

llamafactory-cli train --stage pt --do_train True --model_name_or_path /root/autodl-tmp/Qwen/Qwen2___5-1___5B-Instruct --preprocessing_num_workers 16 --finetuning_type full --template qwen --flash_attn auto --dataset_dir data --dataset class_stand --cutoff_len 4096 --learning_rate 1e-05 --num_train_epochs 20.0 --max_samples 10000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type linear --max_grad_norm 0.5 --logging_steps 5 --save_steps 200 --warmup_steps 200 --optim adamw_torch --packing True --report_to none --use_galore True --output_dir /root/autodl-tmp/Qwen2.5/full/train_2024-10-29-11-37-39 --bf16 True --plot_loss True --ddp_timeout 180000000 --include_num_input_tokens_seen True --galore_rank 8 --galore_update_interval 200 --galore_scale 0.25 --galore_target all --val_size 0.1 --eval_strategy steps --eval_steps 200 --per_device_eval_batch_size 2 --overwrite_output_dir --save_total_limit 1 --load_best_model_at_end True

Expected behavior

我的问题是：通常大语言模型的知识量是在预训练过程注入的，而预训练通常会将长文本切割成一个个段落。但是这样的话在训练时是怎么关联起来这个上下文的呢？例如：我的数据集中有一个30k长度左右的文本，我想让继续预训练后的模型完全记住我这个文本中的所有内容，我应该如何做呢？

Others

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

继续预训练的效果评估 #5865

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

继续预训练的效果评估 #5865

luoruijie Oct 30, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

Replies: 0 comments

luoruijie
Oct 30, 2024