model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

mzc2113391 · 2024-10-28T08:19:59Z

Reminder

I have read the README and searched the existing issues.

System Info

您好，我在做训完SFT阶段的lora后进行推理的测试，模版为default，我使用llamafactory-cli train并在config yaml设置了do_predict和do_sample=False，会在generated_predictions.jsonl给出预测的结果。但是，这与我自己通过model.generation的实现给出的结果内容的意思大致相同，但token差异比较大，如果想保证model.generation与generated_predictions.jsonl结果完全相同，应该怎么做呢？

以下是我的数据案例和代码

数据案例(实际是我微调中使用的问题)：
{
"instruction": "What are the three primary colors?",
"input": "",
"output": "The three primary colors are red, blue, and yellow."
}

根据数据案例，我使用了default模版，我的input应该是：
user_input = "Human: What are the three primary colors?\nAssistant:"

Reproduction

## llama3_lora_predict.yaml文件内容
### model
model_name_or_path: checkpoint-25000
adapter_name_or_path: checkpoint-37575

# deepspeed: ds_z3_config.json
### method
stage: sft
do_predict: true
finetuning_type: lora

### dataset
eval_dataset: finetuning_test
dataset_dir: data
template: default
cutoff_len: 1024
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: predict
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true
do_sample: False
fp16: True
ddp_timeout: 180000000

##执行如下命令进行do_predict
llamafactory-cli train llama3_lora_predict.yaml

#python 实现，用如下命令模拟predict
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
import torch
from peft import LoraConfig, TaskType, get_peft_model
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch.nn as nn
import json
import torch
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained('checkpoint-25000', use_fast=False, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained('checkpoint-25000', device_map="auto",torch_dtype=torch.float16)
lora_path = 'checkpoint-37575' 
model = PeftModel.from_pretrained(model, model_id=lora_path)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
user_input = "Human: What are the three primary colors?\nAssistant:"
output_ids = model.generate(
    input_ids=input_ids,
    max_length=1024, 
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
response = generated_text[len(input):].strip()
)

Expected behavior

用llamafactory-cli内置的predict方式是trainer.predict的，它给出的结果和model.generation给出的结果有区别，如何保证二者是相同的？

Others

无

The text was updated successfully, but these errors were encountered:

lizbaka · 2024-10-28T14:42:54Z

遇到同样问题。我基于deepseek-coder-1.3B在一个二分类问题上进行了lora微调。使用llamafactory-cli进行do_predict跟model.generation进行预测的输出格式都是正确的，但使用llamafactory的分类性能好很多。

在prompt的层面已经确认了两种方法的输入prompt是对齐的，也尝试使用了llamafactory做预测时使用的gen_kwargs，不太清楚还会在什么地方没有对齐。

以下是我自行实现的inference部分代码

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
from tqdm import tqdm
import argparse
import sys
import json

MAX_LENGTH = 1024
INSTRUCTION = '******\n'


def infer(model, tokenizer, code):
    # input_text = INSTRUCTION + code
    input_text = code
    messages = [
        {
            'role': 'user',
            'content': input_text
        }
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        add_generation_prompt=True, 
        return_tensors="pt", 
        max_length = MAX_LENGTH,
        truncation = True).to(model.device)
    input_prompt = tokenizer.decode(input_ids[0], skip_special_tokens=True)
    # outputs = model.generate(input_ids, max_new_tokens = MAX_LENGTH, eos_token_id = tokenizer.eos_token_id, pad_token_id = tokenizer.pad_token_id, do_sample = False)
    # obtain from gen_kwargs
    outputs = model.generate(input_ids, do_sample = True, temperature = 0.95, top_p = 0.7, top_k = 50, num_beams = 1, max_new_tokens = MAX_LENGTH, repetition_penalty = 1,
                             length_penalty = 1, default_system = None, eos_token_id = tokenizer.eos_token_id, pad_token_id = tokenizer.pad_token_id)
    output_text = tokenizer.decode(outputs[0][len(input_ids[0]):], skip_special_tokens=True)
    return input_prompt, output_text


def test(args):
    tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

    test_dataset = load_dataset('json', data_files = args.test_dataset_file)['train']
    
    model = AutoModelForCausalLM.from_pretrained(args.model_path, trust_remote_code=True, torch_dtype=torch.float32, device_map='auto').eval()
    model = PeftModel.from_pretrained(model, args.output_dir).eval()
    
    tp, tn, fp, fn = 0, 0, 0, 0
    with open(f'{args.output_dir}/generated_predictions.jsonl', 'w', encoding='utf-8') as f:
        for example in tqdm(test_dataset, total=len(test_dataset)):
            input_prompt, output_text = infer(model, tokenizer, example['instruction'])
            f.write(f'{json.dumps({"prompt": input_prompt, "label": example['output'], "predict": output_text}, ensure_ascii=False)}\n')
            label = example['output'].strip()
            output_text = output_text.strip()
            if label == output_text:
                if output_text == 'yes':
                    tp += 1
                else:
                    tn += 1
            else:
                if output_text == 'yes':
                    fp += 1
                else:
                    fn += 1
            acc = (tp + tn) / (tp + tn + fp + fn)
            print(f'Accuracy: {acc}')
            sys.stdout.flush()
    
    acc = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if tp + fp > 0 else float('nan')
    recall = tp / (tp + fn) if tp + fn > 0 else float('nan')
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else float('nan')
    fpr = fp / (fp + tn) if fp + tn > 0 else float('nan')
    print(f'Accuracy: {acc}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1: {f1}')
    print(f'False Positive Rate: {fpr}')


def main(args):
    test(args)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--output_dir', type=str)
    parser.add_argument('--test_dataset_file', type=str)
    parser.add_argument('--model_path', type=str)
    args = parser.parse_args()
    main(args)

github-actions bot added the pending This problem is yet to be addressed label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

mzc2113391 commented Oct 28, 2024

lizbaka commented Oct 28, 2024

model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

Comments

mzc2113391 commented Oct 28, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

lizbaka commented Oct 28, 2024