Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.generate与llamafactory-cli train do_predict给出的结果不一致 #5845

Open
1 task done
mzc2113391 opened this issue Oct 28, 2024 · 1 comment
Open
1 task done
Labels
pending This problem is yet to be addressed

Comments

@mzc2113391
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

您好,我在做训完SFT阶段的lora后进行推理的测试,模版为default,我使用llamafactory-cli train并在config yaml设置了do_predict和do_sample=False,会在generated_predictions.jsonl给出预测的结果。但是,这与我自己通过model.generation的实现给出的结果内容的意思大致相同,但token差异比较大,如果想保证model.generation与generated_predictions.jsonl结果完全相同,应该怎么做呢?

以下是我的数据案例和代码

数据案例(实际是我微调中使用的问题):
{
"instruction": "What are the three primary colors?",
"input": "",
"output": "The three primary colors are red, blue, and yellow."
}

根据数据案例,我使用了default模版,我的input应该是:
user_input = "Human: What are the three primary colors?\nAssistant:"

Reproduction

## llama3_lora_predict.yaml文件内容
### model
model_name_or_path: checkpoint-25000
adapter_name_or_path: checkpoint-37575

# deepspeed: ds_z3_config.json
### method
stage: sft
do_predict: true
finetuning_type: lora

### dataset
eval_dataset: finetuning_test
dataset_dir: data
template: default
cutoff_len: 1024
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: predict
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true
do_sample: False
fp16: True
ddp_timeout: 180000000
##执行如下命令进行do_predict
llamafactory-cli train llama3_lora_predict.yaml
#python 实现,用如下命令模拟predict
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
import torch
from peft import LoraConfig, TaskType, get_peft_model
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch.nn as nn
import json
import torch
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained('checkpoint-25000', use_fast=False, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained('checkpoint-25000', device_map="auto",torch_dtype=torch.float16)
lora_path = 'checkpoint-37575' 
model = PeftModel.from_pretrained(model, model_id=lora_path)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
user_input = "Human: What are the three primary colors?\nAssistant:"
output_ids = model.generate(
    input_ids=input_ids,
    max_length=1024, 
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
response = generated_text[len(input):].strip()
)

Expected behavior

用llamafactory-cli内置的predict方式是trainer.predict的,它给出的结果和model.generation给出的结果有区别,如何保证二者是相同的?

Others

@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 28, 2024
@lizbaka
Copy link

lizbaka commented Oct 28, 2024

遇到同样问题。我基于deepseek-coder-1.3B在一个二分类问题上进行了lora微调。使用llamafactory-cli进行do_predict跟model.generation进行预测的输出格式都是正确的,但使用llamafactory的分类性能好很多。

在prompt的层面已经确认了两种方法的输入prompt是对齐的,也尝试使用了llamafactory做预测时使用的gen_kwargs,不太清楚还会在什么地方没有对齐。

以下是我自行实现的inference部分代码

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
from tqdm import tqdm
import argparse
import sys
import json

MAX_LENGTH = 1024
INSTRUCTION = '******\n'


def infer(model, tokenizer, code):
    # input_text = INSTRUCTION + code
    input_text = code
    messages = [
        {
            'role': 'user',
            'content': input_text
        }
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        add_generation_prompt=True, 
        return_tensors="pt", 
        max_length = MAX_LENGTH,
        truncation = True).to(model.device)
    input_prompt = tokenizer.decode(input_ids[0], skip_special_tokens=True)
    # outputs = model.generate(input_ids, max_new_tokens = MAX_LENGTH, eos_token_id = tokenizer.eos_token_id, pad_token_id = tokenizer.pad_token_id, do_sample = False)
    # obtain from gen_kwargs
    outputs = model.generate(input_ids, do_sample = True, temperature = 0.95, top_p = 0.7, top_k = 50, num_beams = 1, max_new_tokens = MAX_LENGTH, repetition_penalty = 1,
                             length_penalty = 1, default_system = None, eos_token_id = tokenizer.eos_token_id, pad_token_id = tokenizer.pad_token_id)
    output_text = tokenizer.decode(outputs[0][len(input_ids[0]):], skip_special_tokens=True)
    return input_prompt, output_text


def test(args):
    tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

    test_dataset = load_dataset('json', data_files = args.test_dataset_file)['train']
    
    model = AutoModelForCausalLM.from_pretrained(args.model_path, trust_remote_code=True, torch_dtype=torch.float32, device_map='auto').eval()
    model = PeftModel.from_pretrained(model, args.output_dir).eval()
    
    tp, tn, fp, fn = 0, 0, 0, 0
    with open(f'{args.output_dir}/generated_predictions.jsonl', 'w', encoding='utf-8') as f:
        for example in tqdm(test_dataset, total=len(test_dataset)):
            input_prompt, output_text = infer(model, tokenizer, example['instruction'])
            f.write(f'{json.dumps({"prompt": input_prompt, "label": example['output'], "predict": output_text}, ensure_ascii=False)}\n')
            label = example['output'].strip()
            output_text = output_text.strip()
            if label == output_text:
                if output_text == 'yes':
                    tp += 1
                else:
                    tn += 1
            else:
                if output_text == 'yes':
                    fp += 1
                else:
                    fn += 1
            acc = (tp + tn) / (tp + tn + fp + fn)
            print(f'Accuracy: {acc}')
            sys.stdout.flush()
    
    acc = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if tp + fp > 0 else float('nan')
    recall = tp / (tp + fn) if tp + fn > 0 else float('nan')
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else float('nan')
    fpr = fp / (fp + tn) if fp + tn > 0 else float('nan')
    print(f'Accuracy: {acc}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1: {f1}')
    print(f'False Positive Rate: {fpr}')


def main(args):
    test(args)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--output_dir', type=str)
    parser.add_argument('--test_dataset_file', type=str)
    parser.add_argument('--model_path', type=str)
    args = parser.parse_args()
    main(args)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants