本文详细介绍了使用Qwen3-Embedding-0.6B模型通过Lora微调实现语义相似性判断任务的完整流程。包括模型改造、蚂蚁金融语义相似度数据集处理、微调训练过程及测试。实验显示模型在验证集上达到83.17%准确率和83.16%的F1值,略低于chinese-roberta-wwm-ext模型。文章提供了从数据预处理到模型训练的全过程代码,适合大模型微调学习者参考。


一、语义相似性判断任务

语义相似性判断任务 或者 文本语义等价任务,说白了,就是让模型来判断两个句子是不是在说同一个意思。就像我们人类有时候会说两句话来表达同一个想法。应用场景,例如在搜索引擎中,能帮助系统理解用户查询与网页内容之间的深层语义关联,即便两者用词不同也能精准判断;还有在智能客服场景下,它可以判断用户问题与知识库中的标准答案是否意图一致,从而快速给出准确回复等。

在本专栏的前面文章有实验过使用 hfl/chinese-roberta-wwm-ext 模型,微调语义相似性判断任务,最终在验证集上表现为:准确率 85.1485F185.1480 ,文章地址如下:

基于 Roberta 微调训练句子语义等价识别任务

小毕超,公众号:狂热JAVA小毕超基于 Roberta 微调训练句子语义等价识别任务

本文则基于开源的 Qwen3-Embedding-0.6B 模型进行 Lora 微调的方式,实现语义相似性判断任务,实验下 Qwen3-Embedding 模型做下游 NLU 任务的效果如何,数据集同样采用 蚂蚁金融语义相似度数据集

下面实验所使用的主要依赖版本如下:

torch==2.6.0transformers==4.51.3peft==0.12.0

模型改造

预训练模型下载地址:

https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B

Lora 模型修改,采用 huggingface 中的 PEFT 框架实现,主要针对 Qwen3-Embedding-0.6B 模型自注意力层的 q_proj、 k_proj、 v_proj 做降维升维操作,操作方式如下所示:

from transformers import AutoModelfrom peft import LoraConfig, get_peft_model, TaskTypemodel_name = "Qwen/Qwen3-Embedding-0.6B"model = AutoModel.from_pretrained(model_name)peft_config = LoraConfig(        task_type=TaskType.SEQ_CLS,        target_modules=["q_proj", "k_proj", "v_proj"],        inference_mode=False,        r=8,        lora_alpha=32,        lora_dropout=0.1    )model = get_peft_model(model, peft_config)model.print_trainable_parameters()print(model)

输出结果:

trainable params: 1,605,632 || all params: 597,382,144 || trainable%: 0.2688PeftModelForSequenceClassification(  (base_model): LoraModel(    (model): Qwen3Model(      (embed_tokens): Embedding(151669, 1024)      (layers): ModuleList(        (0-27): 28 x Qwen3DecoderLayer(          (self_attn): Qwen3Attention(            (q_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=2048, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=2048, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (k_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=1024, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=1024, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (v_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=1024, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=1024, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (o_proj): Linear(in_features=2048, out_features=1024, bias=False)            (q_norm): Qwen3RMSNorm((128,), eps=1e-06)            (k_norm): Qwen3RMSNorm((128,), eps=1e-06)          )          (mlp): Qwen3MLP(            (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)            (up_proj): Linear(in_features=1024, out_features=3072, bias=False)            (down_proj): Linear(in_features=3072, out_features=1024, bias=False)            (act_fn): SiLUActivation()          )          (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)          (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)        )      )      (norm): Qwen3RMSNorm((1024,), eps=1e-06)      (rotary_emb): Qwen3RotaryEmbedding()    )  ))

整体可训练参数仅占总参数量的 0.2688%, 从网络结构上可以看出,每一层的 q_proj、 k_proj、 v_proj 都增加了 lora_Alora_B 通过降维再升维的方式,影响原始模型的效果。

数据集介绍

数据集下载地址:

https://modelscope.cn/datasets/modelscope/afqmc

数据量统计:

训练集(train.csv) 验证集(dev.csv) 测试集(test.csv)
34,334 4,316 3,861

数据集格式如下所示:

sentence1,sentence2,label,id蚂蚁借呗等额还款可以换成先息后本吗,借呗有先息到期还本吗,0,0蚂蚁花呗说我违约一次,蚂蚁花呗违约行为是什么,0,1帮我看一下本月花呗账单有没有结清,下月花呗账单,0,2蚂蚁借呗多长时间综合评估一次,借呗得评估多久,0,3我的花呗账单是***,还款怎么是***,我的花呗,月结出来说让我还***元,我自己算了一下详细名单我应该还***元,1,4蚂蚁借呗的额度可以从申请不,蚂蚁借呗节假日可以借款吗,0,5

其中标签"1"表示sentence1sentence2具有相似的含义;而标签"0"则表示两者含义不同。

这里可以统计下 train.csvToken 数分布,以便后面训练时指定一个合理的 max_length

from transformers import AutoTokenizerimport matplotlib.pyplot as pltimport pandas as pdplt.rcParams['font.sans-serif'] = ['SimHei']# 获取Token数def get_num_tokens(file_path, tokenizer):    input_num_tokens = []    df = pd.read_csv(file_path)    for index, row in df.iterrows():        sentence1 = row["sentence1"]        sentence2 = row["sentence2"]        tokens = len(tokenizer(sentence1, sentence2)["input_ids"])        input_num_tokens.append(tokens)    return input_num_tokens# 计算分布def count_intervals(num_tokens, interval):    max_value = max(num_tokens)    intervals_count = {}    for lower_bound in range(0, max_value + 1, interval):        upper_bound = lower_bound + interval        count = len([num for num in num_tokens if lower_bound <= num < upper_bound])        intervals_count[f"{lower_bound}-{upper_bound}"] = count    return intervals_countdef main():    model_path = "Qwen/Qwen3-Embedding-0.6B"    train_data_path = "dataset/train.csv"    tokenizer = AutoTokenizer.from_pretrained(model_path)    input_num_tokens = get_num_tokens(train_data_path, tokenizer)    intervals_count = count_intervals(input_num_tokens, 20)    print(intervals_count)    x = [k for k, v in intervals_count.items()]    y = [v for k, v in intervals_count.items()]    plt.figure(figsize=(8, 6))    bars = plt.bar(x, y)    plt.title('训练集Token分布情况')    plt.ylabel('数量')    for bar in bars:        yval = bar.get_height()        plt.text(bar.get_x() + bar.get_width() / 2, yval, int(yval), va='bottom')    plt.show()if __name__ == '__main__':    main()
```![](http://cdn.zhipoai.cn/4c0d227e.jpg)

数据集 `Token`量主要都分布在 `20-60` 之间,整体来看后面 `max_length` 可以设置为 `64` 。

二、微调训练
------

构建 `classify_qwen_dataset.py`

```plaintext
# -*- coding: utf-8 -*-from torch.utils.data import Datasetimport torchimport numpy as npimport pandas as pdclass ClassifyDataset(Dataset):    def __init__(self, tokenizer, data_path, max_length) -> None:        super().__init__()        self.tokenizer = tokenizer        self.max_length = max_length        self.data = []        if data_path:            df = pd.read_csv(data_path)            for index, row in df.iterrows():                sentence1 = row["sentence1"]                sentence2 = row["sentence2"]                label = row["label"]                self.data.append({                    "sentence1": sentence1,                    "sentence2": sentence2,                    "label": label,                })        print("data load , size:", len(self.data))    def preprocess(self, sentence1, sentence2, label):        encoding = self.tokenizer.encode_plus(            sentence1, sentence2,            max_length=self.max_length,            truncation=True,            padding="max_length",            return_tensors="pt"        )        input_ids = encoding["input_ids"].squeeze()        attention_mask = encoding["attention_mask"].squeeze()        return input_ids, attention_mask, label    def __getitem__(self, index):        item_data = self.data[index]        input_ids, attention_mask, label = self.preprocess(**item_data)        return {            "input_ids": torch.LongTensor(np.array(input_ids)),            "attention_mask": torch.LongTensor(np.array(attention_mask)),            "label": torch.LongTensor([label])        }    def __len__(self):        return len(self.data)

微调训练:

# -*- coding: utf-8 -*-import os.pathimport torchfrom torch.utils.data import DataLoaderfrom torch.utils.tensorboard import SummaryWriterimport transformersfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationfrom classify_qwen_dataset import ClassifyDatasetfrom tqdm import tqdmimport time, sysfrom sklearn.metrics import f1_scorefrom peft import LoraConfig, get_peft_model, TaskTypetransformers.logging.set_verbosity_error()def train_model(model, train_loader, val_loader, optimizer,                device, num_epochs, model_output_dir, scheduler, writer):    batch_step = 0    best_f1 = 0.0    for epoch in range(num_epochs):        time1 = time.time()        model.train()        for index, data in enumerate(tqdm(train_loader, file=sys.stdout, desc="Train Epoch: " + str(epoch))):            input_ids = data['input_ids'].to(device)            attention_mask = data['attention_mask'].to(device)            label = data['label'].to(device)            # 前向传播            outputs = model(                input_ids=input_ids,                attention_mask=attention_mask,                labels=label            )            loss = outputs.loss            # 反向传播,计算当前梯度            loss.backward()            # 更新网络参数            optimizer.step()            # 清空过往梯度            optimizer.zero_grad()            writer.add_scalar('Loss/train', loss, batch_step)            batch_step += 1            # 100轮打印一次 loss            if index % 100 == 0 or index == len(train_loader) - 1:                time2 = time.time()                tqdm.write(                    f"{index}, epoch: {epoch} -loss: {str(loss)} ; lr: {optimizer.param_groups[0]['lr']} ;each step's time spent: {(str(float(time2 - time1) / float(index + 0.0001)))}")        # 验证        model.eval()        accuracy, val_loss, f1 = validate_model(model, device, val_loader)        writer.add_scalar('Loss/val', val_loss, epoch)        writer.add_scalar('Accuracy/val', accuracy, epoch)        writer.add_scalar('F1/val', f1, epoch)        print(f"val loss: {val_loss} , val accuracy: {accuracy}, f1: {f1}, epoch: {epoch}")        # 学习率调整        scheduler.step(f1)        # 保存最优模型        if f1 > best_f1:            best_f1 = f1            best_model_path = os.path.join(model_output_dir, "best")            print("Save Best Model To ", best_model_path, ", epoch: ", epoch)            model.save_pretrained(best_model_path)        # 保存当前模型        last_model_path = os.path.join(model_output_dir, "last")        print("Save Last Model To ", last_model_path, ", epoch: ", epoch)        model.save_pretrained(last_model_path)def validate_model(model, device, val_loader):    running_loss = 0.0    correct = 0    total = 0    y_true = []    y_pred = []    with torch.no_grad():        for _, data in enumerate(tqdm(val_loader, file=sys.stdout, desc="Validation Data")):            input_ids = data['input_ids'].to(device)            attention_mask = data['attention_mask'].to(device)            label = data['label'].to(device)            outputs = model(                input_ids=input_ids,                attention_mask=attention_mask,                labels=label            )            loss = outputs.loss            logits = outputs['logits']            total += label.size(0)            predicted = logits.max(-1, keepdim=True)[1]            correct += predicted.eq(label.view_as(predicted)).sum().item()            running_loss += loss.item()            y_true.extend(label.cpu().numpy())            y_pred.extend(predicted.cpu().numpy())    f1 = f1_score(y_true, y_pred, average='macro')    return correct / total * 100, running_loss / len(val_loader), f1 * 100def main():    # 基础模型位置    model_name = "Qwen/Qwen3-Embedding-0.6B"    # 训练集 & 验证集    train_dataset_path = "dataset/train.csv"    val_dataset_path = "dataset/dev.csv"    max_length = 64    num_classes = 2    epochs = 15    batch_size = 128    lr = 1e-4    model_output_dir = "output"    logs_dir = "logs"    # 设备    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    # 加载分词器和模型    tokenizer = AutoTokenizer.from_pretrained(model_name)    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)    if model.config.pad_token_id is None:        model.config.pad_token_id = tokenizer.pad_token_id    peft_config = LoraConfig(        task_type=TaskType.SEQ_CLS,        target_modules=["q_proj", "k_proj", "v_proj"],        inference_mode=False,        r=8,        lora_alpha=32,        lora_dropout=0.1    )    model = get_peft_model(model, peft_config)    model.print_trainable_parameters()    print("Start Load Train Data...")    train_params = {        "batch_size": batch_size,        "shuffle": True,        "num_workers": 0,    }    training_set = ClassifyDataset(tokenizer, train_dataset_path, max_length)    training_loader = DataLoader(training_set, **train_params)    print("Start Load Validation Data...")    val_params = {        "batch_size": batch_size,        "shuffle": False,        "num_workers": 0,    }    val_set = ClassifyDataset(tokenizer, val_dataset_path, max_length)    val_loader = DataLoader(val_set, **val_params)    # 日志记录    writer = SummaryWriter(logs_dir)    # 优化器    optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)    # 学习率调度器,连续两个周期没有改进,学习率调整为当前的0.8    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'max', patience=2, factor=0.8)    model = model.to(device)    # 开始训练    print("Start Training...")    train_model(        model=model,        train_loader=training_loader,        val_loader=val_loader,        optimizer=optimizer,        device=device,        num_epochs=epochs,        model_output_dir=model_output_dir,        scheduler=scheduler,        writer=writer    )    writer.close()if __name__ == '__main__':    main()

训练过程如下:

batch_size128 的情况下,显存占用约 30.6G,如果你运行显存不足可适当缩小 batch_size 的大小,或者改为梯度累积的方式减少显存需求:

训练结束,在 dev 验证集上最好的表现 loss0.4412,准确率为 83.17F183.16

相比于之前用 chinese-roberta-wwm-ext 微调的效果略微逊色一些,之前的效果为准确率 85.1485F185.1480

你也可以通过 tensorboard 查看你训练过程趋势:

tensorboard --logdir=logs --bind_all

在 浏览器访问 http:ip:6006/

三、模型使用测试

# -*- coding: utf-8 -*-import jsonfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchimport pandas as pddef main():    base_path = "Qwen/Qwen3-Embedding-0.6B"    model_path = "output/best"    test_path = "dataset/test.csv"    max_length = 64    num_classes = 2    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    tokenizer = AutoTokenizer.from_pretrained(base_path)    model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=num_classes)    model.to(device)    classify = {        0: "语义不相关", 1: "语义相似"    }    df = pd.read_csv(test_path)    for index, row in df.iterrows():        sentence1 = row["sentence1"]        sentence2 = row["sentence2"]        encoding = tokenizer.encode_plus(            sentence1, sentence2,            max_length=max_length,            return_tensors="pt"        )        input_ids = encoding["input_ids"].to(device)        attention_mask = encoding["attention_mask"].to(device)        outputs = model(            input_ids=input_ids,            attention_mask=attention_mask        )        logits = outputs['logits']        predicted = logits.max(-1, keepdim=True)[1].item()        print(f"{sentence1} - {sentence2}  >>>  {classify[predicted]}")if __name__ == '__main__':    main()
```![](https://i-blog.csdnimg.cn/img_convert/fdfba3b0ab39e9724b1fa568954afa6c.jpeg)
Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐