AI大模型教程（干货满满）：从零基础入门到精通Qwen3-Embedding，建议收藏反复学习！

数据集下载地址：训练集（train.csv)验证集（dev.csv)测试集（test.csv)34,3344,3163,861。

小敢摘葡萄

728人浏览 · 2025-11-14 11:53:14

小敢摘葡萄 · 2025-11-14 11:53:14 发布

本文详细介绍了使用Qwen3-Embedding-0.6B模型通过Lora微调实现语义相似性判断任务的完整流程。包括模型改造、蚂蚁金融语义相似度数据集处理、微调训练过程及测试。实验显示模型在验证集上达到83.17%准确率和83.16%的F1值，略低于chinese-roberta-wwm-ext模型。文章提供了从数据预处理到模型训练的全过程代码，适合大模型微调学习者参考。

一、语义相似性判断任务

语义相似性判断任务或者文本语义等价任务，说白了，就是让模型来判断两个句子是不是在说同一个意思。就像我们人类有时候会说两句话来表达同一个想法。应用场景，例如在搜索引擎中，能帮助系统理解用户查询与网页内容之间的深层语义关联，即便两者用词不同也能精准判断；还有在智能客服场景下，它可以判断用户问题与知识库中的标准答案是否意图一致，从而快速给出准确回复等。

在本专栏的前面文章有实验过使用 hfl/chinese-roberta-wwm-ext 模型，微调语义相似性判断任务，最终在验证集上表现为：准确率 85.1485 ，F1：85.1480 ，文章地址如下：

基于 Roberta 微调训练句子语义等价识别任务

小毕超，公众号：狂热JAVA小毕超基于 Roberta 微调训练句子语义等价识别任务

本文则基于开源的 Qwen3-Embedding-0.6B 模型进行 Lora 微调的方式，实现语义相似性判断任务，实验下 Qwen3-Embedding 模型做下游 NLU 任务的效果如何，数据集同样采用 蚂蚁金融语义相似度数据集。

下面实验所使用的主要依赖版本如下：

torch==2.6.0transformers==4.51.3peft==0.12.0

模型改造

预训练模型下载地址：

https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B

Lora 模型修改，采用 huggingface 中的 PEFT 框架实现，主要针对 Qwen3-Embedding-0.6B 模型自注意力层的 q_proj、 k_proj、 v_proj 做降维升维操作，操作方式如下所示：

from transformers import AutoModelfrom peft import LoraConfig, get_peft_model, TaskTypemodel_name = "Qwen/Qwen3-Embedding-0.6B"model = AutoModel.from_pretrained(model_name)peft_config = LoraConfig(        task_type=TaskType.SEQ_CLS,        target_modules=["q_proj", "k_proj", "v_proj"],        inference_mode=False,        r=8,        lora_alpha=32,        lora_dropout=0.1    )model = get_peft_model(model, peft_config)model.print_trainable_parameters()print(model)

输出结果：

trainable params: 1,605,632 || all params: 597,382,144 || trainable%: 0.2688PeftModelForSequenceClassification(  (base_model): LoraModel(    (model): Qwen3Model(      (embed_tokens): Embedding(151669, 1024)      (layers): ModuleList(        (0-27): 28 x Qwen3DecoderLayer(          (self_attn): Qwen3Attention(            (q_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=2048, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=2048, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (k_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=1024, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=1024, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (v_proj): lora.Linear(              (base_layer): Linear(in_features=1024, out_features=1024, bias=False)              (lora_dropout): ModuleDict(                (default): Dropout(p=0.1, inplace=False)              )              (lora_A): ModuleDict(                (default): Linear(in_features=1024, out_features=8, bias=False)              )              (lora_B): ModuleDict(                (default): Linear(in_features=8, out_features=1024, bias=False)              )              (lora_embedding_A): ParameterDict()              (lora_embedding_B): ParameterDict()              (lora_magnitude_vector): ModuleDict()            )            (o_proj): Linear(in_features=2048, out_features=1024, bias=False)            (q_norm): Qwen3RMSNorm((128,), eps=1e-06)            (k_norm): Qwen3RMSNorm((128,), eps=1e-06)          )          (mlp): Qwen3MLP(            (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)            (up_proj): Linear(in_features=1024, out_features=3072, bias=False)            (down_proj): Linear(in_features=3072, out_features=1024, bias=False)            (act_fn): SiLUActivation()          )          (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)          (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)        )      )      (norm): Qwen3RMSNorm((1024,), eps=1e-06)      (rotary_emb): Qwen3RotaryEmbedding()    )  ))

整体可训练参数仅占总参数量的 0.2688%，从网络结构上可以看出，每一层的 q_proj、 k_proj、 v_proj 都增加了 lora_A 和 lora_B 通过降维再升维的方式，影响原始模型的效果。

数据集介绍

数据集下载地址：

https://modelscope.cn/datasets/modelscope/afqmc

数据量统计：

训练集（train.csv)	验证集（dev.csv)	测试集（test.csv)
34,334	4,316	3,861

数据集格式如下所示：

sentence1,sentence2,label,id蚂蚁借呗等额还款可以换成先息后本吗,借呗有先息到期还本吗,0,0蚂蚁花呗说我违约一次,蚂蚁花呗违约行为是什么,0,1帮我看一下本月花呗账单有没有结清,下月花呗账单,0,2蚂蚁借呗多长时间综合评估一次,借呗得评估多久,0,3我的花呗账单是***，还款怎么是***,我的花呗，月结出来说让我还***元，我自己算了一下详细名单我应该还***元,1,4蚂蚁借呗的额度可以从申请不,蚂蚁借呗节假日可以借款吗,0,5

其中标签"1"表示sentence1与sentence2具有相似的含义；而标签"0"则表示两者含义不同。

这里可以统计下 train.csv 的 Token 数分布，以便后面训练时指定一个合理的 max_length：

from transformers import AutoTokenizerimport matplotlib.pyplot as pltimport pandas as pdplt.rcParams['font.sans-serif'] = ['SimHei']# 获取Token数def get_num_tokens(file_path, tokenizer):    input_num_tokens = []    df = pd.read_csv(file_path)    for index, row in df.iterrows():        sentence1 = row["sentence1"]        sentence2 = row["sentence2"]        tokens = len(tokenizer(sentence1, sentence2)["input_ids"])        input_num_tokens.append(tokens)    return input_num_tokens# 计算分布def count_intervals(num_tokens, interval):    max_value = max(num_tokens)    intervals_count = {}    for lower_bound in range(0, max_value + 1, interval):        upper_bound = lower_bound + interval        count = len([num for num in num_tokens if lower_bound <= num < upper_bound])        intervals_count[f"{lower_bound}-{upper_bound}"] = count    return intervals_countdef main():    model_path = "Qwen/Qwen3-Embedding-0.6B"    train_data_path = "dataset/train.csv"    tokenizer = AutoTokenizer.from_pretrained(model_path)    input_num_tokens = get_num_tokens(train_data_path, tokenizer)    intervals_count = count_intervals(input_num_tokens, 20)    print(intervals_count)    x = [k for k, v in intervals_count.items()]    y = [v for k, v in intervals_count.items()]    plt.figure(figsize=(8, 6))    bars = plt.bar(x, y)    plt.title('训练集Token分布情况')    plt.ylabel('数量')    for bar in bars:        yval = bar.get_height()        plt.text(bar.get_x() + bar.get_width() / 2, yval, int(yval), va='bottom')    plt.show()if __name__ == '__main__':    main()
```![](http://cdn.zhipoai.cn/4c0d227e.jpg)

数据集 `Token`量主要都分布在 `20-60` 之间，整体来看后面 `max_length` 可以设置为 `64` 。

二、微调训练
------

构建 `classify_qwen_dataset.py`

```plaintext
# -*- coding: utf-8 -*-from torch.utils.data import Datasetimport torchimport numpy as npimport pandas as pdclass ClassifyDataset(Dataset):    def __init__(self, tokenizer, data_path, max_length) -> None:        super().__init__()        self.tokenizer = tokenizer        self.max_length = max_length        self.data = []        if data_path:            df = pd.read_csv(data_path)            for index, row in df.iterrows():                sentence1 = row["sentence1"]                sentence2 = row["sentence2"]                label = row["label"]                self.data.append({                    "sentence1": sentence1,                    "sentence2": sentence2,                    "label": label,                })        print("data load ， size：", len(self.data))    def preprocess(self, sentence1, sentence2, label):        encoding = self.tokenizer.encode_plus(            sentence1, sentence2,            max_length=self.max_length,            truncation=True,            padding="max_length",            return_tensors="pt"        )        input_ids = encoding["input_ids"].squeeze()        attention_mask = encoding["attention_mask"].squeeze()        return input_ids, attention_mask, label    def __getitem__(self, index):        item_data = self.data[index]        input_ids, attention_mask, label = self.preprocess(**item_data)        return {            "input_ids": torch.LongTensor(np.array(input_ids)),            "attention_mask": torch.LongTensor(np.array(attention_mask)),            "label": torch.LongTensor([label])        }    def __len__(self):        return len(self.data)

微调训练：

# -*- coding: utf-8 -*-import os.pathimport torchfrom torch.utils.data import DataLoaderfrom torch.utils.tensorboard import SummaryWriterimport transformersfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationfrom classify_qwen_dataset import ClassifyDatasetfrom tqdm import tqdmimport time, sysfrom sklearn.metrics import f1_scorefrom peft import LoraConfig, get_peft_model, TaskTypetransformers.logging.set_verbosity_error()def train_model(model, train_loader, val_loader, optimizer,                device, num_epochs, model_output_dir, scheduler, writer):    batch_step = 0    best_f1 = 0.0    for epoch in range(num_epochs):        time1 = time.time()        model.train()        for index, data in enumerate(tqdm(train_loader, file=sys.stdout, desc="Train Epoch: " + str(epoch))):            input_ids = data['input_ids'].to(device)            attention_mask = data['attention_mask'].to(device)            label = data['label'].to(device)            # 前向传播            outputs = model(                input_ids=input_ids,                attention_mask=attention_mask,                labels=label            )            loss = outputs.loss            # 反向传播，计算当前梯度            loss.backward()            # 更新网络参数            optimizer.step()            # 清空过往梯度            optimizer.zero_grad()            writer.add_scalar('Loss/train', loss, batch_step)            batch_step += 1            # 100轮打印一次 loss            if index % 100 == 0 or index == len(train_loader) - 1:                time2 = time.time()                tqdm.write(                    f"{index}, epoch: {epoch} -loss: {str(loss)} ; lr: {optimizer.param_groups[0]['lr']} ;each step's time spent: {(str(float(time2 - time1) / float(index + 0.0001)))}")        # 验证        model.eval()        accuracy, val_loss, f1 = validate_model(model, device, val_loader)        writer.add_scalar('Loss/val', val_loss, epoch)        writer.add_scalar('Accuracy/val', accuracy, epoch)        writer.add_scalar('F1/val', f1, epoch)        print(f"val loss: {val_loss} , val accuracy: {accuracy}, f1: {f1}, epoch: {epoch}")        # 学习率调整        scheduler.step(f1)        # 保存最优模型        if f1 > best_f1:            best_f1 = f1            best_model_path = os.path.join(model_output_dir, "best")            print("Save Best Model To ", best_model_path, ", epoch: ", epoch)            model.save_pretrained(best_model_path)        # 保存当前模型        last_model_path = os.path.join(model_output_dir, "last")        print("Save Last Model To ", last_model_path, ", epoch: ", epoch)        model.save_pretrained(last_model_path)def validate_model(model, device, val_loader):    running_loss = 0.0    correct = 0    total = 0    y_true = []    y_pred = []    with torch.no_grad():        for _, data in enumerate(tqdm(val_loader, file=sys.stdout, desc="Validation Data")):            input_ids = data['input_ids'].to(device)            attention_mask = data['attention_mask'].to(device)            label = data['label'].to(device)            outputs = model(                input_ids=input_ids,                attention_mask=attention_mask,                labels=label            )            loss = outputs.loss            logits = outputs['logits']            total += label.size(0)            predicted = logits.max(-1, keepdim=True)[1]            correct += predicted.eq(label.view_as(predicted)).sum().item()            running_loss += loss.item()            y_true.extend(label.cpu().numpy())            y_pred.extend(predicted.cpu().numpy())    f1 = f1_score(y_true, y_pred, average='macro')    return correct / total * 100, running_loss / len(val_loader), f1 * 100def main():    # 基础模型位置    model_name = "Qwen/Qwen3-Embedding-0.6B"    # 训练集 & 验证集    train_dataset_path = "dataset/train.csv"    val_dataset_path = "dataset/dev.csv"    max_length = 64    num_classes = 2    epochs = 15    batch_size = 128    lr = 1e-4    model_output_dir = "output"    logs_dir = "logs"    # 设备    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    # 加载分词器和模型    tokenizer = AutoTokenizer.from_pretrained(model_name)    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)    if model.config.pad_token_id is None:        model.config.pad_token_id = tokenizer.pad_token_id    peft_config = LoraConfig(        task_type=TaskType.SEQ_CLS,        target_modules=["q_proj", "k_proj", "v_proj"],        inference_mode=False,        r=8,        lora_alpha=32,        lora_dropout=0.1    )    model = get_peft_model(model, peft_config)    model.print_trainable_parameters()    print("Start Load Train Data...")    train_params = {        "batch_size": batch_size,        "shuffle": True,        "num_workers": 0,    }    training_set = ClassifyDataset(tokenizer, train_dataset_path, max_length)    training_loader = DataLoader(training_set, **train_params)    print("Start Load Validation Data...")    val_params = {        "batch_size": batch_size,        "shuffle": False,        "num_workers": 0,    }    val_set = ClassifyDataset(tokenizer, val_dataset_path, max_length)    val_loader = DataLoader(val_set, **val_params)    # 日志记录    writer = SummaryWriter(logs_dir)    # 优化器    optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)    # 学习率调度器，连续两个周期没有改进，学习率调整为当前的0.8    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'max', patience=2, factor=0.8)    model = model.to(device)    # 开始训练    print("Start Training...")    train_model(        model=model,        train_loader=training_loader,        val_loader=val_loader,        optimizer=optimizer,        device=device,        num_epochs=epochs,        model_output_dir=model_output_dir,        scheduler=scheduler,        writer=writer    )    writer.close()if __name__ == '__main__':    main()

训练过程如下：

在 batch_size 在 128 的情况下，显存占用约 30.6G，如果你运行显存不足可适当缩小 batch_size 的大小，或者改为梯度累积的方式减少显存需求：

训练结束，在 dev 验证集上最好的表现 loss 为 0.4412，准确率为 83.17 ，F1：83.16 ：

相比于之前用 chinese-roberta-wwm-ext 微调的效果略微逊色一些，之前的效果为准确率 85.1485 ，F1：85.1480

你也可以通过 tensorboard 查看你训练过程趋势：

tensorboard --logdir=logs --bind_all

在浏览器访问 http:ip:6006/

三、模型使用测试

# -*- coding: utf-8 -*-import jsonfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchimport pandas as pddef main():    base_path = "Qwen/Qwen3-Embedding-0.6B"    model_path = "output/best"    test_path = "dataset/test.csv"    max_length = 64    num_classes = 2    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    tokenizer = AutoTokenizer.from_pretrained(base_path)    model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=num_classes)    model.to(device)    classify = {        0: "语义不相关", 1: "语义相似"    }    df = pd.read_csv(test_path)    for index, row in df.iterrows():        sentence1 = row["sentence1"]        sentence2 = row["sentence2"]        encoding = tokenizer.encode_plus(            sentence1, sentence2,            max_length=max_length,            return_tensors="pt"        )        input_ids = encoding["input_ids"].to(device)        attention_mask = encoding["attention_mask"].to(device)        outputs = model(            input_ids=input_ids,            attention_mask=attention_mask        )        logits = outputs['logits']        predicted = logits.max(-1, keepdim=True)[1].item()        print(f"{sentence1} - {sentence2}  >>>  {classify[predicted]}")if __name__ == '__main__':    main()
```![](https://i-blog.csdnimg.cn/img_convert/fdfba3b0ab39e9724b1fa568954afa6c.jpeg)

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

Pont - 搭建前后端之桥：高效、灵活的接口管理工具

Pont 是一款强大的数据服务层解决方案，它能够帮助开发者快速搭建前后端之间的桥梁，实现接口的高效管理和代码自动生成。无论是新手还是有经验的开发者，都能通过 Pont 轻松处理接口文档、生成类型安全的 API 代码，从而显著提升开发效率。[![Pont 工具标志](https://raw.gitcode.com/gh_mirrors/po/pont/raw/3f1b7d4bbba3fd2dda

魔乐社区

如何快速上手 hvac：HashiCorp Vault Python 客户端零基础入门指南

**hvac** 是 HashiCorp Vault 的 Python 3.X 客户端库，专为开发者提供简单高效的 Vault 交互方式。无论你是需要管理密钥、配置身份验证，还是实现安全的秘密数据存储，hvac 都能帮助你轻松搞定 Vault 的各项操作。本文将带你零基础快速入门，从安装到基础操作，让你在几分钟内即可上手使用这个强大的工具。[![hvac 客户端 Logo](https://r

魔乐社区

提升Angular2-HN性能的7个实用技巧：让新闻加载速度飞起来

Angular2-HN是一款基于Angular构建的Progressive Hacker News客户端，专为追求高效新闻浏览体验的用户设计。本文将分享7个实用技巧，帮助你优化Angular2-HN的性能，让新闻加载速度显著提升，带来更流畅的阅读体验。## 1. 启用Service Worker缓存关键资源Service Worker是提升Angular应用性能的强大工具，它可以在后台缓存