情感分析的模型

天真的笑下去

633人浏览 · 2023-06-07 21:07:05

天真的笑下去 · 2023-06-07 21:07:05 发布

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
import numpy as np

# 加载和预处理数据
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

def encode_sentence(s):
   tokens = list(tokenizer.tokenize(s))
   tokens.append('[SEP]')
   return tokenizer.convert_tokens_to_ids(tokens)

df['input_ids'] = df.text.apply(encode_sentence)

# 定义模型
class SentimentClassifier(nn.Module):
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-chinese')
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        output = self.drop(pooled_output)
        return self.out(output)

# 训练模型
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SentimentClassifier(len(df.label.unique()))
model = model.to(device)

EPOCHS = 10
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(DataLoader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss().to(device)

def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    model = model.train()
    losses = []
    correct_predictions = 0
    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    return correct_predictions.double() / n_examples, np.mean(losses)

# 评估模型
def eval_model(model, data_loader, loss_fn, device, n_examples):
    model = model.eval()
    losses = []
    correct_predictions = 0
    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            loss = loss_fn(outputs, targets)
            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())
    return correct_predictions.double() / n_examples, np.mean(losses)

# 运行训练和评估
for epoch in range(EPOCHS):
    print('Epoch {}/{}'.format(epoch + 1, EPOCHS))
    print('-' * 10)
    train_acc, train_loss = train_epoch(
        model,
        train_data_loader,
        loss_fn,
        optimizer,
        device,
        scheduler,
        len(df_train)
    )
    print('Train loss {} accuracy {}'.format(train_loss, train_acc))
    val_acc, val_loss = eval_model(
        model,
        val_data_loader,
        loss_fn,
        device,
        len(df_val)
    )
    print('Val   loss {} accuracy {}'.format(val_loss, val_acc))
    print()

# 保存和加载模型
def save_model(model, path):
    torch.save(model.state_dict(), path)

def load_model(path):
    model = SentimentClassifier(len(df.label.unique()))
    model.load_state_dict(torch.load(path))
    return model

save_model(model, "model_weights.pth")
model = load_model("model_weights.pth")

这段代码主要是用BERT模型进行中文情感分类任务，使用了PyTorch和Hugging Face的Transformers库。

一、加载所需的库

pandas: pandas是一个强大的数据分析库，这里用来加载和预处理数据集。
torch: PyTorch是一个用于深度学习的开源库。它提供了张量计算、优化器、损失函数等等基础设施，以及对神经网络的底层支持。
torch.nn: PyTorch的nn模块是用于构建神经网络的核心库。
numpy: numpy是一个用于处理多维数组的库，提供了大量的数学函数进行快速操作。
torch.utils.data: PyTorch的数据加载模块，可以帮助我们创建可迭代的数据加载器，方便我们在训练时进行批量数据的处理。
transformers: Hugging Face的Transformers库，它提供了一些预训练的模型和工具，我们可以直接使用这些模型进行文本分类、问答、摘要等任务。
sklearn.model_selection: Scikit-Learn库的一个模块，用于数据预处理，比如这里的train_test_split函数用来将数据集划分为训练集和验证集。
sklearn.metrics: Scikit-Learn库的一个模块，提供了一些用于模型评估的指标，比如准确率、召回率、F1值等。

二、加载数据

首先，使用pandas的read_csv函数读取CSV数据集。然后，使用dropna函数删除所有文本字段为空的行。

三、预处理和编码数据

使用Transformers库中的BertTokenizer对中文文本进行分词和编码，生成BERT模型需要的input_ids和attention_mask。input_ids是每个词在词汇表中的ID，attention_mask用于区分真实词汇和填充词汇。
通过zip和apply方法对整个数据集进行预处理和编码，将每个文本转化为对应的input_ids和attention_mask。
使用astype将标签列转化为整型。

这段代码主要是用BERT模型进行中文情感分类任务，使用了PyTorch和Hugging Face的Transformers库。

一、加载所需的库

pandas: pandas是一个强大的数据分析库，这里用来加载和预处理数据集。
torch: PyTorch是一个用于深度学习的开源库。它提供了张量计算、优化器、损失函数等等基础设施，以及对神经网络的底层支持。
torch.nn: PyTorch的nn模块是用于构建神经网络的核心库。
numpy: numpy是一个用于处理多维数组的库，提供了大量的数学函数进行快速操作。
torch.utils.data: PyTorch的数据加载模块，可以帮助我们创建可迭代的数据加载器，方便我们在训练时进行批量数据的处理。
transformers: Hugging Face的Transformers库，它提供了一些预训练的模型和工具，我们可以直接使用这些模型进行文本分类、问答、摘要等任务。
sklearn.model_selection: Scikit-Learn库的一个模块，用于数据预处理，比如这里的train_test_split函数用来将数据集划分为训练集和验证集。
sklearn.metrics: Scikit-Learn库的一个模块，提供了一些用于模型评估的指标，比如准确率、召回率、F1值等。

二、加载数据

首先，使用pandas的read_csv函数读取CSV数据集。然后，使用dropna函数删除所有文本字段为空的行。

三、预处理和编码数据

使用Transformers库中的BertTokenizer对中文文本进行分词和编码，生成BERT模型需要的input_ids和attention_mask。input_ids是每个词在词汇表中的ID，attention_mask用于区分真实词汇和填充词汇。
通过zip和apply方法对整个数据集进行预处理和编码，将每个文本转化为对应的input_ids和attention_mask。
使用astype将标签列转化为整型。

四、划分数据集

使用train_test_split函数将数据集划分为训练集和验证集。

五、定义PyTorch Dataset

使用PyTorch的Dataset接口定义一个SentimentDataset类。在这个类中，我们需要实现__len__和__getitem__两个方法，分别用于返回数据集的大小和获取一个样本。
__getitem__方法中，将输入的input_ids、attention_mask和标签转化为PyTorch的张量，并返回。

六、创建DataLoader

使用PyTorch的DataLoader类创建训练集和验证集的DataLoader，设置batch_size和shuffle。

七、定义模型

定义一个SentimentClassifier类，该模型基于BERT模型进行情感分类。其中，BERT模型部分用于提取文本特征，然后通过一个线性层进行分类。在这个类中，需要实现forward方法，用于前向传播。

八、定义训练和验证函数

train_epoch函数用于进行一轮训练。在每个批次中，首先将数据和标签送入GPU，然后清空优化器的梯度，进行前向传播和计算损失，再进行反向传播和参数更新。最后，计算训练的损失和准确率。
eval_model函数用于验证模型。在验证阶段，不需要进行反向传播和参数更新，所以需要使用torch.no_grad()。同样，需要计算验证的损失和准确率。

九、训练模型

定义模型、优化器和损失函数。然后，将模型送入GPU。
对每个epoch，先进行训练，然后进行验证。最后，保存模型。

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模