rnn循环神经网络基本原理
的提出是基于记忆模型的想法,期望网络能够记住前面出现的特征,并依据特征推断后面的结果,而且整体的网络结构不断循环,因为得名就在于拥有一个环路(或回路)。,一般以序列数据为输入,通过设计有效捕捉序列之前的关系特征,一般也是以序列形式进行输出的基本结构特别简单,就是将网络的输出保存在一 个中,这个元和下一 次的输入一起进入中 。输入序列的顺序改变,会改变网络的输出结果使模型隐层上一时间步骤产生的结果h
rnn循环神经网络基本原理
1.rnn循环神经网络基本概念
-
循环神经网络的提出是基于记忆模型的想法,期望网络能够记住前面出现的特征,并依据特征推断后面的结果,而且整体的网络结构不断循环,因为得名循环神经网络 -
RNN 的特征就在于拥有一个环路(或回路)。- 这个
环路可以使数据不断循环。通过数据的循环,RNN 一边记住过去的数据,一边更新到最新的数据
- 这个
-
RNN(Recurrent Neural Network)循环神经网络,一般以序列数据为输入,通过网络内部结构设计有效捕捉序列之前的关系特征,一般也是以序列形式进行输出 -
循环神经网络的基本结构特别简单,就是将网络的输出保存在一 个记忆单元中,这个记忆单元和下一 次的输入一起进入神经网络中 。输入序列的顺序改变,会改变网络的输出结果 -
RNN的循环机制使模型隐层上一时间步骤产生的结果 h t h_t ht,能够作为下一时间步输入的一部分当下时间步的输出包括:正常的输入 x t x_t xt和上一步的隐层输出 h t h_t ht对当下时间步的输出产生影响
2.rnn结构分类
2.1. 按照输入和输出结构分类
2.1.1 N to N - RNN
- RNN最基础的结构形式
输入和输出序列是等长的,由于这个限制,适用范围比较小- 实际应用
- 生成长度相同的诗句

- 生成长度相同的诗句
2.1.2 N to 1 - RNN
输入为一个序列,输出为一个单独的值- 实际应用
- 文本分类

- 文本分类
2.1.3 1 to N - RNN
输入不是一个序列,输出为一个序列- 实际应用
- 图生文【一张图片生成一句话】

- 图生文【一张图片生成一句话】
2.1.4 N to M - RNN
- 不限制输入与输出长度的RNN结构
- 由
编码器和解码器两部分组成,两者的内部结构都是某类RNN,也称为seq to seq 结构 - 输入数据首先通过
编码器最终输出一个隐含变量c【N to 1结构】,使用隐含变量c作用在解码器进行解码的每一步操作【1 to N结构】,以确保输入信息被有效利用 - 实际应用
- 机器翻译
- 阅读理解
- 文本摘要等

2.2. 按照内部结构分类
传统RNNLSTMBi-LSTMGRUBi-GRU
3.传统rnn原理
3.1 最基本的RNN结构公式

| 参数 | 含义 |
|---|---|
| t | 时刻 t,代指时序数据的索引 |
| h(t) | 第t时刻记忆单元【 h t h_t ht 是由前一个输出 h t − 1 h_{t−1} ht−1计算出来的】 h t h_t ht称为隐藏状态或隐藏状态向量 |
| h t − 1 h_{t-1} ht−1 | 第t-1时刻记忆单元 |
| f W f_{W} fW | 非线性激活函数,tanh |
| W h h W_{hh} Whh | 每一时刻h(t)记忆单元参数 |
| W x h W_{xh} Wxh | 每一时刻x的输入参数 |
| y t y_{t} yt | 第t时刻输出的结果 |
| A | 特征融合, t a n h ( W h h t − 1 + W x x t ) tanh(W_{h}h_{t-1}+W_{x }x_t) tanh(Whht−1+Wxxt) |
| X t X_t Xt | 索引为t的序列特征,时刻 t 的输入数据 |
| h 0 h_0 h0 | h 0 h_0 h0是记忆单元,第0时刻前面没有数据,所以 h 0 h_0 h0是初始化全为0的矩阵 |
h t = f W ( h t − 1 , x t ) = t a n h ( W h h t − 1 + W x x t ) y t = W h y h t \begin{aligned} h_t&= f_W(h_{t-1},x_t)\\ &= tanh(W_{h }h_{t-1}+W_{x }x_t)\\ y_t&=W_{hy}h_t \end{aligned} htyt=fW(ht−1,xt)=tanh(Whht−1+Wxxt)=Whyht
3.2 结构理解

- h t − 1 h_{t-1} ht−1记忆单元,上一步时间的隐层输出信息,到最后一刻的 h t h_t ht包含所有信息
- X t X_{t} Xt当前时间的输入输入的新信息,与 h t − 1 h_{t-1} ht−1拼接【concatenate,行不变,列增加】,使用非线性激活函数tanh进行信息融合
3.3 w和x求导公式推导
3.3.1 w和x公式含义
- 为什么 d w x dwx dwx和 d x dx dx,求导公式 d t dt dt,一个在前一个在后?

3.3.2 过程推导
- 求导规则
- 反向传播时,求出 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L的梯度,去更新
w,x的值且维度不可改变 - 矩阵运算,需要前一个矩阵的列,和后一个矩阵的行保持一致
- 反向传播时,求出 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L的梯度,去更新
- 相关参数维度
参数 维度 含义 X(N,2)(数据个数,数据向量长度)W(2,3)(数据向量长度,神经元个数)B(3,)(神经元个数,)X*W(N,3)(数据个数,神经元个数)Y(3,)(神经元个数,)∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L (N,3)(数据个数,神经元个数) - X公式推导
- 已知
X的维度是(N,2),确保维度不改变 - 在 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L和
w中找到一个参数中含有N的放在前面,去更新X的值, ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L在前 - ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L的
列维度为3,所以需要对W进行转置才能运算,即得到下面公式
∂ L ∂ X = ∂ L ∂ Y ∗ w T \begin{aligned} \frac{\partial L}{\partial X}&=\frac{\partial L}{\partial Y}*w^{T} \end{aligned} ∂X∂L=∂Y∂L∗wT
- 已知
- W公式推导
- 已知
X的维度是(2,3),确保维度不改变 - 在 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L和
X中找到一个参数中含有2的放在前面,去更新w的值,即X在前且需转置 - X T X^{T} XT的
列维度为N,所以直接与 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L相乘即可,即得到下面公式
∂ L ∂ W = X T ∗ ∂ L ∂ Y \begin{aligned} \frac{\partial L}{\partial W}&=X^{T}*\frac{\partial L}{\partial Y} \end{aligned} ∂W∂L=XT∗∂Y∂L
- 已知
- 如下图所示

3.4 单层传播反向推导
| 参数 | 含义 |
|---|---|
| h n e x t h_{next} hnext | 当前时刻的 RNN 层的输出( 下一时刻的 RNN 层的输入) |
| h p r e h_{pre} hpre | 当前时刻的 RNN 层的输入( 上一时刻的 RNN 层的输出),相当于 h t − 1 h_{t-1} ht−1 |
| d h n e x t dh_{next} dhnext | 下一时刻的RNN 层梯度 |
| d c u r r e n t d_{current} dcurrent | 当前时刻的导数 |
| d t d_{t} dt | 当前时刻的梯度 |
| x x x | 当前时刻的输入参数,相当于 x t x_t xt |
-
矩阵求导运算法则
∂ A B ∂ A = B T ∂ A B ∂ B = A T \\ \frac{\partial AB}{\partial A} =B^T \\ \frac{\partial AB}{\partial B} =A^T ∂A∂AB=BT∂B∂AB=AT -
相关公式
h n e x t = h t − 1 W h + x t W x + b h c u r r e n t = t a n h ( h n e x t ) = t a n h ( h t − 1 W h + x t W x + b ) \begin{aligned} h_{next}&=h_{t-1}W_{h}+x_tW_{x}+b\\ h_{current}&=tanh( h_{next})\\ &=tanh(h_{t-1}W_{h}+x_tW_{x}+b) \end{aligned} hnexthcurrent=ht−1Wh+xtWx+b=tanh(hnext)=tanh(ht−1Wh+xtWx+b) -
当前梯度推导
- 推导过程
d t = d h n e x t ∗ d h c u r r e n t = d h n e x t ∗ ( 1 − h n e x t 2 ) \begin{aligned} d_{t}&=dh_{next}*dh_{current}\\ &=dh_{next}*(1-h_{next}^2) \end{aligned} dt=dhnext∗dhcurrent=dhnext∗(1−hnext2) - 相关代码

- 推导过程
-
偏置项b求导
- 推导过程
d b = ∂ h n e x t ∂ b ∗ d t = ∂ ( h t − 1 W h + x t W x + b ) ∂ b ∗ d t = 1 ∗ d t = d t \begin{aligned} d_{b}&=\frac{\partial h_{next}}{\partial b}*d_{t}\\ &=\frac{\partial( h_{t-1}W_{h}+x_tW_{x}+b)}{\partial b}*d_{t}\\ &=1*d_{t}=d_{t} \end{aligned} db=∂b∂hnext∗dt=∂b∂(ht−1Wh+xtWx+b)∗dt=1∗dt=dt - 相关代码

- 推导过程
-
W h W_h Wh求导
- 推导过程
d W h = ∂ h n e x t ∂ W h ∗ d t = ∂ ( h t − 1 W h + x t W x + b ) ∂ W h ∗ d t = h t − 1 T ∗ d t = h p r e T ∗ d t \begin{aligned} d_{Wh}&=\frac{\partial h_{next}}{\partial W_h}*d_{t}\\ &=\frac{\partial( h_{t-1}W_{h}+x_tW_{x}+b)}{\partial W_h}*d_{t}\\ &= h_{t-1}^T*d_{t}\\ &=h_{pre}^T*d_{t} \end{aligned} dWh=∂Wh∂hnext∗dt=∂Wh∂(ht−1Wh+xtWx+b)∗dt=ht−1T∗dt=hpreT∗dt - 相关代码

- 推导过程
-
W x W_x Wx求导
- 推导过程
d W x = ∂ h n e x t ∂ W x ∗ d t = ∂ ( h t − 1 W h + x t W x + b ) ∂ W x ∗ d t = x t T ∗ d t \begin{aligned} d_{Wx}&=\frac{\partial h_{next}}{\partial W_x}*d_{t}\\ &=\frac{\partial( h_{t-1}W_{h}+x_tW_{x}+b)}{\partial W_x}*d_{t}\\ &= x_{t}^T*d_{t} \end{aligned} dWx=∂Wx∂hnext∗dt=∂Wx∂(ht−1Wh+xtWx+b)∗dt=xtT∗dt - 相关代码

- 推导过程
-
x x x求导
- 推导过程
d x = d h t − 1 = d t ∗ ∂ h n e x t ∂ x t = d t ∗ ∂ ( h t − 1 W h + x t W x + b ) ∂ x t = d t ∗ W x T \begin{aligned} d_{x}=d_{h_{t-1}}&=d_{t}*\frac{\partial h_{next}}{\partial x_t}\\ &=d_{t}*\frac{\partial( h_{t-1}W_{h}+x_tW_{x}+b)}{\partial x_t}\\ &= d_{t}*W_{x}^T\\ \end{aligned} dx=dht−1=dt∗∂xt∂hnext=dt∗∂xt∂(ht−1Wh+xtWx+b)=dt∗WxT - 相关代码

- 推导过程
-
h p r e h_{pre} hpre求导【 h t − 1 h_{t-1} ht−1】
- 推导过程
d h p r e = d h t − 1 = d t ∗ ∂ h n e x t ∂ h t − 1 = d t ∗ ∂ ( h t − 1 W h + x t W x + b ) ∂ h t − 1 = d t ∗ W h T \begin{aligned} d_{h_{pre}}=d_{h_{t-1}}&=d_{t}*\frac{\partial h_{next}}{\partial h_{t-1}}\\ &=d_{t}*\frac{\partial( h_{t-1}W_{h}+x_tW_{x}+b)}{\partial h_{t-1}}\\ &= d_{t}*W_{h}^T\\ \end{aligned} dhpre=dht−1=dt∗∂ht−1∂hnext=dt∗∂ht−1∂(ht−1Wh+xtWx+b)=dt∗WhT - 相关代码

- 推导过程
-
反向传播相关函数

3.5 反向传播求参推导
-
相关公式
h i = W h h t − 1 + W x x t h t = t a n h ( W h h t − 1 + W x x t ) y t = W y h h t L o s s t = 1 2 ( y − y t ) 2 \begin{aligned} h_i&= W_{h}h_{t-1}+W_{x}x_t\\ h_t&= tanh(W_{h}h_{t-1}+W_{x}x_t)\\ y_t&=W_{yh}h_t\\ Loss_t&=\frac{1}{2}(y-y_t)^2 \end{aligned} hihtytLosst=Whht−1+Wxxt=tanh(Whht−1+Wxxt)=Wyhht=21(y−yt)2 -
链式求导法则
∂ L o s s ( t ) ∂ W h = ∂ L o s s ( t ) ∂ y t ∗ ∂ y ( t ) ∂ h t ∗ ∂ h ( t ) ∂ h i ∗ ∂ h ( i ) ∂ W h \begin{aligned} \frac{\partial Loss_{(t)}}{\partial W_{h}}&=\frac{\partial Loss_{(t)}}{\partial y_{t}}*\frac{\partial y_{(t)}}{\partial h_{t}}*\frac{\partial h_{(t)}}{\partial h_{i}}*\frac{\partial h_{(i)}}{\partial W_{h}} \end{aligned} ∂Wh∂Loss(t)=∂yt∂Loss(t)∗∂ht∂y(t)∗∂hi∂h(t)∗∂Wh∂h(i) -
求导结果
∂ L o s s ( t ) ∂ y t = 1 2 ( y t r u e − y t ) 2 ∂ y t ∂ y ( t ) ∂ h t = W y h h t ∂ h t = W y h ∂ h ( i ) ∂ W h = W h h t − 1 + W x x t ∂ W h = h t − 1 ∂ h t ∂ h i = ∂ h t ∂ h t − 1 ∗ ∂ h t − 1 ∂ h t − 2 ∗ ∂ h t − 2 ∂ h t − 3 . . . ∂ h i + 1 ∂ h i = ∏ k = i t − 1 ∂ h k + 1 ∂ h k ∂ h k + 1 ∂ h k = d i a g ( f ′ ( W h h t − 1 + W x x t ) ) W h ∂ h k ∂ h 1 = ∏ k = i k d i a g ( f ′ ( W h h t − 1 + W x x t ) ) W h \begin{aligned} \frac{\partial Loss_{(t)}}{\partial y_{t}}&=\frac{\frac{1}{2}(y_{true}-y_t)^2}{\partial y_{t}}\\ \frac{\partial y_{(t)}}{\partial h_{t}}&=\frac{W_{yh}h_t}{\partial h_{t}}\\&=W_{yh}\\ \frac{\partial h_{(i)}}{\partial W_{h}}&=\frac{ W_{h}h_{t-1}+W_{x}x_t}{\partial W_{h}}\\&=h_{t-1}\\ \frac{\partial h_{t}}{\partial h_{i}}&=\frac{\partial h_{t}}{\partial h_{t-1}}*\frac{\partial h_{t-1}}{\partial h_{t-2}}*\frac{\partial h_{t-2}}{\partial h_{t-3}}...\frac{\partial h_{i+1}}{\partial h_{i}}\\&=\prod_{k=i}^{t-1}\frac{\partial h_{k+1}}{\partial h_{k}}\\ \frac{\partial h_{k+1}}{\partial h_{k}}&=diag(f^{'}(W_{h}h_{t-1}+W_{x}x_t))W_{h}\\ \frac{\partial h_{k}}{\partial h_{1}}&=\prod_{k=i}^{k}diag(f^{'}(W_{h}h_{t-1}+W_{x}x_t))W_{h}\\ \end{aligned} ∂yt∂Loss(t)∂ht∂y(t)∂Wh∂h(i)∂hi∂ht∂hk∂hk+1∂h1∂hk=∂yt21(ytrue−yt)2=∂htWyhht=Wyh=∂WhWhht−1+Wxxt=ht−1=∂ht−1∂ht∗∂ht−2∂ht−1∗∂ht−3∂ht−2...∂hi∂hi+1=k=i∏t−1∂hk∂hk+1=diag(f′(Whht−1+Wxxt))Wh=k=i∏kdiag(f′(Whht−1+Wxxt))Wh
∂ h t ∂ h i = ∂ h t ∂ h t − 1 ∗ ∂ h t − 1 ∂ h t − 2 ∗ ∂ h t − 2 ∂ h t − 3 . . . ∂ h i + 1 ∂ h i = ∏ k = i t − 1 ∂ h k + 1 ∂ h k ∂ h k + 1 ∂ h k = d i a g ( f ′ ( W h h t − 1 + W x x t ) ) W h ∂ h k ∂ h 1 = ∏ k = i k d i a g ( f ′ ( W h h t − 1 + W x x t ) ) W h \begin{aligned} \frac{\partial h_{t}}{\partial h_{i}}&=\frac{\partial h_{t}}{\partial h_{t-1}}*\frac{\partial h_{t-1}}{\partial h_{t-2}}*\frac{\partial h_{t-2}}{\partial h_{t-3}}...\frac{\partial h_{i+1}}{\partial h_{i}}=\prod_{k=i}^{t-1}\frac{\partial h_{k+1}}{\partial h_{k}}\\ \frac{\partial h_{k+1}}{\partial h_{k}}&=diag(f^{'}(W_{h}h_{t-1}+W_{x}x_t))W_{h}\\ \frac{\partial h_{k}}{\partial h_{1}}&=\prod_{k=i}^{k}diag(f^{'}(W_{h}h_{t-1}+W_{x}x_t))W_{h}\\ \end{aligned} ∂hi∂ht∂hk∂hk+1∂h1∂hk=∂ht−1∂ht∗∂ht−2∂ht−1∗∂ht−3∂ht−2...∂hi∂hi+1=k=i∏t−1∂hk∂hk+1=diag(f′(Whht−1+Wxxt))Wh=k=i∏kdiag(f′(Whht−1+Wxxt))Wh -
根据求导公式发现,再求 ∂ h k ∂ h 1 \frac{\partial h_{k}}{\partial h_{1}} ∂h1∂hk的导数时,会出现累积计算,会出现 W h W_{h} Wh的k次方
- 当k足够大时,且 W h W_{h} Wh大于1,最后结果无穷大,导致梯度爆炸,大幅度更新网络参数,可能参数溢出(出现NaN值)出现计算崩溃,导致模型不稳定。参数迭代调整过大,也会导致模型不稳定。
- 当k足够大时,且 W h W_{h} Wh小于1,最后结果无穷小,出现梯度消失,使参数无法更新迭代,导致模型训练失败。
4. 传统RNN的优缺点
- 优点
- 内部结构简单,对计算资源要求低
- 相比于RNN变体(LSTM和GRU模型)参数少很多
- 在短序列任务上性能和效果表现优异
- 缺点
- 传统RNN在解决长序列之间的关联时,表现很差
- 原因在于反向传播时,过长的序列导致梯度的计算异常,发生梯度消失或爆炸
- 具有长时依赖问题,具有遗忘性
- 传统RNN在解决长序列之间的关联时,表现很差
5. rnn简单代码实现
5.1 创建数据相关代码
- 根据前一个词预测下一个词
def preprocess_rnnlm(sentences_list,lis = []): """ 语料库预处理 :param text_list:句子列表 :return: word_list 是单词列表 word_dict:是单词到单词 ID 的字典 number_dict 是单词 ID 到单词的字典 n_class 单词数 """ for i in sentences_list: text = i.split('.')[0].split(' ') # 按照空格分词,统计 sentences的分词的个数 word_list = list({}.fromkeys(text).keys()) # 去重 统计词典个数 lis=lis+word_list word_list=list({}.fromkeys(lis).keys()) corpus = [i for i, w in enumerate(word_list)] word_dict = {w: i for i, w in enumerate(word_list)} number_dict = {i: w for i, w in enumerate(word_list)} n_class = len(word_dict) # 词典的个数,也是softmax 最终分类的个数 return word_list, word_dict, number_dict,n_class,corpus def make_batch_rnnlm(sentences_list, word_dict, windows_size=1): """ 词向量编码函数 :param sentences_list:句子列表 :param word_dict: 字典{'You': 0,,,} key:单词,value:索引 :param windows_size: 窗口大小 :return: input_batch:数据集向量 target_batch:标签值 """ input_batch, target_batch = [], [] for sen in sentences_list: word_repeat_list = sen.split(' ') # 按照空格分词 for i in range(windows_size, len(word_repeat_list)): # 目标词索引迭代 target = word_repeat_list[i] # 获取目标词 input_index = [word_dict[word_repeat_list[j]] for j in range((i - windows_size), i)][0] # 获取目标词相关输入数据集 target_index = word_dict[target] # 目标词索引 input_batch.append(input_index) target_batch.append(target_index) return input_batch, target_batch if __name__ == '__main__': # 获取数据集 sentences_list = ['After learning his achievement in science in a speech Tom has admired teddy much for his concentration on what he studies'] # 训练数据 word_list, word_to_id, id_to_word,n_class,corpus=preprocess_rnnlm(sentences_list) print('word_to_id为:',word_to_id) print('id_to_word为:',id_to_word) print('corpus为: ',corpus) # 文章单词序列索引 xs, ts = make_batch_rnnlm(sentences_list, word_to_id, windows_size=1) # 构建输入数据和 target label # 创建数据集 vocab_size = int(max(corpus) + 1) # 不同单词个数 corpus_size = int(max(corpus)) # 单词最大id data_size = len(xs) # 样本集长度 print('xs',xs) # 输入文本 print('ts',ts) # 输出(监督标签)

5.2 完整代码
- 代码只知道大致流程
import numpy as np import matplotlib.pyplot as plt def preprocess_rnnlm(sentences_list,lis = []): """ 语料库预处理 :param text_list:句子列表 :return: word_list 是单词列表 word_dict:是单词到单词 ID 的字典 number_dict 是单词 ID 到单词的字典 n_class 单词数 """ for i in sentences_list: text = i.split('.')[0].split(' ') # 按照空格分词,统计 sentences的分词的个数 word_list = list({}.fromkeys(text).keys()) # 去重 统计词典个数 lis=lis+word_list word_list=list({}.fromkeys(lis).keys()) corpus = [i for i, w in enumerate(word_list)] word_dict = {w: i for i, w in enumerate(word_list)} number_dict = {i: w for i, w in enumerate(word_list)} n_class = len(word_dict) # 词典的个数,也是softmax 最终分类的个数 return word_list, word_dict, number_dict,n_class,corpus def make_batch_rnnlm(sentences_list, word_dict, windows_size=1): """ 词向量编码函数 :param sentences_list:句子列表 :param word_dict: 字典{'You': 0,,,} key:单词,value:索引 :param windows_size: 窗口大小 :return: input_batch:数据集向量 target_batch:标签值 """ input_batch, target_batch = [], [] for sen in sentences_list: word_repeat_list = sen.split(' ') # 按照空格分词 for i in range(windows_size, len(word_repeat_list)): # 目标词索引迭代 target = word_repeat_list[i] # 获取目标词 input_index = [word_dict[word_repeat_list[j]] for j in range((i - windows_size), i)][0] # 获取目标词相关输入数据集 target_index = word_dict[target] # 目标词索引 input_batch.append(input_index) target_batch.append(target_index) return input_batch, target_batch class SGD: ''' 随机梯度下降法(Stochastic Gradient Descent) ''' def __init__(self, lr=0.01): self.lr = lr def update(self, params, grads): for i in range(len(params)): params[i] -= self.lr * grads[i] def softmax(x): if x.ndim == 2: x = x - x.max(axis=1, keepdims=True) x = np.exp(x) x /= x.sum(axis=1, keepdims=True) elif x.ndim == 1: x = x - np.max(x) x = np.exp(x) / np.sum(np.exp(x)) return x def cross_entropy_error(y, t): if y.ndim == 1: t = t.reshape(1, t.size) y = y.reshape(1, y.size) # 在监督标签为one-hot-vector的情况下,转换为正确解标签的索引 if t.size == y.size: t = t.argmax(axis=1) batch_size = y.shape[0] return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size class TimeSoftmaxWithLoss: def __init__(self): self.params, self.grads = [], [] self.cache = None self.ignore_label = -1 def forward(self, xs, ts): N, T, V = xs.shape if ts.ndim == 3: # 在监督标签为one-hot向量的情况下 ts = ts.argmax(axis=2) mask = (ts != self.ignore_label) # 按批次大小和时序大小进行整理(reshape) xs = xs.reshape(N * T, V) ts = ts.reshape(N * T) mask = mask.reshape(N * T) ys = softmax(xs) ls = np.log(ys[np.arange(N * T), ts]) ls *= mask # 与ignore_label相应的数据将损失设为0 loss = -np.sum(ls) loss /= mask.sum() self.cache = (ts, ys, mask, (N, T, V)) return loss def backward(self, dout=1): ts, ys, mask, (N, T, V) = self.cache dx = ys dx[np.arange(N * T), ts] -= 1 dx *= dout dx /= mask.sum() dx *= mask[:, np.newaxis] # 与ignore_label相应的数据将梯度设为0 dx = dx.reshape((N, T, V)) return dx class SoftmaxWithLoss: def __init__(self): self.params, self.grads = [], [] self.y = None # softmax的输出 self.t = None # 监督标签 def forward(self, x, t): self.t = t self.y = softmax(x) # 在监督标签为one-hot向量的情况下,转换为正确解标签的索引 if self.t.size == self.y.size: self.t = self.t.argmax(axis=1) loss = cross_entropy_error(self.y, self.t) return loss def backward(self, dout=1): batch_size = self.t.shape[0] dx = self.y.copy() dx[np.arange(batch_size), self.t] -= 1 dx *= dout dx = dx / batch_size return dx class SimpleRnnlm: def __init__(self, vocab_size, wordvec_size, hidden_size): V, D, H = vocab_size, wordvec_size, hidden_size rn = np.random.randn # 初始化权重 embed_W = (rn(V, D) / 100).astype('f') rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f') rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f') rnn_b = np.zeros(H).astype('f') affine_W = (rn(H, V) / np.sqrt(H)).astype('f') affine_b = np.zeros(V).astype('f') # 生成层 self.layers = [ TimeEmbedding(embed_W), TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True), TimeAffine(affine_W, affine_b) ] self.loss_layer = TimeSoftmaxWithLoss() self.rnn_layer = self.layers[1] # 将所有的权重和梯度整理到列表中 self.params, self.grads = [], [] for layer in self.layers: self.params += layer.params self.grads += layer.grads def forward(self, xs, ts): for layer in self.layers: xs = layer.forward(xs) loss = self.loss_layer.forward(xs, ts) return loss def backward(self, dout=1): dout = self.loss_layer.backward(dout) for layer in reversed(self.layers): dout = layer.backward(dout) return dout def reset_state(self): self.rnn_layer.reset_state() class TimeRNN: def __init__(self, Wx, Wh, b, stateful=False): self.params = [Wx, Wh, b] self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)] self.layers = None self.h, self.dh = None, None self.stateful = stateful def forward(self, xs): Wx, Wh, b = self.params N, T, D = xs.shape D, H = Wx.shape self.layers = [] hs = np.empty((N, T, H), dtype='f') if not self.stateful or self.h is None: self.h = np.zeros((N, H), dtype='f') for t in range(T): layer = RNN(*self.params) self.h = layer.forward(xs[:, t, :], self.h) hs[:, t, :] = self.h self.layers.append(layer) return hs def backward(self, dhs): Wx, Wh, b = self.params N, T, H = dhs.shape D, H = Wx.shape dxs = np.empty((N, T, D), dtype='f') dh = 0 grads = [0, 0, 0] for t in reversed(range(T)): layer = self.layers[t] dx, dh = layer.backward(dhs[:, t, :] + dh) dxs[:, t, :] = dx for i, grad in enumerate(layer.grads): grads[i] += grad for i, grad in enumerate(grads): self.grads[i][...] = grad self.dh = dh return dxs def set_state(self, h): self.h = h def reset_state(self): self.h = None class RNN: def __init__(self, Wx, Wh, b): self.params = [Wx, Wh, b] self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)] self.cache = None def forward(self, x, h_prev): Wx, Wh, b = self.params t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b h_next = np.tanh(t) self.cache = (x, h_prev, h_next) return h_next def backward(self, dh_next): Wx, Wh, b = self.params x, h_prev, h_next = self.cache dt = dh_next * (1 - h_next ** 2) db = np.sum(dt, axis=0) dWh = np.dot(h_prev.T, dt) dh_prev = np.dot(dt, Wh.T) dWx = np.dot(x.T, dt) dx = np.dot(dt, Wx.T) self.grads[0][...] = dWx self.grads[1][...] = dWh self.grads[2][...] = db return dx, dh_prev class TimeEmbedding: def __init__(self, W): self.params = [W] self.grads = [np.zeros_like(W)] self.layers = None self.W = W def forward(self, xs): N, T = xs.shape V, D = self.W.shape out = np.empty((N, T, D), dtype='f') self.layers = [] for t in range(T): layer = Embedding(self.W) out[:, t, :] = layer.forward(xs[:, t]) self.layers.append(layer) return out def backward(self, dout): N, T, D = dout.shape grad = 0 for t in range(T): layer = self.layers[t] layer.backward(dout[:, t, :]) grad += layer.grads[0] self.grads[0][...] = grad return None class Embedding: def __init__(self, W): self.params = [W] self.grads = [np.zeros_like(W)] self.idx = None def forward(self, idx): W, = self.params self.idx = idx out = W[idx] return out def backward(self, dout): dW, = self.grads dW[...] = 0 if GPU: np.scatter_add(dW, self.idx, dout) else: np.add.at(dW, self.idx, dout) return None class TimeAffine: def __init__(self, W, b): self.params = [W, b] self.grads = [np.zeros_like(W), np.zeros_like(b)] self.x = None def forward(self, x): N, T, D = x.shape W, b = self.params rx = x.reshape(N*T, -1) out = np.dot(rx, W) + b self.x = x return out.reshape(N, T, -1) def backward(self, dout): x = self.x N, T, D = x.shape W, b = self.params dout = dout.reshape(N*T, -1) rx = x.reshape(N*T, -1) db = np.sum(dout, axis=0) dW = np.dot(rx.T, dout) dx = np.dot(dout, W.T) dx = dx.reshape(*x.shape) self.grads[0][...] = dW self.grads[1][...] = db return dx if __name__ == '__main__': GPU=False # 获取数据集 sentences_list = ['After learning his achievement in science in a speech Tom has admired teddy much for his concentration on what he studies'] # 训练数据 word_list, word_to_id, id_to_word,n_class,corpus=preprocess_rnnlm(sentences_list) print('word_to_id为:',word_to_id) print('id_to_word为:',id_to_word) print('corpus为: ',corpus) # 文章单词序列索引 xs, ts = make_batch_rnnlm(sentences_list, word_to_id, windows_size=1) # 构建输入数据和 target label # 创建数据集 vocab_size = int(max(corpus) + 1) # 不同单词个数 corpus_size = int(max(corpus)) # 单词最大id data_size = len(xs) # 样本集长度 print('xs',xs) # 输入文本 print('ts',ts) # 输出(监督标签) # 设定超参数 batch_size = 5 # 批次大小,每一次传入的数据集大小 wordvec_size = 100 # 词向量长度 hidden_size = 100 # 隐藏层输出个数 time_size = 2 # Truncated BPTT的时间跨度大小,反向传播步长 lr = 0.1 # 学习率 max_epoch = 100 # 迭代次数,训练次数 print('corpus size: %d, vocab size: %d' % (corpus_size, vocab_size)) # 学习用的参数 print('data_size,batch_size',data_size,batch_size) # batch_size * time_size=一次数据集大小 max_iters = data_size // (batch_size*time_size) print('max_iters',max_iters) time_idx ,total_loss,loss_count,ppl_list= 0,0,0,[] # 生成模型 model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size) optimizer = SGD(lr) # print('model',model ) # 计算读入mini-batch的各笔样本数据的开始位置 jump = corpus_size // batch_size # 各批次样本数据的开始位置索引【数据集平分为time_size 2份】 输入数据的开始位置,需要在各个批次中进行“偏移” offsets = [i * jump for i in range(batch_size)] # 各批次元素增加偏移量,len的大小代表批次的元素的个数 # 例如:[0, 3, 6, 9, 12] print('jump',jump) print('offsets',offsets) print('batch_size',batch_size) print('time_size',time_size) for epoch in range(max_epoch): for iter in range(max_iters): # 获取mini-batch # 生成空数组【生成的不是空数组,是运算比较快,需要手动填写值】 batch_x = np.empty((batch_size, time_size), dtype='i') # shape(5,2) batch_t = np.empty((batch_size, time_size), dtype='i') # shape(5,2) for t in range(time_size): # 填充批次的每个元素值【列数据】 for i, offset in enumerate(offsets): # i:数据行数 batch_x[i, t] = xs[(offset + time_idx) % data_size] batch_t[i, t] = ts[(offset + time_idx) % data_size] time_idx += 1 # 计算梯度,更新参数 loss = model.forward(batch_x, batch_t) model.backward() optimizer.update(model.params, model.grads) total_loss += loss loss_count += 1 # 各个epoch的困惑度评价 ppl = np.exp(total_loss / loss_count) print('| epoch %d | perplexity %.2f' % (epoch+1, ppl)) ppl_list.append(float(ppl)) total_loss, loss_count = 0, 0 # break # 绘制图形 x = np.arange(len(ppl_list)) plt.plot(x, ppl_list, label='train') plt.xlabel('epochs') plt.ylabel('perplexity') plt.show()
5.3 效果展示

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐

所有评论(0)