LLM_文本生成评估指标
一个单词只计算它在引用中出现的次数。Example: 计算 ROUGE1。可以看出包内的计算原理同上述。可以看出包内的计算原理同上述。Example: 计算。
一、BLEU (precision-based metric
)
- 评估准确率: 和准确率类似-生成的多少词出现在reference的词中
- 引起问题1:
- 如果生成重复的词,并且该词在引用中出现,那么我们会得到较高的分数
- 针对这点作者指出修正方法:一个单词只计算它在引用中出现的次数。
- example:
- ref-“the cat is on the mat”
- g-“the the the the the the”
- Pvanlilla=66P_{vanlilla}=\frac{6}{6}Pvanlilla=66, Pmod=26P_{mod}=\frac{2}{6}Pmod=62
- example:
- 修正问题1:
clip
- 这意味着一个
n-gram
的出现次数以它在参考句中出现的次数为上限
- 这意味着一个
pn=∑geSnt∈C∑n−gram∈geSntCountclip(n−gram)∑geSnt∈C∑n−gram∈geSntCount(n−gram)p_n=\frac{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count_{clip}(n-gram) }{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count(n-gram) }pn=∑geSnt∈C∑n−gram∈geSntCount(n−gram)∑geSnt∈C∑n−gram∈geSntCountclip(n−gram)
- 引起问题2:
- 因为这个准确率的评估,很显然会对较短的评估对有力,会低估较长生成的结果。
- 修正问题2:简短惩罚
brevity penalty
- BR=min(1,e1−lreflgen)BR = min(1, e^{1 - \frac{l_{ref}}{l_{gen}}} )BR=min(1,e1−lgenlref) : 生成长度大于原句子:1, 生成长度小于原句子:(0,1)(0, 1)(0,1),
最终公式:
BLEU−N=BR∗(∏n=1Npn)1/NBLEU-N=BR * (\prod_{n=1}^N p_n)^{1/N}BLEU−N=BR∗(n=1∏Npn)1/N
Example: 计算BLEU-4
- ref-“the cat sat on the mat”
- g-“the cat the cat is on the mat”
- BR: BR=min(1,e1−6/8)=1BR=min(1, e^{1-6/8})=1BR=min(1,e1−6/8)=1
- n=1
- 1-gram: org:
{"the", "cat", "sat", "on", "mat"}
ge:{"the", "cat", "is", "on", "mat"}
- clip: countclip("the")=2,countclip("cat")=1,countclip("is")=0,1−gram∈geSntcount_{clip}("the") = 2, count_{clip}("cat") = 1, count_{clip}("is") = 0, 1-gram \in geSntcountclip("the")=2,countclip("cat")=1,countclip("is")=0,1−gram∈geSnt
- p1=58p_1 = \frac{5}{8}p1=85
- 1-gram: org:
- n=2
- 2-gram: org:
{"the cat", "cat sat", "sat on", "on the", "the mat"}
ge:{"the cat", "cat the", "cat is", "is on", "on the", "the mat"}
- p2=37p_2 = \frac{3}{7}p2=73
- 2-gram: org:
- n=3
- 3-gram: org:
{"the cat sat", "cat sat on", "sat on the", "on the mat"}
ge:{"the cat the", "cat the cat", "the cat is", "cat is on", "is on the", "on the mat"}
- p3=16p_3 = \frac{1}{6}p3=61
- 3-gram: org:
- n=4
- 3-gram: org:
{"the cat sat on", "cat sat on the", "sat on the mat"}
ge:{"the cat the cat", "cat the cat is", "the cat is on", "cat is on the", "is on the mat"}
- p4=05p_4 = \frac{0}{5}p4=50
- 3-gram: org:
- BLEU-4: 1∗(58∗37∗16∗05)1/4=0.1 * (\frac{5}{8}*\frac{3}{7}*\frac{1}{6}*\frac{0}{5})^{1/4}=0.1∗(85∗73∗61∗50)1/4=0.
1.1 huggingface load_metric
调用sacrebleu
可以看出包内的计算原理同上述
from datasets import load_metric
!pip install sacrebleu
bleu_metric = load_metric("sacrebleu")
bleu_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results
"""
{'score': 0.0,
'counts': [5, 3, 1, 0],
'totals': [8, 7, 6, 5],
'precisions': [62.5, 42.857142857142854, 16.666666666666668, 0.0],
'bp': 1.0,
'sys_len': 8,
'ref_len': 6}
"""
二、 ROUGE (recall-based metric
)
- 评估召回: 和召回率类似- reference的词中有多少出现在生成词中
ROUGE=∑orgSnt∈C∑n−gram∈orgSntCountmatch(n−gram)∑orgSnt∈C∑n−gram∈orgSntCount(n−gram)ROUGE = \frac{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count_{match}(n-gram) }{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count(n-gram) }ROUGE=∑orgSnt∈C∑n−gram∈orgSntCount(n−gram)∑orgSnt∈C∑n−gram∈orgSntCountmatch(n−gram)
- 对于最长公共子串
longest common substring
有个单独的分数ROUGE-L
RLCS=LCS(X,Y)m;PLCS=LCS(X,Y)nR_{LCS}=\frac{LCS(X,Y)}{m}; P_{LCS}=\frac{LCS(X,Y)}{n}RLCS=mLCS(X,Y);PLCS=nLCS(X,Y)
FLCS=(1+β2)RLCSPLCSRLCS+βPLCS,β=PLCSRLCSF_{LCS}=\frac{(1+\beta ^2)R_{LCS}P_{LCS}}{R_{LCS}+\beta P_{LCS}}, \beta=\frac{P_{LCS}}{R_{LCS}}FLCS=RLCS+βPLCS(1+β2)RLCSPLCS,β=RLCSPLCS
Example: 计算 ROUGE1
- ref-“the cat sat on the mat”
- g-“the cat the cat is on the mat”
- 1-gram: org:{“the”, “cat”, “sat”, “on”, “mat”} ge:{“the”, “cat”, “is”, “on”, “mat”}
- ROUGE−1r=2+1+0+1+16=56ROUGE-1^{r}=\frac{2+1+0+1+1}{6}=\frac{5}{6}ROUGE−1r=62+1+0+1+1=65
- BLEU−1p=min(3,2)+min(2,1)+0+1+18=58BLEU-1^{p}=\frac{min(3,2)+min(2,1)+0+1+1}{8}=\frac{5}{8}BLEU−1p=8min(3,2)+min(2,1)+0+1+1=85
1.2 huggingface load_metric
调用sacrebleu
可以看出包内的计算原理同上述
from datasets import load_metric
!pip install rouge_score
rouge_metric = load_metric("rouge")
rouge_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = rouge_metric.compute()
print(1/(0.5* 1/0.625 + 0.5* 1/0.8333333333333334))
results
"""
0.7142857142857143
{'rouge1': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
'rouge2': AggregateScore(low=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), mid=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), high=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5)),
'rougeL': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
'rougeLsum': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143))}
"""

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐
所有评论(0)