【ASR数据集】中文语音ASR测试集
本文汇总了多个中文ASR测试数据集的基本信息,包括AISHELL-1、AISHELL-2、WenetSpeech、Common Voice、MAGICDATA-READ和MAGICDATA-RACM。提供了各数据集的论文链接、测试集规模等关键信息。详细介绍了每个数据集的数据来源、展示了测试集提取方式以及数据集处理代码样例,如音频路径和文本标签的提取方法。这些标准化的测试集对中文语音识别系统的评估和
中文ASR测试集总览
| 数据集地址 | 论文地址 | 测试集句子数 |
|---|---|---|
| AISHELL-1 | https://ieeexplore.ieee.org/document/8384449 | 7176 |
| AISHELL-2(AISHELL-2018A-EVAL) | https://arxiv.org/abs/1808.10583 | 5000 |
| WenetSpeech Internet/meeting domain | https://arxiv.org/abs/2110.03370 | 24774 / 8370 |
| KeSpeech | https://openreview.net/forum?id=b3Zoeq2sCLq | 19723 |
| Common Voice 22 | 无 | 10635 |
| MAGICDATA-READ | 无 | 24280 |
| MAGICDATA-RAMC | https://arxiv.org/abs/2203.16844 | 23012 |
中文ASR测试集
AISHELL-1
介绍https://www.aishelltech.com/kysjcp
希尔贝壳中文普通话开源语音数据库AISHELL-ASR0009-OS1录音时长178小时,是希尔贝壳中文普通话语音数据库AISHELL-ASR0009的一部分。AISHELL-ASR0009录音文本涉及智能家居、无人驾驶、工业生产等11个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16-bit);Android系统手机(16kHz,16-bit);iOS系统手机(16kHz,16-bit)。高保真麦克风录制的音频降采样为16kHz,用于制作AISHELL-ASR0009-OS1。400名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在95%以上。分为训练集、开发集、测试集。
数据集处理
数据样例
speech_asr_aishell_testsets.csv
Audio:FILE,Text:LABEL
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0121.wav,甚 至 出 现 交 易 几 乎 停 滞 的 情 况
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0122.wav,一 二 线 城 市 虽 然 也 处 于 调 整 中
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0123.wav,但 因 为 聚 集 了 过 多 公 共 资 源
提取测试集
def get_aishell1_test_data():
audio_path_list = []
references = []
with open("speech_asr_aishell_testsets.csv", "r", encoding="utf-8") as fin:
csv_dict_reader = csv.DictReader(fin)
for row in csv_dict_reader:
audio_path = row['Audio:FILE']
reference = row['Text:LABEL'].replace(" ", "")
audio_path_list.append(audio_path)
references.append(reference)
assert len(audio_path_list) == len(references)
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
AISHELL-2(AISHELL-2018A-EVAL)
介绍https://www.aishelltech.com/aishell_2
希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时,其中718小时来自AISHELL-ASR0009-[ZH-CN],282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16bit);Android系统手机(16kHz,16bit);iOS系统手机(16kHz,16bit)。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在96%以上。(支持学术研究,未经允许禁止商用。)
数据集处理
数据样例
trans.txt
IT0011W0001 换一首歌
IT0011W0002 几点了
IT0011W0003 早上好
wav.scp
IT0011W0001 wav/T0011/IT0011W0001.wav
IT0011W0002 wav/T0011/IT0011W0002.wav
IT0011W0003 wav/T0011/IT0011W0003.wav
提取测试集
def get_aishell2_test_data():
test_dir = "AISHELL-DEV-TEST-SET/iOS/test/"
uttid1_list = []
audio_path_list = []
with open(os.path.join(test_dir, "wav.scp"), "r", encoding="utf-8") as fin:
for line in fin.readlines():
data = line.strip().split('\t')
assert len(data) == 2
uttid1_list.append(data[0])
audio_path = os.path.join(test_dir, data[1])
audio_path_list.append(audio_path)
uttid2_list = []
references = []
with open(os.path.join(test_dir, "trans.txt"), "r", encoding="utf-8") as fin:
for line in fin.readlines():
data = line.strip().split('\t')
assert len(data) == 2
uttid2_list.append(data[0])
references.append(data[1])
assert len(uttid1_list) == len(uttid2_list)
for i in range(len(uttid2_list)):
assert uttid1_list[i] == uttid2_list[i]
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
WenetSpeech
介绍https://wenet.org.cn/WenetSpeech/
WenetSpeech是时长超过 10000 小时的多领域带转录文本的普通话语音语料库,该语料库的语料来源于 YouTube 和播客(Podcast)。其中,我们分别采用光学字符识别(OCR)技术和自动语音识别(ASR)技术,对 YouTube 和播客的每条录音进行标注。为提升语料库质量,我们还使用了一种新颖的端到端标签错误检测方法,对数据进行进一步验证与筛选。
数据集处理
数据样例
所有的数据都存放于一个大的json格式的数据文件WenetSpeech.json
提取测试集
数据集的是加密的,需要申请解密密码
openssl aes-256-cbc -d -salt -pass pass:密码 -pbkdf2 -in 加密的文件 | tar xzf - -C 解密后的文件夹
提取测试文件
def get_ws_net_test_data():
# from modelscope.msdatasets import MsDataset
# ds = MsDataset.load('wenet/WenetSpeech', subset_name='default', split='test_net')
# print(list(ds))
audio_path_list = []
references = []
audio_dir = "./ws_net"
output_dir = "./ws_net/ws_net_wav"
ws_net = json.load(open("ws_net/WenetSpeech.json", "r", encoding="utf-8"))
print(f"totol audio num:{len(ws_net['audios'])}")
for long_audio in ws_net['audios']:
if "TEST_NET" in long_audio["aid"]:
print("=" * 20)
aid = long_audio["aid"]
path = long_audio["path"]
segments = long_audio["segments"]
# os.system("ffmpeg -y -i {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
# os.path.join(audio_dir, path), os.path.join(output_dir, f"{os.path.split(path)[-1].split('.')[0]}.wav")
# ))
for segment in tqdm(segments):
sid = segment["sid"]
begin_time = segment["begin_time"]
end_time = segment["end_time"]
text = segment["text"]
audio_path = os.path.join(output_dir, f"{sid}.wav")
os.system("ffmpeg -loglevel quiet -y -i {} -ss {} -to {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
os.path.join(audio_dir, path), begin_time, end_time, audio_path
))
audio_path_list.append(audio_path)
references.append(text)
# print(f"segment save to {audio_path}. text: {text}")
print(f"process {aid} done. load {len(segments)} segments.")
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
def get_ws_meeting_test_data():
# temp_dir = "./datasets/ws_meeting"
# #repalce your own password
# password = "XXXXXXXXXXX"
# from modelscope.msdatasets import MsDataset
# ds = MsDataset.load('wenet/WenetSpeech', subset_name='default', split='test_meeting')
# for id, item in tqdm(enumerate(list(ds)[0][1])):
# temp_path = os.path.join(temp_dir, f"B000{id}.ase.tgz")
# os.system(f"cp {item} {temp_path}")
# os.system(f"openssl aes-256-cbc -d -salt -pass pass:{password} -pbkdf2 -in {temp_path} | tar xzf - -C {temp_dir}")
audio_path_list = []
references = []
audio_dir = "./ws_meeting"
output_dir = "./ws_meeting/ws_meeting_wav"
ws_net = json.load(open("ws_meeting/WenetSpeech.json", "r", encoding="utf-8"))
print(f"totol audio num:{len(ws_net['audios'])}")
for long_audio in ws_net['audios']:
if "TEST_MEETING" in long_audio["aid"]:
print("=" * 20)
aid = long_audio["aid"]
path = long_audio["path"]
segments = long_audio["segments"]
assert len(segments) == 1
audio_path = os.path.join(output_dir, f"{os.path.split(path)[-1].split('.')[0]}.wav")
os.system("ffmpeg -loglevel quiet -y -i {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
os.path.join(audio_dir, path), audio_path
))
text = segments[0]["text"]
audio_path_list.append(audio_path)
references.append(text)
print(f"segment save to {audio_path}. text: {text}")
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
KeSpeech
介绍https://openreview.net/pdf?id=b3Zoeq2sCLq
KeSpeech数据集包含由来自中国 34 个城市的27237名说话人录制的1542小时语音信号,发音类型涵盖标准普通话及其8种方言。
数据集处理
数据样例
Tasks/ASR/test/text
1000043_58fb23b5 参与风筝表演的都是世界级顶尖高手
1000043_6674a7fd 参与重大矛盾纠纷化解十一起
1000043_71dd2737 反而在面颊间扫上淡淡的腮红
Tasks/ASR/test/utt2subdialect
1000043_58fb23b5 Northeastern
1000043_6674a7fd Northeastern
1000043_71dd2737 Northeastern
Tasks/ASR/test/wav.scp
1000043_58fb23b5 Audio/1000043/phase1/1000043_58fb23b5.wav
1000043_6674a7fd Audio/1000043/phase1/1000043_6674a7fd.wav
1000043_71dd2737 Audio/1000043/phase1/1000043_71dd2737.wav
提取测试集
KeSpeech数据集都在一个压缩包里,需要先提取出测试集
import os
import shutil
from tqdm import tqdm
def read_text_file(file_path):
with open(file_path, 'r', encoding='utf-8') as fin:
lines = fin.readlines()
return [line.strip().split() for line in lines]
dataset_dir = "KeSpeech"
test_dataset_dir = "KeSpeech-test"
text_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "text")
wav_scp_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "wav.scp")
subdialects_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "utt2subdialect")
text_list = read_text_file(text_file)
wav_scp_list = read_text_file(wav_scp_file)
subdialects_list = read_text_file(subdialects_file)
assert len(text_list) == len(wav_scp_list) == len(subdialects_list)
print(f"Total {len(text_list)} utterances in the test set.")
with open(os.path.join(test_dataset_dir, "test.txt"), 'w', encoding='utf-8') as fout:
for i in tqdm(range(len(text_list))):
assert text_list[i][0] == wav_scp_list[i][0] == subdialects_list[i][0]
id = text_list[i][0]
text = text_list[i][1]
wav_path = wav_scp_list[i][1]
subdialect = subdialects_list[i][1]
wav_dir = os.path.join(test_dataset_dir, os.path.dirname(wav_path))
if not os.path.exists(wav_dir):
os.makedirs(wav_dir)
shutil.copy(os.path.join(dataset_dir, wav_path), wav_dir)
fout.write(f"{id}\t{text}\t{subdialect}\t{wav_path}\n")
print(f"Test dataset prepared at {test_dataset_dir}.")
读取测试数据
def get_kespeech_test_data():
audio_path_list = []
references = []
subdialects = []
test_dir = "./KeSpeech-test"
with open(os.path.join(test_dir, "test.txt"), "r", encoding="utf-8") as fin:
for line in fin.readlines():
data = line.strip().split('\t')
assert len(data) == 4
audio_path = os.path.join(test_dir, data[3])
reference = data[1]
subdialect = data[2]
audio_path_list.append(audio_path)
references.append(reference)
subdialects.append(subdialect)
assert len(audio_path_list) == len(references) == len(subdialects)
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references, subdialects
Common Voice
介绍https://commonvoice.mozilla.org
Common Voice 是自由的开源平台,供社区主导产生数据。可公开访问的开放式语音数据集,涵盖 130 余种语言。由社区打造,适用于 ASR、STT、TTS 等 NLP 用途的数据集。
数据集处理
数据样例
cv-corpus-22.0-2025-06-20/zh-CN/test.tsv
client_id path sentence_id sentence sentence_domain up_votes down_votes age gender accents variant locale segment
0294d9751511843184ab839d1837ab7a41a9834876686496779fc5ba5705d615a199390220ec553bacbb6e883a6914e2ffbfb65aa11ce2b03d5a6b2b8c860aa4 common_voice_zh-CN_33411896.mp3 fc88f495ef25b785e3fbc0c56d95a74761d3ae1da57855b608d9fcc5c553cec8 黑身准裂腹鱼为辐鳍鱼纲鲤形目鲤科的其中一种。 2 0 zh-CN
02bf7ccb5f078eb0294cc22f6d725720e4079d0190fa81e71db04ff4cbdf4f22126bdf528b8591f4fe1069fad05f1911977cf960ef8e09a922b3b10d6a6926f0 common_voice_zh-CN_32269533.mp3 8c25ebbacc7b9fcfcca3c1d22acba9181ce38edf00502da657fd1f73c9300ecb 否 2 1 zh-CN Benchmark
02ec74191c6ccc7dcf6ecaa217268263c477273b4de93fee0ca6aa2974916fbd14445737920d32cc628f3b2895e55662ea880b43ed07f34fae40d0b770efad1d common_voice_zh-CN_22069600.mp3 73f88858e2402cca1896ec78186d8db1a7f5e66221fc7fa98152a973f8a21db1 宋朝末年年间定居粉岭围。 2 0 zh-CN
提取测试集
提取测试集
import os
from tqdm import tqdm
output_dir = "common_voice_22-zh/audio"
os.system(f"cp zh-CN/test.tsv common_voice_22-zh/test.tsv")
with open("zh-CN/test.tsv", "r", encoding="utf-8") as fin:
lines = fin.readlines()
print(lines[0].strip().split("\t"))
for line in tqdm(lines[1:]):
parts = line.strip().split("\t")
clinet_id = parts[0]
path = parts[1]
sentence = parts[3]
if not os.path.exists(f"zh-CN/clips/{path}"):
print(f"File {f'zh-CN/clips/{path}'} does not exist for client {clinet_id}:{path},{sentence}.")
exit(0)
os.system(f"cp zh-CN/clips/{path} {output_dir}/{path}")
if not os.path.exists(f"{output_dir}/{path}"):
print(f"Failed to copy {path} to {output_dir}.")
exit(0)
读取测试数据
def get_common_voice_test_data():
audio_path_list = []
references = []
test_dir = "./common_voice_22-zh/audio"
with open(os.path.join("./common_voice_22-zh", "test.tsv"), "r", encoding="utf-8") as fin:
lines = fin.readlines()
for line in tqdm(lines[1:]):
parts = line.strip().split("\t")
audio_path = os.path.join(test_dir, parts[1])
reference = parts[3]
audio_path_list.append(audio_path)
references.append(reference)
assert len(audio_path_list) == len(references)
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
MAGICDATA-READ
介绍https://www.openslr.org/68/
该语料库由魔笛数据科技有限公司(MAGIC DATA Technology Co., Ltd.)研发,免费发布且仅限非商业用途使用。
该语料库的内容及相关说明如下:
- 语料库包含 755 小时的语音数据,其中绝大部分为移动端录制数据。
- 邀请了来自中国不同方言区的 1080 位说话人参与录制。
- 句子转写准确率高于 98%。
- 所有录制均在安静的室内环境中进行。
- 数据库按 51:1:2 的比例划分为训练集、验证集和测试集。
- 语音数据编码、说话人信息等详细信息均保存在元数据文件中。
- 录制文本的领域具有多样性,涵盖互动问答、音乐搜索、社交平台(SNS)消息、家庭指令控制等场景。
- 同时提供分段式转写文本。
该语料库旨在为语音识别、机器翻译、说话人识别等语音相关领域的研究人员提供支持,因此完全免费供学术用途使用。
数据集处理
数据样例
TRANS.txt
UtteranceID SpeakerID Transcription
38_5715_20170914193306.wav 38_5715 口口音乐
38_5716_20170914202211.wav 38_5716 嗨天气寒冷记得添衣保暖哦
38_5716_20170914202228.wav 38_5716 人的能耐再大大不过天阵雨转阴又变了天了天要下雨娘要嫁人由她去吧
提取测试集
def get_magicdata_read_test_data():
audio_path_list = []
references = []
test_dir = "./MAGICDATA-test"
with open(os.path.join(test_dir, "TRANS.txt"), "r", encoding="utf-8") as fin:
for line in fin.readlines():
data = line.strip().split('\t')
assert len(data) == 3
audio_path = os.path.join(test_dir, data[1], data[0])
reference = data[2]
audio_path_list.append(audio_path)
references.append(reference)
assert len(audio_path_list) == len(references)
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
MAGICDATA-RAMC
介绍https://magichub.com/datasets/magicdata-ramc/
包含 180 小时的普通话对话语音,其中训练集、开发集和测试集的时长分别为 150 小时、10 小时和 20 小时。
数据集处理
数据样例
MagicData-RAMC/DataPartition/test.tsv
../MDT2021S003/WAV/
CTS-CN-F2F-2019-11-15-53.wav 30310800
CTS-CN-F2F-2019-11-15-75.wav 29812400
CTS-CN-F2F-2019-11-15-144.wav 29768400
提取测试集
提取测试集
import os
from tqdm import tqdm
output_dir = "./MagicData-RAMC-test"
os.system(f"cp MDT2021S003/README.txt {output_dir}")
os.system(f"cp MDT2021S003/SPKINFO.txt {output_dir}")
os.system(f"cp MDT2021S003/UTTERANCEINFO.txt {output_dir}")
os.system(f"cp DataPartition/test.tsv {output_dir}")
with open("./DataPartition/test.tsv", "r", encoding="utf-8") as fin:
for line in tqdm(fin.readlines()[1:]):
data = line.strip().split("\t")
audio_path = os.path.join("MDT2021S003", "WAV", data[0])
reference_path = os.path.join("MDT2021S003", "TXT", data[0].replace(".wav", ".txt"))
os.system(f"cp {audio_path} {output_dir}/WAV")
os.system(f"cp {reference_path} {output_dir}/TXT")
读取测试数据
MagicData-RAMC是一个长音频对话,这里是根据标注切分成短音频进行测试
def get_magicdata_ramc_test_data():
audio_path_list = []
references = []
import re
import string
from zhon.hanzi import punctuation as chinese_punctuation
all_punctuation = string.punctuation + chinese_punctuation
pattern = re.compile(f"[{re.escape(all_punctuation)}]")
test_dir = "./MagicData-RAMC-test"
temp_dir = "./MagicData-RAMC-test/temp"
with open(os.path.join(test_dir, "test.tsv"), "r", encoding="utf-8") as fin:
lines = fin.readlines()
for line in tqdm(lines[1:]):
data = line.strip().split("\t")
audio_name = data[0][:-4]
audio_path = os.path.join(test_dir, "WAV", data[0])
reference_path = os.path.join(test_dir, "TXT", data[0].replace(".wav", ".txt"))
with open(reference_path, "r", encoding="utf-8") as fin:
for id, data in tqdm(enumerate(fin.readlines())):
data_list = data.strip().split("\t")
reference = data_list[3]
if "[" == reference[0] and "]" in reference[-1]:
# print(data.strip())
continue
new_audio_path = os.path.join(temp_dir, f"{audio_name}-{id}.wav")
time_list = data_list[0][1:-1].split(",")
assert len(time_list) == 2
start_time, end_time = float(time_list[0]), float(time_list[1])
# os.system(f"ffmpeg -loglevel quiet -y -i {audio_path} -ss {start_time} -to {end_time} -acodec pcm_s16le -ar 16000 -ac 1 {new_audio_path}")
audio_path_list.append(new_audio_path)
reference = reference.replace("[+]", "")
reference = pattern.sub('', reference)
references.append(reference)
assert len(audio_path_list) == len(references)
print(f"Load testset done, total {len(references)} utterances.")
return audio_path_list, references
魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐

所有评论(0)