中文ASR测试集总览

数据集地址 论文地址 测试集句子数
AISHELL-1 https://ieeexplore.ieee.org/document/8384449 7176
AISHELL-2(AISHELL-2018A-EVAL) https://arxiv.org/abs/1808.10583 5000
WenetSpeech Internet/meeting domain https://arxiv.org/abs/2110.03370 24774 / 8370
KeSpeech https://openreview.net/forum?id=b3Zoeq2sCLq 19723
Common Voice 22 10635
MAGICDATA-READ 24280
MAGICDATA-RAMC https://arxiv.org/abs/2203.16844 23012

中文ASR测试集

AISHELL-1

介绍https://www.aishelltech.com/kysjcp

希尔贝壳中文普通话开源语音数据库AISHELL-ASR0009-OS1录音时长178小时,是希尔贝壳中文普通话语音数据库AISHELL-ASR0009的一部分。AISHELL-ASR0009录音文本涉及智能家居、无人驾驶、工业生产等11个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16-bit);Android系统手机(16kHz,16-bit);iOS系统手机(16kHz,16-bit)。高保真麦克风录制的音频降采样为16kHz,用于制作AISHELL-ASR0009-OS1。400名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在95%以上。分为训练集、开发集、测试集。

数据集处理

数据样例

speech_asr_aishell_testsets.csv

Audio:FILE,Text:LABEL
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0121.wav,甚 至 出 现 交 易 几 乎 停 滞 的 情 况 
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0122.wav,一 二 线 城 市 虽 然 也 处 于 调 整 中 
speech_asr_aishell_testsets/wav/test/S0764/BAC009S0764W0123.wav,但 因 为 聚 集 了 过 多 公 共 资 源 
提取测试集
def get_aishell1_test_data():
    audio_path_list = []
    references = []

    with open("speech_asr_aishell_testsets.csv", "r", encoding="utf-8") as fin:
        
        csv_dict_reader = csv.DictReader(fin)
        for row in csv_dict_reader:
            audio_path = row['Audio:FILE']
            reference = row['Text:LABEL'].replace(" ", "")
            audio_path_list.append(audio_path)
            references.append(reference)

    assert len(audio_path_list) == len(references)
    print(f"Load testset done, total {len(references)} utterances.")

    return audio_path_list, references

AISHELL-2(AISHELL-2018A-EVAL)

介绍https://www.aishelltech.com/aishell_2

希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时,其中718小时来自AISHELL-ASR0009-[ZH-CN],282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16bit);Android系统手机(16kHz,16bit);iOS系统手机(16kHz,16bit)。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在96%以上。(支持学术研究,未经允许禁止商用。)

数据集处理

数据样例

trans.txt

IT0011W0001	换一首歌
IT0011W0002	几点了
IT0011W0003	早上好

wav.scp

IT0011W0001	wav/T0011/IT0011W0001.wav
IT0011W0002	wav/T0011/IT0011W0002.wav
IT0011W0003	wav/T0011/IT0011W0003.wav
提取测试集
def get_aishell2_test_data():
    test_dir = "AISHELL-DEV-TEST-SET/iOS/test/"

    uttid1_list = []
    audio_path_list = []
    with open(os.path.join(test_dir, "wav.scp"), "r", encoding="utf-8") as fin:
        for line in fin.readlines():
            data = line.strip().split('\t')
            assert len(data) == 2
            uttid1_list.append(data[0])
            audio_path = os.path.join(test_dir, data[1])
            audio_path_list.append(audio_path)

    uttid2_list = []
    references = []
    with open(os.path.join(test_dir, "trans.txt"), "r", encoding="utf-8") as fin:
        for line in fin.readlines():
            data = line.strip().split('\t')
            assert len(data) == 2
            uttid2_list.append(data[0])
            references.append(data[1])

    assert len(uttid1_list) == len(uttid2_list)
    for i in range(len(uttid2_list)):
        assert uttid1_list[i] == uttid2_list[i]
    print(f"Load testset done, total {len(references)} utterances.")

    return audio_path_list, references

WenetSpeech

介绍https://wenet.org.cn/WenetSpeech/

WenetSpeech是时长超过 10000 小时的多领域带转录文本的普通话语音语料库,该语料库的语料来源于 YouTube 和播客(Podcast)。其中,我们分别采用光学字符识别(OCR)技术和自动语音识别(ASR)技术,对 YouTube 和播客的每条录音进行标注。为提升语料库质量,我们还使用了一种新颖的端到端标签错误检测方法,对数据进行进一步验证与筛选。

数据集处理

数据样例

所有的数据都存放于一个大的json格式的数据文件WenetSpeech.json

提取测试集

数据集的是加密的,需要申请解密密码

openssl aes-256-cbc -d -salt -pass pass:密码 -pbkdf2 -in 加密的文件 | tar xzf - -C 解密后的文件夹

提取测试文件

def get_ws_net_test_data():
    # from modelscope.msdatasets import MsDataset
    # ds =  MsDataset.load('wenet/WenetSpeech', subset_name='default', split='test_net')
    # print(list(ds))

    audio_path_list = []
    references = []

    audio_dir = "./ws_net"
    output_dir = "./ws_net/ws_net_wav"
    ws_net = json.load(open("ws_net/WenetSpeech.json", "r", encoding="utf-8"))
    print(f"totol audio num:{len(ws_net['audios'])}")
    for long_audio in ws_net['audios']:
        if "TEST_NET" in long_audio["aid"]:
            print("=" * 20)
            aid = long_audio["aid"]
            path = long_audio["path"]
            segments = long_audio["segments"]
            # os.system("ffmpeg -y -i {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
            #     os.path.join(audio_dir, path), os.path.join(output_dir, f"{os.path.split(path)[-1].split('.')[0]}.wav")
            # ))
            for segment in tqdm(segments):
                sid = segment["sid"]
                begin_time = segment["begin_time"]
                end_time = segment["end_time"]
                text = segment["text"]
                audio_path = os.path.join(output_dir, f"{sid}.wav")

                os.system("ffmpeg -loglevel quiet -y -i {} -ss {} -to {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
                    os.path.join(audio_dir, path), begin_time, end_time, audio_path
                ))

                audio_path_list.append(audio_path)
                references.append(text)
                # print(f"segment save to {audio_path}. text: {text}")
            print(f"process {aid} done. load {len(segments)} segments.")

    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references


def get_ws_meeting_test_data():
    # temp_dir = "./datasets/ws_meeting"
    # #repalce your own password
    # password = "XXXXXXXXXXX"

    # from modelscope.msdatasets import MsDataset
    # ds =  MsDataset.load('wenet/WenetSpeech', subset_name='default', split='test_meeting')
    # for id, item in tqdm(enumerate(list(ds)[0][1])):
    #     temp_path = os.path.join(temp_dir, f"B000{id}.ase.tgz")
    #     os.system(f"cp {item} {temp_path}")
    #     os.system(f"openssl aes-256-cbc -d -salt -pass pass:{password} -pbkdf2 -in {temp_path} | tar xzf - -C {temp_dir}")


    audio_path_list = []
    references = []

    audio_dir = "./ws_meeting"
    output_dir = "./ws_meeting/ws_meeting_wav"
    ws_net = json.load(open("ws_meeting/WenetSpeech.json", "r", encoding="utf-8"))
    print(f"totol audio num:{len(ws_net['audios'])}")
    for long_audio in ws_net['audios']:
        if "TEST_MEETING" in long_audio["aid"]:
            print("=" * 20)
            aid = long_audio["aid"]
            path = long_audio["path"]
            segments = long_audio["segments"]
            assert len(segments) == 1
            audio_path = os.path.join(output_dir, f"{os.path.split(path)[-1].split('.')[0]}.wav")
            os.system("ffmpeg -loglevel quiet -y -i {} -acodec pcm_s16le -ar 16000 -ac 1 {}".format(
                os.path.join(audio_dir, path), audio_path
            ))
            text = segments[0]["text"]
            audio_path_list.append(audio_path)
            references.append(text)
            print(f"segment save to {audio_path}. text: {text}")

    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references

KeSpeech

介绍https://openreview.net/pdf?id=b3Zoeq2sCLq

KeSpeech数据集包含由来自中国 34 个城市的27237名说话人录制的1542小时语音信号,发音类型涵盖标准普通话及其8种方言。

数据集处理

数据样例

Tasks/ASR/test/text

1000043_58fb23b5 参与风筝表演的都是世界级顶尖高手
1000043_6674a7fd 参与重大矛盾纠纷化解十一起
1000043_71dd2737 反而在面颊间扫上淡淡的腮红

Tasks/ASR/test/utt2subdialect

1000043_58fb23b5 Northeastern
1000043_6674a7fd Northeastern
1000043_71dd2737 Northeastern

Tasks/ASR/test/wav.scp

1000043_58fb23b5 Audio/1000043/phase1/1000043_58fb23b5.wav
1000043_6674a7fd Audio/1000043/phase1/1000043_6674a7fd.wav
1000043_71dd2737 Audio/1000043/phase1/1000043_71dd2737.wav
提取测试集

KeSpeech数据集都在一个压缩包里,需要先提取出测试集

import os
import shutil
from tqdm import tqdm


def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as fin:
        lines = fin.readlines()
    return [line.strip().split() for line in lines]


dataset_dir = "KeSpeech"
test_dataset_dir = "KeSpeech-test"

text_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "text")
wav_scp_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "wav.scp")
subdialects_file = os.path.join(dataset_dir, "Tasks", "ASR", "test", "utt2subdialect")

text_list = read_text_file(text_file)
wav_scp_list = read_text_file(wav_scp_file)
subdialects_list = read_text_file(subdialects_file)

assert len(text_list) == len(wav_scp_list) == len(subdialects_list)
print(f"Total {len(text_list)} utterances in the test set.")

with open(os.path.join(test_dataset_dir, "test.txt"), 'w', encoding='utf-8') as fout:
    for i in tqdm(range(len(text_list))):
        assert text_list[i][0] == wav_scp_list[i][0] == subdialects_list[i][0]
        id = text_list[i][0]
        text = text_list[i][1]
        wav_path = wav_scp_list[i][1]
        subdialect = subdialects_list[i][1]
        wav_dir = os.path.join(test_dataset_dir, os.path.dirname(wav_path))
        if not os.path.exists(wav_dir):
            os.makedirs(wav_dir)
        shutil.copy(os.path.join(dataset_dir, wav_path), wav_dir)
        fout.write(f"{id}\t{text}\t{subdialect}\t{wav_path}\n")
print(f"Test dataset prepared at {test_dataset_dir}.")

读取测试数据

def get_kespeech_test_data():
    audio_path_list = []
    references = []
    subdialects = []

    test_dir = "./KeSpeech-test"

    with open(os.path.join(test_dir, "test.txt"), "r", encoding="utf-8") as fin:
        for line in fin.readlines():
            data = line.strip().split('\t')
            assert len(data) == 4
            audio_path = os.path.join(test_dir, data[3])
            reference = data[1]
            subdialect = data[2]
            audio_path_list.append(audio_path)
            references.append(reference)
            subdialects.append(subdialect)

    assert len(audio_path_list) == len(references) == len(subdialects)
    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references, subdialects 

Common Voice

介绍https://commonvoice.mozilla.org

Common Voice 是自由的开源平台,供社区主导产生数据。可公开访问的开放式语音数据集,涵盖 130 余种语言。由社区打造,适用于 ASR、STT、TTS 等 NLP 用途的数据集。

数据集处理

数据样例

cv-corpus-22.0-2025-06-20/zh-CN/test.tsv

client_id	path	sentence_id	sentence	sentence_domain	up_votes	down_votes	age	gender	accents	variant	locale	segment
0294d9751511843184ab839d1837ab7a41a9834876686496779fc5ba5705d615a199390220ec553bacbb6e883a6914e2ffbfb65aa11ce2b03d5a6b2b8c860aa4	common_voice_zh-CN_33411896.mp3	fc88f495ef25b785e3fbc0c56d95a74761d3ae1da57855b608d9fcc5c553cec8	黑身准裂腹鱼为辐鳍鱼纲鲤形目鲤科的其中一种。		2	0					zh-CN	
02bf7ccb5f078eb0294cc22f6d725720e4079d0190fa81e71db04ff4cbdf4f22126bdf528b8591f4fe1069fad05f1911977cf960ef8e09a922b3b10d6a6926f0	common_voice_zh-CN_32269533.mp3	8c25ebbacc7b9fcfcca3c1d22acba9181ce38edf00502da657fd1f73c9300ecb	否		2	1					zh-CN	Benchmark
02ec74191c6ccc7dcf6ecaa217268263c477273b4de93fee0ca6aa2974916fbd14445737920d32cc628f3b2895e55662ea880b43ed07f34fae40d0b770efad1d	common_voice_zh-CN_22069600.mp3	73f88858e2402cca1896ec78186d8db1a7f5e66221fc7fa98152a973f8a21db1	宋朝末年年间定居粉岭围。		2	0					zh-CN	
提取测试集

提取测试集

import os
from tqdm import tqdm

output_dir = "common_voice_22-zh/audio"

os.system(f"cp zh-CN/test.tsv common_voice_22-zh/test.tsv")

with open("zh-CN/test.tsv", "r", encoding="utf-8") as fin:
    lines = fin.readlines()
    print(lines[0].strip().split("\t"))
    for line in tqdm(lines[1:]):
        parts = line.strip().split("\t")
        clinet_id = parts[0]
        path = parts[1]
        sentence = parts[3]
        if not os.path.exists(f"zh-CN/clips/{path}"):
            print(f"File {f'zh-CN/clips/{path}'} does not exist for client {clinet_id}:{path},{sentence}.")
            exit(0)
        os.system(f"cp zh-CN/clips/{path} {output_dir}/{path}")
        if not os.path.exists(f"{output_dir}/{path}"):
            print(f"Failed to copy {path} to {output_dir}.")
            exit(0)

读取测试数据

def get_common_voice_test_data():
    audio_path_list = []
    references = []

    test_dir = "./common_voice_22-zh/audio"
    with open(os.path.join("./common_voice_22-zh", "test.tsv"), "r", encoding="utf-8") as fin:
        lines = fin.readlines()
        for line in tqdm(lines[1:]):
            parts = line.strip().split("\t")
            audio_path = os.path.join(test_dir, parts[1])
            reference = parts[3]
            audio_path_list.append(audio_path)
            references.append(reference)
            

    assert len(audio_path_list) == len(references)
    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references

MAGICDATA-READ

介绍https://www.openslr.org/68/

该语料库由魔笛数据科技有限公司(MAGIC DATA Technology Co., Ltd.)研发,免费发布且仅限非商业用途使用。
该语料库的内容及相关说明如下:

  • 语料库包含 755 小时的语音数据,其中绝大部分为移动端录制数据。
  • 邀请了来自中国不同方言区的 1080 位说话人参与录制。
  • 句子转写准确率高于 98%。
  • 所有录制均在安静的室内环境中进行。
  • 数据库按 51:1:2 的比例划分为训练集、验证集和测试集。
  • 语音数据编码、说话人信息等详细信息均保存在元数据文件中。
  • 录制文本的领域具有多样性,涵盖互动问答、音乐搜索、社交平台(SNS)消息、家庭指令控制等场景。
  • 同时提供分段式转写文本。
    该语料库旨在为语音识别、机器翻译、说话人识别等语音相关领域的研究人员提供支持,因此完全免费供学术用途使用。

数据集处理

数据样例

TRANS.txt

UtteranceID	SpeakerID	Transcription
38_5715_20170914193306.wav	38_5715	口口音乐
38_5716_20170914202211.wav	38_5716	嗨天气寒冷记得添衣保暖哦
38_5716_20170914202228.wav	38_5716	人的能耐再大大不过天阵雨转阴又变了天了天要下雨娘要嫁人由她去吧
提取测试集
def get_magicdata_read_test_data():
    audio_path_list = []
    references = []

    test_dir = "./MAGICDATA-test"
    with open(os.path.join(test_dir, "TRANS.txt"), "r", encoding="utf-8") as fin:
        for line in fin.readlines():
            data = line.strip().split('\t')
            assert len(data) == 3
            audio_path = os.path.join(test_dir, data[1], data[0])
            reference = data[2]
            audio_path_list.append(audio_path)
            references.append(reference)

    assert len(audio_path_list) == len(references)
    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references

MAGICDATA-RAMC

介绍https://magichub.com/datasets/magicdata-ramc/

包含 180 小时的普通话对话语音,其中训练集、开发集和测试集的时长分别为 150 小时、10 小时和 20 小时。

数据集处理

数据样例

MagicData-RAMC/DataPartition/test.tsv

../MDT2021S003/WAV/
CTS-CN-F2F-2019-11-15-53.wav	30310800
CTS-CN-F2F-2019-11-15-75.wav	29812400
CTS-CN-F2F-2019-11-15-144.wav	29768400
提取测试集

提取测试集

import os
from tqdm import tqdm

output_dir = "./MagicData-RAMC-test"

os.system(f"cp MDT2021S003/README.txt {output_dir}")
os.system(f"cp MDT2021S003/SPKINFO.txt {output_dir}")
os.system(f"cp MDT2021S003/UTTERANCEINFO.txt {output_dir}")
os.system(f"cp DataPartition/test.tsv {output_dir}")

with open("./DataPartition/test.tsv", "r", encoding="utf-8") as fin:
    for line in tqdm(fin.readlines()[1:]):
        data = line.strip().split("\t")
        audio_path = os.path.join("MDT2021S003", "WAV", data[0])
        reference_path = os.path.join("MDT2021S003", "TXT", data[0].replace(".wav", ".txt"))
        os.system(f"cp {audio_path} {output_dir}/WAV")
        os.system(f"cp {reference_path} {output_dir}/TXT")

读取测试数据
MagicData-RAMC是一个长音频对话,这里是根据标注切分成短音频进行测试

def get_magicdata_ramc_test_data():
    audio_path_list = []
    references = []

    import re
    import string
    from zhon.hanzi import punctuation as chinese_punctuation

    all_punctuation = string.punctuation + chinese_punctuation
    pattern = re.compile(f"[{re.escape(all_punctuation)}]")

    test_dir = "./MagicData-RAMC-test"
    temp_dir = "./MagicData-RAMC-test/temp"
    with open(os.path.join(test_dir, "test.tsv"), "r", encoding="utf-8") as fin:
        lines = fin.readlines()
        for line in tqdm(lines[1:]):
            data = line.strip().split("\t")
            audio_name = data[0][:-4]
            audio_path = os.path.join(test_dir, "WAV", data[0])
            reference_path = os.path.join(test_dir, "TXT", data[0].replace(".wav", ".txt"))
            with open(reference_path, "r", encoding="utf-8") as fin:
                for id, data in tqdm(enumerate(fin.readlines())):
                    data_list = data.strip().split("\t")
                    reference = data_list[3]
                    if "[" == reference[0] and "]" in reference[-1]:
                        # print(data.strip())
                        continue
                    new_audio_path = os.path.join(temp_dir, f"{audio_name}-{id}.wav")
                    time_list = data_list[0][1:-1].split(",")
                    assert len(time_list) == 2
                    start_time, end_time = float(time_list[0]), float(time_list[1])
                    # os.system(f"ffmpeg -loglevel quiet -y -i {audio_path} -ss {start_time} -to {end_time} -acodec pcm_s16le -ar 16000 -ac 1 {new_audio_path}")
                    audio_path_list.append(new_audio_path)
                    reference = reference.replace("[+]", "")
                    reference = pattern.sub('', reference)
                    references.append(reference)

    assert len(audio_path_list) == len(references)
    print(f"Load testset done, total {len(references)} utterances.")
    return audio_path_list, references
Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐