6.1 VITS模型数据预处理深度指南 | 《VITS实战：高质量自然语音合成从入门到实践》

本文详细介绍了VITS模型的数据预处理流程，包括音频格式转换、采样率调整、音量归一化等预处理步骤，以及构建单说话人和多说话人filelist的方法。数据预处理对VITS模型的训练效果至关重要，通过规范化的音频质量检查、文本对齐和filelist构建，可以有效提升模型性能。文中提供的Python代码示例展示了完整的自定义数据集处理流程，为VITS模型的数据准备工作提供了实用参考。

火马编程

767人浏览 · 2026-01-17 14:12:19

火马编程 · 2026-01-17 14:12:19 发布

引言

数据预处理是VITS模型训练的基础，直接影响模型的最终性能。虽然现有专栏简要介绍了数据准备与预处理，但缺少详细的自定义数据处理流程和深度内容。本文将详细介绍VITS模型的数据预处理流程，包括自定义数据集处理、音频预处理、filelist构建、Monotonic Alignment Search (MAS) 构建与使用，以及数据增强技术，帮助读者掌握VITS模型的数据预处理技能。

核心概念

数据预处理的重要性

数据预处理是将原始数据转换为模型可处理格式的过程，对于VITS模型来说，高质量的数据预处理可以：

提高模型的训练效率
改善模型的生成质量
增强模型的泛化能力
支持多种类型的数据集

VITS模型的数据要求

VITS模型对输入数据有以下要求：

音频格式：WAV格式，采样率通常为22050Hz
文本格式：与音频对应的文本转录
数据结构：特定格式的filelist文件
对齐信息：通过Monotonic Alignment Search (MAS) 生成

自定义数据集处理流程

1. 音频预处理

音频预处理是数据预处理的第一步，主要包括音频格式转换、采样率调整、音量归一化等。

1.1 音频格式转换与采样率调整

import librosa
import soundfile as sf
import os
from tqdm import tqdm

def preprocess_audio(input_dir, output_dir, target_sr=22050):
    """
    音频预处理：格式转换、采样率调整、音量归一化
    
    Args:
        input_dir: 原始音频目录
        output_dir: 预处理后音频输出目录
        target_sr: 目标采样率
    """
    # 创建输出目录
    os.makedirs(output_dir, exist_ok=True)
    
    # 获取所有音频文件
    audio_files = []
    for root, _, files in os.walk(input_dir):
        for file in files:
            if file.endswith(('.wav', '.mp3', '.flac', '.ogg')):
                audio_files.append(os.path.join(root, file))
    
    # 预处理每个音频文件
    for audio_path in tqdm(audio_files, desc="预处理音频"):
        # 加载音频
        y, sr = librosa.load(audio_path, sr=None)
        
        # 采样率调整
        if sr != target_sr:
            y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)
        
        # 音量归一化
        y = librosa.util.normalize(y)
        
        # 生成输出文件名
        relative_path = os.path.relpath(audio_path, input_dir)
        output_path = os.path.join(output_dir, os.path.splitext(relative_path)[0] + '.wav')
        
        # 创建输出子目录
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        
        # 保存预处理后的音频
        sf.write(output_path, y, target_sr)

# 使用示例
input_dir = "/path/to/original/audio"
output_dir = "/path/to/preprocessed/audio"
preprocess_audio(input_dir, output_dir, target_sr=22050)

1.2 音频质量检查

在预处理过程中，我们还需要检查音频质量，过滤掉低质量音频：

def check_audio_quality(audio_path, min_duration=0.5, max_duration=30.0):
    """
    检查音频质量
    
    Args:
        audio_path: 音频文件路径
        min_duration: 最小音频长度（秒）
        max_duration: 最大音频长度（秒）
        
    Returns:
        bool: 音频质量是否合格
    """
    try:
        y, sr = librosa.load(audio_path, sr=None)
        duration = len(y) / sr
        
        # 检查音频长度
        if duration < min_duration or duration > max_duration:
            return False
        
        # 检查音频能量
        rms = librosa.feature.rms(y=y)[0].mean()
        if rms < 0.01:  # 能量过低
            return False
        
        return True
    except Exception as e:
        print(f"检查音频质量时出错 {audio_path}: {e}")
        return False

2. Filelist构建

Filelist是VITS模型训练的核心配置文件，包含音频文件路径和对应的文本转录。

2.1 单说话人Filelist构建

def build_single_speaker_filelist(audio_dir, text_dir, output_file):
    """
    构建单说话人filelist
    
    Args:
        audio_dir: 预处理后的音频目录
        text_dir: 文本转录目录
        output_file: 输出filelist文件路径
    """
    with open(output_file, 'w', encoding='utf-8') as f:
        for root, _, files in os.walk(audio_dir):
            for file in files:
                if file.endswith('.wav'):
                    # 获取音频文件路径
                    audio_path = os.path.join(root, file)
                    
                    # 检查音频质量
                    if not check_audio_quality(audio_path):
                        continue
                    
                    # 获取对应的文本文件路径
                    text_filename = os.path.splitext(file)[0] + '.txt'
                    text_path = os.path.join(text_dir, text_filename)
                    
                    # 读取文本转录
                    if os.path.exists(text_path):
                        with open(text_path, 'r', encoding='utf-8') as tf:
                            text = tf.read().strip()
                        
                        # 写入filelist
                        f.write(f"{audio_path}|{text}\n")

# 使用示例
audio_dir = "/path/to/preprocessed/audio"
text_dir = "/path/to/text/transcriptions"
output_file = "filelists/my_dataset_train_filelist.txt"
build_single_speaker_filelist(audio_dir, text_dir, output_file)

2.2 多说话人Filelist构建

多说话人Filelist需要额外包含说话人ID信息：

def build_multi_speaker_filelist(audio_dir, text_dir, speaker_map, output_file):
    """
    构建多说话人filelist
    
    Args:
        audio_dir: 预处理后的音频目录
        text_dir: 文本转录目录
        speaker_map: 说话人名称到ID的映射字典
        output_file: 输出filelist文件路径
    """
    with open(output_file, 'w', encoding='utf-8') as f:
        for speaker_name, speaker_id in speaker_map.items():
            speaker_audio_dir = os.path.join(audio_dir, speaker_name)
            speaker_text_dir = os.path.join(text_dir, speaker_name)
            
            if not os.path.exists(speaker_audio_dir) or not os.path.exists(speaker_text_dir):
                continue
            
            for root, _, files in os.walk(speaker_audio_dir):
                for file in files:
                    if file.endswith('.wav'):
                        # 获取音频文件路径
                        audio_path = os.path.join(root, file)
                        
                        # 检查音频质量
                        if not check_audio_quality(audio_path):
                            continue
                        
                        # 获取对应的文本文件路径
                        text_filename = os.path.splitext(file)[0] + '.txt'
                        text_path = os.path.join(speaker_text_dir, text_filename)
                        
                        # 读取文本转录
                        if os.path.exists(text_path):
                            with open(text_path, 'r', encoding='utf-8') as tf:
                                text = tf.read().strip()
                            
                            # 写入filelist (格式: audio_path|speaker_id|text)
                            f.write(f"{audio_path}|{speaker_id}|{text}\n")

# 使用示例
speaker_map = {"speaker1": 0, "speaker2": 1, "speaker3": 2}
output_file = "filelists/my_multi_speaker_train_filelist.txt"
build_multi_speaker_filelist(audio_dir, text_dir, speaker_map, output_file)

2.3 Filelist清洗

构建Filelist后，我们需要对其进行清洗，去除无效数据：

def clean_filelist(input_file, output_file):
    """
    清洗filelist，去除无效数据
    
    Args:
        input_file: 输入filelist文件路径
        output_file: 输出清洗后的filelist文件路径
    """
    valid_lines = []
    
    with open(input_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    
    for line in lines:
        line = line.strip()
        if not line:
            continue
        
        # 根据filelist格式分割
        parts = line.split('|')
        if len(parts) == 2:  # 单说话人
            audio_path, text = parts
        elif len(parts) == 3:  # 多说话人
            audio_path, speaker_id, text = parts
        else:
            continue
        
        # 检查音频文件是否存在
        if not os.path.exists(audio_path):
            continue
        
        # 检查文本是否为空
        if not text.strip():
            continue
        
        # 检查音频质量
        if not check_audio_quality(audio_path):
            continue
        
        valid_lines.append(line)
    
    # 写入清洗后的filelist
    with open(output_file, 'w', encoding='utf-8') as f:
        for line in valid_lines:
            f.write(f"{line}\n")

# 使用示例
input_file = "filelists/my_dataset_train_filelist.txt"
output_file = "filelists/my_dataset_train_filelist.txt.cleaned"
clean_filelist(input_file, output_file)

3. Monotonic Alignment Search (MAS) 构建与使用

Monotonic Alignment Search (MAS) 是VITS模型的核心组件，用于解决文本和语音之间的对齐问题。

3.1 MAS编译与安装

# 进入monotonic_align目录
cd e:\TTSProjects\VITS\vits\monotonic_align

# 编译MAS
python setup.py build_ext --inplace

3.2 文本预处理与G2P转换

在使用MAS之前，我们需要对文本进行预处理和G2P（音素转换）：

from text import text_to_sequence
from text.symbols import symbols
from text.cleaners import english_cleaners2, chinese_cleaners

# 文本预处理示例
def preprocess_text(text, language="english"):
    """
    文本预处理
    
    Args:
        text: 输入文本
        language: 语言类型 (english, chinese等)
        
    Returns:
        list: 处理后的文本序列
    """
    # 选择文本清洗器
    if language == "english":
        cleaners = [english_cleaners2]
    elif language == "chinese":
        cleaners = [chinese_cleaners]
    else:
        cleaners = [english_cleaners2]
    
    # 文本转序列
    text_norm = text_to_sequence(text, cleaners)
    
    return text_norm

# 使用示例
text = "Hello, welcome to the VITS tutorial!"
text_sequence = preprocess_text(text, language="english")
print(f"预处理后的文本序列: {text_sequence}")

3.3 使用MAS进行对齐

VITS模型在训练过程中会自动使用MAS进行对齐，我们不需要手动生成对齐信息。但了解MAS的工作原理对于调试和优化模型非常重要。

MAS的核心思想是在文本和语音之间找到最佳的单调对齐路径，确保每个音素对应到正确的音频片段。

4. 数据增强技术

数据增强是提高模型泛化能力的重要手段，对于VITS模型来说，常用的数据增强技术包括时域增强、频域增强和文本增强。

4.1 时域增强

时域增强直接对音频波形进行处理，常用的时域增强技术包括：

import numpy as np
import librosa

# 速度调整
def time_stretch(audio, rate=1.0):
    """
    调整音频速度
    
    Args:
        audio: 音频波形
        rate: 速度调整率，>1.0加快，<1.0减慢
        
    Returns:
        np.array: 速度调整后的音频
    """
    return librosa.effects.time_stretch(audio, rate=rate)

# 音量变化
def volume_scale(audio, scale=1.0):
    """
    调整音频音量
    
    Args:
        audio: 音频波形
        scale: 音量缩放因子
        
    Returns:
        np.array: 音量调整后的音频
    """
    return audio * scale

# 添加噪声
def add_noise(audio, noise_level=0.01):
    """
    向音频添加随机噪声
    
    Args:
        audio: 音频波形
        noise_level: 噪声水平
        
    Returns:
        np.array: 添加噪声后的音频
    """
    noise = np.random.randn(len(audio)) * noise_level
    return audio + noise

# 时域增强示例
def augment_audio(audio, sr=22050):
    """
    对音频进行时域增强
    
    Args:
        audio: 音频波形
        sr: 采样率
        
    Returns:
        list: 增强后的音频列表
    """
    augmented_audios = [audio]  # 保留原始音频
    
    # 速度调整增强
    for rate in [0.9, 1.1]:
        augmented = time_stretch(audio, rate=rate)
        augmented_audios.append(augmented)
    
    # 音量变化增强
    for scale in [0.8, 1.2]:
        augmented = volume_scale(audio, scale=scale)
        augmented_audios.append(augmented)
    
    # 添加噪声增强
    for noise_level in [0.005, 0.01]:
        augmented = add_noise(audio, noise_level=noise_level)
        augmented_audios.append(augmented)
    
    return augmented_audios

4.2 频域增强

频域增强对音频的频谱进行处理，常用的频域增强技术包括：

# 频谱增强
def spectral_augmentation(mel_spec, freq_mask_param=15, time_mask_param=35):
    """
    对梅尔频谱进行增强
    
    Args:
        mel_spec: 梅尔频谱
        freq_mask_param: 频率掩码参数
        time_mask_param: 时间掩码参数
        
    Returns:
        np.array: 增强后的梅尔频谱
    """
    # 频率掩码
    freq_mask = np.zeros(mel_spec.shape)
    for i in range(mel_spec.shape[0]):
        f = np.random.uniform(0, freq_mask_param)
        f0 = int(np.random.uniform(0, mel_spec.shape[1] - f))
        freq_mask[i, f0:f0+f] = 1
    
    # 时间掩码
    time_mask = np.zeros(mel_spec.shape)
    for i in range(mel_spec.shape[0]):
        t = np.random.uniform(0, time_mask_param)
        t0 = int(np.random.uniform(0, mel_spec.shape[2] - t))
        time_mask[i, :, t0:t0+t] = 1
    
    # 应用掩码
    augmented_mel = mel_spec * (1 - freq_mask) * (1 - time_mask)
    
    return augmented_mel

4.3 文本增强

文本增强对输入文本进行处理，增加训练数据的多样性：

# 同义词替换
def synonym_replacement(text, n=1):
    """
    同义词替换增强
    
    Args:
        text: 输入文本
        n: 替换的同义词数量
        
    Returns:
        str: 增强后的文本
    """
    # 这里使用简单的同义词替换示例，实际应用中可以使用更复杂的同义词词典
    synonyms = {
        "hello": ["hi", "hey", "greetings"],
        "welcome": ["greet", "receive", "accept"],
        "tutorial": ["guide", "lesson", "instruction"]
    }
    
    words = text.split()
    new_words = words.copy()
    
    for _ in range(n):
        for i, word in enumerate(words):
            if word.lower() in synonyms:
                synonym = np.random.choice(synonyms[word.lower()])
                new_words[i] = synonym
                break
    
    return ' '.join(new_words)

# 文本扰动
def text_perturbation(text):
    """
    文本扰动增强
    
    Args:
        text: 输入文本
        
    Returns:
        str: 增强后的文本
    """
    # 简单的文本扰动示例
    perturbations = [
        lambda x: x + ".",
        lambda x: "" + x + "",
        lambda x: x.replace(",", ";"),
        lambda x: x.replace(".", "!")
    ]
    
    perturbation = np.random.choice(perturbations)
    return perturbation(text)

数据预处理流水线

1. 完整的数据预处理流程

def run_data_preprocessing_pipeline(raw_audio_dir, raw_text_dir, output_dir, language="english"):
    """
    完整的数据预处理流水线
    
    Args:
        raw_audio_dir: 原始音频目录
        raw_text_dir: 原始文本目录
        output_dir: 输出目录
        language: 语言类型
    """
    # 创建输出目录
    preprocessed_audio_dir = os.path.join(output_dir, "preprocessed_audio")
    filelist_dir = os.path.join(output_dir, "filelists")
    
    os.makedirs(preprocessed_audio_dir, exist_ok=True)
    os.makedirs(filelist_dir, exist_ok=True)
    
    # 1. 音频预处理
    print("1. 开始音频预处理...")
    preprocess_audio(raw_audio_dir, preprocessed_audio_dir, target_sr=22050)
    
    # 2. 构建filelist
    print("2. 开始构建filelist...")
    train_filelist = os.path.join(filelist_dir, f"{language}_audio_text_train_filelist.txt")
    build_single_speaker_filelist(preprocessed_audio_dir, raw_text_dir, train_filelist)
    
    # 3. 清洗filelist
    print("3. 开始清洗filelist...")
    cleaned_train_filelist = os.path.join(filelist_dir, f"{language}_audio_text_train_filelist.txt.cleaned")
    clean_filelist(train_filelist, cleaned_train_filelist)
    
    # 4. 划分训练集和验证集
    print("4. 开始划分训练集和验证集...")
    with open(cleaned_train_filelist, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    
    # 打乱数据
    np.random.seed(42)
    np.random.shuffle(lines)
    
    # 划分比例：90%训练集，5%验证集，5%测试集
    train_ratio = 0.9
    val_ratio = 0.05
    
    train_size = int(len(lines) * train_ratio)
    val_size = int(len(lines) * val_ratio)
    
    train_lines = lines[:train_size]
    val_lines = lines[train_size:train_size+val_size]
    test_lines = lines[train_size+val_size:]
    
    # 写入划分后的filelist
    filelists = [
        (train_lines, os.path.join(filelist_dir, f"{language}_audio_text_train_filelist.txt.cleaned")),
        (val_lines, os.path.join(filelist_dir, f"{language}_audio_text_val_filelist.txt.cleaned")),
        (test_lines, os.path.join(filelist_dir, f"{language}_audio_text_test_filelist.txt.cleaned"))
    ]
    
    for lines, file_path in filelists:
        with open(file_path, 'w', encoding='utf-8') as f:
            for line in lines:
                f.write(line)
    
    print(f"数据预处理完成！")
    print(f"训练集大小: {len(train_lines)}")
    print(f"验证集大小: {len(val_lines)}")
    print(f"测试集大小: {len(test_lines)}")

# 使用示例
raw_audio_dir = "/path/to/raw/audio"
raw_text_dir = "/path/to/raw/text"
output_dir = "/path/to/preprocessed/data"
run_data_preprocessing_pipeline(raw_audio_dir, raw_text_dir, output_dir, language="english")

2. 多说话人数据预处理流程

def run_multi_speaker_preprocessing_pipeline(raw_data_dir, output_dir):
    """
    多说话人数据预处理流水线
    
    Args:
        raw_data_dir: 原始数据目录，包含多个说话人子目录
        output_dir: 输出目录
    """
    # 创建输出目录
    preprocessed_audio_dir = os.path.join(output_dir, "preprocessed_audio")
    filelist_dir = os.path.join(output_dir, "filelists")
    
    os.makedirs(preprocessed_audio_dir, exist_ok=True)
    os.makedirs(filelist_dir, exist_ok=True)
    
    # 获取所有说话人
    speakers = [d for d in os.listdir(raw_data_dir) if os.path.isdir(os.path.join(raw_data_dir, d))]
    speaker_map = {speaker: i for i, speaker in enumerate(speakers)}
    
    print(f"发现 {len(speakers)} 个说话人: {speakers}")
    
    # 1. 音频预处理
    print("1. 开始音频预处理...")
    for speaker in speakers:
        raw_speaker_dir = os.path.join(raw_data_dir, speaker, "wavs")
        preprocessed_speaker_dir = os.path.join(preprocessed_audio_dir, speaker)
        
        if os.path.exists(raw_speaker_dir):
            preprocess_audio(raw_speaker_dir, preprocessed_speaker_dir, target_sr=22050)
    
    # 2. 构建多说话人filelist
    print("2. 开始构建多说话人filelist...")
    train_filelist = os.path.join(filelist_dir, "vctk_audio_sid_text_train_filelist.txt")
    text_dir = os.path.join(raw_data_dir, "txt")  # 假设文本文件存放在txt目录下
    build_multi_speaker_filelist(preprocessed_audio_dir, text_dir, speaker_map, train_filelist)
    
    # 3. 清洗filelist
    print("3. 开始清洗filelist...")
    cleaned_train_filelist = os.path.join(filelist_dir, "vctk_audio_sid_text_train_filelist.txt.cleaned")
    clean_filelist(train_filelist, cleaned_train_filelist)
    
    print(f"多说话人数据预处理完成！")

最佳实践

1. 数据预处理的最佳实践

数据质量优先：确保使用高质量的音频和文本数据，过滤掉低质量数据
一致性处理：对所有数据使用相同的预处理流程
合理划分数据集：通常采用90%训练集、5%验证集、5%测试集的划分比例
适当的数据增强：根据数据集大小和质量，选择合适的数据增强技术
监控数据分布：分析数据的长度分布、语速分布等，确保数据分布合理
定期验证数据：定期检查预处理后的数据质量，确保数据的正确性

2. 常见问题与解决方案

问题	解决方案
音频采样率不一致	使用统一的采样率（如22050Hz）重新采样所有音频
文本转录错误	人工检查或使用自动语音识别技术验证文本转录
音频长度过短或过长	设置合理的音频长度阈值，过滤掉不符合要求的音频
Filelist格式错误	使用脚本自动生成和清洗filelist，确保格式正确
MAS编译失败	确保安装了正确的编译工具，如Visual Studio（Windows）或gcc（Linux）