end2end-asr-pytorch - audio processing - speech signal processing

https://github.com/gentaiscool/end2end-asr-pytorch

采样频率/取样频率是每秒钟采集声音样本的次数。采样频率越高，声音质量越好，声音还原越真实，同时占用资源越多。

采样位数/量化精度/采样值/取样值是采样样本幅度量化，用来衡量声音波动变化的一个参数。数值越大，分辨率越高，发出声音的能力越强。

采样数据记录的是振幅，采样精度取决于采样位数的大小：
1 字节 (8 bit) 记录 256 = 2^8 个等级，振幅划分成 256 个等级。
2 字节 (16 bit) 记录 65536 = 2^16 个等级。
4 字节 (32 bit) 振幅细分到 4294967296 = 2^32 个等级。

音频单通道和双通道是声音通道的数目。单声道的声音只能使用一个喇叭发声，立体声可以使两个喇叭都发声 (左右声道)，更能感受到空间效果。

音频帧记录了一个声音单元，单帧长度为采样位数和通道数的乘积。

交错模式数据以连续帧的方式存放，首先记录帧 1 的左声道样本和右声道样本，依次记录帧 2、3、4、… 的左声道样本和右声道样本。
非交错模式首先记录的是一个周期内所有帧的左声道样本，再记录所有右声道样本。

比特率是每秒的传输速率 bps (位速或比特率)。

1. normalization (归一化) + multiple channels average (多通道均值)

end2end-asr-pytorch/utils/audio.py
https://github.com/gentaiscool/end2end-asr-pytorch/blob/master/utils/audio.py

def load_audio(path):
    sound, _ = torchaudio.load(path, normalization=True)
    sound = sound.numpy().T
    if len(sound.shape) > 1:
        if sound.shape[1] == 1:
            sound = sound.squeeze()
        else:
            sound = sound.mean(axis=1)  # multiple channels, average
    return sound

load(filepath, out=None, normalization=True, channels_first=True, num_frames=0, offset=0, signalinfo=None, encodinginfo=None, filetype=None)

Loads an audio file from disk into a tensor. - 将音频文件从磁盘加载到张量中。

1.1 Args

filepath (str or pathlib.Path): Path to audio file. - 音频文件的路径。

out (torch.Tensor, optional): An output tensor to use instead of creating one. (Default: None).

normalization (bool, number, or callable, optional): If boolean True, then output is divided by 1 << 31 = 2147483648‬ = 2^31 (assumes signed 32-bit audio), and normalizes to [-1, 1].
If number, then output is divided by that number. - 输出除以该数字。
If callable, then the output is passed as a parameter to the given function, then the output is divided by the result. (Default: True)

channels_first (bool): Set channels first or length first in result. (Default: True) - channels first - [C x L].

num_frames (int, optional): Number of frames to load. num_frames = 0 to load everything after the offset. (Default: 0)

offset (int, optional): Number of frames from the start of the file to begin data loading. (Default: 0)

signalinfo (sox_signalinfo_t, optional): A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined. (Default: None)

encodinginfo (sox_encodinginfo_t, optional): A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined. (Default: None)

filetype (str, optional): A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

1.2 Returns

Tuple[torch.Tensor, int]: An output tensor of size [C x L] or [L x C] (Default: channels first) where L is the number of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file).

如果所要提取的语音特征不区分声道，则必须将多声道的语音转换成单声道。转换成单声道语音，需要计算多声道语音数据平均值即可。单声道的语音，不需要做转换。

2. pre-emphasis (预加重)

预加重的目的是只保留一定频率范围的信号，这个过程起到了高通滤波器的作用。高通滤波器对高频信号有着很好的放大作用，而且会大幅度压缩低频信号的幅度。同时，还会产生一个相位滞后的效应，这个对高频信号尤为明显。这个过程会在一定程度上消除唇齿效应。

语音处理通常使用 0.9 - 0.97，pre-emphasis = 0.97 比较常用。

3. 重采样

语音信号可能来自不同的设备，它们在录制的时候所设置的参数也不尽相同，最重要的一个就是采样率。

4. 语音信号分帧

hop_length - 帧移
win_length - 窗长

语音分析使用短时分析技术。语音信号是随时间变化的，它是一个非平稳态过程，不能用处理平稳信号的数字信号处理技术对其进行分析处理。由于不同的语音是由人的口腔肌肉运动构成声道某种形状而产生的响应，而这种口腔肌肉运动相对于语音频率来说是非常慢的。虽然语音信号具有时变特性，但是在一个短时间范围内 (10ms - 30ms)，其特性基本保持不变，相对稳定，因而可以看作是一个准稳态过程，即语音信号具有短时平稳性。

语音信号处理常常要弄清楚语音中各个频率成分的分布。傅里叶变换要求输入信号是平稳的，输入不平稳的信号，得到的结果没有什么意义。语音在宏观上来看是不平稳的，但是从微观上来看，在比较短的时间内可以看成平稳的，就可以截取出来做傅里叶变换。

从整体上看，这段语音信号不平稳。红框框出来的部分是一帧，在这一帧内部的信号可以看成平稳的。

在这里插入图片描述