【人工智能】Transformers之Pipeline（二）：自动语音识别（automatic-speech-recognition）_Python

一、引言

二、自动语音识别（automatic-speech-recognition）

2.1 概述

2.3.1 pipeline对象实例化参数

2.3.2 pipeline对象使用参数

2.3.3 pipeline对象返回参数

2.4 pipeline实战

2.4.1 facebook/wav2vec2-base-960h（默认模型）

2.4.2 openai/whisper-medium

2.5 模型排名

三、总结

一、引言

pipeline（管道）是huggingface transformers库中一种极简方式使用大模型推理的抽象，将所有大模型分为音频（audio）、计算机视觉（computer vision）、自然语言处理（nlp）、多模态（multimodal）等4大类，28小类任务（tasks），共计覆盖32万个模型。

今天介绍audio音频的第二篇，自动语音识别（automatic-speech-recognition），在huggingface库内共有1.8万个音频分类模型。

二、自动语音识别（automatic-speech-recognition）

2.1 概述

自动语音识别 (asr)，也称为语音转文本 (stt)，是将给定音频转录为文本的任务。主要应用场景有人机对话、语音转文本、歌词识别、字幕生成等。

2.2 技术原理

自动语音识别主要原理是音频切分成25ms-60ms的音谱后，采用卷机网络抽取音频特征，再通过transformer等网络结构与文本进行对齐训练。比较知名的自动语音识别当属openai的whisper和meta的wav2vec 2.0。

2.2.1 whisper模型

语音部分：基于680000小时音频数据进行训练，包含英文、其他语言转英文、非英文等多种语言。将音频数据转换成梅尔频谱图，再经过两个卷积层后送入 transformer 模型。

文本部分：文本token包含3类：special tokens（标记tokens）、text tokens（文本tokens）、timestamp tokens（时间戳），基于标记tokens控制文本的开始和结束，基于timestamp tokens让语音时间与文本对其。

不同尺寸模型参数量、多语言支持情况、需要现存大小以及推理速度如下

2.2.2 wav2vec 2.0模型

wav2vec 2.0是 meta在2020年发表的无监督语音预训练模型。它的核心思想是通过向量量化（vector quantization，vq）构造自建监督训练目标，对输入做大量掩码后利用对比学习损失函数进行训练。模型结构如图，基于卷积网络（convoluational neural network，cnn）的特征提取器将原始音频编码为帧特征序列，通过 vq 模块把每帧特征转变为离散特征 q，并作为自监督目标。同时，帧特征序列做掩码操作后进入 transformer [5] 模型得到上下文表示 c。最后通过对比学习损失函数，拉近掩码位置的上下文表示与对应的离散特征 q 的距离，即正样本对。

2.3 pipeline参数

2.3.1 pipeline对象实例化参数

2.3.2 pipeline对象使用参数

2.3.3 pipeline对象返回参数

2.4 pipeline实战

2.4.1 facebook/wav2vec2-base-960h（默认模型）

pipeline对于automatic-speech-recognition的默认模型是facebook/wav2vec2-base-960h，使用pipeline时，如果仅设置task=automatic-speech-recognition，不设置模型，则下载并使用默认模型。

import os
os.environ["hf_endpoint"] = "https://hf-mirror.com"
os.environ["cuda_visible_devices"] = "2"

from transformers import pipeline

speech_file = "./output_video_enhanced.mp3"
pipe = pipeline(task="automatic-speech-recognition")
result = pipe(speech_file)
print(result)

可以将.mp3内的音频转为文本：

{'text': "well to day's story meeting is officially started someone said that you have been telling stories for two or three years for such a long time and you still have a story meeting to tell"}

2.4.2 openai/whisper-medium

我们指定模型openai/whisper-medium，具体代码为：

import os
os.environ["hf_endpoint"] = "https://hf-mirror.com"
os.environ["cuda_visible_devices"] = "2"

from transformers import pipeline

speech_file = "./output_video_enhanced.mp3"
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium")
result = pipe(speech_file)
print(result)

输入为一段mp3格式的语音，输出为

{'text': " well, today's story meeting is officially started. someone said that you have been telling stories for two or three years for such a long time, and you still have a story meeting to tell."}

2.5 模型排名

在huggingface上，我们筛选自动语音识别模型，并按下载量从高到低排序：

三、总结

本文对transformers之pipeline的自动语音识别（automatic-speech-recognition）从概述、技术原理、pipeline参数、pipeline实战、模型排名等方面进行介绍，读者可以基于pipeline使用文中的代码极简的进行自动语音识别推理，应用于语音识别、字幕提取等业务场景。

期待您的3连+关注，如何还有时间，欢迎阅读我的其他文章：

《transformers-pipeline概述》

【人工智能】transformers之pipeline（概述）：30w+大模型极简应用

《transformers-pipeline 第一章：音频（audio）篇》

【人工智能】transformers之pipeline（一）：音频分类（audio-classification）

【人工智能】transformers之pipeline（二）：自动语音识别（automatic-speech-recognition）

【人工智能】transformers之pipeline（三）：文本转音频（text-to-audio）

【人工智能】transformers之pipeline（四）：零样本音频分类（zero-shot-audio-classification）

《transformers-pipeline 第二章：计算机视觉（cv）篇》

【人工智能】transformers之pipeline（五）：深度估计（depth-estimation）

【人工智能】transformers之pipeline（六）：图像分类（image-classification）

【人工智能】transformers之pipeline（七）：图像分割（image-segmentation）

【人工智能】transformers之pipeline（八）：图生图（image-to-image）

【人工智能】transformers之pipeline（九）：物体检测（object-detection）

【人工智能】transformers之pipeline（十）：视频分类（video-classification）

【人工智能】transformers之pipeline（十一）：零样本图片分类（zero-shot-image-classification）

【人工智能】transformers之pipeline（十二）：零样本物体检测（zero-shot-object-detection）

《transformers-pipeline 第三章：自然语言处理（nlp）篇》

【人工智能】transformers之pipeline（十三）：填充蒙版（fill-mask）

【人工智能】transformers之pipeline（十四）：问答（question-answering）

【人工智能】transformers之pipeline（十五）：总结（summarization）

【人工智能】transformers之pipeline（十六）：表格问答（table-question-answering）

【人工智能】transformers之pipeline（十七）：文本分类（text-classification）

【人工智能】transformers之pipeline（十八）：文本生成（text-generation）

【人工智能】transformers之pipeline（十九）：文生文（text2text-generation）

【人工智能】transformers之pipeline（二十）：令牌分类（token-classification）

【人工智能】transformers之pipeline（二十一）：翻译（translation）

【人工智能】transformers之pipeline（二十二）：零样本文本分类（zero-shot-classification）

《transformers-pipeline 第四章：多模态（multimodal）篇》

【人工智能】transformers之pipeline（二十三）：文档问答（document-question-answering）

【人工智能】transformers之pipeline（二十四）：特征抽取（feature-extraction）

【人工智能】transformers之pipeline（二十五）：图片特征抽取（image-feature-extraction）

【人工智能】transformers之pipeline（二十六）：图片转文本（image-to-text）

【人工智能】transformers之pipeline（二十七）：掩码生成（mask-generation）

【人工智能】transformers之pipeline（二十八）：视觉问答（visual-question-answering）

【人工智能】Transformers之Pipeline（二）：自动语音识别（automatic-speech-recognition）

2024年07月28日 • Python •我要评论

一、引言

二、自动语音识别（automatic-speech-recognition）

2.1 概述

2.2 技术原理

2.2.1 whisper模型

2.2.2 wav2vec 2.0模型

2.3 pipeline参数

2.3.1 pipeline对象实例化参数

2.3.2 pipeline对象使用参数

2.3.3 pipeline对象返回参数

2.4 pipeline实战

2.4.1 facebook/wav2vec2-base-960h（默认模型）

2.4.2 openai/whisper-medium

2.5 模型排名

三、总结

相关文章:

Python进阶-部署Flask项目(以TensorFlow图像识别项目WSGI方式启动为例)

已解决“from tensorflow.keras.preprocessing.image import ImageDataGenerator”标红，解决版本问题导致导包失败

Python与OpenCV：图像处理与计算机视觉实战指南

发表评论


验证码：

【人工智能】Transformers之Pipeline（二）：自动语音识别（automatic-speech-recognition）

2024年07月28日 • Python •我要评论

一、引言

二、自动语音识别（automatic-speech-recognition）

2.1 概述

2.2 技术原理

2.2.1 whisper模型

2.2.2 wav2vec 2.0模型

2.3 pipeline参数

2.3.1 pipeline对象实例化参数​​​​​​​

2.3.2 pipeline对象使用参数

2.3.3 pipeline对象返回参数

2.4 pipeline实战

2.4.1 facebook/wav2vec2-base-960h（默认模型）

2.4.2 openai/whisper-medium

2.5 模型排名

三、总结

相关文章:

Python进阶-部署Flask项目(以TensorFlow图像识别项目WSGI方式启动为例)

已解决“from tensorflow.keras.preprocessing.image import ImageDataGenerator”标红，解决版本问题导致导包失败

Python与OpenCV：图像处理与计算机视觉实战指南

发表评论

2.3.1 pipeline对象实例化参数