AI语音识别神器Openai Whisper对中文的支持如何？_开源

文章目录

前言
一、资料准备
二、whisper环境搭建
- 第一步：安装whisper
- 第二步：安装ffmpeg
三、whisper测试
总结
- 其他相关

前言

语音识别一直以来都是人工智能领域中一个不容忽视的技术，随着大模型时代的到来，这项技术也发生了质的变化。凡是在ai相关的讨论中，语音识别绝对是一个高热的话题。

目前开源的语音识别软件中，openai whisper绝对是霸主的存在，他在这方面的表现甚至超越了很多商用的产品，那么openai whisper对中文的支持如何呢，今天我们来简单测试一下。

一、资料准备

因为今天我们主要研究中文识别，所以这里我准备了一个比较有特色的音频。语音文件如下面所示：

一年级-小青蛙（标准普通话）：1.mp3（532k） 点击下载

内容如下：

三字经素读11（标准普通话）：2.mp3（533k） 点击下载

内容如下：

一段粤语（和普通话接近度很低）：3.mp3（306k） 点击下载

内容如下：

李伯伯的一段评书（四川话-和普通话接近度较高）：4.mp3（1.4m） 点击下载

内容有点长，后面再看看识别情况

二、whisper环境搭建

目前openai whisper是人气最高的开源的语音识别项目，项目地址：【https://github.com/openai/whisper】从名字就可以看出，它是有openai开源出来的，主要利用大模型来训练。支持99 种语言，特别是对英语的支持错误率很低。whipser 推出了 tiny、base、small、medium、large 5 个档次的模型。

模型	大小	英语	多语言	所需显存	相对速率
tiny	39 m	`tiny.en`	`tiny`	~1 gb	~32x
base	74 m	`base.en`	`base`	~1 gb	~16x
small	244 m	`small.en`	`small`	~2 gb	~6x
medium	769 m	`medium.en`	`medium`	~5 gb	~2x
large	1550 m	n/a	`large`	~10 gb	1x

whisper的错误率如下图所示：
在这里插入图片描述

下面我来看看如何安装，安装whipser需要python环境，所需要的环境如下：

python 3.9.9+
pip 24.0+
ffmpeg

首先检查电脑环境是否满足，如果已经满足，执行以下命令：

第一步：安装whisper

pip install -u openai-whisper

当看到有类似下面的输出表示安装成功：

building wheels for collected packages: openai-whisper
  building wheel for openai-whisper (pyproject.toml) ... done
  created wheel for openai-whisper: filename=openai_whisper-20231117-py3-none-any.whl size=801358 sha256=9c53589d5935329764df742678ccdf63238285771a946ef7157912e71a623bb3
  stored in directory: /root/.cache/pip/wheels/0f/3e/0a/683df97c94e7b6f0818ba78f0177ebe638c30d192bdd39f399
successfully built openai-whisper

第二步：安装ffmpeg

安装ffmpeg，这里不一样的系统安装方式也不一样，下面给出了几种系统的安装方式：

# on ubuntu or debian
sudo apt update && sudo apt install ffmpeg

# on arch linux
sudo pacman -s ffmpeg

# on macos using homebrew (https://brew.sh/)
brew install ffmpeg

# on windows using chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on windows using scoop (https://scoop.sh/)
scoop install ffmpeg

如果你是centos，在centos7上安装ffmpeg还需要多几个步骤，具体如下：
导入nux dextop仓库：

sudo rpm --import http://li.nux.ro/download/nux/rpm-gpg-key-nux.ro
sudo rpm -uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-1.el7.nux.noarch.rpm

安装

sudo yum update -y
sudo yum install ffmpeg -y

安装成功后验证ffmpeg

ffmpeg -help

三、whisper测试

安装成功后，我们可以直接在控制台使用：

whisper --help

如果我们要进行识别操作，具体命令如下：

whisper audio.mp3 --命令参数

常用参数说明：

–task

指定转录方式，默认使用 --task transcribe 转录模式，–task translate 则为翻译模式，目前只支持英文。

–model

指定使用模型，默认使用 --model small，whisper 还有英文专用模型，就是在名称后加上 .en，这样速度更快。默认采用base

–language

指定转录语言，默认会截取 30 秒来判断语种，但最好指定为某种语言，比如指定中文是 --language chinese。

–device

指定硬件加速，默认使用 auto 自动选择，–device cuda 则为显卡，cpu 就是 cpu， mps 为苹果 m1 芯片。

output_format

识别结果输出格式（txt,vtt,srt,tsv,json,all），默认为：all

output_dir

识别结果输出目录

除了在控制台直接使用外，也可以在python中使用，python的示例代码如下：

# coding=utf-8

import whisper

if __name__ == '__main__':
    model = whisper.load_model("tiny")

    audio = whisper.load_audio("1.mp3")
    audio = whisper.pad_or_trim(audio)
    
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    _, probs = model.detect_language(mel)

    options = whisper.decodingoptions()
    result = whisper.decode(model, mel, options)

    print(result.text)

当然也可以直接在控制台来测试，这里我整理了测试的结果如下图：
在这里插入图片描述
这里我是直接输出的txt格式，如果输出vtt格式，可以看到响应的时间点，类似下面：

这里我编写了一个自动化测试的shell脚本，方便大家来做相关测试：

#!/bin/sh

suffixes=("mp3")

models=("tiny" "base" "small" "medium" "large")

# models=("tiny" "base")

find_audio(){
  suffix=$1
  for file in ./*.$suffix; do
      if [ -f "$file" ]; then
        txt_rs=$(basename "$file")
        dir=$(basename "$file" | cut -d "." -f 1)
        for model in "${models[@]}"; do
          do_whisper $file $dir $model
        done
      fi
  done
}
do_whisper(){
  start_time=$(date +%s)
  whisper $1 --language chinese --output_dir $2_$3 --output_format txt --model=$3
  end_time=$(date +%s)
  time_sec=$(($((end_time))-$((start_time))))
  txt="（耗时："$time_sec"秒）"
  rs_file=$2_$3/$2.txt
  echo "$txt" >> $rs_file
}
do_report(){
  models_strs=$(printf ",%s" "${models[@]}")
  models_strs=${models_strs:1}

cat > report.csv << eof
音频,$models_strs
eof
  suffix=$1
  for file in ./*.$suffix; do
      if [ -f "$file" ]; then
        txt_rs=$(basename "$file")
        dir=$(basename "$file" | cut -d "." -f 1)
        for model in "${models[@]}"; do
          rs_whisper_file=$dir"_"$model/$dir.txt
          rs_whisper_file_txt=`cat $rs_whisper_file`
          rs_whisper_file_txt=$(echo "$rs_whisper_file_txt" | tr -d '\r')
          rs_whisper_file_txt=$(echo "$rs_whisper_file_txt" | tr -d '\n')
          rs_whisper_file_txt=$(echo "$rs_whisper_file_txt" | tr -d '\r\n')
          rs_whisper_file_txt='"'$rs_whisper_file_txt'"'
          txt_rs=$txt_rs","$rs_whisper_file_txt
        done
        echo "$txt_rs" >> report.csv
      fi
  done
}

for suffix in "${suffixes[@]}"; do
  find_audio ${suffix}
done

for suffix in "${suffixes[@]}"; do
  do_report ${suffix}
done

大家可以修改里面的相关参数来自己做测试。

总结

从上面的测试可以看出，对标准的普通话来说，识别已经相当成功了，同时最让我惊讶的是，他对粤语的识别竟然错误率这么低，基本上是翻译了过来。四川话因为发音比较接近普通话，但是有些地方词语差异还是很大，所以识别的时候错误率还是很高的。

总的来说，作为开源产品，whisper对中文的支持已经相当好了，甚至超越了一些国内商用的产品，我将这段粤语在几个大厂的平台上去测试了一下，大部分是识别不出来的，大家可以用我的脚本测试一下更多的方言或者不同情况的下的语音。

如果有gpu设备的可以尝试一下gpu设备下效果如何？

openai whisper的语音更像是大力出奇迹，利用大模型训练，涵盖了大部分的语言。同时也颠覆了传统的语音识别技术。相信很快就会有更完美的模型出来。我查看了whisper的模型下载逻辑，目前好像已经有：large-v1，large-v2，large-v3了，但是由于模型较大，我没得测试环境，大家可以自行去试试。模型下载可以源码位置：python3.12/site-packages/whisper/init.py