乐趣区

关于人工智能:极速进化光速转录C版本人工智能实时语音转文字字幕语音识别Whispercpp实践

业界良心 OpenAI 开源的 Whisper 模型是开源语音转文字畛域的执牛耳者,白璧微瑕之处在于无奈通过苹果 M 芯片优化转录效率,Whisper.cpp 则是 Whisper 模型的 C/C++ 移植版本,它具备无依赖项、内存使用量低等特点,重要的是减少了 Core ML 反对,完满适配苹果 M 系列芯片。

Whisper.cpp 的张量运算符针对苹果 M 芯片的 CPU 进行了大量优化,依据计算大小,应用 Arm Neon SIMD instrisics 或 CBLAS Accelerate 框架例程,后者对于更大的尺寸特地无效,因为 Accelerate 框架能够应用苹果 M 系列芯片中提供的专用 AMX 协处理器。

配置 Whisper.cpp

老规矩,运行 git 命令来克隆 Whisper.cpp 我的项目:

git clone https://github.com/ggerganov/whisper.cpp.git

随后进入我的项目的目录:

cd whisper.cpp

我的项目默认的根底模型不反对中文,这里举荐应用 medium 模型,通过 shell 脚本进行下载:

bash ./models/download-ggml-model.sh medium

下载实现后,会在我的项目的 models 目录保留 ggml-medium.bin 模型文件,大小为 1.53GB:

whisper.cpp git:(master) cd models   
➜  models git:(master) ll  
total 3006000  
-rw-r--r--  1 liuyue  staff   3.2K  4 21 07:21 README.md  
-rw-r--r--  1 liuyue  staff   7.2K  4 21 07:21 convert-h5-to-ggml.py  
-rw-r--r--  1 liuyue  staff   9.2K  4 21 07:21 convert-pt-to-ggml.py  
-rw-r--r--  1 liuyue  staff    13K  4 21 07:21 convert-whisper-to-coreml.py  
drwxr-xr-x  4 liuyue  staff   128B  4 22 00:33 coreml-encoder-medium.mlpackage  
-rwxr-xr-x  1 liuyue  staff   2.1K  4 21 07:21 download-coreml-model.sh  
-rw-r--r--  1 liuyue  staff   1.3K  4 21 07:21 download-ggml-model.cmd  
-rwxr-xr-x  1 liuyue  staff   2.0K  4 21 07:21 download-ggml-model.sh  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-base.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-base.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-large.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-medium.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-medium.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-small.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-small.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-tiny.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-tiny.en.bin  
-rwxr-xr-x  1 liuyue  staff   1.4K  4 21 07:21 generate-coreml-interface.sh  
-rwxr-xr-x@ 1 liuyue  staff   769B  4 21 07:21 generate-coreml-model.sh  
-rw-r--r--  1 liuyue  staff   1.4G  3 22 16:04 ggml-medium.bin

模型下载当前,在根目录编译可执行文件:

make

程序返回:

➜  whisper.cpp git:(master) make  
I whisper.cpp build info:   
I UNAME_S:  Darwin  
I UNAME_P:  arm  
I UNAME_M:  arm64  
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE  
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread  
I LDFLAGS:   -framework Accelerate  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)  
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)  
  
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/bench/bench.cpp ggml.o whisper.o -o bench  -framework Accelerate

至此,Whisper.cpp 就配置好了。

牛刀小试

当初咱们来测试一段语音,看看成果:

./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav

这行命令的含意是通过方才下载 ggml-medium.bin 模型来对我的项目中的 samples/jfk.wav 语音文件进行辨认,这段语音是遇刺的美国总统肯尼迪的驰名演讲,程序返回:

➜  whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav  
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'  
whisper_model_load: loading model  
whisper_model_load: n_vocab       = 51865  
whisper_model_load: n_audio_ctx   = 1500  
whisper_model_load: n_audio_state = 1024  
whisper_model_load: n_audio_head  = 16  
whisper_model_load: n_audio_layer = 24  
whisper_model_load: n_text_ctx    = 448  
whisper_model_load: n_text_state  = 1024  
whisper_model_load: n_text_head   = 16  
whisper_model_load: n_text_layer  = 24  
whisper_model_load: n_mels        = 80  
whisper_model_load: f16           = 1  
whisper_model_load: type          = 4  
whisper_model_load: mem required  = 1725.00 MB (+   43.00 MB per decoder)  
whisper_model_load: adding 1608 extra tokens  
whisper_model_load: model ctx     = 1462.35 MB  
whisper_model_load: model size    = 1462.12 MB  
whisper_init_state: kv self size  =   42.00 MB  
whisper_init_state: kv cross size =  140.62 MB  
  
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |   
  
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...  
  
  
[00:00:00.000 --> 00:00:11.000]   And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.  
  
output_srt: saving output to 'samples/jfk.wav.srt'

只须要 11 秒,同时语音字幕会写入 samples/jfk.wav.srt 文件。

英文准确率是百分之百。

当初咱们来换成中文语音,能够轻易录制一段语音,须要留神的是,Whisper.cpp 只反对 wav 格局的语音文件,这里先通过 ffmpeg 将 mp3 文件转换为 wav:

ffmpeg -i ./test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le ./test1.wav

程序返回:

ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers  
  built with Apple clang version 14.0.0 (clang-1400.0.29.202)  
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/5.1.2_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-neon  
  libavutil      57. 28.100 / 57. 28.100  
  libavcodec     59. 37.100 / 59. 37.100  
  libavformat    59. 27.100 / 59. 27.100  
  libavdevice    59.  7.100 / 59.  7.100  
  libavfilter     8. 44.100 /  8. 44.100  
  libswscale      6.  7.100 /  6.  7.100  
  libswresample   4.  7.100 /  4.  7.100  
  libpostproc    56.  6.100 / 56.  6.100  
[mp3 @ 0x130e05580] Estimating duration from bitrate, this may be inaccurate  
Input #0, mp3, from './test1.mp3':  
  Duration: 00:05:41.33, start: 0.000000, bitrate: 48 kb/s  
  Stream #0:0: Audio: mp3, 24000 Hz, mono, fltp, 48 kb/s  
Stream mapping:  
  Stream #0:0 -> #0:0 (mp3 (mp3float) -> pcm_s16le (native))  
Press [q] to stop, [?] for help  
Output #0, wav, to './test1.wav':  
  Metadata:  
    ISFT            : Lavf59.27.100  
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s  
    Metadata:  
      encoder         : Lavc59.37.100 pcm_s16le  
[mp3float @ 0x132004260] overread, skip -6 enddists: -4 -4ed=N/A      
    Last message repeated 1 times  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -7 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -9 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -5 enddists: -1 -1  
    Last message repeated 1 times  
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -8 enddists: -5 -5  
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -6 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -6 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -6 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -7 enddists: -6 -6  
[mp3float @ 0x132004260] overread, skip -9 enddists: -6 -6  
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
size=   10667kB time=00:05:41.32 bitrate= 256.0kbits/s speed=2.08e+03x      
video:0kB audio:10666kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000714%

这里将一段五分四十一秒的语音转换为 wav 文件。

随后运行命令开始转录:

./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh

这里须要加上参数 -l,告知程序为中文语音,程序返回:

➜  whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh  
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'  
whisper_model_load: loading model  
whisper_model_load: n_vocab       = 51865  
whisper_model_load: n_audio_ctx   = 1500  
whisper_model_load: n_audio_state = 1024  
whisper_model_load: n_audio_head  = 16  
whisper_model_load: n_audio_layer = 24  
whisper_model_load: n_text_ctx    = 448  
whisper_model_load: n_text_state  = 1024  
whisper_model_load: n_text_head   = 16  
whisper_model_load: n_text_layer  = 24  
whisper_model_load: n_mels        = 80  
whisper_model_load: f16           = 1  
whisper_model_load: type          = 4  
whisper_model_load: mem required  = 1725.00 MB (+   43.00 MB per decoder)  
whisper_model_load: adding 1608 extra tokens  
whisper_model_load: model ctx     = 1462.35 MB  
whisper_model_load: model size    = 1462.12 MB  
whisper_init_state: kv self size  =   42.00 MB  
whisper_init_state: kv cross size =  140.62 MB  
  
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |   
  
main: processing 'samples/test1.wav' (5461248 samples, 341.3 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...  
  
  
[00:00:00.000 --> 00:00:03.340]  Hello 大家好, 这里是刘越的技术博客。[00:00:03.340 --> 00:00:05.720]  最近的事件大家都知道了,  
[00:00:05.720 --> 00:00:07.880]  某公司技术经理魅上欺下,  
[00:00:07.880 --> 00:00:10.380]  打工人应答进队, 不易快灾,  
[00:00:10.380 --> 00:00:12.020]  不易壮灾,  
[00:00:12.020 --> 00:00:14.280]  所谓魅上者必欺下,  
[00:00:14.280 --> 00:00:16.020]  今人诚不我窃。[00:00:16.020 --> 00:00:17.360]  技术经理者,  
[00:00:17.360 --> 00:00:20.160]  公然在聊天群里大玩职场 PUA,  
[00:00:20.160 --> 00:00:22.400]  气焰嚣张, 有恃无恐,  
[00:00:22.400 --> 00:00:23.700]  最终引发众目,  
[00:00:23.700 --> 00:00:26.500]  嘿嘿, 技术经理, 团队领导,  
[00:00:26.500 --> 00:00:29.300]  原来团队领导这四个字是这么用的,  
[00:00:29.300 --> 00:00:31.540]  奴媚显达, 构陷上司,  
[00:00:31.540 --> 00:00:32.780]  人文巨损,  
[00:00:32.780 --> 00:00:33.840]  逢迎上意,  
[00:00:33.840 --> 00:00:34.980]  傲然下欺,  
[00:00:34.980 --> 00:00:36.080]  矫揉造作,  
[00:00:36.080 --> 00:00:37.180]  极尽投机,  
[00:00:37.180 --> 00:00:38.320]  负别人之负,  
[00:00:38.320 --> 00:00:39.620]  康别人之愷,  
[00:00:39.620 --> 00:00:42.180]  如此者, 堪称团队领导也。[00:00:42.180 --> 00:00:43.980]  中国的所谓传统文化,  
[00:00:43.980 --> 00:00:45.320]  除了仁义理智性,  
[00:00:45.320 --> 00:00:46.620]  除了金石子极,  
[00:00:46.620 --> 00:00:47.820]  除了争争风骨,  
[00:00:47.820 --> 00:00:49.560]  其实还有很多别的货色,  
[00:00:49.560 --> 00:00:52.020]  被大家或无意或无心的漠视了,  
[00:00:52.020 --> 00:00:53.300]  比方功利实用,  
[00:00:53.300 --> 00:00:54.300]  屈颜附示,  
[00:00:54.300 --> 00:00:55.360]  以兼至善,  
[00:00:55.360 --> 00:01:01.000]  官本位和钱规定的传统, 在某种程度上, 传统文化这没硬币的另一面,  
[00:01:01.000 --> 00:01:03.900]  才是更须要咱们去面对和正视的,  
[00:01:03.900 --> 00:01:07.140]  我认为, 这在目前流行实惠价值观的时候,  
[00:01:07.140 --> 00:01:08.940]  提一提还是必要的,  
[00:01:08.940 --> 00:01:10.240]  有的人说了,  
[00:01:10.240 --> 00:01:13.740]  在开发群里对领导, 十分畅快, 十分爽,  
[00:01:13.740 --> 00:01:17.180]  然而, 而后呢, 有用吗?  
[00:01:17.180 --> 00:01:19.260]  晦气的还不是本人,  
[00:01:19.260 --> 00:01:22.520]  没错, 这就是功利且实用的传统,  
[00:01:22.520 --> 00:01:28.780]  各种精力, 思辨, 镇压, 愤恨, 都抵不过三个字, 有用吗?  
[00:01:28.780 --> 00:01:31.820]  事实上, 凡是叫做某种精力的,  
[00:01:31.820 --> 00:01:33.320]  那就是哲学思辨,  
[00:01:33.320 --> 00:01:36.220]  就是一种绝对无用的思辨和学术,  
[00:01:36.220 --> 00:01:39.180]  而中国职场有很强的实用传统,  
[00:01:39.180 --> 00:01:42.140]  但这不是学术思辨, 也没有实践构架,  
[00:01:42.140 --> 00:01:44.380]  仅仅是一种短视的经验论,  
[00:01:44.380 --> 00:01:47.220]  所以, 功利主义, 是密尔,  
[00:01:47.220 --> 00:01:48.980]  编庆的伦理价值学说,  
[00:01:48.980 --> 00:01:52.700]  强调的是, 追求幸福, 如何取得最大效用,  
[00:01:52.700 --> 00:01:55.580]  实用主义, 是东方的一个学术流派,  
[00:01:55.580 --> 00:01:58.260]  比方杜威, 胡适, 就是代表,  
[00:01:58.260 --> 00:02:01.180]  实用主义的另一个名字, 叫人本主义,  
[00:02:01.180 --> 00:02:04.780]  意思是, 以人作为教训和万物的尺度,  
[00:02:04.780 --> 00:02:06.080]  换句话说,  
[00:02:06.080 --> 00:02:09.420]  功利主义, 拥护的正是那种短视的功利,  
[00:02:09.420 --> 00:02:13.220]  实用主义, 拥护的也正是那种但凡看对本人,  
[00:02:13.220 --> 00:02:15.220]  是不是无利的局限判断,  
[00:02:15.220 --> 00:02:17.260]  而在中国职场功利,  
[00:02:17.260 --> 00:02:21.060]  实用的传统中, 恰好是不会有这些实践构架的,  
[00:02:21.060 --> 00:02:23.700]  并且, 不仅没有实践构架,  
[00:02:23.700 --> 00:02:26.140]  还要对那些无用的, 思辨的,  
[00:02:26.140 --> 00:02:29.980]  纯正的精力, 视如避喜, 吃之以鼻,  
[00:02:29.980 --> 00:02:32.260]  没错, 在技术团队里,  
[00:02:32.260 --> 00:02:35.260]  咱们器重技术, 器重实用的迷信,  
[00:02:35.260 --> 00:02:38.900]  然而支流职场并不激励去搞那些看似无用的货色,  
[00:02:38.900 --> 00:02:41.380]  比方一般劳动者的合法权益,  
[00:02:41.380 --> 00:02:43.580]  张义谋的满江红,  
[00:02:43.580 --> 00:02:45.220]  大家想必也都看了的,  
[00:02:45.220 --> 00:02:46.820]  人们总感觉很奇怪,  
[00:02:46.820 --> 00:02:48.300]  为什么那么坏的人,  
[00:02:48.300 --> 00:02:50.020]  皇帝为啥不罢免他?  
[00:02:50.020 --> 00:02:53.140]  为什么君子能当权来构陷坏蛋呢?  
[00:02:53.140 --> 00:02:55.980]  当咱们理解了传统文化中的法家思维,  
[00:02:55.980 --> 00:02:57.300]  就了然了,  
[00:02:57.300 --> 00:02:59.260]  在法家的思维规定下,  
[00:02:59.260 --> 00:03:01.660]  君子得是, 忠良备辱,  
[00:03:01.660 --> 00:03:03.140]  事事所必然,  
[00:03:03.140 --> 00:03:04.900]  因为他一开始的设定,  
[00:03:04.900 --> 00:03:07.540]  就使得劣币驱赶良币的游戏规则,  
[00:03:07.540 --> 00:03:09.940]  所以, 在这种观点下,  
[00:03:09.940 --> 00:03:12.460]  现代常见的一种职场智慧就是,  
[00:03:12.460 --> 00:03:14.820]  自污名节, 以求自保,  
[00:03:14.820 --> 00:03:16.420]  在这种环境下,  
[00:03:16.420 --> 00:03:17.780]  要想生存,  
[00:03:17.780 --> 00:03:19.260]  就只有一条前途,  
[00:03:19.260 --> 00:03:20.900]  那就是附丽势力,  
[00:03:20.900 --> 00:03:23.700]  并且, 谁能领有更大的势力,  
[00:03:23.700 --> 00:03:25.700]  谁就能生存得更好,  
[00:03:25.700 --> 00:03:27.500]  如何附丽势力呢?  
[00:03:27.500 --> 00:03:29.180]  那就是当初正在产生的,  
[00:03:29.180 --> 00:03:31.900]  胡作非为的大腕职场 PUA,  
[00:03:31.900 --> 00:03:33.060]  除此之外,  
[00:03:33.060 --> 00:03:34.340]  这种势力关系,  
[00:03:34.340 --> 00:03:36.900]  在现代会渗透到方方面面,  
[00:03:36.900 --> 00:03:40.300]  因为势力零碎是一个简单而高效的运行机器,  
[00:03:40.300 --> 00:03:42.940]  CPU, 内存, 硬盘,  
[00:03:42.940 --> 00:03:44.900]  甚至一颗 C 面底螺丝钉,  
[00:03:44.900 --> 00:03:47.140]  都是势力机器上的一个环节,  
[00:03:47.140 --> 00:03:48.060]  于是,  
[00:03:48.060 --> 00:03:50.420]  官僚体系之外的所有职场人,  
[00:03:50.420 --> 00:03:52.340]  都会面临一个难堪的处境,  
[00:03:52.340 --> 00:03:54.340]  一方面遭逢势力的打压,  
[00:03:54.340 --> 00:03:55.340]  另一方面,  
[00:03:55.340 --> 00:03:57.900]  也都会多少尝到势力的苦头,  
[00:03:57.900 --> 00:03:58.900]  于是乎,  
[00:03:58.900 --> 00:04:01.420]  势力的细胞渗透到角角落落,  
[00:04:01.420 --> 00:04:02.980]  即使没有组织势力,  
[00:04:02.980 --> 00:04:04.620]  也要谋求文化势力,  
[00:04:04.620 --> 00:04:05.500]  父权,  
[00:04:05.500 --> 00:04:06.380]  夫权,  
[00:04:06.380 --> 00:04:07.460]  家长势力,  
[00:04:07.460 --> 00:04:08.580]  宗族势力,  
[00:04:08.580 --> 00:04:09.660]  老师势力,  
[00:04:09.660 --> 00:04:10.780]  公司势力,  
[00:04:10.780 --> 00:04:12.140]  团队领导势力,  
[00:04:12.140 --> 00:04:13.100]  点点滴滴,  
[00:04:13.100 --> 00:04:15.580]  滴滴点点, 追赶势力,  
[00:04:15.580 --> 00:04:18.140]  简直成为人们生存的全副意义,  
[00:04:18.140 --> 00:04:18.980]  故而,  
[00:04:18.980 --> 00:04:19.980]  遵从势力,  
[00:04:19.980 --> 00:04:21.180]  遵从下级,  
[00:04:21.180 --> 00:04:22.420]  不得罪共事,  
[00:04:22.420 --> 00:04:23.660]  不得罪敌人,  
[00:04:23.660 --> 00:04:25.060]  不得罪陌生人,  
[00:04:25.060 --> 00:04:26.100]  因为你不晓得,  
[00:04:26.100 --> 00:04:28.260]  他们背地有什么的势力关系,  
[00:04:28.260 --> 00:04:30.940]  他们又会不会用这个势力来凑合你,  
[00:04:30.940 --> 00:04:31.940]  没错,  
[00:04:31.940 --> 00:04:34.380]  当咱们解构群里那位领导的行为时,  
[00:04:34.380 --> 00:04:36.220]  咱们也在解构咱们本人,  
[00:04:36.220 --> 00:04:37.420]  毫无疑问,  
[00:04:37.420 --> 00:04:39.380]  对于这位敢于发声的职场人,  
[00:04:39.380 --> 00:04:41.180]  深安职场底层逻辑的,  
[00:04:41.180 --> 00:04:43.220]  咱们肯定能猜到他的终局,  
[00:04:43.220 --> 00:04:44.700]  他的终局是注定的,  
[00:04:44.700 --> 00:04:46.220]  同时也是悲痛的,  
[00:04:46.220 --> 00:04:47.340]  问题是,  
[00:04:47.340 --> 00:04:48.540]  这样做,  
[00:04:48.540 --> 00:04:49.660]  值得吗?  
[00:04:49.660 --> 00:04:52.580]  香港驰名导演王家卫拍过一部电影,  
[00:04:52.580 --> 00:04:54.420]  叫做东邪西毒,  
[00:04:54.420 --> 00:04:56.340]  电影中有这样一个情节,  
[00:04:56.340 --> 00:04:59.620]  有个女人的弟弟被太尉府的一群刀客杀了,  
[00:04:59.620 --> 00:05:00.860]  他想报仇,  
[00:05:00.860 --> 00:05:02.300]  可本人没有文治,  
[00:05:02.300 --> 00:05:04.060]  只能请刀客出手,  
[00:05:04.060 --> 00:05:05.540]  但家里穷没钱,  
[00:05:05.540 --> 00:05:08.540]  最有价值的资产是一篮子鸡蛋,  
[00:05:08.540 --> 00:05:09.260]  于是,  
[00:05:09.260 --> 00:05:10.900]  他提着那一篮子鸡蛋,  
[00:05:10.900 --> 00:05:13.420]  天天站在刀客剑客们通过的路口,  
[00:05:13.420 --> 00:05:14.700]  申请他们出手,  
[00:05:14.700 --> 00:05:16.220]  报仇就是鸡蛋,  
[00:05:16.220 --> 00:05:17.860]  没有人违心为了鸡蛋,  
[00:05:17.860 --> 00:05:20.020]  去单挑太尉府的刀客,  
[00:05:20.020 --> 00:05:21.460]  除了洪七,  
[00:05:21.460 --> 00:05:24.260]  洪七单独力战太尉府那帮刀客,  
[00:05:24.260 --> 00:05:26.780]  所得的报仇是一个鸡蛋,  
[00:05:26.780 --> 00:05:29.020]  然而洪七付出的代价太大,  
[00:05:29.020 --> 00:05:30.060]  混战中,  
[00:05:30.060 --> 00:05:32.700]  洪七被对手砍断了一根手指,  
[00:05:32.700 --> 00:05:33.820]  为了一个鸡蛋,  
[00:05:33.820 --> 00:05:35.500]  而失去一只手指,  
[00:05:35.500 --> 00:05:36.740]  值得吗?  
[00:05:36.740 --> 00:05:37.860]  不值得,  
[00:05:37.860 --> 00:05:39.300]  然而我感觉畅快,  
[00:05:39.300 --> 00:05:40.540]  因為這才是我本人  
  
output_srt: saving output to 'samples/test1.wav.srt'  
  
whisper_print_timings:     load time =   978.82 ms  
whisper_print_timings:     fallbacks =   0 p /   0 h  
whisper_print_timings:      mel time =   438.81 ms  
whisper_print_timings:   sample time =   980.66 ms /  2343 runs (0.42 ms per run)  
whisper_print_timings:   encode time = 31476.10 ms /    13 runs (2421.24 ms per run)  
whisper_print_timings:   decode time = 47833.70 ms /  2343 runs (20.42 ms per run)  
whisper_print_timings:    total time = 81797.88 ms

五分钟的语音,只须要一分钟多一点就能够转录实现,效率满分。

当然,精确度还有待进步,进步精确度能够抉择 large 模型,但转录工夫会相应减少。

苹果 M 芯片模型转换

基于苹果 Mac 零碎的用户有福了,Whisper.cpp 能够通过 Core ML 在 Apple Neural Engine (ANE) 上执行编码器推理,这能够比仅应用 CPU 执行快出三倍以上。

首先装置转换依赖:

pip install ane_transformers  
pip install openai-whisper  
pip install coremltools

接着运行转换脚本:

./models/generate-coreml-model.sh medium 

这里参数即模型的名称。

程序返回:

➜  models git:(master) python3 convert-whisper-to-coreml.py --model medium --encoder-only True   
scikit-learn version 1.2.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.  
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=1024, n_audio_head=16, n_audio_layer=24, n_vocab=51865, n_text_ctx=448, n_text_state=1024, n_text_head=16, n_text_layer=24)  
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!  
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"  
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  
  scale = (n_state // self.n_head) ** -0.25  
Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 1971/1972 [00:00<00:00, 3247.25  
Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 54.69 passes/s]  
Running MIL default pipeline: 100%|████████| 57/57 [00:09<00:00,  6.29 passes/s]  
Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 444.13 passe  
  
  
  
  
  
  
done converting

转换好当前,从新进行编译:

make clean  
WHISPER_COREML=1 make -j

随后用转换后的模型进行转录即可:

./main -m models/ggml-medium.bin -f samples/jfk.wav

至此,Mac 用户立马荣升一等公民。

结语

Whisper.cpp 是 Whisper 的精力复刻与精神新生,完满承继了 Whisper 的所有性能,在此之上,进步了语音转录文字的速度和效率以及跨平台移植性,百尺竿头更进一步,开源技术的高速倒退让咱们明确了一件事,那就是高品质技术的流传远比技术自身更加贵重。

退出移动版