关于人工智能:自制有声书阅读器用PaddleSpeech打开读书新方式

吕声辉，飞桨开发者技术专家（PPDE），某网络科技公司研发工程师。次要钻研方向为图像识别，自然语言解决等。 • AI Studio主页
https://aistudio.baidu.com/aistudio/personalcenter/thirdview/...

我的项目背景

随着互联网的倒退，普通用户对于书籍展现模式的需要已由纯文字变成了图文、语音、视频等多种形式，因而将文本书籍转换为有声读物具备很大的市场需求。本文以飞桨语音模型库PaddleSpeech提供的语音合成技术为外围，通过音色克隆、语速设置、音量调整等附加性能，展现有声书籍的技术可行计划。

最终出现成果如
player.bilibili.com/player.html?bvid=BV1x84y1V7SR

网页体验拜访地址
https://book.weixin12306.com/

环境筹备

PaddleSpeech 是基于飞桨的语音方向开源模型库，用于语音和音频中的各种要害工作的开发，蕴含大量基于深度学习的前沿和有影响力的模型。首先进行PaddleSpeech装置环境的配置，配置如下：

# 留神如果之前运行过这步 下次就不必再运行了，这个目录重启我的项目也不会清空的# 下载解压谈话人编码器!wget -P data https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip!unzip -o -d work data/ge2e_ckpt_0.3.zip# 下载解压声码器!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip!unzip -o -d work data/pwg_aishell3_ckpt_0.5.zip# 下载解压声学模型!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip!unzip -o -d work data/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip#  下载解压nltk包!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz!tar zxvf data/nltk_data.tar.gz# 装置PaddleSpeech!pip install pytest-runner!pip install paddlespeech# 将nltk_data 拷贝到 /home/aistudio 目录!cp -r /home/aistudio/work/nltk_data /home/aistudio# 装置moviepy !pip install moviepy==1.0.3

数据处理

每本书的内容均以json格局寄存在txt文本中，门路为
/work/books/inputs/bookname.txt。为不便演示，这里以三国演义为例。

{   “name”: “三国演义”,    “lists”: [{      “title”: “第一回 宴桃园俊杰三结义 斩黄巾英雄首犯罪”      “content”: “滚滚长江东逝水，浪花淘尽英雄。是非成败转头空。青山依}, {      “title”: “第二回 张翼德怒鞭督邮 何国舅谋诛宦竖”,      “content”: “且说董卓字仲颖，陇西临洮人也，官拜河东太守，自来自豪   }]}

音频合成

段落句子宰割

以换行符"\n"宰割为段落，以"。"宰割为句子。

# 段落和句子宰割def lists(self, lists):    results = []    for i in range(len(lists)):        item = lists[i]        title = item['title']        content = item['content']        sections = []        sentences = []        contents = content.split('\n')        for citem in contents:            if len(citem) > 1:                sections.append(citem)        sentenceIndex = 0        for sitems in sections:            sitems_ = []            for tmp  in sitems.split('。'):                if len(tmp) > 1:                    sitems_.append(tmp)            for j in range(len(sitems_)):                sentence = {                    'id':sentenceIndex,                    'sentence': sitems_[j],                    'end': 0 if j < len(sitems_) - 1 else 1                }                sentences.append(sentence)                sentenceIndex += 1        result = {            'id':i,            'title':title,            'sentences':sentences        }        results.append(result)    return results

特殊字符解决

在国学书籍中，有可能呈现很多生僻字或者特殊符号，这里须要做针对性的替换。

# 非凡解决示例，工程化最好用字典主动判断替换def dealText(self, text):    text = text.replace('-','')    text = text.replace(' ', '')    text = text.replace('’','')    text = text.replace('﨑','崎')    text = text.replace("[",' ')    text = text.replace("]",' ')    text = text.replace('　',' ')    text = text.replace("，]","")    text = text.replace("１","1")    text = text.replace("２",'2')    text = text.replace("６","6")    text = text.replace("〔","")    text = text.replace("─","")    text = text.replace("┬","")    text = text.replace("┼","")    text = text.replace("┴","")    text = text.replace("〖"," ")    text = text.replace("〗"," ")    text = text.replace("礻殳","祋")    return text

音频合成

依据宰割的ID，保留到对应地位。

# 音频合成def audio(self, contents):    self.tts = TTSExecutor()    for i in range(len(contents['lists'])):        item = contents['lists'][i]        basePath = self.bookPathOutput+'/'+self.bookname+'/'+str(i)        if os.path.exists(basePath) is False:            os.makedirs(r''+basePath)        # 生成每回题目音频        self.text2audio(item['title'], basePath+'/title.wav')        # 生成每句内容音频        for j in range(len(item['sentences'])):            sitem = item['sentences'][j]            self.text2audio(sitem['sentence'], basePath+'/'+str(sitem['id'])+'.wav')def text2audio(self, text, path):    text = self.dealText(text)    self.voice_cloning(text, path)#self.tts(text=text, output=path)

音色克隆

能够当时将不同音色音频搁置在 /work/sounds 目录下。此处音色克隆局部的性能次要参考自PaddleSpeech语音克隆我的项目。

我的项目链接
https://aistudio.baidu.com/aistudio/projectdetail/4265795?channelType=0&channel=0

def clone_pre(self):        # Init body.        with open(self.am_config) as f:            am_config = CfgNode(yaml.safe_load(f))        self.am_config_ = am_config        with open(self.voc_config) as f:            voc_config = CfgNode(yaml.safe_load(f))        # speaker encoder        p = SpeakerVerificationPreprocessor(            sampling_rate=16000,            audio_norm_target_dBFS=-30,            vad_window_length=30,            vad_moving_average_width=8,            vad_max_silence_length=6,            mel_window_length=25,            mel_window_step=10,            n_mels=40,            partial_n_frames=160,            min_pad_coverage=0.75,            partial_overlap_ratio=0.5)        print("Audio Processor Done!")        self.p = p        speaker_encoder = LSTMSpeakerEncoder(            n_mels=40, num_layers=3, hidden_size=256, output_size=256)        speaker_encoder.set_state_dict(paddle.load(self.ge2e_params_path))        speaker_encoder.eval()        self.speaker_encoder = speaker_encoder        print("GE2E Done!")        with open(self.phones_dict, "r") as f:            phn_id = [line.strip().split() for line in f.readlines()]        vocab_size = len(phn_id)        print("vocab_size:", vocab_size)        # acoustic model        odim = am_config.n_mels        # model: {model_name}_{dataset}        am_name = self.am[:self.am.rindex('_')]        am_dataset = self.am[self.am.rindex('_') + 1:]        am_class = dynamic_import(am_name, self.model_alias)        am_inference_class = dynamic_import(            am_name + '_inference', self.model_alias)      if am_name == 'fastspeech2':            am = am_class(                idim=vocab_size, odim=odim, spk_num=None, **am_config["model"])        elif am_name == 'tacotron2':            am = am_class(idim=vocab_size, odim=odim, **am_config["model"])        am.set_state_dict(paddle.load(self.am_ckpt)["main_params"])        am.eval()        am_mu, am_std = np.load(self.am_stat)        am_mu = paddle.to_tensor(am_mu)        am_std = paddle.to_tensor(am_std)        am_normalizer = ZScore(am_mu, am_std)        am_inference = am_inference_class(am_normalizer, am)        am_inference.eval()        self.am_inference = am_inference        print("acoustic model done!")        # vocoder        # model: {model_name}_{dataset}        voc_name = self.voc[:self.voc.rindex('_')]        voc_class = dynamic_import(voc_name, self.model_alias)        voc_inference_class = dynamic_import(            voc_name + '_inference', self.model_alias)        voc = voc_class(**voc_config["generator_params"])        voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])        voc.remove_weight_norm()        voc.eval()        voc_mu, voc_std = np.load(self.voc_stat)        voc_mu = paddle.to_tensor(voc_mu)        voc_std = paddle.to_tensor(voc_std)        voc_normalizer = ZScore(voc_mu, voc_std)        voc_inference = voc_inference_class(voc_normalizer, voc)        voc_inference.eval()        self.voc_inference = voc_inference        print("voc done!")        self.frontend = Frontend(phone_vocab_path=self.phones_dict)        print("frontend done!")        # 获取音色        ref_audio_path = self.soundsInput+'/'+str(self.sound)+'.mp3'        mel_sequences = self.p.extract_mel_partials(self.p.preprocess_wav(ref_audio_path))        # print("mel_sequences: ", mel_sequences.shape)        with paddle.no_grad():            spk_emb = self.speaker_encoder.embed_utterance(paddle.to_tensor(mel_sequences))        # print("spk_emb shape: ", spk_emb.shape)        self.spk_emb = spk_embdef voice_cloning(self, text, path):    input_ids = self.frontend.get_input_ids(text, merge_sentences=True)    phone_ids = input_ids["phone_ids"][0]    with paddle.no_grad():        wav = self.voc_inference(self.am_inference(phone_ids, spk_emb=self.spk_emb))    sf.write(path, wav.numpy(), samplerate=self.am_config_.fs)

语速和音量调整

def post_del(self, path):    old_au = AudioFileClip(path)    new_au = old_au.fl_time(lambda t:  self.speed*t, apply_to=['mask', 'audio'])    new_au = new_au.set_duration(old_au.duration/self.speed)    new_au = (new_au.fx(afx.volumex, self.volumex))    final_path = path.replace('outputs','final')    print(path, final_path)    new_au.write_audiofile(final_path)    print('^^^^^^')

音色、语速和音量须要在 main.py 的头部中设置。

class Main(object):    def __init__(self):        self.bookPathInput = './books/inputs' # 书籍输出目录        self.bookPathOutput = './books/outputs' # 惯例输入目录        self.bookPathFinal = './books/final' # 最终输入目录        self.bookname = 'sanguoyanyi'        self.tts = None        self.soundsInput = './sounds' # 音色文件寄存目录        self.sound = '001' # 音色编号        self.speed = 1.0 # 语速        self.volumex = 1.1 # 音量# 音频合成，一键命令%cd /home/aistudio/work/!python main.py

查看生成后果

最终切分好的数据在
/work/outputs/sanguoyanyi目录下，原始语速和音量音频在outputs目录下，指定语速和音量音频在final目录下。其中的outputs.txt为切分内容，而音频会依照每个章节以及每个章节的句子索引排序好。

以下为outputs.txt 内容：

{   “name”: “三国演义”,   “lists”: [{      “id”: 0,      “title”: “第一回 宴桃园俊杰三结义 斩黄巾英雄首犯罪”,       “sentence”: [{         “id”: 0         “sentence”: “滚滚长江东逝水，浪花淘尽英雄”,         “end”: 0   }, {      “id”: 1,      “sentence”: “是否成败转头空”,      “end”: 0   }, {      “id”: 2,      “sentence”: “青山仍旧在，几度夕阳红”,      “end”: 0   }, {      “id”: 3,      “sentence”: “白发渔樵江渚上，惯看秋月春风”,      “end”: 0}, {      “id”: 4,      “sentence”: “一壶浊酒喜相逢”,      “end”: 0}, {

以下为第一回的每个句子wav格局音频。

客户端展现

输入第三局部生成好的内容和音频。这里用H5页面简略展现一下有声书浏览的成果，包含内容展现和逐句朗诵高亮两种性能。

[video(video-pUpZJ8ZD-1678071814221)(type-csdn)(url-https://live.csdn.net/v/embed/280333)(image-https://video-community.csdnimg.cn/vod-84deb4/5a4f23f0bbc971e...)(title-用PaddleSpeech实现有声书浏览)]

H5的具体代码已放在GitHub 上，大家可在下方链接中查看
https://github.com/lvsh2012/book2audio

手机或者PC也可间接体验
https://book.weixin12306.com/

总结

通过PaddleSpeech能够简略疾速地实现语音合成性能，轻松实现书籍有声化。使用者在这里须要关注下，当以H5展现播放成果时，须要留神内容和音频的对应关系。除了语音合成性能外，PaddleSpeech还提供了包含语音辨认、声纹提取、标点复原等其余性能。置信大家基于PaddleSpeech能够在该畛域挖掘出更多的可能性！