吕声辉,飞桨开发者技术专家(PPDE),某网络科技公司研发工程师。次要钻研方向为图像识别,自然语言解决等。• AI Studio 主页
https://aistudio.baidu.com/aistudio/personalcenter/thirdview/…
我的项目背景
随着互联网的倒退,普通用户对于书籍展现模式的需要已由纯文字变成了图文、语音、视频等多种形式,因而将文本书籍转换为有声读物具备很大的市场需求。本文以飞桨语音模型库 PaddleSpeech 提供的语音合成技术为外围,通过音色克隆、语速设置、音量调整等附加性能,展现有声书籍的技术可行计划。
最终出现成果如
player.bilibili.com/player.html?bvid=BV1x84y1V7SR
网页体验拜访地址
https://book.weixin12306.com/
环境筹备
PaddleSpeech 是基于飞桨的语音方向开源模型库,用于语音和音频中的各种要害工作的开发,蕴含大量基于深度学习的前沿和有影响力的模型。首先进行 PaddleSpeech 装置环境的配置,配置如下:
# 留神如果之前运行过这步 下次就不必再运行了,这个目录重启我的项目也不会清空的
# 下载解压谈话人编码器
!wget -P data https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip
!unzip -o -d work data/ge2e_ckpt_0.3.zip
# 下载解压声码器
!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip
!unzip -o -d work data/pwg_aishell3_ckpt_0.5.zip
# 下载解压声学模型
!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip
!unzip -o -d work data/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip
# 下载解压 nltk 包
!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz
!tar zxvf data/nltk_data.tar.gz
# 装置 PaddleSpeech
!pip install pytest-runner
!pip install paddlespeech
# 将 nltk_data 拷贝到 /home/aistudio 目录
!cp -r /home/aistudio/work/nltk_data /home/aistudio
# 装置 moviepy
!pip install moviepy==1.0.3
数据处理
每本书的内容均以 json 格局寄存在 txt 文本中,门路为
/work/books/inputs/bookname.txt。为不便演示,这里以三国演义为例。
{“name”:“三国演义”,“lists”: [{“title”:“第一回 宴桃园俊杰三结义 斩黄巾英雄首犯罪”“content”:“滚滚长江东逝水,浪花淘尽英雄。是非成败转头空。青山依
}, {“title”:“第二回 张翼德怒鞭督邮 何国舅谋诛宦竖”,“content”:“且说董卓字仲颖,陇西临洮人也,官拜河东太守,自来自豪
}]
}
音频合成
段落句子宰割
以换行符 ”\n” 宰割为段落,以 ”。” 宰割为句子。
# 段落和句子宰割
def lists(self, lists):
results = []
for i in range(len(lists)):
item = lists[i]
title = item['title']
content = item['content']
sections = []
sentences = []
contents = content.split('\n')
for citem in contents:
if len(citem) > 1:
sections.append(citem)
sentenceIndex = 0
for sitems in sections:
sitems_ = []
for tmp in sitems.split('。'):
if len(tmp) > 1:
sitems_.append(tmp)
for j in range(len(sitems_)):
sentence = {
'id':sentenceIndex,
'sentence': sitems_[j],
'end': 0 if j < len(sitems_) - 1 else 1
}
sentences.append(sentence)
sentenceIndex += 1
result = {
'id':i,
'title':title,
'sentences':sentences
}
results.append(result)
return results
特殊字符解决
在国学书籍中,有可能呈现很多生僻字或者特殊符号,这里须要做针对性的替换。
# 非凡解决示例,工程化最好用字典主动判断替换
def dealText(self, text):
text = text.replace('-','')
text = text.replace('','')
text = text.replace('’','')
text = text.replace('﨑','崎')
text = text.replace("[",' ')
text = text.replace("]",' ')
text = text.replace('',' ')
text = text.replace(",]","")
text = text.replace("1","1")
text = text.replace("2",'2')
text = text.replace("6","6")
text = text.replace("〔","")
text = text.replace("─","")
text = text.replace("┬","")
text = text.replace("┼","")
text = text.replace("┴","")
text = text.replace("〖"," ")
text = text.replace("〗"," ")
text = text.replace("礻殳","祋")
return text
音频合成
依据宰割的 ID,保留到对应地位。
# 音频合成
def audio(self, contents):
self.tts = TTSExecutor()
for i in range(len(contents['lists'])):
item = contents['lists'][i]
basePath = self.bookPathOutput+'/'+self.bookname+'/'+str(i)
if os.path.exists(basePath) is False:
os.makedirs(r''+basePath)
# 生成每回题目音频
self.text2audio(item['title'], basePath+'/title.wav')
# 生成每句内容音频
for j in range(len(item['sentences'])):
sitem = item['sentences'][j]
self.text2audio(sitem['sentence'], basePath+'/'+str(sitem['id'])+'.wav')
def text2audio(self, text, path):
text = self.dealText(text)
self.voice_cloning(text, path)
#self.tts(text=text, output=path)
音色克隆
能够当时将不同音色音频搁置在 /work/sounds 目录下。此处音色克隆局部的性能次要参考自 PaddleSpeech 语音克隆我的项目。
我的项目链接
https://aistudio.baidu.com/aistudio/projectdetail/4265795?channelType=0&channel=0
def clone_pre(self):
# Init body.
with open(self.am_config) as f:
am_config = CfgNode(yaml.safe_load(f))
self.am_config_ = am_config
with open(self.voc_config) as f:
voc_config = CfgNode(yaml.safe_load(f))
# speaker encoder
p = SpeakerVerificationPreprocessor(
sampling_rate=16000,
audio_norm_target_dBFS=-30,
vad_window_length=30,
vad_moving_average_width=8,
vad_max_silence_length=6,
mel_window_length=25,
mel_window_step=10,
n_mels=40,
partial_n_frames=160,
min_pad_coverage=0.75,
partial_overlap_ratio=0.5)
print("Audio Processor Done!")
self.p = p
speaker_encoder = LSTMSpeakerEncoder(n_mels=40, num_layers=3, hidden_size=256, output_size=256)
speaker_encoder.set_state_dict(paddle.load(self.ge2e_params_path))
speaker_encoder.eval()
self.speaker_encoder = speaker_encoder
print("GE2E Done!")
with open(self.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
# acoustic model
odim = am_config.n_mels
# model: {model_name}_{dataset}
am_name = self.am[:self.am.rindex('_')]
am_dataset = self.am[self.am.rindex('_') + 1:]
am_class = dynamic_import(am_name, self.model_alias)
am_inference_class = dynamic_import(am_name + '_inference', self.model_alias)
if am_name == 'fastspeech2':
am = am_class(idim=vocab_size, odim=odim, spk_num=None, **am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(self.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(self.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
am_inference.eval()
self.am_inference = am_inference
print("acoustic model done!")
# vocoder
# model: {model_name}_{dataset}
voc_name = self.voc[:self.voc.rindex('_')]
voc_class = dynamic_import(voc_name, self.model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', self.model_alias)
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
voc_mu, voc_std = np.load(self.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
voc_inference.eval()
self.voc_inference = voc_inference
print("voc done!")
self.frontend = Frontend(phone_vocab_path=self.phones_dict)
print("frontend done!")
# 获取音色
ref_audio_path = self.soundsInput+'/'+str(self.sound)+'.mp3'
mel_sequences = self.p.extract_mel_partials(self.p.preprocess_wav(ref_audio_path))
# print("mel_sequences:", mel_sequences.shape)
with paddle.no_grad():
spk_emb = self.speaker_encoder.embed_utterance(paddle.to_tensor(mel_sequences))
# print("spk_emb shape:", spk_emb.shape)
self.spk_emb = spk_emb
def voice_cloning(self, text, path):
input_ids = self.frontend.get_input_ids(text, merge_sentences=True)
phone_ids = input_ids["phone_ids"][0]
with paddle.no_grad():
wav = self.voc_inference(self.am_inference(phone_ids, spk_emb=self.spk_emb))
sf.write(path, wav.numpy(), samplerate=self.am_config_.fs)
语速和音量调整
def post_del(self, path):
old_au = AudioFileClip(path)
new_au = old_au.fl_time(lambda t: self.speed*t, apply_to=['mask', 'audio'])
new_au = new_au.set_duration(old_au.duration/self.speed)
new_au = (new_au.fx(afx.volumex, self.volumex))
final_path = path.replace('outputs','final')
print(path, final_path)
new_au.write_audiofile(final_path)
print('^^^^^^')
音色、语速和音量须要在 main.py 的头部中设置。
class Main(object):
def __init__(self):
self.bookPathInput = './books/inputs' # 书籍输出目录
self.bookPathOutput = './books/outputs' # 惯例输入目录
self.bookPathFinal = './books/final' # 最终输入目录
self.bookname = 'sanguoyanyi'
self.tts = None
self.soundsInput = './sounds' # 音色文件寄存目录
self.sound = '001' # 音色编号
self.speed = 1.0 # 语速
self.volumex = 1.1 # 音量
# 音频合成,一键命令
%cd /home/aistudio/work/
!python main.py
查看生成后果
最终切分好的数据在
/work/outputs/sanguoyanyi 目录下,原始语速和音量音频在 outputs 目录下,指定语速和音量音频在 final 目录下。其中的 outputs.txt 为切分内容,而音频会依照每个章节以及每个章节的句子索引排序好。
以下为 outputs.txt 内容:
{“name”:“三国演义”,“lists”: [{“id”: 0,“title”:“第一回 宴桃园俊杰三结义 斩黄巾英雄首犯罪”,“sentence”: [{“id”: 0“sentence”:“滚滚长江东逝水,浪花淘尽英雄”,“end”: 0
}, {“id”: 1,“sentence”:“是否成败转头空”,“end”: 0
}, {“id”: 2,“sentence”:“青山仍旧在,几度夕阳红”,“end”: 0
}, {“id”: 3,“sentence”:“白发渔樵江渚上,惯看秋月春风”,“end”: 0
}, {“id”: 4,“sentence”:“一壶浊酒喜相逢”,“end”: 0
}, {
以下为第一回的每个句子 wav 格局音频。
客户端展现
输入第三局部生成好的内容和音频。这里用 H5 页面简略展现一下有声书浏览的成果,包含内容展现和逐句朗诵高亮两种性能。
[video(video-pUpZJ8ZD-1678071814221)(type-csdn)(url-https://live.csdn.net/v/embed/280333)(image-https://video-community.csdnimg.cn/vod-84deb4/5a4f23f0bbc971e…)(title- 用 PaddleSpeech 实现有声书浏览)]
H5 的具体代码已放在 GitHub 上,大家可在下方链接中查看
https://github.com/lvsh2012/book2audio
手机或者 PC 也可间接体验
https://book.weixin12306.com/
总结
通过 PaddleSpeech 能够简略疾速地实现语音合成性能,轻松实现书籍有声化。使用者在这里须要关注下,当以 H5 展现播放成果时,须要留神内容和音频的对应关系。除了语音合成性能外,PaddleSpeech 还提供了包含语音辨认、声纹提取、标点复原等其余性能。置信大家基于 PaddleSpeech 能够在该畛域挖掘出更多的可能性!