Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@


### 近期更新

- 👑 2022.05.13: PaddleSpeech 发布 [PP-ASR](./docs/source/asr/PPASR_cn.md) 流式语音识别系统、[PP-TTS](./docs/source/tts/PPTTS_cn.md) 流式语音合成系统、[PP-VPR](docs/source/vpr/PPVPR_cn.md) 全链路声纹识别系统
- 👏🏻 2022.05.06: PaddleSpeech Streaming Server 上线! 覆盖了语音识别(标点恢复、时间戳),和语音合成。
- 👏🏻 2022.05.06: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、语音合成、声纹识别,标点恢复。
Expand Down
16 changes: 16 additions & 0 deletions demos/speech_web/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
*/.vscode/*
*.wav
*/resource/*
.Ds*
*.pyc
*.pcm
*.npy
*.diff
*.sqlite
*/static/*
*.pdparams
*.pdiparams*
*.pdmodel
*/source/*
*/PaddleSpeech/*

168 changes: 168 additions & 0 deletions demos/speech_web/README_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Paddle Speech Demo

PaddleSpeechDemo是一个以PaddleSpeech的语音交互功能为主体开发的Demo展示项目,用于帮助大家更好的上手PaddleSpeech以及使用PaddleSpeech构建自己的应用。

智能语音交互部分使用PaddleSpeech,对话以及信息抽取部分使用PaddleNLP,网页前端展示部分基于Vue3进行开发

主要功能:

+ 语音聊天:PaddleSpeech的语音识别能力+语音合成能力,对话部分基于PaddleNLP的闲聊功能
+ 声纹识别:PaddleSpeech的声纹识别功能展示
+ 语音识别:支持【实时语音识别】,【端到端识别】,【音频文件识别】三种模式
+ 语音合成:支持【流式合成】与【端到端合成】两种方式
+ 语音指令:基于PaddleSpeech的语音识别能力与PaddleNLP的信息抽取,实现交通费的智能报销

运行效果:

![效果](docs/效果展示.png)

## 安装

### 后端环境安装

```
# 安装环境
cd speech_server
pip install -r requirements.txt
```


### 前端环境安装

前端依赖node.js ,需要提前安装,确保npm可用,npm测试版本8.3.1,建议下载[官网](https://nodejs.org/en/)稳定版的node.js

```
# 进入前端目录
cd web_client

# 安装yarn,已经安装可跳过
npm install -g yarn

# 使用yarn安装前端依赖
yarn install
```


## 启动服务

### 开启后端服务

```
cd speech_server
# 默认8010端口
python main.py --port 8010
```

### 开启前端服务

```
cd web_client
yarn dev --port 8011
```

默认配置下,前端中配置的后台地址信息是localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的FAQ:【后端如果部署在其它机器或者别的端口如何修改】

## Docker启动

### 后端docker
后端docker使用[paddlepaddle官方docker](https://www.paddlepaddle.org.cn),这里演示CPU版本
```
# 拉取PaddleSpeech项目
cd PaddleSpeechServer
git clone https://github.com/PaddlePaddle/PaddleSpeech.git

# 拉取镜像
docker pull registry.baidubce.com/paddlepaddle/paddle:2.3.0

# 启动容器
docker run --name paddle -it -p 8010:8010 -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.3.0 /bin/bash

# 进入容器
cd /paddle

# 安装依赖
pip install -r requirements

# 启动服务
python main --port 8010

```

### 前端docker

前端docker直接使用[node官方的docker](https://hub.docker.com/_/node)即可

```shell
docker pull node
```

镜像中安装依赖

```shell
cd PaddleSpeechWebClient
# 映射外部8011端口
docker run -it -p 8011:8011 -v $PWD:/paddle node:latest bin/bash
# 进入容器中
cd /paddle
# 安装依赖
yarn install
# 启动前端
yarn dev --port 8011
```





## FAQ

#### Q: 如何安装node.js

A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保npm可用

#### Q:后端如果部署在其它机器或者别的端口如何修改

A:后端的配置地址有分散在两个文件中

修改第一个文件`PaddleSpeechWebClient/vite.config.js`

```json
server: {
host: "0.0.0.0",
proxy: {
"/api": {
target: "http://localhost:8010", // 这里改成后端所在接口
changeOrigin: true,
rewrite: (path) => path.replace(/^\/api/, ""),
},
},
}
```

修改第二个文件`PaddleSpeechWebClient/src/api/API.js`(Websocket代理配置失败,所以需要在这个文件中修改)

```javascript
// websocket (这里改成后端所在的接口)
CHAT_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/offlineStream', // ChatBot websocket 接口
ASR_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/onlineStream', // Stream ASR 接口
TTS_SOCKET_RECORD: 'ws://localhost:8010/ws/tts/online', // Stream TTS 接口
```

#### Q:后端以IP地址的形式,前端无法录音

A:这里主要是游览器安全策略的限制,需要配置游览器后重启。游览器修改配置可参考[使用js-audio-recorder报浏览器不支持getUserMedia](https://blog.csdn.net/YRY_LIKE_YOU/article/details/113745273)

chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure




## 参考资料

vue实现录音参考资料:https://blog.csdn.net/qq_41619796/article/details/107865602#t1

前端流式播放音频参考仓库:

https://github.com/AnthumChris/fetch-stream-audio

https://bm.enthuses.me/buffered.php?bref=6677
Binary file added demos/speech_web/docs/效果展示.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
103 changes: 103 additions & 0 deletions demos/speech_web/speech_server/conf/tts_online_application.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# This is the parameter configuration file for streaming tts server.

#################################################################################
# SERVER SETTING #
#################################################################################
host: 0.0.0.0
port: 8092

# The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http']
protocol: 'http'
engine_list: ['tts_online-onnx']


#################################################################################
# ENGINE CONFIG #
#################################################################################

################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config:
am_ckpt:
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0

# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc'
voc_config:
voc_ckpt:
voc_stat:

# others
lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14



#################################################################################
# ENGINE CONFIG #
#################################################################################

################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
am_ckpt: # list
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0
am_sample_rate: 24000
am_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4

# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx'
voc_ckpt:
voc_sample_rate: 24000
voc_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4

# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# This is the parameter configuration file for PaddleSpeech Serving.

#################################################################################
# SERVER SETTING #
#################################################################################
host: 0.0.0.0
port: 8090

# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_online']
# protocol = ['websocket'] (only one can be selected).
# websocket only support online engine type.
protocol: 'websocket'
engine_list: ['asr_online']


#################################################################################
# ENGINE CONFIG #
#################################################################################

################################### ASR #########################################
################### speech task: asr; engine_type: online #######################
asr_online:
model_type: 'conformer_online_wenetspeech'
am_model: # the pdmodel file of am static model [optional]
am_params: # the pdiparams file of am static model [optional]
lang: 'zh'
sample_rate: 16000
cfg_path:
decode_method:
force_yes: True
device: 'cpu' # cpu or gpu:id
decode_method: "attention_rescoring"
continuous_decoding: True # enable continue decoding when endpoint detected
num_decoding_left_chunks: 16
am_predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config

chunk_buffer_conf:
window_n: 7 # frame
shift_n: 4 # frame
window_ms: 25 # ms
shift_ms: 10 # ms
sample_rate: 16000
sample_width: 2
Loading