Short description

DDESE is an efficient end-to-end automatic speech recognition (ASR) engine with the deep learning acceleration solution of algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference) by DeePhi. We Use Baidu DeepSpeech2 framework with LibriSpeech 1000h dataset for model training and compression. Users could run the test scripts for both performance comparison of CPU/FPGA and single sentence recognition.

Features

Innovative full-stack acceleration solution for deep learning in acoustic speech recognition (ESE: best paper of FPGA2017)

  • Support both unidirectional and bi-directional LSTM acceleration on FPGA for model inference
  • Support CNN layers, Fully-Connected (FC) layers, Batch Normalization layers and varieties of activation functions such as Sigmoid, Tanh and HardTanh
  • Support testing for both performance comparison of CPU/FPGA and single sentence recognition
  • Supporting user’s own test audio recognition (English, 16kHz sample rate, no longer than 3 seconds)

Solution

Our solution is algorithm, software and hardware co-design (containing pruning, quantization, compilation and FPGA inference).

After pruning, the model is pruned to a sparse one (15%~20% density) with little loss of accuracy, then the weights and activations are quantized to 16bits so that the whole model is compressed by more than 10X and could be easily compiled by CSC (Compressed Sparse Column) format and deployed on the Descartes platform for efficient inference with the help of FPGA.

Performance

Our ASR system and model structure are as follows:

Our achievements of DDESE are as follows:

2.87X and 2.56X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for unidirectional and bi-directional LSTM model respectively, if only considering LSTM layers.

2.06X speedup could be achieved compared to GPU (Tesla P4 + cudnn) for the whole end-to-end speech recognition process if considering both CNN and bi-directional LSTM layers for further acceleration.

  • For LSTM layers only(input audio:1 second)
  • For CNN layers + bi-directional LSTM layers (input audio: 1 second)
    Note:E2E is short for end-to-end,ACT is short for activation,WER is short for word error rate.

The details of performance comparison for bi-directional LSTM model are as follows:

Usage Information

We assume you are familiar with AWS F1 instance. Please refer to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html if you are not. You should launch and login to DDESE instance before the test.

Test the DDESE (DeePhi Descartes Efficient Speech Recognition Engine)

Environment Settings
# sudo bash (make sure you are under root environment)
# source /opt/Xilinx/SDx/2017.1.rte/setup.sh(start SDAccel platform)

# cd ASR_Accelerator/deepspeech2 (where the test tool are placed)
# source activate test_py3 (activate python3.6 environment)

After the above steps are done, you are free to test the ASR process.

Test Example

The following command deploys a model on CPU and transcribes the same sentence 1000 times.

# python aws_test.py --audio_path data/middle_audio/wav/middle1.wav 
   --single_test

The following command deploys a model on FPGA and transcribes the same sentence 1000 times.

# python aws_test.py --fpga_config deephi/config/fpga_cnnblstm_0.15.json 
   --audio_path data/middle_audio/wav/middle1.wav --no_cpu --single_test

With the help of these tests, you could compare the performance of the same automatic speech recognition task on CPU and FPGA.

Command Description

In this part, we will detail more commands that you could use to test the DeePhi_ASRAcc. Furthermore, you can change some parameters according to the parameter descriptions.

# python aws_test.py (multi-sentence test to show the performance of FPGA over CPU)

By default, this command will deploy a model on CPU and transcribe all the sentences (“.wav” format) under data/short_audio/wav/ and print the output logs.

# python transcribe.py (single-sentence test to show the accuracy of the model)

By default, this command will deploy the model on CPU and transcribe data/short_audio/wav/short_audio1.wav and print the output logs.

By default both commands deploy model only CPU, you could add FPGA configuration to deploy the model on FPGA, like below:

# python aws_test.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on both CPU and FPGA and run the test)

By running this command, models will be deployed on CPU AND FPGA and the ASR process will be tested on CPU and FPGA one by one.

# python transcribe.py --fpga_config deephi/config/fpga_bilstm_0.15.json
(deploy the model on FPGA and do the ASR)

By running this command, model will be deployed on FPGA INSTEAD of on CPU, together with the ASR process.

Command Parameters Description

A. for command aws_test.py:

--no_cpu
:set this parameter to avoid running the ASR process 
on CPU

--wav_folder ROOTDIR_OF_YOUR_WAV_FILES
:specify the ROOTDIR_OF_YOUR_WAV_FILE to the folder
 where wav files are saved, then this command will transcribe every .wav file under this folder,this parameter SHOULD NOT be used together with --
 single_test parameter
--audio_path PATH_TO_YOUR_WAV_FILE
:specify the PATH_TO_YOUR_WAV_FILE to the wav file 
that you want to transcribe, then this command will transcribe the specified sentence for 1000 times,this parameter SHOULD be used 
together with --
 single_test parameter
--single_test
:set this parameter to run single test mode, thus, 
transcribe the same sentence 1000 times on the specified models. Otherwise transcribe all the
 sentences under the specified folder for 1 time.

B. for command transcribe.py:

--audio_path PATH_TO_YOUR_WAV_FILE
:specify the PATH_TO_YOUR_WAV_FILE to the wav file that you want to transcribe.

Note: The folder named “data” consist of short audios, middle audios and long audios.

Try Using Your Own Input

Please upload your own wav file (must be 16kHz sample rate, recorded in clean environment, shorter than 3 seconds).Then use the following command to transcribe the uploaded sentence:

# python transcribe.py --audio_path PATH_TO_YOUR_WAV_FILE

Contact US

If you are interested in our work or have any problems in running our solution on AWS F1, please contact us at the following email address: