Click this link to the original paper.

- Main Point

This paper proposed a HMM-free speech recognition system, which consists of an end-to-end model outputting conditional probability of character sequences given acoustic signal and a corresponding decoding mechanism to output the recognition result.

The end-to-end model can be divided into two parts, a encoder and a decoder. Encoder which is called listener in the paper, uses stacked pyramidal Bidirectional Long Short Term Memory RNN(pBLSTM) to represent the original signal frames in high level features. Decoder, which is called speller in the paper, uses attention-based LSTM transducer that accepts the encoder result as input, output the distribution over the next character conditioned on all the previous characters and whole acoustic feature sequences.

For the last piece of system, a left-to-right beam search can be used to generate recognition result without any kind of dictionary or language model, which achieves WER 14.1% on Google voice search task. Further, this paper also shows that language model(LM) can be applied to rescore the beams, which improves WER to 10.3% on the same task. It also mentioned that LM can be incorporated in another way like by FST.

- End to end Model presented in visualization and formula:

The basic structure is described in ‘Main Point’, the simple illustration of this structure can be found in Figure 1 of that paper. I expend the listener(decode) part to try to make things more clear.

For listener, the system uses LSTM architecture to better explore the long range text effect. One possible version of LSTM is shown in Fig.1. Listener also uses bidirectional RNN to explore both the future and past context. Fig.2 shows the combination of above two structures. Except the combination of these two structures, listener also design a pyramidal structure to reduce the computational complexity, which is shown in Fig.3.

For speller, within each time step, it will calculate a context vector to encapsulate all the information of whole acoustic features. (detail will be discussed latter) Then, the decoder state will be updated by a function of previous decoder state, previous context and previously emitted character. This function is calculated by a 2 layer LSTM. In the end, the currently emitted character distribution will be depend on an softmax output from a MLP taking current state and context as input.

The context vector in the speller is calculated as following: Each feature in the listener output will be scored and normalized followed equation (9 ) and (10) in the paper. Then, the context vector is just the weighted sum of features using normalized scores.

- Pipeline: