kaldi-hybrid-decoder

In Automatic Speech Recognition(ASR), decoder is either static(based on Weighted Finite State Transducer) or dynamic(based on History Conditioned Word Prefix-Tree/Graph). This project provides a unified approach in Kaldi's framework, extending its decoder for more application scenarios.

Kaldi's decoder

There are several "versions" of decoder in Kaldi, new comers often get confused about which one to use. Before I explain "hybrid" decoder in this project, I think it might be helpful to have a short intro about existing decoders in Kaldi.

when you look into Kaldi's code(eg. kaldi/src/decoder/*.{h,cc}), you will find several keywords that reveal the nature of each decoder version:

"simple" decoder:

an "alpha" implementation that focus on code simplicity, it is mainly used for comparsion/debug purpose for developing more complex decoder.

"faster" decoder:

a more efficient hash-list data structure is used to organize viterbi search tokens, which is fast.
more sophisticated prunning strategies: beam pruning & adaptive beam for min/max token count control.

"latgen" decoder:

normally a decoder only need to implement a backpointer to trace back best token in the end, to get ASR result. However, in some cases, the entire searched space contains rich information for later post-processing, lattice generation is basically a must-have feature for serious decoder implementation.
latgen version book-keeps all traversed nodes and arcs in viterbi search, as a "graph", so in the end a pruned dense "lattice" can be generated from the graph, and best ASR result is obtained from "best-path" algorithm across the saved lattice.

"online" decoder:

sometimes decoder is required to output partial recognition result in the middle of decoding, this process have to be light-weight, however "best-path" algorithm across lattice is not. So in online version, aside from the searched space, an extra backpointer needs to be maintained in each token to enable fast trace-back.

"biglm" decoder:

static wfst decoder has its limitations: optimizing large wfst consumes extremly large memory and it is very slow, basically it is not possible to apply large language model in Kaldi's decoder directly.
The solution is to use a small wfst to generate lattices and use large language model for post-processing rescore, this is standard in Kaldi community. However, post-processing introduce latency, which is undesireable in many applications.
"biglm" version uses on-the-fly composition to dynamically bind a large language model during decoding, this is very similar to dynamic decoder. Viterbi state is composed by a pair of state: a state index in "first-pass" small wfst, and a state in the larger LM. By doing this, users no longer need to incorperate the large LM into wfst, which not only makes the wfst building/optimization fast, but also reduces latency compared to large LM post-processing lattice-recore.
however, on-the-fly composition not only introduces dead-end states in the virtual "biglm-composed" wfst, but also results a unnecessarily large search space which is detrimental to decoder's Real-Time-Factor. Kaldi's biglm decoder is slow.