ARPA and Grammar FST in Kaldi

1 What is ARPA?

ARPA origins from MIT, is a text format to represent n-gram back-off language model. Although, it is not as efficient as most efficient binary formats, it is well accepted by popular speech toolkit, like HTK, SPHINX, Kaldi, etc.  Thus, it is commonly used as a way to save LM model and be transferred later to other format which best suits specific platform.

In this post, I will first introduce more details of ARPA and then talk about recipe Kaldi used to transfer ARPA into FST(Finate State Transducer).

2 What is ARPA format?

Below is illustration for a fake ARPA(HelloWorld.arpa) which is gonna used in the whole post.(If needed, you can zoom in the web page to see text more clearly on the graph). You can also learn about ARPA [here].

An ARPA is first created using [mit-lm], where training corpus just have one sentence “hello world”. The ARPA shown below is actually a modified version where probabilities are manually picked, so that you will see pretty graph in the end.

1.png
HelloWorld.arpa with illustration

3 How to use ARPA to calculate LM score on a given sentence?

Give a sentence W_1, W_2, W_3….W_k, LM score is P(W_1, W_2,…W_k)

We will break down this probability based on three rules:

  • we can predict any word based on its previous n-1 words, where n is the order of n-gram model
  • Any sentence starts with < s > and ends with
  • If a particular combination is not found in the ARPA, then its probability can be calculated from the language model as follows
P( W_M | W_{M-N+1},..., W_{M-2}, W_{M-1}) =
P( W_M | W_{M-N+2},..., W_{M-2}, W_{M-1}) * backoff-weight( W_{M-1} | W_{M-N+1}, ...., W_{M-2} )

if  backoff-weight is also not shown in ARPA file, it means that probability is 1(lg probability is 0)

Based on Above rules,

if n=2(bigram)

P( W_1, W_2, ..., W_k ) =
P( W_1| < s >) * P(W_2|W_1) * P(W_3|W_2)*...*P(W_k|W_{k-1})*P(< /s >| W_k)

if n=3(trigram)

P( W_1, W_2, ..., W_k ) =
P( W_1| < s >) * P(W_2|< s >,W_1) * P(W_3|W_1,W_2)*...*P(W_k|W_{k-2},W_{k-1})*P(< /s >| W_k, W_{k-1})< /s >

Now, Let me show you two real examples of calculation using above HelloWorld.arpa.

#1 Sentence “hello world”

P(hello, world ) = P( hello | < s > ) * P(world | < s >, hello) * P(< /s > |hello, world)
                 = 10^(-0.217147241 + -0.086858897 + -0.043429448)
                 = 10^ -0.347435586
                 = 0.449329

In this case, all the probability already exists in ARPA file, thus it is pretty easy to calculate it. More complexed one is shown below

#2 Sentence “world hello”

P(world, hello ) = P( world | < s > ) * P(hello | < s > , world) * P(< /s > |hello, world)
< /s >

In this case all three probabilities are not shown in the ARPA file. We will use back-off formula to calculate them

P( world | < s > )  =    P(world)   *    backoff-weight(< s >)
                   = 10^(-0.26057669+ 0.26057669)
                   = 10 ^(0)
P(hello | < s >, world)  = P(hello | world) * backoff-weight(world | < s >)
                       = P(hello) * backoff-weight(world ) * backoff-weight(world | < s >)
                       =10 ^ (-0.304006138+0.086858897+0)
                       =10 ^ (-0.217147241)
P( < /s >  | world, hello )  = P( | hello) * backoff-weight(hello | world)
                          = P(< /s > ) * backoff-weight(hello ) * backoff-weight(hello | world)
                          =10 ^ (-0.108573621+0.043429448+0)
                          =10 ^ (-0.065144173)

Finally, since we solve those three probability, we can get the LM score

P(world, hello ) = P( world | < s > ) * P(hello | < s > , world) * P(< /s > |hello, world)
                 =10^( 0 + -0.217147241 + -0.065144173)
                 =0.522046

Please ignore the meaning of the numbers, because all number in ARPA is fake.

Till now, you should understand what is included in ARPA and how to use it to calculate LM score. If you feel interested in how to train an ARPA, you can start from [here] or [here].

4 How to transfer ARPA into FST?

Nowadays, it is very common to transfer an ARPA LM into a WFST(weighted finite state machine). Because many speech recognition engine are decoding on WFST created by HCLG recipe where “G” means grammar( Language model). And many speech engine relies on kaldi toolkit, thus I will introduce kaldi recipe’s transferring an ARPA to FST.

The code implementation can be found [here]. FST information can be found [here].

Description and illustration is shown in slides below.

5 How to use FST to calculate LM score on a given sentence?

Now, you are able to transfer ARPA to FST by yourself. I hope you will find this post helpful. Let me know if I wrote anything wrong above or whether anything is not clear enough.

6 Reference:

[1] ARPA Language models -SPHINX

[2] The ARPA-MIT LM format-HTK

[3] Kaldi recipe-code

[4] Kaldi Documentation-Preparing the grammar G

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.