All starting from here https://groups.google.com/g/kaldi-help/search?q=confidence
It is possible to obtain posterior probabilities for each of the n-best sentences. If you use nbest-to-linear you can get the lm-cost and the acoustic cost for each nbest. If you scale down the acoustics, negate both of the costs to get logprobs, add the lm and acoustic costs and exponentiate, you’ll get an unnormalized probability for each element of the n-best list. You could normalize those to sum to one. [of course, you’d compute this differently to avoid overflow in the exp.]
You could perhaps use the output of lattice-to-ctm-conf (or modify
code to use this internally), that has confidences. If you use a
phone-based language model for the unknown word, that will help you
catch situations where the lexicon itself doesn’t seem to match, and
an unknown word seems to be the better match; see
Look at the usage message of `lattice-confidence`:
Compute sentence-level lattice confidence measures for each lattice.
The output is simly the difference between the total costs of the best and
second-best paths in the lattice (or a very large value if the lattice
had only one path). Caution: this is not necessarily a very good confidence
measure. You almost certainly want to specify the acoustic scale.
If the input is a state-level lattice, you need to specify
–read-compact-lattice=false, or the confidences will be very small
(and wrong). You can get word-level confidence info from lattice-mbr-decode.
Likelihood, usually “acoustic likelihood” is a probabilistic term which describes the probability of a particular observation in the model space. It is basically the probability of a certain observation in a space of training data. Sometimes this probability is not normalized to 1, that is why it is called “likelihood”, not “probability”. Basically consider all training data and think how frequently you see something similar to the current observation. This is usually pretty small value.
Acoustic cost is usually the same as “likelihood” but it more relates to the actual value computed in software which might be adjusted by normalization factor or somehow rounded for faster computation. Usually you say “acoustic cost” when you discuss the values of the software variables, dropping their probabilistic nature and considering only the best path search in a graph.
Posterior probability is a measure in posterior space. When you already see the observation you can compare the outcome with all other possible outcomes and compute how probable already observed value. This is a bayesian theory term which can also be computed as a factor between model probability of observation (likelihood) and model probability all other observations (also likelihoods). This value is usually pretty high compared to model probability and basically tells you how certain you are in a model decision.
Since posterior probability is a certainty measure, you can call it a “confidence score” and usually posterior probabilities are used as confidence scores. However, not all confidence scores are probabilistic. For example, you can adjust score with an estimate of expected time or somehow penalize it depending on the time of arrival not considering probabilistic nature. That would also be a “confidence score” but since it is not within probabilistic framework it is not “posterior” anymore. So “confidence score” is more software related term generalizing posterior probability.
./lattice-push ark:”gunzip -c lat.1.gz |” ark:- | ./lattice-align-words-lexicon ./align_lexicon.int ./40.mdl ark:- ark:- | ./lattice-to-ctm-conf –acoustic-scale=0.0769 –frame-shift=0.01 –print-silence=true ark:- – | ./int2sym.pl -f 5 ./words.txt 2>/dev/null
The confidences won’t always be very good, they are just derived from
the lattice posterior. They will be particularly poor if the language
model doesn’t contain a lot of short words (which could act as a kind
of filler model).
Yes, the last field is the confidence; the fields in red are “channel”
(normally 1, 2, A or B), start-time, duration.
A good method is to take a weak system and a strong system and see how
much difference there is in the transcripts. There is a very good
correlation between that and the word error rate. (George Saon may
have a paper about this; he used GMM systems before and after fMLLR
since that was before DNNs). You can choose which type of weak system
you want, e.g. a GMM system, or one with a weak language model.
if the edit distance of the two transcripts (from weak and strong system) is high then the transcripts will be worse.