All starting from here https://groups.google.com/g/kaldi-help/search?q=confidence

Topic 1: sentence confidence and nbest hypothesis

2016-10-20

It is possible to obtain posterior probabilities for each of the n-best sentences. If you use nbest-to-linear you can get the lm-cost and the acoustic cost for each nbest. If you scale down the acoustics, negate both of the costs to get logprobs, add the lm and acoustic costs and exponentiate, you’ll get an unnormalized probability for each element of the n-best list. You could normalize those to sum to one. [of course, you’d compute this differently to avoid overflow in the exp.]

2017-7-24

You could perhaps use the output of lattice-to-ctm-conf (or modify

code to use this internally), that has confidences. If you use a

phone-based language model for the unknown word, that will help you

catch situations where the lexicon itself doesn’t seem to match, and

an unknown word seems to be the better match; see

tedlium/s5_r2/local/run_unk_model.sh.

2020-5-8

Look at the usage message of `lattice-confidence`:

Compute sentence-level lattice confidence measures for each lattice.

The output is simly the difference between the total costs of the best and

second-best paths in the lattice (or a very large value if the lattice

had only one path). Caution: this is not necessarily a very good confidence

measure. You almost certainly want to specify the acoustic scale.

If the input is a state-level lattice, you need to specify

–read-compact-lattice=false, or the confidences will be very small

(and wrong). You can get word-level confidence info from lattice-mbr-decode.

2017-8-21 nshm…@gmail.com

Likelihood, usually “acoustic likelihood” is a probabilistic term which describes the probability of a particular observation in the model space. It is basically the probability of a certain observation in a space of training data. Sometimes this probability is not normalized to 1, that is why it is called “likelihood”, not “probability”. Basically consider all training data and think how frequently you see something similar to the current observation. This is usually pretty small value.

Acoustic cost is usually the same as “likelihood” but it more relates to the actual value computed in software which might be adjusted by normalization factor or somehow rounded for faster computation. Usually you say “acoustic cost” when you discuss the values of the software variables, dropping their probabilistic nature and considering only the best path search in a graph.

Posterior probability is a measure in posterior space. When you already see the observation you can compare the outcome with all other possible outcomes and compute how probable already observed value. This is a bayesian theory term which can also be computed as a factor between model probability of observation (likelihood) and model probability all other observations (also likelihoods). This value is usually pretty high compared to model probability and basically tells you how certain you are in a model decision.

Since posterior probability is a certainty measure, you can call it a “confidence score” and usually posterior probabilities are used as confidence scores. However, not all confidence scores are probabilistic. For example, you can adjust score with an estimate of expected time or somehow penalize it depending on the time of arrival not considering probabilistic nature. That would also be a “confidence score” but since it is not within probabilistic framework it is not “posterior” anymore. So “confidence score” is more software related term generalizing posterior probability.

./lattice-push ark:”gunzip -c lat.1.gz |” ark:- | ./lattice-align-words-lexicon ./align_lexicon.int ./40.mdl ark:- ark:- | ./lattice-to-ctm-conf –acoustic-scale=0.0769 –frame-shift=0.01 –print-silence=true ark:- – | ./int2sym.pl -f 5 ./words.txt 2>/dev/null

The confidences won’t always be very good, they are just derived from

the lattice posterior. They will be particularly poor if the language

model doesn’t contain a lot of short words (which could act as a kind

of filler model).

Yes, the last field is the confidence; the fields in red are “channel”

(normally 1, 2, A or B), start-time, duration.

inferring quality of transcription from confidence scores (or other indicators…)

A good method is to take a weak system and a strong system and see how

much difference there is in the transcripts. There is a very good

correlation between that and the word error rate. (George Saon may

have a paper about this; he used GMM systems before and after fMLLR

since that was before DNNs). You can choose which type of weak system

you want, e.g. a GMM system, or one with a weak language model.

if the edit distance of the two transcripts (from weak and strong system) is high then the transcripts will be worse.