(This is under construction.)
| GENERATING LATTICES AND N-BEST LISTS, AND SOME FACTS | 
Generating lattices:
Lattices can be generated by including the flag
-oulatdir [directory in which you want to write lattices]in the argument of the s3decode binary. Corresponding to each utterance, the decoder will then write a lattice in the directory you have specified. The lattice will be named utteranceid.lat.gz, and the contents of the file can be seen by giving the command "zcat filename" from the commandline on a unix machine.
If the utterance name in the ctl file includes directory names too, then you have the option of including or excluding them from the lattice filenames by including or excluding the string ,CTL from the argument you give to -outlatdir, respectively. This string is appended directly after the argument, without a space. Thus if the argument you give is
-outlatdir currentthe extended argument would be
-outlatdir current,CTL
Generating N-best lists from lattices:
N-best lists can be generated from the lattices by using the binary s3astar. It works just like the decoder (it takes the same controlfile as the decoder, and the inlatdir is the same as the outlatdir that the decoder used). You need to additionally provide an nbestdir where the N-best files are written. The number of hypotheses in any N-best list can be specified using a -nbest argument (the default value is 200, but note that just becuase you ask for 200 hypotheses it does not mean that you will get 200 hypotheses. If the lattice holds fewer than 200 possible hypotheses, you'll get fewer hypotheses). The N-best files will look like matchseg outputs.
Example of a lattice and explanation of format:
The lattice has three distinct sections. In the first all the nodes in the graph with their associated words and being and end times are listed. In the second section the acoustic scores associated with each of the nodes is listed. In the final section the scores associated with the edge between any two words is listed. The lattice also has additional lines of information mentioning the total number of nodes in the graph, the id of the first and last nodes, and text describing the format of the lines in the lattice. In addition, the lattice may contain lines that begin with a "#". These are comments.
Here are examples from each of the components of the lattice. Explanations are interspersed.
..... Nodes 1949 (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME) 0 ++GARBAGE++ 254 256 256 1 ++LAUGH++ 254 256 256 2 ++N++ 254 256 256 3 ++GARBAGE++ 253 255 255 4 ++LAUGH++ 253 255 255 5 ++N++ 253 255 255 6 ++GARBAGE++ 252 254 254 7 ++LAUGH++ 252 254 254 8 ++N++ 252 254 254 9 ++GARBAGE++ 251 253 253 10 ++N++ 251 253 253 11 A 251 253 253 12 ++GARBAGE++ 250 252 252 13 ++N++ 250 252 252 14 A 250 252 252 15 ++N++ 249 251 251 16 HAVE 245 250 253 17 ARE(2) 245 250 250 18 HAVE 244 249 249 19 GO 244 249 251 .....Node no. 16 is the word HAVE and begins on the 245th frame and can end anywhere between the 250th and 253rd frames.
..... # Initial 1948 Final 82 .....Nodes are written out in *reverse* order in the lattice. As a result, the node that is written out last is actually the *first* node in the lattice. Nodes are also not written in stricly reverse sequential order since, due to the "stretch" in the ending frames of different nodes, it is difficult to determine a precise sequence for all but the first node. As a result, in this lattice, the first node was node number 1948 (the one written out last), but the last node was actually node 82.
..... # BestSegAscr 13865 (NODEID ENDFRAME ASCORE) 1948 2 -172014 1948 3 -207858 1948 4 -220188 1947 5 -351673 .....While a node can end at many different frames, the acoustic score associated with the node when it ends at a particular frame will be different from that associated with it when it ends at a different frame. This portion of the lattice shows this information. In this example, when node number 1948 ends at frame 2 it has an acoustic score of -172014, when it ends at frame 3, the acoustic score is -207858, etc. Note, however that this acoustic score is only the best score and is not really useful since the true score for the node would depend on the path being considered due to the existence of cross-word triphones.
..... # Edges (FROM-NODEID TO-NODEID ASCORE) 33 23 -243293 33 20 -297751 35 23 -1599007 37 23 -1923161 .....The true acoustic score for any word is dependent on the word following it in the path. We therefore associate this score with the *edge* leading from that word to the following word. There can be many edges leading out of a node even at a given frame. Each of these edges is likely to have a different score than the other edges. In the above portion of the lattice we are given the information that the edge from node 33 to node 23 has the score -243293, the edge from node 33 to node 20 has score -297751 and so on. Keep in mind that there can be only one edge between any two nodes, even though a node can end at many different frames. This is because only one of these possible ending frames will permit a proper edge to the unique starting frame of the next word.
..... 1948 1440 -2083713 1948 1399 -220188 End .....The lattice ends here.
Note also that a lattice is actually a *tree*, and so the left context of any node is fixed. So, the variations in acoustic scores of words are only due to the right contexts, since any node in a tree can have only one predecessor. However, what the sphinx3 writes out is not a lattice, but actually a DAG, or a directed, acyclic graph. What is done here is that nodes representing the same word in the lattice are merged if they have identical time stamps. What you see in the "lattice" file is actually this DAG and not a tree-structured lattice at all.
An important consideration in combining lattices from different
sources:
if you had two parallel paths of this kind:
......> WORD1 ------> WORD2 ......> WORD1 ------> WORD2(WORD1 = "and", say and WORD2 = "the")
You CAN merge it to
.....> WORD1 ------> WORD2
*If* you are using CI models! Then the two parallel edges (the dashed edges) would have had close to identical scores, so you could just take the highest score. But if you are using CD models here's what will happen: the edges from path1 and path2 will have *different* scores in the lattice *even* if both WORD1 and WORD2 begin and end at exactly the same time instants in both cases. This is because the the word preceding WORD1 in the two cases would have been different, so the cross-word triphone score of the first phone in the word would have been different. e.g.
OK......> WORD1(and) ------> WORD2(the) BIT......> WORD1(and) ------> WORD2(the)the word preceding "and" in the first path is "ok", the x-wd triphone at the beginning of and in the first path is A(EY,N). The preceding word is "bit" in path 2, the x-wd triphone at the beginning of "and" is A(T,N). So the score of the edge between "and" and "the" would reflect this in the two paths and be different. All this even when you are only working with a *single* lattice (e.g. the MFC lattice). Any heuristic, like using the highest score for the merged path (node/edges) is likely to backfire for this reason (but would have to be experimentally tested). If path1 is (say) from and MFC-based lattice and path2 is (say) from a PLP-based one, this problem is compunded by the additional problem of how to come up with correct scaling factors for the scores.
| EXPLANATION OF VARIOUS FIELDS IN AN ARPA FORMAT LM | 
\data\ ngram 1=NUM1 ngram 2=NUM2 ngram 3=NUM3This means that there are NUM1 unigrams, NUM2 bigrams and NUM3 trigrams in the LM. Then you have a line
\1-grams:This means that all following lines are unigrams until you encounter a line "\2-grams:" or a "\end\" marker. The \end\ marker marks the end of the arpa LM. All unigrams have the form
MUMa WORD NUMbNUMa is the log probabilty of the unigram for the word WORD. NUMb is the back-off weight associated with that word. For bigrams entries may be
NUMa WORD1 WORD2 NUMbor
NUMa WORD1 WORD2The first form of entry is when the LM also has trigrams. If it is only a bigram LM the entries will be of the second form. Here, NUMa is the log prob of the bigram P(WORD2 | WORD1) and NUMb is the back-off weight for the word pair (WORD1 WORD2). The general N-gram entry is of the form
NUMa WORD1 WORD2 ... WORDN NUMbor if it is an Ngram model
NUMa WORD1 WORD2 ... WORDNAll logarithms are base 10. To prune the LM you can delete all N-gram entries where the difference between the probability entry for that Ngram and the predicted probability for the N-gram obtained by backing off is very small. The predicted probability is (of course) for trigrams: P(C|A,B) ~ P(C|B) * backoffwt(A,B). For bigram P(C|B) ~ P(C) * backofwt(B) Pruning is easiest done only on the highest order Ngram since deleting lower order Ngrams will delete the back-off weight for that Ngram as well and affect our prediction for the higher order Ngram. For example, if we pruned P(B|A) out of the LM, then the backoffwt(A,B) would also get pruned out. This affects the estimate of P(C|A,B), and the pruning heuristic would have to be appropriately considered.
| GENERATING MATCHSEG FILES, FORMAT OF MATCHSEG OUTPUT | 
SB01 
    S 36683016 
    T -27154407 
    A -21711975 
    L -858796 
    0 -603268 0 <s> 
    17 -814154 -49595 SHOW 
    40 -594450 -11806 THE(3) 
    51 -463637 -23023 <sil> 
    65 -2880525 -88849 ITS(2) 
    133 -1203392 -72333 DATES 
    171 -806587 -5753 FOR 
    185 -898344 -26773 ALL 
    210 - 2459017 -71603 DEPLOYED 
    260 -1765774 -94176 CEP
    302 -1791387 -76622 EVERETT(2) 
    338 -843218 -62384 THEIR 
    355 -848925 -17748 HOME 
    376 -1482528 -2325 PORT 
    411 -105 8809 -79875 SPEED 
    435 -786647 -37608 BY 
    453 -1771960 -32704 FIVE 
    482 -363323 -23 023 <sil> 
    489 -276030 -82596 </s> 
    514
This is a hypothesis in the "matchseg format". This output usually comes in a single line, but it was rearranged above to make it easier to read. The first word (or "field")
is the filename, S is a scaling factor (to prevent integers from wrapping
around due to underflowing - it can be thought of as a normalization factor
for likelihoods). T is the total likehood of the utterance, A is the
acoustic likelihood L is the LM likelihood (these are all log likelihoods,
hence the large numbers). Then onwards the format is beginning_frame_number
acoustic_score lm_score WORD beginning_frame_number acoustic_score lm_score
WORD ........ and so on.  In the end, the LAST frame number of the
utterance is written (514 in this case).
| EXPLANATION OF SOME SPHINX-II DECODER FLAGS | 
 -compress : compress excess background frames
 -compress : compress excess background frames based on prior utt
Typical silence compression code is as follows:
   if (silcomp == COMPRESS_PRIOR) {
        j = 0;
        for (i = 0; i < nfr; i++) {
            if (histo_add_c0 (mfc[i][0])) {
                if (i != j)
                    memcpy (mfc[j], mfc[i], sizeof(float)*CEP_SIZE);
                comp2rawfr[j++] = i;
            }
            /* Else skip the frame, don't copy across */
        }
        nfr = j;
    }
The "silence" frames are actually deleted.
note: This is not good when you are using models trained in the standard manner using SPHINX-III. Deleting silence frames completely during decoding (regardless of whether they are put back in the seg file later) is bad. We train cross-word triphones with silence as context explicitly. There are usually hundreds of such triphones in the model set. If there are no silence frames at all in the sequence of frames being decoded, the cross-word triphones with silence never get a chance of being used. Note that in most model sets, the silence and breath models are usually the best trained models.