| 
  
   
    |  |  |  
    |  | Mosur Ravishankar |  
    |  | School of Computer Science, CMU |  
    |  | Nov 19, 1999 |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Recognition problem |  
    |  | Search for the most likely word sequence
    matching the input speech, given the various models |  
    |  | Illustrated using Sphinx-3 (original) |  
    |  | Lextree search (Sphinx-3.2) |  
    |  | Search organization |  
    |  | Pruning |  
    |  | Experiments |  
    |  | Conclusion |  | 
 
  | 
  
   
    |  |  |  
    |  | Search organization |  
    |  | Continuous speech recognition |  
    |  | Cross-word triphone modeling |  
    |  | Language model integration |  
    |  | Pruning for efficiency |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Flat lexical structure |  
    |  | Cross-word triphone modeling |  
    |  | Multiplexing at word beginning |  
    |  | Replication at word end |  
    |  | Single-phone words: combination of both |  
    |  | LM score applied upon transition into word |  
    |  | Trigram language model |  
    |  | However, only single best history maintained |  
    |  | Beam-based pruning |  
    |  | Long-tailed distribution of active HMMs/frame |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Flat lexicon; every word treated independently: |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Evaluating an HMM w.r.t. input speech: Viterbi
    search |  
    |  | Score the best state-sequence through HMM, given
    the input |  | 
 
  |  | 
 
  |  | 
 
  |  | 
 
  |  | 
 
  | 
  
   
    |  |  |  
    |  | Instantaneous score: how well a given HMM state
    matches the speech input at a given time frame |  
    |  | Path: A sequence of HMM states traversed during
    a given segment of input speech |  
    |  | Path-score: Product of instantaneous scores and
    state transition probabilities corresponding to a given path |  
    |  | Viterbi search: An efficient lattice structure
    and algorithm for computing the best path score for a given segment of
    input speech |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Search all words in parallel |  
    |  | Initialize start state of every word with
    path-score 1.0 |  
    |  | For each frame of input speech: |  
    |  | Update path scores within each HMM |  
    |  | Propagate exit state score from one HMM to
    initial state of its successor (using Viterbi criterion) |  
    |  | Select word with best exit state path-score |  | 
 
  | 
  
   
    |  |  |  
    |  | Add null transitions from word ends to
    beginnings: |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Apply Viterbi search algorithm to the modified
    network |  
    |  | Q: How to recover the recognized word sequence? |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Each word exit recorded in the BP table: |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Upon transitioning from an exited word A to
    another B: |  
    |  | Inject pointer to BP table entry for A into
    start state of B.  (This identifies
    the predecessor of B.) |  
    |  | Propagate these pointers along with path-scores
    during Viterbi search |  
    |  | At end of utterance, identify best exited word
    and trace back using predecessor pointers |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Some additional information available from BP
    table: |  
    |  | All candidate words recognized during
    recognition |  
    |  | Word segmentations |  
    |  | Word segment acoustic scores |  
    |  | “Lattice density”: No. of competing word
    hypotheses at any instant |  
    |  | Useful for postprocessing steps: |  
    |  | Lattice rescoring |  
    |  | N-best list generation |  
    |  | Confidence measures |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Exhaustive search over large vocabulary too
    expensive, and unnecessary |  
    |  | Use a “beam” to “prune” the set of active HMMs: |  
    |  | At start of each frame, find best available
    path-score S |  
    |  | Use a scale-factor f  (< 1.0) to set a pruning threshold T = S*f |  
    |  | Deactivate an HMM if no state in it has path
    score >= T |  
    |  | Effect: No. of active HMMs larger if no clear
    frontrunner |  
    |  | Two kinds of beams: |  
    |  | To control active set of HMMs |  
    |  | No. of active HMMs per frame typically 10-20% of
    total space |  
    |  | To control word exits taken (and recorded in BP
    table) |  
    |  | No. of words exited typically 10-20 per frame |  
    |  | Recognition accuracy essentially unaffected |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Language models essential for recognition
    accuracy |  
    |  | Reduce word error rate by an order of magnitude |  
    |  | Reduce active search space significantly |  
    |  | Implementation: associate LM probabilities with
    transitions between words.  E.g.: |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Two issues with large vocabulary bigram LMs: |  
    |  | With vocabulary size V and N word exits per
    frame, NxV cross-word transitions per frame |  
    |  | Bigram probabilities very sparse; mostly
    “backoff” to unigrams |  
    |  | Optimize cross-word transitions using “backoff node”: |  
    |  | Viterbi decision at backoff node selects
    single-best predecessor |  | 
 
  | 
  
   
    |  |  |  
    |  | Sphinx uses “triphone” or “phoneme-in-context”
    HMMs |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Cross-word transitions use appropriate
    exit-model, and inject left-context into entry state |  | 
 
  | 
  
   
    |  |  |  
    |  | initialize start state of <S> with
    path-score = 1; |  
    |  | for each frame of input speech { |  
    |  | evaluate all active HMMs; find best path-score, pruning thresholds; |  
    |  | for each active HMM { |  
    |  | if above pruning threshold { |  
    |  | activate HMM for next frame; |  
    |  | transition to and activate successor
    HMM within word, if any |  
    |  | if word-final HMM and above
    word-pruning threshold |  
    |  | record word-exit in BP table; |  
    |  | } |  
    |  | } |  
    |  | transition from words exited into initial state of entire lexicon
    (using the |  
    |  | LM), and activate HMMs entered; |  
    |  | } |  
    |  | find final </S> BP table entry and
    back-trace through table to retrieve result; |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Most active HMMs are word-initial models,
    decaying rapidly subsequently |  
    |  | On 60K-word Hub-4 task, 55% of active HMMs are
    word-initial |  
    |  | (Same reason for handling left/right contexts
    differently.) |  
    |  | But, no. of distinct word-initial model types
    much fewer: |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Use a “prefix-tree” structure to maximize
    sharing among words |  | 
 
  | 
  
   
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Nodes shared if triphone “State-Sequence ID”
    (SSID) identical |  
    |  | Leaf (word-final) nodes not shared |  
    |  | In 60K-word BN task, word-initial models reduced
    ~50x |  | 
 
  | 
  
   
    |  |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Root nodes replicated for left context |  
    |  | Again, nodes shared if SSIDs identical |  
    |  | During search, very few distinct incoming
    left-contexts at any time; so only very few copies activated |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Leaf nodes use “composite” SSID models |  
    |  | Simplifies lextree and backpointer table
    implementation |  
    |  | Simplifies cross-word transitions implementation |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Problem: LM probabilities cannot be determined
    upon transition to lextree root nodes |  
    |  | Root nodes shared among several unrelated words |  
    |  | Several solutions possible: |  
    |  | Incremental evaluation, using composite LM
    scores |  
    |  | Lextree replication (Ney, Antoniol) |  
    |  | Rescoring at every node (BBN) |  
    |  | Post-tree evaluation (Sphinx-II) |  | 
 
  | 
  
   
    |  |  |  
    |  | Incremental LM score accumulation; e.g. (bigram
    LM): |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Large computation and memory requirements |  
    |  | Overhead for dynamic lextree
    creation/destruction |  | 
 
  | 
  
   
    |  |  |  
    |  | Again, incremental LM score accumulation: |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Still, large computation/memory requirements and
    overhead for dynamic lextree maintenance |  
    |  | Multiple LM transitions between some word pairs |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Single lextree |  
    |  | Null transitions from leaf nodes back to root
    nodes |  
    |  | No LM score upon transition into root or
    non-leaf node of lextree |  
    |  | If reached a leaf node for word W: |  
    |  | Find all possible LM histories of W (from BP
    table) |  
    |  | Find LM scores for W w.r.t. each LM history |  
    |  | Choose best resulting path-score for W |  
    |  | Drawbacks: |  
    |  | Inexact acoustic scores |  
    |  | Root node evaluated w.r.t. a single left
    context, but resulting score used w.r.t. all histories (with possibly
    different left contexts) |  
    |  | Impoverished word segmentations |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Q: Which transition wins? |  
    |  | Flat lexicon (separate model per word): |  
    |  | At A: word ninety entered with LM score P (ninety
    | ninety) |  
    |  | At B: word ninety entered with P (ninety |
    nineteen) |  
    |  | Since the latter is much better, it prevails
    over the former |  
    |  | Result: correct recognition, and segmentation
    for ninety |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  |  |  
    |  | Tree lexicon: |  
    |  | At A: root node for ninety entered without any
    LM score |  
    |  | At B: Attempt to enter root node for ninety again |  
    |  | Transition may or may not succeed (no LM score
    used) |  
    |  | At C: obtain LM score for ninety w.r.t. all
    predecessors |  
    |  | If transition at B failed, the only candidate
    predecessor is ninety; result: incorrect segmentation for ninety(2),
    incorrect recognition |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Post-lextree LM scoring (as above); however: |  
    |  | Limited, static lextree replication |  
    |  | Limits memory requirements |  
    |  | No dynamic lextree management overhead |  
    |  | Transitions into lextrees staggered across time: |  
    |  | At any time, only one lextree entered |  
    |  | “-epl” (entries per lextree) parameter: block of
    frames one lextree entered, before switching to next |  
    |  | More word segmentations (start times) survive |  
    |  | Full LM histories; if reached a leaf node for
    word W: |  
    |  | Find all possible LM histories of W (from BP
    table) |  
    |  | Include LM scores for W w.r.t. each LM history |  
    |  | Create a separate BP table entry for each
    resulting history |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Pure beam-pruning has long-tailed distribution
    of active HMMs/frame |  
    |  | Absolute pruning to control worst-case
    performance: |  
    |  | Max. active HMMs per frame |  
    |  | Implemented approximately, using “histogram
    pruning” (avoids expensive sorting step) |  
    |  | Max. unique words exiting per frame |  
    |  | Max. LM histories saved in BP table per frame |  
    |  | Word error rate unaffected |  
    |  | Additional beam for lextree-internal, cross-HMM
    transitions |  
    |  | Unigram “lookahead” scores used in lextree for
    yet more pruning |  | 
 
  | 
  
   
    |  |  |  
    |  | 1997 BN eval set, excluding F2 (telephone
    speech) |  
    |  | 6K tied-state CHMM, 20 density/state model
    (1997) |  
    |  | 60K vocab, trigram LM |  | 
 
  | 
  
   
    |  |  |  
    |  | 1998 BN Eval set |  
    |  | 5K tied-state CHMM, 32 density/state model
    (1998) |  
    |  | 60K vocab, trigram LM |  | 
 
  | 
  
   
    |  |  |  |  
    |  | Effect of absolute pruning parameters (1998 BN): |  
    |  | (Per frame) computation stats for each utterance |  
    |  | Distribution of stats over entire test set (375
    utts) |  
    |  | Absolute pruning highly effective in controlling
    variance in computational cost |  | 
 
  | 
  
   
    |  |  |  |  
    |  | 10x CHMM system available (thanks to P-!!!) |  
    |  | With lextree implementation, only about 1-2% of
    total HMMs active per frame |  
    |  | Order of magnitude fewer compared to flat
    lexicon search |  
    |  | Lextree replication improves WER noticeably |  
    |  | Absolute pruning parameters improve worst-case
    behavior significantly, without penalizing accuracy |  
    |  | When active search space grows beyond some
    threshold, no hope of correct recognition anyway |  | 
 
  | 
  
   
    |  |  |  |  |  
    |  | Tree lexicon still not as accurate as flat
    lexicon baseline |  
    |  | Residual word segmentation problems? |  
    |  | Try lextree replication? |  
    |  | Use of composite SSID model at leaf nodes? |  
    |  | Parameters not close to optimal? |  
    |  | HMM state acoustic score computation now
    dominant |  
    |  | Back to efficient Gaussian selection/computation |  |