Skip to content
Snippets Groups Projects
Select Git revision
  • ee2f6ae33368ff935416e11f972f8a37ecb7a3f1
  • main default protected
  • tags/misc
  • tags/version-0.5r2
  • tags/version-0.6
  • tags/version-0.4
  • tags/version-0.5r1
  • tags/libxaal_v01
  • tags/generic-feedback-renderer_complexAPI
  • tags/testing-libsodium
  • tags/testing-nettle
  • tags/testing_ajax
  • tags/testing_clearsilver
  • tags/testing_jansson
  • tags/testing_jsmn
  • tags/testing_json-c
16 results

corpus.lm

Blame
  • corpus.lm 7.48 KiB
    #############################################################################
    ## Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
    ## Ronald Rosenfeld and Philip Clarkson
    ## Version 3, Copyright (c) 2006, Carnegie Mellon University 
    ## Contributors includes Wen Xu, Ananlada Chotimongkol, 
    ## David Huggins-Daines, Arthur Chan and Alan Black 
    #############################################################################
    =============================================================================
    ===============  This file was produced by the CMU-Cambridge  ===============
    ===============     Statistical Language Modeling Toolkit     ===============
    =============================================================================
    This is a 3-gram language model, based on a vocabulary of 31 words,
      which begins "</s>", "<s>", "<unk>"...
    This is a CLOSED-vocabulary model
      (OOVs eliminated from training data and are forbidden in test data)
    Good-Turing discounting was applied.
    1-gram frequency of frequency : 0 
    2-gram frequency of frequency : 0 6 0 6 0 0 0 
    3-gram frequency of frequency : 0 20 0 6 0 0 31 
    1-gram discounting ratios : 
    2-gram discounting ratios : 
    3-gram discounting ratios : 
    This file is in the ARPA-standard format introduced by Doug Paul.
    
    p(wd3|wd1,wd2)= if(trigram exists)           p_3(wd1,wd2,wd3)
                    else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
                    else                         p(wd3|w2)
    
    p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
                else              bo_wt_1(wd1)*p_1(wd2)
    
    All probs and back-off weights (bo_wt) are given in log10 form.
    
    Data formats:
    
    Beginning of data mark: \data\
    ngram 1=nr            # number of 1-grams
    ngram 2=nr            # number of 2-grams
    ngram 3=nr            # number of 3-grams
    
    \1-grams:
    p_1     wd_1 bo_wt_1
    \2-grams:
    p_2     wd_1 wd_2 bo_wt_2
    \3-grams:
    p_3     wd_1 wd_2 wd_3 
    
    end of data mark: \end\
    
    \data\
    ngram 1=31
    ngram 2=65
    ngram 3=123
    
    \1-grams:
    -0.8957 </s>	-2.2791
    -0.8937 <s>	-2.2811
    -1.7531 <unk>	-1.4461
    -2.0841 allume	-1.1816
    -1.9080 arrête	-1.3105
    -1.9080 baisse	-1.3105
    -1.7388 cuisine	-1.4595
    -1.2808 de	-1.9328
    -1.4378 des	-1.7968
    -1.9080 descends	-1.3105
    -1.7988 douche	-1.4033
    -2.6281 du	-0.7771
    -1.9080 ferme	-1.3105
    -1.7388 l'entrée	-1.4595
    -1.2957 la	-1.9365