NLP_trigram_model

The Trigram Language Model is implemented in python. The implementation is based on a template provided by Prof. Daniel Bauer.

Handling missing word:

The way to deal with unseen words is to use a pre-defined lexicon before we extract ngrams. The function corpus_reader has an optional parameter lexicon, which should be a Python set containing a list of tokens in the lexicon. All tokens that are not in the lexicon will be replaced with a special "UNK" token.

Smoothing method

Using linear interpolation between the raw trigram, unigram, and bigram probabilities to smooth probabilities.

In this project we set lambda1 == lambda2 == lambda3 == 1/3

Performance benchmark

Compute Perplexity based on brown_train.txt (train file) and brown_test.txt (test file).

The perlexity is 300.17653468276893

Compute correct prediction rate based on ets_toefl_data dataset.

The accuracy is 84.86%

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
hw1_data		hw1_data
README.md		README.md
interpolation.png		interpolation.png
trigram_model.py		trigram_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP_trigram_model

Handling missing word:

Smoothing method

Performance benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP_trigram_model

Handling missing word:

Smoothing method

Performance benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages