N-Gram Next Word Prediction

ID: 45334

Type: Default

1000ms

256MiB

You are given a seed text, an integer n, and a corpus of text. Your task is to predict the next word according to an n-gram model constructed from the corpus. In particular, you need to:

1. Tokenize the corpus into words. In our case, a word is defined as a sequence of alphanumeric characters. All tokens should be converted to lowercase.

2. Build an n-gram frequency distribution. For every contiguous sequence of n words from the corpus, use the first n-1 words as the key and count the frequency of the nth word that follows.

3. Tokenize the seed text in the same manner. If the number of tokens is less than n-1, output ERROR.

4. Use the last n-1 tokens of the seed text to look up the most frequent following word in the corpus. If there is no such entry, output None.

The mathematical definition of n-gram here is as follows:

\[ \text{For a corpus } T = \{w_1, w_2, \dots, w_m\}, \text{ each n-gram is } (w_i, w_{i+1}, \dots, w_{i+n-1}), \quad 1 \le i \le m-n+1. \]

inputFormat

The input is read from stdin and consists of three lines:

A string representing the seed text.
An integer n representing the size of the n-gram.
A string representing the corpus text.

outputFormat

Output to stdout a single line containing the predicted next word. If the seed text has fewer than n-1 words, output ERROR. If no prediction is possible (i.e. the (n-1)-gram from the seed text is not found in the corpus), output None.

## sample

the quick
3
the quick brown fox jumps over the lazy dog. the fox is quick.

brown

#C206. N-Gram Next Word Prediction

N-Gram Next Word Prediction