Keyword-Based Tweet Classification

ID: 29658

Type: Default

1000ms

256MiB

This problem requires you to implement a simple keyword-based tweet classification system. You are given a set of training tweets, each with an identifier, the tweet text, and a category label. Your task is to preprocess the tweets by converting all letters to lowercase and removing special characters, then build a keyword frequency model for each category. Finally, you will classify a set of test tweets by summing the keyword frequencies for each category and choosing the category with the highest score. In the event of a tie, choose the category that appeared first in the training data.

Note: All formulas should be rendered in LaTeX. For example, the score for a tweet in a given category is computed as:

\(score = \sum_{w \in \text{tweet}} frequency(w, \text{category})\)

You must read input from STDIN and write the output to STDOUT.

inputFormat

The input consists of two parts.

The first part describes the training data:
- The first line contains an integer \(N\) representing the number of training tweets.
- Each of the next \(N\) lines contains a training sample in the format: tweet_id;tweet_text;category.
The second part describes the test tweets:
- The next line contains an integer \(M\) representing the number of tweets to classify.
- Each of the following \(M\) lines contains a tweet text.

outputFormat

Output \(M\) lines, each containing the predicted category for the corresponding test tweet.

## sample

3
1;Just won a marathon!;1
2;Breaking: the stock market crashes!;2
3;Can't wait for the new superhero movie!;3
3
I love playing basketball!
Latest news on the economy
The new movie is amazing!

1
2
3

</p>

#K5401. Keyword-Based Tweet Classification

Keyword-Based Tweet Classification