#K5401. Keyword-Based Tweet Classification
Keyword-Based Tweet Classification
Keyword-Based Tweet Classification
This problem requires you to implement a simple keyword-based tweet classification system. You are given a set of training tweets, each with an identifier, the tweet text, and a category label. Your task is to preprocess the tweets by converting all letters to lowercase and removing special characters, then build a keyword frequency model for each category. Finally, you will classify a set of test tweets by summing the keyword frequencies for each category and choosing the category with the highest score. In the event of a tie, choose the category that appeared first in the training data.
Note: All formulas should be rendered in LaTeX. For example, the score for a tweet in a given category is computed as:
\(score = \sum_{w \in \text{tweet}} frequency(w, \text{category})\)
You must read input from STDIN and write the output to STDOUT.
inputFormat
The input consists of two parts.
- The first part describes the training data:
- The first line contains an integer \(N\) representing the number of training tweets.
- Each of the next \(N\) lines contains a training sample in the format:
tweet_id;tweet_text;category
.
- The second part describes the test tweets:
- The next line contains an integer \(M\) representing the number of tweets to classify.
- Each of the following \(M\) lines contains a tweet text.
outputFormat
Output \(M\) lines, each containing the predicted category for the corresponding test tweet.
## sample3
1;Just won a marathon!;1
2;Breaking: the stock market crashes!;2
3;Can't wait for the new superhero movie!;3
3
I love playing basketball!
Latest news on the economy
The new movie is amazing!
1
2
3
</p>