#P9800. Source Code Compression and Identifier Renaming

    ID: 22946 Type: Default 1000ms 256MiB

Source Code Compression and Identifier Renaming

Source Code Compression and Identifier Renaming

You are given a program consisting of several lines. Each line contains zero or more tokens separated by spaces. Your task is to "compress" the program by performing the following steps:

  1. Comment removal: In each line, if a # character is encountered, it marks the beginning of a comment. The # and everything following it on that line should be ignored.
  2. Tokenization: After comment removal, parse each line from left to right by repeatedly skipping any spaces and extracting the longest possible token. A token is one of the following:
    • Reserved identifier: A fixed string consisting of one or more non-space ASCII characters (and not containing #). These include operators, separators, literals, reserved keywords, or library function names. These tokens are preserved as they are.
    • Numeric identifier: A sequence of one or more digits. It is described by the regular expression: \( ^[0-9]+$ \).
    • Word identifier: A sequence of characters taken from the set of lowercase letters, uppercase letters, digits, underscore _, and dollar sign $, with the condition that the first character is not a digit. (i.e. it matches \( ^[A-Za-z_$][A-Za-z0-9_$]*$ \)).

    Note: Even if a token matches the numeric or word definition, if it exactly equals one of the reserved identifiers (from a given reserved list), it must be treated as a reserved identifier. For simplicity, you can assume that any token which contains characters outside the [A-Za-z0-9_$] set is a reserved identifier.

  3. Renaming word identifiers: When processing the sequence of tokens, every time you encounter a word identifier (as defined above) that is not a reserved identifier, perform the following renaming using a target word list s:
    • Define \( s \) as the sequence of all strings made up solely of lowercase letters, sorted first by length and then in lexicographical order. In other words, \( s = \{a, b, \dots, z, aa, ab, \dots\} \).
    • The first time a new word identifier appears, rename it to the first word in \( s \). Every subsequent occurrence of that same identifier is renamed to the same new name. Then, the next new (distinct) word identifier is renamed using the next word in \( s \), and so on.
  4. Whitespace removal: You are allowed to remove unnecessary spaces and newlines. However, you must not remove spaces in such a way that two tokens that were originally separate become concatenated into a new token that qualifies as an identifier. To ensure this, when concatenating tokens, if the last character of the previous token and the first character of the current token are both in the set [A-Za-z0-9_$] (i.e. they could form a word), then a space must be inserted between them.

Your program should read the source code from input and output the compressed version according to the above rules.

LaTeX Formulas:

  • Numeric identifier: \( ^[0-9]+$ \)
  • Word identifier: \( ^[A-Za-z_$][A-Za-z0-9_$]*$ \)

inputFormat

The input consists of several lines representing the source code. Each line may contain tokens separated by spaces. Comments starting with a # should be ignored.

outputFormat

Output a single line which is the compressed version of the input source code after processing all lines using the steps defined above.

sample

abc 123
a 123