Module tokenization
This module allows you to tokenize dictionaries for better results.
Functions
def build_2D_substitute_matrix(dictionary, alphabet, substitute_dict)
-
build_2D_substitute_matrix()
initiate and fill a 2 dimension matrix (dict of dict object) by browsing the dictionary.- dictionary (list): the input dictionary (after processing)
- alphabet (list): the used alphabet (from input file or from dictionary)
- substitute_dict (dict): the substituted characters indexed by single substitution character
- return (dict): the matrix representing the probability of letter chaining each other
def check_tokenizable(dictionary)
-
check_tokenizable()
checks if the dictionary contains any word with a digit or an uppercase character.- dictionary (list): the input dictionary (after processing)
- return (bool) False if any digit or uppercase character, True otherwise
def find_max(matrix, alphabet)
-
find_max()
finds the most frequent character sequence.- matrix (dict): the matrix representing the probability of letter chaining each other
- alphabet (list): the used alphabet (from input file or from dictionary)
- return (tuple): the most frequent consecutive character sequence
def plot_2D_matrix(matrix, alphabet, filename)
-
plot_2D_matrix()
plot the matrix in a diagram using matplotlib.- matrix (dict): the matrix representing the probability of letter chaining each other
- alphabet (list): the used alphabet (from input file or from dictionary)
- filename (str): the name of the file to plot in
- return (None)
def print_2D_matrix(matrix, alphabet)
-
print_2D_matrix()
print the matrix row by row.s- matrix (dict): the matrix representing the probability of letter chaining each other
- alphabet (list): the used alphabet (from input file or from dictionary)
- return (None)
def reverse_substitution(word, substitute_dict)
-
reverse_substitution()
decode a word from substitute to human readable.- word (str): the word to decode back
- substitute_dict (dict): the substituted characters indexed by single substitution character
- return (str): the decoded word
- word (str): the word to decode back
def write_substitute_dictionary(dictionary, substitute_dict, filename)
-
write_substitute_dictionary()
writes the dictionary in a file with substitutions.- dictionary (list): the input dictionary (after processing)
- substitute_dict (dict): the substituted characters indexed by single substitution character
- filename (str): the name of the file to open (
write
mode) - return (None)
- dictionary (list): the input dictionary (after processing)