Module tokenization

This module allows you to tokenize dictionaries for better results.

Functions

def build_2D_substitute_matrix(dictionary, alphabet, substitute_dict)

build_2D_substitute_matrix() initiate and fill a 2 dimension matrix (dict of dict object) by browsing the dictionary.

  • dictionary (list): the input dictionary (after processing)
  • alphabet (list): the used alphabet (from input file or from dictionary)
    • substitute_dict (dict): the substituted characters indexed by single substitution character
  • return (dict): the matrix representing the probability of letter chaining each other
def check_tokenizable(dictionary)

check_tokenizable() checks if the dictionary contains any word with a digit or an uppercase character.

  • dictionary (list): the input dictionary (after processing)
  • return (bool) False if any digit or uppercase character, True otherwise
def find_max(matrix, alphabet)

find_max() finds the most frequent character sequence.

  • matrix (dict): the matrix representing the probability of letter chaining each other
  • alphabet (list): the used alphabet (from input file or from dictionary)
  • return (tuple): the most frequent consecutive character sequence
def plot_2D_matrix(matrix, alphabet, filename)

plot_2D_matrix() plot the matrix in a diagram using matplotlib.

  • matrix (dict): the matrix representing the probability of letter chaining each other
  • alphabet (list): the used alphabet (from input file or from dictionary)
  • filename (str): the name of the file to plot in
    • return (None)
def print_2D_matrix(matrix, alphabet)

print_2D_matrix() print the matrix row by row.s

  • matrix (dict): the matrix representing the probability of letter chaining each other
  • alphabet (list): the used alphabet (from input file or from dictionary)
    • return (None)
def reverse_substitution(word, substitute_dict)

reverse_substitution() decode a word from substitute to human readable.

  • word (str): the word to decode back
    • substitute_dict (dict): the substituted characters indexed by single substitution character
    • return (str): the decoded word
def write_substitute_dictionary(dictionary, substitute_dict, filename)

write_substitute_dictionary() writes the dictionary in a file with substitutions.

  • dictionary (list): the input dictionary (after processing)
    • substitute_dict (dict): the substituted characters indexed by single substitution character
  • filename (str): the name of the file to open (write mode)
  • return (None)