Encryption Series: Breaking a Substitution Cypher

We automate the cryptanalysis of a substitution cipher using Rust. With letter frequency data and a dictionary, the script makes educated guesses, applies substitutions, and iteratively refines them to recover plaintext.

Encryption Series: Breaking a Substitution Cypher
Photo by Samuel Marques Lucio / Unsplash

This article directly follows my earlier piece on monoalphabetic cipher implementation and manual cryptanalysis. If you have not yet explored that, I suggest reading it first, as this article will specifically demonstrate how to automate the breaking process programmatically.

To begin, we require a comprehensive list of common English words. Fortuitously, there exists a repository containing the 10,000 most frequently used English words, thoughtfully ranked by their prevalence—ideal for our purposes and notably beneficial in terms of computational efficiency. The repository may be found here:

https://github.com/first20hours/google-10000-english/tree/master

For convenience, I have renamed the downloaded file english_words.csv and placed it neatly within the root directory of my development container.

Strategic Considerations

In my earlier manual efforts, I initially resorted to letter-frequency analysis, a venerable technique employed since the earliest days of cryptography. However, given modern computing speeds (and particularly Rust’s admirable performance), frequency analysis, though intellectually appealing, offers limited practical benefit here. Iterating substitutions through the modest 26-letter English alphabet is so trivially quick that it obviates the need for preliminary frequency calculations. Should you nonetheless wish to implement a frequency analysis algorithm, perhaps for didactic purposes, the complexity resembles a straightforward LeetCode exercise, little more than populating a frequency hashmap.

Algorithmic Approach

Our automated method for breaking the substitution cipher relies on exploiting characteristic patterns in the encoded words, rather than traditional letter-frequency analysis. Given that the cipher consistently maps each plaintext letter to exactly one ciphertext character (and vice versa), each unique word in the ciphertext will yield a distinct "pattern signature," reflecting its structure of repeated and unique characters. We then systematically match these signatures against a pre-processed dictionary of English words to narrow down possible substitutions.

Step 1: Pattern Signature Generation

We first convert each ciphertext word into a numerical vector capturing its internal character repetition pattern. Consider the word "HELLO"; it translates neatly into the numerical sequence [0, 1, 2, 2, 3], since 'L' appears twice in positions 2 and 3. This technique is implemented as:

/// Compute the pattern signature of a word:
/// e.g. "HELLO" -> vec![0,1,2,2,3]
fn pattern_signature(word: &str) -> Vec<usize> {
    let mut map = HashMap::new();
    let mut next_id = 0;
    word.chars().map(|c| {
        let entry = map.entry(c).or_insert_with(|| {
            let id = next_id;
            next_id += 1;
            id
        });
        *entry
    }).collect()
}

Step 2: Dictionary Preparation

Before decoding, we process our dictionary (in our case, english_words.csv) into a lookup structure. This involves computing the pattern signatures for all dictionary words and grouping them accordingly. The structure allows rapid retrieval of candidate words matching a given ciphertext pattern signature:

/// Load dictionary from a CSV (one word per line, no header),
/// group by pattern signature.
fn load_dictionary(path: &str) -> Result<HashMap<Vec<usize>, Vec<String>>, Box<dyn Error>> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);
    let mut by_pattern: HashMap<Vec<usize>, Vec<String>> = HashMap::new();

    for line in reader.lines() {
        let word = line?.trim().to_uppercase();
        if word.chars().all(|c| c.is_ascii_alphabetic()) {
            let sig = pattern_signature(&word);
            by_pattern.entry(sig).or_default().push(word);
        }
    }
    Ok(by_pattern)
}

Step 3: Extracting and Sorting Ciphertext Words

From the ciphertext, we extract all unique words, filtering out trivial or single-letter words. To maximise decoding efficiency, we arrange these unique ciphertext words in descending order of length. Longer words typically present more distinctive pattern signatures, significantly reducing ambiguity early in our decoding process.

Step 4: Recursive Backtracking and Letter Substitution

The core of our solution employs a recursive backtracking algorithm. Starting with the longest word, we attempt to match each ciphertext word’s pattern signature against candidate words from our dictionary:

  • For each candidate, we provisionally assign substitutions, ensuring each ciphertext character consistently maps to exactly one plaintext character and vice versa.
  • If a candidate introduces contradictions—such as mapping one ciphertext character to multiple plaintext characters or assigning the same plaintext character to multiple ciphertext letters—we discard it immediately and try another candidate.
  • Once a suitable match is found for a word, the algorithm proceeds recursively to the next word.
  • If at any stage no valid substitution exists, the algorithm gracefully backtracks, retracting tentative substitutions and exploring alternative paths.

This approach is encoded in the following recursive routine:

/// Recursively attempt to build a full mapping by trying to match each
/// ciphertext word to candidates of the same pattern.
fn solve_recursive<'a>(
    words: &[String],
    dict: &'a HashMap<Vec<usize>, Vec<String>>,
    idx: usize,
    mapping: &mut HashMap<char, char>,
    used: &mut HashSet<char>,
) -> Option<HashMap<char,char>> {
    if idx == words.len() {
        return Some(mapping.clone());
    }

    let cipher = &words[idx];
    let sig = pattern_signature(cipher);
    if let Some(candidates) = dict.get(&sig) {
        'candidate: for cand in candidates {
            // check compatibility
            let mut new_pairs = Vec::new();
            for (c_c, p_c) in cipher.chars().zip(cand.chars()) {
                if let Some(&mapped) = mapping.get(&c_c) {
                    if mapped != p_c { continue 'candidate; }
                } else {
                    if used.contains(&p_c) { continue 'candidate; }
                    new_pairs.push((c_c, p_c));
                }
            }
            // extend mapping
            for &(c_c, p_c) in &new_pairs {
                mapping.insert(c_c, p_c);
                used.insert(p_c);
            }
            // recurse
            if let Some(sol) = solve_recursive(words, dict, idx + 1, mapping, used) {
                return Some(sol);
            }
            // backtrack
            for &(c_c, p_c) in &new_pairs {
                mapping.remove(&c_c);
                used.remove(&p_c);
            }
        }
    }
    None
}

Step 5: Final Reconstruction of the Plaintext

Once a complete, consistent mapping is found—one that seamlessly decodes every ciphertext word—we apply the substitutions to the entire ciphertext message, including punctuation and spacing. Any unresolved character mappings (though rare if the algorithm has succeeded) are marked explicitly with placeholders.

This final decoding step is implemented in the following code:

/// Attempt to break the substitution cipher in `ciphertext` using dictionary at `dict_path`.
/// Returns the first full-length English decode found.
pub fn break_substitution(ciphertext: &str, dict_path: &str) -> Result<Option<String>, Box<dyn Error>> {
    // 1. Load dictionary
    let dict = load_dictionary(dict_path)?;

    // 2. Extract and sort unique words in ciphertext (longest first)
    let mut words: Vec<String> = ciphertext
        .split(|c: char| !c.is_ascii_alphabetic())
        .filter(|w| w.len() >= 2)
        .map(|w| w.to_uppercase())
        .collect();
    words.sort_by_key(|w| std::cmp::Reverse(w.len()));
    words.dedup();

    // 3. Backtracking state
    let mut mapping = HashMap::new();
    let mut used = HashSet::new();

    // 4. Solve
    if let Some(sol_map) = solve_recursive(&words, &dict, 0, &mut mapping, &mut used) {
        // Apply to full ciphertext
        let plaintext: String = ciphertext.chars().map(|c| {
            if c.is_ascii_alphabetic() {
                let up = c.to_ascii_uppercase();
                if let Some(&p) = sol_map.get(&up) {
                    if c.is_ascii_lowercase() { p.to_ascii_lowercase() } else { p }
                } else {
                    '?'  // unmapped
                }
            } else {
                c
            }
        }).collect();
        Ok(Some(plaintext))
    } else {
        Ok(None)
    }
}

Step 6: Practical Demonstration

Applying the described algorithm to our ciphertext yields a rapid and precise plaintext decoding. The efficiency of this method is particularly evident when compared with manual cryptanalysis:

vscode ➜ /workspaces/vscode-remote-try-rust (main) $ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/encryption`
Encoded: sk mpg cgvw rkk p rjcxd obrdpgmk pjkpo lzd sk mpg rkk tvkgdw djkxk djpd gkkor dc lk ocgk
Decoded: we can only see a short distance ahead but we can see plenty there that needs to be done

Though conceptually straightforward, the implementation involves a modest quantity of code. Consequently, I have supplied a fully commented Rust source file (main.rs), which you may download and experiment with directly:

In practice, the algorithm executes in mere fractions of a second, even without optimisations, highlighting the impressive performance of Rust’s robust standard library and efficient compilation. Such speed starkly contrasts with manual methods, offering clear evidence of computational cryptanalysis’s practical advantages.