Encryption Series: The Monoalphabetic Substitution Cipher

An introduction to the monoalphabetic substitution cipher, its historical roots, keyspace complexity, and a full implementation in Rust. We demonstrate how it improves on Caesar’s cipher, and where it still falls short.

Encryption Series: The Monoalphabetic Substitution Cipher
Photo by John Jennings / Unsplash

In our previous exploration into encryption, we examined the Caesar cipher and quickly recognised its vulnerability. Indeed, the Caesar cipher proved so trivial that one might decipher it mentally: mere pattern recognition could suffice. Consider, for instance, the encrypted message below:

Amm mgm bw mgm

We observe that there are three three-letter words, two which are identical words ("mgm") and a solitary two-letter word ("bw"). Crucially, the letter "m" occurs disproportionately frequently, constituting more than half of the entire text. The presence of repeated letters further reduces the complexity of potential solutions.

Approaching this cipher systematically reveals several avenues of attack. Firstly, the English language contains a finite set of two- and three-letter words; the occurrence of double letters narrows this set further. Secondly, certain letters are inherently more frequent in English texts, providing a strategic starting point for analysis.

If we hypothesise the most common English letter ("E") corresponds to "m", the cipher transforms to:

Aee ege bw ege

This appears promising. Assuming the Caesar cipher, testing reveals a shift of eight, which, when applied across all letters, yields the plain text:

See eye to eye

Thus, without brute force or computational assistance, the message has succumbed to mere educated guesswork. Clearly, to preserve secrecy, we must refine our cipher substantially. In engineering, such iterative improvement is fundamental; perfection is an asymptotic ideal, approached incrementally. A good engineer addresses problems sequentially; a great engineer adopts a holistic view. However, the experienced professional resists the seductive danger of scope creep. Let us emulate the experienced.

The critical weakness in the Caesar cipher lies in its uniformity. Each letter shifts by an identical numerical key between 1 and 25 (excluding the redundant 26), allowing an adversary to deduce the key from a single correct guess. Therefore, a more secure approach would employ a distinct substitution for each letter. Instead of a numerical key, we adopt a complete substitution alphabet.

Though this insight may seem innovative, it is nearly as ancient as Caesar's method. Historically, civilisations have repeatedly harnessed substitution ciphers in warfare to preserve strategic secrecy. Warfare has often accelerated technological advancements, with encryption being among its notable beneficiaries.

Implementing a monoalphabetic substitution cipher involves assigning each letter of the plain alphabet to a unique cipher letter. Crucially, the mapping must remain bijective: no cipher letter may correspond to more than one plain letter, preserving information integrity as mandated by the CIA triad (Confidentiality, Integrity, Availability).

Consider the following illustrative example:

Plain Alphabet:    ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher Alphabet:   QWERTYUIOPASDFGHJKLZXCVBNM

This cipher alphabet employs the QWERTY keyboard layout, a layout familiar to most readers, as its base. Unlike the Caesar cipher, here a letter may coincidentally map to itself without security consequences, since detection of such an occurrence would be impossible without additional information.

Applying this substitution cipher to our earlier plain text "See eye to eye," we obtain:

  • S -> L
  • E -> T
  • Y -> N
  • T -> Z
  • O -> G

This gives us:

Ltt tnt zg tnt

Attempting to decrypt this message through simplistic letter substitution (such as replacing the frequent "t" with "e") yields no useful information. Here, frequency analysis would need to be more rigorous. However, before we delve deeper into decryption strategies, let us further explore the practical implementation of this cipher.

The Mathematics Underpinning Substitution Ciphers

The cipher's theoretical complexity derives from permutations. With 26 letters, the total number of distinct cipher alphabets possible is 26 factorial (26!). That's (\(1\times2\times3\times\ldots\times25\times26\)):

\(26! = 403,291,461,126,605,635,584,000,000 \)

This number is sufficiently vast to discourage brute-force attacks through sheer computational magnitude.

Implementation in Code

use std::collections::HashMap;

fn create_substitution_map(key: &str) -> HashMap<char, char> {
    let alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    alphabet.chars().zip(key.chars()).collect()
}

fn encrypt(plaintext: &str, key: &str) -> String {
    let sub_map = create_substitution_map(key);
    plaintext.chars().map(|c| {
        if c.is_ascii_alphabetic() {
            // work in uppercase space for lookup
            let upper = c.to_ascii_uppercase();
            // look up, defaulting to the letter itself
            let mapped = sub_map.get(&upper).unwrap_or(&upper);
            // restore original case
            if c.is_ascii_lowercase() {
                mapped.to_ascii_lowercase()
            } else {
                *mapped
            }
        } else {
            c
        }
    }).collect()
}

fn decrypt(ciphertext: &str, key: &str) -> String {
    let sub_map = create_substitution_map(key);
    // build an owned inverse map: cipher-char → plain-char
    let inv_map: HashMap<char, char> =
        sub_map.iter().map(|(&plain, &cipher)| (cipher, plain)).collect();

    ciphertext.chars().map(|c| {
        if c.is_ascii_alphabetic() {
            let upper = c.to_ascii_uppercase();
            let mapped = inv_map.get(&upper).unwrap_or(&upper);
            if c.is_ascii_lowercase() {
                mapped.to_ascii_lowercase()
            } else {
                *mapped
            }
        } else {
            c
        }
    }).collect()
}

fn main() {
    let key = "QWERTYUIOPASDFGHJKLZXCVBNM";
    let plaintext = "See eye to eye";
    let encrypted = encrypt(plaintext, key);
    let decrypted = decrypt(&encrypted, key);

    println!("Original:  {}", plaintext);
    println!("Encrypted: {}", encrypted);
    println!("Decrypted: {}", decrypted);
}

In our provided Rust implementation, the cipher alphabet serves as our key. We construct mappings to translate plaintext into ciphertext (encryption) and ciphertext back into plaintext (decryption). Default behaviours gracefully handle unexpected or non-alphabetic characters. Admittedly, further abstraction might optimise readability, but our present aim is cryptographic clarity rather than pristine programming style.

Returning to the task of cipher-breaking, we encounter the cipher's primary weakness: frequency analysis. Although brute-forcing all permutations is computationally impractical, language's inherent statistical patterns provide a critical vulnerability.

Consider a new ciphertext encrypted using a fresh substitution alphabet:

sk mpg cgvw rkk p rjcxd obrdpgmk pjkpo lzd sk mpg rkk tvkgdw djkxk djpd gkkor dc lk ocgk

All punctuation and uppercase letters have been removed deliberately, complicating direct interpretation through visual clues alone.

Letter |   Count |    Percent
-------+---------+------------
K      |      15 |     21.43%
D      |       8 |     11.43%
P      |       7 |     10.00%
G      |       7 |     10.00%
R      |       5 |      7.14%
O      |       4 |      5.71%
J      |       4 |      5.71%
C      |       4 |      5.71%
M      |       3 |      4.29%
W      |       2 |      2.86%
X      |       2 |      2.86%
V      |       2 |      2.86%
L      |       2 |      2.86%
S      |       2 |      2.86%
Z      |       1 |      1.43%
T      |       1 |      1.43%
B      |       1 |      1.43%

Letter frequency analysis of the ciphertext

According to Wikipedia, the letter "E" is the most frequently occurring letter in English, appearing approximately 12.7% of the time. Consequently, our first logical substitution is to replace the cipher's most common letter, "K", with "E". Additionally, I have noticed the presence of a single-letter word: "P". Single-letter words in English are invariably either "I" or "A"; with "A" slightly more prevalent (8.2% frequency compared to 7.0% for "I"), and given that "P" appears several times throughout our ciphertext, we shall tentatively replace "P" with "A". Applying these initial substitutions yields the following intermediate text:

se mag cgvw ree a rjcxd obrdagme ajeao lzd se mag ree tvegdw djexe djad geeor dc le ocge

While still nonsensical, this version of the text invites further analysis. The letter "D" is the second most frequent character in our cipher. According to the aforementioned Wikipedia source, "T" is the second most frequent letter in English, occurring roughly 9.1% of the time. Therefore, our next step is to substitute "D" with "T". This substitution produces the following result:

se mag cgvw ree a rjcxt obrtagme ajeao lzt se mag ree tvegtw tjexe tjat geeor tc le ocge

To make further progress, we must identify patterns. Two-letter words in particular are valuable indicators, and all two-letter words identified so far appear to contain the letter we have presumed as "E" (though we remain cautiously in guesswork mode).

The letters "S" and "L" each appear in two-letter words within our ciphertext, yet they are relatively infrequent according to our frequency table. However, the three-letter word "ree" offers promise, since "R" has a moderate frequency of 7.14%. In English, there are relatively few three-letter words ending with a double "e", with "see" being notably common. Considering "S" has a frequency of approximately 6.3% (again referencing Wikipedia), it is a reasonable candidate for substitution. Thus, replacing "R" with "S" gives us:

se mag cgvw see a sjcxt obstagme ajeao lzt se mag see tvegtw tjexe tjat geeos tc le ocge

At this juncture, we have successfully substituted four of the five most frequently occurring letters. Our next logical step involves deducing the likely plaintext representation of the cipher letter "G". It remains fairly common and notably appears in the ciphertext word "gkkor", which, with previous substitutions ("K" → "E" and "R" → "S"), now reads "geeos". There are numerous five-letter English words containing a double "e". However, by constraining our search to words specifically ending with the letters "e", "e", and "s" in succession, we significantly narrow our possibilities.

Consulting Merriam-Webster, we find 63 English words matching this particular pattern, of which only 23 are classified as common. Merriam-Webster provides authoritative definitions for over 300,000 English words and is continuously updated with new meanings and lexical entries (https://www.merriam-webster.com).

Merriam-Webster: America’s Most Trusted Dictionary
Find definitions for over 300,000 words from the most authoritative English dictionary. Continuously updated with new words and meanings.

Proceeding systematically down our frequency table, the next unused common letter is "I". Unfortunately, none of our shortlisted 23 common words begin with "I". Moving on, we consider the letter "N", which appears with a frequency of 6.7% in English. One particularly common candidate word from Merriam-Webster’s list that fits our pattern is "needs". Assuming "geeos" represents "needs", we substitute "G" with "N" and "O" with "D", resulting in:

se man cnvw see a sjcxt dbstanme ajead lzt se man see tventw tjexe tjat needs tc le dcne

Words and recognisable patterns now emerge more clearly. For example:

  • "ajead" strongly suggests "ahead".
  • "tc" aligns neatly with "to".
  • "se man" likely corresponds to "we can".

Incorporating these insights, the cipher becomes:

we can onvw see a sjoxt dbstance ajead lzt we can see tventw tjexe tjat needs to le done

At this point, clarity rapidly improves. Without additional substitutions, the plaintext phrase appears evidently as:

We can only see a short distance ahead {something} we can see {something} there that needs to be done

Integrating the logical substitutions derived thus far confirms this suspicion:

we can only see a short distance ahead bzt we can see tlenty there that needs to be done

Now the message is unmistakably recognisable, and the plaintext readily discernible as:

we can only see a short distance ahead but we can see plenty there that needs to be done

Although we have not uncovered the entire substitution alphabet, due to certain letters remaining unused, we have successfully deciphered the message. Unlike the Caesar cipher, this cipher is not trivially solvable mentally. Instead, it requires deliberate analysis (perhaps aided by paper), marking an incremental improvement in cipher strength.

Automating such frequency analysis through programming would further accelerate this process. However, given this article's length and educational intent, this computational approach warrants separate treatment. Interested readers may pursue further details in the accompanying article: Breaking a Substitution Cypher.

Thus concludes our exploration of the monoalphabetic substitution cipher: a step forward from Caesar's primitive method, yet still vulnerable enough to inspire further cryptographic innovations.