Artifacts… Historic and Combinatorial

Bradley Hauer and Grzegorz Kondrak, of the Department of Computing Science at the University of Alberta, recently released a paper describing their algorithmic analysis of text in the Voynich Manuscript. Here is how the media presented it [underlines added for emphasis]:

  • Independent (UK): Mysterious 15th century manuscript decoded by computer scientists using artificial intelligence
  • Smithsonian: Artificial Intelligence Takes a Crack at Decoding the Mysterious Voynich Manuscript
  • artnet News: AI May Have Just Decoded a Mystical 600-Year-Old Manuscript That Baffled Humans for Decades
  • ScienceAlert: AI May Have Finally Decoded The Bizarre, Mysterious Voynich Manuscript
  • ExtremTech: AI May Have Unlocked the Secrets of the Mysterious Voynich Manuscript

Contrary to the headlines, Hauer and Kondrak make no claim in their paper of having used artificial intelligence (note that the source code is in perl, an excellent language for parsing text, but not usually the first choice for AI routines, unless combined with other software). Nor do they claim they have decoded the VMS. In fact, they explicitly stated:

The results presented in this section could be interpreted either as tantalizing clues for Hebrew as the source language of the VMS, or simply as artifacts of the combinatorial power of anagramming and language models.

There you have it—right up front, “…artifacts of the combinatorial power of anagramming…”

If you aren’t sure what that means, I’ll explain it with some examples…

“Finally, we consider the possibility that the underlying script is an abjad, in which only consonants are explicitly represented.”

First, let’s imagine the VMS were a natural language encrypted without vowels (and without anagramming). In English, three letters like mnt could potentially represent the following words:

mint, minty, minute, mount, Monet, Monty, amount, minuet, enmity

… and that’s just in English. There are thousands of possibilities in other languages.

Imagine if arbitrary anagramming were permitted, as well. The number of interpretations of mnt becomes much greater:

mint, minty, minute, mount, Monet, Monty, amount, minuet, enmity, autumn, anytime, mutiny, ataman, inmate, amenity, atman, ament, amniote, manito, tamein, matin, meant, onetime, toneme, matinee, etamin, motion, etymon, animate, anatomy, emotion…

There is a certain subjective flexibility inherent in 1) anagramming, 2) choosing vowels that seem to work best, and 3) choosing the language that seems most similar to the resulting text.

The Terms of Engagement

In their paper, the researchers declare their focus specifically as “monoalphabetic substitution” ciphers.

How well does this apply to the Voynich Manuscript?

Monoalphabetic ciphers were common in the Middle Ages, and still are, so it is not unreasonable to develop algorithms to crack them

Anagrammed texts are not unusual either, but one would hope that if the VMS text were anagrammed, it would be in some regular way, otherwise the possible interpretations (assuming meaningful text can be extracted), increases exponentially (that’s what the researchers mean by “combinatorial power”). If you have 20 possible interpretations for the first word, and another 20 for the second word, and so on… the number of ways in which the combined words can be decrypted goes into the stratosphere.

The Basic Steps

The researchers state that their first step is to identify the encrypted language. To accomplish this, they are working with a data bank of text samples in natural languages to test and fine-tune the recognition and decryption software. They claim up to 97% accuracy for 380 [natural] languages, and 93% from a smaller pool of 50 arbitrarily anagrammed ciphers in five languages.

So far, so good. Working from this “proof of concept”, they decided to try the software on the Voynich Manuscript—an intriguing experiment.

“However, the biggest obstacle to deciphering the [Voynich] manuscript is the lack of knowledge of what language it represents.”

I would agree—this is a frequent stumbling block to deciphering encrypted text—software to expedite the process would probably be welcomed. Even when the underlying language is known, some codes can be hard to crack, but it should be kept in mind that the VMS may not represent a natural language (or any language).

  • If it is meaningful text, it might be multiple languages, heavily abbreviated, or a synthetic language. There are precedents. In the 12th century, Hildegard von Bingen invented a language that was part cipher, part rule-set, and part glossary lookup. In the 13th century, Roger Bacon invented numerous methods of encrypting text. In the 14th and 15th centuries, Latin texts in many languages were so heavily abbreviated they resembled shorthand. By the 16th century, as the use of Latin faded and global exploration increased, scholars were inventing universal languages to bridge the communication gap.
  • There are also fantasy languages. Edward Talbot/Kelley ingratiated himself  with John Dee by “channeling” angelic language conveyed by spirits in a looking glass. This combined effort produced a “language” now known as Enochian. They also poured over charts in the Book of Soyga, trying to make sense of text that had been algorithmically encoded in a stepwise fashion in page-long charts.

What these examples illustrate is that decipherable and not-so-decipherable texts in many different forms did exist in the Middle Ages.

The Process of Decryption

The best code-breakers are usually good at context-switching, pattern recognition, and lateral thinking… If it isn’t this, then maybe it’s this [insert a completely different form of attack].

Context-switching is not inherent in brute-force methods of coding. Even artificial intelligence programmers struggle to create algorithms that can “think outside the box”. If you have seen the movie “AlphaGo” about the development of Google’s game-playing software that was pitted against Lee Sedol, world-champion Go player, you’ll note near the end that even these programmers admit they used a significant amount of brute-force programming to deal with common patterns that occur in certain positions on the board (known as joseki).

Most software programming is about anticipating scenarios (and building in pre-scripted responses). It is not so easy to write code that analyzes and tries to process inscrutable data in entirely new ways, without human intervention. Many so-called “expert systems” have no AI programming at all. They are essentially very large keyed and prioritized databases. The only thing they “learn” is which lookups the user does most often and this is a simple algorithm that can sometimes be more of a hindrance than a help.

But to get back to Hauer and Kondrak’s attack on the Voynich Manuscript…

The researchers admit that a native Hebrew speaker declared the decrypted first sentence as “not quite a coherent sentence” and that  “a couple of spelling corrections” were made to the text, after which it was fed into Google Translate. Even after this double intervention, the resulting grammar is questionable and I would argue that the Google translation of a couple of specific words is also questionable. Keep in mind that Google’s software is designed to try to make sense of imperfect text.

Too Little, Too Soon

It’s not the fault of the researchers that the press declared this as a solution achieved through artificial intelligence, because neither of these claims is made in their paper, but even so, when attempting to decipher coded information, one has to be very cautious about reading too much into small amounts of text. Sometimes what looks like a pattern falls apart when one examines the bigger picture.

Take, as an example, the system proposed by Stephen Bax in 2014, in which he announced he had decoded about a dozen words. When his substitution system is used to decrypt a full paragraph or even a full sentence on any page of the manuscript, the result is gibberish. What he had was a theory, not a “provisional” decoding. There’s no way to prove one has a solution if it doesn’t generalize to larger blocks of text. Bax is not alone in thinking he had solved the VMS (or parts of it)—many proposed solutions do work in a spotty fashion, but only because they ignore the vast amounts of text that don’t fall into line.

A few words, or even an isolated sentence that seems to make sense here or there, can be found in the VMS in many languages. I’ve located hundreds of words and sometimes full phrases in Greek, Spanish, Portuguese, Latin, and other languages, but I do not have a solution or even a partial solution. The VMS includes thousands of word-tokens in different combinations, almost all of which are going to match something in some language, especially languages with similar statistical properties.


Hauer and Kondrak have some interesting technology. I can think of many practical uses for it, and some of their graphs provide additional perspective on the VMS. But before everyone jumps on the next bandwagon and declares the VMS solved, I suggest they read the research paper first.

J.K. Petersen

Copyright © 2018, All Rights Reserved

2 thoughts on “Artifacts… Historic and Combinatorial

  1. Koen Gheuens

    A welcome clarification for the few of us who aren’t programmers 😉

    The type of AI I’m kind of familiar with on a superficial level is one that can teach itself how to reach a desired result. Learning the nature of the result is not the goal – we already know what we want – but rather the method.

    In the case of the VM though, we absolutely have no idea what the result should look like. So as I understand it, it would be hard to deploy AI since it wouldn’t know what to do. You develop AI to perform a very specific task for you, not to crack ancient mysteries. But maybe in a few decades.

  2. J.K. Petersen Post author

    Yes, you make a good point. There are forms of AI that can use incoming data to self-modify to achieve a desired outcome or improve their “hit ratio” when the goal is known or better defined. There are others where the code can be likened to a nervous system, which reacts and responds to feedback in order to interact (and improve its interaction) with its “environment”, but isn’t guaranteed an expected or even a desired outcome. Robotics researchers are especially interested in these kinds of problems.

    Software developers trying more specific approaches are hoping for a bingo, a bit of luck (and why not, if you have the technology), and it helps to have data that looks more objectively at people’s hunches (such as a possible similarity to Semitic abjads or Asian syllabic languages), but the likelihood of more conventional approaches solving the VMS is probably low.

    Stepping back to determine whether the VMS is a monoalphabetic system, an abjad, anagrammed text, or something completely different like steganography, or text that is intended to be transcribed vertically or diagonally, or read as numbers, is still mostly a human skill. We don’t even know whether glyphs that look similar are intended to be interpreted as the same glyph or as different glyphs. In the garbage-in/garbage-out department, using a transcript based on a flawed premise is unlikely to have a good result, even if the software is well designed.


Leave a Reply

Your email address will not be published. Required fields are marked *