Bradley Hauer and Grzegorz Kondrak, of the Department of Computing Science at the University of Alberta, recently released a paper describing their algorithmic analysis of text in the Voynich Manuscript. Here is how the media presented it [underlines added for emphasis]:
- Independent (UK): Mysterious 15th century manuscript decoded by computer scientists using artificial intelligence
- Smithsonian: Artificial Intelligence Takes a Crack at Decoding the Mysterious Voynich Manuscript
- artnet News: AI May Have Just Decoded a Mystical 600-Year-Old Manuscript That Baffled Humans for Decades
- ScienceAlert: AI May Have Finally Decoded The Bizarre, Mysterious Voynich Manuscript
- ExtremTech: AI May Have Unlocked the Secrets of the Mysterious Voynich Manuscript
Contrary to the headlines, Hauer and Kondrak make no claim in their paper of having used artificial intelligence (note that the source code is in perl, an excellent language for parsing text, but not usually the first choice for AI routines, unless combined with other software). Nor do they claim they have decoded the VMS. In fact, they explicitly stated:
The results presented in this section could be interpreted either as tantalizing clues for Hebrew as the source language of the VMS, or simply as artifacts of the combinatorial power of anagramming and language models.
There you have it—right up front, “…artifacts of the combinatorial power of anagramming…”
If you aren’t sure what that means, I’ll explain it with some examples…
“Finally, we consider the possibility that the underlying script is an abjad, in which only consonants are explicitly represented.”
First, let’s imagine the VMS were a natural language encrypted without vowels (and without anagramming). In English, three letters like mnt could potentially represent the following words:
mint, minty, minute, mount, Monet, Monty, amount, minuet,
… and that’s just in English. There are thousands of possibilities in other languages.
Imagine if arbitrary anagramming were permitted, as well. The number of interpretations of mnt becomes much greater:
mint, minty, minute, mount, Monet, Monty, amount, minuet, enmity, autumn, anytime, mutiny, ataman, inmate, amenity, atman, ament, amniote, manito, tamein, matin, meant, onetime, toneme, matinee, etamin, motion, etymon, animate, anatomy, emotion…
There is a certain subjective flexibility inherent in 1) anagramming, 2) choosing vowels that seem to work best, and 3) choosing the language that seems most similar to the resulting text.
The Terms of Engagement
In their paper, the researchers declare their focus specifically as “monoalphabetic substitution” ciphers.
How well does this apply to the Voynich Manuscript?
Monoalphabetic ciphers were common in the Middle Ages, and still are, so it is not unreasonable to develop algorithms to crack them
Anagrammed texts are not unusual either, but one would hope that if the VMS text were anagrammed, it would be in some regular way, otherwise the possible interpretations (assuming meaningful text can be extracted), increases exponentially (that’s what the researchers mean by “combinatorial power”). If you have 20 possible interpretations for the first word, and another 20 for the second word, and so on… the number of ways in which the combined words can be decrypted goes into the stratosphere.
The Basic Steps
The researchers state that their first step is to identify the encrypted language. To accomplish this, they are working with a data bank of text samples in natural languages to test and fine-tune the recognition and decryption software. They claim up to 97% accuracy for 380 [natural] languages, and 93% from a smaller pool of 50 arbitrarily anagrammed ciphers in five languages.
So far, so good. Working from this “proof of concept”, they decided to try the software on the Voynich Manuscript—an intriguing experiment.
“However, the biggest obstacle to deciphering the [Voynich] manuscript is the lack of knowledge of what language it represents.”
I would agree—this is a frequent stumbling block to deciphering encrypted text—software to expedite the process would probably be welcomed. Even when the underlying language is known, some codes can be hard to crack, but it should be kept in mind that the VMS may not represent a natural language (or any language).
- If it is meaningful text, it might be multiple languages, heavily abbreviated, or a synthetic language. There are precedents. In the 12th century, Hildegard von Bingen invented a language that was part cipher, part rule-set, and part glossary lookup. In the 13th century, Roger Bacon invented numerous methods of encrypting text. In the 14th and 15th centuries, Latin texts in many languages were so heavily abbreviated they resembled shorthand. By the 16th century, as the use of Latin faded and global exploration increased, scholars were inventing universal languages to bridge the communication gap.
- There are also fantasy languages. Edward Talbot/Kelley ingratiated himself with John Dee by “channeling” angelic language conveyed by spirits in a looking glass. This combined effort produced a “language” now known as Enochian. They also poured over charts in the Book of Soyga, trying to make sense of text that had been algorithmically encoded in a stepwise fashion in page-long charts.
What these examples illustrate is that decipherable and not-so-decipherable texts in many different forms did exist in the Middle Ages.
The Process of Decryption
The best code-breakers are usually good at context-switching, pattern recognition, and lateral thinking… If it isn’t this, then maybe it’s this [insert a completely different form of attack].
Context-switching is not inherent in brute-force methods of coding. Even artificial intelligence programmers struggle to create algorithms that can “think outside the box”. If you have seen the movie “AlphaGo” about the development of Google’s game-playing software that was pitted against Lee Sedol, world-champion Go player, you’ll note near the end that even these programmers admit they used a significant amount of brute-force programming to deal with common patterns that occur in certain positions on the board (known as joseki).
Most software programming is about anticipating scenarios (and building in pre-scripted responses). It is not so easy to write code that analyzes and tries to process inscrutable data in entirely new ways, without human intervention. Many so-called “expert systems” have no AI programming at all. They are essentially very large keyed and prioritized databases. The only thing they “learn” is which lookups the user does most often and this is a simple algorithm that can sometimes be more of a hindrance than a help.
But to get back to Hauer and Kondrak’s attack on the Voynich Manuscript…
The researchers admit that a native Hebrew speaker declared the decrypted first sentence as “not quite a coherent sentence” and that “a couple of spelling corrections” were made to the text, after which it was fed into Google Translate. Even after this double intervention, the resulting grammar is questionable and I would argue that the Google translation of a couple of specific words is also questionable. Keep in mind that Google’s software is designed to try to make sense of imperfect text.
Too Little, Too Soon
It’s not the fault of the researchers that the press declared this as a solution achieved through artificial intelligence, because neither of these claims is made in their paper, but even so, when attempting to decipher coded information, one has to be very cautious about reading too much into small amounts of text. Sometimes what looks like a pattern falls apart when one examines the bigger picture.
Take, as an example, the system proposed by Stephen Bax in 2014, in which he announced he had decoded about a dozen words. When his substitution system is used to decrypt a full paragraph or even a full sentence on any page of the manuscript, the result is gibberish. What he had was a theory, not a “provisional” decoding. There’s no way to prove one has a solution if it doesn’t generalize to larger blocks of text. Bax is not alone in thinking he had solved the VMS (or parts of it)—many proposed solutions do work in a spotty fashion, but only because they ignore the vast amounts of text that don’t fall into line.
A few words, or even an isolated sentence that seems to make sense here or there, can be found in the VMS in many languages. I’ve located hundreds of words and sometimes full phrases in Greek, Spanish, Portuguese, Latin, and other languages, but I do not have a solution or even a partial solution. The VMS includes thousands of word-tokens in different combinations, almost all of which are going to match something in some language, especially languages with similar statistical properties.
Hauer and Kondrak have some interesting technology. I can think of many practical uses for it, and some of their graphs provide additional perspective on the VMS. But before everyone jumps on the next bandwagon and declares the VMS solved, I suggest they read the research paper first.
Copyright © 2018, All Rights Reserved