Why is Code in English?

Most of it has to do with not reinventing the wheel. What we think of as the web’s underpinnings was created by Woodstock-era American engineers and later spread to engineers and scientists in England and Europe, where it reached an audience that wrote in the Roman alphabet, and was largely comfortable with English. It didn’t become interesting to a larger audience until the early 90s, when Briton Tim Berners-Lee invented HTML, an accessible language that allowed relatively non-technical people to create and read pages on the web without typing lots and lots of things into a command line.

When Tim Berners-Lee began inventing HTML code, he didn’t start by creating a new spoken language and a new alphabet. He used what he already knew and had available, in this case, the English language and alphabet. HTML code, as well as nearly every other computer language used online, is based on English words typed on a Roman keyboard. HTML is mostly English words, or parts of English words, typed inside brackets. This remains true for other languages today. For example, the code for blinking text is the word <blink>. The code for bold text was originally the word <bold>, often shortened to <b> and later inexplicably changed to <strong>. When developers from non-Western countries want to create Web pages, they must write code written in Roman text based on the English language. This is not as difficult as it sounds; because so much computing technology originated in the West, many computer keyboards in non-Western countries consist of roman letters with local characters added.

What does this mean? Well, perhaps it means that the world is even more geared towards English than previously thought. And with it, the web. For example, languages written in the Roman alphabet read left-to-right. People who read western languages are primed to write code left-to-right. That’s why Ruby contains things like <%stuff.each do |s|%> puts s<%end%> , which only makes sense when read left-to-right, instead of <%end%> puts s <%stuff each do|s|%>, which only makes sense when read right-to-left. There’s nothing inherently better about considering things starting from the left or the right, or indeed from the top or the bottom, but people who were building the structures on which the Internet rests set it up this way because they based their decisions on what they already knew.

I originally investigated this issue at the behest of an international law professor with an anthropology background who wanted me to learn what “biases are built into the web”. I’d initially thought there weren’t any, that the web was a just and utopian place. I initially searched for academic articles on the Internet in world affairs. They barely existed, and roughly half of those that did exist were written by Columbia’s Tim Wu. (Thanks, Tim!) I finally realized that I could simply take a sampling of foreign newspapers and read their source code online. And with it, I found a number of comments to fellow programmers in characters I could not understand at all, as well as lots and lots of HTML, which I could read perfectly well because they were in essentially in English.

I’m not really sure where this leaves us – there have been fabulous advances in technology since HTML was invented, almost all of which have built on the Romanji, left-to-right technology, and there’s clearly no going back. Still, it’s worth considering what’s been lost. For example, it only became possible to write URLs in non-Roman characters in 2010. I tried to visit a web page that Tech Crunch linked to as an example of a non-Roman url, but could not, because the reviewer did not understand the address and so entered it as a relative link.

This blog post is based on a paper I wrote in grad school on the Internet and International law, which fed into a thesis on the Internet and democratization. You should read them.

Katie on Rails

On Internets and other things