Language codes

AUTHOR’S NOTE – You’re reading the HTML version of a chapter from the book Building Accessible Websites (ISBN 0-7357-1150-X). Copyright © Joe Clark, 2002 (about the author). All rights reserved. ¶ Back to Contents

Under the Web Content Accessibility Guidelines, you are required to specify changes in the “natural” or human language used in documents. You do this by adding the lang="languagecode" attribute to virtually any tag (like <p></p>, <span></span>, <cite></cite>, or <hx></hx>. Also, in order to specify a change in language, you must already have declared the default, base, or original language, which you do by adding lang="languagecode" to the <body> or (preferably) <html> tags, like so:

So just what are those language codes? They’re two-letter abbreviations, optionally followed by a hyphen and some other qualifier. In the second example above, French is specified (fr), but of the Canadian variety (ca).

The exact specification is ISO 639-1, “Codes for the Representation of Names of Languages,” whose homepage resides at the Library of Congress: lcweb.loc.gov/standards/iso639-2/. (Yes, that URL says “iso639-2”; you have to hunt around at the site to find the 639-1 section, which is a bit outdated.)

Note that the companion standard, ISO 639-2, provides three-letter codes for languages – and for a vastly wider range of languages, at that. Online, however, we must stick with the two-letter codes. At least, this is my interpretation. A page at the World Wide Web Consortium Internationalization site tells us:

According to RFC 3066, for languages with both a two-letter and a three-letter code, the two-letter code must be used. This also solves the problem of those languages that have two different three-letter codes, because all of them also have a two-letter code.

So this “solves the problem,” does it? I don’t see a lot of problems that are actually “solved” here. The RFC (request for comment) mentioned in this citation merely refers back to ISO 639-1 and tells us, in effect, that the only three-letter language codes we may use are those that do not have a two-language code. But there are somewhat complex rules in place governing when a three-letter code may be coined without creating a corresponding two-letter code.

From an accessibility perspective, this restriction will eventually have to be lifted. Textual media are not the only kind available on the Web, and as more and more video becomes available, more and more sign languages will be available, and all sign-language names exist in the three-character specification (under sgn). It is technically impossible to specify a sign language on a Website as the standards currently exist.

My recommendation? Damn the torpedoes! If you have to specify a language with a three-letter code because you cannot find a two-letter code, do it. Such a practice appears to be permitted anyway and is the only one that makes sense.

Let’s start with the two-letter codes. Now, hundreds of languages have been defined, and I’m not going to list every single one of them here because the super-obscure language codes have no practical value to my audience. (It’s nice to know that Faroese has its own language code, but how many readers of a book on Web accessibility will have cause to design Websites in Faroese? And won’t such designers already know that Faroese’s language code is fo?) Besides, the ISO 639-1 specs are all online and provide all the codes for you.

I have not found a truly reliable source for the Top Ten languages used online (after English – the Top Eleven, really). I have synthesized various lists into the following somewhat longer compilation – not quite Top Forty, but close.

Very-widely-used languages online

Japanese
ja
German
de
Chinese
zh
French
fr
Spanish
es
Italian
it
Dutch
nl
Portuguese
pt
Finnish
fi
Swedish
sv
Norwegian
no
Danish
da
Korean
ko
Polish
pl
Russian
ru
Hebrew
he
Hungarian
hu
Greek
el
Turkish
tr
Czech
cs
Thai
th
Arabic
ar
Icelandic
is

Confusable codes

Note that country codes and language codes are often just different enough to get you into trouble if you’re not eagle-eyed.

Dialects

Some dialect names are standardized under ISO 639-1, while others, usually of a more fanciful nature (Cockney, Newfoundland, joual) are not. Both types are permitted; it is up to the browser or device to interpret the codes correctly.

It is possible and legal, for example, to specify all these variants of English:

You must not assume, however, that browsers or devices will be able to understand or represent anything beyond the first dash.

In rather more important cases, like the two variations of Norwegian, Bokmål and Nynorsk, enough social importance is given to the dialects that they have their own codes.

Authors writing in Norwegian will likely know which dialect they are using and can cite it appropriately. Authors who merely quote Norwegian text or make some other casual use of it may not know which is which; that’s what the generic no tag is for.

If you’re wondering about Chinese (no doubt you are), Mandarin and Cantonese are not the only recognized dialects, but all of them are subsumed under zh. You must use dialect codes for Mandarin (zh-guoyu) and Cantonese (zh-yue) if you wish to differentiate them. (The distinction is nearly meaningless on Websites that do not use voice given that the two dialects use the same writing system.) There is no difference in language code between Traditional and Simplified Chinese; arguably there should be.

Take my word for this as a linguist and an accessibility obsessif: This stuff is more detailed and pedantic than trainspotting, and almost as addictive to susceptible personalities. Just keep in mind that dinner-party guests are never really as interested in this topic as we are.


Previous   ¶   Contents   ¶