May 5, 2008 1:35 PM PDT

Google: Unicode conquers ASCII on the Web

I picture it happening this way. The Roman alphabet is on the run, pursued by a much larger army of Arabic characters with long scimitar-like ligatures, Chinese characters that look like throwing stars, and European peasant letters bristling with umlauts, cedillas, and tildes.

Unicode now is the most common character encoding method on the Web.

Unicode now is the most common character encoding method on the Web.

(Credit: Google)

Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web, Mark Davis, Google's senior international software architect, said in a blog post. Also vanquished at almost exactly the same time was the Western European encoding.

Unicode is a character encoding standard that gracefully accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.

Unicode vanquished ASCII and Western European within 10 days in December, Davis said.

"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.

Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.

"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.

Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.

One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.

Recent posts from Underexposed
Red Hat lives on the edge with Fedora 9
Firefox add-on infected with Trojan remnant
Linux video project evades DMCA, back on Google Code
Google: Unicode conquers ASCII on the Web
Google yanks open-source project after copyright complaint
Add a Comment (Log in or register) 5 comments (Page 1 of 1)
It doesn't always take up more storage
by chriswaco May 5, 2008 4:24 PM PDT
The story says "One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character". That's not really true with UTF-8. For most Western/Roman characters, UTF-8 takes up exactly one byte per character just like ASCII. When you get into accent marks and non-Roman character sets, though, UTF-8 can take up more than two bytes per character. See: http://en.wikipedia.org/wiki/UTF-8
Reply to this comment
Most unicode content on the web is encoded in utf-8
by JasonTrue May 5, 2008 5:06 PM PDT
Utf-8 doesn't take dramatically more space than ASCII or ISO-8859-1 encodings, except for East Asian Languages and certain European characters, which can take up to 50% more space than the 16-bit encoding for Unicode. In Windows programs, text is typically represented as UTF-16 internally, which does take up more space, but generally behaves faster, since the Windows APIs are natively UTF-16. The older single-byte/double-byte API equivalents are quietly converted to Unicode on each call, which can slow programs down a bit if they are particularly text-heavy.
Reply to this comment
UTF-8
by RussHolsclaw May 6, 2008 7:50 AM PDT
As mentioned by others, the UTF-8 format of Unicode encoding significantly reduces the overhead of Unicode, in most cases, because the character codes that correspond to the base ASCII character set are identical to ASCII itself: one byte per character. For others, the overhead is not too great. This is especially true when compared to the typical HTML/XML method of encoding non-ASCII characters by the use of "character entity" sequences, which allow non-ASCII characters to be included on a web page. These sequences are all much longer than the equivalent UTF-8 encoding of the same characters. It also permits a single web page to contain text in multiple languages at the same time. Also, since web pages consist largely of HTML tags and client-side scripts, which are made up of pure ASCII characters, these take up no more space than if it page were ordinary ASCII or some ISO ASCII extension set.
Reply to this comment View reply
by krosavcheg May 9, 2008 2:18 AM PDT
1) The "meteoric rise" of unicode is indisputable, but the graph is misleading. 75% of the web is still not unicode. Since the family of unicode text encodings aims to replace all other encodings, the graph really should have only 2 lines, "unicode encodings" and "other encodings". 2) As other commenters remarked, the overhead of unicode encodings is minimal. Overhead should never be an argument against using a unicode encoding. Anyone who has to deal with multiple text encodings in organically evolved (i.e. not carefully designed) IT systems will agree. wcoenen (logged in with bugmenot.com)
Reply to this comment
Powered by Jive Software
advertisement
Click Here
  • About Underexposed

  • This blog sheds light on digital photography, science and open-source software--Stephen Shankland's eclectic beat. Shankland joined CNET News.com in 1998 after a five-year stint as a science writer. He's a lab rat who grew up in Los Alamos, New Mexico, and graduated from Harvard.

    Contact Stephen at Stephen.Shankland@cnet.com

Add this feed to your online news reader
Google
Yahoo
MSN

Stuff I'm reading:

Latest blog posts from News.com

Featured blogs

Beyond Binary by Ina Fried A look at how technology is changing our lives and at the people behind all that life-changing stuff.

Coop's Corner by Charles Cooper Charles Cooper weighs in on Silicon Valley hijinks, and he doesn't suffer fools gladly.

Defense in Depth by Robert Vamosi Covering the latest in computer viruses and computer crime.

Geek Gestalt by Daniel Terdiman At the tech culture nexus of video games, fire art, and virtual worlds.

Green Tech Fresh green tech news and commentary.

One More Thing by Tom Krazit Tom Krazit takes on the tech phenomenon that is Apple, and keeps a close watch on the chip industry.

Outside the Lines by Dan Farber When business and technology meet, that's when things get interesting.

The Iconoclast by Declan McCullagh Exploring the intersection of politics and technology.

The Social by Caroline McCarthy Exploring all facets of social media and tech culture.

Underexposed by Stephen Shankland Coverage of digital photography, science, and open-source software.

Resource center from News.com sponsors

advertisement
Click Here
On TechRepublic: 10 ways users mess up their computers
Advanced
search
Advanced
search
Visit other CNET Networks sites: