December 20, 2006 2:31 PM PST

Grant funds open-source challenge to Google library

The nonprofit Internet Archive announced Wednesday it has received $1 million from the Alfred P. Sloan Foundation to continue its effort to scan public domain works for open online accessibility.

The archiving organization's Open-Access Text Archive is an open-source alternative to book-scanning efforts like the ones from Google and Microsoft. Internet Archive, perhaps best known for its WayBack Machine archive of Web pages by date--is also an online digital library of text, audio, software, images and video content.

"Brewster Kahle and the Internet Archive are pioneers in this exciting and historic opportunity to create a universal digital library that is both open-access and non-proprietary," said Doron Weber, who overseas public understanding of science and technology at the Sloan Foundation, in a statement.

Kahle was one of the inventors of Wide Area Information Servers (WAIS), a text-based search system that searched database indexes on remote servers before there were Internet search engines. After WAIS was sold to AOL in 1995 for several million dollars, Kahle founded the Internet Archive, which works closely with the Open Content Alliance (OCA). The OCA developed a set of principles dedicated to a "permanent archive of multilingual digitized text and multimedia content" for free and open access.

The grant from the Sloan charitable trust will enable Internet Archive and the OCA to scan collections from several major institutions, including the entire collection of publications from the Metropolitan Museum of Art as well as several thousand images from the museum; John Adams' personal library of over 3,800 works at the Boston Public Library; and other collections from The Getty Research Institute, Johns Hopkins University and the University of California, Berkeley.

The announcement comes just after the San Francisco-based Internet Archive reached the milestone of scanning 100,000 books. That may not sound like a lot compared to Google Book Search's claim of millions within a decade, but the OCA has ramped up its scanning recently to about 12,000 books a month. According to its own statistics, the organization has also archived 65 billion pages from 50 million Web sites.

"Google is so good at the media being their PR machine, that you would not know there was an alternative out there," Kahle said. "We have brand name institutions going open and foundations like the Sloan are funding (us). It shows that the Open Content Alliance is viable, that there is support for public interest. We don't have to privatize the library system."

Google has begun to offer full-text, printable PDFs of public domain works with plans to add more as it scans more books. But its platform is closed, and its PDF pages have a "Digitized by Google" watermark. The company is not planning to share its scanned material with the OCA or Internet Archive, according to Kahle.

"We think they (Google) are doing great stuff. If the materials would be made available for broad public search and educational use we'd be all for it, but in my discussion with the founders (Google co-founders Larry Page and Sergey Brin) they aren't going to," said Kahle.

Google did not respond to requests for comment about its book scanning project.

"It shows that the Open Content Alliance is viable, that there is support for public interest. We don't have to privatize the library system."
--Brewster Kahle, Internet Archive founder

Google scans and indexes both public domain and copyright works, an issue that has raised legal concerns. The Google Book Search engine restricts full access to copyright works while still offering snippet views, instead of excluding the work from its search feature altogether, according to the Google Book Search Web site.

"This whole Google Book Search looks like Amazon's Search Inside the Book," said Kahle. "Let's go open with these collections...These are beautiful books."

Yahoo is a supporter of the OCA and has helped the OCA index some of the scanned content, but its project is smaller than those of Google and Microsoft, according to Gregory Crane, a classics professor and digital library expert at Tufts University.

Microsoft was an early supporter of the OCA and in June worked with it on a project scanning and indexing materials from the University of California and the University of Toronto libraries as part of its Windows Live Book Search project. But Microsoft has become more proprietary in recent months, Kahle said.

"We continue to work with Microsoft, but the results going forward are not strictly OCA principles," Kahle later added in an e-mail. "To their credit, they are interested in helping get more scanning done in the open, of course because they can use the books as well, but still, this is more than other projects.

Jay Girotto, who heads Microsoft's Live Book Search selection team, further explained his company's position.

"We support the fundamental mission of the OCA, and hope that many more partners like the Sloan Foundation will step forward and contribute significant resources to scan public-domain materials under the OCA principles," he said in a statement.

Research impacts

Tufts' Crane thinks the companies are reluctant to share for fear of helping the competition.

"My impression is that both Microsoft and Google don't want the other benefiting from their investment, he wrote in an e-mail. "Now each is hoarding. Ideally, each would split the cost of digitizing content and then make the public domain material available in the OCA. At the moment, Google is well ahead, and I would think that they would feel that Microsoft would benefit too much."

A lack of open-source access, Crane explained, impedes research that requires access to multiple groups of works in bulk, and prevents researchers from applying more nuanced OCR (optical character recognition) searches to those texts.

"We are evaluating OCR on classical Greek. Google runs OCR on all its texts--that's how it generates searchable OCR. The Google OCR, though, doesn't know Greek and produces no usable text as far as we can tell. Google says that you have to get permission to run OCR, etc...on its PDF books," Crane said, further explaining, "Even if the PDF books are good enough quality to support OCR--they might be lower than the archival resolution.

"I am sure that Google would be open to us doing this work, but that means (for each academic project) getting their attention, writing letters, and a lot of hassle," Crane said. "I think it's easier and better in the long run to open the library up and let the world have at it," he said.

See more CNET content tagged:
Brewster Kahle, Google Book Search, Google Inc., brand name, open source

Powered by Jive Software
advertisement

Latest tech news headlines

Resource center from News.com sponsors
Aligning CIO & CEO visions
What CIOs need to know

It's a simple truth. The closer you and your CEO see things, the greater your chance for success. Our exclusive report can help you get there—and help your business grow. To get the report, featuring the views of 765 CEOs on innovation. click here

Click Here!
What CEOs think: Innovation Insights for CIOs

Learn How CIOs can deliver strategic success for their enterprises

The New CIO: Beyond Technology

Learn how CIOs become heroes

Podcast: Chris Gorog of Napster

Learn about the impact of technology in strategy execution

The future of the Enterprise

Read more about tomorrow's organization

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.

More feeds available in our RSS feed index.

advertisement

Inside CNET News

Scroll Left Scroll Right
  • Nanotech: The Circuits Blog

    Intel ships low-power chips for servers

    New server chips from processor giant draw as little as 12.5 watts per core.

  • Gallery

    Photos: Top 10 reviews of the week

    Here are CNET Reviews' 10 favorite items from the past week, including the TiVo HD XL, Sony Cyber-shot DSC-H50, and the Dish Network's newest digital TV converter box.

  • News - Apple

    Apple watchers spot 'iPod Nano' pix, iTunes hints

    The rumor mill has long been predicting a longer, leaner new version of the iPod Nano, and now it's conjuring up some pictures.

  • Coop's Corner

    Chris Shipley 1, Internet lynch mob 0

    Demo's impresario goes public with a tart and smartly written riposte to the shoot-from-the-lip crowd.

  • Video

    Katie Couric reflects on first Webcast

    The political conventions are over and so are CBS Evening News anchor Katie Couric's first series of Webcasts. CNET's Kara Tsuboi sat down with Couric on the final night of the Republican National Convention to discuss what she liked about Webcasting, some of her most memorable guests, and whether TV news will still be around by the next round of conventions.

  • Webware

    Google upgrades Gmail for IE 6 users

    The online e-mail application is faster for those using the 7-year-old browser and gets features already available to more modern browsers, Google said.

  • Video

    YouTube plays party politics

    During the presidential campaigning four years ago, YouTube didn't even exist. Now it's a tool candidates must master to get their message across. CNET's Kara Tsuboi stops by the YouTube upload booths at the Democratic and Republican conventions to find out why Google's video site has such a big presence in Denver and St. Paul, Minn.

  • News - Gaming and Culture

    Are Demo and TechCrunch50 fragmenting their audiences?

    With both events scheduled to start Monday, many press, as well as venture capitalists and others are having to choose which one to attend.

  • News - Cutting Edge

    Execs predict next Google-like tech

    On eve of company's 10-year anniversary, researchers and business pundits speculate about what technologies might someday have as much impact as Google.

  • Gallery

    Images: The art of 'Spore' prototypes

    Will Wright and his Maxis team worked on dozens of prototypes to test the elements of their soon-to-be-released evolution game. Here's a sampling.

  • Crave

    DVD ripping goes legit with RealDVD

    Real's RealDVD software lets you rip DVDs to your PC hard drive--legally--and watch them on up to 4 other PCs.

  • Green Tech

    TI does energy efficiency on a chip

    Its line of Piccolo microcontrollers can reduce power consumption significantly of home appliances, hybrid cars, LED lighting, and even solar panels.