Quantity Over Quality at Google Book Search

Quantity Over Quality at Google Book Search

Campus Technology has a well-documented article about : The Good, the Bad & the Ugly, which suggests that ’s project is more about than . For example, The University of California has to deliver 3,000 books a day to , according to their agreement. “All of the libraries are talking about that, in the sense of what might be the most interesting materials to scan. But I’ll be very frank: There’s a real balance point between volume and selection, especially when looking at these numbers. UC is trying to meet the needs of the contract it’s signed,” says Robin Chandler, former director of data acquisitions for UC’s California Library.

And since has to scan a lot of books, it needs a scalable scanning technology. “When it first started, the technical challenge was simply building a scanning device that worked. The next technical challenge was being able to run this scanning process at scale. We would have been quite happy to use commercial scanning technologies if they were adequate to scale to this. We only built our own scanning process because that was the way to make this project achievable for ,” says Dan Clancy from .

Surprisingly, the scanning process involves humans, as you can see in some books from ’s index (TechCrunch, Google Blogoscoped, George Hernandez, The Genealogue spotted fingers). “If you go into [ ] and look at any , you’ll be able to see by the number of parts and fingerprints that [the pages] are being turned manually,” suggests Linda Becker, VP at Kirtas, the company that produces the fastest robotic scanner in the world: APT BookScan 2400. “If you were to go to the site, you’d see that one out of every five pages is either missing, or has fingers in it, or is cut off, or is blurry.”


Larry Page announced in October 2007 that the index is “over a million books”. A for “now” returns 2,190,600 results (1,740,600 available in limited preview and 214,600 fully available for reading and downloading).

The conclusion of the article is optimistic:

When it comes down to it, then, this brave new world of probably needs to be understood as 1.0. And maybe participants should not get so hung up on that they obstruct the flow of an astounding amount of information. Right now, say many, the conveyor belt is running and the goal is to manage , knowing that with time the rest of what’s important will follow. Certainly, there’s little doubt that in five years or so, as defined by will be very different. The lawsuits will have been resolved, the copyright issues sorted out, the standards settled, the technologies more broadly available, the integration more transparent.

(Via Google Operating System.)




Comments are closed.

R-Echos

Since 2004, R-Echos is an experimental online magazine dedicated to republication; topics vary from biology to graphic design, from ecology to business. It agglomerates anything which is about art, computing, science. His form is made out of collages of texts, links, images, references, videos and sounds - choosen with care to take part to this very personnal publication.

* Electronest

  • About
  • Articles
  • Beta version
  • Categories
  • Defragmentation
  • Defragmentation 2
  • Index
  • Monthly Archives
  • OPML & Links
  • R-Echos.tv
  • Tags
  • Visual Index
  • Visualisation
  • Collections

  • Displaying
  • un-Realisation
  • Physical Interface
  • Augmented Reality
  • Publishing
  • Geometry
  • Visualisation
  • Recently republished | Most Read

    Subscribe in a reader

    Enter your email address:

    Delivered by FeedBurner