You are here

Block title
Block content

The Growing Google Books Corpus: Preservation, Access and the Long Tail

Photograph of stack of books

Ithaca, NY The Google Books Project (GBP)[1] has been simultaneously characterized as being in violation of copyright laws in the U.S. and elsewhere, a boon to the global publishing industry, a step in the right direction towards open knowledge, an information monopoly, and a key component in the digital preservation of and access to the worldwide record of human science and culture. Which is it?

Google has digitized approximately 15 million books out of an estimated 130 million that are no longer easy to access for many reasons. Other groups such as the HathiTrust [2]  shared digital repository preserve and provide access to some materials scanned by Google and other institutions. Currently HathiTrust includes about 6 million volumes working with ten large institutional library partners [3]. More than 1 million HathiTrust works are in the public domain.

Several of the world’s great libraries are Google Books partners because, like Google, they believe in the significance of the Google books project as one way to preserve and provide access to human knowledge in spite of ongoing negotiations over copyright violations, control and access policies. Oya Reiger, Associate Cornell University Librarian  for the Division of Library Information Technologies [4] along with Cornell University Library GBP group members recently provided an update on the Google Books Project at Cornell. She reminded the audience, "It is our belief that our books are discovered when they are digitized." Based on an informal look at circulation records she estimates that about 85-90% of the books being digitized as part of the GBP have not been circulated in recent history–a potential CUL long tail of knowledge waiting for online discovery.


CUL has established a broad framework for both planning and executing large scale digitization of their collections. During the 1990s CUL worked with Microsoft to digitize approximately 85,000 books. Since 2008 CUL has been one of 35 libraries participating in the Google Books Project. In the first phase approximately 265,000 materials were digitized. Only 5.5% of those materials are now available in full text; 27% of the materials are available in a metadata or snippet view [5].

The rights to the GBP corpus are in dispute. Access and rights are complicated by recent (and pending) court decisions that support the broad idea that mass digitization of any large group of works is illegal based on existing copyright laws.

Publishers, countries, and interest groups representing content creators and distributors of every stripe have sued the company to prevent implementation of Google’s plan to make scanned books available on a pay per view basis. Everyone has a different axe to grind. Some stakeholders object to lack of open access, some want more distributed control, some are concerned about copyright infringement and others want a share or a larger share of potential profits. Even if current lawsuits and negotiations are settled in the U.S. any work whose rights are held in another country will not be accessible as part of the GBP. Right now Google allows only limited access to the broad digital corpus. Even the 35 institutions who are participating in regularly shipping materials to Google for scanning must apply for special permission to access the data.

Reiger explained, “We need to have our rights to our own collection protected.” CUL retains full ownership of the 85,000 works that were digitized in partnership with Microsoft and is a HathiTrust partner working towards creating better access to digital collections.

A sample of recent articles regarding the Google Books Project point to the ongoing complexity of satisfying the concerns of multiple stakeholders:

• 2009: Google’s Digital-Book Project Hangs in the Balance
• 2009: Google Book Search
• 2010: France: More Publishers Sue Google
• 2010: Visual Artists to Sue Google Over Vast Library Project
• 2010: Harvard Discusses Google Book Project Lawsuit


GBP CUL staff have been kept very busy with a rigorous and extremely physical process of selection, packing, tracking, shipping and reshelving numerous volumes that are sent to and returned from Google after digitization.

The process begins with a “Pick List” used to physically prepare shipments from locations on campus that do not contain materials from special collections. Every three weeks the staff loads books into specially prepared carts that are then transferred to very big trucks that take books to a scanning center in New England.

Humans and machines are part of an integrated process for evaluating the quality of the resulting scans. Google believes that the quality of the book corpus will be improved over time as scanning technology and workflows improve.


A potential full-text online corpus of 130 GBP volumes will change scholarship. Early research into how such a corpus might be used points to a non-consumptive use of books [6] as text containers rather than as end-to-end linear works. The ability to analyze works from a linguistics point of view will be a significant scholarly improvement particularly for humanities researchers.

Scholarly advances may be nothing more than an interesting footnote for a nation of on-the-fly app users and ubiquitous computing enthusiasts. The collaborative process for developing future-focused preservation and access policies is not available as an app–yet. The Google Books Project is a bold and complex attempt to create access to a significant part of our shared intellectual heritage. The GBP partners [7] have signed on to see an evolving and imperfect process evolve to benefit information consumers of the future.
5.  /screenshots.html#snippetview