Baltimore, MD Is there anyone out there who has an in-box, spam filter, hard drive, or update feed that is not brimming with outdated, digital junk? And are you even sure about whether it’s junk or not? Like old string, your institution may have a particular reason for keeping a collection of regularly updated data. It might even be an important reason. Welcome to the world of the Sun Preservation and Archiving Special Interest Group (PASIG) fall meeting.
Ask any systems administrator holding back a flood of content and they will tell you that their finger in the digital dyke is the only thing keeping your personal computing devices from being swept away by a virtual rising tide of data. At last month’s Sun PASIG fall meeting held in Baltimore use cases, data “floodwatch” metrics, technical architectures and storage strategies were examined and discussed as ways to move towards making use of, and tracking what seem like “bazillions” of proliferating data points. To download Sun PASIG Meeting presentations please visit http://events-at-sun.com/pasig_fall08/presentations.html.
Mike Keller, University Librarian, Director of Academic Information Resources, Stanford University, opened Sun PASIG. He painted an optimistic picture of new directions that Sun Microsystems will take in the future, particularly with regard to creating an ongoing global best practices forum for high performance computing, and solutions for storage and easier-to-implement reference architectures. Keller introduced Ernie Ingles, Vice-Provost and Chief Librarian, University of Alberta who suggested, “The future has not yet been preserved.” He asked attendees to imagine laying the technical and social frameworks for preserving “memory objects” whose meaning will be far greater than any anonymous data for future generations.
National Digital Information Infrastructure and Preservation Program (NDIIPP)
Martha Anderson, Director of Program Management, NDIIPP, (National Digital Information Infrastructure and Preservation Program) U.S. Library of Congress presented a “Major Trends Overview.” The Library of Congress’s efforts to preserve US heritage and knowledge takes into account multiple dimensions of preservation from the vantage points of “today, tomorrow and forever.” She suggested that reading Philip K. Dick’s 1969 science fiction collection entitled The Preserving Machine would help attendees gain into some of the real issues involved in an expansive view of preserving knowledge that were the stuff of science fiction 40 years ago. Anderson said, “We [NDIIPP] want to bridge the present to the future and we are building machines to help us do that.”
Sun Storage Technologies
Sun systems and architecture specialists sketched out assumptions, solutions and new ideas around creating long-term facilities for large data acquisition, storage and management. Chris Wood, Storage CTO, Sun Microsystems, Inc. said, “Over time systems, software and people will be replaced and data will be preserved. Every component will fail or be swapped out.” He went on, “Twenty years ago they were thinking machines, and then general purpose computing took over—specialized hardware will always lose. Multiple archive models must be supported.” The model he presented supports LOCKSS, CLOCKSS, as well as ILS models. Wood believes that many types of architectures that support clouds or “federated services” will emerge over time.
Several Sun presenters emphasized the role of energy consumption in the digital preservation equation. Over time the energy price tag for operating almost any piece of hardware will surpass its original cost. As the cost for storage systems decrease and energy prices increase institutions will continue to be faced with deciding how much data they can afford to keep.
The DSpace Foundation and Fedora Commons Collaborate on “DuraSpace”
DSpace and Fedora Commons held several meetings at Sun PASIG. The first introduced the organizations’ joint DuraSpace inititative. This six-month investigation funded by the Mellon Foundation is being led by the DSpace Foundation and Fedora Commons to determine the feasibility of establishing an open, durable store service layer to leverage repository development in compute and storage clouds. The idea behind DuraSpace is to provide a trusted, value-added service layer to augment the capabilities of generic storage providers by making stored digital content more durable, manageable, accessible and sharable.
The second part of the meeting was dedicated to a community discussion about establishing a professional development curriculum for existing and potential repository developers, managers and curators with support from Sun Microsystems.
Outreach staff from DSpace and Fedora Commons explained that the key shared objective is to strengthen and engage repository user and developer communities worldwide. Attendees expressed interest in the concept of a repository professional development seminar series that would include preservation and archiving as part of an integrated curriculum as well as one-off profiles, use cases and “how-to” topics. Seminar leaders and topics are being sought. Please contact Carissa Smith at if you are interested in working on the joint seminar series.
DSpace and Fedora developers met at noon on Nov. 19 to move towards shared understandings for how the two popular repository’s technology development strategies might be brought closer together. By taking a step back from existing concepts of how each system operates, simple storage may be viewed as a logical first “rung” on a “ladder” towards interoperability. Four progressive laddering concepts are:
1. Content blobs (bottom rung of the ladder)
2. Facts about blobs and their interrelationships (next rung up the ladder)
3. Aggregations (next rung up the ladder)
4. Enriched semanticunderstanding—recommendations and best practices about how to expose things that filters down to the bottom layer. (top rung)
U.S. National Archives and Records Administration
Kenneth Thibodeau, Director, Electronic Records Archives Program, NHE, U.S. National Archives and Records Administration (NARA) (http://www.archives.gov/) opened afternoon sessions on November 19 by explaining that he woke up each day thinking, “Today’s the day the sky is going to fall.” With 852 requirements statements for establishing a scalable, extensible and evolvable view of digitally preserving the reliable transmission of digitally encoded information over time and technology in support of nothing less than protecting “Records [that] help us claim our rights and entitlements, hold our elected officials accountable for their actions, and document our history as a nation,” his concern is understandable.
As a small example of the volume of data that NARA is legally responsible for, Thibodeau explained that they will take legal ownership of 150 terabytes of data containing 100 million email messages when President Bush leaves office.
He advocates a cyclical approach to systems development and production designed for growth, evolution, openness and closure, and stresses that NARA systems cannot satisfy end users. There is a need for information “brokers,” like university libraries, to provide user services. Thibodeau would like to make it easier for third parties to package and promote NARA resources and information.
Beyond Fedora 3.0
After the 2008 releases of Fedora 3.0 and 3.01 with a Content Model Architecture (CMA) providing an integrated structure for persisting and delivering the essential characteristics of digital objects , Sandy Payette, Executive Director, Fedora Commons said, “We stood back and strategized.”
She went on, “Waves of repository-enabled applications have emerged—institutional repository and digital library apps; collaborative “Web 2.0” apps; eScience, eResearch, and data curation apps… We can build amazing private islands.” Should we stay within this institutional-specific and organizational-specific small island application development paradigm? She believes that repository communities must evolve from “systems” to “networks” that are characterized by:
Payette concluded by saying that new ways to expose content, backend storage abstraction, performance and scalability, and the content service ladder concept promote greater interoperability. Strategic collaborations such as the DSpace and Fedora DuraSpace initiative will find ways to blend networks using repository infrastructure and services. “What we build today needs to be evolveable and organic. The only systems that last are the ones that change,” she said.
The National Science Foundation Office of Cyberinfrastructure DataNet Program
Lucy Nowell, Program Director, Office of CyberInfrastructure (OCI), U.S. National Science Foundation says, “In 2007 the amount of information created surpassed, for the first time, our ability to preserve it on the planet.” A single project funded by OCI, for example, might generate 30 terabytes of data per night.
OCI’s answer to what to do about this wealth of information lies in establishing the Sustainable Digital Data Preservation and Access Network Partners (DataNet) program. DataNet is focused on creating a national framework, culture change, tools, resources, opportunities and exploration around the curation and use of data by building a “network of data networks” similar to the Internet.
DataNet seeks to preserve the history of science by not only capturing “big science” data, but also by saving the long tail of small science that provides critical evidence of primary science.
A Very Large, Scalable, Data Architecture
The Church of Jesus Christ of the Latter-Day Saints (LDS) has established and maintains the largest genealogical library in the world to “preserve the heritage of mankind,” and makes it freely available to the public. How much information is that? More than 10 billion names have been cataloged since 1894 from images of birth and death information recorded in family and public documents that equal 10% of all the human beings who have ever lived on the planet. Their near-term goal is to be able to publish 1 billion new genealogical record images every year.
Gary Wright, Digital Preservation Product Manager and Randy Stokes, Principal Engineer, FamilySearch, The Church of Jesus Christ of Latter-Day Saints explained that records are available through the catalog of the Family History Library and may be accessed through the FamilySearch web site. Though recent record collection efforts are digitally-based, the historical preservation strategy for collecting images of birth and death records from around the planet was to store them on microfilm. Currently that microfilm is being digitized and a new architecture for preserving the collection is being developed.
Complex systems for ingesting, disseminating, preserving and supporting tens to hundreds of petabytes of data consisting of more than 100 billion objects are currently being developed using the Sun Storage Tek SL8500 Modular Library System.
Final Observations: How the PASIG fits into the Global Project Landscape
Clifford Lynch, Executive Director, Coalition for Networked Information (CNI), offered a final keynote summary at Sun PASIG. In the ongoing “tech vs. policy” conversation he suggests that there are limits to what can be accomplished. There must be an emphasis on economic issues with an eye to what can be optimized and what can be left behind. We cannot pursue perfection, and a balance between better, faster and cheaper must be found.
Lynch is looking for economies of scale in federation models. With advances such as the Open Archives Initiative Object Reuse and Exchange (ORE) moving assets is getting easier. Services that can help to establish provenance, authenticity along with more network trust models such as Lots of Copies Keep Stuff Safe (LOCKSS) are needed.
“Cloud” storage could be an economic win especially if facilities are located where energy costs are low. Lynch cautioned that there are no standards for cloud agreements—what are services; how with risks be monetized, and; how will media and format migrations be handled?
He reminded the audience, who perhaps were already on this page after three days of presentations about ubiquitous and persistent data, that there is a lack of framework to discuss “what to give up” as knowledge preservation issues loom. “It is an ugly conversation but we are morally obligated to have it,” Lynch concluded.