Sandy Payette at IASSIST: Repositories and Cloud Services for Data Cyberinfrastructure

Thu, 2010-06-03 11:58 -- Anonymous (not verified)

Sandy Payette, CEO DuraSpace, speaking to IASSIST delegates on June 2, 2010 at Cornell University.

Ithaca, NY It was a blue sky morning in central New York state as social scientists, researchers and technology specialists gathered at Cornell University in Ithaca, New York to address issues and opportunities around the theme of “Social Data and Social Networking: Connecting Social Science Communities Across the Globe” at the 2010 IASSIST (International Association for Social Science Information Services and Technology) Conference hosted by the Cornell Institute for Social and Economic Research (CISER) and the Cornell University Library (CUL) from June 1-4. Delegates from the US, Canada and Europe were on hand to share ideas about how to move a strategic agenda forward.

Sandy Payette, CEO DuraSpace, presented a plenary lecture entitled, “Repositories and Cloud Services for Data Cyberinfrastructure” that included thoughts on how the DuraSpace suite of open technologies might help to change the way that social scientists interact with, and preserve research data. IASSIST, a 300-member organization of professionals working with information technology and data services to support research and teaching in the social sciences, is in a period of strategic renewal indicated by a new planning document.

George Alter, Acting Director, Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan, introduced Ms. Payette and characterized her as someone who is in a unique position to talk to IASSIST about repositories and cyberinfrastructure because she published the original article about Fedora (Flexible-Extensible-Digital-Object-Repository-Architecture) with Carl Lagoze in 1998.  She is now CEO of DuraSpace, an independent 501(c)(3) not-for-profit committed to serving academic, scientific, cultural, and technology communities in creating practical solutions to help ensure that current and future generations have access to our collective digital heritage. DuraSpace was launched in 2009.

Payette began by asking the audience to visualize data cyberinfrastructure–is it beautiful, colorful and improvised? Can it also be managed, structured, orderly and predictable?  She challenged the group to consider these contradictory views of what data cyberinfrastructure is: “How can we achieve both without jepoardizing either?”

“We all want to get it, share it, store it and leverage it,” she said about how data cyberinfrastructure should operate. She pointed to historic perspectives on infrastructure development in referring to the Jan 2007 Report on a Workshop on History and Theory of Infrastructure: Lessons for New Scientific Cyberinfrastructures “Understanding Infrastructure: Dynamics, Tensions and Design.” [1]

In the beginning most loose systems of components that will eventually be assembled into cohesive infrastructures start out very creatively and in a somewhat unstructured way. How do “systems, i.e.

linked sets of devices that fill a functional needs,” [1] become more standard and orderly over time?

Hardening depends on development and adherence to protocols and standards. The system of systems perspective on cyberinfrastructure refers to heterogeneous systems with distributed controls. “This model is like the Web,” she said, “But when you get down to it, it depends on some well-defined standards.”

The push to develop data cyberinfrastructure in support of scholarship is similar across many use cases and  disciplines. She quoted John Wilbanks, executive director of Science Commons, who has pointed out that a shift in how we manage large data sets is really about changing how scientists view themselves and their work. We all live in a networked world and scientists, like everyone else, no longer work alone.

Repositories are key components of networked infrastructure and are defined in many ways and by various characteristics. arXiv, Amazon Public Data, Google Data, Internet Archive, Fedora or DSpace are all considered to be ‘repositories.’ “Ultimately it doesn’t matter. We at DuraSpace are concerned with preservation and accessibility with integrity,” She said. In discussing DuraSpace’s 1,000+ DSpace or Fedora repository instances worldwide she remarked, “We can build amazing private islands, but should we?”

There have been three waves of repository-enabled applications:

1. Institutional repositories and digital libraries

2. Web and collaborative “Web 2.0″

3. Infrastructure for data-intensive research

She suggested that the most interesting of these phases is the last. How will we learn how to surf elegantly on rough and very large data waves? Payette said that our work and the digital products of our work will be more distributed, collaborative, web-oriented, open and interoperable going forward.

 

In the slide entitled “Fedora–what’s in the brand?” she was glad “not to put up an architectural diagram.” Hallmarks that have remained true over time are the platform’s durability, flexibility, service oriented framework, semantic linking capability, and scalability. “Every essential characteristic in a digital object is written in a file–not stored somewhere else. If you have a back-up of the core files you have all of your data with or without Fedora,” she said. This basic digital preservation feature has made Fedora a good choice for those users who are particularly concerned with data longevity. Scalability is another important data storage requirement. Fedora community collaborators have performed scalability tests of up to 150 million objects with no degradation in function.

She explained that the DSpace brand also fits perfectly into the DuraSpace open technology portfolio because the application is easy to use and has a strong position with libraries, institutional repositories and the Open Access movement all over the world. She said, “DSpace is focused on the user and user features with a turn-key ability to have it up and running right away.”

What about the Cloud?

A key component in visualizing integrated cyberinfrastructure as part of a big data solution is computing where massively scalable IT-related capabilities are provided as services using the Internet–the cloud. Payette noted that DuraSpace is developing the DuraCloud (http://DuraCloud.org) open source technology and service designed to provide preservation services for durable digital content. DuraCloud also addresses issues of trust in utility clouds such as Amazon’s S3.  “We will provide a Chinese menu of services in the cloud to allow our users to benefit from utility cloud characteristics–flexibility, scalability, elastic, easy implementation and cost effectiveness,” She said.

 

In closing Payette noted that the DuraCloud Pilot Program is currently implementing use cases with partner organizations that are using DSpace and Fedora repositories, and who require different types of solutions for large amounts of data. The goal of the Pilot Program is to test the new service and document user experiences and issues. Fitting DuraCloud into existing preservation and archiving organizational workflows will ensure that the service is particularly useful to customers when it launches.

DuraSpace will continue to serve its stakeholder communities with software and services that cross the boundaries of institutional systems, the Web, and cloud infrastructure because, she reminded the audience, “Preserving the world’s intellectual, cultural and scientific heritage isn’t easy.  It’s vital. Together we can do it.”