The Petabyte Problem: Scrubbing, Curating and Publishing Big Data
Submitted by on Mon, 2008-06-16 09:04
One strategy for classifying the millions of galaxies mapped by the Sloan Digital Sky Survey was to open the Galaxy Zoo, invite the public to look at the new creatures, and give them tools to record their observations.
Pittsburgh, PA When Alex Szalay is not considering improved strategies for managing and sharing big data, and how that might be an effective force for advancing science, he is the lead guitarist for the jazz and progressive rock band Panta Rhei. Szalay presented the third and final keynote, “Scientific Publishing in the Era of Pedabyte Data,” at JCDL on June 19, 2008.
He opened with a look at the evolution of science: 1,000 yrs ago science was empirical; during the last few hundred years science was theoretical using models and generalizations; a computational branch emerged in the last few decades, and; today science is about data exploration.
Scientific data doubles every year which has fundamentally changed the nature of scientific computing. Today scientific computing cuts across disciplines and has become unwieldy making it more difficult to extract knowledge. He noted that 20% of the worlds servers are feeding information to big data centers–Google MSN, Yahoo, Amazon, and Ebay–so it’s not only just about scientific data.
Szalay has been personally involved in the expotential growth of astronomy data from the late 1990s to 2008 due to his role with the Sloan Digital Sky Survey (SDSS) that has been “mapping the universe” as part of the Virtual Observatory activities for the last ten years. SDSS is now complete, and is in the process of developing the final data release. The completed SDSS archive will contain over 100 terabytes and will be managed by Johns Hopkins University. Sky Survey user sessions show a constant and increasing use of the SDSS data.
Data versioning was SDSS’s biggest challenge, and he emphasized that there is a need to develop automation for more steps of the steps in curating data for final publication–collection, raw, calibrated, and derived.
Szalay believes that scientific discoveries are made at the edges and boundaries or large data sets–the places where you might not naturally be looking. The number of connections that can be made among data sets the more likely that something new will be discovered along the edges suggesting data federation is significant.
Scientific projects that generate data are often short term–3-5 years. Data is only “uploaded” at the end of a project–the data will never catch up with the published discoveries. He advocates for projects becoming more active data curators and publishers further up stream in the investigative process.
One way to do this is to consider methods for “taking the analysis to the data” to manipulate data at the database.
In any scenario he noted that finding the right data to answer a question cannot be optimal because data is fuzzy and machine resources are limited. Next generation data analysis will require a combination of statistics and computer science to create novel data structures and randomized algorithms.
Szalay suggests that Power Laws arise in social systems where people are faced with many choices such as in the analysis of enormous data sets–more choices make the distribution, or long tail, more extreme. People’s choices, made by brains are naturally designed to sort, order, and balance, affect one another and are not random events. He cited long tail distribution observations including those of Pareto who suggested that 20% of population holds 80% of wealth, and more currently those of Chris Anderson who believes that everything on the web is a Power Law.
He suggests that the there is a science project pyramid–single lab at the base, multi-campus in the center, and international consortia on top. Often a scientific discipline will recognize the need for a major “giga” initiatives such as supercomputing research that is highly collaborative and distributed. The output from these efforts at every scale contain:
–Derived and re-combined data
Szalay would like to see a continous feedback loop among these three aspects where data and analysis are always updating.
To answer the question, “How can you publish data so that others might recreate your results in 100 yrs.,” he referred to Gray’s laws of Data Engineering: scientific computing revolves around data; scale-out the solution for analysis; take the analysis to the data; start with 20 queries, and; go from working to working.
One successful experiment in scaling out the solution for analysis came about because the Sloan Digital Sky Survey generated more data than scientists have time to study or classify, coupled with the fact that astronomy is attractive to the public. Astronomers asked citizens for help in classifying over a million galaxies by establishing the Galaxy Zoo.
This public science analysis solution has received enormous publicity and has allowed 100,000 citizens from all over the globe to contribute to discovery by helping to classify galaxies online while viewing beautiful images of unkown locations in the universe. For example, a German teacher found and called attention to an object that she had no experience in analyzing. Her observation turned out to be a significant discovery. The object that proved to be a Voowerp.
Szalay believes that the educational impact of this work is enormous. Data sharing and publishing would benefit from the establishment of specialized Journals for data. He emphasized that scholarly communications are no longer characterized by a paper trail, but rather by an email trail along with resources collected by the Internet Archive, wiki pages, some science blogs, collaborative workbenches, and even instant messages.
Technology plus sociology plus economics must come together to continue to work on how to preserve our ntellectual data resources. Any one discipline alone is not enough to solve the data deluge problem. Both the promise and the unpredictability of increased participation in citizen science is yet another unknown. If there are 1,000s of new discoveries each day in public science is there any way to know how this will scale or does this create a horrifying potential for even more data?