“Six Reasons Why Mulgara’s XA2 Storage Layer Should Matter to You”

Wed, 2008-03-12 17:18 -- Anonymous (not verified)

San Francisco, CA The Topaz Project hosted developers from Mulgara, Fedora Commons, and Topaz on March 5-6, 2008 for a series of design meetings and discussions based on Mulgara’s XA2 Storage Layer.

Paul Gearon, Mulagara, presented an overview of how Mulgara’s storage currently works. It consists of a “string pool” (mapping numbers to URIs and literals) and a set of six indexes: GSPO, GPOS, GOSP, SPOG, POSG, OSPG. Each of the indexes contains a full copy of the (Subject, Predicate, Object, Graph) quads, just in a different sorted order.

Andrae Muys, Netymon, a specialist supplier of technology development and consultancy services, introduced his ideas on XA2, a redesign of Mulgara’s storage layer. Major goals are to support concurrent writes and up to 100 Billion triples in one instance (two orders of magnitude above what Mulgara can achieve today). Read Muys’ discussion paper:

Six Reasons Why Mulgara’s XA2 Storage Layer Should Matter to You

Muys proposes replacing the string pool with a Trie-based data structure capable of storing 100 billion+ strings. He also suggests a novel approach for the indexes using redundant binary numbers. This approach would reduce the cost of updates to the indexes.

A series of triplestore-related talks hosted by SDForum in Palo Alto. David Wood gave an overview of his company (Zepheira) and talked about Mulgara goals. Andy Palmer (co-founder of Vertica, which uses C-Store) and Sam Madden (co-developer of C-Store at MIT) gave an overview of how Vertica can be used as a read-optimized triplestore.

Ideas for incremental improvements to Mulgara that would improve performance were presented. One idea was to change the string pool so it doesn’t constantly bother with re-allocation of space. This would keep “dirty” data on disk for longer, deferring cleanup to

a later time. Another idea was to use BTrees for the indexes. Currently, Mulgara employs AVLTrees. AVLTrees have the advantage of always being balanced (so they’re always optimized for read performance), but they are relatively expensive to re-balance when

deletes occur. Paul also presented an interesting “out of the box” idea on storing all the indexes as pairs: (subjectID, statementID), (predicateID, statementID), (objectID, statementID), versus triples (subject, predicate, object). This would give every statement in the store a unique ID (think “reification”), but he made a case for why it would actually be more performant than what we have today.

James Leigh introduced the idea of supporting Sesame’s SAIL api in Mulgara. SAIL is similar in purpose to Trippi, but it supports transactions and uses the OpenRDF APIs (as opposed to JRDF, which is falling out of favor). The Mulgara team was generally receptive to the idea. From a technical perspective, it appears the challenge will be mapping the abstract syntax of iTQL to that of SAIL (which is currently the union of SERQL and SPARQL). From a social perspective, there is some concern that the use of SAIL (traditionally

seen as “the Sesame API”) may weaken Mulgara’s brand. However, there is general recognition of the need for a (transactional) JDBC-like API for working with triplestores, and if this API were supported by both Sesame and Mulgara, it would be a win for the community, and would attract more users to Mulgara.

Contributed by Chris Wilper, Lead Software Developer, Fedora Commons

RSS Feeds: