Fedora4 Beta Phase Sprint Eight: Performance, Search and Large Files++

Tue, 2013-12-10 15:46 -- carol

Winchester, MA  The Fedora team has concluded the eighth sprint within the "Beta Phase" of Fedora4 development. The work of this and each sprint is planned and completed thanks to the contributions of Fedora stakeholder institutions which allocate developer time. If you would like to be involved with Fedora4 development, please send an email to Andrew Woods awoods@duraspace.org, or the Fedora Steering Committee ff-steering@googlegroups.com. If you have comments on the work from this sprint, please also send an email or comment directly on the wiki.

Read the the Sprint B8 summary:

https://wiki.duraspace.org/display/FF/Sprint+B8+Summary

About Sprint 6, Beta Phase

Development Team

Benjamin Armintor - Columbia University
frank asseg - FIZ Karlsruhe
Chris Beer - Stanford University
Esme Cowles - University of California, San Diego
Osman Din - Yale University
Michael Durbin - University of Virginia
Eric James - Yale University
Greg Jansen - University of North Carolina, Chapel Hill (scrum master)
Scott Prater - University of Wisconsin
A. Soroka - University of Virginia

Sprint Themes

1) Performance

A primary focus of Sprint B8 was performance benchmarking and improvement. This effort consisted of several approaches:

- Establishing performance test scenarios [1] which define the intersection of varying platform [2], repository [3], setup [4], and workflow [5] profiles
- Running and analyzing select test scenarios with CPU and memory profiling enabled
- Running ingest scenarios with a production collection [6] of 16,000+ objects, 65,000+ items, totaling 250+ GB of content

This work surfaced potential bottlenecks relating to versioning configurations and the messaging framework as well as limitations of different backend persistence choices.

This sprint represents an early and significant step towards painting a clear picture of how Fedora 4 compares with Fedora 3 across a variety of usage scenarios.

2) Search

Fedora 4 is designed to support two search services:

- External search (i.e. standalone Solr populated by repository event listener)
- Administrative search (i.e. advanced legacy field-search)

This sprint established the administrative search service. If a user-facing, full-featured search service is required of your repository, the external search is ideal. However, if a repository administrator-facing search is needed in support of queries over resource properties, then the new administrative search may suffice. Administrative search exposes both a text search over resource properties as well as a SPARQL endpoint over repository subjects.

For more details on the administrative search and its usage, see the wiki [7].

3) Large Files

One of the long-standing requirements of Fedora is support for the management and serving of large files. Given the recent software contributions made to the upstream project, ModeShape [8], it appears that earlier identified bottlenecks seen when projecting over large files on the filesystem have now been addressed.

Exploring large file support from the HTTP interface, this sprint determined that 300-GB content [9] could be both ingested and read from the local machine via the Fedora 4 REST API [10]. Tests for the actual hard limits imposed by the repository or REST API are still being explored.

4) Content Modeling

In an effort to clarify how a Fedora 4 repository manager might define content models, an example set of configurations [11] where constructed that represent an initial set of Fedora 3 content models translated into Fedora 4 node-types. The Fedora 4 means of defining content models is via Compact Node Definitions (CNDs) [12] which can be written as configuration files (as in the mentioned examples) or updated via the Fedora REST API [13].

5) Versioning

This sprint improved the basic versioning capability by adding support for common user needs. Firstly, the initial versioning feature was turned-on for repository resource modification actions, period. The obvious incremental improvement introduced in this sprint is the ability to define which node types (if any) are subject to versioning [14].

Additionally, support was added to return and view previous versions of a resource via the REST API and HTML interfaces.

6) House-Cleaning

A significant portion of the energy of this and most sprints was dedicated to refactoring, refining, and pruning the codebase. Without trying to enumerate all such enhancements, some notable updates include:

- Delete action on workspaces endpoint [15]
- Content upload API support for inclusion of user-provided checksum [16]
- JAX-RS Provider support for RDF iteration
- Clean-up of HTML user interface

References

[1] https://wiki.duraspace.org/display/FF/Test+-+Performance+Test+Profiles
[2] https://wiki.duraspace.org/display/FF/Test+-+Platform+Profiles+for+Performance+Testing
[3] https://wiki.duraspace.org/display/FF/Test+-+Repository+Profiles+for+Performance+Testing
[4] https://wiki.duraspace.org/display/FF/Test+-+Setup+Profiles+for+Repository+Testing
[5] https://wiki.duraspace.org/display/FF/Test+-+Workflow+Profiles+for+Performance+Testing
[6] https://wiki.duraspace.org/display/FF/Test+-+Performance+Testing+-+Stanford+SALT+Collection
[7] https://wiki.duraspace.org/display/FF/Design+-+Administrative+Search
[8] https://docs.jboss.org/author/display/MODE/Home
[9] https://wiki.duraspace.org/display/FF/How+to+handle+large+files
[10] https://wiki.duraspace.org/display/FF/HTTP+API
[11] https://github.com/futures/fcrepo-content-model-examples

[12] https://docs.jboss.org/author/display/MODE/Defining+custom+node+types
[13] https://wiki.duraspace.org/display/FF/HTTP+API#HTTPAPI-Nodetypes
[14] https://wiki.duraspace.org/display/FF/Versioning
[15] https://wiki.duraspace.org/display/FF/HTTP+API#HTTPAPI-Workspaces
[16] https://wiki.duraspace.org/display/FF/HTTP+API#HTTPAPI-BinaryContent

RSS Feeds: