Fedora4 Team Completes Ninth Beta Phase Sprint: Performance, Versioning, Authorization++

Thu, 2013-12-19 11:56 -- carol

Winchester, MA  The Fedora team has concluded the ninth sprint within the "Beta Phase" of Fedora4 development. The work of this and each sprint is planned and completed thanks to the contributions of Fedora stakeholder institutions which allocate developer time. If you would like to be involved with Fedora4 development, please send an email to Andrew Woods awoods@duraspace.org, or the Fedora Steering Committee ff-steering@googlegroups.com. If you have comments on the work from this sprint, please also send an email or comment directly on the wiki.

Read the the Sprint B9 summary:


About Sprint 9, Beta Phase

Development Team

Benjamin Armintor - Columbia University
Frank Asseg - FIZ Karlsruhe
Chris Beer - Stanford University
Esme Cowles - University of California, San Diego
Osman Din - Yale University
Michael Durbin - University of Virginia (scrum master)
Eric James - Yale University
Greg Jansen - University of North Carolina, Chapel Hill
Adam Soroka - University of Virginia

Sprint Themes

1) Performance
It often goes without saying that Fedora 4 must be performant under a range of use cases and scenarios. A very specific theme of this sprint was ensuring those assumptions hold true, and in the cases where they do not, surface and address the reasons. The following performance-related topics received attention this sprint.

1a) Profiling
Profiling was employed as an initial means of inspecting the hotspots within the codebase. In general, it was determined that the greatest sources of slowdown relate to:
- Extraneous creation of JCR sessions
- JCR node lookups
- Synchronous internal index updates

1b) Benchmarking
In parallel to the profiling work, significant effort was put towards painting a clear picture of the current performance status of Fedora 4 across a variety of hardware, configurations, and scenarios.

Tests were performed with consistent and documented setups across test servers at the following institutions:
- FIZ Karlsruhe
- Stanford University
- University of California, San Diego
- University of North Carolina, Chapel Hill
- University of Wisconsin
- Yale University

Tests are defined by their union of the following four variables [1]:
- Platform Profile - the hardware and networking used to conduct the tests
- Repository Profile - the Fedora-specific configuration options
- Setup Profile - the data loaded into the repository as a baseline before testing
- Workflow Profile - the specific tests performed, what tools were used, and what was measured
Of particular interest are the results of
ingest/read/update/delete workflows with Fedora 4 single-node installation [2].

1c) Performance Benchmark
- Authorization

An additional set of benchmarks were collected to determine the effect of authorization on performance [3].

As expected, there is a performance penalty with authorization enabled; however, these tests tend to indicate the impact to be less than 10% across the ingest/read/update/delete functions.

1d) Performance Benchmark
- Fedora 3 vs. 4

Defining the goals for acceptable performance levels for a repository is an ambiguous task. There are many variables that come into play, and generating test cases that simulate production scenarios is not always effective. That said, one concrete measure of performance is the relative behavior of Fedora 4 in comparison to Fedora 3.

Significant work remains in this comparison, but some initial numbers show favorably for Fedora 4's ingest capability [4].

1e) Performance Benchmark - Large files

In terms of performance related to large files, this sprint tested the limits and performance of:
- Ingest and retrieval via the Fedora 4 REST API
- Retrieval via Fedora 4 filesystem projection

In both cases, content as large as 1-TB was successfully tested with documented throughput [5].

1f) Performance Benchmark - Clustering

In a similar fashion to the single-server performance tests, clustered configurations need to be benchmarked as well.

The first step towards clustered tests was the creation in this sprint of a GitHub project [6] that standardizes selected configurations and installations of clustered Fedora 4 repositories.

2) Versioning

In a previous sprint versioning was introduced. This sprint furthered the capability by allowing the repository user to more selectively enable resource-level versioning [7]. 

Now a version of a resource can be created for a specific repository resource via the REST API [8], with the option to associate a label with the version.

Additionally, auto-versioning can be enabled by setting a property on a resource that indicates the activation of auto-versioning [9]. This property can either be set at runtime by the repository user, or more globally as a default property defined in the node-type definitions (CND [10]).

3) Authorization
In an attempt to increase the visibility and ease the use of the basic Fedora 4 authorization capabilities, this sprint migrated the existing authorization code from its separate GitHub project into the main codebase.

The ability to add and update access roles on resources has been available via the REST API [11]. This sprint additionally added an input form to the Fedora 4 HTML UI which simplifies the creation and update of access roles on a given resource

Until Fedora 4 establishes a runtime wiring and configuration framework [13], certain repository configurations must set at either build or deploy-time. Therefore, this sprint introduced an additional build artifact for the Fedora 4 webapp which has basic roles [14] authorization enabled.

4) Batch Operations
This sprint enhanced the previous batch operations capability to support a more standardized approach to performing the following actions batched as a single request:
- Retrieve multiple binary resources in a single request
- Create multiple resources in a single request
- Modify multiple resources in a single request
- Delete multiple resources in a single request

In addition to batching multiple actions of the same type, create/modify/delete actions can also be mixed in a single request.

Examples and feature documentation can be found on the wiki [15], along with the REST API documentation [16].

5) External Search
External search went through a significant round of refactoring this sprint in order to address performance issues discovered in the application profiling effort as well as to establish a flexible pattern for transforming resource properties into indexible fields. In a similar pattern employed by the external triplestore feature, external search relies on repository event messages to trigger index updates. These messages have been refactored to contain minimal, essential event and resource information which now eliminates the previous overhead imposed by the eventing machinery of making additional lookups back into the repository.

As for the configurable identification of resources to index and the definition of transformations which the external search component leverages to get a mapping of resource properties to indexible fields, the basic approach is as follows:
- Set the property on a resource that flags it for indexing
- Optionally, set the property on a resource that references the properties mapping transformation
- Optionally, create a new resource that contains the actual LDPath [17] transformation referenced in the previous step

More details of the external search feature and its configuration can be found on the wiki [18].

6) Storage Durability
The fundamental principles of Fedora have always included a commitment to a non-proprietary, transparent, persistence format. Within the Fedora 4 architecture, there are several available approaches to defining the backend persistence store.

The two backend stores that have primarily been used so far in development are the filesystem and LevelDB [19] implementations. In both cases, Fedora 4 persists the binary content in a tree of directories and the resource properties as binary JSON.

Details of the format of the JSON and nested fields is described in the wiki [20].

7) Easy Deployment
In support of the devOps users, on-going effort is dedicated to making the deployment and configuration of Fedora 4 as straight-forward and reproducible as possible. As has been noted on the fedora-tech mailing list, at times there has been some confusion as to which system properties should be set for configuring Fedora 4 persistence locations.

This sprint defines a single system property (fcrepo.home) that allows a Fedora 4 installation to specify the base directory under which all other application data will be written.

Details of the deployment and configuration of Fedora 4 is described in the wiki [21].

8) Documentation
As we move closer to a Beta release of Fedora 4, it is vital that there exist developer and administrator documentation for the application. An initial structuring of this documentation can be found on the wiki [22].

The following sections contain user-facing documentation:
- Administrator Guide [23]
- Developers Guide [24]
- Feature Tour [25]
- Features [26]
- Glossary [27]

9) House-cleaning
Several HTML UI improvements were included in this sprint. Additionally, predicates from the PREMIS namespace [28] are now being used for a number of standard resource properties.


[1] https://wiki.duraspace.org/display/FF/Performance+Testing
[2] https://wiki.duraspace.org/display/FF/Single-Node+Test+Results
[3] https://wiki.duraspace.org/display/FF/AuthZ+-+No+AuthZ+Fedora+4+Comparison+Performance+Testing
[4] https://wiki.duraspace.org/display/FF/Single-Node+Test+Results#Single-NodeTestResults-Fedora3/4Comparison
[5] https://wiki.duraspace.org/display/FF/Large+File+Ingest+and+Retrieval
[6] https://github.com/futures/fcrepo-test-profiles
[7] https://wiki.duraspace.org/display/FF/Versioning
[8] https://wiki.duraspace.org/display/FF/REST+API+-+Versioning
[10] https://github.com/futures/fcrepo4/blob/master/fcrepo-kernel/src/main/resources/fedora-node-types.cnd
[11] https://wiki.duraspace.org/display/FF/REST+API+-+Access+Roles
[12] https://wiki.duraspace.org/display/FF/Feature+Tour+-+Action+-+Access+Roles
[13] https://wiki.duraspace.org/display/FF/Design+-+Wiring+and+configuration
[14] https://wiki.duraspace.org/display/FF/Basic+Role-based+PEP
[15] https://wiki.duraspace.org/display/FF/Batch+Operations
[16] https://wiki.duraspace.org/display/FF/REST+API+-+Batch+Operations
[17] http://wiki.apache.org/marmotta/LDPath
[18] https://wiki.duraspace.org/display/FF/External+Search
[19] https://code.google.com/p/leveldb/
[24] https://wiki.duraspace.org/display/FF/Developers+Guide
[25] https://wiki.duraspace.org/display/FF/Feature+Tour
[26] https://wiki.duraspace.org/display/FF/Features
[27] https://wiki.duraspace.org/display/FF/Glossary
[28] http://www.loc.gov/premis/rdf/v1#

RSS Feeds: