Short Read Archive
for NCBI/NLM/NIH

Introduction

NCBI began preparations for implementation of its Short Read Archive with knowledge mostly of the magnitude of what was to come, but with only general information regarding its form. It was known that massive amounts of data would be stored on-line in quantities that would overburden commercially available databases, and that once entered, most data would never be updated again.

Based upon experience with its Trace Archives, NCBI decided to archive data directly into a networked file system. The Trace Archives had initially been implemented with a commercial DBMS, but that approach eventually proved untenable on many axes, e.g. db administration, replication, and disk space. To address the situation we developed an archiving format and access libraries specific to Trace data.

The new Trace Archive format was designed to replace the commercial dbs on a one-for-one basis, and provide just the functionality that was required of a read-only database. We placed major emphasis on modularity, compactness and speed of retrieval. Complex queries are handled by an internal NCBI specific service based upon Sybase Open Server technology, and relayed to our access library that today manages nearly 47 TB of Trace data.

Illustration of histogramm When we began work on the Short Read Archive, it was clear that we wanted a similar facility; one that represented data in a very compact form yet could be retrieved with speed from multiple hosts, that scales by adding storage devices to the file system or more front end servers, and whose content may be edited by moving files into and out of the directory structure. We also decided to transfer more of the archival functions to the file system, meaning we needed increased indexing and querying abilities, plus metadata storage.

Finally, our new archive format had to be flexible enough to accept and preserve the structure of data arriving in unforeseen forms. This is when the hitherto custom archival format and software became general purpose, which in turn has led to greater applicability. It is our goal to develop this solution until it reaches a reasonable point on the graph where benefits and costs are considered optimal.

Abstract storage format description

As previously mentioned, the file system is used for storage. This is in contrast to the typical DBMS which accesses the disk as a raw storage device and bypasses the file system altogether.

There are currently two standard representations of the archive. The first is implemented as a directory structure, and the other is a single file archive of the first. Of course, the former can be archived using tar or a similar facility to produce single file archives, but the second form is meant to be directly executable.

Bottom-up structural description

Illustration of Column structure At the bottom of the system is the column. It is a self-contained unit designed to store arbitrarily sized blobs, indexed by integer id. Each blob may be stored with or without a checksum that may be used to validate data integrity upon retrieval (NCBI stores a 32-bit checksum with each blob in each column, but may tailor this in the future as its presence or absence does not affect operation).

The column itself is a directory with one or more data files, optional metadata and multi-stage indexing files. The indices are designed for compact on-disk representation and runtime caching.

Operationally, the column is in itself a mini-database. It makes use of externally assigned integer index keys to access its blobs.

Illustration of Table structure The next level is that of the run, which is effectively a table in database parlance. It groups an arbitrary number of columns into a single structure and provides the facility of assigning integer spot ids to an unique key string as well as key string to id conversion (projection index). A novel feature of this type of table is that it takes its runtime structural definition from the contents of the file system, i.e. it determines its component columns dynamically upon open. This makes it possible to add or remove columns to a run by simply changing the contents of its directory, e.g. convert a full run to the functional equivalent of fastq by simply knocking out all but the base-call and quality columns. Another feature is the ability to store as many columns as are desired without incurring retrieval or interchange penalties when fewer columns are requested, because the data are not stored by row.

We have not mentioned data compression or treatment so far because the storage format itself does not require it or have any awareness of it. Nevertheless, it is a huge issue in storing Short Read Archive data, and is currently implemented in a layer above the run/table distributed within NCBI between tools and custom Sybase Open Server libraries. We will describe this layer functionally and to a lesser extent structurally, as the current implementation is NCBI-specific and will be replaced shortly with a general purpose structure.

The process of populating a run starts with source specific information such as platform, number of channels, etc. stored in the run's metadata. It proceeds with input data parsed into spots. These data are separated and written into columns, and integer ids are assigned to each key string. Data are manipulated as appropriate to each type to convert them to standard formats, i.e. float to integer, rotation, FASTA to 2na, etc. Most manipulations are lossless, but some (such as float to integer) may incur unavoidable truncation loss, while still others are intentionally quantized (ln, arctan, etc.).

After applying data transforms (if any), data are optionally compressed using Huffman or other encodings. Compression tables are stored as column associated metadata for use during retrieval.

The retrieval side of the upper layer is able to treat several individual archives as a single, extended, distributed archive. Within NCBI, this layer looks upon multiple filers with multiple mirror image volumes upon each, and several runs upon each volume, named by accession. Data may be accessed from one or several runs within the upper layer and assembled into a single result. A query for data from one or more spots involves requesting appropriate column blobs, and assembling them into rows for presentation. The column blobs are initially fetched from the file system, decompressed and have applicable reverse format transforms applied, and then cached.

There are search facilities with an expression language for querying archives based upon content. Currently these facilities implement matching using 2na and 4na alphabets, where the latter of course permits wild cards. The basic expression operation is a nucleotide comparison, but several operators are provided that implement basic Boolean logic ( !, &, | ) and sub-expressions, plus anchoring; the ability to require a match at a designated location within the sequence, e.g. beginning or end ( ^, $ ).

Generic differences

Our standard format makes use of column-based storage, which is a simple but radical departure from the general approach. It is also heavily but efficiently indexed in every representation. Together these features make read access very fast, which was one of our basic requirements. They also make it possible to perform real-time content-based searches over entire runs; NCBI currently uses the search API to find spots by pattern expression with no content indexing at all.

Finally, our archives are file-based "junior" databases. They are optimized for space and retrieval speed, and the format was designed for execution as well as storage.

Short sequence format requirements

The Generic Format for Sequence Data document, version 1.3.1, provides a set of requirements that have guided the development of SRF. While they differ somewhat with those that drove our work for NCBI, a comparison is useful.

  R1Open formatComplies
R2StreamableStarting with v2See discussion
R3EfficientComplies
R3 [sic]Random accessCompliesFully indexed
R4No experimental info requiredComplies
R5Individual and multiple read sets Complies
R6Unique id per readComplies
R7Big endianSource architecture dependent  See discussion
R8Multi-platform contentComplies
R9Image data not requiredComplies

There are two unsatisfied requirements from the SRF document when applied to our V1 format: R2 (streamability) and R7 (big endian).

Streamability can have different meanings in different contexts. In our case, the definition that makes most sense may be taken from the viewpoint of a server, where the ability to stream an object means that it can do so without needing to buffer unbounded (or bounded but unreasonable) amounts of data. Streamability of static files according to this definition is not at issue, but that of dynamically generated objects is. Any format or implementation that violates a server's reasonable buffering bounds is not streamable under those conditions. We intend to introduce a streamable format with version 2.

The big vs. little endian issue must be addressed in file formats. For us, the requirement was to have an unambiguous architecture indication in the format, and no endian exposure in the API. A fixed byte order in the format hurts performance on incompatible architectures; big-endian for example will penalize all Intel based machines, i.e. Linux, Mac, Windows. Our format indicates byte order in headers, and permits byte order reversal if so desired between Sun and Linux, for example. The important thing is that the API handles byte order properly on all platforms, making the format portable.

Summary

This general storage solution for the Short Read Archive started as part of a custom database for NCBI. It brings some novel features with it, but is most remarkable when considered as a whole for its robustness, flexibility, efficiency and performance. It offers features normally associated with a relational database, yet in a portable, light-weight representation comparable in convenience with simple container files. At the same time, it offers unique capabilities that are found useful within NCBI.