Next-Generation Storage for Next-Generation Sequencing
High throughput and next-generation sequencing (NGS) have significantly increased the quantity of raw and processed genome sequencing data researchers need to manage. To compound matters, sequencing data is routinely stored in redundant sets as researchers process and annotate data iteratively and seldom, if ever, delete anything. As a result, researchers are struggling to find and manage cost-effective storage for petabytes of genome sequencing data. Fully customizable and scalable data storage solutions can help manage very large-scale genome sequencing data.
Storing Petabytes of Genome Sequencing Data
Lack of cost-effective, scalable data storage with strong input/output (I/O) performance and high data integrity has slowed genomic research due to lost or irretrievable data, memory I/O issues, limited IT capacity and capability, power consumption, space costs, and other cumulative concerns.
Limitations of Cloud-Based Storage
To mitigate petabyte storage issues in a cost-effective way, many researchers have uploaded genome sequencing data to inexpensive, off-site (cloud) storage facilities. One downside of using off-site cloud storage is limited bandwidth hampering upload and download times. Most labs simply do not have this kind of computing and connectivity to effectively and efficiently work on petabyte-scale data sets. Another limitation is the ability for multiple researchers to work on sequencing data simultaneously within the cloud itself, which requires increased bandwidth. Furthermore, everyone working on the same data set creates more data, which requires more space and even more bandwidth to access it. Bandwidth limitations to access cloud storage have quickly escalated in recent years to the point where better solutions have become imperative.
What Researchers Need
Researchers need genome sequencing data to be stored in a high performance, scalable, cost effective system with high data integrity. They also need integrated software that allows them to manage petabytes of data and billions of files. All the data must be easily and simultaneously accessible to multiple researchers who create a very large number of redundant sets of the same data in a sequencing processing and analysis pipeline. The best large-scale data storage solution must consider these and other pain points by understanding how each is interrelated and can affect the performance and capabilities of the system as a whole:
Scalability
Scalability has become a critical attribute for storing genome sequencing data since NGS analysis must handle larger datasets, more jobs and users. Plus, iterative processing and annotation can increase this quantity many times over.
Data Integrity
Genome sequencing data may be stored, processed, analyzed, and annotated for years. Therefore, data integrity is of paramount importance from the initial sequencing output to assembled genomes and from computational analyses to reporting. In some cases, accurate sequence information must be preserved to meet regulatory requirements, to support ongoing research and development, and to protect intellectual property.
Fault Tolerance
Genome sequencing data must be protected and accessible in the event of systematic hardware and/or software failures. Fault tolerance is the ability of the data storage system to handle several different types of failures while protecting data and making it continuously accessible.
Storage Performance
To be useful, I/O performance and throughput should be fast enough to allow researchers to process and work on sequencing data with minimal lag time. In general, read performance is more valued than write performance because storage devices spend the majority of their time reading data to store it. This will be especially critical with petabytes of NGS output.
I/O Bandwidth
High I/O storage bandwidth is essential for fast upload and download speeds. The best storage solution will directly upload NGS output while providing a variety of data management tools to turn sequencing data into usable and useful information.
Tiered Storage
Raw sequence data is heavily accessed initially for data cleaning and assembly, but much less often once this is done. Multiple tiers of storage allow higher-cost technologies for fast access to be blended with lower-cost nearline and archival storage. Assembled genome and variant data are accessed regularly by researchers working on different projects, sometimes years later. Managing this dynamic prioritization of data access and tiered storage is an ongoing challenge for researchers.
RAID Inc. Next-Generation Storage Solutions for NGS
Genome sequencing is producing data at a faster rate and lower cost than data storage. Lack of cost effective, scalable data storage for the massive amounts of sequencing data being produced has hindered genomic research. Cloud storage is one solution, but bandwidth limitations often make it unrealistic for research needs.
Raid Incorporated’s scalable storage solutions are fully customizable and scalable data storage solutions that allows researchers to manage very large-scale genome sequencing data, while improving data integrity, fault tolerance, tiered storage, storage performance, and I/O bandwidth problems.