Big Data Demands Flexible High Performance Computing
Big data, defined as a massive volume of structured and unstructured data that is difficult to process via traditional technology, holds a wealth of possibility; but, standard, traditional parallel relational database technology has not proven to be cost-effective or provide the high-performance needed to analyze massive amounts of data in a timely manner.
As technology becomes more advanced, and data more necessary to provide business intelligence, many organizations are overwhelmed. They have large amounts of data that has been collected and stored in massive datasets, but the sheer amount often poses a problem as it needs to be processed and analyzed quickly to be useful.
Traditional Database Centralization Offers Challenges to Big Data
Traditionally, databases are broken into two classes: analytical and transactional. Transactional databases capture structured information and maintain the relationships between that information. Transactional data is one feedstock for big data. Analytical databases then sift through the structured and unstructured data to extract actionable intelligence. Oftentimes, this actionable intelligence is then stored back in a transactional database.
Because of the volume and velocity of data being processed, centralization is anathema to big data. Big data requires decentralization. The networking, storage and compute must be decentralized or they will not scale. However, centralization is a core tenet of SQL databases. Traditional databases tightly link computation, caching and storage in a single machine in order to deliver optimal performance.
Petabyte Scale Data Processing Requires Data Parallelism
Many computing problems are suitable for parallelization; data-parallel applications are a potential solution to petabyte scale data processing requirements. Data parallelism can be defined as a computation applied independently to each data item of a set of data, which allows the degree of parallelism to be scaled with the volume of data.
The best way of achieving this type of parallelism is to use a parallel file system. The most important reason for developing data-parallel applications and using a parallel file system is the potential for scalable storage and performance in high-performance computing, and may result in several orders of magnitude performance improvement.
Agility of Shared-Data Database Clusters Work for Big Data
Shared-data database clusters deliver the agility required to handle big data. Unlike shared databases, shared-data clusters support elastic scaling. If your database requires more compute, you can add compute nodes. If your database is I/O bound, you can add storage nodes. In keeping with the big data principle of distributing the workload, shared-data clusters parallelize some processing across smart storage nodes, further eliminating bottlenecks, and allowing you to scale to address your big data needs.
Also, unlike shared databases, shared-data clusters maintain the flexibility to add new tables and relationships on the fly. This flexibility is imperative, in order to keep up with the ever-changing data sources and data relationships driven by big data. Shared-data clusters can scale to accommodate thousands of storage nodes enabling almost unlimited scaling capability.