Big Data, Traditional Database Challenges, and HPC
The Issue: Maximizing the Potential of Big Data
Big data, defined as a massive volume of structured and unstructured data that is difficult to process via traditional technology, holds a wealth of possibility; but, standard, traditional parallel relational database technology has not proven to be cost-effective or provide the high-performance needed to analyze massive amounts of data in a timely manner.”(Nyland, Prins, Golderg, and Mills, 2000).
As technology becomes more advanced, and data more necessary to provide business intelligence, many organizations are overwhelmed. They have large amounts of data that has been collected and stored in massive datasets, but the sheer amount often poses a problem as it needs to be processed and analyzed quickly to be useful.
Traditional Database Centralization Offers Challenges to Big Data
Traditionally, databases are broken into two classes: analytical and transactional. Transactional databases capture structured information and maintain the relationships between that information. Transactional data is one feedstock for Big Data. Analytical databases then sift through the structured and unstructured data to extract actionable intelligence. Oftentimes, this actionable intelligence is then stored back in a transactional database.
Because of the volume and velocity of data being processed, centralization is anathema to Big Data. Big Data requires decentralization. The networking, storage and compute must be decentralized or they will not scale. However, centralization is a core tenet of SQL databases. Traditional databases tightly link computation, caching and storage in a single machine in order to deliver optimal performance (Biem et al., 2013; Barlow, 2013; Kusnetzky, 2012; ScaleBase, 2012).
SQL Database Sharing Inflexible for Big Data
There are two approaches to scaling SQL databases in order to handle Big Data–namely sharing and shared-data clustering. If you have an existing schema, sharing removes the relations between tables and then stores those various tables in separate databases. This forces the application layer to maintain, and in some cases reconstruct, those relationships. One common approach to sharing is to split customers across multiple databases. For example, you might have customers 1-10,000 in one database, then 10,001-20,000 in another database and so on.
Sharing is one way to scale your data handling needs, but it is very inflexible, it doesn’t adhere to the Big Data principle of agility. A shared database cannot add new data sources, and new ways of processing that data, on the fly. Sharing creates a rigid structure that necessitates a
painful re-sharing each time you modify or expand the data or relationships between the data.