Rethink data storage?

Recently Werner Vogels mentioned a article about an interview with Michael Stonebraker at his blog about the current sitiuation on the database market aka what has changed since 70ths.  Besides it seems Stonebraker is a Ruby fan … But the core of the article is that we have to rethink how we store our data and because this is also a serious problem at my work place, some experiences I have made:

I have to problems to conquer, one is to store large volumes of data, in the range of several terrabytes, the other challange is the ability to insert huge numbers of data online (means queries take place constantly on the same table). To store data someone will think relational databases on the first place, build a big, central database which stores all information very well indexed. But in reality, relational databases are definitelly not ideal for this task. I have tested IBM’s DB2, Microsoft’s SQL Server (our current system) and Oracle, all in their latest version. Oracle was by at least the factor 2 the fastest one, but not as fast as we want it. A simple test driver program tries to load 170 million records as fast as possible into the database. Oracle behaves very well if no index is set, needs only 660 sec. to store the information (with JDBC and the use of batched Prepared Statements). As comparison, pure file writing needs 386 sec., so not a bad value at all. But the trouble begins if you want to have an active index on the table, the time needed raises to 10469 sec.! If you look at at the Oracle Enterprise Manager Dashboard (which is quite useful), most time is needed to reorganize the index data. Droping and rebuilding of the index is no option, because you need to execute queries in parallel. The only solution is to use Oracle’s Materialized Views, which handle the query access and are refreshed after the main table was filled up.

Besides the relational databases I am also testing a so called “post relational” database, Intersystem Caché. This is basically a hybrid hierarchical/multidimensional array based approach (with anchestors back to 1960) with a deticated SQL and Object Oriented Layer. Both layers are stable (at least under Java), and the database behaves like any other one. Caché is at least as fast on queries as DB2 or the SQL Server, but it looses the performance crown to Oracle if the Server has a lot of processing cores (in my case eight Opteron Cores) because of the very effective Oracle Query Optimizer. Caché is very fast if it comes to the insert of the data, normally the performance is compareable to the one of Oracle without active index. Because it seems not to suffer from the old habits of hierarchical databases to support only some access paths well, it is a alternative solution for our problem.
Another alternative solution is the partitioning of the database, distribute it over a lot of independent servers with the help of something like Sequoia. I will test it as soon as possible because if it works it could be the best solution besides a single, expensive System as Oracle RAC or Netezza.

The second problem is easier to solve. Do not store large text fields at all inside the database! Because we use external idnex system such as Lucene for information retrieval, the database has only to store the bibliographic data, associated meta information and the path to the text fragment file on a storage solution based on SUN’s ZFS file system. Storing data inside the database works well for small amounts of data or if you have to process the text inside SQL queries. Otherwise externalization is the best option.

Others like Google have build their own storage systems, similar to Apache Hadoop, and are not using traditional products. They definitively are right to do so because the traditional relational databases are by no way sufficient for todays data storage demands. Stream based processing, dedicated DWH systems (like Teradata and Netezza), XML based systems and large text repositories are the future, the time of then one system fits it all are gone.

So what is your experience with databases? Does anyone know of better solutions for the mentioned problems or know promising new ways? Maybe another solution is to use replicative caches like Oracle Coherence or to use small local storage systems on the client/service side like db4o and give up the idea of a consistent data repositories?


No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: