Archive for July, 2007|Monthly archive page

“Director’s cut” of news feeds

This is an older but never less interesting documentation about what you can see and hear if you listen to satellite news feeds. I think is not so easy any more because today most stuff is encrypted and already cut before broadcasting.

Visual Studio 2008 first impressions

Microsoft has released the second beta of Visual Studio 2008 and I’ve installed yesterday (the professional version). The first positive thing was that it can be installed without problems besides my current VS 2005 installation, both seem to work correctly. One of the most anticipated things of VS 2008 is the multi target building. Now it is possible to build for .Net 3.5 as well as for .Net 2.0 and 3.0. The nice thing is that it is a feature of the project properties (seemed natural), so mixed solutions are possible. .Net 3.5 brings no new CLR, which is the old, proven 2.0 one (an exception is Silverlight 1.1), so all new language extensions in C# are pure syntactical sugar. I thing everyone has read about the new possibilities but now the are integrated very well in the editor and the debugger of VS 2008. As example if you use LINQ to query a collection and return some object, with the possibility to create the new anonymous ones on the fly, the editor shows the correct runtime type of the return value if you hover your mouse  cursor about the val key word. IntelliSense is also full aware of the created type before it was compiled the first time. That is really a sophisticated feature not seen by any of other editors (Netbeans 6.0 with JRuby really needs something like this …). Windows Presentation Foundation (WPF) is now fully integrated and the editor works flawless so finally it is possible to get rid of the silly WinForms constructs. Windows Communication Foundation (WCF) and Workflow Foundation is on the same integration state as in VS 2005, at least I’ve not found new things so far.

I have to admit VS2008 beta 2 is a very impressive release. It seems to be in a very stable state, it does not disturb existing installations and the new editor has awesome support of the new language features. The only thing that is missing is IronRuby 😉

OSCON 2007

Great presentations form the O’Reilly OSCON 2007 conference.

Never get lost on Web 2.0

A fascinating map of trends in 2007!

Rethink data storage?

Recently Werner Vogels mentioned a article about an interview with Michael Stonebraker at his blog about the current sitiuation on the database market aka what has changed since 70ths.  Besides it seems Stonebraker is a Ruby fan … But the core of the article is that we have to rethink how we store our data and because this is also a serious problem at my work place, some experiences I have made:

I have to problems to conquer, one is to store large volumes of data, in the range of several terrabytes, the other challange is the ability to insert huge numbers of data online (means queries take place constantly on the same table). To store data someone will think relational databases on the first place, build a big, central database which stores all information very well indexed. But in reality, relational databases are definitelly not ideal for this task. I have tested IBM’s DB2, Microsoft’s SQL Server (our current system) and Oracle, all in their latest version. Oracle was by at least the factor 2 the fastest one, but not as fast as we want it. A simple test driver program tries to load 170 million records as fast as possible into the database. Oracle behaves very well if no index is set, needs only 660 sec. to store the information (with JDBC and the use of batched Prepared Statements). As comparison, pure file writing needs 386 sec., so not a bad value at all. But the trouble begins if you want to have an active index on the table, the time needed raises to 10469 sec.! If you look at at the Oracle Enterprise Manager Dashboard (which is quite useful), most time is needed to reorganize the index data. Droping and rebuilding of the index is no option, because you need to execute queries in parallel. The only solution is to use Oracle’s Materialized Views, which handle the query access and are refreshed after the main table was filled up.

Besides the relational databases I am also testing a so called “post relational” database, Intersystem CachĂ©. This is basically a hybrid hierarchical/multidimensional array based approach (with anchestors back to 1960) with a deticated SQL and Object Oriented Layer. Both layers are stable (at least under Java), and the database behaves like any other one. CachĂ© is at least as fast on queries as DB2 or the SQL Server, but it looses the performance crown to Oracle if the Server has a lot of processing cores (in my case eight Opteron Cores) because of the very effective Oracle Query Optimizer. CachĂ© is very fast if it comes to the insert of the data, normally the performance is compareable to the one of Oracle without active index. Because it seems not to suffer from the old habits of hierarchical databases to support only some access paths well, it is a alternative solution for our problem.
Another alternative solution is the partitioning of the database, distribute it over a lot of independent servers with the help of something like Sequoia. I will test it as soon as possible because if it works it could be the best solution besides a single, expensive System as Oracle RAC or Netezza.

The second problem is easier to solve. Do not store large text fields at all inside the database! Because we use external idnex system such as Lucene for information retrieval, the database has only to store the bibliographic data, associated meta information and the path to the text fragment file on a storage solution based on SUN’s ZFS file system. Storing data inside the database works well for small amounts of data or if you have to process the text inside SQL queries. Otherwise externalization is the best option.

Others like Google have build their own storage systems, similar to Apache Hadoop, and are not using traditional products. They definitively are right to do so because the traditional relational databases are by no way sufficient for todays data storage demands. Stream based processing, dedicated DWH systems (like Teradata and Netezza), XML based systems and large text repositories are the future, the time of then one system fits it all are gone.

So what is your experience with databases? Does anyone know of better solutions for the mentioned problems or know promising new ways? Maybe another solution is to use replicative caches like Oracle Coherence or to use small local storage systems on the client/service side like db4o and give up the idea of a consistent data repositories?

On the evil side

Google has the motto “Don’t be evil.”, but sometimes they are definitely misguided. Google’s advertising group for the health care industry has it’s own blog and the newest entry is a recommendation for the poor companies which are target of Michael Moore’s newest movie “Sicko” how to support people to find the “true” information about them with the help of Google Ads. Micheal Moore always exaggerate in his movies but they help to set the spotlight on the weak places in the American society. Google is the premium access point to find information on the Internet and it is, also incontestable, a commercial company which has the primary target to make money, nothing else. But I think Google has now also a responsibility to balance the interests of the public and their commercial targets, but if they have to choose they always choose the profitable one (as seen with China).