Archive for November, 2007|Monthly archive page

A comprehensive overview of useful data mining algorithm

The book Programming Collective Intelligence from Toby Segaran is a practical useful guide through the most common data mining, more exact classification, algorithms. The book covers the traditional algorithms as decision trees, naive Bayesian classifier, neural networks, clustering and not so common ones as support-vector machines  and non-negative matrix factorization. The last one was new for me. The first paper seems not to have surfaced before 2000 so it is relatively new technique and has shown very good results in the case of feature extraction of large numerical spaces. Also optimizing functions like simulated annealing and genetic algorithm are mentioned. Interestingly, there is also a small example for genetic programming.
All algorithms are explained with examples and small programs in Python. The only thing I’ve missed is a mathematical representation in addition to the explanation itself, but at least all utility functions are explained more formal in the appendix.

Content-Aware Image Resizing

On SIGGRAPH 2007 Shai Avidan presented a amazing Demo how to resize, shrink and enlarge, images without loosing important content information. Everyone was amazed by this technology but the paper is available and now, in fact, there are some implementations of the algorithm, Seam Carving.
The results are really fascinating because the algorithm works without knowledge about the content of the image. So check out the link on Mike Swanson’s blog, he wrote also a first C# implementation.

IRF Symposium 2007

I’m currently working for a company which is a startup in the information retrieval domain founded a year ago a open community platform called Information Retrieval Facility. An international symposium was organized last week this were my impressions:
My impression from overhearing some discussions is that basically, from my point of view, there are three overlapping problem fields: first there are the pure engineering based problems which are sometimes caused by not being aware of current technology and possibilities, in fact most of the solutions are 10 or more years behind the current State of the Art. With professional software engineering, EVERY company is able to solve them, the only thing matters if the technical solution itself supports the actual revenue generating product.
The second problem field is “The big ball of mud” of the patent information itself. The raw information itself is very inconsistent, has a lot of errors, comes in nearly any possible flavor and format, is not standardized in any way (and will never be in near future) and is, nowadays, more a legal document as a technical description. There is no other way as to handle each small bucket of mud carefully and with as much domain knowledge as possible. Because patents are human oriented, only actual domain experts will be capable to work through this. So demand here is to find a way to provide a supporting work flow as well as “Conversational Interfaces” (language oriented, not solemnly voice oriented) to help the mining of information out of the pile of mud. And simple things as flexible filter combinations can help a lot as well as to build up a working feedback loop (OK, not so simple). To drainage the mud, only Google will be able to do this…
The third field is the area of scientific problems in information- and knowledge retrieval. Either on the structural, semantically, language or domain specific area, there are a lot of not so simple problems which had to be solved and there is mostly no currently existing shortcut. The research here goes in two directions: make the information and the actual (meta)knowledge more computational, as well as provide humans tools to help them to cope with the knowledge itself, through navigation, presentation, abstraction or decision making.
Some other common themes were that it is clear that China will be the biggest challenge in the patent world, patent search is a challenging and very important factor at least for international companies and will become more important for smaller ones also, searching in special features as chemical structures or images, feature- and knowledge extraction from pure text are not well integrated into a flexible tool to support search strategies.

So I have to say the event was very well organized and the idea to build a community platform for different professions seems to work out nicely. Presentations and pictures can be found here.

UPDATE: The videos of all presentations now can also be found here.

Singularity Summit 2007

For all who are interested in the progress and current state artificial intelligence, interesting talks of this year’s summit of the Singularity Institute for Artificial Intelligence can be found here. Only a few have transcripts available, the others are only available as podcasts and sometimes hard follow. But there are lot of interesting ideas and thoughts to hear.

DARPA Urban Challenge 2007

Yesterday I watched the complete DARPA Urban Challenge through the whole 7 hours and it was worth the time (at least more interesting as some F1 races). Eleven Teams were qualified for the final race, six of them completed all three missions and all in good time. The goal for every bot was to finish the missions as fast as possible, drive without violating the Californian traffic rules and do not collide with any other bot or with any of the other 37 vehicles on the road. The course was very large, with trees and off-road sections, so the GPS has some problems, as well as with some unmapped streets. Every car gets its random generated missions 5 minutes before starting, so no team has known in advanced what to expect.
The first three cars had one thing in common: they drove really confident, especially the Standford car “Junior” which was not only the fastest one, it also drove absolutely perfect. The second one was, as in 2005, “The Boss” from Carnegie Mellon. The last three were also very interesting to watch: “Skynet” and “Little Ben” were very careful and “polite” drivers, they wait on every crossing until any other car has passed and Skynet was by far the best looking car. MIT’s “Talos” was very interesting to watch. They had more sensors and processing power attached to the car as everyone else, trying to process as much environmental data as possible which results in a very “spastic” driving because every 5 minutes the car was definitely not sure what to do. Talos was also the bully in this race, it collides with Team CarOLO and with Skynet, it also ignores most of the traffic rules (infect it drives like a teenager). But because this was MIT’s first participating in the race, it was a very good show.
The winner will be announced today but I thing “Junior” has made it.

So after the 2004 Grand Challenge then only one car droves 8 miles, now six cars succeed in urban environment. In three years it was possible for the technology to evolve very fast and if you think that at least Lexus now starts to build in robotic behavior into the cars (self parking functionality), it is reasonable that in ten years it is technically feasible to build very reliable robotic vehicles. Maybe this was the last challenge, but I think DARPA has shown what is technologically possible in very short time.

 UPDATE: OK, I was wrong. Dr. William “Red” Whittaker has made it this year with “The Boss”, his car wins the DARPA Urban challenge 2007, second is tis time Stanford. He was actually faster with an also perfect driving performance.