Home

Sensors

Simulation

Analytics

Odds & Ends

Resources

About

Analytics


Here are some papers and presentations I have put together on various topics in analytics:


A QSAR Model using a Probabilistic Neural Network Ensemble (PDF – Presentation at SIAM Symposium on Data Mining 2011)


Market Basket Analysis/Affinity Analysis (PDF – old Sensorlytics white paper)


Movie Revenue Forecasting (PDF – old Sensorlytics white paper)


Methods for Blending Multiple Models (PDF – my entry to AusDM2009)


Method for Mining Clinical Data for Drug Side-effect Associations (PDF – Method for OMOP Cup)


Netflix Prize Presentation at Cascade Systems Society, 5 Nov 2010 (PDF)


A Method for Scheduling Railroad Fueling Operations (PDF)


A Simple Football Prediction Algorithm




Some of the analytics competitions I have competed in are:


Kaggle Claim Prediction Challenge (31st / 107)

The goal of this competition was to predict the likelihood and size of injury claims filed by a given driver.


Kaggle dunnhumby's Shopper Challenge (93rd / 284)

The goal of this competition was to predict the day a person would next go shopping and how much they would spend.


SIAM SDM'11 Contest: Prediction of Biological Properties of Molecules from Chemical Structure ( 1st Place !!!! )

The process of identifying useful drugs is still largely a trial-and-error process, in which researchers physically test candidate compounds for even basic properties such as solubility in water. The purpose of this challenge is to develop algorithms that can look at structural features of the molecules (which can be extracted from the formulae by current modeling software) and compare them to a set of reference molecules with known physical properties. While this technique won't provide certainty, it may be useful in reducing the number of candidate molecules that need to be 'wet-tested' in the lab.


E-LICO Multi-nomics Prediction Challenge ( 4th Place )

The goal of this competition is to come up with algorithms that can estimate the extent of kidney problems (differential function and blockage) from biological markers, as opposed to more invasive measures.


INFORMS Railroad Applications Section Problem Solving Competition (2nd Place / 31!!)

This competition was not data mining, but operations research. The problem was to come up with an algorithm that would efficiently schedule fueling operations for a small railroad system (73 fueling depots, and 214 locomotives). Years ago, I studied how to model factories and was interested in scheduling problems. At the time it didn't seem like anybody other than the military actually used OR techniques, so I wandered off into the wilds of the electronics industry. Since I was down in Austin for the awards for this competition I also spent a few days at the INFORMS conference attending sessions in various areas (Transportation, Healthcare, Energy). What really surprised me was that after 20+ years these kinds of problems have not been completely solved.


IEEE ICDM2010 Contest (Ranks 15th/101, 10th/40, 6/17 in the three tracks)

This competition is sponsored by TomTom (the GPS maker) as part of an IEEE conference and concerns a problem we can all relate to: Road Traffic. The goal is to come up with algorithms that can predict traffic levels and traffic jams in advance using data collected from roadways or vehicle GPS transponders. This competition had three separate 'tracks' (events) in which one could enter - each track represented a different problem to forecast.


RSCTC 2010 Discovery Challenge (Rank 19th / 96)

This was another conference-tied competiton where the goal was to classify DNA microplate data. This was an interesting problem in that I still have no idea what a DNA microplate really is or how you collect data from them. Still, that didn't seem to be a serious impediment in terms of building a system to classify the data :)


2009/2010 OMOP Cup (25th / 55)

Competition sponsored by Foundation for National Institute of Health. Given simulated medical records (patient symptoms and drug course treatments) the challenge was to identify which drugs caused which side effects. This was a large-scale dataset, with 10,000,000 patient records to process.


AusDM2009 Analytic Challenge (11th / 19)

Competition sponsored by Australasian data-mining conference. Figure out how to optimally combine a series of expert predictions (in this case derived from the Netflix Prize Competition)


Netflix Prize Competition (770th / 40,000+ entrants)

Sponsored by Netflix - Given a bunch of customer ratings for movies, predict how customers will rate movies they have yet to see. With 400,000 customers, 18,000 movies, and 100,000,000 ratings this is a large-scale consumer preference modeling problem.



My Tools:

To this point the main software tool I have used in these data-mining exploits is Microsoft VB.NET Express (because Microsoft is giving this away for free, the price is right!). I have poked around a bit with R and Gnu Octave but have yet to use them in a serious way. As for hardware, my number-crunching box is an Intel E5300 (dual-core Pentium) with 4GB of RAM that I bought at Office Depot for less than $400. One thing that has become apparent to me through all these competitions is that very simple analytics algorithms running on modest hardware can often be nearly as effective as vastly more complex ones running on bleeding edge hardware - a refreshing change from the comically macho computer nerd culture where having the trendiest and most 'powerful' hardware and software seems to be a substitute for the size of one's wiener.


Last Updated 1/1/12