It is easy to get fooled by how the statistics interpret data. Sometimes, analysis of big data sets lead to conclusions that may not make sense. Also, the cause and effect do not work quite the same when the big data analysis shows a correlation. Just because there is a correlation does not mean that there is a cause and effect. Take the example of Kaggle… they ran a contest in 2012 on the quality of used cars and the characteristics of those cars. A used car dealer supplied the data to predict which cars were likely to have problems, their characteristics and what were the other cars that were not so likely to have problems. A correlation analysis showed that cars painted orange were far less prone to have defects – about half the rate of other cars. What has the car color got to do with problems? Color has no correlation and rightly so – this was just the chance event that was pulled out. But once, such a correlation between the car defects and color had been found out, the conclusions that can be drawn tends to get ridiculous. • Paint your car orange to have fewer defects. • Buy a orange car and your car will last longer, no matter how you treat it and forget about the oil change. • If you have an orange car, then you do not need to maintain the car. However, these conclusions get more complicated the more you use them. Even with the most complicated analysis, it is important to think about reason rather than believe everything that can be concluded.
Similar Posts
New Data analysis estimates Heart attack risk
To estimate risk of a disease, the earliest method that scientist used were epidemiological data analysis. For example, in the 1950’s they found that smokers got an increased incidence of lung cancer. It took the society nearly 3 decades before it was generally accepted and warnings started being issued but that is another story. Currently,…
Data mining – Useless data or useful data?
Companies that you previously did not think were in the data mining business or in the prediction business have used data mining technologies in strange but very useful ways. Your company too has some “useless” data that can be harvested. It really comes down to asking the correct question. Google obviously is centrally in the…
Weave for data visualization
Data visualization requires special tools. The one that has been very useful is “Weave”, as a ridiculously powerful visualization tool. Instead of describing in detail, look at this video below showing the elections of UK and the link that gives all the features very well. {youtube}WHQeP4yQa9U{/youtube}
Enable drug discovery
Drug discovery is hard.Amazing to see the databases that are available for public access that enable drug discovery. Broad institute publishes The Connectivity map (CMAP)which is a database of gene signatures of transcriptional response to perturbation of many cell lines. This is incredible amount of data that is available in the public domain to be…
Analyze biological data Mix-omics
One of the oldest entrants in the integration, visualization and analysis of data from various sources (genotype, phenotype etc.,) is Mix-omics. http://mixomics.org/ The project started in 2009 and has been a continuously developed framework for multiple data sources. The graphic shows the multivariate methods available today for users. There is also sample data sets available…
Cancer cures through data sharing
Cancer is a difficult disease to manage but still difficult to treat. However, many people have many thoughts on what it takes to cure disease and cure it. Maybe, some of them are practical and others are no so practical, however, it will take a revolutionary approach to come up drug discovery paradigms that help…