It is easy to get fooled by how the statistics interpret data. Sometimes, analysis of big data sets lead to conclusions that may not make sense. Also, the cause and effect do not work quite the same when the big data analysis shows a correlation. Just because there is a correlation does not mean that there is a cause and effect. Take the example of Kaggle… they ran a contest in 2012 on the quality of used cars and the characteristics of those cars. A used car dealer supplied the data to predict which cars were likely to have problems, their characteristics and what were the other cars that were not so likely to have problems. A correlation analysis showed that cars painted orange were far less prone to have defects – about half the rate of other cars. What has the car color got to do with problems? Color has no correlation and rightly so – this was just the chance event that was pulled out. But once, such a correlation between the car defects and color had been found out, the conclusions that can be drawn tends to get ridiculous. • Paint your car orange to have fewer defects. • Buy a orange car and your car will last longer, no matter how you treat it and forget about the oil change. • If you have an orange car, then you do not need to maintain the car. However, these conclusions get more complicated the more you use them. Even with the most complicated analysis, it is important to think about reason rather than believe everything that can be concluded.
Similar Posts
Difficult Quantitative Science for Biologists
Sometimes, Biology is difficult. There are so many factors that are not understood, things that cannot be controlled and complex systems that are hard to predict that it is difficult to understand the full consequences of common things like food, diet, exercise and all the other complicating factors in health. However, one area that gets…
Simulating the whole brain one cell at a time.
Dr Dharmendra Modha’s group at IBM has been simulating the whole brain one cell at a time. They started with cat brain simulation and have almost reached the whole brain simulation except it runs about 1542 times slower. This is a simulation of nearly 1.6 billion virtual neurons and 9 trillion synapses. The power consumption…
Data – strange tales of paper size – A4
Paper comes in different sizes. Paper is used every day and it has been accepted as a standard but there is some interesting mathematics behind it too. Consider size A0: The area of a A0 size of paper is exactly 1 meter square. Interestingly, each subsequent size is half the area of the previous size….
Three rules of visualizing large data sets
There are several ways to visualize data. The most commonly used could be Excel charts. However, when the data sets are complicated and large, the ability to display and chart the data is required to be given to the individual user or consumer, new techniques are required to plot the data. One such technique involves…
PySB: Biological models in Python
As scientists make biological models, it is difficult to create frameworks of mathematics and formula’s around it. Usually this requires significant customization for each model. Dr Sorger’ Lab has developed a Python framework for creating systems biology models. These translate the equations to the models and allow a tedious task to be significantly simplified by…
New Data analysis estimates Heart attack risk
To estimate risk of a disease, the earliest method that scientist used were epidemiological data analysis. For example, in the 1950’s they found that smokers got an increased incidence of lung cancer. It took the society nearly 3 decades before it was generally accepted and warnings started being issued but that is another story. Currently,…