Data analysis and Big Data (52)

12 Jan 2017

This is just a short note that if you do any biological data analysis – whether it is sequence data, flow cytometry data or any other data then first look at Bioconductor website. This has open source data, mostly written in R, that enable you to see what is the process that is used to analyze the data as well as giving you the solution that enables you to do data analysis for the cost of processing time. There are many packages for genome sequence analysis and bioinformatics but there are interesting packages too. For example "imageHTS" for microscopy HTS data : Analysis of high-throughput microscopy-based screens.

Most of these packages run on the local machine but it is also possible to run them in the cloud based on your setup.

What is your favorite package in Bioconductor?

08 Sep 2014

Did you know about Paul the Octopus who could predict soccer matches? Since 2008, Paul the psychic octopus has been correctly picking the right team that will win the world cup soccer matches especially Germany in a German zoo. It works like this: to make Paul predict the right match, the zoo keepers in Germany present him with two transparent boxes, each of them labeled with the flag of the competing countries and loaded with a tasty mussel to eat. Depending on which one Paul selects first to eat, is the country that wins the soccer game.

Does Paul know about soccer ? No.

Can he unscrew bottle lids? Yes, the octupus is one of the most intelligent species in the invertebrate world.

Does he predict correctly? Yes, most of the time.

Paul predicted 8 out of 8 games in the FIFA world cup of 2010, which converts to 1 chance in 256.

Probabilities are a concept that we learn during school with a coin toss. If you toss an unbiased coin, then it has an equal probability of falling on either side with heads or tails. Thus predicting 8 straight matches equals 0.5 x 0.5 x 0.5 x 0.5 x 0.5 x 0.5 x 0.5 x 0.5 = 1 chance in 256.

Just from the sheer odd nature of the prediction, there have been numerous articles written about Paul which are convinced that based on sheer statistics Paul was a psychic. A good web search will bring up several articles about his incredible ability. And the more one data mine's Paul's correlations with World cup results, the convincing it gets that Paul had some innate ability (the correlation coefficient is incredible!).After all how could this be possible any other way? So what gives?

It turns out that Paul was in a German Zoo and mostly presented with the German flag to predict the outcome of German matches. It also turns out that octopus like horizontal lines and picked up the German flag for its horizontal lines. Though the octopus cannot see color, if you look at the German flag shown below you realize the image processing in the Octopus eye. Each time it picked up the German flag, it got the tasty treat. Pretty soon, it learned to pick up the German flag first.

German Flag

Somehow, the Octopus gets attracted to the lines, where it also finds food and selects the flag. Curiously, the other times that it has selected the flag, it selected Spain which has similar pattern to the German flag.

Spainish Flag

 And other times, it was the Serbian flag which was also very similar.

Serbian Flag

Thus though the data showed that there was a strong correlation between Octopus picking the right winner and the winner of the soccer game, the conclusion to ascribe it to image processing in the Octopus eye was only possible once the processing in the Octopus eye was understood. Thus, finding correlative data should be a starting point for more investigation rather than a conclusion.


10 Jul 2014

Sometimes, Biology is difficult. There are so many factors that are not understood, things that cannot be controlled and complex systems that are hard to predict that it is difficult to understand the full consequences of common things like food, diet, exercise and all the other complicating factors in health. However, one area that gets a lot of interest is the area of predicting how long a person will live. This is obviously important in financial planning not only for individuals but also for countries managing their population. There are other consequences to insurance companies who would like to be sure that their prediction on a person is statistically close to the population. If too many of their insurers start filing claims then it becomes difficult to manage the insurance company.

However, it is surprising to see how many people are making a claim to understanding longevity. There are others that are seeking solutions.

See this problem: “Novel Approaches for Predicting Individual Life Expectancy” on the link below, or search for this problem.

Now, is it possible to combine data and do this reasonably for a population – probably, yes. But, doing it for any one particular individual may be difficult. For example, if you know that the individual smokes then the insurance company may slash X years off their life…but how do you quantitate whether the person is “angry” or “calm”.. Are those factors correlated with longevity? Does the individual have gene X that makes him prone to obesity.. how does that factor in to his longevity ? Complex factors but there is significant incentive to figure it out. If you think you know the answer then go to the website and enter the challenge for some money.

11 Aug 2013

Cancer is a difficult disease to manage but still difficult to treat. However, many people have many thoughts on what it takes to cure disease and cure it. Maybe, some of them are practical and others are no so practical, however, it will take a revolutionary approach to come up drug discovery paradigms that help in curing cancer.

See the interesting link below in the Forbes article that discusses how it is possible to find more drugs for cancer through a KickStarter like project. Great article, but forgets to mention that the success rate for any discovered drug is low and also the amount of time it takes to discover a drug. The time that it takes to discover a drug is years or decades and that requires a different approach.

One method is to share data. Look at the video below to hear about the proposal to share data that can be used to develop new therapies.

06 Aug 2013

When anyone measures a lot of data and then correlates it with phenomenon seemingly unrelated to the data, seemingly some relationships become evident. Is there a cause and effect? – Not necessarily but it still is surprising on how it can be related.

Consider the example correlation between the movie showed on TV and the stock market. It is a strange correlation between Studio Ghibli movie shown on Friday night and bad trading sessions after that…This has happened many times and has convinced the traders as well as the people to such an extent that it has become a self fulfilling prophecy.

Take this a little further, when that movie showing is combined with non-farm payroll data is announced by the US, the yen does very badly. This has happened nearly eight out of nine times and has led to surprising results, and has convinced people of the value of this correlation.

However, it seems logical that there is no correlation, and statistician will announce that this is probably random noise ,but still has convinced many Japanese that there is a correlation. Read more at the link below.

05 Aug 2013

Data visualization requires special tools. The one that has been very useful is “Weave”, as a ridiculously powerful visualization tool. Instead of describing in detail, look at this video below showing the elections of UK and the link that gives all the features very well.

03 Aug 2013

There are several ways to visualize data. The most commonly used could be Excel charts. However, when the data sets are complicated and large, the ability to display and chart the data is required to be given to the individual user or consumer, new techniques are required to plot the data.

One such technique involves having the data in a database and then giving the ability to plot to the user through a web based interface using Java. This has been used in many instances and one of them can be seen below. This is one of the graphs that shows large amount of data, with animation and also ability for the user to change the graph according to their requirement.

Three rules of data visualization:

  1. Simple is better. This is true of Steve Jobs mobile phones as well as graphs. A graphs with a lot of points, bars, lines makes things complicated. When displaying just one data display with the ability to add more, gives much more clarity and makes it easier to understand.
  2. Graphs can display data in various forms. Giving the user the ability to modify the data to what is useful to the individual user makes it much more useful to the user.
  3. Animation help user understand the data and unfolds the story much better than presenting all the data at once and then hoping the user draws the conclusion that you would like them to draw.
27 Jul 2013

When you have a business website, it is important to track visitors, the places they visit, the hits you get and other analytics that tell you more about whether your content is relevant and whether there is information that is useful to visitors or customers.

Most companies make it a part of their daily, weekly or monthly review process to understand the visitors. There are many tools out there and Google makes their own tools available to the website owners when they have incorporated their tools into the website.

For those with an Android phone the one that has been getting good reviews is an app called gAnalytics. This enables you to see you analytical statistics on the road, away from the office. The best that can be said about it: it works.

Download it at the link below and try it out with your website data.

26 Jul 2013

The data analysis engine that most people use most commonly is something like an Excel spreadsheet with charts to look at the data that they have collected. However, once the dataset gets bigger, the tools to analyze the data also need to scale. Additionally, as the dataset gets bigger, it is managed through a database or other engines.

One company, called Chartio has made it their business to create graphical representations of large datasets. They take inputs from various sources, databases and create variety of plots. This was a need for large datasets like a logfile with multiple statements and datapoints.

A few years ago , thinking there was a need for charting websites would have been unheard. However, there are several other niches with Big Data that need to be filled and it will be upto companies that realize the needs and fulfill it.

22 Jul 2013

Each researcher spends quality time managing the literature that they have collected in the form of papers and PDF's. However, the collection of PDF continues to grow and there are very little in the form of tools that help manage it. There are three(3) great open source tools that work wonderfully to manage the collection of PDF's.


I,librarian: This is a complete collection of web server and PHP based programs that manage the database of PDF's in a SQLite interface. This has bee around for quite some time and has a great interface. The negative is that it requires the installation of the webserver that will manage the database and serve it over a web page. It serves most purposes and the ability to see the source code is very useful. You can also try some of its work by looking and trying out a large database as one operated by Canadian Council on Animal Care. Interestingly they also have a touch screen friendly interface that enables reference lookup through a mobile device. They also make installation easier for windows and Linux users.


Jabref: This has been created in Java so is relatively platform independent. This is also better oriented towards LaTeX and BIBTeX files but will also work equally well with PDF.


Refbase: This is a another great database that has a great interface, good search capability and also an ability to gather the RSS feed. This is also capable of reading a wide variety of documents including Microsoft Word and Excel files. They can manage to create direct download links as well as a convenient way to upload files. What distinguishes this database is their ability to use MySQL and that makes them very rugged and search optimized. This may be more linux friendly and so will require a more knowledge of the operating system to install.


