When machine learning invades biology

After giving an intuition of what machine learning is, it is time to discuss what machine learning can do. Of course Machine learning has plenty of applications in many fields and will have more in the coming years, but let’s focus on biology. In this article I will differentiate three levels : the technical level, the research level and the business level.

Update : I recommand this review of Libbrecht & Noble about Machine learning application in biology and a overall view of different machine learning aspects illustrate by example in genomics.

Technical level

The technical level represent the daily questions you have to answer in a project. Obviously not something you may want to publish or to sell as a product but tricky questions you need to solve to go on. With the current implementations of machine learning methods, the cost of deployment of such methods is strongly reduced and it is possible to use them to solve routine problems.

In biology, we often have categories, like type of patients, proteins belonging to families, true data vs artefacts of the experiments, so the most common tasks are either classification or, even more frequent, clustering to check whether our data are belonging to expected groups.

A real-life example : I was working a few year ago for a European university and one of my colleague came with meting curves of amplification. At every batch, she had thousand of curves to review to find which samples properly amplified and it took her hours on a repetitive task so she asked me if it could be automatized. To save hours of her time, I created a SVM classifier that could recognize the curve’s shape to label the sample as “amplified” or “not amplified”, using her previous work as base of knowledge as shown below.

The output was optimized to allow human eyes to identify in a blink which samples properly amplified and to keep the process under the expert control (who could quickly check difficult cases). An example is given below.

pcr
green are “amplified” samples, red are “not amplified” samples, black are “questionable” sample, where the human expert will have a second look

The total processing time (including the human visualisation) dropped from 3 hours to 15 minutes with a reduction of labelling error.

Research level

What I called “research applications” could be either academic or R&D. It is a full project that could give birth to a tool that will drastically change processes in labs and companies or a new type of classifiers. Obviously something you may want to publish.

Such level of project involve a deep knowledge of the topic, a fair amount of good quality data and multiple pre-processing steps. Such projects more often use supervised learning methods as prediction of label or value are more often informations that bring value to the analysis. Feature selections begin to be also used, especially because biologists don’t like “black boxes” (prediction without explanation on how the prediction was made).

As example of such application, I recently worked for a world leader institute in research on cereals who wanted to implement genomic selection. The idea was to be able to predict agronomic traits such as yield or flowering time from groups of genomic markers in plants. Using a population for which both the genotypes and the traits values were known, I designed classifiers (using different algorithms) that are able to predict trait values from new lines where only genotypes are known. With the decreasing cost of sequencing, being able to predict traits values for tested lines, to keep the promising lines and to remove the unpromising ones could both improve the genetic value of the lines produced and save dozen of thousand of dollars by not testing some probable “dead end” lines.

Business level

At the age of big data, machine learning can be used to create new products with special business models. After the social era, we are entering in the Internet of the Thing (IoT) era. Gigabytes, Terabytes or more of personal data will flow and companies that will be able to exploit them, could offer dedicated products or services.

In biology, personal Healthcare is one of the next biggest innovation to come. For centuries, we have studied the human body to improve Health. Until recently, our paradigm was more or less “the same treatment(s) for the same disease” but with our capacity to monitor ourself, things are going to change. With a combination of disease knowledge, (genetic) background of the patient, and his current parameters (sleeping hours for the last days, heart beats, outside temperature, diet…) diagnostics could be more and more precise. Outside of disease treatment, there is rooms for products and services which will help people to stay in good health on daily basis. And machine learning methods will be the heart of such business opportunities…