When I talk about my work, many people wonder what machine learning is. Although “a kind of AI” is an acceptable short answer, I think the field deserve a better one. In this post, I will not defined machine learning, there are dictionaries or wikipedia for that, but I’ll try to explain what it is and how it works.
What do machine learning do ?
In a few words, machine learning produces algorithms that learn to recognize something. Yes, just that. If you feel disappointed, just think of the interest to recognize symbols like letters. If you know the Latin alphabet (which obviously you are if you read this post), you can recognize that all the symbol below are for the same letter (well, technically, the bottom one is another symbol, but it stand for another form of the same letter.
This is something people that never learned the Latin alphabet may not be able to do and it allows you to read, go to school, getting degrees and reveal the mysteries of the universe (of course you can do the same with any alphabet). Not bad for a “simple” ability to recognize things !
Machine learning input
I won’t go too much into details because I will give the full spot to data in a further article but a minimum explanation is required.
The usual dataset is made of many “individuals” (e.g. patients, plants, genes…) described by features (genomic markers, physiological values…). Dataset may contains label (e.g. “cancer” or “heathy” for patients) or associated value (e.g. yield for a plant) that are used by supervised methods (see below).
The features are encoded as vectors, one per individual, which are read by the algorithm and used for recognition. Choosing the right features and the encoding scheme may be a key point to get an efficient classifier.
Supervised or unsupervised, that is the question
The machine learning field has many different algorithms and those algorithms may have different purposes. To make a long story short, algorithms may roughly be of two types : supervised and unsupervised.
Unsupervised algorithms aim mainly at clustering data by looking for hidden structures in features. In that sense they may seem more simple, but they are very valuable on new data from which we don’t know anything. Imagine you have cancer patient data and you want to check if there are multiple type of cancer (lug cancer, colon cancer, skin cancer…) mixed-up in your data. An unsupervised learning tool can help you to separate them and the result may look like this.
Colours on the figure above are attributed by the users. The tool (a PCA where the 2 first PCs are taken as coordinate) tells us that the yellow dot on the right was attributed to the “yellow group” (maybe data coming from the same source and supposed to be the same type of cancer) but it seems closer to the “blue group” (an information to be confirmed by computing distances), which may be a valuable information before analysing the data more deeply.
Supervised algorithms are slightly more complex. The first step for building a classifier is to train it using a training set for which data are labelled (in this case it is a classification) or associated to a value (in this case it is a regression). The algorithm try to build a model which associate features to label/value (each algorithm has its own way to do it). From that training set, you get a model that can be used to predict new individuals for which you have the features values but no label/output value. Here is an illustration of how most supervised algorithms works.
Last but not least, you can use a “reverse-classifier” version of most machine learning techniques to find what were the important features to explain how data are clustered/classified/linked to the value. This is called “feature analysis” or sometimes “feature selection” and it is quite useful if you have insight about the features.
This is of course a simplistic way to explain what machine learning is, but I hope it will help people to understand what is behind this esoteric word of machine learning (and, in some ways AI).
Maybe you also start to imagine what we can do with such powerful tool ? It will be the next topic of this blog (mostly for biology), so stay tuned !