As always, my examples come from the field of biology, but they easy be extended to other fields.
Everything is end up with a vector
First, let’s remind the usual data format for a machine learning method.
Most of the algorithms take vectors as input. A vector represent one “individual” i.e. a sample. Examples of individuals may be : one patient (or healthy person used as control), one gene, one metabolic pathway… this individual is represented by features i.e. characteristics. For a patient, it could be her genomes sequence for some specific genes (linked to the disease being studied), for a gene, it could be the number of occurrences of some specific sub-sequences , for a metabolic pathway, it could be a set of preselected reactions. Remember that the set of feature must remains the same for all individuals.
To summarize, the real input of for machine learning look like this :
well, not exactly but we will see that later
There is two things to think about even before collecting the data : how much data do you need and which (and how much) features to extract.
Number of individuals
The question of the number of individuals is always tricky. Below hundreds of individuals, it is optimistic to expect very accurate classifiers but some application may work with 200-300 individuals and others may required thousands or more. This number is dependent of the quality of the information carried by your training set. A training set that represent well the target population with well chosen feature (see below) will need less individual to be accurate than a training set that is less informative.
Rather than giving obscure rules of thumb to estimate the number of individuals required, here are the elements to consider :
- you need enough data to train your classifier and a bit more to evaluate parameters and test the classifier. So, if you think you need X individuals for a proper training, consider to collect enough individuals to have X + two third of X (one third going to parameter estimation, one third to testing).
- Unless you are totally confident of your data generation process, generate a bit more than what you will finally need to avoid issue with artifacts
- Your training set must be representative of the population you have to predict. If you want to predict fruits, you have a training set made of many type of berries but the classifiers is made to predict exotic fruits, it won’t work
Beyond that it is always a trade-off between what you can generate (because of time and cost) and the final accuracy you expect.
The question of which feature to choose is completely depend on the question to solve.
There is no free lunch about features selection. It is exactly where human (still) beat the machine and an expert of the data is required. If you do have a lot of resources (time and money) for that part, collect everything that seems relevant and select them during the training (see below). If data generation is expensive (like in biology), thinking ahead about the features to collect can be critical for the success of the project.
About the number of feature to select, it is clearly depending of the quantity of information carried by each feature. In an extreme case, if there is a perfect correlation between what you want to predict and one feature, that one is enough… but in this case, you don’t need a machine learning approach. In some cases, less than a dozen of features may work (see the “Titanic example” https://blog.socialcops.com/engineering/machine-learning-python/) but for complex questions, like genomic prediction, you may need hundred of thousand or more features. Even more than with individuals, you may be limited by the material available or the cost of collecting huge number of features.
There is more than one way to do it
Once you have your data, you need to transform them into suitable inputs for your machine learning method. This is called vectorization.
There is more than one way to convert data into numeric vector components. Thought a straightforward solution are always available for numerical or categorical features, more robust encoding schemes can be designed for most of the data science questions. Here again, the data scientist imagination and expertise can make the difference !
For instance, when encoding diploid individuals with a sample of markers as features (such as in GWAS analysis), people tend to use a -1/0/1 (or 0/1/2) scheme. This scheme focus on whether the individual is homozygous (-1 or 1) or not (0) the frequency of the allele (if the individual is homozygous for the most frequent allele at that marker, it is 1, -1 otherwise). This scheme is widely used because it is actually more informative than encoding the sequence alone (A=1, C=2, G=3, T=4) : we bring two explicit informations (homozygosity and frequency), instead of one (sequence).
Whatever the encoding scheme you plan to use, there are some good practices to keep in mind. First, data have to be scaled. Unscaled data may rise issues, especially if features are not of the same type (e.g. isoelectric point and molecular weight for a protein). Second, even better than just scaling the data, normalize them ! Although 1234/5678/7890 seems to mean the same thing than -1/0/1, some classifiers may be sensible to high feature values which leads to numerical issues. Last, remove features with low variance. Even if your initial analysis pointed out these features as “potentially interesting”, low variance means low information. Don’t overload your model with features like that.
- think about which data you need : number of individuals, type and number of features
- select or design an encoding process
- scale your data
- normalize your data
- check the quantity of information in you features.