Machine learning is going to be playing a bigger role in SEO, and so I think it is important that we have a basic understanding of how it works.
Machine learning is appropriate to use when there is a problem that does not have an exact answer (i.e. there isn’t a right or wrong answer) and/or one that does not have a method of solution that we can fully describe.
Examples where machine learning is not appropriate would be a computer program that counts the words in a document, simply adds some numbers together, or counts the hyperlinks on a page.
Examples where machine learning would be appropriate are optical character recognition, determining whether an email is spam, or identifying a face in a photo. In all of these cases it is almost impossible for a human (who is most likely extremely good at these tasks) to write an exact set of rules for how to go about doing these things that they can feed into a computer program. Furthermore, there isn’t always a right answer; one man’s spam is another man’s informative newsletter.
Take the picking of watermelons. Watermelons do not continue to ripen once they are picked, so it is important to pick them when they are perfectly ripe. Anyone who has been picking watermelons for years can look at a watermelon, give it a feel with their hands, and from its size, colour and from how firm it feels they can determine whether it is under-ripe, over-ripe or just right. They can do this with a high degree of accuracy. However, if you asked them to write down a list of rules or a flow chart that you or I could use to determined whether a specific watermelon was ripe, then they would almost certainly fail – the problem doesn’t have a clean cut answer you can write into rules. Also note that there isn’t necessarily a right or wrong answer – there may even be disagreement among the farmers.
The same is true about how to identify whether a webpage is spammy or not; it is hard or impossible to write an exact set of rules that work well, and there is room for disagreement.
We can set up a computer (there are various methods, we don’t need to know the details at this point, but the method you’ve likely heard of is artificial neural networks) such that we can feed it information about one melon after another (size, firmness, color, etc.), and we also tell the computer whether that melon is ripe or not. This collection of melons is our “training set,” and depending the complexity of what is being learnt it needs to have a lot of “melons” (or webpages or whatever) in it.
Over time, the computer will begin to construct a model of how it thinks the various attributes of the melon play into it being ripe or not. Machine learning can handle situations where these interactions could be relatively complex (e.g. the firmness of a ripe melon may change depending on the melon’s color and the ambient temperature). We show each melon in the training set many times in a round robin fashion (imagine this was you; now that you’ve noticed something you didn’t before you can go back to previous melons and learn even more from them).
Once we’re feeling confident that the computer is getting the hang of it, then we can give it a test by showing it melons from another collection it has not yet seen (we call this set of melons the “validation set”), but we don’t share whether these melons are ripe or not. Now the computer tries to apply what it has learnt and predict whether the melons are ripe or not (or even how ripe they may or may not be). We can see from how many melons the computer accurately identifies how well it has learnt. If it didn’t learn well we may need to show it more melons or we may need to tweak the algorithm (the “brain”) behind the scenes and start again.
This type of approach is called supervised learning, where we supply the learning algorithm with the details about whether the original melons are ripe or not. There do exist alternative methods, but supervised learning is the best starting point and likely covers a fair bit of what Google is doing.
One thing to note here is that even after you’ve trained the computer to identify ripe melons well, it cannot write that exhaustive set of rules we wanted from the farmer any more than the farmer could.
Google likely created a training set by having their teams of human quality assessors give webpages a score for how spammy that page was. They would have had hundreds or thousands of assessors all review hundreds or thousands of pages to produce a huge list of webpages with associated spam scores (averaged from multiple assessors). I’m not 100% sure on exactly what format this process would have taken, but we can get a general understanding using the above explanation.
Now, recall that to learn how ripe the watermelons are we have to have a lot of melons and we have to look at each of them multiple times. This is a lot of work and takes time, especially given that we have to learn and update our understanding (we call that the “model”) of how to determine ripeness. After that step we need to try our model out on the validation set (the melons we’ve not seen before) to assess whether it is working well or not.
In Google’s case, this process is taking place across its whole index of the web. I’m not clear on the exact approach they would be using here, of course, but it seems clear that applying the above “learn and test” approach across the whole index is immensely resource intensive. The types of breakthroughs that Caffeine enabled with a live index and faster computation on just parts of the graph are what made Machine Learning finally viable. You can imagine that previously if it took hours (or even minutes) to recompute values (be it PageRank or a spam metric) then doing this the thousands of times necessary to apply Machine Learning simply was not possible. Once Caffeine allowed them to begin, the timeline to Panda and subsequently Penguin was pretty quick, demonstrating that once they were able they were keen to utilise machine learning as part of the algorithm (and it is clear why).