Why AlphaGo is a milestone but it still not achieved AGI
Congrats Google Deepmind. You did it! Top rank player of Go is no longer a human. Should we rejoice or fear this formidable feat? Is this milestone the start of the end game in the quest for Artificial General Intelligence (AGI)? My short answer is yes and no.
First, some context. Go is an ancient Chinese board game with very simple rules: black and white stones, played in alternate moves by two opponents, fight to conquer the most territory on an 19×19 square. Despite it’s simple rules, the game is of an incredible complexity. As chess, Go is a Markov complete game (the information for the next move is completely contained in the present state of pieces on the board).
However, contrary to chess, Go is a much more fluid game, where a single move can change drastically the fate of the game. In chess, after some moves the fate of the game is determined (for a good enough player). Go is much harder to predict the outcome (for two top players), only by the very end of the game – which may take 200 or more turns. On the other hand, the number of possible moves is beyond imagination: for example, if we want to predict 50 turns ahead, we get about 3×10^100 , which is a higher number than the number of atoms in the entire universe.
So, how did AlphaGo did it? How it was able to navigate through this gigantic space of possibilities within seconds? Brute force will not work. The simple answer is that it used algorithms that are close to human level intelligence. What are they? We don’t know for sure, but some sort of high-level heuristics and shortcuts, like the ability to see that certain configurations are more defendable than others. This takes years to master by humans, but AlphaGo learned them from scratch – initially playing against other computer programs and later against experienced go players. It learns from its mistakes and was able to device, what may be called, strategies – pretty much the same way as human brain uses heuristics.
In this respect, this is a remarkable feat as it relies on a completely different approach from the one used by Deep Blue to beat Kasparov on chess. The more technically the answer is that AlphaGo is taking advantage of one of the most powerful machine learning algorithms ever built: deep recurrent neural networks (DRNN) trained with Long-Short Term Memory (LSTM) – a technique proposed in 1997 by J. Schmidhuber of IDSIA, Switzerland. Behind it are more bizarre techniques used for reinforcement learning, most notable Monte-Carlo tree search. But the core concept is DRNN.
Hardly the capability to generate heuristic or neural networks are new concepts. Neural Networks were around for more than 50 years. They never got much respect from the academic and AI community as the way they work and why they work is pretty obscured – somehow as our brain. They only become popular, though for a short period, in late 80’s when Hinton and others propose an algorithm, called back-propagation that made possible training of networks with hidden layers (a layer between the input and the output). Those architectures gain some relevance as it allow to extract non-linear relationships between inputs and outputs (like the XOR problem) that previous networks couldn’t. They got some visibility in problems, like OCR (recognition of digits – typeset or hand written).
However, after the early 90’s enthusiasm decline in favor of more “elegant” and mathematical sounded machines like SVM, especially the kernel based learning. ANN research were confined to a few ghettos, most prominently, Geoffrey Hinton in Toronto, Yann LeCunn in New York, Yoshua Bengio in Montreal and Jurgen Schmidhuber in Losano. They worked hard to solve a fundamental problem in ANNs: how to train deep neural networks?
Two main difficulties seems insurmountable: 1) how to avoid the gradient vanishing and gradient explosion problem? and 2) how to avoid overfitting?
Faster computers, more data and efficient learning tricks come to the rescue. A new wave was initiated in 2006, when Hinton published a landmark work proposing an algorithm to train a class of machines called Restricted Boltzmann Machines (RBM) with several layers. The idea was simple: train the network layer by layer with an algorithm called Contrastive Divergence (CD) – when a layer is trained, its weights are freezed and their hidden states taken as input to the next layer.
The architecture of these layers stacked in a greedy approach were called a Deep Belief Network (DBN). They have the advantage that, any time we add an extra layer the expression capability of the function they represent grows exponentially. We can consider each layer of these networks as a representation of ever higher complex features abstractions from the data. Pretty much like the human visual cortex.
DBN worked fine achieving top performance on the classical MNIST handwritten character recognition problem. Other types of generative models were proposed, like convolutional neural networks (CNN) and Deep stacked Auto Encoders (DsAE).
Note that, from a statistical point of view, training a deep neural network (i.e. finding the latent parameters describing the posterior probability distribution based on a set of observations) is an intractable problem, i.e., the number of possibilities is far too high to calculate the likelihood integrals – sorry for all this Bayesian statistical jargon… There are possibilities to solve this difficulty: sampling and variational inference. In the case of AlphaGo they used a technique called Monte Carlo Tree Search (MCTS) to sample promising moves among zillions of possibilities. DRNN trained in reinforcement learning and using MCTS for exploration did the trick. And it worked!
However, bear in mind, that these machines are very wild beasts. They are very complex and time consuming to train and finding the right architecture can become a nightmare as almost no theoretical support was developed so far. These machines could have hundreds of millions of parameters. Finding the right set of parameters is like locating the grand canyon at complete obscurity…
Despite all these difficulties, Deep Learning is the thing. It beats all the other techniques in complex problems, sometimes by orders of magnitude: image recognition, video, speech recognition, translation, NLP. To human race dismay, we reach a point where machines beat humans in recognizing objects in an image. So, no much surprise Google beating humans in Go. Our old brains simply can not cope with this huge and powerful learning algorithms.
So much for deep learning. No comes the No part.
Why Deep Learning alone will not lead us to GAI?
What is the problem in reaching Artificial General Intelligence (AGI)? Haven’t machines proved powerful enough to solve all hard problems we through them: locomotion, playing games, driving cars, translation, even cooking. So, we are already there, aren’t we? No.
The problem I see can be stated in one phrase: these machines do not have “free will”, or in other words, they don’t known what to do unless a human tell them the goal, or, in ML terms, the objective function (or loss function) to be minimized.
We learn mostly by unsupervised learning, not supervised learning, as in Go. How to reach to a point where a machine will find automatically what goals to reach and optimize is a long shot from where we are today. So, what is the problem?
The problem is finding what “pseudo-objective function” we should give the machine to learn. Reward optimization, as proposed by schmidhuber is ok when the task is well defined and you have frequent or infrequent rewards. However, for more high-level cognition we don’t have obvious rewards. What is the reward of curiosity, of art, of creativity?
Furthermore, you always need an external observer or teacher to interpret the results of the machine and to choose which reward to pick among an infinite number of possibilities. That’s the problem: the machine is isolated from the world and it needs ALWAYS an outside agent consciousness to tell the meaning of inputs & outputs and the goals.
My take on that is simple: the pseudo-objective function is “meaning”. The machine will try to minimize the serendipity or surprise, between what it thinks is the world and and it observes from the world.
Given its computational capabilities, it will start creating models of the world as he observes data (a model is simply a set of abstractions he learns to represent invariant properties he may find in observations).
Based on these models, the agent will generalize these patterns to other unseen observations. If it founds something as expected, then he confirms. If it sees something unexpected, then it will reconsider its model based on what he observed in the past and what he is seeing.
The agent tries to create a story, a narrative, that creates a sense of unity, cohesion of all elements he is observing. So, in other words, the objective is to maximize the state of internal coherence – the meaning. The observation take only a secondary role and are conditioned by the bias of the model that the system already have. It is conditioned by the world but its learning is completely independent from the world. That’s the challenge.
So, unless we create curiosity driven machine capable to incorporate some level of subjectivity of the world, we will not solve the last missing part of the puzzle to achieve AGI.