A striking feature of the human brain is the ability to associate abstract concepts with the sensory input signals, such as visual and audio. As a result of this multimodal association, a concept can be located and translated from a representation of one modality (visual) to another representation of a different modality (for instance, audio), and the reverse. For example, the abstract concept “ball” from the sentence “John plays with a ball” can be associated to several instances of different spherical shapes (visual input) and sound waves (audio input). Neuroscience, Psychology, and Artificial Intelligence, are interested to determine all factors that are involve in this task but this is still an open problem.
This is known as the Symbol Grounding Problem. Infants start learning the association when they are acquiring the language(s) in a multimodal scenario. The initial set of words in infants is mainly nouns, such as “dad”, “mom”,” cat”, and “dog” and two different patterns occur in the brain activity of infants depending on the semantic correctness between a visual and an audio stimulus – self-consistency check. The brain activity is pattern specific and will eventually merge if the visual and audio signals represent the same semantic concept. Otherwise, other pattern can se assigned.
However, recent advances in multimodal learning using deep learning techniques (like image and text or image and audio) can help solve this old problem. By jointly embedding a latent representation of both sensorial inputs types, the symbolic association of abstract concepts like “above/below”, “in/out” can be learned. The algorithm will learn from visual clues the meaning of the words and reversely, the meaning of the images from the description it receives. A team from Baidu have actually developed a system to query images using natural language questions like “what is below the big red table?”- see this publication for details. This a remarkable feat as, for the first time, we are envisage a possibility of achieving Artificial General Intelligence (AGI) where machines are able not only to understand the complexity and ambiguity of human language but also able to create their own symbolic representation of the world, i.e., their own language.
Systems with end-to-end learning from complex environments to segment and classify the object-word association problem have already being test at one of the largest Artificial Intelligence groups in Europe. Problems where a semantic concept in missing in one modality can be auto-completed. The association problem is completely dynamic and flexible, different from the traditional setup where the association is fixed via a pre-defined coding scheme of the classes before training – this being performed using Long Short-Term Memory (LSTM) networks by alignment of their outputs using Dynamic Time Warping.
This allows, for instance, to give human like instruction to guide a robot through complex mazes. See this work and this to see how a robot can understand these instructions like : “Place your back against the wall of the “T” intersection. Go forward one segment to the intersection with the blue-tiled hall. This interesction [sic] contains a chair. Turn left. Go forward to the end of the hall. Turn left. Go forward one segment to the intersection with the wooden-floored hall. This intersection conatains [sic] an easel. Turn right. Go forward two segments to the end of the hall. Turn left. Go forward one segment to the intersection containing the lamp. Turn right. Go forward one segment to the empty corner.”
Finally, check the Robot Barista, an amazing robot that integrates image, text and movements and learn by example following a set of instructions, much like a child. What is remarkable in this robot, is that he is able to understand ambiguous commands (like, “switch the machine on”) and utilise tools he had never seen before.
Remarkable times lies ahead.