Artificial Intelligence (AI) populates our imagination. Instead of fear it why not embrace it to solve one of the oldest, hardest and most important problem of humanity: the search of a soul mate. It’s a big and messy problem. But it’s also a very interesting one that, since long, took my imagination. Can we really solve this problem in the age of big data and AI? Can technology show its strengths or just its weakness moving in such fuzzy waters?
Over the next posts I’ll describe an algorithm for multimodal latent representations that allow semantic matching by combining text, images and social media data. It’s a work in progress, so if you have other data or want to validate my work, please let me know.
Why do we need an algorithm for dating?
Thousands of algorithms have come out with solutions for match making in zillion of dating websites. The usual suspects are Boolean match (age, location, height, education), to the most sophisticated personality score and, more recently, photo matching. Some of these algos are doing a pretty reasonable job, sorting out serial killers and otherwise pathological individuals.
However, after all these filters, its still a big mess to sort thousands or millions of putative partners. The problem is, even if only 0.1% of candidates fits your bar, if the site has 100 000 profiles, you will have to date with 100 guys. That’s scary and time consuming.
So, in terms dating, every marginal increment of an algorithm can save you hours, days or your whole life. How many of those 100 lucky guys are paranoiac, can’t make you smile, can’t stop talking, or even worst, can’t kiss properly? You only have a life so better that you chose wisely.
Let me say it upfront: I don’t believe such a perfect algorithm will ever be possible. Sometimes, even if the best match is in front of us, we may well not even realize it, do we? After all, we are humans, not machines. We make a lot of mistakes and rely extensively on emotions.
I don’t believe that any perfect match is possible without some interactivity. You cannot know a person by the photos, description, or dating CV. You have to see him, you have to talk with him or even smell him. There is no way I can figure out an algorithm to tell you that with 100% accuracy that is going to happen.
But, let’s face it, humans are not very good either. How often did you get it right: one in ten? 10% will be already a good mark. My bet is that we are much worse than that in picking the right person. Matching algorithms haven’t to be perfect – just have to be better than us.
To the rescue comes big data and advanced machine learning algorithms – mostly Deep Learning architectures. In this post I will describe a matching algorithm that takes advantages of these technology to boost the accuracy and, more important, to understand why a specific match makes sense. It will combine text, images and social media data to create a latent sparse representation model of users. By integrating all information it allows a powerful semantic search.
Some interesting findings are uncovered and we concluded that a significant fraction of users are inconsistent when describing themselves. Finally, we show some examples with multimodal semantic exploration (combining text and images to browse profiles).
I used data from a data website and some external sources. The inputs were of the following types
• Static (age, gender, education, …)
• Non-structured, i.e., text (description, likes and dislikes, …)
• Personality tests (based on Q&As)
• Social Media Graph (social media profiles)
• Photos (profile pictures)
I scrap about 10 000 profiles of females from 3 cities: Sao Paulo, New York and London. I captured information about age, sexual preferences, description and. I also collect data from personality tests and first 3 profile photos . Social media data was added, when available – more on this latter.
The scraper was done using a C++ program developed for the propose. It start from a profile and it follow the links of related profiles – please note that some bias is introduced here as the web site tends to show only similar profiles to the initial one selected. Most profiles don’t have all the fields filled. Some strange facts emerge from the data. For instance, around 18% of profiles don’t have any description (I wonder why?). Some don’t have even a photo. Sure there must be lots of fake and phantom profiles. There is also inconsistencies in the data. Another problem is the option fields used by the site are somehow contradictory.
The data was stored in neo4j a graph database, which is a more natural way to represent it. Each person is a node in a high dimensional graph. The graph contain several types of nodes (people, places, music bands, etc) and several types of relations (edges): likes, is a member, have been, etc. This can be aggregated as “interacts”.
These are completely unsupervised models, meaning that the algorithms will just look for patterns in data without any “external” knowledge. That’s the beauty (and weakness) of it. Supervision will definitely help, but we have to wait a few years.
Next section I’ll describe the details of the implementation of the model.
(to be continued).