The other day I found myself at a breakfast seminar with the title ”What do I need AI for?”. Halfway through something caught my interest, the speaker mentioned a few ways companies are collecting data for machine learning. I was intrigued. Everybody knows that companies are using customer data to enhance their products or services. But I was fascinated by how smart some companies are, not only using their current data but also collecting data in a productive or fun manner.
Just to recap, explaining what machine learning is isn’t the easiest thing to do. Although, I managed to find a pretty straight forward explanation in this blog, which I actually really liked:
”[Machine learning is] a subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data. Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task.”
This basically tells us that to teach your machine learning algorithm how it should work, you need quite a lot of data for it to base its decision on. Or at least, the more data the better the decision. So how do companies collect this data?
The first example that they spoke about at the event was Google’s game QuickDraw. It’s not a sneaky or a hidden way to collect data, instead they label it as ”an A.I. experiment”. The collected data set is also open source and “created to help with machine learning research”. Anyway, it’s basically a game of Pictionary where you play against a neural network. Not only is this a quite fun game, an inspiring way of understanding machine learning but it is also a very smart way for Google to collect data. Just imagine how many drawings they get from people playing this game, just as an example they have almost 140,000 drawings people have made of an apple. In total, they have collected over 50 million drawings. Imagine how Google can use this knowledge about how people visualize or recognize different items.
Another example from Google is reCAPTCHA. ”CAPTCHA” is a backronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”, and reCAPTCHA is Google’s own system for CAPTCHA. You have probably used this yourself when either signing up for a specific site or making a purchase online, a typical test to distinguish humans from bots. Although, this isn’t the only use case. Since 2011 reCAPTCHA has also assisted in digitizing The New York Times archive as well as a large number of books from Google books. reCAPTCHA started out with blurred or, in other ways, unidentifiable words. Most of the time the computer shows a control word, already known, together with a unidentified word, to which the computer doesn’t know the answer. When a human enters the answer to that captcha, the computer accepts it as ”probably valid”. Once enough users have entered the same word, it is accepted as valid and is reused as a control word. Again, a very smart way of using data to provide value. These blurred words are still used, probably to digitize new books, but reCAPTCHA has also started to use pictures and numbers. Pictures can be used in massive data sets to train machine learning algorithms to distinguish what is in the picture, and most of the numbers are from Google street view to be able to map out specific addresses in more detail. Man, those guys are smart.
Another example is Foldit, a crowdsourced computer game where the user contributes to science by just simply playing the game. There are many examples like this out there. Just think about when Facebook recognizes a face and suggests a person to tag, it’s not always the right person but the more you help it by tagging people the better it becomes at suggesting the right person the next time. Or all of these ”which XXX character are you?” tests popping up on Facebook. People find them fun, like to share and discuss the results with their friends but primarily these tests detect user characteristics and behaviors to target them with more suitable ads and suggestions.
Companies are smart these days, and they keep getting smarter and coming up with new innovative ways of collecting interesting data. So think twice when you come across these kinds of games, tasks or controls next time. You might me part of an experiment or data collection without even knowing it.