Image Reading and Object Recognition in Images is an important task and challenge in image processing and computer vision. A simple search for “automatic object recognition” on Google scholar will provide you with a long list of articles mixing all sort of sophisticated equations and algorithms, dating from early ’90s until present. This means that the subject has been highly intriguing for the researchers in the field from the very beginning of search but is still a work that seems to be in progress.
Not long ago we engaged you in the Google Image Search world, with an in-depth case study. We then tried to figure out whether Google can read text from images and what are the implications of this matter for the SEO world. We are now back with an interesting research in the same “image search” field that is trying to put the light on Google’s progress in the field and on how a SEO professional should keep up in order to have better rankings in the future.
The future belongs to those who are prepared for it today! In the very near future Google will probably change the algorithms regarding the way it will rank images, changes that will dramatically affect the search, and thereby the SEO world.
It is not a minor algorithm change we are talking about, one of those which affect only a small percentage of the search. Is the next level image search generation we are about to face. It will be a great success for the search industry progress undoubtedly but the true success will be of those who were one step ahead and already prepared for these big changes.
Why is object detection in images important to the digital marketing community?
At the end of the day, it all comes down to rankings.
Object Detection in Images will add an extra layer of ranking signals that cannot be easily altered.
An Image with a blue dog will rank on a blue dog related keyword and not on a red dog related keyword. This has two important implications for the SEO industry:
1. A lower number of false positives when searching for a particular keyword will appear in terms of what the image actually contains.
2. It can also be used to relate a page content to that actual image without any other external factors. If a page has a lot of photos of blue dogs and various other stuff related to dogs, than it automatically strengths the ranking of that page being about dogs.
Yet, a question may come from this:
Could this be a new era for Object Stuffing in Images as a shady SEO technique?
I do not think so, as the algorithms are pretty advanced nowadays in order to detect this kind of spam intent. However, it’s definitely a new generation of search we are talking about that may come with big changes and challenges for the SEO world.
Google, Artificial Intelligence & Image Understanding
Image understanding is a pretty big deal for everyone, which is why a visual recognition challenge has existed since 2010. It is called ImageNet Large Scale Vision Recognition Challenge (or ILSVRC) and it is a fine example of how competition fosters progress. There are three main tracks in ILSVRC: classification, classification with localization and detection. This means algorithms that are entered in this challenged are tested both in relation to how well they recognize objects in a particular picture, as well as whether they “understand” where the object is located in the picture.
This level of superior performance in the detection challenge requires the industry to push beyond annotating an image with a “bag of labels” in hopes that something will stick.
In order for an algorithm to succeed in this challenge, it must be able to describe a complex scene by accurately locating and identifying many objects in it. This means that given a picture of someone riding a moped, the software should be able to not only distinguish between several separate objects (moped, person and helmet, for instance), but also correctly place them in space and correctly classify them. As we can see in the image bellow, separate items are correctly identified and classified.
Any search engine with this capability would make it extremely difficult for anyone to try and pass pictures of people riding mopeds as “race drivers driving Porsche” pictures by stuffing them with metadata that simply says so. As you can see in the examples below, the technology is pretty advanced and any misleading scheme could be easily exposed.
Google participated this year at ILSVRC where it won with their team GoogLeNet and made the code open source in order to share it with the community and make the technology advance faster. This has a tremendou significance in terms of progress, since the ILSVRC 2014 is already hundreds and even thousands of times more complex than the similar object detection challenge just 2 years ago. Even within the span of a single year, the advance made this year at this competition seems to be significantly superior to last year‘s: 60 658 new images were collected and fully annotated with 200 object categories, yielding 132 953 new bounding box annotations compared to 2013. With the hope I am not bringing too much data in the equation , this means that in 2013 the image number was about 395000 and only one year later the number considerably increased to about457000. And yes, this sounds as twisted and impressive as Google’s data center that you can see in the image below.
The winning algorithm this year is using the Distbelief Infrastructure which not only looks at images in a very complex manner, and can identify objects regardless of their size and position within the picture, but it is also capable of learning. This is neither the first nor the single time Google has focused on machine-learning technologies to make things better. Last year, Andrew Ng, the director for Stanford University’s Artificial Intelligence Lab and former visiting scholar at Google’s s work research group “Google X” has put forward an architecture that is able to teach and grow:
“Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines.”
Google+ Already Uses Object Detection in Images. Google Search Next?
In fact, a smart image detection algorithm based on convolutional neural network architecture has already been in use at Google+ for more than a year. Part of the code presented at the ImageNet challenge has been used to improve the search engine’s algorithms when it comes to searching for specific (types of) photos even when they were not properly labeled.
1. For one, Google’s algorithm has proven itself to be able to match objects from web images (close-up, artificial light, detailed) with objects from unstaged photos (middle-ground, natural light with shadows, varying levels of detail). A flower seemed just as much a flower by any other resolution or lighting conditions.
2. Moreover, the big G managed to identify some very specific visual classes beyond the general ones. Not only it identified most flowers as flowers, it also identified certain specific flowers (such as hibiscus or dahlia) as such.
3. Google’s algorithms also managed to do well with more abstract categories of objects, by recognizing a fair and varied number of pictures that could, for instance, be categorized as “dance” or “meal”, or “kiss”. This takes a lot more than simply detecting an orange as an orange.
4. Classes with multi-modal appearance were also handled well. “Car” is not necessarily an abstract concept, but it can be a little tricky. Is it a picture of a car if we can see the whole car? Is the inside of a car still a picture of a car? We would say yes and so does Google’s new algorithm it seems.
5. The new model is not without sin. It makes mistakes. But it is important to note that even when talking about mistakes, those show progress. Given a certain context, it is reasonable to mistake a donkey head for a dog head. Or a slug for a snake. Even in error, Google’s current algorithm is head and shoulders above previous algorithms.
Is a Google’s Knowledge Graph & Image Detection “Marriage” Possible?
As impressive as Google’s new model is, it is even more impressive that it is just a part of a larger picture of learning machines, perfectly integrated with the already impressive on its own Knowledge Graph. The “entities” which form the basis of the latter also help shape the classification and detection capabilities of the image detection algorithm. Objects and classes of objects are each given a unique code (so, for instance, a jaguar – the animal – could never be mistaken with a Jaguar – the car) and then used to help the algorithm learn by providing it with a knowledge base against which to test its attempts.
Google is turning search into something that understands and translates your words and images into the real-world entities you’re talking about.
That’s right! It doesn’t just generates results based on specific words or images but it really “understands” you.
How Object Detection in Images Might Affect Your SEO
These technological advances will enable even better image understanding on our side and progress is directly transferable to Google products such as photo search, image search, YouTube, self-driving cars, and any place where it is useful to understand what is in an image as well as where things are @GoogleResearch
In terms of plain SEO, this move is tremendous and it will help foster Google’s vision of quality-content driven SEO. Just as it has become increasingly hard to trick the search engine through various linking schemes, it might become comparably difficult to trick it with mislabeled photos or the sheer amount of them. Good content (i.e. quality pictures, clearly identified objects, topical pictures) is likely to become central soon enough when it comes to visual objects as well.
Tagging is also likely to become much more about photographic composition rather than manual or artificial labeling. If you want your picture to show up when people are looking for “yellow dog” images, your SEO will have to start with how you take the picture and what you put in that picture.
How Object Detection in Images Really Works
If you haven’t already, it’s time to grab a cup of coffee and bear with me while we make a short journey into the “geeks’ territory”.
So what is so particular about the DistBelief Infrastructure? The straight answer is that it makes it possible to train neural networks in a distributed manner and is based on the Hebbian principle and on the principle of scale invariance.
Still confused ? Yes, there is a lot of math packed in there, but it can be unpacked at a more basic level. Neural networks actually refers to what you would expect – how neurons inside our brains are wired. So when we are talking about them we actually mean artificial neural networks (ANNs), which are computational models based on the ideas of learning and pattern recognition – makes sense in the context we are talking about, right? The example below, of how object detection works might bring some light into this pretty-hard-to-understand field.
GoogLeNet used a particular type of ANN called convolutional neural network which is based on the idea that individual neurons respond to different (but overlapping) regions in the visual field and that it is possible to tile them to get a more complex image. To grossly simplify – it’s a little bit like working with layers. One of the perks of a convolutional neural network is that it supports translation very well. In mathematics, translation can refer to any type of movement of an object from one space in place to another. So if we put together all we know this far, Distbelief is pretty good at recognizing an object no matter where it is placed in a certain picture. It can also do more than that. Scale invariance is also a mathematic principle and it basically states that properties of objects do not change if scales of length are multiplied by a common factor. This means that Distbelief should be pretty good at recognizing an orange regardless of whether it is as big as your screen or as tiny as an icon: it will still be an orange and recognized as such (yay for oranges!)
The Hebbian principle has less to do with object recognition and more to do with the learning or training of neural networks part. In neuroscience this principle is often summarized as “Cells that fire together, wire together”. The cells being, of course, neurons. The application of this principle for artificial neural networks basically means that the software based on this algorithm would be able to teach itself, to get better in time.
Google’s Acquisitions Regarding Artificial Intelligence and Image Understanding
Google has already developed some of these technologies on its own; others, such as Andrew Ng’s architecture, it acquired. It is also worth noting Google’s strategy of growing while letting the market grow, by leaving many of its solutions open-source for others to improve on (why stifle the competition when you can afford to buyout the competition?)
One of the most interesting acquisitions though, is DeepMind, a $400 million dollars investment. Why would Google make such a big purchase? you might think.
It’s very likely that this particular acquisition aimed at adding skilled experts rather than specific products, marks an acceleration in efforts by Google, Facebook, and other Internet firms to monopolize the biggest brains in artificial intelligence research. Of course, this is just a supposition. Yet, for a better understanding of this matter, we highly recommend you to watch the video below which will give you a very well illustrated image of how big Google actually is and how it impacts our everyday life.
The ability of humans to recognize thousands of object categories in cluttered scenes, despite variability in pose, changes in illumination and occlusions, is one of the most surprising capabilities of visual perception, still unmatched by computer vision algorithms. Or at least this is what an article from 2007 stated. Here we are, a couple of years later, facing the situation where search engines are about to implement automatic object recognition on a daily basis. Even more, Google is already making steps forward as it owns a patent for automatic large scale video object recognition since 2012.
Organic results might not look the way they do today and for sure important improvements will be made soon. Google is switching “from strings to things” as the Knowledge Graph will be fully integrated in the search landscape. Algorithms will change too and they will probably be more related to the actual entities in the content and how these entities are linked together.
It’s true that only time and context will be the one to prove us wrong or right. Things can only be understood backwards; but they must be lived forward.