We’ve all seen it. Whether it’s there to keep automated spammers away from your blog comments or to make sure you are a real person who is registering for an account, at some point we’ve all had to deal with a graphic like the one above. It’s called CAPTCHA, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart. While there is some controversy over who invented it, the process was first patented in 1998 by Mark D. Lillibridge, Martin Abadi, Krishna Bharat, and Andrei Z. Broder at AltaVista.
Why is CAPTCHA so effective? Because even though it is relatively simple for you and me to read the obscured and distorted words in a graphic, so far no one has been able to program an automated system to do the same thing. Computers can be programmed to scan a picture of a page of printed text and read the words in the picture. However, when the words are obscured or distorted too much, the program doesn’t recognize them anymore. A human looking at the same picture can read the words, even when the most sophisticated automated system cannot.
A team of scientists at the Salk Institute for Biological Studies is starting to reveal the amazing complexity behind our ability to interpret such images.
Scientists have known for a long time that there are several “levels” of interpretation involved in the human visual system. The light-sensitive cells in the eye send signals to the brain, and those signals are processed by brain cells called neurons. Some of the first neurons to process the signals coming from the eye are only able to see the information that comes from a tiny region in the area that is being examined. However, those neurons send their results to other neurons for further processing, and those neurons send their results “up the line” to another set of neurons. When the signals reach an area of the brain referred to as V4, the neurons there process the information so as to interpret the images coming from a large section of what is being viewed by the eyes. This kind of processing allows the brain to recognize shapes, like lines and curves.
Consider, for example, looking at the number “5.” The early neurons that process the visual information coming from the eyes don’t see the number at all. They just see tiny sections of the number. However, they send their processed information to other neurons, which do more processing and then send the information on to another set of neurons. When the information reaches the V4 section of the brain, the neurons there are tuned to recognize specific shapes from larger regions of what is being examined. One set of neurons, for example, might recognize the curve at the bottom of the “5.” Another set might recognize the vertical line that sits on top of the curve, and another set might recognize the horizontal line that sits on the vertical line. It’s actually a lot more complicated than that, but at least it gives you some idea of how all this works.
In two separate papers, the scientists investigated how the neurons in the V4 responded to simple shapes (like the lines and curve in a “5”)1 and how they respond to images that contained more complicated shapes (like a natural scene).2 In the second paper, the authors start with this statement:
Although object recognition feels effortless, it is in fact a challenging computational problem. There are two important properties that any system that mediates robust object recognition must have. The first property is known as “invariance”: the ability of the system to respond similarly to different views of the same object. The second property is known as “selectivity.” Selectivity requires that systems’ components, such as neurons within the ventral visual stream, produce different responses to potentially quite similar objects (such as different faces) even when presented from similar viewpoints. It is straightforward to make detectors that are invariant but not selective or selective but not invariant. The difficulty lies in how to make detectors that are both selective and invariant.
In other words, to recognize objects, a sensor must respond the same to different views of the same object, but it must respond differently to similar (but distinct) objects, even when presented from the same view. It was thought that the brain took care of all this by the time the signals reached the V4 section of the brain. In other words, it was thought that the neurons in the V4 are already selective and invariant – they can recognize the curve of the “5” or the lines in the “5” no matter where they are in the visual field.
The authors show that the V4 neurons are invariant when it comes to simple shapes (like the lines in the “5”), but they are not invariant when it comes to more complicated shapes (like the curve in the “5”). In order for the visual system to become invariant with the complicated shapes, more processing has to be done. This makes sense, of course, since it should be easier to process simple shapes than it is to process complicated shapes. However, until now, we haven’t appreciated the depth of the processing this requires.
So why can we interpret a CAPTCHA when the most sophisticated automated system that exists today cannot? Because our brains can process visual stimuli so that the information is both selective and invariant. Right now, the best of human technology cannot match such a feat. Of course, as we learn more about how the brain produces both selectivity and invariance, our automated systems will become better, because the designers of the automated systems will be able to copy (in a crude way) the design of the Ultimate Engineer.
1. Anirvan S. Nandy, Tatyana O. Sharpee, John H. Reynolds, and Jude F. Mitchell, “The Fine Structure of Shape Tuning in Area V4,” Neuron 78(6):1102-1115, 2013
Return to Text
2. T. O. Sharpee, M. Kouh, and J. H. Reynolds, “Trade-off between curvature tuning and position invariance in visual area V4,” Proceedings of the National Academy of Sciences of the United States of America, DOI:10.1073/pnas.1217479110, 2013
Return to Text