Something is (still) rotten in the kingdom of artificial intelligence

Four types of problems in the land of artificial intelligence.

artificial intelligence robotics machine learning

Nobody can deny that artificial intelligence (or machine learning, deep learning, or cognitive computing) is booming these days. And—as before, as this is in fact the second round for AI—the hype is almost unlimited. But there are serious problems, and I suspect it will not be long before they become undeniable again and we’re back to a more realistic assessment of what the technology is bringing us.

There are roughly four types of problems in the land of AI. Let’s start with an illustration of one of these, where the hype doesn’t live up to reality yet and where it is quite possible that it never will.


In 2015, Google had to apologize because its analytics software tagged a picture of two black people as gorillas. Google’s chief architect at the time wrote: “We used to have a problem with people (of all races) being tagged as dogs, for similar reasons. We’re also working on longer-term fixes around both linguistics (words to be careful about in photos of people) and image recognition itself (e.g., better recognition of dark-skinned faces). Lots of work being done and lots still to be done, but we’re very much on it.” Fast-forward three years and it turns out what Google’s quick and dirty fix was to: ban the word “gorilla” altogether. This is very indicative of a fundamental problem: In three years’ time it was not able to fix the issue. Banning the word “gorilla” altogether is not a fix, it is an admission of failure.

And here is a sign that we’re still deep in hype mode: when reporting on this, the Guardian writes: “The failure of the company to develop a more sustainable fix in the following two years highlights the extent to which machine learning technology, which underpins the image recognition feature, is still maturing” (italics mine). This is a sign of hype as it assumes (without reasonable proof) that the problem is solvable at all. And it is by definition limited in time, as you cannot keep writing that something is ‘maturing’ for extended periods of time.

These kinds of problems (and this kind of reporting in the press fully mirrors what happened in the 1960s and 1970s. Many reported successes weren’t real and the failures were either painted over, not reported, or labelled “still maturing,” implying: there is nothing wrong with the approach itself. I already noticed the same vein in reporting in journals like the New Scientist already years ago. Good examples were articles listed “the X most interesting scientific and technological breakthroughs.” Such lists generally contained at least one AI-like item and such an item was often also the only item that wasn’t reality yet. Often (just as in the first AI hype period) these were reported as failures “that just needed to mature,” “fix a few outstanding issues,” etc. All the others in the list were proven breakthroughs (e.g., a mechanism that worked, even in volume, but only was too expensive yet to produce in volume), but the AI ones were always only breakthroughs if we assumed totally unfounded extrapolations. There is a disturbing lack of criticism when AI is being reported upon—by the way, I like New Scientist a lot. This just shows that even serious science journalism is not immune to the hype.

For those experienced enough (a nice way of saying “old buggers”) to have been on the inside of the first period of AI hype (1960 to 1990), the current hype shows some distressing parallels with the previous one, in which hundreds of billions were spent in vain to get the promised “artificial intelligence.” In 1972, from Hubert Dreyfus’s book What Computers Can’t Do, it was already clear the approaches were doomed. It is a testament to the strength of (mistaken) conviction that was behind the first AI hype that it took another 20 years and billions of dollars, spent by governments (mainly in the US, the EU, and Japan) and private companies (Bill Gates’s Microsoft was a big believer for instance) for the AI hype to peter out and the “winter of AI” to set in.

Did that mean AI failed completely in the previous century? Not completely, but what definitely did fail was the initial wave and the initial assumptions on which it was built. This current AI hype is seemingly built on different assumptions, so theoretically it might succeed. The question then becomes “Are we closing in on the goal this time?” The answer is again no, for some fundamental reasons I mention below, but the situation is also different and more actual value results from the current efforts.

Great expectations and failure, round 1

With the early rise of computers in the 1950s and 1960s there followed a fast rise in the belief that these wonderful computers, that could calculate so much faster than any human could, would give rise to artificial (human-like) intelligence. Early pioneers wrote small scale programs that were able to first beat humans in tic-tac-toe, then checkers, and in 1970 we already read predictions that computers would be “as smart as humans in a matter of years” and “much smarter a few months later.” Within ten years the world chess champion would be beaten by a computer.

Computers as intelligent as humans (even if specialist programs eventually could outcalculate human intelligence in microworlds such as chess) failed to materialize decade after decade, until in the 1980s the “winter of AI” slowly set in. I often lament the billions Bill Gates’s Microsoft poured into AI (in which he believed strongly) to produce duds like Microsoft Bob, Microsoft Agent, the infamous Clippy, and lots of vaporware, where it could have fixed its software to be less of a security nightmare and actually work decently (e.g., it is moronic that even in Windows 10 today, if you share a spreadsheet from within Excel, which creates a mail message in Outlook, Outlook is incapable to do anything else until the window containing the spreadsheet to be sent has been closed. Something NeXTStep, now MacOS, could already do 25 years ago, but I digress).

But the whole subject in the end came to nothing. Expert systems turned out to be costly to build and maintain and very, very brittle. Machine language translation based on grammars (rules) and dictionaries never amounted to much. The actual successes often had one thing in common: they restricted the problem to solve to something computers could manage, a microworld. For instance, using a small subset of English (e.g., Simplified Technical English). Or, using the fact that the possible existing combinations of street names, numbers, postal codes, and city names gave such a good limitation of the search space that mail-sorting machines could get by with a measly 70 percent accuracy of actually reading the addresses themselves.

As mentioned above, the failure was already predicted in the late 1960s by Dreyfus, who—after studying the approaches and ideas of the cognitive science and AI communities—noticed that their approaches were based on misunderstandings already known from analytic philosophy, which had hit those same limitations before. But who listens to philosophers, even if they are of the analytic kind?

Anyway, in 1979, the second revised edition of Dreyfus’s seminal book was published. It showed, that much of the additionally reported progress was again illusory and it analyzed why. And finally, in 1992 the third and definitive edition was published by MIT Press (MIT being a hotbed of anti-Dreyfus sentiment in the years before). Twenty years in a field supposedly developing with a speed beyond everything that went before, with computers that had become many orders of magnitude more powerful, but the book remained correct, and in fact is correct (and worthwhile to read) to this day.

In the 1992 edition, Dreyfus described an early failure of the then-nascent “neural networks” approach (the approach that underlies many of today’s successes, such as Google beating the human Go champion). The U.S. Department of Defense had commissioned a program that would—based on neural network technology—be able to detect tanks in a forest based on photos. The researchers built the program and fed it with the first batch of a set of photos that showed the same location, but one with a tank and one without. After having trained the neural network thus, the program was able to detect tanks with an uncanny precision on photos from the training batch itself. Even more impressive, when they used the rest of the set (the photos that had not been used to train the neural network), the program did extremely well and the Department of Defense was much impressed. Too soon, though. The researchers went back to the forest to take more pictures and were surprised to find that the neural network was not capable of discriminating between pictures with and without tanks at all. An investigation provided the answer: It took time to put tanks in place or remove them.

As a result, the original photos with tanks had been made on a cloudy day, the photos without tanks had been made on a sunny day. The network had effectively been trained to discriminate between a sunny (sharp shadows) and a cloudy day (hardly shadows). Fast-forward thirty (!) years and the deep learning neural networks or equivalents from Google run into exactly the same problem. Their neural networks have learned to discriminate, but not enough to discriminate between black people and gorillas, and it is actually not very clear what they discriminate on.

And yes, Google can now beat the best people at Go, which is much more difficult than chess (where IBM beat the best people in the 1990s), but Go and Chess are still examples of the microworlds of the late 1960s. Even more important: these are strictly logical (discrete) microworlds, and people are extremely bad at (discrete) logic. We might be better at it than any other living thing on this planet, but we’re still much better at Frisbee. (This by the way is one of the reasons why rule-based approaches for enterprise/IT architecture fare so poorly: The world in which enterprises/computers must act is not logical at all.)

Great expectations, round 2. Failure ahead?

Neural networks and other forms of statistics require lots of data and “deep” (i.e., many-layered) networks to amount to anything. In the 1960s and 1970s the computers were puny and the data sets were tiny. So, generally, the “intelligence” was based on direct creation of symbolic rules in expert systems without a need for statistics on data. These rule-based systems (sometimes disguised as data-driven—such as the Cyc project which stubbornly plods along with the goal of modeling all facts and thus come to intelligence—failed outside of a few microworlds. Cyc, by the way, is proving that even with a million facts and rules, you still have nothing that looks like true intelligence.

Anyway, the first neural networks were “thin,” initially two, maybe three layers to create the correlations between input (e.g., photo) and output (“Tank!”). Such thin networks are very brittle. With more “depth” today, they have more hidden rules. The statistical correlations are still equivalent to rules, these days deeper than mistaking a sunny day versus a cloudy day for “tank” versus “no tank.” In fact, while we do not know the actual rules, the neural networks on digital computers are still data-driven, rule-based systems in disguise. Smart “psychometrics” systems may still contain relations like between liking “I hate Israel” on Facebook with a tendency to like Nike shoes and Kit Kat.

Now, over the last decade, the amount of data that is available has become huge (or, at least according to our current view, don’t forget that researchers in the 1960s already talked about the issues that came with trying to program the “very powerful” computers of the day). This has opened up a new resource—data, in this light sometimes called the new “oil”—that can be exploited with statistics. To us, logically woefully underpowered people, it looks like magic and intelligence. These statistical tricks can draw reliable conclusions that are impossible for us to draw, if alone because we as human beings are unable to process all that data.

But look a bit deeper, and you will still find shockingly simple rules that are in fact unearthed and used. One specialist told me: Most of the time, when we do statistical research on the data, we find correlations around some 14 different aspects: ZIP code, age, sex, income, education, etc. And these predict with a decent enough reliability how we should approach customers.

The patterns (rules) unearthed are thus rather simple and have two important properties:

  • They are not perfectly reliable. They all come with a percentage of reliability, as in “90 percent reliable correlation.” That means in many cases they produce erroneous results.
  • The more restricted the range of predictions, the more reliable they are.

So, how bad is it that these methods have reliability issues? Will this statistics-based AI fail as spectacularly as the earlier symbolic wave did? The answer is no. If you can support your customer with a method none out of ten times well, that is a good thing, right? Unless, of course, the remaining 10 percent are making so much fuss in social media that your brand suffers (“Google labeled me a gorilla!”). At this stage in the game, nobody pays much attention to the outliers, and that is a recipe for disaster.

1 2 Page 1
Page 1 of 2