Written by: Michael Mozer, Scientific Advisor at AnswerOn and Professor at the University of Colorado

In the late 1980’s, neural networks were hot.

Arnold as the Terminator announced that his brain was a neural network. Dick Tracy learned that neural networks were computers that learned through experience. And long before Google and Facebook, cutting-edge businesses tried using neural networks for difficult predictions problems, including stock prices and customer churn. At the heart of the neural net revolution was a training procedure that allowed a neural network to discovered “weight” parameters from a set of labeled examples (Rumelhart, Hinton, & Williams, 1986). For example, a neural net could be presented images of cats and dogs grouped by category, and the training procedure would automatically program the neural network to not only classify the training images correctly but to generalize to novel images.

Dick Tracy neural network cartoon

In the community of artificial-intelligence researchers, neural networks were offered as a general purpose “black box” tool that could solve difficult classification and regression problems. The promise was not entirely realized, and by the mid 1990’s, new, mathematically principled methods had been invented that seemed to have greater potential. These methods, including support-vector machines and Bayesian networks, became the new rage, and researchers still developing neural networks were considered out of touch and behind the times.

Forward ahead twenty years. Neural networks are hot again, both in the academic community and in the popular press.

Networks developed by Google researchers watch hundreds of hours of YouTube videos and train themselves to recognize cats. Networks developed by Microsoft researchers perform real-time spoken language translation. A New York Times article appears in the Science section touting “Brainlike Computers, Learning From Experience” (12/29/13).

What changed between 1995 and 2015? Essentially, computers have gotten orders of magnitude faster and training data sets have gotten orders of magnitude larger. In addition to Moore’s law and the era of big data, the dogged computer scientists who continued to investigate neural networks discovered some essential tricks and techniques (e.g., the method of drop out, proposed by Hinton et al., 2012). These researchers most notably include Geoffrey Hinton at the University of Toronto, Yoshua Bengio at the University of Montreal, Juergen Schmidhuber at the Swiss AI lab IDSIA, and Yann LeCun at NYU.

Neural networks have been rebranded deep learning (LeCun, Bengio, & Hinton 2015), to reflect the claim that key to success of the neural network is the “deep” architecture consisting of many layers of neurons.  Each layer encodes a transformation of the information encoded in the layer below. With many layers, the deep architecture captures statistical regularities of a task as it transforms input representations to output representations.  It is still not entirely clear why having many layers of transformation is critical to the success of a neural net. A single layer of neurons is sufficient to encode arbitrary mapping from inputs to outputs (Hornik, 1991), but for some reason learning is facilitated by having many layers.

As in the early 1990s, neural networks are being hyped as a general-purpose solution to all problems. If history is any indication, they will not live up to this hype and a backlash is inevitable. History offers us many other lessons, and in future postings, I will present some of these lessons and their relevance to deep learning and to Answer On.


Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580v1 [cs.NE] 3 Jul 2012.

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251-257.

LeCun, Y., Bengio, Y., & Hinton, G. E. (2015). Deep learning. Nature, 521, 436-443.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-36.