08 February 2017

Data Science DC: Deep Learning Past, Present, and Near Future


Here in Washington DC we are lucky to have a robust technical community and as a result we have many great technical meetups which include a number that fall under the local data science community. I recently attended the Data Science DC (DSDC) talk "Deep Learning Past, Present, and Near Future" (video) (pdf slides) presented by Dr. John Kaufhold who is a data scientist and managing partner of Deep Learning Analytics. This will be the second DSDC post I have written, the first is here. Parts of this post will be cross posted as a series on the DSDC blog (TBA), references to my previous blog posts either here or there will be to referring to my blog elegantcoding.com. My previous DSDC post was pretty much just play by play coverage of the talk. I decided to take a slightly different approach. My goal here is to present some of the ideas and source material in John's talk while adding and fleshing out some additional details. This post should be viewed as "augmentation" as he talks about some things that I will not capture here.

I am guessing that some proportion of the audience that night is like me in that they feel the need to learn more about deep learning and machine learning in general. This is a pressure that seems to be currently omnipresent in our industry. I know some people who are aggressively pursuing machine learning and math courses. So I would like to add some additional resources vetted by John for people like myself who want to learn more about deep learning.

I decided to make this a single post in with parts. This will essentially parallel John's presentation, although I will change the order a bit. The first part will summarize the past, present, and future of deep learning. I will be doing some editorializing of my own on the AI future. The second part will be an augmented list of resources including concepts, papers, and books in conjunction with what he provided for getting started and learning more in depth about deep learning.

I will give a brief history of deep learning based in part on John's slide, well actually Lukas Masuch's slide. I am a fan of history, especially James Burke Connections style history. It is hard for me to look at any current recent history without thinking of the greater historical context. I have a broader computer history post that might be of interest to history of technology and history of science fans. Three earlier relevant events might be the development of the least squares method that lead to statistical regression, Thomas Bayes's "An Essay towards solving a Problem in the Doctrine of Chances" and Andrey Markov's discovery of Markov Chains. I will leave those in the broader context and focus on the relative recent history of deep learning.

An Incomplete History of Deep Learning

1951: Marvin Minsky and Dean Edmonds build SNARC (Stochastic Neural Analog Reinforcement Calculator) a neural net machine that is able to learn. It is a randomly connected network of Hebb synapses.

1957: Frank Rosenblatt invents the Perceptron at the Cornell Aeronautical Laboratory with naval research funding. He publishes "The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain" in 1958 . It is initially implemented in software for the IBM 704 and subsequently implemented in custom-built hardware as the "Mark 1 Perceptron".

1969: Marvin Minsky and Seymour Papert publish Perceptrons which showed that it was impossible for these classes of networks to learn an XOR function.

1970: Seppo Linnainmaa publishes the general method for automatic differentiation of discrete connected networks of nested differentiable functions. This corresponds to the modern version of back propagation which is efficient even when the networks are sparse.

1973: Stuart Dreyfus uses backpropagation to adapt parameters of controllers in proportion to error gradients.

1972: Stephen Grossberg published the first of a series of papers introducing networks capable of modeling differential, contrast-enhancing and XOR functions. "Contour enhancement, short-term memory, and constancies in reverberating neural networks (pdf)".

1974: Paul Werbos mentions the possibility of applying back propagation to artificial neural networks. Back propagation had been initially context of control theory by Henry J. Kelley and Arthur E. Bryson in the early 1960s. Around the same time Stuart Dreyfus published a simpler derivation based only on the chain rule.

1974–80: First AI winter which may have been in part caused by the often mis-cited 1969 Perceptrons by Minsky and Papert.

1980: Neoconitron, a hierarchical, multilayered artificial neural network, is proposed by Kunihiko Fukushima. It has been used for handwritten character recognition and other pattern recognition tasks and served as the inspiration for convolutional neural networks.

1982: John Hopfield popularizes Hopfield networks, a type of recurrent neural network that can serve as content-addressable memory systems.

1982: Stuart Dreyfus applies Linnainmaa's automatic differentiation method to neural networks in the way that is widely used today.

1986: David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams publish "Learning representations by back-propagating errors (pdf)". This shows through computer experiments that this method can generate useful internal representations of incoming data in hidden layers of neural networks

1987: Minsky and Papert publish "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected.

1987–93: Second AI winter occurs. Caused in part by the collapse of the Lisp machine market and the fall of the expert system. Additionally training times for deep neural networks are too long making them impractical for real world applications.

1993: Eric A. Wan wins the an international pattern recognition contest. This is the first time backpropagation is used to win this contest.

1995: Corinna Cortes and Vapnik publish current standard incarnation (soft margin) in "Support-Vector Networks". The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963. SVMs now take a dominant role in AI.

1997: Sepp Hochreiter and Jürgen Schmidhuber invent Long-short term memory (LSTM) recurrent neural networks greatly improving the efficiency and practicality of recurrent neural networks.

1998: A team led by Yann LeCun releases the MNIST database, a dataset comprising a mix of handwritten digits from American Census Bureau employees and American high school students. The MNIST database has since become a benchmark for evaluating handwriting recognition. Lecun, Bottou, Bengio, Haffner publish "Gradient-Based Learning Applied to Document Recognition (pdf)"

1999: NVIDIA Invents the GPU

2006: Geoffrey Hinton and Ruslan Salakhutdinov publish "Reducing the Dimensionality of Data with Neural Networks (pdf)" This is an unsupervised learning breakthrough that now allows for the training of deeper networks.

2007: NVIDIA launches the CUDA programming platform. This opens up the general purpose parallel processing capabilities of the GPU.

2009: NIPS Workshop on Deep Learning for Speech Recognition discovers that with a large enough data set, the neural networks don't need pre-training, and the error rates drop significantly.

2012: Artificial pattern-recognition algorithms achieve human-level performance on certain tasks. This is demonstrated by Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton using the ImageNet dataset and published as "ImageNet Classification with Deep Convolutional Neural Networks (pdf)".

2012: Google's deep learning algorithm discovers cats. AI now threatens the cat cabal control of the internet.

2015: Facebook puts deep learning technology - called DeepFace - into operation to automatically tag and identify Facebook users in photographs. Algorithms perform superior face recognition tasks using deep networks that take into account 120 million parameters.

A Brief and Incomplete History of Game Playing Programs and AIs

1951: Alan Turing is first to publish a program, capable of playing a full game of chess, developed on paper during a whiteboard coding interview.

1952: Arthur Samuel joins IBM's Poughkeepsie Laboratory and begins working on some of the very first machine learning programs, first creating programs that play checkers.

1992: Gerald Tesauro at IBM's Thomas J. Watson Research Center develops TD-Gammon a computer backgammon program. Its name comes from the fact that it is an artificial neural net trained by a form of temporal-difference learning.

1997: IBM Deep Blue Beats Kasparov, the world champion at chess.

2011: IBM's Watson using a combination of machine learning, natural language processing and information retrieval techniques beats two human champions on the TV game show Jeopardy! Later versions of Watson successfully banter with Alex thus further surpassing human capabilities.

2013: DeepMind learns to play 7 Atari games, surpassing humans on three of them, using Deep Reinforcement Learning and no adjustments to the architecture or algorithm. It eventually learns additional games taking the number to 46.

2016: AlphaGo, Google DeepMind's algorithm, defeats the professional Go player Lee Sedol.

2017: DeepStack becomes the first computer program to beat professional poker players in heads-up no-limit Texas hold'em.

The Present and Future

John also included the idea of Geoffrey Hinton, Yann LeCun, and Yoshua Bengio being the triumvirate that weathered the AI winter, a theme that he used in his first DSDC talk as well. Since it was one of Homer's vocabulary words, I couldn't resist creating my own graphic for it:

While the triumvirate researchers are definitely major players in the field, there are many others. There seems to be some drama about giving credit where credit is due. John goes into some of the drama surrounding this. Unfortunately it seems to human nature sometimes we have drama surrounding scientific discoveries and this is no exception in the Deep Learning community today. Geoffrey Hinton also catches some flak on reddit (ok, no surprise there) for not properly crediting Seppo Linnainmaa for backpropagation. It is natural to want credit for one's ideas, however, these ideas are never conceived of in a vacuum and sometimes they are discovered independently, Calculus springs to mind. As they say history is written by the winners. John made a remark about Rosalind Franklin's Nobel Prize for the double helix. I asked him how many people in the audience he thought got that remark and he replied: "One, apparently :) I'll take it!" Lise Meitner is another interesting example of someone who failed to be recognized by the Nobel committee for her work in nuclear fission.

A major name in the field is Jeff Dean. John mentioned Jeff Dean facts. This is apparently based on the Chuck Norris facts. Jeff Dean somewhat of a legend and it is hard to tell what is real and what is made up about him. Do compilers really apologize to him? We may never know. As I understand it he was one of the crucial early engineers to have helped create the success of Google. He was instrumental in the conception and development of both Map Reduce and Big table not only solidifying Google's successful infrastructure but also laying down much of the foundation of major big data tools like Hadoop and Cassandra. There is some serendipitous timing here in that Jeff Dean published "The Google Brain team - Looking Back on 2016" on the Google Research Blog a few days after John's talk. His post lays out a number of areas where deep learning has made significant contributions as well other updates from their team.

The present and near future for Deep Learning is very bright. There are many deep learning companies that have been acquired and many of these are small averaging about 7 people. John describes this as a startup stampede. Most likely a lot of these are acqui-hires since the labor pool for deep learning is very small. Apparently the top talent is receiving NFL equivalent compensation.

It's fairly obvious that deep learning and machine learning are the current hotness in our industry. One metric might be, given any random moment, go on hacker news and you will most likely see a deep learning article if not a machine learning article on the front page. As if this is not enough, now Kristen Stewart famous for the Twilight movies, has coauthored an AI paper: "Bringing Impressionism to Life with Neural Style Transfer in Come Swim". I am now waiting for an AI paper from Jesse Eisenberg.

As for perspectives on the future of Machine Learning John recommends chapters 2, 8, and 9 of the free O'Reilly book The Future of Machine Intelligence. He also points out that the idea that deep learning does not necessarily mean neural Nets and that it means the number of operations between the input and output. This means that machine learning methods other than neural nets can have a deep architecture. It's about the depth itself that learns the complex relationship. For this he cites "Learning Feature Representations with K-means (pdf)" by Adam Coates and Andrew Y. Ng.

Christopher Olah in his post: "Neural Networks, Types, and Functional Programming" talks about how deep learning is a very young field and that like other young fields in the past things are developed in an ad hoc manner with more of the understanding and formalisms being discovered later. As he points out we will probably have a very different view of what we are doing in 30 years. Although that last point is a bit obvious, it still helps to put things in context. So it is probably hard to tell what will be the most fruitful AI algorithms even ten years from now.

At the end of his talk John touches on a bigger issue that is becoming more ubiquitous in conversations about deep learning and AI: What are the larger social implications for society? So I thought I would contemplate some possibilities.

The first and perhaps most imminent and frightening concern is the loss of jobs due to automation. Many people argue that this is the same fear that was raised especially during the first two industrial revolutions which occurred in the late 18th-early 19th century and the late 19th-early 20th century. Of course how you classify industrial revolutions can vary and can easily be off by one if you include the Roman innovation of water power. These industrial revolutions displaced certain skilled jobs by mechanizing work and created new, perhaps less skilled factory jobs. The third industrial revolution mid-late 20th century can be thought of be the rise of more factory automation and the development of microprocessor which can be described in simple terms as a general electronic controller, of course it is way more than that. This continued automation lead to many things like specialized robots for manufacturing. This meant still less jobs for people. Now we are on the cusp of what is being called the fourth industrial revolution with generalized AI algorithms and multipurpose robotics.

I have often heard the argument that this has all happened before. John brings up a valid point on this which is when it happened before there was a lot more time for society to recover and new jobs were created. There was an article on NPR identifying truck driver as one of the most common jobs in America. There is some dispute of this in that the most common jobs are retail jobs. Regardless of which is the most common job, still there are by some estimates 3.5 million truck driver jobs in the US. Self driving vehicles are very possibly less than five years away. That means those jobs, not to mention taxi jobs as well will be lost. How long before you click purchase on an ecommerce website like Amazon and no humans are involved in that purchase arriving at your house? This might be about that same timeframe. This is an unprecedented level of automation and it affects many other professions for example some paralegal and radiology jobs are potentially now under threat.

There is of course an upside. If AI's become better at medical jobs than humans that could be a good thing. They won't be prone to fatigue or human error. In theory people could liberated from having to perform many tedious jobs as well. AI's could usher in new scientific discoveries. Perhaps they could even make many aspects of our society more efficient and more equal. In this case I for one would welcome our AI overlords.

It is hard to contemplate the future of society without thinking of the work of Orwell and Huxley, this comic sums up some of those concerns and begs the question how much of this do we currently see in our society. From the Orwellian perspective programs like PRISM and the NSA's Utah Data Center allow for the collection and storage of our data. AI's could comb through this data in ways humans never could. This creates the potential for a powerful surveillance state. Of course it is about how this power is used or abused that will determine if we go down the trajectory of the Orwellian dystopia. The other scary aspect of AI's are idea of automated killer drones, this of course is the Skynet dystopia. I think we should fear the Elysium dystopia with both aerial and Elysium style enforcement drones.

Maybe we go down the other path with a universal income that frees us from having to perform daily labor to earn our right to live. How do we now fill our days? Do we all achieve whatever artistic or intellectual pursuits that we are now free to explore? What happens if those pursuits can also be done better by the AI's? Is there a point in painting a picture or writing and singing a song if you will just be outcompeted by an AI? Some of this is already possible. ML algorithms have written reasonable songs, created artistic images as well as having written newspaper articles that are mostly indistinguishable from human writers. Even math and science may fall under their superior abilities for instance Shinichi Mochizuki's ABC conjecture proof seems to be incomprehensible by humans. What if math and science becomes too big and complex for humans to advance it and it becomes the domain of the AI's? So in this case do we fall into Huxley's dystopia? Actually many people would probably find themselves, as some do now, exclusively pursuing drugs, alcohol, carnal pleasures, porn, video games, tv, movies, celebrity gossip, etc. So do we all fall into this in spite of our aspirations? And the AI's will most likely be able to create all of our entertainment desires including the future synthetic Kim Kardashian style celebs.

Ok so that's enough gloom and doom. I can't really say where it will go. Maybe we'll end up with AI hardware wired directly to our brains and thus be able to take our human potential to new levels. Of course why fight it. If you can't beat them join them and the best way to do that is to learn more about deep learning.

Deep Learning 101/Getting Started

Online Learning Resources

John mentions a few resources for getting started. He mentions Christopher Olah's blog, describing him as the Strunk and the White of Deep Learning through his clear communication using a visual language. I have had his "Neural Networks, Manifolds, and Topology" post on my todo list and I agree that he has a bunch of excellent posts. I admit to being a fan of his visual style from his old blog as my post on Pascal's Triangle was partially inspired by his work. His newer work seems to be posted on http://distill.pub/. John also mentions Andrej Karpathy's blog as well as the Comprehensive Course on Convolutional Neural Nets that he teaches Stanford: CS231n: Convolutional Neural Networks for Visual Recognition, which can additionally be found here.

Additionally John did a write up "Where are the Deep Learning Courses? after his first DSDC talk.

If you are local we also have the DC-Deep-Learning-Working-Group they are currently working through the Richard Socher Stanford NLP course.

Some Additional Resources

I wanted to include some additional resources and came up with a list that I asked John's opinion on. These are some that I found that he endorses with high recommendations for Richard Socher's NLP tutorial.

Neural Networks and Deep Learning By Michael Nielsen http://neuralnetworksanddeeplearning.com/

Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/

UDACITY : Deep Learning by Google

ACL 2012 + NAACL 2013 Tutorial: Deep Learning for NLP (without Magic), Richard Socher, Chris Manning and Yoshua Bengio

CS224d: Deep Learning for Natural Language Processing, Richard Socher

These are the additional resources that looked interesting to me:

Deep Learning in a Nutshell by Nikhil Buduma

Deep Learning Video Lectures by Ruslan Salakhutdinov, Department of Statistical Sciences, University of Toronto:

Lecture 1

Lecture 2

Nuts and Bolts of Applying Deep Learning (Andrew Ng)(video)

Core Concepts

I wanted to list out some core concepts that one might consider learning. I reached out to John with a list of concepts. He replied with some that he thought were important and pointed out that such a list can grow rapidly and it is better to take a more empirical approach and learn as you go by doing. I think a brief list might be helpful to people new to machine learning and deep learning.

Some basic ML concepts might include:

These are some that are more specific to Neural Nets:

Some additional areas that might be of interest are:

Additionaly a couple of articles that lay out some core concepts are The Evolution and Core Concepts of Deep Learning & Neural Networks and Deep Learning in a Nutshell: Core Concepts.

Git Clone, Hack, and Repeat

John mentions the approach of Git Clone, Hack, and Repeat. I asked him if there were and there any specific github repo's that you would recommend? I am just using his reply here, with some of the links and descriptions filled in:

This is really, choose your own adventure, but when I look in my history for recent git clone references, I see:

Torch is a scientific computing framework for with wide support for machine learning algorithms that puts GPUs first. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation. github

Efficient, reusable RNNs and LSTMs for torch

Torch interface to HDF5 library

Computation using data flow graphs for scalable machine learning

This repository contains IPython Notebook with sample code, complementing Google Research blog post about Neural Network art.

Neural Face uses Deep Convolutional Generative Adversarial Networks (DCGAN)

A tensorflow implementation of "Deep Convolutional Generative Adversarial Networks"

Among others--playing with OpenAI's universe and pix2pix both seem full of possibilities, too.

Image-to-image translation using conditional adversarial nets

Openai Universe

To learn, I'm a big advocate of finding a big enough dataset you like and then trying to find interesting patterns in it--so the best hacks will usually end up combining elements of multiple repos in new ways. A bit of prosaic guidance I seem to find myself repeating a lot: always try to first reproduce what the author did; then start hacking. No sense practicing machine learning art when there's some dependency, sign error, or numerical overflow or underflow that needs fixing first. And bugs are everywhere.

Data Sets

I asked John to recommend some data sets for getting into deep learning, he recommended the following:

MNIST

CIFAR-10

NORB

Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" provides some examples using a Shakespeare corpus, wikipedia data, and linux kernel code that can illustrate a lot of really interesting properties of RNNs

Deep Learning Frameworks

He also mentions a number of Deep Learning frameworks to look at for getting into deep learning, the top three are tensorflow, caffe and Keras.

As I was writing this I saw this: "Google Tensorflow chooses Keras". This makes Keras the first high-level library added to core TensorFlow at Google, which will effectively make it TensorFlow's default API.

Building Deep Learning Systems

One area of interest to me is how to build better software. In traditional software development tests area used to help verify a system. In these instances you can usually expect known inputs to yield known outputs. This is something that software developers strive for and is increased by things like referential transparency and immutability. However, how do you now handle systems that behave probabilisticly instead of discretely? As I was researching the topic of how do you successfully test and deploy ML systems, I found these two resources: "What's your ML test score? A rubric for ML production systems" and "Rules of Machine Learning:Best Practices for ML Engineering(pdf) ". I asked John his thoughts on this and he recommended: "Machine Learning: The High-Interest Credit Card of Technical Debt" as well as "Infrastructure for Deep Learning", although he said this is more of a metaphor us normal humans that don't have Altman/Musk funded budgets. He also pointed out that building these types of systems also fall under the standard best practices approach to any software and trying to maintain high software development standards is just as helpful as in any project.

Final Thoughts

My contemplation of the future is pretty pessimistic from the societal perspective. From the AI perspective it is pretty optimistic on what can be done. New articles about AI achievements like "Deep learning algorithm does as well as dermatologists in identifying skin cancer" are constantly appearing. However, then you see things like "Medical Equipment Crashes During Heart Procedure Because of Antivirus Scan". Although this is not related to Deep Learning it does touch on a well known issue, which is our lives are effectively run by software and we still don't really know how to truly engineer it. From what I understand, we really don't know exactly how some of these deep learning algorithms work. This makes the future of our software and AI seem even more uncertain. Hopefully we will overcome that.

Lastly I wanted thank John for being very generous with his time in helping me with additional resources for this post.

References and Further Reading

Some Leading Deep Learning Researcher's Links