For those interested in the details, back propagation uses the gradient of the error (or cost) function with respect to the weights and biases of the model to discover the correct direction to minimize the error. Two things control the application of corrections: the optimization algorithm and the learning rate variable. The learning rate variable usually needs to be small to guarantee convergence and avoid causing dead ReLU neurons.

### Optimizers for neural networks

Optimizers for neural networks typically use some form of gradient descent algorithm to drive the back propagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimizing randomly selected mini-batches (Stochastic Gradient Descent) and applying *momentum* corrections to the gradient. Some optimization algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).

As with all machine learning, you need to check the predictions of the neural network against a separate validation data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.

### Deep learning algorithms

A deep neural network for a real problem might have upwards of 10 hidden layers. Its topology might be simple, or quite complex.

The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train.

Convolutional neural networks (CNN) are often used for machine vision. Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers apply the non-saturating activation function f(x) = max(0,x). In a fully connected layer, the neurons have connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss function for classification, or a Euclidean loss function for regression.

Recurrent neural networks (RNN) are often used for natural language processing (NLP) and other sequence processing, as are Long Short-Term Memory (LSTM) networks and attention-based neural networks. In feed-forward neural networks, information flows from the input, through the hidden layers, to the output. This limits the network to dealing with a single state at a time.

In recurrent neural networks, the information cycles through a loop, which allows the network to remember recent previous outputs. This allows for the analysis of sequences and time series. RNNs have two common issues: exploding gradients (easily fixed by clamping the gradients) and vanishing gradients (not so easy to fix).

In LSTMs, the network is capable of forgetting (gating) previous information as well as remembering it, in both cases by altering weights. This effectively gives an LSTM both long-term and short-term memory, and solves the vanishing gradient problem. LSTMs can deal with sequences of hundreds of past inputs.

Attention modules are generalized gates that apply weights to a vector of inputs. A hierarchical neural attention encoder uses multiple layers of attention modules to deal with tens of thousands of past inputs.

Random Decision Forests (RDF), which are not neural networks, are useful for a range of classification and regression problems. RDFs are constructed from many layers, but instead of neurons an RDF is constructed from decision trees, and outputs a statistical average (mode for classification or mean for regression) of the predictions of the individual trees. The randomized aspects of RDFs are the use of bootstrap aggregation (a.k.a. *bagging*) for individual trees, and taking random subsets of the features for the trees.

XGBoost (eXtreme Gradient Boosting), also *not* a deep neural network, is a scalable, end-to-end tree boosting system that has produced state-of-the-art results on many machine learning challenges. Bagging and boosting are often mentioned in the same breath; the difference is that instead of generating an ensemble of randomized trees, gradient tree boosting starts with a single decision or regression tree, optimizes it, and then builds the next tree from the residuals of the first tree.

Some of the best Python deep learning frameworks are TensorFlow, Keras, PyTorch, and MXNet. Deeplearning4j is one of the best Java deep learning frameworks. ONNX and TensorRT are runtimes for deep learning models.

## Deep learning vs. machine learning

In general, classical (non-deep) machine learning algorithms train and predict much faster than deep learning algorithms; one or more CPUs will often be sufficient to train a classical model. Deep learning models often need hardware accelerators such as GPUs, TPUs, or FPGAs for training, and also for deployment at scale; without them, the models would take months to train.

For many problems, some classical machine learning algorithms will produce a “good enough” model. For other problems, classical machine learning algorithms have not worked terribly well in the past.

One area that is usually attacked with deep learning is natural language processing, which encompasses language translation, automatic summarization, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.

Another prime area for deep learning is image classification, which includes image classification with localization, object detection, object segmentation, image style transfer, image colorization, image reconstruction, image super-resolution, and image synthesis.

In addition, deep learning has been used successfully to predict how molecules will interact in order to help pharmaceutical companies design new drugs, to search for subatomic particles, and to automatically parse microscope images used to construct a three-dimensional map of the human brain.