An intro to Automatic Differentiation
I originally wrote the text below to put on my homepage, but then I decided it would just clutter up the page without tremendous benefit. So the text is now here fofr you to enjoy.
In automatic differentiation (AD), we study how to generalise taking symbolic/algebraic derivatives from simple mathematical expressions to whole computer programs, while keeping computational efficiency in mind. Being able to take the derivative of a numerical function implemented as a program is useful whenever you want to get some idea a program's local behaviour when you change its inputs a little; in the case of AD, we consider continuous inputs — in practice, floating point numbers. (If you're interested in the case where inputs are discrete, such as integers, look to incremental computation instead — about which I know very little.)
For example, in machine learning, the usual approach is to formulate a model with a lot of parameters, and then optimise those parameters to make the model compute some kind of target function as well as possible. This model is a computer program; neural networks (as considered in machine learning) are simply computer programs of a particular, quite structured form. Optimising a model involves changing the parameters in such a way as to get the output closer to the desired output on some set of inputs, and the derivative (gradient, in this case) of the model tells you, at least locally, how to change the parameters to make the model a little bit better. Gradient descent, and other more sophisticated algorithms, use this derivative information to optimise a model in some fashion. AD allows model implementors to have a lot of freedom in designing their models, because whatever they do, the AD algorithm will let the system compute the model's derivative.
Another application of AD is in statistical inference, such as in Stan, where users formulate a statistical model (with some parameters) of some part of the real world and then try to determine, given some real-world observations, where the parameters will approximately lie. (This is the Bayesian inference approach to statistical inference — frequentist inference fairly directly reduces to optimisation, for example using gradient descent.) Such inference algorithms need to compute a probability distribution (namely, the result: the distribution telling you where the parameters lie), and while it is relatively easy to give an indication of where these distributions are high (likely parameter values) and where they are low (unlikely ones), it's harder to normalise this distribution, which means computing what the actual probability of some parameter value is (as opposed to just being more or less likely than another parameter value). This normalisation involves integrating the produced unnormalised probability distribution, and clever integration algorithms (for continuous parameters, HMC) use the local "relief" (derivative) of your statistical model to more quickly find areas of interest, and more quickly converge to a good estimate of the overall integral (and thus the actual probabilities). Here, again, AD allows model implementors to not have to worry about writing their models in a specific form — any computation will do.