%!TEX root = main.tex \section{Introduction} Artificial neural networks have dozends of hyperparameters which influence their behaviour during training and evaluation time. One parameter is the choice of activation functions. While in principle every neuron could have a different activation function, in practice networks only use two activation functions: The softmax function for the output layer in order to obtain a probability distribution over the possible classes and one activation function for all other neurons. Activation functions should have the following properties: \begin{itemize} \item \textbf{Non-linearity}: A linear activation function in a simple feed forward network leads to a linear function. This means no matter how many layers the network uses, there is an equivalent network with only the input and the output layer. Please note that \glspl{CNN} are different. Padding and pooling are also non-linear operations. \item \textbf{Differentiability}: Activation functions need to be differentiable in order to be able to apply gradient descent. It is not necessary that they are differentiable at any point. In practice, the gradient at non-differentiable points can simply be set to zero in order to prevent weight updates at this point. \item \textbf{Non-zero gradient}: The sign function is not suitable for gradient descent based optimizers as its gradient is zero at all differentiable points. An activation function should have infinitely many points with non-zero gradient. \end{itemize} One of the simplest and most widely used activation functions for \glspl{CNN} is \gls{ReLU}~\cite{AlexNet-2012}, but others such as \gls{ELU}~\cite{clevert2015fast}, \gls{PReLU}~\cite{he2015delving}, softplus~\cite{7280459} and softsign~\cite{bergstra2009quadratic} have been proposed. Activation functions differ in the range of values and the derivative. The definitions and other comparisons of eleven activation functions are given in~\cref{table:activation-functions-overview}. \section{Important Differences of Proposed Activation Functions} Theoretical explanations why one activation function is preferable to another in some scenarios are the following: \begin{itemize} \item \textbf{Vanishing Gradient}: Activation functions like tanh and the logistic function saturate outside of the interval $[-5, 5]$. This means weight updates are very small for preceding neurons, which is especially a problem for very deep or recurrent networks as described in~\cite{bengio1994learning}. Even if the neurons learn eventually, learning is slower~\cite{AlexNet-2012}. \item \textbf{Dying ReLU}: The dying \gls{ReLU} problem is similar to the vanishing gradient problem. The gradient of the \gls{ReLU} function is~0 for all non-positive values. This means if all elements of the training set lead to a negative input for one neuron at any point in the training process, this neuron does not get any update and hence does not participate in the training process. This problem is addressed in~\cite{maas2013rectifier}. \item \textbf{Mean unit activation}: Some publications like~\cite{clevert2015fast,BatchNormalization-2015} claim that mean unit activations close to 0 are desirable. They claim that this speeds up learning by reducing the bias shift effect. The speedup of learning is supported by many experiments. Hence the possibility of negative activations is desirable. \end{itemize} Those considerations are listed in~\cref{table:properties-of-activation-functions} for 11~activation functions. Besides the theoretical properties, empiric results are provided in~\cref{table:CIFAR-100-accuracies-activation-functions,table:CIFAR-100-timing-activation-functions}. The baseline network was adjusted so that every activation function except the one of the output layer was replaced by one of the 11~activation functions. As expected, \gls{PReLU} and \gls{ELU} performed best. Unexpected was that the logistic function, tanh and softplus performed worse than the identity and it is unclear why the pure-softmax network performed so much better than the logistic function. One hypothesis why the logistic function performs so bad is that it cannot produce negative outputs. Hence the logistic$^-$ function was developed: \[\text{logistic}^{-}(x) = \frac{1}{1+ e^{-x}} - 0.5\] The logistic$^-$ function has the same derivative as the logistic function and hence still suffers from the vanishing gradient problem. The network with the logistic$^-$ function achieves an accuracy which is \SI{11.30}{\percent} better than the network with the logistic function, but is still \SI{5.54}{\percent} worse than the \gls{ELU}. Similarly, \gls{ReLU} was adjusted to have a negative output: \[\text{ReLU}^{-}(x) = \max(-1, x) = \text{ReLU}(x+1) - 1\] The results of \gls{ReLU}$^-$ are much worse on the training set, but perform similar on the test set. The result indicates that the possibility of hard zero and thus a sparse representation is either not important or similar important as the possibility to produce negative outputs. This contradicts~\cite{glorot2011deep,srivastava2014understanding}. A key difference between the logistic$^-$ function and \gls{ELU} is that \gls{ELU} does neither suffers from the vanishing gradient problem nor is its range of values bound. For this reason, the S2ReLU activation function, defined as \begin{align*} \StwoReLU(x) &= \ReLU \left (\frac{x}{2} + 1 \right ) - \ReLU \left (-\frac{x}{2} + 1 \right)\\ &= \begin{cases}-\frac{x}{2} + 1 &\text{if } x \le -2\\ x &\text{if } -2\le x \le 2\\ \frac{x}{2} + 1&\text{if } x > -2\end{cases} \end{align*} This function is similar to SReLUs as introduced in~\cite{jin2016deep}. The difference is that S2ReLU does not introduce learnable parameters. The S2ReLU was designed to be symmetric, be the identity close to zero and have a smaller absolute value than the identity farther away. It is easy to compute and easy to implement. Those results --- not only the absolute values, but also the relative comparison --- might depend on the network architecture, the training algorithm, the initialization and the dataset. Results for MNIST can be found in~\cref{table:MNIST-accuracies-activation-functions} and for HASYv2 in~\cref{table:HASYv2-accuracies-activation-functions}. For both datasets, the logistic function has a much shorter training time and a noticeably lower test accuracy. \glsunset{LReLU} \begin{table}[H] \centering \begin{tabular}{lccc} \toprule \multirow{2}{*}{Function} & Vanishing & Negative Activation & Bound \\ & Gradient & possible & activation \\\midrule Identity & \cellcolor{green!25}No & \cellcolor{green!25} Yes & \cellcolor{green!25}No \\ Logistic & \cellcolor{red!25} Yes & \cellcolor{red!25} No & \cellcolor{red!25} Yes \\ Logistic$^-$ & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\ Softmax & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\ tanh & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\ Softsign & \cellcolor{red!25} Yes & \cellcolor{green!25}Yes & \cellcolor{red!25} Yes \\ ReLU & \cellcolor{yellow!25}Yes\footnotemark & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\ Softplus & \cellcolor{green!25}No & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\ S2ReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\ \gls{LReLU}/PReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\ ELU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\ \bottomrule \end{tabular} \caption[Activation function properties]{Properties of activation functions.} \label{table:properties-of-activation-functions} \end{table} \footnotetext{The dying ReLU problem is similar to the vanishing gradient problem.} \begin{table}[H] \centering \begin{tabular}{lccclllll} \toprule \multirow{2}{*}{Function} & \multicolumn{2}{c}{Inference per} & Training & \multirow{2}{*}{Epochs} & Mean total \\\cline{2-3} & 1 Image & 128 & time & & training time \\\midrule Identity & \SI{8}{\milli\second} & \SI{42}{\milli\second} & \SI{31}{\second\per\epoch} & 108 -- \textbf{148} &\SI{3629}{\second} \\ Logistic & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 167 &\textbf{\SI{2234}{\second}} \\ Logistic$^-$ & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \textbf{\SI{22}{\second\per\epoch}} & 133 -- 255 &\SI{3421}{\second} \\ Softmax & \SI{7}{\milli\second} & \SI{37}{\milli\second} & \SI{33}{\second\per\epoch} & 127 -- 248 &\SI{5250}{\second} \\ Tanh & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 125 -- 211 &\SI{3141}{\second} \\ Softsign & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 122 -- 205 &\SI{3505}{\second} \\ \gls{ReLU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 118 -- 192 &\SI{3449}{\second} \\ Softplus & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 165 &\SI{2718}{\second} \\ S2ReLU & \textbf{\SI{5}{\milli\second}} & \SI{32}{\milli\second} & \SI{26}{\second\per\epoch} & 108 -- 209 &\SI{3231}{\second} \\ \gls{LReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{25}{\second\per\epoch} & 109 -- 198 &\SI{3388}{\second} \\ \gls{PReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{28}{\second\per\epoch} & 131 -- 215 &\SI{3970}{\second} \\ \gls{ELU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 146 -- 232 &\SI{3692}{\second} \\ \bottomrule \end{tabular} \caption[Activation function timing results on CIFAR-100]{Training time and inference time of adjusted baseline models trained with different activation functions on GTX~970 \glspl{GPU} on CIFAR-100. It was expected that the identity is the fastest function. This result is likely an implementation specific problem of Keras~2.0.4 or Tensorflow~1.1.0.} \label{table:CIFAR-100-timing-activation-functions} \end{table} \begin{table}[H] \centering \begin{tabular}{lccccc} \toprule \multirow{2}{*}{Function} & \multicolumn{2}{c}{Single model} & Ensemble & \multicolumn{2}{c}{Epochs}\\\cline{2-3}\cline{5-6} & Accuracy & std & Accuracy & Range & Mean \\\midrule Identity & \SI{99.45}{\percent} & $\sigma=0.09$ & \SI{99.63}{\percent} & 55 -- \hphantom{0}77 & 62.2\\%TODO: Really? Logistic & \SI{97.27}{\percent} & $\sigma=2.10$ & \SI{99.48}{\percent} & \textbf{37} -- \hphantom{0}76 & \textbf{54.5}\\ Softmax & \SI{99.60}{\percent} & $\boldsymbol{\sigma=0.03}$& \SI{99.63}{\percent} & 44 -- \hphantom{0}73 & 55.6\\ Tanh & \SI{99.40}{\percent} & $\sigma=0.09$ & \SI{99.57}{\percent} & 56 -- \hphantom{0}80 & 67.6\\ Softsign & \SI{99.40}{\percent} & $\sigma=0.08$ & \SI{99.57}{\percent} & 72 -- 101 & 84.0\\ \gls{ReLU} & \textbf{\SI{99.62}{\percent}} & \textbf{$\sigma=0.04$} & \textbf{\SI{99.73}{\percent}} & 51 -- \hphantom{0}94 & 71.7\\ Softplus & \SI{99.52}{\percent} & $\sigma=0.05$ & \SI{99.62}{\percent} & 62 -- \hphantom{0}\textbf{70} & 68.9\\ \gls{PReLU} & \SI{99.57}{\percent} & $\sigma=0.07$ & \textbf{\SI{99.73}{\percent}} & 44 -- \hphantom{0}89 & 71.2\\ \gls{ELU} & \SI{99.53}{\percent} & $\sigma=0.06$ & \SI{99.58}{\percent} & 45 -- 111 & 72.5\\ \bottomrule \end{tabular} \caption[Activation function evaluation results on MNIST]{Test accuracy of adjusted baseline models trained with different activation functions on MNIST.} \label{table:MNIST-accuracies-activation-functions} \end{table} \glsreset{LReLU}