%!TEX root = write-math-ba-paper.tex

\section{Algorithms}
\subsection{Preprocessing}\label{sec:preprocessing}
Preprocessing in symbol recognition is done to improve the quality and
expressive power of the data. It makes follow-up tasks like feature extraction
and classification easier, more effective or faster. It does so by resolving
errors in the input data, reducing duplicate information and removing
irrelevant information.

Preprocessing algorithms fall into two groups: Normalization and noise
reduction algorithms.

A very important normalization algorithm in single-symbol recognition is
\textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
its bounding box fits into a unit square. As the aspect ratio of a recording is
almost never 1:1, only one dimension will fit exactly in the unit square. For
this paper, it was chosen to shift the recording in the direction of its bigger
dimension into the $[0,1] \times [0,1]$ unit square. After that, the recording
is shifted in direction of its smaller dimension such that its bounding box is
centered around zero.

Another normalization preprocessing algorithm is
resampling~\cite{Guyon91,Manke01}. As the data points on the pen trajectory are
generated asynchronously and with different time-resolutions depending on the
used hardware and software, it is desirable to resample the recordings to have
points spread equally in time for every recording. This was done by linear
interpolation of the $(x,t)$ and $(y,t)$ sequences and getting a fixed number
of equally spaced points per stroke.

\textit{Stroke connection} is a noise reduction algorithm which is mentioned
in~\cite{Tappert90}. It happens sometimes that the hardware detects that the
user lifted the pen where the user certainly didn't do so. This can be detected
by measuring the Euclidean distance between the end of one stroke and the
beginning of the next stroke. If this distance is below a threshold, then the
strokes are connected.

Due to a limited resolution of the recording device and due to erratic
handwriting, the pen trajectory might not be smooth. One way to smooth is
calculating a weighted average and replacing points by the weighted average of
their coordinate and their neighbors coordinates. Another way to do smoothing
is to reduce the number of points with the Douglas-Peucker
algorithm to the points that are more relevant for the
overall shape of a stroke and then interpolate the stroke between those points.
The Douglas-Peucker stroke simplification algorithm is usually used in
cartography to simplify the shape of roads. It works recursively to find a
subset of points of a stroke that is simpler and still similar to the original
shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
between that has maximum distance from the line $p_1 p_n$. If this distance is
above a threshold $\varepsilon$, the point $p_i$ is added to $S$. Then the
algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is described
as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.

\subsection{Features}\label{sec:features}
Features can be \textit{global}, that means calculated for the complete
recording or complete strokes. Other features are calculated for single points
on the pen trajectory and are called \textit{local}.

Global features are the \textit{number of strokes} in a recording, the
\textit{aspect ratio} of a recordings bounding box or the
\textit{ink} being used for a recording. The ink feature gets calculated by
measuring the length of all strokes combined. The re-curvature, which was
introduced in~\cite{Huang06}, is defined as
\[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
and a stroke-global feature.

The simplest local feature is the coordinate of the point itself. Speed,
curvature and a local small-resolution bitmap around the point, which was
introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
features.

\subsection{Multilayer Perceptrons}\label{sec:mlp-training}
\Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
different numbers of hidden layers, the number of neurons per layer and the
activation functions can be varied. The learning algorithm is parameterized by
the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
and the number of epochs.

The topology of \glspl{MLP} will be denoted in the following by separating the
number of neurons per layer with colons. For example, the notation
$160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
there are three hidden layers with 500~neurons per layer and one output layer
with 369~neurons.

\glspl{MLP} training can be executed in various different ways, for example
with \acrfull{SLP}. In case of a \gls{MLP} with the topology
$160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
discarded, a new hidden layer and a new output layer is added and it is trained
again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
discarded again, a new hidden layer is added and a new output layer is added
and the training is executed again.

Denoising auto-encoders are another way of pretraining. An
\textit{auto-encoder} is a neural network that is trained to restore its input.
This means the number of input neurons is equal to the number of output
neurons. The weights define an \textit{encoding} of the input that allows
restoring the input. As the neural network finds the encoding by itself, it is
called auto-encoder. If the hidden layer is smaller than the input layer, it
can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
layer with linear activation functions is used, then the hidden layer contains
the principal components after training~\cite{Duda2001}.

Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
is more robust to partial corruption of the input features. It is trained to
get robust by adding noise to the input features.

There are multiple ways how noise can be added. Gaussian noise and randomly
masking elements with zero are two possibilities.
\cite{Deeplearning-Denoising-AE} describes how such a denoising auto-encoder
with masking noise can be implemented. The corruption $\varkappa \in [0, 1)$ is
the probability of a feature being masked.