Add papers/write-math-paper

2025-04-19 11:38:05 +02:00 · 2015-10-14 14:46:02 +02:00 · 2015-10-14 14:46:02 +02:00 · fe78311901
commit fe78311901
parent 7740f0147f
25 changed files with 10624 additions and 0 deletions
--- a/documents/papers/write-math-paper/Makefile
+++ b/documents/papers/write-math-paper/Makefile
@ -0,0 +1,15 @@
 DOKUMENT = write-math-ba-paper
 make:
 	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # aux-files for makeindex / makeglossaries
 	makeglossaries $(DOKUMENT)
 	bibtex $(DOKUMENT)
 	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include index
 	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
 	pdflatex -shell-escape $(DOKUMENT).tex -interaction=batchmode -output-format=pdf # include symbol table
 	make clean
 combine:
 	pdftk Dok1-open-kit.pdf KIT_SWP_Vorlage_Impressum_en_2015.pdf write-math-ba-paper.pdf KIT-WSP_RS_en.pdf cat output single-symbol-classification-paper.pdf
 clean:
 	rm -rf  $(TARGET) *.class *.html *.log *.aux *.out *.thm *.idx *.toc *.ind *.ilg figures/torus.tex *.glg *.glo *.gls *.ist *.xdy *.fdb_latexmk *.bak *.blg *.bbl *.glsdefs *.acn *.acr *.alg *.nls *.nlo *.bak *.pyg *.lot *.lof
--- a/documents/papers/write-math-paper/README.md
+++ b/documents/papers/write-math-paper/README.md
@ -0,0 +1,11 @@
 [Download compiled PDF](https://github.com/MartinThoma/write-math-paper/blob/master/write-math-ba-paper.pdf?raw=true)
 ## License
 This is work is licensed under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/).
 ## Spell checking
 * Spell checking `for f in ch*.tex; do aspell --lang=en --mode=tex check $f; done`
 * Spell checking `for f in ch*.tex; do /home/moose/GitHub/Academic-Writing-Check/checkwriting $f; done`
 * Spell checking with `http://www.reverso.net/spell-checker`
 * https://github.com/devd/Academic-Writing-Check
--- a/documents/papers/write-math-paper/abstract-500-chars.txt
+++ b/documents/papers/write-math-paper/abstract-500-chars.txt
@ -0,0 +1,17 @@
 Autoren: Thoma, Martin; Kilgour, Kevin; Stüker, Sebastian; Waibel, Alexander
 Titel: On-line Recognition of Handwritten Mathematical Symbols
 Institut: Institute for Anthropomatics and Robotics
 Abstract (max 500 Zeichen):
 This paper presents a classification system which uses the pen trajectory to
 classify handwritten symbols. Five preprocessing steps, one data multiplication
 algorithm, five features and five variants for multilayer Perceptron training
 were evaluated using $\num{166898}$ recordings. The evaluation results of
 21~experiments were used to create an optimized recognizer. This improvement
 was achieved by \acrlong{SLP} and adding new features.
 Keywords (max 5): recognition; machine learning; neural networks; symbols;
 multilayer perceptron
 Geplanter Veröffentlichungstermin: 1. August 2015
--- a/documents/papers/write-math-paper/baseline-1.csv
+++ b/documents/papers/write-math-paper/baseline-1.csv
--- a/documents/papers/write-math-paper/baseline-2-pretraining.csv
+++ b/documents/papers/write-math-paper/baseline-2-pretraining.csv
--- a/documents/papers/write-math-paper/baseline-2.csv
+++ b/documents/papers/write-math-paper/baseline-2.csv
--- a/documents/papers/write-math-paper/ch1-introduction.tex
+++ b/documents/papers/write-math-paper/ch1-introduction.tex
@ -0,0 +1,46 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Introduction}
 On-line recognition makes use of the pen trajectory. One possible
 representation of the data is given as groups of sequences of tuples $(x, y, t)
 \in \mathbb{R}^3$, where each group represents a stroke, $(x, y)$ is the
 position of the pen on a canvas and $t$ is the time.
 % On-line data was used to classify handwritten natural language text in many
 % different variants. For example, the $\text{NPen}^{++}$ system classified
 % cursive handwriting into English words by using hidden Markov models and neural
 % networks~\cite{Manke1995}.
 % Several systems for mathematical symbol recognition with on-line data have been
 % described so far~\cite{Kosmala98,Mouchere2013}, but no standard test set
 % existed to compare the results of different classifiers for single-symbol
 % classification of mathematical symbols. The used symbols differed in most
 % papers. This is unfortunate as the choice of symbols is crucial for the top-$n$
 % error. For example, the symbols $o$, $O$, $\circ$ and $0$ are very similar and
 % systems which know all those classes will certainly have a higher top-$n$ error
 % than systems which only accept one of them. But not only the classes differed,
 % also the used data to train and test had to be collected by each author again.
 \cite{Kirsch}~describes a system called Detexify which uses
 time warping to classify on-line handwritten symbols and reports a top-3 error
 of less than $\SI{10}{\percent}$ for a set of $\num{100}$~symbols. He did also
 recently publish his data on \url{https://github.com/kirel/detexify-data},
 which was collected by a crowdsourcing approach via
 \url{http://detexify.kirelabs.org}. Those recordings as well as some recordings
 which were collected by a similar approach via \url{http://write-math.com} were
 merged in a single data set, the labels were semi-automatically checked for
 correctness and used to train and evaluated different classifiers. A more
 detailed description of all used software, data and experiments is given
 in~\cite{Thoma:2014}.
 In this paper we present a baseline system for the classification of on-line
 handwriting into $369$ classes of which some are very similar. An optimized
 classifier was developed which has a $\SI{29.7}{\percent}$ relative improvement
 of the top-3 error. This was achieved by using better features and \gls{SLP}.
 The absolute improvements compared to the baseline of those changes will also
 be shown.
 In the following, we will give a general overview of the system design, give
 information about the used data and implementation, describe the algorithms
 we used to classify the data, report results of our experiments and present
 the optimized recognizer we created.
--- a/documents/papers/write-math-paper/ch2-general-system-design.tex
+++ b/documents/papers/write-math-paper/ch2-general-system-design.tex
@ -0,0 +1,36 @@
 %!TEX root = write-math-ba-paper.tex
 \section{General System Design}
 The following steps are used for symbol classification:\nobreak
 \begin{enumerate}
    \item \textbf{Preprocessing}: Recorded data is never perfect. Devices have
          errors and people make mistakes while using the devices. To tackle
          these problems there are preprocessing algorithms to clean the data.
          The preprocessing algorithms can also remove unnecessary variations
          of the data that do not help in the classification process, but hide
          what is important. Having slightly different sizes of the same symbol
          is an example of such a variation. Four preprocessing algorithms that
          clean or normalize recordings are explained in
          \cref{sec:preprocessing}.
    \item \textbf{Data multiplication}: Learning systems need lots of data
          to learn internal parameters. If there is not enough data available,
          domain knowledge can be considered to create new artificial data from
          the original data. In the domain of on-line handwriting recognition,
          data can be multiplied by adding rotated variants.
    \item \textbf{Feature extraction}: A feature is high-level information
          derived from the raw data after preprocessing. Some systems like
          Detexify take the result of the preprocessing step, but many compute
          new features. Those features can be designed by a human engineer or
          learned. Non-raw data features have the advantage that less
          training data is needed since the developer uses knowledge about
          handwriting to compute highly discriminative features. Various
          features are explained in \cref{sec:features}.
 \end{enumerate}
 After these steps, it is a classification task for which the classifier has to
 learn internal parameters before it can classify new recordings.We classified
 recordings by computing constant-sized feature vectors and using
 \glspl{MLP}. There are many ways to adjust \glspl{MLP} (number of neurons and
 layers, activation functions) and their training (learning rate, momentum,
 error function). Some of them are described in~\cref{sec:mlp-training} and the
 evaluation results are presented in \cref{ch:Optimization-of-System-Design}.
--- a/documents/papers/write-math-paper/ch3-data-and-implementation.tex
+++ b/documents/papers/write-math-paper/ch3-data-and-implementation.tex
@ -0,0 +1,20 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Data and Implementation}
 We used $\num{369}$ symbol classes with a total of $\num{166898}$ labeled
 recordings. Each class has at least $\num{50}$ labeled recordings, but over
 $200$ symbols have more than $\num{200}$ labeled recordings and over $100$
 symbols have more than $500$ labeled recordings.
 The data was collected by two crowd-sourcing projects (Detexify and
 \href{http://write-math.com}{write-math.com}) where users wrote
 symbols, were then given a list ordered by an early classification system and
 clicked on the symbol they wrote.
 The data of Detexify and \href{http://write-math.com}{write-math.com} was
 combined, filtered semi-automatically and can be downloaded via
 \href{http://write-math.com/data}{write-math.com/data} as a compressed tar
 archive of CSV files.
 All of the following preprocessing and feature computation algorithms were
 implemented and are publicly available as open-source software in the Python
 package \texttt{hwrt}.
--- a/documents/papers/write-math-paper/ch4-algorithms.tex
+++ b/documents/papers/write-math-paper/ch4-algorithms.tex
@ -0,0 +1,113 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Algorithms}
 \subsection{Preprocessing}\label{sec:preprocessing}
 Preprocessing in symbol recognition is done to improve the quality and
 expressive power of the data. It makes follow-up tasks like feature extraction
 and classification easier, more effective or faster. It does so by resolving
 errors in the input data, reducing duplicate information and removing
 irrelevant information.
 Preprocessing algorithms fall into two groups: Normalization and noise
 reduction algorithms.
 A very important normalization algorithm in single-symbol recognition is
 \textit{scale-and-shift}~\cite{Thoma:2014}. It scales the recording so that
 its bounding box fits into a unit square. As the aspect ratio of a recording is
 almost never 1:1, only one dimension will fit exactly in the unit square. For
 this paper, it was chosen to shift the recording in the direction of its bigger
 dimension into the $[0,1] \times [0,1]$ unit square. After that, the recording
 is shifted in direction of its smaller dimension such that its bounding box is
 centered around zero.
 Another normalization preprocessing algorithm is
 resampling~\cite{Guyon91,Manke01}. As the data points on the pen trajectory are
 generated asynchronously and with different time-resolutions depending on the
 used hardware and software, it is desirable to resample the recordings to have
 points spread equally in time for every recording. This was done by linear
 interpolation of the $(x,t)$ and $(y,t)$ sequences and getting a fixed number
 of equally spaced points per stroke.
 \textit{Stroke connection} is a noise reduction algorithm which is mentioned
 in~\cite{Tappert90}. It happens sometimes that the hardware detects that the
 user lifted the pen where the user certainly didn't do so. This can be detected
 by measuring the Euclidean distance between the end of one stroke and the
 beginning of the next stroke. If this distance is below a threshold, then the
 strokes are connected.
 Due to a limited resolution of the recording device and due to erratic
 handwriting, the pen trajectory might not be smooth. One way to smooth is
 calculating a weighted average and replacing points by the weighted average of
 their coordinate and their neighbors coordinates. Another way to do smoothing
 is to reduce the number of points with the Douglas-Peucker
 algorithm to the points that are more relevant for the
 overall shape of a stroke and then interpolate the stroke between those points.
 The Douglas-Peucker stroke simplification algorithm is usually used in
 cartography to simplify the shape of roads. It works recursively to find a
 subset of points of a stroke that is simpler and still similar to the original
 shape. The algorithm adds the first and the last point $p_1$ and $p_n$ of a
 stroke to the simplified set of points $S$. Then it searches the point $p_i$ in
 between that has maximum distance from the line $p_1 p_n$. If this distance is
 above a threshold $\varepsilon$, the point $p_i$ is added to $S$. Then the
 algorithm gets applied to $p_1 p_i$ and $p_i p_n$ recursively. It is described
 as \enquote{Algorithm 1} in~\cite{Visvalingam1990}.
 \subsection{Features}\label{sec:features}
 Features can be \textit{global}, that means calculated for the complete
 recording or complete strokes. Other features are calculated for single points
 on the pen trajectory and are called \textit{local}.
 Global features are the \textit{number of strokes} in a recording, the
 \textit{aspect ratio} of a recordings bounding box or the
 \textit{ink} being used for a recording. The ink feature gets calculated by
 measuring the length of all strokes combined. The re-curvature, which was
 introduced in~\cite{Huang06}, is defined as
 \[\text{re-curvature}(stroke) := \frac{\text{height}(stroke)}{\text{length}(stroke)}\]
 and a stroke-global feature.
 The simplest local feature is the coordinate of the point itself. Speed,
 curvature and a local small-resolution bitmap around the point, which was
 introduced by Manke, Finke and Waibel in~\cite{Manke1995}, are other local
 features.
 \subsection{Multilayer Perceptrons}\label{sec:mlp-training}
 \Glspl{MLP} are explained in detail in~\cite{Mitchell97}. They can have
 different numbers of hidden layers, the number of neurons per layer and the
 activation functions can be varied. The learning algorithm is parameterized by
 the learning rate $\eta \in (0, \infty)$, the momentum $\alpha \in [0, \infty)$
 and the number of epochs.
 The topology of \glspl{MLP} will be denoted in the following by separating the
 number of neurons per layer with colons. For example, the notation
 $160{:}500{:}500{:}500{:}369$ means that the input layer gets 160~features,
 there are three hidden layers with 500~neurons per layer and one output layer
 with 369~neurons.
 \glspl{MLP} training can be executed in various different ways, for example
 with \acrfull{SLP}. In case of a \gls{MLP} with the topology
 $160{:}500{:}500{:}500{:}369$, \gls{SLP} works as follows: At first a \gls{MLP}
 with one hidden layer ($160{:}500{:}369$) is trained. Then the output layer is
 discarded, a new hidden layer and a new output layer is added and it is trained
 again, resulting in a $160{:}500{:}500{:}369$ \gls{MLP}. The output layer is
 discarded again, a new hidden layer is added and a new output layer is added
 and the training is executed again.
 Denoising auto-encoders are another way of pretraining. An
 \textit{auto-encoder} is a neural network that is trained to restore its input.
 This means the number of input neurons is equal to the number of output
 neurons. The weights define an \textit{encoding} of the input that allows
 restoring the input. As the neural network finds the encoding by itself, it is
 called auto-encoder. If the hidden layer is smaller than the input layer, it
 can be used for dimensionality reduction~\cite{Hinton1989}. If only one hidden
 layer with linear activation functions is used, then the hidden layer contains
 the principal components after training~\cite{Duda2001}.
 Denoising auto-encoders are a variant introduced in~\cite{Vincent2008} that
 is more robust to partial corruption of the input features. It is trained to
 get robust by adding noise to the input features.
 There are multiple ways how noise can be added. Gaussian noise and randomly
 masking elements with zero are two possibilities.
 \cite{Deeplearning-Denoising-AE} describes how such a denoising auto-encoder
 with masking noise can be implemented. The corruption $\varkappa \in [0, 1)$ is
 the probability of a feature being masked.
--- a/documents/papers/write-math-paper/ch5-optimization-of-system-design.tex
+++ b/documents/papers/write-math-paper/ch5-optimization-of-system-design.tex
@ -0,0 +1,214 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
 In order to evaluate the effect of different preprocessing algorithms, features
 and adjustments in the \gls{MLP} training and topology, the following baseline
 system was used:
 Scale the recording to fit into a unit square while keeping the aspect ratio,
 shift it as described in \cref{sec:preprocessing},
 resample it with linear interpolation to get 20~points per stroke, spaced
 evenly in time. Take the first 4~strokes with 20~points per stroke and
 2~coordinates per point as features, resulting in 160~features which is equal
 to the number of input neurons. If a recording has less than 4~strokes, the
 remaining features were filled with zeroes.
 All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
 2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
 could have a severe influence on the effect of new features or preprocessing
 steps. Each hidden layer in all evaluated systems has $500$ neurons.
 Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
 of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
 the sigmoid function. The neurons in the
 output layer use the softmax function. For every experiment, exactly one part
 of the baseline systems was changed.
 \subsection{Random Weight Initialization}
 The neural networks in all experiments got initialized with a small random
 weight
 \[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
 where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
 of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
 initialization was suggested on
 \cite{deeplearningweights} and is done to break symmetry.
 This can lead to different error rates for the same systems just because the
 initialization was different.
 In order to get an impression of the magnitude of the influence on the different
 topologies and error rates the baseline models were trained 5 times with
 random initializations.
 \Cref{table:baseline-systems-random-initializations-summary}
 shows a summary of the results. The more hidden layers are used, the more do
 the results vary between different random weight initializations.
 \begin{table}[h]
    \centering
    \begin{tabular}{crrr|rrr} %chktex 44
    \toprule
    \multirow{3}{*}{System}  & \multicolumn{6}{c}{Classification error}\\
    \cmidrule(l){2-7}
               & \multicolumn{3}{c}{Top-1}   & \multicolumn{3}{c}{Top-3}\\
               & Min                   & Max                   & Mean                  & Min                  & Max                  & Mean\\\midrule
    $B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
    $B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
    $B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
    $B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
    \bottomrule
    \end{tabular}
    \caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
             trained and evaluated 5~times to estimate the influence of random
             weight initialization.}
 \label{table:baseline-systems-random-initializations-summary}
 \end{table}
 \subsection{Stroke connection}
 In order to solve the problem of interrupted strokes, pairs of strokes
 can be connected with stroke connection algorithm. The idea is that for
 a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
 close to the first point of $s_{i+1}$ if a stroke was accidentally split
 into two strokes.
 $\SI{59}{\percent}$ of all stroke pair distances in the collected data are
 between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
 algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
 $\SI{20}{\pixel}$.
 All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
 by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
 improve.
 \subsection{Douglas-Peucker Smoothing}
 The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
 0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
 but before resampling. The interpolation in the resampling step was done
 linearly and with cubic splines in two experiments. The recording was scaled
 and shifted again after the interpolation because the bounding box might have
 changed.
 The result of the application of the Douglas-Peucker smoothing with $\varepsilon
 > 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
 This means that the simplification process removes some relevant information and
 does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
 with linear interpolation some models top-1 error improved, but the
 changes were small. It could be an effect of random weight initialization.
 However, cubic spline interpolation made all systems perform more than
 $\num{1.7}$ percentage points worse for top-1 and top-3 error.
 The lower the value of $\varepsilon$, the less does the recording change after
 this preprocessing step. As it was applied after scaling the recording such that
 the biggest dimension of the recording (width or height) is $1$, a value of
 $\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
 of the biggest dimension.
 \subsection{Global Features}
 Single global features were added one at a time to the baseline systems. Those
 features were re-curvature
 $\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
 as described in \cite{Huang06}, the ink feature which is the summed length
 of all strokes, the stroke count, the aspect ratio and the stroke center points
 for the first four strokes. The stroke center point feature improved the system
 $B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
 the top-1 error by $\num{0.7}$~percentage points, but all other systems and
 error measures either got worse or did not improve much.
 The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
 $B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
 improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
 top-1 error.
 \subsection{Data Multiplication}
 Data multiplication can be used to make the model invariant to transformations.
 However, this idea seems not to work well in the domain of on-line handwritten
 mathematical symbols. We tripled the data by adding a version that is rotated
 3~degrees to the left and another one that is rotated 3~degrees to the right
 around the center of mass. This data multiplication made all classifiers for
 most error measures perform worse by more than $\num{2}$~percentage points for
 the top-1 error.
 The same experiment was executed by rotating by 6~degrees and in another
 experiment by 9~degrees, but those performed even worse.
 Also multiplying the data by a factor of 5 by adding two 3-degree rotated
 variants and two 6-degree rotated variant made the classifier perform worse
 by more than $\num{2}$~percentage points.
 \subsection{Pretraining}\label{subsec:pretraining-evaluation}
 Pretraining is a technique used to improve the training of \glspl{MLP} with
 multiple hidden layers.
 \Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
 performance by $\num{1.6}$ percentage points for the top-1 error and
 $\num{1.0}$ percentage points for the top-3 error. As one can see in
 \cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
 is not only the case because of the longer training as the test error is
 relatively stable after $\num{1000}$ epochs of training. This was confirmed
 by an experiment where the baseline systems where trained for $\num{10000}$
 epochs and did not perform notably different.
 \begin{figure}[htb]
    \centering
    \input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
    \caption{Training- and test error by number of trained epochs for different
             topologies with \acrfull{SLP}. The plot shows
             that all pretrained systems performed much better than the systems
             without pretraining. All plotted systems did not improve
             with more epochs of training.}
 \label{fig:training-and-test-error-for-different-topologies-pretraining}
 \end{figure}
 \begin{table}[tb]
    \centering
    \begin{tabular}{lrrrr}
    \toprule
    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
    \cmidrule(l){2-5}
                & Top-1                  & Change               & Top-3                & Change                 \\\midrule
    $B_{hl=1}$     & $\SI{23.2}{\percent}$  & -                    & $\SI{6.7}{\percent}$ & - \\
    $B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
    $B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
    $B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
    \bottomrule
    \end{tabular}
    \caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
             compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
             in \cref{table:baseline-systems-random-initializations-summary}
             which used pure gradient descent. The \gls{SLP}
             systems clearly performed worse.}
 \label{table:pretraining-slp}
 \end{table}
 Pretraining with denoising auto-encoder lead to the much worse results listed in
 \cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
 activation function. Every layer was trained for $1000$ epochs and the
 \gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
 $\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
 chosen. This pretraining setup made all systems with all error measures perform
 much worse.
 \begin{table}[tb]
    \centering
    \begin{tabular}{lrrrr}
    \toprule
    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
    \cmidrule(l){2-5}
                 & Top-1                  & Change               & Top-3                & Change                 \\\midrule
    $B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
    $B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
    $B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
    $B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
    \bottomrule
    \end{tabular}
    \caption{Systems with denoising \acrfull{AEP} compared to pure
             gradient descent. The \gls{AEP} systems performed worse.}
 \label{table:pretraining-denoising-auto-encoder}
 \end{table}
--- a/documents/papers/write-math-paper/ch6-summary.tex
+++ b/documents/papers/write-math-paper/ch6-summary.tex
@ -0,0 +1,123 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Summary}
 Four baseline recognition systems were adjusted in many experiments and their
 recognition capabilities were compared in order to build a recognition system
 that can recognize 396 mathematical symbols with low error rates as well as to
 evaluate which preprocessing steps and features help to improve the recognition
 rate.
 All recognition systems were trained and evaluated with
 $\num{\totalCollectedRecordings{}}$ recordings for \totalClassesAnalyzed{}
 symbols. These recordings were collected by two crowdsourcing projects
 (\href{http://detexify.kirelabs.org/classify.html}{Detexify} and
 \href{write-math.com}{write-math.com}) and created with various devices. While
 some recordings were created with standard touch devices such as tablets and
 smartphones, others were created with the mouse.
 \Glspl{MLP} were used for the classification task. Four baseline systems with
 different numbers of hidden layers were used, as the number of hidden layer
 influences the capabilities and problems of \glspl{MLP}.
 All baseline systems used the same preprocessing queue. The recordings were
 scaled and shifted as described in \ref{sec:preprocessing}, resampled with
 linear interpolation so that every stroke had exactly 20~points which are
 spread equidistant in time. The 80~($x,y$) coordinates of the first 4~strokes
 were used to get exactly $160$ input features for every recording. The baseline
 system $B_{hl=2}$ has a top-3 error of $\SI{5.7}{\percent}$.
 Adding two slightly rotated variants for each recording and hence tripling the
 training set made the systems $B_{hl=3}$ and $B_{hl=4}$ perform much worse, but
 improved the performance of the smaller systems.
 The global features re-curvature, ink, stoke count and aspect ratio improved
 the systems $B_{hl=1}$--$B_{hl=3}$, whereas the stroke center point feature
 made $B_{hl=2}$ perform worse.
 Denoising auto-encoders were evaluated as one way to use pretraining, but by
 this the error rate increased notably. However, \acrlong{SLP} improved the
 performance decidedly.
 The stroke connection algorithm was added to the preprocessing steps of the
 baseline system as well as the re-curvature feature, the ink feature, the
 number of strokes and the aspect ratio. The training setup of the baseline
 system was changed to \acrlong{SLP} and the resulting model was trained with a
 lower learning rate again. This optimized recognizer $B_{hl=2,c}'$ had a top-3
 error of $\SI{4.0}{\percent}$. This means that the top-3 error dropped by over
 $\num{1.7}$ percentage points in comparison to the baseline system $B_{hl=2}$.
 A top-3 error of $\SI{4.0}{\percent}$ makes the system usable for symbol
 lookup. It could also be used as a starting point for the development of a
 multiple-symbol classifier.
 The aim of this work was to develop a symbol recognition system which is easy
 to use, fast and has high recognition rates as well as evaluating ideas for
 single symbol classifiers. Some of those goals were reached. The recognition
 system $B_{hl=2,c}'$ evaluates new recordings in a fraction of a second and has
 acceptable recognition rates.
 % Many algorithms were evaluated. However, there are still many other
 % algorithms which could be evaluated and, at the time of this work, the best
 % classifier $B_{hl=2,c}'$ is only available through the Python package
 % \texttt{hwrt}. It is planned to add an web version of that classifier online.
 \section{Optimized Recognizer}
 All preprocessing steps and features that were useful were combined to create a
 recognizer that performs best.
 All models were much better than everything that was tried before. The results
 of this experiment show that single-symbol recognition with
 \totalClassesAnalyzed{} classes and usual touch devices and the mouse can be
 done with a top-1 error rate of $\SI{18.6}{\percent}$ and a top-3 error of
 $\SI{4.1}{\percent}$. This was
 achieved by a \gls{MLP} with a $167{:}500{:}500{:}\totalClassesAnalyzed{}$ topology.
 It used the stroke connection algorithm to connect of which the ends were less
 than $\SI{10}{\pixel}$ away, scaled each recording to a unit square and shifted
 as described in \ref{sec:preprocessing}. After that, a linear resampling step
 was applied to the first 4 strokes to resample them to 20 points each. All
 other strokes were discarded.
 \goodbreak
 The 167 features were\mynobreakpar%
 \begin{itemize}
     \item the first 4 strokes with 20 points per stroke resulting in 160
           features,
     \item the re-curvature for the first 4 strokes,
     \item the ink,
     \item the number of strokes and
     \item the aspect ratio of the bounding box
 \end{itemize}
 \Gls{SLP} was applied with $\num{1000}$ epochs per layer, a
 learning rate of $\eta=0.1$ and a momentum of $\alpha=0.1$. After that, the
 complete model was trained again for $1000$ epochs with standard mini-batch
 gradient descent resulting in systems $B_{hl=1,c}'$ -- $B_{hl=4,c}'$.
 After the models $B_{hl=1,c}$ -- $B_{hl=4,c}$ were trained the first $1000$ epochs,
 they were trained again for $\num{1000}$ epochs with a learning rate of $\eta =
 0.05$. \Cref{table:complex-recognizer-systems-evaluation} shows that
 this improved the classifiers again.
 \begin{table}[htb]
    \centering
    \begin{tabular}{lrrrr}
    \toprule
    \multirow{2}{*}{System}  & \multicolumn{4}{c}{Classification error}\\
    \cmidrule(l){2-5}
              & Top-1                 & Change                & Top-3                & Change\\\midrule
    $B_{hl=1,c}$ & $\SI{21.0}{\percent}$ & $\SI{-2.2}{\percent}$ & $\SI{5.2}{\percent}$ & $\SI{-1.5}{\percent}$\\
    $B_{hl=2,c}$ & $\SI{18.3}{\percent}$ & $\SI{-3.3}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
    $B_{hl=3,c}$ & \underline{$\SI{18.2}{\percent}$} & $\SI{-3.7}{\percent}$ & \underline{$\SI{4.1}{\percent}$} & $\SI{-1.6}{\percent}$\\
    $B_{hl=4,c}$ & $\SI{18.6}{\percent}$ & $\SI{-5.3}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\\midrule
    $B_{hl=1,c}'$ & $\SI{19.3}{\percent}$ & $\SI{-3.9}{\percent}$ & $\SI{4.8}{\percent}$ & $\SI{-1.9}{\percent}$ \\
    $B_{hl=2,c}'$ & \underline{$\SI{17.5}{\percent}$} & $\SI{-4.1}{\percent}$ & \underline{$\SI{4.0}{\percent}$} & $\SI{-1.7}{\percent}$\\
    $B_{hl=3,c}'$ & $\SI{17.7}{\percent}$ & $\SI{-4.2}{\percent}$ & $\SI{4.1}{\percent}$ & $\SI{-1.6}{\percent}$\\
    $B_{hl=4,c}'$ & $\SI{17.8}{\percent}$ & $\SI{-6.1}{\percent}$ & $\SI{4.3}{\percent}$ & $\SI{-1.9}{\percent}$\\
    \bottomrule
    \end{tabular}
    \caption{Error rates of the optimized recognizer systems. The systems
             $B_{hl=i,c}'$ were trained another $\num{1000}$ epochs with a learning rate
             of $\eta=0.05$.}
 \label{table:complex-recognizer-systems-evaluation}
 \end{table}
--- a/documents/papers/write-math-paper/ch7-mfrdb-eval.tex
+++ b/documents/papers/write-math-paper/ch7-mfrdb-eval.tex
@ -0,0 +1,32 @@
 %!TEX root = write-math-ba-paper.tex
 \section{Evaluation}
 The optimized classifier was evaluated on three publicly available data sets:
 \verb+MfrDB_Symbols_v1.0+ \cite{Stria2012}, CROHME~2011 \cite{Mouchere2011},
 and CROHME~2012 \cite{Mouchere2012}.
 \verb+MfrDB_Symbols_v1.0+ contains recordings for 105~symbols, but for
 11~symbols less than 50~recordings were available. For this reason, the
 optimized classifier was evaluated on 94~of the 105~symbols.
 The evaluation results are given in \cref{table:public-eval-results}.
 \begin{table}[htb]
    \centering
    \begin{tabular}{lcrr}
    \toprule
    \multirow{2}{*}{Dataset}  & \multirow{2}{*}{Symbols}  & \multicolumn{2}{c}{Classification error}\\
    \cmidrule(l){3-4}
              & & Top-1                 & Top-3                \\\midrule
    MfrDB       & 94 & $\SI{8.4}{\percent}$  & $\SI{1.3}{\percent}$ \\
    CROHME 2011 & 56 & $\SI{10.2}{\percent}$ & $\SI{3.7}{\percent}$ \\
    CROHME 2012 & 75 & $\SI{12.2}{\percent}$ & $\SI{4.1}{\percent}$ \\
    \bottomrule
    \end{tabular}
    \caption{Error rates of the optimized recognizer systems. The systems
             output layer was adjusted to the number of symbols it should
             recognize and trained with the combined data from
             write-math and the training given by the datasets.}
 \label{table:public-eval-results}
 \end{table}
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/Makefile
@ -0,0 +1,35 @@
 SOURCE = errors-by-epoch-pretraining
 DELAY = 80
 DENSITY = 300
 WIDTH = 512
 make:
 	pdflatex $(SOURCE).tex -output-format=pdf
 	make clean
 clean:
 	rm -rf  $(TARGET) *.class *.html *.log *.aux *.data *.gnuplot
 gif:
 	pdfcrop $(SOURCE).pdf
 	convert -verbose -delay $(DELAY) -loop 0 -density $(DENSITY) $(SOURCE)-crop.pdf $(SOURCE).gif
 	make clean
 png:
 	make
 	make svg
 	inkscape $(SOURCE).svg -w $(WIDTH) --export-png=$(SOURCE).png
 transparentGif:
 	convert $(SOURCE).pdf -transparent white result.gif
 	make clean
 svg:
 	make
 	#inkscape $(SOURCE).pdf --export-plain-svg=$(SOURCE).svg
 	pdf2svg $(SOURCE).pdf $(SOURCE).svg
 	# Necessary, as pdf2svg does not always create valid svgs:
 	inkscape $(SOURCE).svg --export-plain-svg=$(SOURCE).svg
 	rsvg-convert -a -w $(WIDTH) -f svg $(SOURCE).svg -o $(SOURCE)2.svg
 	inkscape $(SOURCE)2.svg --export-plain-svg=$(SOURCE).svg
 	rm $(SOURCE)2.svg
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-1.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-2.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-3-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/baseline-4-pretraining.csv
--- a/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex
+++ b/documents/papers/write-math-paper/figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex
@ -0,0 +1,31 @@
 \begin{tikzpicture}
    \begin{axis}[
            axis x line=middle,
            axis y line=middle,
            enlarge y limits=true,
            xmin=0,
            % xmax=1000,
            ymin=0.18, ymax=0.4,
            minor ytick={0, 0.01, ..., 1},
            % width=15cm, height=8cm,     % size of the image
            grid = both,
            minor grid style={dashed, gray!30},
            major grid style={gray!40},,
            %grid style={dashed, gray!30},
            ylabel=error,
            xlabel=epoch,
            legend cell align=left,
            legend style={
                at={(0.5,-0.1)},
                anchor=north,
                legend columns=2
            }
         ]
          \addplot[mark=x,green] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-1.csv};
          \addplot[mark=x,orange] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2.csv};
          \addplot[mark=x,red] table [each nth point=20,x=epoch, y=testerror, col sep=comma] {baseline-2-pretraining.csv};
          \legend{{1 hidden layer},
                  {2 hidden layers},
                  {2 hidden layers with pretraining}}
    \end{axis}
 \end{tikzpicture}
--- a/documents/papers/write-math-paper/glossary.tex
+++ b/documents/papers/write-math-paper/glossary.tex
@ -0,0 +1,74 @@
 %!TEX root = thesis.tex
 %Term definitions
 \newacronym{ANN}{ANN}{artificial neural network}
 \newacronym{CSR}{CSR}{cursive script recognition}
 \newacronym{DTW}{DTW}{dynamic time warping}
 \newacronym{GTW}{GTW}{greedy time warping}
 \newacronym{HMM}{HMM}{hidden Markov model}
 \newacronym{HWR}{HWR}{handwriting recognition}
 \newacronym{HWRT}{HWRT}{handwriting recognition toolkit}
 \newacronym{MLP}{MLP}{multilayer perceptron}
 \newacronym{MSE}{MSE}{mean squared error}
 \newacronym{OOV}{OOV}{out of vocabulary}
 \newacronym{TDNN}{TDNN}{time delay neural network}
 \newacronym{PCA}{PCA}{principal component analysis}
 \newacronym{LDA}{LDA}{linear discriminant analysis}
 \newacronym{CROHME}{CROHME}{Competition on Recognition of Online Handwritten Mathematical Expressions}
 \newacronym{GMM}{GMM}{Gaussian mixture model}
 \newacronym{SVM}{SVM}{support vector machine}
 \newacronym{PyPI}{PyPI}{Python Package Index}
 \newacronym{CFM}{CFM}{classification figure of merit}
 \newacronym{CE}{CE}{cross entropy}
 \newacronym{GPU}{GPU}{graphics processing unit}
 \newacronym{CUDA}{CUDA}{Compute Unified Device Architecture}
 \newacronym{SLP}{SLP}{supervised layer-wise pretraining}
 \newacronym{AEP}{AEP}{auto-encoder pretraining}
 % Term definitions
 \newglossaryentry{Detexify}{name={Detexify}, description={A system used for
 on-line handwritten symbol recognition which is described in \cite{Kirsch}}}
 \newglossaryentry{epoch}{name={epoch}, description={During iterative training of a neural network, an \textit{epoch} is a single pass through the entire training set, followed by testing of the verification set.\cite{Concise12}}}
 \newglossaryentry{hypothesis}{
    name={hypothesis},
    description={The recognition results which a classifier returns is called a hypothesis. In other words, it is the \enquote{guess} of a classifier},
    plural=hypotheses
 }
 \newglossaryentry{reference}{
    name={reference},
    description={Labeled data is used to evaluate classifiers. Those labels are called references},
 }
 \newglossaryentry{YAML}{name={YAML}, description={YAML is a human-readable data format that can be used for configuration files}}
 \newglossaryentry{MER}{name={MER}, description={An error measure which combines symbols to equivalence classes. It was introduced on \cpageref{merged-error-introduction}}}
 \newglossaryentry{JSON}{name={JSON}, description={JSON, short for JavaScript Object Notation, is a language-independent data format that can be used to transmit data between a server and a client in web applications}}
 \newglossaryentry{hyperparamter}{name={hyperparamter}, description={A
 \textit{hyperparamter} is a parameter of a neural net, that cannot be learned,
 but has to be chosen}, symbol={\ensuremath{\theta}}}
 \newglossaryentry{learning rate}{name={learning rate}, description={A factor $0 \leq \eta \in \mdr$ that affects how fast new weights are learned. $\eta=0$ means that no new data is learned}, symbol={\ensuremath{\eta}}} % Andrew Ng: \alpha
 \newglossaryentry{learning rate decay}{name={learning rate decay}, description={The learning rate decay $0 < \alpha \leq 1$ is used to adjust the learning rate. After each epoch the learning rate $\eta$ is updated to $\eta \gets \eta \times \alpha$}, symbol={\ensuremath{\eta}}}
 \newglossaryentry{preactivation}{name={preactivation}, description={The preactivation of a neuron is the weighted sum of its input, before the activation function is applied}}
 \newglossaryentry{stroke}{name={stroke}, description={The path the pen took from
 the point where the pen was put down to the point where the pen was lifted first}}
 \newglossaryentry{line}{name={line}, description={Geometric object that is infinitely long
 and defined by two points.}}
 \newglossaryentry{line segment}{name={line segment}, description={Geometric object that has finite length
 and defined by two points.}}
 \newglossaryentry{symbol}{name={symbol}, description={An atomic semantic entity. A more detailed description can be found in \cref{sec:what-is-a-symbol}}}
 \newglossaryentry{weight}{name={weight}, description={A
 \textit{weight} is a parameter of a neural net, that can be learned}, symbol={\ensuremath{\weight}}}
 \newglossaryentry{control point}{name={control point}, description={A
 \textit{control point} is a point recorded by the input device.}}
--- a/documents/papers/write-math-paper/sRGBIEC1966-2.1.icm
+++ b/documents/papers/write-math-paper/sRGBIEC1966-2.1.icm
--- a/documents/papers/write-math-paper/variables.tex
+++ b/documents/papers/write-math-paper/variables.tex
@ -0,0 +1,12 @@
 \newcommand{\totalCollectedRecordings}{166898}  % ACTUALITY
 \newcommand{\detexifyCollectedRecordings}{153423}
 \newcommand{\trainingsetsize}{134804}
 \newcommand{\validtionsetsize}{15161}
 \newcommand{\testsetsize}{17012}
 \newcommand{\totalClasses}{1111}
 \newcommand{\totalClassesAnalyzed}{369}
 \newcommand{\totalClassesAboveFifty}{680}
 \newcommand{\totalClassesNotAnalyzedBelowFifty}{431}
 \newcommand{\detexifyPercentage}{$\SI{91.93}{\percent}$}
 \newcommand{\recordingsWithDots}{$\SI{2.77}{\percent}$}  % excluding i,j, ...
 \newcommand{\recordingsWithDotsSizechange}{$\SI{0.85}{\percent}$}  % excluding i,j, ...
--- a/documents/papers/write-math-paper/write-math-ba-paper.bib
+++ b/documents/papers/write-math-paper/write-math-ba-paper.bib
--- a/documents/papers/write-math-paper/write-math-ba-paper.tex
+++ b/documents/papers/write-math-paper/write-math-ba-paper.tex
@ -0,0 +1,87 @@
 \documentclass[9pt,technote,a4paper]{IEEEtran}
 \usepackage{amssymb, amsmath} % needed for math
 \usepackage[a-1b]{pdfx}
 \usepackage{filecontents}
 \begin{filecontents*}{\jobname.xmpdata}
    \Keywords{recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron}
    \Title{On-line Recognition of Handwritten Mathematical Symbols}
    \Author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
    \Org{Institute for Anthropomatics and Robotics}
    \Doi{}
 \end{filecontents*}
 \RequirePackage{ifpdf}
 \ifpdf \PassOptionsToPackage{pdfpagelabels}{hyperref} \fi
 \RequirePackage{hyperref}
 \usepackage{parskip}
 \usepackage[pdftex,final]{graphicx}
 \usepackage{csquotes}
 \usepackage{braket}
 \usepackage{booktabs}
 \usepackage{multirow}
 \usepackage{pgfplots}
 \usepackage{wasysym}
 \usepackage{caption}
 % \captionsetup{belowskip=12pt,aboveskip=4pt}
 \makeatletter
 \newcommand\mynobreakpar{\par\nobreak\@afterheading}
 \makeatother
 \usepackage[noadjust]{cite}
 \usepackage[nameinlink,noabbrev]{cleveref} % has to be after hyperref, ntheorem, amsthm
 \usepackage[binary-units,group-separator={,}]{siunitx}
 \sisetup{per-mode=fraction,binary-units=true}
 \DeclareSIUnit\pixel{px}
 \usepackage{glossaries}
 \loadglsentries[main]{glossary}
 \makeglossaries
 \title{On-line Recognition of Handwritten Mathematical Symbols}
 \author{Martin Thoma, Kevin Kilgour, Sebastian St{\"u}ker and Alexander Waibel}
 \hypersetup{
  pdfauthor   = {Martin Thoma\sep Kevin Kilgour\sep Sebastian St{\"u}ker\sep Alexander Waibel},
  pdfkeywords = {recognition\sep machine learning\sep neural networks\sep symbols\sep multilayer perceptron},
  pdfsubject  = {Recognition},
  pdftitle    = {On-line Recognition of Handwritten Mathematical Symbols},
 }
 \include{variables}
 \crefname{table}{Table}{Tables}
 \crefname{figure}{Figure}{Figures}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % Begin document                                                    %
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{document}
 \maketitle
 \begin{abstract}
 The automatic recognition of single handwritten symbols has three main
 applications: Supporting users who know how a symbol looks like, but not what
 its name is, providing the necessary commands for professional publishing, or
 as a building block for formula recognition.
 This paper presents a system which uses the pen trajectory to classify
 handwritten symbols. Five preprocessing steps, one data multiplication
 algorithm, five features and five variants for multilayer Perceptron training
 were evaluated using $\num{166898}$ recordings. Those recordings were made
 publicly available. The evaluation results of these 21~experiments were used to
 create an optimized recognizer which has a top-1 error of less than
 $\SI{17.5}{\percent}$ and a top-3 error of $\SI{4.0}{\percent}$. This is a
 relative improvement of $\SI{18.5}{\percent}$ for the top-1 error and
 $\SI{29.7}{\percent}$ for the top-3 error compared to the baseline system. This
 improvement was achieved by \acrlong{SLP} and adding new features. The
 improved classifier can be used via \href{http://write-math.com/}{write-math.com}.
 \end{abstract}
 \input{ch1-introduction}
 \input{ch2-general-system-design}
 \input{ch3-data-and-implementation}
 \input{ch4-algorithms}
 \input{ch5-optimization-of-system-design}
 \input{ch6-summary}
 \input{ch7-mfrdb-eval}
 \bibliographystyle{IEEEtranSA}
 \bibliography{write-math-ba-paper}
 \end{document}