mirror of
https://github.com/MartinThoma/LaTeX-examples.git
synced 2025-04-26 06:48:04 +02:00
214 lines
11 KiB
TeX
214 lines
11 KiB
TeX
%!TEX root = write-math-ba-paper.tex
|
|
|
|
\section{Optimization of System Design}\label{ch:Optimization-of-System-Design}
|
|
In order to evaluate the effect of different preprocessing algorithms, features
|
|
and adjustments in the \gls{MLP} training and topology, the following baseline
|
|
system was used:
|
|
|
|
Scale the recording to fit into a unit square while keeping the aspect ratio,
|
|
shift it as described in \cref{sec:preprocessing},
|
|
resample it with linear interpolation to get 20~points per stroke, spaced
|
|
evenly in time. Take the first 4~strokes with 20~points per stroke and
|
|
2~coordinates per point as features, resulting in 160~features which is equal
|
|
to the number of input neurons. If a recording has less than 4~strokes, the
|
|
remaining features were filled with zeroes.
|
|
|
|
All experiments were evaluated with four baseline systems $B_{hl=i}$, $i \in \Set{1,
|
|
2, 3, 4}$, where $i$ is the number of hidden layers as different topologies
|
|
could have a severe influence on the effect of new features or preprocessing
|
|
steps. Each hidden layer in all evaluated systems has $500$ neurons.
|
|
|
|
Each \gls{MLP} was trained with a learning rate of $\eta = 0.1$ and a momentum
|
|
of $\alpha = 0.1$. The activation function of every neuron in a hidden layer is
|
|
the sigmoid function. The neurons in the
|
|
output layer use the softmax function. For every experiment, exactly one part
|
|
of the baseline systems was changed.
|
|
|
|
|
|
\subsection{Random Weight Initialization}
|
|
The neural networks in all experiments got initialized with a small random
|
|
weight
|
|
|
|
\[w_{i,j} \sim U(-4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}}, 4 \cdot \sqrt{\frac{6}{n_l + n_{l+1}}})\]
|
|
|
|
where $w_{i,j}$ is the weight between the neurons $i$ and $j$, $l$ is the layer
|
|
of neuron $i$, and $n_i$ is the number of neurons in layer $i$. This random
|
|
initialization was suggested on
|
|
\cite{deeplearningweights} and is done to break symmetry.
|
|
|
|
This can lead to different error rates for the same systems just because the
|
|
initialization was different.
|
|
|
|
In order to get an impression of the magnitude of the influence on the different
|
|
topologies and error rates the baseline models were trained 5 times with
|
|
random initializations.
|
|
\Cref{table:baseline-systems-random-initializations-summary}
|
|
shows a summary of the results. The more hidden layers are used, the more do
|
|
the results vary between different random weight initializations.
|
|
|
|
\begin{table}[h]
|
|
\centering
|
|
\begin{tabular}{crrr|rrr} %chktex 44
|
|
\toprule
|
|
\multirow{3}{*}{System} & \multicolumn{6}{c}{Classification error}\\
|
|
\cmidrule(l){2-7}
|
|
& \multicolumn{3}{c}{Top-1} & \multicolumn{3}{c}{Top-3}\\
|
|
& Min & Max & Mean & Min & Max & Mean\\\midrule
|
|
$B_{hl=1}$ & $\SI{23.1}{\percent}$ & $\SI{23.4}{\percent}$ & $\SI{23.2}{\percent}$ & $\SI{6.7}{\percent}$ & $\SI{6.8}{\percent}$ & $\SI{6.7}{\percent}$ \\
|
|
$B_{hl=2}$ & \underline{$\SI{21.4}{\percent}$} & \underline{$\SI{21.8}{\percent}$}& \underline{$\SI{21.6}{\percent}$} & $\SI{5.7}{\percent}$ & \underline{$\SI{5.8}{\percent}$} & \underline{$\SI{5.7}{\percent}$}\\
|
|
$B_{hl=3}$ & $\SI{21.5}{\percent}$ & $\SI{22.3}{\percent}$ & $\SI{21.9}{\percent}$ & \underline{$\SI{5.5}{\percent}$} & $\SI{5.8}{\percent}$ & \underline{$\SI{5.7}{\percent}$}\\
|
|
$B_{hl=4}$ & $\SI{23.2}{\percent}$ & $\SI{24.8}{\percent}$ & $\SI{23.9}{\percent}$ & $\SI{6.0}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{6.2}{\percent}$\\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{The systems $B_{hl=1}$ -- $B_{hl=4}$ were randomly initialized,
|
|
trained and evaluated 5~times to estimate the influence of random
|
|
weight initialization.}
|
|
\label{table:baseline-systems-random-initializations-summary}
|
|
\end{table}
|
|
|
|
|
|
\subsection{Stroke connection}
|
|
In order to solve the problem of interrupted strokes, pairs of strokes
|
|
can be connected with stroke connection algorithm. The idea is that for
|
|
a pair of consecutively drawn strokes $s_{i}, s_{i+1}$ the last point $s_i$ is
|
|
close to the first point of $s_{i+1}$ if a stroke was accidentally split
|
|
into two strokes.
|
|
|
|
$\SI{59}{\percent}$ of all stroke pair distances in the collected data are
|
|
between $\SI{30}{\pixel}$ and $\SI{150}{\pixel}$. Hence the stroke connection
|
|
algorithm was evaluated with $\SI{5}{\pixel}$, $\SI{10}{\pixel}$ and
|
|
$\SI{20}{\pixel}$.
|
|
All models top-3 error improved with a threshold of $\theta = \SI{10}{\pixel}$
|
|
by at least $\num{0.2}$ percentage points, except $B_{hl=4}$ which did not notably
|
|
improve.
|
|
|
|
|
|
\subsection{Douglas-Peucker Smoothing}
|
|
The Douglas-Peucker algorithm was applied with a threshold of $\varepsilon =
|
|
0.05$, $\varepsilon = 0.1$ and $\varepsilon = 0.2$ after scaling and shifting,
|
|
but before resampling. The interpolation in the resampling step was done
|
|
linearly and with cubic splines in two experiments. The recording was scaled
|
|
and shifted again after the interpolation because the bounding box might have
|
|
changed.
|
|
|
|
The result of the application of the Douglas-Peucker smoothing with $\varepsilon
|
|
> 0.05$ was a high rise of the top-1 and top-3 error for all models $B_{hl=i}$.
|
|
This means that the simplification process removes some relevant information and
|
|
does not---as it was expected---remove only noise. For $\varepsilon = 0.05$
|
|
with linear interpolation some models top-1 error improved, but the
|
|
changes were small. It could be an effect of random weight initialization.
|
|
However, cubic spline interpolation made all systems perform more than
|
|
$\num{1.7}$ percentage points worse for top-1 and top-3 error.
|
|
|
|
The lower the value of $\varepsilon$, the less does the recording change after
|
|
this preprocessing step. As it was applied after scaling the recording such that
|
|
the biggest dimension of the recording (width or height) is $1$, a value of
|
|
$\varepsilon = 0.05$ means that a point has to move at least $\SI{5}{\percent}$
|
|
of the biggest dimension.
|
|
|
|
|
|
\subsection{Global Features}
|
|
Single global features were added one at a time to the baseline systems. Those
|
|
features were re-curvature
|
|
$\text{re-curvature}(stroke) = \frac{\text{height}(stroke)}{\text{length}(stroke)}$
|
|
as described in \cite{Huang06}, the ink feature which is the summed length
|
|
of all strokes, the stroke count, the aspect ratio and the stroke center points
|
|
for the first four strokes. The stroke center point feature improved the system
|
|
$B_{hl=1}$ by $\num{0.3}$~percentage points for the top-3 error and system $B_{hl=3}$ for
|
|
the top-1 error by $\num{0.7}$~percentage points, but all other systems and
|
|
error measures either got worse or did not improve much.
|
|
|
|
The other global features did improve the systems $B_{hl=1}$ -- $B_{hl=3}$, but not
|
|
$B_{hl=4}$. The highest improvement was achieved with the re-curvature feature. It
|
|
improved the systems $B_{hl=1}$ -- $B_{hl=4}$ by more than $\num{0.6}$~percentage points
|
|
top-1 error.
|
|
|
|
|
|
\subsection{Data Multiplication}
|
|
Data multiplication can be used to make the model invariant to transformations.
|
|
However, this idea seems not to work well in the domain of on-line handwritten
|
|
mathematical symbols. We tripled the data by adding a version that is rotated
|
|
3~degrees to the left and another one that is rotated 3~degrees to the right
|
|
around the center of mass. This data multiplication made all classifiers for
|
|
most error measures perform worse by more than $\num{2}$~percentage points for
|
|
the top-1 error.
|
|
|
|
The same experiment was executed by rotating by 6~degrees and in another
|
|
experiment by 9~degrees, but those performed even worse.
|
|
|
|
Also multiplying the data by a factor of 5 by adding two 3-degree rotated
|
|
variants and two 6-degree rotated variant made the classifier perform worse
|
|
by more than $\num{2}$~percentage points.
|
|
|
|
|
|
\subsection{Pretraining}\label{subsec:pretraining-evaluation}
|
|
Pretraining is a technique used to improve the training of \glspl{MLP} with
|
|
multiple hidden layers.
|
|
|
|
\Cref{table:pretraining-slp} shows that \gls{SLP} improves the classification
|
|
performance by $\num{1.6}$ percentage points for the top-1 error and
|
|
$\num{1.0}$ percentage points for the top-3 error. As one can see in
|
|
\cref{fig:training-and-test-error-for-different-topologies-pretraining}, this
|
|
is not only the case because of the longer training as the test error is
|
|
relatively stable after $\num{1000}$ epochs of training. This was confirmed
|
|
by an experiment where the baseline systems where trained for $\num{10000}$
|
|
epochs and did not perform notably different.
|
|
|
|
\begin{figure}[htb]
|
|
\centering
|
|
\input{figures/errors-by-epoch-pretraining/errors-by-epoch-pretraining.tex}
|
|
\caption{Training- and test error by number of trained epochs for different
|
|
topologies with \acrfull{SLP}. The plot shows
|
|
that all pretrained systems performed much better than the systems
|
|
without pretraining. All plotted systems did not improve
|
|
with more epochs of training.}
|
|
\label{fig:training-and-test-error-for-different-topologies-pretraining}
|
|
\end{figure}
|
|
|
|
\begin{table}[tb]
|
|
\centering
|
|
\begin{tabular}{lrrrr}
|
|
\toprule
|
|
\multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
|
|
\cmidrule(l){2-5}
|
|
& Top-1 & Change & Top-3 & Change \\\midrule
|
|
$B_{hl=1}$ & $\SI{23.2}{\percent}$ & - & $\SI{6.7}{\percent}$ & - \\
|
|
$B_{hl=2,SLP}$ & $\SI{19.9}{\percent}$ & $\SI{-1.7}{\percent}$ & $\SI{4.7}{\percent}$ & $\SI{-1.0}{\percent}$\\
|
|
$B_{hl=3,SLP}$ & \underline{$\SI{19.4}{\percent}$} & $\SI{-2.5}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.1}{\percent}$\\
|
|
$B_{hl=4,SLP}$ & $\SI{19.6}{\percent}$ & $\SI{-4.3}{\percent}$ & \underline{$\SI{4.6}{\percent}$} & $\SI{-1.6}{\percent}$\\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Systems with 1--4 hidden layers which used \acrfull{SLP}
|
|
compared to the mean of systems $B_{hl=1}$--$B_{hl=4}$ displayed
|
|
in \cref{table:baseline-systems-random-initializations-summary}
|
|
which used pure gradient descent. The \gls{SLP}
|
|
systems clearly performed worse.}
|
|
\label{table:pretraining-slp}
|
|
\end{table}
|
|
|
|
|
|
Pretraining with denoising auto-encoder lead to the much worse results listed in
|
|
\cref{table:pretraining-denoising-auto-encoder}. The first layer used a $\tanh$
|
|
activation function. Every layer was trained for $1000$ epochs and the
|
|
\gls{MSE} loss function. A learning-rate of $\eta = 0.001$, a corruption of
|
|
$\varkappa = 0.3$ and a $L_2$ regularization of $\lambda = 10^{-4}$ were
|
|
chosen. This pretraining setup made all systems with all error measures perform
|
|
much worse.
|
|
|
|
\begin{table}[tb]
|
|
\centering
|
|
\begin{tabular}{lrrrr}
|
|
\toprule
|
|
\multirow{2}{*}{System} & \multicolumn{4}{c}{Classification error}\\
|
|
\cmidrule(l){2-5}
|
|
& Top-1 & Change & Top-3 & Change \\\midrule
|
|
$B_{hl=1,AEP}$ & $\SI{23.8}{\percent}$ & $\SI{+0.6}{\percent}$ & $\SI{7.2}{\percent}$ & $\SI{+0.5}{\percent}$\\
|
|
$B_{hl=2,AEP}$ & \underline{$\SI{22.8}{\percent}$} & $\SI{+1.2}{\percent}$ & $\SI{6.4}{\percent}$ & $\SI{+0.7}{\percent}$\\
|
|
$B_{hl=3,AEP}$ & $\SI{23.1}{\percent}$ & $\SI{+1.2}{\percent}$ & \underline{$\SI{6.1}{\percent}$} & $\SI{+0.4}{\percent}$\\
|
|
$B_{hl=4,AEP}$ & $\SI{25.6}{\percent}$ & $\SI{+1.7}{\percent}$ & $\SI{7.0}{\percent}$ & $\SI{+0.8}{\percent}$\\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Systems with denoising \acrfull{AEP} compared to pure
|
|
gradient descent. The \gls{AEP} systems performed worse.}
|
|
\label{table:pretraining-denoising-auto-encoder}
|
|
\end{table}
|