Sunteți pe pagina 1din 5

2011-04-08

chapter1.tex

% \pagebreak[4] % \hspace*{1cm} % \pagebreak[4] % \hspace*{1cm} % \pagebreak[4] \chapter{The transforming autoencoder} \label{chap:transautoencoder} \ifpdf \graphicspath{{Chapter1/Chapter1Figs/PNG/}{Chapter1/Chapter1Figs/PDF/}{Chapter1/Chapter1Figs/}} \else \graphicspath{{Chapter1/Chapter1Figs/EPS/}{Chapter1/Chapter1Figs/}} \fi \section{The architecture of transforming autoencoder} \subsection{A high level overview} The transforming autoencoder consists of many capsules. Each capsule is an independent subnetwork used to extract a single set of instantiation parameters. Each parameterized feature consists of a single gating unit, indicating if a visual entity is present and some instantiation parameters, representing the pose. \footnote{We use visual entity to mean things in the image; parameterized feature to mean gating units and instantiation parameters; instantiation parameters to mean the compact representation obtained by the transforming autoencoder; capsule to mean the hardware (or subnetwork) used to compute these instantiation parameters. We avoid the general word feature because it could mean either the representation (parameterized feature, and instantiation parameters are examples of representation) or things in the image (visual entity)} We introduce the idea using its simplest form - 2d translations. The instantiation parameters (2 dimensional) should represent the position of the visual entity that its capsule is responsible for. Since we know the position differences between our input and output vectors, we can add to the predicted position this known difference, and ask the network to reconstruct the correct output. Adding this difference forces a meaning onto the code extracted so the network has to extract positions of visual entities in the image. Although each capsule process information indepedently, it cannot cheat since the code extracted has to contain all the information needed to reconstruct the output. Thus each capsules tend to learn to represent different visual entities from other capsules. It would be more satisfying (and theoretically possible) if the network does not need to be told of this known transformation such as in \cite{?}. But this transformation is readily available and can serve to force the autoencoder to find an appropriate representation. For example, this information is available to us. If we move our head while fixing our eyes on a letter in this page, we do not perceive any movements in the world because our visual system expects the change it received. Or when we saccade our eye around this page, we can incoporate information from this new view to what we obtained before because we know how much we saccade. Similar information would be available in robotics application as well through feedback from the motor and control system. So we explore what happens when we have direct access to such transformation, and introduce an architecture that takes advantage of this information. \subsection{Details of the transforming autoencoder} \label{sec:details} We first demonstrate the concept by dealing with image translations and then generalize to full affine transformations later. Suppose the raw data ${w_i} \in \mathbb{R}^{784}$ are $28\times28$ images of MNIST digits. We may apply simple geometric transformations (in this case just translation) to $w^i$ to generate our training data set, here I use the superscript to denote training case $i$. $$\{x^i = T(s^i_x, w^i), t^i = T(s^i_t, w^i), s^i\}_{i=1}^{i=M}$$ For each training case (dropping the superscript for cleaner notation), $T(s, x) \in \mathbb{R}^{784}$ takes $x \in \mathbb{R}^{784}$ to the image of $x$ shifted by $s \in \mathbb{R}^2$. $s = s_t - s_x$ is the difference between the underlying translations of $t$ and $x$. For example, $s_t, s_x \in \mathbb{R}^2$ might be distributed according to a spherical Gaussian with a certain variance $\sigma^2$. The transforming autoencoder is now easy to define. Let the transforming autoencoder have $N_c$ capsules. Each capsule has receptive hidden units $H_r \in \mathbb{R}^{N_r}$, parameter units $c \in \mathbb{R}^2$ gating unit $p \in (0, 1)$, and generative hidden units $H_g \in \mathbb{R}^{N_g}$. The input to each capsule is the whole input image, and capsules are independent from each other. Each capsule produces an output $y \in \mathbb{R}^{784}$, these outputs are added up linearly $Y = \sum_j y_j \in \mathbb{R}^{784}$ and the final output is $O = \sigma(Y + b_o) \in \mathbb{R}^{784}$. Where $b_o \in \mathbb{R}^{784}$ is the output bias. The output of the network $O$ should be like the target $t$. le:///home/sidaw/thesis/Chapter1/chapter1.tex

\mathbb{R}^2$ gating unit $p \in (0, 1)$, and generative hidden units $H_g \in \mathbb{R}^{N_g}$. The input to each capsule is the whole input image, and capsules are independent from each other. Each capsule produces an output $y \in \mathbb{R}^{784}$, these outputs are added up linearly $Y = \sum_j y_j 2011-04-08 chapter1.tex 2 \in \mathbb{R}^{784}$ and the final output is $O = \sigma(Y + b_o) \in \mathbb{R}^{784}$. Where $b_o \in \mathbb{R}^{784}$ is the output bias. The output of the network $O$ should be like the target $t$. % perhaps you should use a generic number instead of 784 \begin{figure}[h!] \centering \includegraphics[scale = .5]{transautolbl.pdf} \caption{Simplest autoencoder architecture for translations. $N$ is the number of capsules. Each of the N capsules is gated by unit $p$ at the final output of the capsule.} \end{figure} Let $x, y$ be the input and output of an individual capsule. Then, we have, $$y = p (W_{hy} H_g)$$ $$H_g = \sigma(W_{cg} c + b_g)$$ $$p = \sigma(W_{hp} H_r + b_p) \in (0,1)$$ $$c' = c + s \in R^{2}$$ $$c = W_{hc} H_r + b_c \in R^{2}$$ $$H_r = \sigma(W_{xh} x + b_r)$$

Where $\sigma(x) = \frac{1}{1+e^{-x}}$ and $c = c(x), \ p = p(x)$ are the values of the parameter units and gating units of the current capsule. And $W_{xh}, W_{hc}, W_{cp}, W_{cg}, W_{hy}$ are, respectively, weight matrix from input to receptive hidden units, from receptive hidden units to parameter units, from receptive hidden units to gating units, from parameter units to generative hidden units, and from generative hidden units to output. Training is done by fixing this architecture and minimize the difference between $O$ and $t$. Basically, we want the final network function $F$ where $O = F(s, x)$ to be like the translation function $T$ where $t = T(s, x)$. But it has to do so by first extracting a small number of capsules and put them in compact codes $c \in \mathbb{R}^2$ for each capsule.

\section{Training the transforming autoencoder} \subsection{How to train} We can then train this autoencoder using back-propagation \cite{?} (See \ref{chap:nn}). For training data $\{x, t, s\}$, we minimize the cross-entropy objective function, $$ C(t, O) = - \sum_{j=1}^{784} t_j \log(O_j) + (1-t_j) \log(1-O_j)$$ In reality, we use mini-batches of size 100. See more details in \ref{sub:backprop}. \subsection{The transforming autoencoder on translations of MNIST} I trained the transforming autoencoder described above on the MNIST dataset. The data is translated by an amount distributed according to a 2d isotropic Gaussian with $\sigma = 3, 10$. I used a weight decay of 1e-4, mini-batch of 100 and this net runs a GPU board. For more implementation details please see \ref{chap:nn}. The table and figures to follow shows squared reconstruction errors, what the reconstructions look like, and what the receptive and generative fields of each capsule look like. \begin{center} \begin{tabular}{ l | l | l | } \hline case & training error & testing error \\ \hline 100 capsules $\sigma = 10$ & 5.31 & 5.61 \\ \hline 100 capsules $\sigma = 3$ & 3.58 & 3.75 \\ \hline \hline \end{tabular} \end{center}

\begin{figure}[h!] \centering

le:///home/sidaw/thesis/Chapter1/chapter1.tex

2011-04-08

chapter1.tex

\includegraphics{recontrans5.png} \caption{Reconstructions of translated digits using a transforming autoencoder with 100 capsules, each with 10 recognition units and 10 generative units. \textbf{top row}: input,\textbf{ middle row}: reconstruction, \textbf{bottom row}: target output} \footnote{The reconstruction tasks shown above, and in other figures of this section, are on test data the transforming autoencoder has not seen before. But these test data are easier than the training data, since the input is also transformed during training. } \end{figure}

\begin{figure}[h!] \centering \includegraphics[scale=.7]{transfilter.png} \caption{All 10 generative units of the first 10 (out of 100) capsules which produced the above reconstructions} \end{figure}

The same net can also be applied to large translations. \begin{figure}[h!] \centering \includegraphics{recontranslarge.png} \caption{Reconstructions of translated digits using a transforming autoencoder with 100 capsules, each with 10 recognition units and 10 generative units. \textbf{top row}: input,\textbf{ middle row}: reconstruction, \textbf{bottom row}: target output} \end{figure} \begin{figure}[h!] \centering \includegraphics[scale=.7]{largetransfilter.png} \caption{All 10 generative units of the first 10 (out of 100) capsules which produced the above reconstructions} \end{figure} \subsection{The transforming autoencoder on more complex transformations} The transforming autoencoder can also be applied to more complex transformations. Instead of just having $\Delta x, \Delta y$, we can just add more parameters. Or even better, we may change the way these capsules interact with each other. So instead of adding the difference, we may multiply by a matrix. There are several approaches to more complex image transformations. One is to simply introduce explicit parameters for orientation, scaling, or shear in addition to $x$ and $y$. For example, to deal with rotation and translation, we introduce capsule parameters $x, y, \theta$ that are to be predicted and provide $\Delta x, \Delta y$ and $\Delta\theta$ that are known where $\theta$ is the orientation parameter. Another more general approach is to predict a 3 3 homography or affine transformation matrix $A$ from the input image, and apply a known transformation matrix $T$ to get matrix $T A$ which is then used to predict the output image. The net shown in the figure has 50 capsules of 40 hidden units each. Each capsule outputs parameters $x, y, \theta$.

\begin{center} \begin{tabular}{ l | l | l | } \hline Case & training error & testing error \\ \hline 50 capsules translation and rotation & 13 & 13 \\ \hline 25 capsules full affine 2 shift & 5.9 & 6.2 \\ \hline \hline \end{tabular} \end{center}

\begin{figure}[h!] \centering \includegraphics{recontransrot.png} le:///home/sidaw/thesis/Chapter1/chapter1.tex

2011-04-08

chapter1.tex

\caption{Reconstruction of translation and rotation of digits using a transforming autoencoder with 50 capsules, each having 20 recognition units and 20 generative units. \textbf{top row}: input,\textbf{ middle row}: reconstruction, \textbf{bottom row}: target output} \end{figure} \begin{figure}[h!] \centering \includegraphics[scale=.7]{filtertransrot.png} \caption{all 20 generative fields of the first 5 capsules of the transforming autoencoder that produced the above output. Each capsule has $\Delta x, \Delta y$ and $\Delta\theta$ and a gating unit.} \end{figure} If each capsule is given 9 real-valued outputs that is treated as a $3\times3$ matrix $A$, a transforming autoencoder can be trained to predict all 2-d affine transformations (translation, rotation, scaling and shearing). A known transformation matrix $T$ is applied to the output of the capsule $A$ to get the matrix $TA$ which is then used to predict the target output image. See \ref{chap:nn} for how training data was generated. \begin{figure}[h!] \centering \includegraphics{reconsfullaffine.png} \caption{Full affine reconstructions using a transforming autoencoder with 25 capsules, each having 40 recognition units and 40 generative units. \textbf{top row}: input,\textbf{ middle row}: reconstruction, \textbf{bottom row}: target output} \end{figure} \begin{figure}[h!] \centering \includegraphics[scale=.7]{filterfullaffine.png} \caption{The first 20 (out of 40) generative units of the first 7 (out of 25) capsules which produced the above reconstructions. Each capsule outputs a full affine matrix $A \in ^3\mathbb{R}^3$ } \end{figure}

\section{How does the transforming autoencoder work} So the reconstruction is decent, and the filters are interesting looking. But how do the capsules really work? Here we investigate how the capsule parameters change as we translate the input image. \begin{figure}[h!] \centering \includegraphics[scale=.8]{capsuleoutputs.png} \caption{Red scatter: the x outputs of capsule 0 on a set of images shifted by 3 pixels vs. the x outputs of capsule 0 on the original set of images; blue line: $y = x + 3$; green line: $y = x$} \end{figure} Trained on small range of transformations, these capsules indeed outputs coordinates of the visual entity it is responsible for. The situation is a little complicated for transforming autoencoder trained on large transformation. In this case, the generative units get most of the training and the receptive units do not behave well. \begin{figure}[h!] \centering \includegraphics[scale=.8]{capsuleplot.png} \caption{Plotted is the capsule outputs as a particular input image shifts in direction $(1,1)$. red line: gating units, blue line: x outputs, green line: y outputs. Each subplot is the a different capsule. } \end{figure}

%%% Local Variables: %%% mode: latex %%% TeX-master: "../thesis" le:///home/sidaw/thesis/Chapter1/chapter1.tex

2011-04-08 %%% End:

chapter1.tex

le:///home/sidaw/thesis/Chapter1/chapter1.tex

S-ar putea să vă placă și