This thesis explores a variant of the bag-of-visual-words framework with a large fraction of
unsupervised learning to predict the presence or absence of objects in images. We extract local
image features with different SIFT detector and descriptor implementations from the PASCAL
VOC 2007 dataset. Based on the bag-of-visual-words assumption, we quantize the visual words
into visual word counts using Sculley's Mini-batch k-Means. Afterwards, we train Neural Networks
with Replicated Softmax input and multilabel classifcation output layers. We use these
Neural Networks with multiclass classification output layers (softmax) for the task of document
classification, too.
Major contributions of this thesis encompass a detailed mathematical derivation and implementation
of the Replicated Softmax (RSM) model presented by Salakhutdinov and Hinton, as
well as a detailed mathematical derivation of Welling et al.'s Exponential Family Harmoniums
(EFH). We can report classification results on the 20 Newsgroups dataset competitive e.g. to
the directed DiscLDA model. Moreover, Neural Networks with RSM input layers significantly
outperform standard Feed-forward Neural Networks on the PASCAL VOC 2007 image object
classification challenge (in terms of mean Average Precision).
Moreover, we present the DualRSM model for image object classification that adds a second
visible units wing to the RSM model and hence enables it to combine two different histogram
data inputs. In particular, we train it on visual word counts together with their respective
all-to-all distances histogram. This is an attempt to incorporate information about the spatial
relationships among the visual words, i.e. to leverage the strongly simplifying bag-of(-visual)-
words assumption.
Download: landthal2011.pdf
The Python implementation of the Replicated Softmax model can be found here: Replicated Softmax implementation.
Master's Thesis in Informatik, advised by Christian Osendorfer at TU München, I6.
You can reach me here: joerg [-at-] fylance [-.-] de