Sunteți pe pagina 1din 136

Ten Trending Academic Papers on the Future of

Computer Vision

These papers provide a breadth of information about Computer Vision that is generally useful and interesting from a

data science perspective.


Contents

 Learning Individual Styles of Conversational Gesture

 Textured Neural Avatars

 DSFD: Dual Shot Face Detector

 GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face

Reconstruction

 DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation,

Segmentation and Re-Identification of Clothing Images

 Inverse Cooking: Recipe Generation from Food Images

 ArcFace: Additive Angular Margin Loss for Deep Face Recognition


 Fast Online Object Tracking and Segmentation: A Unifying Approach

 Revealing Scenes by Inverting Structure from Motion Reconstructions

 Semantic Image Synthesis with Spatially-Adaptive Normalization


Learning Individual Styles of Conversational Gesture

Shiry Ginosar∗ Amir Bar∗ Gefen Kohavi Caroline Chan


UC Berkeley Zebra Medical Vision UC Berkeley MIT
Andrew Owens Jitendra Malik
UC Berkeley UC Berkeley
arXiv:1906.04160v1 [cs.CV] 10 Jun 2019

Figure 1: Speech-to-gesture translation example. In this paper, we study the connection between conversational gesture and speech.
Here, we show the result of our model that predicts gesture from audio. From the bottom upward: the input audio, arm and hand pose
predicted by our model, and video frames synthesized from pose predictions using [10]. (See http://people.eecs.berkeley.
edu/˜shiry/speech2gesture for video results.)

Abstract we spontaneously emit when we speak [34]. They comple-


ment speech and add non-verbal information that help our
Human speech is often accompanied by hand and arm listeners comprehend what we say [6]. Kendon [23] places
gestures. Given audio speech input, we generate plausi- conversational gestures at one end of a continuum, with sign
ble gestures to go along with the sound. Specifically, we language, a true language, at the other end. In between the
perform cross-modal translation from “in-the-wild” mono- two extremes are pantomime and emblems like “Italianite”,
logue speech of a single speaker to their hand and arm mo- with an agreed-upon vocabulary and culture-specific mean-
tion. We train on unlabeled videos for which we only have ings. A gesture can be subdivided into phases describing
noisy pseudo ground truth from an automatic pose detec- its progression from the speaker’s rest position, through the
tion system. Our proposed model significantly outperforms gesture preparation, stroke, hold and retraction back to rest.
baseline methods in a quantitative comparison. To support Is the information conveyed in speech and gesture corre-
research toward obtaining a computational understanding lated? This is a topic of ongoing debate. The hand-in-hand
of the relationship between gesture and speech, we release hypothesis claims that gesture is redundant to speech when
a large video dataset of person-specific gestures. speakers refer to subjects and objects in scenes [43]. In
contrast, according to the trade-off hypothesis, speech and
gesture are complementary since people use gesture when
1. Introduction speaking would require more effort and vice versa [15]. We
approach the question from a data-driven learning perspec-
When we talk, we convey ideas via two parallel channels tive and ask to what extent can we predict gesture motion
of communication—speech and gesture. These conversa- from the raw audio signal of speech.
tional, or co-speech, gestures are the hand and arm motions
We present a method for temporal cross-modal transla-
∗ Indicates equal contribution. tion. Given an input audio clip of a spoken statement (Fig-

1
Almaram Angelica Kubinec Covach Kagan

Conan Oliver Stewart Meyers Ellen

Figure 2: Speaker-specific gesture dataset. We show a representative video frame for each speaker in our dataset. Below each one is a
heatmap depicting the frequency that their arms and hands appear in different spatial locations (using the skeletal representation of gestures
shown in Figure 1). This visualization reveals the speaker’s resting pose, and how they tend to move—for example, Angelica tends to keep
her hands folded, whereas Kubinec frequently points towards the screen with his left hand. Note that some speakers, like Kagan, Conan
and Ellen, alternate between sitting and standing and thus the distribution of their arm positions is bimodal.

ure 1 bottom), we generate a corresponding motion of the despite the noisy automatically-annotated pseudo ground
speaker’s arms and hands which matches the style of the truth. Due to multimodality, we do not expect our predicted
speaker, despite the fact that we have never seen or heard motion to be the same as the ground truth. However, as this
this person say this utterance in training (Figure 1 middle). is the only training signal we have, we still use automatic
We then use an existing video synthesis method to visualize pose detections for learning through regression. To avoid
what the speaker might have looked like when saying these regressing to the mean of all modes, we apply an adversar-
words (Figure 1 top). ial discriminator [19] to our predicted motion. This ensures
To generate motion from speech, we must learn a map- that we produce motion that is “real” with respect to the
ping between audio and pose. While this can be formu- current speaker.
lated as translation, in practice there are two inherent chal- Gesture is idiosyncratic [34], as different speakers tend
lenges to using the natural pairing of audio-visual data in to use different styles of motion (see Figure 2). It is there-
this setting. First, gesture and speech are asynchronous, as fore important to learn a personalized gesture model for
gesture can appear before, after or during the correspond- each speaker. To address this, we present a large, 144-hour
ing utterance [4]. Second, this is a multimodal prediction person-specific video dataset of 10 speakers that we make
task as speakers may perform different gestures while say- publicly available1 . We deliberately pick a set of speakers
ing the same thing on different occasions. Moreover, ac- for which we can find hours of clean single-speaker footage.
quiring human annotations for large amounts of video is in- Our speakers come from a diverse set of backgrounds: tele-
feasible. We therefore need to get a training signal from vision show hosts, university lecturers and televangelists.
pseudo ground truth of 2D human pose detections on unla- They span at least three religions and discuss a large range
beled video. of topics from commentary on current affairs through the
Nevertheless, we are able to translate speech to gesture philosophy of death, chemistry and the history of rock mu-
in an end-to-end fashion from the raw audio to a sequence sic, to readings in the Bible and the Qur’an.
of poses. To overcome the asynchronicity issue we use a
large temporal context (both past and future) for prediction. 1 http://people.eecs.berkeley.edu/ shiry/
˜
Temporal context also allows for smooth gesture prediction speech2gesture
2. Related Work these methods is to generate motions for virtual agents, they
use lab-recorded audio, text, and motion capture. This al-
Conversational Gestures McNeill [34] divides gestures lows them to use simplifying assumptions that present chal-
into several classes [34]: emblematics have specific conven- lenges for in-the-wild video analysis like ours: e.g., [30]
tional meanings (e.g. “thumbs up!”); iconics convey physi- requires precise 3D pose and assumes that motions occur
cal shapes or direction of movements; metaphorics describe on syllable boundaries, and [11] assumes that gestures are
abstract content using concrete motion; deictics are point- initiated by an upward motion of the wrist. In contrast
ing gestures, and beats are repetitive, fast hand motions that with these methods, our approach does not explicitly use
provide a temporal framing to speech. any text or language information during training—it learns
Many psychologists have studied questions related to co- gestures from raw audio-visual correspondences—nor does
speech gestures [34, 23] (See [46] for a review). This vast it use hand-defined gesture categories: arm/hand pose are
body of research has mostly relied on studying a small num- predicted directly from audio.
ber of individual subjects using recorded choreographed
story retelling in lab settings. Analysis in these studies was Visualizing predicted gestures One of the most common
a manual process. Our goal, instead, is to study conversa- ways of visualizing gestures is to use them to animate a 3D
tional gestures in the wild using a data-driven approach. avatar [45, 29, 20]. Since our work studies personalized
Conditioning gesture prediction on speech is arguably an gestures for in-the-wild videos, where 3D data is not avail-
ambiguous task, since gesture and speech may not be syn- able, we use a data-driven synthesis approach inspired by
chronous. While McNeill [34] suggests that gesture and Bregler et al. [2]. To do this, we employ the pose-to-video
speech originate from a common source and thus should co- method of Chan et al. [10], which uses a conditional gen-
occur in time according to well-defined rules, Kendon [23] erative adversarial network (GAN) to synthesize videos of
suggests that gesture starts before the corresponding utter- human bodies from pose.
ance. Others even argue that the temporal relationships be-
tween speech and gesture are not yet clear and that gesture Sound and vision Aytar et al. [1] use the synchronization
can appear before, after or during an utterance [4]. of visual and audio signals in natural phenomena to learn
sound representations from unlabeled in-the-wild videos.
Sign language and emblematic gesture recognition To do this, they transfer knowledge from trained discrim-
There has been a great deal of computer vision work geared inative models in the visual domain, to the audio domain.
towards recognizing sign language gestures from video. Synchronization of audio and visual features can also be
This includes methods that use video transcripts as a weak used for synthesis. Langlois et al. [28] try to optimize for
source of supervision [3], as well as recent methods based synchronous events by generating rigid-body animations of
on CNNs [37, 26] and RNNs [13]. There has also been work objects falling or tumbling that temporally match an input
that recognizes emblematic hand and face gestures [17, 14], sound wave of the desired sequence of contact events with
head gestures [35], and co-speech gestures [38]. By con- the ground plane. More recently, Shlizerman et al. [42]
trast, our goal is to predict co-speech gestures from audio. animated the hands of a 3D avatar according to input mu-
sic. However, their focus was on music performance, rather
Conversational agents Researchers have proposed a than gestures, and consequently the space of possible mo-
number of methods for generating plausible gestures, par- tions was limited (e.g., the zig-zag motion of a violin bow).
ticularly for applications with conversational agents [8]. In Moreover, while music is uniquely defined by the motion
early work, Cassell et al. [7] proposed a system that guided that generates it (and is synchronous with it), gestures are
arm/hand motions based on manually defined rules. Sub- neither unique to, nor synchronous with speech utterances.
sequent rule-based systems [27] proposed new ways of ex- Several works have focused on the specific task of
pressing gestures via annotations. synthesizing videos of faces speaking, given audio input.
More closely related to our approach are methods that Chung et al. [12] generate an image of a talking face from
learn gestures from speech and text, without requiring an a still image of the speaker and an input speech segment
author to hand-specify rules. Notably, [9] synthesized ges- by learning a joint embedding of the face and audio. Simi-
tures using natural language processing of spoken text, and larly, [44] synthesizes videos of Obama saying novel words
Neff [36] proposed a system for making person-specific by using a recurrent neural network to map speech audio to
gestures. Levine et al. [30] learned to map acoustic prosody mouth shapes and then embedding the synthesized lips in
features to motion using a HMM. Later work [29] extended ground truth facial video. While both methods enable the
this approach to use reinforcement learning and speech creation of fake content by generating faces saying words
recognition, combined acoustic analysis with text [33], cre- taken from a different person, we focus on single-person
ated hybrid rule-based systems [40], and used restricted models that are optimized for animating same-speaker ut-
Boltzmann machines for inference [11]. Since the goal of terances. Most importantly, generating gesture, rather than
lip motion, from speech is more involved as gestures are Audio G(t1), . . . , G(tT )
asynchronous with speech, multimodal and person-specific. G L1 regression loss

Time
D
3. A Speaker-Specific Gesture Dataset
Real or Fake
We introduce a large 144-hour video dataset specifically Motion Sequence?
tailored to studying speech and gesture of individual speak- Frequency
ers in a data-driven fashion. As shown in Figure 2, our
dataset contains in-the-wild videos of 10 gesturing speak- Figure 3: Speech to gesture translation model. A convolutional
ers that were originally recorded for television shows or audio encoder downsamples the 2D spectrogram and transforms
university lectures. We collect several hours of video per it to a 1D signal. The translation model, G, then predicts a corre-
sponding temporal stack of 2D poses. L1 regression to the ground
speaker, so that we can individually model each one. We
truth poses provides a training signal, while an adversarial dis-
chose speakers that cover a wide range of topics and ges-
criminator, D, ensures that the predicted motion is both temporally
turing styles. Our dataset contains: 5 talk show hosts, 3 coherent and in the style of the speaker.
lecturers and 2 televangelists. Details about data collection
and processing as well as an analysis of the individual styles
of gestures can be found in the supplementary material. 4.1. Speech-to-Gesture Translation
Gesture representation and annotation We represent Any realistic gesture motion must be temporally coher-
the speakers’ pose over time using a temporal stack of 2D ent and smooth. We accomplish smoothness by learning an
skeletal keypoints, which we obtain using OpenPose [5]. audio encoding which is a representation of the whole ut-
From the complete set of keypoints detected by OpenPose, terance, taking into account the full temporal extent of the
we use the 49 points corresponding to the neck, shoulders, input speech, s, and predicting the whole temporal sequence
elbows, wrists and hands to represent gestures. Together of corresponding poses, p, at once (rather than recurrently).
with the video footage, we provide the skeletal keypoints Our fully convolutional network consists of an audio en-
for each frame of the data at a 15fps. Note, however, that coder followed by a 1D UNet [39, 22] translation architec-
these are not ground truth annotations, but a proxy for the ture, as shown in Figure 3. The audio encoder takes a 2D
ground truth from a state-of-the-art pose detection system. log-mel spectrogram as input, and downsamples it through
a series of convolutions, resulting in a 1D signal with the
Quality of dataset annotations All ground truth, same sampling rate as our video (15 Hz). The UNet transla-
whether from human observers or otherwise, has associated tion architecture then learns to map this signal to a temporal
error. The pseudo ground truth we collect using automatic stack of pose vectors (see Section 3 for details of our gesture
pose detection may have much larger error than human an- representation) via an L1 regression loss:
notations, but it enables us to train on much larger amounts
of data. Still, we must estimate whether the accuracy of the LL1 (G) = Es,p [||p − G(s)||1 ]. (1)
pseudo ground truth is good enough to support our quantita-
tive conclusions. We compare the automatic pose detections We use a UNet architecture for translation since its bot-
to labels obtained from human observers on a subset of our tleneck provides the network with past and future tempo-
training data and find that the pseudo ground truth is close ral context, while the skip connections allow for high fre-
to human labels and that the error in the pseudo ground truth quency temporal information to flow through, enabling pre-
is small enough for our task. The full experiment is detailed diction of fast motion.
in our supplementary material.
4.2. Predicting Plausible Motion
4. Method
While L1 regression to keypoints is the only way we
Given raw audio of speech, our goal is to generate the can extract a training signal from our data, it suffers from
speaker’s corresponding arm and hand gesture motion. We the known issue of regression to the mean which produces
approach this task in two stages—first, since the only sig- overly smooth motion. This can be seen in our supplemen-
nal we have for training are corresponding audio and pose tary video results. To combat the issue and ensure that we
detection sequences, we learn a mapping from speech to produce realistic motion, we add an adversarial discrimi-
gesture using L1 regression to temporal stacks of 2D key- nator [22, 10] D, conditioned on the difference of the pre-
points. Second, to avoid regressing to the mean of all pos- dicted sequence of poses. i.e. the input to the discriminator
sible modes of gesture, we employ an adversarial discrim- is the vector m = [p2 −p1 , . . . , pT −pT −1 ] where pi are 2D
inator that ensures that the motion we produce is plausible pose keypoints and T is the temporal extent of the input au-
with respect to the typical motion of the speaker. dio and predicted pose sequence. The discriminator D tries
to maximize the following objective while the generator G Nearest neighbors Instead of selecting a completely ran-
(translation architecture, Section 4.1) tries to minimize it: dom gesture sequence from the same speaker, we can use
audio as a similarity cue. For an input audio track, we find
LGAN (G, D) = Em [log D(m)] + Es [log(1 − G(s))], (2) its nearest neighbor for the speaker using pretrained audio
features, and transfer its corresponding motion. To repre-
where s is the input audio speech segment and m is the mo- sent the audio, we use the state-of-the-art VGGish feature
tion derivative of the predicted stack of poses. Thus, the embedding [21] pretrained on AudioSet [18], and use co-
generator learns to produce real-seeming speaker motion sine distance on normalized features.
while the discriminator learns to classify whether a given
RNN-based model [42] We further compare our motion
motion sequence is real. Our full objective is therefore:
prediction to an RNN architecture proposed by Shlizerman
et al. Similar to us, Shlizerman et al. predict arm and hand
min max LGAN (G, D) + λLL1 (G). (3)
G D motion from audio in a 2D skeletal keypoint space. How-
ever, while our model is a convolutional neural network
4.3. Implementation Details with log-mel spectrogram input, theirs uses a 1-layer LSTM
We obtain translation invariance by subtracting (per model that takes MFCC features (a low-dimensional, hand-
frame) the neck keypoint location from all other keypoints crafted audio feature representation) as input. We evaluated
in our pseudo ground truth gesture representation (section both feature types and found that for [42], MFCC features
3). We then normalize each keypoint (e.g. left wrist) across outperform the log-mel spectrogram features on all speak-
all frames by subtracting the per-speaker mean and divid- ers. We therefore use their original MFCC features in our
ing by the standard deviation. During training, we take as experiments. For consistency with our own model, instead
input spectrograms corresponding to about 4 seconds of au- of measuring L2 distance on PCA features, as they do, we
dio and predict 64 pose vectors, which correspond to about add an extra hidden layer and use L1 distance.
4 seconds at a 15Hz frame-rate. At test time we can run Ours, no GAN Finally, as an ablation, we compare our
our network on arbitrary audio durations. We optimize us- full model to the prediction of the translation architecture
ing Adam [24] with a batch size of 32 and a learning rate of alone, without the adversarial discriminator.
10−4 . We train for 300K/90K iterations with and without an
adversarial loss, respectively, and select the best performing 5.1.2 Evaluation Metrics
model on the validation set.
Our main quantitative evaluation metric is the L1 regres-
5. Experiments sion loss of the different models in comparison. We ad-
ditionally report results according to the percent of correct
We show that our method produces motion that quanti- keypoints (PCK) [47], a widely accepted metric for pose de-
tatively outperforms several baselines, as well as a previous tection. Here, a predicted keypoint is defined as correct if
method that we adapt to the problem. it falls within α max(h, w) pixels of the ground truth key-
5.1. Setup point, where h and w are the height and width of the person
bounding box, respectively.
We describe our experimental setup including our base- We note that PCK was designed for localizing object
lines for comparison and evaluation metric. parts, whereas we use it here for a cross-modal prediction
task (predicting pose from audio). First, unlike L1 , PCK is
5.1.1 Baselines
not linear and correctness scores fall to zero outside a hard
We compare our method to several other models. threshold. Since our goal is not to predict the ground truth
Always predict the median pose Speakers spend most of motion but rather to use it as a training signal, L1 is more
their time in rest position [23], so predicting the speaker’s suited to measuring how we perform on average. Second,
median pose can be a high-quality baseline. For a visualiza- PCK is sensitive to large gesture motion as the correctness
tion of each speaker’s rest position, see Figure 2. radius depends on the width of the span of the speaker’s
Predict a randomly chosen gesture In this baseline, we arms. While [47] suggest α = 0.1 for data with full people
randomly select a different gesture sequence (which does and α = 0.2 for data where only half the person is visi-
not correspond to the input utterance) from the training set ble, we take an average over α = 0.1, 0.2 and show the full
of the same speaker, and use this as our prediction. While results in the supplementary.
we would not expect this method to perform well quantita-
5.2. Quantitative Evaluation
tively, there is reason to think it would generate qualitatively
appealing motion: these are real speaker gestures—the only We compare the results of our method to the baselines
way to tell they are fake is to evaluate how well they corre- using our quantitative metrics. To assess whether our re-
sponds to the audio.
sults are perceptually convincing, we conduct a user study.
Finally, we ask whether the gestures we predict are person-
specific and whether the input speech is indeed a better pre-
dictor of motion than the initial pose of the gesture.

5.2.1 Numerical Comparison


We compare to all baselines on 2,048 randomly chosen test
set intervals per speaker and display the results in Table 1.
We see that on most speakers, our model outperforms all
others, where our no-GAN condition is slightly better than
the GAN one. This is expected, as the adversarial dis- Figure 4: Our trained models are person-specific. For every
criminator pushes the generator to snap to a single mode speaker audio input (row) we apply all other individually trained
of the data, which is often further away from the actual speaker models (columns). Color saturation corresponds to L1
loss values on a held out test set (lower is better). For each row,
ground truth than the mean predicted by optimizing L1 loss
the entry on the diagonal is lightest as models work best using the
alone. Our model outperforms the RNN-based model on
input speech of the person they were trained on.
most speakers. Qualitatively, we find that this baseline pre-
dicts relatively small motions on our data, which may be
due to the fact that it has relatively low capacity compared the rate at which its output fooled the participants. Inter-
to our UNet model. estingly, we found that for the dynamic speaker all meth-
ods that generate realistic motion fooled humans at similar
5.2.2 Human Study rates. As shown in Table 2, our results for this speaker were
comparable to real motion sequences, whether selected by
To gain insight into how synthesized gestures perceptually
an audio-based nearest neighbor approach or randomly. For
compare to real motion, we conducted a small-scale real
the stationary speaker who spends most of the time in rest
vs. fake perceptual study on Amazon Mechanical Turk.
position, real motion was more often selected as there is
We used two speakers who are always shot from the same
no prediction error associated with it. While the nearest
camera viewpoint: Oliver, whose gestures are relatively dy-
neighbor and random motion models are significantly less
namic and Meyers, who is relatively stationary. We visu-
accurate quantitatively (Table 1), they are perceptually con-
alized gesture motion using videos of skeletal wire frames.
vincing because their components are realistic.
To provide participants with additional context, we included
the ground truth mouth and facial keypoints of the speaker
in the videos. We show examples of skeletal wire frame 5.2.3 The Predicted Gestures are Person-Specific
videos in our video supplementary material. For every speaker’s speech input (Figure 4 rows), we pre-
Participants watched a series of video pairs. In each pair, dict gestures using all other speakers’ trained models (Fig-
one video was produced from a real pose sequence; the ure 4 columns). We find that on average, predicting using
other was generated by an algorithm—our model or a base- our model trained on a different speaker performs better nu-
line. Participants were then asked to identify the video con- merically than predicting random motion, but significantly
taining the motion that corresponds to the speech sound (we worse than always predicting the median pose of the input
did not verify that they in fact listened to the speech while speaker (and far worse than the predictions from the model
answering the question). Videos of 4 seconds or 12 seconds trained on the input speaker). The diagonal structure of the
each of resolution 400×226 (downsampled from 910×512 confusion matrix in Figure 4 exemplifies this.
in order to fit two videos side-by-side on different screen
sizes) were shown, and after each pair, participants were
5.2.4 Speech is a Good Predictor for Gesture
given unlimited time to respond. We sampled 100 input
audio intervals at random and predicted from them a 2D- Seeing the success of our translation model, we ask how
keypoint motion sequence using each method. Each task much does the audio signal help when the initial pose of the
consisted of 20 pairs of videos and was performed by 300 gesture sequence is known. In other words, how much can
different participants. Each participant was given a short sound tell us beyond what can be predicted from motion dy-
training set of 10 video pairs before the start of the task, namics. To study this, we augment our model by providing
and was given feedback indicating whether they had cor- it the pose of the speaker directly preceding their speech,
rectly identified the ground-truth motion. which we incorporate into the bottleneck of the UNet (Fig-
We compared all the gesture-prediction models (Sec- ure 3). We consider the following conditions: Predict me-
tion 5.1.1) and assessed the quality of each method using dian pose, as in the baselines above. Predict the input initial
Model Meyers Oliver Conan Stewart Ellen Kagan Kubinec Covach Angelica Almaram Avg. L1 Avg. PCK
Median 0.66 0.69 0.79 0.63 0.75 0.80 0.80 0.70 0.74 0.76 0.73 38.11
Random 0.93 1.00 1.10 0.94 1.07 1.11 1.12 1.00 1.04 1.08 1.04 26.55
NN [21] 0.88 0.96 1.05 0.93 1.02 1.11 1.10 0.99 1.01 1.06 1.01 27.92
RNN [42] 0.61 0.66 0.76 0.62 0.71 0.74 0.73 0.72 0.72 0.75 0.70 39.69
Ours, no GAN 0.57 0.60 0.63 0.61 0.71 0.72 0.68 0.69 0.75 0.76 0.67 44.62
Ours, GAN 0.77 0.63 0.64 0.68 0.81 0.74 0.70 0.72 0.78 0.83 0.73 41.95

Table 1: Quantitative results for the speech to gesture translation task using L1 loss (lower is better) on the test set. The rightmost column
is the average PCK value (higher is better) over all speakers and α = 0.1, 0.2 (See full results in supplementary).

Oliver Meyers Model Avg. L1 Avg. PCK


Model 4 seconds 12 seconds 4 seconds 12 seconds

Pred.
Predict the median pose 0.73 38.11
Median 12.1 ± 2.8 6.7 ± 2.0 34.0 ± 4.2 25.8 ± 3.9 Predict the input initial pose 0.53 60.50
Random 34.2 ± 4.0 29.1 ± 3.7 40.9 ± 4.6 34.3 ± 4.4 Speech input 0.67 44.62

Input
NN [21] 36.9 ± 3.9 26.4 ± 3.8 43.5 ± 4.5 33.3 ± 4.4 Initial pose input 0.49 61.24
RNN [42] 18.2 ± 3.2 10.0 ± 2.5 37.5 ± 4.6 19.4 ± 3.6 Speech & initial pose input 0.47 62.39
Ours, no GAN 25.0 ± 3.8 19.8 ± 3.4 36.1 ± 4.3 33.1 ± 4.2
Ours, GAN 35.4 ± 4.0 27.8 ± 3.9 33.2 ± 4.4 22.0 ± 4.0 Table 3: How much information does sound provide once we
know the initial pose of the speaker? We see that the initial pose
Table 2: Human study results for the speech to gesture translation of the gesture sequence is a good predictor for the rest of the
task on 4 and 12-second video clips of two speakers—one dy- 4-second motion sequence (second to last row), but that adding
namic (Oliver) and one relatively stationary (Meyers). As a metric audio improves the prediction (last row). We use both average
for comparison, we use the percentage of times participants were L1 loss (lower is better) and average PCK over all speakers and
fooled by the generated motions and picked them as real over the α = 0.1, 0.2 (higher is better) as metrics of comparison. We com-
ground truth motion in a two-alternative forced choice. We found pare two baselines and three conditions of inputs.
that humans were not sensitive to the alignment of speech and
gesture. For the dynamic speaker, gestures with realistic motion—
whether randomly selected from another video of the same speaker 5.3. Qualitative Results
or generated by our GAN-based model—fooled humans at equal
rates (no statistically significant difference between the bolded
We qualitatively compare our speech to gesture transla-
numbers). Since the stationary speaker is usually at rest position, tion results to the baselines and the ground truth gesture
real unaligned motion sequences look more realistic as they do not sequences in Figure 5. Please refer to our supplementary
suffer from prediction noise like the generated ones. video results which better convey temporal information.

6. Conclusion
pose, a model that simply repeats the input initial ground- Humans communicate through both sight and sound,
truth pose as its prediction. Speech input, our model. Initial yet the connection between these modalities remains un-
pose input, a variation of our model in which the audio in- clear [23]. In this paper, we proposed the task of predict-
put is ablated and the network predicts the future pose from ing person-specific gestures from “in-the-wild” speech as a
only an initial ground-truth pose input, and Speech & initial computational means of studying the connections between
pose input, where we condition the prediction on both the these communication channels. We created a large person-
speech and the initial pose. specific video dataset and used it to train a model for pre-
Table 3 displays the results of the comparison for our dicting gestures from speech. Our model outperforms other
model trained without the adversarial discriminator (no methods in an experimental evaluation.
GAN). When comparing the Initial pose input and Speech Despite its strong performance on these tasks, our model
& initial pose input conditions, we find that the addition has limitations that can be addressed by incorporating in-
of speech significantly improves accuracy when we average sights from other work. For instance, using audio as in-
the loss across all speakers (p < 10−3 using a two sided put has its benefits compared to using textual transcriptions
t-test). Interestingly, we find that most of the gains come as audio is a rich representation that contains information
from a small number of speakers (e.g. Oliver) who make about prosody, intonation, rhythm, tone and more. How-
large motions during speech. ever, audio does not directly encode high-level language se-
Figure 5: Speech to gesture translation qualitative results. We show the input audio spectrogram and the predicted poses overlaid on the
ground-truth video for Dr. Kubinec (lecturer) and Conan O’Brien (show host). See our supplementary material for more results.

mantics that may allow us to predict certain types of gesture Acknowledgements: This work was supported, in part, by the
(e.g. metaphorics), nor does it separate the speaker’s speech AWS Cloud Credits for Research and the DARPA MediFor pro-
from other sounds (e.g. audience laughter). Additionally, grams, and the UC Berkeley Center for Long-Term Cybersecu-
we treat pose estimations as though they were ground truth, rity. Special thanks to Alyosha Efros, the bestest advisor, and to
which introduces significant amount of noise—particularly Tinghui Zhou for his dreams of late-night talk show stardom.
on the speakers’ fingers.
References
We see our work as a step toward a computational anal-
[1] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning
ysis of conversational gesture, and opening three possible
sound representations from unlabeled video. In Advances in
directions for further research. The first is in using gestures Neural Information Processing Systems, 2016. 3
as a representation for video analysis: co-speech hand and
[2] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driv-
arm motion make a natural target for video prediction tasks. ing visual speech with audio. In Computer Graphics and
The second is using in-the-wild gestures as a way of train- Interactive Techniques, SIGGRAPH, pages 353–360. ACM,
ing conversational agents: we presented one way of visual- 1997. 3
izing gesture predictions, based on GANs [10], but, follow- [3] P. Buehler, A. Zisserman, and M. Everingham. Learning
ing classic work [8], these predictions could also be used sign language by watching tv (using weakly aligned subti-
to drive the motions of virtual agents. Finally, our method tles). In Computer Vision and Pattern Recognition (CVPR),
is one of only a handful of initial attempts to predict mo- pages 2961–2968. IEEE, 2009. 3
tion from audio. This cross-modal translation task is fertile [4] B. Butterworth and U. Hadar. Gesture, speech, and compu-
ground for further research. tational stages: A reply to McNeill. Psychological Review,
96:168–74, Feb. 1989. 2, 3 Now: Introducing the Virtual Human Toolkit. In 13th In-
[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- ternational Conference on Intelligent Virtual Agents, Edin-
person 2d pose estimation using part affinity fields. In Com- burgh, UK, Aug. 2013. 3
puter Vision and Pattern Recognition (CVPR). IEEE, 2017. [21] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke,
4 A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous,
[6] J. Cassell, D. McNeill, and K.-E. McCullough. Speech- B. Seybold, M. Slaney, R. Weiss, and K. Wilson. CNN ar-
gesture mismatches: Evidence for one underlying represen- chitectures for large-scale audio classification. In Interna-
tation of linguistic and nonlinguistic information. Pragmat- tional Conference on Acoustics, Speech and Signal Process-
ics and Cognition, 7(1):1–34, 1999. 1 ing. 2017. 5, 7
[7] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, [22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. translation with conditional adversarial networks. In Com-
Animated conversation: Rule-based generation of facial ex- puter Vision and Pattern Recognition (CVPR), 2017. 4
pression, gesture & spoken intonation for multiple conversa- [23] A. Kendon. Gesture: Visible Action as Utterance. Cam-
tional agents. In Computer Graphics and Interactive Tech- bridge University Press, 2004. 1, 3, 5, 7, 10, 11
niques, SIGGRAPH, pages 413–420. ACM, 1994. 3 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic
[8] J. Cassell, J. Sullivan, E. Churchill, and S. Prevost. Embod- optimization. CoRR, abs/1412.6980, 2014. 5
ied conversational agents. MIT press, 2000. 3, 8 [25] M. Kipp, M. Neff, K. H. Kipp, and I. Albrecht. Towards
[9] J. Cassell, H. H. Vilhjálmsson, and T. Bickmore. Beat: the natural gesture synthesis: Evaluating gesture units in a data-
behavior expression animation toolkit. In Life-Like Charac- driven approach to gesture synthesis. In C. Pelachaud, J.-C.
ters, pages 163–185. Springer, 2004. 3 Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé,
editors, Intelligent Virtual Agents, pages 15–28, Berlin, Hei-
[10] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody
delberg, 2007. Springer Berlin Heidelberg. 10
Dance Now. ArXiv e-prints, Aug. 2018. 1, 3, 4, 8
[26] O. Koller, H. Ney, and R. Bowden. Deep hand: How to train
[11] C.-C. Chiu and S. Marsella. How to train your avatar:
a cnn on 1 million hand images when your data is contin-
A data driven approach to gesture generation. In Interna-
uous and weakly labelled. In Computer Vision and Pattern
tional Workshop on Intelligent Virtual Agents, pages 127–
Recognition (CVPR), pages 3793–3802. IEEE, 2016. 3
140. Springer, 2011. 3
[27] S. Kopp, B. Krenn, S. Marsella, A. N. Marshall,
[12] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that?
C. Pelachaud, H. Pirker, K. R. Thórisson, and
In British Machine Vision Conference, 2017. 3
H. Vilhjálmsson. Towards a common framework for
[13] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and multimodal generation: The behavior markup language. In
R. Bowden. Neural sign language translation. In Computer International workshop on intelligent virtual agents, pages
Vision and Pattern Recognition (CVPR). IEEE, June 2018. 3 205–217. Springer, 2006. 3
[14] T. J. Darrell, I. A. Essa, and A. P. Pentland. Task-specific [28] T. R. Langlois and D. L. James. Inverse-foley animation:
gesture analysis in real-time using interpolated views. IEEE Synchronizing rigid-body motions to sound. ACM Transac-
Transactions on Pattern Analysis and Machine Intelligence, tions on Graphics, 33(4):41:1–41:11, July 2014. 3
18(12):1236–1242, Dec. 1996. 3 [29] S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun. Gesture
[15] J. P. de Ruiter, A. Bangerter, and P. Dings. The interplay controllers. In ACM Transactions on Graphics, volume 29,
between gesture and speech in the production of referring page 124. ACM, 2010. 3
expressions: Investigating the tradeoff hypothesis. Topics in [30] S. Levine, C. Theobalt, and V. Koltun. Real-time prosody-
Cognitive Science, 4(2):232–248, Mar. 2012. 1 driven synthesis of body language. In ACM Transactions on
[16] D. F. Fouhey, W.-c. Kuo, A. A. Efros, and J. Malik. From Graphics, volume 28, page 172. ACM, 2009. 3
lifestyle vlogs to everyday interactions. arXiv preprint [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
arXiv:1712.02310, 2017. 10 manan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common
[17] W. T. Freeman and M. Roth. Orientation histograms for hand objects in context. In European Conference on Computer Vi-
gesture recognition. In Workshop on Automatic Face and sion (ECCV), Zrich, 2014. Oral. 10
Gesture Recognition. IEEE, June 1995. 3 [32] R. C. B. Madeo, S. M. Peres, and C. A. de Moraes Lima.
[18] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, Gesture phase segmentation using support vector machines.
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Au- Expert Systems with Applications, 56:100 – 115, 2016. 11
dio set: An ontology and human-labeled dataset for audio [33] S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and
events. In International Conference on Acoustics, Speech A. Shapiro. Virtual character performance from speech.
and Signal Processing, pages 776–780, Mar. 2017. 5 In Symposium on Computer Animation, SCA, pages 25–35.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, ACM, 2013. 3
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [34] D. McNeill. Hand and Mind: What Gestures Reveal about
erative adversarial nets. In Advances in Neural Information Thought. University of Chicago Press, Chicago, 1992. 1, 2,
Processing Systems, pages 2672–2680, 2014. 2 3, 10
[20] A. Hartholt, D. Traum, S. C. Marsella, A. Shapiro, G. Stra- [35] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic
tou, A. Leuski, L.-P. Morency, and J. Gratch. All Together discriminative models for continuous gesture recognition. In
Computer Vision and Pattern Recognition (CVPR), pages 1–
8. IEEE, 2007. 3
[36] M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel. Gesture
modeling and animation based on a probabilistic re-creation Figure 6: A segmented gesture unit.
of speaker style. ACM Transactions on Graphics, 27(1):5:1–
5:24, Mar. 2008. 3
7. Appendix
[37] T. Pfister, K. Simonyan, J. Charles, and A. Zisserman. Deep
convolutional neural networks for efficient pose estimation 7.1. Dataset
in gesture videos. In Asian Conference on Computer Vision,
pages 538–552. Springer, 2014. 3 Data collection and processing We collected internet
videos by querying YouTube for each speaker, and de-
[38] F. Quek, D. McNeill, R. Bryll, S. Duncan, X.-F. Ma, C. Kir-
duplicated the data using the approach of [16]. We then
bas, K. E. McCullough, and R. Ansari. Multimodal hu-
used out-of-the-box face recognition and pose detection
man discourse: gesture and speech. ACM Transactions
on Computer-Human Interaction (TOCHI), 9(3):171–193, systems to split each videos into intervals in which only the
2002. 3 subject appears in frame and all detected keypoints are vis-
ible. Our dataset consists of 60,000 such intervals with an
[39] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolu-
average length of 8.7 seconds and a standard deviation of
tional networks for biomedical image segmentation. In Med-
11.3 seconds. In total, there are 144 hours of video. We
ical Image Computing and Computer-Assisted Intervention
(MICCAI), volume 9351 of LNCS, pages 234–241. Springer, split the data into 80% train, 10% validation, and 10% test
2015. 4 sets, such that each source video only appears in one set.
[40] N. Sadoughi and C. Busso. Retrieving target gestures to- Quality of dataset annotations We estimate whether the
ward speech driven animation with meaningful behaviors. In accuracy of the pseudo ground truth is good enough to sup-
Proceedings of the 2015 ACM on International Conference port our quantitative conclusions via the following experi-
on Multimodal Interaction, ICMI ’15, pages 115–122. ACM, ment. We took a 200-frame subset of the pseudo ground
2015. 3
truth used for training and had it labeled by 3 human ob-
[41] H. Sakoe and S. Chiba. Dynamic programming algorithm servers with neck and arm keypoints. We quantified the
optimization for spoken word recognition. IEEE Transac- consensus between annotators via, σi , a standard devia-
tions on Acoustics, Speech, and Signal Processing, 26(1):43– tion per keypoint-type i, as is typical in COCO [31] eval-
49, Feb. 1978. 11
uation. We also computed ||opi − µi ||, the distance between
[42] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher- the OpenPose detection and the mean of the annotations,
Shlizerman. Audio to body dynamics. In Computer Vision and ||prediction − µi || the distance between our audio-
and Pattern Recognition (CVPR). IEEE, 2018. 3, 5, 7 to-motion prediction and the annotation mean. We found
[43] W. C. So, S. Kita, and S. Goldin-Meadow. Using the hands that the pseudo ground truth is close to human labels, since
to identify who does what to whom: Gesture and speech go 0.14 = E[||opi − µi ||] ≈ E[σi ] = 0.06; And that the er-
hand-in-hand. Cognitive Science, 33(1):115–125, Feb. 2009. ror in the pseudo ground truth is small enough for our task,
1 since 0.25 = ||prediction − µi || >> σi = 0.06. Note
[44] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher- that this is a lower bound on the prediction error since it is
Shlizerman. Synthesizing obama: Learning lip sync from computed on training data samples.
audio. ACM Transactions on Graphics, 36(4):95:1–95:13,
July 2017. 3 7.2. Learning Individual Gesture Dictionaries
[45] M. Thiebaux, S. Marsella, A. N. Marshall, and M. Kall- Gesture unit segmentation We use an unsupervised
mann. Smartbody: Behavior realization for embodied con- method for building a dictionary of an individual’s ges-
versational agents. In International Joint Conference on Au- tures. We segment motion sequences into gesture units,
tonomous Agents and Multiagent Systems, volume 1, pages propose an appropriate descriptor and similarity metric and
151–158. International Foundation for Autonomous Agents then cluster the gestures of an individual.
and Multiagent Systems, 2008. 3 A gesture unit is a sequence of gestures that starts from
[46] P. Wagner, Z. Malisz, and S. Kopp. Gesture and speech in a rest position and returns to a rest position only after the
interaction: An overview. Speech Communication, 57:209 – last gesture [23]. While [34] observed that most of their
232, 2014. 3 subjects usually perform one gesture at a time, a study of an
[47] Y. Yang and D. Ramanan. Articulated human detection with 18-minute video dataset of TV speakers reported that their
flexible mixtures of parts. IEEE Transactions on Pattern gestures were often strung together in a sequence [25]. We
Analysis and Machine Intelligence, 35(12):2878–2890, Dec. treat each gesture unit – from rest position to rest position –
2013. 5 as an atomic segment.
Individual styles of gesture These clusters represent an
unsupervised definition of the typical gestures that an in-
dividual performs. For each dictionary element cluster we
Cluster 3

define the central point as the point that is closest on aver-


age to all datapoints in the cluster. We sort the gesture units
in each cluster by their distance to the central point and pick
the most central ones for display. We visualize some exam-
ples of the dictionary of gestures we learn for Jon Stewart
in Figure 7.
Cluster 4
Cluster 9

Figure 7: Individual styles of gesture. Examples from Jon Stew-


art’s gesture dictionary.

We use an unsupervised approach to the temporal seg-


mentation of gesture units based on prediction error (by
contrast, [32] use a supervised approach). Given a motion
sequence of keypoints (Section 3) from time t0 to tT , we try
to predict the tT +1 pose. A low prediction error may signal
that the speaker is at rest, or that they are in the middle of a
gesture that the model has frequently seen during training.
Since speakers spend most of the time in rest position [23],
a high prediction error may indicate that a new gesture has
begun. We segment gesture units at points of high predic-
tion error (without defining a rest position per person). An
example of a segmented gesture unit is displayed in Fig-
ure 6. We train a segmentation model per subject and do
not expect it to generalize across speakers.

Dictionary learning We use the first 5 principal compo-


nents of the keypoints computed over all static frames as
a gesture unit descriptor. This reduces the dimensionality
while capturing 93% of the variance. We use dynamic time
warping [41] as our distance metric to account for temporal
variations in the execution of similar gestures. Since this
is not a Euclidean norm, we must compute the distance be-
tween each pair of datapoints. We precompute a distance
matrix for a randomly chosen sample of 1, 000 training ges-
ture units and use it to hierarchically cluster the datapoints.
Textured Neural Avatars

Aliaksandra Shysheya 1,2 Egor Zakharov 1,2 Kara-Ali Aliev 1 Renat Bashirov 1
Egor Burkov 1,2 Karim Iskakov 1 Aleksei Ivakhnenko 1 Yury Malkov 1
Igor Pasechnik 1 Dmitry Ulyanov 1,2 Alexander Vakhitov 1,2 Victor Lempitsky 1,2
1 2
Samsung AI Center, Moscow Skolkovo Institute of Science and Technology, Moscow
arXiv:1905.08776v1 [cs.CV] 21 May 2019

Figure 1: We propose a new model for neural rendering of humans. The model is trained for a single person and can produce
renderings of this person from novel viewpoints (top) or in the new body pose (bottom) unseen during training. To improve
generalization, our model retains explicit texture representation, which is learned alongside the rendering neural network.

Abstract system is capable of learning to generate realistic render-


ings while being trained on videos annotated with 3D poses
and foreground masks. We also demonstrate that maintain-
We present a system for learning full-body neural avatars,
ing an explicit texture representation helps our system to
i.e. deep networks that produce full-body renderings of a
achieve better generalization compared to systems that use
person for varying body pose and camera position. Our
direct image-to-image translation.
system takes the middle path between the classical graph-
ics pipeline and the recent deep learning approaches that
generate images of humans using image-to-image transla- 1. Introduction
tion. In particular, our system estimates an explicit two-
dimensional texture map of the model surface. At the same Capturing and rendering human body in all of its com-
time, it abstains from explicit shape modeling in 3D. In- plexity under varying pose and imaging conditions is one
stead, at test time, the system uses a fully-convolutional of the core problems of both computer vision and com-
network to directly map the configuration of body feature puter graphics. Recently, there is a surge of interest that
points w.r.t. the camera to the 2D texture coordinates of in- involves deep convolutional networks (ConvNets) as an al-
dividual pixels in the image frame. We show that such a ternative to traditional computer graphics means. Realistic

1
neural rendering of body fragments e.g. faces [37, 43, 62], 2. Related work
eyes [24], hands [47] is now possible. Very recent works
have shown the abilities of such networks to generate views Our approach is closely related to a vast number of pre-
of a person with a varying body pose but with a fixed cam- vious works, and below we discuss a small subset of these
era position, and using an excessive amount of training connections.
data [1, 12, 42, 67]. In this work, we focus on the learn- Building full-body avatars from image data has long
ing of neural avatars, i.e. generative deep networks that been one of the main topics of computer vision research.
are capable of rendering views of individual people under Traditionally, an avatar is defined by a 3D geometric mesh
varying body pose defined by a set of 3D positions of the of a certain neutral pose, a texture, and a skinning mecha-
body joints and under varying camera positions (Figure 1). nism that transforms the mesh vertices according to pose
We prefer to use body joint positions to represent the hu- changes. A large group of works has been devoted to
man pose, as joint positions are often easier to capture using body modeling from 3D scanners [51], registered multi-
marker-based or marker-less motion capture systems. view sequences [53] as well as from depth and RGB-D
sequences [7, 69, 74]. On the other extreme are methods
that fit skinned parametric body models to single images
Generally, neural avatars can serve as an alternative to [6, 8, 30, 35, 49, 50, 59]. Finally, research on building full-
classical (“neural-free”) avatars based on a standard com- body avatars from monocular videos has started [3, 4]. Sim-
puter graphics pipeline that estimates a user-personalized ilarly to the last group of works, our work builds an avatar
body mesh in a neutral position, performs skinning (defor- from a video or a set of unregistered monocular videos. The
mation of the neutral pose), and projects the resulting 3D classical (computer graphics) approach to modeling human
surface onto the image coordinates, while superimposing avatars requires explicit physically-plausible modeling of
person-specific 2D texture. Neural avatars attempt to short- human skin, hair, sclera, clothing surface, as well as mo-
cut the multiple stages of the classical pipeline and to re- tion under pose changes. Despite considerable progress in
place them with a single network that learns the mapping reflectivity modeling [2, 18, 38, 70, 72] and better skin-
from the input (the location of body joints) to the output (the ning/dynamic surface modeling [23, 44, 60], the computer
2D image). As a part of our contribution, we demonstrate graphics approach still requires considerable “manual” ef-
that, however appealing for its conceptual simplicity, exist- fort of designers to achieve high realism [2] and to pass the
ing pose-to-image translation networks generalize poorly to so-called uncanny valley [46], especially if real-time ren-
new camera views, and therefore new architectures for neu- dering of avatars is required.
ral avatars are required. Image synthesis using deep convolutional neural net-
works is a thriving area of research [20, 27] and a lot of
Towards this end, we present a neural avatar system that recent effort has been directed onto synthesis of realistic hu-
does full-body rendering and combines the ideas from the man faces [15, 36, 61]. Compared to traditional computer
classical computer graphics, namely the decoupling of ge- graphics representations, deep ConvNets model data by fit-
ometry and texture, with the use of deep convolutional neu- ting an excessive number of learnable weights to training
ral networks. In particular, similarly to the classic pipeline, data. Such ConvNets avoid explicit modeling of the sur-
our system explicitly estimates the 2D textures of body face geometry, surface reflectivity, or surface motion under
parts. The 2D texture within the classical pipeline effec- pose changes, and therefore do not suffer from the lack of
tively transfers the appearance of the body fragments across realism of the corresponding components. On the flipside,
camera transformations and body articulations. Keeping the lack of ingrained geometric or photometric models in
this component within the neural pipeline boosts general- this approach means that generalizing to new poses and in
ization across such transforms. The role of the convolu- particular to new camera views may be problematic. Still
tional network in our approach is then confined to predict- a lot of progress has been made over the last several years
ing the texture coordinates of individual pixels in the out- for the neural modeling of personalized talking head mod-
put 2D image given the body pose and the camera parame- els [37, 43, 62], hair [68], hands [47]. Notably, the recent
ters (Figure 2). Additionally, the network predicts the body system [43] has achieved very impressive results for neural
foreground/background mask. face rendering, while decomposing view-dependent texture
and 3D shape modeling.
In our experiments, we compare the performance of our Over the last several months, several groups have pre-
textured neural avatar with a direct video-to-video trans- sented results of neural modeling of full bodies [1, 12, 42,
lation approach [67], and show that explicit estimation of 67]. While the presented results are very impressive, the ap-
textures brings additional generalization capability and im- proaches still require a large amount of training data. They
proves the realism of the generated images for new views also assume that the test images are rendered with the same
and/or when the amount of training data is limited. camera views as the training data, which in our experience

2
makes the task considerably simpler than modeling body a specific image location, e.g. Bij [x, y] denotes the scalar
appearance from an arbitrary viewpoint. In this work, we element in the j-th map of the stack Bi located at location
aim to expand the neural body modeling approach to tackle (x, y), and Bi [x, y] denotes the vector of elements corre-
the latter, harder task. The work [45] uses a combination of sponding to all maps sampled at location (x, y).
classical and neural rendering to render human body from
new viewpoints, but does so based on depth scans and there-
fore with a rather different algorithmic approach. Input and output. In general, we are interested in syn-
A number of recent works warp a photo of a person to a thesizing images of a certain person given her/his pose. We
new photorealistic image with modified gaze direction [24], assume that the pose for the i-th image comes in the form of
modified facial expression/pose [9, 55, 64, 71], or modified 3D joint positions defined in the camera coordinate frame.
body pose [5, 48, 56, 64], whereas the warping field is esti- As an input to the network, we then consider a map stack
mated using a deep convolutional network (while the origi- Bi , where each map Bij contains the rasterized j-th segment
nal photo effectively serves as a texture). These approaches (bone) of the “stickman” (skeleton) projected on the camera
are however limited in their realism and/or the amount of plane. To retain the information about the third coordinate
change they can model, due to their reliance on a single of the joints, we linearly interpolate the depth value between
photo of a given person for its input. Our approach also the joints defining the segments, and use the interpolated
disentangles texture from surface geometry/motion mod- values to define the values in the map Bij corresponding to
eling but trains from videos, therefore being able to han- the bone pixels (the pixels not covered by the j-th bone are
dle harder problem (full body multi-view setting) and to set to zero). Overall, the stack Bi incorporates the informa-
achieve higher realism. tion about the person and the camera pose.
Our system relies on the DensePose body surface param- As an output of the whole system, we expect an RGB
eterization (UV parameterization) similar to the one used in image (a three-channel stack) Ii and a single channel mask
the classical graphics-based representation. Part of our sys- Mi , defining the pixels that are covered by the avatar. Be-
tem performs a mapping from the body pose to the surface low, we consider two approaches: the direct translation
parameters (UV coordinates) of image pixels. This makes baseline, which directly maps Bi into {Ii , Mi } and the tex-
our approach related to the DensePose approach [28] and tured neural avatar approach that performs such mapping
the earlier works [29, 63] that predict UV coordinates of indirectly using texture mapping.
image pixels from the input photograph. Furthermore, our
In both cases, at training time, we assume that for each
approach uses DensePose results [28] for pretraining.
input frame i, the input joint locations and the “ground
Our system is related to approaches that extract textures
truth” foreground mask are estimated, and we use 3D body
from multi-view image collections [26, 39] or multi-view
pose estimation and human semantic segmentation to ex-
video collections [66] or a single video [52]. Our approach
tract them from raw video frames. At test time, given a
is also related to free-viewpoint video compression and ren-
real or synthetic background image I˜i , we generate the fi-
dering systems, e.g. [11, 16, 21, 66]. Unlike those works,
nal view by first predicting Mi and Ii from the body pose
ours is restricted to scenes containing a single human. At
and then linearly blending the resulting avatar into an im-
the same time, our approach aims to generalize not only
age: Iˆi = Ii Mi + I˜i (1 − Mi ) (where defines a
to new camera views but also to new user poses unseen in
“location-wise” product, i.e. the RGB values at each loca-
the training videos. The work of [73] is the most related
tion are multiplied by the mask value at this location).
to ours in this group, as they warp the individual frames of
the multi-view video dataset according to the target pose to
generate new sequences. The poses that they can handle,
however, are limited by the need to have a close match in Direct translation baseline. The direct approach that we
the training set, which is a strong limitation given the com- consider as a baseline to ours is to learn an image trans-
binatorial nature of the human pose configuration space. lation network that maps the map stack Bik to the map
stacks Ii and Mi (usually the two output stacks are pro-
3. Methods duced within two branches that share the initial stage of the
processing [20]). Generally, mappings between stacks of
Notation. We use the lower index i to denote objects that maps can be implemented using fully-convolutional archi-
are specific to the i-th training or test image. We use up- tectures. Exact architectures and losses for such networks
percase notation, e.g. Bi to denote a stack of maps (a third- is an active area of research [14, 31, 33, 65]. Very recent
order tensor/three-dimensional array) corresponding to the works [1, 12, 42, 67] have used direct translation (with var-
i-th training or test image. We use the upper index to denote ious modifications) to synthesize the view of a person for
a specific map (channel) in the stack, e.g. Bij . Furthermore, a fixed camera. We use the video-to-video variant of this
we use square brackets to denote elements corresponding to approach [67] as a baseline for our method.

3
Part assignments Predicted mask Ground truth mask
Cross-entropy
loss
Input pose Generator

Perceptual
loss
Render

Part coordinates Predicted RGB Ground truth RGB

Texture stack

Figure 2: The overview of the textured neural avatar system. The input pose is defined as a stack of ”bone” rasterizations
(one bone per channel; here we show it as a skeleton image). The input is processed by the fully-convolutional network
(generator) to produce the body part assignment map stack and the body part coordinate map stack. These stacks are then
used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by
the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the
background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting
losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting
in their updates.

Textured neural avatar. The direct translation approach th body part, and the map channel Pin corresponds to the
relies on the generalization ability of ConvNets and incor- probability of the background. The coordinate maps Ci2k
porates very little domain-specific knowledge into the sys- and Ci2k+1 correspond to the pixel coordinates on the k-th
tem. As an alternative, we suggest the textured avatar ap- body part. Specifically, once the part assignments Pi and
proach, that explicitly estimates the textures of body parts, body part coordinates Ci are predicted, the image Ii at each
thus ensuring the similarity of the body surface appearance pixel (x, y) is reconstructed as a weighted combination of
under varying pose and cameras. texture elements, where the weights and texture coordinates
Following the DensePose approach [28], we subdivide are prescribed by the part assignment maps and the coordi-
the body into n=24 parts, where each part has a 2D param- nate maps correspondingly:
eterization. Each body part also has the texture map T k , n−1
X
which is a color image of a fixed pre-defined size (256×256 s(Pi , Ci , T )[x, y] = Pik [x, y]·
in our implementation). The training process for the tex- k=0
tured neural avatar estimates personalized part parameteri- T k Ci2k [x, y], Ci2k+1 [x, y] ,
 
(1)
zations and textures.
Again, following the DensePose approach, we assume where s(·, ·, ·) is the sampling function (layer) that outputs
that each pixel in an image of a person is (soft)-assigned the RGB map stack given the three input arguments. In (1),
to one of n parts or to the background and with a specific the texture maps T k are sampled at non-integer locations
location on the texture of that part (body part coordinates). (Ci2k [x, y], Ci2k+1 [x, y]) in a piecewise-differentiable man-
Unlike DensePose, where part assignments and body part ner using bilinear interpolation [32].
coordinates are induced from the image, our approach at When training the neural textured avatar, we learn a con-
test time aims to predict them based solely on the pose Bi . volutional network gφ with learnable parameters φ to trans-
late the input map stacks Bi into the body part assignments
The introduction of the body surface parameterization
and the body part coordinates. As gφ has two branches
outlined above changes the translation problem. For a
(“heads”), we denote with gφP the branch that produces the
given pose defined by Bi , the translation network now has
to predict the stack Pi of body part assignments and the body part assignments stack, and with gφC the branch that
stack Ci of body part coordinates, where Pi contains n+1 produces the body part coordinates. To learn the parameters
maps of the textured neural avatar, we optimize the loss between
Pn of knon-negative numbers that sum to identity (i.e. the generated image and the ground truth image I¯i :
k=0 Pi [x, y] = 1 for any position (x, y)), and Ci con-
tains 2n maps of real numbers between 0 and w, where w is
 
Limage (φ, T ) = dImage I¯i , s gφP (Bi ), gφC (Bi ), T (2)
the spatial size (width and height) of the texture maps T k .
The map channel Pik for k = 0, . . . , n−1 is then in- where dImage (·, ·) is a loss used to compare two images.
terpreted as the probability of the pixel to belong to the k- In our current implementation we use a simple perceptual

4
loss [25, 33, 65], which computes the maps of activations
within pretrained fixed VGG network [58] for both im-
ages and evaluates the L1-norm between the resulting maps
(Conv1,6,11,20,29 of VGG19 were used). More ad-
vanced adversarial losses [27] popular in image translation
[19, 31] can also be used here.
During the stochastic optimization, the gradient of the
loss (2) is backpropagated through (1) both into the trans-
lation network gφ and onto the texture maps T k , so that
minimizing this loss updates not only the network param-
eters but also the textures themselves. As an addition, the
learning also optimizes the mask loss that measures the dis-
crepancy between the ground truth background mask 1−M̄i
and the background mask prediction:
Figure 3: The impact of the learning on the texture (top,
shown for the same subset of maps T k ) and on the convolu-
 
Lmask (φ, T ) = dBCE 1̄ − Mi , gφP (Bi )n (3)
tional network gφC predictions (bottom, shown for the same
pair of input poses). Left part shows the starting state (af-
where dBCE is the binary cross-entropy loss, and gφP (Bi )n
ter initialization), while the right part shows the final state,
corresponds to the n-th (i.e. background) channel of the pre-
which is considerably different from the start.
dicted part assignment map stack. After backpropagation
of the weighted combination of (2) and (3), the network
parameters φ and the textures maps T k are updated. As person, and they change significantly during the end-to-end
the training progresses, the texture maps change (Figure 2), learning (Figure 3).
and so does the body part coordinate predictions, so that the
learning is free to choose the appropriate parameterization
of body part surfaces. 4. Experiments
Below, we discuss the details of the experimental vali-
Initialization of textured neural avatar. The success of dation, provide comparison with baseline approaches, and
our network depends on the initialization strategy. When show qualitative results. The project webpage1 contains
training from multiple video sequences, we use the Dense- more videos of the learned avatars.
Pose system [28] to initialize the textured neural avatar.
Specifically, we run DensePose on the training data and pre-
train gφ as a translation network between the pose stacks Bi Architecture. We input 3D pose via bone rasterizations,
and the DensePose outputs. where each bone, hand and face are drawn in separate
An alternative way that is particularly attractive when channels. We then use standard image translation archi-
training data is scarce is to initialize the avatar is through tecture [33] to perform a mapping from these bones’ ras-
transfer learning. In this case, we simply take gφ from an- terizations to texture assignments and coordinates. This ar-
other avatar trained on abundant data. The explicit decou- chitecture consists of downsampling layers, stack of resid-
pling of geometry from appearance in our method facilitates ual blocks, operating at low dimensional feature representa-
transfer learning, as the geometrical mapping provided by tions, and upsampling layers. We then split the network into
the network gφ usually does not need to change much be- two roughly equal parts: encoder and decoder, with texture
tween two people, especially if the body types are not too assignments and coordinates having separate decoders. We
dissimilar. use 4 downsampling and upsampling layers with initial 32
channels in the convolutions and 256 channels in the resid-
Once the mapping gφ has been initialized, the texture
ual blocks. The ConvNet gφ has 17 million parameters.
maps T k are initialized as follows. Each pixel in the train-
ing image is assigned to a single body part (according to the
prediction of the pretrained gφP ) and to a particular texture Datasets. We train neural avatars on several types of
pixel on the texture of the corresponding part (according datasets. First, we consider collections of multi-view videos
to the prediction of the pretrained gφC ). Then, the value of registered in time and space, where 3D pose estimates can
each texture pixel is initialized to the mean of all image pix- be obtained via triangulation of 2D poses. We use two sub-
els assigned to it (the texture pixels assigned zero pixels are sets (corresponding to two persons from the 171026 pose2
initialized to black). The initialized texture T and gφ usu-
ally produce images that are only coarsely reminding the 1 https://saic-violet.github.io/texturedavatar/

5
Figure 4: Renderings produced by multiple textured neural avatars (for all people in our study). All renderings are produced
from the new viewpoints unseen during training.

(a) User study (b) SSIM score (c) Frechet distance


Ours-v-V2V Ours-v-Direct V2V Direct Ours V2V Direct Ours
CMU1-16 0.56 0.75 0.908 0.899 0.919 6.7 7.3 8.8
CMU2-16 0.54 0.74 0.916 0.907 0.922 7.0 8.8 10.7
CMU1-6 0.50 0.92 0.905 0.896 0.914 7.7 10.7 8.9
CMU2-6 0.53 0.71 0.918 0.907 0.920 7.0 9.7 10.4

Table 1: Quantitative comparison of the three models operating on different datasets (see text for discussion).

scene) from the CMU Panoptic dataset collection [34], re- consecutive frames of the monocular RGB image sequence.
ferring to them as CMU1 and CMU2 (both subsets have ap- Then we concatenate and lift the estimated 2D poses to infer
proximately four minutes / 7,200 frames in each camera the 3D pose of the last frame by using a multi-layer percep-
view). We consider two regimes: training on 16 cameras tron model. The perceptron is trained on the CMU 3D pose
(CMU1-16 and CMU2-16) or six cameras (CMU1-6 and annotations (augmented with position of the feet joints by
CMU2-6). The evaluation is done on the hold-out cameras triangulating the output of OpenPose) in orthogonal projec-
and hold-out parts of the sequence (no overlap between train tion.
and test in terms of the cameras or body motion). For foreground segmentation we use DeepLabv3+ with
We have also captured our own multi-view sequences Xception-65 backbone [13] initially trained on PAS-
of three subjects using a rig of seven cameras, spanning CAL VOC 2012 [22] and fine-tuned on HumanParsing
approximately 30◦ . In one scenario, the training sets in- dataset [40, 41] to predict initial human body segmentation
cluded six out of seven cameras, where the duration of each masks. We additionally employ GrabCut [54] with back-
video was approximately six minutes (11,000 frames). We ground/foreground model initialized by the masks to refine
show qualitative results for the hold-out camera as well as object boundaries on the high-resolution images. Pixels
from new viewpoints. In the other scenario described below, covered by the skeleton rasterization were always added to
training was done based on a video from a single camera. the foreground mask.
Finally, we evaluate on two short monocular sequences
from [4] and a Youtube video in Figure 7. Baselines. In the multi-video training scenario, we con-
sider two other systems, against which ours is compared.
Pre-processing. Our system expects 3D human pose as First, we take the video-to-video (V2V) system [67], using
input. For non-CMU datasets, we used the OpenPose- the authors’ code with minimal modifications that lead to
compatible [10, 57] 3D pose formats, represented by improved performance. We provide it with the same input
25 body joints, 21 joints for each hand and 70 facial land- as ours, and we use images with blacked-out background
marks. For the CMU Panoptic datasets, we use the available (according to our segmentation) as desired output. On the
3D pose annotation as input (which has 19 rather than 25 CMU1-6 task, we have also evaluated a model with Dense-
body joints). To get a 3D pose for non-CMU sequences we Pose results computed on the target frame given as input
first apply the OpenPose 2D pose estimation engine to five (alongside keypoints). Despite much stronger (oracle-type)

6
GT Direct V2V Proposed GT Direct V2V Proposed

Figure 5: Comparison of the rendering quality for the Direct, V2V and proposed methods on the CMU1-6 and CMU2-6
sequences. Images from six arbitrarily chosen cameras were used for training. We generate the views onto the hold-out
cameras which were not used during training. The pose and camera in the lower right corner are in particular difficult for all
the systems.

conditioning, the performance of this model in terms of con- from a disadvantage both in the quantitative metrics and in
sidered metrics has not improved in comparison with V2V the user comparison, since it averages out lighting from dif-
that uses only body joints as input. ferent viewpoints. The more detailed quantitative compari-
The video-to-video system employs several adversarial son is presented in Table 1.
losses and an architecture different from ours. Therefore we We show more qualitative examples of our method for a
consider a more direct ablation (Direct), which has the same variety of models in Figure 4 and some qualitative compar-
network architecture that predicts RGB color and mask di- isons with baselines in Figure 6.
rectly, rather than via body part assignments/coordinates.
The Direct system is trained using the same losses and in
the same protocol as ours. Single video comparisons. We also evaluate our system
As for the single video case, two baseline systems, in a single video case. We consider the scenario, where we
against which ours is compared, were considered. On our train the model and transfer it to a new person by fitting it
own captured sequences, we compare our system against to a single video. We use single-camera videos from one
video-to-video (V2V) system [67], whereas on sequences of the cameras in our rig. We then evaluate the model (and
from [4] we provide a qualitative comparison against the V2V baseline) on a hold-out set of poses projected onto the
system of [4]. camera from the other side of the rig (around 30◦ away).
We thus demonstrate that new models can be obtained us-
ing a single monocular video. For our models, we consider
Multi-video comparison. We compare the three systems transferring from CMU1-16.
(ours, V2V, Direct) in CMU1-16, CMU2-16, CMU1-6, We thus pretrain V2V and our system on CMU1-16 and
CMU2-6. Using the hold-out sequences/motions, we then use the obtained weights of gφ as initialization for fine-
evaluated two popular metrics, namely structured self- tuning to the single video in our dataset. The texture maps
similarity (SSIM) and Frechet Inception Distance (FID) be- are initialized from scratch as described above. Evaluating
tween the results of each system and the hold-out frames on hold-out camera and motion highlighted strong advan-
(with background removed using our segmentation algo- tage of our method. In the user study on two subjects, the
rithm). Our method outperforms the other two in terms of result of our method has been preferred to V2V in 55% and
SSIM and underperforms V2V in terms of FID. Represen- 65% of the cases. We further compare our method and the
tative examples are shown in Figure 5. system of [4] on the sequences from [4]. The qualitative
We have also performed user study using a crowd- comparison is shown in Figure 7. In addition, we gener-
sourcing website, where the users were shown the results of ate an avatar from a YouTube video. In this set of exper-
ours and one of the other two systems on either side of the iments, the avatars were obtained by fine-tuning from the
ground truth image and were asked to pick a better match to same avatar (shown in Figure 6–left). Except for the con-
the middle image. In the side-by-side comparison, the re- siderable artefacts on hand parts, our system has generated
sults of our method were always preferred by the majority avatars that can generalize to new pose despite very short
of crowd-sourcing users. We note that our method suffers video input (300 frames in the case of [4]).

7
GT Proposed V2V GT Proposed V2V

Figure 6: Results comparison for our multi-view sequences using a hold-out camera. Textured Neural Avatars and the images
produced by the video-to-video (V2V) system correspond to the same viewpoint. Both systems use a video from a single
viewpoint for training. Electronic zoom-in recommended.

Figure 7: Results on external monocular sequences. Rows 1-2: avatars for sequences from [4] in an unseen pose (left – ours,
right – [4]). Row 3 – the textured avatar computed from a popular YouTube video (’PUMPED UP KICKS DUBSTEP’). In
general, our system is capable of learning avatars from monocular videos.

8
5. Summary and Discussion Automatic estimation of 3d human pose and shape from a
single image. In Proc. ECCV, pages 561–578. Springer,
We have presented textured neural avatar approach to 2016. 2
model the appearance of humans for new camera views and [9] Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan
new body poses. Our system takes the middle path between Sun. Learning a high fidelity pose invariant model
the recent generation of methods that use ConvNets to map for high-resolution face frontalization. arXiv preprint
the pose to the image directly, and the traditional approach arXiv:1806.08472, 2018. 3
that uses geometric modeling of the surface and superim- [10] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.
pose the personalized texture maps. This is achieved by Realtime multi-person 2d pose estimation using part affinity
learning a ConvNet that predicts texture coordinates of pix- fields. In Proc. CVPR, 2017. 6
els in the new view jointly with the texture within the end- [11] Dan Casas, Marco Volino, John Collomosse, and Adrian
Hilton. 4d video textures for interactive character appear-
to-end learning process. We demonstrate that retaining an
ance. In Computer Graphics Forum, volume 33, pages 371–
explicit shape and texture separation helps to achieve better 380. Wiley Online Library, 2014. 3
generalization than direct mapping approaches. [12] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and
Our method suffers from certain limitations. The gen- Alexei A Efros. Everybody dance now. arXiv preprint
eralization ability is still limited, as it does not generalize arXiv:1808.07371, 2018. 2, 3
well when a person is rendered at a scale that is consid- [13] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
erably different from the training set (which can be par- Schroff, and Hartwig Adam. Encoder-decoder with atrous
tially addressed by rescaling prior to rendering followed by separable convolution for semantic image segmentation. In
cropping/padding postprocessing). Furthermore, textured Proc. ECCV, 2018. 6
avatars exhibit strong artefacts in the presence of pose es- [14] Qifeng Chen and Vladlen Koltun. Photographic image syn-
timation errors on hands and faces. Finally, our method as- thesis with cascaded refinement networks. In Proc. ICCV,
sumes constancy of the surface color and ignores lighting pages 1520–1529, 2017. 3
[15] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,
effects. This can be potentially addressed by making our
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-
textures view- and lighting-dependent [17, 43]. tive adversarial networks for multi-domain image-to-image
translation. In Proc. CVPR, June 2018. 2
References [16] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den-
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk,
[1] Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Bao-
and Steve Sullivan. High-quality streamable free-viewpoint
quan Chen, and Daniel Cohen-Or. Deep video-based perfor-
video. ACM Transactions on Graphics (TOG), 34(4):69,
mance cloning. arXiv preprint arXiv:1808.06847, 2018. 2,
2015. 3
3
[17] Paul E. Debevec, Yizhou Yu, and George Borshukov. Effi-
[2] Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan
cient view-dependent image-based rendering with projective
Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul De-
texture-mapping. In Rendering Techniques ’98, Proceedings
bevec. The Digital Emily project: Achieving a photorealistic
of the Eurographics Workshop in Vienna, Austria, June 29 -
digital actor. IEEE Computer Graphics and Applications,
July 1, 1998, pages 105–116, 1998. 9
30(4):20–31, 2010. 2
[18] Craig Donner, Tim Weyrich, Eugene d’Eon, Ravi Ra-
[3] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian
mamoorthi, and Szymon Rusinkiewicz. A layered, heteroge-
Theobalt, and Gerard Pons-Moll. Detailed human avatars
neous reflectance model for acquiring and rendering human
from monocular video. In 2018 International Conference on
skin. In ACM Transactions on Graphics (TOG), volume 27,
3D Vision (3DV), pages 98–109. IEEE, 2018. 2
page 140. ACM, 2008. 2
[4] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian [19] Alexey Dosovitskiy and Thomas Brox. Generating images
Theobalt, and Gerard Pons-Moll. Video based reconstruction with perceptual similarity metrics based on deep networks.
of 3d people models. In Proc. CVPR, June 2018. 2, 6, 7, 8 In Proc. NIPS, pages 658–666, 2016. 5
[5] Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Frédo Du- [20] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas
rand, and John V. Guttag. Synthesizing images of humans in Brox. Learning to generate chairs with convolutional neural
unseen poses. In Proc. CVPR, pages 8340–8348, 2018. 3 networks. In Proc. CVPR, pages 1538–1546, 2015. 2, 3
[6] Alexandru O Bălan and Michael J Black. The naked truth: [21] Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh
Estimating body shape under clothing. In Proc. ECCV, pages Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir
15–29. Springer, 2008. 2 Tankovich, and Shahram Izadi. Motion2fusion: real-time
[7] Federica Bogo, Michael J Black, Matthew Loper, and Javier volumetric performance capture. ACM Transactions on
Romero. Detailed full-body reconstructions of moving peo- Graphics (TOG), 36(6):246, 2017. 3
ple from monocular RGB-D sequences. In Proc. ICCV, [22] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.
pages 2300–2308, 2015. 2 Williams, J. Winn, and A. Zisserman. The pascal visual ob-
[8] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter ject classes challenge: A retrospective. International Journal
Gehler, Javier Romero, and Michael J Black. Keep it smpl: of Computer Vision, 111(1):98–136, Jan. 2015. 6

9
[23] Andrew Feng, Dan Casas, and Ari Shapiro. Avatar reshap- [38] Oliver Klehm, Fabrice Rousselle, Marios Papas, Derek
ing and automatic rigging using a deformable model. In Pro- Bradley, Christophe Hery, Bernd Bickel, Wojciech Jarosz,
ceedings of the 8th ACM SIGGRAPH Conference on Motion and Thabo Beeler. Recent advances in facial appearance
in Games, pages 57–64. ACM, 2015. 2 capture. In Computer Graphics Forum, volume 34, pages
[24] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and 709–733. Wiley Online Library, 2015. 2
Victor Lempitsky. Deepwarp: Photorealistic image resynthe- [39] Victor S. Lempitsky and Denis V. Ivanov. Seamless mosaic-
sis for gaze manipulation. In Proc. ECCV, pages 311–326. ing of image-based texture maps. In Proc. CVPR, 2007. 3
Springer, 2016. 2, 3 [40] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi
[25] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep human
Image style transfer using convolutional neural networks. In parsing with active template regression. Pattern Analysis and
Proc. CVPR, pages 2414–2423, 2016. 5 Machine Intelligence, IEEE Transactions on, 37(12):2402–
[26] Bastian Goldlücke and Daniel Cremers. Superresolution 2414, Dec 2015. 6
texture maps for multiview reconstruction. In Proc. ICCV, [41] Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang,
pages 1677–1684, 2009. 3 Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan. Iccv.
[27] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 2015. 6
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [42] Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo
Yoshua Bengio. Generative adversarial nets. In Proc. NIPS, Kim, Florian Bernard, Marc Habermann, Wenping Wang,
pages 2672–2680, 2014. 2, 5 and Christian Theobalt. Neural animation and reenactment
of human actor videos. arXiv preprint arXiv:1809.03658,
[28] Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos.
2018. 2, 3
DensePose: Dense human pose estimation in the wild. In
Proc. CVPR, June 2018. 3, 4, 5 [43] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser
Sheikh. Deep appearance models for face rendering. ACM
[29] Riza Alp Güler, George Trigeorgis, Epameinondas Anton-
Transactions on Graphics (TOG), 37(4):68, 2018. 2, 9
akos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokki-
[44] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
nos. DenseReg: Fully convolutional dense shape regression
Pons-Moll, and Michael J Black. Smpl: A skinned multi-
in-the-wild. In Proc. CVPR, volume 2, page 5, 2017. 3
person linear model. ACM Transactions on Graphics (TOG),
[30] Nils Hasler, Hanno Ackermann, Bodo Rosenhahn, Thorsten 34(6):248, 2015. 2
Thormählen, and Hans-Peter Seidel. Multilinear pose and
[45] Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel
body shape estimation of dressed subjects from image sets.
Pidlypenskyi, Jonathan Taylor, Julien P. C. Valentin, Sameh
In Proc. CVPR, pages 1823–1830. IEEE, 2010. 2
Khamis, Philip L. Davidson, Anastasia Tkach, Peter Lin-
[31] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. coln, Adarsh Kowdle, Christoph Rhemann, Dan B. Gold-
Efros. Image-to-image translation with conditional adver- man, Cem Keskin, Steven M. Seitz, Shahram Izadi, and
sarial networks. In Proc. CVPR, pages 5967–5976, 2017. 3, Sean Ryan Fanello. LookinGood: enhancing performance
5 capture with real-time neural re-rendering. ACM Trans.
[32] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Graph., 37(6):255:1–255:14, 2018. 3
Koray Kavukcuoglu. Spatial transformer networks. In Proc. [46] Masahiro Mori. The uncanny valley. Energy, 7(4):33–35,
NIPS, pages 2017–2025, 2015. 4 1970. 2
[33] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual [47] Franziska Mueller, Florian Bernard, Oleksandr Sotny-
losses for real-time style transfer and super-resolution. In chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and
Proc. ECCV, pages 694–711, 2016. 3, 5 Christian Theobalt. GANerated hands for real-time 3d hand
[34] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei tracking from monocular RGB. In Proc. CVPR, June 2018.
Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart 2
Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and [48] Natalia Neverova, Riza Alp Güler, and Iasonas Kokkinos.
Yaser Sheikh. Panoptic studio: A massively multiview sys- Dense pose transfer. In Proc. ECCV, September 2018. 3
tem for social interaction capture. IEEE Transactions on Pat- [49] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-
tern Analysis and Machine Intelligence, 2017. 6 ter V. Gehler, and Bernt Schiele. Neural body fitting: Uni-
[35] Angjoo Kanazawa, Michael J Black, David W Jacobs, and fying deep learning and model-based human pose and shape
Jitendra Malik. End-to-end recovery of human shape and estimation. Verona, Italy, 2018. 2
pose. In Proc. CVPR, 2018. 2 [50] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas
[36] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Daniilidis. Learning to estimate 3d human pose and shape
Progressive growing of GANs for improved quality, stabil- from a single color image. In Proc. CVPR, June 2018. 2
ity, and variation. In International Conference on Learning [51] Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and
Representations, 2018. 2 Michael J Black. Dyna: A model of dynamic human shape in
[37] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng motion. ACM Transactions on Graphics (TOG), 34(4):120,
Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian 2015. 2
Richardt, Michael Zollhöfer, and Christian Theobalt. Deep [52] Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and An-
video portraits. arXiv preprint arXiv:1805.11714, 2018. 2 drew W. Fitzgibbon. Unwrap mosaics: a new representation

10
for video editing. ACM Trans. Graph., 27(3):17:1–17:11, [69] Alexander Weiss, David Hirshberg, and Michael J Black.
2008. 3 Home 3d body scans from noisy image and range data. In
[53] Nadia Robertini, Dan Casas, Edilson De Aguiar, and Chris- Proc. ICCV, pages 1951–1958. IEEE, 2011. 2
tian Theobalt. Multi-view performance capture of sur- [70] Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd
face details. International Journal of Computer Vision, Bickel, Craig Donner, Chien Tu, Janet McAndless, Jinho
124(1):96–113, 2017. 2 Lee, Addy Ngan, Henrik Wann Jensen, et al. Analysis of
[54] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. human faces using a measurement-based skin reflectance
”grabcut”: interactive foreground extraction using iterated model. In ACM Transactions on Graphics (TOG), vol-
graph cuts. ACM Trans. Graph., 23(3):309–314, 2004. 6 ume 25, pages 1013–1024. ACM, 2006. 2
[55] Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris [71] Olivia Wiles, A. Sophia Koepke, and Andrew Zisserman.
Samaras, Nikos Paragios, and Iasonas Kokkinos. Deform- X2face: A network for controlling face generation using im-
ing autoencoders: Unsupervised disentangling of shape and ages, audio, and pose codes. In Proc. ECCV, September
appearance. In Proc. ECCV, September 2018. 3 2018. 3
[56] Aliaksandr Siarohin, Enver Sangineto, Stphane Lathuilire, [72] Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke
and Nicu Sebe. Deformable gans for pose-based human im- Sugano, Peter Robinson, and Andreas Bulling. Rendering
age generation. In Proc. CVPR, June 2018. 3 of eyes for eye-shape registration and gaze estimation. In
[57] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Proc. ICCV, pages 3756–3764, 2015. 2
Sheikh. Hand keypoint detection in single images using mul- [73] Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gau-
tiview bootstrapping. In CVPR, 2017. 6 rav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz,
[58] Karen Simonyan and Andrew Zisserman. Very deep convo- and Christian Theobalt. Video-based characters: creating
lutional networks for large-scale image recognition. CoRR, new human performances from a multi-view video database.
abs/1409.1556, 2014. 5 ACM Transactions on Graphics (TOG), 30(4):32, 2011. 3
[59] J Starck and A Hilton. Model-based multiple view recon- [74] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai
struction of people. In Proc. ICCV, pages 915–922, 2003. Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Doublefu-
2 sion: Real-time capture of human performances with inner
[60] Ian Stavness, C Antonio Sánchez, John Lloyd, Andrew Ho, body shapes from a single depth sensor. In Proc. CVPR,
Johnty Wang, Sidney Fels, and Danny Huang. Unified skin- pages 7287–7296. IEEE Computer Society, 2018. 2
ning of rigid and deformable models for anatomical simu-
lations. In SIGGRAPH Asia 2014 Technical Briefs, page 9.
ACM, 2014. 2
[61] Diana Sungatullina, Egor Zakharov, Dmitry Ulyanov, and
Victor Lempitsky. Image manipulation with perceptual dis-
criminators. In Proc. ECCV, September 2018. 2
[62] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama: learning
lip sync from audio. ACM Transactions on Graphics (TOG),
36(4):95, 2017. 2
[63] Jonathan Taylor, Jamie Shotton, Toby Sharp, and Andrew
Fitzgibbon. The vitruvian manifold: Inferring dense corre-
spondences for one-shot human pose estimation. In Proc.
CVPR, pages 103–110. IEEE, 2012. 3
[64] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. Mocogan: Decomposing motion and content for
video generation. In Proc. CVPR, June 2018. 3
[65] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Vic-
tor S. Lempitsky. Texture networks: Feed-forward synthesis
of textures and stylized images. In Proc. ICML, pages 1349–
1357, 2016. 3, 5
[66] Marco Volino, Dan Casas, John P Collomosse, and Adrian
Hilton. Optimal representation of multi-view video. In Proc.
BMVC, 2014. 3
[67] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
video synthesis. arXiv preprint arXiv:1808.06601, 2018. 2,
3, 6, 7
[68] Lingyu Wei, Liwen Hu, Vladimir Kim, Ersin Yumer, and
Hao Li. Real-time hair rendering using sequential adversarial
networks. In Proc. ECCV, September 2018. 2

11
DSFD: Dual Shot Face Detector

Jian Li† Yabiao Wang‡ Changan Wang‡ Ying Tai‡


Jianjun Qian†∗ Jian Yang†∗ Chengjie Wang‡ Jilin Li‡ Feiyue Huang‡

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education

Jiangsu Key Lab of Image and Video Understanding for Social Security

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
arXiv:1810.10220v3 [cs.CV] 6 Apr 2019


Youtu Lab, Tencent

lijiannuist@gmail.com, {csjqian, csjyang}@njust.edu.cn

{casewang, changanwang, yingtai, jasoncjwang, jerolinli, garyhuang}@tencent.com

Scale Blurry Illumination

Pose & Occlusion Reflection Makeup

Figure 1: Visual results. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection and makeup.

Abstract mentation to provide better initialization for the regressor.


Since these techniques are all related to the two-stream de-
In this paper, we propose a novel face detection network sign, we name the proposed network as Dual Shot Face De-
with three novel contributions that address three key aspects tector (DSFD). Extensive experiments on popular bench-
of face detection, including better feature learning, progres- marks, WIDER FACE and FDDB, demonstrate the superi-
sive loss design and anchor assign based data augmenta- ority of DSFD over the state-of-the-art face detectors.
tion, respectively. First, we propose a Feature Enhance
Module (FEM) for enhancing the original feature maps to
extend the single shot detector to dual shot detector. Sec- 1. Introduction
ond, we adopt Progressive Anchor Loss (PAL) computed by
two different sets of anchors to effectively facilitate the fea- Face detection is a fundamental step for various facial
tures. Third, we use an Improved Anchor Matching (IAM) applications, like face alignment [26], parsing [3], recog-
by integrating novel anchor assign strategy into data aug- nition [34], and verification [6]. As the pioneering work
∗ Jianjun
for face detection, Viola-Jones [29] adopts AdaBoost algo-
Qian and Jian Yang are corresponding authors. This work
was supported by the National Science Fund of China under Grant Nos.
rithm with hand-crafted features, which are now replaced by
61876083, U1713208, and Program for Changjiang Scholars. This work deeply learned features from the convolutional neural net-
was done when Jian Li was an intern at Tencent Youtu Lab. work (CNN) [10] that achieves great progress. Although

1
the CNN based face detectors have being extensively stud- anchor sizes in the first shot, and use larger sizes in the
ied, detecting faces with high degree of variability in scale, second shot. Third, we propose Improved Anchor Match-
pose, occlusion, expression, appearance and illumination in ing (IAM), which integrates anchor partition strategy and
real-world scenarios remains a challenge. anchor-based data augmentation to better match anchors
Previous state-of-the-art face detectors can be roughly and ground truth faces, and thus provides better initializa-
divided into two categories. The first one is mainly based tion for the regressor. The three aspects are complementary
on the Region Proposal Network (RPN) adopted in Faster so that these techniques can work together to further im-
RCNN [24] and employs two stage detection schemes [30, prove the performance. Besides, since these techniques are
33, 36]. RPN is trained end-to-end and generates high- all related to two-stream design, we name the proposed net-
quality region proposals which are further refined by Fast work as Dual Shot Face Detector (DSFD). Fig. 1 shows the
R-CNN detector. The other one is Single Shot Detec- effectiveness of DSFD on various variations, especially on
tor (SSD) [20] based one-stage methods, which get rid of extreme small faces or heavily occluded faces.
RPN, and directly predict the bounding boxes and confi- In summary, the main contributions of this paper include:
dence [4, 27, 39]. Recently, one-stage face detection frame- • A novel Feature Enhance Module to utilize different
work has attracted more attention due to its higher inference level information and thus obtain more discriminability and
efficiency and straightforward system deployment. robustness features.
Despite the progress achieved by the above methods, • Auxiliary supervisions introduced in early layers via a
there are still some problems existed in three aspects: set of smaller anchors to effectively facilitate the features.
Feature learning Feature extraction part is essential for • An improved anchor matching strategy to match an-
a face detector. Currently, Feature Pyramid Network chors and ground truth faces as far as possible to provide
(FPN) [17] is widely used in state-of-the-art face detectors better initialization for the regressor.
for rich features. However, FPN just aggregates hierarchi- • Comprehensive experiments conducted on popular
cal feature maps between high and low-level output layers, benchmarks FDDB and WIDER FACE to demonstrate the
which does not consider the current layer’s information, and superiority of our proposed DSFD network compared with
the context relationship between anchors is ignored. the state-of-the-art methods.
Loss design The conventional loss functions used in object
detection include a regression loss for the face region and 2. Related work
a classification loss for identifying if a face is detected or We review the prior works from three perspectives.
not. To further address the class imbalance problem, Lin et Feature Learning Early works on face detection mainly
al. [18] propose Focal Loss to focus training on a sparse set rely on hand-crafted features, such as Harr-like fea-
of hard examples. To use all original and enhanced features, tures [29], control point set [1], edge orientation his-
Zhang et al. propose Hierarchical Loss to effectively learn tograms [13]. However, hand-crafted features design is lack
the network [37]. However, the above loss functions do not of guidance. With the great progress of deep learning, hand-
consider progressive learning ability of feature maps in both crafted features have been replaced by Convolutional Neu-
of different levels and shots. ral Networks (CNN). For example, Overfeat [25], Cascade-
Anchor matching Basically, pre-set anchors for each fea- CNN [14], MTCNN [38] adopt CNN as a sliding window
ture map are generated by regularly tiling a collection of detector on image pyramid to build feature pyramid. How-
boxes with different scales and aspect ratios on the image. ever, using an image pyramid is slow and memory ineffi-
Some works [27, 39] analyze a series of reasonable anchor cient. As the result, most two stage detectors extract fea-
scales and anchor compensation strategy to increase posi- tures on single scale. R-CNN [7, 8] obtains region propos-
tive anchors. However, such strategy ignores random sam- als by selective search [28], and then forwards each nor-
pling in data augmentation, which still causes imbalance be- malized image region through a CNN to classify. Faster
tween positive and negative anchors. R-CNN [24], R-FCN [5] employ Region Proposal Network
In this paper, we propose three novel techniques to ad- (RPN) to generate initial region proposals. Besides, ROI-
dress the above three issues, respectively. First, we intro- pooling [24] and position-sensitive RoI pooling [5] are ap-
duce a Feature Enhance Module (FEM) to enhance the dis- plied to extract features from each region.
criminability and robustness of the features, which com- More recently, some research indicates that multi-scale
bines the advantages of the FPN in PyramidBox and Re- features perform better for tiny objects. Specifically,
ceptive Field Block (RFB) in RFBNet [19]. Second, moti- SSD [20], MS-CNN [2], SSH [23], S3FD [39] predict
vated by the hierarchical loss [37] and pyramid anchor [27] boxes on multiple layers of feature hierarchy. FCN [22],
in PyramidBox, we design Progressive Anchor Loss (PAL) Hypercolumns [9], Parsenet [21] fuse multiple layer fea-
that uses progressive anchor sizes for not only different lev- tures in segmentation. FPN [15, 17], a top-down architec-
els, but also different shots. Specifically, we assign smaller ture, integrate high-level semantic information to all scales.
(a) Original Feature Shot
Input Image conv3_3 conv4_3 conv5_3 conv_fc7 conv6_2 conv7_2

First Shot PAL


(b) Feature Enhance Module

Second Shot PAL


640x640 160x160 80x80 40x40 20x20 10x10 5x5

(c) Enhanced Feature Shot


Figure 2: Our DSFD framework uses a Feature Enhance Module (b) on top of a feedforward VGG/ResNet architecture to generate the
enhanced features (c) from the original features (a), along with two loss layers named first shot PAL for the original features and second
shot PAL for the enchanted features.
Current feature map
FPN-based methods, such as FAN [31], PyramidBox [27]
N/3
achieve significant improvement on detection. However,
product N/3 concat
these methods do not consider the current layers informa-
tion. Different from the above methods that ignore the con- 1x1 N
conv N/3
text relationship between anchors, we propose a feature en-
Up feature map

hance module that incorporates multi-level dilated convolu-


tional layers to enhance the semantic of the features. 1x1
conv upsample dilation conv,kernel=3x3,rate=3
Loss Design Generally, the objective loss in detection is a
weighted sum of classification loss (e.g. softmax loss) and Figure 3: Illustration on Feature Enhance Module, in which
box regression loss (e.g. L2 loss). Girshick et al. [7] pro- the current feature map cell interactives with neighbors in current
pose smooth L1 loss to prevent exploding gradients. Lin feature maps and up feature maps.
et al. [18] discover that the class imbalance is one obsta-
cle for better performance in one stage detector, hence they 3. Dual Shot Face Detector
propose focal loss, a dynamically scaled cross entropy loss. We firstly introduce the pipeline of our proposed frame-
Besides, Wang et al. [32] design RepLoss for pedestrian de- work DSFD, and then detailly describe our feature enhance
tection, which improves performance in occlusion scenar- module in Sec. 3.2, progressive anchor loss in Sec. 3.3 and
ios. FANet [37] create a hierarchical feature pyramid and improved anchor matching in Sec. 3.4, respectively.
presents hierarchical loss for their architecture. However,
the anchors used in FANet are kept the same size in dif- 3.1. Pipeline of DSFD
ferent stages. In this work, we adaptively choose different
anchor sizes in different stages to facilitate the features. The framework of DSFD is illustrated in Fig. 2. Our
architecture uses the same extended VGG16 backbone as
Anchor Matching To make the model more robust, most PyramidBox [27] and S3FD [39], which is truncated be-
detection methods [20,35,39] do data augmentation, such as fore the classification layers and added with some aux-
color distortion, horizontal flipping, random crop and multi- iliary structures. We select conv3 3, conv4 3, conv5 3,
scale training. Zhang et al. [39] propose an anchor compen- conv fc7, conv6 2 and conv7 2 as the first shot detec-
sation strategy to make tiny faces to match enough anchors tion layers to generate six original feature maps named
during training. Wang et al. [35] propose random crop to of1 , of2 , of3 , of4 , of5 , of6 . Then, our proposed FEM trans-
generate large number of occluded faces for training. How- fers these original feature maps into six enhanced feature
ever, these methods ignore random sampling in data aug- maps named ef1 , ef2 , ef3 , ef4 , ef5 , ef6 , which have the
mentation, while ours combines anchor assign to provide same sizes as the original ones and are fed into SSD-style
better data initialization for anchor matching. head to construct the second shot detection layers. Note that
the input size of the training image is 640, which means the Table 1: The stride size, feature map size, anchor scale, ratio, and
feature map size of the lowest-level layer to highest-level number of six original/enhanced features for two shots.
Feature Stride Size Scale Ratio Number
layer is from 160 to 5. Different from S3FD and Pyramid- ef 1 (of 1) 4 160 × 160 16 (8) 1.5 : 1 25600
Box, after we utilize the receptive field enlargement in FEM ef 2 (of 2) 8 80 × 80 32 (16) 1.5 : 1 6400
and the new anchor design strategy, its unnecessary for the ef 3 (of 3) 16 40 × 40 64 (32) 1.5 : 1 1600
three sizes of stride, anchor and receptive field to satisfy ef 4 (of 4) 32 20 × 20 128 (64) 1.5 : 1 400
ef 5 (of 5) 64 10 × 10 256 (128) 1.5 : 1 100
equal-proportion interval principle. Therefore, our DSFD is ef 6 (of 6) 128 5×5 512 (256) 1.5 : 1 25
more flexible and robustness. Besides, the original and en-
hanced shots have two different losses, respectively named
First Shot progressive anchor Loss (FSL) and Second Shot vs. background), and Lloc is the smooth L1 loss between the
progressive anchor Loss (SSL). parameterizations of the predicted box ti and ground-truth
box gi using the anchor ai . When p∗i = 1 (p∗i = {0, 1}),
3.2. Feature Enhance Module the anchor ai is positive and the localization loss is acti-
vated. β is a weight to balance the effects of the two terms.
Feature Enhance Module is able to enhance original fea-
Compared to the enhanced feature maps in the same level,
tures to make them more discriminable and robust, which
the original feature maps have less semantic information for
is called FEM for short. For enhancing original neuron cell
classification but more high resolution location information
oc(i,j,l) , FEM utilizes different dimension information in-
for detection. Therefore, we believe that the original feature
cluding upper layer original neuron cell oc(i,j,l) and current
maps can detect and classify smaller faces. As the result, we
layer non-local neuron cells: nc(i−ε,j−ε,l) , nc(i−ε,j,l) , ...,
propose the First Shot multi-task Loss with a set of smaller
nc(i,j+ε,l) , nc(i+ε,j+ε,l) . Specially, the enhanced neuron
anchors as follows:
cell ec(i,j,l) can be mathematically defined as follow:
1
Σi Lconf (pi , p∗i )
LF SL (pi , p∗i , ti , gi , sai ) =
ec(i,j,l) = fconcat (fdilation (nc(i,j,l) )) Nconf
(1)
nci,j,l = fprod (oc(i,j,l) , fup (oc(i,j,l+1) )) β
+ Σi p∗i Lloc (ti , gi , sai ),
Nloc
where ci,j,l is a cell located in (i, j) coordinate of the feature (3)
maps in the l-th layer, f denotes a set of basic dilation con- where sa indicates the smaller anchors in the first shot lay-
volution, elem-wise production, up-sampling or concatena- ers, and the two shots losses can be weighted summed into
tion operations. Fig. 3 illustrates the idea of FEM, which is a whole Progressive Anchor Loss as follows:
inspired by FPN [17] and RFB [19]. Here, we first use 1×1
convolutional kernel to normalize the feature maps. Then, LP AL = LF SL (sa) + λLSSL (a). (4)
we up-sample upper feature maps to do element-wise prod-
Note that anchor size in the first shot is half of ones in the
uct with the current ones. Finally, we split the feature maps
second shot, and λ is weight factor. Detailed assignment
to three parts, followed by three sub-networks containing
on the anchor size is described in Sec. 3.4. In prediction
different numbers of dilation convolutional layers.
process, we only use the output of the second shot, which
3.3. Progressive Anchor Loss means no additional computational cost is introduced.
Different from the traditional detection loss, we design 3.4. Improved Anchor Matching
progressive anchor sizes for not only different levels, but
Current anchor matching method is bidirectional be-
also different shots in our framework. Motivated by the
tween the anchor and ground-truth face. Therefore, an-
statement in [24] that low-level features are more suitable
chor design and face sampling during augmentation are col-
for small faces, we assign smaller anchor sizes in the first
laborative to match the anchors and faces as far as pos-
shot, and use larger sizes in the second shot. First, our Sec-
sible for better initialization of the regressor. Our IAM
ond Shot anchor-based multi-task Loss function is defined
targets on addressing the contradiction between the dis-
as:
crete anchor scales and continuous face scales, in which
1 the faces are augmented by Sinput ∗ Sf ace /Sanchor (S in-
(Σi Lconf (pi , p∗i )
LSSL (pi , p∗i , ti , gi , ai ) =
Nconf dicates the spatial size) with the probability of 40% so as
β to increase the positive anchors, stabilize the training and
+ Σi p∗i Lloc (ti , gi , ai )), thus improve the results. Table 1 shows details of our an-
Nloc
(2) chor design on how each feature map cell is associated to
where Nconf and Nloc indicate the number of positive and the fixed shape anchor. We set anchor ratio 1.5:1 based
negative anchors, and the number of positive anchors re- on face scale statistics. Anchor size for the original fea-
spectively, Lconf is the softmax loss over two classes (face ture is one half of the enhanced feature. Additionally, with
Table 2: Effectiveness of Feature Enhance Module on the AP
performance.
Component Easy Medium Hard
FSSD+VGG16 92.6% 90.2% 79.1%
FSSD+VGG16+FEM 93.0% 91.4% 84.6%

Table 3: Effectiveness of Progressive Anchor Loss on the AP


performance.
Component Easy Medium Hard
FSSD+RES50 93.7% 92.2% 81.8%
FSSD+RES50+FEM 95.0% 94.1% 88.0% Figure 5: Comparisons on number distribution of matched
FSSD+RES50+FEM+PAL 95.3% 94.4% 88.6% anchor for ground truth faces between traditional anchor match-
ing (blue line) and our improved anchor matching (red line). we
actually set the IoU threshold to 0.35 for the traditional version.
That means even with a higher threshold (i.e., 0.4), using our IAM,
we can still achieve more matched anchors. Here, we choose a
slightly higher threshold in IAM so that to better balance the num-
ber and quality of the matched faces.

4.2. Analysis on DSFD


Figure 4: The number distribution of different scales of faces
compared between traditional anchor matching (Left) and our im- In this subsection, we conduct extensive experiments and
proved anchor matching (Right). ablation studies on the WIDER FACE dataset to evaluate
the effectiveness of several contributions of our proposed
probability of 2/5, we utilize anchor-based sampling like framework, including feature enhance module, progressive
data-anchor-sampling in PyramidBox, which randomly se- anchor loss, and improved anchor matching. For fair com-
lects a face in an image, crops sub-image containing the parisons, we use the same parameter settings for all the ex-
face, and sets the size ratio between sub-image and selected periments, except for the specified changes to the compo-
face to 640/rand (16, 32, 64, 128, 256, 512). For the remain- nents. All models are trained on the WIDER FACE training
ing 3/5 probability, we adopt data augmentation similar to set and evaluated on validation set. To better understand
SSD [20]. In order to improve the recall rate of faces and DSFD, we select different baselines to ablate each compo-
ensure anchor classification ability simultaneously, we set nent on how this part affects the final performance.
Intersection-over-Union (IoU) threshold 0.4 to assign an- Feature Enhance Module First, We adopt anchor designed
chor to its ground-truth faces. in S3FD [39], PyramidBox [27] and six original feature
maps generated by VGG16 to perform classification and re-
4. Experiments gression, which is named Face SSD (FSSD) as the baseline.
We then use VGG16-based FSSD as the baseline to add
4.1. Implementation Details feature enchance module for comparison. Table 2 shows
First, we present the details in implementing our net- that our feature enhance module can improve VGG16-based
work. The backbone networks are initialized by the pre- FSSD from 92.6%, 90.2%, 79.1% to 93.0%, 91.4%, 84.6%.
trained VGG/ResNet on ImageNet. All newly added con- Progressive Anchor Loss Second, we use Res50-based
volution layers’ parameters are initialized by the ‘xavier’ FSSD as the baseline to add progressive anchor loss for
method. We use SGD with 0.9 momentum, 0.0005 weight comparison. We use four residual blocks’ ouputs in
decay to fine-tune our DSFD model. The batch size is set to ResNet to replace the outputs of conv3 3, conv4 3, conv5 3,
16. The learning rate is set to 10−3 for the first 40k steps, conv fc7 in VGG. Except for VGG16, we do not perform
and we decay it to 10−4 and 10−5 for two 10k steps. layer normalization. Table 3 shows our progressive an-
During inference, the first shot’s outputs are ignored chor loss can improve Res50-based FSSD using FEM from
and the second shot predicts top 5k high confident detec- 95.0%, 94.1%, 88.0% to 95.3%, 94.4%, 88.6%.
tions. Non-maximum suppression is applied with jaccard Improved Anchor Matching To evaluate our improved
overlap of 0.3 to produce top 750 high confident bound- anchor matching strategy, we use Res101-based FSSD
ing boxes per image. For 4 bounding box coordinates, we without anchor compensation as the baseline. Table 4 shows
round down top left coordinates and round up width and that our improved anchor matching can improve Res101-
height to expand the detection bounding box. The offi- based FSSD using FEM from 95.8%, 95.1%, 89.7% to
cial code has been released at: https://github.com/ 96.1%, 95.2%, 90.0%. Finally, we can improve our DSFD
TencentYoutuResearch/FaceDetection-DSFD. to 96.6%, 95.7%, 90.4% with ResNet152 as the backbone.
Val: easy Val: medium Val: hard

Test: easy Test: medium Test: hard


Figure 6: Precision-recall curves on WIDER FACE validation and testing subset.
Table 4: Effectiveness of Improved Anchor Matching on the AP performance.
Component Easy Medium Hard
FSSD+RES101 95.1% 93.6% 83.7%
FSSD+RES101+FEM 95.8% 95.1% 89.7%
FSSD+RES101+FEM+IAM 96.1% 95.2% 90.0%
FSSD+RES101+FEM+IAM+PAL 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 96.6% 95.7% 90.4%
FSSD+RES152+FEM+IAM+PAL+LargeBS 96.4% 95.7% 91.2%

Table 5: Effectiveness of different backbones.


Component Params ACC@Top-1 Easy Medium Hard
FSSD+RES101+FEM+IAM+PAL 399M 77.44% 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 459M 78.42% 96.6% 95.7% 90.4%
FSSD+SE-RES101+FEM+IAM+PAL 418M 78.39% 95.7% 94.7% 88.6%
FSSD+DPN98+FEM+IAM+PAL 515M 79.22% 96.3% 95.5% 90.4%
FSSD+SE-RESNeXt101 32×4d+FEML+IAM+PA 416M 80.19% 95.7% 94.8% 88.9%

Table 6: FEM vs. RFB on WIDER FACE. Comparison with RFB Our FEM differs from RFB in two
Backbone - ResNet101 (%) Easy Medium Hard
DSFD (RFB) 96.0 94.5 87.2
aspects. First, our FEM is based on FPN to make full use of
DSFD (FPN) / (FPN+RFB) 96.2 / 96.2 95.1 / 95.3 89.7 / 89.9 feature information from different spatial levels, while RFB
DSFD (FEM) 96.3 95.4 90.1 ignores. Second, our FEM adopts stacked dilation convolu-
tions in a multi-branch structure, which efficiently leads to
Besides, Fig. 4 shows that our improved anchor match- larger Receptive Fields (RF) than RFB that only uses one
ing strategy greatly increases the number of ground truth dilation layer in each branch, e.g., R3 in FEM compared to
faces that are closed to the anchor, which can reduce the R in RFB where indicates the RF of one dilation convolu-
contradiction between the discrete anchor scales and con- tion. Tab. 6 clearly demonstrates the superiority of our FEM
tinuous face scales. Moreover, Fig. 5 shows the number dis- over RFB, even when RFB is equipped with FPN.
tribution of matched anchor number for ground truth faces, From the above analysis and results, some promising
which indicates our improved anchor matching can signif- conclusions can be drawn: 1) Feature enhance is crucial.
icantly increase the matched anchor number, and the aver- We use a more robust and discriminative feature enhance
aged number of matched anchor for different scales of faces module to improve the feature presentation ability, espe-
can be improved from 6.4 to about 6.9. cially for hard face. 2) Auxiliary loss based on progressive
Discontinous ROC curves Continous ROC curves

Discontinous ROC curves Continous ROC curves


Figure 7: Comparisons with popular state-of-the-art methods on the FDDB dataset. The first row shows the ROC results without
additional annotations, and the second row shows the ROC results with additional annotations.

anchor is used to train all 12 different scale detection feature For VGA resolution inputs to Res50-based DSFD, it runs
maps, and it improves the performance on easy, medium 22 FPS on NVIDA GPU P40 during inference.
and hard faces simultaneously. 3) Our improved anchor
matching provides better initial anchors and ground-truth 4.3. Comparisons with State-of-the-Art Methods
faces to regress anchor from faces, which achieves the im-
We evaluate the proposed DSFD on two popular face
provements of 0.3%, 0.1%, 0.3% on three settings, respec-
detection benchmarks, including WIDER FACE [35] and
tively. Additionally, when we enlarge the training batch size
Face Detection Data Set and Benchmark (FDDB) [12]. Our
(i.e., LargeBS), the result in hard setting can get 91.2% AP.
model is trained only using the training set of WIDER
Effects of Different Backbones To better understand FACE, and then evaluated on both benchmarks without any
our DSFD, we further conducted experiments to examine further fine-tuning. We also follow the similar way used
how different backbones affect classification and detection in [31] to build the image pyramid for multi-scale testing
performance. Specifically, we use the same setting ex- and use more powerful backbone similar as [4].
cept for the feature extraction network, we implement SE- WIDER FACE Dataset It contains 393, 703 annotated
ResNet101, DPN−98, SE-ResNeXt101 32×4d following faces with large variations in scale, pose and occlusion in
the ResNet101 setting in our DSFD. From Table 5, DSFD total 32, 203 images. For each of the 60 event classes, 40%,
with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on 10%, 50% images of the database are randomly selected
easy, medium and hard settings respectively, which indi- as training, validation and testing sets. Besides, each sub-
cates that more complexity model and higher Top-1 Ima- set is further defined into three levels of difficulty: ’Easy’,
geNet classification accuracy may not benefit face detection ’Medium’, ’Hard’ based on the detection rate of a baseline
AP. Therefore, in our DSFD framework, better performance detector. As shown in Fig. 6, our DSFD achieves the best
on classification are not necessary for better performance performance among all of the state-of-the-art face detectors
on detection, which is consistent to the conclusion claimed based on the average precision (AP) across the three sub-
in [11, 16]. Our DSFD enjoys high inference speed bene- sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
fited from simply using the second shot detection results. on validation set, and 96.0% (Easy), 95.3% (Medium) and
Scale Pose Occlusion Blurry

Makeup Illumination Modality Reflection

Figure 8: Illustration of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.

90.0% (Hard) on test set. Fig. 8 shows more examples to 5. Conclusions


demonstrate the effects of DSFD on handling faces with This paper introduces a novel face detector named Dual
various variations, in which the blue bounding boxes indi- Shot Face Detector (DSFD). In this work, we propose a
cate the detector confidence is above 0.8. novel Feature Enhance Module that utilizes different level
FDDB Dataset It contains 5, 171 faces in 2, 845 images information and thus obtains more discriminability and ro-
taken from the faces in the wild data set. Since WIDER bustness features. Auxiliary supervisions introduced in
FACE has bounding box annotation while faces in FDDB early layers by using smaller anchors are adopted to ef-
are represented by ellipses, we learn a post-hoc ellipses re- fectively facilitate the features. Moreover, an improved an-
gressor to transform the final prediction results. As shown chor matching method is introduced to match anchors and
in Fig. 7, our DSFD achieves state-of-the-art performance ground truth faces as far as possible to provide better initial-
on both discontinuous and continuous ROC curves, i.e. ization for the regressor. Comprehensive experiments are
99.1% and 86.2% when the number of false positives equals conducted on popular face detection benchmarks, FDDB
to 1, 000. After adding additional annotations to those un- and WIDER FACE, to demonstrate the superiority of our
labeled faces [39], the false positives of our model can be proposed DSFD compared with the state-of-the-art face de-
further reduced and outperform all other methods. tectors, e.g., SRN and PyramidBox.
References [14] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and
Gang Hua. A convolutional neural network cascade for face
[1] Yotam Abramson, Bruno Steux, and Hicham Ghorayeb. Yet detection. In Proceedings of IEEE Conference on Computer
even faster (yef) real-time object detection. International Vision and Pattern Recognition (CVPR), 2015. 2
Journal of Intelligent Systems Technologies and Applica- [15] Jian Li, Jianjun Qian, and Jian Yang. Object detection via
tions, 2(2-3):102–112, 2007. 2 feature fusion based single network. In IEEE International
[2] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vas- Conference on Image Processing, 2017. 2
concelos. A unified multi-scale deep convolutional neural [16] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong
network for fast object detection. In Proceedings of Euro- Deng, and Jian Sun. Detnet: A backbone network for object
pean Conference on Computer Vision (ECCV), 2016. 2 detection. In Proceedings of European Conference on Com-
[3] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian puter Vision, 2018. 7
Yang. Fsrnet: End-to-end learning face super-resolution with [17] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He,
facial priors. In Proceedings of the IEEE Conference on Bharath Hariharan, and Serge J Belongie. Feature pyra-
Computer Vision and Pattern Recognition, 2018. 1 mid networks for object detection. In Proceedings of IEEE
[4] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Z Conference on Computer Vision and Pattern Recognition
Li, and Xudong Zou. Selective refinement network for high (CVPR), 2017. 2, 4
performance face detection. In Proceedings of Association [18] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
for the Advancement of Artificial Intelligence (AAAI), 2019. Piotr Dollár. Focal loss for dense object detection. In Pro-
2, 7 ceedings of IEEE International Conference on Computer Vi-
[5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object sion (ICCV), 2017. 2, 3
detection via region-based fully convolutional networks. In [19] Songtao Liu, Di Huang, and Yunhong Wang. Receptive field
Proceedings of Advances in Neural Information Processing block net for accurate and fast object detection. In Proceed-
Systems (NIPS), 2016. 2 ings of European Conference on Computer Vision, 2018. 2,
[6] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arc- 4
face: Additive angular margin loss for deep face recognition. [20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
arXiv:1801.07698v1, 2018. 1 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[7] Ross Girshick. Fast r-cnn. In Proceedings of IEEE Inter- Berg. Ssd: Single shot multibox detector. In Proceedings
national Conference on Computer Vision (ICCV), 2015. 2, of European conference on computer vision (ECCV), 2016.
3 2, 3, 5
[21] Wei Liu, Andrew Rabinovich, and Alexander Berg. Parsenet:
[8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Looking wider to see better. In Proceedings of International
Malik. Rich feature hierarchies for accurate object detection
Conference on Learning Representations Workshop, 2016. 2
and semantic segmentation. In Proceedings of IEEE Confer-
[22] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
ence on Computer Vision and Pattern Recognition (CVPR),
convolutional networks for semantic segmentation. In Pro-
pages 580–587, 2014. 2
ceedings of IEEE Conference on Computer Vision and Pat-
[9] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Ji-
tern Recognition (CVPR), 2015. 2
tendra Malik. Hypercolumns for object segmentation and
[23] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and
fine-grained localization. In Proceedings of IEEE Confer-
Larry S Davis. Ssh: Single stage headless face detector. In
ence on Computer Vision and Pattern Recognition (CVPR),
Proceedings of IEEE International Conference on Computer
2015. 2
Vision (ICCV), 2017. 2
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Deep residual learning for image recognition. In Proceed- Faster r-cnn: Towards real-time object detection with region
ings of IEEE Conference on Computer Vision and Pattern proposal networks. In Proceedings of Advances in Neural
Recognition (CVPR), 2016. 1 Information Processing Systems (NIPS), 2015. 2, 4
[11] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, [25] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Math-
Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- ieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated
jna, Yang Song, Sergio Guadarrama, and Kevin Murphy. recognition, localization and detection using convolutional
Speed/accuracy trade-offs for modern convolutional object networks. In Proceedings of International Conference on
detectors. In Proceedings of the IEEE Conference on Com- Learning Representations (ICLR), 2014. 2
puter Vision and Pattern Recognition, 2017. 7 [26] Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li,
[12] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark Chengjie Wang, Feiyue Huang, and Yu Chen. Towards
for face detection in unconstrained settings. Technical highly accurate and stable face alignment for high-resolution
report, Technical Report UM-CS-2010-009, University of videos. In The AAAI Conference on Artificial Intelligence
Massachusetts, Amherst, 2010. 7 (AAAI), 2019. 1
[13] Kobi Levi and Yair Weiss. Learning object detection from a [27] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. Pyra-
small number of examples: the importance of good features. midbox: A context-assisted single shot face detector. In
In Proceedings of IEEE Conference on Computer Vision and Proceedings of European Conference on Computer Vision
Pattern Recognition (CVPR), 2004. 2 (ECCV), 2018. 2, 3, 5
[28] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
ers, and Arnold WM Smeulders. Selective search for ob-
ject recognition. International Journal of Computer Vision,
104(2):154–171, 2013. 2
[29] Paul Viola and Michael J Jones. Robust real-time face detec-
tion. International Journal of Computer Vision, 57(2):137–
154, 2004. 1, 2
[30] Hao Wang, Zhifeng Li, Xing Ji, and Yitong Wang. Face r-
cnn. arXiv preprint arXiv:1706.01061, 2017. 2
[31] Jianfeng Wang, Ye Yuan, and Gang Yu. Face attention net-
work: An effective face detector for the occluded faces.
arXiv preprint arXiv:1711.07246, 2017. 3, 7
[32] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian
Sun, and Chunhua Shen. Repulsion loss: Detecting pedes-
trians in a crowd. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 3
[33] Yitong Wang, Xing Ji, Zheng Zhou, Hao Wang, and Zhifeng
Li. Detecting faces using region-based fully convolutional
networks. arXiv preprint arXiv:1709.05256, 2017. 2
[34] Jian Yang, Lei Luo, Jianjun Qian, Ying Tai, Fanlong Zhang,
and Yong Xu. Nuclear norm based matrix regression with
applications to face recognition with occlusion and illumi-
nation changes. IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), 39(1):156–171, 2017. 1
[35] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang.
Wider face: A face detection benchmark. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2016. 3, 7
[36] Changzheng Zhang, Xiang Xu, and Dandan Tu. Face
detection using improved faster rcnn. arXiv preprint
arXiv:1802.02142, 2018. 2
[37] Jialiang Zhang, Xiongwei Wu, Jianke Zhu, and Steven CH
Hoi. Feature agglomeration networks for single stage face
detection. arXiv preprint arXiv:1712.00721, 2017. 2, 3
[38] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters,
23(10):1499–1503, 2016. 2
[39] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo
Wang, and Stan Z Li. Sˆ 3fd: Single shot scale-invariant face
detector. In Proceedings of IEEE International Conference
on Computer Vision (ICCV), 2017. 2, 3, 5, 8
GANFIT: Generative Adversarial Network Fitting
for High Fidelity 3D Face Reconstruction

Baris Gecer1,2 , Stylianos Ploumpis1,2 , Irene Kotsia3 , and Stefanos Zafeiriou1,2


1
Imperial College London
2
FaceSoft.io
3
University of Middlesex
arXiv:1902.05978v2 [cs.CV] 6 Apr 2019

{b.gecer, s.ploumpis, s.zafeiriou}@imperial.ac.uk , drkotsia@gmail.com

Figure 1: The proposed deep fitting approach can reconstruct high quality texture and geometry from a single image with
precise identity recovery. The reconstructions in the figure and the rest of the paper are represented by a vector of size 700
floating points and rendered without any special effects. We would like to highlight that the depicted texture is reconstructed
by our model and none of the features taken directly from the image.

Abstract the optimal latent parameters that best reconstruct the test
image but under a new perspective. We optimize the param-
In the past few years, a lot of work has been done to- eters with the supervision of pretrained deep identity fea-
wards reconstructing the 3D facial structure from single tures through our end-to-end differentiable framework. We
images by capitalizing on the power of Deep Convolutional demonstrate excellent results in photorealistic and identity
Neural Networks (DCNNs). In the most recent works, differ- preserving 3D face reconstructions and achieve for the first
entiable renderers were employed in order to learn the rela- time, to the best of our knowledge, facial texture reconstruc-
tionship between the facial identity features and the param- tion with high-frequency details.1
eters of a 3D morphable model for shape and texture. The
texture features either correspond to components of a lin-
ear texture space or are learned by auto-encoders directly 1. Introduction
from in-the-wild images. In all cases, the quality of the fa-
Estimation of the 3D facial surface and other intrinsic
cial texture reconstruction of the state-of-the-art methods is
components of the face from single images (e.g., albedo,
still not capable of modeling textures in high fidelity. In this
etc.) is a very important problem at the intersection of
paper, we take a radically different approach and harness
computer vision and machine learning with countless ap-
the power of Generative Adversarial Networks (GANs) and
plications (e.g., face recognition, face editing, virtual real-
DCNNs in order to reconstruct the facial texture and shape
ity). It is now twenty years from the seminal work of Blanz
from single images. That is, we utilize GANs to train a very
and Vetter [4] which showed that it is possible to recon-
powerful generator of facial texture in UV space. Then, we
struct shape and albedo by solving a non-linear optimiza-
revisit the original 3D Morphable Models (3DMMs) fitting
approaches making use of non-linear optimization to find 1 Project page: https://github.com/barisgecer/ganfit

1
tion problem that is constrained by linear statistical models controlled environment to collect ∼20 millions of images.
of facial texture and shape. This statistical model of tex- In this paper, we still propose to build upon the success
ture and shape is called a 3D Morphable Model (3DMM). of DCNNs but take a radically different approach for 3D
Arguably the most popular publicly available 3DMM is the shape and texture reconstruction from a single in-the-wild
Basel model built from 200 people [21]. Recently, large image. That is, instead of formulating regression method-
scale statistical models of face and head shape have been ologies or auto-encoder structures that make use of self-
made publicly available [7, 10]. supervision [39, 16, 43], we revisit the optimization-based
For many years 3DMMs and its variants were the meth- 3DMM fitting approach by the supervision of deep iden-
ods of choice for 3D face reconstruction [33, 46, 22]. tity features and by using Generative Adversarial Networks
Furthermore, with appropriate statistical texture models (GANs) as our statistical parametric representation of the
on image features such as Scale Invariant Feature Trans- facial texture.
form (SIFT) and Histogram Of Gradients (HOG), 3DMM- In particular, the novelties that this paper brings are:
based methodologies can still achieve state-of-the-art per- • We show for the first time, to the best of our knowl-
formance in 3D shape estimation on images captured un- edge, that a large-scale high-resolution statistical re-
der unconstrained conditions [6]. Nevertheless, those meth- construction of the complete facial surface on an un-
ods [6] can reconstruct only the shape and not the facial tex- wrapped UV space can be successfully used for recon-
ture. Another line of research in [45, 34] decouples texture struction of arbitrary facial textures even captured in
and shape reconstruction. A standard linear 3DMM fitting unconstrained recording conditions4 .
strategy [41] is used for face reconstruction followed by a
number of steps for texture completion and refinement. In • We formulate a novel 3DMM fitting strategy which is
these papers [34, 45], the texture looks excellent when ren- based on GANs and a differentiable renderer.
dered under professional renderers (e.g., Arnold), neverthe-
• We devise a novel cost function which combines vari-
less when the texture is overlaid on the images the quality
ous content losses on deep identity features from a face
significantly drops 2 .
recognition network.
In the past two years, a lot of work has been con-
ducted on how to harness Deep Convolutional Neural Net- • We demonstrate excellent facial shape and texture re-
works (DCNNs) for 3D shape and texture reconstruction. constructions in arbitrary recording conditions that are
The first such methods either trained regression DCNNs shown to be both photorealistic and identity preserving
from image to the parameters of a 3DMM [42] or used in qualitative and quantitative experiments.
a 3DMM to synthesize images [30, 18] and formulate an
image-to-image translation problem using DCNNs to es- 2. History of 3DMM Fitting
timate the depth3 [36]. The more recent unsupervised Our methodology naturally extends and generalizes the
DCNN-based methods are trained to regress 3DMM param- ideas of texture and shape 3DMM using modern methods
eters from identity features by making use of differentiable for representing texture using GANs, as well as defines loss
image formation architectures [9] and differentiable render- functions using differentiable renderers and very powerful
ers [16, 40, 31]. publicly available face recognition networks [12]. Before
The most recent methods such as [39, 43, 14] use both we define our cost function, we will briefly outline the his-
the 3DMM model, as well as additional network structures tory of 3DMM representation and fitting.
(called correctives) in order to extend the shape and texture
representation. Even though the paper [39] shows that the 2.1. 3DMM representation
reconstructed facial texture has indeed more details than a The first step is to establish dense correspondences be-
texture estimated from a 3DMM [42, 40], it is still unable to tween the training 3D facial meshes and a chosen template
capture high-frequency details in texture and subsequently with fixed topology in terms of vertices and triangulation.
many identity characteristics (please see the Fig. 4). Fur-
thermore, because the method permits the reconstructions
2.1.1 Texture
to be outside the 3DMM space, it is susceptible to outliers
(e.g., glasses etc.) which are baked in shape and texture. Al- Traditionally 3DMMs use a UV map for representing tex-
though rendering networks (i.e. trained by VAE [26]) gen- ture. UV maps help us to assign 3D texture data into 2D
erates outstanding quality textures, each network is capable 4 In the very recent works, it was shown that it is feasible to reconstruct
of storing up to few individuals whom should be placed in a the non-visible parts a UV space for facial texture completion[11] and that
GANs can be used to generate novel high-resolution faces[38]. Neverthe-
2 Please see the supplementary materials for a comparison with [34, 45].
less, our work is the first one that demonstrates that a GAN can be used
3 The depth was afterwards refined by fitting a 3DMM and then chang- as powerful statistical texture prior and reconstruct the complete texture of
ing the normals by using image features. arbitrary facial images.
PCA Shape Model Input Image
.
Differentiable Renderer (Sec.3.2) Landmark Detector (Sec.3.3.4)
.
. Camera and Lighting
Parameters
p s= .
.
.
.
.
.
.
.
.
c= .
i= .
. .
.
.
.
.
.
= 2
Expression Blend Shapes
.
.
(Eq.6)
=| - |
.
p e= .
. Face Recognition CNN (Sec.3.3.1)
. . . . .
. . . .
.

Coloured mesh
= .
.
.
*
.
.
.
+
.
.
.
*
.
.
.
Sampling . . . .
. . . .
Texture GAN (Sec.3.1)
. (Eq.7)
. . .
. . .

pt= .
.
Random
pe, c, i
= -
.
.
.
.
.
.
. . .
. .
.
2

Figure 2: Detailed overview of the proposed approach. A 3D face reconstruction is rendered by a differentiable renderer
(shown in purple). Cost functions are mainly formulated by means of identity features on a pretrained face recognition
network (shown in gray) and they are optimized by flowing the error all the way back to the latent parameters (ps , pe , pt , c, i,
shown in green) with gradient descent optimization. End-to-end differentiable architecture enables us to use computationally
cheap and reliable first order derivatives for optimization thus making it possible to employ deep networks as a generator
(i.e,. statistical model) or as a cost function.

planes with universal per-pixel alignment for all textures. A The recent 3D face fitting methods [39, 43, 14] still make
commonly used UV map is built by cylindrical unwrapping use of similar statistical models for the texture. Hence, they
the mean shape into a 2D flat space formulation, which we can naturally represent only the low-frequency components
use to create an RGB image IU V . Each vertex in the 3D of the facial texture (please see Fig. 4).
space has a texture coordinate tcoord in the UV image plane
in which the texture information is stored. A universal func-
tion exists, where for each vertex we can sample the texture 2.1.2 Shape
information from the UV space as T = P(IU V , tcoord ). The method of choice for building statistical models of fa-
In order to define a statistical texture representation, all cial or head 3D shapes is still PCA [23]. Assuming that the
the training texture UV maps are vectorized and Principal 3D shapes in correspondence comprise of N vertexes, i.e.
Component Analysis (PCA) is applied. Under this model  T T
 T
s = xT 1 , . . . , xN = [x1 , y1 , z1 , . . . , xN , yN , zN ] . In
any test texture T0 is approximated as a linear combination
order to represent both variations in terms of identity and
of the mean texture mt and a set of bases Ut as follows:
expression, generally two linear models are used. The first
T(pt ) ≈ mt + Ut pt (1) is learned from facial scans displaying the neutral expres-
sion (i.e., representing identity variations) and the second
where pt is the texture parameters for the text sample T0 . is learned from displacement vectors (i.e., representing ex-
In the early 3DMM studies, the statistical model of the tex- pression variations). Then a test facial shape S(ps,e ) can be
ture was built with few faces captured in strictly controlled written as
conditions and was used to reconstruct the test albedo of S(ps,e ) ≈ ms,e + Us,e ps,e (2)
the face. Since, such texture models can hardly represent
faces captured in uncontrolled recording conditions (in-the- where ms,e in the mean shape vector, Us,e ∈ R3N ×ns,e
wild). Recently it was proposed to use statistical models is Us,e = [Us , Ue ] where the Us are the bases that cor-
of hand-crafted features such as SIFT or HoG [6] directly respond to identity variations, and Ue the bases that cor-
from in-the-wild faces. The interested reader is referred to respond to expression. Finally, ps,e are the ns,e shape pa-
[5, 32] for more details on texture models used in 3DMM rameters which can be split accordingly to the identity and
fitting algorithms. expression bases: ps,e = [ps , pe ].
2.2. Fitting 3. Approach
3D face and texture reconstruction by fitting a 3DMM We propose an optimization-based 3D face reconstruc-
is performed by solving a non-linear energy based cost op- tion approach from a single image that employs a high fi-
timization problem that recovers a set of parameters p = delity texture generation network as statistical prior as il-
[ps,e , pt , pc , pl ] where pc are the parameters related to a lustrated in Fig. 2. To this end, the reconstruction mesh
camera model and pl are the parameters related to an illu- is formed by 3D morphable shape model; textured by the
mination model. The optimization can be formulated as: generator network’s output UV map; and projected into 2D
image by a differentiable renderer. The distance between
min E(p) = ||I0 (p) − W(p)||22 + Reg({ps,e , pt }) (3) the rendered image and the input image is minimized in
p
terms of a number of cost functions by updating the latent
parameters of 3DMM and the texture network with gradi-
where I0 is the test image to be fitted and W is a vector ent descent. We mainly formulate these functions based on
produced by a physical image formation process (i.e., ren- rich features of face recognition network [12, 35, 28] for
dering) controlled by p. Finally, Reg is the regularization smoother convergence and landmark detection network [13]
term that is mainly related to texture and shape parameters. for alignment and rough shape estimation.
Various methods have been proposed for numerical op- The following sections introduce firstly our novel texture
timization of the above cost functions [19, 2]. A notable model that employs a generator network trained by progres-
recent approach is [6] which uses handcrafted features (i.e., sive growing GAN framework. After describing the proce-
H) for texture representation simplified the cost function as: dure for image formation with differentiable renderer, we
formulate our cost functions and the procedure for fitting
min
r
E(pr ) = ||H(I0 (pr ))−H(W(pr ))||2A+Reg(ps,e ) (4) our shape and texture models onto a test image.
p

where ||a||2A = aT Aa, A is the orthogonal space to the 3.1. GAN Texture Model
statistical model of the texture and pr is the set of reduced Although conventional PCA is powerful enough to build
parameters pr = {ps,e , pc }. The optimization problem in a decent shape and texture model, it is often unable to cap-
Eq. 4 is solved by Gauss-Newton method. The main draw- ture high frequency details and ends up having blurry tex-
back of this method is that the facial texture in not recon- tures due to its Gaussian nature. This becomes more appar-
structed. ent in texture modelling which is a key component in 3D
In this paper, we generalize the 3DMM fittings and in- reconstruction to preserve identity as well as photo-realism.
troduce the following novelties: GANs are shown to be very effective at capturing such
details. However, they suffer from preserving 3D co-
• We use a GAN on high-resolution UV maps as our sta- herency [17] of the target distribution when the training im-
tistical representation of the facial texture. That way ages are semi-aligned. We found that a GAN trained with
we can reconstruct textures with high-frequency de- UV representation of real textures with per pixel alignment
tails. avoids this problem and is able to generate realistic and co-
herent UVs from 99.9% of its latent space while at the same
• Instead of other cost functions used in the literature time generalizing well to unseen data.
such as low-level `1 or `2 loss (e.g., RGB values [29], In order to take advantage of this perfect harmony, we
edges [33]) or hand-crafted features (e.g., SIFT [6]), train a progressive growing GAN [24] to model distribu-
we propose a novel cost function that is based on fea- tion of UV representations of 10,000 high resolution tex-
ture loss from the various layers of publicly available tures and use the trained generator network
face recognition embedding network [12]. Unlike oth-
ers, deep identity features are very powerful at preserv- G(pt ) : R512 → RH×W ×C (5)
ing identity characteristics of the input image.
as texture model that replaces 3DMM texture model in
• We replace physical image formation stage with a dif- Eq. 1.
ferentiable renderer to make use of first order deriva- While fitting with linear models, i.e. 3DMM, is as sim-
tives (i.e., gradient descent). Unlike its alternatives, ple as linear transformation, fitting with a generator net-
gradient descent provides computationally cheaper and work can be formulated as an optimization that minimizes
more reliable derivatives through such deep architec- per-pixel Manhattan distance between target texture in UV
tures (i.e., above-mentioned texture GAN and identity space Iuv and the network output G(pt ) with respect to the
DCNN). latent parameter pt , i.e. minpt |G(pt ) − Iuv |.
3.2. Differentiable Renderer many other tasks including novel identity synthesizing [15],
face normalization [9] and 3D face reconstruction [16]. In
Following [16], we employ a differentiable renderer to
our approach, we take advantage of an off-the-shelf state-
project 3D reconstruction into a 2D image plane based on
of-the-art face recognition network [12]5 in order to capture
deferred shading model with given camera and illumination
identity related features of an input face image and optimize
parameters. Since color and normal attributes at each vertex
the latent parameters accordingly. More specifically, given a
are interpolated at the corresponding pixels with barycen-
pretrained face recognition network F n (I) : RH×W ×C →
tric coordinates, gradients can be easily backpropagated
R512 consisting of n convolutional filters, we calculate the
through the renderer to the latent parameters.
cosine distance between the identity features (i.e., embed-
A 3D textured mesh at the center of Cartesian origin
dings) of the real target image and our rendered images as
[0, 0, 0] is projected onto 2D image plane by a pinhole cam-
following:
era model with the camera standing at [xc , yc , zc ], directed
towards [x0c , yc0 , zc0 ] and with the focal length fc . The il- F n (I0 ).F n (IR )
lumination is modelled by phong shading given 1) direct Lid = 1 − (8)
||F n (I0 )||2 ||F n (IR )||2
light source at 3D coordinates [xl , yl , zl ] with color values
[rl , gl , bl ], and 2) color of ambient lighting [ra , ga , ba ]. We formulate an additional identity loss on the rendered im-
Finally, we denote the rendered image given age ÎR that is rendered with random pose, expression and
geometry (ps,e ), texture (pt ), camera (pc = lighting. This loss ensures that our reconstruction resembles
[xc , yc , zc , x0c , yc0 , zc0 , fc ]) and lighting parameters the target identity under different conditions. We formulate
(pl = [xl , yl , zl , rl , gl , bl , ra , ga , ba ] by the following: it by replacing IR by ÎR in Eq. 8 and it is denoted as L̂id .

IR = R(S(ps , pe ), P(G(pt )), pc , pl ) (6)


3.3.2 Content Loss
where we construct shape mesh by 3DMM as given in Eq. 2 Face recognition networks are trained to remove all kinds
and texture by GAN generator network as in Eq. 5. Since of attributes (e.g. expression, illumination, age, pose) other
our differentiable renderer supports only color vectors, we than abstract identity information throughout the convolu-
sample from our generated UV map to get vectorized color tional layers. Despite their strength, the activations in the
representation as explained in Sec. 2.1.1. very last layer discard some of the mid-level features that
Additionally, we render a secondary image with random are useful for 3D reconstruction, e.g. variations that depend
expression, pose and illumination in order to generalize on age. Therefore we found it effective to accompany iden-
identity related parameters well with those variations. We tity loss by leveraging intermediate representations in the
sample expression parameters from a normal distribution as face recognition network that are still robust to pixel-level
pˆe ∼ N (µ = 0, σ = 0.5) and sample camera and illumina- deformations and not too abstract to miss some details. To
tion parameters from the Gaussian distribution of 300W-3D this end, normalized euclidean distance of intermediate ac-
dataset as p̂c ∼ N (µˆc , σˆc ) and p̂l ∼ N (µ̂l , σ̂l ). This ren- tivations, namely content loss, is minimized between input
dered image of the same identity as IR (i.e., with same ps and rendered image with the following loss term:
and pt parameters) is expressed by the following:
n
||F j (I0 ) − F j (IR )||2
ÎR = R(S(ps , pˆe ), P(G(pt )), pˆc , p̂l )
X
(7) Lcon = (9)
j
HF j × WF j × CF j
3.3. Cost Functions
Given an input image I0 , we optimize all of the afore- 3.3.3 Pixel Loss
mentioned parameters simultaneously with gradient descent
While identity and content loss terms optimize albedo of
updates. In each iteration, we simply calculate the forth-
the visible texture, lighting conditions are optimized based
coming cost terms for the current state of the 3D recon-
on pixel value difference directly. While this cost function
struction, and take the derivative of the weighted error with
is relatively primitive, it is sufficient to optimize lighting
respect to the parameters using backpropagation.
parameters such as ambient colors, direction, distance and
color of a light source. We found that optimizing illumina-
3.3.1 Identity Loss tion parameters jointly with others helped to improve albedo
With the availability of large scale datasets, CNNs have of the recovered texture. Furthermore, pixel loss support
shown incredible performance on many face recognition identity and content loss with fine-grained texture as it sup-
benchmarks. Their strong identity features are robust to ports highest available resolution while images needs to be
many variations including pose, expression, illumination, 5 We empirically deduced that other face recognition networks work

age etc. These features are shown to be quite effective at almost equally well and this choice is orthogonal to the proposed approach.
Figure 3: Example fits of our approach for the images from various datasets. Please note that our fitting approach is robust
to occlusion (e.g., glasses), low resolution and black-white in the photos and generalizes well with ethnicity, gender and
age. The reconstructed textures are very well at capturing high frequency details of the identities; likewise, the reconstructed
geometries from 3DMM are surprisingly good at identity preservation thanks to the identity features used, e.g. crooked nose
at bottom-left, dull eyes at bottom-right and chin dimple at top-left

downscaled to 112 × 112 before identity and content loss. onto input image and is formulated as following:
The pixel loss is defined by pixel level `1 loss function as:
Llan = ||M(I0 ) − M(IR )||2 (11)
0 R
Lpix = ||I − I ||1 (10) 3.4. Model Fitting
We first roughly align our reconstruction to the input im-
3.3.4 Landmark Loss age by optimizing shape, expression and camera parame-
ters by: minpr E(pr ) = λlan Llan . We then simultaneously
The face recognition network F is pre-trained by the im-
optimize all of our parameters with gradient descent and
ages that are aligned by similarity transformation to a fixed
backpropagation so as to minimize weighted combination
landmark template. To be compatible with the network, we
of above loss terms in the following:
align the input and rendered images under the same settings.
However, this process disregards the aspect ratio and scale min E(p) = λid Lid + λ̂id L̂id + λcon Lcon +λpix Lpix
p
of the reconstruction. Therefore, we employ a deep face
alignment network [13] M(I) : RH×W ×C → R68×2 to +λlan Llan + λreg Reg({ps,e , pl })
detect landmark locations of the input image and align the (12)
rendered geometry onto it by updating the shape, expression where we weight each of our loss terms with λ parame-
and camera parameters. That is, camera parameters are op- ters. In order to prevent our shape and expression mod-
timized to align with the pose of image I and geometry pa- els and lighting parameters from exaggeration to arbitrar-
rameters are optimized for the rough shape estimation. As ily bias our loss terms, we regularize those parameters by
a natural consequence, this alignment drastically improves Reg({ps,e , pl }).
the effectiveness of the pixel and content loss, which are
sensitive to misalignment between the two images. Fitting with Multiple Images (i.e. Video): While the
The alignment error is achieved by point-to-point eu- proposed approach can fit a 3D reconstruction from a single
clidean distances between detected landmark locations of image, one can take advantage of more images effectively
the input image and 2D projection of the 3D reconstruc- when available, e.g. from a video recording. This often
tion landmark locations that is available as meta-data of the helps to improve reconstruction quality under challenging
shape model. Since landmark locations of the reconstruc- conditions, e.g. outdoor, low resolution. While state-of-
tion heavily depend on camera parameters, this loss is great the-art methods follow naive approaches by averaging ei-
a source of information the alignment of the reconstruction ther the reconstruction [42] or features-to-be-regressed [16]
Input Images

Ours

Genova
[16]

A.T.Tran et al.
[42]

Tewari et al.
[39]

Ours
Geometry

Tewari et al.
[39]

L. Tran et al.
[43]

Figure 4: Comparison of our qualitative results with other state-of-the-art methods in MoFA-Test dataset. Rows 2-5 show
comparison with textured geometry and rows 6-8 compare only shapes. The Figure is best viewed in colored and under zoom.
before making a reconstruction, we utilize the power of iter- 4.1. Implementation Details
ative optimization by averaging identity reconstruction pa-
rameters (ps , pt ) after every iteration. For an image set For all of our experiments, a given face image is aligned
I = {I0 , I1 , . . . , Ii , . . . , Ini }, we reformulate our param- to our fixed template using 68 landmark locations detected
eters as p = [ps , pie , pt , pic , pil ] in which we average shape by an hourglass 2D landmark detection [13]. For the iden-
and texture parameters by the following: tity features, we employ ArcFace [12] network’s pretrained
n
X n
X models. For the generator network G, we train a progres-
ps = pis , pt = pit (13) sive growing GAN [24] with around 10,000 UV maps from
i i [7] at the resolution of 512 × 512. We use the Large Scale
4. Experiments Face Model [7] for 3DMM shape model with ns = 158
and the expression model learned from 4DFAB database [8]
This section demonstrates the excellent performance of with ne = 29. During fitting process, we optimize pa-
the proposed approach for 3D face reconstruction and shape rameters using Adam Solver [25] with 0.01 learning rate.
recovery. We verify this by qualitative results in Fig- And we set our balancing factors as the following: λid :
ures 1, 3, qualitative comparisons with the state-of-the-art 2.0, λ̂id : 2.0, λcon : 50.0, λpix : 1.0, λlan : 0.001, λreg :
in Sec. 4.2 and quantitative shape reconstruction experiment {0.05, 0.01}. The Fitting converges in around 30 seconds
on a database with ground truth in Sec. 4.3. on an Nvidia GTX 1080 TI GPU for a single image.
Cooperative Indoor Outdoor
Method Mean Std. Mean Std. Mean Std.
Tran et al. [42] 1.93 0.27 2.02 0.25 1.86 0.23
Booth et al. [6] 1.82 0.29 1.85 0.22 1.63 0.16
Genova et al. [16] 1.50 0.13 1.50 0.11 1.48 0.11
Ours 0.95 0.107 0.94 0.106 0.94 0.106
(a) I0 (b) IR (c) IR albedo
Table 1: Accuracy results for the meshes on the MICC
Dataset using point-to-plane distance. The table reports the
mean error (Mean), the standard deviation (Std.).

4.2. Qualitative Comparison to the State-of-the-art


Fig. 4 compares our results with the most recent face
reconstruction studies [40, 39, 16, 42, 43] on a subset of (d) IR \ Lid (e) IR \ L̂id (f) IR \ Lcon
MoFA test-set. The first four rows after input images show
a comparison of our shape and texture reconstructions to
[16, 42, 39] and the last three rows show our reconstructed
geometries without texture compared to [39, 43]. All in all,
our method outshines all others with its high fidelity pho-
torealistic texture reconstructions. Both of our texture and
shape reconstructions manifest strong identity characteris- (g) IR \ Lpix (h)IR\{Lid,L̂id,Lcon} (i) IR with T(pt )
tics of the corresponding input images from the thickness
and shape of the eyebrows to wrinkles around the mouth Figure 5: Contributions of the components or loss terms of
and forehead. the proposed approach with an leave-one-out ablation study.
4.3. 3D shape recovery on MICC dataset something that suggests that each of our components signif-
icantly contributes towards a good reconstruction. Fig. 5(c)
We evaluate the shape reconstruction performance of our
indicates albedo is well disentangled from illumination and
method on MICC Florence 3D Faces dataset (MICC) [1] in
our model capture the light direction accurately.
Table 1. The dataset provides 3D scans of 53 subjects as
While Fig. 5(d-f) shows each of the identity terms con-
well as their short video footages under three difficulty set-
tributes to preserve identity, Fig. 5(h) demonstrates the sig-
tings: ’cooperative’, ’indoor’ and ’outdoor’. Unlike [16, 42]
nificance identity features altogether. Still, overall recon-
which processes all the frames in a video, we uniformly
struction utilizes pixel intensities to capture better albedo
sample only 5 frames from each video regardless of their
and illumination as shown in Fig. 5(g). Finally, Fig. 5(i)
zoom level. And, we run our method with multi-image sup-
shows the superiority of our textures over PCA-based ones.
port for these 5 frames for each video separately as shown
in Eq. 13. Each test mesh is cropped at a radius of 95mm
around the tip of the nose according to [42] in order to eval-
5. Conclusion
uate the shape recovery of the inner facial mesh. We per- In this paper, we revisit optimization-based 3D face re-
form dense alignment between each predicted mesh and its construction under a new perspective, that is, we utilize the
corresponding ground truth mesh, by implementing an iter- power of recent machine learning techniques such as GANs
ative closest point (ICP) method [3]. As evaluation metric, and face recognition network as statistical texture model
we follow [16] to measure the error by average symetric and as energy function respectively.
point-to-plane distance. To the best of our knowledge, this is the first time that
Table 1 reports the normalized point-to-plain errors in GANs are used for model fitting and they have shown excel-
millimeters. It is evident that we have improved the abso- lent results for high quality texture reconstruction. The pro-
lute error compared to the other two state-of-the-art meth- posed approach shows identity preserving high fidelity 3D
ods by 36%. Our results are shown to be consistent across reconstructions in qualitative and quantitative experiments.
all different settings with minimal standard deviation from
the mean error. Acknowledgements: Baris Gecer is funded by the Turk-
4.4. Ablation Study ish Ministry of National Education. Stefanos Zafeiriou
acknowledges support by EPSRC Fellowship DEFORM
Fig. 5 shows an ablation study on our method where the (EP/S010203/1) and a Google Faculty Award.
full model reconstructs the input face better than its variants,
References torealistic face images of new identities from 3d morphable
model. ECCV, 2018. 5
[1] Andrew D Bagdanov, Alberto Del Bimbo, and Iacopo Masi.
[16] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron
The florence 2d/3d hybrid face dataset. In Proceedings of the
Sarna, Daniel Vlasic, and William T Freeman. Unsupervised
2011 joint ACM workshop on Human gesture and behavior
training for 3d morphable model regression. In CVPR, 2018.
understanding, pages 79–80. ACM, 2011. 8
2, 5, 6, 7, 8, 11
[2] Anil Bas, William AP Smith, Timo Bolkart, and Stefanie
Wuhrer. Fitting a 3d morphable model to edges: A com- [17] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial
parison between hard and soft correspondences. In ACCV, networks. arXiv preprint arXiv:1701.00160, 2016. 4
2016. 4 [18] Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang, and
[3] Paul J Besl and Neil D McKay. Method for registration of 3- Jianmin Zheng. Cnn-based real-time dense face reconstruc-
d shapes. In Sensor Fusion IV: Control Paradigms and Data tion with inverse-rendered photo-realistic face images. IEEE
Structures, volume 1611, pages 586–607, 1992. 8 transactions on pattern analysis and machine intelligence,
[4] Volker Blanz and Thomas Vetter. A morphable model for 2018. 2
the synthesis of 3d faces. In Proceedings of the 26th an- [19] Guosheng Hu, Fei Yan, Josef Kittler, William Christmas,
nual conference on Computer graphics and interactive tech- Chi Ho Chan, Zhenhua Feng, and Patrik Huber. Efficient 3d
niques, pages 187–194. ACM Press/Addison-Wesley Pub- morphable face model fitting. Pattern Recognition, 67:366–
lishing Co., 1999. 1 379, 2017. 4
[5] Volker Blanz and Thomas Vetter. Face recognition based [20] Gary B. Huang, Marwan Mattar, Honglak Lee, and Erik
on fitting a 3d morphable model. TPAMI, 25(9):1063–1074, Learned-Miller. Learning to align from scratch. In NIPS,
2003. 3 2012. 11
[6] James Booth, Epameinondas Antonakos, Stylianos [21] IEEE. A 3D Face Model for Pose and Illumination Invariant
Ploumpis, George Trigeorgis, Yannis Panagakis, Stefanos Face Recognition, 2009. 2
Zafeiriou, et al. 3d face morphable models in-the-wild. In [22] Luo Jiang, Juyong Zhang, Bailin Deng, Hao Li, and Lig-
CVPR, 2017. 2, 3, 4, 8 ang Liu. 3d face reconstruction with geometry details from
[7] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan a single image. IEEE Transactions on Image Processing,
Ponniah, and David Dunaway. A 3d morphable model learnt 27(10):4756–4770, 2018. 2
from 10,000 faces. In CVPR, 2016. 2, 7 [23] Ian Jolliffe. Principal component analysis. In Interna-
[8] Shiyang Cheng, Irene Kotsia, Maja Pantic, and Stefanos tional encyclopedia of statistical science, pages 1094–1096.
Zafeiriou. 4dfab: a large scale 4d facial expression database Springer, 2011. 3
for biometric applications. arXiv preprint arXiv:1712.01443, [24] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
2017. 7 Progressive growing of GANs for improved quality, stability,
[9] Forrester Cole, David Belanger, Dilip Krishnan, Aaron and variation. In ICLR, 2018. 4, 7
Sarna, Inbar Mosseri, and William T Freeman. Synthesiz- [25] Diederik P Kingma and Jimmy Ba. Adam: A method for
ing normalized faces from facial identity features. In CVPR, stochastic optimization. arXiv preprint arXiv:1412.6980,
2017. 2, 5 2014. 7
[10] Hang Dai, Nick Pears, William Smith, and Christian Dun-
[26] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser
can. A 3d morphable model of craniofacial shape and tex-
Sheikh. Deep appearance models for face rendering. ACM
ture variation. In 2017 IEEE International Conference on
Transactions on Graphics (TOG), 37(4):68, 2018. 2
Computer Vision (ICCV), 2017. 2
[27] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face
[11] Jiankang Deng, Shiyang Cheng, Niannan Xue, Yuxiang
recognition. In British Machine Vision Conference, 2015.
Zhou, and Stefanos Zafeiriou. Uv-gan: Adversarial facial uv
11
map completion for pose-invariant face recognition. CVPR,
2018. 2 [28] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al.
[12] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Deep face recognition. In BMVC, 2015. 4
Additive angular margin loss for deep face recognition. arXiv [29] Marcel Piotraschke and Volker Blanz. Automated 3d face
preprint arXiv:1801.07698, 2018. 2, 4, 5, 7 reconstruction from multiple images using quality measures.
[13] Jiankang Deng, Yuxiang Zhou, Shiyang Cheng, and Stefanos In CVPR, 2016. 4
Zaferiou. Cascade multi-view hourglass model for robust 3d [30] Elad Richardson, Matan Sela, and Ron Kimmel. 3d face re-
face alignment. In Automatic Face & Gesture Recognition construction by learning from synthetic data. In 2016 Fourth
(FG), pages 399–403. IEEE, 2018. 4, 6, 7 International Conference on 3D Vision (3DV), pages 460–
[14] Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Val- 469. IEEE, 2016. 2
gaerts, Kiran Varanasi, Patrick Pérez, and Christian [31] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel.
Theobalt. Reconstruction of personalized 3d face rigs from Learning detailed face reconstruction from a single image.
monocular video. ACM Transactions on Graphics (TOG), In CVPR, 2017. 2
35(3):28, 2016. 2, 3 [32] Sami Romdhani, Volker Blanz, and Thomas Vetter. Face
[15] Baris Gecer, Binod Bhattarai, Josef Kittler, and Tae-Kyun identification by fitting a 3d morphable model using linear
Kim. Semi-supervised adversarial learning to generate pho- shape and texture error functions. In ECCV, 2002. 3
[33] Sami Romdhani and Thomas Vetter. Estimating 3d shape
and texture using pixel intensity, edges, specular highlights,
texture constraints and a prior. In CVPR, 2005. 2, 4
[34] Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and
Hao Li. Photorealistic facial texture inference using deep
neural networks. In CVPR, 2017. 2, 12
[35] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clus-
tering. In CVPR, 2015. 4
[36] Matan Sela, Elad Richardson, and Ron Kimmel. Unre-
stricted facial geometry reconstruction using image-to-image
translation. In ICCV, 2017. 2
[37] Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli,
Eli Shechtman, and Dimitris Samaras. Neural face edit-
ing with intrinsic image disentangling. In Computer Vision
and Pattern Recognition (CVPR), 2017 IEEE Conference on,
pages 5444–5453. IEEE, 2017. 11
[38] Ron Slossberg, Gil Shamai, and Ron Kimmel. High quality
facial surface and texture synthesis via generative adversarial
networks. ECCVW, 2018. 2
[39] Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian
Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian
Theobalt. Self-supervised multi-level face model learning
for monocular reconstruction at over 250 hz. 2018. 2, 3, 7, 8
[40] Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo
Garrido, Florian Bernard, Patrick Pérez, and Christian
Theobalt. Mofa: Model-based deep convolutional face au-
toencoder for unsupervised monocular reconstruction. In
ICCV, 2017. 2, 8
[41] Justus Thies, Michael Zollhofer, Marc Stamminger, Chris-
tian Theobalt, and Matthias Nießner. Face2face: Real-time
face capture and reenactment of rgb videos. In CVPR, pages
2387–2395, 2016. 2
[42] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard
Medioni. Regressing robust and discriminative 3d mor-
phable models with a very deep neural network. In CVPR,
2017. 2, 6, 7, 8, 11
[43] Luan Tran and Xiaoming Liu. Nonlinear 3d face morphable
model. In CVPR, 2018. 2, 3, 7, 8
[44] Michael J Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann,
John Collomosse, and Serge J Belongie. Bam! the behance
artistic media dataset for recognition beyond photography. In
ICCV, pages 1211–1220, 2017. 11
[45] Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie
Zhao, Weikai Chen, Kyle Olszewski, Shigeo Morishima, and
Hao Li. High-fidelity facial reflectance and geometry infer-
ence from an unconstrained image. ACM Transactions on
Graphics (TOG), 37(4):162, 2018. 2, 11
[46] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and
Stan Z Li. Face alignment across large poses: A 3d solu-
tion. In CVPR, 2016. 2
Appendix A. Experiments on LFW
In order to evaluate identity preservation capacity of the
proposed method, we run two face recognition experiments
on Labelled Faces in the Wild (LFW) dataset [20]. Follow-
ing [16], we feed real LFW images and rendered images
of their 3D reconstruction by our method to a pretrained
face recognition network, namely VGG-Face[27]. We then
compute the activations at the embedding layer and measure

Genova et al.
cosine similarity between 1) real and rendered images and
2) renderings of same/different pairs.
In Fig. 6 and 7, we have quantitatively showed that our
method is better at identity preservation and photorealism
(i.e., as the pretrained network is trained by real images)
than other state-of-the-art deep 3D face reconstruction ap-
proaches [16, 42].

Ours.
Rendering-to-photo cosine similarity on LFW

Genova et al.
Tran et al.
Ours

Figure 8: Our results on BAM dataset[44] compared to [16].


Our method is robust to many image deformations and even
capable of recovering identities from paintings thanks to
strong identity features.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 6: Cosine similarity distributions of rendered and


real images LFW based on activations at the embedding
layer of VGG-Face network[27]. Our method achieves
more than 0.5 similarity on average which [16] has 0.35
average similarity and [42] 0.16 average similarity. Camera
and lighting parameters are fixed for all renderings.
Input Image Shu et al. Yamaguchi et al. Ours

Cosine similarity of same/different pairs of LFW

Genova et al. same Figure 9: Qualitative comparison with [45, 37] by overlay-
Genova et al. different
Ours same ing the reconstructions on the input images. Our method
Ours different
can generate high fidelity texture with accurate shape, cam-
era and illumination fitting.

Appendix B. More Qualitative Results


Figures 8, 9, 10, and 11 illustrate the reconstructions of
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
our method under different settings in comparison to the
other state-of-the-art methods. Please see figure captions
Figure 7: Our method successfully preserve identity so that for detailed explanation.
distribution of cosine similarity of same/different pairs is
separable by thresholding. Camera and lighting parameters
are fixed for all renderings.
Input Images Saito et al. Ours

Figure 10: Qualitative comparison with [34] by means of texture maps, whole and partial face renderings. Please note that
while our method does not require any particular renderer for special effects, e.g., lighting, [34] produce these renderings
with a commercial renderer called Arnold.
(a) I0 (b) IR (c) IR
alb. (d) IR−IR
alb. (e) S

Figure 11: Results under more challenging conditions, i.e. strong illuminations, self-occlusions and facial hair. (a) Input
image. (b) Estimated fitting overlayyed including illumination estimation. (c) Overlayyed fitting without illumination. (d)
Pixel-wise intensity difference of (b) to (c). (e) Estimated shape mesh
DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation,
Segmentation and Re-Identification of Clothing Images

Yuying Ge1 , Ruimao Zhang1 , Lingyun Wu2 , Xiaogang Wang1 , Xiaoou Tang1 , and Ping Luo1
1
The Chinese University of Hong Kong
2
SenseTime Research
arXiv:1901.07973v1 [cs.CV] 23 Jan 2019

Abstract (a)

tank top

DeepFashion
Understanding fashion images has been advanced by
benchmarks with rich annotations such as DeepFashion,
whose labels include clothing categories, landmarks, and cardigan
cardigan
consumer-commercial image pairs. However, DeepFash- tank top

ion has nonnegligible issues such as single clothing-item


per image, sparse landmarks (4∼8 only), and no per-pixel (b)
long sleeve
long sleeve outwear
masks, making it had significant gap from real-world sce-
DeepFashion2

vest top
vest
narios. We fill in the gap by presenting DeepFashion2 to
shorts
address these issues. It is a versatile benchmark of four
tasks including clothes detection, pose estimation, segmen- skirt
long sleeve
trousers outwear
tation, and retrieval. It has 801K clothing items where shorts

each item has rich annotations such as style, scale, view-


point, occlusion, bounding box, dense landmarks (e.g. 39 Figure 1. Comparisons between (a) DeepFashion and (b) Deep-
for ‘long sleeve outwear’ and 15 for ‘vest’), and masks. Fashion2. (a) only has single item per image, which is annotated
There are also 873K Commercial-Consumer clothes pairs. with 4 ∼ 8 sparse landmarks. The bounding boxes are estimated
The annotations of DeepFashion2 are much larger than from the labeled landmarks, making them noisy. In (b), each im-
age has minimum single item while maximum 7 items. Each item
its counterparts such as 8× of FashionAI Global Chal-
is manually labeled with bounding box, mask, dense landmarks
lenge. A strong baseline is proposed, called Match R-
(20 per item on average), and commercial-customer image pairs.
CNN, which builds upon Mask R-CNN to solve the above
four tasks in an end-to-end manner. Extensive evalu-
ations are conducted with different criterions in Deep-
Fashion2. DeepFashion2 Dataset will be released at : challenges can be rooted in the gap between the recent
https://github.com/switchablenorms/DeepFashion2 benchmark and the practical scenario. For example, the
existing largest fashion dataset, DeepFashion [14], has its
own drawbacks such as single clothing item per image,
sparse landmark and pose definition (every clothing cate-
1. Introduction gory shares the same definition of 4 ∼ 8 keypoints), and no
Fashion image analyses are active research topics in re- per-pixel mask annotation as shown in Fig.1(a).
cent years because of their huge potential in industry. With To address the above drawbacks, this work presents
the development of fashion datasets [20, 5, 7, 3, 14, 12, 21, DeepFashion2, a large-scale benchmark with comprehen-
1], significant progresses have been achieved in this area sive tasks and annotations of fashion image understanding.
[2, 19, 17, 18, 9, 8]. DeepFashion2 contains 491K images of 13 popular cloth-
However, understanding fashion images remains a chal- ing categories. A full spectrum of tasks are defined on
lenge in real-world applications, because of large deforma- them including clothes detection and recognition, landmark
tions, occlusions, and discrepancies of clothes across do- and pose estimation, segmentation, as well as verification
mains between consumer and commercial images. Some and retrieval. All these tasks are supported by rich annota-

1
2
tions. For instance, DeepFashion2 totally has 801K cloth- hion et I hion
Fas nA Fas
ing items, where each item in an image is labeled with scale, TB
I RN eep daN hio eep
W DA D Mo Fas D
occlusion, zooming, viewpoint, bounding box, dense land- year 2015[5] 2015[7] 2016[14] 2018[21] 2018[1] now
#images 425K 182K 800K 55K 357K 491K
marks, and per-pixel mask, as shown in Fig.1(b). These #categories 11 20 50 13 41 13
items can be grouped into 43.8K clothing identities, where #bboxes 39K 7K × × × 801K
a clothing identity represents the clothes that have almost #landmarks × × 120K × 100K 801K
#masks × × × 119K × 801K
the same cutting, pattern, and design. The images of the #pairs 39K 91K 251K × × 873K
same identity are taken by both customers and commercial Table 1. Comparisons of DeepFashion2 with the other clothes
shopping stores. An item from the customer and an item datasets. The rows represent number of images, bounding boxes,
from the commercial store forms a pair. There are 873K landmarks, per-pixel masks, and consumer-to-shop pairs respec-
pairs that are 3.5 times larger than DeepFashion. The above tively. Bounding boxes inferred from other annotations are not
thorough annotations enable developments of strong algo- counted.
rithms to understand fashion images.
This work has three main contributions. (1) We build as well as 873K pairs. It is the most comprehensive bench-
a large-scale fashion benchmark with comprehensive tasks mark of its kinds to date.
and annotations, to facilitate fashion image analysis. Deep- Fashion Image Understanding. There are various
Fashion2 possesses the richest definitions of tasks and the tasks that analyze clothing images such as clothes detec-
largest number of labels. Its annotations are at least 3.5× of tion [2, 14], landmark prediction [15, 19, 17], clothes seg-
DeepFashion [14], 6.7× of ModaNet [21], and 8× of Fash- mentation [18, 20, 13], and retrieval [7, 5, 14]. However,
ionAI [1]. (2) A full spectrum of tasks is carefully defined a unify benchmark and framework to account for all these
on the proposed dataset. For example, to our knowledge, tasks is still desired. DeepFashion2 and Match R-CNN fill
clothing pose estimation is presented for the first time in the in this blank. We report extensive results for the above
literature by defining landmarks and poses of 13 categories tasks with respect to different variations, including scale,
that are more diverse and fruitful than human pose. (3) With occlusion, zoom-in, and viewpoint. For the task of clothes
DeepFashion2, we extensively evaluate Mask R-CNN [6] retrieval, unlike previous methods [5, 7] that performed
that is a recent advanced framework for visual perception. image-level retrieval, DeepFashion2 enables instance-level
A novel Match R-CNN is also proposed to aggregate all the retrieval of clothing items. We also present a new fashion
learned features from clothes categories, poses, and masks task called clothes pose estimation, which is inspired by
to solve clothing image retrieval in an end-to-end manner. human pose estimation to predict clothing landmarks and
DeepFashion2 and implementations of Match R-CNN will skeletons for 13 clothes categories. This task helps improve
be released. performance of fashion image analysis in real-world appli-
cations.
1.1. Related Work
2. DeepFashion2 Dataset and Benchmark
Clothes Datasets. Several clothes datasets have been
proposed such as [20, 5, 7, 14, 21, 1] as summarized in Overview. DeepFashion2 has four unique characteris-
Table 1. They vary in size as well as amount and type of tics compared to existing fashion datasets. (1) Large Sam-
annotations. For example, WTBI [5] and DARN [7] have ple Size. It contains 491K images of 43.8K clothing iden-
425K and 182K images respectively. They scraped cat- tities of interest (unique garment displayed by shopping
egory labels from metadata of the collected images from stores). On average, each identity has 12.7 items with dif-
online shopping websites, making their labels noisy. In ferent styles such as color and printing. DeepFashion2 con-
contrast, CCP [20], DeepFashion [14], and ModaNet [21] tained 801K items in total. It is the largest fashion database
obtain category labels from human annotators. Moreover, to date. Furthermore, each item is associated with various
different kinds of annotations are also provided in these annotations as introduced above.
datastes. For example, DeepFashion labels 4∼8 landmarks (2) Versatility. DeepFashion2 is developed for multiple
(keypoints) per image that are defined on the functional re- tasks of fashion understanding. Its rich annotations support
gions of clothes (e.g. ‘collar’). The definitions of these clothes detection and classification, dense landmark and
sparse landmarks are shared across all categories, making pose estimation, instance segmentation, and cross-domain
them difficult to capture rich variations of clothing images. instance-level clothes retrieval.
Furthermore, DeepFashion does not have mask annotations. (3) Expressivity. This is mainly reflected in two aspects.
By comparison, ModaNet [21] has street images with masks First, multiple items are present in a single image, unlike
(polygons) of single person but without landmarks. Unlike DeepFashion where each image is labeled with at most one
existing datasets, DeepFashion2 contains 491K images and item. Second, we have 13 different definitions of landmarks
801K instances of landmarks, masks, and bounding boxes, and poses (skeletons) for 13 different categories. There is

2
Commercial Customer

Scale
(1)
Short sleeve top

Occlusion
Zoom-in (2)
Shorts

Viewpoint (3)
Long sleeve6 outwear
12
7 37
5
3 28
8 16 4 36
15 17 27 29
9 14
30 35
10 13 18 26
31 34
11 12 19 25 32 33

20

(4)
24

21 22 23
Long sleeve dress
Figure 2. Examples of DeepFashion2. The first column shows definitions of dense landmarks and skeletons of four categories. From (1)
to (4), each row represents clothes images with different variations including ‘scale’, ‘occlusion’, ‘zoom-in’, and ‘viewpoint’. At each row,
we partition the images into two groups, the left three columns represent clothes from commercial stores, while the right three columns are
from customers. In each group, the three images indicate three levels of difficulty with respect to the corresponding variation, including (1)
‘small’, ‘moderate’, ‘large’ scale, (2) ‘slight’, ‘medium’, ‘heavy’ occlusion, (3) ‘no’, ‘medium’, ‘large’ zoom-in, (4) ‘not on human’, ‘side’,
‘back’ viewpoint. Furthermore, at each row, the items in these two groups of images are from the same clothing identity but from two
different domains, that is, commercial and customer. The items of the same identity may have different styles such as color and printing.
Each item is annotated with landmarks and masks.

23 defined landmarks for each category on average. Some further crawl a large set of images on the Internet from both
definitions are shown in the first column of Fig.2. These commercial shopping stores and consumers. To clean up
representations are different from human pose and are not the crawled set, we first remove shop images with no corre-
presented in previous work. They facilitate learning of sponding consumer-taken photos. Then human annotators
strong clothes features that satisfy real-world requirements. are asked to clean images that contain clothes with large oc-
(4) Diversity. We collect data by controlling their vari- clusions, small scales, and low resolutions. Eventually we
ations in terms of four properties including scale, occlu- have 491K images of 801K items and 873K commercial-
sion, zoom-in, and viewpoint as illustrated in Fig.2, making consumer pairs.
DeepFashion2 a challenging benchmark. For each property, Variations. We explain the variations in DeepFashion2.
each clothing item is assigned to one of three levels of dif- Their statistics are plotted in Fig.3. (1) Scale. We divide all
ficulty. Fig.2 shows that each identity has high diversity clothing items into three sets, according to the proportion
where its items are from different difficulties. of an item compared to the image size, including ‘small’
Data Collection and Cleaning. Raw data of DeepFash- (< 10%), ‘moderate’ (10% ∼ 40%), and ‘large’ (> 40%).
ion2 are collected from two sources including DeepFashion Fig.3(a) shows that only 50% items have moderate scale.
[14] and online shopping websites. In particular, images (2) Occlusion. An item with occlusion means that its re-
of each consumer-to-shop pair in DeepFashion are included gion is occluded by hair, human body, accessory or other
in DeepFashion2, while the other images are removed. We items. Note that an item with its region outside the im-

3
(a) small slight no no wear
(d)
7%
moderate 6% medium 12% medium frontal
7%
24% 26% large heavy large 8% side
47% 21% back
47%
67%
50% 78%

(1) Scale (2) Occlusion (3) Zoom-in (4) Viewpoint


(b) (c)
200000
Instance Number

150000

100000

50000

1000

Cardigan Coat Joggers Sweatpants

Figure 3. (a) shows the statistics of different variations in DeepFashion2. (b) is the numbers of items of the 13 categories in DeepFashion2.
(c) shows that categories in DeepFashion [14] have ambiguity. For example, it is difficult to distinguish between ‘cardigan’ and ‘coat’, and
between ‘joggers’ and ‘sweatpants’. They result in ambiguity when labeling data. (d) Top: masks may be inaccurate when complex poses
are presented. Bottom: the masks will be refined by human.

age does not belong to this case. Each item is categorized landmarks following these instructions.
by the number of its landmarks that are occluded, includ- Moreover, each landmark is assigned one of the two
ing ‘partial occlusion’(< 20% occluded keypoints), ‘heavy modes, ‘visible’ or ‘occluded’. We then generate contours
occlusion’ (> 50% occluded keypoints), ‘medium occlu- and skeletons automatically by connecting landmarks in a
sion’ (otherwise). More than 50% items have medium or certain order. To facilitate this process, annotators are also
heavy occlusions as summarized in Fig.3. (3) Zoom-in. An asked to distinguish landmarks into two types, that is, con-
item with zoom-in means that its region is outside the im- tour point or junction point. The former one refers to key-
age. This is categorized by the number of landmarks out- points at the boundary of an item, while the latter one is
side image. We define ‘no’, ‘large’ (> 30%), and ‘medium’ assigned to keypoints in conjunction e.g. ‘endpoint of strap
zoom-in. We see that more than 30% items are zoomed in. on sling’. The above process controls the labeling quality,
(4) Viewpoint. We divide all items into four partitions in- because the generated skeletons help the annotators reex-
cluding 7% clothes that are not on people, 78% clothes on amine whether the landmarks are labeled with good quality.
people from frontal viewpoint, 15% clothes on people from In particular, only when the contour covers the entire item,
side or back viewpoint. the labeled results are eligible, otherwise keypoints will be
refined.
2.1. Data Labeling Mask. We label per-pixel mask for each item in a semi-
Category and Bounding Box. Human annotators are automatic manner with two stages. The first stage automat-
asked to draw a bounding box and assign a category label ically generates masks from the contours. In the second
for each clothing item. DeepFashion [14] defines 50 cat- stage, human annotators are asked to refine the masks, be-
egories but half of them contain less than 5‰ number of cause the generated masks may be not accurate when com-
images. Also, ambiguity exists between 50 categories mak- plex human poses are presented. As shown in Fig.3(d), the
ing data labeling difficult as shown in Fig.3(c). By grouping mark is inaccurate when an image is taken from side-view
categories in DeepFashion, we derive 13 popular categories of people crossing legs. The masks will be refined by hu-
without ambiguity. The numbers of items of 13 categories man.
are shown in Fig.3(b). Style. As introduced before, we collect 43.8K different
Clothes Landmark, Contour, and Skeleton. As differ- clothing identities where each identity has 13 items on av-
ent categories of clothes (e.g. upper- and lower-body gar- erage. These items are further labeled with different styles
ment) have different deformations and appearance changes, such as color, printing, and logo. Fig.2 shows that a pair
we represent each category by defining its pose, which is a of clothes that have the same identity could have different
set of landmarks as well as contours and skeletons between styles.
landmarks. They capture shapes and structures of clothes.
2.2. Benchmarks
Pose definitions are not presented in previous work and are
significantly different from human pose. For each clothing We build four benchmarks by using the images and la-
item of a category, human annotations are asked to label bels from DeepFashion2. For each benchmark, there are

4
391K images for training, 34K images for validation and
𝐼" FN RoIAlign 14x14 14x14 28x28
landmark
PN
67K images for test. x256 x512 x32

Clothes Detection. This task detects clothes in an im- class


ResNet RoIAlign 7x7x
age by predicting bounding boxes and category labels. The 256 1024 1024
FPN
evaluation metrics are the bounding box’s average preci- box
RoIAlign 28x28
sion APbox , APIoU=0.50
box , and APIoU=0.75
box by following 14x14
x256
14x14
x256 x256 mask
COCO [11].
Landmark Estimation. This task aims to predict land- z
256 𝑣" MN
marks for each detected clothing item in an each image. NxN NxNx
1024 matching score
x256 1024
Similarly, we employ the evaluation metrics used by COCO Square FC
𝐼" Sub
for human pose estimation by calculating the average pre-
NxNx
cision for keypoints APpt , APOKS=0.50 , and APOKS=0.75 , NxN 1024 not matching score
pt pt FN x256
1024
256 𝑣#
where OKS indicates the object landmark similarity.
Segmentation. This task assigns a category label 𝐼# PN
(including background label) to each pixel in an item.
The evaluation metrics is the average precision includ- Figure 4. Diagram of Match R-CNN that contains three main
IoU=0.50
ing APmask , APmask , and APIoU=0.75
mask computed over components including a feature extraction network (FN), a per-
masks. ception network (PN), and a match network (MN).
Commercial-Consumer Clothes Retrieval. Given a
detected item from a consumer-taken photo, this task aims
to search the commercial images in the gallery for the items ture with lateral connections to build a pyramid of feature
that are corresponding to this detected item. This setting maps. RoIAlign extracts features from different levels of
is more realistic than DeepFashion [14], which assumes the pyramid map.
ground-truth bounding box is provided. In this task, top-k In the second stage, PN contains three streams of net-
retrieval accuracy is employed as the evaluation metric. We works including landmark estimation, clothes detection,
emphasize the retrieval performance while still consider the and mask prediction as shown in Fig.4. The extracted RoI
influence of detector. If a clothing item fails to be detected, features after the first stage are fed into three streams in
this query item is counted as missed. In particular, we have PN separately. The clothes detection stream has two hidden
more than 686K commercial-consumer clothes pairs in the fully-connected (fc) layers, one fc layer for classification,
training set. In the validation set, there are 10, 990 con- and one fc layer for bounding box regression. The stream of
sumer images with 12, 550 items as a query set, and 21, 438 landmark estimation has 8 ‘conv’ layers and 2 ‘deconv’ lay-
commercial images with 37, 183 items as a gallery set. In ers to predict landmarks. Segmentation stream has 4 ‘conv’
the test set, there are 21, 550 consumer images with 24, 402 layers, 1 ‘deconv’ layer, and another ‘conv’ layer to predict
items as queries, while 43, 608 commercial images with masks.
75, 347 items in the gallery.
In the third stage, MN contains a feature extractor and
a similarity learning network for clothes retrieval. The
3. Match R-CNN learned RoI features after the FN component are highly
We present a strong baseline model built upon Mask R- discriminative with respect to clothes category, pose, and
CNN [6] for DeepFashion2, termed Match R-CNN, which mask. They are fed into MN to obtain features vectors
is an end-to-end training framework that jointly learns for retrieval, where v1 and v2 are passed into the similar-
clothes detection, landmark estimation, instance segmenta- ity learning network to obtain the similarity score between
tion, and consumer-to-shop retrieval. The above tasks are the detected clothing items in I1 and I2 . Specifically, the
solved by using different streams and stacking a Siamese feature extractor has 4 ‘conv’ layers, one pooling layer, and
module on top of these streams to aggregate learned fea- one fc layer. The similarity learning network consists of
tures. subtraction and square operator and a fc layer, which esti-
As shown in Fig.4, Match R-CNN employs two images mates the probability of whether two clothing items match
I1 and I2 as inputs. Each image is passed through three or not.
main components including a Feature Network (FN), a Per- Loss Functions. The parameters Θ of the Match R-CNN
ception Network (PN), and a Matching Network (MN). In are optimized by minimizing five loss functions, which are
the first stage, FN contains a ResNet-FPN [10] backbone, formulated as minΘ L = λ1 Lcls + λ2 Lbox + λ3 Lpose +
a region proposal network (RPN) [16] and RoIAlign mod- λ4 Lmask + λ5 Lpair , including a cross-entropy (CE) loss
ule. An image is first fed into ResNet50 to extract features, Lcls for clothes classification, a smooth loss [4] Lbox for
which are then fed into a FPN that uses a top-down architec- bounding box regression, a CE loss Lpose for landmark es-

5
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
APbox 0.604 0.700 0.660 0.712 0.654 0.372 0.695 0.629 0.466 0.624 0.681 0.641 0.667
APIoU=0.50
box 0.780 0.851 0.768 0.844 0.810 0.531 0.848 0.755 0.563 0.713 0.832 0.796 0.814
APIoU=0.75
box 0.717 0.809 0.744 0.812 0.768 0.433 0.806 0.718 0.525 0.688 0.791 0.744 0.773

Table 2. Clothes detection of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint. The
evaluation metrics are APbox , APIoU=0.50
box , and APIoU=0.75
box . The best performance of each subset is bold.

(a)
long sleeve dress long sleeve outwear long sleeve top
long sleeve outwear
long sleeve top
shorts
short sleeve dress
long sleeve dress 0.80

(b) vest vest


outwear sling

skirt vest dress

Figure 5. (a) shows failure cases in clothes detection while (b) shows failure cases in clothes segmentation. In (a) and (b), the missing
bounding boxes are drawn in red while the correct category labels are also in red. Inaccurate masks are also highlighted by arrows in (b).
For example, clothes fail to be detected or segmented in too small scale, too large scale, large non-rigid deformation, heavy occlusion, large
zoom-in, side or back viewpoint.

timation, a CE loss Lmask for clothes segmentation, and a erwise. In clothes segmentation stream, positive RoIs with
CE loss Lpair for clothes retrieval. Specifically, Lcls , Lbox , foreground label are chosen while in landmark estimation
Lpose , and LPmask are identical as defined in [6]. We have stream, positive RoIs with visible landmarks are selected.
n
Lpair = − n1 i=1 [yi log(ŷi ) + (1 − yi )log(1 − ŷi )], where We define ground truth box of interest as clothing items
yi = 1 indicates the two items of a pair are matched, other- whose style number is > 0 and can constitute matching
wise yi = 0. pairs. In clothes retrieval stream, RoIs are selected if their
Implementations. In our experiments, each training im- IoU with a ground truth box of interest is larger than 0.7. If
age is resized to its shorter edge of 800 pixels with its longer RoI features are extracted from landmark estimation stream,
edge that is no more than 1333 pixels. Each minibatch has RoIs with visible landmarks are also selected.
two images in a GPU and 8 GPUs are used for training. Inference. At testing time, images are resized in the
For minibatch size 16, the learning rate (LR) schedule starts same way as the training stage. The top 1000 proposals with
at 0.02 and is decreased by a factor of 0.1 after 8 epochs detection probabilities are chosen for bounding box classi-
and then 11 epochs, and finally terminates at 12 epochs. fication and regression. Then non-maximum suppression is
This scheduler is denoted as 1x. Mask R-CNN adopts 2x applied to these proposals. The filtered proposals are fed
schedule for clothes detection and segmentation where ‘2x’ into the landmark branch and the mask branch separately.
is twice as long as 1x with the LR scaled proportionally. For the retrieval task, each unique detected clothing item in
Then It adopts s1x for landmark and pose estimation where consumer-taken image with highest confidence is selected
s1x scales the 1x schedule by roughly 1.44x. Match R- as query.
CNN uses 1x schedule for consumer-to-shop clothes re-
trieval. The above models are trained by using SGD with 4. Experiments
a weight decay of 10−5 and momentum of 0.9.
In our experiments, the RPN produces anchors with 3 as- We demonstrate the effectiveness of DeepFashion2 by
pect rations on each level of the FPN pyramid. In clothes evaluating Mask R-CNN [6] and Match R-CNN in multiple
detection stream, an RoI is considered positive if its IoU tasks including clothes detection and classification, land-
with a ground truth box is larger than 0.5 and negative oth- mark estimation, instance segmentation, and consumer-to-

6
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
0.587 0.687 0.599 0.669 0.631 0.398 0.688 0.559 0.375 0.527 0.677 0.536 0.641
APpt
0.497 0.607 0.555 0.643 0.530 0.248 0.616 0.489 0.319 0.510 0.596 0.456 0.563
0.780 0.854 0.782 0.851 0.813 0.534 0.855 0.757 0.571 0.724 0.846 0.748 0.820
APOKS=0.50
pt
0.764 0.839 0.774 0.847 0.799 0.479 0.848 0.744 0.549 0.716 0.832 0.727 0.805
0.671 0.779 0.678 0.760 0.718 0.440 0.786 0.633 0.390 0.571 0.771 0.610 0.728
APOKS=0.75
pt
0.551 0.703 0.625 0.739 0.600 0.236 0.714 0.537 0.307 0.550 0.684 0.506 0.641

Table 3. Landmark estimation of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.
Results of evaluation on visible landmarks only and evaluation on both visible and occlusion landmarks are separately shown in each row.
OKS=0.50
The evaluation metrics are APpt , APpt , and APOKS=0.75
pt . The best performance of each subset is bold.

(a)
(a) shop clothes retrieval. To further show the large variations
of DeepFashion2, the validation set is divided into three
subsets according to their difficulty levels in scale, occlu-
sion, zoom-in, and viewpoint. The settings of Mask R-CNN
and Match R-CNN follow Sec.3. All models are trained in
the training set and evaluated in the validation set.
(b) The following sections from 4.1 to 4.4 report results for
different tasks, showing that DeepFashion2 imposes signif-
icant challenges to both Mask R-CNN and Match R-CNN,
which are the recent state-of-the-art systems for visual per-
ception.

4.1. Clothes Detection


(c)
Table 2 summarizes the results of clothes detection on
different difficulty subsets. We see that the clothes of mod-
erate scale, slight occlusion, no zoom-in, and frontal view-
point have the highest detection rates. There are several
observations. First, detecting clothes with small or large
scale reduces detection rates. Some failure cases are pro-
vided in Fig.5(a) where the item could occupy less than 2%
of the image while some occupies more than 90% of the
image. Second, in Table 2, it is intuitively to see that heavy
occlusion and large zoom-in degenerate performance. In
these two cases, large portions of the clothes are invisible
as shown in Fig.5(a). Third, it is seen in Table 2 that the
(d)
(1) (2) clothing items not on human body also drop performance.
Retrieval Accuracy

Retrieval Accuracy

0.6 0.6

0.4 0.4
This is because they possess large non-rigid deformations as
0.2
visualized in the failure cases of Fig.5(a). These variations
0.2

are not presented in previous object detection benchmarks


0 0
1 5 10
Retrieved Instance
15 20 1 5 10
Retrieved Instance
15 20 such as COCO. Fourth, clothes with side or back viewpoint,
class pose mask class pose mask
are much more difficult to detect as shown in Fig.5(a).
pose+class mask+class pose+class mask+class

Figure 6. (a) shows results of landmark and pose estimation. (b) 4.2. Landmark and Pose Estimation
shows results of clothes segmentation. (c) shows queries with top-
5 retrieved clothing items. The first column is the image from the Table 3 summarizes the results of landmark estimation.
customer with bounding box predicted by detection module, and The evaluation of each subset is performed in two settings,
the second to the sixth columns show the retrieval results from the including visible landmark only (the occluded landmarks
store. (d) is the retrieval accuracy of overall query validation set are not evaluated), as well as both visible and occluded
with (1) detected box (2) ground truth box. Evaluation metrics are landmarks. As estimating the occluded landmarks is more
top-1, -5, -10, -15, and -20 retrieval accuracy. difficult than visible landmarks, the second setting generally
provides worse results than the first setting.
In general, we see that Mask R-CNN obtains an overall

7
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
APmask 0.634 0.700 0.669 0.720 0.674 0.389 0.703 0.627 0.526 0.695 0.697 0.617 0.680
APIoU=0.50
mask 0.831 0.900 0.844 0.900 0.878 0.559 0.899 0.815 0.663 0.829 0.886 0.843 0.873
APIoU=0.75
mask 0.765 0.838 0.786 0.850 0.813 0.463 0.842 0.740 0.613 0.792 0.834 0.732 0.812

Table 4. Clothes segmentation of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.
The evaluation metrics are APmask , APIoU=0.50
mask , and APIoU=0.75
mask . The best performance of each subset is bold.

scale occlusion zoom-in viewpoint overall


small moderate large slight medium heavy no medium large no wear frontal side or back top-1 top-10 top-20
0.513 0.619 0.547 0.580 0.556 0.503 0.608 0.557 0.441 0.555 0.580 0.533 0.122 0.363 0.464
class
0.445 0.558 0.515 0.542 0.514 0.361 0.557 0.514 0.409 0.508 0.529 0.519 0.104 0.321 0.417
0.695 0.775 0.729 0.752 0.729 0.698 0.769 0.742 0.618 0.725 0.755 0.705 0.255 0.555 0.647
pose
0.619 0.695 0.688 0.704 0.668 0.559 0.700 0.693 0.572 0.682 0.690 0.654 0.234 0.495 0.589
0.641 0.705 0.663 0.688 0.656 0.645 0.708 0.670 0.556 0.650 0.690 0.653 0.187 0.471 0.573
mask
0.584 0.656 0.632 0.657 0.619 0.512 0.663 0.630 0.541 0.628 0.645 0.602 0.175 0.421 0.529
0.752 0.786 0.733 0.754 0.750 0.728 0.789 0.750 0.620 0.726 0.771 0.719 0.268 0.574 0.665
pose+class
0.691 0.730 0.705 0.725 0.706 0.605 0.746 0.709 0.582 0.699 0.723 0.684 0.244 0.522 0.617
0.679 0.738 0.685 0.711 0.695 0.651 0.742 0.699 0.569 0.677 0.719 0.678 0.214 0.510 0.607
mask+class
0.623 0.696 0.661 0.685 0.659 0.568 0.708 0.667 0.566 0.659 0.676 0.657 0.200 0.463 0.564
Table 5. Consumer-to-Shop Clothes Retrieval of Match R-CNN on different subsets of some validation consumer-taken images. Each
query item in these images has over 5 identical clothing items in validation commercial images. Results of evaluation on ground truth box
and detected box are separately shown in each row. The evaluation metrics are top-20 accuracy. The best performance of each subset is
bold.

AP of just 0.563, showing that clothes landmark estimation increases the accuracy. In particular, the learned features
could be even more challenging than human pose estima- from pose and class achieve better results than the other
tion in COCO. In particular, Table 3 exhibits similar trends features. When comparing learned features from pose and
as those from clothes detection. For example, the cloth- mask, we find that the former achieves better results, indi-
ing items with moderate scale, slight occlusion, no zoom- cating that landmark locations can be more robust across
in, and frontal viewpoint have better results than the others scenarios.
subsets. Moreover, heavy occlusion and zoom-in decreases As shown in Table 5, the performance declines when
performance a lot. Some results are given in Fig.6(a). small scale, heavily occluded clothing items are presented.
Clothes with large zoom-in achieved the lowest accuracy
4.3. Clothes Segmentation because only part of clothes are displayed in the image and
Table 4 summarizes the results of segmentation. The crucial distinguishable features may be missing. Compared
performance declines when segmenting clothing items with with clothes on people from frontal view, clothes from side
small and large scale, heavy occlusion, large zoom-in, side or back viewpoint perform worse due to lack of discrim-
or back viewpoint, which is consistent with those trends in inative features like patterns on the front of tops. Exam-
the previous tasks. Some results are given in Fig.6(b). Some ple queries with top-5 retrieved clothing items are shown in
failure cases are visualized in Fig.5(b). Fig.6(c).

4.4. Consumer-to-Shop Clothes Retrieval 5. Conclusions


Table 5 summarizes the results of clothes retrieval. The This work represented DeepFashion2, a large-scale fash-
retrieval accuracy is reported in Fig. 6(d), where top-1, - ion image benchmark with comprehensive tasks and an-
5, -10, and -20 retrieval accuracy are shown. We evaluate notations. DeepFashion2 contains 491K images, each of
two settings in (c.1) and (c.2), when the bounding boxes which is richly labeled with style, scale, occlusion, zoom-
are predicted by the detection module in Match R-CNN and ing, viewpoint, bounding box, dense landmarks and pose,
are provided as ground truths. Match R-CNN achieves a pixel-level masks, and pair of images of identical item from
top-20 accuracy of less than 0.7 with ground-truth bounding consumer and commercial store. We establish benchmarks
boxes provided, indicating that the retrieval benchmark is covering multiple tasks in fashion understanding, including
challenging. Furthermore, retrieval accuracy drops when clothes detection, landmark and pose estimation, clothes
using detected boxes, meaning that this is a more realistic segmentation, consumer-to-shop verification and retrieval.
setting. A novel Match R-CNN framework that builds upon Mask
In Table 5, different combinations of the learned features R-CNN is proposed to solve the above tasks in end-to-end
are also evaluated. In general, the combination of features manner. Extensive evaluations are conducted in DeepFash-

8
ion2. [17] W. Wang, Y. Xu, J. Shen, and S.-C. Zhu. Attentive fashion
The rich data and labels of DeepFashion2 will defi- grammar network for fashion landmark detection and cloth-
nitely facilitate the developments of algorithms to under- ing category classification. In CVPR, 2018.
stand fashion images in future work. We will focus on [18] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll
three aspects. First, more challenging tasks will be explored parsing: Retrieving similar styles to parse clothing items. In
with DeepFashion2, such as synthesizing clothing images ICCV, 2013.
by using GANs. Second, it is also interesting to explore [19] S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Un-
constrained fashion landmark detection via hierarchical re-
multi-domain learning for clothing images, because fashion
current transformer networks. In ACM Multimedia, 2017.
trends of clothes may change frequently, making variations
[20] W. Yang, P. Luo, and L. Lin. Clothing co-parsing by joint
of clothing images changed. Third, we will introduce more image segmentation and labeling. In CVPR, 2014.
evaluation metrics into DeepFashion2, such as size, run-
[21] S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu.
time, and memory consumptions of deep models, towards Modanet: A large-scale street fashion dataset with polygon
understanding fashion images in real-world scenario. annotations. In ACM Multimedia, 2018.

References
[1] Fashionai dataset. http://fashionai.alibaba.
com/datasets/.
[2] H. Chen, A. Gallagher, and B. Girod. Describing clothing by
semantic attributes. In ECCV, 2012.
[3] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and
S. Yan. Deep domain adaptation for describing people based
on fine-grained clothing attributes. In CVPR, 2015.
[4] R. Girshick. Fast r-cnn. In ICCV, 2015.
[5] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L.
Berg. Where to buy it: Matching street clothing photos in
online shops. In ICCV, 2015.
[6] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
In ICCV, 2017.
[7] J. Huang, R. S. Feris, Q. Chen, and S. Yan. Cross-domain
image retrieval with a dual attribute-aware ranking network.
In ICCV, 2015.
[8] X. Ji, W. Wang, M. Zhang, and Y. Yang. Cross-domain image
retrieval with attention modeling. In ACM Multimedia, 2017.
[9] L. Liao, X. He, B. Zhao, C.-W. Ngo, and T.-S. Chua. Inter-
pretable multimodal retrieval for fashion products. In ACM
Multimedia, 2018.
[10] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
S. J. Belongie. Feature pyramid networks for object detec-
tion. In CVPR, 2017.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In ECCV, 2014.
[12] K.-H. Liu, T.-Y. Chen, and C.-S. Chen. Mvc: A dataset for
view-invariant clothing retrieval and attribute prediction. In
ACM Multimedia, 2016.
[13] S. Liu, X. Liang, L. Liu, K. Lu, L. Lin, X. Cao, and S. Yan.
Fashion parsing with video context. IEEE Transactions on
Multimedia, 17(8):1347–1358, 2015.
[14] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion:
Powering robust clothes recognition and retrieval with rich
annotations. In CVPR, 2016.
[15] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion land-
mark detection in the wild. In ECCV, 2016.
[16] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
NIPS, 2015.

9
Inverse Cooking: Recipe Generation from Food Images

Amaia Salvador1∗ Michal Drozdzal2 Xavier Giro-i-Nieto1 Adriana Romero2


1
Universitat Politecnica de Catalunya 2 Facebook AI Research
{amaia.salvador, xavier.giro}@upc.edu, {adrianars, mdrozdzal}@fb.com
arXiv:1812.06164v2 [cs.CV] 15 Jun 2019

Abstract Title: Biscuits


Ingredients:
People enjoy food photography because they appreciate Flour, butter, sugar, egg, milk, salt.
food. Behind each meal there is a story described in a com- Instructions:
plex recipe and, unfortunately, by simply looking at a food - Preheat oven to 450 degrees.
- Cream butter and sugar.
image we do not have access to its preparation process. - Add egg and milk.
Therefore, in this paper we introduce an inverse cooking - Sift flour and salt together.
system that recreates cooking recipes given food images. - Add to creamed mixture.
Our system predicts ingredients as sets by means of a novel - Roll out on floured board to 1/4
inch thickness.
architecture, modeling their dependencies without impos-
- Cut with biscuit cutter.
ing any order, and then generates cooking instructions by - Place on ungreased cookie sheet.
attending to both image and its inferred ingredients simul- - Bake for 10 minutes.
taneously. We extensively evaluate the whole system on the
large-scale Recipe1M dataset and show that (1) we improve Figure 1: Example of a generated recipe, composed of a
performance w.r.t. previous baselines for ingredient predic- title, ingredients and cooking instructions.
tion; (2) we are able to obtain high quality recipes by lever-
aging both image and ingredients; (3) our system is able to limited and, as a consequence, it is hard to know precisely
produce more compelling recipes than retrieval-based ap- what we eat. Therefore, we argue that there is a need for
proaches according to human judgment. We make code and inverse cooking systems, which are able to infer ingredients
models publicly available1 . and cooking instructions from a prepared meal.
The last few years have witnessed outstanding improve-
1. Introduction ments in visual recognition tasks such as natural image clas-
sification [47, 14], object detection [42, 41] and semantic
Food is fundamental to human existence. Not only does segmentation [27, 19]. However, when comparing to natu-
it provide us with energy—it also defines our identity and ral image understanding, food recognition poses additional
culture [10, 34]. As the old saying goes, we are what we eat, challenges, since food and its components have high intra-
and food related activities such as cooking, eating and talk- class variability and present heavy deformations that occur
ing about it take a significant portion of our daily life. Food during the cooking process. Ingredients are frequently oc-
culture has been spreading more than ever in the current cluded in a cooked dish and come in a variety of colors,
digital era, with many people sharing pictures of food they forms and textures. Further, visual ingredient detection re-
are eating across social media [31]. Querying Instagram for quires high level reasoning and prior knowledge (e.g. cake
#food leads to at least 300M posts; similarly, searching for will likely contain sugar and not salt, while croissant will
#foodie results in at least 100M posts, highlighting the un- presumably include butter). Hence, food recognition chal-
questionable value that food has in our society. Moreover, lenges current computer vision systems to go beyond the
eating patterns and cooking culture have been evolving over merely visible, and to incorporate prior knowledge to en-
time. In the past, food was mostly prepared at home, but able high-quality structured food preparation descriptions.
nowadays we frequently consume food prepared by third- Previous efforts on food understanding have mainly fo-
parties (e.g. takeaways, catering and restaurants). Thus, cused on food and ingredient categorization [1, 39, 24].
the access to detailed information about prepared food is However, a system for comprehensive visual food recog-
∗ Work done during internship at Facebook AI Research nition should not only be able to recognize the type of meal
1 https://github.com/facebookresearch/inversecooking or its ingredients, but also understand its preparation pro-
cess. Traditionally, the image-to-recipe problem has been gether with a recently held iFood challenge2 has enabled
formulated as a retrieval task [54, 3, 4, 45], where a recipe significant advancements in visual food recognition, by
is retrieved from a fixed dataset based on the image similar- providing reference benchmarks to train and compare ma-
ity score in an embedding space. The performance of such chine learning approaches. As a result, there is currently
systems highly depends on the dataset size and diversity, as a vast literature in computer vision dealing with a variety
well as on the quality of the learned embedding. Not sur- of food related tasks, with special focus in image classifi-
prisingly, these systems fail when a matching recipe for the cation [26, 39, 38, 33, 6, 24, 30, 60, 16, 17]. Subsequent
image query does not exist in the static dataset. works tackle more challenging tasks such as estimating the
An alternative to overcome the dataset constraints of re- number of calories given a food image [32], estimating food
trieval systems is to formulate the image-to-recipe problem quantities [5], predicting the list of present ingredients [3, 4]
as a conditional generation one. Therefore, in this paper, we and finding the recipe for a given image [54, 3, 4, 45, 2].
present a system that generates a cooking recipe containing Additionally, [34] provides a detailed cross-region anal-
a title, ingredients and cooking instructions directly from ysis of food recipes, considering images, attributes (e.g.
an image. Figure 1 shows an example of a generated recipe style and course) and recipe ingredients. Food related tasks
obtained with our method, which first predicts ingredients have also been considered in the natural language process-
from an image and then conditions on both the image and ing literature, where recipe generation has been studied in
the ingredients to generate the cooking instructions. To the the context of generating procedural text from either flow
best of our knowledge, our system is the first to generate graphs [13, 36, 35] or ingredients’ checklists [21].
cooking recipes directly from food images. We pose the in- Multi-label classification. Significant effort has been
struction generation problem as a sequence generation one devoted in the literature to leverage deep neural networks
conditioned on two modalities simultaneously, namely an for multi-label classification, by designing models [49, 8,
image and its predicted ingredients. We formulate the in- 56, 37, 53] and studying loss functions [12] well suited for
gredient prediction problem as a set prediction, exploiting this task. Early attempts exploit single-label classification
their underlying structure. We model ingredient dependen- models coupled with binary logistic loss [3], assuming the
cies while not penalizing for prediction order, thus revising independence among labels and dropping potentially rele-
the question of whether order matters [51]. We extensively vant information. One way of capturing label dependen-
evaluate our system on the large-scale Recipe1M dataset cies is by relying on label powersets [49]. Powersets con-
[45] that contains images, ingredients and cooking instruc- sider all possible label combinations, which makes them in-
tions, showing satisfactory results. More precisely, in a hu- tractable for large scale problems. Another expensive alter-
man evaluation study, we show that our inverse cooking sys- native consists in learning the joint probability of the labels.
tem outperforms previously introduced image-to-recipe re- To overcome this issue, probabilistic classifier chains [8]
trieval approaches by a large margin. Moreover, using a and their recurrent neural network-based [53, 37] counter-
small set of images, we show that food image-to-ingredient parts propose to decompose the joint distribution into con-
prediction is a hard task for humans and that our approach ditionals, at the expense of introducing intrinsic ordering.
is able to surpass them. Note that most of these models require to make a predic-
The contributions of this paper can be summarized as: tion for each of the potential labels. Moreover, joint input
and label embeddings [57, 25, 61] have been introduced to
– We present an inverse cooking system, which gener-
preserve correlations and predict label sets. As an alterna-
ates cooking instructions conditioned on an image and
tive, researchers have attempted to predict the cardinality of
its ingredients, exploring different attention strategies
the set of labels [43, 44]; however, assuming the indepen-
to reason about both modalities simultaneously.
dence of labels. When it comes to multi-label classification
– We exhaustively study ingredients as both a list and a
objectives, binary logistic loss [3], target distribution cross-
set, and propose a new architecture for ingredient pre-
entropy [12, 29], target distribution mean squared error [56]
diction that exploits co-dependencies among ingredi-
and ranking-based losses [12] have been investigated and
ents without imposing order.
compared. Recent results on large scale datasets outline the
– By means of a user study we show that ingredient pre-
potential of the target distribution loss [29].
diction is indeed a difficult task and demonstrate the
superiority of our proposed system against image-to- Conditional text generation. Conditional text genera-
recipe retrieval approaches. tion with auto-regressive models has been widely studied in
the literature using both text-based [48, 11, 50, 9] as well
as image-based conditionings [52, 59, 28, 20, 23, 7, 46]. In
2. Related Work neural machine translation, where the goal is to predict the
translation for a given source text into another language, dif-
Food Understanding. The introduction of large scale
food datasets, such as Food-101 [1] and Recipe1M [45], to- 2 https://www.kaggle.com/c/ifood2018
Add onion and cook until tender
r0 r1 r2 r3 r4 r5
Image
Encoder Ingredient Ingredient
eI Decoder Encoder eL
beef
onion 𝜃R 𝜃R 𝜃R 𝜃R 𝜃R 𝜃R
𝜃I 𝜃L tomato 𝜃E
beans

Instruction Decoder

Figure 2: Recipe generation model. We extract image features eI with the image encoder, parametrized by θI . Ingredients
are predicted by θL , and encoded into ingredient embeddings eL with θe . The cooking instruction decoder, parametrized by
θR generates a recipe title and a sequence of cooking steps by attending to image embeddings eI , ingredient embeddings eL ,
and previously predicted words (r0 , ..., rt−1 ).

ferent architecture designs have been studied, including re- 3.1. Cooking Instruction Transformer
current neural networks [48], convolutional models [11] and
attention based approaches [50]. More recently, sequence- Given an input image with associated ingredients, we
to-sequence models have been applied to more open-ended aim to produce a sequence of instructions R = (r1 , ..., rT )
generation tasks, such as poetry [55] and story generation (where rt denotes a word in the sequence) by means of
[23, 9]. Following neural machine translation trends, auto- an instruction transformer [50]. Note that the title is pre-
regressive models have exhibited promising performance in dicted as the first instruction. This transformer is condi-
image captioning [52, 59, 28, 20, 7, 46], where the goal is to tioned jointly on two inputs: the image representation eI
provide a short description of the image contents, opening and the ingredient embedding eL . We extract the image
the doors to less constrained problems such as generating representation with a ResNet-50 [15] encoder and obtain the
descriptive paragraphs [23] or visual storytelling [18]. ingredient embedding eL by means of a decoder architec-
ture to predict ingredients, followed by a single embedding
layer mapping each ingredient into a fixed-size vector.
3. Generating recipes from images The instruction decoder is composed of transformer
blocks, each of them containing two attention layers fol-
Generating a recipe (title, ingredients and instructions) lowed by a linear layer [50]. The first attention layer applies
from an image is a challenging task, which requires a si- self-attention over previously generated outputs, whereas
multaneous understanding of the ingredients composing the the second one attends to the model conditioning in order
dish as well as the transformations they went through, e.g. to refine the self-attention output. The transformer model
slicing, blending or mixing with other ingredients. Instead is composed of multiple transformer blocks followed by a
of obtaining the recipe from an image directly, we argue that linear layer and a softmax nonlinearity that provides a dis-
a recipe generation pipeline would benefit from an interme- tribution over recipe words for each time step t. Figure 3a
diate step predicting the ingredients list. The sequence of illustrates the transformer model, which traditionally is con-
instructions would then be generated conditioned on both ditioned on a single modality. However, our recipe gen-
the image and its corresponding list of ingredients, where erator is conditioned on two sources: the image features
the interplay between image and ingredients could provide eI ∈ RP ×de and ingredients embeddings eL ∈ RK×de
additional insights on how the latter were processed to pro- (P and K denote the number of image and ingredient fea-
duce the resulting dish. tures, respectively, and de is the embedding dimensional-
Figure 2 illustrates our approach. Our recipe genera- ity). Thus, we want our attention to reason about both
tion system takes a food image as an input and outputs a modalities simultaneously, guiding the instruction genera-
sequence of cooking instructions, which are generated by tion process. To that end, we explore three different fusion
means of an instruction decoder that takes as input two em- strategies (depicted in Figure 3):
beddings. The first one represents visual features extracted – Concatenated attention. This strategy first concate-
from an image, while the second one encodes the ingre- nates both image eI and ingredients eL embeddings
dients extracted from the image. We start by introducing over the first dimension econcat ∈ R(K+P )×de . Then,
our transfomer-based instruction decoder in Subsection 3.1. attention is applied over the combined embeddings.
This allows us to formally review the transformer, which we – Independent attention. This strategy incorporates
then study and modify to predict ingredients in an orderless two attention layers to deal with the bi-modal condi-
manner in Subsection 3.2. Finally, we review the optimiza- tioning. In this case, one layer attends over the image
tion details in Subsection 3.3. embedding eI , whereas the other attends over the in-
Output probabilities
Softmax
Linear

Add
xN Add & Norm
Feed-forward

Attention
Add & Norm

Attention

e Add & Norm Add & Norm Add & Norm eI/eL Add & Norm
Self-Attention
Attention Attention Attention Attention
Add & Norm

Positional encoding
Embedding eL/eI
[eL eI] eI eL
Outputs (shifted right)

(a) Transformer model [50] (b) Concatenated (c) Independent (d) Sequential

Figure 3: Attention strategies for the instruction decoder. In our experiments, we replace the attention module in the
transformer (a), with three different attention modules (b-d) for cooking instruction generation using multiple conditions.

gredient embeddings eL . The output of both attention where θI and θL represent the learnable parameters of the
layers is combined via summation operation. image encoder and ingredient decoder, respectively. Since
– Sequential attention. This strategy sequentially at- L denotes a list, we can factorize p(L̂(i) = L(i) |x(i) )
tends over the two conditioning modalities. In our de-
PK (i) (i) (i)
into K conditionals: k=0 log p(L̂k = Lk |x(i) , L<k ) 3
sign, we consider two orderings: (1) image first where (i) (i)
and parametrize p(L̂k |x(i) , L<k ) as a categorical distribu-
the attention is first computed over image embeddings tion. In the literature, these conditionals are usually mod-
eI and then over ingredient embeddings eL ; and (2) eled with auto-regressive (recurrent) models. In our experi-
ingredients first where the order is flipped and we first ments, we choose the transformer model as well. It is worth
attend over ingredient embeddings eL followed by im- mentioning that a potential drawback of this formulation is
age embeddings eI . that it inherently penalizes for order, which might not nec-
3.2. Ingredient Decoder essarily be relevant for ingredients.
A set of ingredients is a variable sized, unordered col-
Which is the best structure to represent ingredients? On lection of unique meal constituents. We can obtain a set of
the one hand, it seems clear that ingredients are a set, since ingredients S by selecting K ingredients from the dictio-
permuting them does not alter the outcome of the cooking nary D: S = {si }K i=0 . We represent S as a binary vector s
recipe. On the other hand, we colloquially refer to ingredi- of dimension N , where si = 1 if si ∈ S and 0 otherwise.
ents as a list (e.g. list of ingredients), implying some order. Thus, our training data consists of M image and ingredient
Moreover, it would be reasonable to think that there is some set pairs: {(x(i) , s(i) )}Mi=0 . In this case, the goal is to predict
information in the order in which humans write down the ŝ from an image x by maximizing the following objective:
ingredients in a recipe. Therefore, in this subsection we M
X
consider both scenarios and introduce models that work ei- arg max log p(ŝ(i) = s(i) |x(i) ; θI , θL ). (2)
ther with a list of ingredients or with a set of ingredients. θI ,θL i=0
A list of ingredients is a variable sized, ordered collec- Assuming independence among elements, we can fac-
tion of unique meal constituents. More precisely, let us de- PN (i) (i)
torize p(ŝ(i) = s(i) |x(i) ) as j=0 log p(ŝj = sj |x(i) ).
fine a dictionary of ingredients of size N as D = {di }N i=0 , However, the ingredients in the set are not necessarily inde-
from which we can obtain a list of ingredients L by select-
pendent, e.g. salt and pepper frequently appear together.
ing K elements from D: L = [li ]K i=0 . We encode L as a To account for element dependencies in the set, we
binary matrix L of dimensions K × N , with Li,j = 1 if
model the set as a list, i.e. as a product of conditional prob-
dj ∈ D is selected and 0 otherwise (one-hot-code represen-
abilities, by means of an auto-regressive model such as the
tation). Thus, our training data consists of M image and
transformer. The transformer predicts ingredients in a list-
ingredient list pairs {(x(i) , L(i) )}M
i=0 . In this scenario, the (i) (i)
goal is to predict L̂ from an image x by maximizing the like fashion p(L̂k |x(i) , L<k ), until the end of sequence eos
following objective: token is encountered. As mentioned previously, the draw-
back of this approach is that such model design penalizes
M
(i) (i)
3 Lk denotes the k-th row of L(i) and L<k represents all rows of
X
arg max log p(L̂(i) = L(i) |x(i) ; θI , θL ), (1)
θI ,θL i=0
L(i) up to, but not including, the k-th one.
3.3. Optimization
salt onion beans rice eos
l0 l1 l2 l3 l4 We train our recipe transfomer in two stages. In the first
stage, we pre-train the image encoder and ingredients de-
coder as presented in Subsection 3.2. Then, in the second
stage, we train the ingredient encoder and instruction de-
pool coder (following Subsection 3.1) by minimizing the neg-
ative log-likelihood and adjusting θR and θE . Note that,
while training, the instruction decoder takes as input the
ground truth ingredients. All transformer models are trained
with teacher forcing [58] except for the set transformer.
𝜃L 𝜃L 𝜃L 𝜃L 𝜃L

4. Experiments
This section is devoted to the dataset and the descrip-
Figure 4: Set transformer (TFset ). Softmax probabilities tion of implementation details, followed by an exhaustive
are pooled across time to avoid penalizing for order. analysis of the proposed attention strategies for the cooking
instruction transformer. Further, we quantitatively compare
the proposed ingredient prediction models to previously in-
for order. In order to remove the order in which ingre- troduced baselines. Finally, a comparison of our inverse
dients are predicted, we propose to aggregate the outputs cooking system with retrieval-based models as well as a
across different time-steps by means of a max pooling op- comprehensive user study is provided.
eration (see Figure 4). Moreover, to ensure that the ingre-
dients in L̂(i) are selected without repetition, we force the 4.1. Dataset
(i) (i)
pre-activation of p(L̂k |x(i) , L<k ) to be −∞ for all previ-
We train and evaluate our models on the Recipe1M
ously selected ingredients at time-steps < k. We train this
dataset [45], composed of 1 029 720 recipes scraped from
model by minimizing the binary cross-entropy between the
cooking websites. The dataset contains 720 639 training,
predicted ingredients (after pooling) and the ground truth.
155 036 validation and 154 045 test recipes, containing a ti-
Including the eos in the pooling operation would result in
tle, a list of ingredients, a list of cooking instructions and
loosing the information of where the token appears. There-
(optionally) an image. In our experiments, we use only
fore, in order to learn the stopping criteria of the ingredient
the recipes containing images, and remove recipes with less
prediction, we introduce an additional loss accounting for
than 2 ingredients or 2 instructions, resulting in 252 547
it. The eos loss is defined as the binary cross-entropy loss
training, 54 255 validation and 54 506 test samples.
between the predicted eos probability at all time-steps and
Since the dataset was obtained by scraping cooking web-
the ground truth (represented as a unit step function, whose
sites, the resulting recipes are highly unstructured and con-
value is 0 for the time-steps corresponding to ingredients
tain frequently redundant or very narrowly defined cooking
and 1 otherwise). In addition to that, we incorporate a car-
ingredients (e.g. olive oil, virgin olive oil and spanish olive
dinality `1 penalty, which we found empirically useful. At
oil are separate ingredients). Moreover, the ingredient vo-
inference time, we directly sample from the transformer’s
cabulary contains more than 400 different types of cheese,
output. We refer to this model as set transformer.
and more than 300 types of pepper. As a result, the original
Alternatively, we could use target distribution dataset contains 16 823 unique ingredients, which we pre-
P (i)
p(s(i) |x(i) ) = s(i) / j sj [12, 29] to model the process to reduce its size and complexity. First, we merge
joint distribution of set elements and train a model by ingredients if they share the first or last two words (e.g. ba-
minimizing the cross-entropy loss between p(s(i) |x(i) ) and con cheddar cheese is merged into cheddar cheese); then,
the model’s output distribution p(ŝ(i) |x(i) ). Nonetheless, we cluster the ingredients that have same word in the first or
it is not clear how to convert the target distribution back to in the last position (e.g. gorgonzola cheese or cheese blend
the corresponding set of elements with variable cardinality. are clustered together into the cheese category); finally we
In this case, we build a feed forward network and train it remove plurals and discard ingredients that appear less than
with the target distribution cross-entropy loss. To recover 10 times in the dataset. Altogether, we reduce the ingredi-
the ingredient set, we propose to greedily sample elements ent vocabulary from over 16k to 1 488 unique ingredients.
from a cumulative distribution of sorted output probabil- For the cooking instructions, we tokenize the raw text and
ities p(ŝ(i) |x(i) ) and stop the sampling once the sum of remove words that appear less than 10 times in the dataset,
probabilities of selected elements is above a threshold. We and replace them with unknown word token. Moreover, we
refer to this model as feed forward (target distribution). add special tokens for the start and the end of recipe as well
Model IoU F1 fluence of visual features on recipe quality, we adapt our
model by removing visual features and predicting instruc-
FFBCE 17.85 30.30
FFIOU 26.25 41.58 tions directly from ingredients (L2R). Our system achieves
Model ppl FFDC 27.22 42.80 a test set perplexity of 8.51, improving both I2R and L2R
FFT D 28.84 44.11 baselines, and highlighting the benefits of using both image
Independent 8.59
and ingredients when generating recipes. L2R surpasses
Seq. img. first 8.53 TFlist 29.48 45.55
I2R with a perplexity of 8.67 vs. 9.66, demonstrating the
Seq. ing. first 8.61 TFlist + shuf. 27.86 43.58
Concatenated 8.50 TFset 31.80 48.26 usefulness of having access to concepts (ingredients) that
are essential to the cooking instructions. Finally, we greed-
Table 1: Model selection (val). Left: Recipe perplexity ily sample instructions from our model and analyze the re-
(ppl). Right: Global ingredient IoU & F1. sults. We notice that generated instructions have an average
of 9.21 sentences containing 9 words each, whereas real,
ground truth instructions have an average of 9.08 sentences
as the end of instruction. This process results in a recipe
of length 12.79. See supplementary material for qualitative
vocabulary of 23 231 unique words.
examples of generated recipes.
4.2. Implementation Details
4.4. Ingredient Prediction
We resize images to 256 pixels in their shortest side and
take random crops of 224 × 224 for training and we select In this section, we compare the proposed ingredient pre-
central 224 × 224 pixels for evaluation. For the instruc- diction approaches to previously introduced models, with
tion decoder, we use a transformer with 16 blocks and 8 the goal of assessing whether ingredients should be treated
multi-head attentions, each one with dimensionality 64. For as lists or sets. We consider models from the multilabel
the ingredient decoder, we use a transformer with 4 blocks classification literature as baselines, and tune them for our
and 2 multi-head attentions, each one with dimensionality purposes. On the one hand, we have models based on feed
of 256. To obtain image embeddings we use the last convo- forward convolutional networks, which are trained to pre-
lutional layer of ResNet-50 model. Both image and ingredi- dict sets of ingredients. We experiment with several losses
ents embedings are of dimension 512. We keep a maximum to train these models, namely binary cross-entropy, soft in-
of 20 ingredients per recipe and truncate instructions to a tersection over union as well as target distribution cross-
maximum of 150 words. The models are trained with Adam entropy. Note that binary cross-entropy is the only one not
optimizer [22] until early-stopping criteria is met (using pa- taking into account dependencies among elements in the set.
tience of 50 and monitoring validation loss). All models are On the other hand, we have sequential models that predict
implemented with PyTorch4 [40]. Additional implementa- lists, imposing order and exploiting dependencies among
tion details are provided in the supplementary material. elements. Finally, we consider recently proposed models
which couple set prediction with cardinality prediction to
4.3. Recipe Generation determine which elements to include in the set [44].
Table 1 (right) reports the results on the validation set
In this section, we compare the proposed multi-modal
for the state-of-the-art baselines as well as the proposed
attention architectures described in Section 3.1. Table 1
approaches. We evaluate the models in terms of Intersec-
(left) reports the results in terms of perplexity on the val-
tion over Union (IoU) and F1 score, computed for accumu-
idation set. We observe that independent attention exhibits
lated counts of T P , F N and F P over the entire dataset
the lowest results, followed by both sequential attentions.
split (following Pascal VOC convention). As shown in the
While the latter have the capability to refine the output with
table, the feed forward model trained with binary cross-
either ingredient or image information consecutively, inde-
entropy [3] (FFBCE ) exhibits the lowest performance on
pendent attention can only do it in one step. This is also
both metrics, which could be explained by the assumed in-
the case of concatenated attention, which achieves the best
dependence among ingredients. These results are already
performance. However, concatenated attention is flexible
notably improved by the method that learns to predict the set
enough to decide whether to give more focus to one modal-
cardinality (FFDC ). Similarly, the performance increases
ity, at the expense of the other, whereas independent atten-
when training the model with structured losses such as soft
tion is forced to include information from both modalities.
IoU (FFIOU ). Our feed forward model trained with tar-
Therefore, we use the concatenated attention model to re-
get distribution (FFT D ) and sampled by thresholding (th
port results on the test set. We compare it to a system go-
= 0.5) the sum of probabilities of selected ingredients out-
ing directly from image-to-sequence of instructions with-
performs all feed forward baselines, including recently pro-
out predicting ingredients (I2R). Moreover, to assess the in-
posed alternatives for set prediction such as [44] (FFDC ).
4 https://pytorch.org/ Note that target distribution models dependencies among
Card. error # pred. ingrs
FFBCE 5.67 ± 3.10 2.37 ± 1.58
FFDC 2.68 ± 2.07 9.18 ± 2.06
FFIOU 2.46 ± 1.95 7.86 ± 1.72
FFT D 3.02 ± 2.50 8.02 ± 3.24
TFlist 2.49 ± 2.11 7.05 ± 2.77
TFlist + shuffle 3.24 ± 2.50 5.06 ± 1.85
TFset 2.56 ± 1.93 9.43 ± 2.35

Table 2: Ingredient Cardinality. Figure 5: Ingredient prediction results: P@K and F1 per ingredient.

elements in a set and implicitly captures cardinality infor- IoU F1


mation. Following recent literature modeling sets as lists
RI2L [45] 18.92 31.83
[37], we train a transformer network to predict ingredients RI2LR [45] 19.85 33.13 Rec. Prec.
given an image by minimizing the negative log-likelihood
loss (TFlist ). Moreover, we train the same transformer FFT D (ours) 29.82 45.94 RIL2R 31.92 28.94
TFset (ours) 32.11 48.61 Ours 75.47 77.13
by randomly shuffling the ingredients (thus, removing or-
der from the data). Both models exhibit competitive results
when compared to feed forward models, highlighting the Table 3: Test performance against retrieval. Left: Global
importance of modeling dependencies among ingredients. ingredient IoU and F1 scores. Right: Precision and Recall
Finally, our proposed set transformer TFset , which models of ingredients in cooking instructions.
ingredient co-occurrences exploiting the auto-regressive na-
IoU F1 Success %
ture of the model yet satisfying order invariance, achieves
the best results, emphasizing the importance of modeling Human 21.36 35.20 Real 80.33
dependencies, while not penalizing for any given order. Retrieved 18.03 30.55 Retrieved 48.81
The average number of ingredients per sample in Ours 32.52 49.08 Ours 55.47
Recipe1M is 7.99 ± 3.21 after pre-processing. We report
the cardinality prediction errors as well as the average num- Table 4: User studies. Left: IoU & F1 scores for ingredi-
ber of predicted ingredients for each of the tested models in ents obtained with retrieval [45], our approach and humans.
Table 2. TFset is the third best method in terms of cardi- Right: Recipe success rate according to human judgment.
nality error (after FFIOU and TFlist ), while being superior
to all methods in terms of F1 and IoU. Further, Figure 5 task, we use the image embeddings to retrieve the closest
(left) shows the precision score at different values of K. As recipe and report metrics for the ingredients of the retrieved
observed, the plot follows similar trends as Table 1 (right), recipe. We further consider an alternative retrieval archi-
with FFT D being among the most competitive models and tecture, which learns joint embeddings between images and
TFset outperforming all previous baselines for most values ingredients list (ignoring title and instructions). We refer
of K. Figure 5 (right) shows the F1 per ingredient, where to this model as RI2L . Table 3 (left) reports the obtained
the ingredients in the horizontal axes are sorted by score. results on the Recipe1M test set. The RI2LR model outper-
Again, we see that models that exploit dependencies con- forms the RI2L one, which indicates that instructions con-
sistently improve ingredient’s F1 scores, strengthening the tain complementary information that is useful when learn-
importance of modeling ingredient co-occurrences. ing effective embeddings. Furthermore, both of our pro-
posed methods outperform the retrieval-baselines by a large
4.5. Generation vs Retrieval
margin (e.g. TFset outperforms the RI2LR retrieval base-
In this section, we compare our proposed recipe genera- line by 12.26 IoU points and 15.48 F1 score points), which
tion system with retrieval baselines, which we use to search demonstrates the superiority of our models. Finally, Figure
recipes in the entire test set for fair comparison. 6 presents some qualitative results for image-to-ingredient
Ingredient prediction evaluation. We use the retrieval prediction for our model as well as for the retrieval based
model in [45] as a baseline and compare it with our best system. We use blue to highlight the ingredients that are
ingredient predictions models, namely FFT D and FFset . present in the ground truth annotation and red otherwise.
The retrieval model, which we refer to as RI2LR , learns Recipe generation evaluation. We compare our pro-
joint embeddings of images and recipes (title, ingredients posed instruction decoder (which generates instructions
and instructions). Therefore, for the ingredient prediction given an image and ingredients) with a retrieval variant. For
Ours Retrieved Real and F1 ingredient scores obtained by humans, the retrieval
cheese onion
baseline and our method. Results are included in Table 4
potato butter milk water
pepper soup soup cheese butter potato
(left), underlining the complexity of the task. As shown in
cream salt milk onion corn cheese the table, humans outperform the retrieval baseline (F1 of
butter cream corn onion
35.20% vs 30.55%, respectively). Furthermore, our method
outperforms both human baseline and retrieval based sys-
lemon zucchini oil
shrimp butter tems obtaining F1 of 49.08%. Qualitative comparisons
lemon salt clove pepper shrimp
garlic zucchini
catfish seasoning between generated and human-written recipes (including
pepper soy_sauce juice salt garlic
carrot parsley
juice parsley onion recipes from average and expert users) are provided in the
supplementary material.
sugar
strawberries juice tart_shell sugar
butter vanilla The second study aims at quantifying the quality of the
strawberries sugar
water raspberries
cornstarch juice
wine vinegar generated recipes (ingredients and instructions) with respect
strawberries
cream cream to (1) the real recipes in the dataset, and (2) the ones ob-
tained with the retrieval baseline [45]. With this purpose,
cheese cheese
tomato cracker
muffin we randomly select 150 recipes with their associated im-
cheese
cracker miracle_whip
broccoli ages from the test set and, for each image, we collect the
broccoli lettuce
muffin tomato
tomato corresponding real recipe, the top-1 retrieved recipe and
our generated recipe. We present the users with 15 image-
recipe pairs (randomly chosen among the real, retrieved and
Figure 6: Ingredient prediction examples. We compare
generated ones) asking them to indicate whether the recipe
obtained ingredients with our method and the retrieval base-
matches the image. In the study, we collected answers from
line. Ingredients are displayed in blue if they are present in
105 different users, resulting in an average of 10 responses
the real sample and red otherwise. Best viewed in color.
for each image. Table 4 (right) presents the results of this
study, reporting the success rate of each recipe type. As
a fair comparison, we retrain the retrieval system to find the
it can be observed, the success rate of generated recipes is
cooking instructions given both image and ingredients. In
higher than the success rate of retrieved recipes, stressing
our evaluation, we consider the ground truth ingredients as
the benefits of our approach w.r.t. retrieval.
reference and compute recall and precision w.r.t. the ingre-
dients that appear in the obtained instructions. Thus, recall
computes the percentage of ingredients in the reference that 5. Conclusion
appear in the output instructions, whereas precision mea- In this paper, we introduced an image-to-recipe genera-
sures the percentage of ingredients appearing in the instruc- tion system, which takes a food image and produces a recipe
tions that also appear in the reference. Table 3 (right) dis- consisting of a title, ingredients and sequence of cooking
plays comparison between our model and the retrieval sys- instructions. We first predicted sets of ingredients from
tem. Results show that ingredients appearing in generated food images, showing that modeling dependencies matters.
instructions have better recall and precision scores than the Then, we explored instruction generation conditioned on
ingredients in retrieved instructions. images and inferred ingredients, highlighting the impor-
4.6. User Studies tance of reasoning about both modalities at the same time.
Finally, user study results confirm the difficulty of the task,
In this section, we quantify the quality of predicted in- and demonstrate the superiority of our system against state-
gredients and generated instructions with user studies. In of-the-art image-to-recipe retrieval approaches.
the first study, we compare the performance of our model
against human performance in the task of recipe genera- 6. Acknowledgements
tion (including ingredients and recipe instructions). We ran-
domly select 15 images from the test set, and ask users to We are grateful to Nicolas Ballas, Lluis Castrejon,
select up to 20 distinct ingredients as well as write a recipe Zizhao Zhang and Pascal Vincent for their fruitful com-
that would correspond with the provided image. To re- ments and suggestions. We also want to express our grat-
duce the complexity of the task for humans, we reduced itude to Joelle Pineau for her unwavering support to this
the ingredient vocabulary from 1 488 to 323, by increas- project. Finally, we wish to thank everyone who anony-
ing the frequency threshold from 10 to 1k. We collected mously participated in the user studies.
answers from 31 different users, altogether collecting an This work has been partially developed in the framework
average of 5.5 answers for each image. For fair compar- of projects TEC2013-43935-R and TEC2016-75976-R, fi-
ison, we re-train our best ingredient prediction model on nanced by the Spanish Ministerio de Economa y Competi-
the reduced vocabulary of ingredients. We compute IoU tividad and the European Regional Development Fund.
References [19] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana
Romero, and Yoshua Bengio. The one hundred layers
[1] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. tiramisu: Fully convolutional densenets for semantic seg-
Food-101–mining discriminative components with random mentation. In CVPR-W, 2017.
forests. In ECCV, 2014.
[20] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-
[2] Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, ments for generating image descriptions. In CVPR, 2015.
Nicolas Thome, and Matthieu Cord. Cross-modal retrieval in
[21] Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. Globally
the cooking context: Learning semantic text-image embed-
coherent text generation with neural checklist models. In
dings. In SIGIR, 2018.
EMNLP, 2016.
[3] Jing-Jing Chen and Chong-Wah Ngo. Deep-based ingredient [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for
recognition for cooking recipe retrieval. In ACM Multimedia. stochastic optimization. CoRR, abs/1412.6980, 2014.
ACM, 2016.
[23] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li
[4] Jing-Jing Chen, Chong-Wah Ngo, and Tat-Seng Chua. Fei-Fei. A hierarchical approach for generating descriptive
Cross-modal recipe retrieval with rich food attributes. In image paragraphs. In CVPR, 2017.
ACM Multimedia. ACM, 2017.
[24] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun
[5] Mei-Yun Chen, Yung-Hsiang Yang, Chia-Ju Ho, Shih-Han Yang. Cleannet: Transfer learning for scalable image classi-
Wang, Shane-Ming Liu, Eugene Chang, Che-Hua Yeh, and fier training with label noise. In CVPR, 2018.
Ming Ouhyoung. Automatic chinese food identification and
[25] Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang.
quantity estimation. In SIGGRAPH Asia 2012 Technical
Multi-label classification via feature-aware implicit label
Briefs, 2012.
space encoding. In ICML, 2014.
[6] Xin Chen, Hua Zhou, and Liang Diao. Chinesefoodnet:
[26] Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod
A large-scale image dataset for chinese food recognition.
Vokkarane, and Yunsheng Ma. Deepfood: Deep learning-
CoRR, abs/1705.02743, 2017.
based food image recognition for computer-aided dietary as-
[7] Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. To- sessment. In ICOST, 2016.
wards diverse and natural image descriptions via a condi-
[27] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
tional gan. ICCV, 2017.
convolutional networks for semantic segmentation. In
[8] Krzysztof Dembczyński, Weiwei Cheng, and Eyke CVPR, 2015.
Hüllermeier. Bayes optimal multilabel classification via
[28] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.
probabilistic classifier chains. In ICML, 2010.
Knowing when to look: Adaptive attention via a visual sen-
[9] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical tinel for image captioning. In CVPR, 2017.
neural story generation. In ACL, 2018. [29] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan,
[10] Claude Fischler. Food, self and identity. Information (Inter- Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
national Social Science Council), 1988. and Laurens van der Maaten. Exploring the limits of weakly
[11] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, supervised pretraining. CoRR, abs/1805.00932, 2018.
and Yann N. Dauphin. Convolutional sequence to sequence [30] Niki Martinel, Gian Luca Foresti, and Christian Micheloni.
learning. CoRR, abs/1705.03122, 2017. Wide-slice residual networks for food recognition. In WACV,
[12] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander To- 2018.
shev, and Sergey Ioffe. Deep convolutional ranking for mul- [31] Sara McGuire. Food Photo Frenzy: Inside the Instagram
tilabel image annotation. CoRR, abs/1312.4894, 2013. Craze and Travel Trend. https://www.business.
[13] Kristian J. Hammond. CHEF: A model of case-based plan- com/articles/food-photo-frenzy-inside-
ning. In AAAI, 1986. the-instagram-craze-and-travel-trend/,
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2017. [Online; accessed Nov-2018].
Delving deep into rectifiers: Surpassing human-level perfor- [32] Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korat-
mance on imagenet classification. In CVPR, 2015. tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama,
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. George Papandreou, Jonathan Huang, and Kevin P Murphy.
Deep residual learning for image recognition. In CVPR, Im2calories: towards an automated mobile vision food diary.
2016. In ICCV, 2015.
[16] Luis Herranz, Shuqiang Jiang, and Ruihan Xu. Modeling [33] Simon Mezgec and Barbara Koroui Seljak. Nutrinet: A deep
restaurant context for food recognition. IEEE Transactions learning food and drink image recognition system for dietary
on Multimedia, 2017. assessment. Nutrients, 9(7), 2017.
[17] Shota Horiguchi, Sosuke Amano, Makoto Ogawa, and Kiy- [34] Weiqing Min, Bing-Kun Bao, Shuhuan Mei, Yaohui Zhu,
oharu Aizawa. Personalized classifier for food image recog- Yong Rui, and Shuqiang Jiang. You are what you eat: Ex-
nition. IEEE Transactions on Multimedia, 2018. ploring rich recipe information for cross-region food analy-
[18] Qiuyuan Huang, Zhe Gan, Asli Çelikyilmaz, Dapeng Oliver sis. IEEE Transactions on Multimedia, 2018.
Wu, Jianfeng Wang, and Xiaodong He. Hierarchically struc- [35] Shinsuke Mori, Hirokuni Maeta, Tetsuro Sasada, Koichiro
tured reinforcement learning for topically coherent visual Yoshino, Atsushi Hashimoto, Takuya Funatomi, and Yoko
story generation. CoRR, abs/1805.08191, 2018. Yamakata. Flowgraph2text: Automatic sentence skeleton
compilation for procedural text generation. In INLG. The [53] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang
Association for Computer Linguistics, 2014. Huang, and Wei Xu. CNN-RNN: A unified framework for
[36] Shinsuke Mori, Hirokuni Maeta, Yoko Yamakata, and Tet- multi-label image classification. In CVPR, 2016.
suro Sasada. Flow graph corpus from recipe texts. In LREC. [54] Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord,
European Language Resources Association (ELRA), 2014. and Frederic Precioso. Recipe recognition with large multi-
[37] Jinseok Nam, Eneldo Loza Mencı́a, Hyunwoo J Kim, and modal food dataset. In ICMEW, 2015.
Johannes Fürnkranz. Maximizing subset accuracy with re- [55] Zhe Wang, Wei He, Hua Wu, Haiyang Wu, Wei Li, Haifeng
current neural networks in multi-label classification. In Wang, and Enhong Chen. Chinese poetry generation with
NeurIPS. 2017. planning based neural network. CoRR, abs/1610.09889,
[38] Chong-Wah Ngo. Deep learning for food recognition. In 2016.
SoICT, 2017. [56] Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian
[39] Ferda Ofli, Yusuf Aytar, Ingmar Weber, Raggi al Hammouri, Dong, Yao Zhao, and Shuicheng Yan. CNN: single-label
and Antonio Torralba. Is saki# delicious?: The food percep- to multi-label. CoRR, abs/1406.5726, 2014.
tion gap on instagram and its relation to health. In ICWWW, [57] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie:
2017. Scaling up to large vocabulary image annotation. In IJCAI,
[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory 2011.
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- [58] Ronald J. Williams and David Zipser. A learning algorithm
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic for continually running fully recurrent neural networks. Neu-
differentiation in pytorch. In NeurIPS-W, 2017. ral Comput., 1(2), June 1989.
[41] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali [59] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Farhadi. You only look once: Unified, real-time object de- Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
tection. In CVPR, 2016. Bengio. Show, attend and tell: Neural image caption gen-
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. eration with visual attention. In ICML, 2015.
Faster R-CNN: towards real-time object detection with re- [60] Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang,
gion proposal networks. In NeurIPS, 2015. Xinhang Song, and Ramesh Jain. Geolocalized modeling for
[43] S Hamid Rezatofighi, Anton Milan, Ehsan Abbasnejad, An- dish recognition. IEEE Transactions on Multimedia, 2015.
thony Dick, Ian Reid, et al. Deepsetnet: Predicting sets with [61] Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-
deep neural networks. In ICCV, 2017. Chiang Frank Wang. Learning deep latent spaces for multi-
[44] S Hamid Rezatofighi, Anton Milan, Qinfeng Shi, Anthony label classification. CoRR, abs/1707.00418, 2017.
Dick, and Ian Reid. Joint learning of set cardinality and state
distribution. AAAI, 2018.
[45] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin,
Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning
cross-modal embeddings for cooking recipes and food im-
ages. CVPR, 2017.
[46] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Soricut. Conceptual captions: A cleaned, hypernymed, im-
age alt-text dataset for automatic image captioning. In ACL,
2018.
[47] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014.
[48] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to
sequence learning with neural networks. In NeurIPS, 2014.
[49] Grigorios Tsoumakas and Ioannis Vlahavas. Random k-
labelsets: An ensemble method for multilabel classification.
In Joost N. Kok, Jacek Koronacki, Raomon Lopez de Man-
taras, Stan Matwin, Dunja Mladenič, and Andrzej Skowron,
editors, ECML, 2007.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[51] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order
matters: Sequence to sequence for sets. In ICLR, 2016.
[52] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. Show and tell: A neural image caption gen-
erator. In CVPR, 2015.
7. Supplementary Material in many recipes). After filtering and clustering ingredients,
the distribution slightly changes (e.g. pepper becomes the
This supplementary material intends to provide further most frequent ingredient, and popular ingredients such as
details as well as qualitative results. In Section 7.1, we de- olive oil or vegetable oil are clustered into oil). Addition-
scribe additional implementation and training details. Sec-
ally, we illustrate the high ingredient overlap in the dataset
tion 7.2 presents an analysis of our ingredient vocabulary
with an example of the different types of cheese that appear
before and after its pre-processing. Examples of generated
as different ingredients before pre-processing.
recipes, displayed together with real ones from the dataset,
are presented in Section 7.3. Section 7.4 includes screen- 7.3. Generated Recipes
shots of the two forms that were used to collect data for the
user studies. Section 7.5 includes examples of human writ- Figure 8 shows additional examples of generated recipes
ten recipes compared to real and generated ones. Finally, in obtained with our method. We also provide the real recipe
Section 7.6, we provide examples of generated recipes for for completeness. Although sometimes far from the real
out-of-dataset pictures taken by authors. recipe, our system is able to generate plausible and struc-
tured recipes for the input images. Common mistakes in-
7.1. Training Details clude failures in ingredient recognition (e.g. stuffed toma-
toes are confused with stuffed peppers in Figure 8b), in-
Ingredient Prediction. Feed-forward models FFBCE ,
consistencies between ingredients and instructions (e.g. cu-
FFT D and FFIOU were trained with a mini-batch size of
cumber is predicted as an ingredient but unused in Figure
300, whereas FFDC was trained with a mini-batch size of
8d, and meat is mentioned in the title and instructions but is
256. All of them were trained with a learning rate of 0.001.
not predicted as an ingredient in Figure 8e), and repetitions
The learning rate for pre-trained ResNet layers was scaled
in ingredient enumeration (e.g. Stir in tomato sauce, tomato
for each model as follows: 0.01× for FFBCE , FFIOU and
paste, tomato paste, ... in Figure 8c).
FFDC and 0.1× for FFT D . Transformer list-based models
TFlist were trained with mini-batch size 300 and learning 7.4. User Study Forms
rate 0.001, scaling the learning rate of ResNet layers with
a factor of 0.1×. Similarly, the set transformer TFset was We provide screenshots of the two forms used to collect
trained with mini-batch size of 300 and a learning rate of data for user studies. Figure 9 shows the interface used by
0.0001, scaling the learning rate of pre-trained ResNet lay- users to select image ingredients (each ingredient was se-
ers with a factor of 1.0×. The optimization of TFset mini- lected using a drop-down menu), and write recipes (as free-
mizes a cost function composed of three terms, namely the form text). Figure 10 shows the form we used to assess
ingredient prediction loss Lingr and the end-of-sequence whether a recipe matched the provided image according to
loss Leos and the cardinality penalty Lcard . We set the con- human judgment.
tribution of each term with weights 1000.0 and 1.0 and 1.0, 7.5. Human-written Recipes
respectively. We use a label smoothing factor of 0.1 for
all models trained with BCE loss (FFBCE , FFDC , TFset ), In Figure 11 we show examples of recipes written by
which we found experimentally useful. humans, which were collected using the form in Figure 9.
Instruction Generation. We use a batch size of 256 and We also display the real and generated recipes for complete-
learning rate of 0.001. Parameters of the image encoder ness. Recipes written by humans tend to be shorter, with an
module are taken from the ingredient prediction model and average of 5.29 instructions of 9.03 words each. In contrast,
frozen during training for instruction generation. our model generates recipes that contain an average of in-
All models are trained with Adam optimizer (β1 = 0.9, structions 9.21 of 9 words each, which closely matches the
β1 = 0.99 and  =1e-8), exponential decay of 0.99 after real distribution (9.08 sentences of length 12.79).
each epoch, dropout probability 0.3 and a maximum num-
ber of 400 epochs (if early stopping criterion is not met).
7.6. Dine Out: A case study
During training we randomly flip (p = 0.5), rotate (±10 We test the capabilities of our model to generalize for
degrees) and translate images (±10% image size on each out-of-dataset food images. Figure 12 shows recipes ob-
axis) for augmentation. tained for food images taken by authors at their homes or in
restaurants during the weeks prior to the submission.
7.2. Ingredient Analysis
We provide visualizations of the ingredient vocabulary
used to train our models. Figure 7 displays each unique
ingredient in the vocabulary before and after our pre-
processing stage. The size of each ingredient word indi-
cates its frequency in the dataset (e.g. butter and salt appear
Figure 7: Ingredient word clouds. The size of each ingredient word is proportional to the frequency of appearance in the
dataset. We display word clouds for ingredients before (7a) and after (7b) our pre-processing step. In 7c we show the different
types of cheese that are clustered together after pre-processing.

(a) Before pre-processing.

(b) After pre-processing.

(c) Types of cheese before pre-processing.


Figure 8: Recipe examples. We show both real and generated recipes for different test images.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 9: User Study 1. Interface for writing recipes and selecting ingredients.
Figure 10: User Study 2. Recipe quality assessment form.
Figure 11: Written Recipes. Real, generated and human written recipes collected with our user study.

Ingredients Ingredients Ingredients


hamburger, beans, tomato, soup, onion, macaroni, chili, pineapple, greens, stick, tomato, vinaigrette beef, onion, garlic, salt, pepper, parsley, basil, sugar, rice,
sugar, ketchup, broth, butter, pepper broth
Real

Instructions Instructions
-Combine all ingredients and cook over medium heat until -Preheat greased grill to medium-high heat. Instructions
potatoes are just tender. -Grill fruit 3 min. -Brown the beef with spices and onion when browned add
-Turn down heat to low and simmer at least 1.5 hours. -On each side or until lightly browned on both sides. paste
-Cut fruit into 2-inch sticks; place in large salad bowl. -Bring broth to boil add tomato sauce and spices let boil 3
-Add greens, jicama and tomatoes; toss lightly. minutes add rice bring back to boil cover and let sit off heat
-Drizzle with dressing just before serving. 7 minutes
-Mix rice and sauce enjoy

Ingredients Ingredients Ingredients


chuck, tomato, water, soup greens, cheese, tomato, ranch_dressing, chicken, rice, onion, pepper, oil, salt, cheese, tomato, clove, broth
Generated

italian_dressing Instructions
-In a large skillet, heat oil over medium heat.
-Add onion and garlic and cook until onion is translucent.
Instructions Instructions -Add rice and cook until rice is lightly browned.
-Brown ground chuck in a large dutch oven. -Toss greens with chicken, tomatoes, cheese and dressing in -Add chicken broth, tomatoes, salt, pepper and cayenne
-Drain any grease and add the crushed tomatoes, onion soup large bowl. pepper.
mix and water. -Add dressing; mix lightly. -Bring to a boil.
-Simmer 5 minutes. -Reduce heat and cover.
-Add velveeta shells and cheese, mix well and serve with hot -Simmer for 20 minutes.
rolls. -Remove from heat and let stand covered for 5 minutes.
-Sprinkle with cheese and serve.

Ingredients Ingredients Ingredients


tomato, chili, salt, beef, cheese, oil, pepper tomato, spinach, sprouts, arugula, vinaigrette, zucchini rice, meat, parsley, salt, oil, cheese
Written (1)

Instructions Instructions Instructions


-Wash the tomato -Chop the tomatoes. -Cook the minced meat in a big pan. Put salt and oil into it
-Cut the tomato. -Mix all the ingredients together. -When it is cooked, add the rice and water so it can be boiled
-Put some oil in a pan -Add the vinaigrette on top. -Boil it for 20 minutes
-Add the beef - When all the water has evaporated, add some cheese and
-Add the tomato parsley to enhance the flavor.
-Add the chili and the cheese
-Add some salt and pepper

Ingredients Ingredients Ingredients


beans, seasoning, tomato, beef, spices, hot_sauce, cheese, arugula, tomato, kale, zucchini, onion, oil rice, parsley, broth, meat, salt, water
Written (2)

onion

Instructions Instructions
-Cook the beef in a frying pan till brown. Instructions -Put the sauteed meat in a casserole with hot oil.
-Dice the onions. -Wash and rinse the kale -After it is quite cooked, add the rice.
-Drain the beans. -Wash, cut and slice the onion, the yellow zucchini and the -Add the water.
-Take a slow-cooker and put the beef, tomatoes, beans and cherry tomatoes -Add the broth.
onions. -Add all the ingredients in a bowl and mix -Add salt
-Add the seasoning and spices. -Add some olive oil and your favorite vinagrette -Let it cook until the rice it is done.
-Turn the slow-cooker on medium and cook for around 6 -Add the parsley
hrs.
Figure 12: Dine Out Study. Generated recipes for food images taken by authors.

(a)
(b)

(c)
(d)
ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng * Jia Guo ∗ Niannan Xue


Imperial College London InsightFace Imperial College London
j.deng16@imperial.ac.uk guojia@gmail.com n.xue15@imperial.ac.uk

Stefanos Zafeiriou
Imperial College London
arXiv:1801.07698v3 [cs.CV] 9 Feb 2019

s.zafeiriou@imperial.ac.uk

Abstract
One of the main challenges in feature learning using
Deep Convolutional Neural Networks (DCNNs) for large-
scale face recognition is the design of appropriate loss func-
tions that enhance discriminative power. Centre loss pe-
nalises the distance between the deep features and their cor-
responding class centres in the Euclidean space to achieve
intra-class compactness. SphereFace assumes that the lin-
ear transformation matrix in the last fully connected layer
can be used as a representation of the class centres in an
angular space and penalises the angles between the deep Figure 1. Based on the centre [18] and feature [37] normalisation,
features and their corresponding weights in a multiplicative all identities are distributed on a hypersphere. To enhance intra-
way. Recently, a popular line of research is to incorporate class compactness and inter-class discrepancy, we consider four
margins in well-established loss functions in order to max- kinds of Geodesic Distance (GDis) constraint. (A) Margin-Loss:
imise face class separability. In this paper, we propose an insert a geodesic distance margin between the sample and cen-
Additive Angular Margin Loss (ArcFace) to obtain highly tres. (B) Intra-Loss: decrease the geodesic distance between the
discriminative features for face recognition. The proposed sample and the corresponding centre. (C) Inter-Loss: increase the
ArcFace has a clear geometric interpretation due to the ex- geodesic distance between different centres. (D) Triplet-Loss: in-
sert a geodesic distance margin between triplet samples. In this
act correspondence to the geodesic distance on the hyper-
paper, we propose an Additive Angular Margin Loss (ArcFace),
sphere. We present arguably the most extensive experimen-
which is exactly corresponded to the geodesic distance (Arc) mar-
tal evaluation of all the recent state-of-the-art face recog- gin penalty in (A), to enhance the discriminative power of face
nition methods on over 10 face recognition benchmarks in- recognition model. Extensive experimental results show that the
cluding a new large-scale image database with trillion level strategy of (A) is most effective.
of pairs and a large-scale video dataset. We show that Ar-
cFace consistently outperforms the state-of-the-art and can face recognition [32, 33, 29, 24]. DCNNs map the face im-
be easily implemented with negligible computational over- age, typically after a pose normalisation step [45], into a
head. We release all refined training data, training codes, feature that has small intra-class and large inter-class dis-
pre-trained models and training logs1 , which will help re- tance.
produce the results in this paper. There are two main lines of research to train DCNNs
for face recognition. Those that train a multi-class clas-
sifier which can separate different identities in the train-
1. Introduction ing set, such by using a softmax classifier [33, 24, 6], and
those that learn directly an embedding, such as the triplet
Face representation using Deep Convolutional Neural loss [29]. Based on the large-scale training data and the
Network (DCNN) embedding is the method of choice for elaborate DCNN architectures, both the softmax-loss-based
∗ denotes equal contribution to this work. methods [6] and the triplet-loss-based methods [29] can ob-
1 https://github.com/deepinsight/insightface tain excellent performance on face recognition. However,

1
both the softmax loss and the triplet loss have some draw- the softmax loss. The advantages of the proposed ArcFace
backs. For the softmax loss: (1) the size of the linear trans- can be summarised as follows:
formation matrix W ∈ Rd×n increases linearly with the Engaging. ArcFace directly optimises the geodesic dis-
identities number n; (2) the learned features are separable tance margin by virtue of the exact correspondence between
for the closed-set classification problem but not discrimina- the angle and arc in the normalised hypersphere. We in-
tive enough for the open-set face recognition problem. For tuitively illustrate what happens in the 512-D space via
the triplet loss: (1) there is a combinatorial explosion in the analysing the angle statistics between features and weights.
number of face triplets especially for large-scale datasets, Effective. ArcFace achieves state-of-the-art performance
leading to a significant increase in the number of iteration on ten face recognition benchmarks including large-scale
steps; (2) semi-hard sample mining is a quite difficult prob- image and video datasets.
lem for effective model training. Easy. ArcFace only needs several lines of code as given
Several variants [38, 9, 46, 18, 37, 35, 7, 34, 27] have in Algorithm 1 and is extremely easy to implement in the
been proposed to enhance the discriminative power of the computational-graph-based deep learning frameworks, e.g.
softmax loss. Wen et al. [38] pioneered the centre loss, the MxNet [8], Pytorch [25] and Tensorflow [4]. Furthermore,
Euclidean distance between each feature vector and its class contrary to the works in [18, 19], ArcFace does not need
centre, to obtain intra-class compactness while the inter- to be combined with other loss functions in order to have
class dispersion is guaranteed by the joint penalisation of stable performance, and can easily converge on any training
the softmax loss. Nevertheless, updating the actual centres datasets.
during training is extremely difficult as the number of face Efficient. ArcFace only adds negligible computational
classes available for training has recently dramatically in- complexity during training. Current GPUs can easily sup-
creased. port millions of identities for training and the model parallel
By observing that the weights from the last fully con- strategy can easily support many more identities.
nected layer of a classification DCNN trained on the soft-
max loss bear conceptual similarities with the centres of 2. Proposed Approach
each face class, the works in [18, 19] proposed a multiplica- 2.1. ArcFace
tive angular margin penalty to enforce extra intra-class com-
pactness and inter-class discrepancy simultaneously, lead- The most widely used classification loss function, soft-
ing to a better discriminative power of the trained model. max loss, is presented as follows:
Even though Sphereface [18] introduced the important idea T
N
of angular margin, their loss function required a series of ap- 1 X eWyi xi +byi
proximations in order to be computed, which resulted in an L1 = − log Pn T , (1)
N i=1 eWj xi +bj
j=1
unstable training of the network. In order to stabilise train-
ing, they proposed a hybrid loss function which includes the where xi ∈ Rd denotes the deep feature of the i-th sample,
standard softmax loss. Empirically, the softmax loss dom- belonging to the yi -th class. The embedding feature dimen-
inates the training process, because the integer-based mul- sion d is set to 512 in this paper following [38, 46, 18, 37].
tiplicative angular margin makes the target logit curve very Wj ∈ Rd denotes the j-th column of the weight W ∈ Rd×n
precipitous and thus hinders convergence. CosFace [37, 35] and bj ∈ Rn is the bias term. The batch size and the class
directly adds cosine margin penalty to the target logit, which number are N and n, respectively. Traditional softmax loss
obtains better performance compared to SphereFace but ad- is widely used in deep face recognition [24, 6]. However,
mits much easier implementation and relieves the need for the softmax loss function does not explicitly optimise the
joint supervision from the softmax loss. feature embedding to enforce higher similarity for intra-
In this paper, we propose an Additive Angular Margin class samples and diversity for inter-class samples, which
Loss (ArcFace) to further improve the discriminative power results in a performance gap for deep face recognition under
of the face recognition model and to stabilise the training large intra-class appearance variations (e.g. pose variations
process. As illustrated in Figure 2, the dot product be- [30, 48] and age gaps [22, 49]) and large-scale test scenarios
tween the DCNN feature and the last fully connected layer (e.g. million [15, 39, 21] or trillion pairs [2]).
is equal to the cosine distance after feature and weight nor- For simplicity, we fix the bias bj = 0 as in [18]. Then,
malisation. We utilise the arc-cosine function to calculate we transform the logit [26] as WjT xi = kWj k kxi k cos θj ,
the angle between the current feature and the target weight. where θj is the angle between the weight Wj and the fea-
Afterwards, we add an additive angular margin to the tar- ture xi . Following [18, 37, 36], we fix the individual weight
get angle, and we get the target logit back again by the co- kWj k = 1 by l2 normalisation. Following [28, 37, 36, 35],
sine function. Then, we re-scale all logits by a fixed feature we also fix the embedding feature kxi k by l2 normalisation
norm, and the subsequent steps are exactly the same as in and re-scale it to s. The normalisation step on features and
Figure 2. Training a DCNN for face recognition supervised by the ArcFace loss. Based on the feature xi and weight W normalisation, we
get the cos θj (logit) for each class as WjT xi . We calculate the arccosθyi and get the angle between the feature xi and the ground truth
weight Wyi . In fact, Wj provides a kind of centre for each class. Then, we add an angular margin penalty m on the target (ground truth)
angle θyi . After that, we calculate cos(θyi + m) and multiply all logits by the feature scale s. The logits then go through the softmax
function and contribute to the cross entropy loss.
Algorithm 1 The Pseudo-code of ArcFace on MxNet
Input: Feature Scale s, Margin Parameter m in Eq. 3, Class Number n, Ground-Truth ID gt.
1. x = mx.symbol.L2Normalization (x, mode = ’instance’)
2. W = mx.symbol.L2Normalization (W, mode = ’instance’)
3. fc7 = mx.sym.FullyConnected (data = x, weight = W, no bias = True, num hidden = n)
4. original target logit = mx.sym.pick (fc7, gt, axis = 1)
5. theta = mx.sym.arccos (original target logit)
6. marginal target logit = mx.sym.cos (theta + m)
7. one hot = mx.sym.one hot (gt, depth = n, on value = 1.0, off value = 0.0)
8. fc7 = fc7 + mx.sym.broadcast mul (one hot, mx.sym.expand dims (marginal target logit - original target logit, 1))
9. fc7 = fc7 * s
Output: Class-wise affinity score f c7.

weights makes the predictions only depend on the angle be- while the proposed ArcFace loss can obviously enforce a
tween the feature and the weight. The learned embedding more evident gap between the nearest classes.
features are thus distributed on a hypersphere with a radius
of s.
N
1 X es cos θyi
L2 = − log s cos θy Pn s cos θj
. (2)
N i=1 e i +
j=1,j6=yi e

As the embedding features are distributed around each


feature centre on the hypersphere, we add an additive angu- (a) Softmax (b) ArcFace
lar margin penalty m between xi and Wyi to simultaneously Figure 3. Toy examples under the softmax and ArcFace loss on
enhance the intra-class compactness and inter-class discrep- 8 identities with 2D features. Dots indicate samples and lines re-
ancy. Since the proposed additive angular margin penalty is fer to the centre direction of each identity. Based on the feature
equal to the geodesic distance margin penalty in the nor- normalisation, all face features are pushed to the arc space with
malised hypersphere, we name our method as ArcFace. a fixed radius. The geodesic distance gap between closest classes
becomes evident as the additive angular margin penalty is incor-
N porated.
1 X es(cos(θyi +m))
L3 = − log s(cos(θ +m)) Pn . 2.2. Comparison with SphereFace and CosFace
N i=1 e yi
+ j=1,j6=yi es cos θj
(3) Numerical Similarity. In SphereFace [18, 19], ArcFace,
We select face images from 8 different identities contain- and CosFace [37, 35], three different kinds of margin
ing enough samples (around 1,500 images/class) to train 2- penalty are proposed, e.g. multiplicative angular margin
D feature embedding networks with the softmax and Ar- m1 , additive angular margin m2 , and additive cosine mar-
cFace loss, respectively. As illustrated in Figure 3, the gin m3 , respectively. From the view of numerical analysis,
softmax loss provides roughly separable feature embedding different margin penalties, no matter add on the angle [18]
but produces noticeable ambiguity in decision boundaries, or cosine space [37], all enforce the intra-class compactness
and inter-class diversity by penalising the target logit [26]. SphereFace [18] employs an annealing optimisation strat-
In Figure 4(b), we plot the target logit curves of SphereFace, egy. To avoid divergence at the beginning of training, joint
ArcFace and CosFace under their best margin settings. We supervision from softmax is used in SphereFace to weaken
only show these target logit curves within [20◦ , 100◦ ] be- the multiplicative margin penalty. We implement a new ver-
cause the angles between Wyi and xi start from around 90◦ sion of SphereFace without the integer requirement on the
(random initialisation) and end at around 30◦ during Arc- margin by employing the arc-cosine function instead of us-
Face training as shown in Figure 4(a). Intuitively, there are ing the complex double angle formula. In our implementa-
three factors in the target logit curves that affect the perfor- tion, we find that m = 1.35 can obtain similar performance
mance, i.e. the starting point, the end point and the slope. compared to the original SphereFace without any conver-
gence difficulty.
10000 1
start
middle
end
0.8 2.3. Comparison with Other Losses
8000 0.6

0.4

0.2 Other loss functions can be designed based on the angu-


Target Logit

6000
Numbers

4000
-0.2 lar representation of features and weight-vectors. For exam-
Softmax (1.00, 0.00, 0.00)
-0.4

-0.6
SphereFace(m=4, =5)
SphereFace(1.35, 0.00, 0.00)
ArcFace (1.00, 0.50, 0.00)
ples, we can design a loss to enforce intra-class compact-
2000 -0.8

-1
CosFace (1.00, 0.00, 0.35)
CM1
CM2
(1.00, 0.30, 0.20)
(0.90, 0.40, 0.15)
ness and inter-class discrepancy on the hypersphere. As
0
20 30 40 50 60 70 80 90
Angle between the Feature and Target Center
100
-1.2
20 30 40 50 60 70 80
Angle between the Feature and Target Center
90 100 shown in Figure 1, we compare with three other losses in
this paper.
(a) θj Distributions (b) Target Logits Curves
Intra-Loss is designed to improve the intra-class compact-
Figure 4. Target logit analysis. (a) θj distributions from start to ness by decreasing the angle/arc between the sample and
end during ArcFace training. (2) Target logit curves for softmax, the ground truth centre.
SphereFace, ArcFace, CosFace and combined margin penalty
(cos(m1 θ + m2 ) − m3 ). 1 X
N

By combining all of the margin penalties, we implement L5 = L2 + θy . (5)


πN i=1 i
SphereFace, ArcFace and CosFace in an united framework
with m1 , m2 and m3 as the hyper-parameters. Inter-Loss targets at enhancing inter-class discrepancy by
N
increasing the angle/arc between different centres.
1 X es(cos(m1 θyi +m2 )−m3 )
L4 = − log s(cos(m θ +m )−m ) Pn s cos θj
. N n
N i=1 e 1 yi 2 3 +
j=1,j6=yi e 1 X X
(4) L6 = L2 − arccos(WyTi Wj ). (6)
πN (n − 1) i=1
As shown in Figure 4(b), by combining all of the above- j=1,j6=yi
motioned margins (cos(m1 θ + m2 ) − m3 ), we can easily The Inter-Loss here is a special case of the Minimum
get some other target logit curves which also have high per- Hyper-spherical Energy (MHE) method [17]. In [17], both
formance. hidden layers and output layers are regularised by MHE. In
Geometric Difference. Despite the numerical similarity the MHE paper, a special case of loss function was also pro-
between ArcFace and previous works, the proposed ad- posed by combining the SphereFace loss with MHE loss on
ditive angular margin has a better geometric attribute as the last layer of the network.
the angular margin has the exact correspondence to the Triplet-loss aims at enlarging the angle/arc margin between
geodesic distance. As illustrated in Figure 5, we compare triplet samples. In FaceNet [29], Euclidean margin is ap-
the decision boundaries under the binary classification case. plied on the normalised features. Here, we employ the
The proposed ArcFace has a constant linear angular margin triplet-loss by the angular representation of our features as
throughout the whole interval. By contrast, SphereFace and arccos(xposi xi ) + m ≤ arccos(xi
neg
xi ).
CosFace only have a nonlinear angular margin.
3. Experiments
3.1. Implementation Details
Datasets. As given in Table 1, we separately employ CA-
SIA [43], VGGFace2 [6], MS1MV2 and DeepGlint-Face
(including MS1M-DeepGlint and Asian-DeepGlint) [2] as
Figure 5. Decision margins of different loss functions under bi- our training data in order to conduct fair comparison with
nary classification case. The dashed line represents the decision other methods. Please note that the proposed MS1MV2 is a
boundary, and the grey areas are the decision margins. semi-automatic refined version of the MS-Celeb-1M dataset
The minor difference in margin designs can have “butter- [10]. To best of our knowledge, we are the first to em-
fly effect” on the model training. For example, the original ploy ethnicity-specific annotators for large-scale face image
Datasets #Identity #Image/Video ResNet50 and 250MB for ResNet100) and extract the 512-
CASIA [43] 10K 0.5M D features (8.9 ms/face for ResNet50 and 15.4 ms/face for
VGGFace2 [6] 9.1K 3.3M ResNet100) for each normalised face. To get the embed-
MS1MV2 85K 5.8M ding features for templates (e.g. IJB-B and IJB-C) or videos
MS1M-DeepGlint [2] 87K 3.9M (e.g. YTF and iQIYI-VID), we simply calculate the feature
Asian-DeepGlint [2] 94 K 2.83M centre of all images from the template or all frames from
LFW [13] 5,749 13,233 the video. Note that, overlap identities between the training
CFP-FP [30] 500 7,000 set and the test set are removed for strict evaluations, and
AgeDB-30 [22] 568 16,488 we only use a single crop for all testing.
CPLFW [48] 5,749 11,652
CALFW [49] 5,749 12,174 3.2. Ablation Study on Losses
YTF [40] 1,595 3,425
In Table 2, we first explore the angular margin setting
MegaFace [15] 530 (P) 1M (G)
for ArcFace on the CASIA dataset with ResNet50. The best
IJB-B [39] 1,845 76.8K
margin observed in our experiments was 0.5. Using the pro-
IJB-C [21] 3,531 148.8K
posed combined margin framework in Eq. 4, it is easier to
Trillion-Pairs [2] 5,749 (P) 1.58M (G)
set the margin of SphereFace and CosFace which we found
iQIYI-VID [20] 4,934 172,835
to have optimal performance when setting at 1.35 and 0.35,
Table 1. Face datasets for training and testing. “(P)” and “(G)” respectively. Our implementations for both SphereFace and
refer to the probe and gallery set, respectively. CosFace can lead to excellent performance without observ-
ing any difficulty in convergence. The proposed ArcFace
annotations, as the boundary cases (e.g. hard samples and achieves the highest verification accuracy on all three test
noisy samples) are very hard to distinguish if the annotator sets. In addition, we performed extensive experiments with
is not familiar with the identity. During training, we explore the combined margin framework (some of the best perfor-
efficient face verification datasets (e.g. LFW [13], CFP-FP mance was observed for CM1 (1, 0.3, 0.2) and CM2 (0.9,
[30], AgeDB-30 [22]) to check the improvement from dif- 0.4, 0.15)) guided by the target logit curves in Figure 4(b).
ferent settings. Besides the most widely used LFW [13] and The combined margin framework led to better performance
YTF [40] datasets, we also report the performance of Ar- than individual SphereFace and CosFace but upper-bounded
cFace on the recent large-pose and large-age datasets(e.g. by the performance of ArcFace.
CPLFW [48] and CALFW [49]). We also extensively test Besides the comparison with margin-based methods, we
the proposed ArcFace on large-scale image datasets (e.g. conduct a further comparison between ArcFace and other
MegaFace [15], IJB-B [39], IJB-C [21] and Trillion-Pairs losses which aim at enforcing intra-class compactness (Eq.
[2]) and video datasets (iQIYI-VID [20]). 5) and inter-class discrepancy (Eq. 6). As the baseline
Experimental Settings. For data prepossessing, we follow we have chosen the softmax loss and we have observed
the recent papers [18, 37] to generate the normalised face performance drop on CFP-FP and AgeDB-30 after weight
crops (112 × 112) by utilising five facial points. For the and feature normalisation. By combining the softmax with
embedding network, we employ the widely used CNN ar- the intra-class loss, the performance improves on CFP-FP
chitectures, ResNet50 and ResNet100 [12, 11]. After the and AgeDB-30. However, combining the softmax with the
last convolutional layer, we explore the BN [14]-Dropout inter-class loss only slightly improves the accuracy. The
[31]-FC-BN structure to get the final 512-D embedding fea- fact that Triplet-loss outperforms Norm-Softmax loss in-
ture. In this paper, we use ([training dataset, network struc- dicates the importance of margin in improving the perfor-
ture, loss]) to facilitate understanding of the experimental mance. However, employing margin penalty within triplet
settings. samples is less effective than inserting margin between sam-
We follow [37] to set the feature scale s to 64 and choose ples and centres as in ArcFace. Finally, we incorporate the
the angular margin m of ArcFace at 0.5. All experiments in Intra-loss, Inter-loss and Triplet-loss into ArcFace, but no
this paper are implemented by MXNet [8]. We set the batch improvement is observed, which leads us to believe that Ar-
size to 512 and train models on four NVIDIA Tesla P40 cFace is already enforcing intra-class compactness, inter-
(24GB) GPUs. On CASIA, the learning rate starts from 0.1 class discrepancy and classification margin.
and is divided by 10 at 20K, 28K iterations. The training To get a better understanding of ArcFace’s superiority,
process is finished at 32K iterations. On MS1MV2, we di- we give the detailed angle statistics on training data (CA-
vide the learning rate at 100K,160K iterations and finish at SIA) and test data (LFW) under different losses in Table
180K iterations. We set momentum to 0.9 and weight decay 3. We find that (1) Wj is nearly synchronised with em-
to 5e − 4. During testing, we only keep the feature embed- bedding feature centre for ArcFace (14.29◦ ), but there is
ding network without the fully connected layer (160MB for an obvious deviation (44.26◦ ) between Wj and the em-
Loss Functions LFW CFP-FP AgeDB-30 NS ArcFace IntraL InterL TripletL
ArcFace (0.4) 99.53 95.41 94.98 W-EC 44.26 14.29 8.83 46.85 -
ArcFace (0.45) 99.46 95.47 94.93 W-Inter 69.66 71.61 31.34 75.66 -
ArcFace (0.5) 99.53 95.56 95.15 Intra1 50.50 38.45 17.50 52.74 41.19
ArcFace (0.55) 99.41 95.32 95.05 Inter1 59.23 65.83 24.07 62.40 50.23
SphereFace [18] 99.42 - - Intra2 33.97 28.05 12.94 35.38 27.42
SphereFace (1.35) 99.11 94.38 91.70 Inter2 65.60 66.55 26.28 67.90 55.94
CosFace [37] 99.33 - -
CosFace (0.35) 99.51 95.44 94.56 Table 3. The angle statistics under different losses ([CASIA,
CM1 (1, 0.3, 0.2) 99.48 95.12 94.38 ResNet50, loss*]). Each column denotes one particular loss. “W-
CM2 (0.9, 0.4, 0.15) 99.50 95.24 94.86 EC” refers to the mean of angles between Wj and the correspond-
ing embedding feature centre. “W-Inter” refers to the mean of
Softmax 99.08 94.39 92.33
minimum angles between Wj ’s. “Intra1” and “Intra2” refer to the
Norm-Softmax (NS) 98.56 89.79 88.72 mean of angles between xi and the embedding feature centre on
NS+Intra 98.75 93.81 90.92 CASIA and LFW, respectively. “Inter1” and “Inter2” refer to the
NS+Inter 98.68 90.67 89.50 mean of minimum angles between embedding feature centres on
NS+Intra+Inter 98.73 94.00 91.41 CASIA and LFW, respectively.
Triplet (0.35) 98.98 91.90 89.98 Method #Image LFW YTF
ArcFace+Intra 99.45 95.37 94.73 DeepID [32] 0.2M 99.47 93.20
ArcFace+Inter 99.43 95.25 94.55 Deep Face [33] 4.4M 97.35 91.4
ArcFace+Intra+Inter 99.43 95.42 95.10 VGG Face [24] 2.6M 98.95 97.30
ArcFace+Triplet 99.50 95.51 94.40 FaceNet [29] 200M 99.63 95.10
Table 2. Verification results (%) of different loss functions ([CA- Baidu [16] 1.3M 99.13 -
SIA, ResNet50, loss*]). Center Loss [38] 0.7M 99.28 94.9
Range Loss [46] 5M 99.52 93.70
Marginal Loss [9] 3.8M 99.48 95.98
bedding feature centre for Norm-Softmax. Therefore, the
SphereFace [18] 0.5M 99.42 95.0
angles between Wj cannot absolutely represent the inter-
SphereFace+ [17] 0.5M 99.47 -
class discrepancy on training data. Alternatively, the em-
CosFace [37] 5M 99.73 97.6
bedding feature centres calculated by the trained network
MS1MV2, R100, ArcFace 5.8M 99.83 98.02
are more representative. (2) Intra-Loss can effectively com-
press intra-class variations but also brings in smaller inter- Table 4. Verification performance (%) of different methods on
class angles. (3) Inter-Loss can slightly increase inter-class LFW and YTF.
discrepancy on both W (directly) and the embedding net- 3.3. Evaluation Results
work (indirectly), but also raises intra-class angles. (4) Ar-
cFace already has very good intra-class compactness and Results on LFW, YTF, CALFW and CPLFW. LFW [13]
inter-class discrepancy. (5) Triplet-Loss has similar intra- and YTF [40] datasets are the most widely used benchmark
class compactness but inferior inter-class discrepancy com- for unconstrained face verification on images and videos. In
pared to ArcFace. In addition, ArcFace has a more distinct this paper, we follow the unrestricted with labelled outside
margin than Triplet-Loss on the test set as illustrated in Fig- data protocol to report the performance. As reported in Ta-
ure 6. ble 4, ArcFace trained on MS1MV2 with ResNet100 beats
the baselines (e.g. SphereFace [18] and CosFace [37]) by
a significant margin on both LFW and YTF, which shows
104 104
3 2.5

2.5
Negative
Positive
Negative
Positive
that the additive angular margin penalty can notably en-
2

2
hance the discriminative power of deeply learned features,
Pair Numbers

Pair Numbers

1.5

1.5 demonstrating the effectiveness of ArcFace.


1
1
Besides on LFW and YTF datasets, we also report the
0.5
0.5
performance of ArcFace on the recently introduced datasets
0 0
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs (e.g. CPLFW [48] and CALFW [49]) which show higher
(a) ArcFace (b) Triplet-Loss pose and age variations with same identities from LFW.
Figure 6. Angle distributions of all positive pairs and random neg- Among all of the open-sourced face recognition models, the
ative pairs (∼ 0.5M) from LFW. Red area indicates positive pairs ArcFace model is evaluated as the top-ranked face recog-
while blue indicates negative pairs. All angles are represented in nition model as shown in Table 5, outperforming coun-
degree. ([CASIA, ResNet50, loss*]). terparts by an obvious margin. In Figure 7, we illustrate
the angle distributions (predicted by ArcFace model trained
Method LFW CALFW CPLFW Methods Id (%) Ver (%)
HUMAN-Individual 97.27 82.32 81.21 Softmax [18] 54.85 65.92
HUMAN-Fusion 99.85 86.50 85.24 Contrastive Loss[18, 32] 65.21 78.86
Center Loss [38] 98.75 85.48 77.48 Triplet [18, 29] 64.79 78.32
SphereFace [18] 99.27 90.30 81.40 Center Loss[38] 65.49 80.14
VGGFace2 [6] 99.43 90.57 84.00 SphereFace [18] 72.729 85.561
MS1MV2, R100, ArcFace 99.82 95.45 92.08 CosFace [37] 77.11 89.88
AM-Softmax [35] 72.47 84.44
Table 5. Verification performance (%) of open-sourced face recog-
nition models on LFW, CALFW and CPLFW. SphereFace+ [17] 73.03 -
CASIA, R50, ArcFace 77.50 92.34
on MS1MV2 with ResNet100) of both positive and nega- CASIA, R50, ArcFace, R 91.75 93.69
tive pairs on LFW, CFP-FP, AgeDB-30, YTF, CPLFW and FaceNet [29] 70.49 86.47
CALFW. We can clearly find that the intra-variance due CosFace [37] 82.72 96.65
to pose and age gaps significantly increases the angles be- MS1MV2, R100, ArcFace 81.03 96.98
tween positive pairs thus making the best threshold for face MS1MV2, R100, CosFace 80.56 96.56
verification increasing and generating more confusion re- MS1MV2, R100, ArcFace, R 98.35 98.48
gions on the histogram. MS1MV2, R100, CosFace, R 97.91 97.91
250 180 200
Table 6. Face identification and verification evaluation of different
Negative Negative Negative

200
Positive 160

140
Positive

150
Positive
methods on MegaFace Challenge1 using FaceScrub as the probe
120

set. “Id” refers to the rank-1 face identification accuracy with 1M


Pair Numbers

Pair Numbers

Pair Numbers

150
100
100

distractors, and “Ver” refers to the face verification TAR at 10−6


80
100
60
50
40
50

0
20

0 0
FAR. “R” refers to data refinement on both probe set and 1M dis-
tractors. ArcFace obtains state-of-the-art performance under both
0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs

(a) LFW (99.83%) (b) CFP-FP (98.37%) (c) AgeDB (98.15%) small and large protocols.
350

300
Negative
Positive
250
Negative
Positive
180

160
Negative
Positive
affects the performance. Therefore, we manually refined
200
140
250
120 the whole MegaFace dataset and report the correct perfor-
Pair Numbers

Pair Numbers

Pair Numbers

150
200 100

150

100
100
80

60
mance of ArcFace on MegaFace. On the refined MegaFace,
50
50
40

20 ArcFace still clearly outperforms CosFace and achieves the


0 0 0
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
best performance on both verification and identification.
(d) YTF (98.02%) (e) CPLFW (92.08%) (f) CALFW (95.45%) Under large protocol, ArcFace surpasses FaceNet [29]
Figure 7. Angle distributions of both positive and negative pairs on by a clear margin and obtains comparable results on iden-
LFW, CFP-FP, AgeDB-30, YTF, CPLFW and CALFW. Red area tification and better results on verification compared to
indicates positive pairs while blue indicates negative pairs. All an- CosFace [37]. Since CosFace employs a private training
gles are represented in degree. ([MS1MV2, ResNet100, ArcFace]) data, we retrain CosFace on our MS1MV2 dataset with
ResNet100. Under fair comparison, ArcFace shows supe-
Results on MegaFace. The MegaFace dataset [15] includes riority over CosFace and forms an upper envelope of Cos-
1M images of 690K different individuals as the gallery set Face under both identification and verification scenarios as
and 100K photos of 530 unique individuals from FaceScrub shown in Figure 8.
[23] as the probe set. On MegaFace, there are two testing
scenarios (identification and verification) under two proto- 100 100

cols (large or small training set). The training set is defined 95


99

98

as large if it contains more than 0.5M images. For the fair


True Positive Rate (%)
Identification Rate (%)

97
90
96
comparison, we train ArcFace on CAISA and MS1MV2 85 95

under the small protocol and large protocol, respectively. 80 CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
94

93
CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
MS1MV2, ResNet100, ArcFace, Original MS1MV2, ResNet100, ArcFace, Original
In Table 6, ArcFace trained on CASIA achieves the best 75 MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
92

91
MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
MS1MV2, ResNet100, CosFace, Refine MS1MV2, ResNet100, CosFace, Refine

single-model identification and verification performance, 70


100 101 102 103 104 105 106
90
10-6 10-5 10-4 10-3 10-2 10-1 100

not only surpassing the strong baselines (e.g. SphereFace Rank False Positive Rate

[18] and CosFace [37]) but also outperforming other pub- (a) CMC (b) ROC
lished methods [38, 17]. Figure 8. CMC and ROC curves of different models on MegaFace.
As we observed an obvious performance gap between Results are evaluated on both original and refined MegaFace
identification and verification, we performed a thorough dataset.
manual check in the whole MegaFace dataset and found Results on IJB-B and IJB-C. The IJB-B dataset [39]
many face images with wrong labels, which significantly contains 1, 845 subjects with 21.8K still images and 55K
Method IJB-B IJB-C Method Id (@FPR=1e-3) Ver(@FPR=1e-9)
ResNet50 [6] 0.784 0.825 CASIA 26.643 21.452
SENet50 [6] 0.800 0.840 MS1MV2 80.968 78.600
ResNet50+SENet50 [6] 0.800 0.841 DeepGlint-Face 80.331 78.586
MN-v [42] 0.818 0.852 MS1MV2+Asian 84.840 (1st) 80.540
MN-vc [42] 0.831 0.862 CIGIT IRSEC 84.234 (2nd) 81.558 (1st)
ResNet50+DCN(Kpts) [41] 0.850 0.867
Table 8. Identification and verification results (%) on the Trillion-
ResNet50+DCN(Divs) [41] 0.841 0.880 Pairs dataset. ([Dataset*, ResNet100, ArcFace])
SENet50+DCN(Kpts) [41] 0.846 0.874
SENet50+DCN(Divs) [41] 0.849 0.885
VGG2, R50, ArcFace 0.898 0.921 set. Every pair between gallery and probe set is used
MS1MV2, R100, ArcFace 0.942 0.956 for evaluation (0.4 trillion pairs in total). In Table 8,
we compare the performance of ArcFace trained on dif-
Table 7. 1:1 verification TAR (@FAR=1e-4) on the IJB-B and IJB- ferent datasets. The proposed MS1MV2 dataset obvi-
C dataset.
ously boosts the performance compared to CASIA and even
frames from 7, 011 videos. In total, there are 12, 115 slightly outperforms the DeepGlint-Face dataset, which has
templates with 10, 270 genuine matches and 8M impos- a double identity number. When combining all identities
tor matches. The IJB-C dataset [39] is a further extension from MS1MV2 and Asian celebrities from DeepGlint, Arc-
of IJB-B, having 3, 531 subjects with 31.3K still images Face achieves the best identification performance 84.840%
and 117.5K frames from 11, 779 videos. In total, there (@FPR=1e-3) and comparable verification performance
are 23, 124 templates with 19, 557 genuine matches and compared to the most recent submission (CIGIT IRSEC)
15, 639K impostor matches. from the lead-board.
On the IJB-B and IJB-C datasets, we employ the VGG2 Results on iQIYI-VID. The iQIYI-VID challenge [20]
dataset as the training data and the ResNet50 as the embed- contains 565,372 video clips (training set 219,677, valida-
ding network to train ArcFace for the fair comparison with tion set 172,860, and test set 172,835) of 4934 identities
the most recent methods [6, 42, 41]. In Table 7, we compare from iQIYI variety shows, films and television dramas. The
the TAR (@FAR=1e-4) of ArcFace with the previous state- length of each video ranges from 1 to 30 seconds. This
of-the-art models [6, 42, 41]. ArcFace can obviously boost dataset supplies multi-modal cues, including face, cloth,
the performance on both IJB-B and IJB-C (about 3 ∼ 5%, voice, gait and subtitles, for character identification. The
which is a significant reduction in the error). Drawing sup- iQIYI-VID dataset employs MAP@100 as the evaluation
port from more training data (MS1MV2) and deeper neu- indicator. MAP (Mean Average Precision) refers to the
ral network (ResNet100), ArcFace can further improve the overall average accuracy rate, which is the mean of the av-
TAR (@FAR=1e-4) to 94.2% and 95.6% on IJB-B and IJB- erage accuracy rate of the corresponding videos of person
C, respectively. In Figure 9, we show the full ROC curves of ID retrieved in the test set for each person ID (as the query)
the proposed ArcFace on IJB-B and IJB-C 2 , and ArcFace in the training set.
achieves impressive performance even at FAR=1e-6 setting As shown in Table 9, ArcFace trained on combined
a new baseline. MS1MV2 and Asian datasets with ResNet100 sets a high
baseline (MAP=(79.80%)). Based on the embedding fea-
1
ROC on IJB-B
1
ROC on IJB-C ture for each training video, we train an additional three-
0.9

0.8
0.9

0.8
layer fully connected network with a classification loss to
0.7 0.7
get the customised feature descriptor on the iQIYI-VID
True Positive Rate

True Positive Rate

0.6 0.6

0.5 0.5 dataset. The MLP learned on the iQIYI-VID training set
0.4 0.4

0.3 0.3 significantly boosts the MAP by 6.60%. Drawing support


0.2 0.2

0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
from the model ensemble and context features from the off-
0
10-6 10 -5
10 -4
10 -3

False Positive Rate


10 -2
10 -1
10 0
0
10-6 10-5 10-4 10-3
False Positive Rate
10-2 10-1 100 the-shelf object and scene classifier [1], our final result sur-
passes the runner-up by a clear margin ( 0.99%).
(a) ROC for IJB-B (b) ROC for IJB-C
Figure 9. ROC curves of 1:1 verification protocol on the IJB-B and
IJB-C dataset.
4. Conclusions

Results on Trillion-Pairs. The Trillion-Pairs dataset [2] In this paper, we proposed an Additive Angular Margin
provides 1.58M images from Flickr as the gallery set and Loss function, which can effectively enhance the discrimi-
274K images from 5.7k LFW [13] identities as the probe native power of feature embeddings learned via DCNNs for
face recognition. In the most comprehensive experiments
2 https://github.com/deepinsight/insightface/tree/master/Evaluation/IJB reported in the literature we demonstrate that our method
Method MAP(%) 25

MS1MV2+Asian, R100, ArcFace 79.80


+ MLP 86.40

GPU Memory Consumption (GB)


20

+ Ensemble 88.26
+ Context 88.65 (1st) 15

Other Participant 87.66 (2nd)


10
Table 9. MAP of our method on the iQIYI-VID test set. “MLP”
refers to a three-layer fully connected network trained on the
iQIYI-VID training data. 5
Parallel Acceleration on Feature (x) -- Data Parallel
consistently outperforms the state-of-the-art. Code and de- Parallel Acceleration on Feature (x) and Center (W)
0
tails have been released under the MIT license. 0 0.5 1 1.5 2 2.5 3
Identity Number in the Training Data 6
10

5. Appendix (a) GPU Memory

5.1. Parallel Acceleration 1000

900
Can we apply ArcFace on large-scale identities? Yes,

Training Speed (samples/second)


800
millions of identities are not a problem.
700
The concept of Centre (W ) is indispensable in ArcFace,
600
but the parameter size of Centre (W ) is proportional to the
500
number of classes. When there are millions of identities
400
in the training data, the proposed ArcFace confronts with
300
substantial training difficulties, e.g. excessive GPU mem-
200
ory consumption and massive computational cost, even at a Parallel Acceleration on Feature (x) -- Data Parallel
100
prohibitive level. Parallel Acceleration on Feature (x) and Center (W)
0
In our implementation 3 , we employ a parallel acceler- 0 0.5 1 1.5 2 2.5 3

ation strategy [44] to relieve this problem. We optimise Identity Number in the Training Data 106

our training code to easily and efficiently support million (b) Training Speed
level identities on a single machine by parallel accelera- Figure 10. Parallel acceleration on both feature x and centre W .
tion on both feature x (it known as the general data parallel Setting: ResNet 50, batch size 8*64, feature dimension 512, float
strategy) and centre W (we named it as the centre parallel point 32, GPU 8*P40 (24GB).
strategy). As shown in Figure 10, our parallel acceleration
on both feature x and centre W can significantly decrease
the GPU memory consumption and accelerate the training
speed. Even for one million identities trained on 8*1080ti score sub-matrix (batch size 512 × identity number 1M/8)
(11GB), our implementation (ResNet 50, batch size 8*64, on each GPU. The similarity score matrix goes forward to
feature dimension 512 and float point 32) can still run at calculate the ArcFace loss and the gradient. Here, we con-
800 samples per second. Compared to the approximate ac- duct a simple matrix partition on the centre matrix and the
celeration method proposed in [47], our implementation has similarity score matrix along the identity dimension, and
no performance drop. there is no communication cost on the centre and similarity
score matrix. Both the centre sub-matrix and the similarity
In Figure 11, we illustrate the main calculation steps of
score sub-matrix are only 256MB on each GPU.
the parallel acceleration by simple matrix partition, which
can be easily grasped and reproduced by beginners [3]. (3) Get gradient on centre (dW ). We transpose the fea-
(1) Get feature (x). Face embedding features are aggre- ture matrix on each GPU, and concurrently multiply the
gated into one feature matrix (batch size 8*64 × feature transposed feature matrix by the gradient sub-matrix of the
dimension 512) from 8 GPU cards. The size of the aggre- similarity score.
gated feature matrix is only 1MB, and the communication (4) Get gradient on feature (x). We concurrently multi-
cost is negligible when we transfer the feature matrix. ply the gradient sub-matrix of similarity score by the trans-
(2) Get similarity score matrix (score = xW ). We copy posed centre sub-matrix and sum up the outputs from 8
the feature matrix into each GPU, and concurrently multi- GPU cards to get the gradient on feature x.
ply the feature matrix by the centre sub-matrix (feature di-
mension 512 × identity number 1M/8) to get the similarity Considering the communication cost (MB level), our
implementation of ArcFace can be easily and efficiently
3 https://github.com/deepinsight/insightface/tree/master/recognition trained on millions of identities by clusters.
nearest neighbour separation[5] is
2 1 Γ( d2 ) 1
E[θ(Wj )] → n− d−1 Γ(1 + )( √ d−1
)− d−1 ,
d − 1 2 π(d − 1)Γ( 2 )
(7)
where d is the space dimension, n is the identity number,
and θ(Wj ) = min1≤i,j≤n,i6=j arccos(Wi , Wj )∀i, j. In Fig-
ure 12, we give E[θ(Wj )] in the 128-d, 256-d and 512-d
space with the class number ranging from 10K to 100M .
(a) x
The high-dimensional space is so large that E[θ(Wj )] de-
creases slowly when the class number increases exponen-
tially.
90

85

Minimum Angles Between Individuals


80

75

70
(b) score = xW
65

60

128-d
55 256-d
512-d
50
4 5 6 7 8
10 10 10 10 10
Random Individual Numbers

Figure 12. The high-dimensional space is so large that the mean


of the nearest angles decreases slowly when the class number in-
(c) dW = xT dscore creases exponentially.

References
[1] http://data.mxnet.io/models/. 8
[2] http://trillionpairs.deepglint.com/overview. 2, 4, 5, 8
[3] Stanford cs class cs231n: Convolutional neural networks
for visual recognition. http://cs231n.github.io/
neural-networks-case-study/. 9
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
(d) dx = dscoreW T distributed systems. arXiv:1603.04467, 2016. 2
[5] J. S. Brauchart, A. B. Reznikov, E. B. Saff, I. H. Sloan,
Figure 11. Parallel calculation by simple matrix partition. Setting:
Y. G. Wang, and R. S. Womersley. Random point sets on
ResNet 50, batch size 8*64, feature dimension 512, float point
the spherehole radii, covering, and separation. Experimental
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Com-
Mathematics, 2018. 10
munication cost: 1MB (feature x). Training speed: 800 sam-
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.
ples/second.
Vggface2: A dataset for recognising faces across pose and
age. In FG, 2018. 1, 2, 4, 5, 7, 8
[7] B. Chen, W. Deng, and J. Du. Noisy softmax: improving
5.2. Feature Space Analysis the generalization ability of dcnn via postponing the early
softmax saturation. In CVPR, 2017. 2
[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
Is the 512-d hypersphere space large enough to hold B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-
large-scale identities? Theoretically, Yes. cient machine learning library for heterogeneous distributed
systems. arXiv:1512.01274, 2015. 2, 5
We assume that the identity centre Wj ’s follow a realis- [9] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep
tically spherical uniform distribution, the expectation of the face recognition. In CVPR Workshop, 2017. 2, 6
[10] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: [30] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel-
A dataset and benchmark for large-scale face recognition. In lappa, and D. W. Jacobs. Frontal to profile face verification
ECCV, 2016. 4 in the wild. In WACV, 2016. 2, 5
[11] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [31] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
works. arXiv:1610.02915, 2016. 5 R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning networks from overfitting. JML, 2014. 5
for image recognition. In CVPR, 2016. 5 [32] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face
[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. representation by joint identification-verification. In NIPS,
Labeled faces in the wild: A database for studying face 2014. 1, 6, 7
recognition in unconstrained environments. Technical report, [33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
2007. 5, 6, 8 Closing the gap to human-level performance in face verifica-
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tion. In CVPR, 2014. 1, 6
deep network training by reducing internal covariate shift. In [34] W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking fea-
ICML, 2015. 5 ture distribution for loss functions in image classification.
[15] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and arXiv:1803.02988, 2018. 2
E. Brossard. The megaface benchmark: 1 million faces for [35] F. Wang, W. Liu, H. Liu, and J. Cheng. Additive margin
recognition at scale. In CVPR, 2016. 2, 5, 7 softmax for face verification. IEEE Signal Processing Let-
[16] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ters, 2018. 2, 3, 7
ultimate accuracy: Face recognition via deep embedding.
[36] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Norm-
arXiv:1506.07310, 2015. 6
face: l 2 hypersphere embedding for face verification.
[17] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song. arXiv:1704.06369, 2017. 2
Learning towards minimum hyperspherical energy. In NIPS,
[37] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,
2018. 4, 6, 7
and W. Liu. Cosface: Large margin cosine loss for deep face
[18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
recognition. In CVPR, 2018. 1, 2, 3, 5, 6, 7
Sphereface: Deep hypersphere embedding for face recogni-
[38] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
tion. In CVPR, 2017. 1, 2, 3, 4, 5, 6, 7
ture learning approach for deep face recognition. In ECCV,
[19] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax
2016. 2, 6, 7
loss for convolutional neural networks. In ICML, 2016. 2, 3
[20] Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, [39] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C.
C. Lin, J. Jiang, and Y. Fan. iqiyi-vid: A large dataset for Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, and
multi-modal person identification. arXiv:1811.07548, 2018. K. Allen. Iarpa janus benchmark-b face dataset. In CVPR
5, 8 Workshop, 2017. 2, 5, 7, 8
[21] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, [40] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un-
C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, and J. Ch- constrained videos with matched background similarity. In
eney. Iarpa janus benchmark–c: Face dataset and protocol. CVPR, 2011. 5, 6
In ICB, 2018. 2, 5 [41] W. Xie, S. Li, and A. Zisserman. Comparator networks. In
[22] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kot- ECCV, 2018. 8
sia, and S. Zafeiriou. Agedb: The first manually collected [42] W. Xie and A. Zisserman. Multicolumn networks for face
in-the-wild age database. In CVPR Workshop, 2017. 2, 5 recognition. In BMVC, 2018. 8
[23] H.-W. Ng and S. Winkler. A data-driven approach to clean- [43] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
ing large face datasets. In ICIP, 2014. 7 tation from scratch. arXiv:1411.7923, 2014. 4, 5
[24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face [44] D. Zhang. A distributed training solution for face recogni-
recognition. In BMVC, 2015. 1, 2, 6 tion. DeepGlint, 2018. 9
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detec-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tion and alignment using multitask cascaded convolutional
tomatic differentiation in pytorch. In NIPS Workshop, 2017. networks. IEEE Signal Processing Letters, 2016. 1
2 [46] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss
[26] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hin- for deep face recognition with long-tail. In ICCV, 2017. 2, 6
ton. Regularizing neural networks by penalizing confident [47] X. Zhang, L. Yang, J. Yan, and D. Lin. Accelerated train-
output distributions. arXiv:1701.06548, 2017. 2, 3 ing for massive classification via dynamic class selection. In
[27] X. Qi and L. Zhang. Face recognition via centralized coor- AAAI, 2018. 9
dinate learning. arXiv:1801.05678, 2018. 2 [48] T. Zheng and W. Deng. Cross-pose lfw: A database for
[28] R. Ranjan, C. D. Castillo, and R. Chellappa. L2- studying cross-pose face recognition in unconstrained envi-
constrained softmax loss for discriminative face verification. ronments. Technical Report, 2018. 2, 5, 6
arXiv:1703.09507, 2017. 2 [49] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database
[29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- for studying cross-age face recognition in unconstrained en-
fied embedding for face recognition and clustering. In CVPR, vironments. arXiv:1708.08197, 2017. 2, 5, 6
2015. 1, 4, 6, 7
Fast Online Object Tracking and Segmentation: A Unifying Approach

Qiang Wang∗ Li Zhang∗ Luca Bertinetto∗


CASIA University of Oxford FiveAI
qiang.wang@nlpr.ia.ac.cn lz@robots.ox.ac.uk luca@robots.ox.ac.uk

Weiming Hu Philip H.S. Torr


CASIA University of Oxford
arXiv:1812.05050v2 [cs.CV] 5 May 2019

wmhu@nlpr.ia.ac.cn philip.torr@eng.ox.ac.uk

Abstract

In this paper we illustrate how to perform both visual object


tracking and semi-supervised video object segmentation,
in real-time, with a single simple approach. Our method,
dubbed SiamMask, improves the offline training procedure
of popular fully-convolutional Siamese approaches for ob-
ject tracking by augmenting their loss with a binary seg-
mentation task. Once trained, SiamMask solely relies on a
single bounding box initialisation and operates online, pro-
ducing class-agnostic object segmentation masks and ro-
tated bounding boxes at 55 frames per second. Despite its
simplicity, versatility and fast speed, our strategy allows us Init Estimates
to establish a new state of the art among real-time track- Figure 1. Our method aims at the intersection between the tasks
ers on VOT-2018, while at the same time demonstrating of visual tracking and video object segmentation to achieve high
competitive performance and the best speed for the semi- practical convenience. Like conventional object trackers, it relies
supervised video object segmentation task on DAVIS-2016 on a simple bounding box initialisation (blue) and operates online.
Differently from state-of-the-art trackers such as ECO [12] (red),
and DAVIS-2017. The project website is http://www.
SiamMask (green) is able to produce binary segmentation masks,
robots.ox.ac.uk/˜qwang/SiamMask.
which can more accurately describe the target object.

reason about the current position of the object [26]. This


1. Introduction is the scenario portrayed by visual object tracking bench-
marks, which represent the target object with a simple axis-
Tracking is a fundamental task in any video applica-
aligned (e.g. [56, 52]) or rotated [26, 27] bounding box.
tion requiring some degree of reasoning about objects of
Such a simple annotation helps to keep the cost of data la-
interest, as it allows to establish object correspondences be-
belling low; what is more, it allows a user to perform a quick
tween frames [34]. It finds use in a wide range of scenarios
and simple initialisation of the target.
such as automatic surveillance, vehicle navigation, video la-
belling, human-computer interaction and activity recogni- Similar to object tracking, the task of semi-supervised
tion. Given the location of an arbitrary target of interest in video object segmentation (VOS) requires estimating the
the first frame of a video, the aim of visual object tracking position of an arbitrary target specified in the first frame
is to estimate its position in all the subsequent frames with of a video. However, in this case the object represen-
the best possible accuracy [48]. tation consists of a binary segmentation mask which ex-
For many applications, it is important that tracking can presses whether or not a pixel belongs to the target [40].
be performed online, while the video is streaming. In other Such a detailed representation is more desirable for appli-
words, the tracker should not make use of future frames to cations that require pixel-level information, like video edit-
ing [38] and rotoscoping [37]. Understandably, produc-
∗ Equal contribution. ing pixel-level estimates requires more computational re-

1
sources than a simple bounding box. As a consequence, briefly outlines some of the most relevant prior work in vi-
VOS methods have been traditionally slow, often requir- sual object tracking and semi-supervised VOS; Section 3
ing several seconds per frame (e.g. [55, 50, 39, 1]). Very describes our proposal; Section 4 evaluates it on four bench-
recently, there has been a surge of interest in faster ap- marks and illustrates several ablative studies; Section 5 con-
proaches [59, 36, 57, 8, 7, 22, 21]. However, even the fastest cludes the paper.
still cannot operate in real-time.
In this paper, we aim at narrowing the gap between ar- 2. Related Work
bitrary object tracking and VOS by proposing SiamMask, In this section, we briefly cover the most representative
a simple multi-task learning approach that can be used techniques for the two problems tackled in this paper.
to address both problems. Our method is motivated by
Visual object tracking. Arguably, until very recently,
the success of fast tracking approaches based on fully-
the most popular paradigm for tracking arbitrary objects
convolutional Siamese networks [3] trained offline on mil-
has been to train online a discriminative classifier exclu-
lions of pairs of video frames (e.g. [28, 63, 15, 60]) and by
sively from the ground-truth information provided in the
the very recent availability of YouTube-VOS [58], a large
first frame of a video (and then update it online).
video dataset with pixel-wise annotations. We aim at retain-
In the past few years, the Correlation Filter, a sim-
ing the offline trainability and online speed of these meth-
ple algorithm that allows to discriminate between the tem-
ods while at the same time significantly refining their rep-
plate of an arbitrary target and its 2D translations, rose
resentation of the target object, which is limited to a simple
to prominence as particularly fast and effective strategy
axis-aligned bounding box.
for tracking-by-detection thanks to the pioneering work of
To achieve this goal, we simultaneously train a Siamese Bolme et al. [4]. Performance of Correlation Filter-based
network on three tasks, each corresponding to a different trackers has then been notably improved with the adop-
strategy to establish correspondances between the target ob- tion of multi-channel formulations [24, 20], spatial con-
ject and candidate regions in the new frames. As in the straints [25, 13, 33, 29] and deep features (e.g. [12, 51]).
fully-convolutional approach of Bertinetto et al. [3], one Recently, a radically different approach has been intro-
task is to learn a measure of similarity between the target duced [3, 19, 49]. Instead of learning a discrimative clas-
object and multiple candidates in a sliding window fashion. sifier online, these methods train offline a similarity func-
The output is a dense response map which only indicates the tion on pairs of video frames. At test time, this function
location of the object, without providing any information can be simply evaluated on a new video, once per frame.
about its spatial extent. To refine this information, we si- In particular, evolutions of the fully-convolutional Siamese
multaneously learn two further tasks: bounding box regres- approach [3] considerably improved tracking performance
sion using a Region Proposal Network [46, 28] and class- by making use of region proposals [28], hard negative min-
agnostic binary segmentation [43]. Notably, binary labels ing [63], ensembling [15] and memory networks [60].
are only required during offline training to compute the seg- Most modern trackers, including all the ones mentioned
mentation loss and not online during segmentation/tracking. above, use a rectangular bounding box both to initialise the
In our proposed architecture, each task is represented by target and to estimate its position in the subsequent frames.
a different branch departing from a shared CNN and con- Despite its convenience, a simple rectangle often fails to
tributes towards a final loss, which sums the three outputs properly represent an object, as it is evident in the examples
together. of Figure 1. This motivated us to propose a tracker able to
Once trained, SiamMask solely relies on a single bound- produce binary segmentation masks while still only relying
ing box initialisation, operates online without updates and on a bounding box initialisation.
produces object segmentation masks and rotated bound- Interestingly, in the past it was not uncommon for track-
ing boxes at 55 frames per second. Despite its simplicity ers to produce a coarse binary mask of the target object
and fast speed, SiamMask establishes a new state-of-the-art (e.g. [11, 42]). However, to the best of our knowledge, the
on VOT-2018 for the problem of real-time object tracking. only recent tracker that, like ours, is able to operate on-
Moreover, the same method is also very competitive against line and produce a binary mask starting from a bounding
recent semi-supervised VOS approaches on DAVIS-2016 box initialisation is the superpixel-based approach of Yeo et
and DAVIS-2017, while being the fastest by a large mar- al. [61]. However, at 4 frames per seconds (fps), its fastest
gin. This result is achieved with a simple bounding box variant is significantly slower than our proposal. Further-
initialisation (as opposed to a mask) and without adopting more, when using CNN features, its speed is affected by a
costly techniques often used by VOS approaches such as 60-fold decrease, plummeting below 0.1 fps. Finally, it has
fine-tuning [35, 39, 1, 53], data augmentation [23, 30] and not demonstrated to be competitive on modern tracking or
optical flow [50, 1, 39, 30, 8]. VOS benchmarks. Similar to us, the methods of Perazzi et
The rest of this paper is organised as follows. Section 2 al. [39] and Ci et al. [10] can also start from a rectangle and

2
output per-frame masks. However, they require fine-tuning 3. Methodology
at test time, which makes them slow.
To allow online operability and fast speed, we adopt
the fully-convolutional Siamese framework [3]. Moreover,
Semi-supervised video object segmentation. Bench-
to illustrate that our approach is agnostic to the specific
marks for arbitrary object tracking (e.g. [48, 26, 56]) as-
fully-convolutional method used as a starting point (e.g. [3,
sume that trackers receive input frames in a sequential fash-
28, 63, 60, 16]), we consider the popular SiamFC [3] and
ion. This aspect is generally referred to with the attributes
SiamRPN [28] as two representative examples. We first in-
online or causal [26]. Moreover, methods are often focused
troduce them in Section 3.1 and then describe our approach
on achieving a speed that exceeds the ones of typical video
in Section 3.2.
framerates [27]. Conversely, semi-supervised VOS algo-
rithms have been traditionally more concerned with an ac- 3.1. Fully-convolutional Siamese networks
curate representation of the object of interest [38, 40].
SiamFC. Bertinetto et al. [3] propose to use, as a fun-
In order to exploit consistency between video frames,
damental building block of a tracking system, an offline-
several methods propagate the supervisory segmentation
trained fully-convolutional Siamese network that compares
mask of the first frame to the temporally adjacent ones via
an exemplar image z against a (larger) search image x to
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
obtain a dense response map. z and x are, respectively, a
particular, Bao et al. [1] recently proposed a very accurate
w×h crop centered on the target object and a larger crop
method that makes use of a spatio-temporal MRF in which
centered on the last estimated position of the target. The
temporal dependencies are modelled by optical flow, while
two inputs are processed by the same CNN fθ , yielding two
spatial dependencies are expressed by a CNN.
feature maps that are cross-correlated:
Another popular strategy is to process video frames in-
dependently (e.g. [35, 39, 53]), similarly to what happens gθ (z, x) = fθ (z) ? fθ (x). (1)
in most tracking approaches. For example, in OSVOS-S
In this paper, we refer to each spatial element of the re-
Maninis et al. [35] do not make use of any temporal in-
sponse map (left-hand side of Eq. 1) as response of a can-
formation. They rely on a fully-convolutional network pre-
didate window (RoW). For example, gθn (z, x), encodes a
trained for classification and then, at test time, they fine-
similarity between the examplar z and n-th candidate win-
tune it using the ground-truth mask provided in the first
dow in x. For SiamFC, the goal is for the maximum value of
frame. MaskTrack [39] instead is trained from scratch on
the response map to correspond to the target location in the
individual images, but it does exploit some form of tempo-
search area x. Instead, in order to allow each RoW to en-
rality at test time by using the latest mask prediction and
code richer information about the target object, we replace
optical flow as additional input to the network.
the simple cross-correlation of Eq. 1 with depth-wise cross-
Aiming towards the highest possible accuracy, at test correlation [2] and produce a multi-channel response map.
time VOS methods often feature computationally intensive SiamFC is trained offline on millions of video frames with
techniques such as fine-tuning [35, 39, 1, 53], data augmen- the logistic loss [3, Section 2.2], which we refer to as Lsim .
tation [23, 30] and optical flow [50, 1, 39, 30, 8]. Therefore, SiamRPN. Li et al. [28] considerably improve the perfor-
these approaches are generally characterised by low fram- mance of SiamFC by relying on a region proposal network
erates and the inability to operate online. For example, it (RPN) [46, 14], which allows to estimate the target location
is not uncommon for methods to require minutes [39, 9] or with a bounding box of variable aspect ratio. In particular,
even hours [50, 1] for videos that are just a few seconds in SiamRPN each RoW encodes a set of k anchor box pro-
long, like the ones of DAVIS. posals and corresponding object/background scores. There-
Recently, there has been an increasing interest in the fore, SiamRPN outputs box predictions in parallel with
VOS community towards faster methods [36, 57, 8, 7, 22, classification scores. The two output branches are trained
21]. To the best of our knowledge, the fastest approaches using the smooth L1 and the cross-entropy losses [28, Sec-
with a performance competitive with the state of the art tion 3.2]. In the following, we refer to them as Lbox and
are the ones of Yang et al. [59] and Wug et al. [57]. The Lscore respectively.
former uses a meta-network “modulator” to quickly adapt
3.2. SiamMask
the parameters of a segmentation network during test time,
while the latter does not use any fine-tuning and adopts an Unlike existing tracking methods that rely on low-
encoder-decoder Siamese architecture trained in multiple fidelity object representations, we argue the importance of
stages. Both these methods run below 10 frames per sec- producing per-frame binary segmentation masks. To this
ond, while we are more than six times faster and only rely aim we show that, besides similarity scores and bound-
on a bounding box initialisation. ing box coordinates, it is possible for the RoW of a fully-

3
17*17*(63*63)

ℎ %
17*17*(63*63)
𝑓" 15*15*256
127*127*1 ℎ % mask
1*1*(63*63) mask

127*127*3
1*1*(63*63)
17*17*256 17*17*256
⋆' 𝑏) 17*17*4k box

RoW: 1*1*256 𝑝- score


RoW: 1*1*256 17*17*1

𝑓" 31*31*256 𝑠+ 17*17*2k score

255*255*3
(a) three-branch variant architecture (b) two-branch variant head

Figure 2. Schematic illustration of SiamMask variants: (a) three-branch architecture (full), (b) two-branch architecture (head). ?d denotes
depth-wise cross correlation. For simplicity, upsampling layer and mask refinement module are omitted here and detailed in Appendix A.

convolutional Siamese network to also encode the informa- case this representation corresponds to one of the (17×17)
tion necessary to produce a pixel-wise binary mask. This RoWs produced by the depth-wise cross-correlation be-
can be achieved by extending existing Siamese trackers with tween fθ (z) and fθ (x). Importantly, the network hφ of
an extra branch and loss. the segmentation task is composed of two 1×1 convolu-
We predict w×h binary masks (one for each RoW) using tional layers, one with 256 and the other with 632 chan-
a simple two-layers neural network hφ with learnable pa- nels (Figure 2). This allows every pixel classifier to utilise
rameters φ. Let mn denote the predicted mask correspond- information contained in the entire RoW and thus to have
ing to the n-th RoW, a complete view of its corresponding candidate window in
x, which is critical to disambiguate between instances that
mn = hφ (gθn (z, x)). (2) look like the target (e.g. last row of Figure 4), often referred
to as distractors. With the aim of producing a more accurate
From Eq. 2 we can see that the mask prediction is a function object mask, we follow the strategy of [44], which merges
of both the image to segment x and the target object in z. low and high resolution features using multiple refinement
In this way, z can be used as a reference to guide the seg- modules made of upsampling layers and skip connections
mentation process: given a different reference image, the (see Appendix A).
network will produce a different segmentation mask for x.
Two variants. For our experiments, we augment the ar-
Loss function. During training, each RoW is labelled with chitectures of SiamFC [3] and SiamRPN [28] with our seg-
a ground-truth binary label yn ∈ {±1} and also associated mentation branch and the loss Lmask , obtaining what we
with a pixel-wise ground-truth mask cn of size w×h. Let call the two-branch and three-branch variants of SiamMask.
cij
n ∈ {±1} denote the label corresponding to pixel (i, j) of These respectively optimise the multi-task losses L2B and
the object mask in the n-th candidate RoW. The loss func- L3B , defined as:
tion Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs: L2B = λ1 · Lmask + λ2 · Lsim , (4)
L3B = λ1 · Lmask + λ2 · Lscore + λ3 · Lbox . (5)
X 1 + yn X ij ij
Lmask (θ, φ) = ( log(1 + e−cn mn )). (3) We refer the reader to [3, Section 2.2] for Lsim and to [28,
n
2wh ij Section 3.2] for Lbox and Lscore . For L3B , a RoW is con-
sidered positive (yn = 1) if one of its anchor boxes has
Thus, the classification layer of hφ consists of w×h classi- IOU with the ground-truth box of at least 0.6 and negative
fiers, each indicating whether a given pixel belongs to the (yn = −1) otherwise. For L2B , we adopt the same strat-
object in the candidate window or not. Note that Lmask is egy of [3] to define positive and negative samples. We did
considered only for positive RoWs (i.e. with yn = 1). not search over the hyperparameters of Eq. 4 and Eq. 5 and
Mask representation. In contrast to semantic segmen- simply set λ1 = 32 like in [43] and λ2 = λ3 = 1. The task-
tation methods in the style of FCN [32] and Mask R- specific branches for the box and score outputs are consti-
CNN [17], which maintain explicit spatial information tuted by two 1×1 convolutional layers. Figure 2 illustrates
throughout the network, our approach follows the spirit the two variants of SiamMask.
of [43, 44] and generates masks starting from a flat- Box generation. Note that, while VOS benchmarks re-
tened representation of the object. In particular, in our quire binary masks, typical tracking benchmarks such as

4
reference to crop the next frame search region. Instead, in
the three-branch variant, we find more effective to exploit
the highest-scoring output of the box branch as reference.

4. Experiments
Figure 3. In order to generate a bounding box from a binary mask
(in yellow), we experiment with three different methods. Min-
In this section, we evaluate our approach on two related
max: the axis-aligned rectangle containing the object (red); MBR: tasks: visual object tracking (on VOT-2016 and VOT-2018)
the minimum bounding rectangle (green); Opt: the rectangle ob- and semi-supervised video object segmentation (on DAVIS-
tained via the optimisation strategy proposed in VOT-2016 [26] 2016 and DAVIS-2017). We refer to our two-branch and
(blue). three-branch variants with SiamMask-2B and SiamMask
respectively.
VOT [26, 27] require a bounding box as final representation
of the target object. We consider three different strategies 4.1. Evaluation for visual object tracking
to generate a bounding box from a binary mask (Figure 3):
(1) axis-aligned bounding rectangle (Min-max), (2) rotated Datasets and settings. We adopt two widely used bench-
minimum bounding rectangle (MBR) and (3) the optimisa- marks for the evaluation of the object tracking task: VOT-
tion strategy used for the automatic bounding box gener- 2016 [26] and VOT-2018 [27], both annotated with rotated
ation proposed in VOT-2016 [26] (Opt). We empirically bounding boxes. We use VOT-2016 to understand how dif-
evaluate these alternatives in Section 4 (Table 1). ferent types of representation affect the performance. For
this first experiment, we use mean intersection over union
3.3. Implementation details (IOU) and Average Precision (AP)@{0.5, 0.7} IOU. We
then compare against the state-of-the-art on VOT-2018, us-
Network architecture. For both our variants, we use a ing the official VOT toolkit and the Expected Average Over-
ResNet-50 [18] until the final convolutional layer of the 4-th lap (EAO), a measure that considers both accuracy and ro-
stage as our backbone fθ . In order to obtain a high spatial bustness of a tracker [27].
resolution in deeper layers, we reduce the output stride to 8
How much does the object representation matter?
by using convolutions with stride 1. Moreover, we increase
Existing tracking methods typically predict axis-aligned
the receptive field by using dilated convolutions [6]. In our
bounding boxes with a fixed [3, 20, 13, 33] or variable [28,
model, we add to the shared backbone fθ an unshared adjust
19, 63] aspect ratio. We are interested in understanding to
layer (1×1 conv with 256 outputs). For simplicity, we omit
which extent producing a per-frame binary mask can im-
it in Eq. 1. We describe the network architectures in more
prove tracking. In order to focus on representation accuracy,
detail in Appendix A.
for this experiment only we ignore the temporal aspect and
Training. Like SiamFC [3], we use examplar and search sample video frames at random. The approaches described
image patches of 127×127 and 255×255 pixels respec- in the following paragraph are tested on randomly cropped
tively. During training, we randomly jitter examplar and search patches (with random shifts within ±16 pixels and
search patches. Specifically, we consider random transla- scale deformations up to 21±0.25 ) from the sequences of
tions (up to ±8 pixels) and rescaling (of 2±1/8 and 2±1/4 VOT-2016.
for examplar and search respectively). In Table 1, we compare our three-branch variant using
The network backbone is pre-trained on the the Min-max, MBR and Opt approaches (described at the
ImageNet-1k classification task. We use SGD with a end of Section 3.2 and in Figure 3). For reference, we also
first warmup phase in which the learning rate increases report results for SiamFC and SiamRPN as representative
linearly from 10−3 to 5×10−3 for the first 5 epochs of the fixed and variable aspect-ratio approaches, together
and then descreases logarithmically until 5×10−4 for 15 with three oracles that have access to per-frame ground-
more epochs. We train all our models using COCO [31], truth information and serve as upper bounds for the dif-
ImageNet-VID [47] and YouTube-VOS [58]. ferent representation strategies. (1) The fixed aspect-ratio
Inference. During tracking, SiamMask is simply evalu- oracle uses the per-frame ground-truth area and center loca-
ated once per frame, without any adaptation. In both our tion, but fixes the aspect reatio to the one of the first frame
variants, we select the output mask using the location attain- and produces an axis-aligned bounding box. (2) The Min-
ing the maximum score in the classification branch. Then, max oracle uses the minimal enclosing rectangle of the ro-
after having applied a per-pixel sigmoid, we binarise the tated ground-truth bounding box to produce an axis-aligned
output of the mask branch at the threshold of 0.5. In the bounding box. (3) Finally, the MBR oracle uses the rotated
two-branch variant, for each video frame after the first one, minimum bounding rectangle of the ground-truth. Note that
we fit the output mask with the Min-max box and use it as (1), (2) and (3) can be considered, respectively, the per-

5
mIOU (%) mAP@0.5 IOU mAP@0.7 IOU ric, showing a significant advantage with respect to the Cor-
Fixed a.r. Oracle 73.43 90.15 62.52 relation Filter-based trackers CSRDCF [33], STRCF [29].
Min-max Oracle 77.70 88.84 65.16
This is not surprising, as SiamMask relies on a richer object
MBR Oracle 84.07 97.77 80.68
SiamFC [3] 50.48 56.42 9.28
representation, as outlined in Table 1. Interestingly, sim-
SiamRPN [63] 60.02 76.20 32.47 ilarly to us, He et al. (SA Siam R) [15] are motivated to
SiamMask-Min-max 65.05 82.99 43.09 achieve a more accurate target representation by consider-
SiamMask-MBR 67.15 85.42 50.86 ing multiple rotated and rescaled bounding boxes. However,
SiamMask-Opt 71.68 90.77 60.47 their representation is still constrained to a fixed aspect-ratio
Table 1. Performance for different bounding box representation
box.
strategies on VOT-2016. Table 3 gives further results of SiamMask with dif-
ferent box generation strategies on VOT-2018 and -2016.
SiamMask-box means the box branch of SiamMask is
formance upper bounds for the representation strategies of adopted for inference despite the mask branch has been
SiamFC, SiamRPN and SiamMask. trained. We can observe clear improvements on all evalua-
Table 1 shows that our method achieves the best mIOU, tion metrics by using the mask branch for box generation.
no matter the box generation strategy used (Figure 3). Al-
beit SiamMask-Opt offers the highest IOU and mAP, it re- 4.2. Evaluation for semi-supervised VOS
quires significant computational resources due to its slow
Our model, once trained, can also be used for the task
optimisation procedure [54]. SiamMask-MBR achieves a
of VOS to achieve competitive performance without requir-
mAP@0.5 IOU of 85.4, with a respective improvement of
ing any adaptation at test time. Importantly, differently to
+29 and +9.2 points w.r.t. the two fully-convolutional
typical VOS approaches, ours can operate online, runs in
baselines. Interestingly, the gap significantly widens when
real-time and only requires a simple bounding box initiali-
considering mAP at the higher accuracy regime of 0.7 IOU:
sation.
+41.6 and +18.4 respectively. Notably, our accuracy re-
sults are not far from the fixed aspect-ratio oracle. More- Datasets and settings. We report the performance of
over, comparing the upper bound performance represented SiamMask on DAVIS-2016 [40], DAVIS-2017 [45] and
by the oracles, it is possible to notice how, by simply chang- YouTube-VOS [58] benchmarks. For both DAVIS datasets,
ing the bounding box representation, there is a great room we use the official performance measures: the Jaccard index
for improvement (e.g. +10.6% mIOU improvement be- (J ) to express region similarity and the F-measure (F) to
tween the fixed aspect-ratio and the MBR oracles). express contour accuracy. For each measure C ∈ {J , F},
Overall, this study shows how the MBR strategy to obtain three statistics are considered: mean CM , recall CO , and
a rotated bounding box from a binary mask of the object decay CD , which informs us about the gain/loss of per-
offers a significant advantage over popular strategies that formance over time [40]. Following Xu et al. [58], for
simply report axis-aligned bounding boxes. YouTube-VOS we report the mean Jaccard index and F-
measure for both seen (JS , FS ) and unseen categories (JU ,
Results on VOT-2018 and VOT-2016. In Table 2 we
FU ). O is the average of these four measures.
compare the two variants of SiamMask with MBR strategy
and SiamMask–Opt against five recently published state- To initialise SiamMask, we extract the axis-aligned
of-the-art trackers on the VOT-2018 benchmark. Unless bounding box from the mask provided in the first frame
stated otherwise, SiamMask refers to our three-branch vari- (Min-max strategy, see Figure 3). Similarly to most VOS
ant with MBR strategy. Both variants achieve outstanding methods, in case of multiple objects in the same video
performance and run in real-time. In particular, our three- (DAVIS-2017) we simply perform multiple inferences.
branch variant significantly outperforms the very recent Results on DAVIS and YouTube-VOS. In the semi-
and top performing DaSiamRPN [63], achieving a EAO of supervised setting, VOS methods are initialised with a
0.380 while running at 55 frames per second. Even with- binary mask [38] and many of them require computa-
out box regression branch, our simpler two-branch vari- tionally intensive techniques at test time such as fine-
ant (SiamMask-2B) achieves a high EAO of 0.334, which tuning [35, 39, 1, 53], data augmentation [23, 30], infer-
is in par with SA Siam R [15] and superior to any other ence on MRF/CRF [55, 50, 36, 1] and optical flow [50, 1,
real-time method in the published literature. Finally, in 39, 30, 8]. As a consequence, it is not uncommon for VOS
SiamMask–Opt, the strategy proposed in [54] to find the op- techniques to require several minutes to process a short se-
timal rotated rectangle from a binary mask brings the best quence. Clearly, these strategies make the online applicabil-
overall performance (and a particularly high accuracy), but ity (which is our focus) impossible. For this reason, in our
comes at a significant computational cost. comparison we mainly concentrate on fast state-of-the-art
Our model is particularly strong under the accuracy met- approaches.

6
SiamMask-Opt SiamMask SiamMask-2B DaSiamRPN [63] SiamRPN [28] SA Siam R [15] CSRDCF [33] STRCF [29]
EAO ↑ 0.387 0.380 0.334 0.326 0.244 0.337 0.263 0.345
Accuracy ↑ 0.642 0.609 0.575 0.569 0.490 0.566 0.466 0.523
Robustness ↓ 0.295 0.276 0.304 0.337 0.460 0.258 0.318 0.215
Speed (fps) ↑ 5 55 60 160 200 32.4 48.9 2.9
Table 2. Comparison with the state-of-the-art under the EAO, Accuracy, and Robustness metrics on VOT-2018.

VOT-2018 VOT-2016 AN RN EAO ↑ JM↑ FM↑ Speed


EAO ↑ A ↑ R↓ EAO ↑ A ↑ R↓ Speed SiamFC 4 0.188 - - 86
SiamMask-box 0.363 0.584 0.300 0.412 0.623 0.233 76 SiamFC 4 0.251 - - 40
SiamMask 0.380 0.609 0.276 0.433 0.639 0.214 55 SiamRPN 4 0.243 - - 200
SiamMask-Opt 0.387 0.642 0.295 0.442 0.670 0.233 5 SiamRPN 4 0.359 - - 76
SiamMask-2B w/o R 4 0.326 62.3 55.6 43
Table 3. Results on VOT-2016 and VOT-2018. SiamMask w/o R 4 0.375 68.6 57.8 58
SiamMask-2B-score 4 0.265 - - 40
FT M JM↑ JO↑ JD↓ FM↑ FO↑ FD↓ Speed
SiamMask-box 4 0.363 - - 76
OnAVOS [53] 4 4 86.1 96.1 5.2 84.9 89.7 5.8 0.08 SiamMask-2B 4 0.334 67.4 63.5 60
MSK [39] 4 4 79.7 93.1 8.9 75.4 87.1 9.0 0.1 SiamMask 4 0.380 71.7 67.8 55
MSKb [39] 4 8 69.6 - - - - - 0.1
Table 7. Ablation studies on VOT-2018 and DAVIS-2016.
SFL [9] 4 4 76.1 90.6 12.1 76.0 85.5 10.4 0.1
FAVOS [8] 8 4 82.4 96.5 4.5 79.5 89.4 5.5 0.8
RGMP [57] 8 4 81.5 91.7 10.9 82.0 90.8 10.1 8 SiamMask achieves a very low decay [40] for both region
PML [7] 8 4 75.5 89.6 8.5 79.3 93.4 7.8 3.6 similarity (JD ,) and contour accuracy (FD ). This suggests
OSMN [59] 8 4 74.0 87.6 9.0 72.9 84.0 10.6 8.0 that our method is robust over time and thus it is indicated
PLM [62] 8 4 70.2 86.3 11.2 62.5 73.2 14.7 6.7 for particularly long sequences.
VPN [22] 8 4 70.2 82.3 12.4 65.5 69.0 14.4 1.6 Qualitative results of SiamMask for both VOT and
SiamMask 8 8 71.7 86.8 3.0 67.8 79.8 2.1 55 DAVIS sequences are shown in Figure 4, 9 and 10. Despite
Table 4. Results on DAVIS 2016 (validation set). FT and M respec- the high speed, SiamMask produces accurate segmentation
tively denote if the method requires fine-tuning and whether it is masks even in presence of distractors.
initialised with a mask (4) or a bounding box (8).
4.3. Further analysis
FT M JM↑ JO↑ JD↓ FM↑ FO↑ FD↓ Speed In this section, we illustrate ablation studies, failure
OnAVOS [53] 4 4 61.6 67.4 27.9 69.1 75.4 26.6 0.1 cases and timings of our methods.
OSVOS [5] 4 4 56.6 63.8 26.1 63.9 73.8 27.0 0.1 Network architecture. In Table 7, AN and RN denote
FAVOS [8] 8 4 54.6 61.1 14.1 61.8 72.3 18.0 0.8 whether we use AlexNet or ResNet-50 as the shared back-
OSMN [59] 8 4 52.5 60.9 21.5 57.1 66.1 24.3 8.0 bone fθ (Figure 2), while with “w/o R” we mean that the
SiamMask 8 8 54.3 62.8 19.3 58.5 67.5 20.9 55
method does not use the refinement strategy of Pinheiro et
Table 5. Results on DAVIS 2017 (validation set). al. [44]. From the results of Table 7, it is possible to make
several observations. (1) The first set of rows shows that,
FT M JS↑ JU ↑ FS↑ FU ↑ O↑ Speed by simply updating the architecture of fθ , it is possible to
OnAVOS [53] 4 4 60.1 46.6 62.7 51.4 55.2 0.1
achieve an important performance improvement. However,
OSVOS [5] 4 4 59.8 54.2 60.5 60.7 58.8 0.1
this comes at the cost of speed, especially for SiamRPN. (2)
OSMN [59] 8 4 60.0 40.6 60.1 44.0 51.2 8.0 SiamMask-2B and SiamMask considerably improve over
SiamMask 8 8 60.2 45.1 58.2 47.7 52.8 55 their baselines (with same fθ ) SiamFC and SiamRPN. (3)
Table 6. Results on YouTube-VOS (validation set).
Interestingly, the refinement approach of Pinheiro et al. [44]
is very important for the contour accuracy FM , but less so
for the other metrics.
Table 4, 5 and 6 show how SiamMask can be considered Multi-task training. We conducted two further experi-
as a strong baseline for online VOS. First, it is almost two ments to disentangle the effect of multi-task training. Re-
orders of magnitude faster than accurate approaches such sults are reported in Table 7. To achieve this, we modified
as OnAVOS [53] or SFL [9]. Second, it is competitive with the two variants of SiamMask during inference so that, re-
recent VOS methods that do not employ fine-tuning, while spectively, they report an axis-aligned bounding box from
being four times more efficient than the fastest ones (i.e. the score branch (SiamMask-2B-score) or the box branch
OSMN [59] and RGMP [57]). Interestingly, we note that (SiamMask-box). Therefore, despite having been trained,

7
Basketball
Nature
Car-Shadow
Dogs-Jump
Pigs

Figure 4. Qualitative results of our method for sequences belonging to both object tracking and video object segmentation benchmarks.
Basketball and Nature are from VOT-2018 [27]; Car-Shadow is from DAVIS-2016 [40]; Dogs-Jump and Pigs are from DAVIS-2017 [45].
Multiple masks are obtained from different inferences (with different initialisations).
that can be unambiguously discriminated from the fore-
ground.

5. Conclusion
In this paper we introduced SiamMask, a simple ap-
proach that enables fully-convolutional Siamese trackers to
Figure 5. Failure cases: motion blur and “non-object” instance.
produce class-agnostic binary segmentation masks of the
target object. We show how it can be applied with success
the mask branch is not used during inference. We can ob- to both tasks of visual object tracking and semi-supervised
serve how both variants obtain a modest but meaningful im- video object segmentation, showing better accuracy than
provement with respect to their counterparts (SiamFC and state-of-the-art trackers and, at the same time, the fastest
SiamRPN): from 0.251 to 0.265 EAO for the two-branch speed among VOS methods. The two variants of SiamMask
and from 0.359 to 0.363 for the three-branch on VOT2018. we proposed are initialised with a simple bounding box, op-
Timing. SiamMask operates online without any adap- erate online, run in real-time and do not require any adapta-
tation to the test sequence. On a single NVIDIA RTX tion to the test sequence. We hope that our work will inspire
2080 GPU, we measured an average speed of 55 and 60 further studies that consider the two problems of visual ob-
frames per second, respectively for the two-branch and ject tracking and video object segmentation together.
three-branch variants. Note that the highest computational Acknowledgements. This work was supported by
burden comes from the feature extractor fθ . the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC
Failure cases. Finally, we discuss two scenarios in which grant Seebibyte EP/M013774/1 and EPSRC/MURI grant
SiamMask fails: motion blur and “non-object” instance EP/N019474/1. We would also like to acknowledge the sup-
(Figure 5). Despite being different in nature, these two port of the Royal Academy of Engineering and FiveAI Ltd.
1
cases arguably arise from the complete lack of similar train- Qiang Wang is partly supported by the NSFC (Grant No.
ing samples in a training sets, which are focused on objects 61751212, 61721004 and U1636218).

8
References [16] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese
network for real-time object tracking. In IEEE Conference
[1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object seg- on Computer Vision and Pattern Recognition, 2018. 3
mentation via inference in a cnn-based higher-order spatio-
[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-
temporal mrf. In IEEE Conference on Computer Vision and
cnn. In IEEE International Conference on Computer Vision,
Pattern Recognition, 2018. 2, 3, 6
2017. 4
[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
A. Vedaldi. Learning feed-forward one-shot learners. In Ad-
for image recognition. In IEEE Conference on Computer
vances in Neural Information Processing Systems, 2016. 3
Vision and Pattern Recognition, 2016. 5, 11
[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and
[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100
P. H. Torr. Fully-convolutional siamese networks for object
fps with deep regression networks. In European Conference
tracking. In European Conference on Computer Vision work-
on Computer Vision, 2016. 2, 5
shops, 2016. 2, 3, 4, 5, 6
[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-
[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.
speed tracking with kernelized correlation filters. IEEE
Visual object tracking using adaptive correlation filters. In
Transactions on Pattern Analysis and Machine Intelligence,
IEEE Conference on Computer Vision and Pattern Recogni-
2015. 2, 5
tion, 2010. 2
[5] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, [21] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch:
D. Cremers, and L. Van Gool. One-shot video object seg- Matching based video object segmentation. In European
mentation. In IEEE Conference on Computer Vision and Conference on Computer Vision, 2018. 2, 3
Pattern Recognition, 2017. 7 [22] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and networks. In IEEE Conference on Computer Vision and Pat-
A. L. Yuille. Deeplab: Semantic image segmentation with tern Recognition, 2017. 2, 3, 7
deep convolutional nets, atrous convolution, and fully con- [23] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele.
nected crfs. IEEE Transactions on Pattern Analysis and Ma- Lucid data dreaming for object tracking. In IEEE Con-
chine Intelligence, 2018. 5, 11 ference on Computer Vision and Pattern Recognition work-
[7] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz- shops, 2017. 2, 3, 6
ingly fast video object segmentation with pixel-wise metric [24] H. Kiani Galoogahi, T. Sim, and S. Lucey. Multi-channel
learning. In IEEE Conference on Computer Vision and Pat- correlation filters. In IEEE International Conference on
tern Recognition, 2018. 2, 3, 7 Computer Vision, 2013. 2
[8] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. [25] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filters
Fast and accurate online video object segmentation via track- with limited boundaries. In IEEE Conference on Computer
ing parts. In IEEE Conference on Computer Vision and Pat- Vision and Pattern Recognition, 2015. 2
tern Recognition, 2018. 2, 3, 6, 7 [26] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
[9] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: R. Pflugfelder, L. Čehovin, T. Vojı́r, G. Häger, A. Lukežič,
Joint learning for video object segmentation and optical G. Fernández, et al. The visual object tracking vot2016 chal-
flow. In IEEE International Conference on Computer Vision, lenge results. In European Conference on Computer Vision,
2017. 3, 7 2016. 1, 3, 5
[10] H. Ci, C. Wang, and Y. Wang. Video object segmentation by [27] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
learning location-sensitive embeddings. In European Con- R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,
ference on Computer Vision, 2018. 2 A. Eldesokey, G. Fernandez, and et al. The sixth visual object
[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking tracking vot-2018 challenge results. In European Conference
of non-rigid objects using mean shift. In IEEE Conference on Computer Vision workshops, 2018. 1, 3, 5, 8, 12
on Computer Vision and Pattern Recognition, 2000. 2 [28] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance
[12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: visual tracking with siamese region proposal network. In
Efficient convolution operators for tracking. In IEEE Con- IEEE Conference on Computer Vision and Pattern Recogni-
ference on Computer Vision and Pattern Recognition, 2017. tion, 2018. 2, 3, 4, 5, 7
1, 2 [29] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learn-
[13] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Learn- ing spatial-temporal regularized correlation filters for visual
ing spatially regularized correlation filters for visual track- tracking. In IEEE Conference on Computer Vision and Pat-
ing. In IEEE International Conference on Computer Vision, tern Recognition, 2018. 2, 6, 7
2015. 2, 5 [30] X. Li and C. C. Loy. Video object segmentation with joint
[14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track re-identification and attention-aware mask propagation. In
and track to detect. In IEEE International Conference on European Conference on Computer Vision, 2018. 2, 3, 6
Computer Vision, 2017. 3 [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[15] A. He, C. Luo, X. Tian, and W. Zeng. Towards a better match manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
in siamese network based visual object tracker. In European mon objects in context. In European Conference on Com-
Conference on Computer Vision workshops, 2018. 2, 6, 7 puter Vision, 2014. 5

9
[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional survey. IEEE Transactions on Pattern Analysis and Machine
networks for semantic segmentation. In IEEE Conference on Intelligence, 2014. 1, 3
Computer Vision and Pattern Recognition, 2015. 4 [49] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance
[33] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan. search for tracking. In IEEE Conference on Computer Vision
Discriminative correlation filter with channel and spatial reli- and Pattern Recognition, 2016. 2
ability. In IEEE Conference on Computer Vision and Pattern [50] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmenta-
Recognition, 2017. 2, 5, 6, 7 tion via object flow. In IEEE Conference on Computer Vision
[34] T. Makovski, G. A. Vazquez, and Y. V. Jiang. Visual learning and Pattern Recognition, 2016. 2, 3, 6
in multiple-object tracking. PLoS One, 2008. 1 [51] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and
[35] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal- P. H. S. Torr. End-to-end representation learning for correla-
Taixé, D. Cremers, and L. Van Gool. Video object segmen- tion filter based tracking. In IEEE Conference on Computer
tation without temporal information. In IEEE Transactions Vision and Pattern Recognition, 2017. 2
on Pattern Analysis and Machine Intelligence, 2017. 2, 3, 6 [52] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao,
[36] N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- A. Vedaldi, A. Smeulders, P. H. S. Torr, and E. Gavves.
lateral space video segmentation. In IEEE Conference on Long-term tracking in the wild: A benchmark. In European
Computer Vision and Pattern Recognition, 2016. 2, 3, 6 Conference on Computer Vision, 2018. 1
[37] O. Miksik, J.-M. Pérez-Rúa, P. H. Torr, and P. Pérez. Roam: [53] P. Voigtlaender and B. Leibe. Online adaptation of convo-
a rich object appearance model with application to rotoscop- lutional neural networks for video object segmentation. In
ing. In IEEE Conference on Computer Vision and Pattern British Machine Vision Conference, 2017. 2, 3, 6, 7
Recognition, 2017. 1 [54] T. Vojir and J. Matas. Pixel-wise object segmentations for
[38] F. Perazzi. Video Object Segmentation. PhD thesis, ETH the vot 2016 dataset. Research Report CTU-CMP-2017–01,
Zurich, 2017. 1, 3, 6 Center for Machine Perception, Czech Technical University,
[39] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and Prague, Czech Republic, 2017. 6
A. Sorkine-Hornung. Learning video object segmentation [55] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang. Jots: Joint
from static images. In IEEE Conference on Computer Vision online tracking and segmentation. In IEEE Conference on
and Pattern Recognition, 2017. 2, 3, 6, 7 Computer Vision and Pattern Recognition, 2015. 2, 3, 6
[40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, [56] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A
M. Gross, and A. Sorkine-Hornung. A benchmark dataset benchmark. In IEEE Conference on Computer Vision and
and evaluation methodology for video object segmentation. Pattern Recognition, 2013. 1, 3
In IEEE Conference on Computer Vision and Pattern Recog- [57] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast
nition, 2017. 1, 3, 6, 7, 8, 13 video object segmentation by reference-guided mask propa-
[41] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. gation. In IEEE Conference on Computer Vision and Pattern
Fully connected object proposals for video segmentation. In Recognition, 2018. 2, 3, 7
IEEE International Conference on Computer Vision, 2015. 3 [58] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
[42] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet. Color-Based S. Cohen, and T. Huang. Youtube-vos: Sequence-to-
Probabilistic Tracking. In European Conference on Com- sequence video object segmentation. In European Confer-
puter Vision, 2002. 2 ence on Computer Vision, 2018. 2, 5, 6
[43] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to seg- [59] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos.
ment object candidates. In Advances in Neural Information Efficient video object segmentation via network modulation.
Processing Systems, 2015. 2, 4 In IEEE Conference on Computer Vision and Pattern Recog-
[44] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn- nition, June 2018. 2, 3, 7
ing to refine object segments. In European Conference on [60] T. Yang and A. B. Chan. Learning dynamic memory net-
Computer Vision, 2016. 4, 7, 11 works for object tracking. In European Conference on Com-
[45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine- puter Vision, 2018. 2, 3
Hornung, and L. Van Gool. The 2017 davis chal- [61] D. Yeo, J. Son, B. Han, and J. H. Han. Superpixel-based
lenge on video object segmentation. arXiv preprint tracking-by-segmentation using markov chains. In IEEE
arXiv:1704.00675, 2017. 6, 8, 13 Conference on Computer Vision and Pattern Recognition,
[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards 2017. 2
real-time object detection with region proposal networks. In [62] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S.
Advances in Neural Information Processing Systems, 2015. Kweon. Pixel-level matching for video object segmentation
2, 3 using convolutional neural networks. In IEEE International
[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Conference on Computer Vision, 2017. 7
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, [63] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.
et al. Imagenet large scale visual recognition challenge. In- Distractor-aware siamese networks for visual object track-
ternational Journal of Computer Vision, 2015. 5 ing. In European Conference on Computer Vision, 2018. 2,
[48] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, 3, 5, 6, 7
A. Dehghan, and M. Shah. Visual tracking: An experimental

10
A. Architectural details block score box mask
conv5 1 × 1, 256 1 × 1, 256 1 × 1, 256
Network backbone. Table 8 illustrates the details of our conv6 1 × 1, 2k 1 × 1, 4k 1 × 1, (63 × 63)
backbone architecture (fθ in the main paper). For both vari-
ants, we use a ResNet-50 [18] until the final convolutional Table 9. Architectural details of the three-branch head. k denotes
the number of anchor boxes per RoW.
layer of the 4-th stage. In order to obtain a higher spatial
resolution in deep layers, we reduce the output stride to block score mask
8 by using convolutions with stride 1. Moreover, we in- conv5 1 × 1, 256 1 × 1, 256
crease the receptive field by using dilated convolutions [6]. conv6 1 × 1, 1 1 × 1, (63 × 63)
Specifically, we set the stride to 1 and the dilation rate to
2 in the 3×3 conv layer of conv4 1. Differently to the Table 10. Architectural details of the two-branch head.
original ResNet-50, there is no downsampling in conv4 x.
We also add to the backbone an adjust layer (a 1×1 con- 61*61*8 2x, up + 31*31*16
conv,
3*3, 16
conv,
3*3, 16 31*31*16

volutional layer with 256 output channels). Examplar and


search patches share the network’s parameters from conv1 31*31*16
to conv4 x, while the parameters of the adjust layer are
not shared. The output features of the adjust layer are then conv,
+ Element-wise
3*3, 16 sum
depth-wise cross-correlated, resulting a feature map of size ReLU
17×17. conv,
3*3, 32

Network heads. The network architecture of the branches conv,


Refinement module
3*3, 64
of both variants are shows in Table 9 and 10. The conv5
block in both variants contains a normalisation layer and 31*31*256
conv 2
ReLU non-linearity while conv6 only consists of a 1×1
convolutional layer. Figure 6. Example of a refinement module U3 .
Mask refinement module. With the aim of producing a
more accurate object mask, we follow the strategy of [44],
which merges low and high resolution features using multi-
ple refinement modules made of upsampling layers and skip
connections. Figure 6 gives an example of refinement mod- Target
ule U3 , while Figure 8 illustrates how a mask is generated Search
with stacked refinement modules.

block examplar output size search output size backbone


conv1 61×61 125×125 7×7, 64, stride 2
3×3 max pool, stride 2
 
1×1, 64
conv2 x 31×31 63×63  3×3, 64 ×3
1×1, 256
 
1×1, 128 Figure 7. Score maps from the mask branch at different locations.
conv3 x 15×15 31×31  3×3, 128 ×4
1×1, 512
 
1×1, 256 score branch to select the final output mask (using the loca-
conv4 x 15×15 31×31  3×3, 256 ×6
1×1, 1024 tion attaining the maximum score). The example of Figure 7
adjust 15×15 31×31 1×1, 256 illustrates the multiple output masks produced by the mask
xcorr 17 × 17 depth-wise branch, each corresponding to a different RoW.
Benchmark sequences. More qualitative results for VOT
Table 8. Backbone architecture. Details of each building block are
and DAVIS sequences are shown in Figure 9 and 10.
shown in square brackets.

B. Further qualitative results


Different masks at different locations. Our model gener-
ates a mask for each RoW. During inference, we rely on the

11
conv, deconv,
Sigmoid 127*127*4 𝑈$ 61*61*8 𝑈# 31*31*16 𝑈" 15*15*32 32
3*3, 1

127*127*1
mask

61*61*64 31*31*256 15*15*512 15*15*1024 15*15*256 17*17*256


adjust 1*1*256
conv 1 conv 2 conv 3 con 4
(RoW)

127*127*3

ResNet-50

31*31*256
conv 1 conv 2 conv 3 conv 4 adjust

255*255*3

Figure 8. Schematic illustration of the stacked refinement modules.


butterfly
crabs1
iceskater1
iceskater2
motocross1
singer2
soccer1

Figure 9. Further qualitative results of our method on sequences from the visual object tracking benchmark VOT-2018 [27].

12
dog
drift-straight
goat
Libby
motocross-jump
parkour
Gold-Fish

Figure 10. Further qualitative results of our method on sequences from the semi-supervised video object segmentation benchmarks DAVIS-
2016 [40] and DAVIS-2017 [45]. Multiple masks are obtained from different inferences (with different initialisations).

13
Revealing Scenes by Inverting Structure from Motion Reconstructions

Francesco Pittaluga1 Sanjeev J. Koppal1 Sing Bing Kang2 Sudipta N. Sinha2


1 2
University of Florida Microsoft Research
arXiv:1904.03303v1 [cs.CV] 5 Apr 2019

(a) SfM point cloud (top view) (b) Projected 3D points (c) Synthesized Image (d) Original Image

Figure 1: S YNTHESIZING IMAGERY FROM A S F M POINT CLOUD : From left to right: (a) Top view of a SfM reconstruction
of an indoor scene, (b) 3D points projected into a viewpoint associated with a source image, (c) the image reconstructed using
our technique, and (d) the source image. The reconstructed image is very detailed and closely resembles the source image.

Abstract workplaces, and other sensitive environments. Image-based


localization techniques allow such devices to estimate their
Many 3D vision systems localize cameras within a scene precise pose within the scene [18, 37, 23, 25]. However,
using 3D point clouds. Such point clouds are often obtained these localization methods requires persistent storage of 3D
using structure from motion (SfM), after which the images models of the scene which contains sparse 3D point clouds
are discarded to preserve privacy. In this paper, we show, reconstructed using images and SfM algorithms [38].
for the first time, that such point clouds retain enough in- SfM source images are usually discarded to safeguard
formation to reveal scene appearance and compromise pri- privacy. Surprisingly, however, we show that the SfM point
vacy. We present a privacy attack that reconstructs color cloud and the associated attributes such as color and SIFT
images of the scene from the point cloud. Our method is descriptors contain enough information to reconstruct de-
based on a cascaded U-Net that takes as input, a 2D multi- tailed comprehensible images of the scene (see Fig. 1 and
channel image of the points rendered from a specific view- Fig. 3). This suggests that the persistent point cloud storage
point containing point depth and optionally color and SIFT poses serious privacy risks that have been widely ignored
descriptors and outputs a color image of the scene from so far but will become increasingly relevant as localization
that viewpoint. Unlike previous feature inversion meth- services are adopted by a larger user community.
ods [46, 9], we deal with highly sparse and irregular 2D While privacy issues for wearable devices have been
point distributions and inputs where many point attributes studied [16], to the best of our knowledge, a systematic
are missing, namely keypoint orientation and scale, the de- analysis of privacy risk of storing 3D point cloud maps has
scriptor image source and the 3D point visibility. We evalu- never been reported. We illustrate the privacy concerns by
ate our attack algorithm on public datasets [24, 39] and an- proposing the problem of synthesizing color images from
alyze the significance of the point cloud attributes. Finally, an SfM model of a scene. We assume that the reconstructed
we show that novel views can also be generated thereby en- model contains a sparse 3D point cloud with optional at-
abling compelling virtual tours of the underlying scene. tributes such as descriptors, color, point visibility and asso-
ciated camera poses but not the source images.
1. Introduction We make the following contributions: (1) We intro-
duce the problem of inverting a sparse SfM point cloud
Emerging AR technologies on mobile devices based on and reconstructing detailed views of the scene from arbi-
ARCore [2], ARKit [3], 3D mapping APIs [1], and new trary viewpoints. This problem differs from the previously
devices such as HoloLens [15] have set the stage for de- studied single-image feature inversion problem due to the
ployment of devices with always-on cameras in our homes, need to deal with highly sparse point distributions and a
higher degree of missing information in the input, namely typical inherent sparsity of SfM point clouds. Second, the
unknown keypoint orientation and scale, unknown image SIFT keypoint scale and orientation are unknown since SfM
source of descriptors, and unknown 3D point visibilities. methods retain only the descriptors for the 3D points. Third,
(2) We present a new approach based on three neural net- each 3D point typically has only one descriptor sampled
works where the first network performs visibility estima- from an arbitrary source image whose identity is not stored
tion, the second network reconstructs the image and the either, entailing descriptors with unknown perspective dis-
third network uses an adversarial framework to further re- tortions and photometric inconsistencies. Finally, the 3D
fine the image quality. (3) We systematically analyze vari- point visibilities are also unknown and we will demonstrate
ants of the inversion attack that exploits additional attributes the importance of visibility reasoning in the paper.
that may be available, namely per-point descriptors, color
and information about the source camera poses and point Image-to-Image Translation. Various methods such as
visibility and show that even the minimalist representation Pix2Pix [19], CycleGan [50], CoGAN [27] and related un-
(descriptors only) are prone to the attack. (4) We demon- supervised approaches [7, 26, 34] use conditional adversar-
strate the need for developing privacy preserving 3D repre- ial networks to transform between 2D representations, such
sentations, since the reconstructed images reveal the scene as edge to color, label to color, and day to night images.
in great details and confirm the feasibility of the attack in a While such networks are typically dense (without holes)
wide range of scenes. We also show that novel views of the and usually low-dimensional (single channel or RGB), Con-
scene can be synthesized without any additional effort and tour2Im [5] takes sparse 2D points sampled along gradients
a compelling virtual tour of a scene can be easily generated. along with low-dimensional input features. In contrast to
our work, these approaches are trained on specific object
The three networks in our cascade are trained on 700+
categories and semantically similar images. While we use
indoor and outdoor SfM reconstructions generated from
similar building blocks to these methods (encoder-decoder
500k+ multi-view images taken from the NYU2 [39] and
networks, U-nets, adversarial loss, and perceptual loss), our
MegaDepth [24] datasets. The training data for all three
networks can generalize to arbitrary images, and are trained
networks including the visibility labels were generated au-
on large scale indoor and outdoor SfM datasets.
tomatically using COLMAP [38]. Next we compare our
approach to previous work on inverting image features Upsampling. When the input and output domains are iden-
[46, 9, 8] and discuss how the problem of inverting SfM tical, deep networks have shown excellent results on up-
models poses a unique set of challenges. sampling and superresolution tasks for images, disparity,
depth maps and active range maps [4, 28, 43, 36, 17]. How-
2. Related Work ever, prior upsampling methods typically focus on inputs
with uniform sparsity. Our approach differs due to the non-
In this section, we review existing work on inverting im- uniform spatial sampling in the input data which also hap-
age features and contrast them to inverting SfM point cloud pens to be high dimensional and noisy since the input de-
models. We then broadly discuss image-to-image transla- scriptors are from different source images and viewpoints.
tion, upsampling and interpolation, and privacy attacks.
Novel view synthesis and image-based rendering. Deep
Inverting features. The task of reconstructing images from networks can significantly improve photorealism in free
features has been explored to understand what is encoded viewpoint image-based rendering [12, 14]. Additionally,
by the features, as was done for SIFT features by Weinza- several works have also explored monocular depth estima-
epfel et al. [46], HOG features by Vondrick et al. [45] and tion and novel view synthesis using U-Nets [11, 24, 31].
bag-of-words by Kato and Harada [20]. Recent work on Our approach arguably provides similar photorealistic visu-
the topic has been primarily focused on inverting and inter- ally quality – remarkably, from sparse SfM reconstructions
preting CNN features [49, 48, 29]. Dosovitskiy and Brox instead of images. This is disappointing news from a pri-
proposed encoder-decoder CNN architectures for inverting vacy perspective but could be useful in other settings for
many different features (DB1) [9] and later incorporated ad- generating photorealistic images from 3D reconstructions.
versarial training with perceptual loss functions (DB2) [8].
While DB1 [9] showed some qualitative results on inverting CNN-based privacy attacks and defense techniques. Re-
sparse SIFT, both papers focused primarily on dense fea- cently, McPherson et al. [30] and Vasiljevic et al. [44]
tures. In contrast to these feature inversion approaches, we showed that deep models could defeat existing image obfus-
focus solely on inverting SIFT descriptors stored along with cation methods. Further more, many image transformations
SfM point clouds. While the projected 3D points on a cho- can be considered as adding noise and undoing them as de-
sen viewpoint may resemble single image SIFT features, noising, and here deep networks have been quite success-
there are some key differences. First, our input 2D point ful [47]. To defend against CNN-based attacks, attempts at
distributions can be highly sparse and irregular, due to the learning CNN-resistant transformations have shown some
promise [33, 10, 35, 13]. Concurrent to our work, Speciale depths and could learn to reason about visibility. However,
et al. [41] introduced the privacy preserving image-based in practice, we found that this approach to be inaccurate,
localization problem to address the privacy issues we have especially in regions where the input feature maps contain
brought up. They proposed a new camera pose estimation a low ratio of visible to occluded points. Qualitative exam-
technique using an obfuscated representation of the map ge- ples of these failure cases are shown in Figure 5. Therefore
ometry which can defend against our inversion attack. we explored explicit visibility estimation approaches based
on geometric reasoning as well as learning.
3. Method
VisibSparse. We explored a simple geometric method that
The input to our pipeline is a feature map generated from we refer to as V ISIB S PARSE. It is based on the “point splat-
a SfM 3D point cloud model given a specific viewpoint i.e. ting” paradigm used in computer graphics. By considering
a set of camera extrinsic parameters. We obtain this fea- only the depth channel in the input, we apply a min filter
ture map by projecting the 3D points on the image plane with a k × k kernel on the feature map to obtain a filtered
and associating the 3D point attributes (SIFT descriptor, depth map. Here, we used k = 3 based on empirical test-
color, etc.) with the discrete 2D pixel where the 3D point ing. Each entry in the feature map whose depth value is no
projects in the image. When multiple points project to the greater than 5% of the depth value in the filtered depth map
same pixel, we retain the attributes for the point closest to is retained as visible. Otherwise, the point is considered
the camera and store its depth. We train a cascade of three occluded and the associated entry in the input is removed.
encoder-decoder neural networks for visibility estimation,
VisibDense. When the camera poses for the source im-
coarse image reconstruction and the final refinement step
ages computed during SfM and the image measurements
which recovers fine details in the reconstructed image.
are stored along with the 3D point cloud, it is often possible
Visibility Estimation. Since SfM 3D point clouds are of- to exploit that data to compute a dense scene reconstruc-
ten quite sparse and the underlying geometry and topology tion. Labatut et al. [21] proposed such a method to com-
of the surfaces in the scene are unknown, it is not possi- pute a dense triangulated mesh by running space carving on
ble to easily determine which 3D points should be consid- the tetrahedral cells of the 3D Delaunay triangulation of the
ered as visible from a specific camera viewpoint just us- sparse SfM points. We used this method, implemented in
ing z-buffering. This is because a sufficient number of 3D COLMAP [38] and computed 3D point visibility based on
points may not have been reconstructed on the foreground the reconstructed mesh model using traditional z-buffering.
occluding surfaces. This produces 2D pixels in the input
VisibNet. A geometric method such as V ISIB D ENSE can-
feature maps which are associated with 3D points in the
not be used when the SfM cameras poses and image mea-
background i.e. lie on surfaces which are occluded from
surements are unavailable. We therefore propose a general
that viewpoint. Identifying and removing such points from
regression-based approach that directly predicts the visibil-
the feature maps is critical to generating high-quality im-
ity from the input feature maps, where the predictive model
ages and avoiding visual artifacts. We propose to recover
is trained using supervised learning. Specifically, we train
point visibility using a data-driven neural network-based
an encoder-decoder neural network which we refer to as
approach, which we refer to as V ISIB N ET. We also evaluate
V ISIB N ET to classify each input point as either “visible”
two geometric methods which we refer to as V ISIB S PARSE
or “occluded”. Ground truth visibility labels were gener-
and V ISIB D ENSE. Both geometric methods however re-
ated automatically by leveraging V ISIB D ENSE on all train,
quire additional information which might be unavailable.
test, and validation scenes. Using V ISIB N ET’s predictions
Coarse Image Reconstructon and Refinement. Our tech- to “cull” occluded points from the input feature maps prior
nique for image synthesis from feature maps consists of a to running C OARSE N ET significantly improves the quality
coarse image reconstruction step followed by a refinement of the reconstructed images, especially in regions where the
step. C OARSE N ET is conditioned on the input feature map input feature map contains fewer visible points compared to
and produces an RGB image of the same width and height the number of points that are actually occluded.
as the feature map. R EFINE N ET outputs the final color im-
age which has the same size, given the input feature map 3.2. Architecture
along with the image output of C OARSE N ET as its input. A sample input feature map as well as our complete net-
work architecture consisting of V ISIB N ET, C OARSE N ET,
3.1. Visibility Estimation
and R EFINE N ET is shown in Figure 2. The input to our
If we did not perform explicit visibility prediction in network is an H × W × n dimensional feature map consist-
our pipeline, some degree of implicit visibility reasoning ing of n-dimensional feature vectors with different combi-
would still be carried out by the image synthesis network nations of depth, color, and SIFT features at each 2D loca-
C OARSE N ET. In theory, this network has access to the input tion. Except for the number of input/output channels in the
nD
Input Tensor
=
z RGB SIFT descriptor encoder decoder conv. layers

nD Input VisibNet Visibility CoarseNet RefineNet


RGB image RGB image
Map (output)

Figure 2: N ETWORK A RCHITECTURE : Our network has three sub-networks – V ISIB N ET, C OARSE N ET and R EFINE N ET.
The upper left shows that the input to our network is a multi-dimensional nD array. The paper explores network variants where
the inputs are different subsets of depth, color and SIFT descriptors. The three sub-networks have similar architectures. They
are U-Nets with encoder and decoder layers with symmetric skip connections. The extra layers at the end of the decoder
layers (marked in orange) are there to help with high-dimensional inputs. See the text and supplementary material for details.

first/final layers, each sub-network has the same architec- where V : RH×W ×N → RW ×H×1 denotes a differen-
ture consisting of U-Nets with a series of encoder-decoder tiable function representing V ISIB N ET, with learnable pa-
layers with skip connections. Compared to conventional U- rameters, Ux ∈ RH×W ×1 denotes the ground-truth visibil-
Nets, our network has a few extra convolutional layers at ity map for feature map Fx , and the summation is carried
the end of the decoder layers. These extra layers facilitate out over the set of M non-zero spatial locations in Fx .
propagation of information from the low-level features, par- C OARSE N ET was trained next, using a combination of
ticularly the information extracted from SIFT descriptors, an L1 pixel loss and an L2 perceptual loss (as in [22, 8])
via the skip connections to a larger pixel area in the out- over the outputs of layers relu1 1, relu2 2, and relu3 3 of
put, while also helping to attenuate visual artifacts resulting VGG16 [40] pre-trained for image classification on the Im-
from the highly sparse and irregular distribution of these ageNet [6] dataset. The weights of V ISIB N ET remained
features. We use nearest neighbor upsampling followed by fixed while C OARSE N ET was being trained using the loss
standard convolutions instead of transposed convolutions as
3
the latter are known to produce artifacts [32]. X
LC = ||C(Fx ) − x||1 + α ||φi (C(Fx )) − φi (x)||22 , (2)
3.3. Optimization i=1

We separately train the sub-networks in our architecture, where C : RH×W ×N → RH×W ×3 denotes a differentiable
V ISIB N ET, C OARSE N ET, and R EFINE N ET. Batch normal- function representing C OARSE N ET, with learnable param-
H W
ization was used in every layer, except the final one in each eters, and φ1 : RH×W ×3 → R 2 × 2 ×64 , φ2 : RH×W ×3 →
H W H W
network. We applied Xavier initialization and projections R4 4× ×128
, and φ3 : R H×W ×3
→ R 8 × 8 ×256 denote
were generated on-the-fly to facilitate data augmentation the layers relu1 1, relu2 2, and relu2 2, respectively, of the
during training and novel view generation after training. pre-trained VGG16 network.
V ISIB N ET was trained first to classify feature map points R EFINE N ET was trained last using a combination of an
as either visible or occluded, using ground-truth visibility L1 pixel loss, the same L2 perceptual loss as C OARSE N ET,
masks generated automatically by running V ISIB D ENSE for and an adversarial loss. While training R EFINE N ET, the
all train, test, and validation samples. Given training pairs weights of V ISIB N ET and C OARSE N ET remained fixed.
of input feature maps Fx ∈ RH×W ×N and target source For adversarial training, we used a conditional discrimi-
images x ∈ RH×W ×3 , V ISIB N ET’s objective is nator whose goal was to distinguish between real source
M
X   images used to generate the SfM models and images syn-
LV (x) = − Ux log (V (Fx ) + 1)/2 + thesized by R EFINE N ET. The discriminator trained using
i=1
(1) cross-entropy loss similar to Eq. (1). Additionally, to sta-

(1 − Ux )log (1 − V (Fx ))/2 i , bilize adversarial training, φ1 (R(Fx ))1 , φ2 (R(Fx ))1 , and
Desc. Inp. Feat. MAE SSIM Inp. Feat. Accuracy
Src. D O S 20% 60% 100% 20% 60% 100%
Data
z D C 20% 60% 100%
Si X X X .126 .105 .101 .539 .605 .631 X × × .948 .948 .946
Si X X × .133 .111 .105 .499 .568 .597 X × X .938 .943 .941
Si X × X .129 .107 .102 .507 .574 .599 MD
X X × .949 .951 .948
Si X × × .131 .113 .109 .477 .550 .578 X X X .952 .952 .950
M X × × .147 .128 .123 .443 .499 .524 X × × .892 .907 .908
X × X .897 .908 .910
Table 1: I NVERTING S INGLE I MAGE S IFT F EATURES : NYU
X X × .895 .907 .909
The top four rows compare networks designed for differ- X X X .906 .916 .917
ent subsets of single image (Si) inputs: descriptor (D), key-
point orientation (O) and scale (S). Test error (MAE) and
Table 2: E VALUATION OF V ISIB N ET: We trained four ver-
accuracy (SSIM) obtained when 20%, 60% and all the SIFT
sion of V ISIB N ET, each with a different set of input at-
features are used. Lower MAE and higher SSIM values are
tributes, namely, z (depth), D (SIFT) and C (color) to eval-
better. The last row is for when the descriptors originate
uate their relative importance. Ground truth labels were ob-
from multiple (M) different and unknown source images.
tained with VisibDense. The table reports mean classifica-
tion accuracy on the test set for the NYU and MD datasets.
φ3 (R(Fx ))1 were concatenated before the first, second, and The results show that V ISIB N ET achieves accuracy greater
third convolutional layers of the discriminator as done in than 93.8% and 89.2% on MD and NYU respectively and is
[42]. R EFINE N ET denoted as R() has the following loss. not very sensitive to sparsity levels and input attributes.

3
X partitioned the scenes into training, validation, and testing
LR =||R(Fx ) − x||1 + α ||φi (R(Fx )) − φi (x)||22 sets with 441, 80, and 139 scenes respectively. All images
i=1
(3) of one scene were included only in one of the three groups.
+ β[log(D(x)) + log(1 − D(R(Fx )))]. We report results using both the average mean absolute error
(MAE), where color values are scaled to the range [0,1].
Here, the two functions, R : RH×W ×N +3 → RH×W ×3 and average structured similarity (SSIM). Note that lower
and D : RH×W ×N +3 → R denote differentiable functions MAE and higher SSIM values indicate better results.
representing R EFINE N ET and the discriminator, respec-
Inverting Single Image SIFT Features. Consider the sin-
tively, with learnable parameters. We trained R EFINE N ET
gle image scenario, with trivial visibility estimation and
to minimize LR by applying alternating gradient updates
identical input to [9]. We performed an ablation study in
to R EFINE N ET and the discriminator. The gradients were
this scenario, measuring the effect of inverting features with
computed on mini-batches of training data, with different
unknown keypoint scale, orientation, and multiple unknown
batches used to update R EFINE N ET and the discriminator.
image sources. Four variants of C OARSE N ET were trained,
then tested at three sparsity levels. The results are shown
4. Experimental Results in Table 1 and Figure 4. Table 1 reports MAE and SSIM
We now report a systematic evaluation of our method. across a combined MD and NYU dataset. The sparsity per-
Some of our results are qualitatively summarized in Fig. centage refers to how many randomly selected features were
3, demonstrating robustness to various challenges, namely, retained in the input, and our method handles a wide range
missing information in the point clouds, effectiveness of our of sparsity reasonably well. From the examples in Figure 4,
visibility estimation, and the sparse and irregular distribu- we observe that the networks are surprisingly robust at in-
tion of input samples over a large variety of scenes. verting features with unknown orientation and scale; while
the accuracy drops a bit as expected, the reconstructed im-
Dataset. We use the MegaDepth [24] and NYU [39] ages are still recognizable. Finally, we quantify the effect
datasets in our experiments. MegaDepth (MD) is an In- of unknown and different image sources for the SIFT fea-
ternet image dataset with ∼150k images of 196 landmark tures. The last row of Table 1 shows that indeed the feature
scenes obtained from Flickr. NYU contains ∼400k images inversion problem becomes harder but the results are still re-
of 464 indoor scenes captured with the Kinect (we only used markably good. Having demonstrated that our work solves
the RGB images). These datasets cover very different scene a harder problem than previously tackled, we now report
content, image resolution, and generate very different dis- results on inverting SfM points and their features.
tribution of SfM points and camera poses. Generally, NYU
scenes produce far fewer SfM points than the MD scenes. 4.1. Visibility Estimation
Preprocessing. We processed the 660 scenes in MD and We first independently evaluate the performance of the
NYU using the SfM implementation in COLMAP [38]. We proposed V ISIB N ET model and compare it to the geomet-
Figure 3: Q UALITATIVE R ESULTS : Each result is a 3 × 1 set of square images, showing point clouds (with occluded points
in red), image reconstruction and original. The first four columns (top and bottom) show results from the MegaDepth dataset
(internet scenes) and the last four columns (top and bottom) show results from indoor NYU scenes. Sparsity: Our network
handles a large variety in input sparsity (density decreases from left to right). In addition, perspective projection accentuates
the spatially-varying density differences, and the MegaDepth outdoor scenes have concentrated points in the input whereas
NYU indoor scenes have far samples. Further, the input points are non-homogeneous, with large holes which our method
gracefully fills in. Visual effects: For the first four columns (MD scenes) our results give the pleasing effect of uniform
illumination (see top of first column). Since our method relies on SfM, moving objects are not recovered. Scene diversity:
The fourth column is an aerial photograph, an unusual category that is still recovered well. For the last four columns (NYU
scenes), despite lower sparsity, we can recover textures in common household scenes such as bathrooms, classrooms and
bedrooms. The variety shows that our method does not learn object categories and works on any scene. Visibility: All scenes
benefit from visibility prediction using V ISIB N ET which for example was crucial for the bell example (lower 2nd column).

ric methods V ISIB S PARSE and V ISIB D ENSE. We trained classification accuracy on MD and NYU test sets, respec-
four variants of V ISIB N ET designed for different subsets of tively, even when only 20% of the input samples were used
input attributes to classify points in the input feature map to simulate sparse inputs. Table 3 shows that when points
as “visible” or “occluded”. We report classification accu- predicted as occluded by V ISIB N ET are removed from the
racy separately on the MD and NYU test sets even though input to C OARSE N ET, we observe a consistent improve-
the network was trained on the combined training set (see ment when compared to C OARSE N ET carrying both the
Table 2). We observe that V ISIB N ET is largely insensitive burdens of visibility and image synthesis (denoted as Im-
to scene type, sparsity levels, and choice of input attributes plicit in the table). While the improvement may not seem
such as depth, color, and descriptors. The V ISIB N ET vari- numerically large, in Figure 5 we show insets where visual
ant designed for depth only has 94.8% and 89.2% mean artifacts (bookshelf above, building below) are removed.
(a) Input (b) SIFT (c) SIFT + s (d) SIFT + o (e) SIFT + s + o (f) Original

Figure 4: I NVERTING S IFT F EATURES IN A S INGLE I MAGE : (a) 2D keypoint locations. Results obtained with (b) only
descriptor, (c) descriptor and keypoint scale, (d) descriptor and keypoint orientation, (e) descriptor, scale and orientation. (f)
Original image. Results from using only descriptors (2nd column) are only slightly worse than the baseline (5th column).

(a) Input (b) Pred. (VisibNet) (c) Implicit (d) VisibNet (e) VisibDense (f) Original

Figure 5: I MPORTANCE OF VISIBILITY ESTIMATION : Examples showing (a) input 2D point projections (in blue), (b) pre-
dicted visibility from V ISIB N ET – occluded (red) and visible (blue) points, (c–e) results from I MPLICIT (no explicit visibility
estimation), V ISIB N ET (uses a CNN) and V ISIB D ENSE (uses z-buffering and dense models), and (f) the original image.

4.2. Relative Significance of Point Attributes and SIFT descriptors significantly improves visual quality.
We trained four variants of C OARSE N ET, each with a
4.3. Significance of RefineNet
different set of the available SfM point attributes. The goal
here is to measure the relative importance of each of the at- In Figure 7 we qualitatively compare two scenes where
tributes. This information could be used to decide which the feature maps had only depth and descriptors (left) and
optional attributes should be removed when storing SfM when it had all the attributes (right). For privacy preser-
model to enhance privacy. We report reconstruction error on vation, these results are sobering. While Table 4 showed
the test set for both indoor (NYU) and outdoor scenes (MD) that C OARSE N ET struggles when color is dropped (sug-
for various sparsity levels in Table 4 and show qualitative gesting an easy solution of removing color for privacy),
evaluation on the test set in Figure 6. The results indicate Figure 7 (left) unfortunately shows that R EFINE N ET recov-
that our approach is largely invariant to sparsity and capable ers plausible colors and improves results a lot. Of course,
of capturing very fine details even when the input feature R EFINE N ET trained on all features also does better than
map contains just depth, although, not surprisingly, color C OARSE N ET although less dramatically (Figure 7 (right)).
Visibility MAE SSIM
Data
Est. 20% 60% 100% 20% 60% 100%
Implicit .201 .197 .195 .412 .436 .445
VisibSparse .202 .197 .196 .408 .432 .440
MD
VisibNet .201 .196 .195 .415 .440 .448
VisibDense .201 .196 .195 .417 .442 .451
Implicit .121 .100 .094 .541 .580 .592
VisibSparse .122 .100 .094 .539 .579 .592
NYU
VisibNet .120 .098 .092 .543 .583 .595
VisibDense .120 .097 .090 .545 .587 .600

Table 3: I MPORTANCE OF V ISIBILITY E STIMATION : Both z+D z+D+C


sub-tables show results obtained using I MPLICIT i.e. no Figure 7: I MPORTANCE OF R EFINE N ET: (Top row)
explicit occlusion reasoning where of burden of visibility C OARSE N ET results. (Bottom Row) R EFINE N ET results.
estimation implicitly falls on C OARSE N ET, VisibNet and (Left) Networks use depth and descriptors (z + D). (Right)
the geometric methods V ISIB S PARSE and V ISIB D ENSE. Networks use depth, descriptor and color (z + D + C).
Lower MAE and higher SSIM values are better.

Inp. Feat. MAE SSIM


Data
z D C 20% 60% 100% 20% 60% 100%
X × × .258 .254 .253 .264 .254 .250
X × X .210 .204 .202 .378 .394 .403
MD
X X × .228 .223 .221 .410 .430 .438
X X X .201 .196 .195 .414 .439 .448
X × × .295 .290 .289 .244 .209 .197
X × X .148 .121 .111 .491 .528 .546
Figure 8: N OVEL V IEW S YNTHESIS : Synthesized images
NYU from virtual viewpoints in two NYU scenes [39] helps to
X X × .207 .179 .171 .493 .528 .539
X X X .121 .099 .093 .542 .582 .594 interpret the cluttered scenes (see supplementary video).

Table 4: E FFECT OF P OINT ATTRIBUTES : Performance of such results is more difficult (in contrast to our experiments
four networks designed for different sets of input attributes where aligned real camera images are available), we show
– z (depth), D (SIFT) and C (color), on MD and NYU. Input qualitative results in Figure 8 and generate virtual tours
sparsity is simulated by applying random dropout to input based on the synthesized novel views1 . Such novel view
samples during training and testing. based virtual tours can make scene interpretation easier for
an attacker even when the images contain some artifacts.

5. Conclusion
In this paper, we introduced a new problem, that of in-
verting a sparse SfM point cloud and reconstructing color
images of the underlying scene. We demonstrated that sur-
prisingly high quality images can be reconstructed from the
limited amount of information stored along with sparse 3D
point cloud models. Our work highlights the privacy and
security risks associated with storing 3D point clouds and
the necessity for developing privacy preserving point cloud
z z+D z+C z+D+C orig representations and camera localization techniques, where
the persistent scene model data cannot easily be inverted to
Figure 6: E FFECT OF P OINT ATTRIBUTES : Results ob- reveal the appearance of the underlying scene. This was
tained with different attributes. Left to right: depth [z], also the primary goal in concurrent work on privacy pre-
depth + SIFT [z + D], depth + color [z + C], depth + SIFT serving camera pose estimation [41] which proposes a de-
+ color [z + D + C] and the original image. (see Table 4). fense against the type of attacks investigated in our paper.
Another interesting avenue of future work would be to ex-
4.4. Novel View Synthesis plore privacy preserving features for recovering correspon-
dences between images and 3D models.
Our technique can be used to easily generate realistic
novel views of the scene. While quantitatively evaluating 1 see the video in the supplementary material.
References [18] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof.
From structure-from-motion point clouds to fast loca-
[1] 6D.AI. http://6d.ai/, 2018. tion recognition. In CVPR, pages 2599–2606, 2009.
[2] ARCore. developers.google.com/ar/, [19] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
2018. to-image translation with conditional adversarial net-
[3] ARKit. developer.apple.com/arkit/, works. In CVPR, pages 1125–1134, 2017.
2018. [20] H. Kato and T. Harada. Image reconstruction from
[4] Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabi- bag-of-visual-words. In CVPR, pages 955–962, 2014.
novich. Estimating depth from RGB and sparse sens- [21] P. Labatut, J.-P. Pons, and R. Keriven. Efficient multi-
ing. In ECCV, pages 167–182, 2018. view reconstruction of large-scale scenes using inter-
[5] T. Dekel, C. Gan, D. Krishnan, C. Liu, and W. T. Free- est points, Delaunay triangulation and graph cuts. In
man. Smart, sparse contours to represent and edit im- ICCV, pages 1–8, 2007.
ages. In CVPR, 2018. [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
L. Fei-Fei. ImageNet: A large-scale hierarchical im- Z. Wang, et al. Photo-realistic single image super-
age database. In CVPR, pages 248–255, 2009. resolution using a generative adversarial network. In
[7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial CVPR, pages 4681–4690, 2017.
feature learning. In ICLR, 2017. [23] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. World-
[8] A. Dosovitskiy and T. Brox. Generating images with wide pose estimation using 3d point clouds. In ECCV,
perceptual similarity metrics based on deep networks. pages 15–29. Springer, 2012.
In Advances in Neural Information Processing Sys- [24] Z. Li and N. Snavely. Megadepth: Learning single-
tems, pages 658–666, 2016. view depth prediction from internet photos. In Com-
[9] A. Dosovitskiy and T. Brox. Inverting visual represen- puter Vision and Pattern Recognition (CVPR), 2018.
tations with convolutional networks. In CVPR, pages [25] H. Lim, S. N. Sinha, M. F. Cohen, M. Uyttendaele, and
4829–4837, 2016. H. J. Kim. Real-time monocular image-based 6-dof
[10] H. Edwards and A. Storkey. Censoring representations localization. The International Journal of Robotics
with an adversary. In ICLR, 2016. Research, 34(4-5):476–492, 2015.
[11] D. Eigen, C. Puhrsch, and R. Fergus. Depth map pre- [26] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
diction from a single image using a multi-scale deep image-to-image translation networks. In Advances in
network. In Advances in neural information process- Neural Information Processing Systems, pages 700–
ing systems, pages 2366–2374, 2014. 708, 2017.
[12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. [27] M.-Y. Liu and O. Tuzel. Coupled generative adver-
Deepstereo: Learning to predict new views from the sarial networks. In Advances in neural information
world’s imagery. In CVPR, pages 5515–5524, 2016. processing systems, pages 469–477, 2016.
[13] J. Hamm. Minimax filter: learning to preserve pri- [28] J. Lu and D. Forsyth. Sparse depth super resolution.
vacy from inference attacks. The Journal of Machine In CVPR, pages 2245–2253, 2015.
Learning Research, 18(1):4704–4734, 2017. [29] A. Mahendran and A. Vedaldi. Understanding deep
[14] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Dret- image representations by inverting them. In CVPR,
takis, and G. Brostow. Deep blending for free- pages 5188–5196, 2015.
viewpoint image-based rendering. ACM Transactions [30] R. McPherson, R. Shokri, and V. Shmatikov. De-
on Graphics (SIGGRAPH Asia Conference Proceed- feating image obfuscation with deep learning. arXiv
ings), 37(6), November 2018. preprint arXiv:1609.00408, 2016.
[15] Hololens. https://www.microsoft.com/ [31] M. Moukari, S. Picard, L. Simoni, and F. Jurie. Deep
en-us/hololens, 2016. multi-scale architectures for monocular depth estima-
[16] J. Hong. Considering privacy issues in the context of tion. In ICIP, pages 2940–2944, 2018.
google glass. Commun. ACM, 56(11):10–11, 2013. [32] A. Odena, V. Dumoulin, and C. Olah. Deconvolution
[17] T.-W. Hui, C. C. Loy, and X. Tang. Depth map super- and checkerboard artifacts. Distill, 2016.
resolution by deep multi-scale guidance. In ECCV, [33] F. Pittaluga, S. Koppal, and A. Chakrabarti. Learn-
2016. ing privacy preserving encodings through adversarial
training. In 2019 IEEE Winter Conference on Appli- [49] M. D. Zeiler and R. Fergus. Visualizing and under-
cations of Computer Vision (WACV), pages 791–799. standing convolutional networks. In ECCV, pages
IEEE, 2019. 818–833. Springer, 2014.
[34] X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric [50] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired
image synthesis. In CVPR, pages 8808–8816, 2018. image-to-image translation using cycle-consistent ad-
[35] N. Raval, A. Machanavajjhala, and L. P. Cox. Protect- versarial networks. In CVPR, pages 2223–2232,
ing visual secrets using adversarial nets. In CV-COPS 2017.
2017, CVPR Workshop, pages 1329–1332, 2017.
[36] G. Riegler, M. Rüther, and H. Bischof. ATGV-Net:
Accurate depth super-resolution. In ECCV, pages
268–284, 2016.
[37] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based
localization using direct 2d-to-3d matching. In ICCV,
pages 667–674. IEEE, 2011.
[38] J. L. Schönberger and J.-M. Frahm. Structure-from-
motion revisited. In CVPR, pages 4104–4113, 2016.
[39] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. In-
door segmentation and support inference from rgbd
images. In ECCV, 2012.
[40] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In
ICLR, 2015.
[41] P. Speciale, J. L. Schönberger, S. B. Kang, S. N. Sinha,
and M. Pollefeys. Privacy preserving image-based lo-
calization. arXiv preprint arXiv:1903.05572, 2019.
[42] D. Sungatullina, E. Zakharov, D. Ulyanov, and
V. Lempitsky. Image manipulation with perceptual
discriminators. In ECCV, pages 579–595, 2018.
[43] J. Uhrig, N. Schneider, L. Schneider, U. Franke,
T. Brox, and A. Geiger. Sparsity invariant CNNs. In
International Conference on 3D Vision (3DV), pages
11–20, 2017.
[44] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich.
Examining the impact of blur on recognition
by convolutional networks. arXiv preprint
arXiv:1611.05760, 2016.
[45] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Tor-
ralba. Hoggles: Visualizing object detection features.
In CVPR, pages 1–8, 2013.
[46] P. Weinzaepfel, H. Jégou, and P. Pérez. Reconstructing
an image from its local descriptors. In CVPR, pages
337–344, 2011.
[47] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convo-
lutional neural network for image deconvolution. In
Advances in Neural Information Processing Systems,
pages 1790–1798, 2014.
[48] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lip-
son. Understanding neural networks through deep vi-
sualization. In ICML Workshop on Deep Learning,
2015.
Supplementary Material:
Revealing Scenes by Inverting Structure from Motion Reconstructions

Francesco Pittaluga1 Sanjeev J. Koppal1 Sing Bing Kang2 Sudipta N. Sinha2


1 2
University of Florida Microsoft Research

A. Implementation Details of size 3 × 3 and stride equal to 1 followed by a 2 × 2


max pooling operation followed by an addition of a bias,
In the supplementary material we describe our network
batch normalization, and a leaky ReLU operation. FCN
architecture and the training procedure in more details.
denotes a fully connected layer with N nodes followed
A.1. Architecture by an addition of a bias, and a leaky ReLU operation. In
the final layer, the leaky ReLU is replaced by a softmax
Our network architecture consists of three sub-networks function.
– V ISIB N ET, C OARSE N ET and R EFINE N ET. The input to
our network is an H × W × n dimensional feature map A.2. Optimization
where at each 2D location in the feature map where there is
We used the Adam optimizer with β1 = 0.9, β2 = 0.999,
a valid sample, we have one n-dimensional feature vector.
 = 1e−8 and a learning rate of 0.0001 for training all
This feature vector is obtained by concatenating different
networks. Images with resolution 256 × 256 pixels were
subsets of depth, color, and SIFT features which are associ-
used as input to the network during training. However, the
ated with each 3D point in the SfM point cloud. Except for
trained network was used to process images at a resolution
the number of input/output channels in the first/final layers,
of 512 × 512 pixels. During training, we resized each im-
each sub-network has the same architecture, that of a U-Net
age such that the smaller dimension of the resized image
with an encoder and a decoder and with skip connections
was randomly assigned to either 296, 394, or 512 pixels, af-
between the layers in the encoder and decoder networks at
ter which we applied a random 2D cropping to the resized
identical depths. In contrast to conventional U-Nets, where
image to obtain a 256 × 256 image which was the actual
the decoder directly generates the output, in our network,
input to our network. We used Xavier initialization for all
the output of the decoder is passed through three convolu-
the parameters of our network.
tional layers in sequence to obtain the final output.
More, specifically, the architecture of the encoder is
CE256 - CE256 - CE256 - CE512 - CE512 - CE512, where CEN de-
B. Additional Results
notes a convolutional layer with N kernels of size 4 × 4 and We now present qualitative results to show that our net-
stride equal to 2 followed by an addition of a bias, batch work is robust to 2D input which is very sparse. Figure A1
normalization, and a ReLU operation. shows two example results. Three images are synthesized
The architecture of the decoder is CD512 - CD512 - CD512 - on randomly selected 20%, 60% and 100% of all the pro-
CD256 - CD256 - CD256 - C128 - C64 - C32- C3, where CDN de- jected 3D points for the scenes. Despite the high simulated
notes nearest neighbor upsampling by a factor of 2 followed 2D sparsity in the input, the output images are quite inter-
by a convolutional layer with N kernels of size 3 × 3 and pretable. Figure A2 shows some failure examples.
a stride equal to 1, followed by an addition of a bias, batch
Supplementary Video. Finally, we encourage the reader
normalization, and a ReLU operation. CN layers are the
to view the supplementary video which makes it easier to
same as CDN layers but without the upsampling operation.
visualize the qualitative results shown in the main paper.
In the final layer of the decoder, the ReLU is replaced with
For two scenes, where the SfM camera poses are available,
a tanh non-linearity. In R EFINE N ET, all ReLU operations
we show that we can reconstruct the source video by run-
are replaced by leaky ReLU operations in all the layers of
ning our method on a frame by frame basis with the camera
the decoder network. In V ISIB N ET, the final layer of the
poses for the source images. Finally, we show results of
decoder has 1 kernel instead of 3.
synthesizing images from novel camera viewpoints. Such
Our discriminator used for adversarial training of
results can be used to create virtual tours of the scene, thus
R EFINE N ET has the following architecture – CA256 - CA256 -
making it easier to reveal and the appearance, layout and
CA256 - CA512 - CA512 - CA512 - FC1024 - FC1024 - FC1024 - FC2,
geometry of the scene.
where CAN denotes a convolutional layer with N kernels
(a) Sparse Input (20%) (b) Using 20% (c) Using 60% (d) Using all (e) Original Images

Figure A1: E VALUATING ROBUSTNESS TO S PARSITY: Two sets of images synthesized using our complete pipeline, by
running V ISIB N ET, C OARSE N ET and R EFINE N ET. From left to right: (a) Simulated sparse inputs to our networks. Here,
only 20% of the 3D points in the respective SfM models were used. Image synthesized using our method using (b) 20% of
the points, (c) 60% of the points, (d) all the points and (e) the original source images. Even when the inputs are extremely
sparse, most of the contents of the synthesized images can be easily recognized.

(a) (b) (c) (d) (e) (f) (g) (h)

Figure A2: FAILURE E XAMPLES : (a) Dense points on the building in the background overwhelms a few sparse points in the
foreground on the base of the statue. V ISIB N ET in this case incorrectly predicts that the building is visible and this causes the
base of the statue to disappear completely in the synthesized image. (b) A similar artifact for a different scene. (c) Parallel
straight lines are sometimes poorly handled, such as the lines on the vertical pillars of the monument. (d) The complex
occlusions in the architectural structure produce artifacts where the occluded surfaces and the occluders are fused into each
other. (e) Straight lines are often reconstructed as curved or bent (f–g) Low sample density in the input common in indoor
scenes results in blurry and wavy edges. (h) Finally, spurious 3D points may cause our method to hallucinate structures such
as the dark line on the wall which is not actually there.
Semantic Image Synthesis with Spatially-Adaptive Normalization

Taesung Park1,2∗ Ming-Yu Liu2 Ting-Chun Wang2 Jun-Yan Zhu2,3


1 2 2,3
UC Berkeley NVIDIA MIT CSAIL

cloud sky
arXiv:1903.07291v2 [cs.CV] 5 Nov 2019

tree mountain

sea grass

Figure 1: Our model allows user control over both semantic and style as synthesizing an image. The semantic (e.g., the
existence of a tree) is controlled via a label map (the top row), while the style is controlled via the reference style image (the
leftmost column). Please visit our website for interactive image synthesis demos.

Abstract https://github.com/NVlabs/SPADE.

We propose spatially-adaptive normalization, a simple 1. Introduction


but effective layer for synthesizing photorealistic images Conditional image synthesis refers to the task of gen-
given an input semantic layout. Previous methods directly erating photorealistic images conditioning on certain in-
feed the semantic layout as input to the deep network, which put data. Seminal work computes the output image by
is then processed through stacks of convolution, normaliza- stitching pieces from a single image (e.g., Image Analo-
tion, and nonlinearity layers. We show that this is subop- gies [16]) or using an image collection [7, 14, 23, 30, 35].
timal as the normalization layers tend to “wash away” se- Recent methods directly learn the mapping using neural net-
mantic information. To address the issue, we propose using works [3, 6, 22, 47, 48, 54, 55, 56]. The latter methods are
the input layout for modulating the activations in normal- faster and require no external database of images.
ization layers through a spatially-adaptive, learned trans- We are interested in a specific form of conditional im-
formation. Experiments on several challenging datasets age synthesis, which is converting a semantic segmentation
demonstrate the advantage of the proposed method over ex- mask to a photorealistic image. This form has a wide range
isting approaches, regarding both visual fidelity and align- of applications such as content generation and image edit-
ment with input layouts. Finally, our model allows user ing [6, 22, 48]. We refer to this form as semantic image
control over both semantic and style. Code is available at synthesis. In this paper, we show that the conventional net-
work architecture [22, 48], which is built by stacking con-
∗Taesung Park contributed to the work during his NVIDIA internship. volutional, normalization, and nonlinearity layers, is at best

1
suboptimal because their normalization layers tend to “wash ization (InstanceNorm) [46], the Layer Normalization [2],
away” information contained in the input semantic masks. the Group Normalization [50], and the Weight Normaliza-
To address the issue, we propose spatially-adaptive normal- tion [45]. We label these normalization layers as uncondi-
ization, a conditional normalization layer that modulates the tional as they do not depend on external data in contrast to
activations using input semantic layouts through a spatially- the conditional normalization layers discussed below.
adaptive, learned transformation and can effectively propa- Conditional normalization layers include the Conditional
gate the semantic information throughout the network. Batch Normalization (Conditional BatchNorm) [11] and
We conduct experiments on several challenging datasets Adaptive Instance Normalization (AdaIN) [19]. Both were
including the COCO-Stuff [4, 32], the ADE20K [58], and first used in the style transfer task and later adopted in var-
the Cityscapes [9]. We show that with the help of our ious vision tasks [3, 8, 10, 20, 26, 36, 39, 42, 49, 54]. Dif-
spatially-adaptive normalization layer, a compact network ferent from the earlier normalization techniques, condi-
can synthesize significantly better results compared to sev- tional normalization layers require external data and gen-
eral state-of-the-art methods. Additionally, an extensive ab- erally operate as follows. First, layer activations are nor-
lation study demonstrates the effectiveness of the proposed malized to zero mean and unit deviation. Then the nor-
normalization layer against several variants for the semantic malized activations are denormalized by modulating the
image synthesis task. Finally, our method supports multi- activation using a learned affine transformation whose pa-
modal and style-guided image synthesis, enabling control- rameters are inferred from external data. For style trans-
lable, diverse outputs, as shown in Figure 1. Also, please fer tasks [11, 19], the affine parameters are used to control
see our SIGGRAPH 2019 Real-Time Live demo and try our the global style of the output, and hence are uniform across
online demo by yourself. spatial coordinates. In contrast, our proposed normalization
layer applies a spatially-varying affine transformation, mak-
2. Related Work ing it suitable for image synthesis from semantic masks.
Wang et al. proposed a closely related method for image
Deep generative models can learn to synthesize images. super-resolution [49]. Both methods are built on spatially-
Recent methods include generative adversarial networks adaptive modulation layers that condition on semantic in-
(GANs) [13] and variational autoencoder (VAE) [28]. Our puts. While they aim to incorporate semantic information
work is built on GANs but aims for the conditional image into super-resolution, our goal is to design a generator for
synthesis task. The GANs consist of a generator and a dis- style and semantics disentanglement. We focus on provid-
criminator where the goal of the generator is to produce re- ing the semantic information in the context of modulating
alistic images so that the discriminator cannot tell the syn- normalized activations. We use semantic maps in different
thesized images apart from the real ones. scales, which enables coarse-to-fine generation. The reader
Conditional image synthesis exists in many forms that dif- is encouraged to review their work for more details.
fer in the type of input data. For example, class-conditional
models [3, 36, 37, 39, 41] learn to synthesize images given 3. Semantic Image Synthesis
category labels. Researchers have explored various models
for generating images based on text [18,44,52,55]. Another Let m ∈ LH×W be a semantic segmentation mask
widely-used form is image-to-image translation based on a where L is a set of integers denoting the semantic labels,
type of conditional GANs [20, 22, 24, 25, 33, 57, 59, 60], and H and W are the image height and width. Each entry
where both input and output are images. Compared to in m denotes the semantic label of a pixel. We aim to learn
earlier non-parametric methods [7, 16, 23], learning-based a mapping function that can convert an input segmentation
methods typically run faster during test time and produce mask m to a photorealistic image.
more realistic results. In this work, we focus on converting Spatially-adaptive denormalization. Let hi denote the ac-
segmentation masks to photorealistic images. We assume tivations of the i-th layer of a deep convolutional network
the training dataset contains registered segmentation masks for a batch of N samples. Let C i be the number of chan-
and images. With the proposed spatially-adaptive normal- nels in the layer. Let H i and W i be the height and width
ization, our compact network achieves better results com- of the activation map in the layer. We propose a new condi-
pared to leading methods. tional normalization method called the SPatially-Adaptive
Unconditional normalization layers have been an impor- (DE)normalization1 (SPADE). Similar to the Batch Nor-
tant component in modern deep networks and can be found malization [21], the activation is normalized in the channel-
in various classifiers, including the Local Response Nor- wise manner and then modulated with learned scale and
malization in the AlexNet [29] and the Batch Normaliza- bias. Figure 2 illustrates the SPADE design. The activation
tion (BatchNorm) in the Inception-v2 network [21]. Other 1 Conditional normalization [11, 19] uses external data to denormalize

popular normalization layers include the Instance Normal- the normalized activations; i.e., the denormalization part is conditional.

2
conv
conv
𝛽

Batch
element-wise
Norm

Figure 2: In the SPADE, the mask is first projected onto an Figure 3: Comparing results given uniform segmentation
embedding space and then convolved to produce the modu- maps: while the SPADE generator produces plausible tex-
lation parameters γ and β. Unlike prior conditional normal- tures, the pix2pixHD generator [48] produces two identical
ization methods, γ and β are not vectors, but tensors with outputs due to the loss of the semantic information after the
spatial dimensions. The produced γ and β are multiplied normalization layer.
and added to the normalized activation element-wise.
SPADE generator. With the SPADE, there is no need to
value at site (n ∈ N, c ∈ C i , y ∈ H i , x ∈ W i ) is feed the segmentation map to the first layer of the genera-
tor, since the learned modulation parameters have encoded
i
hin,c,y,x − µic i enough information about the label layout. Therefore, we
γc,y,x (m) + βc,y,x (m) (1)
σci discard encoder part of the generator, which is commonly
used in recent architectures [22, 48]. This simplification re-
where hin,c,y,x is the activation at the site before normaliza-
sults in a more lightweight network. Furthermore, similarly
tion and µic and σci are the mean and standard deviation of
to existing class-conditional generators [36,39,54], the new
the activations in channel c:
generator can take a random vector as input, enabling a sim-
1 X
ple and natural way for multi-modal synthesis [20, 60].
µic = hi (2)
N H W n,y,x n,c,y,x
i i
Figure 4 illustrates our generator architecture, which em-
s ploys several ResNet blocks [15] with upsampling layers.
i 1 X
i

2 − (µi )2 .
σc = (h n,c,y,x ) c (3) The modulation parameters of all the normalization layers
N H i W i n,y,x are learned using the SPADE. Since each residual block
i i
operates at a different scale, we downsample the semantic
The variables γc,y,x (m) and βc,y,x (m) in (1) are the mask to match the spatial resolution.
learned modulation parameters of the normalization layer.
We train the generator with the same multi-scale discrim-
In contrast to the BatchNorm [21], they depend on the in-
inator and loss function used in pix2pixHD [48] except that
put segmentation mask and vary with respect to the location
i i we replace the least squared loss term [34] with the hinge
(y, x). We use the symbol γc,y,x and βc,y,x to denote the
loss term [31,38,54]. We test several ResNet-based discrim-
functions that convert m to the scaling and bias values at
inators used in recent unconditional GANs [1, 36, 39] but
the site (c, y, x) in the i-th activation map. We implement
i i observe similar results at the cost of a higher GPU mem-
the functions γc,y,x and βc,y,x using a simple two-layer con-
ory requirement. Adding the SPADE to the discriminator
volutional network, whose design is in the appendix.
also yields a similar performance. For the loss function, we
In fact, SPADE is related to, and is a generalization
observe that removing any loss term in the pix2pixHD loss
of several existing normalization layers. First, replacing
function lead to degraded generation results.
the segmentation mask m with the image class label and
making the modulation parameters spatially-invariant (i.e., Why does the SPADE work better? A short answer is that
i i i i it can better preserve semantic information against common
γc,y 1 ,x1
≡ γc,y 2 ,x2
and βc,y 1 ,x1
≡ βc,y 2 ,x2
for any y1 , y2 ∈
{1, 2, ..., H } and x1 , x2 ∈ {1, 2, ..., W i }), we arrive at the
i normalization layers. Specifically, while normalization lay-
form of the Conditional BatchNorm [11]. Indeed, for any ers such as the InstanceNorm [46] are essential pieces in
spatially-invariant conditional data, our method reduces to almost all the state-of-the-art conditional image synthesis
the Conditional BatchNorm. Similarly, we can arrive at models [48], they tend to wash away semantic information
the AdaIN [19] by replacing m with a real image, mak- when applied to uniform or flat segmentation masks.
ing the modulation parameters spatially-invariant, and set- Let us consider a simple module that first applies con-
ting N = 1. As the modulation parameters are adaptive to volution to a segmentation mask and then normalization.
the input segmentation mask, the proposed SPADE is better Furthermore, let us assume that a segmentation mask with
suited for semantic image synthesis. a single label is given as input to the module (e.g., all the

3
pix2pixHD
3x3 Conv

3x3 Conv
SPADE

SPADE
ReLU

ReLU

SPADE SPADE SPADE SPADE


~ ResBlk ResBlk ResBlk ResBlk
SPADE ResBlk

Figure 4: In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations.
(left) Structure of one residual block with the SPADE. (right) The generator contains a series of the SPADE residual blocks
with upsampling layers. Our architecture achieves better performance with a smaller number of parameters by removing the
downsampling layers of leading image-to-image translation networks such as the pix2pixHD model [48].

pixels have the same label such as sky or grass). Under this learning rates for the generator and discriminator are
setting, the convolution outputs are again uniform, with dif- 0.0001 and 0.0004, respectively [17]. We use the ADAM
ferent labels having different uniform values. Now, after we solver [27] with β1 = 0 and β2 = 0.999. All the exper-
apply InstanceNorm to the output, the normalized activation iments are conducted on an NVIDIA DGX1 with 8 32GB
will become all zeros no matter what the input semantic la- V100 GPUs. We use synchronized BatchNorm, i.e., these
bel is given. Therefore, semantic information is totally lost. statistics are collected from all the GPUs.
This limitation applies to a wide range of generator archi- Datasets. We conduct experiments on several datasets.
tectures, including pix2pixHD and its variant that concate- • COCO-Stuff [4] is derived from the COCO dataset [32].
nates the semantic mask at all intermediate layers, as long It has 118, 000 training images and 5, 000 validation im-
as a network applies convolution and then normalization to ages captured from diverse scenes. It has 182 semantic
the semantic mask. In Figure 3, we empirically show this is classes. Due to its vast diversity, existing image synthe-
precisely the case for pix2pixHD. Because a segmentation sis models perform poorly on this dataset.
mask consists of a few uniform regions in general, the issue • ADE20K [58] consists of 20, 210 training and 2, 000 val-
of information loss emerges when applying normalization. idation images. Similarly to the COCO, the dataset con-
In contrast, the segmentation mask in the SPADE Gen- tains challenging scenes with 150 semantic classes.
erator is fed through spatially adaptive modulation without • ADE20K-outdoor is a subset of the ADE20K dataset that
normalization. Only activations from the previous layer are only contains outdoor scenes, used in Qi et al. [43].
normalized. Hence, the SPADE generator can better pre- • Cityscapes dataset [9] contains street scene images in
serve semantic information. It enjoys the benefit of normal- German cities. The training and validation set sizes are
ization without losing the semantic input information. 3, 000 and 500, respectively. Recent work has achieved
Multi-modal synthesis. By using a random vector as the photorealistic semantic image synthesis results [43, 47]
input of the generator, our architecture provides a simple on the Cityscapes dataset.
way for multi-modal synthesis [20, 60]. Namely, one can • Flickr Landscapes. We collect 41, 000 photos from
attach an encoder that processes a real image into a random Flickr and use 1, 000 samples for the validation set. To
vector, which will be then fed to the generator. The encoder avoid expensive manual annotation, we use a well-trained
and generator form a VAE [28], in which the encoder tries DeepLabV2 [5] to compute input segmentation masks.
to capture the style of the image, while the generator com- We train the competing semantic image synthesis methods
bines the encoded style and the segmentation mask informa- on the same training set and report their results on the same
tion via the SPADEs to reconstruct the original image. The validation set for each dataset.
encoder also serves as a style guidance network at test time Performance metrics. We adopt the evaluation protocol
to capture the style of target images, as used in Figure 1. from previous work [6, 48]. Specifically, we run a seman-
For training, we add a KL-Divergence loss term [28]. tic segmentation model on the synthesized images and com-
pare how well the predicted segmentation mask matches the
4. Experiments ground truth input. Intuitively, if the output images are re-
alistic, a well-trained semantic segmentation model should
Implementation details. We apply the Spectral Norm [38] be able to predict the ground truth label. For measuring the
to all the layers in both generator and discriminator. The segmentation accuracy, we use both the mean Intersection-

4
Label Ground Truth CRN [6] pix2pixHD [48] Ours

Figure 5: Visual comparison of semantic image synthesis results on the COCO-Stuff dataset. Our method successfully
synthesizes realistic details from semantic labels.
Label Ground Truth CRN [6] SIMS [43] pix2pixHD [48] Ours

Figure 6: Visual comparison of semantic image synthesis results on the ADE20K outdoor and Cityscapes datasets. Our
method produces realistic images while respecting the spatial semantic layout at the same time.
COCO-Stuff ADE20K ADE20K-outdoor Cityscapes
Method mIoU accu FID mIoU accu FID mIoU accu FID mIoU accu FID
CRN [6] 23.7 40.4 70.4 22.4 68.8 73.3 16.5 68.6 99.0 52.4 77.1 104.7
SIMS [43] N/A N/A N/A N/A N/A N/A 13.1 74.7 67.7 47.2 75.5 49.7
pix2pixHD [48] 14.6 45.8 111.5 20.3 69.2 81.8 17.4 71.6 97.8 58.3 81.4 95.0
Ours 37.4 67.9 22.6 38.5 79.9 33.9 30.8 82.9 63.3 62.3 81.9 71.8
Table 1: Our method outperforms the current leading methods in semantic segmentation (mIoU and accu) and FID [17]
scores on all the benchmark datasets. For the mIoU and accu, higher is better. For the FID, lower is better.

over-Union (mIoU) and the pixel accuracy (accu). We use parametric image synthesis method (SIMS) [43]. The
the state-of-the-art segmentation networks for each dataset: pix2pixHD is the current state-of-the-art GAN-based con-
DeepLabV2 [5, 40] for COCO-Stuff, UperNet101 [51] for ditional image synthesis framework. The CRN uses a deep
ADE20K, and DRN-D-105 [53] for Cityscapes. In addi- network that repeatedly refines the output from low to high
tion to the mIoU and the accu segmentation performance resolution, while the SIMS takes a semi-parametric ap-
metrics, we use the Fréchet Inception Distance (FID) [17] proach that composites real segments from a training set and
to measure the distance between the distribution of synthe- refines the boundaries. Both the CRN and SIMS are mainly
sized results and the distribution of real images. trained using image reconstruction loss. For a fair compar-
Baselines. We compare our method with 3 leading seman- ison, we train the CRN and pix2pixHD models using the
tic image synthesis models: the pix2pixHD model [48], implementations provided by the authors. As image syn-
the cascaded refinement network (CRN) [6], and the semi- thesis using the SIMS requires many queries to the training

5
Figure 7: Semantic image synthesis results on the Flickr Landscapes dataset. The images were generated from semantic
layout of photographs on the Flickr website.

dataset, it is computationally prohibitive for a large dataset Ours vs. Ours vs. Ours vs.
Dataset
such as the COCO-stuff and the full ADE20K. Therefore, CRN pix2pixHD SIMS
we use the results provided by the authors when available. COCO-Stuff 79.76 86.64 N/A
ADE20K 76.66 83.74 N/A
Quantitative comparisons. As shown in Table 1, our ADE20K-outdoor 66.04 79.34 85.70
method outperforms the current state-of-the-art methods by Cityscapes 63.60 53.64 51.52
a large margin in all the datasets. For the COCO-Stuff, our Table 2: User preference study. The numbers indicate the
method achieves an mIoU score of 35.2, which is about 1.5 percentage of users who favor the results of the proposed
times better than the previous leading method. Our FID method over those of the competing method.
is also 2.2 times better than the previous leading method.
We note that the SIMS model produces a lower FID score
but has poor segmentation performances on the Cityscapes In Figures 7 and 8, we show more example results from
dataset. This is because the SIMS synthesizes an image by the Flickr Landscape and COCO-Stuff datasets. The pro-
first stitching image patches from the training dataset. As posed method can generate diverse scenes with high image
using the real image patches, the resulting image distribu- fidelity. More results are included in the appendix.
tion can better match the distribution of real images. How- Human evaluation. We use the Amazon Mechanical Turk
ever, because there is no guarantee that a perfect query (e.g., (AMT) to compare the perceived visual fidelity of our
a person in a particular pose) exists in the dataset, it tends method against existing approaches. Specifically, we give
to copy objects that do not match the input segments. the AMT workers an input segmentation mask and two
Qualitative results. In Figures 5 and 6, we provide quali- synthesis outputs from different methods and ask them to
tative comparisons of the competing methods. We find that choose the output image that looks more like a correspond-
our method produces results with much better visual quality ing image of the segmentation mask. The workers are given
and fewer visible artifacts, especially for diverse scenes in unlimited time to make the selection. For each comparison,
the COCO-Stuff and ADE20K dataset. When the training we randomly generate 500 questions for each dataset, and
dataset size is small, the SIMS model also renders images each question is answered by 5 different workers. For qual-
with good visual quality. However, the depicted content ity control, only workers with a lifetime task approval rate
often deviates from the input segmentation mask (e.g., the greater than 98% can participate in our study.
shape of the swimming pool in the second row of Figure 6). Table 2 shows the evaluation results. We find that users

6
Figure 8: Semantic image synthesis results on COCO-Stuff. Our method successfully generates realistic images in diverse
scenes ranging from animals to sports activities.
Method #param COCO. ADE. City. Method COCO ADE20K Cityscapes
decoder w/ SPADE (Ours) 96M 35.2 38.5 62.3 segmap input 35.2 38.5 62.3
compact decoder w/ SPADE 61M 35.2 38.0 62.5 random input 35.3 38.3 61.6
decoder w/ Concat 79M 31.9 33.6 61.1 kernelsize 5x5 35.0 39.3 61.8
pix2pixHD++ w/ SPADE 237M 34.4 39.0 62.2 kernelsize 3x3 35.2 38.5 62.3
pix2pixHD++ w/ Concat 195M 32.9 38.9 57.1 kernelsize 1x1 32.7 35.9 59.9
pix2pixHD++ 183M 32.7 38.3 58.8 #params 141M 35.3 38.3 62.5
compact pix2pixHD++ 103M 31.6 37.3 57.6 #params 96M 35.2 38.5 62.3
pix2pixHD [48] 183M 14.6 20.3 58.3 #params 61M 35.2 38.0 62.5
Table 3: The mIoU scores are boosted when the SPADE Sync BatchNorm 35.0 39.3 61.8
BatchNorm 33.7 37.9 61.8
is used, for both the decoder architecture (Figure 4) and
InstanceNorm 33.9 37.4 58.7
encoder-decoder architecture of pix2pixHD++ (our im-
proved baseline over pix2pixHD [48]). On the other hand, Table 4: The SPADE generator works with different con-
simply concatenating semantic input at every layer fails to figurations. We change the input of the generator, the con-
do so. Moreover, our compact model with smaller depth at volutional kernel size acting on the segmentation map, the
all layers outperforms all the baselines. capacity of the network, and the parameter-free normaliza-
tion method. The settings used in the paper are boldfaced.

strongly favor our results on all the datasets, especially on


bines the strong baseline with the SPADE is denoted as
the challenging COCO-Stuff and ADE20K datasets. For the
pix2pixHD++ w/ SPADE.
Cityscapes, even when all the competing methods achieve
As shown in Table 3, the architectures with the proposed
high image fidelity, users still prefer our results.
SPADE consistently outperforms its counterparts, in both
Effectiveness of the SPADE. For quantifying importance the decoder-style architecture described in Figure 4 and
of the SPADE, we introduce a strong baseline called more traditional encoder-decoder architecture used in the
pix2pixHD++, which combines all the techniques we find pix2pixHD. We also find that concatenating segmentation
useful for enhancing the performance of pix2pixHD except masks at all intermediate layers, a reasonable alternative
the SPADE. We also train models that receive the segmen- to the SPADE, does not achieve the same performance as
tation mask input at all the intermediate layers via feature SPADE. Furthermore, the decoder-style SPADE generator
concatenation in the channel direction, which is termed as works better than the strong baselines even with a smaller
pix2pixHD++ w/ Concat. Finally, the model that com- number of parameters.

7
Figure 9: Our model attains multimodal synthesis capability when trained with the image encoder. During deployment,
by using different random noise, our model synthesizes outputs with diverse appearances but all having the same semantic
layouts depicted in the input mask. For reference, the ground truth image is shown inside the input segmentation mask.

Variations of SPADE generator. Table 4 reports the per- ferent segmentation masks, and our model renders the cor-
formance of several variations of our generator. First, we responding landscape images. Moreover, our model allows
compare two types of input to the generator where one is the users to choose an external style image to control the global
random noise while the other is the downsampled segmen- appearances of the output image. We achieve it by replac-
tation map. We find that both of the variants render similar ing the input noise with the embedding vector of the style
performance and conclude that the modulation by SPADE image computed by the image encoder.
alone provides sufficient signal about the input mask. Sec-
ond, we vary the type of parameter-free normalization lay- 5. Conclusion
ers before applying the modulation parameters. We observe
that the SPADE works reliably across different normaliza- We have proposed the spatially-adaptive normalization,
tion methods. Next, we vary the convolutional kernel size which utilizes the input semantic layout while performing
the affine transformation in the normalization layers. The
acting on the label map, and find that kernel size of 1x1
hurts performance, likely because it prohibits utilizing the proposed normalization leads to the first semantic image
context of the label. Lastly, we modify the capacity of the synthesis model that can produce photorealistic outputs for
generator by changing the number of convolutional filters. diverse scenes including indoor, outdoor, landscape, and
We present more variations and ablations in the appendix. street scenes. We further demonstrate its application for
multi-modal synthesis and guided image synthesis.
Multi-modal synthesis. In Figure 9, we show the mul- Acknowledgments. We thank Alexei A. Efros, Bryan
timodal image synthesis results on the Flickr Landscape Catanzaro, Andrew Tao, and Jan Kautz for insightful ad-
dataset. For the same input segmentation mask, we sam- vice. We thank Chris Hebert, Gavriil Klimov, and Brad
ple different noise inputs to achieve different outputs. More Nemire for their help in constructing the demo apps. Tae-
results are included in the appendix. sung Park contributed to the work during his internship at
Semantic manipulation and guided image synthesis. In NVIDIA. His Ph.D. is supported by a Samsung Scholarship.
Figure 1, we show an application where a user draws dif-

8
References converge to a local Nash equilibrium. In Advances in Neural
Information Processing Systems, 2017. 4, 5, 13
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gen-
[18] S. Hong, D. Yang, J. Choi, and H. Lee. Inferring seman-
erative adversarial networks. In International Conference on
tic layout for hierarchical text-to-image synthesis. In IEEE
Machine Learning (ICML), 2017. 3
Conference on Computer Vision and Pattern Recognition
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. (CVPR), 2018. 2
arXiv preprint arXiv:1607.06450, 2016. 2
[19] X. Huang and S. Belongie. Arbitrary style transfer in real-
[3] A. Brock, J. Donahue, and K. Simonyan. Large scale gan
time with adaptive instance normalization. In IEEE Inter-
training for high fidelity natural image synthesis. In Inter-
national Conference on Computer Vision (ICCV), 2017. 2,
national Conference on Learning Representations (ICLR),
3
2019. 1, 2
[20] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal
[4] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and
unsupervised image-to-image translation. European Confer-
stuff classes in context. In IEEE Conference on Computer
ence on Computer Vision (ECCV), 2018. 2, 3, 4
Vision and Pattern Recognition (CVPR), 2018. 2, 4
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
deep network training by reducing internal covariate shift.
A. L. Yuille. Deeplab: Semantic image segmentation with
In International Conference on Machine Learning (ICML),
deep convolutional nets, atrous convolution, and fully con-
2015. 2, 3
nected crfs. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence (TPAMI), 40(4):834–848, 2018. 4, 5 [22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-
image translation with conditional adversarial networks. In
[6] Q. Chen and V. Koltun. Photographic image synthesis with
IEEE Conference on Computer Vision and Pattern Recogni-
cascaded refinement networks. In IEEE International Con-
tion (CVPR), 2017. 1, 2, 3, 11, 12
ference on Computer Vision (ICCV), 2017. 1, 4, 5, 13, 14,
15, 16, 17, 18 [23] M. Johnson, G. J. Brostow, J. Shotton, O. Arandjelovic,
V. Kwatra, and R. Cipolla. Semantic photo synthesis. In
[7] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu.
Computer Graphics Forum, volume 25, pages 407–413,
Sketch2photo: internet image montage. ACM Transactions
2006. 1, 2
on Graphics (TOG), 28(5):124, 2009. 1, 2
[8] T. Chen, M. Lucic, N. Houlsby, and S. Gelly. On self mod- [24] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning
ulation for generative adversarial networks. In International to generate images of outdoor scenes from attributes and se-
Conference on Learning Representations, 2019. 2 mantic layouts. arXiv preprint arXiv:1612.00215, 2016. 2
[9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [25] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Manipu-
R. Benenson, U. Franke, S. Roth, and B. Schiele. The lating attributes of natural scenes via hallucination. arXiv
cityscapes dataset for semantic urban scene understanding. preprint arXiv:1808.07413, 2018. 2
In IEEE Conference on Computer Vision and Pattern Recog- [26] T. Karras, S. Laine, and T. Aila. A style-based generator
nition (CVPR), 2016. 2, 4 architecture for generative adversarial networks. In IEEE
[10] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, Conference on Computer Vision and Pattern Recognition
and A. C. Courville. Modulating early visual processing (CVPR), 2019. 2
by language. In Advances in Neural Information Process- [27] D. P. Kingma and J. Ba. Adam: A method for stochastic
ing Systems, 2017. 2 optimization. In International Conference on Learning Rep-
[11] V. Dumoulin, J. Shlens, and M. Kudlur. A learned repre- resentations (ICLR), 2015. 4
sentation for artistic style. In International Conference on [28] D. P. Kingma and M. Welling. Auto-encoding variational
Learning Representations (ICLR), 2016. 2, 3 bayes. In International Conference on Learning Representa-
[12] X. Glorot and Y. Bengio. Understanding the difficulty of tions (ICLR), 2014. 2, 4, 11, 12
training deep feedforward neural networks. In Proceedings [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
of the thirteenth international conference on artificial intel- classification with deep convolutional neural networks. In
ligence and statistics, pages 249–256, 2010. 12, 13 Advances in Neural Information Processing Systems, 2012.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, 2
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [30] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn,
erative adversarial nets. In Advances in Neural Information and A. Criminisi. Photo clip art. In ACM transactions on
Processing Systems, 2014. 2 graphics (TOG), volume 26, page 3. ACM, 2007. 1
[14] J. Hays and A. A. Efros. Scene completion using millions of [31] J. H. Lim and J. C. Ye. Geometric gan. arXiv preprint
photographs. In ACM SIGGRAPH, 2007. 1 arXiv:1705.02894, 2017. 3, 11
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
for image recognition. In IEEE Conference on Computer manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
Vision and Pattern Recognition (CVPR), 2016. 3 mon objects in context. In European Conference on Com-
[16] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. puter Vision (ECCV), 2014. 2, 4
Salesin. Image analogies. 2001. 1, 2 [33] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-
[17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and image translation networks. In Advances in Neural Informa-
S. Hochreiter. GANs trained by a two time-scale update rule tion Processing Systems, 2017. 2

9
[34] X. Mao, Q. Li, H. Xie, Y. R. Lau, Z. Wang, and S. P. Smol- Computer Vision and Pattern Recognition (CVPR), 2018. 1,
ley. Least squares generative adversarial networks. In IEEE 3, 4, 5, 7, 11, 12, 13, 14, 15, 16, 17, 18
International Conference on Computer Vision (ICCV), 2017. [49] X. Wang, K. Yu, C. Dong, and C. Change Loy. Recover-
3, 11 ing realistic texture in image super-resolution by deep spatial
[35] T. B. Mathias Eitz, Kristian Hildebrand and M. Alexa. Pho- feature transform. In Proceedings of the IEEE Conference on
tosketch: A sketch based image query and compositing sys- Computer Vision and Pattern Recognition, pages 606–615,
tem. In ACM SIGGRAPH 2009 Talk Program, 2009. 1 2018. 2
[36] L. Mescheder, A. Geiger, and S. Nowozin. Which training [50] Y. Wu and K. He. Group normalization. In European Con-
methods for gans do actually converge? In International ference on Computer Vision (ECCV), 2018. 2
Conference on Machine Learning (ICML), 2018. 2, 3, 11 [51] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified per-
[37] M. Mirza and S. Osindero. Conditional generative adversar- ceptual parsing for scene understanding. In European Con-
ial nets. arXiv preprint arXiv:1411.1784, 2014. 2 ference on Computer Vision (ECCV), 2018. 5
[38] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spec- [52] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
tral normalization for generative adversarial networks. In In- X. He. Attngan: Fine-grained text to image generation with
ternational Conference on Learning Representations (ICLR), attentional generative adversarial networks. In IEEE Confer-
2018. 3, 4, 11 ence on Computer Vision and Pattern Recognition (CVPR),
[39] T. Miyato and M. Koyama. cGANs with projection discrim- 2018. 2
inator. In International Conference on Learning Representa- [53] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual net-
tions (ICLR), 2018. 2, 3, 11 works. In IEEE Conference on Computer Vision and Pattern
[40] K. Nakashima. Deeplab-pytorch. https://github. Recognition (CVPR), 2017. 5
com/kazuto1011/deeplab-pytorch, 2018. 5 [54] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-
[41] A. Odena, C. Olah, and J. Shlens. Conditional image synthe- attention generative adversarial networks. In International
sis with auxiliary classifier GANs. In International Confer- Conference on Machine Learning (ICML), 2019. 1, 2, 3, 11
ence on Machine Learning (ICML), 2017. 2 [55] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and
[42] E. Perez, H. De Vries, F. Strub, V. Dumoulin, and D. Metaxas. Stackgan: Text to photo-realistic image synthe-
A. Courville. Learning visual reasoning without strong sis with stacked generative adversarial networks. In IEEE
priors. In International Conference on Machine Learning International Conference on Computer Vision (ICCV), 2017.
(ICML), 2017. 2 1, 2
[43] X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric im- [56] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang,
age synthesis. In IEEE Conference on Computer Vision and and D. Metaxas. Stackgan++: Realistic image synthesis
Pattern Recognition (CVPR), 2018. 4, 5, 13, 17, 18 with stacked generative adversarial networks. IEEE Transac-
[44] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and tions on Pattern Analysis and Machine Intelligence (TPAMI),
H. Lee. Generative adversarial text to image synthesis. In In- 2018. 1
ternational Conference on Machine Learning (ICML), 2016. [57] B. Zhao, L. Meng, W. Yin, and L. Sigal. Image generation
2 from layout. In IEEE Conference on Computer Vision and
[45] T. Salimans and D. P. Kingma. Weight normalization: A Pattern Recognition (CVPR), 2019. 2
simple reparameterization to accelerate training of deep neu- [58] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and
ral networks. In Advances in Neural Information Processing A. Torralba. Scene parsing through ade20k dataset. In
Systems, 2016. 2 IEEE Conference on Computer Vision and Pattern Recog-
[46] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor- nition (CVPR), 2017. 2, 4
malization: The missing ingredient for fast stylization. arxiv [59] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
2016. arXiv preprint arXiv:1607.08022, 2016. 2, 3 to-image translation using cycle-consistent adversarial net-
[47] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, works. In IEEE International Conference on Computer Vi-
and B. Catanzaro. Video-to-video synthesis. In Advances in sion (ICCV), 2017. 2
Neural Information Processing Systems, 2018. 1, 4 [60] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,
[48] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and O. Wang, and E. Shechtman. Toward multimodal image-to-
B. Catanzaro. High-resolution image synthesis and semantic image translation. In Advances in Neural Information Pro-
manipulation with conditional gans. In IEEE Conference on cessing Systems, 2017. 2, 3, 4

10
A. Additional Implementation Details
Generator. The architecture of the generator consists of a
series of the proposed SPADE ResBlks with nearest neigh-
bor upsampling. We train our network using 8 GPUs simul-
Linear(256, 16384)
taneously and use the synchronized version of the Batch-
Norm. We apply the Spectral Norm [38] to all the convolu- Reshape(1024, 4, 4)
tional layers in the generator. The architectures of the pro- SPADE ResBlk(1024), Upsample(2)
posed SPADE and SPADE ResBlk are given in Figure 10
SPADE ResBlk(1024), Upsample(2)
and Figure 11, respectively. The architecture of the genera-
tor is shown in Figure 12. SPADE ResBlk(1024), Upsample(2)
Discriminator. The architecture of the discriminator fol- SPADE ResBlk(512), Upsample(2)
lows the one used in the pix2pixHD method [48], which SPADE ResBlk(256), Upsample(2)
uses a multi-scale design with the InstanceNorm (IN). The
only difference is that we apply the Spectral Norm to all the SPADE ResBlk(128), Upsample(2)

SPADE ResBlk(64), Upsample(2)

SPADE 3x3Conv-3, Tanh

Sync Batch Norm

Figure 12: SPADE Generator. Different from prior im-


3x3-Conv-k
3x3-Conv-128, ReLU

age generators [22, 48], the semantic segmentation mask is


Resize (order=0)

passed to the generator through the proposed SPADE Res-


Blks in Figure 11.
3x3-Conv-k

Image Encoder. The image encoder consists of 6 stride-2


convolutional layers followed by two linear layers to pro-
duce the mean and variance of the output distribution as
shown in Figure 14.
Figure 10: SPADE Design. The term 3x3-Conv-k denotes a Learning objective. We use the learning objective function
3-by-3 convolutional layer with k convolutional filters. The in the pix2pixHD work [48] except that we replace its LS-
segmentation map is resized to match the resolution of the GAN loss [34] term with the Hinge loss term [31, 38, 54].
corresponding feature map using nearest-neighbor down- We use the same weighting among the loss terms in the ob-
sampling. jective function as that in the pix2pixHD work.

When training the proposed framework with the image


SPADE ResBlk(k) encoder for multi-modal synthesis and style-guided image
synthesis, we include a KL Divergence loss:
SPADE
ReLU
LKLD = DKL (q(z|x)||p(z))
SPADE
3x3-Conv-k
ReLU where the prior distribution p(z) is a standard Gaussian dis-
SPADE
3x3-Conv-k
tribution and the variational distribution q is fully deter-
ReLU mined by a mean vector and a variance vector [28]. We
3x3-Conv-k use the reparamterization trick [28] for back-propagating
the gradient from the generator to the image encoder. The
weight for the KL Divergence loss is 0.05.
In Figure 15, we overview the training data flow. The
Figure 11: SPADE ResBlk. The residual block design image encoder encodes a real image to a mean vector and
largely follows that in Mescheder et al. [36] and Miyato et a variance vector. They are used to compute the noise in-
al. [39]. We note that for the case that the number of chan- put to the generator via the reparameterization trick [28].
nels before and after the residual block is different, the skip The generator also takes the segmentation mask of the in-
connection is also learned (dashed box in the figure). put image as input with the proposed SPADE ResBlks. The
convolutional layers of the discriminator. The details of the
discriminator architecture is shown in Figure 13.

11
Concat and the output image from the generator as input and aims
to classify that as fake.
4x4-↓2-Conv-64, LReLU Training details. We perform 200 epochs of training on the
4x4-↓2-Conv-128, IN, LReLU
Cityscapes and ADE20K datasets, 100 epochs of training
on the COCO-Stuff dataset, and 50 epochs of training on the
4x4-↓2-Conv-256, IN, LReLU Flickr Landscapes dataset. The image sizes are 256 × 256,
4x4-Conv-512, IN, LReLU except the Cityscapes at 512 × 256. We linearly decay the
learning rate to 0 from epoch 100 to 200 for the Cityscapes
4x4-Conv-1
and ADE20K datasets. The batch size is 32. We initialize
the network weights using thes Glorot initialization [12].
Figure 13: Our discriminator design largely follows that in
the pix2pixHD [48]. It takes the concatenation the segmen-
tation map and the image as input. It is based on the Patch-
GAN [22]. Hence, the last layer of the discriminator is a
convolutional layer.
Image
Encoder

3x3-↓2-Conv-64, IN, LReLU

3x3-↓2-Conv-128, IN, LReLU Generator

3x3-↓2-Conv-256, IN, LReLU

3x3-↓2-Conv-512, IN, LReLU


Concat
3x3-↓2-Conv-512, IN, LReLU

3x3-↓2-Conv-512, IN, LReLU


Discriminator
Reshape(8192, 1, 1)
Figure 15: The image encoder encodes a real image to a la-
Linear(256) Linear(256)
tent representation for generating a mean vector and a vari-
𝜇 𝜎𝟐 ance vector. They are used to compute the noise input to the
generator via the reparameterization trick [28]. The gener-
Figure 14: The image encoder consists a series of convolu-
ator also takes the segmentation mask of the input image as
tional layers with stride 2 followed by two linear layers that
input via the proposed SPADE ResBlks. The discriminator
output a mean vector µ and a variance vector σ.
takes concatenation of the segmentation mask and the out-
put image from the generator as input and aims to classify
discriminator takes concatenation of the segmentation mask that as fake.

12
B. Additional Ablation Study In Table 5, we also analyze the effectiveness of each
component used in our strong baseline, the pix2pixHD++
Method COCO. ADE. City.
method, derived from the pix2pixHD method. We
found that the Spectral Norm, synchronized BatchNorm,
Ours 35.2 38.5 62.3
TTUR [17], and the hinge loss objective all contribute to
Ours w/o Perceptual loss 24.7 30.1 57.4
the performance boost. Adding the SPADE to the strong
Ours w/o GAN feature matching loss 33.2 38.0 62.2
baseline further improves the performance. Note that the
Ours w/ a deeper discriminator 34.9 38.3 60.9
pix2pixHD++ w/o Sync BatchNorm and w/o Spectral Norm
pix2pixHD++ w/ SPADE 34.4 39.0 62.2
still differs from the pix2pixHD in that it uses the hinge loss
pix2pixHD++ 32.7 38.3 58.8
objective, TTUR, a large batch size, and the Glorot initial-
pix2pixHD++ w/o Sync BatchNorm 27.4 31.8 51.1
ization [12].
pix2pixHD++ w/o Sync BatchNorm, 26.0 31.9 52.3
and w/o Spectral Norm
C. Additional Results
pix2pixHD [48] 14.6 20.3 58.3
In Figure 16, 17, and 18, we show additional synthe-
Table 5: Additional ablation study results using the mIoU sis results from the proposed method on the COCO-Stuff
metric: the table shows that both the perceptual loss and and ADE20K datasets with comparisons to those from the
GAN feature matching loss terms are important. Mak- CRN [6] and pix2pixHD [48] methods.
ing the discriminator deeper does not lead to a perfor- In Figure 19 and 20, we show additional synthesis re-
mance boost. The table also shows that all the compo- sults from the proposed method on the ADE20K-outdoor
nents (Synchronized BatchNorm, Spectral Norm, TTUR, and Cityscapes datasets with comparison to those from the
the Hinge loss objective, and the SPADE) used in the pro- CRN [6], SIMS [43], and pix2pixHD [48] methods.
posed method helps our strong baseline, pix2pixHD++. In Figure 21, we show additional multi-modal synthesis
results from the proposed method. As sampling different z
Table 5 provides additional ablation study results ana- from a standard multivariate Gaussian distribution, we syn-
lyzing the contribution of individual components in the pro- thesize images of diverse appearances.
posed method. We first find that both of the perceptual loss In the accompanying video, we demonstrate our seman-
and GAN feature matching loss inherited from the learn- tic image synthesis interface. We show how a user can cre-
ing objective function of the pix2pixHD [48] are impor- ate photorealistic landscape images by painting semantic
tant. Removing any of them leads to a performance drop. labels on a canvas. We also show how a user can synthe-
We also find that increasing the depth of the discriminator size images of diverse appearances for the same semantic
by inserting one more convolutional layer to the top of the segmentation mask as well as transfer the appearance of a
pix2pixHD discriminator does not improve the results. provided style image to the synthesized one.

13
Label Ground Truth CRN pix2pixHD Ours

Figure 16: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the COCO-Stuff
dataset.
14
Label Ground Truth CRN pix2pixHD Ours

Figure 17: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the COCO-Stuff
dataset.
15
Label Ground Truth CRN pix2pixHD Ours

Figure 18: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the ADE20K
dataset.

16
Label Ground Truth CRN SIMS pix2pixHD Ours

Figure 19: Additional results with comparison to those from the CRN [6], SIMS [43], and pix2pixHD [48] methods on the
ADE20K-outdoor dataset. 17
Label Ground Truth Ours

CRN SIMS pix2pixHD

Label Ground Truth Ours

CRN SIMS pix2pixHD

Label Ground Truth Ours

CRN SIMS pix2pixHD

Figure 20: Additional results with comparison to those from the CRN [6], SIMS [43], and pix2pixHD [48] methods on the
Cityscapes dataset.

18
Label Ground Truth Multi-modal results

Figure 21: Additional multi-modal synthesis results on the Flickr Landscapes Dataset. By sampling latent vectors from a
standard Gaussian distribution, we synthesize images of diverse appearances.
19

S-ar putea să vă placă și