Documente Academic
Documente Profesional
Documente Cultură
Computer Vision
These papers provide a breadth of information about Computer Vision that is generally useful and interesting from a
Reconstruction
Figure 1: Speech-to-gesture translation example. In this paper, we study the connection between conversational gesture and speech.
Here, we show the result of our model that predicts gesture from audio. From the bottom upward: the input audio, arm and hand pose
predicted by our model, and video frames synthesized from pose predictions using [10]. (See http://people.eecs.berkeley.
edu/˜shiry/speech2gesture for video results.)
1
Almaram Angelica Kubinec Covach Kagan
Figure 2: Speaker-specific gesture dataset. We show a representative video frame for each speaker in our dataset. Below each one is a
heatmap depicting the frequency that their arms and hands appear in different spatial locations (using the skeletal representation of gestures
shown in Figure 1). This visualization reveals the speaker’s resting pose, and how they tend to move—for example, Angelica tends to keep
her hands folded, whereas Kubinec frequently points towards the screen with his left hand. Note that some speakers, like Kagan, Conan
and Ellen, alternate between sitting and standing and thus the distribution of their arm positions is bimodal.
ure 1 bottom), we generate a corresponding motion of the despite the noisy automatically-annotated pseudo ground
speaker’s arms and hands which matches the style of the truth. Due to multimodality, we do not expect our predicted
speaker, despite the fact that we have never seen or heard motion to be the same as the ground truth. However, as this
this person say this utterance in training (Figure 1 middle). is the only training signal we have, we still use automatic
We then use an existing video synthesis method to visualize pose detections for learning through regression. To avoid
what the speaker might have looked like when saying these regressing to the mean of all modes, we apply an adversar-
words (Figure 1 top). ial discriminator [19] to our predicted motion. This ensures
To generate motion from speech, we must learn a map- that we produce motion that is “real” with respect to the
ping between audio and pose. While this can be formu- current speaker.
lated as translation, in practice there are two inherent chal- Gesture is idiosyncratic [34], as different speakers tend
lenges to using the natural pairing of audio-visual data in to use different styles of motion (see Figure 2). It is there-
this setting. First, gesture and speech are asynchronous, as fore important to learn a personalized gesture model for
gesture can appear before, after or during the correspond- each speaker. To address this, we present a large, 144-hour
ing utterance [4]. Second, this is a multimodal prediction person-specific video dataset of 10 speakers that we make
task as speakers may perform different gestures while say- publicly available1 . We deliberately pick a set of speakers
ing the same thing on different occasions. Moreover, ac- for which we can find hours of clean single-speaker footage.
quiring human annotations for large amounts of video is in- Our speakers come from a diverse set of backgrounds: tele-
feasible. We therefore need to get a training signal from vision show hosts, university lecturers and televangelists.
pseudo ground truth of 2D human pose detections on unla- They span at least three religions and discuss a large range
beled video. of topics from commentary on current affairs through the
Nevertheless, we are able to translate speech to gesture philosophy of death, chemistry and the history of rock mu-
in an end-to-end fashion from the raw audio to a sequence sic, to readings in the Bible and the Qur’an.
of poses. To overcome the asynchronicity issue we use a
large temporal context (both past and future) for prediction. 1 http://people.eecs.berkeley.edu/ shiry/
˜
Temporal context also allows for smooth gesture prediction speech2gesture
2. Related Work these methods is to generate motions for virtual agents, they
use lab-recorded audio, text, and motion capture. This al-
Conversational Gestures McNeill [34] divides gestures lows them to use simplifying assumptions that present chal-
into several classes [34]: emblematics have specific conven- lenges for in-the-wild video analysis like ours: e.g., [30]
tional meanings (e.g. “thumbs up!”); iconics convey physi- requires precise 3D pose and assumes that motions occur
cal shapes or direction of movements; metaphorics describe on syllable boundaries, and [11] assumes that gestures are
abstract content using concrete motion; deictics are point- initiated by an upward motion of the wrist. In contrast
ing gestures, and beats are repetitive, fast hand motions that with these methods, our approach does not explicitly use
provide a temporal framing to speech. any text or language information during training—it learns
Many psychologists have studied questions related to co- gestures from raw audio-visual correspondences—nor does
speech gestures [34, 23] (See [46] for a review). This vast it use hand-defined gesture categories: arm/hand pose are
body of research has mostly relied on studying a small num- predicted directly from audio.
ber of individual subjects using recorded choreographed
story retelling in lab settings. Analysis in these studies was Visualizing predicted gestures One of the most common
a manual process. Our goal, instead, is to study conversa- ways of visualizing gestures is to use them to animate a 3D
tional gestures in the wild using a data-driven approach. avatar [45, 29, 20]. Since our work studies personalized
Conditioning gesture prediction on speech is arguably an gestures for in-the-wild videos, where 3D data is not avail-
ambiguous task, since gesture and speech may not be syn- able, we use a data-driven synthesis approach inspired by
chronous. While McNeill [34] suggests that gesture and Bregler et al. [2]. To do this, we employ the pose-to-video
speech originate from a common source and thus should co- method of Chan et al. [10], which uses a conditional gen-
occur in time according to well-defined rules, Kendon [23] erative adversarial network (GAN) to synthesize videos of
suggests that gesture starts before the corresponding utter- human bodies from pose.
ance. Others even argue that the temporal relationships be-
tween speech and gesture are not yet clear and that gesture Sound and vision Aytar et al. [1] use the synchronization
can appear before, after or during an utterance [4]. of visual and audio signals in natural phenomena to learn
sound representations from unlabeled in-the-wild videos.
Sign language and emblematic gesture recognition To do this, they transfer knowledge from trained discrim-
There has been a great deal of computer vision work geared inative models in the visual domain, to the audio domain.
towards recognizing sign language gestures from video. Synchronization of audio and visual features can also be
This includes methods that use video transcripts as a weak used for synthesis. Langlois et al. [28] try to optimize for
source of supervision [3], as well as recent methods based synchronous events by generating rigid-body animations of
on CNNs [37, 26] and RNNs [13]. There has also been work objects falling or tumbling that temporally match an input
that recognizes emblematic hand and face gestures [17, 14], sound wave of the desired sequence of contact events with
head gestures [35], and co-speech gestures [38]. By con- the ground plane. More recently, Shlizerman et al. [42]
trast, our goal is to predict co-speech gestures from audio. animated the hands of a 3D avatar according to input mu-
sic. However, their focus was on music performance, rather
Conversational agents Researchers have proposed a than gestures, and consequently the space of possible mo-
number of methods for generating plausible gestures, par- tions was limited (e.g., the zig-zag motion of a violin bow).
ticularly for applications with conversational agents [8]. In Moreover, while music is uniquely defined by the motion
early work, Cassell et al. [7] proposed a system that guided that generates it (and is synchronous with it), gestures are
arm/hand motions based on manually defined rules. Sub- neither unique to, nor synchronous with speech utterances.
sequent rule-based systems [27] proposed new ways of ex- Several works have focused on the specific task of
pressing gestures via annotations. synthesizing videos of faces speaking, given audio input.
More closely related to our approach are methods that Chung et al. [12] generate an image of a talking face from
learn gestures from speech and text, without requiring an a still image of the speaker and an input speech segment
author to hand-specify rules. Notably, [9] synthesized ges- by learning a joint embedding of the face and audio. Simi-
tures using natural language processing of spoken text, and larly, [44] synthesizes videos of Obama saying novel words
Neff [36] proposed a system for making person-specific by using a recurrent neural network to map speech audio to
gestures. Levine et al. [30] learned to map acoustic prosody mouth shapes and then embedding the synthesized lips in
features to motion using a HMM. Later work [29] extended ground truth facial video. While both methods enable the
this approach to use reinforcement learning and speech creation of fake content by generating faces saying words
recognition, combined acoustic analysis with text [33], cre- taken from a different person, we focus on single-person
ated hybrid rule-based systems [40], and used restricted models that are optimized for animating same-speaker ut-
Boltzmann machines for inference [11]. Since the goal of terances. Most importantly, generating gesture, rather than
lip motion, from speech is more involved as gestures are Audio G(t1), . . . , G(tT )
asynchronous with speech, multimodal and person-specific. G L1 regression loss
Time
D
3. A Speaker-Specific Gesture Dataset
Real or Fake
We introduce a large 144-hour video dataset specifically Motion Sequence?
tailored to studying speech and gesture of individual speak- Frequency
ers in a data-driven fashion. As shown in Figure 2, our
dataset contains in-the-wild videos of 10 gesturing speak- Figure 3: Speech to gesture translation model. A convolutional
ers that were originally recorded for television shows or audio encoder downsamples the 2D spectrogram and transforms
university lectures. We collect several hours of video per it to a 1D signal. The translation model, G, then predicts a corre-
sponding temporal stack of 2D poses. L1 regression to the ground
speaker, so that we can individually model each one. We
truth poses provides a training signal, while an adversarial dis-
chose speakers that cover a wide range of topics and ges-
criminator, D, ensures that the predicted motion is both temporally
turing styles. Our dataset contains: 5 talk show hosts, 3 coherent and in the style of the speaker.
lecturers and 2 televangelists. Details about data collection
and processing as well as an analysis of the individual styles
of gestures can be found in the supplementary material. 4.1. Speech-to-Gesture Translation
Gesture representation and annotation We represent Any realistic gesture motion must be temporally coher-
the speakers’ pose over time using a temporal stack of 2D ent and smooth. We accomplish smoothness by learning an
skeletal keypoints, which we obtain using OpenPose [5]. audio encoding which is a representation of the whole ut-
From the complete set of keypoints detected by OpenPose, terance, taking into account the full temporal extent of the
we use the 49 points corresponding to the neck, shoulders, input speech, s, and predicting the whole temporal sequence
elbows, wrists and hands to represent gestures. Together of corresponding poses, p, at once (rather than recurrently).
with the video footage, we provide the skeletal keypoints Our fully convolutional network consists of an audio en-
for each frame of the data at a 15fps. Note, however, that coder followed by a 1D UNet [39, 22] translation architec-
these are not ground truth annotations, but a proxy for the ture, as shown in Figure 3. The audio encoder takes a 2D
ground truth from a state-of-the-art pose detection system. log-mel spectrogram as input, and downsamples it through
a series of convolutions, resulting in a 1D signal with the
Quality of dataset annotations All ground truth, same sampling rate as our video (15 Hz). The UNet transla-
whether from human observers or otherwise, has associated tion architecture then learns to map this signal to a temporal
error. The pseudo ground truth we collect using automatic stack of pose vectors (see Section 3 for details of our gesture
pose detection may have much larger error than human an- representation) via an L1 regression loss:
notations, but it enables us to train on much larger amounts
of data. Still, we must estimate whether the accuracy of the LL1 (G) = Es,p [||p − G(s)||1 ]. (1)
pseudo ground truth is good enough to support our quantita-
tive conclusions. We compare the automatic pose detections We use a UNet architecture for translation since its bot-
to labels obtained from human observers on a subset of our tleneck provides the network with past and future tempo-
training data and find that the pseudo ground truth is close ral context, while the skip connections allow for high fre-
to human labels and that the error in the pseudo ground truth quency temporal information to flow through, enabling pre-
is small enough for our task. The full experiment is detailed diction of fast motion.
in our supplementary material.
4.2. Predicting Plausible Motion
4. Method
While L1 regression to keypoints is the only way we
Given raw audio of speech, our goal is to generate the can extract a training signal from our data, it suffers from
speaker’s corresponding arm and hand gesture motion. We the known issue of regression to the mean which produces
approach this task in two stages—first, since the only sig- overly smooth motion. This can be seen in our supplemen-
nal we have for training are corresponding audio and pose tary video results. To combat the issue and ensure that we
detection sequences, we learn a mapping from speech to produce realistic motion, we add an adversarial discrimi-
gesture using L1 regression to temporal stacks of 2D key- nator [22, 10] D, conditioned on the difference of the pre-
points. Second, to avoid regressing to the mean of all pos- dicted sequence of poses. i.e. the input to the discriminator
sible modes of gesture, we employ an adversarial discrim- is the vector m = [p2 −p1 , . . . , pT −pT −1 ] where pi are 2D
inator that ensures that the motion we produce is plausible pose keypoints and T is the temporal extent of the input au-
with respect to the typical motion of the speaker. dio and predicted pose sequence. The discriminator D tries
to maximize the following objective while the generator G Nearest neighbors Instead of selecting a completely ran-
(translation architecture, Section 4.1) tries to minimize it: dom gesture sequence from the same speaker, we can use
audio as a similarity cue. For an input audio track, we find
LGAN (G, D) = Em [log D(m)] + Es [log(1 − G(s))], (2) its nearest neighbor for the speaker using pretrained audio
features, and transfer its corresponding motion. To repre-
where s is the input audio speech segment and m is the mo- sent the audio, we use the state-of-the-art VGGish feature
tion derivative of the predicted stack of poses. Thus, the embedding [21] pretrained on AudioSet [18], and use co-
generator learns to produce real-seeming speaker motion sine distance on normalized features.
while the discriminator learns to classify whether a given
RNN-based model [42] We further compare our motion
motion sequence is real. Our full objective is therefore:
prediction to an RNN architecture proposed by Shlizerman
et al. Similar to us, Shlizerman et al. predict arm and hand
min max LGAN (G, D) + λLL1 (G). (3)
G D motion from audio in a 2D skeletal keypoint space. How-
ever, while our model is a convolutional neural network
4.3. Implementation Details with log-mel spectrogram input, theirs uses a 1-layer LSTM
We obtain translation invariance by subtracting (per model that takes MFCC features (a low-dimensional, hand-
frame) the neck keypoint location from all other keypoints crafted audio feature representation) as input. We evaluated
in our pseudo ground truth gesture representation (section both feature types and found that for [42], MFCC features
3). We then normalize each keypoint (e.g. left wrist) across outperform the log-mel spectrogram features on all speak-
all frames by subtracting the per-speaker mean and divid- ers. We therefore use their original MFCC features in our
ing by the standard deviation. During training, we take as experiments. For consistency with our own model, instead
input spectrograms corresponding to about 4 seconds of au- of measuring L2 distance on PCA features, as they do, we
dio and predict 64 pose vectors, which correspond to about add an extra hidden layer and use L1 distance.
4 seconds at a 15Hz frame-rate. At test time we can run Ours, no GAN Finally, as an ablation, we compare our
our network on arbitrary audio durations. We optimize us- full model to the prediction of the translation architecture
ing Adam [24] with a batch size of 32 and a learning rate of alone, without the adversarial discriminator.
10−4 . We train for 300K/90K iterations with and without an
adversarial loss, respectively, and select the best performing 5.1.2 Evaluation Metrics
model on the validation set.
Our main quantitative evaluation metric is the L1 regres-
5. Experiments sion loss of the different models in comparison. We ad-
ditionally report results according to the percent of correct
We show that our method produces motion that quanti- keypoints (PCK) [47], a widely accepted metric for pose de-
tatively outperforms several baselines, as well as a previous tection. Here, a predicted keypoint is defined as correct if
method that we adapt to the problem. it falls within α max(h, w) pixels of the ground truth key-
5.1. Setup point, where h and w are the height and width of the person
bounding box, respectively.
We describe our experimental setup including our base- We note that PCK was designed for localizing object
lines for comparison and evaluation metric. parts, whereas we use it here for a cross-modal prediction
task (predicting pose from audio). First, unlike L1 , PCK is
5.1.1 Baselines
not linear and correctness scores fall to zero outside a hard
We compare our method to several other models. threshold. Since our goal is not to predict the ground truth
Always predict the median pose Speakers spend most of motion but rather to use it as a training signal, L1 is more
their time in rest position [23], so predicting the speaker’s suited to measuring how we perform on average. Second,
median pose can be a high-quality baseline. For a visualiza- PCK is sensitive to large gesture motion as the correctness
tion of each speaker’s rest position, see Figure 2. radius depends on the width of the span of the speaker’s
Predict a randomly chosen gesture In this baseline, we arms. While [47] suggest α = 0.1 for data with full people
randomly select a different gesture sequence (which does and α = 0.2 for data where only half the person is visi-
not correspond to the input utterance) from the training set ble, we take an average over α = 0.1, 0.2 and show the full
of the same speaker, and use this as our prediction. While results in the supplementary.
we would not expect this method to perform well quantita-
5.2. Quantitative Evaluation
tively, there is reason to think it would generate qualitatively
appealing motion: these are real speaker gestures—the only We compare the results of our method to the baselines
way to tell they are fake is to evaluate how well they corre- using our quantitative metrics. To assess whether our re-
sponds to the audio.
sults are perceptually convincing, we conduct a user study.
Finally, we ask whether the gestures we predict are person-
specific and whether the input speech is indeed a better pre-
dictor of motion than the initial pose of the gesture.
Table 1: Quantitative results for the speech to gesture translation task using L1 loss (lower is better) on the test set. The rightmost column
is the average PCK value (higher is better) over all speakers and α = 0.1, 0.2 (See full results in supplementary).
Pred.
Predict the median pose 0.73 38.11
Median 12.1 ± 2.8 6.7 ± 2.0 34.0 ± 4.2 25.8 ± 3.9 Predict the input initial pose 0.53 60.50
Random 34.2 ± 4.0 29.1 ± 3.7 40.9 ± 4.6 34.3 ± 4.4 Speech input 0.67 44.62
Input
NN [21] 36.9 ± 3.9 26.4 ± 3.8 43.5 ± 4.5 33.3 ± 4.4 Initial pose input 0.49 61.24
RNN [42] 18.2 ± 3.2 10.0 ± 2.5 37.5 ± 4.6 19.4 ± 3.6 Speech & initial pose input 0.47 62.39
Ours, no GAN 25.0 ± 3.8 19.8 ± 3.4 36.1 ± 4.3 33.1 ± 4.2
Ours, GAN 35.4 ± 4.0 27.8 ± 3.9 33.2 ± 4.4 22.0 ± 4.0 Table 3: How much information does sound provide once we
know the initial pose of the speaker? We see that the initial pose
Table 2: Human study results for the speech to gesture translation of the gesture sequence is a good predictor for the rest of the
task on 4 and 12-second video clips of two speakers—one dy- 4-second motion sequence (second to last row), but that adding
namic (Oliver) and one relatively stationary (Meyers). As a metric audio improves the prediction (last row). We use both average
for comparison, we use the percentage of times participants were L1 loss (lower is better) and average PCK over all speakers and
fooled by the generated motions and picked them as real over the α = 0.1, 0.2 (higher is better) as metrics of comparison. We com-
ground truth motion in a two-alternative forced choice. We found pare two baselines and three conditions of inputs.
that humans were not sensitive to the alignment of speech and
gesture. For the dynamic speaker, gestures with realistic motion—
whether randomly selected from another video of the same speaker 5.3. Qualitative Results
or generated by our GAN-based model—fooled humans at equal
rates (no statistically significant difference between the bolded
We qualitatively compare our speech to gesture transla-
numbers). Since the stationary speaker is usually at rest position, tion results to the baselines and the ground truth gesture
real unaligned motion sequences look more realistic as they do not sequences in Figure 5. Please refer to our supplementary
suffer from prediction noise like the generated ones. video results which better convey temporal information.
6. Conclusion
pose, a model that simply repeats the input initial ground- Humans communicate through both sight and sound,
truth pose as its prediction. Speech input, our model. Initial yet the connection between these modalities remains un-
pose input, a variation of our model in which the audio in- clear [23]. In this paper, we proposed the task of predict-
put is ablated and the network predicts the future pose from ing person-specific gestures from “in-the-wild” speech as a
only an initial ground-truth pose input, and Speech & initial computational means of studying the connections between
pose input, where we condition the prediction on both the these communication channels. We created a large person-
speech and the initial pose. specific video dataset and used it to train a model for pre-
Table 3 displays the results of the comparison for our dicting gestures from speech. Our model outperforms other
model trained without the adversarial discriminator (no methods in an experimental evaluation.
GAN). When comparing the Initial pose input and Speech Despite its strong performance on these tasks, our model
& initial pose input conditions, we find that the addition has limitations that can be addressed by incorporating in-
of speech significantly improves accuracy when we average sights from other work. For instance, using audio as in-
the loss across all speakers (p < 10−3 using a two sided put has its benefits compared to using textual transcriptions
t-test). Interestingly, we find that most of the gains come as audio is a rich representation that contains information
from a small number of speakers (e.g. Oliver) who make about prosody, intonation, rhythm, tone and more. How-
large motions during speech. ever, audio does not directly encode high-level language se-
Figure 5: Speech to gesture translation qualitative results. We show the input audio spectrogram and the predicted poses overlaid on the
ground-truth video for Dr. Kubinec (lecturer) and Conan O’Brien (show host). See our supplementary material for more results.
mantics that may allow us to predict certain types of gesture Acknowledgements: This work was supported, in part, by the
(e.g. metaphorics), nor does it separate the speaker’s speech AWS Cloud Credits for Research and the DARPA MediFor pro-
from other sounds (e.g. audience laughter). Additionally, grams, and the UC Berkeley Center for Long-Term Cybersecu-
we treat pose estimations as though they were ground truth, rity. Special thanks to Alyosha Efros, the bestest advisor, and to
which introduces significant amount of noise—particularly Tinghui Zhou for his dreams of late-night talk show stardom.
on the speakers’ fingers.
References
We see our work as a step toward a computational anal-
[1] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning
ysis of conversational gesture, and opening three possible
sound representations from unlabeled video. In Advances in
directions for further research. The first is in using gestures Neural Information Processing Systems, 2016. 3
as a representation for video analysis: co-speech hand and
[2] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driv-
arm motion make a natural target for video prediction tasks. ing visual speech with audio. In Computer Graphics and
The second is using in-the-wild gestures as a way of train- Interactive Techniques, SIGGRAPH, pages 353–360. ACM,
ing conversational agents: we presented one way of visual- 1997. 3
izing gesture predictions, based on GANs [10], but, follow- [3] P. Buehler, A. Zisserman, and M. Everingham. Learning
ing classic work [8], these predictions could also be used sign language by watching tv (using weakly aligned subti-
to drive the motions of virtual agents. Finally, our method tles). In Computer Vision and Pattern Recognition (CVPR),
is one of only a handful of initial attempts to predict mo- pages 2961–2968. IEEE, 2009. 3
tion from audio. This cross-modal translation task is fertile [4] B. Butterworth and U. Hadar. Gesture, speech, and compu-
ground for further research. tational stages: A reply to McNeill. Psychological Review,
96:168–74, Feb. 1989. 2, 3 Now: Introducing the Virtual Human Toolkit. In 13th In-
[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- ternational Conference on Intelligent Virtual Agents, Edin-
person 2d pose estimation using part affinity fields. In Com- burgh, UK, Aug. 2013. 3
puter Vision and Pattern Recognition (CVPR). IEEE, 2017. [21] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke,
4 A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous,
[6] J. Cassell, D. McNeill, and K.-E. McCullough. Speech- B. Seybold, M. Slaney, R. Weiss, and K. Wilson. CNN ar-
gesture mismatches: Evidence for one underlying represen- chitectures for large-scale audio classification. In Interna-
tation of linguistic and nonlinguistic information. Pragmat- tional Conference on Acoustics, Speech and Signal Process-
ics and Cognition, 7(1):1–34, 1999. 1 ing. 2017. 5, 7
[7] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, [22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. translation with conditional adversarial networks. In Com-
Animated conversation: Rule-based generation of facial ex- puter Vision and Pattern Recognition (CVPR), 2017. 4
pression, gesture & spoken intonation for multiple conversa- [23] A. Kendon. Gesture: Visible Action as Utterance. Cam-
tional agents. In Computer Graphics and Interactive Tech- bridge University Press, 2004. 1, 3, 5, 7, 10, 11
niques, SIGGRAPH, pages 413–420. ACM, 1994. 3 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic
[8] J. Cassell, J. Sullivan, E. Churchill, and S. Prevost. Embod- optimization. CoRR, abs/1412.6980, 2014. 5
ied conversational agents. MIT press, 2000. 3, 8 [25] M. Kipp, M. Neff, K. H. Kipp, and I. Albrecht. Towards
[9] J. Cassell, H. H. Vilhjálmsson, and T. Bickmore. Beat: the natural gesture synthesis: Evaluating gesture units in a data-
behavior expression animation toolkit. In Life-Like Charac- driven approach to gesture synthesis. In C. Pelachaud, J.-C.
ters, pages 163–185. Springer, 2004. 3 Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé,
editors, Intelligent Virtual Agents, pages 15–28, Berlin, Hei-
[10] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody
delberg, 2007. Springer Berlin Heidelberg. 10
Dance Now. ArXiv e-prints, Aug. 2018. 1, 3, 4, 8
[26] O. Koller, H. Ney, and R. Bowden. Deep hand: How to train
[11] C.-C. Chiu and S. Marsella. How to train your avatar:
a cnn on 1 million hand images when your data is contin-
A data driven approach to gesture generation. In Interna-
uous and weakly labelled. In Computer Vision and Pattern
tional Workshop on Intelligent Virtual Agents, pages 127–
Recognition (CVPR), pages 3793–3802. IEEE, 2016. 3
140. Springer, 2011. 3
[27] S. Kopp, B. Krenn, S. Marsella, A. N. Marshall,
[12] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that?
C. Pelachaud, H. Pirker, K. R. Thórisson, and
In British Machine Vision Conference, 2017. 3
H. Vilhjálmsson. Towards a common framework for
[13] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and multimodal generation: The behavior markup language. In
R. Bowden. Neural sign language translation. In Computer International workshop on intelligent virtual agents, pages
Vision and Pattern Recognition (CVPR). IEEE, June 2018. 3 205–217. Springer, 2006. 3
[14] T. J. Darrell, I. A. Essa, and A. P. Pentland. Task-specific [28] T. R. Langlois and D. L. James. Inverse-foley animation:
gesture analysis in real-time using interpolated views. IEEE Synchronizing rigid-body motions to sound. ACM Transac-
Transactions on Pattern Analysis and Machine Intelligence, tions on Graphics, 33(4):41:1–41:11, July 2014. 3
18(12):1236–1242, Dec. 1996. 3 [29] S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun. Gesture
[15] J. P. de Ruiter, A. Bangerter, and P. Dings. The interplay controllers. In ACM Transactions on Graphics, volume 29,
between gesture and speech in the production of referring page 124. ACM, 2010. 3
expressions: Investigating the tradeoff hypothesis. Topics in [30] S. Levine, C. Theobalt, and V. Koltun. Real-time prosody-
Cognitive Science, 4(2):232–248, Mar. 2012. 1 driven synthesis of body language. In ACM Transactions on
[16] D. F. Fouhey, W.-c. Kuo, A. A. Efros, and J. Malik. From Graphics, volume 28, page 172. ACM, 2009. 3
lifestyle vlogs to everyday interactions. arXiv preprint [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
arXiv:1712.02310, 2017. 10 manan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common
[17] W. T. Freeman and M. Roth. Orientation histograms for hand objects in context. In European Conference on Computer Vi-
gesture recognition. In Workshop on Automatic Face and sion (ECCV), Zrich, 2014. Oral. 10
Gesture Recognition. IEEE, June 1995. 3 [32] R. C. B. Madeo, S. M. Peres, and C. A. de Moraes Lima.
[18] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, Gesture phase segmentation using support vector machines.
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Au- Expert Systems with Applications, 56:100 – 115, 2016. 11
dio set: An ontology and human-labeled dataset for audio [33] S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and
events. In International Conference on Acoustics, Speech A. Shapiro. Virtual character performance from speech.
and Signal Processing, pages 776–780, Mar. 2017. 5 In Symposium on Computer Animation, SCA, pages 25–35.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, ACM, 2013. 3
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [34] D. McNeill. Hand and Mind: What Gestures Reveal about
erative adversarial nets. In Advances in Neural Information Thought. University of Chicago Press, Chicago, 1992. 1, 2,
Processing Systems, pages 2672–2680, 2014. 2 3, 10
[20] A. Hartholt, D. Traum, S. C. Marsella, A. Shapiro, G. Stra- [35] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic
tou, A. Leuski, L.-P. Morency, and J. Gratch. All Together discriminative models for continuous gesture recognition. In
Computer Vision and Pattern Recognition (CVPR), pages 1–
8. IEEE, 2007. 3
[36] M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel. Gesture
modeling and animation based on a probabilistic re-creation Figure 6: A segmented gesture unit.
of speaker style. ACM Transactions on Graphics, 27(1):5:1–
5:24, Mar. 2008. 3
7. Appendix
[37] T. Pfister, K. Simonyan, J. Charles, and A. Zisserman. Deep
convolutional neural networks for efficient pose estimation 7.1. Dataset
in gesture videos. In Asian Conference on Computer Vision,
pages 538–552. Springer, 2014. 3 Data collection and processing We collected internet
videos by querying YouTube for each speaker, and de-
[38] F. Quek, D. McNeill, R. Bryll, S. Duncan, X.-F. Ma, C. Kir-
duplicated the data using the approach of [16]. We then
bas, K. E. McCullough, and R. Ansari. Multimodal hu-
used out-of-the-box face recognition and pose detection
man discourse: gesture and speech. ACM Transactions
on Computer-Human Interaction (TOCHI), 9(3):171–193, systems to split each videos into intervals in which only the
2002. 3 subject appears in frame and all detected keypoints are vis-
ible. Our dataset consists of 60,000 such intervals with an
[39] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolu-
average length of 8.7 seconds and a standard deviation of
tional networks for biomedical image segmentation. In Med-
11.3 seconds. In total, there are 144 hours of video. We
ical Image Computing and Computer-Assisted Intervention
(MICCAI), volume 9351 of LNCS, pages 234–241. Springer, split the data into 80% train, 10% validation, and 10% test
2015. 4 sets, such that each source video only appears in one set.
[40] N. Sadoughi and C. Busso. Retrieving target gestures to- Quality of dataset annotations We estimate whether the
ward speech driven animation with meaningful behaviors. In accuracy of the pseudo ground truth is good enough to sup-
Proceedings of the 2015 ACM on International Conference port our quantitative conclusions via the following experi-
on Multimodal Interaction, ICMI ’15, pages 115–122. ACM, ment. We took a 200-frame subset of the pseudo ground
2015. 3
truth used for training and had it labeled by 3 human ob-
[41] H. Sakoe and S. Chiba. Dynamic programming algorithm servers with neck and arm keypoints. We quantified the
optimization for spoken word recognition. IEEE Transac- consensus between annotators via, σi , a standard devia-
tions on Acoustics, Speech, and Signal Processing, 26(1):43– tion per keypoint-type i, as is typical in COCO [31] eval-
49, Feb. 1978. 11
uation. We also computed ||opi − µi ||, the distance between
[42] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher- the OpenPose detection and the mean of the annotations,
Shlizerman. Audio to body dynamics. In Computer Vision and ||prediction − µi || the distance between our audio-
and Pattern Recognition (CVPR). IEEE, 2018. 3, 5, 7 to-motion prediction and the annotation mean. We found
[43] W. C. So, S. Kita, and S. Goldin-Meadow. Using the hands that the pseudo ground truth is close to human labels, since
to identify who does what to whom: Gesture and speech go 0.14 = E[||opi − µi ||] ≈ E[σi ] = 0.06; And that the er-
hand-in-hand. Cognitive Science, 33(1):115–125, Feb. 2009. ror in the pseudo ground truth is small enough for our task,
1 since 0.25 = ||prediction − µi || >> σi = 0.06. Note
[44] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher- that this is a lower bound on the prediction error since it is
Shlizerman. Synthesizing obama: Learning lip sync from computed on training data samples.
audio. ACM Transactions on Graphics, 36(4):95:1–95:13,
July 2017. 3 7.2. Learning Individual Gesture Dictionaries
[45] M. Thiebaux, S. Marsella, A. N. Marshall, and M. Kall- Gesture unit segmentation We use an unsupervised
mann. Smartbody: Behavior realization for embodied con- method for building a dictionary of an individual’s ges-
versational agents. In International Joint Conference on Au- tures. We segment motion sequences into gesture units,
tonomous Agents and Multiagent Systems, volume 1, pages propose an appropriate descriptor and similarity metric and
151–158. International Foundation for Autonomous Agents then cluster the gestures of an individual.
and Multiagent Systems, 2008. 3 A gesture unit is a sequence of gestures that starts from
[46] P. Wagner, Z. Malisz, and S. Kopp. Gesture and speech in a rest position and returns to a rest position only after the
interaction: An overview. Speech Communication, 57:209 – last gesture [23]. While [34] observed that most of their
232, 2014. 3 subjects usually perform one gesture at a time, a study of an
[47] Y. Yang and D. Ramanan. Articulated human detection with 18-minute video dataset of TV speakers reported that their
flexible mixtures of parts. IEEE Transactions on Pattern gestures were often strung together in a sequence [25]. We
Analysis and Machine Intelligence, 35(12):2878–2890, Dec. treat each gesture unit – from rest position to rest position –
2013. 5 as an atomic segment.
Individual styles of gesture These clusters represent an
unsupervised definition of the typical gestures that an in-
dividual performs. For each dictionary element cluster we
Cluster 3
Aliaksandra Shysheya 1,2 Egor Zakharov 1,2 Kara-Ali Aliev 1 Renat Bashirov 1
Egor Burkov 1,2 Karim Iskakov 1 Aleksei Ivakhnenko 1 Yury Malkov 1
Igor Pasechnik 1 Dmitry Ulyanov 1,2 Alexander Vakhitov 1,2 Victor Lempitsky 1,2
1 2
Samsung AI Center, Moscow Skolkovo Institute of Science and Technology, Moscow
arXiv:1905.08776v1 [cs.CV] 21 May 2019
Figure 1: We propose a new model for neural rendering of humans. The model is trained for a single person and can produce
renderings of this person from novel viewpoints (top) or in the new body pose (bottom) unseen during training. To improve
generalization, our model retains explicit texture representation, which is learned alongside the rendering neural network.
1
neural rendering of body fragments e.g. faces [37, 43, 62], 2. Related work
eyes [24], hands [47] is now possible. Very recent works
have shown the abilities of such networks to generate views Our approach is closely related to a vast number of pre-
of a person with a varying body pose but with a fixed cam- vious works, and below we discuss a small subset of these
era position, and using an excessive amount of training connections.
data [1, 12, 42, 67]. In this work, we focus on the learn- Building full-body avatars from image data has long
ing of neural avatars, i.e. generative deep networks that been one of the main topics of computer vision research.
are capable of rendering views of individual people under Traditionally, an avatar is defined by a 3D geometric mesh
varying body pose defined by a set of 3D positions of the of a certain neutral pose, a texture, and a skinning mecha-
body joints and under varying camera positions (Figure 1). nism that transforms the mesh vertices according to pose
We prefer to use body joint positions to represent the hu- changes. A large group of works has been devoted to
man pose, as joint positions are often easier to capture using body modeling from 3D scanners [51], registered multi-
marker-based or marker-less motion capture systems. view sequences [53] as well as from depth and RGB-D
sequences [7, 69, 74]. On the other extreme are methods
that fit skinned parametric body models to single images
Generally, neural avatars can serve as an alternative to [6, 8, 30, 35, 49, 50, 59]. Finally, research on building full-
classical (“neural-free”) avatars based on a standard com- body avatars from monocular videos has started [3, 4]. Sim-
puter graphics pipeline that estimates a user-personalized ilarly to the last group of works, our work builds an avatar
body mesh in a neutral position, performs skinning (defor- from a video or a set of unregistered monocular videos. The
mation of the neutral pose), and projects the resulting 3D classical (computer graphics) approach to modeling human
surface onto the image coordinates, while superimposing avatars requires explicit physically-plausible modeling of
person-specific 2D texture. Neural avatars attempt to short- human skin, hair, sclera, clothing surface, as well as mo-
cut the multiple stages of the classical pipeline and to re- tion under pose changes. Despite considerable progress in
place them with a single network that learns the mapping reflectivity modeling [2, 18, 38, 70, 72] and better skin-
from the input (the location of body joints) to the output (the ning/dynamic surface modeling [23, 44, 60], the computer
2D image). As a part of our contribution, we demonstrate graphics approach still requires considerable “manual” ef-
that, however appealing for its conceptual simplicity, exist- fort of designers to achieve high realism [2] and to pass the
ing pose-to-image translation networks generalize poorly to so-called uncanny valley [46], especially if real-time ren-
new camera views, and therefore new architectures for neu- dering of avatars is required.
ral avatars are required. Image synthesis using deep convolutional neural net-
works is a thriving area of research [20, 27] and a lot of
Towards this end, we present a neural avatar system that recent effort has been directed onto synthesis of realistic hu-
does full-body rendering and combines the ideas from the man faces [15, 36, 61]. Compared to traditional computer
classical computer graphics, namely the decoupling of ge- graphics representations, deep ConvNets model data by fit-
ometry and texture, with the use of deep convolutional neu- ting an excessive number of learnable weights to training
ral networks. In particular, similarly to the classic pipeline, data. Such ConvNets avoid explicit modeling of the sur-
our system explicitly estimates the 2D textures of body face geometry, surface reflectivity, or surface motion under
parts. The 2D texture within the classical pipeline effec- pose changes, and therefore do not suffer from the lack of
tively transfers the appearance of the body fragments across realism of the corresponding components. On the flipside,
camera transformations and body articulations. Keeping the lack of ingrained geometric or photometric models in
this component within the neural pipeline boosts general- this approach means that generalizing to new poses and in
ization across such transforms. The role of the convolu- particular to new camera views may be problematic. Still
tional network in our approach is then confined to predict- a lot of progress has been made over the last several years
ing the texture coordinates of individual pixels in the out- for the neural modeling of personalized talking head mod-
put 2D image given the body pose and the camera parame- els [37, 43, 62], hair [68], hands [47]. Notably, the recent
ters (Figure 2). Additionally, the network predicts the body system [43] has achieved very impressive results for neural
foreground/background mask. face rendering, while decomposing view-dependent texture
and 3D shape modeling.
In our experiments, we compare the performance of our Over the last several months, several groups have pre-
textured neural avatar with a direct video-to-video trans- sented results of neural modeling of full bodies [1, 12, 42,
lation approach [67], and show that explicit estimation of 67]. While the presented results are very impressive, the ap-
textures brings additional generalization capability and im- proaches still require a large amount of training data. They
proves the realism of the generated images for new views also assume that the test images are rendered with the same
and/or when the amount of training data is limited. camera views as the training data, which in our experience
2
makes the task considerably simpler than modeling body a specific image location, e.g. Bij [x, y] denotes the scalar
appearance from an arbitrary viewpoint. In this work, we element in the j-th map of the stack Bi located at location
aim to expand the neural body modeling approach to tackle (x, y), and Bi [x, y] denotes the vector of elements corre-
the latter, harder task. The work [45] uses a combination of sponding to all maps sampled at location (x, y).
classical and neural rendering to render human body from
new viewpoints, but does so based on depth scans and there-
fore with a rather different algorithmic approach. Input and output. In general, we are interested in syn-
A number of recent works warp a photo of a person to a thesizing images of a certain person given her/his pose. We
new photorealistic image with modified gaze direction [24], assume that the pose for the i-th image comes in the form of
modified facial expression/pose [9, 55, 64, 71], or modified 3D joint positions defined in the camera coordinate frame.
body pose [5, 48, 56, 64], whereas the warping field is esti- As an input to the network, we then consider a map stack
mated using a deep convolutional network (while the origi- Bi , where each map Bij contains the rasterized j-th segment
nal photo effectively serves as a texture). These approaches (bone) of the “stickman” (skeleton) projected on the camera
are however limited in their realism and/or the amount of plane. To retain the information about the third coordinate
change they can model, due to their reliance on a single of the joints, we linearly interpolate the depth value between
photo of a given person for its input. Our approach also the joints defining the segments, and use the interpolated
disentangles texture from surface geometry/motion mod- values to define the values in the map Bij corresponding to
eling but trains from videos, therefore being able to han- the bone pixels (the pixels not covered by the j-th bone are
dle harder problem (full body multi-view setting) and to set to zero). Overall, the stack Bi incorporates the informa-
achieve higher realism. tion about the person and the camera pose.
Our system relies on the DensePose body surface param- As an output of the whole system, we expect an RGB
eterization (UV parameterization) similar to the one used in image (a three-channel stack) Ii and a single channel mask
the classical graphics-based representation. Part of our sys- Mi , defining the pixels that are covered by the avatar. Be-
tem performs a mapping from the body pose to the surface low, we consider two approaches: the direct translation
parameters (UV coordinates) of image pixels. This makes baseline, which directly maps Bi into {Ii , Mi } and the tex-
our approach related to the DensePose approach [28] and tured neural avatar approach that performs such mapping
the earlier works [29, 63] that predict UV coordinates of indirectly using texture mapping.
image pixels from the input photograph. Furthermore, our
In both cases, at training time, we assume that for each
approach uses DensePose results [28] for pretraining.
input frame i, the input joint locations and the “ground
Our system is related to approaches that extract textures
truth” foreground mask are estimated, and we use 3D body
from multi-view image collections [26, 39] or multi-view
pose estimation and human semantic segmentation to ex-
video collections [66] or a single video [52]. Our approach
tract them from raw video frames. At test time, given a
is also related to free-viewpoint video compression and ren-
real or synthetic background image I˜i , we generate the fi-
dering systems, e.g. [11, 16, 21, 66]. Unlike those works,
nal view by first predicting Mi and Ii from the body pose
ours is restricted to scenes containing a single human. At
and then linearly blending the resulting avatar into an im-
the same time, our approach aims to generalize not only
age: Iˆi = Ii Mi + I˜i (1 − Mi ) (where defines a
to new camera views but also to new user poses unseen in
“location-wise” product, i.e. the RGB values at each loca-
the training videos. The work of [73] is the most related
tion are multiplied by the mask value at this location).
to ours in this group, as they warp the individual frames of
the multi-view video dataset according to the target pose to
generate new sequences. The poses that they can handle,
however, are limited by the need to have a close match in Direct translation baseline. The direct approach that we
the training set, which is a strong limitation given the com- consider as a baseline to ours is to learn an image trans-
binatorial nature of the human pose configuration space. lation network that maps the map stack Bik to the map
stacks Ii and Mi (usually the two output stacks are pro-
3. Methods duced within two branches that share the initial stage of the
processing [20]). Generally, mappings between stacks of
Notation. We use the lower index i to denote objects that maps can be implemented using fully-convolutional archi-
are specific to the i-th training or test image. We use up- tectures. Exact architectures and losses for such networks
percase notation, e.g. Bi to denote a stack of maps (a third- is an active area of research [14, 31, 33, 65]. Very recent
order tensor/three-dimensional array) corresponding to the works [1, 12, 42, 67] have used direct translation (with var-
i-th training or test image. We use the upper index to denote ious modifications) to synthesize the view of a person for
a specific map (channel) in the stack, e.g. Bij . Furthermore, a fixed camera. We use the video-to-video variant of this
we use square brackets to denote elements corresponding to approach [67] as a baseline for our method.
3
Part assignments Predicted mask Ground truth mask
Cross-entropy
loss
Input pose Generator
Perceptual
loss
Render
Texture stack
Figure 2: The overview of the textured neural avatar system. The input pose is defined as a stack of ”bone” rasterizations
(one bone per channel; here we show it as a skeleton image). The input is processed by the fully-convolutional network
(generator) to produce the body part assignment map stack and the body part coordinate map stack. These stacks are then
used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by
the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the
background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting
losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting
in their updates.
Textured neural avatar. The direct translation approach th body part, and the map channel Pin corresponds to the
relies on the generalization ability of ConvNets and incor- probability of the background. The coordinate maps Ci2k
porates very little domain-specific knowledge into the sys- and Ci2k+1 correspond to the pixel coordinates on the k-th
tem. As an alternative, we suggest the textured avatar ap- body part. Specifically, once the part assignments Pi and
proach, that explicitly estimates the textures of body parts, body part coordinates Ci are predicted, the image Ii at each
thus ensuring the similarity of the body surface appearance pixel (x, y) is reconstructed as a weighted combination of
under varying pose and cameras. texture elements, where the weights and texture coordinates
Following the DensePose approach [28], we subdivide are prescribed by the part assignment maps and the coordi-
the body into n=24 parts, where each part has a 2D param- nate maps correspondingly:
eterization. Each body part also has the texture map T k , n−1
X
which is a color image of a fixed pre-defined size (256×256 s(Pi , Ci , T )[x, y] = Pik [x, y]·
in our implementation). The training process for the tex- k=0
tured neural avatar estimates personalized part parameteri- T k Ci2k [x, y], Ci2k+1 [x, y] ,
(1)
zations and textures.
Again, following the DensePose approach, we assume where s(·, ·, ·) is the sampling function (layer) that outputs
that each pixel in an image of a person is (soft)-assigned the RGB map stack given the three input arguments. In (1),
to one of n parts or to the background and with a specific the texture maps T k are sampled at non-integer locations
location on the texture of that part (body part coordinates). (Ci2k [x, y], Ci2k+1 [x, y]) in a piecewise-differentiable man-
Unlike DensePose, where part assignments and body part ner using bilinear interpolation [32].
coordinates are induced from the image, our approach at When training the neural textured avatar, we learn a con-
test time aims to predict them based solely on the pose Bi . volutional network gφ with learnable parameters φ to trans-
late the input map stacks Bi into the body part assignments
The introduction of the body surface parameterization
and the body part coordinates. As gφ has two branches
outlined above changes the translation problem. For a
(“heads”), we denote with gφP the branch that produces the
given pose defined by Bi , the translation network now has
to predict the stack Pi of body part assignments and the body part assignments stack, and with gφC the branch that
stack Ci of body part coordinates, where Pi contains n+1 produces the body part coordinates. To learn the parameters
maps of the textured neural avatar, we optimize the loss between
Pn of knon-negative numbers that sum to identity (i.e. the generated image and the ground truth image I¯i :
k=0 Pi [x, y] = 1 for any position (x, y)), and Ci con-
tains 2n maps of real numbers between 0 and w, where w is
Limage (φ, T ) = dImage I¯i , s gφP (Bi ), gφC (Bi ), T (2)
the spatial size (width and height) of the texture maps T k .
The map channel Pik for k = 0, . . . , n−1 is then in- where dImage (·, ·) is a loss used to compare two images.
terpreted as the probability of the pixel to belong to the k- In our current implementation we use a simple perceptual
4
loss [25, 33, 65], which computes the maps of activations
within pretrained fixed VGG network [58] for both im-
ages and evaluates the L1-norm between the resulting maps
(Conv1,6,11,20,29 of VGG19 were used). More ad-
vanced adversarial losses [27] popular in image translation
[19, 31] can also be used here.
During the stochastic optimization, the gradient of the
loss (2) is backpropagated through (1) both into the trans-
lation network gφ and onto the texture maps T k , so that
minimizing this loss updates not only the network param-
eters but also the textures themselves. As an addition, the
learning also optimizes the mask loss that measures the dis-
crepancy between the ground truth background mask 1−M̄i
and the background mask prediction:
Figure 3: The impact of the learning on the texture (top,
shown for the same subset of maps T k ) and on the convolu-
Lmask (φ, T ) = dBCE 1̄ − Mi , gφP (Bi )n (3)
tional network gφC predictions (bottom, shown for the same
pair of input poses). Left part shows the starting state (af-
where dBCE is the binary cross-entropy loss, and gφP (Bi )n
ter initialization), while the right part shows the final state,
corresponds to the n-th (i.e. background) channel of the pre-
which is considerably different from the start.
dicted part assignment map stack. After backpropagation
of the weighted combination of (2) and (3), the network
parameters φ and the textures maps T k are updated. As person, and they change significantly during the end-to-end
the training progresses, the texture maps change (Figure 2), learning (Figure 3).
and so does the body part coordinate predictions, so that the
learning is free to choose the appropriate parameterization
of body part surfaces. 4. Experiments
Below, we discuss the details of the experimental vali-
Initialization of textured neural avatar. The success of dation, provide comparison with baseline approaches, and
our network depends on the initialization strategy. When show qualitative results. The project webpage1 contains
training from multiple video sequences, we use the Dense- more videos of the learned avatars.
Pose system [28] to initialize the textured neural avatar.
Specifically, we run DensePose on the training data and pre-
train gφ as a translation network between the pose stacks Bi Architecture. We input 3D pose via bone rasterizations,
and the DensePose outputs. where each bone, hand and face are drawn in separate
An alternative way that is particularly attractive when channels. We then use standard image translation archi-
training data is scarce is to initialize the avatar is through tecture [33] to perform a mapping from these bones’ ras-
transfer learning. In this case, we simply take gφ from an- terizations to texture assignments and coordinates. This ar-
other avatar trained on abundant data. The explicit decou- chitecture consists of downsampling layers, stack of resid-
pling of geometry from appearance in our method facilitates ual blocks, operating at low dimensional feature representa-
transfer learning, as the geometrical mapping provided by tions, and upsampling layers. We then split the network into
the network gφ usually does not need to change much be- two roughly equal parts: encoder and decoder, with texture
tween two people, especially if the body types are not too assignments and coordinates having separate decoders. We
dissimilar. use 4 downsampling and upsampling layers with initial 32
channels in the convolutions and 256 channels in the resid-
Once the mapping gφ has been initialized, the texture
ual blocks. The ConvNet gφ has 17 million parameters.
maps T k are initialized as follows. Each pixel in the train-
ing image is assigned to a single body part (according to the
prediction of the pretrained gφP ) and to a particular texture Datasets. We train neural avatars on several types of
pixel on the texture of the corresponding part (according datasets. First, we consider collections of multi-view videos
to the prediction of the pretrained gφC ). Then, the value of registered in time and space, where 3D pose estimates can
each texture pixel is initialized to the mean of all image pix- be obtained via triangulation of 2D poses. We use two sub-
els assigned to it (the texture pixels assigned zero pixels are sets (corresponding to two persons from the 171026 pose2
initialized to black). The initialized texture T and gφ usu-
ally produce images that are only coarsely reminding the 1 https://saic-violet.github.io/texturedavatar/
5
Figure 4: Renderings produced by multiple textured neural avatars (for all people in our study). All renderings are produced
from the new viewpoints unseen during training.
Table 1: Quantitative comparison of the three models operating on different datasets (see text for discussion).
scene) from the CMU Panoptic dataset collection [34], re- consecutive frames of the monocular RGB image sequence.
ferring to them as CMU1 and CMU2 (both subsets have ap- Then we concatenate and lift the estimated 2D poses to infer
proximately four minutes / 7,200 frames in each camera the 3D pose of the last frame by using a multi-layer percep-
view). We consider two regimes: training on 16 cameras tron model. The perceptron is trained on the CMU 3D pose
(CMU1-16 and CMU2-16) or six cameras (CMU1-6 and annotations (augmented with position of the feet joints by
CMU2-6). The evaluation is done on the hold-out cameras triangulating the output of OpenPose) in orthogonal projec-
and hold-out parts of the sequence (no overlap between train tion.
and test in terms of the cameras or body motion). For foreground segmentation we use DeepLabv3+ with
We have also captured our own multi-view sequences Xception-65 backbone [13] initially trained on PAS-
of three subjects using a rig of seven cameras, spanning CAL VOC 2012 [22] and fine-tuned on HumanParsing
approximately 30◦ . In one scenario, the training sets in- dataset [40, 41] to predict initial human body segmentation
cluded six out of seven cameras, where the duration of each masks. We additionally employ GrabCut [54] with back-
video was approximately six minutes (11,000 frames). We ground/foreground model initialized by the masks to refine
show qualitative results for the hold-out camera as well as object boundaries on the high-resolution images. Pixels
from new viewpoints. In the other scenario described below, covered by the skeleton rasterization were always added to
training was done based on a video from a single camera. the foreground mask.
Finally, we evaluate on two short monocular sequences
from [4] and a Youtube video in Figure 7. Baselines. In the multi-video training scenario, we con-
sider two other systems, against which ours is compared.
Pre-processing. Our system expects 3D human pose as First, we take the video-to-video (V2V) system [67], using
input. For non-CMU datasets, we used the OpenPose- the authors’ code with minimal modifications that lead to
compatible [10, 57] 3D pose formats, represented by improved performance. We provide it with the same input
25 body joints, 21 joints for each hand and 70 facial land- as ours, and we use images with blacked-out background
marks. For the CMU Panoptic datasets, we use the available (according to our segmentation) as desired output. On the
3D pose annotation as input (which has 19 rather than 25 CMU1-6 task, we have also evaluated a model with Dense-
body joints). To get a 3D pose for non-CMU sequences we Pose results computed on the target frame given as input
first apply the OpenPose 2D pose estimation engine to five (alongside keypoints). Despite much stronger (oracle-type)
6
GT Direct V2V Proposed GT Direct V2V Proposed
Figure 5: Comparison of the rendering quality for the Direct, V2V and proposed methods on the CMU1-6 and CMU2-6
sequences. Images from six arbitrarily chosen cameras were used for training. We generate the views onto the hold-out
cameras which were not used during training. The pose and camera in the lower right corner are in particular difficult for all
the systems.
conditioning, the performance of this model in terms of con- from a disadvantage both in the quantitative metrics and in
sidered metrics has not improved in comparison with V2V the user comparison, since it averages out lighting from dif-
that uses only body joints as input. ferent viewpoints. The more detailed quantitative compari-
The video-to-video system employs several adversarial son is presented in Table 1.
losses and an architecture different from ours. Therefore we We show more qualitative examples of our method for a
consider a more direct ablation (Direct), which has the same variety of models in Figure 4 and some qualitative compar-
network architecture that predicts RGB color and mask di- isons with baselines in Figure 6.
rectly, rather than via body part assignments/coordinates.
The Direct system is trained using the same losses and in
the same protocol as ours. Single video comparisons. We also evaluate our system
As for the single video case, two baseline systems, in a single video case. We consider the scenario, where we
against which ours is compared, were considered. On our train the model and transfer it to a new person by fitting it
own captured sequences, we compare our system against to a single video. We use single-camera videos from one
video-to-video (V2V) system [67], whereas on sequences of the cameras in our rig. We then evaluate the model (and
from [4] we provide a qualitative comparison against the V2V baseline) on a hold-out set of poses projected onto the
system of [4]. camera from the other side of the rig (around 30◦ away).
We thus demonstrate that new models can be obtained us-
ing a single monocular video. For our models, we consider
Multi-video comparison. We compare the three systems transferring from CMU1-16.
(ours, V2V, Direct) in CMU1-16, CMU2-16, CMU1-6, We thus pretrain V2V and our system on CMU1-16 and
CMU2-6. Using the hold-out sequences/motions, we then use the obtained weights of gφ as initialization for fine-
evaluated two popular metrics, namely structured self- tuning to the single video in our dataset. The texture maps
similarity (SSIM) and Frechet Inception Distance (FID) be- are initialized from scratch as described above. Evaluating
tween the results of each system and the hold-out frames on hold-out camera and motion highlighted strong advan-
(with background removed using our segmentation algo- tage of our method. In the user study on two subjects, the
rithm). Our method outperforms the other two in terms of result of our method has been preferred to V2V in 55% and
SSIM and underperforms V2V in terms of FID. Represen- 65% of the cases. We further compare our method and the
tative examples are shown in Figure 5. system of [4] on the sequences from [4]. The qualitative
We have also performed user study using a crowd- comparison is shown in Figure 7. In addition, we gener-
sourcing website, where the users were shown the results of ate an avatar from a YouTube video. In this set of exper-
ours and one of the other two systems on either side of the iments, the avatars were obtained by fine-tuning from the
ground truth image and were asked to pick a better match to same avatar (shown in Figure 6–left). Except for the con-
the middle image. In the side-by-side comparison, the re- siderable artefacts on hand parts, our system has generated
sults of our method were always preferred by the majority avatars that can generalize to new pose despite very short
of crowd-sourcing users. We note that our method suffers video input (300 frames in the case of [4]).
7
GT Proposed V2V GT Proposed V2V
Figure 6: Results comparison for our multi-view sequences using a hold-out camera. Textured Neural Avatars and the images
produced by the video-to-video (V2V) system correspond to the same viewpoint. Both systems use a video from a single
viewpoint for training. Electronic zoom-in recommended.
Figure 7: Results on external monocular sequences. Rows 1-2: avatars for sequences from [4] in an unseen pose (left – ours,
right – [4]). Row 3 – the textured avatar computed from a popular YouTube video (’PUMPED UP KICKS DUBSTEP’). In
general, our system is capable of learning avatars from monocular videos.
8
5. Summary and Discussion Automatic estimation of 3d human pose and shape from a
single image. In Proc. ECCV, pages 561–578. Springer,
We have presented textured neural avatar approach to 2016. 2
model the appearance of humans for new camera views and [9] Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan
new body poses. Our system takes the middle path between Sun. Learning a high fidelity pose invariant model
the recent generation of methods that use ConvNets to map for high-resolution face frontalization. arXiv preprint
the pose to the image directly, and the traditional approach arXiv:1806.08472, 2018. 3
that uses geometric modeling of the surface and superim- [10] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.
pose the personalized texture maps. This is achieved by Realtime multi-person 2d pose estimation using part affinity
learning a ConvNet that predicts texture coordinates of pix- fields. In Proc. CVPR, 2017. 6
els in the new view jointly with the texture within the end- [11] Dan Casas, Marco Volino, John Collomosse, and Adrian
Hilton. 4d video textures for interactive character appear-
to-end learning process. We demonstrate that retaining an
ance. In Computer Graphics Forum, volume 33, pages 371–
explicit shape and texture separation helps to achieve better 380. Wiley Online Library, 2014. 3
generalization than direct mapping approaches. [12] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and
Our method suffers from certain limitations. The gen- Alexei A Efros. Everybody dance now. arXiv preprint
eralization ability is still limited, as it does not generalize arXiv:1808.07371, 2018. 2, 3
well when a person is rendered at a scale that is consid- [13] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
erably different from the training set (which can be par- Schroff, and Hartwig Adam. Encoder-decoder with atrous
tially addressed by rescaling prior to rendering followed by separable convolution for semantic image segmentation. In
cropping/padding postprocessing). Furthermore, textured Proc. ECCV, 2018. 6
avatars exhibit strong artefacts in the presence of pose es- [14] Qifeng Chen and Vladlen Koltun. Photographic image syn-
timation errors on hands and faces. Finally, our method as- thesis with cascaded refinement networks. In Proc. ICCV,
sumes constancy of the surface color and ignores lighting pages 1520–1529, 2017. 3
[15] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,
effects. This can be potentially addressed by making our
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-
textures view- and lighting-dependent [17, 43]. tive adversarial networks for multi-domain image-to-image
translation. In Proc. CVPR, June 2018. 2
References [16] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den-
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk,
[1] Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Bao-
and Steve Sullivan. High-quality streamable free-viewpoint
quan Chen, and Daniel Cohen-Or. Deep video-based perfor-
video. ACM Transactions on Graphics (TOG), 34(4):69,
mance cloning. arXiv preprint arXiv:1808.06847, 2018. 2,
2015. 3
3
[17] Paul E. Debevec, Yizhou Yu, and George Borshukov. Effi-
[2] Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan
cient view-dependent image-based rendering with projective
Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul De-
texture-mapping. In Rendering Techniques ’98, Proceedings
bevec. The Digital Emily project: Achieving a photorealistic
of the Eurographics Workshop in Vienna, Austria, June 29 -
digital actor. IEEE Computer Graphics and Applications,
July 1, 1998, pages 105–116, 1998. 9
30(4):20–31, 2010. 2
[18] Craig Donner, Tim Weyrich, Eugene d’Eon, Ravi Ra-
[3] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian
mamoorthi, and Szymon Rusinkiewicz. A layered, heteroge-
Theobalt, and Gerard Pons-Moll. Detailed human avatars
neous reflectance model for acquiring and rendering human
from monocular video. In 2018 International Conference on
skin. In ACM Transactions on Graphics (TOG), volume 27,
3D Vision (3DV), pages 98–109. IEEE, 2018. 2
page 140. ACM, 2008. 2
[4] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian [19] Alexey Dosovitskiy and Thomas Brox. Generating images
Theobalt, and Gerard Pons-Moll. Video based reconstruction with perceptual similarity metrics based on deep networks.
of 3d people models. In Proc. CVPR, June 2018. 2, 6, 7, 8 In Proc. NIPS, pages 658–666, 2016. 5
[5] Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Frédo Du- [20] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas
rand, and John V. Guttag. Synthesizing images of humans in Brox. Learning to generate chairs with convolutional neural
unseen poses. In Proc. CVPR, pages 8340–8348, 2018. 3 networks. In Proc. CVPR, pages 1538–1546, 2015. 2, 3
[6] Alexandru O Bălan and Michael J Black. The naked truth: [21] Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh
Estimating body shape under clothing. In Proc. ECCV, pages Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir
15–29. Springer, 2008. 2 Tankovich, and Shahram Izadi. Motion2fusion: real-time
[7] Federica Bogo, Michael J Black, Matthew Loper, and Javier volumetric performance capture. ACM Transactions on
Romero. Detailed full-body reconstructions of moving peo- Graphics (TOG), 36(6):246, 2017. 3
ple from monocular RGB-D sequences. In Proc. ICCV, [22] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.
pages 2300–2308, 2015. 2 Williams, J. Winn, and A. Zisserman. The pascal visual ob-
[8] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter ject classes challenge: A retrospective. International Journal
Gehler, Javier Romero, and Michael J Black. Keep it smpl: of Computer Vision, 111(1):98–136, Jan. 2015. 6
9
[23] Andrew Feng, Dan Casas, and Ari Shapiro. Avatar reshap- [38] Oliver Klehm, Fabrice Rousselle, Marios Papas, Derek
ing and automatic rigging using a deformable model. In Pro- Bradley, Christophe Hery, Bernd Bickel, Wojciech Jarosz,
ceedings of the 8th ACM SIGGRAPH Conference on Motion and Thabo Beeler. Recent advances in facial appearance
in Games, pages 57–64. ACM, 2015. 2 capture. In Computer Graphics Forum, volume 34, pages
[24] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and 709–733. Wiley Online Library, 2015. 2
Victor Lempitsky. Deepwarp: Photorealistic image resynthe- [39] Victor S. Lempitsky and Denis V. Ivanov. Seamless mosaic-
sis for gaze manipulation. In Proc. ECCV, pages 311–326. ing of image-based texture maps. In Proc. CVPR, 2007. 3
Springer, 2016. 2, 3 [40] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi
[25] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep human
Image style transfer using convolutional neural networks. In parsing with active template regression. Pattern Analysis and
Proc. CVPR, pages 2414–2423, 2016. 5 Machine Intelligence, IEEE Transactions on, 37(12):2402–
[26] Bastian Goldlücke and Daniel Cremers. Superresolution 2414, Dec 2015. 6
texture maps for multiview reconstruction. In Proc. ICCV, [41] Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang,
pages 1677–1684, 2009. 3 Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan. Iccv.
[27] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 2015. 6
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [42] Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo
Yoshua Bengio. Generative adversarial nets. In Proc. NIPS, Kim, Florian Bernard, Marc Habermann, Wenping Wang,
pages 2672–2680, 2014. 2, 5 and Christian Theobalt. Neural animation and reenactment
of human actor videos. arXiv preprint arXiv:1809.03658,
[28] Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos.
2018. 2, 3
DensePose: Dense human pose estimation in the wild. In
Proc. CVPR, June 2018. 3, 4, 5 [43] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser
Sheikh. Deep appearance models for face rendering. ACM
[29] Riza Alp Güler, George Trigeorgis, Epameinondas Anton-
Transactions on Graphics (TOG), 37(4):68, 2018. 2, 9
akos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokki-
[44] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
nos. DenseReg: Fully convolutional dense shape regression
Pons-Moll, and Michael J Black. Smpl: A skinned multi-
in-the-wild. In Proc. CVPR, volume 2, page 5, 2017. 3
person linear model. ACM Transactions on Graphics (TOG),
[30] Nils Hasler, Hanno Ackermann, Bodo Rosenhahn, Thorsten 34(6):248, 2015. 2
Thormählen, and Hans-Peter Seidel. Multilinear pose and
[45] Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel
body shape estimation of dressed subjects from image sets.
Pidlypenskyi, Jonathan Taylor, Julien P. C. Valentin, Sameh
In Proc. CVPR, pages 1823–1830. IEEE, 2010. 2
Khamis, Philip L. Davidson, Anastasia Tkach, Peter Lin-
[31] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. coln, Adarsh Kowdle, Christoph Rhemann, Dan B. Gold-
Efros. Image-to-image translation with conditional adver- man, Cem Keskin, Steven M. Seitz, Shahram Izadi, and
sarial networks. In Proc. CVPR, pages 5967–5976, 2017. 3, Sean Ryan Fanello. LookinGood: enhancing performance
5 capture with real-time neural re-rendering. ACM Trans.
[32] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Graph., 37(6):255:1–255:14, 2018. 3
Koray Kavukcuoglu. Spatial transformer networks. In Proc. [46] Masahiro Mori. The uncanny valley. Energy, 7(4):33–35,
NIPS, pages 2017–2025, 2015. 4 1970. 2
[33] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual [47] Franziska Mueller, Florian Bernard, Oleksandr Sotny-
losses for real-time style transfer and super-resolution. In chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and
Proc. ECCV, pages 694–711, 2016. 3, 5 Christian Theobalt. GANerated hands for real-time 3d hand
[34] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei tracking from monocular RGB. In Proc. CVPR, June 2018.
Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart 2
Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and [48] Natalia Neverova, Riza Alp Güler, and Iasonas Kokkinos.
Yaser Sheikh. Panoptic studio: A massively multiview sys- Dense pose transfer. In Proc. ECCV, September 2018. 3
tem for social interaction capture. IEEE Transactions on Pat- [49] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-
tern Analysis and Machine Intelligence, 2017. 6 ter V. Gehler, and Bernt Schiele. Neural body fitting: Uni-
[35] Angjoo Kanazawa, Michael J Black, David W Jacobs, and fying deep learning and model-based human pose and shape
Jitendra Malik. End-to-end recovery of human shape and estimation. Verona, Italy, 2018. 2
pose. In Proc. CVPR, 2018. 2 [50] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas
[36] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Daniilidis. Learning to estimate 3d human pose and shape
Progressive growing of GANs for improved quality, stabil- from a single color image. In Proc. CVPR, June 2018. 2
ity, and variation. In International Conference on Learning [51] Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and
Representations, 2018. 2 Michael J Black. Dyna: A model of dynamic human shape in
[37] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng motion. ACM Transactions on Graphics (TOG), 34(4):120,
Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian 2015. 2
Richardt, Michael Zollhöfer, and Christian Theobalt. Deep [52] Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and An-
video portraits. arXiv preprint arXiv:1805.11714, 2018. 2 drew W. Fitzgibbon. Unwrap mosaics: a new representation
10
for video editing. ACM Trans. Graph., 27(3):17:1–17:11, [69] Alexander Weiss, David Hirshberg, and Michael J Black.
2008. 3 Home 3d body scans from noisy image and range data. In
[53] Nadia Robertini, Dan Casas, Edilson De Aguiar, and Chris- Proc. ICCV, pages 1951–1958. IEEE, 2011. 2
tian Theobalt. Multi-view performance capture of sur- [70] Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd
face details. International Journal of Computer Vision, Bickel, Craig Donner, Chien Tu, Janet McAndless, Jinho
124(1):96–113, 2017. 2 Lee, Addy Ngan, Henrik Wann Jensen, et al. Analysis of
[54] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. human faces using a measurement-based skin reflectance
”grabcut”: interactive foreground extraction using iterated model. In ACM Transactions on Graphics (TOG), vol-
graph cuts. ACM Trans. Graph., 23(3):309–314, 2004. 6 ume 25, pages 1013–1024. ACM, 2006. 2
[55] Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris [71] Olivia Wiles, A. Sophia Koepke, and Andrew Zisserman.
Samaras, Nikos Paragios, and Iasonas Kokkinos. Deform- X2face: A network for controlling face generation using im-
ing autoencoders: Unsupervised disentangling of shape and ages, audio, and pose codes. In Proc. ECCV, September
appearance. In Proc. ECCV, September 2018. 3 2018. 3
[56] Aliaksandr Siarohin, Enver Sangineto, Stphane Lathuilire, [72] Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke
and Nicu Sebe. Deformable gans for pose-based human im- Sugano, Peter Robinson, and Andreas Bulling. Rendering
age generation. In Proc. CVPR, June 2018. 3 of eyes for eye-shape registration and gaze estimation. In
[57] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Proc. ICCV, pages 3756–3764, 2015. 2
Sheikh. Hand keypoint detection in single images using mul- [73] Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gau-
tiview bootstrapping. In CVPR, 2017. 6 rav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz,
[58] Karen Simonyan and Andrew Zisserman. Very deep convo- and Christian Theobalt. Video-based characters: creating
lutional networks for large-scale image recognition. CoRR, new human performances from a multi-view video database.
abs/1409.1556, 2014. 5 ACM Transactions on Graphics (TOG), 30(4):32, 2011. 3
[59] J Starck and A Hilton. Model-based multiple view recon- [74] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai
struction of people. In Proc. ICCV, pages 915–922, 2003. Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Doublefu-
2 sion: Real-time capture of human performances with inner
[60] Ian Stavness, C Antonio Sánchez, John Lloyd, Andrew Ho, body shapes from a single depth sensor. In Proc. CVPR,
Johnty Wang, Sidney Fels, and Danny Huang. Unified skin- pages 7287–7296. IEEE Computer Society, 2018. 2
ning of rigid and deformable models for anatomical simu-
lations. In SIGGRAPH Asia 2014 Technical Briefs, page 9.
ACM, 2014. 2
[61] Diana Sungatullina, Egor Zakharov, Dmitry Ulyanov, and
Victor Lempitsky. Image manipulation with perceptual dis-
criminators. In Proc. ECCV, September 2018. 2
[62] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama: learning
lip sync from audio. ACM Transactions on Graphics (TOG),
36(4):95, 2017. 2
[63] Jonathan Taylor, Jamie Shotton, Toby Sharp, and Andrew
Fitzgibbon. The vitruvian manifold: Inferring dense corre-
spondences for one-shot human pose estimation. In Proc.
CVPR, pages 103–110. IEEE, 2012. 3
[64] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. Mocogan: Decomposing motion and content for
video generation. In Proc. CVPR, June 2018. 3
[65] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Vic-
tor S. Lempitsky. Texture networks: Feed-forward synthesis
of textures and stylized images. In Proc. ICML, pages 1349–
1357, 2016. 3, 5
[66] Marco Volino, Dan Casas, John P Collomosse, and Adrian
Hilton. Optimal representation of multi-view video. In Proc.
BMVC, 2014. 3
[67] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
video synthesis. arXiv preprint arXiv:1808.06601, 2018. 2,
3, 6, 7
[68] Lingyu Wei, Liwen Hu, Vladimir Kim, Ersin Yumer, and
Hao Li. Real-time hair rendering using sequential adversarial
networks. In Proc. ECCV, September 2018. 2
11
DSFD: Dual Shot Face Detector
‡
Youtu Lab, Tencent
†
lijiannuist@gmail.com, {csjqian, csjyang}@njust.edu.cn
‡
{casewang, changanwang, yingtai, jasoncjwang, jerolinli, garyhuang}@tencent.com
Figure 1: Visual results. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection and makeup.
1
the CNN based face detectors have being extensively stud- anchor sizes in the first shot, and use larger sizes in the
ied, detecting faces with high degree of variability in scale, second shot. Third, we propose Improved Anchor Match-
pose, occlusion, expression, appearance and illumination in ing (IAM), which integrates anchor partition strategy and
real-world scenarios remains a challenge. anchor-based data augmentation to better match anchors
Previous state-of-the-art face detectors can be roughly and ground truth faces, and thus provides better initializa-
divided into two categories. The first one is mainly based tion for the regressor. The three aspects are complementary
on the Region Proposal Network (RPN) adopted in Faster so that these techniques can work together to further im-
RCNN [24] and employs two stage detection schemes [30, prove the performance. Besides, since these techniques are
33, 36]. RPN is trained end-to-end and generates high- all related to two-stream design, we name the proposed net-
quality region proposals which are further refined by Fast work as Dual Shot Face Detector (DSFD). Fig. 1 shows the
R-CNN detector. The other one is Single Shot Detec- effectiveness of DSFD on various variations, especially on
tor (SSD) [20] based one-stage methods, which get rid of extreme small faces or heavily occluded faces.
RPN, and directly predict the bounding boxes and confi- In summary, the main contributions of this paper include:
dence [4, 27, 39]. Recently, one-stage face detection frame- • A novel Feature Enhance Module to utilize different
work has attracted more attention due to its higher inference level information and thus obtain more discriminability and
efficiency and straightforward system deployment. robustness features.
Despite the progress achieved by the above methods, • Auxiliary supervisions introduced in early layers via a
there are still some problems existed in three aspects: set of smaller anchors to effectively facilitate the features.
Feature learning Feature extraction part is essential for • An improved anchor matching strategy to match an-
a face detector. Currently, Feature Pyramid Network chors and ground truth faces as far as possible to provide
(FPN) [17] is widely used in state-of-the-art face detectors better initialization for the regressor.
for rich features. However, FPN just aggregates hierarchi- • Comprehensive experiments conducted on popular
cal feature maps between high and low-level output layers, benchmarks FDDB and WIDER FACE to demonstrate the
which does not consider the current layer’s information, and superiority of our proposed DSFD network compared with
the context relationship between anchors is ignored. the state-of-the-art methods.
Loss design The conventional loss functions used in object
detection include a regression loss for the face region and 2. Related work
a classification loss for identifying if a face is detected or We review the prior works from three perspectives.
not. To further address the class imbalance problem, Lin et Feature Learning Early works on face detection mainly
al. [18] propose Focal Loss to focus training on a sparse set rely on hand-crafted features, such as Harr-like fea-
of hard examples. To use all original and enhanced features, tures [29], control point set [1], edge orientation his-
Zhang et al. propose Hierarchical Loss to effectively learn tograms [13]. However, hand-crafted features design is lack
the network [37]. However, the above loss functions do not of guidance. With the great progress of deep learning, hand-
consider progressive learning ability of feature maps in both crafted features have been replaced by Convolutional Neu-
of different levels and shots. ral Networks (CNN). For example, Overfeat [25], Cascade-
Anchor matching Basically, pre-set anchors for each fea- CNN [14], MTCNN [38] adopt CNN as a sliding window
ture map are generated by regularly tiling a collection of detector on image pyramid to build feature pyramid. How-
boxes with different scales and aspect ratios on the image. ever, using an image pyramid is slow and memory ineffi-
Some works [27, 39] analyze a series of reasonable anchor cient. As the result, most two stage detectors extract fea-
scales and anchor compensation strategy to increase posi- tures on single scale. R-CNN [7, 8] obtains region propos-
tive anchors. However, such strategy ignores random sam- als by selective search [28], and then forwards each nor-
pling in data augmentation, which still causes imbalance be- malized image region through a CNN to classify. Faster
tween positive and negative anchors. R-CNN [24], R-FCN [5] employ Region Proposal Network
In this paper, we propose three novel techniques to ad- (RPN) to generate initial region proposals. Besides, ROI-
dress the above three issues, respectively. First, we intro- pooling [24] and position-sensitive RoI pooling [5] are ap-
duce a Feature Enhance Module (FEM) to enhance the dis- plied to extract features from each region.
criminability and robustness of the features, which com- More recently, some research indicates that multi-scale
bines the advantages of the FPN in PyramidBox and Re- features perform better for tiny objects. Specifically,
ceptive Field Block (RFB) in RFBNet [19]. Second, moti- SSD [20], MS-CNN [2], SSH [23], S3FD [39] predict
vated by the hierarchical loss [37] and pyramid anchor [27] boxes on multiple layers of feature hierarchy. FCN [22],
in PyramidBox, we design Progressive Anchor Loss (PAL) Hypercolumns [9], Parsenet [21] fuse multiple layer fea-
that uses progressive anchor sizes for not only different lev- tures in segmentation. FPN [15, 17], a top-down architec-
els, but also different shots. Specifically, we assign smaller ture, integrate high-level semantic information to all scales.
(a) Original Feature Shot
Input Image conv3_3 conv4_3 conv5_3 conv_fc7 conv6_2 conv7_2
Table 6: FEM vs. RFB on WIDER FACE. Comparison with RFB Our FEM differs from RFB in two
Backbone - ResNet101 (%) Easy Medium Hard
DSFD (RFB) 96.0 94.5 87.2
aspects. First, our FEM is based on FPN to make full use of
DSFD (FPN) / (FPN+RFB) 96.2 / 96.2 95.1 / 95.3 89.7 / 89.9 feature information from different spatial levels, while RFB
DSFD (FEM) 96.3 95.4 90.1 ignores. Second, our FEM adopts stacked dilation convolu-
tions in a multi-branch structure, which efficiently leads to
Besides, Fig. 4 shows that our improved anchor match- larger Receptive Fields (RF) than RFB that only uses one
ing strategy greatly increases the number of ground truth dilation layer in each branch, e.g., R3 in FEM compared to
faces that are closed to the anchor, which can reduce the R in RFB where indicates the RF of one dilation convolu-
contradiction between the discrete anchor scales and con- tion. Tab. 6 clearly demonstrates the superiority of our FEM
tinuous face scales. Moreover, Fig. 5 shows the number dis- over RFB, even when RFB is equipped with FPN.
tribution of matched anchor number for ground truth faces, From the above analysis and results, some promising
which indicates our improved anchor matching can signif- conclusions can be drawn: 1) Feature enhance is crucial.
icantly increase the matched anchor number, and the aver- We use a more robust and discriminative feature enhance
aged number of matched anchor for different scales of faces module to improve the feature presentation ability, espe-
can be improved from 6.4 to about 6.9. cially for hard face. 2) Auxiliary loss based on progressive
Discontinous ROC curves Continous ROC curves
anchor is used to train all 12 different scale detection feature For VGA resolution inputs to Res50-based DSFD, it runs
maps, and it improves the performance on easy, medium 22 FPS on NVIDA GPU P40 during inference.
and hard faces simultaneously. 3) Our improved anchor
matching provides better initial anchors and ground-truth 4.3. Comparisons with State-of-the-Art Methods
faces to regress anchor from faces, which achieves the im-
We evaluate the proposed DSFD on two popular face
provements of 0.3%, 0.1%, 0.3% on three settings, respec-
detection benchmarks, including WIDER FACE [35] and
tively. Additionally, when we enlarge the training batch size
Face Detection Data Set and Benchmark (FDDB) [12]. Our
(i.e., LargeBS), the result in hard setting can get 91.2% AP.
model is trained only using the training set of WIDER
Effects of Different Backbones To better understand FACE, and then evaluated on both benchmarks without any
our DSFD, we further conducted experiments to examine further fine-tuning. We also follow the similar way used
how different backbones affect classification and detection in [31] to build the image pyramid for multi-scale testing
performance. Specifically, we use the same setting ex- and use more powerful backbone similar as [4].
cept for the feature extraction network, we implement SE- WIDER FACE Dataset It contains 393, 703 annotated
ResNet101, DPN−98, SE-ResNeXt101 32×4d following faces with large variations in scale, pose and occlusion in
the ResNet101 setting in our DSFD. From Table 5, DSFD total 32, 203 images. For each of the 60 event classes, 40%,
with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on 10%, 50% images of the database are randomly selected
easy, medium and hard settings respectively, which indi- as training, validation and testing sets. Besides, each sub-
cates that more complexity model and higher Top-1 Ima- set is further defined into three levels of difficulty: ’Easy’,
geNet classification accuracy may not benefit face detection ’Medium’, ’Hard’ based on the detection rate of a baseline
AP. Therefore, in our DSFD framework, better performance detector. As shown in Fig. 6, our DSFD achieves the best
on classification are not necessary for better performance performance among all of the state-of-the-art face detectors
on detection, which is consistent to the conclusion claimed based on the average precision (AP) across the three sub-
in [11, 16]. Our DSFD enjoys high inference speed bene- sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
fited from simply using the second shot detection results. on validation set, and 96.0% (Easy), 95.3% (Medium) and
Scale Pose Occlusion Blurry
Figure 8: Illustration of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.
Figure 1: The proposed deep fitting approach can reconstruct high quality texture and geometry from a single image with
precise identity recovery. The reconstructions in the figure and the rest of the paper are represented by a vector of size 700
floating points and rendered without any special effects. We would like to highlight that the depicted texture is reconstructed
by our model and none of the features taken directly from the image.
Abstract the optimal latent parameters that best reconstruct the test
image but under a new perspective. We optimize the param-
In the past few years, a lot of work has been done to- eters with the supervision of pretrained deep identity fea-
wards reconstructing the 3D facial structure from single tures through our end-to-end differentiable framework. We
images by capitalizing on the power of Deep Convolutional demonstrate excellent results in photorealistic and identity
Neural Networks (DCNNs). In the most recent works, differ- preserving 3D face reconstructions and achieve for the first
entiable renderers were employed in order to learn the rela- time, to the best of our knowledge, facial texture reconstruc-
tionship between the facial identity features and the param- tion with high-frequency details.1
eters of a 3D morphable model for shape and texture. The
texture features either correspond to components of a lin-
ear texture space or are learned by auto-encoders directly 1. Introduction
from in-the-wild images. In all cases, the quality of the fa-
Estimation of the 3D facial surface and other intrinsic
cial texture reconstruction of the state-of-the-art methods is
components of the face from single images (e.g., albedo,
still not capable of modeling textures in high fidelity. In this
etc.) is a very important problem at the intersection of
paper, we take a radically different approach and harness
computer vision and machine learning with countless ap-
the power of Generative Adversarial Networks (GANs) and
plications (e.g., face recognition, face editing, virtual real-
DCNNs in order to reconstruct the facial texture and shape
ity). It is now twenty years from the seminal work of Blanz
from single images. That is, we utilize GANs to train a very
and Vetter [4] which showed that it is possible to recon-
powerful generator of facial texture in UV space. Then, we
struct shape and albedo by solving a non-linear optimiza-
revisit the original 3D Morphable Models (3DMMs) fitting
approaches making use of non-linear optimization to find 1 Project page: https://github.com/barisgecer/ganfit
1
tion problem that is constrained by linear statistical models controlled environment to collect ∼20 millions of images.
of facial texture and shape. This statistical model of tex- In this paper, we still propose to build upon the success
ture and shape is called a 3D Morphable Model (3DMM). of DCNNs but take a radically different approach for 3D
Arguably the most popular publicly available 3DMM is the shape and texture reconstruction from a single in-the-wild
Basel model built from 200 people [21]. Recently, large image. That is, instead of formulating regression method-
scale statistical models of face and head shape have been ologies or auto-encoder structures that make use of self-
made publicly available [7, 10]. supervision [39, 16, 43], we revisit the optimization-based
For many years 3DMMs and its variants were the meth- 3DMM fitting approach by the supervision of deep iden-
ods of choice for 3D face reconstruction [33, 46, 22]. tity features and by using Generative Adversarial Networks
Furthermore, with appropriate statistical texture models (GANs) as our statistical parametric representation of the
on image features such as Scale Invariant Feature Trans- facial texture.
form (SIFT) and Histogram Of Gradients (HOG), 3DMM- In particular, the novelties that this paper brings are:
based methodologies can still achieve state-of-the-art per- • We show for the first time, to the best of our knowl-
formance in 3D shape estimation on images captured un- edge, that a large-scale high-resolution statistical re-
der unconstrained conditions [6]. Nevertheless, those meth- construction of the complete facial surface on an un-
ods [6] can reconstruct only the shape and not the facial tex- wrapped UV space can be successfully used for recon-
ture. Another line of research in [45, 34] decouples texture struction of arbitrary facial textures even captured in
and shape reconstruction. A standard linear 3DMM fitting unconstrained recording conditions4 .
strategy [41] is used for face reconstruction followed by a
number of steps for texture completion and refinement. In • We formulate a novel 3DMM fitting strategy which is
these papers [34, 45], the texture looks excellent when ren- based on GANs and a differentiable renderer.
dered under professional renderers (e.g., Arnold), neverthe-
• We devise a novel cost function which combines vari-
less when the texture is overlaid on the images the quality
ous content losses on deep identity features from a face
significantly drops 2 .
recognition network.
In the past two years, a lot of work has been con-
ducted on how to harness Deep Convolutional Neural Net- • We demonstrate excellent facial shape and texture re-
works (DCNNs) for 3D shape and texture reconstruction. constructions in arbitrary recording conditions that are
The first such methods either trained regression DCNNs shown to be both photorealistic and identity preserving
from image to the parameters of a 3DMM [42] or used in qualitative and quantitative experiments.
a 3DMM to synthesize images [30, 18] and formulate an
image-to-image translation problem using DCNNs to es- 2. History of 3DMM Fitting
timate the depth3 [36]. The more recent unsupervised Our methodology naturally extends and generalizes the
DCNN-based methods are trained to regress 3DMM param- ideas of texture and shape 3DMM using modern methods
eters from identity features by making use of differentiable for representing texture using GANs, as well as defines loss
image formation architectures [9] and differentiable render- functions using differentiable renderers and very powerful
ers [16, 40, 31]. publicly available face recognition networks [12]. Before
The most recent methods such as [39, 43, 14] use both we define our cost function, we will briefly outline the his-
the 3DMM model, as well as additional network structures tory of 3DMM representation and fitting.
(called correctives) in order to extend the shape and texture
representation. Even though the paper [39] shows that the 2.1. 3DMM representation
reconstructed facial texture has indeed more details than a The first step is to establish dense correspondences be-
texture estimated from a 3DMM [42, 40], it is still unable to tween the training 3D facial meshes and a chosen template
capture high-frequency details in texture and subsequently with fixed topology in terms of vertices and triangulation.
many identity characteristics (please see the Fig. 4). Fur-
thermore, because the method permits the reconstructions
2.1.1 Texture
to be outside the 3DMM space, it is susceptible to outliers
(e.g., glasses etc.) which are baked in shape and texture. Al- Traditionally 3DMMs use a UV map for representing tex-
though rendering networks (i.e. trained by VAE [26]) gen- ture. UV maps help us to assign 3D texture data into 2D
erates outstanding quality textures, each network is capable 4 In the very recent works, it was shown that it is feasible to reconstruct
of storing up to few individuals whom should be placed in a the non-visible parts a UV space for facial texture completion[11] and that
GANs can be used to generate novel high-resolution faces[38]. Neverthe-
2 Please see the supplementary materials for a comparison with [34, 45].
less, our work is the first one that demonstrates that a GAN can be used
3 The depth was afterwards refined by fitting a 3DMM and then chang- as powerful statistical texture prior and reconstruct the complete texture of
ing the normals by using image features. arbitrary facial images.
PCA Shape Model Input Image
.
Differentiable Renderer (Sec.3.2) Landmark Detector (Sec.3.3.4)
.
. Camera and Lighting
Parameters
p s= .
.
.
.
.
.
.
.
.
c= .
i= .
. .
.
.
.
.
.
= 2
Expression Blend Shapes
.
.
(Eq.6)
=| - |
.
p e= .
. Face Recognition CNN (Sec.3.3.1)
. . . . .
. . . .
.
Coloured mesh
= .
.
.
*
.
.
.
+
.
.
.
*
.
.
.
Sampling . . . .
. . . .
Texture GAN (Sec.3.1)
. (Eq.7)
. . .
. . .
pt= .
.
Random
pe, c, i
= -
.
.
.
.
.
.
. . .
. .
.
2
Figure 2: Detailed overview of the proposed approach. A 3D face reconstruction is rendered by a differentiable renderer
(shown in purple). Cost functions are mainly formulated by means of identity features on a pretrained face recognition
network (shown in gray) and they are optimized by flowing the error all the way back to the latent parameters (ps , pe , pt , c, i,
shown in green) with gradient descent optimization. End-to-end differentiable architecture enables us to use computationally
cheap and reliable first order derivatives for optimization thus making it possible to employ deep networks as a generator
(i.e,. statistical model) or as a cost function.
planes with universal per-pixel alignment for all textures. A The recent 3D face fitting methods [39, 43, 14] still make
commonly used UV map is built by cylindrical unwrapping use of similar statistical models for the texture. Hence, they
the mean shape into a 2D flat space formulation, which we can naturally represent only the low-frequency components
use to create an RGB image IU V . Each vertex in the 3D of the facial texture (please see Fig. 4).
space has a texture coordinate tcoord in the UV image plane
in which the texture information is stored. A universal func-
tion exists, where for each vertex we can sample the texture 2.1.2 Shape
information from the UV space as T = P(IU V , tcoord ). The method of choice for building statistical models of fa-
In order to define a statistical texture representation, all cial or head 3D shapes is still PCA [23]. Assuming that the
the training texture UV maps are vectorized and Principal 3D shapes in correspondence comprise of N vertexes, i.e.
Component Analysis (PCA) is applied. Under this model T T
T
s = xT 1 , . . . , xN = [x1 , y1 , z1 , . . . , xN , yN , zN ] . In
any test texture T0 is approximated as a linear combination
order to represent both variations in terms of identity and
of the mean texture mt and a set of bases Ut as follows:
expression, generally two linear models are used. The first
T(pt ) ≈ mt + Ut pt (1) is learned from facial scans displaying the neutral expres-
sion (i.e., representing identity variations) and the second
where pt is the texture parameters for the text sample T0 . is learned from displacement vectors (i.e., representing ex-
In the early 3DMM studies, the statistical model of the tex- pression variations). Then a test facial shape S(ps,e ) can be
ture was built with few faces captured in strictly controlled written as
conditions and was used to reconstruct the test albedo of S(ps,e ) ≈ ms,e + Us,e ps,e (2)
the face. Since, such texture models can hardly represent
faces captured in uncontrolled recording conditions (in-the- where ms,e in the mean shape vector, Us,e ∈ R3N ×ns,e
wild). Recently it was proposed to use statistical models is Us,e = [Us , Ue ] where the Us are the bases that cor-
of hand-crafted features such as SIFT or HoG [6] directly respond to identity variations, and Ue the bases that cor-
from in-the-wild faces. The interested reader is referred to respond to expression. Finally, ps,e are the ns,e shape pa-
[5, 32] for more details on texture models used in 3DMM rameters which can be split accordingly to the identity and
fitting algorithms. expression bases: ps,e = [ps , pe ].
2.2. Fitting 3. Approach
3D face and texture reconstruction by fitting a 3DMM We propose an optimization-based 3D face reconstruc-
is performed by solving a non-linear energy based cost op- tion approach from a single image that employs a high fi-
timization problem that recovers a set of parameters p = delity texture generation network as statistical prior as il-
[ps,e , pt , pc , pl ] where pc are the parameters related to a lustrated in Fig. 2. To this end, the reconstruction mesh
camera model and pl are the parameters related to an illu- is formed by 3D morphable shape model; textured by the
mination model. The optimization can be formulated as: generator network’s output UV map; and projected into 2D
image by a differentiable renderer. The distance between
min E(p) = ||I0 (p) − W(p)||22 + Reg({ps,e , pt }) (3) the rendered image and the input image is minimized in
p
terms of a number of cost functions by updating the latent
parameters of 3DMM and the texture network with gradi-
where I0 is the test image to be fitted and W is a vector ent descent. We mainly formulate these functions based on
produced by a physical image formation process (i.e., ren- rich features of face recognition network [12, 35, 28] for
dering) controlled by p. Finally, Reg is the regularization smoother convergence and landmark detection network [13]
term that is mainly related to texture and shape parameters. for alignment and rough shape estimation.
Various methods have been proposed for numerical op- The following sections introduce firstly our novel texture
timization of the above cost functions [19, 2]. A notable model that employs a generator network trained by progres-
recent approach is [6] which uses handcrafted features (i.e., sive growing GAN framework. After describing the proce-
H) for texture representation simplified the cost function as: dure for image formation with differentiable renderer, we
formulate our cost functions and the procedure for fitting
min
r
E(pr ) = ||H(I0 (pr ))−H(W(pr ))||2A+Reg(ps,e ) (4) our shape and texture models onto a test image.
p
where ||a||2A = aT Aa, A is the orthogonal space to the 3.1. GAN Texture Model
statistical model of the texture and pr is the set of reduced Although conventional PCA is powerful enough to build
parameters pr = {ps,e , pc }. The optimization problem in a decent shape and texture model, it is often unable to cap-
Eq. 4 is solved by Gauss-Newton method. The main draw- ture high frequency details and ends up having blurry tex-
back of this method is that the facial texture in not recon- tures due to its Gaussian nature. This becomes more appar-
structed. ent in texture modelling which is a key component in 3D
In this paper, we generalize the 3DMM fittings and in- reconstruction to preserve identity as well as photo-realism.
troduce the following novelties: GANs are shown to be very effective at capturing such
details. However, they suffer from preserving 3D co-
• We use a GAN on high-resolution UV maps as our sta- herency [17] of the target distribution when the training im-
tistical representation of the facial texture. That way ages are semi-aligned. We found that a GAN trained with
we can reconstruct textures with high-frequency de- UV representation of real textures with per pixel alignment
tails. avoids this problem and is able to generate realistic and co-
herent UVs from 99.9% of its latent space while at the same
• Instead of other cost functions used in the literature time generalizing well to unseen data.
such as low-level `1 or `2 loss (e.g., RGB values [29], In order to take advantage of this perfect harmony, we
edges [33]) or hand-crafted features (e.g., SIFT [6]), train a progressive growing GAN [24] to model distribu-
we propose a novel cost function that is based on fea- tion of UV representations of 10,000 high resolution tex-
ture loss from the various layers of publicly available tures and use the trained generator network
face recognition embedding network [12]. Unlike oth-
ers, deep identity features are very powerful at preserv- G(pt ) : R512 → RH×W ×C (5)
ing identity characteristics of the input image.
as texture model that replaces 3DMM texture model in
• We replace physical image formation stage with a dif- Eq. 1.
ferentiable renderer to make use of first order deriva- While fitting with linear models, i.e. 3DMM, is as sim-
tives (i.e., gradient descent). Unlike its alternatives, ple as linear transformation, fitting with a generator net-
gradient descent provides computationally cheaper and work can be formulated as an optimization that minimizes
more reliable derivatives through such deep architec- per-pixel Manhattan distance between target texture in UV
tures (i.e., above-mentioned texture GAN and identity space Iuv and the network output G(pt ) with respect to the
DCNN). latent parameter pt , i.e. minpt |G(pt ) − Iuv |.
3.2. Differentiable Renderer many other tasks including novel identity synthesizing [15],
face normalization [9] and 3D face reconstruction [16]. In
Following [16], we employ a differentiable renderer to
our approach, we take advantage of an off-the-shelf state-
project 3D reconstruction into a 2D image plane based on
of-the-art face recognition network [12]5 in order to capture
deferred shading model with given camera and illumination
identity related features of an input face image and optimize
parameters. Since color and normal attributes at each vertex
the latent parameters accordingly. More specifically, given a
are interpolated at the corresponding pixels with barycen-
pretrained face recognition network F n (I) : RH×W ×C →
tric coordinates, gradients can be easily backpropagated
R512 consisting of n convolutional filters, we calculate the
through the renderer to the latent parameters.
cosine distance between the identity features (i.e., embed-
A 3D textured mesh at the center of Cartesian origin
dings) of the real target image and our rendered images as
[0, 0, 0] is projected onto 2D image plane by a pinhole cam-
following:
era model with the camera standing at [xc , yc , zc ], directed
towards [x0c , yc0 , zc0 ] and with the focal length fc . The il- F n (I0 ).F n (IR )
lumination is modelled by phong shading given 1) direct Lid = 1 − (8)
||F n (I0 )||2 ||F n (IR )||2
light source at 3D coordinates [xl , yl , zl ] with color values
[rl , gl , bl ], and 2) color of ambient lighting [ra , ga , ba ]. We formulate an additional identity loss on the rendered im-
Finally, we denote the rendered image given age ÎR that is rendered with random pose, expression and
geometry (ps,e ), texture (pt ), camera (pc = lighting. This loss ensures that our reconstruction resembles
[xc , yc , zc , x0c , yc0 , zc0 , fc ]) and lighting parameters the target identity under different conditions. We formulate
(pl = [xl , yl , zl , rl , gl , bl , ra , ga , ba ] by the following: it by replacing IR by ÎR in Eq. 8 and it is denoted as L̂id .
age etc. These features are shown to be quite effective at almost equally well and this choice is orthogonal to the proposed approach.
Figure 3: Example fits of our approach for the images from various datasets. Please note that our fitting approach is robust
to occlusion (e.g., glasses), low resolution and black-white in the photos and generalizes well with ethnicity, gender and
age. The reconstructed textures are very well at capturing high frequency details of the identities; likewise, the reconstructed
geometries from 3DMM are surprisingly good at identity preservation thanks to the identity features used, e.g. crooked nose
at bottom-left, dull eyes at bottom-right and chin dimple at top-left
downscaled to 112 × 112 before identity and content loss. onto input image and is formulated as following:
The pixel loss is defined by pixel level `1 loss function as:
Llan = ||M(I0 ) − M(IR )||2 (11)
0 R
Lpix = ||I − I ||1 (10) 3.4. Model Fitting
We first roughly align our reconstruction to the input im-
3.3.4 Landmark Loss age by optimizing shape, expression and camera parame-
ters by: minpr E(pr ) = λlan Llan . We then simultaneously
The face recognition network F is pre-trained by the im-
optimize all of our parameters with gradient descent and
ages that are aligned by similarity transformation to a fixed
backpropagation so as to minimize weighted combination
landmark template. To be compatible with the network, we
of above loss terms in the following:
align the input and rendered images under the same settings.
However, this process disregards the aspect ratio and scale min E(p) = λid Lid + λ̂id L̂id + λcon Lcon +λpix Lpix
p
of the reconstruction. Therefore, we employ a deep face
alignment network [13] M(I) : RH×W ×C → R68×2 to +λlan Llan + λreg Reg({ps,e , pl })
detect landmark locations of the input image and align the (12)
rendered geometry onto it by updating the shape, expression where we weight each of our loss terms with λ parame-
and camera parameters. That is, camera parameters are op- ters. In order to prevent our shape and expression mod-
timized to align with the pose of image I and geometry pa- els and lighting parameters from exaggeration to arbitrar-
rameters are optimized for the rough shape estimation. As ily bias our loss terms, we regularize those parameters by
a natural consequence, this alignment drastically improves Reg({ps,e , pl }).
the effectiveness of the pixel and content loss, which are
sensitive to misalignment between the two images. Fitting with Multiple Images (i.e. Video): While the
The alignment error is achieved by point-to-point eu- proposed approach can fit a 3D reconstruction from a single
clidean distances between detected landmark locations of image, one can take advantage of more images effectively
the input image and 2D projection of the 3D reconstruc- when available, e.g. from a video recording. This often
tion landmark locations that is available as meta-data of the helps to improve reconstruction quality under challenging
shape model. Since landmark locations of the reconstruc- conditions, e.g. outdoor, low resolution. While state-of-
tion heavily depend on camera parameters, this loss is great the-art methods follow naive approaches by averaging ei-
a source of information the alignment of the reconstruction ther the reconstruction [42] or features-to-be-regressed [16]
Input Images
Ours
Genova
[16]
A.T.Tran et al.
[42]
Tewari et al.
[39]
Ours
Geometry
Tewari et al.
[39]
L. Tran et al.
[43]
Figure 4: Comparison of our qualitative results with other state-of-the-art methods in MoFA-Test dataset. Rows 2-5 show
comparison with textured geometry and rows 6-8 compare only shapes. The Figure is best viewed in colored and under zoom.
before making a reconstruction, we utilize the power of iter- 4.1. Implementation Details
ative optimization by averaging identity reconstruction pa-
rameters (ps , pt ) after every iteration. For an image set For all of our experiments, a given face image is aligned
I = {I0 , I1 , . . . , Ii , . . . , Ini }, we reformulate our param- to our fixed template using 68 landmark locations detected
eters as p = [ps , pie , pt , pic , pil ] in which we average shape by an hourglass 2D landmark detection [13]. For the iden-
and texture parameters by the following: tity features, we employ ArcFace [12] network’s pretrained
n
X n
X models. For the generator network G, we train a progres-
ps = pis , pt = pit (13) sive growing GAN [24] with around 10,000 UV maps from
i i [7] at the resolution of 512 × 512. We use the Large Scale
4. Experiments Face Model [7] for 3DMM shape model with ns = 158
and the expression model learned from 4DFAB database [8]
This section demonstrates the excellent performance of with ne = 29. During fitting process, we optimize pa-
the proposed approach for 3D face reconstruction and shape rameters using Adam Solver [25] with 0.01 learning rate.
recovery. We verify this by qualitative results in Fig- And we set our balancing factors as the following: λid :
ures 1, 3, qualitative comparisons with the state-of-the-art 2.0, λ̂id : 2.0, λcon : 50.0, λpix : 1.0, λlan : 0.001, λreg :
in Sec. 4.2 and quantitative shape reconstruction experiment {0.05, 0.01}. The Fitting converges in around 30 seconds
on a database with ground truth in Sec. 4.3. on an Nvidia GTX 1080 TI GPU for a single image.
Cooperative Indoor Outdoor
Method Mean Std. Mean Std. Mean Std.
Tran et al. [42] 1.93 0.27 2.02 0.25 1.86 0.23
Booth et al. [6] 1.82 0.29 1.85 0.22 1.63 0.16
Genova et al. [16] 1.50 0.13 1.50 0.11 1.48 0.11
Ours 0.95 0.107 0.94 0.106 0.94 0.106
(a) I0 (b) IR (c) IR albedo
Table 1: Accuracy results for the meshes on the MICC
Dataset using point-to-plane distance. The table reports the
mean error (Mean), the standard deviation (Std.).
Genova et al.
cosine similarity between 1) real and rendered images and
2) renderings of same/different pairs.
In Fig. 6 and 7, we have quantitatively showed that our
method is better at identity preservation and photorealism
(i.e., as the pretrained network is trained by real images)
than other state-of-the-art deep 3D face reconstruction ap-
proaches [16, 42].
Ours.
Rendering-to-photo cosine similarity on LFW
Genova et al.
Tran et al.
Ours
Genova et al. same Figure 9: Qualitative comparison with [45, 37] by overlay-
Genova et al. different
Ours same ing the reconstructions on the input images. Our method
Ours different
can generate high fidelity texture with accurate shape, cam-
era and illumination fitting.
Figure 10: Qualitative comparison with [34] by means of texture maps, whole and partial face renderings. Please note that
while our method does not require any particular renderer for special effects, e.g., lighting, [34] produce these renderings
with a commercial renderer called Arnold.
(a) I0 (b) IR (c) IR
alb. (d) IR−IR
alb. (e) S
Figure 11: Results under more challenging conditions, i.e. strong illuminations, self-occlusions and facial hair. (a) Input
image. (b) Estimated fitting overlayyed including illumination estimation. (c) Overlayyed fitting without illumination. (d)
Pixel-wise intensity difference of (b) to (c). (e) Estimated shape mesh
DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation,
Segmentation and Re-Identification of Clothing Images
Yuying Ge1 , Ruimao Zhang1 , Lingyun Wu2 , Xiaogang Wang1 , Xiaoou Tang1 , and Ping Luo1
1
The Chinese University of Hong Kong
2
SenseTime Research
arXiv:1901.07973v1 [cs.CV] 23 Jan 2019
Abstract (a)
tank top
DeepFashion
Understanding fashion images has been advanced by
benchmarks with rich annotations such as DeepFashion,
whose labels include clothing categories, landmarks, and cardigan
cardigan
consumer-commercial image pairs. However, DeepFash- tank top
vest top
vest
narios. We fill in the gap by presenting DeepFashion2 to
shorts
address these issues. It is a versatile benchmark of four
tasks including clothes detection, pose estimation, segmen- skirt
long sleeve
trousers outwear
tation, and retrieval. It has 801K clothing items where shorts
1
2
tions. For instance, DeepFashion2 totally has 801K cloth- hion et I hion
Fas nA Fas
ing items, where each item in an image is labeled with scale, TB
I RN eep daN hio eep
W DA D Mo Fas D
occlusion, zooming, viewpoint, bounding box, dense land- year 2015[5] 2015[7] 2016[14] 2018[21] 2018[1] now
#images 425K 182K 800K 55K 357K 491K
marks, and per-pixel mask, as shown in Fig.1(b). These #categories 11 20 50 13 41 13
items can be grouped into 43.8K clothing identities, where #bboxes 39K 7K × × × 801K
a clothing identity represents the clothes that have almost #landmarks × × 120K × 100K 801K
#masks × × × 119K × 801K
the same cutting, pattern, and design. The images of the #pairs 39K 91K 251K × × 873K
same identity are taken by both customers and commercial Table 1. Comparisons of DeepFashion2 with the other clothes
shopping stores. An item from the customer and an item datasets. The rows represent number of images, bounding boxes,
from the commercial store forms a pair. There are 873K landmarks, per-pixel masks, and consumer-to-shop pairs respec-
pairs that are 3.5 times larger than DeepFashion. The above tively. Bounding boxes inferred from other annotations are not
thorough annotations enable developments of strong algo- counted.
rithms to understand fashion images.
This work has three main contributions. (1) We build as well as 873K pairs. It is the most comprehensive bench-
a large-scale fashion benchmark with comprehensive tasks mark of its kinds to date.
and annotations, to facilitate fashion image analysis. Deep- Fashion Image Understanding. There are various
Fashion2 possesses the richest definitions of tasks and the tasks that analyze clothing images such as clothes detec-
largest number of labels. Its annotations are at least 3.5× of tion [2, 14], landmark prediction [15, 19, 17], clothes seg-
DeepFashion [14], 6.7× of ModaNet [21], and 8× of Fash- mentation [18, 20, 13], and retrieval [7, 5, 14]. However,
ionAI [1]. (2) A full spectrum of tasks is carefully defined a unify benchmark and framework to account for all these
on the proposed dataset. For example, to our knowledge, tasks is still desired. DeepFashion2 and Match R-CNN fill
clothing pose estimation is presented for the first time in the in this blank. We report extensive results for the above
literature by defining landmarks and poses of 13 categories tasks with respect to different variations, including scale,
that are more diverse and fruitful than human pose. (3) With occlusion, zoom-in, and viewpoint. For the task of clothes
DeepFashion2, we extensively evaluate Mask R-CNN [6] retrieval, unlike previous methods [5, 7] that performed
that is a recent advanced framework for visual perception. image-level retrieval, DeepFashion2 enables instance-level
A novel Match R-CNN is also proposed to aggregate all the retrieval of clothing items. We also present a new fashion
learned features from clothes categories, poses, and masks task called clothes pose estimation, which is inspired by
to solve clothing image retrieval in an end-to-end manner. human pose estimation to predict clothing landmarks and
DeepFashion2 and implementations of Match R-CNN will skeletons for 13 clothes categories. This task helps improve
be released. performance of fashion image analysis in real-world appli-
cations.
1.1. Related Work
2. DeepFashion2 Dataset and Benchmark
Clothes Datasets. Several clothes datasets have been
proposed such as [20, 5, 7, 14, 21, 1] as summarized in Overview. DeepFashion2 has four unique characteris-
Table 1. They vary in size as well as amount and type of tics compared to existing fashion datasets. (1) Large Sam-
annotations. For example, WTBI [5] and DARN [7] have ple Size. It contains 491K images of 43.8K clothing iden-
425K and 182K images respectively. They scraped cat- tities of interest (unique garment displayed by shopping
egory labels from metadata of the collected images from stores). On average, each identity has 12.7 items with dif-
online shopping websites, making their labels noisy. In ferent styles such as color and printing. DeepFashion2 con-
contrast, CCP [20], DeepFashion [14], and ModaNet [21] tained 801K items in total. It is the largest fashion database
obtain category labels from human annotators. Moreover, to date. Furthermore, each item is associated with various
different kinds of annotations are also provided in these annotations as introduced above.
datastes. For example, DeepFashion labels 4∼8 landmarks (2) Versatility. DeepFashion2 is developed for multiple
(keypoints) per image that are defined on the functional re- tasks of fashion understanding. Its rich annotations support
gions of clothes (e.g. ‘collar’). The definitions of these clothes detection and classification, dense landmark and
sparse landmarks are shared across all categories, making pose estimation, instance segmentation, and cross-domain
them difficult to capture rich variations of clothing images. instance-level clothes retrieval.
Furthermore, DeepFashion does not have mask annotations. (3) Expressivity. This is mainly reflected in two aspects.
By comparison, ModaNet [21] has street images with masks First, multiple items are present in a single image, unlike
(polygons) of single person but without landmarks. Unlike DeepFashion where each image is labeled with at most one
existing datasets, DeepFashion2 contains 491K images and item. Second, we have 13 different definitions of landmarks
801K instances of landmarks, masks, and bounding boxes, and poses (skeletons) for 13 different categories. There is
2
Commercial Customer
Scale
(1)
Short sleeve top
Occlusion
Zoom-in (2)
Shorts
Viewpoint (3)
Long sleeve6 outwear
12
7 37
5
3 28
8 16 4 36
15 17 27 29
9 14
30 35
10 13 18 26
31 34
11 12 19 25 32 33
20
(4)
24
21 22 23
Long sleeve dress
Figure 2. Examples of DeepFashion2. The first column shows definitions of dense landmarks and skeletons of four categories. From (1)
to (4), each row represents clothes images with different variations including ‘scale’, ‘occlusion’, ‘zoom-in’, and ‘viewpoint’. At each row,
we partition the images into two groups, the left three columns represent clothes from commercial stores, while the right three columns are
from customers. In each group, the three images indicate three levels of difficulty with respect to the corresponding variation, including (1)
‘small’, ‘moderate’, ‘large’ scale, (2) ‘slight’, ‘medium’, ‘heavy’ occlusion, (3) ‘no’, ‘medium’, ‘large’ zoom-in, (4) ‘not on human’, ‘side’,
‘back’ viewpoint. Furthermore, at each row, the items in these two groups of images are from the same clothing identity but from two
different domains, that is, commercial and customer. The items of the same identity may have different styles such as color and printing.
Each item is annotated with landmarks and masks.
23 defined landmarks for each category on average. Some further crawl a large set of images on the Internet from both
definitions are shown in the first column of Fig.2. These commercial shopping stores and consumers. To clean up
representations are different from human pose and are not the crawled set, we first remove shop images with no corre-
presented in previous work. They facilitate learning of sponding consumer-taken photos. Then human annotators
strong clothes features that satisfy real-world requirements. are asked to clean images that contain clothes with large oc-
(4) Diversity. We collect data by controlling their vari- clusions, small scales, and low resolutions. Eventually we
ations in terms of four properties including scale, occlu- have 491K images of 801K items and 873K commercial-
sion, zoom-in, and viewpoint as illustrated in Fig.2, making consumer pairs.
DeepFashion2 a challenging benchmark. For each property, Variations. We explain the variations in DeepFashion2.
each clothing item is assigned to one of three levels of dif- Their statistics are plotted in Fig.3. (1) Scale. We divide all
ficulty. Fig.2 shows that each identity has high diversity clothing items into three sets, according to the proportion
where its items are from different difficulties. of an item compared to the image size, including ‘small’
Data Collection and Cleaning. Raw data of DeepFash- (< 10%), ‘moderate’ (10% ∼ 40%), and ‘large’ (> 40%).
ion2 are collected from two sources including DeepFashion Fig.3(a) shows that only 50% items have moderate scale.
[14] and online shopping websites. In particular, images (2) Occlusion. An item with occlusion means that its re-
of each consumer-to-shop pair in DeepFashion are included gion is occluded by hair, human body, accessory or other
in DeepFashion2, while the other images are removed. We items. Note that an item with its region outside the im-
3
(a) small slight no no wear
(d)
7%
moderate 6% medium 12% medium frontal
7%
24% 26% large heavy large 8% side
47% 21% back
47%
67%
50% 78%
150000
100000
50000
1000
Figure 3. (a) shows the statistics of different variations in DeepFashion2. (b) is the numbers of items of the 13 categories in DeepFashion2.
(c) shows that categories in DeepFashion [14] have ambiguity. For example, it is difficult to distinguish between ‘cardigan’ and ‘coat’, and
between ‘joggers’ and ‘sweatpants’. They result in ambiguity when labeling data. (d) Top: masks may be inaccurate when complex poses
are presented. Bottom: the masks will be refined by human.
age does not belong to this case. Each item is categorized landmarks following these instructions.
by the number of its landmarks that are occluded, includ- Moreover, each landmark is assigned one of the two
ing ‘partial occlusion’(< 20% occluded keypoints), ‘heavy modes, ‘visible’ or ‘occluded’. We then generate contours
occlusion’ (> 50% occluded keypoints), ‘medium occlu- and skeletons automatically by connecting landmarks in a
sion’ (otherwise). More than 50% items have medium or certain order. To facilitate this process, annotators are also
heavy occlusions as summarized in Fig.3. (3) Zoom-in. An asked to distinguish landmarks into two types, that is, con-
item with zoom-in means that its region is outside the im- tour point or junction point. The former one refers to key-
age. This is categorized by the number of landmarks out- points at the boundary of an item, while the latter one is
side image. We define ‘no’, ‘large’ (> 30%), and ‘medium’ assigned to keypoints in conjunction e.g. ‘endpoint of strap
zoom-in. We see that more than 30% items are zoomed in. on sling’. The above process controls the labeling quality,
(4) Viewpoint. We divide all items into four partitions in- because the generated skeletons help the annotators reex-
cluding 7% clothes that are not on people, 78% clothes on amine whether the landmarks are labeled with good quality.
people from frontal viewpoint, 15% clothes on people from In particular, only when the contour covers the entire item,
side or back viewpoint. the labeled results are eligible, otherwise keypoints will be
refined.
2.1. Data Labeling Mask. We label per-pixel mask for each item in a semi-
Category and Bounding Box. Human annotators are automatic manner with two stages. The first stage automat-
asked to draw a bounding box and assign a category label ically generates masks from the contours. In the second
for each clothing item. DeepFashion [14] defines 50 cat- stage, human annotators are asked to refine the masks, be-
egories but half of them contain less than 5‰ number of cause the generated masks may be not accurate when com-
images. Also, ambiguity exists between 50 categories mak- plex human poses are presented. As shown in Fig.3(d), the
ing data labeling difficult as shown in Fig.3(c). By grouping mark is inaccurate when an image is taken from side-view
categories in DeepFashion, we derive 13 popular categories of people crossing legs. The masks will be refined by hu-
without ambiguity. The numbers of items of 13 categories man.
are shown in Fig.3(b). Style. As introduced before, we collect 43.8K different
Clothes Landmark, Contour, and Skeleton. As differ- clothing identities where each identity has 13 items on av-
ent categories of clothes (e.g. upper- and lower-body gar- erage. These items are further labeled with different styles
ment) have different deformations and appearance changes, such as color, printing, and logo. Fig.2 shows that a pair
we represent each category by defining its pose, which is a of clothes that have the same identity could have different
set of landmarks as well as contours and skeletons between styles.
landmarks. They capture shapes and structures of clothes.
2.2. Benchmarks
Pose definitions are not presented in previous work and are
significantly different from human pose. For each clothing We build four benchmarks by using the images and la-
item of a category, human annotations are asked to label bels from DeepFashion2. For each benchmark, there are
4
391K images for training, 34K images for validation and
𝐼" FN RoIAlign 14x14 14x14 28x28
landmark
PN
67K images for test. x256 x512 x32
5
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
APbox 0.604 0.700 0.660 0.712 0.654 0.372 0.695 0.629 0.466 0.624 0.681 0.641 0.667
APIoU=0.50
box 0.780 0.851 0.768 0.844 0.810 0.531 0.848 0.755 0.563 0.713 0.832 0.796 0.814
APIoU=0.75
box 0.717 0.809 0.744 0.812 0.768 0.433 0.806 0.718 0.525 0.688 0.791 0.744 0.773
Table 2. Clothes detection of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint. The
evaluation metrics are APbox , APIoU=0.50
box , and APIoU=0.75
box . The best performance of each subset is bold.
(a)
long sleeve dress long sleeve outwear long sleeve top
long sleeve outwear
long sleeve top
shorts
short sleeve dress
long sleeve dress 0.80
Figure 5. (a) shows failure cases in clothes detection while (b) shows failure cases in clothes segmentation. In (a) and (b), the missing
bounding boxes are drawn in red while the correct category labels are also in red. Inaccurate masks are also highlighted by arrows in (b).
For example, clothes fail to be detected or segmented in too small scale, too large scale, large non-rigid deformation, heavy occlusion, large
zoom-in, side or back viewpoint.
timation, a CE loss Lmask for clothes segmentation, and a erwise. In clothes segmentation stream, positive RoIs with
CE loss Lpair for clothes retrieval. Specifically, Lcls , Lbox , foreground label are chosen while in landmark estimation
Lpose , and LPmask are identical as defined in [6]. We have stream, positive RoIs with visible landmarks are selected.
n
Lpair = − n1 i=1 [yi log(ŷi ) + (1 − yi )log(1 − ŷi )], where We define ground truth box of interest as clothing items
yi = 1 indicates the two items of a pair are matched, other- whose style number is > 0 and can constitute matching
wise yi = 0. pairs. In clothes retrieval stream, RoIs are selected if their
Implementations. In our experiments, each training im- IoU with a ground truth box of interest is larger than 0.7. If
age is resized to its shorter edge of 800 pixels with its longer RoI features are extracted from landmark estimation stream,
edge that is no more than 1333 pixels. Each minibatch has RoIs with visible landmarks are also selected.
two images in a GPU and 8 GPUs are used for training. Inference. At testing time, images are resized in the
For minibatch size 16, the learning rate (LR) schedule starts same way as the training stage. The top 1000 proposals with
at 0.02 and is decreased by a factor of 0.1 after 8 epochs detection probabilities are chosen for bounding box classi-
and then 11 epochs, and finally terminates at 12 epochs. fication and regression. Then non-maximum suppression is
This scheduler is denoted as 1x. Mask R-CNN adopts 2x applied to these proposals. The filtered proposals are fed
schedule for clothes detection and segmentation where ‘2x’ into the landmark branch and the mask branch separately.
is twice as long as 1x with the LR scaled proportionally. For the retrieval task, each unique detected clothing item in
Then It adopts s1x for landmark and pose estimation where consumer-taken image with highest confidence is selected
s1x scales the 1x schedule by roughly 1.44x. Match R- as query.
CNN uses 1x schedule for consumer-to-shop clothes re-
trieval. The above models are trained by using SGD with 4. Experiments
a weight decay of 10−5 and momentum of 0.9.
In our experiments, the RPN produces anchors with 3 as- We demonstrate the effectiveness of DeepFashion2 by
pect rations on each level of the FPN pyramid. In clothes evaluating Mask R-CNN [6] and Match R-CNN in multiple
detection stream, an RoI is considered positive if its IoU tasks including clothes detection and classification, land-
with a ground truth box is larger than 0.5 and negative oth- mark estimation, instance segmentation, and consumer-to-
6
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
0.587 0.687 0.599 0.669 0.631 0.398 0.688 0.559 0.375 0.527 0.677 0.536 0.641
APpt
0.497 0.607 0.555 0.643 0.530 0.248 0.616 0.489 0.319 0.510 0.596 0.456 0.563
0.780 0.854 0.782 0.851 0.813 0.534 0.855 0.757 0.571 0.724 0.846 0.748 0.820
APOKS=0.50
pt
0.764 0.839 0.774 0.847 0.799 0.479 0.848 0.744 0.549 0.716 0.832 0.727 0.805
0.671 0.779 0.678 0.760 0.718 0.440 0.786 0.633 0.390 0.571 0.771 0.610 0.728
APOKS=0.75
pt
0.551 0.703 0.625 0.739 0.600 0.236 0.714 0.537 0.307 0.550 0.684 0.506 0.641
Table 3. Landmark estimation of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.
Results of evaluation on visible landmarks only and evaluation on both visible and occlusion landmarks are separately shown in each row.
OKS=0.50
The evaluation metrics are APpt , APpt , and APOKS=0.75
pt . The best performance of each subset is bold.
(a)
(a) shop clothes retrieval. To further show the large variations
of DeepFashion2, the validation set is divided into three
subsets according to their difficulty levels in scale, occlu-
sion, zoom-in, and viewpoint. The settings of Mask R-CNN
and Match R-CNN follow Sec.3. All models are trained in
the training set and evaluated in the validation set.
(b) The following sections from 4.1 to 4.4 report results for
different tasks, showing that DeepFashion2 imposes signif-
icant challenges to both Mask R-CNN and Match R-CNN,
which are the recent state-of-the-art systems for visual per-
ception.
Retrieval Accuracy
0.6 0.6
0.4 0.4
This is because they possess large non-rigid deformations as
0.2
visualized in the failure cases of Fig.5(a). These variations
0.2
Figure 6. (a) shows results of landmark and pose estimation. (b) 4.2. Landmark and Pose Estimation
shows results of clothes segmentation. (c) shows queries with top-
5 retrieved clothing items. The first column is the image from the Table 3 summarizes the results of landmark estimation.
customer with bounding box predicted by detection module, and The evaluation of each subset is performed in two settings,
the second to the sixth columns show the retrieval results from the including visible landmark only (the occluded landmarks
store. (d) is the retrieval accuracy of overall query validation set are not evaluated), as well as both visible and occluded
with (1) detected box (2) ground truth box. Evaluation metrics are landmarks. As estimating the occluded landmarks is more
top-1, -5, -10, -15, and -20 retrieval accuracy. difficult than visible landmarks, the second setting generally
provides worse results than the first setting.
In general, we see that Mask R-CNN obtains an overall
7
scale occlusion zoom-in viewpoint overall
small moderate large slight medium heavy no medium large no wear frontal side or back
APmask 0.634 0.700 0.669 0.720 0.674 0.389 0.703 0.627 0.526 0.695 0.697 0.617 0.680
APIoU=0.50
mask 0.831 0.900 0.844 0.900 0.878 0.559 0.899 0.815 0.663 0.829 0.886 0.843 0.873
APIoU=0.75
mask 0.765 0.838 0.786 0.850 0.813 0.463 0.842 0.740 0.613 0.792 0.834 0.732 0.812
Table 4. Clothes segmentation of Mask R-CNN [6] on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.
The evaluation metrics are APmask , APIoU=0.50
mask , and APIoU=0.75
mask . The best performance of each subset is bold.
AP of just 0.563, showing that clothes landmark estimation increases the accuracy. In particular, the learned features
could be even more challenging than human pose estima- from pose and class achieve better results than the other
tion in COCO. In particular, Table 3 exhibits similar trends features. When comparing learned features from pose and
as those from clothes detection. For example, the cloth- mask, we find that the former achieves better results, indi-
ing items with moderate scale, slight occlusion, no zoom- cating that landmark locations can be more robust across
in, and frontal viewpoint have better results than the others scenarios.
subsets. Moreover, heavy occlusion and zoom-in decreases As shown in Table 5, the performance declines when
performance a lot. Some results are given in Fig.6(a). small scale, heavily occluded clothing items are presented.
Clothes with large zoom-in achieved the lowest accuracy
4.3. Clothes Segmentation because only part of clothes are displayed in the image and
Table 4 summarizes the results of segmentation. The crucial distinguishable features may be missing. Compared
performance declines when segmenting clothing items with with clothes on people from frontal view, clothes from side
small and large scale, heavy occlusion, large zoom-in, side or back viewpoint perform worse due to lack of discrim-
or back viewpoint, which is consistent with those trends in inative features like patterns on the front of tops. Exam-
the previous tasks. Some results are given in Fig.6(b). Some ple queries with top-5 retrieved clothing items are shown in
failure cases are visualized in Fig.5(b). Fig.6(c).
8
ion2. [17] W. Wang, Y. Xu, J. Shen, and S.-C. Zhu. Attentive fashion
The rich data and labels of DeepFashion2 will defi- grammar network for fashion landmark detection and cloth-
nitely facilitate the developments of algorithms to under- ing category classification. In CVPR, 2018.
stand fashion images in future work. We will focus on [18] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll
three aspects. First, more challenging tasks will be explored parsing: Retrieving similar styles to parse clothing items. In
with DeepFashion2, such as synthesizing clothing images ICCV, 2013.
by using GANs. Second, it is also interesting to explore [19] S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Un-
constrained fashion landmark detection via hierarchical re-
multi-domain learning for clothing images, because fashion
current transformer networks. In ACM Multimedia, 2017.
trends of clothes may change frequently, making variations
[20] W. Yang, P. Luo, and L. Lin. Clothing co-parsing by joint
of clothing images changed. Third, we will introduce more image segmentation and labeling. In CVPR, 2014.
evaluation metrics into DeepFashion2, such as size, run-
[21] S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu.
time, and memory consumptions of deep models, towards Modanet: A large-scale street fashion dataset with polygon
understanding fashion images in real-world scenario. annotations. In ACM Multimedia, 2018.
References
[1] Fashionai dataset. http://fashionai.alibaba.
com/datasets/.
[2] H. Chen, A. Gallagher, and B. Girod. Describing clothing by
semantic attributes. In ECCV, 2012.
[3] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and
S. Yan. Deep domain adaptation for describing people based
on fine-grained clothing attributes. In CVPR, 2015.
[4] R. Girshick. Fast r-cnn. In ICCV, 2015.
[5] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L.
Berg. Where to buy it: Matching street clothing photos in
online shops. In ICCV, 2015.
[6] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
In ICCV, 2017.
[7] J. Huang, R. S. Feris, Q. Chen, and S. Yan. Cross-domain
image retrieval with a dual attribute-aware ranking network.
In ICCV, 2015.
[8] X. Ji, W. Wang, M. Zhang, and Y. Yang. Cross-domain image
retrieval with attention modeling. In ACM Multimedia, 2017.
[9] L. Liao, X. He, B. Zhao, C.-W. Ngo, and T.-S. Chua. Inter-
pretable multimodal retrieval for fashion products. In ACM
Multimedia, 2018.
[10] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
S. J. Belongie. Feature pyramid networks for object detec-
tion. In CVPR, 2017.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In ECCV, 2014.
[12] K.-H. Liu, T.-Y. Chen, and C.-S. Chen. Mvc: A dataset for
view-invariant clothing retrieval and attribute prediction. In
ACM Multimedia, 2016.
[13] S. Liu, X. Liang, L. Liu, K. Lu, L. Lin, X. Cao, and S. Yan.
Fashion parsing with video context. IEEE Transactions on
Multimedia, 17(8):1347–1358, 2015.
[14] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion:
Powering robust clothes recognition and retrieval with rich
annotations. In CVPR, 2016.
[15] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion land-
mark detection in the wild. In ECCV, 2016.
[16] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
NIPS, 2015.
9
Inverse Cooking: Recipe Generation from Food Images
Instruction Decoder
Figure 2: Recipe generation model. We extract image features eI with the image encoder, parametrized by θI . Ingredients
are predicted by θL , and encoded into ingredient embeddings eL with θe . The cooking instruction decoder, parametrized by
θR generates a recipe title and a sequence of cooking steps by attending to image embeddings eI , ingredient embeddings eL ,
and previously predicted words (r0 , ..., rt−1 ).
ferent architecture designs have been studied, including re- 3.1. Cooking Instruction Transformer
current neural networks [48], convolutional models [11] and
attention based approaches [50]. More recently, sequence- Given an input image with associated ingredients, we
to-sequence models have been applied to more open-ended aim to produce a sequence of instructions R = (r1 , ..., rT )
generation tasks, such as poetry [55] and story generation (where rt denotes a word in the sequence) by means of
[23, 9]. Following neural machine translation trends, auto- an instruction transformer [50]. Note that the title is pre-
regressive models have exhibited promising performance in dicted as the first instruction. This transformer is condi-
image captioning [52, 59, 28, 20, 7, 46], where the goal is to tioned jointly on two inputs: the image representation eI
provide a short description of the image contents, opening and the ingredient embedding eL . We extract the image
the doors to less constrained problems such as generating representation with a ResNet-50 [15] encoder and obtain the
descriptive paragraphs [23] or visual storytelling [18]. ingredient embedding eL by means of a decoder architec-
ture to predict ingredients, followed by a single embedding
layer mapping each ingredient into a fixed-size vector.
3. Generating recipes from images The instruction decoder is composed of transformer
blocks, each of them containing two attention layers fol-
Generating a recipe (title, ingredients and instructions) lowed by a linear layer [50]. The first attention layer applies
from an image is a challenging task, which requires a si- self-attention over previously generated outputs, whereas
multaneous understanding of the ingredients composing the the second one attends to the model conditioning in order
dish as well as the transformations they went through, e.g. to refine the self-attention output. The transformer model
slicing, blending or mixing with other ingredients. Instead is composed of multiple transformer blocks followed by a
of obtaining the recipe from an image directly, we argue that linear layer and a softmax nonlinearity that provides a dis-
a recipe generation pipeline would benefit from an interme- tribution over recipe words for each time step t. Figure 3a
diate step predicting the ingredients list. The sequence of illustrates the transformer model, which traditionally is con-
instructions would then be generated conditioned on both ditioned on a single modality. However, our recipe gen-
the image and its corresponding list of ingredients, where erator is conditioned on two sources: the image features
the interplay between image and ingredients could provide eI ∈ RP ×de and ingredients embeddings eL ∈ RK×de
additional insights on how the latter were processed to pro- (P and K denote the number of image and ingredient fea-
duce the resulting dish. tures, respectively, and de is the embedding dimensional-
Figure 2 illustrates our approach. Our recipe genera- ity). Thus, we want our attention to reason about both
tion system takes a food image as an input and outputs a modalities simultaneously, guiding the instruction genera-
sequence of cooking instructions, which are generated by tion process. To that end, we explore three different fusion
means of an instruction decoder that takes as input two em- strategies (depicted in Figure 3):
beddings. The first one represents visual features extracted – Concatenated attention. This strategy first concate-
from an image, while the second one encodes the ingre- nates both image eI and ingredients eL embeddings
dients extracted from the image. We start by introducing over the first dimension econcat ∈ R(K+P )×de . Then,
our transfomer-based instruction decoder in Subsection 3.1. attention is applied over the combined embeddings.
This allows us to formally review the transformer, which we – Independent attention. This strategy incorporates
then study and modify to predict ingredients in an orderless two attention layers to deal with the bi-modal condi-
manner in Subsection 3.2. Finally, we review the optimiza- tioning. In this case, one layer attends over the image
tion details in Subsection 3.3. embedding eI , whereas the other attends over the in-
Output probabilities
Softmax
Linear
Add
xN Add & Norm
Feed-forward
Attention
Add & Norm
Attention
e Add & Norm Add & Norm Add & Norm eI/eL Add & Norm
Self-Attention
Attention Attention Attention Attention
Add & Norm
Positional encoding
Embedding eL/eI
[eL eI] eI eL
Outputs (shifted right)
(a) Transformer model [50] (b) Concatenated (c) Independent (d) Sequential
Figure 3: Attention strategies for the instruction decoder. In our experiments, we replace the attention module in the
transformer (a), with three different attention modules (b-d) for cooking instruction generation using multiple conditions.
gredient embeddings eL . The output of both attention where θI and θL represent the learnable parameters of the
layers is combined via summation operation. image encoder and ingredient decoder, respectively. Since
– Sequential attention. This strategy sequentially at- L denotes a list, we can factorize p(L̂(i) = L(i) |x(i) )
tends over the two conditioning modalities. In our de-
PK (i) (i) (i)
into K conditionals: k=0 log p(L̂k = Lk |x(i) , L<k ) 3
sign, we consider two orderings: (1) image first where (i) (i)
and parametrize p(L̂k |x(i) , L<k ) as a categorical distribu-
the attention is first computed over image embeddings tion. In the literature, these conditionals are usually mod-
eI and then over ingredient embeddings eL ; and (2) eled with auto-regressive (recurrent) models. In our experi-
ingredients first where the order is flipped and we first ments, we choose the transformer model as well. It is worth
attend over ingredient embeddings eL followed by im- mentioning that a potential drawback of this formulation is
age embeddings eI . that it inherently penalizes for order, which might not nec-
3.2. Ingredient Decoder essarily be relevant for ingredients.
A set of ingredients is a variable sized, unordered col-
Which is the best structure to represent ingredients? On lection of unique meal constituents. We can obtain a set of
the one hand, it seems clear that ingredients are a set, since ingredients S by selecting K ingredients from the dictio-
permuting them does not alter the outcome of the cooking nary D: S = {si }K i=0 . We represent S as a binary vector s
recipe. On the other hand, we colloquially refer to ingredi- of dimension N , where si = 1 if si ∈ S and 0 otherwise.
ents as a list (e.g. list of ingredients), implying some order. Thus, our training data consists of M image and ingredient
Moreover, it would be reasonable to think that there is some set pairs: {(x(i) , s(i) )}Mi=0 . In this case, the goal is to predict
information in the order in which humans write down the ŝ from an image x by maximizing the following objective:
ingredients in a recipe. Therefore, in this subsection we M
X
consider both scenarios and introduce models that work ei- arg max log p(ŝ(i) = s(i) |x(i) ; θI , θL ). (2)
ther with a list of ingredients or with a set of ingredients. θI ,θL i=0
A list of ingredients is a variable sized, ordered collec- Assuming independence among elements, we can fac-
tion of unique meal constituents. More precisely, let us de- PN (i) (i)
torize p(ŝ(i) = s(i) |x(i) ) as j=0 log p(ŝj = sj |x(i) ).
fine a dictionary of ingredients of size N as D = {di }N i=0 , However, the ingredients in the set are not necessarily inde-
from which we can obtain a list of ingredients L by select-
pendent, e.g. salt and pepper frequently appear together.
ing K elements from D: L = [li ]K i=0 . We encode L as a To account for element dependencies in the set, we
binary matrix L of dimensions K × N , with Li,j = 1 if
model the set as a list, i.e. as a product of conditional prob-
dj ∈ D is selected and 0 otherwise (one-hot-code represen-
abilities, by means of an auto-regressive model such as the
tation). Thus, our training data consists of M image and
transformer. The transformer predicts ingredients in a list-
ingredient list pairs {(x(i) , L(i) )}M
i=0 . In this scenario, the (i) (i)
goal is to predict L̂ from an image x by maximizing the like fashion p(L̂k |x(i) , L<k ), until the end of sequence eos
following objective: token is encountered. As mentioned previously, the draw-
back of this approach is that such model design penalizes
M
(i) (i)
3 Lk denotes the k-th row of L(i) and L<k represents all rows of
X
arg max log p(L̂(i) = L(i) |x(i) ; θI , θL ), (1)
θI ,θL i=0
L(i) up to, but not including, the k-th one.
3.3. Optimization
salt onion beans rice eos
l0 l1 l2 l3 l4 We train our recipe transfomer in two stages. In the first
stage, we pre-train the image encoder and ingredients de-
coder as presented in Subsection 3.2. Then, in the second
stage, we train the ingredient encoder and instruction de-
pool coder (following Subsection 3.1) by minimizing the neg-
ative log-likelihood and adjusting θR and θE . Note that,
while training, the instruction decoder takes as input the
ground truth ingredients. All transformer models are trained
with teacher forcing [58] except for the set transformer.
𝜃L 𝜃L 𝜃L 𝜃L 𝜃L
4. Experiments
This section is devoted to the dataset and the descrip-
Figure 4: Set transformer (TFset ). Softmax probabilities tion of implementation details, followed by an exhaustive
are pooled across time to avoid penalizing for order. analysis of the proposed attention strategies for the cooking
instruction transformer. Further, we quantitatively compare
the proposed ingredient prediction models to previously in-
for order. In order to remove the order in which ingre- troduced baselines. Finally, a comparison of our inverse
dients are predicted, we propose to aggregate the outputs cooking system with retrieval-based models as well as a
across different time-steps by means of a max pooling op- comprehensive user study is provided.
eration (see Figure 4). Moreover, to ensure that the ingre-
dients in L̂(i) are selected without repetition, we force the 4.1. Dataset
(i) (i)
pre-activation of p(L̂k |x(i) , L<k ) to be −∞ for all previ-
We train and evaluate our models on the Recipe1M
ously selected ingredients at time-steps < k. We train this
dataset [45], composed of 1 029 720 recipes scraped from
model by minimizing the binary cross-entropy between the
cooking websites. The dataset contains 720 639 training,
predicted ingredients (after pooling) and the ground truth.
155 036 validation and 154 045 test recipes, containing a ti-
Including the eos in the pooling operation would result in
tle, a list of ingredients, a list of cooking instructions and
loosing the information of where the token appears. There-
(optionally) an image. In our experiments, we use only
fore, in order to learn the stopping criteria of the ingredient
the recipes containing images, and remove recipes with less
prediction, we introduce an additional loss accounting for
than 2 ingredients or 2 instructions, resulting in 252 547
it. The eos loss is defined as the binary cross-entropy loss
training, 54 255 validation and 54 506 test samples.
between the predicted eos probability at all time-steps and
Since the dataset was obtained by scraping cooking web-
the ground truth (represented as a unit step function, whose
sites, the resulting recipes are highly unstructured and con-
value is 0 for the time-steps corresponding to ingredients
tain frequently redundant or very narrowly defined cooking
and 1 otherwise). In addition to that, we incorporate a car-
ingredients (e.g. olive oil, virgin olive oil and spanish olive
dinality `1 penalty, which we found empirically useful. At
oil are separate ingredients). Moreover, the ingredient vo-
inference time, we directly sample from the transformer’s
cabulary contains more than 400 different types of cheese,
output. We refer to this model as set transformer.
and more than 300 types of pepper. As a result, the original
Alternatively, we could use target distribution dataset contains 16 823 unique ingredients, which we pre-
P (i)
p(s(i) |x(i) ) = s(i) / j sj [12, 29] to model the process to reduce its size and complexity. First, we merge
joint distribution of set elements and train a model by ingredients if they share the first or last two words (e.g. ba-
minimizing the cross-entropy loss between p(s(i) |x(i) ) and con cheddar cheese is merged into cheddar cheese); then,
the model’s output distribution p(ŝ(i) |x(i) ). Nonetheless, we cluster the ingredients that have same word in the first or
it is not clear how to convert the target distribution back to in the last position (e.g. gorgonzola cheese or cheese blend
the corresponding set of elements with variable cardinality. are clustered together into the cheese category); finally we
In this case, we build a feed forward network and train it remove plurals and discard ingredients that appear less than
with the target distribution cross-entropy loss. To recover 10 times in the dataset. Altogether, we reduce the ingredi-
the ingredient set, we propose to greedily sample elements ent vocabulary from over 16k to 1 488 unique ingredients.
from a cumulative distribution of sorted output probabil- For the cooking instructions, we tokenize the raw text and
ities p(ŝ(i) |x(i) ) and stop the sampling once the sum of remove words that appear less than 10 times in the dataset,
probabilities of selected elements is above a threshold. We and replace them with unknown word token. Moreover, we
refer to this model as feed forward (target distribution). add special tokens for the start and the end of recipe as well
Model IoU F1 fluence of visual features on recipe quality, we adapt our
model by removing visual features and predicting instruc-
FFBCE 17.85 30.30
FFIOU 26.25 41.58 tions directly from ingredients (L2R). Our system achieves
Model ppl FFDC 27.22 42.80 a test set perplexity of 8.51, improving both I2R and L2R
FFT D 28.84 44.11 baselines, and highlighting the benefits of using both image
Independent 8.59
and ingredients when generating recipes. L2R surpasses
Seq. img. first 8.53 TFlist 29.48 45.55
I2R with a perplexity of 8.67 vs. 9.66, demonstrating the
Seq. ing. first 8.61 TFlist + shuf. 27.86 43.58
Concatenated 8.50 TFset 31.80 48.26 usefulness of having access to concepts (ingredients) that
are essential to the cooking instructions. Finally, we greed-
Table 1: Model selection (val). Left: Recipe perplexity ily sample instructions from our model and analyze the re-
(ppl). Right: Global ingredient IoU & F1. sults. We notice that generated instructions have an average
of 9.21 sentences containing 9 words each, whereas real,
ground truth instructions have an average of 9.08 sentences
as the end of instruction. This process results in a recipe
of length 12.79. See supplementary material for qualitative
vocabulary of 23 231 unique words.
examples of generated recipes.
4.2. Implementation Details
4.4. Ingredient Prediction
We resize images to 256 pixels in their shortest side and
take random crops of 224 × 224 for training and we select In this section, we compare the proposed ingredient pre-
central 224 × 224 pixels for evaluation. For the instruc- diction approaches to previously introduced models, with
tion decoder, we use a transformer with 16 blocks and 8 the goal of assessing whether ingredients should be treated
multi-head attentions, each one with dimensionality 64. For as lists or sets. We consider models from the multilabel
the ingredient decoder, we use a transformer with 4 blocks classification literature as baselines, and tune them for our
and 2 multi-head attentions, each one with dimensionality purposes. On the one hand, we have models based on feed
of 256. To obtain image embeddings we use the last convo- forward convolutional networks, which are trained to pre-
lutional layer of ResNet-50 model. Both image and ingredi- dict sets of ingredients. We experiment with several losses
ents embedings are of dimension 512. We keep a maximum to train these models, namely binary cross-entropy, soft in-
of 20 ingredients per recipe and truncate instructions to a tersection over union as well as target distribution cross-
maximum of 150 words. The models are trained with Adam entropy. Note that binary cross-entropy is the only one not
optimizer [22] until early-stopping criteria is met (using pa- taking into account dependencies among elements in the set.
tience of 50 and monitoring validation loss). All models are On the other hand, we have sequential models that predict
implemented with PyTorch4 [40]. Additional implementa- lists, imposing order and exploiting dependencies among
tion details are provided in the supplementary material. elements. Finally, we consider recently proposed models
which couple set prediction with cardinality prediction to
4.3. Recipe Generation determine which elements to include in the set [44].
Table 1 (right) reports the results on the validation set
In this section, we compare the proposed multi-modal
for the state-of-the-art baselines as well as the proposed
attention architectures described in Section 3.1. Table 1
approaches. We evaluate the models in terms of Intersec-
(left) reports the results in terms of perplexity on the val-
tion over Union (IoU) and F1 score, computed for accumu-
idation set. We observe that independent attention exhibits
lated counts of T P , F N and F P over the entire dataset
the lowest results, followed by both sequential attentions.
split (following Pascal VOC convention). As shown in the
While the latter have the capability to refine the output with
table, the feed forward model trained with binary cross-
either ingredient or image information consecutively, inde-
entropy [3] (FFBCE ) exhibits the lowest performance on
pendent attention can only do it in one step. This is also
both metrics, which could be explained by the assumed in-
the case of concatenated attention, which achieves the best
dependence among ingredients. These results are already
performance. However, concatenated attention is flexible
notably improved by the method that learns to predict the set
enough to decide whether to give more focus to one modal-
cardinality (FFDC ). Similarly, the performance increases
ity, at the expense of the other, whereas independent atten-
when training the model with structured losses such as soft
tion is forced to include information from both modalities.
IoU (FFIOU ). Our feed forward model trained with tar-
Therefore, we use the concatenated attention model to re-
get distribution (FFT D ) and sampled by thresholding (th
port results on the test set. We compare it to a system go-
= 0.5) the sum of probabilities of selected ingredients out-
ing directly from image-to-sequence of instructions with-
performs all feed forward baselines, including recently pro-
out predicting ingredients (I2R). Moreover, to assess the in-
posed alternatives for set prediction such as [44] (FFDC ).
4 https://pytorch.org/ Note that target distribution models dependencies among
Card. error # pred. ingrs
FFBCE 5.67 ± 3.10 2.37 ± 1.58
FFDC 2.68 ± 2.07 9.18 ± 2.06
FFIOU 2.46 ± 1.95 7.86 ± 1.72
FFT D 3.02 ± 2.50 8.02 ± 3.24
TFlist 2.49 ± 2.11 7.05 ± 2.77
TFlist + shuffle 3.24 ± 2.50 5.06 ± 1.85
TFset 2.56 ± 1.93 9.43 ± 2.35
Table 2: Ingredient Cardinality. Figure 5: Ingredient prediction results: P@K and F1 per ingredient.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 9: User Study 1. Interface for writing recipes and selecting ingredients.
Figure 10: User Study 2. Recipe quality assessment form.
Figure 11: Written Recipes. Real, generated and human written recipes collected with our user study.
Instructions Instructions
-Combine all ingredients and cook over medium heat until -Preheat greased grill to medium-high heat. Instructions
potatoes are just tender. -Grill fruit 3 min. -Brown the beef with spices and onion when browned add
-Turn down heat to low and simmer at least 1.5 hours. -On each side or until lightly browned on both sides. paste
-Cut fruit into 2-inch sticks; place in large salad bowl. -Bring broth to boil add tomato sauce and spices let boil 3
-Add greens, jicama and tomatoes; toss lightly. minutes add rice bring back to boil cover and let sit off heat
-Drizzle with dressing just before serving. 7 minutes
-Mix rice and sauce enjoy
italian_dressing Instructions
-In a large skillet, heat oil over medium heat.
-Add onion and garlic and cook until onion is translucent.
Instructions Instructions -Add rice and cook until rice is lightly browned.
-Brown ground chuck in a large dutch oven. -Toss greens with chicken, tomatoes, cheese and dressing in -Add chicken broth, tomatoes, salt, pepper and cayenne
-Drain any grease and add the crushed tomatoes, onion soup large bowl. pepper.
mix and water. -Add dressing; mix lightly. -Bring to a boil.
-Simmer 5 minutes. -Reduce heat and cover.
-Add velveeta shells and cheese, mix well and serve with hot -Simmer for 20 minutes.
rolls. -Remove from heat and let stand covered for 5 minutes.
-Sprinkle with cheese and serve.
onion
Instructions Instructions
-Cook the beef in a frying pan till brown. Instructions -Put the sauteed meat in a casserole with hot oil.
-Dice the onions. -Wash and rinse the kale -After it is quite cooked, add the rice.
-Drain the beans. -Wash, cut and slice the onion, the yellow zucchini and the -Add the water.
-Take a slow-cooker and put the beef, tomatoes, beans and cherry tomatoes -Add the broth.
onions. -Add all the ingredients in a bowl and mix -Add salt
-Add the seasoning and spices. -Add some olive oil and your favorite vinagrette -Let it cook until the rice it is done.
-Turn the slow-cooker on medium and cook for around 6 -Add the parsley
hrs.
Figure 12: Dine Out Study. Generated recipes for food images taken by authors.
(a)
(b)
(c)
(d)
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Stefanos Zafeiriou
Imperial College London
arXiv:1801.07698v3 [cs.CV] 9 Feb 2019
s.zafeiriou@imperial.ac.uk
Abstract
One of the main challenges in feature learning using
Deep Convolutional Neural Networks (DCNNs) for large-
scale face recognition is the design of appropriate loss func-
tions that enhance discriminative power. Centre loss pe-
nalises the distance between the deep features and their cor-
responding class centres in the Euclidean space to achieve
intra-class compactness. SphereFace assumes that the lin-
ear transformation matrix in the last fully connected layer
can be used as a representation of the class centres in an
angular space and penalises the angles between the deep Figure 1. Based on the centre [18] and feature [37] normalisation,
features and their corresponding weights in a multiplicative all identities are distributed on a hypersphere. To enhance intra-
way. Recently, a popular line of research is to incorporate class compactness and inter-class discrepancy, we consider four
margins in well-established loss functions in order to max- kinds of Geodesic Distance (GDis) constraint. (A) Margin-Loss:
imise face class separability. In this paper, we propose an insert a geodesic distance margin between the sample and cen-
Additive Angular Margin Loss (ArcFace) to obtain highly tres. (B) Intra-Loss: decrease the geodesic distance between the
discriminative features for face recognition. The proposed sample and the corresponding centre. (C) Inter-Loss: increase the
ArcFace has a clear geometric interpretation due to the ex- geodesic distance between different centres. (D) Triplet-Loss: in-
sert a geodesic distance margin between triplet samples. In this
act correspondence to the geodesic distance on the hyper-
paper, we propose an Additive Angular Margin Loss (ArcFace),
sphere. We present arguably the most extensive experimen-
which is exactly corresponded to the geodesic distance (Arc) mar-
tal evaluation of all the recent state-of-the-art face recog- gin penalty in (A), to enhance the discriminative power of face
nition methods on over 10 face recognition benchmarks in- recognition model. Extensive experimental results show that the
cluding a new large-scale image database with trillion level strategy of (A) is most effective.
of pairs and a large-scale video dataset. We show that Ar-
cFace consistently outperforms the state-of-the-art and can face recognition [32, 33, 29, 24]. DCNNs map the face im-
be easily implemented with negligible computational over- age, typically after a pose normalisation step [45], into a
head. We release all refined training data, training codes, feature that has small intra-class and large inter-class dis-
pre-trained models and training logs1 , which will help re- tance.
produce the results in this paper. There are two main lines of research to train DCNNs
for face recognition. Those that train a multi-class clas-
sifier which can separate different identities in the train-
1. Introduction ing set, such by using a softmax classifier [33, 24, 6], and
those that learn directly an embedding, such as the triplet
Face representation using Deep Convolutional Neural loss [29]. Based on the large-scale training data and the
Network (DCNN) embedding is the method of choice for elaborate DCNN architectures, both the softmax-loss-based
∗ denotes equal contribution to this work. methods [6] and the triplet-loss-based methods [29] can ob-
1 https://github.com/deepinsight/insightface tain excellent performance on face recognition. However,
1
both the softmax loss and the triplet loss have some draw- the softmax loss. The advantages of the proposed ArcFace
backs. For the softmax loss: (1) the size of the linear trans- can be summarised as follows:
formation matrix W ∈ Rd×n increases linearly with the Engaging. ArcFace directly optimises the geodesic dis-
identities number n; (2) the learned features are separable tance margin by virtue of the exact correspondence between
for the closed-set classification problem but not discrimina- the angle and arc in the normalised hypersphere. We in-
tive enough for the open-set face recognition problem. For tuitively illustrate what happens in the 512-D space via
the triplet loss: (1) there is a combinatorial explosion in the analysing the angle statistics between features and weights.
number of face triplets especially for large-scale datasets, Effective. ArcFace achieves state-of-the-art performance
leading to a significant increase in the number of iteration on ten face recognition benchmarks including large-scale
steps; (2) semi-hard sample mining is a quite difficult prob- image and video datasets.
lem for effective model training. Easy. ArcFace only needs several lines of code as given
Several variants [38, 9, 46, 18, 37, 35, 7, 34, 27] have in Algorithm 1 and is extremely easy to implement in the
been proposed to enhance the discriminative power of the computational-graph-based deep learning frameworks, e.g.
softmax loss. Wen et al. [38] pioneered the centre loss, the MxNet [8], Pytorch [25] and Tensorflow [4]. Furthermore,
Euclidean distance between each feature vector and its class contrary to the works in [18, 19], ArcFace does not need
centre, to obtain intra-class compactness while the inter- to be combined with other loss functions in order to have
class dispersion is guaranteed by the joint penalisation of stable performance, and can easily converge on any training
the softmax loss. Nevertheless, updating the actual centres datasets.
during training is extremely difficult as the number of face Efficient. ArcFace only adds negligible computational
classes available for training has recently dramatically in- complexity during training. Current GPUs can easily sup-
creased. port millions of identities for training and the model parallel
By observing that the weights from the last fully con- strategy can easily support many more identities.
nected layer of a classification DCNN trained on the soft-
max loss bear conceptual similarities with the centres of 2. Proposed Approach
each face class, the works in [18, 19] proposed a multiplica- 2.1. ArcFace
tive angular margin penalty to enforce extra intra-class com-
pactness and inter-class discrepancy simultaneously, lead- The most widely used classification loss function, soft-
ing to a better discriminative power of the trained model. max loss, is presented as follows:
Even though Sphereface [18] introduced the important idea T
N
of angular margin, their loss function required a series of ap- 1 X eWyi xi +byi
proximations in order to be computed, which resulted in an L1 = − log Pn T , (1)
N i=1 eWj xi +bj
j=1
unstable training of the network. In order to stabilise train-
ing, they proposed a hybrid loss function which includes the where xi ∈ Rd denotes the deep feature of the i-th sample,
standard softmax loss. Empirically, the softmax loss dom- belonging to the yi -th class. The embedding feature dimen-
inates the training process, because the integer-based mul- sion d is set to 512 in this paper following [38, 46, 18, 37].
tiplicative angular margin makes the target logit curve very Wj ∈ Rd denotes the j-th column of the weight W ∈ Rd×n
precipitous and thus hinders convergence. CosFace [37, 35] and bj ∈ Rn is the bias term. The batch size and the class
directly adds cosine margin penalty to the target logit, which number are N and n, respectively. Traditional softmax loss
obtains better performance compared to SphereFace but ad- is widely used in deep face recognition [24, 6]. However,
mits much easier implementation and relieves the need for the softmax loss function does not explicitly optimise the
joint supervision from the softmax loss. feature embedding to enforce higher similarity for intra-
In this paper, we propose an Additive Angular Margin class samples and diversity for inter-class samples, which
Loss (ArcFace) to further improve the discriminative power results in a performance gap for deep face recognition under
of the face recognition model and to stabilise the training large intra-class appearance variations (e.g. pose variations
process. As illustrated in Figure 2, the dot product be- [30, 48] and age gaps [22, 49]) and large-scale test scenarios
tween the DCNN feature and the last fully connected layer (e.g. million [15, 39, 21] or trillion pairs [2]).
is equal to the cosine distance after feature and weight nor- For simplicity, we fix the bias bj = 0 as in [18]. Then,
malisation. We utilise the arc-cosine function to calculate we transform the logit [26] as WjT xi = kWj k kxi k cos θj ,
the angle between the current feature and the target weight. where θj is the angle between the weight Wj and the fea-
Afterwards, we add an additive angular margin to the tar- ture xi . Following [18, 37, 36], we fix the individual weight
get angle, and we get the target logit back again by the co- kWj k = 1 by l2 normalisation. Following [28, 37, 36, 35],
sine function. Then, we re-scale all logits by a fixed feature we also fix the embedding feature kxi k by l2 normalisation
norm, and the subsequent steps are exactly the same as in and re-scale it to s. The normalisation step on features and
Figure 2. Training a DCNN for face recognition supervised by the ArcFace loss. Based on the feature xi and weight W normalisation, we
get the cos θj (logit) for each class as WjT xi . We calculate the arccosθyi and get the angle between the feature xi and the ground truth
weight Wyi . In fact, Wj provides a kind of centre for each class. Then, we add an angular margin penalty m on the target (ground truth)
angle θyi . After that, we calculate cos(θyi + m) and multiply all logits by the feature scale s. The logits then go through the softmax
function and contribute to the cross entropy loss.
Algorithm 1 The Pseudo-code of ArcFace on MxNet
Input: Feature Scale s, Margin Parameter m in Eq. 3, Class Number n, Ground-Truth ID gt.
1. x = mx.symbol.L2Normalization (x, mode = ’instance’)
2. W = mx.symbol.L2Normalization (W, mode = ’instance’)
3. fc7 = mx.sym.FullyConnected (data = x, weight = W, no bias = True, num hidden = n)
4. original target logit = mx.sym.pick (fc7, gt, axis = 1)
5. theta = mx.sym.arccos (original target logit)
6. marginal target logit = mx.sym.cos (theta + m)
7. one hot = mx.sym.one hot (gt, depth = n, on value = 1.0, off value = 0.0)
8. fc7 = fc7 + mx.sym.broadcast mul (one hot, mx.sym.expand dims (marginal target logit - original target logit, 1))
9. fc7 = fc7 * s
Output: Class-wise affinity score f c7.
weights makes the predictions only depend on the angle be- while the proposed ArcFace loss can obviously enforce a
tween the feature and the weight. The learned embedding more evident gap between the nearest classes.
features are thus distributed on a hypersphere with a radius
of s.
N
1 X es cos θyi
L2 = − log s cos θy Pn s cos θj
. (2)
N i=1 e i +
j=1,j6=yi e
0.4
6000
Numbers
4000
-0.2 lar representation of features and weight-vectors. For exam-
Softmax (1.00, 0.00, 0.00)
-0.4
-0.6
SphereFace(m=4, =5)
SphereFace(1.35, 0.00, 0.00)
ArcFace (1.00, 0.50, 0.00)
ples, we can design a loss to enforce intra-class compact-
2000 -0.8
-1
CosFace (1.00, 0.00, 0.35)
CM1
CM2
(1.00, 0.30, 0.20)
(0.90, 0.40, 0.15)
ness and inter-class discrepancy on the hypersphere. As
0
20 30 40 50 60 70 80 90
Angle between the Feature and Target Center
100
-1.2
20 30 40 50 60 70 80
Angle between the Feature and Target Center
90 100 shown in Figure 1, we compare with three other losses in
this paper.
(a) θj Distributions (b) Target Logits Curves
Intra-Loss is designed to improve the intra-class compact-
Figure 4. Target logit analysis. (a) θj distributions from start to ness by decreasing the angle/arc between the sample and
end during ArcFace training. (2) Target logit curves for softmax, the ground truth centre.
SphereFace, ArcFace, CosFace and combined margin penalty
(cos(m1 θ + m2 ) − m3 ). 1 X
N
2.5
Negative
Positive
Negative
Positive
that the additive angular margin penalty can notably en-
2
2
hance the discriminative power of deeply learned features,
Pair Numbers
Pair Numbers
1.5
200
Positive 160
140
Positive
150
Positive
methods on MegaFace Challenge1 using FaceScrub as the probe
120
Pair Numbers
Pair Numbers
150
100
100
0
20
0 0
FAR. “R” refers to data refinement on both probe set and 1M dis-
tractors. ArcFace obtains state-of-the-art performance under both
0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs
(a) LFW (99.83%) (b) CFP-FP (98.37%) (c) AgeDB (98.15%) small and large protocols.
350
300
Negative
Positive
250
Negative
Positive
180
160
Negative
Positive
affects the performance. Therefore, we manually refined
200
140
250
120 the whole MegaFace dataset and report the correct perfor-
Pair Numbers
Pair Numbers
Pair Numbers
150
200 100
150
100
100
80
60
mance of ArcFace on MegaFace. On the refined MegaFace,
50
50
40
98
97
90
96
comparison, we train ArcFace on CAISA and MS1MV2 85 95
under the small protocol and large protocol, respectively. 80 CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
94
93
CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
MS1MV2, ResNet100, ArcFace, Original MS1MV2, ResNet100, ArcFace, Original
In Table 6, ArcFace trained on CASIA achieves the best 75 MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
92
91
MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
MS1MV2, ResNet100, CosFace, Refine MS1MV2, ResNet100, CosFace, Refine
not only surpassing the strong baselines (e.g. SphereFace Rank False Positive Rate
[18] and CosFace [37]) but also outperforming other pub- (a) CMC (b) ROC
lished methods [38, 17]. Figure 8. CMC and ROC curves of different models on MegaFace.
As we observed an obvious performance gap between Results are evaluated on both original and refined MegaFace
identification and verification, we performed a thorough dataset.
manual check in the whole MegaFace dataset and found Results on IJB-B and IJB-C. The IJB-B dataset [39]
many face images with wrong labels, which significantly contains 1, 845 subjects with 21.8K still images and 55K
Method IJB-B IJB-C Method Id (@FPR=1e-3) Ver(@FPR=1e-9)
ResNet50 [6] 0.784 0.825 CASIA 26.643 21.452
SENet50 [6] 0.800 0.840 MS1MV2 80.968 78.600
ResNet50+SENet50 [6] 0.800 0.841 DeepGlint-Face 80.331 78.586
MN-v [42] 0.818 0.852 MS1MV2+Asian 84.840 (1st) 80.540
MN-vc [42] 0.831 0.862 CIGIT IRSEC 84.234 (2nd) 81.558 (1st)
ResNet50+DCN(Kpts) [41] 0.850 0.867
Table 8. Identification and verification results (%) on the Trillion-
ResNet50+DCN(Divs) [41] 0.841 0.880 Pairs dataset. ([Dataset*, ResNet100, ArcFace])
SENet50+DCN(Kpts) [41] 0.846 0.874
SENet50+DCN(Divs) [41] 0.849 0.885
VGG2, R50, ArcFace 0.898 0.921 set. Every pair between gallery and probe set is used
MS1MV2, R100, ArcFace 0.942 0.956 for evaluation (0.4 trillion pairs in total). In Table 8,
we compare the performance of ArcFace trained on dif-
Table 7. 1:1 verification TAR (@FAR=1e-4) on the IJB-B and IJB- ferent datasets. The proposed MS1MV2 dataset obvi-
C dataset.
ously boosts the performance compared to CASIA and even
frames from 7, 011 videos. In total, there are 12, 115 slightly outperforms the DeepGlint-Face dataset, which has
templates with 10, 270 genuine matches and 8M impos- a double identity number. When combining all identities
tor matches. The IJB-C dataset [39] is a further extension from MS1MV2 and Asian celebrities from DeepGlint, Arc-
of IJB-B, having 3, 531 subjects with 31.3K still images Face achieves the best identification performance 84.840%
and 117.5K frames from 11, 779 videos. In total, there (@FPR=1e-3) and comparable verification performance
are 23, 124 templates with 19, 557 genuine matches and compared to the most recent submission (CIGIT IRSEC)
15, 639K impostor matches. from the lead-board.
On the IJB-B and IJB-C datasets, we employ the VGG2 Results on iQIYI-VID. The iQIYI-VID challenge [20]
dataset as the training data and the ResNet50 as the embed- contains 565,372 video clips (training set 219,677, valida-
ding network to train ArcFace for the fair comparison with tion set 172,860, and test set 172,835) of 4934 identities
the most recent methods [6, 42, 41]. In Table 7, we compare from iQIYI variety shows, films and television dramas. The
the TAR (@FAR=1e-4) of ArcFace with the previous state- length of each video ranges from 1 to 30 seconds. This
of-the-art models [6, 42, 41]. ArcFace can obviously boost dataset supplies multi-modal cues, including face, cloth,
the performance on both IJB-B and IJB-C (about 3 ∼ 5%, voice, gait and subtitles, for character identification. The
which is a significant reduction in the error). Drawing sup- iQIYI-VID dataset employs MAP@100 as the evaluation
port from more training data (MS1MV2) and deeper neu- indicator. MAP (Mean Average Precision) refers to the
ral network (ResNet100), ArcFace can further improve the overall average accuracy rate, which is the mean of the av-
TAR (@FAR=1e-4) to 94.2% and 95.6% on IJB-B and IJB- erage accuracy rate of the corresponding videos of person
C, respectively. In Figure 9, we show the full ROC curves of ID retrieved in the test set for each person ID (as the query)
the proposed ArcFace on IJB-B and IJB-C 2 , and ArcFace in the training set.
achieves impressive performance even at FAR=1e-6 setting As shown in Table 9, ArcFace trained on combined
a new baseline. MS1MV2 and Asian datasets with ResNet100 sets a high
baseline (MAP=(79.80%)). Based on the embedding fea-
1
ROC on IJB-B
1
ROC on IJB-C ture for each training video, we train an additional three-
0.9
0.8
0.9
0.8
layer fully connected network with a classification loss to
0.7 0.7
get the customised feature descriptor on the iQIYI-VID
True Positive Rate
0.6 0.6
0.5 0.5 dataset. The MLP learned on the iQIYI-VID training set
0.4 0.4
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
from the model ensemble and context features from the off-
0
10-6 10 -5
10 -4
10 -3
Results on Trillion-Pairs. The Trillion-Pairs dataset [2] In this paper, we proposed an Additive Angular Margin
provides 1.58M images from Flickr as the gallery set and Loss function, which can effectively enhance the discrimi-
274K images from 5.7k LFW [13] identities as the probe native power of feature embeddings learned via DCNNs for
face recognition. In the most comprehensive experiments
2 https://github.com/deepinsight/insightface/tree/master/Evaluation/IJB reported in the literature we demonstrate that our method
Method MAP(%) 25
+ Ensemble 88.26
+ Context 88.65 (1st) 15
900
Can we apply ArcFace on large-scale identities? Yes,
ation strategy [44] to relieve this problem. We optimise Identity Number in the Training Data 106
our training code to easily and efficiently support million (b) Training Speed
level identities on a single machine by parallel accelera- Figure 10. Parallel acceleration on both feature x and centre W .
tion on both feature x (it known as the general data parallel Setting: ResNet 50, batch size 8*64, feature dimension 512, float
strategy) and centre W (we named it as the centre parallel point 32, GPU 8*P40 (24GB).
strategy). As shown in Figure 10, our parallel acceleration
on both feature x and centre W can significantly decrease
the GPU memory consumption and accelerate the training
speed. Even for one million identities trained on 8*1080ti score sub-matrix (batch size 512 × identity number 1M/8)
(11GB), our implementation (ResNet 50, batch size 8*64, on each GPU. The similarity score matrix goes forward to
feature dimension 512 and float point 32) can still run at calculate the ArcFace loss and the gradient. Here, we con-
800 samples per second. Compared to the approximate ac- duct a simple matrix partition on the centre matrix and the
celeration method proposed in [47], our implementation has similarity score matrix along the identity dimension, and
no performance drop. there is no communication cost on the centre and similarity
score matrix. Both the centre sub-matrix and the similarity
In Figure 11, we illustrate the main calculation steps of
score sub-matrix are only 256MB on each GPU.
the parallel acceleration by simple matrix partition, which
can be easily grasped and reproduced by beginners [3]. (3) Get gradient on centre (dW ). We transpose the fea-
(1) Get feature (x). Face embedding features are aggre- ture matrix on each GPU, and concurrently multiply the
gated into one feature matrix (batch size 8*64 × feature transposed feature matrix by the gradient sub-matrix of the
dimension 512) from 8 GPU cards. The size of the aggre- similarity score.
gated feature matrix is only 1MB, and the communication (4) Get gradient on feature (x). We concurrently multi-
cost is negligible when we transfer the feature matrix. ply the gradient sub-matrix of similarity score by the trans-
(2) Get similarity score matrix (score = xW ). We copy posed centre sub-matrix and sum up the outputs from 8
the feature matrix into each GPU, and concurrently multi- GPU cards to get the gradient on feature x.
ply the feature matrix by the centre sub-matrix (feature di-
mension 512 × identity number 1M/8) to get the similarity Considering the communication cost (MB level), our
implementation of ArcFace can be easily and efficiently
3 https://github.com/deepinsight/insightface/tree/master/recognition trained on millions of identities by clusters.
nearest neighbour separation[5] is
2 1 Γ( d2 ) 1
E[θ(Wj )] → n− d−1 Γ(1 + )( √ d−1
)− d−1 ,
d − 1 2 π(d − 1)Γ( 2 )
(7)
where d is the space dimension, n is the identity number,
and θ(Wj ) = min1≤i,j≤n,i6=j arccos(Wi , Wj )∀i, j. In Fig-
ure 12, we give E[θ(Wj )] in the 128-d, 256-d and 512-d
space with the class number ranging from 10K to 100M .
(a) x
The high-dimensional space is so large that E[θ(Wj )] de-
creases slowly when the class number increases exponen-
tially.
90
85
75
70
(b) score = xW
65
60
128-d
55 256-d
512-d
50
4 5 6 7 8
10 10 10 10 10
Random Individual Numbers
References
[1] http://data.mxnet.io/models/. 8
[2] http://trillionpairs.deepglint.com/overview. 2, 4, 5, 8
[3] Stanford cs class cs231n: Convolutional neural networks
for visual recognition. http://cs231n.github.io/
neural-networks-case-study/. 9
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
(d) dx = dscoreW T distributed systems. arXiv:1603.04467, 2016. 2
[5] J. S. Brauchart, A. B. Reznikov, E. B. Saff, I. H. Sloan,
Figure 11. Parallel calculation by simple matrix partition. Setting:
Y. G. Wang, and R. S. Womersley. Random point sets on
ResNet 50, batch size 8*64, feature dimension 512, float point
the spherehole radii, covering, and separation. Experimental
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Com-
Mathematics, 2018. 10
munication cost: 1MB (feature x). Training speed: 800 sam-
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.
ples/second.
Vggface2: A dataset for recognising faces across pose and
age. In FG, 2018. 1, 2, 4, 5, 7, 8
[7] B. Chen, W. Deng, and J. Du. Noisy softmax: improving
5.2. Feature Space Analysis the generalization ability of dcnn via postponing the early
softmax saturation. In CVPR, 2017. 2
[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
Is the 512-d hypersphere space large enough to hold B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-
large-scale identities? Theoretically, Yes. cient machine learning library for heterogeneous distributed
systems. arXiv:1512.01274, 2015. 2, 5
We assume that the identity centre Wj ’s follow a realis- [9] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep
tically spherical uniform distribution, the expectation of the face recognition. In CVPR Workshop, 2017. 2, 6
[10] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: [30] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel-
A dataset and benchmark for large-scale face recognition. In lappa, and D. W. Jacobs. Frontal to profile face verification
ECCV, 2016. 4 in the wild. In WACV, 2016. 2, 5
[11] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [31] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
works. arXiv:1610.02915, 2016. 5 R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning networks from overfitting. JML, 2014. 5
for image recognition. In CVPR, 2016. 5 [32] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face
[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. representation by joint identification-verification. In NIPS,
Labeled faces in the wild: A database for studying face 2014. 1, 6, 7
recognition in unconstrained environments. Technical report, [33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
2007. 5, 6, 8 Closing the gap to human-level performance in face verifica-
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tion. In CVPR, 2014. 1, 6
deep network training by reducing internal covariate shift. In [34] W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking fea-
ICML, 2015. 5 ture distribution for loss functions in image classification.
[15] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and arXiv:1803.02988, 2018. 2
E. Brossard. The megaface benchmark: 1 million faces for [35] F. Wang, W. Liu, H. Liu, and J. Cheng. Additive margin
recognition at scale. In CVPR, 2016. 2, 5, 7 softmax for face verification. IEEE Signal Processing Let-
[16] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ters, 2018. 2, 3, 7
ultimate accuracy: Face recognition via deep embedding.
[36] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Norm-
arXiv:1506.07310, 2015. 6
face: l 2 hypersphere embedding for face verification.
[17] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song. arXiv:1704.06369, 2017. 2
Learning towards minimum hyperspherical energy. In NIPS,
[37] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,
2018. 4, 6, 7
and W. Liu. Cosface: Large margin cosine loss for deep face
[18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
recognition. In CVPR, 2018. 1, 2, 3, 5, 6, 7
Sphereface: Deep hypersphere embedding for face recogni-
[38] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
tion. In CVPR, 2017. 1, 2, 3, 4, 5, 6, 7
ture learning approach for deep face recognition. In ECCV,
[19] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax
2016. 2, 6, 7
loss for convolutional neural networks. In ICML, 2016. 2, 3
[20] Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, [39] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C.
C. Lin, J. Jiang, and Y. Fan. iqiyi-vid: A large dataset for Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, and
multi-modal person identification. arXiv:1811.07548, 2018. K. Allen. Iarpa janus benchmark-b face dataset. In CVPR
5, 8 Workshop, 2017. 2, 5, 7, 8
[21] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, [40] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un-
C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, and J. Ch- constrained videos with matched background similarity. In
eney. Iarpa janus benchmark–c: Face dataset and protocol. CVPR, 2011. 5, 6
In ICB, 2018. 2, 5 [41] W. Xie, S. Li, and A. Zisserman. Comparator networks. In
[22] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kot- ECCV, 2018. 8
sia, and S. Zafeiriou. Agedb: The first manually collected [42] W. Xie and A. Zisserman. Multicolumn networks for face
in-the-wild age database. In CVPR Workshop, 2017. 2, 5 recognition. In BMVC, 2018. 8
[23] H.-W. Ng and S. Winkler. A data-driven approach to clean- [43] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
ing large face datasets. In ICIP, 2014. 7 tation from scratch. arXiv:1411.7923, 2014. 4, 5
[24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face [44] D. Zhang. A distributed training solution for face recogni-
recognition. In BMVC, 2015. 1, 2, 6 tion. DeepGlint, 2018. 9
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detec-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tion and alignment using multitask cascaded convolutional
tomatic differentiation in pytorch. In NIPS Workshop, 2017. networks. IEEE Signal Processing Letters, 2016. 1
2 [46] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss
[26] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hin- for deep face recognition with long-tail. In ICCV, 2017. 2, 6
ton. Regularizing neural networks by penalizing confident [47] X. Zhang, L. Yang, J. Yan, and D. Lin. Accelerated train-
output distributions. arXiv:1701.06548, 2017. 2, 3 ing for massive classification via dynamic class selection. In
[27] X. Qi and L. Zhang. Face recognition via centralized coor- AAAI, 2018. 9
dinate learning. arXiv:1801.05678, 2018. 2 [48] T. Zheng and W. Deng. Cross-pose lfw: A database for
[28] R. Ranjan, C. D. Castillo, and R. Chellappa. L2- studying cross-pose face recognition in unconstrained envi-
constrained softmax loss for discriminative face verification. ronments. Technical Report, 2018. 2, 5, 6
arXiv:1703.09507, 2017. 2 [49] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database
[29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- for studying cross-age face recognition in unconstrained en-
fied embedding for face recognition and clustering. In CVPR, vironments. arXiv:1708.08197, 2017. 2, 5, 6
2015. 1, 4, 6, 7
Fast Online Object Tracking and Segmentation: A Unifying Approach
wmhu@nlpr.ia.ac.cn philip.torr@eng.ox.ac.uk
Abstract
1
sources than a simple bounding box. As a consequence, briefly outlines some of the most relevant prior work in vi-
VOS methods have been traditionally slow, often requir- sual object tracking and semi-supervised VOS; Section 3
ing several seconds per frame (e.g. [55, 50, 39, 1]). Very describes our proposal; Section 4 evaluates it on four bench-
recently, there has been a surge of interest in faster ap- marks and illustrates several ablative studies; Section 5 con-
proaches [59, 36, 57, 8, 7, 22, 21]. However, even the fastest cludes the paper.
still cannot operate in real-time.
In this paper, we aim at narrowing the gap between ar- 2. Related Work
bitrary object tracking and VOS by proposing SiamMask, In this section, we briefly cover the most representative
a simple multi-task learning approach that can be used techniques for the two problems tackled in this paper.
to address both problems. Our method is motivated by
Visual object tracking. Arguably, until very recently,
the success of fast tracking approaches based on fully-
the most popular paradigm for tracking arbitrary objects
convolutional Siamese networks [3] trained offline on mil-
has been to train online a discriminative classifier exclu-
lions of pairs of video frames (e.g. [28, 63, 15, 60]) and by
sively from the ground-truth information provided in the
the very recent availability of YouTube-VOS [58], a large
first frame of a video (and then update it online).
video dataset with pixel-wise annotations. We aim at retain-
In the past few years, the Correlation Filter, a sim-
ing the offline trainability and online speed of these meth-
ple algorithm that allows to discriminate between the tem-
ods while at the same time significantly refining their rep-
plate of an arbitrary target and its 2D translations, rose
resentation of the target object, which is limited to a simple
to prominence as particularly fast and effective strategy
axis-aligned bounding box.
for tracking-by-detection thanks to the pioneering work of
To achieve this goal, we simultaneously train a Siamese Bolme et al. [4]. Performance of Correlation Filter-based
network on three tasks, each corresponding to a different trackers has then been notably improved with the adop-
strategy to establish correspondances between the target ob- tion of multi-channel formulations [24, 20], spatial con-
ject and candidate regions in the new frames. As in the straints [25, 13, 33, 29] and deep features (e.g. [12, 51]).
fully-convolutional approach of Bertinetto et al. [3], one Recently, a radically different approach has been intro-
task is to learn a measure of similarity between the target duced [3, 19, 49]. Instead of learning a discrimative clas-
object and multiple candidates in a sliding window fashion. sifier online, these methods train offline a similarity func-
The output is a dense response map which only indicates the tion on pairs of video frames. At test time, this function
location of the object, without providing any information can be simply evaluated on a new video, once per frame.
about its spatial extent. To refine this information, we si- In particular, evolutions of the fully-convolutional Siamese
multaneously learn two further tasks: bounding box regres- approach [3] considerably improved tracking performance
sion using a Region Proposal Network [46, 28] and class- by making use of region proposals [28], hard negative min-
agnostic binary segmentation [43]. Notably, binary labels ing [63], ensembling [15] and memory networks [60].
are only required during offline training to compute the seg- Most modern trackers, including all the ones mentioned
mentation loss and not online during segmentation/tracking. above, use a rectangular bounding box both to initialise the
In our proposed architecture, each task is represented by target and to estimate its position in the subsequent frames.
a different branch departing from a shared CNN and con- Despite its convenience, a simple rectangle often fails to
tributes towards a final loss, which sums the three outputs properly represent an object, as it is evident in the examples
together. of Figure 1. This motivated us to propose a tracker able to
Once trained, SiamMask solely relies on a single bound- produce binary segmentation masks while still only relying
ing box initialisation, operates online without updates and on a bounding box initialisation.
produces object segmentation masks and rotated bound- Interestingly, in the past it was not uncommon for track-
ing boxes at 55 frames per second. Despite its simplicity ers to produce a coarse binary mask of the target object
and fast speed, SiamMask establishes a new state-of-the-art (e.g. [11, 42]). However, to the best of our knowledge, the
on VOT-2018 for the problem of real-time object tracking. only recent tracker that, like ours, is able to operate on-
Moreover, the same method is also very competitive against line and produce a binary mask starting from a bounding
recent semi-supervised VOS approaches on DAVIS-2016 box initialisation is the superpixel-based approach of Yeo et
and DAVIS-2017, while being the fastest by a large mar- al. [61]. However, at 4 frames per seconds (fps), its fastest
gin. This result is achieved with a simple bounding box variant is significantly slower than our proposal. Further-
initialisation (as opposed to a mask) and without adopting more, when using CNN features, its speed is affected by a
costly techniques often used by VOS approaches such as 60-fold decrease, plummeting below 0.1 fps. Finally, it has
fine-tuning [35, 39, 1, 53], data augmentation [23, 30] and not demonstrated to be competitive on modern tracking or
optical flow [50, 1, 39, 30, 8]. VOS benchmarks. Similar to us, the methods of Perazzi et
The rest of this paper is organised as follows. Section 2 al. [39] and Ci et al. [10] can also start from a rectangle and
2
output per-frame masks. However, they require fine-tuning 3. Methodology
at test time, which makes them slow.
To allow online operability and fast speed, we adopt
the fully-convolutional Siamese framework [3]. Moreover,
Semi-supervised video object segmentation. Bench-
to illustrate that our approach is agnostic to the specific
marks for arbitrary object tracking (e.g. [48, 26, 56]) as-
fully-convolutional method used as a starting point (e.g. [3,
sume that trackers receive input frames in a sequential fash-
28, 63, 60, 16]), we consider the popular SiamFC [3] and
ion. This aspect is generally referred to with the attributes
SiamRPN [28] as two representative examples. We first in-
online or causal [26]. Moreover, methods are often focused
troduce them in Section 3.1 and then describe our approach
on achieving a speed that exceeds the ones of typical video
in Section 3.2.
framerates [27]. Conversely, semi-supervised VOS algo-
rithms have been traditionally more concerned with an ac- 3.1. Fully-convolutional Siamese networks
curate representation of the object of interest [38, 40].
SiamFC. Bertinetto et al. [3] propose to use, as a fun-
In order to exploit consistency between video frames,
damental building block of a tracking system, an offline-
several methods propagate the supervisory segmentation
trained fully-convolutional Siamese network that compares
mask of the first frame to the temporally adjacent ones via
an exemplar image z against a (larger) search image x to
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
obtain a dense response map. z and x are, respectively, a
particular, Bao et al. [1] recently proposed a very accurate
w×h crop centered on the target object and a larger crop
method that makes use of a spatio-temporal MRF in which
centered on the last estimated position of the target. The
temporal dependencies are modelled by optical flow, while
two inputs are processed by the same CNN fθ , yielding two
spatial dependencies are expressed by a CNN.
feature maps that are cross-correlated:
Another popular strategy is to process video frames in-
dependently (e.g. [35, 39, 53]), similarly to what happens gθ (z, x) = fθ (z) ? fθ (x). (1)
in most tracking approaches. For example, in OSVOS-S
In this paper, we refer to each spatial element of the re-
Maninis et al. [35] do not make use of any temporal in-
sponse map (left-hand side of Eq. 1) as response of a can-
formation. They rely on a fully-convolutional network pre-
didate window (RoW). For example, gθn (z, x), encodes a
trained for classification and then, at test time, they fine-
similarity between the examplar z and n-th candidate win-
tune it using the ground-truth mask provided in the first
dow in x. For SiamFC, the goal is for the maximum value of
frame. MaskTrack [39] instead is trained from scratch on
the response map to correspond to the target location in the
individual images, but it does exploit some form of tempo-
search area x. Instead, in order to allow each RoW to en-
rality at test time by using the latest mask prediction and
code richer information about the target object, we replace
optical flow as additional input to the network.
the simple cross-correlation of Eq. 1 with depth-wise cross-
Aiming towards the highest possible accuracy, at test correlation [2] and produce a multi-channel response map.
time VOS methods often feature computationally intensive SiamFC is trained offline on millions of video frames with
techniques such as fine-tuning [35, 39, 1, 53], data augmen- the logistic loss [3, Section 2.2], which we refer to as Lsim .
tation [23, 30] and optical flow [50, 1, 39, 30, 8]. Therefore, SiamRPN. Li et al. [28] considerably improve the perfor-
these approaches are generally characterised by low fram- mance of SiamFC by relying on a region proposal network
erates and the inability to operate online. For example, it (RPN) [46, 14], which allows to estimate the target location
is not uncommon for methods to require minutes [39, 9] or with a bounding box of variable aspect ratio. In particular,
even hours [50, 1] for videos that are just a few seconds in SiamRPN each RoW encodes a set of k anchor box pro-
long, like the ones of DAVIS. posals and corresponding object/background scores. There-
Recently, there has been an increasing interest in the fore, SiamRPN outputs box predictions in parallel with
VOS community towards faster methods [36, 57, 8, 7, 22, classification scores. The two output branches are trained
21]. To the best of our knowledge, the fastest approaches using the smooth L1 and the cross-entropy losses [28, Sec-
with a performance competitive with the state of the art tion 3.2]. In the following, we refer to them as Lbox and
are the ones of Yang et al. [59] and Wug et al. [57]. The Lscore respectively.
former uses a meta-network “modulator” to quickly adapt
3.2. SiamMask
the parameters of a segmentation network during test time,
while the latter does not use any fine-tuning and adopts an Unlike existing tracking methods that rely on low-
encoder-decoder Siamese architecture trained in multiple fidelity object representations, we argue the importance of
stages. Both these methods run below 10 frames per sec- producing per-frame binary segmentation masks. To this
ond, while we are more than six times faster and only rely aim we show that, besides similarity scores and bound-
on a bounding box initialisation. ing box coordinates, it is possible for the RoW of a fully-
3
17*17*(63*63)
ℎ %
17*17*(63*63)
𝑓" 15*15*256
127*127*1 ℎ % mask
1*1*(63*63) mask
127*127*3
1*1*(63*63)
17*17*256 17*17*256
⋆' 𝑏) 17*17*4k box
255*255*3
(a) three-branch variant architecture (b) two-branch variant head
Figure 2. Schematic illustration of SiamMask variants: (a) three-branch architecture (full), (b) two-branch architecture (head). ?d denotes
depth-wise cross correlation. For simplicity, upsampling layer and mask refinement module are omitted here and detailed in Appendix A.
convolutional Siamese network to also encode the informa- case this representation corresponds to one of the (17×17)
tion necessary to produce a pixel-wise binary mask. This RoWs produced by the depth-wise cross-correlation be-
can be achieved by extending existing Siamese trackers with tween fθ (z) and fθ (x). Importantly, the network hφ of
an extra branch and loss. the segmentation task is composed of two 1×1 convolu-
We predict w×h binary masks (one for each RoW) using tional layers, one with 256 and the other with 632 chan-
a simple two-layers neural network hφ with learnable pa- nels (Figure 2). This allows every pixel classifier to utilise
rameters φ. Let mn denote the predicted mask correspond- information contained in the entire RoW and thus to have
ing to the n-th RoW, a complete view of its corresponding candidate window in
x, which is critical to disambiguate between instances that
mn = hφ (gθn (z, x)). (2) look like the target (e.g. last row of Figure 4), often referred
to as distractors. With the aim of producing a more accurate
From Eq. 2 we can see that the mask prediction is a function object mask, we follow the strategy of [44], which merges
of both the image to segment x and the target object in z. low and high resolution features using multiple refinement
In this way, z can be used as a reference to guide the seg- modules made of upsampling layers and skip connections
mentation process: given a different reference image, the (see Appendix A).
network will produce a different segmentation mask for x.
Two variants. For our experiments, we augment the ar-
Loss function. During training, each RoW is labelled with chitectures of SiamFC [3] and SiamRPN [28] with our seg-
a ground-truth binary label yn ∈ {±1} and also associated mentation branch and the loss Lmask , obtaining what we
with a pixel-wise ground-truth mask cn of size w×h. Let call the two-branch and three-branch variants of SiamMask.
cij
n ∈ {±1} denote the label corresponding to pixel (i, j) of These respectively optimise the multi-task losses L2B and
the object mask in the n-th candidate RoW. The loss func- L3B , defined as:
tion Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs: L2B = λ1 · Lmask + λ2 · Lsim , (4)
L3B = λ1 · Lmask + λ2 · Lscore + λ3 · Lbox . (5)
X 1 + yn X ij ij
Lmask (θ, φ) = ( log(1 + e−cn mn )). (3) We refer the reader to [3, Section 2.2] for Lsim and to [28,
n
2wh ij Section 3.2] for Lbox and Lscore . For L3B , a RoW is con-
sidered positive (yn = 1) if one of its anchor boxes has
Thus, the classification layer of hφ consists of w×h classi- IOU with the ground-truth box of at least 0.6 and negative
fiers, each indicating whether a given pixel belongs to the (yn = −1) otherwise. For L2B , we adopt the same strat-
object in the candidate window or not. Note that Lmask is egy of [3] to define positive and negative samples. We did
considered only for positive RoWs (i.e. with yn = 1). not search over the hyperparameters of Eq. 4 and Eq. 5 and
Mask representation. In contrast to semantic segmen- simply set λ1 = 32 like in [43] and λ2 = λ3 = 1. The task-
tation methods in the style of FCN [32] and Mask R- specific branches for the box and score outputs are consti-
CNN [17], which maintain explicit spatial information tuted by two 1×1 convolutional layers. Figure 2 illustrates
throughout the network, our approach follows the spirit the two variants of SiamMask.
of [43, 44] and generates masks starting from a flat- Box generation. Note that, while VOS benchmarks re-
tened representation of the object. In particular, in our quire binary masks, typical tracking benchmarks such as
4
reference to crop the next frame search region. Instead, in
the three-branch variant, we find more effective to exploit
the highest-scoring output of the box branch as reference.
4. Experiments
Figure 3. In order to generate a bounding box from a binary mask
(in yellow), we experiment with three different methods. Min-
In this section, we evaluate our approach on two related
max: the axis-aligned rectangle containing the object (red); MBR: tasks: visual object tracking (on VOT-2016 and VOT-2018)
the minimum bounding rectangle (green); Opt: the rectangle ob- and semi-supervised video object segmentation (on DAVIS-
tained via the optimisation strategy proposed in VOT-2016 [26] 2016 and DAVIS-2017). We refer to our two-branch and
(blue). three-branch variants with SiamMask-2B and SiamMask
respectively.
VOT [26, 27] require a bounding box as final representation
of the target object. We consider three different strategies 4.1. Evaluation for visual object tracking
to generate a bounding box from a binary mask (Figure 3):
(1) axis-aligned bounding rectangle (Min-max), (2) rotated Datasets and settings. We adopt two widely used bench-
minimum bounding rectangle (MBR) and (3) the optimisa- marks for the evaluation of the object tracking task: VOT-
tion strategy used for the automatic bounding box gener- 2016 [26] and VOT-2018 [27], both annotated with rotated
ation proposed in VOT-2016 [26] (Opt). We empirically bounding boxes. We use VOT-2016 to understand how dif-
evaluate these alternatives in Section 4 (Table 1). ferent types of representation affect the performance. For
this first experiment, we use mean intersection over union
3.3. Implementation details (IOU) and Average Precision (AP)@{0.5, 0.7} IOU. We
then compare against the state-of-the-art on VOT-2018, us-
Network architecture. For both our variants, we use a ing the official VOT toolkit and the Expected Average Over-
ResNet-50 [18] until the final convolutional layer of the 4-th lap (EAO), a measure that considers both accuracy and ro-
stage as our backbone fθ . In order to obtain a high spatial bustness of a tracker [27].
resolution in deeper layers, we reduce the output stride to 8
How much does the object representation matter?
by using convolutions with stride 1. Moreover, we increase
Existing tracking methods typically predict axis-aligned
the receptive field by using dilated convolutions [6]. In our
bounding boxes with a fixed [3, 20, 13, 33] or variable [28,
model, we add to the shared backbone fθ an unshared adjust
19, 63] aspect ratio. We are interested in understanding to
layer (1×1 conv with 256 outputs). For simplicity, we omit
which extent producing a per-frame binary mask can im-
it in Eq. 1. We describe the network architectures in more
prove tracking. In order to focus on representation accuracy,
detail in Appendix A.
for this experiment only we ignore the temporal aspect and
Training. Like SiamFC [3], we use examplar and search sample video frames at random. The approaches described
image patches of 127×127 and 255×255 pixels respec- in the following paragraph are tested on randomly cropped
tively. During training, we randomly jitter examplar and search patches (with random shifts within ±16 pixels and
search patches. Specifically, we consider random transla- scale deformations up to 21±0.25 ) from the sequences of
tions (up to ±8 pixels) and rescaling (of 2±1/8 and 2±1/4 VOT-2016.
for examplar and search respectively). In Table 1, we compare our three-branch variant using
The network backbone is pre-trained on the the Min-max, MBR and Opt approaches (described at the
ImageNet-1k classification task. We use SGD with a end of Section 3.2 and in Figure 3). For reference, we also
first warmup phase in which the learning rate increases report results for SiamFC and SiamRPN as representative
linearly from 10−3 to 5×10−3 for the first 5 epochs of the fixed and variable aspect-ratio approaches, together
and then descreases logarithmically until 5×10−4 for 15 with three oracles that have access to per-frame ground-
more epochs. We train all our models using COCO [31], truth information and serve as upper bounds for the dif-
ImageNet-VID [47] and YouTube-VOS [58]. ferent representation strategies. (1) The fixed aspect-ratio
Inference. During tracking, SiamMask is simply evalu- oracle uses the per-frame ground-truth area and center loca-
ated once per frame, without any adaptation. In both our tion, but fixes the aspect reatio to the one of the first frame
variants, we select the output mask using the location attain- and produces an axis-aligned bounding box. (2) The Min-
ing the maximum score in the classification branch. Then, max oracle uses the minimal enclosing rectangle of the ro-
after having applied a per-pixel sigmoid, we binarise the tated ground-truth bounding box to produce an axis-aligned
output of the mask branch at the threshold of 0.5. In the bounding box. (3) Finally, the MBR oracle uses the rotated
two-branch variant, for each video frame after the first one, minimum bounding rectangle of the ground-truth. Note that
we fit the output mask with the Min-max box and use it as (1), (2) and (3) can be considered, respectively, the per-
5
mIOU (%) mAP@0.5 IOU mAP@0.7 IOU ric, showing a significant advantage with respect to the Cor-
Fixed a.r. Oracle 73.43 90.15 62.52 relation Filter-based trackers CSRDCF [33], STRCF [29].
Min-max Oracle 77.70 88.84 65.16
This is not surprising, as SiamMask relies on a richer object
MBR Oracle 84.07 97.77 80.68
SiamFC [3] 50.48 56.42 9.28
representation, as outlined in Table 1. Interestingly, sim-
SiamRPN [63] 60.02 76.20 32.47 ilarly to us, He et al. (SA Siam R) [15] are motivated to
SiamMask-Min-max 65.05 82.99 43.09 achieve a more accurate target representation by consider-
SiamMask-MBR 67.15 85.42 50.86 ing multiple rotated and rescaled bounding boxes. However,
SiamMask-Opt 71.68 90.77 60.47 their representation is still constrained to a fixed aspect-ratio
Table 1. Performance for different bounding box representation
box.
strategies on VOT-2016. Table 3 gives further results of SiamMask with dif-
ferent box generation strategies on VOT-2018 and -2016.
SiamMask-box means the box branch of SiamMask is
formance upper bounds for the representation strategies of adopted for inference despite the mask branch has been
SiamFC, SiamRPN and SiamMask. trained. We can observe clear improvements on all evalua-
Table 1 shows that our method achieves the best mIOU, tion metrics by using the mask branch for box generation.
no matter the box generation strategy used (Figure 3). Al-
beit SiamMask-Opt offers the highest IOU and mAP, it re- 4.2. Evaluation for semi-supervised VOS
quires significant computational resources due to its slow
Our model, once trained, can also be used for the task
optimisation procedure [54]. SiamMask-MBR achieves a
of VOS to achieve competitive performance without requir-
mAP@0.5 IOU of 85.4, with a respective improvement of
ing any adaptation at test time. Importantly, differently to
+29 and +9.2 points w.r.t. the two fully-convolutional
typical VOS approaches, ours can operate online, runs in
baselines. Interestingly, the gap significantly widens when
real-time and only requires a simple bounding box initiali-
considering mAP at the higher accuracy regime of 0.7 IOU:
sation.
+41.6 and +18.4 respectively. Notably, our accuracy re-
sults are not far from the fixed aspect-ratio oracle. More- Datasets and settings. We report the performance of
over, comparing the upper bound performance represented SiamMask on DAVIS-2016 [40], DAVIS-2017 [45] and
by the oracles, it is possible to notice how, by simply chang- YouTube-VOS [58] benchmarks. For both DAVIS datasets,
ing the bounding box representation, there is a great room we use the official performance measures: the Jaccard index
for improvement (e.g. +10.6% mIOU improvement be- (J ) to express region similarity and the F-measure (F) to
tween the fixed aspect-ratio and the MBR oracles). express contour accuracy. For each measure C ∈ {J , F},
Overall, this study shows how the MBR strategy to obtain three statistics are considered: mean CM , recall CO , and
a rotated bounding box from a binary mask of the object decay CD , which informs us about the gain/loss of per-
offers a significant advantage over popular strategies that formance over time [40]. Following Xu et al. [58], for
simply report axis-aligned bounding boxes. YouTube-VOS we report the mean Jaccard index and F-
measure for both seen (JS , FS ) and unseen categories (JU ,
Results on VOT-2018 and VOT-2016. In Table 2 we
FU ). O is the average of these four measures.
compare the two variants of SiamMask with MBR strategy
and SiamMask–Opt against five recently published state- To initialise SiamMask, we extract the axis-aligned
of-the-art trackers on the VOT-2018 benchmark. Unless bounding box from the mask provided in the first frame
stated otherwise, SiamMask refers to our three-branch vari- (Min-max strategy, see Figure 3). Similarly to most VOS
ant with MBR strategy. Both variants achieve outstanding methods, in case of multiple objects in the same video
performance and run in real-time. In particular, our three- (DAVIS-2017) we simply perform multiple inferences.
branch variant significantly outperforms the very recent Results on DAVIS and YouTube-VOS. In the semi-
and top performing DaSiamRPN [63], achieving a EAO of supervised setting, VOS methods are initialised with a
0.380 while running at 55 frames per second. Even with- binary mask [38] and many of them require computa-
out box regression branch, our simpler two-branch vari- tionally intensive techniques at test time such as fine-
ant (SiamMask-2B) achieves a high EAO of 0.334, which tuning [35, 39, 1, 53], data augmentation [23, 30], infer-
is in par with SA Siam R [15] and superior to any other ence on MRF/CRF [55, 50, 36, 1] and optical flow [50, 1,
real-time method in the published literature. Finally, in 39, 30, 8]. As a consequence, it is not uncommon for VOS
SiamMask–Opt, the strategy proposed in [54] to find the op- techniques to require several minutes to process a short se-
timal rotated rectangle from a binary mask brings the best quence. Clearly, these strategies make the online applicabil-
overall performance (and a particularly high accuracy), but ity (which is our focus) impossible. For this reason, in our
comes at a significant computational cost. comparison we mainly concentrate on fast state-of-the-art
Our model is particularly strong under the accuracy met- approaches.
6
SiamMask-Opt SiamMask SiamMask-2B DaSiamRPN [63] SiamRPN [28] SA Siam R [15] CSRDCF [33] STRCF [29]
EAO ↑ 0.387 0.380 0.334 0.326 0.244 0.337 0.263 0.345
Accuracy ↑ 0.642 0.609 0.575 0.569 0.490 0.566 0.466 0.523
Robustness ↓ 0.295 0.276 0.304 0.337 0.460 0.258 0.318 0.215
Speed (fps) ↑ 5 55 60 160 200 32.4 48.9 2.9
Table 2. Comparison with the state-of-the-art under the EAO, Accuracy, and Robustness metrics on VOT-2018.
7
Basketball
Nature
Car-Shadow
Dogs-Jump
Pigs
Figure 4. Qualitative results of our method for sequences belonging to both object tracking and video object segmentation benchmarks.
Basketball and Nature are from VOT-2018 [27]; Car-Shadow is from DAVIS-2016 [40]; Dogs-Jump and Pigs are from DAVIS-2017 [45].
Multiple masks are obtained from different inferences (with different initialisations).
that can be unambiguously discriminated from the fore-
ground.
5. Conclusion
In this paper we introduced SiamMask, a simple ap-
proach that enables fully-convolutional Siamese trackers to
Figure 5. Failure cases: motion blur and “non-object” instance.
produce class-agnostic binary segmentation masks of the
target object. We show how it can be applied with success
the mask branch is not used during inference. We can ob- to both tasks of visual object tracking and semi-supervised
serve how both variants obtain a modest but meaningful im- video object segmentation, showing better accuracy than
provement with respect to their counterparts (SiamFC and state-of-the-art trackers and, at the same time, the fastest
SiamRPN): from 0.251 to 0.265 EAO for the two-branch speed among VOS methods. The two variants of SiamMask
and from 0.359 to 0.363 for the three-branch on VOT2018. we proposed are initialised with a simple bounding box, op-
Timing. SiamMask operates online without any adap- erate online, run in real-time and do not require any adapta-
tation to the test sequence. On a single NVIDIA RTX tion to the test sequence. We hope that our work will inspire
2080 GPU, we measured an average speed of 55 and 60 further studies that consider the two problems of visual ob-
frames per second, respectively for the two-branch and ject tracking and video object segmentation together.
three-branch variants. Note that the highest computational Acknowledgements. This work was supported by
burden comes from the feature extractor fθ . the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC
Failure cases. Finally, we discuss two scenarios in which grant Seebibyte EP/M013774/1 and EPSRC/MURI grant
SiamMask fails: motion blur and “non-object” instance EP/N019474/1. We would also like to acknowledge the sup-
(Figure 5). Despite being different in nature, these two port of the Royal Academy of Engineering and FiveAI Ltd.
1
cases arguably arise from the complete lack of similar train- Qiang Wang is partly supported by the NSFC (Grant No.
ing samples in a training sets, which are focused on objects 61751212, 61721004 and U1636218).
8
References [16] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese
network for real-time object tracking. In IEEE Conference
[1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object seg- on Computer Vision and Pattern Recognition, 2018. 3
mentation via inference in a cnn-based higher-order spatio-
[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-
temporal mrf. In IEEE Conference on Computer Vision and
cnn. In IEEE International Conference on Computer Vision,
Pattern Recognition, 2018. 2, 3, 6
2017. 4
[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
A. Vedaldi. Learning feed-forward one-shot learners. In Ad-
for image recognition. In IEEE Conference on Computer
vances in Neural Information Processing Systems, 2016. 3
Vision and Pattern Recognition, 2016. 5, 11
[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and
[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100
P. H. Torr. Fully-convolutional siamese networks for object
fps with deep regression networks. In European Conference
tracking. In European Conference on Computer Vision work-
on Computer Vision, 2016. 2, 5
shops, 2016. 2, 3, 4, 5, 6
[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-
[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.
speed tracking with kernelized correlation filters. IEEE
Visual object tracking using adaptive correlation filters. In
Transactions on Pattern Analysis and Machine Intelligence,
IEEE Conference on Computer Vision and Pattern Recogni-
2015. 2, 5
tion, 2010. 2
[5] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, [21] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch:
D. Cremers, and L. Van Gool. One-shot video object seg- Matching based video object segmentation. In European
mentation. In IEEE Conference on Computer Vision and Conference on Computer Vision, 2018. 2, 3
Pattern Recognition, 2017. 7 [22] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and networks. In IEEE Conference on Computer Vision and Pat-
A. L. Yuille. Deeplab: Semantic image segmentation with tern Recognition, 2017. 2, 3, 7
deep convolutional nets, atrous convolution, and fully con- [23] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele.
nected crfs. IEEE Transactions on Pattern Analysis and Ma- Lucid data dreaming for object tracking. In IEEE Con-
chine Intelligence, 2018. 5, 11 ference on Computer Vision and Pattern Recognition work-
[7] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz- shops, 2017. 2, 3, 6
ingly fast video object segmentation with pixel-wise metric [24] H. Kiani Galoogahi, T. Sim, and S. Lucey. Multi-channel
learning. In IEEE Conference on Computer Vision and Pat- correlation filters. In IEEE International Conference on
tern Recognition, 2018. 2, 3, 7 Computer Vision, 2013. 2
[8] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. [25] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filters
Fast and accurate online video object segmentation via track- with limited boundaries. In IEEE Conference on Computer
ing parts. In IEEE Conference on Computer Vision and Pat- Vision and Pattern Recognition, 2015. 2
tern Recognition, 2018. 2, 3, 6, 7 [26] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
[9] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: R. Pflugfelder, L. Čehovin, T. Vojı́r, G. Häger, A. Lukežič,
Joint learning for video object segmentation and optical G. Fernández, et al. The visual object tracking vot2016 chal-
flow. In IEEE International Conference on Computer Vision, lenge results. In European Conference on Computer Vision,
2017. 3, 7 2016. 1, 3, 5
[10] H. Ci, C. Wang, and Y. Wang. Video object segmentation by [27] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
learning location-sensitive embeddings. In European Con- R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,
ference on Computer Vision, 2018. 2 A. Eldesokey, G. Fernandez, and et al. The sixth visual object
[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking tracking vot-2018 challenge results. In European Conference
of non-rigid objects using mean shift. In IEEE Conference on Computer Vision workshops, 2018. 1, 3, 5, 8, 12
on Computer Vision and Pattern Recognition, 2000. 2 [28] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance
[12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: visual tracking with siamese region proposal network. In
Efficient convolution operators for tracking. In IEEE Con- IEEE Conference on Computer Vision and Pattern Recogni-
ference on Computer Vision and Pattern Recognition, 2017. tion, 2018. 2, 3, 4, 5, 7
1, 2 [29] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learn-
[13] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Learn- ing spatial-temporal regularized correlation filters for visual
ing spatially regularized correlation filters for visual track- tracking. In IEEE Conference on Computer Vision and Pat-
ing. In IEEE International Conference on Computer Vision, tern Recognition, 2018. 2, 6, 7
2015. 2, 5 [30] X. Li and C. C. Loy. Video object segmentation with joint
[14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track re-identification and attention-aware mask propagation. In
and track to detect. In IEEE International Conference on European Conference on Computer Vision, 2018. 2, 3, 6
Computer Vision, 2017. 3 [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[15] A. He, C. Luo, X. Tian, and W. Zeng. Towards a better match manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
in siamese network based visual object tracker. In European mon objects in context. In European Conference on Com-
Conference on Computer Vision workshops, 2018. 2, 6, 7 puter Vision, 2014. 5
9
[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional survey. IEEE Transactions on Pattern Analysis and Machine
networks for semantic segmentation. In IEEE Conference on Intelligence, 2014. 1, 3
Computer Vision and Pattern Recognition, 2015. 4 [49] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance
[33] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan. search for tracking. In IEEE Conference on Computer Vision
Discriminative correlation filter with channel and spatial reli- and Pattern Recognition, 2016. 2
ability. In IEEE Conference on Computer Vision and Pattern [50] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmenta-
Recognition, 2017. 2, 5, 6, 7 tion via object flow. In IEEE Conference on Computer Vision
[34] T. Makovski, G. A. Vazquez, and Y. V. Jiang. Visual learning and Pattern Recognition, 2016. 2, 3, 6
in multiple-object tracking. PLoS One, 2008. 1 [51] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and
[35] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal- P. H. S. Torr. End-to-end representation learning for correla-
Taixé, D. Cremers, and L. Van Gool. Video object segmen- tion filter based tracking. In IEEE Conference on Computer
tation without temporal information. In IEEE Transactions Vision and Pattern Recognition, 2017. 2
on Pattern Analysis and Machine Intelligence, 2017. 2, 3, 6 [52] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao,
[36] N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- A. Vedaldi, A. Smeulders, P. H. S. Torr, and E. Gavves.
lateral space video segmentation. In IEEE Conference on Long-term tracking in the wild: A benchmark. In European
Computer Vision and Pattern Recognition, 2016. 2, 3, 6 Conference on Computer Vision, 2018. 1
[37] O. Miksik, J.-M. Pérez-Rúa, P. H. Torr, and P. Pérez. Roam: [53] P. Voigtlaender and B. Leibe. Online adaptation of convo-
a rich object appearance model with application to rotoscop- lutional neural networks for video object segmentation. In
ing. In IEEE Conference on Computer Vision and Pattern British Machine Vision Conference, 2017. 2, 3, 6, 7
Recognition, 2017. 1 [54] T. Vojir and J. Matas. Pixel-wise object segmentations for
[38] F. Perazzi. Video Object Segmentation. PhD thesis, ETH the vot 2016 dataset. Research Report CTU-CMP-2017–01,
Zurich, 2017. 1, 3, 6 Center for Machine Perception, Czech Technical University,
[39] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and Prague, Czech Republic, 2017. 6
A. Sorkine-Hornung. Learning video object segmentation [55] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang. Jots: Joint
from static images. In IEEE Conference on Computer Vision online tracking and segmentation. In IEEE Conference on
and Pattern Recognition, 2017. 2, 3, 6, 7 Computer Vision and Pattern Recognition, 2015. 2, 3, 6
[40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, [56] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A
M. Gross, and A. Sorkine-Hornung. A benchmark dataset benchmark. In IEEE Conference on Computer Vision and
and evaluation methodology for video object segmentation. Pattern Recognition, 2013. 1, 3
In IEEE Conference on Computer Vision and Pattern Recog- [57] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast
nition, 2017. 1, 3, 6, 7, 8, 13 video object segmentation by reference-guided mask propa-
[41] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. gation. In IEEE Conference on Computer Vision and Pattern
Fully connected object proposals for video segmentation. In Recognition, 2018. 2, 3, 7
IEEE International Conference on Computer Vision, 2015. 3 [58] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
[42] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet. Color-Based S. Cohen, and T. Huang. Youtube-vos: Sequence-to-
Probabilistic Tracking. In European Conference on Com- sequence video object segmentation. In European Confer-
puter Vision, 2002. 2 ence on Computer Vision, 2018. 2, 5, 6
[43] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to seg- [59] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos.
ment object candidates. In Advances in Neural Information Efficient video object segmentation via network modulation.
Processing Systems, 2015. 2, 4 In IEEE Conference on Computer Vision and Pattern Recog-
[44] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn- nition, June 2018. 2, 3, 7
ing to refine object segments. In European Conference on [60] T. Yang and A. B. Chan. Learning dynamic memory net-
Computer Vision, 2016. 4, 7, 11 works for object tracking. In European Conference on Com-
[45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine- puter Vision, 2018. 2, 3
Hornung, and L. Van Gool. The 2017 davis chal- [61] D. Yeo, J. Son, B. Han, and J. H. Han. Superpixel-based
lenge on video object segmentation. arXiv preprint tracking-by-segmentation using markov chains. In IEEE
arXiv:1704.00675, 2017. 6, 8, 13 Conference on Computer Vision and Pattern Recognition,
[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards 2017. 2
real-time object detection with region proposal networks. In [62] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S.
Advances in Neural Information Processing Systems, 2015. Kweon. Pixel-level matching for video object segmentation
2, 3 using convolutional neural networks. In IEEE International
[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Conference on Computer Vision, 2017. 7
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, [63] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.
et al. Imagenet large scale visual recognition challenge. In- Distractor-aware siamese networks for visual object track-
ternational Journal of Computer Vision, 2015. 5 ing. In European Conference on Computer Vision, 2018. 2,
[48] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, 3, 5, 6, 7
A. Dehghan, and M. Shah. Visual tracking: An experimental
10
A. Architectural details block score box mask
conv5 1 × 1, 256 1 × 1, 256 1 × 1, 256
Network backbone. Table 8 illustrates the details of our conv6 1 × 1, 2k 1 × 1, 4k 1 × 1, (63 × 63)
backbone architecture (fθ in the main paper). For both vari-
ants, we use a ResNet-50 [18] until the final convolutional Table 9. Architectural details of the three-branch head. k denotes
the number of anchor boxes per RoW.
layer of the 4-th stage. In order to obtain a higher spatial
resolution in deep layers, we reduce the output stride to block score mask
8 by using convolutions with stride 1. Moreover, we in- conv5 1 × 1, 256 1 × 1, 256
crease the receptive field by using dilated convolutions [6]. conv6 1 × 1, 1 1 × 1, (63 × 63)
Specifically, we set the stride to 1 and the dilation rate to
2 in the 3×3 conv layer of conv4 1. Differently to the Table 10. Architectural details of the two-branch head.
original ResNet-50, there is no downsampling in conv4 x.
We also add to the backbone an adjust layer (a 1×1 con- 61*61*8 2x, up + 31*31*16
conv,
3*3, 16
conv,
3*3, 16 31*31*16
11
conv, deconv,
Sigmoid 127*127*4 𝑈$ 61*61*8 𝑈# 31*31*16 𝑈" 15*15*32 32
3*3, 1
127*127*1
mask
127*127*3
ResNet-50
31*31*256
conv 1 conv 2 conv 3 conv 4 adjust
255*255*3
Figure 9. Further qualitative results of our method on sequences from the visual object tracking benchmark VOT-2018 [27].
12
dog
drift-straight
goat
Libby
motocross-jump
parkour
Gold-Fish
Figure 10. Further qualitative results of our method on sequences from the semi-supervised video object segmentation benchmarks DAVIS-
2016 [40] and DAVIS-2017 [45]. Multiple masks are obtained from different inferences (with different initialisations).
13
Revealing Scenes by Inverting Structure from Motion Reconstructions
(a) SfM point cloud (top view) (b) Projected 3D points (c) Synthesized Image (d) Original Image
Figure 1: S YNTHESIZING IMAGERY FROM A S F M POINT CLOUD : From left to right: (a) Top view of a SfM reconstruction
of an indoor scene, (b) 3D points projected into a viewpoint associated with a source image, (c) the image reconstructed using
our technique, and (d) the source image. The reconstructed image is very detailed and closely resembles the source image.
Figure 2: N ETWORK A RCHITECTURE : Our network has three sub-networks – V ISIB N ET, C OARSE N ET and R EFINE N ET.
The upper left shows that the input to our network is a multi-dimensional nD array. The paper explores network variants where
the inputs are different subsets of depth, color and SIFT descriptors. The three sub-networks have similar architectures. They
are U-Nets with encoder and decoder layers with symmetric skip connections. The extra layers at the end of the decoder
layers (marked in orange) are there to help with high-dimensional inputs. See the text and supplementary material for details.
first/final layers, each sub-network has the same architec- where V : RH×W ×N → RW ×H×1 denotes a differen-
ture consisting of U-Nets with a series of encoder-decoder tiable function representing V ISIB N ET, with learnable pa-
layers with skip connections. Compared to conventional U- rameters, Ux ∈ RH×W ×1 denotes the ground-truth visibil-
Nets, our network has a few extra convolutional layers at ity map for feature map Fx , and the summation is carried
the end of the decoder layers. These extra layers facilitate out over the set of M non-zero spatial locations in Fx .
propagation of information from the low-level features, par- C OARSE N ET was trained next, using a combination of
ticularly the information extracted from SIFT descriptors, an L1 pixel loss and an L2 perceptual loss (as in [22, 8])
via the skip connections to a larger pixel area in the out- over the outputs of layers relu1 1, relu2 2, and relu3 3 of
put, while also helping to attenuate visual artifacts resulting VGG16 [40] pre-trained for image classification on the Im-
from the highly sparse and irregular distribution of these ageNet [6] dataset. The weights of V ISIB N ET remained
features. We use nearest neighbor upsampling followed by fixed while C OARSE N ET was being trained using the loss
standard convolutions instead of transposed convolutions as
3
the latter are known to produce artifacts [32]. X
LC = ||C(Fx ) − x||1 + α ||φi (C(Fx )) − φi (x)||22 , (2)
3.3. Optimization i=1
We separately train the sub-networks in our architecture, where C : RH×W ×N → RH×W ×3 denotes a differentiable
V ISIB N ET, C OARSE N ET, and R EFINE N ET. Batch normal- function representing C OARSE N ET, with learnable param-
H W
ization was used in every layer, except the final one in each eters, and φ1 : RH×W ×3 → R 2 × 2 ×64 , φ2 : RH×W ×3 →
H W H W
network. We applied Xavier initialization and projections R4 4× ×128
, and φ3 : R H×W ×3
→ R 8 × 8 ×256 denote
were generated on-the-fly to facilitate data augmentation the layers relu1 1, relu2 2, and relu2 2, respectively, of the
during training and novel view generation after training. pre-trained VGG16 network.
V ISIB N ET was trained first to classify feature map points R EFINE N ET was trained last using a combination of an
as either visible or occluded, using ground-truth visibility L1 pixel loss, the same L2 perceptual loss as C OARSE N ET,
masks generated automatically by running V ISIB D ENSE for and an adversarial loss. While training R EFINE N ET, the
all train, test, and validation samples. Given training pairs weights of V ISIB N ET and C OARSE N ET remained fixed.
of input feature maps Fx ∈ RH×W ×N and target source For adversarial training, we used a conditional discrimi-
images x ∈ RH×W ×3 , V ISIB N ET’s objective is nator whose goal was to distinguish between real source
M
X images used to generate the SfM models and images syn-
LV (x) = − Ux log (V (Fx ) + 1)/2 + thesized by R EFINE N ET. The discriminator trained using
i=1
(1) cross-entropy loss similar to Eq. (1). Additionally, to sta-
(1 − Ux )log (1 − V (Fx ))/2 i , bilize adversarial training, φ1 (R(Fx ))1 , φ2 (R(Fx ))1 , and
Desc. Inp. Feat. MAE SSIM Inp. Feat. Accuracy
Src. D O S 20% 60% 100% 20% 60% 100%
Data
z D C 20% 60% 100%
Si X X X .126 .105 .101 .539 .605 .631 X × × .948 .948 .946
Si X X × .133 .111 .105 .499 .568 .597 X × X .938 .943 .941
Si X × X .129 .107 .102 .507 .574 .599 MD
X X × .949 .951 .948
Si X × × .131 .113 .109 .477 .550 .578 X X X .952 .952 .950
M X × × .147 .128 .123 .443 .499 .524 X × × .892 .907 .908
X × X .897 .908 .910
Table 1: I NVERTING S INGLE I MAGE S IFT F EATURES : NYU
X X × .895 .907 .909
The top four rows compare networks designed for differ- X X X .906 .916 .917
ent subsets of single image (Si) inputs: descriptor (D), key-
point orientation (O) and scale (S). Test error (MAE) and
Table 2: E VALUATION OF V ISIB N ET: We trained four ver-
accuracy (SSIM) obtained when 20%, 60% and all the SIFT
sion of V ISIB N ET, each with a different set of input at-
features are used. Lower MAE and higher SSIM values are
tributes, namely, z (depth), D (SIFT) and C (color) to eval-
better. The last row is for when the descriptors originate
uate their relative importance. Ground truth labels were ob-
from multiple (M) different and unknown source images.
tained with VisibDense. The table reports mean classifica-
tion accuracy on the test set for the NYU and MD datasets.
φ3 (R(Fx ))1 were concatenated before the first, second, and The results show that V ISIB N ET achieves accuracy greater
third convolutional layers of the discriminator as done in than 93.8% and 89.2% on MD and NYU respectively and is
[42]. R EFINE N ET denoted as R() has the following loss. not very sensitive to sparsity levels and input attributes.
3
X partitioned the scenes into training, validation, and testing
LR =||R(Fx ) − x||1 + α ||φi (R(Fx )) − φi (x)||22 sets with 441, 80, and 139 scenes respectively. All images
i=1
(3) of one scene were included only in one of the three groups.
+ β[log(D(x)) + log(1 − D(R(Fx )))]. We report results using both the average mean absolute error
(MAE), where color values are scaled to the range [0,1].
Here, the two functions, R : RH×W ×N +3 → RH×W ×3 and average structured similarity (SSIM). Note that lower
and D : RH×W ×N +3 → R denote differentiable functions MAE and higher SSIM values indicate better results.
representing R EFINE N ET and the discriminator, respec-
Inverting Single Image SIFT Features. Consider the sin-
tively, with learnable parameters. We trained R EFINE N ET
gle image scenario, with trivial visibility estimation and
to minimize LR by applying alternating gradient updates
identical input to [9]. We performed an ablation study in
to R EFINE N ET and the discriminator. The gradients were
this scenario, measuring the effect of inverting features with
computed on mini-batches of training data, with different
unknown keypoint scale, orientation, and multiple unknown
batches used to update R EFINE N ET and the discriminator.
image sources. Four variants of C OARSE N ET were trained,
then tested at three sparsity levels. The results are shown
4. Experimental Results in Table 1 and Figure 4. Table 1 reports MAE and SSIM
We now report a systematic evaluation of our method. across a combined MD and NYU dataset. The sparsity per-
Some of our results are qualitatively summarized in Fig. centage refers to how many randomly selected features were
3, demonstrating robustness to various challenges, namely, retained in the input, and our method handles a wide range
missing information in the point clouds, effectiveness of our of sparsity reasonably well. From the examples in Figure 4,
visibility estimation, and the sparse and irregular distribu- we observe that the networks are surprisingly robust at in-
tion of input samples over a large variety of scenes. verting features with unknown orientation and scale; while
the accuracy drops a bit as expected, the reconstructed im-
Dataset. We use the MegaDepth [24] and NYU [39] ages are still recognizable. Finally, we quantify the effect
datasets in our experiments. MegaDepth (MD) is an In- of unknown and different image sources for the SIFT fea-
ternet image dataset with ∼150k images of 196 landmark tures. The last row of Table 1 shows that indeed the feature
scenes obtained from Flickr. NYU contains ∼400k images inversion problem becomes harder but the results are still re-
of 464 indoor scenes captured with the Kinect (we only used markably good. Having demonstrated that our work solves
the RGB images). These datasets cover very different scene a harder problem than previously tackled, we now report
content, image resolution, and generate very different dis- results on inverting SfM points and their features.
tribution of SfM points and camera poses. Generally, NYU
scenes produce far fewer SfM points than the MD scenes. 4.1. Visibility Estimation
Preprocessing. We processed the 660 scenes in MD and We first independently evaluate the performance of the
NYU using the SfM implementation in COLMAP [38]. We proposed V ISIB N ET model and compare it to the geomet-
Figure 3: Q UALITATIVE R ESULTS : Each result is a 3 × 1 set of square images, showing point clouds (with occluded points
in red), image reconstruction and original. The first four columns (top and bottom) show results from the MegaDepth dataset
(internet scenes) and the last four columns (top and bottom) show results from indoor NYU scenes. Sparsity: Our network
handles a large variety in input sparsity (density decreases from left to right). In addition, perspective projection accentuates
the spatially-varying density differences, and the MegaDepth outdoor scenes have concentrated points in the input whereas
NYU indoor scenes have far samples. Further, the input points are non-homogeneous, with large holes which our method
gracefully fills in. Visual effects: For the first four columns (MD scenes) our results give the pleasing effect of uniform
illumination (see top of first column). Since our method relies on SfM, moving objects are not recovered. Scene diversity:
The fourth column is an aerial photograph, an unusual category that is still recovered well. For the last four columns (NYU
scenes), despite lower sparsity, we can recover textures in common household scenes such as bathrooms, classrooms and
bedrooms. The variety shows that our method does not learn object categories and works on any scene. Visibility: All scenes
benefit from visibility prediction using V ISIB N ET which for example was crucial for the bell example (lower 2nd column).
ric methods V ISIB S PARSE and V ISIB D ENSE. We trained classification accuracy on MD and NYU test sets, respec-
four variants of V ISIB N ET designed for different subsets of tively, even when only 20% of the input samples were used
input attributes to classify points in the input feature map to simulate sparse inputs. Table 3 shows that when points
as “visible” or “occluded”. We report classification accu- predicted as occluded by V ISIB N ET are removed from the
racy separately on the MD and NYU test sets even though input to C OARSE N ET, we observe a consistent improve-
the network was trained on the combined training set (see ment when compared to C OARSE N ET carrying both the
Table 2). We observe that V ISIB N ET is largely insensitive burdens of visibility and image synthesis (denoted as Im-
to scene type, sparsity levels, and choice of input attributes plicit in the table). While the improvement may not seem
such as depth, color, and descriptors. The V ISIB N ET vari- numerically large, in Figure 5 we show insets where visual
ant designed for depth only has 94.8% and 89.2% mean artifacts (bookshelf above, building below) are removed.
(a) Input (b) SIFT (c) SIFT + s (d) SIFT + o (e) SIFT + s + o (f) Original
Figure 4: I NVERTING S IFT F EATURES IN A S INGLE I MAGE : (a) 2D keypoint locations. Results obtained with (b) only
descriptor, (c) descriptor and keypoint scale, (d) descriptor and keypoint orientation, (e) descriptor, scale and orientation. (f)
Original image. Results from using only descriptors (2nd column) are only slightly worse than the baseline (5th column).
(a) Input (b) Pred. (VisibNet) (c) Implicit (d) VisibNet (e) VisibDense (f) Original
Figure 5: I MPORTANCE OF VISIBILITY ESTIMATION : Examples showing (a) input 2D point projections (in blue), (b) pre-
dicted visibility from V ISIB N ET – occluded (red) and visible (blue) points, (c–e) results from I MPLICIT (no explicit visibility
estimation), V ISIB N ET (uses a CNN) and V ISIB D ENSE (uses z-buffering and dense models), and (f) the original image.
4.2. Relative Significance of Point Attributes and SIFT descriptors significantly improves visual quality.
We trained four variants of C OARSE N ET, each with a
4.3. Significance of RefineNet
different set of the available SfM point attributes. The goal
here is to measure the relative importance of each of the at- In Figure 7 we qualitatively compare two scenes where
tributes. This information could be used to decide which the feature maps had only depth and descriptors (left) and
optional attributes should be removed when storing SfM when it had all the attributes (right). For privacy preser-
model to enhance privacy. We report reconstruction error on vation, these results are sobering. While Table 4 showed
the test set for both indoor (NYU) and outdoor scenes (MD) that C OARSE N ET struggles when color is dropped (sug-
for various sparsity levels in Table 4 and show qualitative gesting an easy solution of removing color for privacy),
evaluation on the test set in Figure 6. The results indicate Figure 7 (left) unfortunately shows that R EFINE N ET recov-
that our approach is largely invariant to sparsity and capable ers plausible colors and improves results a lot. Of course,
of capturing very fine details even when the input feature R EFINE N ET trained on all features also does better than
map contains just depth, although, not surprisingly, color C OARSE N ET although less dramatically (Figure 7 (right)).
Visibility MAE SSIM
Data
Est. 20% 60% 100% 20% 60% 100%
Implicit .201 .197 .195 .412 .436 .445
VisibSparse .202 .197 .196 .408 .432 .440
MD
VisibNet .201 .196 .195 .415 .440 .448
VisibDense .201 .196 .195 .417 .442 .451
Implicit .121 .100 .094 .541 .580 .592
VisibSparse .122 .100 .094 .539 .579 .592
NYU
VisibNet .120 .098 .092 .543 .583 .595
VisibDense .120 .097 .090 .545 .587 .600
Table 4: E FFECT OF P OINT ATTRIBUTES : Performance of such results is more difficult (in contrast to our experiments
four networks designed for different sets of input attributes where aligned real camera images are available), we show
– z (depth), D (SIFT) and C (color), on MD and NYU. Input qualitative results in Figure 8 and generate virtual tours
sparsity is simulated by applying random dropout to input based on the synthesized novel views1 . Such novel view
samples during training and testing. based virtual tours can make scene interpretation easier for
an attacker even when the images contain some artifacts.
5. Conclusion
In this paper, we introduced a new problem, that of in-
verting a sparse SfM point cloud and reconstructing color
images of the underlying scene. We demonstrated that sur-
prisingly high quality images can be reconstructed from the
limited amount of information stored along with sparse 3D
point cloud models. Our work highlights the privacy and
security risks associated with storing 3D point clouds and
the necessity for developing privacy preserving point cloud
z z+D z+C z+D+C orig representations and camera localization techniques, where
the persistent scene model data cannot easily be inverted to
Figure 6: E FFECT OF P OINT ATTRIBUTES : Results ob- reveal the appearance of the underlying scene. This was
tained with different attributes. Left to right: depth [z], also the primary goal in concurrent work on privacy pre-
depth + SIFT [z + D], depth + color [z + C], depth + SIFT serving camera pose estimation [41] which proposes a de-
+ color [z + D + C] and the original image. (see Table 4). fense against the type of attacks investigated in our paper.
Another interesting avenue of future work would be to ex-
4.4. Novel View Synthesis plore privacy preserving features for recovering correspon-
dences between images and 3D models.
Our technique can be used to easily generate realistic
novel views of the scene. While quantitatively evaluating 1 see the video in the supplementary material.
References [18] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof.
From structure-from-motion point clouds to fast loca-
[1] 6D.AI. http://6d.ai/, 2018. tion recognition. In CVPR, pages 2599–2606, 2009.
[2] ARCore. developers.google.com/ar/, [19] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
2018. to-image translation with conditional adversarial net-
[3] ARKit. developer.apple.com/arkit/, works. In CVPR, pages 1125–1134, 2017.
2018. [20] H. Kato and T. Harada. Image reconstruction from
[4] Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabi- bag-of-visual-words. In CVPR, pages 955–962, 2014.
novich. Estimating depth from RGB and sparse sens- [21] P. Labatut, J.-P. Pons, and R. Keriven. Efficient multi-
ing. In ECCV, pages 167–182, 2018. view reconstruction of large-scale scenes using inter-
[5] T. Dekel, C. Gan, D. Krishnan, C. Liu, and W. T. Free- est points, Delaunay triangulation and graph cuts. In
man. Smart, sparse contours to represent and edit im- ICCV, pages 1–8, 2007.
ages. In CVPR, 2018. [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
L. Fei-Fei. ImageNet: A large-scale hierarchical im- Z. Wang, et al. Photo-realistic single image super-
age database. In CVPR, pages 248–255, 2009. resolution using a generative adversarial network. In
[7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial CVPR, pages 4681–4690, 2017.
feature learning. In ICLR, 2017. [23] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. World-
[8] A. Dosovitskiy and T. Brox. Generating images with wide pose estimation using 3d point clouds. In ECCV,
perceptual similarity metrics based on deep networks. pages 15–29. Springer, 2012.
In Advances in Neural Information Processing Sys- [24] Z. Li and N. Snavely. Megadepth: Learning single-
tems, pages 658–666, 2016. view depth prediction from internet photos. In Com-
[9] A. Dosovitskiy and T. Brox. Inverting visual represen- puter Vision and Pattern Recognition (CVPR), 2018.
tations with convolutional networks. In CVPR, pages [25] H. Lim, S. N. Sinha, M. F. Cohen, M. Uyttendaele, and
4829–4837, 2016. H. J. Kim. Real-time monocular image-based 6-dof
[10] H. Edwards and A. Storkey. Censoring representations localization. The International Journal of Robotics
with an adversary. In ICLR, 2016. Research, 34(4-5):476–492, 2015.
[11] D. Eigen, C. Puhrsch, and R. Fergus. Depth map pre- [26] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
diction from a single image using a multi-scale deep image-to-image translation networks. In Advances in
network. In Advances in neural information process- Neural Information Processing Systems, pages 700–
ing systems, pages 2366–2374, 2014. 708, 2017.
[12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. [27] M.-Y. Liu and O. Tuzel. Coupled generative adver-
Deepstereo: Learning to predict new views from the sarial networks. In Advances in neural information
world’s imagery. In CVPR, pages 5515–5524, 2016. processing systems, pages 469–477, 2016.
[13] J. Hamm. Minimax filter: learning to preserve pri- [28] J. Lu and D. Forsyth. Sparse depth super resolution.
vacy from inference attacks. The Journal of Machine In CVPR, pages 2245–2253, 2015.
Learning Research, 18(1):4704–4734, 2017. [29] A. Mahendran and A. Vedaldi. Understanding deep
[14] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Dret- image representations by inverting them. In CVPR,
takis, and G. Brostow. Deep blending for free- pages 5188–5196, 2015.
viewpoint image-based rendering. ACM Transactions [30] R. McPherson, R. Shokri, and V. Shmatikov. De-
on Graphics (SIGGRAPH Asia Conference Proceed- feating image obfuscation with deep learning. arXiv
ings), 37(6), November 2018. preprint arXiv:1609.00408, 2016.
[15] Hololens. https://www.microsoft.com/ [31] M. Moukari, S. Picard, L. Simoni, and F. Jurie. Deep
en-us/hololens, 2016. multi-scale architectures for monocular depth estima-
[16] J. Hong. Considering privacy issues in the context of tion. In ICIP, pages 2940–2944, 2018.
google glass. Commun. ACM, 56(11):10–11, 2013. [32] A. Odena, V. Dumoulin, and C. Olah. Deconvolution
[17] T.-W. Hui, C. C. Loy, and X. Tang. Depth map super- and checkerboard artifacts. Distill, 2016.
resolution by deep multi-scale guidance. In ECCV, [33] F. Pittaluga, S. Koppal, and A. Chakrabarti. Learn-
2016. ing privacy preserving encodings through adversarial
training. In 2019 IEEE Winter Conference on Appli- [49] M. D. Zeiler and R. Fergus. Visualizing and under-
cations of Computer Vision (WACV), pages 791–799. standing convolutional networks. In ECCV, pages
IEEE, 2019. 818–833. Springer, 2014.
[34] X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric [50] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired
image synthesis. In CVPR, pages 8808–8816, 2018. image-to-image translation using cycle-consistent ad-
[35] N. Raval, A. Machanavajjhala, and L. P. Cox. Protect- versarial networks. In CVPR, pages 2223–2232,
ing visual secrets using adversarial nets. In CV-COPS 2017.
2017, CVPR Workshop, pages 1329–1332, 2017.
[36] G. Riegler, M. Rüther, and H. Bischof. ATGV-Net:
Accurate depth super-resolution. In ECCV, pages
268–284, 2016.
[37] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based
localization using direct 2d-to-3d matching. In ICCV,
pages 667–674. IEEE, 2011.
[38] J. L. Schönberger and J.-M. Frahm. Structure-from-
motion revisited. In CVPR, pages 4104–4113, 2016.
[39] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. In-
door segmentation and support inference from rgbd
images. In ECCV, 2012.
[40] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In
ICLR, 2015.
[41] P. Speciale, J. L. Schönberger, S. B. Kang, S. N. Sinha,
and M. Pollefeys. Privacy preserving image-based lo-
calization. arXiv preprint arXiv:1903.05572, 2019.
[42] D. Sungatullina, E. Zakharov, D. Ulyanov, and
V. Lempitsky. Image manipulation with perceptual
discriminators. In ECCV, pages 579–595, 2018.
[43] J. Uhrig, N. Schneider, L. Schneider, U. Franke,
T. Brox, and A. Geiger. Sparsity invariant CNNs. In
International Conference on 3D Vision (3DV), pages
11–20, 2017.
[44] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich.
Examining the impact of blur on recognition
by convolutional networks. arXiv preprint
arXiv:1611.05760, 2016.
[45] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Tor-
ralba. Hoggles: Visualizing object detection features.
In CVPR, pages 1–8, 2013.
[46] P. Weinzaepfel, H. Jégou, and P. Pérez. Reconstructing
an image from its local descriptors. In CVPR, pages
337–344, 2011.
[47] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convo-
lutional neural network for image deconvolution. In
Advances in Neural Information Processing Systems,
pages 1790–1798, 2014.
[48] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lip-
son. Understanding neural networks through deep vi-
sualization. In ICML Workshop on Deep Learning,
2015.
Supplementary Material:
Revealing Scenes by Inverting Structure from Motion Reconstructions
Figure A1: E VALUATING ROBUSTNESS TO S PARSITY: Two sets of images synthesized using our complete pipeline, by
running V ISIB N ET, C OARSE N ET and R EFINE N ET. From left to right: (a) Simulated sparse inputs to our networks. Here,
only 20% of the 3D points in the respective SfM models were used. Image synthesized using our method using (b) 20% of
the points, (c) 60% of the points, (d) all the points and (e) the original source images. Even when the inputs are extremely
sparse, most of the contents of the synthesized images can be easily recognized.
Figure A2: FAILURE E XAMPLES : (a) Dense points on the building in the background overwhelms a few sparse points in the
foreground on the base of the statue. V ISIB N ET in this case incorrectly predicts that the building is visible and this causes the
base of the statue to disappear completely in the synthesized image. (b) A similar artifact for a different scene. (c) Parallel
straight lines are sometimes poorly handled, such as the lines on the vertical pillars of the monument. (d) The complex
occlusions in the architectural structure produce artifacts where the occluded surfaces and the occluders are fused into each
other. (e) Straight lines are often reconstructed as curved or bent (f–g) Low sample density in the input common in indoor
scenes results in blurry and wavy edges. (h) Finally, spurious 3D points may cause our method to hallucinate structures such
as the dark line on the wall which is not actually there.
Semantic Image Synthesis with Spatially-Adaptive Normalization
cloud sky
arXiv:1903.07291v2 [cs.CV] 5 Nov 2019
tree mountain
sea grass
Figure 1: Our model allows user control over both semantic and style as synthesizing an image. The semantic (e.g., the
existence of a tree) is controlled via a label map (the top row), while the style is controlled via the reference style image (the
leftmost column). Please visit our website for interactive image synthesis demos.
Abstract https://github.com/NVlabs/SPADE.
1
suboptimal because their normalization layers tend to “wash ization (InstanceNorm) [46], the Layer Normalization [2],
away” information contained in the input semantic masks. the Group Normalization [50], and the Weight Normaliza-
To address the issue, we propose spatially-adaptive normal- tion [45]. We label these normalization layers as uncondi-
ization, a conditional normalization layer that modulates the tional as they do not depend on external data in contrast to
activations using input semantic layouts through a spatially- the conditional normalization layers discussed below.
adaptive, learned transformation and can effectively propa- Conditional normalization layers include the Conditional
gate the semantic information throughout the network. Batch Normalization (Conditional BatchNorm) [11] and
We conduct experiments on several challenging datasets Adaptive Instance Normalization (AdaIN) [19]. Both were
including the COCO-Stuff [4, 32], the ADE20K [58], and first used in the style transfer task and later adopted in var-
the Cityscapes [9]. We show that with the help of our ious vision tasks [3, 8, 10, 20, 26, 36, 39, 42, 49, 54]. Dif-
spatially-adaptive normalization layer, a compact network ferent from the earlier normalization techniques, condi-
can synthesize significantly better results compared to sev- tional normalization layers require external data and gen-
eral state-of-the-art methods. Additionally, an extensive ab- erally operate as follows. First, layer activations are nor-
lation study demonstrates the effectiveness of the proposed malized to zero mean and unit deviation. Then the nor-
normalization layer against several variants for the semantic malized activations are denormalized by modulating the
image synthesis task. Finally, our method supports multi- activation using a learned affine transformation whose pa-
modal and style-guided image synthesis, enabling control- rameters are inferred from external data. For style trans-
lable, diverse outputs, as shown in Figure 1. Also, please fer tasks [11, 19], the affine parameters are used to control
see our SIGGRAPH 2019 Real-Time Live demo and try our the global style of the output, and hence are uniform across
online demo by yourself. spatial coordinates. In contrast, our proposed normalization
layer applies a spatially-varying affine transformation, mak-
2. Related Work ing it suitable for image synthesis from semantic masks.
Wang et al. proposed a closely related method for image
Deep generative models can learn to synthesize images. super-resolution [49]. Both methods are built on spatially-
Recent methods include generative adversarial networks adaptive modulation layers that condition on semantic in-
(GANs) [13] and variational autoencoder (VAE) [28]. Our puts. While they aim to incorporate semantic information
work is built on GANs but aims for the conditional image into super-resolution, our goal is to design a generator for
synthesis task. The GANs consist of a generator and a dis- style and semantics disentanglement. We focus on provid-
criminator where the goal of the generator is to produce re- ing the semantic information in the context of modulating
alistic images so that the discriminator cannot tell the syn- normalized activations. We use semantic maps in different
thesized images apart from the real ones. scales, which enables coarse-to-fine generation. The reader
Conditional image synthesis exists in many forms that dif- is encouraged to review their work for more details.
fer in the type of input data. For example, class-conditional
models [3, 36, 37, 39, 41] learn to synthesize images given 3. Semantic Image Synthesis
category labels. Researchers have explored various models
for generating images based on text [18,44,52,55]. Another Let m ∈ LH×W be a semantic segmentation mask
widely-used form is image-to-image translation based on a where L is a set of integers denoting the semantic labels,
type of conditional GANs [20, 22, 24, 25, 33, 57, 59, 60], and H and W are the image height and width. Each entry
where both input and output are images. Compared to in m denotes the semantic label of a pixel. We aim to learn
earlier non-parametric methods [7, 16, 23], learning-based a mapping function that can convert an input segmentation
methods typically run faster during test time and produce mask m to a photorealistic image.
more realistic results. In this work, we focus on converting Spatially-adaptive denormalization. Let hi denote the ac-
segmentation masks to photorealistic images. We assume tivations of the i-th layer of a deep convolutional network
the training dataset contains registered segmentation masks for a batch of N samples. Let C i be the number of chan-
and images. With the proposed spatially-adaptive normal- nels in the layer. Let H i and W i be the height and width
ization, our compact network achieves better results com- of the activation map in the layer. We propose a new condi-
pared to leading methods. tional normalization method called the SPatially-Adaptive
Unconditional normalization layers have been an impor- (DE)normalization1 (SPADE). Similar to the Batch Nor-
tant component in modern deep networks and can be found malization [21], the activation is normalized in the channel-
in various classifiers, including the Local Response Nor- wise manner and then modulated with learned scale and
malization in the AlexNet [29] and the Batch Normaliza- bias. Figure 2 illustrates the SPADE design. The activation
tion (BatchNorm) in the Inception-v2 network [21]. Other 1 Conditional normalization [11, 19] uses external data to denormalize
popular normalization layers include the Instance Normal- the normalized activations; i.e., the denormalization part is conditional.
2
conv
conv
𝛽
Batch
element-wise
Norm
Figure 2: In the SPADE, the mask is first projected onto an Figure 3: Comparing results given uniform segmentation
embedding space and then convolved to produce the modu- maps: while the SPADE generator produces plausible tex-
lation parameters γ and β. Unlike prior conditional normal- tures, the pix2pixHD generator [48] produces two identical
ization methods, γ and β are not vectors, but tensors with outputs due to the loss of the semantic information after the
spatial dimensions. The produced γ and β are multiplied normalization layer.
and added to the normalized activation element-wise.
SPADE generator. With the SPADE, there is no need to
value at site (n ∈ N, c ∈ C i , y ∈ H i , x ∈ W i ) is feed the segmentation map to the first layer of the genera-
tor, since the learned modulation parameters have encoded
i
hin,c,y,x − µic i enough information about the label layout. Therefore, we
γc,y,x (m) + βc,y,x (m) (1)
σci discard encoder part of the generator, which is commonly
used in recent architectures [22, 48]. This simplification re-
where hin,c,y,x is the activation at the site before normaliza-
sults in a more lightweight network. Furthermore, similarly
tion and µic and σci are the mean and standard deviation of
to existing class-conditional generators [36,39,54], the new
the activations in channel c:
generator can take a random vector as input, enabling a sim-
1 X
ple and natural way for multi-modal synthesis [20, 60].
µic = hi (2)
N H W n,y,x n,c,y,x
i i
Figure 4 illustrates our generator architecture, which em-
s ploys several ResNet blocks [15] with upsampling layers.
i 1 X
i
2 − (µi )2 .
σc = (h n,c,y,x ) c (3) The modulation parameters of all the normalization layers
N H i W i n,y,x are learned using the SPADE. Since each residual block
i i
operates at a different scale, we downsample the semantic
The variables γc,y,x (m) and βc,y,x (m) in (1) are the mask to match the spatial resolution.
learned modulation parameters of the normalization layer.
We train the generator with the same multi-scale discrim-
In contrast to the BatchNorm [21], they depend on the in-
inator and loss function used in pix2pixHD [48] except that
put segmentation mask and vary with respect to the location
i i we replace the least squared loss term [34] with the hinge
(y, x). We use the symbol γc,y,x and βc,y,x to denote the
loss term [31,38,54]. We test several ResNet-based discrim-
functions that convert m to the scaling and bias values at
inators used in recent unconditional GANs [1, 36, 39] but
the site (c, y, x) in the i-th activation map. We implement
i i observe similar results at the cost of a higher GPU mem-
the functions γc,y,x and βc,y,x using a simple two-layer con-
ory requirement. Adding the SPADE to the discriminator
volutional network, whose design is in the appendix.
also yields a similar performance. For the loss function, we
In fact, SPADE is related to, and is a generalization
observe that removing any loss term in the pix2pixHD loss
of several existing normalization layers. First, replacing
function lead to degraded generation results.
the segmentation mask m with the image class label and
making the modulation parameters spatially-invariant (i.e., Why does the SPADE work better? A short answer is that
i i i i it can better preserve semantic information against common
γc,y 1 ,x1
≡ γc,y 2 ,x2
and βc,y 1 ,x1
≡ βc,y 2 ,x2
for any y1 , y2 ∈
{1, 2, ..., H } and x1 , x2 ∈ {1, 2, ..., W i }), we arrive at the
i normalization layers. Specifically, while normalization lay-
form of the Conditional BatchNorm [11]. Indeed, for any ers such as the InstanceNorm [46] are essential pieces in
spatially-invariant conditional data, our method reduces to almost all the state-of-the-art conditional image synthesis
the Conditional BatchNorm. Similarly, we can arrive at models [48], they tend to wash away semantic information
the AdaIN [19] by replacing m with a real image, mak- when applied to uniform or flat segmentation masks.
ing the modulation parameters spatially-invariant, and set- Let us consider a simple module that first applies con-
ting N = 1. As the modulation parameters are adaptive to volution to a segmentation mask and then normalization.
the input segmentation mask, the proposed SPADE is better Furthermore, let us assume that a segmentation mask with
suited for semantic image synthesis. a single label is given as input to the module (e.g., all the
3
pix2pixHD
3x3 Conv
3x3 Conv
SPADE
SPADE
ReLU
ReLU
Figure 4: In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations.
(left) Structure of one residual block with the SPADE. (right) The generator contains a series of the SPADE residual blocks
with upsampling layers. Our architecture achieves better performance with a smaller number of parameters by removing the
downsampling layers of leading image-to-image translation networks such as the pix2pixHD model [48].
pixels have the same label such as sky or grass). Under this learning rates for the generator and discriminator are
setting, the convolution outputs are again uniform, with dif- 0.0001 and 0.0004, respectively [17]. We use the ADAM
ferent labels having different uniform values. Now, after we solver [27] with β1 = 0 and β2 = 0.999. All the exper-
apply InstanceNorm to the output, the normalized activation iments are conducted on an NVIDIA DGX1 with 8 32GB
will become all zeros no matter what the input semantic la- V100 GPUs. We use synchronized BatchNorm, i.e., these
bel is given. Therefore, semantic information is totally lost. statistics are collected from all the GPUs.
This limitation applies to a wide range of generator archi- Datasets. We conduct experiments on several datasets.
tectures, including pix2pixHD and its variant that concate- • COCO-Stuff [4] is derived from the COCO dataset [32].
nates the semantic mask at all intermediate layers, as long It has 118, 000 training images and 5, 000 validation im-
as a network applies convolution and then normalization to ages captured from diverse scenes. It has 182 semantic
the semantic mask. In Figure 3, we empirically show this is classes. Due to its vast diversity, existing image synthe-
precisely the case for pix2pixHD. Because a segmentation sis models perform poorly on this dataset.
mask consists of a few uniform regions in general, the issue • ADE20K [58] consists of 20, 210 training and 2, 000 val-
of information loss emerges when applying normalization. idation images. Similarly to the COCO, the dataset con-
In contrast, the segmentation mask in the SPADE Gen- tains challenging scenes with 150 semantic classes.
erator is fed through spatially adaptive modulation without • ADE20K-outdoor is a subset of the ADE20K dataset that
normalization. Only activations from the previous layer are only contains outdoor scenes, used in Qi et al. [43].
normalized. Hence, the SPADE generator can better pre- • Cityscapes dataset [9] contains street scene images in
serve semantic information. It enjoys the benefit of normal- German cities. The training and validation set sizes are
ization without losing the semantic input information. 3, 000 and 500, respectively. Recent work has achieved
Multi-modal synthesis. By using a random vector as the photorealistic semantic image synthesis results [43, 47]
input of the generator, our architecture provides a simple on the Cityscapes dataset.
way for multi-modal synthesis [20, 60]. Namely, one can • Flickr Landscapes. We collect 41, 000 photos from
attach an encoder that processes a real image into a random Flickr and use 1, 000 samples for the validation set. To
vector, which will be then fed to the generator. The encoder avoid expensive manual annotation, we use a well-trained
and generator form a VAE [28], in which the encoder tries DeepLabV2 [5] to compute input segmentation masks.
to capture the style of the image, while the generator com- We train the competing semantic image synthesis methods
bines the encoded style and the segmentation mask informa- on the same training set and report their results on the same
tion via the SPADEs to reconstruct the original image. The validation set for each dataset.
encoder also serves as a style guidance network at test time Performance metrics. We adopt the evaluation protocol
to capture the style of target images, as used in Figure 1. from previous work [6, 48]. Specifically, we run a seman-
For training, we add a KL-Divergence loss term [28]. tic segmentation model on the synthesized images and com-
pare how well the predicted segmentation mask matches the
4. Experiments ground truth input. Intuitively, if the output images are re-
alistic, a well-trained semantic segmentation model should
Implementation details. We apply the Spectral Norm [38] be able to predict the ground truth label. For measuring the
to all the layers in both generator and discriminator. The segmentation accuracy, we use both the mean Intersection-
4
Label Ground Truth CRN [6] pix2pixHD [48] Ours
Figure 5: Visual comparison of semantic image synthesis results on the COCO-Stuff dataset. Our method successfully
synthesizes realistic details from semantic labels.
Label Ground Truth CRN [6] SIMS [43] pix2pixHD [48] Ours
Figure 6: Visual comparison of semantic image synthesis results on the ADE20K outdoor and Cityscapes datasets. Our
method produces realistic images while respecting the spatial semantic layout at the same time.
COCO-Stuff ADE20K ADE20K-outdoor Cityscapes
Method mIoU accu FID mIoU accu FID mIoU accu FID mIoU accu FID
CRN [6] 23.7 40.4 70.4 22.4 68.8 73.3 16.5 68.6 99.0 52.4 77.1 104.7
SIMS [43] N/A N/A N/A N/A N/A N/A 13.1 74.7 67.7 47.2 75.5 49.7
pix2pixHD [48] 14.6 45.8 111.5 20.3 69.2 81.8 17.4 71.6 97.8 58.3 81.4 95.0
Ours 37.4 67.9 22.6 38.5 79.9 33.9 30.8 82.9 63.3 62.3 81.9 71.8
Table 1: Our method outperforms the current leading methods in semantic segmentation (mIoU and accu) and FID [17]
scores on all the benchmark datasets. For the mIoU and accu, higher is better. For the FID, lower is better.
over-Union (mIoU) and the pixel accuracy (accu). We use parametric image synthesis method (SIMS) [43]. The
the state-of-the-art segmentation networks for each dataset: pix2pixHD is the current state-of-the-art GAN-based con-
DeepLabV2 [5, 40] for COCO-Stuff, UperNet101 [51] for ditional image synthesis framework. The CRN uses a deep
ADE20K, and DRN-D-105 [53] for Cityscapes. In addi- network that repeatedly refines the output from low to high
tion to the mIoU and the accu segmentation performance resolution, while the SIMS takes a semi-parametric ap-
metrics, we use the Fréchet Inception Distance (FID) [17] proach that composites real segments from a training set and
to measure the distance between the distribution of synthe- refines the boundaries. Both the CRN and SIMS are mainly
sized results and the distribution of real images. trained using image reconstruction loss. For a fair compar-
Baselines. We compare our method with 3 leading seman- ison, we train the CRN and pix2pixHD models using the
tic image synthesis models: the pix2pixHD model [48], implementations provided by the authors. As image syn-
the cascaded refinement network (CRN) [6], and the semi- thesis using the SIMS requires many queries to the training
5
Figure 7: Semantic image synthesis results on the Flickr Landscapes dataset. The images were generated from semantic
layout of photographs on the Flickr website.
dataset, it is computationally prohibitive for a large dataset Ours vs. Ours vs. Ours vs.
Dataset
such as the COCO-stuff and the full ADE20K. Therefore, CRN pix2pixHD SIMS
we use the results provided by the authors when available. COCO-Stuff 79.76 86.64 N/A
ADE20K 76.66 83.74 N/A
Quantitative comparisons. As shown in Table 1, our ADE20K-outdoor 66.04 79.34 85.70
method outperforms the current state-of-the-art methods by Cityscapes 63.60 53.64 51.52
a large margin in all the datasets. For the COCO-Stuff, our Table 2: User preference study. The numbers indicate the
method achieves an mIoU score of 35.2, which is about 1.5 percentage of users who favor the results of the proposed
times better than the previous leading method. Our FID method over those of the competing method.
is also 2.2 times better than the previous leading method.
We note that the SIMS model produces a lower FID score
but has poor segmentation performances on the Cityscapes In Figures 7 and 8, we show more example results from
dataset. This is because the SIMS synthesizes an image by the Flickr Landscape and COCO-Stuff datasets. The pro-
first stitching image patches from the training dataset. As posed method can generate diverse scenes with high image
using the real image patches, the resulting image distribu- fidelity. More results are included in the appendix.
tion can better match the distribution of real images. How- Human evaluation. We use the Amazon Mechanical Turk
ever, because there is no guarantee that a perfect query (e.g., (AMT) to compare the perceived visual fidelity of our
a person in a particular pose) exists in the dataset, it tends method against existing approaches. Specifically, we give
to copy objects that do not match the input segments. the AMT workers an input segmentation mask and two
Qualitative results. In Figures 5 and 6, we provide quali- synthesis outputs from different methods and ask them to
tative comparisons of the competing methods. We find that choose the output image that looks more like a correspond-
our method produces results with much better visual quality ing image of the segmentation mask. The workers are given
and fewer visible artifacts, especially for diverse scenes in unlimited time to make the selection. For each comparison,
the COCO-Stuff and ADE20K dataset. When the training we randomly generate 500 questions for each dataset, and
dataset size is small, the SIMS model also renders images each question is answered by 5 different workers. For qual-
with good visual quality. However, the depicted content ity control, only workers with a lifetime task approval rate
often deviates from the input segmentation mask (e.g., the greater than 98% can participate in our study.
shape of the swimming pool in the second row of Figure 6). Table 2 shows the evaluation results. We find that users
6
Figure 8: Semantic image synthesis results on COCO-Stuff. Our method successfully generates realistic images in diverse
scenes ranging from animals to sports activities.
Method #param COCO. ADE. City. Method COCO ADE20K Cityscapes
decoder w/ SPADE (Ours) 96M 35.2 38.5 62.3 segmap input 35.2 38.5 62.3
compact decoder w/ SPADE 61M 35.2 38.0 62.5 random input 35.3 38.3 61.6
decoder w/ Concat 79M 31.9 33.6 61.1 kernelsize 5x5 35.0 39.3 61.8
pix2pixHD++ w/ SPADE 237M 34.4 39.0 62.2 kernelsize 3x3 35.2 38.5 62.3
pix2pixHD++ w/ Concat 195M 32.9 38.9 57.1 kernelsize 1x1 32.7 35.9 59.9
pix2pixHD++ 183M 32.7 38.3 58.8 #params 141M 35.3 38.3 62.5
compact pix2pixHD++ 103M 31.6 37.3 57.6 #params 96M 35.2 38.5 62.3
pix2pixHD [48] 183M 14.6 20.3 58.3 #params 61M 35.2 38.0 62.5
Table 3: The mIoU scores are boosted when the SPADE Sync BatchNorm 35.0 39.3 61.8
BatchNorm 33.7 37.9 61.8
is used, for both the decoder architecture (Figure 4) and
InstanceNorm 33.9 37.4 58.7
encoder-decoder architecture of pix2pixHD++ (our im-
proved baseline over pix2pixHD [48]). On the other hand, Table 4: The SPADE generator works with different con-
simply concatenating semantic input at every layer fails to figurations. We change the input of the generator, the con-
do so. Moreover, our compact model with smaller depth at volutional kernel size acting on the segmentation map, the
all layers outperforms all the baselines. capacity of the network, and the parameter-free normaliza-
tion method. The settings used in the paper are boldfaced.
7
Figure 9: Our model attains multimodal synthesis capability when trained with the image encoder. During deployment,
by using different random noise, our model synthesizes outputs with diverse appearances but all having the same semantic
layouts depicted in the input mask. For reference, the ground truth image is shown inside the input segmentation mask.
Variations of SPADE generator. Table 4 reports the per- ferent segmentation masks, and our model renders the cor-
formance of several variations of our generator. First, we responding landscape images. Moreover, our model allows
compare two types of input to the generator where one is the users to choose an external style image to control the global
random noise while the other is the downsampled segmen- appearances of the output image. We achieve it by replac-
tation map. We find that both of the variants render similar ing the input noise with the embedding vector of the style
performance and conclude that the modulation by SPADE image computed by the image encoder.
alone provides sufficient signal about the input mask. Sec-
ond, we vary the type of parameter-free normalization lay- 5. Conclusion
ers before applying the modulation parameters. We observe
that the SPADE works reliably across different normaliza- We have proposed the spatially-adaptive normalization,
tion methods. Next, we vary the convolutional kernel size which utilizes the input semantic layout while performing
the affine transformation in the normalization layers. The
acting on the label map, and find that kernel size of 1x1
hurts performance, likely because it prohibits utilizing the proposed normalization leads to the first semantic image
context of the label. Lastly, we modify the capacity of the synthesis model that can produce photorealistic outputs for
generator by changing the number of convolutional filters. diverse scenes including indoor, outdoor, landscape, and
We present more variations and ablations in the appendix. street scenes. We further demonstrate its application for
multi-modal synthesis and guided image synthesis.
Multi-modal synthesis. In Figure 9, we show the mul- Acknowledgments. We thank Alexei A. Efros, Bryan
timodal image synthesis results on the Flickr Landscape Catanzaro, Andrew Tao, and Jan Kautz for insightful ad-
dataset. For the same input segmentation mask, we sam- vice. We thank Chris Hebert, Gavriil Klimov, and Brad
ple different noise inputs to achieve different outputs. More Nemire for their help in constructing the demo apps. Tae-
results are included in the appendix. sung Park contributed to the work during his internship at
Semantic manipulation and guided image synthesis. In NVIDIA. His Ph.D. is supported by a Samsung Scholarship.
Figure 1, we show an application where a user draws dif-
8
References converge to a local Nash equilibrium. In Advances in Neural
Information Processing Systems, 2017. 4, 5, 13
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gen-
[18] S. Hong, D. Yang, J. Choi, and H. Lee. Inferring seman-
erative adversarial networks. In International Conference on
tic layout for hierarchical text-to-image synthesis. In IEEE
Machine Learning (ICML), 2017. 3
Conference on Computer Vision and Pattern Recognition
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. (CVPR), 2018. 2
arXiv preprint arXiv:1607.06450, 2016. 2
[19] X. Huang and S. Belongie. Arbitrary style transfer in real-
[3] A. Brock, J. Donahue, and K. Simonyan. Large scale gan
time with adaptive instance normalization. In IEEE Inter-
training for high fidelity natural image synthesis. In Inter-
national Conference on Computer Vision (ICCV), 2017. 2,
national Conference on Learning Representations (ICLR),
3
2019. 1, 2
[20] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal
[4] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and
unsupervised image-to-image translation. European Confer-
stuff classes in context. In IEEE Conference on Computer
ence on Computer Vision (ECCV), 2018. 2, 3, 4
Vision and Pattern Recognition (CVPR), 2018. 2, 4
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
deep network training by reducing internal covariate shift.
A. L. Yuille. Deeplab: Semantic image segmentation with
In International Conference on Machine Learning (ICML),
deep convolutional nets, atrous convolution, and fully con-
2015. 2, 3
nected crfs. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence (TPAMI), 40(4):834–848, 2018. 4, 5 [22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-
image translation with conditional adversarial networks. In
[6] Q. Chen and V. Koltun. Photographic image synthesis with
IEEE Conference on Computer Vision and Pattern Recogni-
cascaded refinement networks. In IEEE International Con-
tion (CVPR), 2017. 1, 2, 3, 11, 12
ference on Computer Vision (ICCV), 2017. 1, 4, 5, 13, 14,
15, 16, 17, 18 [23] M. Johnson, G. J. Brostow, J. Shotton, O. Arandjelovic,
V. Kwatra, and R. Cipolla. Semantic photo synthesis. In
[7] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu.
Computer Graphics Forum, volume 25, pages 407–413,
Sketch2photo: internet image montage. ACM Transactions
2006. 1, 2
on Graphics (TOG), 28(5):124, 2009. 1, 2
[8] T. Chen, M. Lucic, N. Houlsby, and S. Gelly. On self mod- [24] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning
ulation for generative adversarial networks. In International to generate images of outdoor scenes from attributes and se-
Conference on Learning Representations, 2019. 2 mantic layouts. arXiv preprint arXiv:1612.00215, 2016. 2
[9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [25] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Manipu-
R. Benenson, U. Franke, S. Roth, and B. Schiele. The lating attributes of natural scenes via hallucination. arXiv
cityscapes dataset for semantic urban scene understanding. preprint arXiv:1808.07413, 2018. 2
In IEEE Conference on Computer Vision and Pattern Recog- [26] T. Karras, S. Laine, and T. Aila. A style-based generator
nition (CVPR), 2016. 2, 4 architecture for generative adversarial networks. In IEEE
[10] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, Conference on Computer Vision and Pattern Recognition
and A. C. Courville. Modulating early visual processing (CVPR), 2019. 2
by language. In Advances in Neural Information Process- [27] D. P. Kingma and J. Ba. Adam: A method for stochastic
ing Systems, 2017. 2 optimization. In International Conference on Learning Rep-
[11] V. Dumoulin, J. Shlens, and M. Kudlur. A learned repre- resentations (ICLR), 2015. 4
sentation for artistic style. In International Conference on [28] D. P. Kingma and M. Welling. Auto-encoding variational
Learning Representations (ICLR), 2016. 2, 3 bayes. In International Conference on Learning Representa-
[12] X. Glorot and Y. Bengio. Understanding the difficulty of tions (ICLR), 2014. 2, 4, 11, 12
training deep feedforward neural networks. In Proceedings [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
of the thirteenth international conference on artificial intel- classification with deep convolutional neural networks. In
ligence and statistics, pages 249–256, 2010. 12, 13 Advances in Neural Information Processing Systems, 2012.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, 2
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [30] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn,
erative adversarial nets. In Advances in Neural Information and A. Criminisi. Photo clip art. In ACM transactions on
Processing Systems, 2014. 2 graphics (TOG), volume 26, page 3. ACM, 2007. 1
[14] J. Hays and A. A. Efros. Scene completion using millions of [31] J. H. Lim and J. C. Ye. Geometric gan. arXiv preprint
photographs. In ACM SIGGRAPH, 2007. 1 arXiv:1705.02894, 2017. 3, 11
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
for image recognition. In IEEE Conference on Computer manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
Vision and Pattern Recognition (CVPR), 2016. 3 mon objects in context. In European Conference on Com-
[16] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. puter Vision (ECCV), 2014. 2, 4
Salesin. Image analogies. 2001. 1, 2 [33] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-
[17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and image translation networks. In Advances in Neural Informa-
S. Hochreiter. GANs trained by a two time-scale update rule tion Processing Systems, 2017. 2
9
[34] X. Mao, Q. Li, H. Xie, Y. R. Lau, Z. Wang, and S. P. Smol- Computer Vision and Pattern Recognition (CVPR), 2018. 1,
ley. Least squares generative adversarial networks. In IEEE 3, 4, 5, 7, 11, 12, 13, 14, 15, 16, 17, 18
International Conference on Computer Vision (ICCV), 2017. [49] X. Wang, K. Yu, C. Dong, and C. Change Loy. Recover-
3, 11 ing realistic texture in image super-resolution by deep spatial
[35] T. B. Mathias Eitz, Kristian Hildebrand and M. Alexa. Pho- feature transform. In Proceedings of the IEEE Conference on
tosketch: A sketch based image query and compositing sys- Computer Vision and Pattern Recognition, pages 606–615,
tem. In ACM SIGGRAPH 2009 Talk Program, 2009. 1 2018. 2
[36] L. Mescheder, A. Geiger, and S. Nowozin. Which training [50] Y. Wu and K. He. Group normalization. In European Con-
methods for gans do actually converge? In International ference on Computer Vision (ECCV), 2018. 2
Conference on Machine Learning (ICML), 2018. 2, 3, 11 [51] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified per-
[37] M. Mirza and S. Osindero. Conditional generative adversar- ceptual parsing for scene understanding. In European Con-
ial nets. arXiv preprint arXiv:1411.1784, 2014. 2 ference on Computer Vision (ECCV), 2018. 5
[38] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spec- [52] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
tral normalization for generative adversarial networks. In In- X. He. Attngan: Fine-grained text to image generation with
ternational Conference on Learning Representations (ICLR), attentional generative adversarial networks. In IEEE Confer-
2018. 3, 4, 11 ence on Computer Vision and Pattern Recognition (CVPR),
[39] T. Miyato and M. Koyama. cGANs with projection discrim- 2018. 2
inator. In International Conference on Learning Representa- [53] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual net-
tions (ICLR), 2018. 2, 3, 11 works. In IEEE Conference on Computer Vision and Pattern
[40] K. Nakashima. Deeplab-pytorch. https://github. Recognition (CVPR), 2017. 5
com/kazuto1011/deeplab-pytorch, 2018. 5 [54] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-
[41] A. Odena, C. Olah, and J. Shlens. Conditional image synthe- attention generative adversarial networks. In International
sis with auxiliary classifier GANs. In International Confer- Conference on Machine Learning (ICML), 2019. 1, 2, 3, 11
ence on Machine Learning (ICML), 2017. 2 [55] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and
[42] E. Perez, H. De Vries, F. Strub, V. Dumoulin, and D. Metaxas. Stackgan: Text to photo-realistic image synthe-
A. Courville. Learning visual reasoning without strong sis with stacked generative adversarial networks. In IEEE
priors. In International Conference on Machine Learning International Conference on Computer Vision (ICCV), 2017.
(ICML), 2017. 2 1, 2
[43] X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric im- [56] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang,
age synthesis. In IEEE Conference on Computer Vision and and D. Metaxas. Stackgan++: Realistic image synthesis
Pattern Recognition (CVPR), 2018. 4, 5, 13, 17, 18 with stacked generative adversarial networks. IEEE Transac-
[44] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and tions on Pattern Analysis and Machine Intelligence (TPAMI),
H. Lee. Generative adversarial text to image synthesis. In In- 2018. 1
ternational Conference on Machine Learning (ICML), 2016. [57] B. Zhao, L. Meng, W. Yin, and L. Sigal. Image generation
2 from layout. In IEEE Conference on Computer Vision and
[45] T. Salimans and D. P. Kingma. Weight normalization: A Pattern Recognition (CVPR), 2019. 2
simple reparameterization to accelerate training of deep neu- [58] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and
ral networks. In Advances in Neural Information Processing A. Torralba. Scene parsing through ade20k dataset. In
Systems, 2016. 2 IEEE Conference on Computer Vision and Pattern Recog-
[46] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor- nition (CVPR), 2017. 2, 4
malization: The missing ingredient for fast stylization. arxiv [59] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
2016. arXiv preprint arXiv:1607.08022, 2016. 2, 3 to-image translation using cycle-consistent adversarial net-
[47] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, works. In IEEE International Conference on Computer Vi-
and B. Catanzaro. Video-to-video synthesis. In Advances in sion (ICCV), 2017. 2
Neural Information Processing Systems, 2018. 1, 4 [60] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,
[48] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and O. Wang, and E. Shechtman. Toward multimodal image-to-
B. Catanzaro. High-resolution image synthesis and semantic image translation. In Advances in Neural Information Pro-
manipulation with conditional gans. In IEEE Conference on cessing Systems, 2017. 2, 3, 4
10
A. Additional Implementation Details
Generator. The architecture of the generator consists of a
series of the proposed SPADE ResBlks with nearest neigh-
bor upsampling. We train our network using 8 GPUs simul-
Linear(256, 16384)
taneously and use the synchronized version of the Batch-
Norm. We apply the Spectral Norm [38] to all the convolu- Reshape(1024, 4, 4)
tional layers in the generator. The architectures of the pro- SPADE ResBlk(1024), Upsample(2)
posed SPADE and SPADE ResBlk are given in Figure 10
SPADE ResBlk(1024), Upsample(2)
and Figure 11, respectively. The architecture of the genera-
tor is shown in Figure 12. SPADE ResBlk(1024), Upsample(2)
Discriminator. The architecture of the discriminator fol- SPADE ResBlk(512), Upsample(2)
lows the one used in the pix2pixHD method [48], which SPADE ResBlk(256), Upsample(2)
uses a multi-scale design with the InstanceNorm (IN). The
only difference is that we apply the Spectral Norm to all the SPADE ResBlk(128), Upsample(2)
11
Concat and the output image from the generator as input and aims
to classify that as fake.
4x4-↓2-Conv-64, LReLU Training details. We perform 200 epochs of training on the
4x4-↓2-Conv-128, IN, LReLU
Cityscapes and ADE20K datasets, 100 epochs of training
on the COCO-Stuff dataset, and 50 epochs of training on the
4x4-↓2-Conv-256, IN, LReLU Flickr Landscapes dataset. The image sizes are 256 × 256,
4x4-Conv-512, IN, LReLU except the Cityscapes at 512 × 256. We linearly decay the
learning rate to 0 from epoch 100 to 200 for the Cityscapes
4x4-Conv-1
and ADE20K datasets. The batch size is 32. We initialize
the network weights using thes Glorot initialization [12].
Figure 13: Our discriminator design largely follows that in
the pix2pixHD [48]. It takes the concatenation the segmen-
tation map and the image as input. It is based on the Patch-
GAN [22]. Hence, the last layer of the discriminator is a
convolutional layer.
Image
Encoder
12
B. Additional Ablation Study In Table 5, we also analyze the effectiveness of each
component used in our strong baseline, the pix2pixHD++
Method COCO. ADE. City.
method, derived from the pix2pixHD method. We
found that the Spectral Norm, synchronized BatchNorm,
Ours 35.2 38.5 62.3
TTUR [17], and the hinge loss objective all contribute to
Ours w/o Perceptual loss 24.7 30.1 57.4
the performance boost. Adding the SPADE to the strong
Ours w/o GAN feature matching loss 33.2 38.0 62.2
baseline further improves the performance. Note that the
Ours w/ a deeper discriminator 34.9 38.3 60.9
pix2pixHD++ w/o Sync BatchNorm and w/o Spectral Norm
pix2pixHD++ w/ SPADE 34.4 39.0 62.2
still differs from the pix2pixHD in that it uses the hinge loss
pix2pixHD++ 32.7 38.3 58.8
objective, TTUR, a large batch size, and the Glorot initial-
pix2pixHD++ w/o Sync BatchNorm 27.4 31.8 51.1
ization [12].
pix2pixHD++ w/o Sync BatchNorm, 26.0 31.9 52.3
and w/o Spectral Norm
C. Additional Results
pix2pixHD [48] 14.6 20.3 58.3
In Figure 16, 17, and 18, we show additional synthe-
Table 5: Additional ablation study results using the mIoU sis results from the proposed method on the COCO-Stuff
metric: the table shows that both the perceptual loss and and ADE20K datasets with comparisons to those from the
GAN feature matching loss terms are important. Mak- CRN [6] and pix2pixHD [48] methods.
ing the discriminator deeper does not lead to a perfor- In Figure 19 and 20, we show additional synthesis re-
mance boost. The table also shows that all the compo- sults from the proposed method on the ADE20K-outdoor
nents (Synchronized BatchNorm, Spectral Norm, TTUR, and Cityscapes datasets with comparison to those from the
the Hinge loss objective, and the SPADE) used in the pro- CRN [6], SIMS [43], and pix2pixHD [48] methods.
posed method helps our strong baseline, pix2pixHD++. In Figure 21, we show additional multi-modal synthesis
results from the proposed method. As sampling different z
Table 5 provides additional ablation study results ana- from a standard multivariate Gaussian distribution, we syn-
lyzing the contribution of individual components in the pro- thesize images of diverse appearances.
posed method. We first find that both of the perceptual loss In the accompanying video, we demonstrate our seman-
and GAN feature matching loss inherited from the learn- tic image synthesis interface. We show how a user can cre-
ing objective function of the pix2pixHD [48] are impor- ate photorealistic landscape images by painting semantic
tant. Removing any of them leads to a performance drop. labels on a canvas. We also show how a user can synthe-
We also find that increasing the depth of the discriminator size images of diverse appearances for the same semantic
by inserting one more convolutional layer to the top of the segmentation mask as well as transfer the appearance of a
pix2pixHD discriminator does not improve the results. provided style image to the synthesized one.
13
Label Ground Truth CRN pix2pixHD Ours
Figure 16: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the COCO-Stuff
dataset.
14
Label Ground Truth CRN pix2pixHD Ours
Figure 17: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the COCO-Stuff
dataset.
15
Label Ground Truth CRN pix2pixHD Ours
Figure 18: Additional results with comparison to those from the CRN [6] and pix2pixHD [48] methods on the ADE20K
dataset.
16
Label Ground Truth CRN SIMS pix2pixHD Ours
Figure 19: Additional results with comparison to those from the CRN [6], SIMS [43], and pix2pixHD [48] methods on the
ADE20K-outdoor dataset. 17
Label Ground Truth Ours
Figure 20: Additional results with comparison to those from the CRN [6], SIMS [43], and pix2pixHD [48] methods on the
Cityscapes dataset.
18
Label Ground Truth Multi-modal results
Figure 21: Additional multi-modal synthesis results on the Flickr Landscapes Dataset. By sampling latent vectors from a
standard Gaussian distribution, we synthesize images of diverse appearances.
19