A Paper

TASK 2V EC : Task Embedding for Meta-Learning
Alessandro Achille Michael Lam Rahul Tewari Avinash Ravichandran

UCLA and AWS AWS AWS AWS
achille@cs.ucla.edu michlam@amazon.com tewarir@amazon.com ravinash@amazon.com
Subhransu Maji Charless Fowlkes Stefano Soatto Pietro Perona

UMass and AWS UCI and AWS UCLA and AWS Caltech and AWS
arXiv:1902.03545v1 [cs.LG] 10 Feb 2019
smmaji@amazon.com fowlkec@amazon.com soattos@amazon.com peronapp@amazon.com
Abstract semantic similarities between tasks (Fig. 1). When other

natural distances are available, such as the taxonomical dis-
We introduce a method to provide vectorial represen- tance in biological classification, we find that the embed-
tations of visual classification tasks which can be used ding distance correlates positively with it (Fig. 2). More-
to reason about the nature of those tasks and their re- over, we introduce an asymmetric distance on tasks which
lations. Given a dataset with ground-truth labels and a correlates with the transferability between tasks.
loss function defined over those labels, we process images Computation of the embedding leverages a duality be-
through a “probe network” and compute an embedding tween network parameters (weights) and outputs (activa-
based on estimates of the Fisher information matrix asso- tions) in a deep neural network (DNN): Just as the activa-
ciated with the probe network parameters. This provides a tions of a DNN trained on a complex visual recognition task
fixed-dimensional embedding of the task that is independent are a rich representation of the input images, we show that
of details such as the number of classes and does not require the gradients of the weights relative to a task-specific loss
any understanding of the class label semantics. We demon- are a rich representation of the task itself. Specifically, given
strate that this embedding is capable of predicting task sim- a task defined by a dataset D = {(xi , yi )}N i=1 of labeled
ilarities that match our intuition about semantic and tax- samples, we feed the data through a pre-trained reference
onomic relations between different visual tasks (e.g., tasks convolutional neural network which we call “probe net-
based on classifying different types of plants are similar). work”, and compute the diagonal Fisher Information Ma-
We also demonstrate the practical value of this framework trix (FIM) of the network filter parameters to capture the
for the meta-task of selecting a pre-trained feature extractor structure of the task (Sect. 2). Since the architecture and
for a new task. We present a simple meta-learning frame- weights of the probe network are fixed, the FIM provides a
work for learning a metric on embeddings that is capable of fixed-dimensional representation of the task. We show this
predicting which feature extractors will perform well. Se- embedding encodes the “difficulty” of the task, character-
lecting a feature extractor with task embedding obtains a istics of the input domain, and which features of the probe
performance close to the best available feature extractor, network are useful to solve it (Sect. 2.1).
while costing substantially less than exhaustively training
and evaluating on all available feature extractors. Our task embedding can be used to reason about the
space of tasks and solve meta-tasks. As a motivating exam-
ple, we study the problem of selecting the best pre-trained
1. Introduction feature extractor to solve a new task. This can be particu-
larly valuable when there is insufficient data to train or fine-
The success of Deep Learning hinges in part on the fact tune a generic model, and transfer of knowledge is essen-
that models learned for one task can be used on other related tial. TASK 2 VEC depends solely on the task, and ignores
tasks. Yet, no general framework exists to describe and interactions with the model which may however play an
learn relations between tasks. We introduce the TASK 2 VEC important role. To address this, we learn a joint task and
embedding, a technique to represent tasks as elements of a model embedding, called MODEL 2 VEC, in such a way that
vector space based on the Fisher Information Matrix. The models whose embeddings are close to a task exhibit good
norm of the embedding correlates with the complexity of perfmormance on the task. We use this to select an expert
the task, while the distance between embeddings captures from a given collection, improving performance relative to
1
Actinopterygii (n) Insecta (n) Reptilia (n) Neckline (m)
Amphibia (n) Mammalia (n) Category (m) Pants (m)
Arachnida (n) Mollusca (n) Color (m) Pattern (m)
Aves (n) Plantae (n) Gender (m) Shoes (m)
Fungi (n) Protozoa (n) Material (m)
Task Embeddings Domain Embeddings
Figure 1: Task embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of the embed-
ding of tasks extracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth grouping of tasks
based on taxonomic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed near the bird
classification task from iNaturalist, even though the original datasets are different. iMaterialist is well separated from iNat-
uralist, as it entails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color attributes)
cluster together but attributes of different task types may also mix when the underlying visual semantics are correlated. For
example, the tasks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task embedding.
(Right) T-SNE visualization of the domain embeddings (using mean feature activations) for the same tasks. Domain em-
bedding can distinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. However,
the fashion attribute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, the domain
embeddings collapse to a region without recovering any sensible structure.
fine-tuning a generic model trained on ImageNet and ob- original output distribution pw (y|x) and the perturbed one
taining close to ground-truth optimal selection. We discuss pw0 (y|x). To second-order approximation, this is
our contribution in relation to prior literature in Sect. 6, after
presenting our empirical results in Sect. 5. Ex∼p̂ KL(pw0 (y|x) k pw (y|x)) = δw · F δw + o(δw2 ),
where F is the Fisher information matrix (FIM):

2. Task Embeddings via Fisher Information
F = Ex,y∼p̂(x)pw (y|x) ∇w log pw (y|x)∇w log pw (y|x)T .

Given an observed input x (e.g., an image) and an hid-
den task variable y (e.g., a label), a deep network is a that is, the expected covariance of the scores (gradients of
family of functions pw (y|x) parametrized by weights w, the log-likelihood) with respect to the model parameters.
trained to approximate the posterior p(y|x) by minimizing The FIM is a Riemannian metric on the space of proba-
the (possibly regularized) cross entropy loss Hpw ,p̂ (y|x) = bility distributions [7], and provides a measure of the infor-
Ex,y∼p̂ [− log pw (y|x)], where p̂ is the empirical distribu- mation a particular parameter (weight or feature) contains
tion defined by the training set D = {(xi , yi )}N i=1 . It is about the joint distribution pw (x, y) = p̂(x)pw (y|x): If the
useful, especially in transfer learning, to think of the net- classification performance for a given task does not depend
work as composed of two parts: a feature extractor which strongly a parameter, the corresponding entry in the FIM
computes some representation z = φw (x) of the input data, will be small. The FIM is also related to the (Kolmogorov)
and a “head,” or classifier, which encodes the distribution complexity of a task, a property that can be used to de-
p(y|z) given the representation z. fine a computable metric of the learning distance between
Not all network weights are equally useful in predicting tasks [3]. Finally, the FIM can be interpreted as an easy-to-
the task variable: the importance, or “informative content,” compute positive semidefinite upper-bound to the Hessian
of a weight for the task can be quantified by considering a of the cross-entropy loss, and coincides with it at local min-
perturbation w0 = w + δw of the weights, and measuring ima [24]. In particular, “flat minima” correspond to weights
the average Kullbach-Leibler (KL) divergence between the that have, on average, low (Fisher) information [5, 13].
2.1. TASK 2 VEC embedding using a probe network advantage of being easy to compute by directly minimizing
the loss L(ŵ; Σ) through Stochastic Gradient Variational
While the network activations capture the information in
Bayes [18], while being less sensitive to irregularities of
the input image which are needed to infer the image label,
the loss landscape than direct computation, since the value
the FIM indicates the set of feature maps which are more
of the loss depends on the cross-entropy in a neighborhood
informative for solving the current task. Following this in-
of ŵ of size Λ−1 . As in the standard Fisher computation,
tuition, we use the FIM to represent the task itself. How-
we estimate one parameter per filter, rather than per weight,
ever, the FIMs computed on different networks are not di-
which in practice means that we constrain Λii = Λjj when-
rectly comparable. To address this, we use single “probe”
ever wi and wj belongs to the same filter. In this case, opti-
network pre-trained on ImageNet as a feature extractor and
mization of L(ŵ; Λ) can be done efficiently using the local
re-train only the classifier layer on any given task, which
reparametrization trick of [18].
usually can be done efficiently. After training is complete,
we compute the FIM for the feature extractor parameters. 2.2. Properties of the TASK 2 VEC embedding
Since the full FIM is unmanageably large for rich probe
networks based on CNNs, we make two additional approxi- The task embedding we just defined has a number of
mations. First, we only consider the diagonal entries, which useful properties. For illustrative purposes, consider a two-
implicitly assumes that correlations between different filters layer sigmoidal network for which an analytic expression
in the probe network are not important. Second, since the can be derived (see Supplementary Materials). The FIM
weights in each filter are usually not independent, we aver- of the feature extractor parameters can be written using the
age the Fisher Information for all weights in the same filter. Kronecker product as
The resulting representation thus has fixed size, equal to the F = Ex,y∼p̂(x)pw (y|x) [(y − p)2 · S ⊗ xxT ]
number of filters in the probe network. We call this embed-
ding method TASK 2 VEC. where p = pw (y = 1|x) and the matrix S = wwT zz T
(1 − z)(1 − z)T is an element-wise product of classifier
Robust Fisher computation Since the FIM is a local
weights w and first layer feature activations z. It is informa-
quantity, it is affected by the local geometry of the training
tive to compare this expression to an embedding based only
loss landscape, which is highly irregular in many deep net-
on the dataset domain statistics, such as the (non-centered)
work architectures [21], and may be too noisy when trained
covariance C0 = E xxT of the input data or the covari-
with few samples. To avoid this problem, instead of a direct
computation, we use a more robust estimator that leverages ance C1 = E zz T of the feature activations. One could
connections to variational inference. Assume we perturb take such statistics as a representative domain embedding
the weights ŵ of the network with Gaussian noise N (0, Λ) since they only depend on the marginal distribution p(x) in
with precision matrix Λ, and we want to find the optimal Λ contrast to the FIM task embedding, which depends on the
which yields a good expected error, while remaining close joint distribution p(x, y). These simple expressions high-
to an isotropic prior N (ŵ, λ2 I). That is, we want to find Λ light some important (and more general) properties of the
that minimizes: Fisher embedding we now describe.
Invariance to the label space: The task embedding does
L(ŵ; Λ) = Ew∼N (ŵ,Λ) [Hpw ,p̂ p(y|x)] not directly depend on the task labels, but only on the pre-
dicted distribution pw (y|x) of the trained model. Infor-
+ β KL(N (0, Λ) k N (0, λ2 I)), mation about the ground-truth labels y is encoded in the
weights w which are a sufficient statistic of the task [5]. In
where H is the cross-entropy loss and β controls the weight
particular, the task embedding is invariant to permutations
of the prior. Notice that for β = 1 this reduces to the Evi-
of the labels y, and has fixed dimension (number of filters
dence Lower-Bound (ELBO) commonly used in variational
of the feature extractor) regardless of the output space (e.g.,
inference. Approximating to the second order, the optimal
k-way classification with varying k).
value of Λ satisfies (see Supplementary Material):
Encoding task difficulty: As we can see from the ex-
β βλ2 pressions above, if the fit model is very confident in its pre-
Λ=F + I. dictions, E[(y − p)2 ] goes to zero. Hence, the norm of the
2N 2N
task embedding kF k? scales with the difficulty of the task
β
Therefore, 2N Λ ∼ F +o(1) can be considered as an estima- for a given feature extractor φ. Figure 2 (Right) shows that
tor of the FIM F , biased towards the prior λ2 I in the low- even for more complex models trained on real data, the FIM
data regime instead of being degenerate. In case the task is norm correlates with test performance.
trivial (the loss is constant or there are too few samples) the Encoding task domain: Data points x that are classi-
embedding will coincide with the prior λ2 I, which we will fied with high confidence, i.e., p is close to 0 or 1, will
refer to as the trivial embedding. This estimator has the have a lower contribution to the task embedding than points
3.0
60%
2.5
Avg. top-k tax. distance
Test error on task (%)

50%
40%
2.0
30%
1.5 20%
Task2Vec distance 10%

1.0 Tax. distance 0%
0 25 50 75 100 125 0.4 0.6 0.8
Size k of neighborhood L1 norm of task embedding 1e8
Figure 2: Distance between species classification tasks. (Left) Task similarity matrix ordered by hierarchical clustering.
Note that the dendrogram produced by the task similarity matches the taxonomic clusters (indicated by color bar). (Center)
For tasks extracted from iNaturalist and CUB, we compare the cosine distance between tasks to their taxonomical distance.
As the size of the task embedding neighborhood increases (measured by number of tasks in the neighborhood), we plot the
average taxonomical distance of tasks from the neighborhood center. While the task distance does not perfectly match the
taxonomical distance (whose curve is shown in orange), it shows a good correlation. Difference are both due to the fact that
taxonomically close species may need very different features to be classified, creating a mismatch between the two notions
of distance, and because for some tasks in iNaturalist too few samples are provided to compute a good embedding. (Right)
Correlation between L1 norm of the task embedding (distance from origin) and test error obtained on the task.
near the decision boundary since p(1 − p) is maximized at sification of cats than it is to classification of species of
p = 1/2. Compare this to the covariance matrix of the data, plants). In this setting, we can define
C0 , to which all data points contribute equally. Instead, in
TASK 2 VEC information on the domain is based on data near
Dtax (ta , tb ) = min d(i, j),
i∈Sa ,j∈Sb
the decision boundary (task-weighted domain embedding).
Encoding useful features for the task: The FIM de- where Sa , Sb are the sets of categories in task ta , tb and
pends on the curvature of the loss function with the diagonal d(i, j) is an ultrametric or graph distance in the taxonomy
entries capturing the sensitivity of the loss to model param- tree. Notice that this is a proper distance, and in particular
eters. Specifically, in the two-layer model one can see that, it is symmetric.
if a given feature is uncorrelated with y, the correspond- Transfer distance. We define the transfer (or fine-tuning)
ing blocks of F are zero. In contrast, a domain embedding gain from a task ta to a task tb (which we improperly call
based on feature activations of the probe network (e.g., C1 ) distance, but is not necessarily symmetric or positive) as
only reflects which features vary over the dataset without the difference in expected performance between a model
indication of whether they are relevant to the task. trained for task tb from a fixed initialization (random or pre-
trained), and the performance of a model fine-tuned for task
3. Similarity Measures on the Space of Tasks tb starting from a solution of task ta :
What metric should be used on the space of tasks? This E[`a→b ] − E[`b ]
Dft (ta → tb ) = ,
depends critically on the meta-task we are considering. As a E[`b ]
motivation, we concentrate on the meta-task of selecting the
where the expectations are taken over all trainings with the
pre-trained feature extractor from a set in order to obtain the
selected architecture, training procedure and network ini-
best performance on a new training task. There are several
tialization, `b is the final test error obtained by training on
natural metrics that may be considered for this meta-task.
task b from the chosen initialization, and `a→b is the error
In this work, we mainly consider:
obtained instead when starting from a solution to task a and
Taxonomic distance For some tasks, there is a natural no- then fine-tuning (with the selected procedure) on task tb .
tion of semantic similarity, for instance defined by sets of
3.1. Symmetric and asymmetric TASK 2 VEC metrics
categories organized in a taxonomic hierarchy where each
task is classification inside a subtree of the hierarchy (e.g., By construction, the Fisher embedding on which
we may say that classifying breeds of dogs is closer to clas- TASK 2 VEC is based captures fundamental information
about the structure of the task. We may therefore expect models trained on the same task with different performance
that the distance between two embeddings correlate posi- characteristics. To model the joint interaction between task
tively with natural metrics on the space of tasks. However, and model (i.e., architecture and training algorithm), we aim
there are two problems in using the Euclidean distance be- to learn a joint embedding of the two.
tween embeddings: the parameters of the network have dif- We consider for concreteness the problem of learning
ferent scales, and the norm of the embedding is affected by a joint embedding for model selection. In order to em-
complexity of the task and the number of samples used to bed models in the task space so that those near a task
compute the embedding. are likely to perform well on that task, we formulate the
following meta-learning problem: Given k models, their
Symmetric TASK 2 VEC distance To make the distance
MODEL 2 VEC embedding are the vectors mi = Fi + bi ,
computation robust, we propose to use the cosine distance
where Fi is the task embedding of the task used to train
between normalized embeddings:
model mi (if available, else we set it to zero), and bi is a
F Fb
a learned “model bias” that perturbs the task embedding to
dsym (Fa , Fb ) = dcos , ,
Fa + Fb Fa + Fb account for particularities of the model. We learn bi by opti-
mizing a k-way cross entropy loss to predict the best model
where dcos is the cosine distance, Fa and Fb are the two
given the task distance (see Supplementary Material):
task embeddings (i.e., the diagonal of the Fisher Informa-
tion computed on the same probe network), and the division L = E[− log p(m | dasym (t, m0 ), . . . , dasym (t, mk ))].
is element-wise. This is a symmetric distance which we ex-
pect to capture semantic similarity between two tasks. For After training, given a novel query task t, we can then pre-
example, we show in Fig. 2 that it correlates well with the dict the best model for it as the arg maxi dasym (t, mi ), that
taxonomical distance between species on iNaturalist. is, the model mi embedded closest to the query task.
On the other hand, precisely for this reason, this distance
is ill-suited for tasks such as model selection, where the (in- 5. Experiments
trinsically asymmetric) transfer distance is more relevant. We test TASK 2 VEC on a large collection of tasks and
Asymmetric TASK 2 VEC distance In a first approxima- models, related to different degrees. Our experiments aim to
tion, that does not consider either the model or the training test both qualitative properties of the embedding and its per-
procedure used, positive transfer between two tasks depends formance on meta-learning tasks. We use an off-the-shelf
both on the similarity between two tasks and on the com- ResNet-34 pretrained on ImageNet as our probe network,
plexity of the first. Indeed, pre-training on a general but which we found to give the best overall performance (see
complex task such as ImageNet often yields a better result Sect. 5.2). The collection of tasks is generated starting
than fine-tuning from a close dataset of comparable com- from the following four main datasets. iNaturalist [36]:
plexity. In our case, complexity can be measured as the dis- Each task extracted corresponds to species classification in
tance from the trivial embedding. This suggests the follow- a given taxonomical order. For instance, the “Rodentia
ing asymmetric score, again improperly called a “distance” task” is to classify species of rodents. Notice that each
despite being asymmetric and possibly negative: task is defined on a separate subset of the images in the
original dataset; that is, the domains of the tasks are dis-
dasym (ta → tb ) = dsym (ta , tb ) − αdsym (ta , t0 ), joint. CUB-200 [37]: We use the same procedure as iNat-
uralist to create tasks. In this case, all tasks are classifica-
where t0 is the trivial embedding, and α is an hyperparam-
tions inside orders of birds (the aves taxonomical class), and
eter. This has the effect of bring more complex models
have generally much less training samples than correspond-
closer. The hyper-parameter α can be selected based on
ing tasks in iNaturalist. iMaterialist [1] and DeepFashion
the meta-task. In our experiments, we found that the best
[23]: Each image in both datasets is associated with sev-
value of α (α = 0.15 when using a ResNet-34 pretrained
eral binary attributes (e.g., style attributes) and categorical
on ImageNet as the probe network) is robust to the choice
attributes (e.g., color, type of dress, material). We binarize
of meta-tasks.
the categorical attributes, and consider each attribute as a
4. MODEL 2 VEC: task/model co-embedding separate task. Notice that, in this case, all tasks share the
same domain and are naturally correlated.
By construction, the TASK 2 VEC distance ignores details In total, our collection of tasks has 1460 tasks (207
of the model and only relies on the task. If we know what iNaturalist, 25 CUB, 228 iMaterialist, 1000 DeepFashion).
task a model was trained on, we can represent the model by While a few tasks have many training examples (e.g., hun-
the embedding of that task. However, in general we may dred thousands), most have just hundreds or thousands of
not have such information (e.g., black-box models or hand- samples. This simulates the heavy-tail distribution of data
constructed feature extractors). We may also have multiple in real-world applications.
iNat+CUB error distribution and expert selection
Selected expert
ImageNet expert
80%
60%
Test Error
40%
20%
0%
[CUCUB] raup dae
B] La idae
B] B] erid e
[C An Mim ae
[CU [CUUB] serifo idae
[CUroce ulgif lidae
[CU] Pod oracliformae
B] UB] ran idae

ara eo ae
[iN [iNat] Sa iformtia
a i o s
[iN ] Co iformes
A Er les
[CU ] Ca ] Ca ingil mes
a e
[iNNat] ragal s
[iN t] As abales
[iN [i at] Oteral s
at] as Ro ta
o s
[iN at] P Nat] donaes

[iN Cary erifo ales
Le ami ta
op s
a
B] irun orm s
[CU UB Cu di es
B] icip iifo es
[CU leca iformes
[CU] Pic form s
[CU [CUB] Coormes
B] B] rvi s
[CU rog cter dae
[CU [C B] Tyodyt dae
[CU[CUB driifo idae
[iN [i lecanerizidae
at] ] E aru es
at] Nat ifo ae

[iN olumRode es
at] t] pin es
at] nu orm s
[iN Ans ncul es
[iN at] Cleop es

C iN ni a
[iN hara at] Avora
Ge ifo ra
[iN [iNa ntia rmes
[iN Per hylla es

at] cifo les
[iN [iNat qua mes
[iN Acc Picif dale
[i spa icale
a F e
pid ale
[CUB] A lariif rme
if e
[CU B] Hpodif rme
B ni e
[iN t] Rapitrif rme

[CU [CU Pass niida
ter
B] prim rdin lida
at] [ ar ter
at] ] L ma
at] dri nu
at er al
Pe ed rm
[iN B ] P rm
C ] rm
at] op rm
B ] C cu nid
Ch Vir nid
Pe mb lid
a b n
at] t] na
[ Th illi
l i
s s
B B Fr r
S r
n
l o
B] byc
I
[CUBom
T
B]
P
[CU
[ C
Figure 3: TASK 2 VEC often selects the best available experts. Violin plot of the distribution of the final test error (shaded
plot) on tasks from the CUB-200 dataset (columns) obtained by training a linear classifier over several expert feature extrac-
tors (points). Most specialized feature extractors perform similarly on a given task, and generally are similar or worse than a
generic feature extractor pre-trained on ImageNet (blue triangles). However, in some cases a carefully chosen expert, trained
on a relevant task, can greatly outperform all other experts (long whisker of the violin plot). The model selection algorithm
based on TASK 2 VEC can, without training, suggest an expert to use for the task (red cross, lower is better). TASK 2 VEC mostly
recover the optimal, or close to optimal, feature extractor to use without having to perform an expensive brute-force search
over all possibilities. Columns are ordered by norm of the task embedding: Notice tasks with lower embedding norm have
lower error and more “complex” task (task with higher embedding norm) tend to benefit more from a specialized expert.
Together with the collection of tasks, we collect several 5.1. Task Embedding Results
“expert” feature extractors. These are ResNet-34 models
Task Embedding qualitatively reflects taxonomic dis-
pre-trained on ImageNet and then fine-tuned on a specific
tance for iNaturalist For tasks extracted from the iNat-
task or collection of related tasks (see Supplementary Ma-
uralist dataset (classification of species), the taxonomical
terials for details). We also consider a “generic”expert pre-
distance between orders provides a natural metric of the se-
trained on ImageNet without any finetuning. Finally, for
mantic similarity between tasks. In Figure 2 we compare
each combination of expert feature extractor and task, we
the symmetric TASK 2 VEC distance with the taxonomical
trained a linear classifier on top of the expert in order to
distance, showing strong agreement.
solve the selected task using the expert.
In total, we trained 4,100 classifiers, 156 feature extrac- Task embedding for iMaterialist In Fig. 1 we show a
tors and 1,460 embeddings. The total effort to generate the t-SNE visualization of the embedding for iMaterialist and
final results was about 1,300 GPU hours. iNaturalist tasks. Task embedding yields interpretable re-
sults: Tasks that are correlated in the dataset, such as binary
classes corresponding to the same categorical attribute, may
end up far away from each other and close to other tasks that
are semantically more similar (e.g., the jeans category task
Meta-tasks. In Sect. 5.2, for a given task we aim to pre- is close to the ripped attribute and the denim material). This
dict, using TASK 2 VEC , which expert feature extractor will is reflected in the mixture of colors of semantically related
yield the best classification performance. In particular, we nearby tasks, showing non-trivial grouping.
formulate two model selection meta-tasks: iNat + CUB and We also compare the TASK 2 VEC embedding with a do-
Mixed. The first consists of 50 tasks and experts from iNat- main embedding baseline, which only exploits the input
uralist and CUB, and aims to test fine-grained expert selec- distribution p(x) rather than the task distribution p(x, y).
tion in a restricted domain. The second contains a mix of While some tasks are highly correlated with their domain
26 curated experts and 50 random tasks extracted from all (e.g., tasks from iNaturalist), other tasks differ only on the
datasets, and aims to test model selection between different labels (e.g., all the attribute tasks of iMaterialist, which
domains and tasks (see Supplementary Material for details). share the same clothes domain). Accordingly, the domain
Brute force fixed ImageNet finetune select the feature extractor trained on the most similar task,
ImageNet fixed Task2Vec finetune and (2) jointly embed the models and tasks, and select a
Task2Vec fixed
Error relative to brute force
model using the learned metric (see Section 4). Notice that
10% (1) does not use knowledge of the model performance on
(lower is better)
various tasks, which makes it more widely applicable but

0% requires we know what task a model was trained for and
may ignore the fact that models trained on slightly differ-
-10% ent tasks may still provide an overall better feature extrac-
tor (for example by over-fitting less to the task they were
102 103 104 trained on).
Number of samples In Table 2 we compare the overall results of the various
proposed metrics on the model selection meta-tasks. On
Figure 4: TASK 2 VEC improves results at different both the iNat+CUB and Mixed meta-tasks, the Asymmetric
dataset sizes and training conditions: Performance of TASK 2 VEC model selection is close to the ground-truth op-
model selection on a subset of 4 tasks as a function of timal, and significantly improves over both chance, and over
the number of samples available to train relative to opti- using an generic ImageNet expert. Notice that our method
mal model selection (dashed orange). Training a classifier has O(1) complexity, while searching over a collection of
on the feature extractor selected by TASK 2 VEC (solid red) is N experts is O(N ).
always better than using a generic ImageNet feature extrac-
tor (dashed red). The same holds when allowed to fine-tune Error distribution In Fig. 3 we show in detail the error
the feature extractor (blue curves). Also notice that in the distribution of the experts on multiple tasks. It is interesting
low-data regime fine-tuning the ImageNet feature extractor to notice that the classification error obtained using most ex-
is more expensive and has a worse performance than accu- perts clusters around some mean value, and little improve-
rately selecting a good fixed feature extractor. ment is observed over using a generic expert. On the other
hand, a few optimal experts can obtain a largely better per-
Probe network Top-10 All formance on the task than a generic expert. This confirms
Chance +13.95% +59.52% the importance of having access to a large collection of ex-
VGG-13 +4.82% +38.03% perts when solving a new task, especially if few training
DenseNet-121 +0.30% +10.63% data are available. But this collection can only be efficiently
ResNet-13 +0.00% +9.97% exploited if an algorithm is given to efficiently find one of
the few experts for the task, which we propose.
Table 1: Choice of probe network. Mean relative error
increase over the ground-truth optimum on the iNat+CUB Dependence on task dataset size Finding experts is es-
meta-task for different choices of the probe-network. We pecially important when the task we are interested in has
also report the performance on the top 10 tasks with more relatively few samples. In Fig. 4 we show how the perfor-
samples to show how data size affect different architectures. mance of TASK 2 VEC varies on a model selection task as the
number of samples varies. At all sample sizes TASK 2 VEC is
close to the optimum, and improves over selecting a generic
embedding recovers similar clusters on iNaturalist. How- expert (ImageNet), both when fine-tuning and when train-
ever, on iMaterialst domain embedding collapses all tasks ing only a classifier. We observe that the best choice of ex-
to a single uninformative cluster (not a single point due to perts is not affected by the dataset size, and that even with
slight noise in embedding computation). few examples TASK 2 VEC is able to find the optimal experts.
Task Embedding encodes task difficulty The scatter-
Choice of probe network In Table 1 we show that
plot in Fig. 3 compares the norm of embedding vectors vs.
DenseNet [15] and ResNet architectures [11] perform sig-
performance of the best expert (or task specific model for
nificantly better when used as probe networks to compute
cases where we have the diagonal computed). As shown
the TASK 2 VEC embedding than a VGG [32] architecture.
analytically for the two-layers model, the norm of the task
embedding correlates with the complexity of the task also
on real tasks and architectures.
6. Related Work
Task and Domain embedding. Tasks distinguished by
5.2. Model Selection
their domain can be understood simply in terms of image
Given a task, our aim is to select an expert feature extrac- statistics. Due to the bias of different datasets, sometimes a
tor that maximizes the classification performance on that benchmark task may be identified just by looking at a few
task. We propose two strategies: (1) embed the task and images [34]. The question of determining what summary
Meta-task Optimal Chance ImageNet TASK 2 VEC Asymmetric TASK 2 VEC MODEL 2 VEC
iNat + CUB 31.24 +59.52% +30.18% +42.54% +9.97% +6.81%
Mixed 22.90 +112.49% +75.73% +40.30% +29.23% +27.81%
Table 2: Model selection performance of different metrics. Average optimal error obtained on two meta-learning tasks
by exhaustive search over the best expert, and relative error increase when using cheaper model selection methods. Always
picking a fixed good general model (e.g., a model pretrained on ImageNet) performs better than picking an expert at random
(chance). However, picking an expert using the Asymmetric TASK 2 VEC distance can achieve an overall better performance
than using a general model. Notice also the improvement over the Symmetric version, especially on iNat + CUB, where
experts trained on very similar tasks may be too simple to yield good transfer, and should be avoided.
statistics are useful (analogous to our choice of probe net- neural network as a characterization of the task. Use of
work) has also been considered, for example [9] train an Fisher information for neural networks was popularized by
autoencoder that learns to extract fixed dimensional sum- Amari [6] who advocated optimization using natural gra-
mary statistics that can reproduce many different datasets dient descent which leverages the fact that the FIM is an
accurately. However, for general vision tasks which apply appropriate parameterization-independent metric on statis-
to all natural images, the domain is the same across tasks. tical models. Recent work has focused on approximates of
Taskonomy [39] explores the structure of the space of FIM appropriate in this setting (see e.g., [12, 10, 25]). FIM
tasks, focusing on the question of effective knowledge has also been proposed for various regularization schemes
transfer in a curated collection of 26 visual tasks, ranging [5, 8, 22, 27], analyze learning dynamics of deep networks
from classification to 3D reconstruction, defined on a com- [4], and to overcome catastrophic forgetting [19].
mon domain. They compute pairwise transfer distances be-
Meta-learning and Model Selection The general prob-
tween pairs of tasks and use the results to compute a di-
lem of meta-learning has a long history with much re-
rected hierarchy. Introducing novel tasks requires comput-
cent work dedicated to problems such as neural architecture
ing the pairwise distance with tasks in the library. In con-
search and hyper-parameter estimation. Closely related to
trast, we focus on a larger library of 1,460 fine-grained clas-
our problem is work on selecting from a library of classi-
sification tasks both on same and different domains, and
fiers to solve a new task [33, 2, 20]. Unlike our approach,
show that it is possible to represent tasks in a topological
these usually address the question via land-marking or ac-
space with a constant-time embedding. The large task col-
tive testing, in which a few different models are evaluated
lection and cheap embedding costs allow us to tackle new
and performance of the remainder estimated by extension.
meta-learning problems.
This can be viewed as a problem of completing a matrix
Fisher kernels Our work takes inspiration from Jaakkola defined by performance of each model on each task.
and Hausler [16]. They propose the “Fisher Kernel”, which A similar approach has been taken in computer vision for
uses the gradients of a generative model score function as a selecting a detector for a new category out of a large library
representation of similarity between data items of detectors [26, 40, 38].
K(x(1) , x(2) ) = ∇θ log P (x(1) |θ)T F −1 ∇θ log P (x(2) |θ).
7. Discussion
Here P (x|θ) is a parameterized generative model and F is
TASK 2 VEC is an efficient way to represent a task, or the
the Fisher information matrix. This provides a way to utilize
corresponding dataset, as a fixed dimensional vector. It has
generative models in the context of discriminative learning.
several appealing properties, in particular its norm corre-
Variants of the Fisher kernel have found wide use as a repre-
lates with the test error obtained on the task, and the co-
sentation of images [28, 29], and other structured data such
sine distance between embeddings correlates with natural
as protein molecules [17] and text [30]. Since the genera-
distances between tasks, when available, such as the taxo-
tive model can be learned on unlabelled data, several works
nomic distance for species classification, and the fine-tuning
have investigated the use of Fisher kernel for unsupervised
distance for transfer learning. Having a representation of
learning [14, 31]. [35] learns a metric on the Fisher kernel
tasks paves the way for a wide variety of meta-learning
representation similar to our metric learning approach. Our
tasks. In this work, we focused on selection of an expert
approach differs in that we use the FIM as a representation
feature extractor in order to solve a new task, especially
of a whole dataset (task) rather than using model gradients
when little training data is present, and showed that using
as representations of individual data items.
TASK 2 VEC to select an expert from a collection can sen-
Fisher Information for CNNs Our approach to task em- sibly improve test performance while adding only a small
bedding makes use of the Fisher Information matrix of a overhead to the training process.
Meta-learning on the space of tasks is an important step [14] A. D. Holub, M. Welling, and P. Perona. Combining gener-
toward general artificial intelligence. In this work, we in- ative models and fisher kernels for object recognition. In
troduce a way of dealing with thousands of tasks, enough to IEEE International Conference on Computer Vision, vol-
enable reconstruct a topology on the task space, and to test ume 1, pages 136–143. IEEE, 2005. 8
meta-learning solutions. The current experiments highlight [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger.
the usefulness of our methods. Even so, our collection does Densely connected convolutional networks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
not capture the full complexity and variety of tasks that one
Recognition, 2017. 7
may encounter in real-world situations. Future work should
[16] T. Jaakkola and D. Haussler. Exploiting generative models in
further test effectiveness, robustness, and limitations of the
discriminative classifiers. In Advances in neural information
embedding on larger and more diverse collections. processing systems, pages 487–493, 1999. 8
[17] T. S. Jaakkola, M. Diekhans, and D. Haussler. Using the
References fisher kernel method to detect remote protein homologies. In
ISMB, volume 99, pages 149–158, 1999. 8
[1] iMaterialist Challenge (Fashion) at FGVC5 workshop,
[18] D. P. Kingma, T. Salimans, and M. Welling. Variational
CVPR 2018. https://www.kaggle.com/c/
dropout and the local reparameterization trick. In Advances
imaterialist-challenge-fashion-2018.
in Neural Information Processing Systems, pages 2575–
5
2583, 2015. 3, 13
[2] S. M. Abdulrahman, P. Brazdil, J. N. van Rijn, and J. Van-
[19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
schoren. Speeding up algorithm selection using average
jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,
ranking and active testing by introducing runtime. Machine
A. Grabska-Barwinska, et al. Overcoming catastrophic for-
learning, 107(1):79–108, 2018. 8
getting in neural networks. Proceedings of the national
[3] A. Achille, G. Mbeng, G. Paolini, and S. Soatto. The dy- academy of sciences, page 201611835, 2017. 8
namic distance between learning tasks: From Kolmogorov
[20] R. Leite, P. Brazdil, and J. Vanschoren. Selecting classifi-
complexity to transfer learning via quantum physics and
cation algorithms with active testing. In International work-
the information bottleneck of the weights of deep networks.
shop on machine learning and data mining in pattern recog-
Proc. of the NIPS Workshop on Integration of Deep Learning
nition, pages 117–131. Springer, 2012. 8
Theories (ArXiv: 1810.02440), October 2018. 2
[21] H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the loss
[4] A. Achille, M. Rovere, and S. Soatto. Critical learning pe-
landscape of neural nets. arXiv preprint arXiv:1712.09913,
riods in deep neural networks. Proc. of the Intl. Conf. on
2017. 3
Learning Representations (ICLR). ArXiv:1711.08856, 2019.
[22] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes. Fisher-rao
8
metric, geometry, and complexity of neural networks. arXiv
[5] A. Achille and S. Soatto. Emergence of invariance and dis-
preprint arXiv:1711.01530, 2017. 8
entanglement in deep representations. Journal of Machine
[23] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfash-
Learning Research (ArXiv 1706.01350), 19(50):1–34, 2018.
ion: Powering robust clothes recognition and retrieval with
2, 3, 8
rich annotations. In Proceedings of the IEEE conference on
[6] S.-I. Amari. Natural gradient works efficiently in learning. computer vision and pattern recognition, pages 1096–1104,
Neural computation, 10(2):251–276, 1998. 8 2016. 5
[7] S.-I. Amari and H. Nagaoka. Methods of information geome- [24] J. Martens. New perspectives on the natural gradient method.
try, volume 191 of translations of mathematical monographs. CoRR, abs/1412.1193, 2014. 2, 12
American Mathematical Society, 13, 2000. 2
[25] J. Martens and R. Grosse. Optimizing neural networks with
[8] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger gen- kronecker-factored approximate curvature. In International
eralization bounds for deep nets via a compression approach. conference on machine learning, pages 2408–2417, 2015. 8
arXiv preprint arXiv:1802.05296, 2018. 8 [26] P. Matikainen, R. Sukthankar, and M. Hebert. Model rec-
[9] H. Edwards and A. Storkey. Towards a neural statistician. ommendation for action recognition. In Computer Vision
arXiv preprint arXiv:1606.02185, 2016. 8 and Pattern Recognition (CVPR), 2012 IEEE Conference on,
[10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- pages 2256–2263. IEEE, 2012. 8
learning for fast adaptation of deep networks. arXiv preprint [27] Y. Mroueh and T. Sercu. Fisher gan. In Advances in Neural
arXiv:1703.03400, 2017. 8 Information Processing Systems, pages 2513–2523, 2017. 8
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [28] F. Perronnin, J. Sánchez, and T. Mensink. Improving
ing for image recognition. In Proceedings of the IEEE con- the fisher kernel for large-scale image classification. In
ference on computer vision and pattern recognition, pages European conference on computer vision, pages 143–156.
770–778, 2016. 7 Springer, 2010. 8
[12] T. Heskes. On natural learning and pruning in multilayered [29] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-
perceptrons. Neural Computation, 12(4):881–901, 2000. 8 age classification with the fisher vector: Theory and practice.
[13] S. Hochreiter and J. Schmidhuber. Flat minima. Neural International journal of computer vision, 105(3):222–245,
Computation, 9(1):1–42, 1997. 2 2013. 8
[30] C. Saunders, A. Vinokourov, and J. S. Shawe-taylor. String
kernels, fisher kernels and finite state automata. In Advances
in Neural Information Processing Systems, pages 649–656,
2003. 8
[31] M. Seeger. Learning with labeled and unlabeled data. Tech-
nical Report EPFL-REPORT-161327, Institute for Adaptive
and Neural Computation, University of Edinburgh, 2000. 8
[32] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 7
[33] M. R. Smith, L. Mitchell, C. Giraud-Carrier, and T. Martinez.
Recommending learning algorithms and their associated hy-
perparameters. arXiv preprint arXiv:1407.1890, 2014. 8
[34] A. Torralba and A. A. Efros. Unbiased look at dataset bias.
In Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on, pages 1521–1528. IEEE, 2011. 8
[35] L. Van Der Maaten. Learning discriminative fisher kernels.
In ICML, volume 11, pages 217–224, 2011. 8
[36] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun,
A. Shepard, H. Adam, P. Perona, and S. Belongie. The inatu-
ralist species classification and detection dataset. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 2018. 5
[37] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
port CNS-TR-2011-001, California Institute of Technology,
2011. 5
[38] Y.-X. Wang and M. Hebert. Model recommendation: Gen-
erating object detectors from few samples. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1619–1628, 2015. 8
[39] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and
S. Savarese. Taskonomy: Disentangling task transfer learn-
ing. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3712–3722, 2018. 8
[40] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh.
Predicting failures of vision systems. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3566–3573, 2014. 8
A. Analytic FIM for two-layer model
Assume we have data points (xi , yi ), i = 1 . . . n and yi ∈ {0, 1}. Assume that a fixed feature extractor applied to data
points x yields features z = φ(x) ∈ Rd and a linear model with parameters w is trained to model the conditional distribution
pi = P (yi = 1|xi ) = σ wT φ(xi ) , where σ is the sigmoid function. The gradient of the cross-entropy loss with respect to
the linear model parameters is:
∂` 1 X
= (yi − pi )φ(xi ),
∂w N i
and the empirical estimate of the Fisher information matrix is:

h ∂` ∂` T i 1 X
F =E = Ey∼pw (y|x) φ(xi )(yi − pi )2 φ(xi )T
∂w ∂w N i
1X
= φ(xi )(1 − pi )pi φ(xi )T
n i
In general, we are also interested in the Fisher information of the parameters of the feature extractor φ(x) since this is
independent of the specifics of the output space y (e.g., for k-way classification). Consider a 2-layer network where the
feature extractor uses a sigmoid non-linearity:
p = σ(wT z) zk = σ(UkT x)
and the matrix U specifies the feature extractor parameters and w are parameters of the task-specific classifier. Taking the
gradient w.r.t. parameters we have:
∂`
= (y − p)zj
∂wj
∂`
= (y − p)wk zk (1 − zk )xj
∂Ukj
The Fisher Information Matrix (FIM) consists of blocks:

T
∂` ∂`
= (y − p)2 zi zj
∂wi ∂wj
T
∂` ∂`
= (y − p)2 zj zk (1 − zk )xi
∂Uki ∂wj
T
∂` ∂`
= (y − p)2 wk zk (1 − zk )wl zl (1 − zl )xi xj
∂Uli ∂Ukj
We focus on the FIM of the probe network parameters which is independent of the dimensionality of the output layer and
write it in matrix form as: T
∂` ∂`
= (y − p)2 (1 − zk )zk (1 − zl )zl wk wl xxT
∂Ul ∂Uk
Note that each block {l, k} consists of the same matrix (y − p)2 · xxT multiplied by a scalar Skl given as:
Skl = (1 − zk )zk (1 − zl )zl wk wl
We can thus write the whole FIM as the expectation of a Kronecker product:
F = E[(y − p)2 · S ⊗ xxT ]
where the matrix S can be written as

S = wwT zz T (1 − z)(1 − z)T
(a) Random linear + ReLU (b) Polynomial of degree three
Figure 5: Task embeddings computed for a probe network consisting of (a) 10 random linear + ReLU features and (b)
degree three polynomial features projected to 2D using t-SNE. The tasks are random binary partitions of the unit square
visualized in each icon (three tasks are visualized on the left) and cannot be distinguished based purely on the input domain
without considering target labels. Note that qualitatively similar tasks group together, with more complex tasks (requiring
complicated decision boundaries) separated from simpler tasks.
Given a task described by N training samples {(xe , ye )}, the FIM can be estimated empirically as
1 X
F = pe (1 − pe ) · Se ⊗ xe xTe
N e
Se = wwT ze zeT (1 − ze )(1 − ze )T
where we take expectation over y w.r.t. the predictive distribution y ∼ pw (y|x).

Example toy task embedding As noted in the main text, the FIM depends on the domain embedding, the particular task
and its complexity. We illustrate these properties of the task embedding using an “toy” task space illustrated in Figure 5. We
generate 64 binary classification tasks by clustering a uniform grid of points in the XY plane into k ∈ [3, 16] clusters using
k-means and assigning a half of them to one category. We consider two different feature extractors, which play the role of
“probe network”. One is a collection of polynomial functions of degree d = 3, the second is 10 random linear features of the
form max(0, ax + by + c) where a and b are sampled uniformly between [−1/2, 1/2] and c between [−1, 1].
B. Robust Fisher Computation

Consider again the loss function (parametrized with the covariance matrix Σ instead of the precision matrix Λ for conve-
nience of notation):
L(ŵ; Σ) = Ew∼N (ŵ,Σ) [Hpw ,p̂ (y|x)] + β KL(N (ŵ, Σ) k N (0, σ 2 I)).
We will make use of the fact that the Fisher Information matrix is a positive semidefinite approximation of the Hessian H of
the cross-entropy loss, and coincide with it in local minima [24]. Expanding to the second order around ŵ, we have:
1
L(ŵ; Σ) =Ew∼N (ŵ,Σ) [Hpŵ ,p̂ (y|x) + ∇w Hpŵ ,p̂ (y|x)(w − ŵ) + (w − ŵ)T H(w − ŵ)] + β KL(N (ŵ, Σ) k N (0, σ 2 I))
2
1
=Hpŵ ,p̂ (y|x) + tr(ΣH) + β KL(N (ŵ, Σ) k N (0, σ 2 I))
2
1 β ŵ2 1
=Hpŵ ,p̂ (y|x) + tr(ΣH) + [ 2 + 2 trΣ + k log σ 2 − log(|Σ|) − k]
2 2 σ σ
where in the last line used the known expression for the KL divergence of two Gaussian. Taking the
derivative with respect
−1 2 β
to Σ and setting it to zero, we obtain that the expression loss is minimized when Σ = β H + 2σ2 I , or, rewritten in term
of the precision matrices, when
2 βλ2
Λ= H+ I ,
β 2
where we have introduced the precision matrices Λ = Σ−1 and λ2 I = 1/σ 2 I.
We can then obtain an estimate of the Hessian H of the cross-entropy loss at the point ŵ, and hence of the FIM, by
minimizing the loss L(ŵ, Λ) with respect to Λ. This is a more robust approximation than the standard definition, as it
depends on the loss in a whole neighborhood of ŵ of size ∝ Λ, rather than from the derivatives of the loss at a point.
To further make the estimation more robust, and to reduce the number of parameters, we constrain Λ to be diagonal, and
constrain weights wij belonging to the same filter to have the same precision Λij . Optimization of this loss can be performed
easily using Stochastic Gradient Variational Bayes, and in particular using the local reparametrization trick of [18].
The prior precision λ2 should be picked according to the scale of the weights of each layer. In practice, since the weights
of each layer have a different scale, we found it useful to select a different λ2 for each layer, and train it together with Λ,
C. Details of the experiments

C.1. Training of experts and classifiers
Given a task, we train an expert on it by fine-tuning an off-the-shelf ResNet-34 pretrained on ImageNet1 . Fine-tuning is
performed by first fixing the weights of the network and retraining from scratch only the final classifier for 10 epochs using
Adam, and then fine-tuning all the network together with SGD for 60 epochs with weight decay 5e-4, starting from learning
rate 0.001 and decreasing it by a factor 0.1 at epochs 40.
Given an expert, we train a classifier on top of it by replacing the final classification layer and training it with Adam for
16 epochs. We use weight decay 5e-4 and learning rate 1e-4.
The tasks we train on generally have different number of samples and unbalanced classes. To limit the impact of this
imbalance on the training procedure, regardless of the total size of the dataset, in each epoch we always sample 10,000
images with replacement, uniformly between classes. In this way, all epochs have the same length and see approximately the
same number of examples for each class. We use this balanced sampling in all experiments, unless noted otherwise.
C.2. Computation of the TASK 2 VEC embedding

As the described in the main text, the TASK 2 VEC embedding is obtained by choosing a probe network, retraining the final
classifier on the given task, and then computing the Fisher Information Matrix for the weights of the probe network.
Unless specified otherwise, we use an off-the-shelf ResNet-34 pretrained on ImageNet as the probe network. The Fisher
Information Matrix is computed in a robust way minimizing the loss function L(ŵ; Λ) with respect to the precision matrix Λ,
as described before. To make computation of the embedding faster, instead of waiting for the convergence of the classifier,
we train the final classifier for 2 epochs using Adam and then we continue to train it jointly with the precision matrix Λ using
the loss L(ŵ; Λ). We constrain Λ to be positive by parametrizing it as Λ = exp(L), for some unconstrained variable L.
While for the classifier we use a low learning rate (1e-4), we found it useful to use an higher learning rate (1e-2) to train L.
C.3. Training the MODEL 2 VECembedding

As described in the main text, in the MODEL 2 VECembedding we aim to learn a vector representation mj = Fj + bj of
the j-th model in the collection, which represents both the task the model was trained on (through the TASK 2 VEC embedding
Fj ), and the particularities of the model (through the learned parameter bj ).
We learn bj by minimizing a k-way classification loss which, given a task t, aims to select the model that performs best
on the task among a collection of k models. Multiple models may perform similarly and close to optimal: to preserve this
information, instead of using a one-hot encoding for the best model, we train using soft-labels obtained as follows:
errori − mean(errori )
p̂(yi ) = Softmax − α ,
std(errori )
where errori,j is the ground-truth test error obtained by training a classifier for task i on top of the j-th model. Notice that
for α 1, the soft-label yij reduces to the one-hot encoding of the index of the best performing model. However, for lower
α’s, the vector yi contains richer information about the relative performance of the models.
1 https://pytorch.org/docs/stable/torchvision/models.html
We obtain our prediction in a similar way: Let di,j = dasym (ti , mj ), then we set our model prediction to be
p(y|di,0 , . . . , di,k ) = Softmax(−γ di ),
where the scalar γ > 0 is a learned parameter. Finally, we learn both the mj ’s and γ using a cross-entropy loss:
N
1 X
L= Ey ∼p̂ [p(yi |di,0 , . . . , di,k )],
N i=0 i
which is minimized precisely when p(y|di,0 , . . . , di,k ) = p̂(yi ).

In our experiments we set α = 20, and minimize the loss using Adam with learning rate 0.05, weight decay 0.0005, and
early stopping after 81 epochs, and report the leave-one-out error (that is, for each task we train using the ground truth of all
other tasks and test on that task alone, and report the average of the test errors obtained in this way).
D. Datasets, tasks and meta-tasks

Our two model selection meta-tasks, iNat+CUB and Mixed, are curated as follows. For iNat+CUB, we generated 50
tasks and (the same) experts from iNaturalist and CUB. The 50 tasks consist of 25 iNaturalist tasks and 25 CUB tasks to
provide a balanced mix from two datasets of the same domain. We generated the 25 iNaturalist tasks by grouping species
into orders and then choosing the top 25 orders with the most samples. The number of samples for tasks shows the heavy-tail
distribution typical of real data, with the top task having 64,100 samples (the Passeriformes order classification task), while
most tasks have around 6, 000 samples.
The 25 CUB tasks were similarly generated with 10 order tasks but additionally has 15 Passeriformes family tasks: After
grouping CUB into orders, we determined 11 usable order tasks (the only unusable order task, Gaviiformes, has only one
species so it makes no sense to train on it). However, one of the orders—Passeriformes—dominated all other orders with 134
species when compared to 3-24 species of other orders. Therefore, we decided to further subdivide the Passeriformes order
task into family tasks (i.e., grouping species into families) to provide a more balanced partition. This resulted in 15 usable
family tasks (i.e., has more than one species) out of 22 family tasks. Unlike iNaturalist, tasks from CUB have only a few
hundreds of samples and hence benefit more from carefully selecting an expert.
In the iNAT+CUB meta-task the classification tasks are the same tasks used to train the experts. To avoid trivial solu-
tions (always selecting the expert trained on the task we are trying to solve) we test in a leave-one-out fashion: given a
classficication task, we aim to select the best expert that was not trained on the same data.
For the Mixed meta-task, we chose 40 random tasks and 25 curated experts from all datasets. The 25 experts were
generated from iNaturalist, iMaterialist and DeepFashion (CUB, having fewer samples than iNaturalist, is more appropriate
as tasks). For iNaturalist, we trained 15 experts: 8 order tasks and 7 class tasks (species ordered by class), both with number
of samples greater than 10,000. For DeepFashion, we trained 3 category experts (upper-body, lower-body, full-body). For
iMaterialist, we trained 2 category experts (pants, shoes) and 5 multi-label experts by grouping attributes (color, gender,
neckline, sleeve, style). For the purposes of clustering attributes into larger groups for training experts (and color coding
the dots in Figure 1), we obtained a de-anonymized list of the iMaterialist Fashion attribute names from the FGCV contest
organizers.
The 40 random tasks were generated as follows. In order to balance tasks among all datasets, we selected 5 CUB, 15
iNaturalist, 15 iMaterialist and 5 DeepFashion tasks. Within those datasets, we randomly pick tasks with a sufficient number
of validation samples and maximum variety. For the iNaturalist tasks, we group the order tasks into class tasks, filter out
the number of validation samples less than 100 and randomly pick order tasks within each class. For the iMaterialist tasks,
we similarly group the tasks (e.g. category, style, pattern), filter out tasks with less than 1,000 validation samples and
randomly pick tasks within each group. For CUB, we randomly select 2 order tasks and 3 Passeriformes family tasks, and
for DeepFashion, we randomly select the tasks uniformly. All this ensures that we have a balanced variety of tasks.
For the data efficiency experiment, we trained on a subset of the tasks and experts in the Mixed meta-task: We picked
the Accipitriformes, Asparagales, Upper-body, Short Sleeves for the tasks, and the Color, Lepidoptera, Upper-body, Passer-
iformes, Asterales for the experts. Tasks where selected among those that have more than 30,000 training samples in order
to represent all datasets. The experts were also selected to be representative of all datasets, and contain both strong and very
weak experts (such as the Color expert).
E. Error matrices
[CUB] Procellariiformes (Aves) 15 18 18 20 16 17 19 16 22 17 22 17 18 23 24 19 19 16 22 17 17 23 23 24 20 17 20 15 18 14 30 18 15 25 26 24 26 18 19 20 19 25 18 26 19 27 24 20 22 24 [CUB] Parulidae 40 38 40 92 45 44 47 43 39 41 16 39 26 35 28 38 33 14 25 36 33 32 44 33 33

[CUB] Cuculiformes (Aves) 12 15 17 18 18 23 17 20 18 22 23 18 14 18 18 21 20 20 20 16 22 23 22 18 19 12 16 20 19 19 32 14 19 25 17 20 29 21 27 19 17 16 26 29 19 20 31 27 25 25
[CUB] Emberizidae 39 44 42 92 43 43 48 46 42 47 21 48 30 39 36 45 38 20 32 40 39 30 50 38 35
[CUB] Charadriiformes (Aves) 45 44 41 45 43 45 45 46 46 46 47 49 46 47 47 45 46 47 48 47 44 47 48 44 48 36 42 45 45 32 51 41 43 47 48 46 50 42 47 48 45 44 50 48 43 44 49 48 50 48
[CUB] Caprimulgiformes (Aves) 21 24 29 16 27 21 16 23 27 23 21 25 25 27 20 27 23 23 25 27 24 20 29 21 17 27 24 16 19 23 33 27 20 23 24 23 23 27 25 28 23 17 29 35 35 21 29 27 31 31 [CUB] Tyrannidae 34 43 44 80 41 40 44 42 42 46 26 44 38 40 39 40 43 25 37 39 41 37 44 34 39
[CUB] Pelecaniformes (Aves) 18 17 22 15 17 19 19 20 20 19 17 19 17 19 20 15 18 16 21 18 17 20 23 17 21 20 19 17 20 19 24 17 18 20 24 22 25 20 23 17 19 20 19 20 18 16 25 25 23 24 [CUB] Charadriiformes 44 47 49 89 46 45 48 46 43 47 35 46 40 42 45 42 48 37 41 42 45 32 51 40 38
[CUB] Piciformes (Aves) 13 11 15 13 10 8 11 12 14 14 13 13 14 11 15 12 16 17 13 15 16 14 15 16 16 6 9 14 11 10 13 9 12 14 14 16 14 10 12 13 12 3 15 15 15 10 16 16 15 13
16 14 16 13 12 15 12 12 12 11 14 16 17 12 13 13 15 15 13 17 14 15 15 14 14 12 11 12 12 12 20 10 8 17 14 14 12 11 14 14 10 12 14 14 15 11 17 16 15 17 [INAT] Accipitriformes 69 70 68 93 69 70 74 77 69 76 59 69 67 68 65 72 69 67 64 71 72 67 75 68 70
[CUB] Anseriformes (Aves)
[CUB] Podicipediformes (Aves) 20 17 27 16 21 21 13 16 19 16 17 24 24 20 20 18 22 20 22 17 17 18 20 17 22 17 16 24 21 13 26 17 15 18 25 20 17 15 18 17 22 17 19 25 22 18 23 20 17 17 [INAT] Caryophyllales 58 61 57 95 63 64 71 62 58 60 56 36 49 54 56 44 53 56 51 54 54 60 43 58 58
[CUB] Apodiformes (Aves) 22 27 24 22 24 27 25 21 23 27 23 29 27 25 24 22 27 27 26 27 22 27 30 25 22 18 19 24 21 22 26 24 19 27 26 24 27 31 28 22 21 22 32 22 25 24 27 26 28 28 [INAT] Coleoptera 43 45 45 95 49 51 58 52 45 47 33 39 22 37 38 42 39 33 29 39 38 42 46 43 39
[CUB] Coraciiformes (Aves) 19 21 21 19 19 20 18 20 21 19 21 23 22 26 21 22 22 30 17 20 21 19 26 22 25 17 16 20 20 15 30 18 16 29 25 19 20 19 24 21 18 21 31 25 21 24 30 24 29 21
[CUB] Icteridae (Aves) 28 31 32 27 32 31 30 31 34 30 25 30 29 28 32 32 29 29 30 28 31 34 28 30 28 20 25 30 29 26 35 26 27 35 35 28 34 29 34 33 30 28 32 36 29 28 37 36 36 36 [INAT] Rodentia 52 56 54 86 54 54 61 56 54 52 41 49 46 47 49 54 50 45 52 45 54 52 51 52 49
[CUB] Cardinalidae (Aves) 12 13 10 14 8 11 12 14 12 9 14 7 13 14 16 13 9 14 10 12 15 16 13 12 13 6 10 16 14 9 18 9 13 18 13 10 19 11 13 13 14 12 16 18 11 12 15 18 20 19 [CUB] Pelecaniformes 12 22 16 69 20 24 24 20 17 16 14 22 19 14 20 18 25 20 14 18 20 20 26 19 14
[CUB] Mimidae (Aves) 7 14 6 8 9 10 8 14 12 12 13 10 11 8 11 9 14 13 11 12 11 12 15 10 11 5 5 8 10 10 17 9 4 16 10 9 14 11 14 8 11 10 10 14 5 8 10 15 13 15 [INAT] Carnivora 51 55 53 94 57 52 59 62 60 55 46 51 44 54 48 53 53 50 47 54 50 55 57 56 54
[CUB] Parulidae (Aves) 35 37 41 36 36 36 38 42 36 40 37 37 36 23 36 36 35 39 40 40 40 38 34 38 35 14 26 36 33 33 42 33 34 42 37 40 41 29 39 36 37 30 42 42 34 33 43 45 43 41
[CUB] Emberizidae (Aves) 36 42 43 38 38 42 38 43 40 41 41 44 40 39 29 41 42 43 41 38 42 43 42 42 39 18 31 40 38 32 53 39 40 50 43 42 49 34 45 43 41 36 47 50 35 40 50 50 51 49 [INAT] Piciformes 63 66 63 96 66 64 71 71 63 63 37 63 52 52 54 64 62 35 47 58 56 60 71 58 57
[CUB] Corvidae (Aves) 29 30 32 29 29 29 29 29 33 27 30 30 32 30 33 25 32 30 30 28 27 30 31 29 30 23 26 29 32 27 35 30 32 33 30 32 35 27 33 27 25 31 31 33 28 34 35 35 34 35 [INAT] Fabales 61 65 60 95 65 62 73 64 63 62 56 37 50 59 57 45 57 55 56 60 60 56 47 59 61
[CUB] Fringillidae (Aves) 7 5 11 7 12 6 9 9 7 9 7 8 8 4 8 11 6 8 10 7 8 7 8 10 6 4 7 7 6 6 8 6 7 13 7 9 7 7 11 7 8 7 9 13 7 6 13 10 9 11 [INAT] Rosales 66 66 69 95 69 64 71 70 66 68 59 46 61 66 63 56 63 61 62 66 63 65 53 62 61
[CUB] Tyrannidae (Aves) 42 43 42 41 38 40 40 41 43 43 38 42 41 42 40 41 43 36 46 44 40 42 40 42 42 24 39 40 41 38 46 40 37 47 40 41 49 37 40 43 42 39 45 44 37 41 46 50 46 46
[CUB] Laniidae (Aves) 35 35 32 28 27 38 40 42 35 33 33 33 45 22 30 32 35 33 22 33 35 35 30 28 42 25 27 37 23 28 32 20 17 42 33 30 37 27 35 37 28 27 35 37 30 33 37 35 37 32 [INAT] Columbiformes 50 59 61 90 51 48 50 57 50 60 37 55 48 56 51 63 52 40 57 54 57 50 62 55 50
[CUB] Passeridae (Aves) 38 27 32 38 40 35 33 32 35 37 32 42 30 35 33 35 32 30 28 32 33 27 37 37 27 18 35 22 30 33 38 35 25 38 30 32 35 28 35 32 32 28 33 35 25 35 35 43 37 42 [INAT] Perciformes 48 49 52 97 50 47 57 56 44 50 38 43 35 42 45 42 45 39 36 45 44 47 47 47 47
[CUB] Hirundinidae (Aves) 27 23 29 20 21 21 19 24 27 23 25 27 24 22 24 21 25 20 24 23 23 27 23 28 22 12 19 26 29 17 32 18 18 30 25 24 28 20 29 25 24 23 28 26 23 27 34 26 28 26 [INAT] Hymenoptera 60 62 61 94 67 62 68 67 61 65 52 61 43 53 53 60 57 53 49 55 56 59 64 57 54
[CUB] Thraupidae (Aves) 10 13 7 7 13 8 13 13 13 10 5 7 10 10 5 8 8 12 12 8 13 8 15 13 3 3 5 7 5 8 13 8 10 15 7 10 8 7 8 5 7 8 12 13 7 12 10 18 12 13
[CUB] Vireonidae (Aves) 32 33 39 35 36 31 34 38 33 35 33 35 34 27 34 29 35 33 35 34 36 33 32 41 30 15 24 32 32 27 41 27 33 45 37 35 44 34 43 29 33 28 44 42 31 37 48 46 42 40 [INAT] Testudines 64 65 66 89 62 59 68 65 63 63 54 61 57 59 58 62 56 54 56 56 58 58 65 61 54
[CUB] Bombycillidae (Aves) 10 10 15 7 10 8 13 20 10 10 10 12 13 10 13 15 10 7 8 12 10 8 12 13 8 7 15 12 12 18 17 12 13 15 18 10 18 13 13 12 7 10 17 12 13 12 17 8 15 13 [INAT] Hemiptera 44 47 52 95 53 53 60 53 51 49 39 45 27 39 44 43 43 39 30 40 43 44 50 44 45
[CUB] Troglodytidae (Aves) 32 32 33 27 31 32 30 36 31 32 30 31 31 31 30 33 26 31 35 30 29 34 30 30 22 18 24 29 33 29 35 30 31 32 28 29 33 32 35 30 30 31 40 35 34 30 37 38 32 34
[INAT] Suliformes 62 72 63 88 63 70 67 73 65 62 42 63 62 55 60 50 65 48 48 57 53 58 65 53 50
[iNat] Passeriformes (Aves) 78 81 81 78 78 81 79 82 80 81 79 81 80 80 81 80 78 81 82 78 79 81 81 81 80 56 73 79 78 76 86 75 77 84 82 78 85 77 82 79 79 77 85 85 75 78 86 85 87 84
[iNat] Lepidoptera (Insecta) 61 60 62 58 60 59 61 63 60 60 60 61 60 62 60 60 60 63 62 60 60 62 62 60 60 54 31 54 56 60 62 59 60 60 55 60 60 54 62 60 59 60 62 62 60 57 61 62 63 61 [INAT] Caudata 54 52 56 89 60 60 67 60 54 56 52 60 52 45 52 54 42 51 50 49 59 50 55 55 52
[iNat] Squamata (Reptilia) 73 74 73 70 72 73 73 75 74 72 75 74 70 75 74 72 73 76 74 72 74 72 75 72 73 68 68 61 71 71 75 71 74 74 73 73 73 71 74 71 72 73 77 76 71 70 76 77 77 76 [IMAT] Male 12 10 13 30 6 9 9 10 10 11 14 11 16 14 14 14 18 13 15 10 11 8 10 11 14
[iNat] Odonata (Insecta) 71 73 73 69 70 70 73 72 70 70 72 74 71 71 73 71 71 74 72 72 71 73 71 71 72 65 59 67 45 69 72 68 71 69 70 71 70 64 69 71 70 68 67 72 71 69 74 72 73 71
62 60 61 59 65 62 63 65 64 62 65 64 59 63 65 64 60 66 63 60 61 62 65 64 65 54 59 59 62 48 70 57 57 65 66 60 66 60 69 62 63 60 68 70 61 64 69 69 69 65 [IMAT] Marbled 9 8 10 21 9 7 9 7 7 6 5 6 5 9 5 8 5 5 6 5 6 4 6 7 6
[iNat] Charadriiformes (Aves)
[iNat] Asterales (Magnoliopsida) 69 68 69 69 69 69 70 70 69 67 68 67 66 69 68 67 68 69 68 70 69 69 69 69 69 66 64 66 66 67 49 69 69 54 69 69 57 65 59 69 68 69 58 60 68 66 63 58 58 59 [DEEPFASHION] Sleeve 25 33 22 19 41 21 40 17 31 39 40 21 42 47 22 32 32 27 44 65 37 41 32 38 32
[iNat] Pelecaniformes (Aves) 58 61 55 49 54 56 54 57 58 59 62 55 58 64 60 59 57 64 57 54 57 57 63 53 59 48 50 54 60 53 67 46 51 59 56 51 62 57 64 51 47 56 64 64 54 59 67 67 67 61 [IMAT] Printed 33 22 17 66 24 40 26 27 26 32 30 28 28 28 20 34 29 27 26 35 24 35 30 35 34
[iNat] Anseriformes (Aves) 60 60 63 61 61 60 60 63 64 60 66 65 60 62 63 59 61 66 60 62 58 66 67 59 62 57 59 66 62 56 69 59 57 61 67 61 66 63 65 64 62 62 64 65 63 65 66 65 68 64
62 58 61 60 59 58 61 63 57 59 58 59 58 60 61 56 60 60 60 61 61 59 61 61 63 55 57 57 56 57 43 59 61 38 57 60 47 54 45 60 58 59 46 45 61 58 52 49 47 48 [IMAT] Prom dresses 9 9 12 28 9 10 8 7 7 8 10 12 9 12 9 10 10 11 5 8 12 6 8 11 17
[iNat] Lamiales (Magnoliopsida)
[iNat] Anura (Amphibia) 69 69 73 65 71 72 72 71 69 72 71 73 70 73 71 69 71 71 73 69 70 75 70 74 66 68 60 64 70 70 75 68 71 71 58 70 74 64 71 72 73 70 70 74 70 69 76 75 71 71 [IMAT] Reversible 17 26 13 30 18 27 17 17 16 14 18 19 22 23 18 17 20 24 20 17 21 17 24 20 25
[iNat] Accipitriformes (Aves) 69 69 69 68 73 72 68 75 74 76 70 73 68 74 69 70 70 77 73 69 71 76 71 72 71 62 67 68 68 68 77 69 68 76 66 66 74 66 71 72 68 66 74 76 71 72 71 77 73 72 [IMAT] Ruched 26 26 14 20 20 14 21 22 19 27 20 19 20 26 14 14 22 27 26 26 21 16 22 23 23
[iNat] Caryophyllales (Magnoliopsida) 60 57 59 56 58 55 60 58 55 58 55 54 57 57 59 53 58 56 57 57 56 55 57 55 54 55 53 53 56 57 44 56 57 45 54 57 42 53 48 53 56 57 47 50 56 54 50 46 48 46
[iNat] Coleoptera (Insecta) 45 42 44 41 43 42 42 46 43 43 43 43 42 40 45 42 44 45 45 40 43 43 41 40 43 34 29 40 37 41 49 41 43 44 39 44 44 29 43 42 42 40 42 46 43 38 52 48 45 46 [IMAT] Blouses 19 22 23 27 25 25 24 23 25 31 23 21 23 29 22 25 23 15 30 21 28 25 19 25 31
[iNat] Asparagales (Liliopsida) 51 50 53 48 50 51 51 54 50 50 52 48 52 52 55 50 52 54 52 51 53 45 51 50 52 44 47 49 51 51 38 48 50 38 49 53 40 48 34 51 49 52 41 42 52 48 44 44 42 44 [IMAT] Slippers 2 4 9 24 6 7 6 6 9 1 3 4 6 4 5 9 7 8 4 7 4 5 6 4 6
[iNat] Rodentia (Mammalia) 56 50 54 51 48 49 48 58 49 49 53 51 48 50 52 46 52 50 47 50 50 48 48 46 46 43 51 44 50 53 56 52 56 58 51 50 58 51 54 44 50 52 54 54 50 48 59 61 59 59 [IMAT] Bodycon 30 28 23 44 22 25 28 25 25 26 23 30 32 27 27 30 21 25 27 27 25 20 23 20 34
[iNat] Carnivora (Mammalia) 52 47 58 48 47 54 49 56 49 51 53 51 50 55 54 52 50 49 52 47 52 49 53 49 50 52 49 49 53 54 58 53 54 58 54 53 58 53 56 54 46 52 54 59 49 49 57 57 57 54
[iNat] Piciformes (Aves) 59 62 60 52 60 56 57 52 59 62 53 63 61 64 62 63 58 63 61 60 61 63 66 59 63 41 50 60 51 52 69 55 54 72 61 59 66 50 60 53 56 48 63 67 58 60 70 66 68 69 [IMAT] Bodysuits 28 28 23 49 11 16 20 14 18 19 24 17 20 14 22 8 10 24 20 18 21 15 22 16 23
[iNat] Fabales (Magnoliopsida) 61 60 59 57 61 64 65 61 60 59 58 58 59 61 59 59 60 63 61 61 60 58 63 58 60 56 54 60 58 59 49 58 61 46 58 60 49 57 48 59 60 60 43 49 62 57 50 52 50 50 [IMAT] Velour 9 12 10 36 15 10 19 13 14 11 11 14 17 11 13 15 10 12 13 16 14 11 10 13 17
[iNat] Rosales (Magnoliopsida) 65 65 65 66 64 63 71 70 64 63 63 64 65 68 64 62 62 66 65 66 68 63 69 67 66 59 61 62 67 68 51 62 66 54 62 63 57 61 62 64 62 66 59 49 64 63 56 57 58 58 [IMAT] Winter boots 4 4 4 20 4 3 4 4 2 0 4 3 4 3 4 3 4 6 5 5 4 3 5 5 5
[iNat] Columbiformes (Aves) 53 50 55 51 50 58 55 56 59 59 57 57 50 62 59 50 55 58 55 52 57 59 59 60 57 41 50 51 59 50 67 55 51 62 58 50 61 50 56 59 58 59 67 61 50 64 64 63 59 65
[iNat] Perciformes (Actinopterygii) 45 46 48 45 44 48 45 52 44 47 46 45 44 49 48 44 45 50 50 46 48 43 48 43 45 41 35 45 43 46 48 46 48 47 47 51 48 41 46 49 48 46 48 49 48 29 50 50 48 46 [IMAT] Criss cross 25 22 24 33 28 19 26 11 18 10 16 26 23 15 20 22 16 25 12 18 28 20 13 20 24
[iNat] Sapindales (Magnoliopsida) 64 68 70 67 63 64 67 64 64 66 70 65 63 67 67 69 67 68 67 66 69 64 65 65 63 64 61 61 65 65 64 67 70 59 63 67 60 62 64 65 67 64 56 61 70 57 53 60 61 62 [DEEPFASHION] Floral 14 13 14 52 29 24 25 19 30 21 26 26 16 22 18 26 16 19 19 18 28 21 19 23 30
[iNat] Ranunculales (Magnoliopsida) 50 48 50 50 49 48 53 51 49 48 50 47 46 52 53 45 51 51 50 49 52 49 50 49 50 51 48 49 51 52 35 48 50 33 50 54 38 49 37 50 49 49 39 37 54 47 42 35 39 39
[IMAT] Embroidered 40 15 32 58 48 32 43 14 30 24 33 43 12 30 36 27 43 24 29 25 16 25 25 33 33
[iNat] Gentianales (Magnoliopsida) 57 54 52 55 51 54 56 55 53 54 54 53 54 55 54 52 50 53 53 53 52 53 55 54 55 53 51 51 50 57 41 54 57 42 51 55 44 48 45 57 56 51 46 45 54 48 47 45 40 43
[iNat] Ericales (Magnoliopsida) 49 46 52 47 48 48 47 49 47 48 46 44 47 49 48 48 51 49 49 47 50 47 49 52 48 44 45 49 45 51 38 48 47 37 46 51 38 45 35 47 50 49 40 40 50 45 40 39 41 35 [IMAT] Flannel 14 15 10 40 20 7 15 11 8 9 10 13 10 15 14 8 16 11 17 13 18 19 13 14 16
[DEEPFASHION] Knit 22 36 31 80 48 20 53 46 22 23 15 25 31 53 18 34 29 41 52 31 27 31 21 28 35
[iN at] Raragopte nolio(Aves)
[ ele gif me (Av )
[CU [CUCUB]caniformes (Aves)
B] B] A Pic orm s (A es)
[C dic er rm (A s)
[CUUB] ipediformes (Aves)
Co dif me (Av )
[CU [CU raciiformes (Aves)
B] B] Ic orm s (A es)
[CUardi terid es (Aves)
[CU[CUBB] Mi alidae (Aves)
B] ] Pa mid e (A es)
[CUCUB] beri lidae (Aves)
[CU B] F Cor idae (Aves)
B] rin vid (A s)
[ CU ra lida (Av s)
[CUCUB] B] Lannidae (A es)
[CU] Hir asse iidae (Aves)
B un rid (A s)
[CU[CUB] Thr dinidae ( ves)
[CU ] B ] Vir upidae (Aves)
[iN B] Tomb onidae (Aves)
a a d a Av )
[iNt] Lepsseri ytidae (A es)
S o m (A )
[iN[iNat [iNat qua pteraes (Aves)
[iN ster arad ona a (Resecta)
at] ale riif ta pt )
[iN [iN Pel s (M orm (Inse ilia)
La ] An nif no (Av a)
at] [i [i iale eriformesopsid s)
Ca Nat Nat s (M rm (A a)
ph cci nur no (Av s)
[iN [iN Ro les cifo am al )

[ t] at al (M me ma )
[iN iNat Perc] Col es (Magno s (A lia)
at] nun da s ( rm iop a)

[iN [iNa llale itrifo (Amopsi s)
at] t] C s ( rm ph da)
at] od ale ra ( ps )
[iN [ Carnentias (Li Insecda)
[iNat] F iNat] ivora (Ma iopsi ta)
[iN ] Ra apin rme bifognol opsid s)

[iN enti ulales (M tino s (Avda)
(M gno ps a)
)
Eri les Ma oli ryg )
ca (M gno ops ii)
no psi )
ps )
[iN Asp ole Mag es ibia
at] aba Pi (M mm da
ia
ida
P ul r s es
B] Apo ifor es ves
[iN at] P rogloycillidae ( ves
at] id for e ves
h d t n s
at] ana s ( agn pte es
ag lio ida
lio da
Po ns ifo es ve
[ Em ru ae ve
[ Ty gil ae ve
B P n e ve
at] ] C ] O ma (I ve
m s o li e
ryo ] A ] A ag es ve
y p a li e
at ] S ifo um a li ve
at] at eca ag es ct
les a lio id
B] rim riifo me (Av
n a v
[DEEPFASHION] Lace 25 34 29 24 26 34 34 15 32 28 36 26 16 14 30 38 18 28 30 27 33 25 22 22 37
i
G c le Ac e si
A
[CU ap rad lifor es
l
B] Cha ucu form
[DEEPFASHION] Print 21 19 25 37 16 21 26 18 21 18 19 23 17 32 23 22 21 21 22 27 25 24 19 19 20
r
z
e
a
[CU UB] B] C llarii
C
[C [CU oce
Fu ody thes
yc s
s
Ge lor
Ne er
Sle e
e
le
nts
s
oli s
Ins a
Re ta
Ma ptilia
Lili alia
Pa mph a
rifo ia
pid es
Sq ptera
Ch Od ata
dri ata
Pe Aste es
An iform s
ifo s
es
od the
he
oe
gn Ave
an le
ser e
in
ev
sid
A sid
s
Sty
sse ib
ec
nd
Le rm
rm
rm
Pr
Co
lec ra
ckl
Pa
m
ara on
B
Sh
lot
mm
op
op
C
r-b clo
ll-b clo
ifo
ua
o
B]
A
[CU
we y
Lo r-bod
Ma
a
pe
[iN
Up
Figure 6: Meta-tasks ground-truth error matrices. (Best viewed magnified). (Left) Error matrix for the CUB+iNat
meta-task. The numbers in each cell is the test error obtained by training a classifier on a given combination of task (rows)
and expert (columns). The background color represent the Asymmetric TASK 2 VEC distance between the target task and
the task used to train the expert. Numbers in red indicate the selection made by the model selection algorithm based on
the Asymmetric TASK 2 VEC embedding. The (out-of-diagonal) optimal expert (when different from the one selected by our
algorithm), is highlighted in blue. (Right) Same as before, but for the Mixed meta-task.

A Paper

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Paper

Încărcat de

Drepturi de autor:

Formate disponibile

TASK 2V EC : Task Embedding for Meta-Learning

Alessandro Achille Michael Lam Rahul Tewari Avinash Ravichandran

Subhransu Maji Charless Fowlkes Stefano Soatto Pietro Perona

smmaji@amazon.com fowlkec@amazon.com soattos@amazon.com peronapp@amazon.com

Abstract semantic similarities between tasks (Fig. 1). When other

Task Embeddings Domain Embeddings

where F is the Fisher information matrix (FIM):

Avg. top-k tax. distance

Test error on task (%)

Task2Vec distance 10%

[CUroce ulgif lidae

[CU] Pod oracliformae

B] UB] ran idae

[iN [iNat] Sa iformtia

[iN at] P Nat] donaes

[CU[CUB driifo idae

at] Nat ifo ae

[iN at] Cleop es

[iN Per hylla es

[iN t] Rapitrif rme

various tasks, which makes it more widely applicable but

and the empirical estimate of the Fisher information matrix is:

The Fisher Information Matrix (FIM) consists of blocks:

Skl = (1 − zk )zk (1 − zl )zl wk wl

F = E[(y − p)2 · S ⊗ xxT ]

where the matrix S can be written as

where we take expectation over y w.r.t. the predictive distribution y ∼ pw (y|x).

B. Robust Fisher Computation

C. Details of the experiments

C.2. Computation of the TASK 2 VEC embedding

C.3. Training the MODEL 2 VECembedding

p(y|di,0 , . . . , di,k ) = Softmax(−γ di ),

which is minimized precisely when p(y|di,0 , . . . , di,k ) = p̂(yi ).

D. Datasets, tasks and meta-tasks

[CUB] Procellariiformes (Aves) 15 18 18 20 16 17 19 16 22 17 22 17 18 23 24 19 19 16 22 17 17 23 23 24 20 17 20 15 18 14 30 18 15 25 26 24 26 18 19 20 19 25 18 26 19 27 24 20 22 24 [CUB] Parulidae 40 38 40 92 45 44 47 43 39 41 16 39 26 35 28 38 33 14 25 36 33 32 44 33 33

[iN [iN Ro les cifo am al )

at] nun da s ( rm iop a)

[iN ] Ra apin rme bifognol opsid s)

B] Apo ifor es ves

[iN at] P rogloycillidae ( ves

at] id for e ves

at] ana s ( agn pte es

S-ar putea să vă placă și