Documente Academic
Documente Profesional
Documente Cultură
5
National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
6
Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20742, USA
7
Department of Physics, University of Maryland, College Park, Maryland 20742, USA
8
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin-Dahlem, Germany
(Dated: September 11, 2017)
Superconductivity has been the focus of enormous research effort since its discovery more than a century ago. Yet, some
features of this unique phenomenon remain poorly understood; prime among these is the connection between superconductivity
and chemical/structural properties of materials. To bridge the gap, several machine learning methods are developed herein to
model the critical temperatures (Tc ) of the 12, 000+ known superconductors available via the SuperCon database. Materials
are first divided into two classes based on their Tc ’s, above and below 10 K, and a classification model predicting this label is
trained. The model uses coarse-grained features based only on the chemical compositions. It shows strong predictive power,
with out-of-sample accuracy of about 92%. Separate regression models are developed to predict the values of Tc for cuprate,
iron-based, and “low-Tc ” compounds. These models also demonstrate good performance, with learned predictors offering
important insights into the mechanisms behind superconductivity in different families of materials. To improve the accuracy
and interpretability of these models, new features are incorporated using materials data from the AFLOW Online Repositories.
Finally, the classification and regression models are combined into a single integrated pipeline and employed to search the entire
Inorganic Crystallographic Structure Database (ICSD) for potential new superconductors. We identify about 30 non-cuprate
and non-iron-based oxides as candidate materials.
descriptors were selected heuristically due to their suc- is extracted from the SuperCon database [30], created
cess in classifying binary/ternary structures and predict- and maintained by the Japanese National Institute for
ing stable/metastable ternary quasicrystals. Recently, an Materials Science. It houses information such as the
investigation stumbled on this clustering problem again Tc and reporting journal publication for superconducting
by observing athreshold Tc closer to log Tcthres ≈ 1.3 materials known from experiment. Assembled within it
Tcthres = 20 K [23]. Instead of a heuristic approach, is a uniquely exhaustive list of all reported superconduc-
random forests and simplex fragments were leveraged tors, as well as related non-superconducting compounds.
on the structural/electronic properties data from the The database consists of two separate subsets: “Ox-
AFLOW Online Repositories to find the optimum clus- ide & Metallic” (inorganic materials containing metals,
tering descriptors. A classification model was devel- alloys, cuprate high-temperature superconductors, etc.)
oped showing good performance. Separately, a sequential and “Organic” (organic superconductors). Download-
learning framework was evaluated on superconducting ing the entire inorganic materials dataset and removing
materials, exposing the limitations of relying on random- compounds with incompletely-specified chemical compo-
guess (trial-and-error) approaches for breakthrough dis- sitions leaves about 22, 000 entries. In the case of multi-
coveries [24]. Subsequently, this study also highlights ple records for the same material, the reported material’s
the impact machine learning can have on this particular Tc ’s are averaged, but only if their standard deviation
field. Other contemporary work hones in on specific ma- is less than 5 K, and discarded otherwise. This brings
terials [25] and families of superconductors [26] (see also the total down to about 16, 400 compounds, of which
Ref. [27]). around 4, 000 have no critical temperature reported. Of
Whereas previous investigations trained on several these, roughly 5, 700 compounds are cuprates and 1, 500
hundred compounds at most, this work considers more are iron-based (about 35% and 9%, respectively), reflect-
than 16, 000 different compositions. These are extracted ing the significant research efforts invested in these two
from the SuperCon database, which contains an ex- families. The remaining set of about 8, 000 is a mix of var-
haustive list of superconductors, including many closely- ious materials, including conventional phonon-driven su-
related materials (varying only by small changes in stoi- perconductors (e.g., elemental superconductor, A15 com-
chiometry). The order-of-magnitude increase in training pounds), known unconventional superconductors like the
data (i ) uncovers crucial subtleties in chemical composi- layered nitrides and heavy fermions, and many materials
tion among related compounds and (ii ) exposes different for which the mechanism of superconductivity is still un-
superconducting mechanisms via family-specific model- der debate (such as bismuthates and borocarbides). The
ing. It also enables the optimization of several steps of distribution of materials by Tc for the three groups is
the process of building ML models. Large sets of in- shown in Figure 2a.
dependent variables can be constructed and rigorously There are occasional problems with the validity and
filtered by predictive power (rather than selecting them consistency of some of the data. For example, the
by intuition alone). These advances are crucial to the database includes some reports based on tenuous experi-
success of our ML models in the understanding of the mental evidence and only indirect signatures of supercon-
emergence/suppression of superconductivity with com- ductivity, as well as reports of inhomogeneous (surface,
position. interfacial) and nonequilibrium phases. Even in cases of
As a demonstration of the potential of ML methods in bona fide bulk superconducting phases, important rele-
looking for novel superconductors, we combined and ap- vant variables like pressure are not recorded. Though
plied several models to search for candidates among the some of the obviously erroneous records were removed
roughly 110, 000 different compositions contained in the from the data (see Supplementary Materials), these is-
Inorganic Crystallographic Structure Database (ICSD). sues were largely ignored assuming their effect on the
The framework highlights 30 interesting compounds with entire dataset to be relatively modest.
predicted Tc ’s above 20 K for experimental validation. A more serious problem is the use of the database itself.
Interestingly, all share a peculiar feature in their elec- Training a model only on superconductors can lead to sig-
tronic band structure: one (or more) flat/nearly-flat nificant selection bias that may render it ineffective when
bands just below the Fermi level. The associated large applied to new materials [31]. Even if the model learns
peak in the density of states (infinitely large in the limit to correctly recognize factors promoting superconductiv-
of truly flat bands) can lead to strong electronic instabil- ity, it may miss effects that strongly inhibit it due to the
ity, and has been discussed recently as one possible way lack of exposure to the relevant negative examples. To
to high-temperature superconductivity [28, 29]. mitigate the effect, we incorporate about 300 materials
found by H. Hosono’s group not to display superconduc-
tivity [32]. The presence of non-superconducting materi-
DATA AND METHODS als, along with those without Tc reported in SuperCon,
leads to a conceptual problem. Surely, some of these com-
Superconductivity data. The success of any ML pounds emerge as non-superconducting “end-members”
method ultimately depends on access to reliable and from doping/pressure studies, indicating no supercon-
plentiful data. Superconductivity data used in this work ducting transition was observed despite some efforts to
3
find one. However, a transition may still exist, albeit The Standard ensures that the calculations and derived
at experimentally difficult to reach or altogether inac- properties are empirical (reproducible), reasonably well-
cessible temperatures (for most practical purposes below converged, and above all, consistent (fixed set of param-
10 mK) [33]. This presents a conundrum: ignoring com- eters), a particularly attractive feature for ML modeling.
pounds with no reported Tc disregards a potentially im- Many materials properties important for superconduc-
portant part of the dataset, while assuming Tc = 0 K pre- tivity have been calculated within the AFLOW frame-
scribes an inadequate description for (at least some of) work, and are easily accessible through the AFLOW On-
these compounds. To circumvent the problem, materials line Repositories. It contains information for the vast
are first partitioned in two groups by their Tc , above and majority of compounds in the ICSD [1]. Although the
below a threshold temperature (Tsep ), for the creation AFLOW Online Repositories contain calculated proper-
of a classification model. Compounds with no reported ties, the DFT results have been extensively validated
critical temperature can be classified in the “below-Tsep ” with ICSD records [13, 23, 45–48].
group without the need to specify a Tc value (or assume Unfortunately, only a small subset of materials in
it is zero). SuperCon overlaps with those in the ICSD: about 800
Chemical and structural features. For most mate- with finite Tc and less than 600 are contained within
rials, the SuperCon database provides only the chemical AFLOW. For these, a set of 26 predictors are incorporated
composition and Tc . To convert this information into from the AFLOW Online Repositories, including struc-
meaningful features/predictors (used interchangeably), tural/chemical information like the lattice type, space
we employ the Materials Agnostic Platform for Informat- group, volume of the unit cell, density, ratios of the lattice
ics and Exploration (Magpie) [35]. Magpie computes a parameters, Bader charges and volumes, and formation
set of 145 attributes for each material, including: (i ) energy (see Supplementary Materials). In addition, elec-
stoichiometric features (depends only on the ratio of ele- tronic properties are considered, including the density of
ments and not the specific species); (ii ) elemental prop- states near the Fermi level as calculated by AFLOW. Pre-
erty statistics: the mean, mean absolute deviation, range, vious investigations exposed limitations in applying ML
minimum, maximum, and mode of 22 different elemen- methods to a similar dataset in isolation [23]. Instead,
tal properties (e.g., period/group on the periodic table, a framework is presented for combining models built on
atomic number, atomic radii, melting temperature); (iii ) Magpie descriptors (large sampling, but features limited
electronic structure attributes: the average fraction of to compositional data) and AFLOW features (small sam-
electrons from the s, p, d and f valence shells among all pling, but diverse and pertinent features).
elements present; and (iv ) ionic compound features that Machine learning algorithms. Once we have a list of
include whether it is possible to form an ionic compound relevant predictors, various ML models can be applied to
assuming all elements exhibit a single oxidation state. the data [49, 50]. All ML algorithms in this work are vari-
The application of Magpie predictors, though appear- ants of the random forest method [51]. Fundamentally,
ing to lack a priori justification, expands upon past clus- the approach combines many individual decision trees,
tering approaches by Villars and Rabe [21, 22]. They where each tree is a non-parametric supervised learning
show that, in the space of a few judiciously chosen heuris- method used for modeling either categorical or numerical
tic predictors, materials separate and cluster according to variables (i.e., classification or regression modeling). A
their crystal structure and even complex properties such tree predicts the value of a target variable by learning
as high-temperature ferroelectricity and superconductiv- simple decision rules inferred from the available features
ity. Similar to these features, Magpie predictors capture (see Figure 1 for an example). The deeper the tree, the
significant chemical information, which plays a decisive more complex the relationships it can learn, but also the
role in determining structural and physical properties of greater the danger of overfitting, i.e., learning some ir-
materials. relevant information or just “noise”.
Despite the success of Magpie predictors in modeling The random forest method creates a set of individual
material properties [35], interpreting their connection to decision trees (hence the “forest”), each built to solve the
superconductivity presents a serious challenge. They do same classification/regression problem. It then combines
not encode (at least directly) many important materi- their results, either by voting or averaging depending on
als properties, particularly those pertinent to supercon- the problem. To make the forest more robust to over-
ductivity. Incorporating features like lattice type and fitting, individual trees in the ensemble are built from
density of states would undoubtedly lead to significantly samples drawn with replacement (a bootstrap sample)
more powerful and interpretable models. Since such in- from the training set. In addition, when splitting a node
formation is not generally available in SuperCon, we em- during the construction of a tree, the model chooses the
ploy data from the AFLOW Online Repositories [17–20]. best split of the data only considering a random subset
The materials database houses nearly 170 million proper- of the features. The hyperparameters used to optimize
ties calculated with the software package AFLOW [2, 36– the model are described in the Supplementary Material.
44]. AFLOW is a high-throughput ab initio framework Random forest is one of the most powerful, versatile,
that manages density functional theory (DFT) calcu- and widely-used ML methods [52]. There are several ad-
lations in accordance with the AFLOW Standard [19]. vantages that make it especially suitable for this problem.
4
(...)
(...)
std(T melt) ≤ 418.92 K False avg(atomic weight) ≤ 102.81 u
True samples = 45.5% samples = 18.3%
proportion = [0.87, 0.13] proportion = [0.76, 0.24]
class = T c below 10 K class = T c below 10 K (...)
std(column number) ≤ 4.16
samples = 100.0%
proportion = [0.62, 0.38]
class = T c below 10 K avg(atomic weight) ≤ 80.01 u True std(T melt) ≤ 672.09 K (...)
False samples = 54.5% samples = 49.0%
proportion = [0.41, 0.59] proportion = [0.37, 0.63]
class = T c above 10 K class = T c above 10 K
False (...)
std(electronegativity) ≤ 0.52
samples = 5.6%
proportion = [0.8, 0.2] (...)
class = T c below 10 K
(...)
FIG. 1. Schematic of the random forest ML approach. Example of a single decision tree used to classify materials
depending on whether their Tc is above or below 10 K. A tree can have many levels, but only the three top are shown. The
decision rules leading to each subset are written inside individual rectangles. The subset population percentage is given by
“samples”, and the node color/shade represents the degree of separation, i.e., dark blue/orange illustrates a high proportion
of Tc > 10 K/Tc < 10 K materials (the exact value is given by “proportion”). A random forest consists of a large number —
could be hundreds or thousands — of such individual trees.
First, it can learn complicated non-linear dependencies els are created, each designed to separate materials into
from the data. Unlike many other methods (e.g., linear two distinct groups depending on whether Tc is above or
regression), it does not make any assumptions about the below some predetermined value. The temperature that
relationship between the predictors and the target vari- separates the two groups (Tsep ) is treated as an adjustable
able. Second, random forests are quite tolerant to hetero- parameter of the model, though some physical consid-
geneity in the training data. It can handle both numer- erations should guide its choice as well. Classification
ical and categorical data which, furthermore, does not ultimately allows compounds with no reported Tc to be
need extensive and potentially dangerous preprocessing, used in the training set by including them in the below-
such as scaling or normalization. Even the presence of Tsep bin. Although discretizing continuous variables is
strongly correlated predictors is not a problem for model not generally recommended, in this case the benefits of
construction (unlike many other ML algorithms). An- including compounds without Tc outweigh the potential
other significant advantage of this method is that, by information loss.
combining information from individual trees, it can es- In order to choose the optimal value of Tsep , a series of
timate the importance of each predictor, thus making random forest models are trained with different thresh-
the model more interpretable. However, unlike model old temperatures separating the two classes. Since set-
construction, determination of predictor importance is ting Tsep too low or too high creates strongly imbalanced
complicated by the presence of correlated features. To classes (with many more instances in one group), it is
avoid this, standard feature selection procedures are em- important to compare the models using several differ-
ployed along with a rigorous predictor elimination scheme ent metrics. Focusing only on the accuracy (count of
(based on their strength and correlation with others). correctly-classified instances) can lead to deceptive re-
Overall, these methods reduce the complexity of the mod- sults. Hypothetically, if 95% of the observations in the
els and improve our ability to interpret them. dataset are in the below-Tsep group, simply classifying
all materials as such would yield a high accuracy (95%),
while being trivial in any other sense. To avoid this po-
RESULTS AND DISCUSSION tential pitfall, three other standard metrics for classifica-
tion are considered: precision, recall, and F1 score. They
Classification models. As a first step in applying ML are defined using the values tp, tn, f p, and f n for the
methods to the dataset, a sequence of classification mod- count of true/false positive/negative predictions of the
5
a b
FIG. 2. SuperCon dataset and classification model performance. (a) Histogram of materials categorized by Tc (bin size
is 2 K, only those with finite Tc are counted). Blue, green, and red denote “low-Tc ”, iron-based, and cuprate superconductors,
respectively. In the inset: histogram of materials categorized by ln (Tc ) restricted to those with Tc > 10 K. (b) Performance of
different classification models as a function of the threshold temperature (Tsep ) that separates materials in two classes by Tc .
Performance is measured by accuracy (gray), precision (red), recall (blue), and F1 score (purple). The scores are calculated
from predictions on an independent test set, i.e., one separate from the dataset used to train the model. In the inset: the
dashed red curve gives the proportion of materials in the above-Tsep set.
a b
FIG. 3. Scatter plots of 3, 000 superconductors in the space of the four most important classification predictors.
Blue/red represent below-Tsep /above-Tsep materials, where Tsep = 10 K. (a) Feature space of the first and second most important
predictors: standard deviations of the column numbers and electronegativities (calculated over the values for the constituent
elements in each compound). (b) Feature space of the third and fourth most important predictors: standard deviation of the
elemental melting temperatures and average of the atomic weights.
above the highest Tc of all elements and pseudoelemen- and limit the list of predictors to a manageable size, the
tal materials (solid solution whose range of composition backward feature elimination method is employed. The
includes a pure element). Here, the proportion of above- process begins with a model constructed with the full list
Tsep materials is approximately 38% and the accuracy is of predictors, and iteratively removes the least significant
about 92%, i.e., the model can correctly classify nine out one, rebuilding the model and recalculating importances
of ten materials — much better than random guessing. with every iteration. (This iterative procedure is neces-
The recall — quantifying how well all above-Tsep com- sary since the ordering of the predictors by importance
pounds are labeled and, thus, the most important metric can change at each step.) Predictors are removed until
when searching for new superconducting materials — is the accuracy drops by no more than 2%, reducing the
even higher. (Note that the models’ metrics also depend full list of 145 down to 5. Furthermore, two of these
on random factors such as the composition of the training predictors are strongly correlated with each other, and
and test sets, and their exact values can vary.) we remove the less important one. This has a negligi-
For an understanding of what the model has learned, ble impact on the model performance, yielding four pre-
an analysis of the chosen predictors is needed. In the dictors total (see Table 1) with an above 90% accuracy
random forest method, features can be ordered by their score — only slightly worse than the full model. Scat-
importance quantified via the so-called Gini importance ter plots of the pairs of the most important predictors
or “mean decrease in impurity” [49, 50]. For a given are shown in Figure 3, where blue/red denotes whether
feature, it is the sum of the Gini impurity [53] over the the material is in the below-Tsep /above-Tsep class. Fig-
number of splits that include the feature, weighted by the ure 3a shows a scatter plot of 3, 000 compounds in the
number of samples it splits, and averaged over the entire space spanned by the standard deviations of the column
forest. Due to the nature of the algorithm, the closer to numbers and electronegativities calculated over the ele-
the top of the tree a predictor is used, the greater number mental values. Superconductors with Tc > 10 K tend to
of predictions it impacts. cluster in the upper-right corner of the plot and in a rela-
tively thin elongated region extending to the left of it. In
Although correlations do not affect the model’s ability fact, the points in the upper-right corner represent mostly
to learn from features, it can distort importance esti- cuprate materials, which with their complicated compo-
mates. For example, a material property with a strong sitions and large number of elements are likely to have
effect on Tc can be shared among several correlated pre- high standard deviations in these variables. Figure 3b
dictors. Since the model can access the same informa- shows the same compounds projected in the space of the
tion through any of these variables, their relative impor- standard deviations of the melting temperatures and the
tances are diluted across the group. To reduce the effect
7
TABLE 1. The most relevant predictors and their importances for the classification and general regression
models. “avg(x)” and “std(x)” denote the composition-weighted average and standard deviation, respectively, calculated over
the vector of elemental values for each compound [35]. For the classification model, all predictor importances are quite close.
predictor model
rank classification regression (general; Tc > 10 K)
1 std(column number) 0.26 avg(number of unfilled orbitals) 0.26
2 std(electronegativity) 0.26 std(ground state volume) 0.18
3 std(melting temperature) 0.23 std(space group number) 0.17
4 avg(atomic weight) 0.24 avg(number of d unfilled orbitals) 0.17
5 - std(number of d valence electrons) 0.12
6 - avg(melting temperature) 0.1
means of the atomic weights of the elements forming each temperature superconductors, with all others denoted
compound. The above-Tsep materials tend to cluster in “low-Tc ” for brevity (no specific mechanism in this
areas with lower mean atomic weights — not a surprising group). Surprisingly, a single regression model does rea-
result given the role of phonons in conventional supercon- sonably well among the different families – benchmarked
ductivity. on the test set, the model achieves R2 ≈ 0.88 (Figure 4a).
For comparison, we create another classifier based on It suggests that the random forest algorithm is flexible
the average number of valence electrons, metallic elec- and powerful enough to automatically separate the com-
tronegativity differences, and orbital radii differences, pounds into groups and create group-specific branches
i.e., the predictors used in Refs. [21, 22] to cluster ma- with distinct predictors (no explicit group labels were
terials with Tc > 10 K. A classifier built only with these used during training and testing). As validation, three
three predictors is less accurate than both the full and separate models are constructed trained only on a spe-
the truncated models presented herein, but comes quite cific family, namely the “low-Tc ”, cuprate, and iron-based
close: the full model has about 3% higher accuracy and superconductors, respectively. Benchmarking on mixed-
F1 score, while the truncated model with four predictors family test sets, the models performed well on compounds
is less that 2% more accurate. The rather small (albeit belonging to their training set family while demonstrat-
not insignificant) differences demonstrates that even on ing no predictive power on the others. Figures 4b-d il-
the scale of the entire SuperCon dataset, the predictors lustrate a cross-section of this comparison. Specifically,
used by Villars and Rabe [21, 22] capture much of the the model trained on “low-Tc ” compounds dramatically
relevant chemical information for superconductivity. underestimates the Tc of both high-temperature super-
Regression models. After constructing a successful conducting families (Figures 4b and c), even though this
classification model, we now move to the more difficult test set only contains compounds with Tc < 40 K. Con-
challenge of predicting Tc . Creating a regression model versely, the model trained on the cuprates tends to over-
may enable better understanding of the factors control- estimate the Tc of “low-Tc ” (Figure 4d) and iron-based
ling Tc of known superconductors, while also serving as (Figure 4e) superconductors. This is a clear indication
an organic part of a system for identifying potential new that superconductors from these groups have different
ones. Leveraging the same set of elemental predictors factors determining their Tc . Interestingly, the family-
as the classification model, several regression models are specific models do not perform better than the general
presented focusing on materials with Tc > 10 K. It avoids regression containing all the data points: R2 for the “low-
the problem of materials with no reported Tc with the Tc ” materials is about 0.85, for cuprates is just below
assumption that, if they were to exhibit superconduc- 0.8, and for iron-based compounds is about 0.74. In fact,
tivity at all, their critical temperature would be below it is a purely geometric effect that the combined model
10 K. Another problem is that the Tc ’s are unevenly dis- has the highest R2 . Each group of superconductors con-
tributed over the Tc axis (see Figure 2a). To avoid this, tributes mostly to a distinct temperature range, and, as
ln (Tc ) is used as the target variable instead of Tc (Fig- a result, the combined line of predicted-vs.-measured Tc
ure 2a inset), which creates a more uniform distribution is better determined over longer interval.
and is also considered a best practice when the range of In order to reduce the number of predictors and in-
a target variable covers more than one order of magni- crease the interpretability of these models without signif-
tude (as in the case of Tc ). Following this transformation, icant detriment to their performance, a backward feature
the dataset is parsed randomly (85%/15%) into training elimination process is again employed. The procedure is
and test subsets (similarly performed for the classifica- very similar to the one described previously for the clas-
tion model). sification model, with the only difference being that the
Present within the dataset are distinct families of su- reduction is guided by R2 of the model, rather than the
perconductors with different driving mechanisms for su- accuracy (the procedure stops when R2 drops by 3%).
perconductivity, including cuprate and iron-based high- The most important predictors for the four models
8
TABLE 2. The most significant predictors and their importances for the three material-specific regression
models. “avg(x)”, “std(x)”, “max(x)” and “frac(x)” denote the composition-weighted average, standard deviation, maximum,
and fraction,
pP respectively, taken over the elemental values for each compound. l2 -norm of a composition is calculated by
2
||x||2 = x
i i , where x i is the proportion of each element i in the compound.
pred. model
rank regression (“low-Tc ”) regression (cuprates) regression (Fe-based)
1 frac(d valence electrons) 0.18 avg(number of unfilled orbitals) 0.22 std(column number) 0.17
2 avg(number of d unfilled orbitals) 0.14 std(number of d valence electrons) 0.13 avg(ionic character) 0.15
3 avg(number of valence electrons) 0.13 frac(d valence electrons) 0.13 std(Mendeleev number) 0.14
4 frac(s valence electrons) 0.11 std(ground state volume) 0.13 std(covalent radius) 0.14
5 avg(number of d valence electrons) 0.09 std(number of valence electrons) 0.1 max(melting temperature) 0.14
6 avg(covalent radius) 0.09 std(row number) 0.08 avg(Mendeleev number) 0.14
7 avg(atomic weight) 0.08 ||composition||2 0.07 ||composition||2 0.11
8 avg(Mendeleev number) 0.07 std(number of s valence electrons) 0.07 -
9 avg(space group number) 0.07 std(melting temperature) 0.07 -
10 avg(number of unfilled orbitals) 0.06 - -
(one general and three family-specific) together with conductivity). Another interesting relation appears in
their importances are shown in Tables 1 and 2. Differ- the context of the average number of d valence electrons.
ences in important predictors across the family-specific Figure 5c illustrates a fundamental bound on Tc of all
models reflect the fact that distinct mechanisms are non-cuprate and non-iron-based superconductors.
responsible for driving superconductivity among these A similar limit exists for cuprates based on the aver-
groups. The list is longest for the “low-Tc ” supercon- age number of unfilled orbitals (Figure 5d). It appears
ductors, reflecting the eclectic nature of this group. Sim- to be quite rigid — several data points found above it
ilar to the general regression model, different branches on inspection are actually incorrectly recorded entries in
are likely created for distinct sub-groups. Nevertheless, the database and were subsequently removed. The con-
some important predictors have straightforward interpre- nection between Tc and the average number of unfilled
tation. As illustrated in Figure 5a, low average atomic orbitals [56] may offer new insight into the mechanism for
weight is a necessary (albeit not sufficient) condition for superconductivity in this family. Known trends include
achieving high Tc among the “low-Tc ” group. In fact, the higher Tc ’s for structures that (i ) stabilize more than one
√
maximum Tc for a given weight roughly follows 1/ mA . superconducting Cu-O plane per unit cell and (ii ) add
Mass plays a significant role in conventional superconduc- more polarizable cations such as Tl3+ and Hg2+ between
tors through the Debye frequency √ of phonons, leading to these planes. The connection reflects these observations,
the well-known formula Tc ∼ 1/ m, where m is the ionic since more copper and oxygen per formula unit leads to
mass. Other factors like density of states are also impor- lower average number of unfilled orbitals (one for copper,
tant, which explains the spread in Tc for a given mA . two for oxygen). Further, the lower-Tc cuprates typically
√
Outlier materials clearly lying above the ∼ 1/ mA line consist of Cu2 -/Cu3 -containing layers stabilized by the
include bismuthates and chloronitrates, suggesting the addition/substition of hard cations, such as Ba2+ and
conventional electron-phonon mechanism is not driving La3+ , respectively. These cations have a large number
superconductivity in these materials. Indeed, chloroni- of unfilled orbitals, thus increasing the compound’s av-
trates exhibit a very weak isotope effect [54], though some erage. Therefore, the ability of between-sheet cations to
unconventional electron-phonon coupling could still be contribute charge to the Cu-O planes may be indeed quite
important for superconductivity [55]. Such findings vali- important. The more polarizable the A cation, the more
date the ability of ML approaches to discover meaningful electron density it can contribute to the already strongly
patterns that encode true physical phenomena. covalent Cu2+ –O bond.
Similar Tc -vs.-predictor plots reveal more interesting Including AFLOW. The models described previously
and subtle features. A narrow cluster of materials with demonstrate surprising accuracy and predictive power,
Tc > 20 K emerges in the context of the mean cova- especially considering the difference between the rele-
lent radii of compounds — another important predictor vant energy scales of most Magpie predictors (typically
for “low-Tc ” superconductors. The cluster includes (left- in the range of eV) and superconductivity (meV scale).
to-right) alkali-doped C60 , MgB2 -related compounds, This disparity, however, hinders the interpretability of
and bismuthates. The sector likely characterizes a re- the models, i.e., the ability to extract meaningful phys-
gion of strong covalent bonding and corresponding high- ical correlations. Thus, it is highly desirable to create
frequency phonon modes that enhance Tc (however, fre- accurate ML models with features based on measurable
quencies that are too high become irrelevant for super- macroscopic properties of the actual compounds (e.g.,
9
a b c
d e
FIG. 4. Benchmarking of regression models predicting ln(Tc ). (a) Predicted vs. measured ln(Tc ) for the general
regression model. The test set comprises of a mix of “low-Tc ”, iron-based, and cuprate superconductors with Tc > 10 K.
With an R2 of about 0.88, this one model can accurately predict Tc for materials in different superconducting groups. (b and
c) Predictions of the regression model trained solely on “low-Tc ” compounds for test sets containing cuprate and iron-based
materials. (d and e) Predictions of the regression model trained solely on cuprates for test sets containing “low-Tc ” and
iron-based superconductors. Models trained on a single group have no predictive power for materials from other groups.
crystallographic and electronic properties) rather than The chemical sparsity in ICSD superconductors is a sig-
composite elemental predictors. Unfortunately, only a nificant hurdle, even when both sets of predictors (i.e.,
small subset of materials in SuperCon is also included in Magpie and AFLOW features) are combined via feature
the ICSD: about 1, 500 compounds in total, only about fusion. Additionally, this approach alone neglects the
800 with finite Tc , and even fewer are characterized with majority of the 16, 000 compounds available via Super-
ab initio calculations. In fact, a good portion of known Con. Instead, we constructed separate models employing
superconductors are disordered (off-stoichiometric) ma- Magpie and AFLOW features, and then judiciously com-
terials and notoriously challenging to address with DFT bined the results to improve model metrics — known as
calculations. Currently, much faster and efficient meth- late or decision-level fusion. Specifically, two indepen-
ods are becoming available [37] for future applications. dent classification models are developed, one using the
full SuperCon dataset and Magpie predictors, and an-
To extract suitable features, data is incorporated from other based on superconductors in the ICSD and AFLOW
the AFLOW Online Repositories — a database of DFT predictors. Such an approach can improve the recall,
calculations managed by the software package AFLOW. It for example, in the case where we classify “high-Tc ” su-
contains information for the vast majority of compounds perconductors as those predicted by either model to be
in the ICSD and about 550 superconducting materials. above-Tsep . Indeed, this is the case here where, sepa-
In Ref. 23, several ML models using a similar set of ma- rately, the models obtain a recall of 40% and 66%, re-
terials are presented. Though a classifier shows good spectively, and together achieve a recall of about 76% (ac-
accuracy, attempts to create a regression model for Tc counting for fluctuations with different test sets). In this
led to disappointing results. We verify that using Mag- way, the models’ predictions complement each other in a
pie predictors for the superconducting compounds in the constructive way such that above-Tsep materials missed
ICSD also yields an unsatisfactory regression model. The by one model (but not the other) are now accurately
issue is not the lack of compounds per se, as models cre- classified.
ated with randomly drawn subsets from SuperCon with
similar counts of compounds perform much better. In Searching for new superconductors in the ICSD.
fact, the problem is the chemical sparsity of supercon- As a final proof of concept demonstration, the classifica-
ductors in the ICSD, i.e., the dearth of closely-related tion and regression models described previously are in-
compounds (usually created by chemical substitution). tegrated in one pipeline and employed to screen the en-
This translates to compound scatter in predictor space tire ICSD database for candidate “high-Tc ” superconduc-
— a challenging learning environment for the model. tors. (Note that “high-Tc ” is a simple label, the precise
10
FIG. 5. Scatter plots of Tc for superconducting materials in the space of significant, family-specific regression
predictors. For 4, 000 “low-Tc ” superconductors (i.e., non-cuprate and non-iron-based), Tc is plotted vs. the (a) average
atomic weight, (b) average covalent radius, and (c) average number of d valence electrons. The dashed red line in (a) is
√
∼ 1/ mA . Having low average atomic weight and low average number of d valence electrons are necessary (but not sufficient)
conditions for achieving high Tc in this group. (d) Scatter plot of Tc for all known superconducting cuprates vs. the mean
number of unfilled orbitals. (c and d) suggest that the values of these predictors lead to hard limits on the maximum achievable
Tc .
meaning of which can be adjusted.) Similar tools power accuracy of 0.98, which is overshadowed by the fact that
high-throughput screening workflows for materials with 96.6% of these compounds belong to the Tc < 10 K class.
desired thermal conductivity and magnetocaloric prop- The precision, recall, and F1 scores are about 0.74, 0.66,
erties [48, 57]. As a first step, the full set of Magpie and 0.70, respectively. These metrics are lower than the
predictors are generated for all compounds in SuperCon. estimates calculated for the general classification model,
A classification model similar to the one presented above which is not unexpected given that this set cannot be
is constructed, but trained only on materials in Super- considered randomly selected. Nevertheless, the perfor-
Con and not in the ICSD (used as an independent test mance suggests a good opportunity to identify new can-
set). The model is then applied on the ICSD set to create didate superconductors.
a list of materials with predicted Tc above 10 K. Oppor- Next in the pipeline, the list is fed into a random
tunities for model benchmarking are limited to those ma- forest regression model (trained on the entire SuperCon
terials both in the SuperCon and ICSD datasets, though database) to predict Tc . Filtering on the materials with
this test set is shown to be problematic. The set includes Tc > 20 K, the list is further reduced to about 2, 000 com-
about 1, 500 compounds, though Tc is reported for only pounds. This count may appear daunting, but should be
about half of them. The model achieves an impressive compared with the total number of compounds in the
11
a 4
AlCs3Ge2O7 eDOS (states/eV)
b 4
AsBeCsO4 eDOS (states/eV)
s
p
3 3 d
total
2 2
1 1
energy (eV)
energy (eV)
0 0
-1 -1
-2 -2
-3 -3
-4 -4
Γ Y F HZ I X Γ Z |M Γ N |X Y1 H1|I 0 20 40 60 80 Γ X S Y Γ Z U R T Z |Y X |S R 0 20 40 60 80
F1 T |U
c 4
Ge2K2ZnO6 eDOS (states/eV) d 4
CdPtSr3O6 eDOS (states/eV)
3 3
2 2
1 1
energy (eV)
energy (eV)
0 0
-1 -1
-2 -2
-3 -3
-4 -4
Γ X S R A Z Γ Y X1 A1 T Y |Z T 0 10 20 30 40 50 60 70 Γ P Q Γ F P1 L Z 0 15 30 45
Z Q1
FIG. 6. DOS of four compounds identified by the ML algorithm as potential materials with Tc > 20 K. The
partial DOS contributions from s, p and d electrons and total DOS are shown in blue, green, red, and black, respectively. The
large peak just below EF is a direct consequence of the flat band(s) present in all these materials. These images were generated
automatically via AFLOW [40]. In the case of substantial overlap among k-point labels, the right-most label is offset below.
database — about 110, 000. Thus, the method selects close to to EF [58–60]. However, note that unlike typical
less than two percent of all materials, which in the con- van Hove points, a true flat band creates divergence in
text of the training set (containing more than 20% with the DOS (as opposed to its derivatives), which in turn
“high-Tc ”), suggests that the model is not overly biased leads to a critical temperature dependence linear in the
toward predicting high critical temperatures. pairing interaction strength, rather than the usual expo-
The vast majority of the compounds identified as can- nential relationship yielding lower Tc [28]. Additionally,
didate superconductors are cuprates, or at least com- there is significant similarity with the band structure and
pounds that contain copper and oxygen. There are also DOS of layered BiS2 -based superconductors [61].
some materials clearly related to the iron-based super- This band structure feature came as the surprising re-
conductors. The remaining set has less than 40 members, sult of applying the ML model. It was not sought for,
and is composed of materials that are not obviously con- and, moreover, no explicit information about the elec-
nected to any high-temperature superconducting families tronic band structure has been included in these predic-
(see Table 3). None of them is predicted to have Tc in tors. This is in contrast to the algorithm presented in
excess of 40 K, which is not surprising, given that no Ref. 26, which was specifically designed to filter ICSD
such instances exist in the training dataset. All contain compounds based on several preselected electronic struc-
oxygen — also not a surprising result, since the group of ture features.
known superconductors with Tc > 20 K is dominated by While at the moment it is not clear if some (or indeed
oxides. any) of these compounds are really superconducting, let
The electronic properties calculated by AFLOW offer alone with Tc ’s above 20 K, the presence of this highly
context to the results of the search, and suggest a possi- unusual electronic structure feature is encouraging. At-
ble connection between these “conventional” candidates tempts to synthesize several of these compounds are al-
for “high-Tc ” materials. Plotting the electronic structure ready underway.
of the potential superconductors exposes an extremely
peculiar feature shared by all — one or several (nearly)
flat bands just below the Fermi level. Such bands lead to CONCLUSION
a large peak in the DOS (see Figure 6) and can cause a
significant enhancement in Tc . Peaks in the DOS elicited Herein, several machine learning tools are developed
by van Hove singularities can enhance Tc if sufficiently to study the critical temperature of superconductors.
12
a b
FIG. 7. Dataset and feature set statistics. (a) Accuracy, precision, recall, and F1 score as a function of the size of the
training set with a fixed test set. (b) Accuracy, precision, recall, and F1 as a function of the number of predictors.
scikit-learn — a powerful and efficient machine learning ducting materials in the AFLOW Online Repositories.
Python library [64]. Hyperparameters of the random The features are built with the following properties:
forest model include the number of trees in the forest, number of atoms, space group, density, volume, energy
the maximum depth of each tree, the minimum number per atom, electronic entropy per atom, valence of the
of samples required to split an internal node, and the cell, scintillation attenuation length, the ratios of the unit
number of features to consider when looking for the best cell’s dimensions, and Bader charges and volumes. For
split. To optimize the classifier and the combined/family- the Bader charges and volumes (vectors), the following
specific regressors, the GridSearch function in scikit-learn statistics are calculated and incorporated: the maximum,
is employed, which generates and compares candidate minimum, average, standard deviation, and range.
models from a grid of parameter values. To reduce com- In the main text, several regression models were de-
putational expense, models are not optimized at each scribed, each one designed to predict the critical tempera-
step of the backward feature selection process. tures of materials from different superconducting groups.
Details about the classification and regression These models achieved an impressive R2 score, demon-
models. The most important factors that determine strating good predictive power for each group. How-
the model’s performance are the size of the available ever, it is also important to consider the accuracy of
dataset and the number of meaningful predictors. In the predictions for individual compounds (rather than
Figure 7a, the accuracy and F1 score generally saturate on the aggregate set), especially in the context of search-
rather quickly with the size of the training set, and a ing for new materials. To do this, we calculate the pre-
reasonably well performing model can be created even diction errors for about 300 materials from a test set.
with a relatively small set (several hundred compounds). Specifically, we consider the difference between the log-
Such a model, however, is susceptible to random vari- arithm of the predicted and measured critical tempera-
ations in the composition of the training and test sets, ture — ln(Tcmeas ) − ln(Tcpred ) — normalized by the value
and thus not very robust (accuracy and F1 score exhibit of ln(Tcmeas )) (since different groups have different Tc
sizable variations for dataset sizes less than 10, 000). So ranges). The models show comparable spread of errors.
having a large dataset is helpful, but not a major fac- The histograms of errors for the four models (combined
tor above some (relatively modest) size. The number and three group-specific) are shown in Fig. 8. The errors
of predictors is another very important model parame- approximately follow a normal distribution, centered not
ter. In Figure 7b, the accuracy is calculated at each step at zero but at a small negative value. This suggests the
of the backward feature elimination process. It quickly models are marginally biased, and on average tend to
saturates when the number of predictors reaches 10. In slightly underestimate Tc . The variance is comparable for
fact, a model with only 5 predictors achieves almost 90% all models, but largest for the model trained and tested
accuracy. on iron-based materials, which also shows the smallest
ML models are also constructed with the supercon- R2 . Performance of this model is expected to benefit
14
a b
c d
FIG. 8. Histograms of ∆ ln(Tc ) ∗ ln(Tc )−1 for the four regression models. ∆ ln(Tc ) ≡ (ln(Tcmeas ) − ln(Tcpred )) and
ln(Tc ) ≡ ln(Tcmeas ).
1
G. Bergerhoff, R. Hundt, R. Sievers, and I. D. Brown, The tum Materials Database (OQMD), JOM 65, 1501–1509
inorganic crystal structure data base, J. Chem. Inf. Com- (2013).
5
put. Sci. 23, 66–69 (1983). A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,
2
S. Curtarolo, W. Setyawan, G. L. W. Hart, M. Jahnátek, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and
R. V. Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang, K. A. Persson, Commentary: The Materials Project: A
O. Levy, M. J. Mehl, H. T. Stokes, D. O. Demchenko, and materials genome approach to accelerating materials inno-
D. Morgan, AFLOW: An automatic framework for high- vation, APL Mater. 1, 011002 (2013).
6
throughput materials discovery, Comput. Mater. Sci. 58, A. Agrawal and A. Choudhary, Perspective: Materi-
218–226 (2012). als informatics and big data: Realization of the “fourth
3
D. D. Landis, J. Hummelshøj, S. Nestorov, J. Greeley, paradigm” of science in materials science, APL Mater. 4,
M. Dulak, T. Bligaard, J. K. Nørskov, and K. W. Ja- 053208 (2016).
7
cobsen, The Computational Materials Repository, Comput. T. Lookman, F. J. Alexander, and K. Rajan, eds., A Per-
Sci. Eng. 14, 51–57 (2012). spective on Materials Informatics: State-of-the-Art and
4
J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and Challenges (Springer International Publishing, 2016), doi:
C. Wolverton, Materials Design and Discovery with High- 10.1007/978-3-319-23871-5.
8
Throughput Density Functional Theory: The Open Quan- A. Jain, G. Hautier, S. P. Ong, and K. A. Persson, New op-
15
portunities for materials informatics: Resources and data Well-Calibrated Uncertainty Estimates, Integr. Mater.
mining techniques for uncovering hidden relationships, J. Manuf. Innov. (2017).
25
Mater. Res. 31, 977994 (2016). M. Ziatdinov, A. Maksov, L. Li, A. S. Sefat, P. Maksy-
9
T. Mueller, A. G. Kusne, and R. Ramprasad, Machine movych, and S. V. Kalinin, Deep data mining in a real
Learning in Materials Science (John Wiley & Sons, Inc, space: separation of intertwined electronic responses in
2016), pp. 186–273, doi:10.1002/9781119148739.ch4. a lightly doped BaFe2 As2 , Nanotechnology 27, 475706
10
A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, Machine (2016).
26
learning with systematic density-functional theory calcula- M. Klintenberg and O. Eriksson, Possible high-temperature
tions: Application to melting temperatures of single- and superconductors predicted from electronic structure and
binary-component solids, Phys. Rev. B 89 (2014). data-filtering algorithms, Comput. Mater. Sci. 67, 282–286
11
P. V. Balachandran, J. Theiler, J. M. Rondinelli, and (2013).
27
T. Lookman, Materials Prediction via Classification Learn- M. R. Norman, Materials design for new superconductors,
ing, Sci. Rep. 5 (2015). Rep. Prog. Phys. 79, 074502 (2016).
12 28
G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga, N. B. Kopnin, T. T. Heikkilä, and G. E. Volovik, High-
R. Ramprasad, J. E. Gubernatis, and T. Lookman, Ma- temperature surface superconductivity in topological flat-
chine learning bandgaps of double perovskites, Sci. Rep. 6, band systems, Phys. Rev. B 83, 220503 (2011).
29
19375 (2016). S. Peotta and P. Törmä, Superfluidity in topologically non-
13
O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, and trivial flat bands, Nat. Commun. 6, 8944 (2015).
30
A. Tropsha, Universal fragment descriptors for predicting National Institute of Materials Science,
electronic properties of inorganic crystals, Nat. Commun. Materials Information Station, SuperCon,
8, 15679 (2017). http://supercon.nims.go.jp/index en.html (2011).
14 31
J. E. Hirsch, M. B. Maple, and F. Marsiglio, Supercon- N.B., a model suffering from selection bias can still provide
ducting Materials: Conventional, Unconventional and Un- valuable statistical information about known superconduc-
determined, Physica C 514, 1–444 (2015). tors.
15 32
P. W. Anderson, Plasmons, Gauge Invariance, and Mass, H. Hosono, K. Tanabe, E. Takayama-Muromachi,
Phys. Rev. 130, 439–442 (1963). H. Kageyama, S. Yamanaka, H. Kumakura, M. Nohara,
16
C. W. Chu, L. Z. Deng, and B. Lv, Hole-doped cuprate H. Hiramatsu, and S. Fujitsu, Exploration of new supercon-
high temperature superconductors, Physica C 514, 290–313 ductors and functional materials, and fabrication of super-
(2015). Superconducting Materials: Conventional, Uncon- conducting tapes and wires of iron pnictides, Sci. Technol.
ventional and Undetermined. Adv. Mater. 16, 033503 (2015).
17 33
S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, There are theoretical arguments for this — according to
R. H. Taylor, L. J. Nelson, G. L. W. Hart, S. San- the Kohn-Luttinger theorem, a superconducting instabil-
vito, M. Buongiorno Nardelli, N. Mingo, and O. Levy, ity should be present as T → 0 in any fermionic metallic
AFLOWLIB.ORG: A distributed materials properties system with Coulomb interactions [34].
34
repository from high-throughput ab initio calculations, W. Kohn and J. M. Luttinger, New Mechanism for Super-
Comput. Mater. Sci. 58, 227–235 (2012). conductivity, Phys. Rev. Lett. 15, 524–526 (1965).
18 35
R. H. Taylor, F. Rose, C. Toher, O. Levy, K. Yang, L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton,
M. Buongiorno Nardelli, and S. Curtarolo, A RESTful API A general-purpose machine learning framework for predict-
for exchanging materials data in the AFLOWLIB.org con- ing properties of inorganic materials, npg Computational
sortium, Comput. Mater. Sci. 93, 178–192 (2014). Materials 2, 16028 (2016).
19 36
C. E. Calderon, J. J. Plata, C. Toher, C. Oses, O. Levy, W. Setyawan and S. Curtarolo, High-throughput electronic
M. Fornari, A. Natan, M. J. Mehl, G. L. W. Hart, M. Buon- band structure calculations: Challenges and tools, Comput.
giorno Nardelli, and S. Curtarolo, The AFLOW standard Mater. Sci. 49, 299–312 (2010).
37
for high-throughput materials science calculations, Com- K. Yang, C. Oses, and S. Curtarolo, Modeling Off-
put. Mater. Sci. 108 Part A, 233–238 (2015). Stoichiometry Materials with a High-Throughput Ab-Initio
20
F. Rose, C. Toher, E. Gossett, C. Oses, M. Buon- Approach, Chem. Mater. 28, 6484–6492 (2016).
38
giorno Nardelli, M. Fornari, and S. Curtarolo, AFLUX: O. Levy, M. Jahnátek, R. V. Chepulskii, G. L. W. Hart,
The LUX materials search API for the AFLOW data repos- and S. Curtarolo, Ordered Structures in Rhenium Binary
itories, Comput. Mater. Sci. 137, 362–370 (2017). Alloys from First-Principles Calculations, J. Am. Chem.
21
P. Villars and J. C. Phillips, Quantum structural diagrams Soc. 133, 158–163 (2011).
39
and high-Tc superconductivity, Phys. Rev. B 37, 2345–2348 O. Levy, G. L. W. Hart, and S. Curtarolo, Structure maps
(1988). for hcp metals from first-principles calculations, Phys. Rev.
22
K. M. Rabe, J. C. Phillips, P. Villars, and I. D. Brown, B 81, 174106 (2010).
40
Global multinary structural chemistry of stable quasicrys- O. Levy, R. V. Chepulskii, G. L. W. Hart, and S. Cur-
tals, high-TC ferroelectrics, and high-Tc superconductors, tarolo, The New face of Rhodium Alloys: Revealing Or-
Phys. Rev. B 45, 7650–7676 (1992). dered Structures from First Principles, J. Am. Chem. Soc.
23
O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, 132, 833–837 (2010).
41
A. Tropsha, and S. Curtarolo, Materials Cartography: Rep- O. Levy, G. L. W. Hart, and S. Curtarolo, Uncovering
resenting and Mining Materials Space Using Structural Compounds by Synergy of Cluster Expansion and High-
and Electronic Fingerprints, Chem. Mater. 27, 735–743 Throughput Methods, J. Am. Chem. Soc. 132, 4830–4833
(2015). (2010).
24 42
J. Ling, M. Hutchinson, E. Antono, S. Paradiso, and G. L. W. Hart, S. Curtarolo, T. B. Massalski, and O. Levy,
B. Meredig, High-Dimensional Materials and Process Op- Comprehensive Search for New Phases and Compounds in
timization Using Data-Driven Experimental Design with Binary Alloy Systems Based on Platinum-Group Metals,
16
Using a Computational First-Principles Approach, Phys. the probability of randomly chosen data point from a given
Rev. X 3, 041035 (2013). decision tree leaf to be in class i [49, 50].
43 54
M. J. Mehl, D. Hicks, C. Toher, O. Levy, R. M. Hanson, Y. Kasahara, K. Kuroki, S. Yamanaka, and Y. Taguchi,
G. L. W. Hart, and S. Curtarolo, The AFLOW Library of Unconventional superconductivity in electron-doped layered
Crystallographic Prototypes: Part 1, Comput. Mater. Sci. metal nitride halides M NX (M = Ti, Zr, Hf; X = Cl, Br,
136, S1–S828 (2017). I), Physica C 514, 354–367 (2015). Superconducting Ma-
44
A. R. Supka, T. E. Lyons, L. S. I. Liyanage, P. D’Amico, terials: Conventional, Unconventional and Undetermined.
55
R. Al Rahal Al Orabi, S. Mahatara, P. Gopal, C. To- Z. P. Yin, A. Kutepov, and G. Kotliar, Correlation-
her, D. Ceresoli, A. Calzolari, S. Curtarolo, M. Buon- Enhanced Electron-Phonon Coupling: Applications of GW
giorno Nardelli, and M. Fornari, AFLOWπ: A minimalist and Screened Hybrid Functional to Bismuthates, Chloroni-
approach to high-throughput ab initio calculations includ- trides, and Other High-Tc Superconductors, Phys. Rev. X
ing the generation of tight-binding hamiltonians, Comput. 3, 021011 (2013).
56
Mater. Sci. 136, 76–84 (2017). The number of unfilled orbitals refers to the electron con-
45
C. Toher, J. J. Plata, O. Levy, M. de Jong, M. D. Asta, figuration of the substituent elements before combining
M. Buongiorno Nardelli, and S. Curtarolo, High-throughput to form oxides. For example, Cu has one unfilled orbital
computational screening of thermal conductivity, Debye ([Ar]4s2 3d9 ) and Bi has three ([Xe]4f 14 6s2 5d10 6p3 ). These
temperature, and Grüneisen parameter using a quasihar- values are averaged per formula unit.
57
monic Debye model, Phys. Rev. B 90, 174107 (2014). J. D. Bocarsly, E. E. Levin, C. A. C. Garcia, K. Schwen-
46
E. Perim, D. Lee, Y. Liu, C. Toher, P. Gong, Y. Li, W. N. nicke, S. D. Wilson, and R. Seshadri, A Simple Compu-
Simmons, O. Levy, J. J. Vlassak, J. Schroers, and S. Cur- tational Proxy for Screening Magnetocaloric Compounds,
tarolo, Spectral descriptors for bulk metallic glasses based Chem. Mater. 29, 1613–1622 (2017).
58
on the thermodynamics of competing crystalline phases, J. Labbé, S. Barišić, and J. Friedel, Strong-Coupling Super-
Nat. Commun. 7, 12315 (2016). conductivity in V3 X type of Compounds, Phys. Rev. Lett.
47
C. Toher, C. Oses, J. J. Plata, D. Hicks, F. Rose, 19, 1039–1041 (1967).
59
O. Levy, M. de Jong, M. D. Asta, M. Fornari, M. Buon- J. E. Hirsch and D. J. Scalapino, Enhanced Superconductiv-
giorno Nardelli, and S. Curtarolo, Combining the AFLOW ity in Quasi Two-Dimensional Systems, Phys. Rev. Lett.
GIBBS and Elastic Libraries to efficiently and robustly 56, 2732–2735 (1986).
60
screen thermomechanical properties of solids, Phys. Rev. I. E. Dzyaloshinskiǐ, Maximal increase of the superconduct-
Materials 1, 015401 (2017). ing transition temperature due to the presence of van’t Hoff
48
A. van Roekeghem, J. Carrete, C. Oses, S. Curtarolo, and singularities, JETP Lett. 46, 118 (1987).
61
N. Mingo, High-Throughput Computation of Thermal Con- D. Yazici, I. Jeon, B. D. White, and M. B. Maple, Super-
ductivity of High-Temperature Solid Phases: The Case of conductivity in layered BiS2 -based compounds, Physica C
Oxide and Fluoride Perovskites, Phys. Rev. X 6, 041061 514, 218–236 (2015). Superconducting Materials: Conven-
(2016). tional, Unconventional and Undetermined.
49 62
C. Bishop, Pattern Recognition and Machine Learning W. McKinney, Python for Data Analysis: Data Wran-
(Springer-Verlag, New York, 2006). gling with Pandas, NumPy, and IPython (O’Reilly Media,
50
T. Hastie, R. Tibshirani, and J. H. Friedman, The Ele- 2012).
63
ments of Statistical Learning: Data Mining, Inference, and L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton,
Prediction (Springer-Verlag, New York, 2001). Magpie Software, https://bitbucket.org/wolverton/magpie
51
L. Breiman, Random Forests, Mach. Learn. 45, 5–32 (2016), doi:10.1038/npjcompumats.2016.28.
64
(2001). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
52
R. Caruana and A. Niculescu-Mizil, An Empirical Com- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
parison of Supervised Learning Algorithms, in Proceedings R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
of the 23rd International Conference on Machine Learn- napeau, M. Brucher, M. Perrot, and É. Duchesnay, Scikit-
ing, ICML ’06 (ACM, New York, NY, USA, 2006), pp. learn: Machine Learning in Python, J. Mach. Learn. Res.
161–168, doi:10.1145/1143844.1143865. (2011).
53 P
Gini impurity is calculated as i pi (1 − pi ), where pi is