Sunteți pe pagina 1din 16

Machine learning modeling of superconducting critical temperature

Valentin Stanev,1, 2 Corey Oses,3, 4 A. Gilad Kusne,1, 5 Efrain Rodriguez,6, 2


Johnpierre Paglione,7, 2 Stefano Curtarolo,3, 4, 8 and Ichiro Takeuchi1, 2
1
Department of Materials Science and Engineering,
University of Maryland, College Park, MD 20742-4111, USA
2
Center for Nanophysics and Advanced Materials,
University of Maryland, College Park, Maryland 20742, USA
3
Department of Mechanical Engineering and Materials Science,
Duke University, Durham, North Carolina 27708, United States
4
Center for Materials Genomics, Duke University, Durham, North Carolina 27708, United States
arXiv:1709.02727v1 [cond-mat.supr-con] 8 Sep 2017

5
National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
6
Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20742, USA
7
Department of Physics, University of Maryland, College Park, Maryland 20742, USA
8
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin-Dahlem, Germany
(Dated: September 11, 2017)
Superconductivity has been the focus of enormous research effort since its discovery more than a century ago. Yet, some
features of this unique phenomenon remain poorly understood; prime among these is the connection between superconductivity
and chemical/structural properties of materials. To bridge the gap, several machine learning methods are developed herein to
model the critical temperatures (Tc ) of the 12, 000+ known superconductors available via the SuperCon database. Materials
are first divided into two classes based on their Tc ’s, above and below 10 K, and a classification model predicting this label is
trained. The model uses coarse-grained features based only on the chemical compositions. It shows strong predictive power,
with out-of-sample accuracy of about 92%. Separate regression models are developed to predict the values of Tc for cuprate,
iron-based, and “low-Tc ” compounds. These models also demonstrate good performance, with learned predictors offering
important insights into the mechanisms behind superconductivity in different families of materials. To improve the accuracy
and interpretability of these models, new features are incorporated using materials data from the AFLOW Online Repositories.
Finally, the classification and regression models are combined into a single integrated pipeline and employed to search the entire
Inorganic Crystallographic Structure Database (ICSD) for potential new superconductors. We identify about 30 non-cuprate
and non-iron-based oxides as candidate materials.

INTRODUCTION terials’ chemistry/structure presents significant theoret-


ical and experimental challenges. In particular, despite
Extensive databases covering various measured and focused research efforts in the last 30 years, the mecha-
calculated materials properties have been created over nism responsible for high-temperature superconductivity
the years [1–5]. Such information can provide invalu- in the cuprate family of materials remains elusive [16].
able guidance in the discovery and design of materials To address the problem, we develop several ML meth-
with improved and novel properties. The shear quan- ods modeling the superconducting critical temperatures
tity of readily accessible information also makes possible, (Tc ) of various materials and apply them to the com-
and even necessary, the use of data-driven approaches, plete list of reported (inorganic) superconductors [1]. In
e.g., statistical and machine learning (ML) methods [6– their simplest form, these methods take as input a num-
9]. Such algorithms can be developed/trained on the ber of predictors generated from the elemental compo-
variables collected in these databases, and employed to sition of each material. Models developed with these
predict macroscopic properties such as the melting tem- basic features are surprisingly accurate, despite lacking
peratures of binary compounds [10], the likely crystal information of relevant properties, such as space group,
structure at a given composition [11], band gap ener- electronic structure, and phonon energies. To further im-
gies [12, 13] and density of states [12] of certain classes prove the predictive power of the models, as well as the
of materials. ability to extract useful information out of them, another
Superconductivity, despite being the subject of in- set of features are constructed based on crystallographic
tense physics, chemistry and materials science research and electronic information taken from the AFLOW On-
for more than a century, remains among one of the most line Repositories [17–20].
puzzling scientific topics [14]. It is an intrinsically quan- Application of statistical methods in the context of su-
tum phenomenon caused by Bose-Einstein condensation perconductivity began in the early eighties with simple
of paired electrons, with unique properties including zero clustering methods [21, 22]. In particular, three “golden”
DC resistivity, Meissner and Josephson effects, and an descriptors confine the sixty known (at the time) super-
ever-growing list of current and potential applications. conductors with Tc > 10 K to three small islands in space:
There is even profound connection between phenomena the averaged valence-electron numbers, orbital radii dif-
in the superconducting state and the Higgs mechanism ferences, and metallic electronegativity differences. Con-
in particle physics [15]. However, understanding the re- versely, about 600 other superconductors with Tc < 10 K
lationship between superconducting properties and ma- appear randomly dispersed in the same space. These
2

descriptors were selected heuristically due to their suc- is extracted from the SuperCon database [30], created
cess in classifying binary/ternary structures and predict- and maintained by the Japanese National Institute for
ing stable/metastable ternary quasicrystals. Recently, an Materials Science. It houses information such as the
investigation stumbled on this clustering problem  again Tc and reporting journal publication for superconducting
by observing athreshold Tc closer to log Tcthres ≈ 1.3 materials known from experiment. Assembled within it
Tcthres = 20 K [23]. Instead of a heuristic approach, is a uniquely exhaustive list of all reported superconduc-
random forests and simplex fragments were leveraged tors, as well as related non-superconducting compounds.
on the structural/electronic properties data from the The database consists of two separate subsets: “Ox-
AFLOW Online Repositories to find the optimum clus- ide & Metallic” (inorganic materials containing metals,
tering descriptors. A classification model was devel- alloys, cuprate high-temperature superconductors, etc.)
oped showing good performance. Separately, a sequential and “Organic” (organic superconductors). Download-
learning framework was evaluated on superconducting ing the entire inorganic materials dataset and removing
materials, exposing the limitations of relying on random- compounds with incompletely-specified chemical compo-
guess (trial-and-error) approaches for breakthrough dis- sitions leaves about 22, 000 entries. In the case of multi-
coveries [24]. Subsequently, this study also highlights ple records for the same material, the reported material’s
the impact machine learning can have on this particular Tc ’s are averaged, but only if their standard deviation
field. Other contemporary work hones in on specific ma- is less than 5 K, and discarded otherwise. This brings
terials [25] and families of superconductors [26] (see also the total down to about 16, 400 compounds, of which
Ref. [27]). around 4, 000 have no critical temperature reported. Of
Whereas previous investigations trained on several these, roughly 5, 700 compounds are cuprates and 1, 500
hundred compounds at most, this work considers more are iron-based (about 35% and 9%, respectively), reflect-
than 16, 000 different compositions. These are extracted ing the significant research efforts invested in these two
from the SuperCon database, which contains an ex- families. The remaining set of about 8, 000 is a mix of var-
haustive list of superconductors, including many closely- ious materials, including conventional phonon-driven su-
related materials (varying only by small changes in stoi- perconductors (e.g., elemental superconductor, A15 com-
chiometry). The order-of-magnitude increase in training pounds), known unconventional superconductors like the
data (i ) uncovers crucial subtleties in chemical composi- layered nitrides and heavy fermions, and many materials
tion among related compounds and (ii ) exposes different for which the mechanism of superconductivity is still un-
superconducting mechanisms via family-specific model- der debate (such as bismuthates and borocarbides). The
ing. It also enables the optimization of several steps of distribution of materials by Tc for the three groups is
the process of building ML models. Large sets of in- shown in Figure 2a.
dependent variables can be constructed and rigorously There are occasional problems with the validity and
filtered by predictive power (rather than selecting them consistency of some of the data. For example, the
by intuition alone). These advances are crucial to the database includes some reports based on tenuous experi-
success of our ML models in the understanding of the mental evidence and only indirect signatures of supercon-
emergence/suppression of superconductivity with com- ductivity, as well as reports of inhomogeneous (surface,
position. interfacial) and nonequilibrium phases. Even in cases of
As a demonstration of the potential of ML methods in bona fide bulk superconducting phases, important rele-
looking for novel superconductors, we combined and ap- vant variables like pressure are not recorded. Though
plied several models to search for candidates among the some of the obviously erroneous records were removed
roughly 110, 000 different compositions contained in the from the data (see Supplementary Materials), these is-
Inorganic Crystallographic Structure Database (ICSD). sues were largely ignored assuming their effect on the
The framework highlights 30 interesting compounds with entire dataset to be relatively modest.
predicted Tc ’s above 20 K for experimental validation. A more serious problem is the use of the database itself.
Interestingly, all share a peculiar feature in their elec- Training a model only on superconductors can lead to sig-
tronic band structure: one (or more) flat/nearly-flat nificant selection bias that may render it ineffective when
bands just below the Fermi level. The associated large applied to new materials [31]. Even if the model learns
peak in the density of states (infinitely large in the limit to correctly recognize factors promoting superconductiv-
of truly flat bands) can lead to strong electronic instabil- ity, it may miss effects that strongly inhibit it due to the
ity, and has been discussed recently as one possible way lack of exposure to the relevant negative examples. To
to high-temperature superconductivity [28, 29]. mitigate the effect, we incorporate about 300 materials
found by H. Hosono’s group not to display superconduc-
tivity [32]. The presence of non-superconducting materi-
DATA AND METHODS als, along with those without Tc reported in SuperCon,
leads to a conceptual problem. Surely, some of these com-
Superconductivity data. The success of any ML pounds emerge as non-superconducting “end-members”
method ultimately depends on access to reliable and from doping/pressure studies, indicating no supercon-
plentiful data. Superconductivity data used in this work ducting transition was observed despite some efforts to
3

find one. However, a transition may still exist, albeit The Standard ensures that the calculations and derived
at experimentally difficult to reach or altogether inac- properties are empirical (reproducible), reasonably well-
cessible temperatures (for most practical purposes below converged, and above all, consistent (fixed set of param-
10 mK) [33]. This presents a conundrum: ignoring com- eters), a particularly attractive feature for ML modeling.
pounds with no reported Tc disregards a potentially im- Many materials properties important for superconduc-
portant part of the dataset, while assuming Tc = 0 K pre- tivity have been calculated within the AFLOW frame-
scribes an inadequate description for (at least some of) work, and are easily accessible through the AFLOW On-
these compounds. To circumvent the problem, materials line Repositories. It contains information for the vast
are first partitioned in two groups by their Tc , above and majority of compounds in the ICSD [1]. Although the
below a threshold temperature (Tsep ), for the creation AFLOW Online Repositories contain calculated proper-
of a classification model. Compounds with no reported ties, the DFT results have been extensively validated
critical temperature can be classified in the “below-Tsep ” with ICSD records [13, 23, 45–48].
group without the need to specify a Tc value (or assume Unfortunately, only a small subset of materials in
it is zero). SuperCon overlaps with those in the ICSD: about 800
Chemical and structural features. For most mate- with finite Tc and less than 600 are contained within
rials, the SuperCon database provides only the chemical AFLOW. For these, a set of 26 predictors are incorporated
composition and Tc . To convert this information into from the AFLOW Online Repositories, including struc-
meaningful features/predictors (used interchangeably), tural/chemical information like the lattice type, space
we employ the Materials Agnostic Platform for Informat- group, volume of the unit cell, density, ratios of the lattice
ics and Exploration (Magpie) [35]. Magpie computes a parameters, Bader charges and volumes, and formation
set of 145 attributes for each material, including: (i ) energy (see Supplementary Materials). In addition, elec-
stoichiometric features (depends only on the ratio of ele- tronic properties are considered, including the density of
ments and not the specific species); (ii ) elemental prop- states near the Fermi level as calculated by AFLOW. Pre-
erty statistics: the mean, mean absolute deviation, range, vious investigations exposed limitations in applying ML
minimum, maximum, and mode of 22 different elemen- methods to a similar dataset in isolation [23]. Instead,
tal properties (e.g., period/group on the periodic table, a framework is presented for combining models built on
atomic number, atomic radii, melting temperature); (iii ) Magpie descriptors (large sampling, but features limited
electronic structure attributes: the average fraction of to compositional data) and AFLOW features (small sam-
electrons from the s, p, d and f valence shells among all pling, but diverse and pertinent features).
elements present; and (iv ) ionic compound features that Machine learning algorithms. Once we have a list of
include whether it is possible to form an ionic compound relevant predictors, various ML models can be applied to
assuming all elements exhibit a single oxidation state. the data [49, 50]. All ML algorithms in this work are vari-
The application of Magpie predictors, though appear- ants of the random forest method [51]. Fundamentally,
ing to lack a priori justification, expands upon past clus- the approach combines many individual decision trees,
tering approaches by Villars and Rabe [21, 22]. They where each tree is a non-parametric supervised learning
show that, in the space of a few judiciously chosen heuris- method used for modeling either categorical or numerical
tic predictors, materials separate and cluster according to variables (i.e., classification or regression modeling). A
their crystal structure and even complex properties such tree predicts the value of a target variable by learning
as high-temperature ferroelectricity and superconductiv- simple decision rules inferred from the available features
ity. Similar to these features, Magpie predictors capture (see Figure 1 for an example). The deeper the tree, the
significant chemical information, which plays a decisive more complex the relationships it can learn, but also the
role in determining structural and physical properties of greater the danger of overfitting, i.e., learning some ir-
materials. relevant information or just “noise”.
Despite the success of Magpie predictors in modeling The random forest method creates a set of individual
material properties [35], interpreting their connection to decision trees (hence the “forest”), each built to solve the
superconductivity presents a serious challenge. They do same classification/regression problem. It then combines
not encode (at least directly) many important materi- their results, either by voting or averaging depending on
als properties, particularly those pertinent to supercon- the problem. To make the forest more robust to over-
ductivity. Incorporating features like lattice type and fitting, individual trees in the ensemble are built from
density of states would undoubtedly lead to significantly samples drawn with replacement (a bootstrap sample)
more powerful and interpretable models. Since such in- from the training set. In addition, when splitting a node
formation is not generally available in SuperCon, we em- during the construction of a tree, the model chooses the
ploy data from the AFLOW Online Repositories [17–20]. best split of the data only considering a random subset
The materials database houses nearly 170 million proper- of the features. The hyperparameters used to optimize
ties calculated with the software package AFLOW [2, 36– the model are described in the Supplementary Material.
44]. AFLOW is a high-throughput ab initio framework Random forest is one of the most powerful, versatile,
that manages density functional theory (DFT) calcu- and widely-used ML methods [52]. There are several ad-
lations in accordance with the AFLOW Standard [19]. vantages that make it especially suitable for this problem.
4

(...)

avg(atomic weight) ≤ 48.40 u


samples = 27.1%
proportion = [0.95, 0.05] (...)
True class = T c below 10 K

(...)
std(T melt) ≤ 418.92 K False avg(atomic weight) ≤ 102.81 u
True samples = 45.5% samples = 18.3%
proportion = [0.87, 0.13] proportion = [0.76, 0.24]
class = T c below 10 K class = T c below 10 K (...)
std(column number) ≤ 4.16
samples = 100.0%
proportion = [0.62, 0.38]
class = T c below 10 K avg(atomic weight) ≤ 80.01 u True std(T melt) ≤ 672.09 K (...)
False samples = 54.5% samples = 49.0%
proportion = [0.41, 0.59] proportion = [0.37, 0.63]
class = T c above 10 K class = T c above 10 K
False (...)

std(electronegativity) ≤ 0.52
samples = 5.6%
proportion = [0.8, 0.2] (...)
class = T c below 10 K

(...)

FIG. 1. Schematic of the random forest ML approach. Example of a single decision tree used to classify materials
depending on whether their Tc is above or below 10 K. A tree can have many levels, but only the three top are shown. The
decision rules leading to each subset are written inside individual rectangles. The subset population percentage is given by
“samples”, and the node color/shade represents the degree of separation, i.e., dark blue/orange illustrates a high proportion
of Tc > 10 K/Tc < 10 K materials (the exact value is given by “proportion”). A random forest consists of a large number —
could be hundreds or thousands — of such individual trees.

First, it can learn complicated non-linear dependencies els are created, each designed to separate materials into
from the data. Unlike many other methods (e.g., linear two distinct groups depending on whether Tc is above or
regression), it does not make any assumptions about the below some predetermined value. The temperature that
relationship between the predictors and the target vari- separates the two groups (Tsep ) is treated as an adjustable
able. Second, random forests are quite tolerant to hetero- parameter of the model, though some physical consid-
geneity in the training data. It can handle both numer- erations should guide its choice as well. Classification
ical and categorical data which, furthermore, does not ultimately allows compounds with no reported Tc to be
need extensive and potentially dangerous preprocessing, used in the training set by including them in the below-
such as scaling or normalization. Even the presence of Tsep bin. Although discretizing continuous variables is
strongly correlated predictors is not a problem for model not generally recommended, in this case the benefits of
construction (unlike many other ML algorithms). An- including compounds without Tc outweigh the potential
other significant advantage of this method is that, by information loss.
combining information from individual trees, it can es- In order to choose the optimal value of Tsep , a series of
timate the importance of each predictor, thus making random forest models are trained with different thresh-
the model more interpretable. However, unlike model old temperatures separating the two classes. Since set-
construction, determination of predictor importance is ting Tsep too low or too high creates strongly imbalanced
complicated by the presence of correlated features. To classes (with many more instances in one group), it is
avoid this, standard feature selection procedures are em- important to compare the models using several differ-
ployed along with a rigorous predictor elimination scheme ent metrics. Focusing only on the accuracy (count of
(based on their strength and correlation with others). correctly-classified instances) can lead to deceptive re-
Overall, these methods reduce the complexity of the mod- sults. Hypothetically, if 95% of the observations in the
els and improve our ability to interpret them. dataset are in the below-Tsep group, simply classifying
all materials as such would yield a high accuracy (95%),
while being trivial in any other sense. To avoid this po-
RESULTS AND DISCUSSION tential pitfall, three other standard metrics for classifica-
tion are considered: precision, recall, and F1 score. They
Classification models. As a first step in applying ML are defined using the values tp, tn, f p, and f n for the
methods to the dataset, a sequence of classification mod- count of true/false positive/negative predictions of the
5

a b

FIG. 2. SuperCon dataset and classification model performance. (a) Histogram of materials categorized by Tc (bin size
is 2 K, only those with finite Tc are counted). Blue, green, and red denote “low-Tc ”, iron-based, and cuprate superconductors,
respectively. In the inset: histogram of materials categorized by ln (Tc ) restricted to those with Tc > 10 K. (b) Performance of
different classification models as a function of the threshold temperature (Tsep ) that separates materials in two classes by Tc .
Performance is measured by accuracy (gray), precision (red), recall (blue), and F1 score (purple). The scores are calculated
from predictions on an independent test set, i.e., one separate from the dataset used to train the model. In the inset: the
dashed red curve gives the proportion of materials in the above-Tsep set.

model: measure the performance of a classification model. With


tp + tn the exception of accuracy, these metrics are not symmet-
accuracy ≡ , (1) ric with respect to the exchange of positive and negative
tp + tn + f p + f n
labels.
For a realistic estimate of the performance of each
tp
precision ≡ , (2) model, the dataset is randomly split (85%/15%) into
tp + f p training and test subsets. The training set is employed
to fit the model, which is then applied to the test set for
tp subsequent benchmarking. The aforementioned metrics
recall ≡ , (3)
tp + f n (Equations 1-4) calculated on the test set provide an un-
biased estimate of how well the model is expected to gen-
precision ∗ recall eralize to a new (but similar) dataset. With the random
F1 ≡ 2 ∗ , (4) forest method, similar estimates can be obtained intrin-
precision + recall sically at the training stage. Since each tree is trained
where positive/negative refers to above-Tsep /below-Tsep . only on a bootstrapped subset of the data, the remain-
The accuracy of a classifier is the total proportion of ing subset can be used as an internal test set. These two
correctly-classified materials, while precision measures methods for quantifying model performance usually yield
the proportion of correctly-classified above-Tsep super- very similar results.
conductors out of all predicted above-Tsep . The recall is With the procedure in place, the models’ metrics are
the proportion of correctly-classified above-Tsep materials evaluated for a range of Tsep and illustrated in Figure 2b.
out of all truly above-Tsep compounds. While the preci- The accuracy increases as Tsep goes from 1 K to 40 K,
sion measures the probability that a material selected by and the proportion of above-Tsep compounds drops from
the model actually has Tc > Tsep , the recall reports how above 70% to about 15%, while the recall and F1 score
sensitive the model is to above-Tsep materials. Maximiz- generally decrease. The region between 5 − 15 K is es-
ing the precision or recall would require some compromise pecially appealing in maximizing/nearly-maximizing all
with the other, i.e., a model that labels all materials as benchmarking metrics while balancing the sizes of the
above-Tsep would have perfect recall but dismal preci- bins. In fact, setting Tsep = 10 K is a particularly
sion. To quantify the trade-off between recall and pre- convenient choice. It is also the temperature used in
cision, their harmonic mean (F1 score) is widely used to Refs [21, 22] to separate the two classes, as it is just
6

a b

FIG. 3. Scatter plots of 3, 000 superconductors in the space of the four most important classification predictors.
Blue/red represent below-Tsep /above-Tsep materials, where Tsep = 10 K. (a) Feature space of the first and second most important
predictors: standard deviations of the column numbers and electronegativities (calculated over the values for the constituent
elements in each compound). (b) Feature space of the third and fourth most important predictors: standard deviation of the
elemental melting temperatures and average of the atomic weights.

above the highest Tc of all elements and pseudoelemen- and limit the list of predictors to a manageable size, the
tal materials (solid solution whose range of composition backward feature elimination method is employed. The
includes a pure element). Here, the proportion of above- process begins with a model constructed with the full list
Tsep materials is approximately 38% and the accuracy is of predictors, and iteratively removes the least significant
about 92%, i.e., the model can correctly classify nine out one, rebuilding the model and recalculating importances
of ten materials — much better than random guessing. with every iteration. (This iterative procedure is neces-
The recall — quantifying how well all above-Tsep com- sary since the ordering of the predictors by importance
pounds are labeled and, thus, the most important metric can change at each step.) Predictors are removed until
when searching for new superconducting materials — is the accuracy drops by no more than 2%, reducing the
even higher. (Note that the models’ metrics also depend full list of 145 down to 5. Furthermore, two of these
on random factors such as the composition of the training predictors are strongly correlated with each other, and
and test sets, and their exact values can vary.) we remove the less important one. This has a negligi-
For an understanding of what the model has learned, ble impact on the model performance, yielding four pre-
an analysis of the chosen predictors is needed. In the dictors total (see Table 1) with an above 90% accuracy
random forest method, features can be ordered by their score — only slightly worse than the full model. Scat-
importance quantified via the so-called Gini importance ter plots of the pairs of the most important predictors
or “mean decrease in impurity” [49, 50]. For a given are shown in Figure 3, where blue/red denotes whether
feature, it is the sum of the Gini impurity [53] over the the material is in the below-Tsep /above-Tsep class. Fig-
number of splits that include the feature, weighted by the ure 3a shows a scatter plot of 3, 000 compounds in the
number of samples it splits, and averaged over the entire space spanned by the standard deviations of the column
forest. Due to the nature of the algorithm, the closer to numbers and electronegativities calculated over the ele-
the top of the tree a predictor is used, the greater number mental values. Superconductors with Tc > 10 K tend to
of predictions it impacts. cluster in the upper-right corner of the plot and in a rela-
tively thin elongated region extending to the left of it. In
Although correlations do not affect the model’s ability fact, the points in the upper-right corner represent mostly
to learn from features, it can distort importance esti- cuprate materials, which with their complicated compo-
mates. For example, a material property with a strong sitions and large number of elements are likely to have
effect on Tc can be shared among several correlated pre- high standard deviations in these variables. Figure 3b
dictors. Since the model can access the same informa- shows the same compounds projected in the space of the
tion through any of these variables, their relative impor- standard deviations of the melting temperatures and the
tances are diluted across the group. To reduce the effect
7

TABLE 1. The most relevant predictors and their importances for the classification and general regression
models. “avg(x)” and “std(x)” denote the composition-weighted average and standard deviation, respectively, calculated over
the vector of elemental values for each compound [35]. For the classification model, all predictor importances are quite close.
predictor model
rank classification regression (general; Tc > 10 K)
1 std(column number) 0.26 avg(number of unfilled orbitals) 0.26
2 std(electronegativity) 0.26 std(ground state volume) 0.18
3 std(melting temperature) 0.23 std(space group number) 0.17
4 avg(atomic weight) 0.24 avg(number of d unfilled orbitals) 0.17
5 - std(number of d valence electrons) 0.12
6 - avg(melting temperature) 0.1

means of the atomic weights of the elements forming each temperature superconductors, with all others denoted
compound. The above-Tsep materials tend to cluster in “low-Tc ” for brevity (no specific mechanism in this
areas with lower mean atomic weights — not a surprising group). Surprisingly, a single regression model does rea-
result given the role of phonons in conventional supercon- sonably well among the different families – benchmarked
ductivity. on the test set, the model achieves R2 ≈ 0.88 (Figure 4a).
For comparison, we create another classifier based on It suggests that the random forest algorithm is flexible
the average number of valence electrons, metallic elec- and powerful enough to automatically separate the com-
tronegativity differences, and orbital radii differences, pounds into groups and create group-specific branches
i.e., the predictors used in Refs. [21, 22] to cluster ma- with distinct predictors (no explicit group labels were
terials with Tc > 10 K. A classifier built only with these used during training and testing). As validation, three
three predictors is less accurate than both the full and separate models are constructed trained only on a spe-
the truncated models presented herein, but comes quite cific family, namely the “low-Tc ”, cuprate, and iron-based
close: the full model has about 3% higher accuracy and superconductors, respectively. Benchmarking on mixed-
F1 score, while the truncated model with four predictors family test sets, the models performed well on compounds
is less that 2% more accurate. The rather small (albeit belonging to their training set family while demonstrat-
not insignificant) differences demonstrates that even on ing no predictive power on the others. Figures 4b-d il-
the scale of the entire SuperCon dataset, the predictors lustrate a cross-section of this comparison. Specifically,
used by Villars and Rabe [21, 22] capture much of the the model trained on “low-Tc ” compounds dramatically
relevant chemical information for superconductivity. underestimates the Tc of both high-temperature super-
Regression models. After constructing a successful conducting families (Figures 4b and c), even though this
classification model, we now move to the more difficult test set only contains compounds with Tc < 40 K. Con-
challenge of predicting Tc . Creating a regression model versely, the model trained on the cuprates tends to over-
may enable better understanding of the factors control- estimate the Tc of “low-Tc ” (Figure 4d) and iron-based
ling Tc of known superconductors, while also serving as (Figure 4e) superconductors. This is a clear indication
an organic part of a system for identifying potential new that superconductors from these groups have different
ones. Leveraging the same set of elemental predictors factors determining their Tc . Interestingly, the family-
as the classification model, several regression models are specific models do not perform better than the general
presented focusing on materials with Tc > 10 K. It avoids regression containing all the data points: R2 for the “low-
the problem of materials with no reported Tc with the Tc ” materials is about 0.85, for cuprates is just below
assumption that, if they were to exhibit superconduc- 0.8, and for iron-based compounds is about 0.74. In fact,
tivity at all, their critical temperature would be below it is a purely geometric effect that the combined model
10 K. Another problem is that the Tc ’s are unevenly dis- has the highest R2 . Each group of superconductors con-
tributed over the Tc axis (see Figure 2a). To avoid this, tributes mostly to a distinct temperature range, and, as
ln (Tc ) is used as the target variable instead of Tc (Fig- a result, the combined line of predicted-vs.-measured Tc
ure 2a inset), which creates a more uniform distribution is better determined over longer interval.
and is also considered a best practice when the range of In order to reduce the number of predictors and in-
a target variable covers more than one order of magni- crease the interpretability of these models without signif-
tude (as in the case of Tc ). Following this transformation, icant detriment to their performance, a backward feature
the dataset is parsed randomly (85%/15%) into training elimination process is again employed. The procedure is
and test subsets (similarly performed for the classifica- very similar to the one described previously for the clas-
tion model). sification model, with the only difference being that the
Present within the dataset are distinct families of su- reduction is guided by R2 of the model, rather than the
perconductors with different driving mechanisms for su- accuracy (the procedure stops when R2 drops by 3%).
perconductivity, including cuprate and iron-based high- The most important predictors for the four models
8

TABLE 2. The most significant predictors and their importances for the three material-specific regression
models. “avg(x)”, “std(x)”, “max(x)” and “frac(x)” denote the composition-weighted average, standard deviation, maximum,
and fraction,
pP respectively, taken over the elemental values for each compound. l2 -norm of a composition is calculated by
2
||x||2 = x
i i , where x i is the proportion of each element i in the compound.
pred. model
rank regression (“low-Tc ”) regression (cuprates) regression (Fe-based)
1 frac(d valence electrons) 0.18 avg(number of unfilled orbitals) 0.22 std(column number) 0.17
2 avg(number of d unfilled orbitals) 0.14 std(number of d valence electrons) 0.13 avg(ionic character) 0.15
3 avg(number of valence electrons) 0.13 frac(d valence electrons) 0.13 std(Mendeleev number) 0.14
4 frac(s valence electrons) 0.11 std(ground state volume) 0.13 std(covalent radius) 0.14
5 avg(number of d valence electrons) 0.09 std(number of valence electrons) 0.1 max(melting temperature) 0.14
6 avg(covalent radius) 0.09 std(row number) 0.08 avg(Mendeleev number) 0.14
7 avg(atomic weight) 0.08 ||composition||2 0.07 ||composition||2 0.11
8 avg(Mendeleev number) 0.07 std(number of s valence electrons) 0.07 -
9 avg(space group number) 0.07 std(melting temperature) 0.07 -
10 avg(number of unfilled orbitals) 0.06 - -

(one general and three family-specific) together with conductivity). Another interesting relation appears in
their importances are shown in Tables 1 and 2. Differ- the context of the average number of d valence electrons.
ences in important predictors across the family-specific Figure 5c illustrates a fundamental bound on Tc of all
models reflect the fact that distinct mechanisms are non-cuprate and non-iron-based superconductors.
responsible for driving superconductivity among these A similar limit exists for cuprates based on the aver-
groups. The list is longest for the “low-Tc ” supercon- age number of unfilled orbitals (Figure 5d). It appears
ductors, reflecting the eclectic nature of this group. Sim- to be quite rigid — several data points found above it
ilar to the general regression model, different branches on inspection are actually incorrectly recorded entries in
are likely created for distinct sub-groups. Nevertheless, the database and were subsequently removed. The con-
some important predictors have straightforward interpre- nection between Tc and the average number of unfilled
tation. As illustrated in Figure 5a, low average atomic orbitals [56] may offer new insight into the mechanism for
weight is a necessary (albeit not sufficient) condition for superconductivity in this family. Known trends include
achieving high Tc among the “low-Tc ” group. In fact, the higher Tc ’s for structures that (i ) stabilize more than one

maximum Tc for a given weight roughly follows 1/ mA . superconducting Cu-O plane per unit cell and (ii ) add
Mass plays a significant role in conventional superconduc- more polarizable cations such as Tl3+ and Hg2+ between
tors through the Debye frequency √ of phonons, leading to these planes. The connection reflects these observations,
the well-known formula Tc ∼ 1/ m, where m is the ionic since more copper and oxygen per formula unit leads to
mass. Other factors like density of states are also impor- lower average number of unfilled orbitals (one for copper,
tant, which explains the spread in Tc for a given mA . two for oxygen). Further, the lower-Tc cuprates typically

Outlier materials clearly lying above the ∼ 1/ mA line consist of Cu2 -/Cu3 -containing layers stabilized by the
include bismuthates and chloronitrates, suggesting the addition/substition of hard cations, such as Ba2+ and
conventional electron-phonon mechanism is not driving La3+ , respectively. These cations have a large number
superconductivity in these materials. Indeed, chloroni- of unfilled orbitals, thus increasing the compound’s av-
trates exhibit a very weak isotope effect [54], though some erage. Therefore, the ability of between-sheet cations to
unconventional electron-phonon coupling could still be contribute charge to the Cu-O planes may be indeed quite
important for superconductivity [55]. Such findings vali- important. The more polarizable the A cation, the more
date the ability of ML approaches to discover meaningful electron density it can contribute to the already strongly
patterns that encode true physical phenomena. covalent Cu2+ –O bond.
Similar Tc -vs.-predictor plots reveal more interesting Including AFLOW. The models described previously
and subtle features. A narrow cluster of materials with demonstrate surprising accuracy and predictive power,
Tc > 20 K emerges in the context of the mean cova- especially considering the difference between the rele-
lent radii of compounds — another important predictor vant energy scales of most Magpie predictors (typically
for “low-Tc ” superconductors. The cluster includes (left- in the range of eV) and superconductivity (meV scale).
to-right) alkali-doped C60 , MgB2 -related compounds, This disparity, however, hinders the interpretability of
and bismuthates. The sector likely characterizes a re- the models, i.e., the ability to extract meaningful phys-
gion of strong covalent bonding and corresponding high- ical correlations. Thus, it is highly desirable to create
frequency phonon modes that enhance Tc (however, fre- accurate ML models with features based on measurable
quencies that are too high become irrelevant for super- macroscopic properties of the actual compounds (e.g.,
9

a b c

d e

FIG. 4. Benchmarking of regression models predicting ln(Tc ). (a) Predicted vs. measured ln(Tc ) for the general
regression model. The test set comprises of a mix of “low-Tc ”, iron-based, and cuprate superconductors with Tc > 10 K.
With an R2 of about 0.88, this one model can accurately predict Tc for materials in different superconducting groups. (b and
c) Predictions of the regression model trained solely on “low-Tc ” compounds for test sets containing cuprate and iron-based
materials. (d and e) Predictions of the regression model trained solely on cuprates for test sets containing “low-Tc ” and
iron-based superconductors. Models trained on a single group have no predictive power for materials from other groups.

crystallographic and electronic properties) rather than The chemical sparsity in ICSD superconductors is a sig-
composite elemental predictors. Unfortunately, only a nificant hurdle, even when both sets of predictors (i.e.,
small subset of materials in SuperCon is also included in Magpie and AFLOW features) are combined via feature
the ICSD: about 1, 500 compounds in total, only about fusion. Additionally, this approach alone neglects the
800 with finite Tc , and even fewer are characterized with majority of the 16, 000 compounds available via Super-
ab initio calculations. In fact, a good portion of known Con. Instead, we constructed separate models employing
superconductors are disordered (off-stoichiometric) ma- Magpie and AFLOW features, and then judiciously com-
terials and notoriously challenging to address with DFT bined the results to improve model metrics — known as
calculations. Currently, much faster and efficient meth- late or decision-level fusion. Specifically, two indepen-
ods are becoming available [37] for future applications. dent classification models are developed, one using the
full SuperCon dataset and Magpie predictors, and an-
To extract suitable features, data is incorporated from other based on superconductors in the ICSD and AFLOW
the AFLOW Online Repositories — a database of DFT predictors. Such an approach can improve the recall,
calculations managed by the software package AFLOW. It for example, in the case where we classify “high-Tc ” su-
contains information for the vast majority of compounds perconductors as those predicted by either model to be
in the ICSD and about 550 superconducting materials. above-Tsep . Indeed, this is the case here where, sepa-
In Ref. 23, several ML models using a similar set of ma- rately, the models obtain a recall of 40% and 66%, re-
terials are presented. Though a classifier shows good spectively, and together achieve a recall of about 76% (ac-
accuracy, attempts to create a regression model for Tc counting for fluctuations with different test sets). In this
led to disappointing results. We verify that using Mag- way, the models’ predictions complement each other in a
pie predictors for the superconducting compounds in the constructive way such that above-Tsep materials missed
ICSD also yields an unsatisfactory regression model. The by one model (but not the other) are now accurately
issue is not the lack of compounds per se, as models cre- classified.
ated with randomly drawn subsets from SuperCon with
similar counts of compounds perform much better. In Searching for new superconductors in the ICSD.
fact, the problem is the chemical sparsity of supercon- As a final proof of concept demonstration, the classifica-
ductors in the ICSD, i.e., the dearth of closely-related tion and regression models described previously are in-
compounds (usually created by chemical substitution). tegrated in one pipeline and employed to screen the en-
This translates to compound scatter in predictor space tire ICSD database for candidate “high-Tc ” superconduc-
— a challenging learning environment for the model. tors. (Note that “high-Tc ” is a simple label, the precise
10

FIG. 5. Scatter plots of Tc for superconducting materials in the space of significant, family-specific regression
predictors. For 4, 000 “low-Tc ” superconductors (i.e., non-cuprate and non-iron-based), Tc is plotted vs. the (a) average
atomic weight, (b) average covalent radius, and (c) average number of d valence electrons. The dashed red line in (a) is

∼ 1/ mA . Having low average atomic weight and low average number of d valence electrons are necessary (but not sufficient)
conditions for achieving high Tc in this group. (d) Scatter plot of Tc for all known superconducting cuprates vs. the mean
number of unfilled orbitals. (c and d) suggest that the values of these predictors lead to hard limits on the maximum achievable
Tc .

meaning of which can be adjusted.) Similar tools power accuracy of 0.98, which is overshadowed by the fact that
high-throughput screening workflows for materials with 96.6% of these compounds belong to the Tc < 10 K class.
desired thermal conductivity and magnetocaloric prop- The precision, recall, and F1 scores are about 0.74, 0.66,
erties [48, 57]. As a first step, the full set of Magpie and 0.70, respectively. These metrics are lower than the
predictors are generated for all compounds in SuperCon. estimates calculated for the general classification model,
A classification model similar to the one presented above which is not unexpected given that this set cannot be
is constructed, but trained only on materials in Super- considered randomly selected. Nevertheless, the perfor-
Con and not in the ICSD (used as an independent test mance suggests a good opportunity to identify new can-
set). The model is then applied on the ICSD set to create didate superconductors.
a list of materials with predicted Tc above 10 K. Oppor- Next in the pipeline, the list is fed into a random
tunities for model benchmarking are limited to those ma- forest regression model (trained on the entire SuperCon
terials both in the SuperCon and ICSD datasets, though database) to predict Tc . Filtering on the materials with
this test set is shown to be problematic. The set includes Tc > 20 K, the list is further reduced to about 2, 000 com-
about 1, 500 compounds, though Tc is reported for only pounds. This count may appear daunting, but should be
about half of them. The model achieves an impressive compared with the total number of compounds in the
11

a 4
AlCs3Ge2O7 eDOS (states/eV)
b 4
AsBeCsO4 eDOS (states/eV)
s
p
3 3 d
total
2 2

1 1
energy (eV)

energy (eV)
0 0

-1 -1

-2 -2

-3 -3

-4 -4
Γ Y F HZ I X Γ Z |M Γ N |X Y1 H1|I 0 20 40 60 80 Γ X S Y Γ Z U R T Z |Y X |S R 0 20 40 60 80
F1 T |U

c 4
Ge2K2ZnO6 eDOS (states/eV) d 4
CdPtSr3O6 eDOS (states/eV)

3 3

2 2

1 1
energy (eV)

energy (eV)
0 0

-1 -1

-2 -2

-3 -3

-4 -4
Γ X S R A Z Γ Y X1 A1 T Y |Z T 0 10 20 30 40 50 60 70 Γ P Q Γ F P1 L Z 0 15 30 45
Z Q1

FIG. 6. DOS of four compounds identified by the ML algorithm as potential materials with Tc > 20 K. The
partial DOS contributions from s, p and d electrons and total DOS are shown in blue, green, red, and black, respectively. The
large peak just below EF is a direct consequence of the flat band(s) present in all these materials. These images were generated
automatically via AFLOW [40]. In the case of substantial overlap among k-point labels, the right-most label is offset below.

database — about 110, 000. Thus, the method selects close to to EF [58–60]. However, note that unlike typical
less than two percent of all materials, which in the con- van Hove points, a true flat band creates divergence in
text of the training set (containing more than 20% with the DOS (as opposed to its derivatives), which in turn
“high-Tc ”), suggests that the model is not overly biased leads to a critical temperature dependence linear in the
toward predicting high critical temperatures. pairing interaction strength, rather than the usual expo-
The vast majority of the compounds identified as can- nential relationship yielding lower Tc [28]. Additionally,
didate superconductors are cuprates, or at least com- there is significant similarity with the band structure and
pounds that contain copper and oxygen. There are also DOS of layered BiS2 -based superconductors [61].
some materials clearly related to the iron-based super- This band structure feature came as the surprising re-
conductors. The remaining set has less than 40 members, sult of applying the ML model. It was not sought for,
and is composed of materials that are not obviously con- and, moreover, no explicit information about the elec-
nected to any high-temperature superconducting families tronic band structure has been included in these predic-
(see Table 3). None of them is predicted to have Tc in tors. This is in contrast to the algorithm presented in
excess of 40 K, which is not surprising, given that no Ref. 26, which was specifically designed to filter ICSD
such instances exist in the training dataset. All contain compounds based on several preselected electronic struc-
oxygen — also not a surprising result, since the group of ture features.
known superconductors with Tc > 20 K is dominated by While at the moment it is not clear if some (or indeed
oxides. any) of these compounds are really superconducting, let
The electronic properties calculated by AFLOW offer alone with Tc ’s above 20 K, the presence of this highly
context to the results of the search, and suggest a possi- unusual electronic structure feature is encouraging. At-
ble connection between these “conventional” candidates tempts to synthesize several of these compounds are al-
for “high-Tc ” materials. Plotting the electronic structure ready underway.
of the potential superconductors exposes an extremely
peculiar feature shared by all — one or several (nearly)
flat bands just below the Fermi level. Such bands lead to CONCLUSION
a large peak in the DOS (see Figure 6) and can cause a
significant enhancement in Tc . Peaks in the DOS elicited Herein, several machine learning tools are developed
by van Hove singularities can enhance Tc if sufficiently to study the critical temperature of superconductors.
12

The classifier shows excellent performance, with out-of-


TABLE 3. List of potential superconductors identified by the
sample accuracy and F1 score of about 92%. Next, sev-
pipeline. Also shown are their ICSD numbers and symme-
tries. Note that for some compounds there are several entries. eral successful random forest regression models are cre-
All of the materials are oxides. ated to predict the value of Tc , including separate models
compound ICSD SYM
for three material sub-groups, i.e., cuprate, iron-based,
and “low-Tc ” compounds. By studying the importance
CdK2 SiO4 083229 orc
of predictors for each family of superconductors, insights
CdK2 SiO4 086917 cub are obtained about the physical mechanisms driving su-
Cd2 IrNa3 O6 404507 mcl perconductivity among the different groups. With the
CdPtSr3 O6 280518 hex incorporation of crystallographic-/electronic-based fea-
CdRb2 SiO4 093879 orc tures from the AFLOW Online Repositories, the ML mod-
Ge2 K166 Sr4 O36 100202 cub els are further improved. Finally, we combined these
GeK2 ZnO4 069018 orc models into one integrated pipeline, which is employed
GeK2 ZnO4 085006 orc to search the entire ICSD database for new inorganic su-
perconductors. The model identified about 30 oxides as
GeK2 ZnO4 085007 cub
candidate materials. One interesting feature that unites
Ge2 K2 ZnO6 065740 orc all these materials is the presence of flat or nearly-flat
GeK0.6 Na1.4 ZnO4 069166 orc bands just below the Fermi level.
PtSr3 ZnO6 280519 hex In conclusion, this work demonstrates the important
KSbO2 411214 mcl role ML models can play in superconductivity research.
RbSbO2 411216 mcl Records collected over several decades in SuperCon and
AlCs3 Ge2 O7 412140 mcl other relevant databases can be consumed by ML models,
AsRbO2 413150 orc generating insights and promoting better understanding
AgAuBa4 O6 072329 orc
of the connection between materials’ chemistry/structure
and superconductivity. Application of sophisticated ML
Au2 Sr5 O8 071965 orc
algorithms has the potential to dramatically accelerate
AsBeCsO4 074027 orc the search for candidate high-temperature superconduc-
K2 SiZnO4 083227 orc tors.
RbSeO2 F 078399 cub
CsSeO2 F 078400 cub
K2 Si2 ZnO6 079705 orc ACKNOWLEDGMENTS
Ca3 Ge6 Na6 O18 067315 hex
CsSbO2 059329 mcl The authors are grateful to Daniel Samarov, Victor
AgCrO2 004149 hex Galitski, Cormac Toher, Richard L. Greene and Yibin Xu
AgCrO2 025624 hex for many useful discussions and suggestions. We acknowl-
edge Stephan Rühl for ICSD. This research is supported
K4 Na2 Tl2 O6 074956 mcl
by ONR N000141512222, ONR N00014-13-1-0635, and
BCdCsO3 189199 cub
AFOSR No. FA 9550-14-10332. CO acknowledges sup-
CsMoZnO3 F3 018082 cub port from the National Science Foundation Graduate Re-
KTeO2 F 411068 mcl search Fellowship under Grant No. DGF1106401. SC
HEu0.5 Ge1.5 K1.5 O5 262677 orc acknowledges support by the Alexander von Humboldt-
K0.8 Li0.2 Sn0.76 O2 262638 hex Foundation.
BiNa3 Ni2 O6 237391 mcl
BiCa2 Na3 O6 240975 orc
Ba5 Br2 Ru2 O9 245668 hex SUPPLEMENTARY MATERIAL
H2 Ge3 K3 TbO10 193585 orc
BaGe3 K4 O9 100203 mcl Software and model parameters. The entire inor-
ganic materials dataset has been downloaded from Su-
Ba6 Ga7 KZn4 O2 040856 tri
perCon. Each entry in the set contains fields for the
chemical composition, Tc , structure, and a journal refer-
ence to the information source. Here, structural informa-
Based on information from the SuperCon database, ini- tion is ignored as it is not always available. A number
tial coarse-grained chemical features are generated us- of entries have also been removed due to incompletely
ing the Magpie software. As a first application of specified compositions and erroneous Tc .
ML methods, materials are divided into two classes de- The data cleaning and processing is carried out us-
pending on whether their Tc is above or below 10 K. ing the Python Pandas package for data analysis [62].
A non-parametric random forest classification model The predictors are calculated using the Magpie software
is constructed to predict the class of superconductors. [63]. The machine learning models are developed using
13

a b

FIG. 7. Dataset and feature set statistics. (a) Accuracy, precision, recall, and F1 score as a function of the size of the
training set with a fixed test set. (b) Accuracy, precision, recall, and F1 as a function of the number of predictors.

scikit-learn — a powerful and efficient machine learning ducting materials in the AFLOW Online Repositories.
Python library [64]. Hyperparameters of the random The features are built with the following properties:
forest model include the number of trees in the forest, number of atoms, space group, density, volume, energy
the maximum depth of each tree, the minimum number per atom, electronic entropy per atom, valence of the
of samples required to split an internal node, and the cell, scintillation attenuation length, the ratios of the unit
number of features to consider when looking for the best cell’s dimensions, and Bader charges and volumes. For
split. To optimize the classifier and the combined/family- the Bader charges and volumes (vectors), the following
specific regressors, the GridSearch function in scikit-learn statistics are calculated and incorporated: the maximum,
is employed, which generates and compares candidate minimum, average, standard deviation, and range.
models from a grid of parameter values. To reduce com- In the main text, several regression models were de-
putational expense, models are not optimized at each scribed, each one designed to predict the critical tempera-
step of the backward feature selection process. tures of materials from different superconducting groups.
Details about the classification and regression These models achieved an impressive R2 score, demon-
models. The most important factors that determine strating good predictive power for each group. How-
the model’s performance are the size of the available ever, it is also important to consider the accuracy of
dataset and the number of meaningful predictors. In the predictions for individual compounds (rather than
Figure 7a, the accuracy and F1 score generally saturate on the aggregate set), especially in the context of search-
rather quickly with the size of the training set, and a ing for new materials. To do this, we calculate the pre-
reasonably well performing model can be created even diction errors for about 300 materials from a test set.
with a relatively small set (several hundred compounds). Specifically, we consider the difference between the log-
Such a model, however, is susceptible to random vari- arithm of the predicted and measured critical tempera-
ations in the composition of the training and test sets, ture — ln(Tcmeas ) − ln(Tcpred ) — normalized by the value
and thus not very robust (accuracy and F1 score exhibit of ln(Tcmeas )) (since different groups have different Tc
sizable variations for dataset sizes less than 10, 000). So ranges). The models show comparable spread of errors.
having a large dataset is helpful, but not a major fac- The histograms of errors for the four models (combined
tor above some (relatively modest) size. The number and three group-specific) are shown in Fig. 8. The errors
of predictors is another very important model parame- approximately follow a normal distribution, centered not
ter. In Figure 7b, the accuracy is calculated at each step at zero but at a small negative value. This suggests the
of the backward feature elimination process. It quickly models are marginally biased, and on average tend to
saturates when the number of predictors reaches 10. In slightly underestimate Tc . The variance is comparable for
fact, a model with only 5 predictors achieves almost 90% all models, but largest for the model trained and tested
accuracy. on iron-based materials, which also shows the smallest
ML models are also constructed with the supercon- R2 . Performance of this model is expected to benefit
14

a b

c d

FIG. 8. Histograms of ∆ ln(Tc ) ∗ ln(Tc )−1 for the four regression models. ∆ ln(Tc ) ≡ (ln(Tcmeas ) − ln(Tcpred )) and
ln(Tc ) ≡ ln(Tcmeas ).

from a larger training set.

1
G. Bergerhoff, R. Hundt, R. Sievers, and I. D. Brown, The tum Materials Database (OQMD), JOM 65, 1501–1509
inorganic crystal structure data base, J. Chem. Inf. Com- (2013).
5
put. Sci. 23, 66–69 (1983). A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,
2
S. Curtarolo, W. Setyawan, G. L. W. Hart, M. Jahnátek, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and
R. V. Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang, K. A. Persson, Commentary: The Materials Project: A
O. Levy, M. J. Mehl, H. T. Stokes, D. O. Demchenko, and materials genome approach to accelerating materials inno-
D. Morgan, AFLOW: An automatic framework for high- vation, APL Mater. 1, 011002 (2013).
6
throughput materials discovery, Comput. Mater. Sci. 58, A. Agrawal and A. Choudhary, Perspective: Materi-
218–226 (2012). als informatics and big data: Realization of the “fourth
3
D. D. Landis, J. Hummelshøj, S. Nestorov, J. Greeley, paradigm” of science in materials science, APL Mater. 4,
M. Dulak, T. Bligaard, J. K. Nørskov, and K. W. Ja- 053208 (2016).
7
cobsen, The Computational Materials Repository, Comput. T. Lookman, F. J. Alexander, and K. Rajan, eds., A Per-
Sci. Eng. 14, 51–57 (2012). spective on Materials Informatics: State-of-the-Art and
4
J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and Challenges (Springer International Publishing, 2016), doi:
C. Wolverton, Materials Design and Discovery with High- 10.1007/978-3-319-23871-5.
8
Throughput Density Functional Theory: The Open Quan- A. Jain, G. Hautier, S. P. Ong, and K. A. Persson, New op-
15

portunities for materials informatics: Resources and data Well-Calibrated Uncertainty Estimates, Integr. Mater.
mining techniques for uncovering hidden relationships, J. Manuf. Innov. (2017).
25
Mater. Res. 31, 977994 (2016). M. Ziatdinov, A. Maksov, L. Li, A. S. Sefat, P. Maksy-
9
T. Mueller, A. G. Kusne, and R. Ramprasad, Machine movych, and S. V. Kalinin, Deep data mining in a real
Learning in Materials Science (John Wiley & Sons, Inc, space: separation of intertwined electronic responses in
2016), pp. 186–273, doi:10.1002/9781119148739.ch4. a lightly doped BaFe2 As2 , Nanotechnology 27, 475706
10
A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, Machine (2016).
26
learning with systematic density-functional theory calcula- M. Klintenberg and O. Eriksson, Possible high-temperature
tions: Application to melting temperatures of single- and superconductors predicted from electronic structure and
binary-component solids, Phys. Rev. B 89 (2014). data-filtering algorithms, Comput. Mater. Sci. 67, 282–286
11
P. V. Balachandran, J. Theiler, J. M. Rondinelli, and (2013).
27
T. Lookman, Materials Prediction via Classification Learn- M. R. Norman, Materials design for new superconductors,
ing, Sci. Rep. 5 (2015). Rep. Prog. Phys. 79, 074502 (2016).
12 28
G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga, N. B. Kopnin, T. T. Heikkilä, and G. E. Volovik, High-
R. Ramprasad, J. E. Gubernatis, and T. Lookman, Ma- temperature surface superconductivity in topological flat-
chine learning bandgaps of double perovskites, Sci. Rep. 6, band systems, Phys. Rev. B 83, 220503 (2011).
29
19375 (2016). S. Peotta and P. Törmä, Superfluidity in topologically non-
13
O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, and trivial flat bands, Nat. Commun. 6, 8944 (2015).
30
A. Tropsha, Universal fragment descriptors for predicting National Institute of Materials Science,
electronic properties of inorganic crystals, Nat. Commun. Materials Information Station, SuperCon,
8, 15679 (2017). http://supercon.nims.go.jp/index en.html (2011).
14 31
J. E. Hirsch, M. B. Maple, and F. Marsiglio, Supercon- N.B., a model suffering from selection bias can still provide
ducting Materials: Conventional, Unconventional and Un- valuable statistical information about known superconduc-
determined, Physica C 514, 1–444 (2015). tors.
15 32
P. W. Anderson, Plasmons, Gauge Invariance, and Mass, H. Hosono, K. Tanabe, E. Takayama-Muromachi,
Phys. Rev. 130, 439–442 (1963). H. Kageyama, S. Yamanaka, H. Kumakura, M. Nohara,
16
C. W. Chu, L. Z. Deng, and B. Lv, Hole-doped cuprate H. Hiramatsu, and S. Fujitsu, Exploration of new supercon-
high temperature superconductors, Physica C 514, 290–313 ductors and functional materials, and fabrication of super-
(2015). Superconducting Materials: Conventional, Uncon- conducting tapes and wires of iron pnictides, Sci. Technol.
ventional and Undetermined. Adv. Mater. 16, 033503 (2015).
17 33
S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, There are theoretical arguments for this — according to
R. H. Taylor, L. J. Nelson, G. L. W. Hart, S. San- the Kohn-Luttinger theorem, a superconducting instabil-
vito, M. Buongiorno Nardelli, N. Mingo, and O. Levy, ity should be present as T → 0 in any fermionic metallic
AFLOWLIB.ORG: A distributed materials properties system with Coulomb interactions [34].
34
repository from high-throughput ab initio calculations, W. Kohn and J. M. Luttinger, New Mechanism for Super-
Comput. Mater. Sci. 58, 227–235 (2012). conductivity, Phys. Rev. Lett. 15, 524–526 (1965).
18 35
R. H. Taylor, F. Rose, C. Toher, O. Levy, K. Yang, L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton,
M. Buongiorno Nardelli, and S. Curtarolo, A RESTful API A general-purpose machine learning framework for predict-
for exchanging materials data in the AFLOWLIB.org con- ing properties of inorganic materials, npg Computational
sortium, Comput. Mater. Sci. 93, 178–192 (2014). Materials 2, 16028 (2016).
19 36
C. E. Calderon, J. J. Plata, C. Toher, C. Oses, O. Levy, W. Setyawan and S. Curtarolo, High-throughput electronic
M. Fornari, A. Natan, M. J. Mehl, G. L. W. Hart, M. Buon- band structure calculations: Challenges and tools, Comput.
giorno Nardelli, and S. Curtarolo, The AFLOW standard Mater. Sci. 49, 299–312 (2010).
37
for high-throughput materials science calculations, Com- K. Yang, C. Oses, and S. Curtarolo, Modeling Off-
put. Mater. Sci. 108 Part A, 233–238 (2015). Stoichiometry Materials with a High-Throughput Ab-Initio
20
F. Rose, C. Toher, E. Gossett, C. Oses, M. Buon- Approach, Chem. Mater. 28, 6484–6492 (2016).
38
giorno Nardelli, M. Fornari, and S. Curtarolo, AFLUX: O. Levy, M. Jahnátek, R. V. Chepulskii, G. L. W. Hart,
The LUX materials search API for the AFLOW data repos- and S. Curtarolo, Ordered Structures in Rhenium Binary
itories, Comput. Mater. Sci. 137, 362–370 (2017). Alloys from First-Principles Calculations, J. Am. Chem.
21
P. Villars and J. C. Phillips, Quantum structural diagrams Soc. 133, 158–163 (2011).
39
and high-Tc superconductivity, Phys. Rev. B 37, 2345–2348 O. Levy, G. L. W. Hart, and S. Curtarolo, Structure maps
(1988). for hcp metals from first-principles calculations, Phys. Rev.
22
K. M. Rabe, J. C. Phillips, P. Villars, and I. D. Brown, B 81, 174106 (2010).
40
Global multinary structural chemistry of stable quasicrys- O. Levy, R. V. Chepulskii, G. L. W. Hart, and S. Cur-
tals, high-TC ferroelectrics, and high-Tc superconductors, tarolo, The New face of Rhodium Alloys: Revealing Or-
Phys. Rev. B 45, 7650–7676 (1992). dered Structures from First Principles, J. Am. Chem. Soc.
23
O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, 132, 833–837 (2010).
41
A. Tropsha, and S. Curtarolo, Materials Cartography: Rep- O. Levy, G. L. W. Hart, and S. Curtarolo, Uncovering
resenting and Mining Materials Space Using Structural Compounds by Synergy of Cluster Expansion and High-
and Electronic Fingerprints, Chem. Mater. 27, 735–743 Throughput Methods, J. Am. Chem. Soc. 132, 4830–4833
(2015). (2010).
24 42
J. Ling, M. Hutchinson, E. Antono, S. Paradiso, and G. L. W. Hart, S. Curtarolo, T. B. Massalski, and O. Levy,
B. Meredig, High-Dimensional Materials and Process Op- Comprehensive Search for New Phases and Compounds in
timization Using Data-Driven Experimental Design with Binary Alloy Systems Based on Platinum-Group Metals,
16

Using a Computational First-Principles Approach, Phys. the probability of randomly chosen data point from a given
Rev. X 3, 041035 (2013). decision tree leaf to be in class i [49, 50].
43 54
M. J. Mehl, D. Hicks, C. Toher, O. Levy, R. M. Hanson, Y. Kasahara, K. Kuroki, S. Yamanaka, and Y. Taguchi,
G. L. W. Hart, and S. Curtarolo, The AFLOW Library of Unconventional superconductivity in electron-doped layered
Crystallographic Prototypes: Part 1, Comput. Mater. Sci. metal nitride halides M NX (M = Ti, Zr, Hf; X = Cl, Br,
136, S1–S828 (2017). I), Physica C 514, 354–367 (2015). Superconducting Ma-
44
A. R. Supka, T. E. Lyons, L. S. I. Liyanage, P. D’Amico, terials: Conventional, Unconventional and Undetermined.
55
R. Al Rahal Al Orabi, S. Mahatara, P. Gopal, C. To- Z. P. Yin, A. Kutepov, and G. Kotliar, Correlation-
her, D. Ceresoli, A. Calzolari, S. Curtarolo, M. Buon- Enhanced Electron-Phonon Coupling: Applications of GW
giorno Nardelli, and M. Fornari, AFLOWπ: A minimalist and Screened Hybrid Functional to Bismuthates, Chloroni-
approach to high-throughput ab initio calculations includ- trides, and Other High-Tc Superconductors, Phys. Rev. X
ing the generation of tight-binding hamiltonians, Comput. 3, 021011 (2013).
56
Mater. Sci. 136, 76–84 (2017). The number of unfilled orbitals refers to the electron con-
45
C. Toher, J. J. Plata, O. Levy, M. de Jong, M. D. Asta, figuration of the substituent elements before combining
M. Buongiorno Nardelli, and S. Curtarolo, High-throughput to form oxides. For example, Cu has one unfilled orbital
computational screening of thermal conductivity, Debye ([Ar]4s2 3d9 ) and Bi has three ([Xe]4f 14 6s2 5d10 6p3 ). These
temperature, and Grüneisen parameter using a quasihar- values are averaged per formula unit.
57
monic Debye model, Phys. Rev. B 90, 174107 (2014). J. D. Bocarsly, E. E. Levin, C. A. C. Garcia, K. Schwen-
46
E. Perim, D. Lee, Y. Liu, C. Toher, P. Gong, Y. Li, W. N. nicke, S. D. Wilson, and R. Seshadri, A Simple Compu-
Simmons, O. Levy, J. J. Vlassak, J. Schroers, and S. Cur- tational Proxy for Screening Magnetocaloric Compounds,
tarolo, Spectral descriptors for bulk metallic glasses based Chem. Mater. 29, 1613–1622 (2017).
58
on the thermodynamics of competing crystalline phases, J. Labbé, S. Barišić, and J. Friedel, Strong-Coupling Super-
Nat. Commun. 7, 12315 (2016). conductivity in V3 X type of Compounds, Phys. Rev. Lett.
47
C. Toher, C. Oses, J. J. Plata, D. Hicks, F. Rose, 19, 1039–1041 (1967).
59
O. Levy, M. de Jong, M. D. Asta, M. Fornari, M. Buon- J. E. Hirsch and D. J. Scalapino, Enhanced Superconductiv-
giorno Nardelli, and S. Curtarolo, Combining the AFLOW ity in Quasi Two-Dimensional Systems, Phys. Rev. Lett.
GIBBS and Elastic Libraries to efficiently and robustly 56, 2732–2735 (1986).
60
screen thermomechanical properties of solids, Phys. Rev. I. E. Dzyaloshinskiǐ, Maximal increase of the superconduct-
Materials 1, 015401 (2017). ing transition temperature due to the presence of van’t Hoff
48
A. van Roekeghem, J. Carrete, C. Oses, S. Curtarolo, and singularities, JETP Lett. 46, 118 (1987).
61
N. Mingo, High-Throughput Computation of Thermal Con- D. Yazici, I. Jeon, B. D. White, and M. B. Maple, Super-
ductivity of High-Temperature Solid Phases: The Case of conductivity in layered BiS2 -based compounds, Physica C
Oxide and Fluoride Perovskites, Phys. Rev. X 6, 041061 514, 218–236 (2015). Superconducting Materials: Conven-
(2016). tional, Unconventional and Undetermined.
49 62
C. Bishop, Pattern Recognition and Machine Learning W. McKinney, Python for Data Analysis: Data Wran-
(Springer-Verlag, New York, 2006). gling with Pandas, NumPy, and IPython (O’Reilly Media,
50
T. Hastie, R. Tibshirani, and J. H. Friedman, The Ele- 2012).
63
ments of Statistical Learning: Data Mining, Inference, and L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton,
Prediction (Springer-Verlag, New York, 2001). Magpie Software, https://bitbucket.org/wolverton/magpie
51
L. Breiman, Random Forests, Mach. Learn. 45, 5–32 (2016), doi:10.1038/npjcompumats.2016.28.
64
(2001). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
52
R. Caruana and A. Niculescu-Mizil, An Empirical Com- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
parison of Supervised Learning Algorithms, in Proceedings R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
of the 23rd International Conference on Machine Learn- napeau, M. Brucher, M. Perrot, and É. Duchesnay, Scikit-
ing, ICML ’06 (ACM, New York, NY, USA, 2006), pp. learn: Machine Learning in Python, J. Mach. Learn. Res.
161–168, doi:10.1145/1143844.1143865. (2011).
53 P
Gini impurity is calculated as i pi (1 − pi ), where pi is

S-ar putea să vă placă și