Sunteți pe pagina 1din 2

In the previous lessons, I introduced some machine

learning terms like features, labels, regression,


classification, et cetera. We also provided examples to
make those concepts clearer. Given how important labels are, you might be
wondering,
where do I find them? Most commonly, labels are
found in historical datasets. This points to the importance
of having a data warehouse. Let's say we are creating a machine learning
model to predict whether a jet engine will require a mechanic's inspection. The
information
about the jet engine collected from flights, those are the features, and they will
probably
be in one database. The information about
mechanic inspections, that will be the label, and they'll probably be
in a different database. Note that if at training time, the two datasets
are still held in two different databases
that are in silos, we cannot even get started. That's because we need to join the
two datasets based on
the engine ID and time. A data warehouse where data from across the organization
is collected in a way that they're joinable is a prerequisite to be able to build
this machine
learning model. To create the machine
learning dataset, we need both these datasets, the dataset of
aircraft measurements, and the dataset of
mechanic inspections. We need both these
historical datasets in order to build
this machine learning model. If we have not been collecting this data, we're out of
luck. Only if we have
this historical data, can we create a machine
learning model to predict one from the other. Let's take another example. Let's say
we want to build a recommendation system
for products, but we don't have any historical data of customer ratings. Our
business never collected it. We don't have labeled
examples then. Are we out of luck?
Well, maybe not. There are three ways that you could address
this challenge. One, use a proxy label, two, build a labeling system or three, use
a labeling service.
Let's look at all of them. For training of
product recommendation model, you would normally
need customer ratings. But even if you don't
have customer ratings, you might be able to use a proxy. For example, the number of
warranty claims or the number of support phone calls
might serve as a proxy for the customer
rating of the product. It's not ideal, but will
allow you to get started. Another way to get a proxy is to train a machine learning
model to produce it for you. In the examples that we've
seen so far, features, were things that we could observe in the world,
actual measurements. But sometimes, the output of one model can be used as
the input to another model. For example, say
you want to predict the future demand of
a retail product, and you think that
one great feature would be the amount of attention that customers and stores
pay to that product. But that's not something that most retailers can
easily measure. The good news is, you don't need to be
able to actually measure exactly that thing for it to
be an input into your model. For example, you could train
a machine learning model to count the number of people in that section
of the store. Maybe you could look
at camera footage, or maybe you could
look at the number of transactions that you're ringing up in that section of the
store. Then you could use that
model's output as a proxy for how many customers
are interested in whatever items you're stocking in that section of the store.

S-ar putea să vă placă și