Sunteți pe pagina 1din 9

DATA SCIENCE

What is Data Science?

What is analytics? What is a data scientist?

Data science is deep knowledge discovery through data inference and exploration. This discipline often
involves using mathematic and algorithmic techniques to solve some of the most analytically complex business
problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers
around evidence-based analytical rigor and building robust decision capabilities.
Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It
is all about adding substantial enterprise value by learning from data.

The variety of projects that a data scientist may be engaged in is incredibly broad. Here are few examples:

tactical optimization improvement of marketing campaigns, business processes, etc

predictive analytics anticipate future demand, future events, etc

nuanced learning e.g. developing deep understanding of consumer behavior

recommendation engines e.g. Amazon product recs, Netflix movie recs

automated decision engines e.g. automated fraud detection, and even self-driving cars

The objectives of these types of initiatives may be clear, but the problems require extensive quantitative expertise to
solve. They may require building predictive models, attribution models, segmentation models, heuristics for deep
pattern-discovery in data, etc this commands having exhaustive knowledge of all sorts of machine-learning
algorithms and sharp technical ability. As you might guess, these are not the easiest skills to pick up.

What is data science the requisite skill set


Data science is multidisciplinary; the skill set of a data scientist lies at the intersection of 3 main competencies:

Mathematics Expertise
At the heart of deriving insight from data is the ability to view the data through a quantitative lens. There are textures,
patterns, dimensions, and correlations in data that can be expressed numerically, and discovering inference from
data becomes a brain teaser of mathematical techniques. Solutions to many business problems often involve building
analytic models that are deeply grounded in the hard math theory, and being able to understand how models work is
as important as knowing the process to build them (danger of building without knowing the math).
Also, a big misconception is that data science all about statistics. While statistics are important, it is not the only type
of mathematics that should be well-understood by a data scientist. First, there are two main branches of statistics
classical statistics and Bayesian statistics. When most people refer to stats they are generally referring to classical
stats, but knowledge of both types is very helpful. Furthermore, many inferential techniques and machine learning

algorithms lean heavily on knowledge of linear algebra. For example, key data science processes like SVD (used for
dimension reduction / latent variable discovery) are grounded in matrix mathematics and have much less to do with
classical statistics. Overall, data scientists should have substantial breadth and depth in their knowledge of math.

Technology and Hacking


First, let's clarify on that we are not talking about hacking as in breaking into computers. We're referring to the
tech/developer subculture meaning of hacking i.e., creativity and ingenuity in using technical skills to build things
and find clever solutions to problems.
Why is hacking ability important? Because data scientists absolutely need to leverage technology in order to wrangle
enormous data sets and work with complex algorithms, and it requires using tools far more sophisticated than Excel.
Examples of such tools are SQL, SAS, and R, all of which require technical/coding ability. With these highperformance tools, a true 'hacker' is a technical ninja, able to use ingenious problem solving ability to achieve
mastery in data exploration piecing together unstructured information and teasing out golden nuggets of insight.
Another way to define a hacker is as a solid algorithmic thinker that is, having the ability to break down messy
problems and recompose them in ways that are solvable. This is critical for good data science, especially since data
scientists work intimately within existing algorithmic frameworks and oftentimes create their own algorithms to solve
complex problems. Clarity of thinking within deeply-abstract mental maps of data dimensions and processing
capability is how challenging problems get solved.

Strong Business Acumen


It is very important to note that a data scientist is first and foremost a strategy consultant. Data science teams have
become invaluable resources within companies because by being able to learn from data in ways no one else can,
they are extraordinarily well-positioned to figure out how to add substantial business value. But this means having a
keen sense of how to dissect and approach business problems becomes as important as having a keen sense of how
to approach algorithmic problems. Ultimately, the value doesn't come from numbers; it comes from strategic thinking
based on those numbers.
Additionally, a core competency of data science is in using data to cogently tell a story. This means no data-puking;
rather, presenting a cohesive narrative of problem and solution, using data insights as supporting pillars, that lead to
guidance.
Clearly, get all the competencies right math, technology, and business and this is an incredibly potent
combination. There is a reason why data scientists are well paid and probably will never have to worry about job
security. Not a bad place to be to have the rarefied talents that big companies everywhere are trying to recruit.

What is a data scientist curiosity and training


The Mindset
A defining personality trait of data scientists is they are deep thinkers with intense intellectual curiosity. Data science
is all about being inquisitive asking new questions, making new discoveries, and learning new things. Ask true data
scientists what drives them in their job, and they will not say "money". The real motivator is being able to use their
creativity and ingenuity to solve hard problems and constantly indulge in their curiosity. Deriving insight from data is
not about getting an answer, it is about uncovering "truth" that lies hidden beneath the surface. Problem solving is not
a task, but rather an intellectually-stimulating journey to a solution. There is passion for the work, and great
satisfaction in taking on challenge.

Training
While solid math skills are necessary, there is a glaring misconception out there that you need a Ph.D in Statistics to
become a legitimate data scientist. That view completely misses the point that data science is multidisciplinary; years
of study in academia may not leave graduates with the correct set of experience and abilities to excel i.e. a Ph.D
statistician may not have nimble hacking skills or strategic business intuition to complete the trifecta.
As a matter of fact, data science is such a relatively new and rising discipline that universities have not caught up in
developing comprehensive data science degree programs meaning that no one can really claim to have "done all
the schooling" to be become a data scientist. Where does much of the training come from? The unyielding intellectual
curiosity that data scientists possess drive them to be passionate autodidacts, motivated to learn skills on their own
with deep determination (Read: where can you find people like this?).

Analytics and machine learning how it ties to data science


There are a slew of terms closely related to data science, that we hope to add some clarity around.

What is Analytics?
Analytics has risen quickly in popular business lingo over the past several years; the term is used loosely, but
generally meant to describe critical thinking that is quantitative in nature. Technically, analytics is the "science of
analysis" put another way, the practice of analyzing information to make decisions.
Is "analytics" the same thing as data science? Depends on context. Sometimes it is synonymous with the definition of
data science that we have described, and sometimes it represents something else. A data scientist using raw data to
build a predictive behavior model falls into the scope of analytics. At the same time, a general business user
interpreting pre-built dashboard reports (e.g. GA) is also in the realm of analytics, but does not cross into the
specialized skill needed in data science. Analytics has come to have fairly broad meaning, though at the end of the
day, the semantics don't matter much.

What is the difference between an analyst and a data scientist?

"Analyst" is somewhat of an ambiguous term that can represent many different types of roles (marketing analyst,
operations analyst, portfolio analyst, financial analyst, etc). Is an analyst the same as a data scientist? We've
discussed pretty strict canon around what is a data scientist as an expert's role with requisite talents in math,
technology, and strategy consulting. Let's just say that some analysts are definitely data-scientists-in-training. As
represented in this visual, there is a place in the middle where the distinction can blur a bit.

Here are examples of growth from analyst to veritable data scientist:

An analyst who has previously only mastered Excel, learns how


to dive into raw warehouse data using SQL and R

An analyst who previously only knew enough stats to report the


results of an A/B test, gains the expertise to build a predictive
model with latent variable analysis and cross-validation

Overall point is that moving in the direction of "data scientist" requires


motivation to learn many new skills. Many companies have actually found success cultivating their own home-grown
data scientists, by giving their analysts the resources and training to take their abilities to the next level.

What is Machine Learning?


Machine learning is a term that is closely tied to data science. Simply, it means being able to train systems or
algorithms to derive insight from a data set. The actual types of machine learning are varied, ranging from regression
models to support vector machines to neural nets, but it all centers around 'teaching' a computer to become very
good at pattern recognition. Examples of machine learning include:

predictive models that can anticipate user behavior

clustering algorithms that mine for natural similarities between different customers

classification models that can recognize and filter out spam

recommendation engines that 'learn' about preferences at an individual level

neural nets that can recognize what image patterns look like

Data scientists work intimately with machine learning techniques to build algorithms that automate elements of their
problem-solving. It is a requisite part of the data science toolset, needed to tackle some of the most complex datadriven projects.

What is Data Munging?


Raw data can be unstructured and messy, with information from disparate data sources and mismatched records.
Data munging is a term to describe the important process of cleaning up data so that it is ready for data analysis and
use in machine learning algorithms. This requires good pattern-recognition ability and clever hacking skills in order to
merge and transform masses of raw information. Dirty data can obfuscate the 'truth' hidden in the data and
completely mislead an analysis, thus, any data scientist must be skillful and nimble at data munging in order to have
accurate data for deriving insight.

What Types of Questions Can Data Science Answer


ML algorithms can be grouped into families based on the type of question they
answer. These can help guide your thinking as you are formulating your razor sharp
question.

What is machine learning? You probably use it dozens of times a day without even
knowing it. Each time you do a web search on Google or Bing, that works so
well because their machine learning software has figured out how to rank what
pages. When Facebook or Apple's photo application recognizes your friends in your
pictures, that's also machine learning. Each time you read your email and a spam
filter saves you from having to wade through tons of spam, again, that's because
your computer has learned to distinguish spam from non-spam email. So, that's
machine learning. There's a science of getting computers to learn without being
explicitly programmed. One of the research projects that I'm working on is getting
robots to tidy up the house. How do you go about doing that? Well what you can do
is have the robot watch you demonstrate the task and learn from that. The robot
can then watch what objects you pick up and where to put them and try to do the
same thing even when you aren't there. For me, one of the reasons I'm excited
about this is the AI, or artificial intelligence problem. Building truly intelligent
machines, we can do just about anything that you or I can do. Many scientists think
the best way to make progress on this is through learning algorithms called neural
networks, which mimic how the human brain works

Is this A or B?
This family is formally known as two-class classification. Its useful for any question that has just two possible
answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like
this or can be re-phrased to fit this form. Its the simplest and most commonly asked data science question. Here are
few typical examples.
Will this customer renew their subscription?
Is this an image of a cat or a dog?
Will this customer click on the top link?
Will this tire fail in the next thousand miles?
Does the $5 coupon or the 25% off coupon result in more return customers?
Is this A or B or C or D?

This algorithm family is called multi-class classification. Like its name implies, it answers a question that has
several (or even many) possible answers: which flavor, which person, which part, which company, which candidate.
Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few
typical examples.
Which animal is in this image?
Which aircraft is causing this radar signature?
What is the topic of this news article?
What is the mood of this tweet?
Who is the speaker in this recording?

Is this Weird?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are
paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no.
The difference is that binary classification assumes you have a collection of examples of both yes and no cases.
Anomaly detection doesnt. This is particularly useful when what you are looking for occurs so rarely that you havent
had a chance to collect many examples of it, like equipment failures. Its also very helpful when there is a lot of variety
in what constitutes not normal, as there is in credit card fraud detection. Here are some typical anomaly detection
questions

Is this pressure reading unusual?


Is this internet message typical?
Is this combination of purchases very different from what this customer has made in the past?

Are these voltages normal for this season and time of day?
How Much / How Many?
When you are looking for a number instead of a class or category, the algorithm family to use is regression.

What will the temperature be next Tuesday?


What will my fourth quarter sales in Portugal be?
How many kilowatts will be demanded from my wind farm 30 minutes from now?
How many new followers will I get next week?
Out of a thousand units, how many of this model of bearings will survive 10,000 hours of use?
Usually, regression algorithms give a real-valued answer; the answers can have lots of decimal places or even be
negative. For some questions, especially questions beginning How many, negative answers may have to be reinterpreted as zero and fractional values re-interpreted as the nearest whole number.
Sometimes questions that look like multi-value classification questions are actually better suited to regression. For
instance, Which news story is the most interesting to this reader? appears to ask for a categorya single item from
the list of news stories. However, you can reformulate it to How interesting is each story on this list to this reader?
and give each article a numerical score. Then it is a simple thing to identify the highest-scoring article. Questions of
this type often occur as rankings or comparisons.

Which van in my fleet needs servicing the most? can be rephrased as How badly does each van in my
fleet need servicing?
Which 5% of my customers will leave my business for a competitor in the next year? can be rephrased as
How likely is each of my customers to leave my business for a competitor in the next year?
Two-Class Classification as Regression
It may not come as a surprise that binary classification problems can also be reformulated as regression. (In fact,
under the hood some algorithms reformulate every binary classification as regression.) This is especially helpful when
an example can belong part A and part B, or have a chance of going either way. When an answer can be partly yes
and no, probably on but possibly off, then regression can reflect that. Questions of this type often begin How likely
or What fraction

How likely is this user to click on my ad?


What fraction of pulls on this slot machine result in payout?
How likely is this employee to be an insider security threat?
What fraction of todays flights will depart on time?
As you may have gathered, the families of two-class classification, multi-class classification, anomaly detection, and
regression are all closely related. They all belong to the same extended family, supervised learning. They have a lot
in common, and often questions can be modified and posed in more than one of them. What they all share is that
they are built using a set labeled examples (a process called training), after which they can assign a value or
category to unlabeled examples (a process called scoring).
Entirely different sets of data science questions belong in the extended algorithm families of unsupervised and
reinforcement learning.

How is this Data Organized?


Questions about how data is organized belong to unsupervised learning. There are a wide variety of techniques
that try to tease out the structure of data. One family of these perform clustering, a.k.a. chunking, grouping,
bunching, or segmentation. They seek to separate out a data set into intuitive chunks. What makes clustering
different from supervised learning is that there is no number or name that tells you what group each point belongs to,
what the groups represent, or even how many groups there should be. If supervised learning is picking out planets
from among the stars in the night sky, then clustering is inventing constellations. Clustering tries to separate out data
into natural clumps, so that a human analyst can more easily interpret it and explain it to others.
Clustering always relies on a definition of closeness or similarity, called a distance metric. The distance metric can be
any measurable quantity, such as difference in IQ, number of shared genetic base pairs, or miles-as-the-crow-flies.
Clustering questions all try to break data into more nearly uniform groups.

Which shoppers have similar tastes in produce?


Which viewers like the same kind of movies?

Which printer models fail the same way?


During which days of the week does this electrical substation have similar electrical power demands?
What is a natural way to break these documents into five topic groups?
Another family of unsupervised learning algorithms are called imensionality reduction techniques. Dimensionality
reduction is another way to simplify the data, to make it both easier to communicate, faster to compute with, and
easier to store.
At its core, dimensionality reduction is all about creating a shorthand for describing data points. A simple example is
GPA. A college students academic strength is measured in dozens of classes by hundreds of exams and thousands
of assignments. Each assignment says something about how well that student understands the course material, but a
full listing of them would be way too much for any recruiter to digest. Luckily, you can create a shorthand just by
averaging all the scores together. You can get away with this massive simplification because students who do very
well on one assignment or in one class typically do well in others. By using GPA rather than the full portfolio, you do
lose richness. For instance, you wouldnt know it if the student is stronger in math than English, or if she scored better
on take-home programming assignments than on in-class quizzes. But what you gain is simplicity, which makes it a
lot easier to talk about and compare students strength.
Dimensionality reduction-related questions are usually about factors that tend to vary together.

Which groups of sensors in this jet engine tend to vary with (and against) each other?
What leadership practices do successful CEOs have in common?
What are the most common patterns in gasoline price changes across the US?
What groups of words tend to occur together in this set of documents? (What are the topics they cover?)
If your goal is to summarize, simplify, condense, or distill a collection of data, dimensionality reduction and clustering
are your tools of choice.
What Should I Do Now?
A third extended family of ML algorithms focuses on taking actions. These are called reinforcement learning (RL)
algorithms. They are little different than the supervised and unsupervised learning algorithms. A regression algorithm
might predict that the high temperature will be 98 degrees tomorrow, but it doesnt decide what to do about it. A RL
algorithm goes the next step and chooses an action, such as pre-refrigerating the upper floors of the office building
while the day is still cool.
RL algorithms were originally inspired by how the brains of rats and humans respond to punishment and rewards.
They choose actions, trying very hard to choose the action that will earn the greatest reward. You have to provide
them with a set of possible actions, and they need to get feedback after each action on whether it was good, neutral,
or a huge mistake.
Typically RL algorithms are a good fit for automated systems that have to make a lot of small decisions without a
humans guidance. Elevators, heating, cooling, and lighting systems are excellent candidates. RL was originally
developed to control robots, so anything that moves on its own, from inspection drones to vacuum cleaners, is fair
game. Questions that RL answers are always about what action should be taken, although the action is usually taken
by machine.

Where should I place this ad on the webpage so that the viewer is most likely to click it?
Should I adjust the temperature higher, lower, or leave it where it is?
Should I vacuum the living room again or stay plugged in to my charging station?
How many shares of this stock should I buy right now?
Should I continue driving at the same speed, brake, or accelerate in response to that yellow light?
RL usually requires more effort to get working than other algorithm types because its so tightly integrated with the
rest of the system. The upside is that most RL algorithms can start working without any data. They gather data as
they go, learning from trial and error.

S-ar putea să vă placă și