Documente Academic
Documente Profesional
Documente Cultură
In This Whitepaper
Data science is about using data to provide insight and evidence that can lead business, government and academic
leaders to make better decisions. However, making sense of the large data sets now becoming ubiquitous is
difficult, and it is crucial to use appropriate tools that will drive smart decisions.
The beginning and end of nearly any problem in data science is a visualizationfirst, for understanding the shape
and structure of the raw data and, second, for communicating the final results to drive decision making. In either
case, the goal is to expose the essential properties of the data in a way that can be perceived and understood by
the human visual system.
Traditional visualization systems and techniques were designed in an era of data scarcity, but in todays Big Data
world of an incredible abundance of information, understanding is the key commodity. Older approaches focused
on rendering individual data points faithfully, which was appropriate for the small data sets previously available.
However, when inappropriately applied to large data sets, these techniques suffer from systematic problems
like overplotting, oversaturation, undersaturation, undersampling and underutilized dynamic range, all of which
obscure the true properties of large data sets and lead to incorrect data-driven decisions. Fortunately, Anaconda
is here to help with datashading technology that is designed to solve these problems head-on.
In this paper, youll learn why Open Data Science is the foundation to modernizing data analytics, and:
The complexity of visualizing large
amounts of data
imaginable for Big Datathere are far more than can be displayed
million pixels. And yet, data scientists must accurately convey, if not
all the data, at least the shape or scope of the Big Data, despite these
hard limitations.
Very small data sets do not have this problem. For a scatterplot with
only ten or a hundred points, it is easy to display all points, and
observers can instantly perceive an outlier off to the side of the datas
cluster. But as you increase the data sets size or sampling density,
Lets take a deeper dive into five major plotting pitfalls and how they
inconveniences with small data sets but very serious problems with
larger ones:
1. Overplotting
2. Oversaturation
3. Undersampling
4. Undersaturation
5. Underutilized range
OVERPLOTTING. Lets consider plotting some 2D data points that
come from two separate categories, plotted as blue and red in A and
0.1, full color saturation will be achieved only when 10 points overlap,
which reduces the effects of plot ordering but can make it harder to
see individual points.
In the example in Figure 2, C and D look very similar (as they should,
based on this data. Of course, both are equally common in this case.
since the distributions are identical), but there are still a few specific
Figure 1. Overplotting
Even worse, if just one has set the alpha value to approximately or
points overlapping will all look the same visually, for alpha=0.1.
still depends on the data set. If there are more points overlapping in
that particular region, a manually adjusted alpha setting that worked
well for a previous data set will systematically misrepresent the new
data set.
In Figure 5, on the next page, lets first look at another example that
has a sum of two normal distributions slightly offset from one
another but no longer uses color to separate them into categories.
As shown in the examples in the previous sections, finding settings to
avoid overplotting and oversaturation is difficult. The small dots
parameters used in the A and B (size 0.1, full alpha) of the
undersampling vs overplotting example work fairly well for a sample
of 600 points (A), but those parameters lead to serious overplotting
issues for larger data sets, obscuring the shape and density of the
distribution (B). Switching to 10 times smaller dots with alpha 0.1 to
allow overlap (tiny dots) works well for the larger data set D, but not
at all for the 600 point data set C. Clearly, not all of these settings are
accurately conveying the underlying distribution, as they all appear
quite different from one another, but in each case they are plotting
samples from the same distribution. Similar problems occur for the
same size data set, but with greater or lesser levels of overlap
on the dot size, because smaller dots have less overlap for the same
data set. With smaller dots, as shown in Figure 4, C and D look more
similar, as desired, but the color of the dots is now difficult to see in
all cases, because the dots are too transparent for this size.
In any case, as data set size increases, at some point plotting a full
scatterplot like any of these will become impractical with current
plotting technology. At this point, people often simply subsample
As you can see in Figure 4, it is very difficult to find settings for the
dot size and alpha parameters that correctly reveal the data, even for
relatively small and obvious data sets like these. With larger data sets
Such problems can occur even when taking very large numbers of
In principle, the heatmap approach can entirely avoid the first three
problems above:
(2,2)
(2,-2)
(-2,-2)
(-2,2)
(0,0)
STANDARD DEVIATION
0.01
0.1
0.5
1.0
2.0
Even though this is still a very simple data set, it has properties
shared with many real world data sets, namely that there are some
view of the data is desired? Rather than just a plotting detail (what
areas of the space that will be very densely populated with points,
size and transparency should I use for the points?), which would
while others are only sparsely populated. On the next page well look
clearly visible Gaussians, but all but the largest appear to have the
same density of points per pixel, which we know is not the case from
how the data set was constructed, plus the smallest is nearly invisible.
coarse binning, but, even B is somewhat too coarsely binned for this
In addition, each of the five Gaussians has the same number of data
data, since the very narrow spread and narrow spread Gaussians
points (10,000), but the second largest looks like it has more than the
show up identically, each mapping entirely into a single bin (the two
black pixels). Plot C does not suffer from too-coarse binning, yet it
still looks more like a plot of the very large spread distribution
alone, rather than a plot of these five distributions that have different
for undersaturation.
density are not visible between the five Gaussians, because all, or
rural population spread over a wide region will entirely fail to show
nearly all, pixels end up being mapped into either the bottom end of
the visible range (light gray), or the top end (pure black, used only for
will entirely dominate the plot if using the plot settings in A, either
the single pixel holding the very narrow spread distribution). The
rest of the visible colors in this gray colormap are unused, conveying
high count value, then the wider spread values are obscured, as in B,
So, lets try transforming the data from its default linear
or entirely invisible, as in C.
To avoid undersaturation, you can add an offset to ensure that low count,
but nonzero, bins are mapped into a visible color, with the remaining
intensity scale used to indicate differences in counts (Figure 9).
Such mapping entirely avoids undersaturation, since all pixels are
either clearly zero, in the background color, white in this case, or a
non-background color taken from the colormap. The widest-spread
Gaussian is now clearly visible in all cases.
Aha! We can now see the full structure of the data set, with all five
Gaussians clearly visible in B and C and the relative spreads also
clearly visible in C. However, we still have a problem, though. Unlike
the solutions to the first four pitfalls, the choice of a logarithmic
full structure that we know was in this data set, i.e. five Gaussians
constructing the example. For large data sets with truly unknown
across the full and very wide range of counts in the original data.
data set values into a visible range that will work across data sets?
Yes, if we think of the visualization problem in a different way. The
pixel in this case. So, plot C is accurately conveying the structure, but
data sets, is that the values in each bin are numerically very different,
ranging from 10,000 in the bin for the very narrow spread
counts, by adding a color key mapping from the visible gray values
Gaussian to 0 or 1 for single data points from the very large spread
values, numerically mapping the data values into the visible range
linearly is clearly not going to work well. But, given that we are
showing the full range in a single plot will not work well, but in each
highlight specific aspects of the data that are needed for a decision.
2. Rasterize
3. Transfer
discussed above. In our approach, however, this is only the first step;
process of tuning ones models and how best to display the data, in
which the data scientist can control how data is best transformed and
visualized at each step, starting from a first plot that already
faithfully reveals the overall data set.
Data
Aggregation
Scene
Transformation
Aggregate(s)
Colormapping
Embedding
Image
Plot
In Figure 13, you can see each of the five underlying distributions
clearly, which have been manually labeled in the version on the right,
for clarity.
sets with millions or billions of data points so that their structure can
be revealed. The above techniques can be applied by hand, but
datashader lets you do this easily, by providing a high performance
and flexible modular visualization pipeline, making it simple to do
automatic processing, such as auto-ranging and histogram
each pixel corresponds to one cell in that grid. The projection stage
1. Select which variable you want to have on the x axis and which
one for the y axis. If those variables are not already columns in
LOCATION
(2,2)
(2,-2)
(-2,-2)
(-2,2)
(0,0)
STANDARD DEVIATION
0.01
0.1
0.5
1.0
2.0
For instance, in Figure 15, instead of plotting all the data, we can
easily find hotspots by plotting only those bins in the 99th percentile
not it is meaningful.
simplest to think about the operator as being applied per data point,
green and blue color channels (AD to 00 for the red channel, in this
case). The alpha (opacity) value is set to 0 for empty bins and 1 for
non-empty bins, allowing the page background to show through
wherever there is no data. You can supply any colormap you like as
shown in Figure 16, including Bokeh palettes, matplotlib colormaps
or a list of colors using the color names from ds.colors, integer
triples or hexadecimal strings.
Count aggregation
Any aggregation
agg.where(agg>=np.
numpy.sin(agg)
percentile(agg,99)
Mean y aggregation
single tiny blue spot in the above plot. Such exploration is crucial for
do not show the data ranges, axis labels and so on, nor do they
Datashader will then merge all the categories present in each pixel to
million people in the United States. Here, well focus on the subset of
the data selected by the Cooper Center, who produced a map of the
(http://www.coopercenter.org/demographics/Racial-Dot-Map). Each
dot in this map corresponds to a specific person counted in the
census, located approximately at their residence. To protect privacy,
the precise locations have been randomized at the block level, so that
Even greater levels of segregation are visible when zooming into any
x,y locations of each person, using all the default plotting values,
nearby neighborhoods.
like roads in the Midwest and history such as high population density
along the East coast, are all clearly visible and additional structures
each point when zooming in so far that data points become sparse,
EXAMPLE 2: NYC TAXI DATA SET. For this example, well use
By analogy to the US census race data, you can also treat each hour
part of the well-studied NYC taxi trip database, with the locations of
all New York City taxicab pickups and dropoffs from January 2015.
The data set contains 12 million pickup and dropoff locations (in
Yellow: 4 a.m.
Green: 8 a.m.
Blue: 4 p.m.
Purple: 8 p.m.
In Figure 23, there are definitely different regions of the city where
pickups happen at specific times of day, with rich structure that can
be revealed by zooming in to see local patterns and relate them to
the underlying geographical map as shown in Figure 24.
York City area is now clearly visible, with increasing levels of detail
available by zooming in to particular regions, without needing any
specially tuned or adjusted parameters.
A
US census data, only including pixels with every race/
ethnicity included
of major cities have been selected, along with many rural areas in
the West Coast that nonetheless have more Blacks than Whites.
Alternatively, we can simply highlight the top 1% of the pixels by
population density, in this case by using a color range with 100
shades of gray and then changing the top one to red in Figure 26,
plot C.
Nearly any such query or operation that can be expressed at the
level of pixels (locations) can be expressed similarly simply,
providing a powerful counterpart to queries that are easy to
perform at the raw data level, or to filter by criteria already
provided as columns in the data set.
C
US population density, with the 1% most dense pixels
colored in red
Trajectory plots (ordered GPS data coordinates) can similarly use all
the data available even for millions or billions of points, without
downsampling and with no parameter tuning, revealing
substructure at every level of detail, as in Figure 28.
In Figure 28, using one million points, there is an overall synthetic
random-walk trajectory, but a cyclic wobble can be seen when
zooming in partially, and small local noisy values can be seen when
zooming in fully. These patterns could be very important, if, for
example, summing up total path length, and are easily discoverable
interactively with datashader, because the full data set is available,
with no downsampling required.
Zoom level 0
Zoom level 1
Zoom level 2
Summary
In this paper, we have shown some of the major challenges in
presenting Big Data visualizations, the failures of traditional
approaches to overcome these challenges and how a new approach
surmounts them. This new approach is a three-step process that:
optimizes the display of the data to fit how the human visual system
works, employs statistical sophistication to ensure that data is
transformed and scaled appropriately, encourages exploration of data
with ease of iteration by providing defaults that reveal the data
automatically and allows full customization that lets data scientists
adjust every step of the process between data and visualization.
We have also introduced the datashader library available with
Anaconda, which supports all of this functionality. Datashader uses
Python code to build visualizations and powers the plotting
capabilities of Anaconda Mosaic, which explores, visualizes and
transforms heterogeneous data and lets you make datashader plots
out-of-the-box, without the need for custom coding.
The serious limitations of traditional approaches to visualizing Big
Data are no longer an issue. The datashader library is now available
to usher in a new era of seeing the truth in your data, to help you
make smart, data-driven decisions.