Documente Academic
Documente Profesional
Documente Cultură
The beginning and end of nearly any problem in data science is a visualization—first, for understanding the shape
and structure of the raw data and, second, for communicating the final results to drive decision making. In either
case, the goal is to expose the essential properties of the data in a way that can be perceived and understood by
the human visual system.
Traditional visualization systems and techniques were designed in an era of data scarcity, but in today’s Big Data
world of an incredible abundance of information, understanding is the key commodity. Older approaches focused
on rendering individual data points faithfully, which was appropriate for the small data sets previously available.
However, when inappropriately applied to large data sets, these techniques suffer from systematic problems
like overplotting, oversaturation, undersaturation, undersampling and underutilized dynamic range, all of which
obscure the true properties of large data sets and lead to incorrect data-driven decisions. Fortunately, Anaconda
is here to help with datashading technology that is designed to solve these problems head-on.
In this paper, you’ll learn why Open Data Science is the foundation to modernizing data analytics, and:
Visualization in the Era of Big Data: you begin to experience difficulties. With as few as 500 data points, it
Getting It Right Is Not Always Easy is much more likely that there will be a large cluster of points that
mostly overlap each other, known as ‘overplotting’, and obscure the
Some of the problems related to the abundance of data can be
structure of the data within the cluster. Also, as they grow, data sets
overcome simply by using more or better hardware. For instance,
can quickly approach the points-per-pixel problem, either overall or
larger data sets can be processed in a given amount of time by
in specific dense clusters of data points.
increasing the amount of computer memory, CPU cores or network
bandwidth. But, other problems are much less tractable, such as
The ‘points-per-pixel’ problem is having more
what might be called the ‘points-per-pixel problem’—which is
anything but trivially easy to solve and requires fundamentally data points than is possible to represent
different approaches.
as pixels on a computer monitor.
The ‘points-per-pixel’ problem is having more data points than is
possible to represent as pixels on a computer monitor. If your data Technical ‘solutions’ are frequently proposed to head off these issues,
set has hundreds of millions or billions of data points—easily but too often these are misapplied. One example is downsampling,
imaginable for Big Data—there are far more than can be displayed where the number of data points is algorithmically reduced, but
on a typical high-end 1920x1080 monitor with 2 million pixels, or which can result in missing important aspects of your data. Another
even on a bleeding edge 8K monitor, which can display only 33 approach is to make data points partially transparent, so that they
million pixels. And yet, data scientists must accurately convey, if not add up, rather than overplot. However, setting the amount of
all the data, at least the shape or scope of the Big Data, despite these transparency correctly is difficult, error-prone and leaves
hard limitations. unavoidable tradeoffs between visibility of isolated samples and
overplotting of dense clusters. Neither approach properly addresses
Very small data sets do not have this problem. For a scatterplot with
the key problem in visualization of large data sets: systematically
only ten or a hundred points, it is easy to display all points, and
and objectively displaying large amounts of data in a way that can be
observers can instantly perceive an outlier off to the side of the data’s
presented effectively to the human visual system.
cluster. But as you increase the data set’s size or sampling density,
A B A B
C D C D
Figure 3. Oversaturation Due to More Overlapping Points Figure 4. Reducing Oversaturation by Decreasing Dot Size
A B A B
C D C D
A B A B C
C D
true is quite a difficult process of trial and error, making it very In principle, the heatmap approach can entirely avoid the first three
likely that important features of the data set will be missed. problems above:
To avoid undersampling large data sets, researchers often use 2D 1. Overplotting, since multiple data points sum arithmetically into
histograms visualized as heatmaps, rather than scatterplots showing the grid cell, without obscuring one another
individual points. A heatmap has a fixed size grid regardless of the
2. Oversaturation, because the minimum and maximum counts
data set size, so that it can make use of all the data. Heatmaps
observed can automatically be mapped to the two ends of a
effectively approximate a probability density function over the
visible color range
specified space, with coarser heatmaps averaging out noise or
irrelevant variations to reveal an underlying distribution, and finer 3. Undersampling, since the resulting plot size is independent of
heatmaps are able to represent more details in the distribution, as the number of data points, allowing it to use an unbounded
long as the distribution is sufficiently and densely sampled. amount of incoming data
Let’s look at some heatmaps in Figure 6 with different numbers of UNDERSATURATION. Heatmaps come with their own plotting
bins for the same two-Gaussians distribution. pitfalls. One rarely appreciated issue common to both heatmaps and
alpha-based scatterplots is undersaturation, where large numbers of
As you can see, a too coarse binning, like grid A, cannot represent
data points can be missed entirely because they are spread over
this distribution faithfully, but with enough bins, like grid C, the
many different heatmap bins or many nearly-transparent scatter
heatmap will approximate a tiny-dot scatterplot like plot D in the
points. To look at this problem, we can construct a data set
Undersampling in Figure 5. For intermediate grid sizes like B, the
combining multiple 2D Gaussians, each at a different location and
heatmap can average out the effects of undersampling. Grid B is
with a different amount of spread (standard deviation):
actually a more faithful representation of the distribution than C,
given that we know this distribution is two offset 2D Gaussians, while
LOCATION (2,2) (2,-2) (-2,-2) (-2,2) (0,0)
C more faithfully represents the sampling—the individual points
drawn from this distribution. Therefore, choosing a good binning
STANDARD DEVIATION 0.01 0.1 0.5 1.0 2.0
grid size for a heatmap does take some expertise and knowledge of
the goals of the visualization, and it is always useful to look at
multiple binning-grid spacings for comparison. Still, the binning Even though this is still a very simple data set, it has properties
parameter is something meaningful at the data level—how coarse a shared with many real world data sets, namely that there are some
view of the data is desired? Rather than just a plotting detail (what areas of the space that will be very densely populated with points,
size and transparency should I use for the points?), which would while others are only sparsely populated. On the next page we’ll look
need to be determined arbitrarily. at some scatterplots for this data in Figure 7.
A B C
A B C
A B C
Figure 10. Dynamic Range with a Logarithmic Transformation Figure 11. Parameter-Free Visualization Using Rank Order Plotting
A B C A B C
The datashader library overcomes all of the pitfalls above, both by In Figure 13, you can see each of the five underlying distributions
automatically calculating appropriate parameters based on the data clearly, which have been manually labeled in the version on the right,
itself and by allowing interactive visualizations of truly large data for clarity.
sets with millions or billions of data points so that their structure can
The stages involved in these computations will be laid out one by
be revealed. The above techniques can be applied ‘by hand’, but
one below, showing both how the steps are automated and how they
datashader lets you do this easily, by providing a high performance
can be customized by the user when desired.
and flexible modular visualization pipeline, making it simple to do
automatic processing, such as auto-ranging and histogram
PROJECTION. Datashader is designed to render data sets projected
equalization, to faithfully reveal the properties of the data.
onto a 2D rectangular grid, eventually generating an image where
The datashader library has been designed to expose the stages each pixel corresponds to one cell in that grid. The projection stage
involved in generating a visualization. These stages can then be includes several steps:
automated, configured, customized or replaced wherever 1. Select which variable you want to have on the x axis and which
appropriate for a data analysis task. The five main stages in a one for the y axis. If those variables are not already columns in
datashader pipeline are an elaboration of the three main stages your dataframe—if you want to do a coordinate
above, after allowing for user control in between processing steps as transformation, you’ll first need to create suitable columns
shown in Figure 12. mapping directly to x and y for use in the next step.
2. Choose a glyph, which determines how an incoming data
Figure 12 illustrates a datashader pipeline with computational steps
point maps onto the chosen rectangular grid. There are three
listed across the top of the diagram, while the data structures, or
glyphs currently provided with the library:
objects, are listed along the bottom. Breaking up the computation
a. A Point glyph that maps the data point into the
into this set of stages is what gives datashader its power, because only
single closest grid cell
the first couple of stages require the full data set while the remaining
b. A Line glyph that maps that point into every grid
stages use a fixed-size data structure regardless of the input data set,
cell falling between this point and the next
making it practical to work with on even extremely large data sets.
c. A Raster glyph that treats each point as a square
To demonstrate, we’ll construct a synthetic data set made of the
in a regular grid covering a continuous space
same five overlapping 2D normal distributions introduced in the
3. Although new glyph types are somewhat difficult to create and
undersaturation example shown previously in Figure 7.
rarely needed, you can design your own if desired, to shade a
point onto a set of bins according to some kernel function or
Figure 14. Visualization of Various Aggregations Using Datashader Figure 15. Single- Line Operations Using xarray/NumPy Functions
A B A B
Count aggregation Any aggregation agg.where(agg>=np. numpy.sin(agg)
percentile(agg,99)
C D
Mean y aggregation Mean val aggregation
On a live server, you can zoom and pan to explore each of the
different regions of this data set. For instance, if you zoom in far
enough on the blue dot, you’ll see that it does indeed include 10,000
points, they are just so close together that they show up as only a
Figure 16. Examples of Colormapping Using Datashader Figure 17. Datashader Embedded in Interactive Bokeh Visualizations
Figure 18. Visualizing US Population Density with Datashader Figure 20. Race & Ethnicity with Datashader
A B
Zooming in to view race/ Zooming in to view race/
ethnicity data in Chicago ethnicity data in NYC
C D
Zooming in to view race/ Zooming in to view race/
ethnicity data in Los Angeles ethnicity data in Chicago
Figure 21. Plotting NYC Taxi Dropoffs with Bokeh Figure 23. NYC Taxi Pickup Times
Figure 22. Plotting NYC Taxi Dropoffs with Datashader Figure 24. Taxi Pickup Times Zoomed with Overlay
Plotted in this way, it is clear that pickups are much more likely
along the main arteries—presumably where a taxi can be hailed
successfully, while dropoffs are more likely along side streets.
LaGuardia Airport (circled) also shows clearly segregated pickup
and dropoff areas, with pickups being more widespread,
dropoffs (blue) vs pick-up (red) locations
presumably because those are on a lower level and thus have lower
GPS accuracy due to occlusion of the satellites.
With datashader, building a plot like this is very simple, once the Figure 26. Filtering US Census Data
data has been aggregated. An aggregate is an xarray (see xarray.
pydata.org) data structure and, if we create an aggregate named
drops that contains the dropoff locations and one named picks
that contains the pickup locations, then drops.where(drops>picks)
will be a new aggregate holding all the areas with more dropoffs,
and picks.where(picks>drops) will hold all those with more
pickups. These can then be merged to make the plot above, in one
line of datashader code. Making a plot like this in another plotting
package would essentially require replicating the aggregation step
of datashader, which would require far more code. A
US census data, only including pixels with every race/
Similarly, referring back to the US census data, it only takes one
ethnicity included
line of datashader code to filter the race/ethnicity data to show only
those pixels containing at least one person of every category in
Figure.26, plot A.
Figure 27. Multiple Overlapping Time Series Curves Figure 28. Zooming in on the Data
Zoom level 0
Zoom level 1
Zoom level 2