Catalog & Classify - Data Integrity

This is a collective research project providing examples and discussion of the basic building blocks of visual data representation.

In his Ph.D. dissertation, information designer Ben Fry assembled a taxonomy of standard visualization types. Look over his list and choose one to research more thoroughly. Sign up for your chosen visualization using this google doc.

Trees & Graphs

Yunjia Yuan | 19 Jan 2021

Tree is a type of data visualization that shows hierarchical data. Typically, the graphs starts with one main branch and expands out to datas pre-organized in the sub categories. The graph a type of tree graphs where we can see several branches would connect back to themselves.

Tree and graph are optimal to show datas in the ranking orders within a data set. They visualize the information architecture, the connections between each category, and the overall information flow but not so friendly to visualize the literal quantity values of each data.

Tree and graph require data categorization, sorting out the connections among each levels of data entry.

Data are most commonly translated into different numbers of colored lines with a dot at the end. The dots' size and color, the branches' stoke weight, the placement of the main branches could heavily impact the visualization outcome.

Pros: It can show a large chunks of data in one graphic and provides a high level understanding of the data's relationship with each other.

Cons: When dealing with high level number of data, only certain branches would be labelled for the sake of clarity of the graphic itself. The graph type can easily get cluttered if the main branches are situated poorly.

Some good examples:

A tree graph. It is showing the data in a clear hierarchy.

A graph consists branches that connect back to itself. The color here are showing the category and quantity at the same time.

A tree graph. The expanding branches are used as a visualization for data's quantity values. — A tree graph. The positions of the main branches enables clear lines to draw between each data entry.

A graph. There are too many data but inadequate sorting and categorization. The result is messy and lack of focus.

A graph. The 3D rendering may deliver a cool visual but the data's representation is sacrificed and hence the incomplete understanding of the data set.

A tree graph. Unnecessary lines and knots that only cause confusion.

Source:

http://www.stefanieposavec.com/

Treemaps

Summer Tan | 19 Jan 2021

Treemaps are commonly found on data dashboards. Designers often choose them to add visual variety on a dense dashboard. However, treemaps are a complex visualization and present many obstacles to quick comprehension (which is the main requirement for any information displayed on a dashboard).

Creating a tree map involves choosing two dimensions of the data, color-coding one dimension and defining a "tiling algorithm" for the dimension represented by area. The tiling algorithm determines how the rectangles are sub-divided into rectangles of specific area (corresponding to the data). Tree maps are most legible when the area of sub-rectangles have an aspect ratio close to one.

Treemaps rely on area (and possibly color) to encode the value of a variable, and therefore, although treemaps can convey overall relationships in a large data set, they are not suited for tasks involving precise comparisons. Treemaps also should not be used if the data is not hierarchical.

3 Example of "good" treemaps

*Treemap corresponding to the tree structure of the S&P 500 dataset depicted above.* *The color of each rectangle shows if the value of that stock is moving up or down – very bright red indicates a big shift downward, and very bright green indicates a big shift upwards.The size of each rectangle represents the market capitalization (value) of that stock, industry, or sector.*

Treemap visualization showing "Elderly population in Europe for NUTS2 regions". Colored rectangles represent the ratio of elderly people ("age group 65 and above") population. The size of each rectangle in the Treemap represents the "Total Population".

*Treemap exploring income comparison from films by studio and genre*

3 Examples of "bad" treemaps

Treemap depicting someone's personal taste on albums. The sizes of each sector are very similar, not enough visual hierarchy is presented in the graph. The colors tend to blend together, creating more confusion.

*Treemap visualization showing "Elderly population in Europe for NUTS2 regions". Lack of hierarchy.*

Source: 1. https://www.nngroup.com/articles/treemaps/

2. https://ncva.itn.liu.se/?l=en

3. https://ncva.itn.liu.se/education-geovisual-analytics/treemap?l=en

Dendrogram

Judy Dai | 19 Jan 2021

A dendrogram is a network structure. It is constituted of a root node that gives birth to several nodes connected by edges or branches. The last nodes of the hierarchy are called leaves.

Two type of dendrogram exist, resulting from 2 types of dataset:

1- A hierarchic dataset provides the links between nodes explicitly.
2- The result of a clustering algorithm can be visualized as a dendrogram.

Dendrograms are often used to depict the strength of clustering in a matrix. In other words, it shows the hierarchical relationship between objects.

The greater the difference in height, the more dissimilarity. But in some dendrograms, shape and color are also used to help people recognize hierarchy and clusters faster.

Pro:

-Good to allocate objects to clusters.

-The height of the branches often shows similarity/dissimilarity between two objects.

Con:

-The shape of the dendrogram does not explicitly decide the total amount of existing clusters. It cannot tell you how many clusters you should have.

-The height that shows similarity is not always true to original data.

-One two objects join they cannot be seperated.

Calculation: Hierarchical Clustering Algorithms

Dendrogram is usually the visual output of hierarchical clustering. Hierarchical clustering can be performed with either a distance matrix or raw data.

Examples of algorithms used for clustering methods: single linkage, complete linkage, simple average, centroid, median, etc.

Good Examples:

#1 Circular Dendrogram. Instead of using the traditional way (the distance) to indicate groupings and hierarchy, it uses different colors and shape sizes. Works well with the circular shape.

Phylogeny and tempo of diversification in the supperradiation of spiny-rayed fishes. Edited by David M.Hillis, University of Texas. A good example of using color, illustrations, and a better indication of hierarchy in a circular dendrogram. The dash circle at the center really helped.

*Products of Slavery and Child Labour* by Giulia De Amicis. A dendrogram of hierarchic dataset. A good example of dendrograms based on hierarchic datasets.

Bad Examples:

A standard dendrogram with scale on the side. Not very appealing to me in terms of design, but it shows hierarchic between clusters.

A phylogenetic tree of the bacterial domain. Some clusters are too small/narrow to be observed. The hierarchy is hard to tell.

Tree of Life (~3,000 species, based on rRNA sequences) by David M. Hillis, Derrick Zwickl, and Robin Gutell, University of Texas. An example of a visually appealing dendrogram that is relying on scientific dataset. However, it lacks color. Cannot tell the clusters with one sight when zoomed out.

Rubber Sheets & Isosurfaces

Mihir Keskar | 19 Jan 2021

Rubber sheets are a type of data visualization chart that functions similar to a heat map but uses colored, three dimensional surfaces to map four or more dimensions.

Rubber sheets represent values such as depth, altitude while encouraging comparison of heights, general landscape, etc. However, due to the visualization type, it is not precise when concerning viewer's understanding of the relative value and any specific numeric values.

Typically, rubber sheet charts are used to explain and lay out geo-related information. Height and colors signify two seperate input values.

Rubber Sheet can be compared with Isosurfaces, which have similar features.

Isosurfaces are maps of data that resemble topographic maps and are commonly used to visualize temperature, weather and Ocean currents.

An isosurface represents points of a constant value (e.g. pressure, temperature, velocity, density) within a volume of space.

Below are examples of isolines...

One may extrude from isoline to isosurface in order to give the viewer greater clarity about value meaning. By adding height and depth, it can be easier to communicate and visualize the value differences.

Scatter Plots

Kevin Ebrahimoff | 19 Jan 2021

Scatter plots use points of dots that represent two different numerical variables in a 2d chart, but can be graphed in a 3d plain as well. The first variable is indepentend while the second value is dependant on the first. Scatter plots allow one to observe the correlation between the variables.

Shapes, color, and scale show additional variables. The inclusion of a pie graph adds another dimension of information that isn't available on other graph types.

These graphs show the variations of data visualizations that can occur when using a 2D Scatter Plot graph.

The two graphs on the left share the strength of seperating the points to make it each data point clear. The two graphs on the right show overplotting where much of the data is lost; the bottom right graph does solve some over-plottings issues by using transparencies.

Scatter Plot Types:

•Scatter Diagrams with no correlation: Here the points appear randomly dispeared making it hard to draw a line through them to estimate the average

scatter-diagram-with-no-correlation — Scatter plot with no correlation.

•Scatter Diagrams with Moderate Correlation: Here the points are more clustered together making it easier to detect and generalize a relationship between this data set.

scatter-diagram-with-moderate-correlation — Scatter plot with moderate correlation.

•Scatter Diagram with Strong Correlation: Here the points are clearly clustered together with an apparent relationship between the points making for an easy estimate of the average.

Each of these correlation types is further catagorized by positive or negative correlations. These are decided by the slant of the average.

If the X value increases along with an increase in the Y value the correlation would be positive.

If the slant shows an increase in the X value and a decrease in the Y value, the correlation would turn out negative.

scatter-diagram-with-strong-negative-correlation — Negative correlation.

Explain the Scatter Diagram Method. Advantages and Disadvantages with diagram? - Sarthaks eConnect | Largest Online Education Community

Scatter plots are very good for getting a general average out of the data sets as clusterings make it easy to see trends in the data, also allowing predictions to be made.

What makes scatter plots special is its ability to gather many additional variables while still providing understandable graphs. By providing a key one can alter shapes, the scale, hue, saturation, luminance, and or opacity to depict other relationships in the collected data

Creating 3-D Scatter Plots - MATLAB & Simulink - MathWorks Deutschland — A 3D scatter plot that shows the relationship in ozone levels between 3 main variables (wind speed, temperature, and solar radiation) and the parts-per-million.

This 2D scatter plot shows the relationship between two main variables (sepal length and sepal width). The graph uses scale, opacity, and color to represent other variables)

Pros:

Shows relationship between 2 main variables and additional sub-information. This allows for numerous data sets with matching axis to be overlayed onto eachother using different plotting point variations.

Non-linear graphing allows for a wider veiwer showing that correlation does not always imply causation.

Many variables can be plotted using different axis, hue, saturation, lightness, shapes, size, and transparencies.

Easy to identify averages using trend lines.

Cons:

'Over plotting' makes it hard to decifer points when data is tightly clustered. Large data sets can be hard to visualize because of this.

Flat trend lines provide inconclusive results.

The graph does not provide precise data depictions as values are often rounded off.

Physical maps & Heat maps

Sophie Fu | 19 Jan 2021

Physical Maps

Visual objects such as bars and lines do not translate well on maps. While easily understandable set alone, when transferred to geographical locations, there are simply too small and two many sections to visibly distinguish data in such a manner.

For physical maps, the best way to display quantitative information is to vary the color intensity or size, or both. As long as a clear legend key is provided, the range of values is flexible in that it can various data such as percentage of population or even aggregate income.

In all physical maps, the x and y axis represent latitude and longitude of the earth, as each location represents a specific geographical area.

More specifically, a choropleth map is a type of physical map that uses heat mapping in order to show distinct geographical areas or regions that are colored in relation to a numeric value.

While useful to understand how territory lines can affect variables, the disadvantage is that larger territories tend to have a bigger weight on the map visual, creating an inherent bias.

Your variables need to be normalized, as raw numbers cannot be compared between regions of distinct size or population. The goal of normalization is to minimize distortions in the differences in the range of values but also to convert the dataset to a common scale. A clear legend must be provided. In choosing a continuous color palette, one must be careful to pick specific hues that do not blur into one tone, making the data variation unclear and hard to distinguish. Most frequently, there is a sequential color ramp between value and color.

Heat maps

While physical maps can fall under the categorization of heat maps, heat maps are not restricted to only physical locations. A heat map uses colors to create a graphical representation of data where a matrix is used to organize individual values.

The most standard heat map has two axis variables that separate the colored squares onto a grid. The axis are divided into ranges, and each cell color indicates the value of the main variable as defined by a gradient legend that depicts the data range.

The variables plotted can take on both categorical or numeric values, and as a result the coloring of cells can take on all sorts of metrics, such as the frequency of a specific item, summary statistics, or even based on non-numeric values such as qualitative generalization of low, medium, and high.

Heatmaps are useful to display hierarchical clustering as it displays a general view of numerical data. Data must be normalized as a data set with too many variation creates even more individual hues, complicating the pre-existing issue with the inability to accurately tell the difference between color shades. Many times the exact value of each cell is still labeled with a number as it is hard to envision a color hue to a distinct value.

Heatmaps can also be used to show changes in data through the passing of time. For example, a heatmap could show the temperature changes in a year across multiple cities.

The color range is very distinctly chosen, the lower percentages are less noticeable, and as you increase in percentage, color temperature also comes into play making the pink more noticeable than the blue. (https://flowingdata.com/2017/04/27/traffic-fatalities-when-and-where/)

Eliminates one possibility of bias with physical maps that create an issue with region sizes creating bias. (http://www.zeit.de/feature/german-unification-a-nation-divided)

Covers the issue of color blindness (red vs. green,) while also making it so the more prevalent colors show the two extremes, leaving the national average a less noticeable shade. (https://knightlab.northwestern.edu/2016/07/18/three-tools-to-help-you-make-colorblind-friendly-graphics/)

There are too many color values in the spectrum. The range is sectioned off and values aren't taken into consideration. The circle units also aren't clearly defined and do not hold a purpose. (https://www.theguardian.com/news/datablog/2012/jul/24/danny-dorling-visualise-social-structure)

The two keys create a conflicting data display. Using opacity as a value also creates a problem with the gradient middle colors. (https://dsparks.wordpress.com/2011/10/24/isarithmic-maps-of-public-opinion-data/)

Inherent bias, the spectrum varies in both value as well as color temperature. The gradient is not an even spread, and jumps too quickly to the extreme values. (http://nickolaylamm.com/wp-content/uploads/2014/03/love.jpg)

Line Graph

Julia Grippo | 19 Jan 2021

Describe your chosen visualization type in terms of the kinds of values it represents (e.g., fractions, integers, percentages, etc.) and the sorts of comparisons it enables or discourages.

A line graph uses lines to connect data points that show quantitative values over a specified period (these are almost always integers and whole numbers). Line graphs make use of two main axes: the "x" (horizontal) axis and the "y" (vertical) axis. The "x" axis depicts a continuous progression while the "y" axis reports values for a metric of interest across the progression of the "x" axis.

The x-axis requires a value that has a regular interval of measurement - most commonly, this value is used to represent something temporary, usually generating an observation over time. The y-axis will be used to report the value of a second numeric variable for points that fall in each of the intervals defined by the x-axis variable.

Most commonly, line graphs are designed to be read from left to right with the lowest values appearing closest to the bottom left corner, while higher values appear closest to the top right corner.

Enables:

Line graphs are best used to emphasize changes in values for one variable (plotted on the "y" vertical axis) for continuous values of a second variable (plotted on the "x" horizontal axis). Therefore, line graphs place an emphasis on change which can be best observed through the slope of the line as the line moves either up or down on the graph. Line graphs encourage comparison over long or short periods of time. They can be used to compare changes over the same period of time for more than one group. Therefore, line graphs are useful in showing small changes that are difficult to measure with other types of graphs. Also, line graphs present a good impression of trends and changes over time.

Discourages:

However, line graphs can get easily cluttered which can make them confusing to read if too many lines are being established. Line graphs are most ideal for representing data made of total figures such as values of total rainfall in a month. Also, a wide range of data is challenging to plot over a line graph, and line graphs can only be used to show data over time. Line graphs also can pose misleading information if consistent scales aren't used on the axes. Lastly, line graphs are mostly made to use for whole numbers and integers, therefore, it is inconvenient if you have to plot fractions or decimal numbers.

Explain what types of calculations need to occur to go from the raw data to the ink/pixels in the resultant chart (for instance, do you need to add up all the values then plot them based on their proportion of the whole? or find the minimum and maximum value to establish the endpoints of an axis?)

Line graphs most commonly pull their data from a two-column table corresponding to y and x axes. Before you start plotting points, you must determine what variables will be used for your x and y axes. Once this is established you can then find what will be the minimum and maximum values that will be used to establish the endpoints of your axes. Then, pick x and y values from your two-column table and begin to plot accordingly. Once enough points are plotted, you will be able to determine the slope of the line created. The slope can be determined by using the formula below. The slope of the plotted line will determine the rate of change seen in y relative to x.

m = \frac{\text{rise}}{\text{run}} = \frac{y_2 - y_1}{x_2 - x_1}

Explain the ‘mapping’ by which numerical/categorical/etc values are converted into positions, sizes, colors, textures, etc. If the chart is primarily about using size to show values (as in, say, a bar chart), can it still use other features such as color to communicate other pieces of information? How?

Using data from your spreadsheet, you can "map" points by moving horizontally to establish your x value and then moving vertically to establish the y value where x and y intersect. This will help in creating a point, viewed as (x,y). Points on a line graph are usually visually demonstrated by use of a filled in circle. These points, (x,y), will begin to reveal a pattern. Once you have your points "mapped" you can then use a line to connect the points. This, in turn, will create your final line graph.

Line graphs can make use of color, line weight, gradients, and line patterns (such as dashed lines) to differentiate between lines. You can also change the shape and fill of the points to create more differentiation between lines. A line graph can hold multiple sets of data, therefore, it is important to use varying colors, line weights, and line styles to differentiate between your subjects.

Search the web for examples of your chart type in use. Include 3 images demonstrating ‘good’ uses and 3 more with ‘bad’ uses of this visualization type. Add a caption to each image describing what makes it good or bad.

Bad Examples:

This is a bad use of the line graph as it incorrectly uses two y-axes. This creates for misleading and confusing information. When read incorrectly, the graph appears to be claiming that the lack of insurance is increasing very slightly (from ~15 percent to ~16 percent) and unemployment increases more rapidly (from ~4.5 percent to 7.5 percent).

This is another bad use of a line graph as it is misleading. It is misleading in that it is designed to show that after a small drop in unemployment, it then went up during the Obama administration. Therefore, the small change in value on the x-axis makes this chart rather misleading.

This is another bad use of a line graph because it is misleading. The chosen starting points for the axes make it seem like the graph has been plotted to look like there is exponential growth. The small change in value on the x-axis makes this chart rather misleading. Contrastingly, the overall employment trend is pretty stable at around 9%.

Good Examples:

This is a good example of a line graph because it fulfills the components for a line graph. It has defined x and y axes, a title, plotted points, and a line connecting the points. Most importantly, this line graph shows quantitative values over a specified period which is the defining factor of a line graph. This example also is a great way to show how you can use a key to differentiate between lines by using color and shape.

Bar Graph and Histogram

Youchen Zhou | 19 Jan 2021

3 Good examples:

(Wall Street Journal) Winners and Losers: Job Gains and Losses
Track the number of sectors gaining or losing jobs each month. Boxes are shaded based on percentage change from the previous month in each sector's payrolls.

Precise color palette that display the information clearly at the first glance. Arrangement of data points show the flow of rising and descending. The timeline is consistent and easy to compare data across the years.

2. (The Verge) Google Quarterly Financials

Two groups of information grouped under the same bar graph, using colors to differentiate quarterly and year. Using bright and greyscale makes it easier to identify a certain quarter for comparison. However, some viewers may be confused due to repeating colors for each year.

3. Expected Halloween Start and End Time

A simple and minimal design makes the viewer process the information very quickly. The method of slightly shifting overlapping colors is a good way to deal with repeating data. Small icons under each hour is a plus point for clarity and quality of design.

3 Bad Examples:

(The New York Times Magazine) Appellate Judgeships Confirmed During First Congressional Term

No baseline for Y axis. Exaggerating information for a biased perspective.

2. (Russia Today) The number of COVID-19 cases in Russia from March 5 to March 31

The graph is trying to "flatten the curve". When comparing March 25's 458 cases to March 27's 1036 cases, the proportions are out of ratio.

3. (CNN) Politics Prediction Position

Under US, the 76% of Republicans and 25% of Democrats are close to each other, even though the bar for Democrats should be 1/3 of Republicans. The bars across countries are very inconsistent.

Hello World

Yue Hou | 19 Jan 2021

:)

Chernoff faces

Fangyi (Yiyi) Yang | 19 Jan 2021

Chernoff faces showcase multivariate data in a symbol of a human face. Every facial feature represents values of the variables by its shape, size, orientation, and placement. The concept behind using faces is that humans are good at recognizing faces and notice subtle changes or differences.

Variables should be carefully chosen because we are more keened to perceive certain facial features than others. (e.g. eye size and eyebrow-slant have been found to carry significant weight).

Pros:

It can be a quick way to display datasets. Especially the ones that involve emotions, such as qualitatively assess the performance of a campaign.
It can be used to represent multi-dimensional data in a very compact manner.
If used appropriately, it can be a fun way to showcase the mundane data.

Cons:

Difficult to make quantitative assessments with the faces.
Unpractical for large datasets since it would be too small to see the details.
Could be offensive to relate certain type of look to something negative.

Good Examples:

Eugene Turner - Life in Los Angeles (1977)

Poor neighborhoods are represented with emaciated, scowling faces, and wealthy neighborhoods are represented by grinning ones. The face symbols are quite intuitive to read and help raise awareness of inequality issues in LA.

A fun way of storytelling for displaying more details of presidential election results. It's quite humorous to give Clinton pearl earrings for the areas that has a higher amount of campaign spendings.

Reporting KPI's with Chernoff Faces by Super Analytics from Kalle Heinonen

Using facial expressions to represent a team's Key Performance Indicator is symbolically appropriate and very intuitive.

Bad Examples:

The graph becomes difficult to read when visualizing a large amount of data. There is also a disconnection between the subjects (beer vs wine) and facial feature.

Why is low unemployment represented by a happy grin, and high unemployment with a frown? Why are areas with a high proportion of women in the workforce represented with angry eyes?

How can the height of someone’s face speak to their propensity for violent crime?

parallel-coordinate plots & star plots

Angela Pan | 19 Jan 2021

hello !

Linear parallel coordinates are used for multidimensional data. Each element has a specific value on each individual dimension.

Radial parallel coordinates are similar to linear parallel coordinates. Here, the dots representing values are linked together to form a shape. Usually it shows multiple records of data.

Star plots are similar to radial parallel coordinates but they only have a single record of data rather than multiple.

These charts do not require calculation when put together. Dimensions are more important to categorize here.

Values are translated into position and orientation, which form shapes when connected.

Good examples:

Clean and organized. it's missing some elements to fully comprehend the chart.

Less good examples:

Very messy. It's difficult to pinpoint one specific data.

It's very complicated. Difficult to recognize which is which because the color of each data is the same

Box Plot

Olivia Zhu | 19 Jan 2021

A boxplot, or a box-and-whisker-plot, is a graph that indicates the variability or the dispersion of the data, graphically depicting groups of numerical data through their quartiles. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

The visualization of data through a boxplot is based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

median (50th Percentile): the middle value of the dataset.

first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

interquartile range (IQR): 25th to the 75th percentile.

whiskers: the two lines outside the box that extend to the highest and lowest observations

outlier: a data point that lies outside the overall pattern in a distribution.(below minimum and /or above maximum)

“maximum”: Q3 + 1.5*IQR

“minimum”: Q1 -1.5*IQR

In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

In addition to the traditional square box plots, box plots in other shapes (eg. violin plots, beans plots, notched box plots) exist to answer different needs. Sometimes an organically shaped box plot is more effective in showing the range and tendency of the data. Adding colors or points (called jitters) representing each data are also common.

Good examples:

The use of opposite colors enables direct visual comparison between the two data groups. Outliers are laid out clearly. The organization of the boxplot makes it easily understandable.

The highlighted jitters translate the raw data very directly and effectively.

The use of colors allow for direct comparisons between the variables. Its horizontal composition adds to the explicity. The footnote clearly explains the chart.

Bad examples: