This is a collective research project providing examples and discussion of the basic building blocks of visual data representation.
In his Ph.D. dissertation, information designer Ben Fry assembled a taxonomy of standard visualization types. Look over his list and choose one to research more thoroughly. Sign up for your chosen visualization using this google doc.
Tree is a type of data visualization that shows hierarchical data. Typically, the graphs starts with one main branch and expands out to datas pre-organized in the sub categories. The graph a type of tree graphs where we can see several branches would connect back to themselves.
Tree and graph are optimal to show datas in the ranking orders within a data set. They visualize the information architecture, the connections between each category, and the overall information flow but not so friendly to visualize the literal quantity values of each data.
Tree and graph require data categorization, sorting out the connections among each levels of data entry.
Data are most commonly translated into different numbers of colored lines with a dot at the end. The dots' size and color, the branches' stoke weight, the placement of the main branches could heavily impact the visualization outcome.
Pros: It can show a large chunks of data in one graphic and provides a high level understanding of the data's relationship with each other.
Cons: When dealing with high level number of data, only certain branches would be labelled for the sake of clarity of the graphic itself. The graph type can easily get cluttered if the main branches are situated poorly.
Treemaps are commonly found on data dashboards. Designers often choose them to add visual variety on a dense dashboard. However, treemaps are a complex visualization and present many obstacles to quick comprehension (which is the main requirement for any information displayed on a dashboard).
Creating a tree map involves choosing two dimensions of the data, color-coding one dimension and defining a "tiling algorithm" for the dimension represented by area. The tiling algorithm determines how the rectangles are sub-divided into rectangles of specific area (corresponding to the data). Tree maps are most legible when the area of sub-rectangles have an aspect ratio close to one.
Treemaps rely on area (and possibly color) to encode the value of a variable, and therefore, although treemaps can convey overall relationships in a large data set, they are not suited for tasks involving precise comparisons. Treemaps also should not be used if the data is not hierarchical.
A dendrogram is a network structure. It is constituted of a root node that gives birth to several nodes connected by edges or branches. The last nodes of the hierarchy are called leaves.
Two type of dendrogram exist, resulting from 2 types of dataset:
1- A hierarchic dataset provides the links between nodes explicitly. 2- The result of a clustering algorithm can be visualized as a dendrogram.
Dendrograms are often used to depict the strength of clustering in a matrix. In other words, it shows the hierarchical relationship between objects.
The greater the difference in height, the more dissimilarity. But in some dendrograms, shape and color are also used to help people recognize hierarchy and clusters faster.
Pro:
-Good to allocate objects to clusters.
-The height of the branches often shows similarity/dissimilarity between two objects.
Con:
-The shape of the dendrogram does not explicitly decide the total amount of existing clusters. It cannot tell you how many clusters you should have.
-The height that shows similarity is not always true to original data.
-One two objects join they cannot be seperated.
Calculation: Hierarchical Clustering Algorithms
Dendrogram is usually the visual output of hierarchical clustering. Hierarchical clustering can be performed with either a distance matrix or raw data.
Examples of algorithms used for clustering methods: single linkage, complete linkage, simple average, centroid, median, etc.
Rubber sheets are a type of data visualization chart that functions similar to a heat map but uses colored, three dimensional surfaces to map four or more dimensions.
Rubber sheets represent values such as depth, altitude while encouraging comparison of heights, general landscape, etc. However, due to the visualization type, it is not precise when concerning viewer's understanding of the relative value and any specific numeric values.
Typically, rubber sheet charts are used to explain and lay out geo-related information. Height and colors signify two seperate input values.
Rubber Sheet can be compared with Isosurfaces, which have similar features.
Isosurfaces are maps of data that resemble topographic maps and are commonly used to visualize temperature, weather and Ocean currents.
An isosurface represents points of a constant value (e.g. pressure, temperature, velocity, density) within a volume of space.
Below are examples of isolines...
One may extrude from isoline to isosurface in order to give the viewer greater clarity about value meaning. By adding height and depth, it can be easier to communicate and visualize the value differences.
Scatter plots use points of dots that represent two different numerical variables in a 2d chart, but can be graphed in a 3d plain as well. The first variable is indepentend while the second value is dependant on the first. Scatter plots allow one to observe the correlation between the variables.
Scatter Plot Types:
•Scatter Diagrams with no correlation: Here the points appear randomly dispeared making it hard to draw a line through them to estimate the average
•Scatter Diagrams with Moderate Correlation: Here the points are more clustered together making it easier to detect and generalize a relationship between this data set.
•Scatter Diagram with Strong Correlation: Here the points are clearly clustered together with an apparent relationship between the points making for an easy estimate of the average.
Each of these correlation types is further catagorized by positive or negative correlations. These are decided by the slant of the average.
If the X value increases along with an increase in the Y value the correlation would be positive.
If the slant shows an increase in the X value and a decrease in the Y value, the correlation would turn out negative.
Scatter plots are very good for getting a general average out of the data sets as clusterings make it easy to see trends in the data, also allowing predictions to be made.
What makes scatter plots special is its ability to gather many additional variables while still providing understandable graphs. By providing a key one can alter shapes, the scale, hue, saturation, luminance, and or opacity to depict other relationships in the collected data
Pros:
Shows relationship between 2 main variables and additional sub-information. This allows for numerous data sets with matching axis to be overlayed onto eachother using different plotting point variations.
Non-linear graphing allows for a wider veiwer showing that correlation does not always imply causation.
Many variables can be plotted using different axis, hue, saturation, lightness, shapes, size, and transparencies.
Easy to identify averages using trend lines.
Cons:
'Over plotting' makes it hard to decifer points when data is tightly clustered. Large data sets can be hard to visualize because of this.
Flat trend lines provide inconclusive results.
The graph does not provide precise data depictions as values are often rounded off.
Visual objects such as bars and lines do not translate well on maps. While easily understandable set alone, when transferred to geographical locations, there are simply too small and two many sections to visibly distinguish data in such a manner.
For physical maps, the best way to display quantitative information is to vary the color intensity or size, or both. As long as a clear legend key is provided, the range of values is flexible in that it can various data such as percentage of population or even aggregate income.
In all physical maps, the x and y axis represent latitude and longitude of the earth, as each location represents a specific geographical area.
More specifically, a choropleth map is a type of physical map that uses heat mapping in order to show distinct geographical areas or regions that are colored in relation to a numeric value.
While useful to understand how territory lines can affect variables, the disadvantage is that larger territories tend to have a bigger weight on the map visual, creating an inherent bias.
Your variables need to be normalized, as raw numbers cannot be compared between regions of distinct size or population. The goal of normalization is to minimize distortions in the differences in the range of values but also to convert the dataset to a common scale. A clear legend must be provided. In choosing a continuous color palette, one must be careful to pick specific hues that do not blur into one tone, making the data variation unclear and hard to distinguish. Most frequently, there is a sequential color ramp between value and color.
Heat maps
While physical maps can fall under the categorization of heat maps, heat maps are not restricted to only physical locations. A heat map uses colors to create a graphical representation of data where a matrix is used to organize individual values.
The most standard heat map has two axis variables that separate the colored squares onto a grid. The axis are divided into ranges, and each cell color indicates the value of the main variable as defined by a gradient legend that depicts the data range.
The variables plotted can take on both categorical or numeric values, and as a result the coloring of cells can take on all sorts of metrics, such as the frequency of a specific item, summary statistics, or even based on non-numeric values such as qualitative generalization of low, medium, and high.
Heatmaps are useful to display hierarchical clustering as it displays a general view of numerical data. Data must be normalized as a data set with too many variation creates even more individual hues, complicating the pre-existing issue with the inability to accurately tell the difference between color shades. Many times the exact value of each cell is still labeled with a number as it is hard to envision a color hue to a distinct value.
Heatmaps can also be used to show changes in data through the passing of time. For example, a heatmap could show the temperature changes in a year across multiple cities.
Describe your chosen visualization type in terms of the kinds of values it represents (e.g., fractions, integers, percentages, etc.) and the sorts of comparisons it enables or discourages.
A line graph uses lines to connect data points that show quantitative values over a specified period (these are almost always integers and whole numbers). Line graphs make use of two main axes: the "x" (horizontal) axis and the "y" (vertical) axis. The "x" axis depicts a continuous progression while the "y" axis reports values for a metric of interest across the progression of the "x" axis.
The x-axis requires a value that has a regular interval of measurement - most commonly, this value is used to represent something temporary, usually generating an observation over time. The y-axis will be used to report the value of a second numeric variable for points that fall in each of the intervals defined by the x-axis variable.
Most commonly, line graphs are designed to be read from left to right with the lowest values appearing closest to the bottom left corner, while higher values appear closest to the top right corner.
Enables:
Line graphs are best used to emphasize changes in values for one variable (plotted on the "y" vertical axis) for continuous values of a second variable (plotted on the "x" horizontal axis). Therefore, line graphs place an emphasis on change which can be best observed through the slope of the line as the line moves either up or down on the graph. Line graphs encourage comparison over long or short periods of time. They can be used to compare changes over the same period of time for more than one group. Therefore, line graphs are useful in showing small changes that are difficult to measure with other types of graphs. Also, line graphs present a good impression of trends and changes over time.
Discourages:
However, line graphs can get easily cluttered which can make them confusing to read if too many lines are being established. Line graphs are most ideal for representing data made of total figures such as values of total rainfall in a month. Also, a wide range of data is challenging to plot over a line graph, and line graphs can only be used to show data over time. Line graphs also can pose misleading information if consistent scales aren't used on the axes. Lastly, line graphs are mostly made to use for whole numbers and integers, therefore, it is inconvenient if you have to plot fractions or decimal numbers.
Explain what types of calculations need to occur to go from the raw data to the ink/pixels in the resultant chart (for instance, do you need to add up all the values then plot them based on their proportion of the whole? or find the minimum and maximum value to establish the endpoints of an axis?)
Line graphs most commonly pull their data from a two-column table corresponding to y and x axes. Before you start plotting points, you must determine what variables will be used for your x and y axes. Once this is established you can then find what will be the minimum and maximum values that will be used to establish the endpoints of your axes. Then, pick x and y values from your two-column table and begin to plot accordingly. Once enough points are plotted, you will be able to determine the slope of the line created. The slope can be determined by using the formula below. The slope of the plotted line will determine the rate of change seen in y relative to x.
Explain the ‘mapping’ by which numerical/categorical/etc values are converted into positions, sizes, colors, textures, etc. If the chart is primarily about using size to show values (as in, say, a bar chart), can it still use other features such as color to communicate other pieces of information? How?
Using data from your spreadsheet, you can "map" points by moving horizontally to establish your x value and then moving vertically to establish the y value where x and y intersect. This will help in creating a point, viewed as (x,y). Points on a line graph are usually visually demonstrated by use of a filled in circle. These points, (x,y), will begin to reveal a pattern. Once you have your points "mapped" you can then use a line to connect the points. This, in turn, will create your final line graph.
Line graphs can make use of color, line weight, gradients, and line patterns (such as dashed lines) to differentiate between lines. You can also change the shape and fill of the points to create more differentiation between lines. A line graph can hold multiple sets of data, therefore, it is important to use varying colors, line weights, and line styles to differentiate between your subjects.
Search the web for examples of your chart type in use. Include 3 images demonstrating ‘good’ uses and 3 more with ‘bad’ uses of this visualization type. Add a caption to each image describing what makes it good or bad.
(Wall Street Journal) Winners and Losers: Job Gains and Losses Track the number of sectors gaining or losing jobs each month. Boxes are shaded based on percentage change from the previous month in each sector's payrolls.
2. (The Verge) Google Quarterly Financials
3. Expected Halloween Start and End Time
3 Bad Examples:
(The New York Times Magazine) Appellate Judgeships Confirmed During First Congressional Term
2. (Russia Today) The number of COVID-19 cases in Russia from March 5 to March 31
Chernoff faces showcase multivariate data in a symbol of a human face. Every facial feature represents values of the variables by its shape, size, orientation, and placement. The concept behind using faces is that humans are good at recognizing faces and notice subtle changes or differences.
Variables should be carefully chosen because we are more keened to perceive certain facial features than others. (e.g. eye size and eyebrow-slant have been found to carry significant weight).
Pros:
It can be a quick way to display datasets. Especially the ones that involve emotions, such as qualitatively assess the performance of a campaign.
It can be used to represent multi-dimensional data in a very compact manner.
If used appropriately, it can be a fun way to showcase the mundane data.
Cons:
Difficult to make quantitative assessments with the faces.
Unpractical for large datasets since it would be too small to see the details.
Could be offensive to relate certain type of look to something negative.
Good Examples:
Eugene Turner - Life in Los Angeles (1977)
Poor neighborhoods are represented with emaciated, scowling faces, and wealthy neighborhoods are represented by grinning ones. The face symbols are quite intuitive to read and help raise awareness of inequality issues in LA.
A fun way of storytelling for displaying more details of presidential election results. It's quite humorous to give Clinton pearl earrings for the areas that has a higher amount of campaign spendings.
Using facial expressions to represent a team's Key Performance Indicator is symbolically appropriate and very intuitive.
Bad Examples:
The graph becomes difficult to read when visualizing a large amount of data. There is also a disconnection between the subjects (beer vs wine) and facial feature.
Why is low unemployment represented by a happy grin, and high unemployment with a frown? Why are areas with a high proportion of women in the workforce represented with angry eyes?
How can the height of someone’s face speak to their propensity for violent crime?
Linear parallel coordinates are used for multidimensional data. Each element has a specific value on each individual dimension.
Radial parallel coordinates are similar to linear parallel coordinates. Here, the dots representing values are linked together to form a shape. Usually it shows multiple records of data.
Star plots are similar to radial parallel coordinates but they only have a single record of data rather than multiple.
These charts do not require calculation when put together. Dimensions are more important to categorize here.
Values are translated into position and orientation, which form shapes when connected.
A boxplot, or a box-and-whisker-plot, is a graph that indicates the variability or the dispersion of the data, graphically depicting groups of numerical data through their quartiles. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
The visualization of data through a boxplot is based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).
median (50th Percentile): the middle value of the dataset.
first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
interquartile range (IQR): 25th to the 75th percentile.
whiskers: the two lines outside the box that extend to the highest and lowest observations
outlier:a data point that lies outside the overall pattern in a distribution.(below minimum and /or above maximum)
“maximum”: Q3 + 1.5*IQR
“minimum”: Q1 -1.5*IQR
In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.
In addition to the traditional square box plots, box plots in other shapes (eg. violin plots, beans plots, notched box plots) exist to answer different needs. Sometimes an organically shaped box plot is more effective in showing the range and tendency of the data. Adding colors or points (called jitters) representing each data are also common.