Graphical Analysis & Exploration

Introducing Tableau Public

In this tutorial we will work with Tableau Public, which allows you to create a wide variety of interactive charts, maps and tables and organize them into dashboards and stories that can be saved to the cloud and embedded on the web.

The free Public version of the software requires you to save your visualizations to the open web. If you have sensitive data that needs to be kept within your organization, you will need a license for the Desktop version of the software.

Tableau was developed for exploratory graphical data analysis, so it is a good tool for exploring a new dataset — filtering, sorting and summarizing/aggregating the data in different ways while experimenting with various chart types.

Although Tableau was not designed as a publication tool, the ability to embed finished dashboards and stories has also allowed newsrooms and individual journalists lacking JavaScript coding expertise to create interactive online graphics.

The data we will use today

Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following file:

Visualize the data on neonatal mortality

Connect to the data

Launch Tableau Public, and you should see the following screen:

Under the Connect heading at top left, select Text File, navigate to the file nations.csv and Open. At this point, you can view the data, which will be labeled as follows:

  • Text: Abc
  • Numbers: #
  • Dates: calendar symbol
  • Geography: globe symbol

You can edit fields to give them the correct data type if there are any problems:

Once the data has loaded, click Sheet 1 at bottom left and you should see a screen like this:

Dimensions and measures: categorical and continuous

The fields should appear in the Data panel at left. Notice that Tableau has divided the fields into Dimensionsand Measures. These broadly correspond to categorical and continuous variables. Dimensions are fields containing text or dates, while measures contain numbers.

If any field appears in the wrong place, click the small downward-pointing triangle that appears when it is highlighted and select Convert to Dimension or Convert to Measure as required.

Shelves and Show Me

Notice that the main panel contains a series of “shelves,” called Pages, Columns, Rows, Filters and so on. Tableau charts and maps are made by dragging and dropping fields from the data into these shelves.

Over to the right you should see the Show Me panel, which will highlight chart types you can make from the data currently loaded into the Columns and Rows shelves. It is your go-to resource when experimenting with different visualization possibilities. You can open and close this panel by clicking on its title bar.

Columns and rows: X and Y axes

The starting point for creating any chart or map in Tableau is to place fields into Columns and Rows, which for most charts correspond to the X and Y axes, respectively. When making maps, longitude goes in Columns and latitude in Rows. If you display the data as a table, then these labels are self-explanatory.

Some questions to ask this data

  • How has the total number of neonatal deaths changed over time, globally, regionally and nationally?
  • How has the neonatal death rate for each country changed over time?

Create new calculated variables

The data contains fields on birth and neonatal death rates, but not the total numbers of births and deaths, which must be calculated. From the top menu, select Analysis>Create Calculated Field. Fill in the dialog box as follows (just start typing a field name to select it for use in a formula):

Notice that calculated fields appear in the Data panel preceded by an = symbol.

Now create a second calculated field giving the total number of neonatal deaths:

In the second formula, we have rounded the number of neonatal deaths to the nearest thousand using -3(-2 would round to the nearest hundred, -1 to the nearest ten, 1 to one decimal place, 2 to two decimal places, and so on.)

Here we have simply run some simple arithmetic, but it’s possible to use a wide variety of functions to manipulate data in Tableau in many ways. To see all of the available functions, click on the little grey triangle at the right of the dialog boxes above.

Understand that Tableau’s default behaviour is to summarize/aggregate data

As we work through today’s exercise, notice that Tableau routinely summarizes or aggregates measures that are dropped into Columns and Rows, calculating a SUM or AVG (average or mean), for example.

This behaviour can be turned off by selecting Analysis from the top menu and unchecking Aggregate Measures. However, I do not recommend doing this, as it will disable some Tableau functions. Instead, if you don’t want to summarize all of the data, drop categorical variables into the Detail shelf so that any summary statistic will be calculated at the correct level for your analysis. If necessary, you can set the aggregation so it is being performed on a single data point, and therefore has no effect.

Make a series of treemaps showing neonatal deaths over time

A treemap allows us to directly compare the neonatal deaths in each country, nested by region.

Drag Country and Region onto Columns and Neonatal deaths onto Rows. Then open Show Me and select the treemap option. The initial chart should look like this:

Look at the Marks shelf and see that the size and color of the rectangles reflect the SUM of Neonatal deathsfor each country, while each rectangle is labeled with Region and Country:

Now drag Region to Color to remove it from the Label and colour the rectangles by region, using Tableau’s default qualitative colour scheme for categorical data:

For a more subtle color scheme, click on Color, select Edit Colors... and at the dialog box select the Tableau Classic Medium qualitative color scheme, then click Assign Palette and OK.

(Tableau’s qualitative color schemes are well designed, so there is no need to adopt a ColorBrewer scheme. However, it is possible to edit colors individually as you wish.)

Click on Color and set transparency to 75%. (For your assignment you will create a chart with overlapping circles, which will benefit from using some transparency to allow all circles to be seen. So we are setting transparency now for consistency.)

The treemap should now look like this:

Tableau has by default aggregated Neonatal deaths using the SUM function, so what we are seeing is the number for each country added up across the years.

To see one year at a time, we need to filter by year. If you drag the existing Year variable to the Filtersshelf, you will get the option to filter by a range of numbers, which isn’t what we need:

Instead, we need to be able check individual years, and draw a treemap for each one. To do that, select Yearin the Dimensions panel and Duplicate.

Select the new variable and Convert to Discrete and then Rename it Year (discrete). Now drag this new variable to Filters, select 2014, and click OK:

The treemap now displays the data for 2014:

That’s good for a snapshot of the data, but with a little tinkering, we can adapt this visualization to show change in the number of neonatal deaths over time at the national, regional and global levels.

Select Year (discrete) in the Filters shelf and Filter ... to edit the filter. Select all the years with even numbers and click OK:

Now drag Year (discrete) onto Rows and the chart should look like this:

The formatting needs work, but notice that we now have a bar chart made out of treemaps.

Extend the chart area to the right by changing from Standard to Entire View on the dropdown menu in the top ribbon:

I find it more intuitive to have the most recent year at the top, so select Year (discrete) in the Rows shelf, select Sort and fill in the dialog box so that the years are sorted in Descending order:

The chart should now look like this:

We will create a map to serve as a legend for the regions, so click on the title bar for the color legend and select Hide Card to remove it from the visualization.

To remove some clutter from the chart, select Format>Borders from the top menu, and under Sheet>Row Divider, set Pane to None. Then close the Format Borders panel.

Right-click on the Sheet 1 title for the chart and select Hide Title. Also right-click on Year (discrete) at the top left of the chart and select Hide Field Labels for Rows. Then hover just above the top bar to get a double-arrowed drag symbol and drag upwards to reduce the white space at the top. You may also want to drag the bars a little closer to the year labels.

The labels will only appear in the larger rectangles. Rather than removing them entirely, let’s just leave a label for India in 2014, to make it clear that this is the country with by far the largest number of neonatal deaths. Click on Label in the Marks shelf, and switch from All to Selected under Marks to Label. Then right-click on the rectangle for India in 2014, and select Mark Label>Always Show. The chart should now look like this:

Hover over one of the rectangles, and notice the tooltip that appears. By default, all the fields we have used to make the visualization appear in the tooltip. (If you need any more, just drag those fields onto Tooltip.) Click on Tooltip and edit as follows. (Unchecking Include command buttons disables some interactivity, giving a plain tooltip):

Save to the web

Right-click on Sheet 1 at bottom left and Rename Sheet to Treemap bar chart. Then select File>Save to Tableau Public... from the top menu. At the logon dialog box enter your Tableau Public account details, give the Workbook a suitable name and click Save. When the save is complete, a view of the visualization on Tableau’s servers will open in your default browser.

Make a map to use as a colour legend

Select Worksheet>New Worksheet from the top menu, and double-click on Country. Tableau recognizes the names of countries and states/provinces; for the U.S., it also recognizes counties. Its default map-making behaviour is to put a circle at the geographic centre, or centroid, of each area, which can be scaled and coloured to reflect values from the data:

However, we need each country to be filled with colour by region. In the Show Me tab, switch to the filled mapsoption, and each nation should fill with colour. Drag Region to Colour and see how the same colour scheme we used previously carries over to the map. Click on Colour, set the transparency to 75% to match the bubble chart and remove the borders. Also click on Tooltip and uncheck Show tooltip so that no tooltip appears on the legend.

We will use this map as a colour legend, so its separate colour legend is unnecessary. Click the colour legend’s title bar and select Hide Card to remove it from the visualization. Also remove the Sheet 2 title as before.

Centre the map in the view by clicking on it, holding and panning, just as you would on Google Maps. It should now look something like this:

Rename the worksheet Map legend and save to the web again.

Make a line chart showing neonatal mortality rate by country over time

To address our second question, and explore the neonatal death rate over time by country, we can use a line chart.

First, rename Neonat Mortal as Neonatal death rate (per 1,000 births). Then, open a new worksheet, drag this variable to Rows and Year to Columns. The chart should now look like this:

Tableau has aggregated the data by adding up the rates for each country in every year, which makes no sense here. So drag Country to Detail in the Marks shelf to draw one line per country:

Drag region to Colour and set the transparency to 75%.

Now right-click on the X axis, select Edit Axis, edit the dialog box as follows and click OK:

Right-click on the X axis again, select Format, change Alignment/Default/Header/Direction to Up and use the dropdown menu set the Font to bold. Also remove the Sheet 3 title.

The chart should now look like this:

We can also highlight the countries with the highest total number of neonatal deaths by dragging Neonatal deaths to Size. The chart should now look like this:

This line chart shows that the trend in most countries has been to reduce neonatal deaths, while some countries have had more complex trajectories. But to make comparisons between individual countries, it will be necessary to add controls to filter the chart.

Tableau’s default behaviour when data is filtered is to redraw charts to reflect the values in the filtered data. So if we want the Y axis and the line thicknesses to stay the same when the chart is filtered, we need to freeze them.

To freeze the line thicknesses, hover over the title bar for the line thickness legend, select Edit Sizes... and fill in the dialog box as follows:

Now remove this legend from the visualization, together with the colour legend. We can later add an annotation to our dashboard to explain the line thickness.

To freeze the Y axis, right-click on it, select Edit Axis..., make it Fixed and click OK:

Right-click on the Y axis again, select Format... and increase the font size to 10pt to make it easier to read.

Now drag Country to Filters, make sure All are checked, and at the dialog box, click OK:

Now we need to add a filter control to select countries to compare. On Country in the Filters shelf, select Show Filter. A default filter control, with a checkbox for each nation, will appear to the right of the chart:

This isn’t the best filter control for this visualization. To change it, click on the title bar for the filter, note the range of filter controls available, and select Multiple Values (Custom List). This allows users to select individual countries by starting to type their names. Then select Edit Title... and add some text explaining how the filter works:

Take some time to explore how this filter works.

Rename Income to Income group. Then add Region and Income group to Filters, making sure that All options are checked for each. Select Show Filter for both of these filters, and select Single Value Dropdown for the control. Reset both of these filters to All, and the chart should now look like this:

Notice that the Income group filter lists the options in alphabetical order, rather than income order, which would make more sense. To fix this, right-click on Income group in the data panel and select Default Properties>Sort. At the dialog box below, select Manual sort, edit the order as follows and click OK:

The chart should now look like this:

Finally, click on Tooltip and edit as follows:

Rename the sheet Line chart and save to the web.

Make a dashboard combining both charts

From the top menu, select Dashboard>New Dashboard. Set its Size to Automatic, so that the dashboard will fill to the size of any screen on which it is displayed:

To make a dashboard, drag charts, and other elements from the left-hand panel to the dashboard area. Notice that Tableau allows you to add items including: horizontal and vertical containers, text boxes, images (useful for adding a publication’s logo), embedded web pages and blank space. These can be added Tiled, which means they cannot overlap, or Floating, which allows one element to be placed over another.

Drag Treemap bar chart from the panel at left to the main panel. The default title, from the worksheet name, isn’t very informative, so right-click on that, select Edit Title ... and change to Total deaths.

Now add Line Chart to the right of the dashboard (the gray area will show where it will appear) and edit its title to Death rates. Also add a note to explain that line widths are proportional to the total number of deaths. The dashboard should now look like this:

Notice that the Country, Region and Income group filters control only the line chart. To make them control the treemaps, too, click on each filter, open up the dropdown menu form the downward-pointing triangle, and select Apply to Worksheets>Selected Worksheets... and fill in the dialog box as follows:

The filters will now control both charts.

Add Map legend for a color legend at bottom right. (You will probably need to drag the window for the last filter down to push it into position.) Hide the legend’s title then right-click on the map and select Hide View Toolbarto remove the map controls.

We can also allow the highlighting of a country on one chart to be carried across the entire dashboard. Select Dashboard>Actions... from the top menu, and at the first dialog box select Add action>Highlight. Filling the second dialog box as follows will cause each country to be highlighted across the dashboard when it is clicked on just one of the charts:

Click OK on both dialog boxes to apply this action.

Select Dashboard>Show Title from the top menu. Right-click on it, select Edit Title... and change from the default to something more informative:

Now drag a Text box to the bottom of the dashboard and add a footnote giving source information:

The dashboard should now look like this:

Design for different devices

This dashboard works on a large screen, but not on a small phone. To see this, click the Device Previewbutton at top left and select Phone under Device type. In portrait orientation, this layout does not work at all:

Click the Add Phone Layout at top right, and then click Custom tab under Layout - Phone in the left-hand panel. You can then rearrange and if necessary remove elements for different devices. Here I have removed the line chart and filter controls, and changed the legend to a Floating element so that it sits in the blank space to the top right of the bar chart of treemaps.

Now save to the web once more. Once the dashboard is online, use the Share link at the bottom to obtain an embed code, which can be inserted into the HTML of any web page.

(You can also Download a static view of the graphic as a PNG image or a PDF.)

You can download the workbook for any Tableau visualization by clicking the Download Workbook link. The files (which will have the extension .twbx) will open in Tableau Public.

Having saved a Tableau visualization to the web, you can reopen it by selecting File>Open from Tableau Public... from the top menu.

Another approach to responsive design

As an alternative to using Tableau’s built-in device options, you may wish to create three different dashboards, each with a size appropriate for phones, tablets, and desktops respectively. You can then follow the instructions here to put the embed codes for each of these dashboards into a div with a separate class, and then use @media CSS rules to ensure that only the div with the correct dashboard displays, depending on the size of the device.

If you need to make a fully responsive Tableau visualization and are struggling, contact me for help!

When making responsively designed web pages, make sure to include this line of code between the <head></head> tags of your HTML:

<meta name="viewport" content="width=device-width, initial-scale=1.0">

From dashboards to stories

Tableau also allows you to create stories, which combine successive dashboards into a step-by-step narrative. Select Story>New Story from the top menu. Having already made a dashboard, you should find these simple and intuitive to create. Select New Blank Point to add a new scene to the narrative.

Assignment

  • Create this second dashboard from the data.Here are some hints:
    • Drop Year into the Pages shelf to create the control to cycle through the years.
    • You will need to change the Marks to solid circles and scale them by the total number of neonatal deaths. Having done so, you will also need to increase the size of all circles so countries with small numbers of neonatal deaths are visible. Good news: Tableau’s default behavior is to size circles correctly by area, so they will be the correct sizes, relative to one another.
    • You will need to switch to a Logarithmic X axis and alter/fix its range.
    • Format GDP per capita in dollars by clicking on it in the Data panel and selecting Default Properties>Number Format>Currency (Custom).
    • Create a single trend line for each year’s data, so that the line shifts with the circles from year to year. Do this by dragging Trend line into the chart area from the Analytics panel. You will then need to select Analysis>Trend Lines>Edit Trend Lines... and adjust the options to give a single line with the correct behavior.
    • Getting the smaller circles rendered on top of the larger ones, so their tooltips can be accessed, is tricky. To solve this, open the dropdown menu for Country in the Marks shelf, select Sort and fill in the dialog box as follows:Now drag Country so it appears at the top of the list of fields in the Marks shelf.

    This should be a challenging exercise that will help you learn how Tableau works. If you get stuck, download my visualization and study how it is put together.

  • By next week’s class, send me the url for your second dashboard. (Don’t worry about designing for different devices.)

Further reading/viewing

Tableau Public training videos

Gallery of Tableau Public visualizations: Again, you can download the workbooks to see how they were put together.

Tableau Public Knowledge Base: Useful resource with the answers to many queries about how to use the software.

Data visualization: Principles

Why visualize data? It is a good way to communicate complex information, because we are highly visual animals, evolved to spot patterns and make visual comparisons. To visualize effectively, however, it helps to understand a little about how our brains process visual information. The mantra for this week’s class is: Design for the human brain!

Visualization: encoding data using visual cues

Whenever we visualize, we are encoding data using visual cues, or “mapping” data onto variation in size, shape or color, and so on. There are various ways of doing this, as this primer illustrates:

These cues are not created equal, however. In the mid-1980s, statisticians William Cleveland and Robert McGill ran some experiments with human volunteers, measuring how accurately they were able to perceive the quantitative information encoded by different cues. This is what they found:

This perceptual hierarchy of visual cues is important. When making comparisons with continuous variables, aim to use cues near the top of the scale wherever possible.

But this doesn’t mean that everything becomes a bar chart

Length on an aligned scale may be the best option to allow people to compare numbers accurately, but that doesn’t mean the other possibilities are always to be avoided in visualization. Indeed, color hue is a good way of encoding categorical data. The human brain is particularly good at recognizing patterns and differences. This means that variations in color, shape and orientation, while poor for accurately encoding the precise value of continuous variables, can be good choices for representing categorical data.

You can also combine different visual cues in the same graphic to encode different variables. But always think about the main messages you are trying to impart, and where you can use visual cues near the top of the visual hierarchy to communicate that message most effectively.

To witness this perceptual hierarchy, look at the following visual encodings of the same simple dataset. In which of the three charts is it easiest to compare the numerical values that are encoded?

If you have spent any time reading blogs on data visualization, you will know the disdain in which pie charts are often held. It should be clear which of these two charts is easiest to read:

(Source: VizThinker)

Pie charts encode continuous variables primarily using the angles made in the center of the circle. It is certainly true that angles are harder read accurately than aligned bars. However, note that encoding data using the area of circles — which has become a “fad” in data visualization in recent years — makes even tougher demands on your audience.

Which chart type should I use?

This is a frequently asked question, and the best answer is: Experiment with different charts, to see which works best to liberate the story in your data. Some of the visualization software — notably Tableau Public — will suggest chart types for you to try. However, it is good to have a basic framework to help you prioritize particular chart types for particular visualization tasks. Although it is far from comprehensive, and makes some specific chart suggestions that I would not personally endorse, this “chart of charts” provides a useful framework by providing four answers to the question: “What would you like to show?”

(Source: A. Abela, Extreme Presentation Method)

Last week, we covered charts to show the distribution of a single continuous variable, and to study the relationship between two continuous variables. So let’s now explore possibilities for comparison between items for a single continuous variable, and composition, or how parts make up the whole. In each case, this framework considers both a snapshot at one point in time, and how to visualize comparison and composition over time — a common task in data journalism.

I like to add a couple more answers to the question: connection, or visualizing how people, things, or organizations relate to one another; and location, which covers maps.

Simple comparisons: bars and columns

Applying the perceptual hierarchy of visual cues, bar and column charts are usually the best options for simple comparisons. Vertical columns often work well when few items are being compared, while horizontal bars may be a better option when there are many items to compare, as in this example from The Wall Street Journal, illustrating common passwords revealed by a 2013 data breach at Gawker Media.

Here I have used a bar chart to show payments for speaking about drug prescription made to doctors in California by the drug company Pfizer in the second half of 2009, using data gathered in reporting this story.

Notice how spot colour is used here as a secondary visual cue, to highlight the doctor who received the most money.

There is one sacrosanct rule with bar and column charts: Because they rely on the length of the bars to encode data, you must start the bars at zero. Failing to do this will mislead your audience. Several graphics aired by Fox News have been criticized for disobeying this rule, for example:

(Source: Fox News, via Media Matters for America)

Comparisons: change over time

Bar or column charts can also be used to illustrate change over time, but there are other possibilities, as shown in these charts showing participation in the federal government’s food stamps nutritional assistance program, from 1969 to 2014.

(Source: Peter Aldhous, from U.S. Department of Agriculture data)

Each of these charts communicates the same basic information with a subtly different emphasis. The column chart emphasizes each year as a discrete point in time, while the line chart focuses on the overall trend or trajectory. The dot-and-line chart is a compromise between these two approaches, showing the trend while also drawing attention to the value for each year. (The dot-column chart is an unusual variant of a column chart, included here to show another possible design approach.)

Multiple comparisons, including over time

When comparing very many items, or how one item has changed over time, “small multiples” provide another approach. They has been used very successfully in recent years by several news organizations. Here is a small section from a larger graphic showing the severity of drought in California in late 2013 and early 2014:

(Source: Los Angeles Times)

Small multiples are becoming more popular as more people consume news graphics on mobile devices. Unlike larger conventional graphics, they can be made to reflow easily in responsive web designs to display effectively on small screens.

If you are comparing two points in time for many items, a slope graph can be an effective choice. Slope falls about midway on the perceptual hierarchy of visual cues, but allows us to scan many items at once and note obvious differences. Here I used slope graphs to visualize data from a study examining the influence of putting house plants in hospital rooms on patient’s sense of well-being, measured before abdominal surgery, and after a period of recovery. I used thicker lines and color to highlight ratings that showed statistically significant improvements.

(Source: Peter Aldhous, from data in this research paper)

Composition: parts of the whole

This is where the much-maligned pie chart does have a role, although it is not the only option. Which of these two representations of an August 2014 poll of public opinion on President Barack Obama’s job performance makes the differences between his approval ratings for difference policy areas easiest to read, the pie charts or the stacked column charts below?

These graphics involve both comparison and composition — a common situation in data journalism.

(Source: Peter Aldhous, from CBS poll data, via PollingReport.com)

In class, we’ll discuss how both of these representations of the data could have been improved.

I would suggest abandoning pie charts if there are any more than three parts to the whole, as they become very hard to read when there are many segments. ProPublica’s graphics style guide goes further, allowing pie charts with two segments only.

Recent research into how people perceive composition visualizations with just categories suggests that the best approach may actually be a square chart. Surprisingly, this is an example where an encoding of area seems to beat length for accuracy:

(Source: Eagereyes)

Another approach, known as a treemap, similarly uses area to encode the size of parts of the whole, and can be effective to display “nested” variables — where each part of the whole is broken down into further parts. Here The New York Times used a treemap to display President Obama’s 2012 budget request to Congress, also using color to indicate whether the proposal represented an increase (shades of green) or decrease (red) in spending:

(Source: The New York Times)

Composition: change over time

Data journalists frequently need to show how parts of the whole vary over time. Here is an example, illustrating the development of drought across the United States, which uses a stacked columns format, in this case with no space between the columns.

(Source: The Upshot, The New York Times)

In the drought example, the size of the whole remains constant. Even if the size of the whole changes, this format can be used to show changes in the relative size of parts of the whole, by converting all of the values at each time interval into percentages of the total.

Stacked column charts can also be used to simultaneously show change in composition over time and change in the size of the whole. This example is from one of my own articles, looking at change over time in the numbers of three categories of scientific research papers in published in Proceedings of the National Academy of Sciences:

(Source: Nature)

Just as for simple comparisons over time, columns are not the only possibility when plotting changes in composition over time. The parts-of-the-whole equivalent of the line chart, stressing the overall trend rather than values at discrete points in time, is the stacked area chart. Again, these charts can be used to show change of time with the size of the whole held constant, or varying over time. This 2009 interactive from the The New York Times used this format to reveal how Americans typically spend their day:

(Source: The New York Times)

Making connections: network graphs

The chart types thought-starter we have used as a framework so far misses two of my answers to the question: “What would you like to show?” We will cover location in subsequent classes on mapping.

Journalists may be interested in exploring connection — which donors gave money to which candidate, how companies are connected through members of their boards, and so on. Network graphs can visualize these questions, and are sometimes used in news media. Here, for example, The New York Times showed connections between the national teams, players and club teams at the 2014 soccer World Cup:

(Source: The New York Times)

Complex network graphs can be very hard to read — “hairball” is a pejorative term used to describe them — so networks often need to be filtered to tell a clear story to your audience.

If you are interested in learning how to make network graphs, I have tutorials here.

Case study: Immunization in California kindergartens

Now we’ll explore a dataset at different levels of analysis, to show how different visual encodings may be needed for different visualization tasks with the same data.

This data, from the California Department of Public Health, gives numbers on immunization and enrollment at kindergartens across the state. The data is provided at the level of individual schools, but can be aggregated to look at counties, or the entire state.

When looking at change over time at the state level, the perceptual hierarchy makes a column chart a good choice:

(Source: Peter Aldhous, from California Department of Public Health data)

Notice that I’ve focused on the percentage of children with incomplete vaccination, rather than the percentage complete, for two reasons:

  • The differences between the lengths of the bars are greater, and so is easier to read.
  • More importantly, incomplete vaccination is what increases the risk of infectious disease outbreaks, which is why we care about this data.

But as for the food stamps data, a bar chart is not the only choice:

Here’s the same information presented as a line chart:

(Source: Peter Aldhous, from California Department of Public Health data)

Notice that here, I haven’t started the Y axis at zero. This would be unforgivable for a bar chart, where the length of the bar is the visual encoding, and so starting at an arbitrary value would distort the comparison between the bars. Here, however, I’m emphasizing the relative slope, to show change over time, so starting at zero is less crucial.

And here’s the data as a dot-and-line chart:

(Source: Peter Aldhous, from California Department of Public Health data)

Here, I’ve returned to a Y axis that starts at zero, so that the relative positions of the points can be compared accurately.

But what if we want to look at individual counties? When comparing a handful of counties, the dot-and-line chart, combining the visual cues of position on an aligned scale (for the yearly values) and slope (for the rate of change from year to year) works well:

(Source: Peter Aldhous, from California Department of Public Health data)

But there are 58 counties in California, and trying to compare them all using a dot-and-line chart results in chaos:

(Source: Peter Aldhous, from California Department of Public Health data)

In this case, it makes sense to drop down the perceptual hierarchy, and use the intensity of color to represent the percentage of incomplete immunization:

(Source: Peter Aldhous, from California Department of Public Health data)

This type of chart is called a heat map. It provides a quick and easy way to scan for the counties and years with the highest rates of incomplete immunization.

What if we want to visualize the data for every kindergarten on a single chart, to give an overview of how immunization rates vary across schools?

Here’s my best attempt at this:

(Source: Peter Aldhous, from California Department of Public Health data)

Here I’ve drawn a circle for every school, and used their position on an aligned scale, along the Y axis, to encode the percentage of incomplete immunization. I’ve also used the area of the circles to encode the enrollment at each kindergarten — but this is secondary to the chart’s main message, which is about the variation of immunization rates across schools.

Using color effectively

Color falls low on the perceptual hierarchy of visual cues, but as we have seen above, it is often deployed to highlight particular elements of a chart, and sometimes to encode data values. Poor choice of color schemes is a problem that bedevils many news graphics, so it is worth taking some time to consider how to use color to maximum effect.

It helps to think about colors in terms of the color wheel, which places colors that “harmonize” well together side by side, and arranges those that have strong visual contrast — blue and orange, for instance — at opposite sides of the circle:

(Source: Wikimedia Commons)

When encoding data with color, take care to fit the color scheme to your data, and the story you’re aiming to tell. Color is often used to encode the values of categorical data. Here you want to use “qualitative” color schemes, where the aim is to pick colors that will be maximally distinctive, as widely spread around the color wheel as possible:

(Source: ColorBrewer)

When using color to encode continuous data, it usually makes sense to use increasing intensity, or saturation of color to indicate larger values. These are called “sequential” color schemes:

(Source: ColorBrewer)

In some circumstances, you may have data that has positive and negative values, or which highlights deviation from a central value. Here, you should use a “diverging” color scheme, which will usually have two colors reasonably well separated on the color wheel as its end points, and cycle through a neutral color in the middle:

(Source: ColorBrewer)

Choosing color schemes is a complex science and art, but there is no need to “roll your own” for every graphic you make. Many visualization tools include suggested color palettes, and I often make use of the website from which the examples above were taken, called ColorBrewer. Orginally designed for maps, but useful for charts in general, these color schemes have been rigorously tested to be maximally informative.

In class, we will take some time to play around with ColorBrewer and examine its outputs. You will notice that the colors it suggests can be displayed according to their values on three color “models”: HEX, RGB and CMYK. Here is a brief explanation of these and other common color models.

  • RGB Three values, describing a color in terms of combinations of red, green, and blue light, with each scale ranging from 0 to 255; sometimes extended to RGB(A), where A is alpha, which encodes transparency. Example: rgb(169, 104, 54).
  • HEX A six-figure “hexadecimal” encoding of RGB values, with each scale ranging from hex 00 (equivalent to 0) to hex ff (equivalent to 255); HEX values will be familiar if you have any experience with web design, as they are commonly used to denote color in HTML and CSS. Example: #a96836
  • CMYK Four values, describing a color in combinations of cyan, magenta, yellow and black, relevant to the combination of print inks. Example: cmyk(0, 0.385, 0.68, 0.337)
  • HSL Three values, describing a color in terms of hue, saturation and lightness (running from black, through the color in question, to white). Hue is the position on a blended version of the color wheel in degrees around the circle ranging from 0 to 360, where 0 is red. Saturation and lightness are given as percentages. Example: hsl(26.1, 51.6%, 43.7%)
  • HSV/B Similar to HSL, except that brightness (sometimes called value) replaces lightness, running from black to the color in question. hsv(26.1, 68.07%, 66.25%)

Colorizer is one of several web apps for picking colors and converting values from one model to another.

Custom color schemes can also work well, but experiment to see how different colors influence your story. The following graphic from The Wall Street Journal, for instance, uses an unusual pseudo-diverging scheme to encode data — the US unemployment rate — that would typically be represented using a sequential color scheme. It has the effect of strongly highlighting periods where the jobless rate rises to around 10%, which flow like rivers of blood through the graphic. This was presumably the designer’s aim.

(Source: The Wall Street Journal)

If you intend to roll your own color scheme, try experimenting with I want hue for qualitative color schemes, the Chroma.js Color Scale Helper for sequential schemes, and this color ramp generator, in combination with Colorizer or another online color picker, for diverging schemes.

You will also notice that ColorBrewer allows you to select color schemes that are colorblind safe. Surprisingly, many news organizations persist in using color schemes that exclude a substantial minority of their audience. Red and green lie on opposite sides of the color wheel, and also can be used to suggest “good” or “go,” versus “bad” or “stop.” But about 5% of men have red-green colorblindness, also known as deuteranopia. Here, for example, is what the budget treemap from The New York Times would look like to someone with this condition:

(Source: The New York Times via Color Oracle)

Install Color Oracle to check how your charts and maps will look to people with various forms of colorblindness.

Using chart furniture, minimizing chart junk, highlighting the story

In addition to the data, encoded through the visual cues we have discussed, various items of chart furniture can help frame the story told by your data:

  • Title and subtitle These provide context for the chart.
  • Coordinate system For most charts, this is provided by the horizontal and vertical axes, giving a cartesian system defined by X and Y coordinates; for a pie chart it is provided by angles around a circle, called a polar coordinate system.
  • Scale Labeled tick marks and grid lines can help your audience read data values.
  • Labels You will usually want to label each axis. Think about other labels that may be necessary to explain the message of your graphic.
  • Legend If you use color or shape to encode data, you will often need a legend to explain this encoding.
  • Source information Usually given as a footnote. Don’t forget this!

Chart furniture can also be used to encode data, as in this example, which shows the terms of New York City’s police commissioners and mayors with reference to the time scale on the X axis:

(Source: The New York Times)

In this example, the label for the Y axis is displayed horizontally in the main chart area, rather than vertically alongside the chart. News media often do this so that readers don’t have to crane their necks to read the label. If you do this, check that it is clear to users that the label refers to scale on the Y axis.

Think carefully about how much chart furniture you really need, and make sure that the story told by your data is front and center. Think data-ink: What proportion the ink or pixels in your chart is actually encoding data, and what proportion is embellishment, adding little to your story?

Here is a nice example of a graphic that minimizes chart junk, and maximizes data-ink. Notice how the Y axis doesn’t need to be drawn, and the gridlines are an absence of ink, consisting of white lines passing through the columns:

(Source: The Upshot, The New York Times)

Contrast this with the proliferation of chart junk in the earlier misleading Fox News column chart.

Labels and spot-color highlights can be particularly useful to highlight your story, as shown in the following scatter plots, used here to show the relationship between the median salaries paid to women and men for the same jobs in 2015. In this case there is no suggestion of causation; here the scatter plot format is being used to display two distributions simultaneously — see the chart types thought-starter.

It is clear from the first, unlabeled plot, that male and female salaries for the same job are strongly correlated, as we would expect, but that relationship is not very interesting. Notice also how I have used transparency to help distinguish overlapping individual points.

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

What we are interested in here is whether men and women are compensated similarly for doing the same jobs. The story in the data starts to emerge if you add a line of equal pay, with a slope of 1 (note that this isn’t a trend line, as we discussed last week). Here I have also highlighted the few jobs in which women in 2013 enjoyed a marginal pay gap over men:

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

Notice how adding another line, representing a 25% pay gap, and highlighting the jobs where the pay gap between men and women is largest, emphasizes different aspects of the story:

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

Pitfalls to avoid

If you ever decide to encode data using area, be sure to do so correctly. Hopefully it is obvious that if one unit is a square with sides of length one, then the correct way to represent a value of four is a square with sides of length two (2*2 = 4), not a square with sides of length four (4*4 = 16).

Mistakes are frequently made, however, when encoding data by the area of circles. In 2011, for instance, President Barack Obama’s State of the Union Address for the first time included an “enhanced” online version with supporting data visualizations. This included the following chart, comparing US Gross Domestic Product to that of competing nations:

(Source: The 2011 State of the Union Address: Enhanced Version)

Data-savvy bloggers were quick to point out that the data had been scaled by the radius of each circle, not its area. Because area = π * radius^2, you need to scale the circles by the square root of the radius to achieve the correct result, on the right:

(Source: Fast Fedora blog)

Many software packages (Microsoft Excel is a notable culprit) allow users to create charts with 3-D effects. Some graphic designers produce customized charts with similar aesthetics. The problem is that that it is very hard to read the data values from 3-D representations, as this example illustrates:

(Source: Good)

A good rule of thumb for data visualization is that trying to represent three dimensions on a two dimensional printed or web page is almost always one dimension too many, except in unusual circumstances, such as these representations of Mount St. Helens in Washington State, before and after its 1980 eruption:

(Source: OriginLab)

Above all, aim for clarity and simplicity in your chart design. Clarity should trump simplicity. As Albert Einstein is reputed to have said: “Everything should be made as simple as possible, but not simpler.”

Sometimes even leading media outlets lose their way. See if you can make sense of this interactive graphic on clandestine US government agencies and their contractors:

(Source: The Washington Post)

Be true to the ‘feel’ of the data

Think about what the data represents in the real world, and use chart forms, visual encodings and color schemes that allow the audience’s senses to get close to what the data means — note again the “rivers of blood” running through The Wall Street Journal’s unemployment chart, which suggest human suffering.

The best example I know of this uses sound rather than visual cues, so strictly speaking it is “sonification” rather than visualization. In 2010, this interactive from The New York Times explored the narrow margins separating medalists from also-rans in many events at the Vancouver Winter Olympics. It visualized the results in a conventional way, but also included sound files encoding the race timings with musical notes.

(Source: The New York Times)

Our brains process music in time, but perceive charts in space. That’s why the auditory component of this interactive was the key to its success.

Break the story down into scenes

Many stories have a step-by-step narrative, and different charts may tell different parts of the story. So think about communicating such stories through a series of graphics. This is another good reason to experiment with different chart types when exploring a new dataset. Here is a nice example of this approach, examining demographic change in Brazil:

(Source: Época, via Visualopolis)

Good practice for interactives

Nowadays the primary publication medium for many news graphics is the web or apps on mobile platforms, rather than print, which opens up many possibilities for interactivity. This can greatly enhance your ability to tell a story, but it also creates new possibilities to confuse and distract your audience — think of this as interactive chart junk.

A good general approach for interactive graphics is to provide an overview first, and then allow the interested user to zoom or filter to dig deeper into the data. In such cases, the starting state for an interactive should tell a clear story: If users have to make an effort to dig into a graphic to get anything from it, few are likely to do so. Indeed, assume that much of your audience will spend only a short time interacting with the data. “How Different Groups Spend Their Day” from The New York Times is a good example of this approach.

Similarly, don’t hide labels or information essential to understanding the graphic in tooltips that are accessed only on clicks or hovers. This is where to put more detailed information for users who have sufficient interest to explore further.

Make the controls for an interactive obvious — play buttons should look like play buttons, for instance. You can include a few words of explanation, but only a very few: as far as possible, how to use the interactive should be intuitive, and built into its design.

The interactivity of the web also facilitates a scene-by-scene narrative — a device employed frequently by The New York Times‘ graphics team in recent years. With colleagues at New Scientist, I also used this approach for this interactive, exploring the likely number of Earth-like planets in our Galaxy:

(Source: New Scientist)

‘Mobile-first’ may change your approach

Increasingly, news content is being viewed on mobile devices with small screens

At the most basic level, this means making graphics “responsive,” so that their size adjusts to screen size. But there is more to effective design for mobile than this.

We have already discussed the value of small multiples, which can be made to reflow for different screen sizes.

This interactive, exploring spending on incarceration by block in Chicago, is a nice example of organizing and displaying the same material differently for different screen sizes. Open it up on your laptop then reduce the size of your browswer window to see how it behaves.

(Source: DataMade)

Again, a step-by-step narrative can be a useful device in overcoming the limitations of a small screen. This interactive, exploring school segregation by race in Florida, is a good example of this approach:

(Source: Tampa Bay Times)

Here’s an article that includes some of my thoughts on the challenge of making graphics that work effectively on mobile.

Be careful with animation

Animation in interactives can be very effective. But remember the goal of staying true to the ‘feel’ of the data. Animated images evolve over time, so animation can be particularly useful to encode data that changes over time. But again you need to think about what the human brain is able to perceive. Research has shown that people have trouble tracking more than about four points at a time. Try playing Gapminder World without the energetic audio commentary of Hans Rosling’s “200 Countries” video, and see whether the story told by the data is clear.

Animated transitions between different states of a graphic can be pleasing. But overdo it, and you’re into the realm of annoying Powerpoint presentations with items zooming into slides with distracting animation effects. It’s also possible for elegant animated transitions to “steal the show” from the story told by the data, which arguably is the case for this exploration by The New York Times of President Obama’s 2013 budget request to Congress:

(Source: The New York Times)

Sketch and experiment to find the story

One key message I’d like you to take from this class is that there are many ways of visualizing the same data. Effective graphics and interactives do not usually emerge fully formed. They usually arise through sketching and experimentation.

As you sketch and experiment with data, use the framework suggested by the chart selector thought-starter to prioritize different chart types, and always keep the perceptual hierarchy of visual cues at the front of your mind. Remember the mantra: Design for the human brain!

Also, show your experiments to friends and colleagues. If people are confused or don’t see the story, you may need to try a different approach.

Learn from the experts

Over the coming weeks and beyond, make a habit of looking for innovative graphics, especially those employing unusual chart forms, that communicate the story from data in an effective way. Work out how they use visual cues to encode data. Here are a couple of examples from The New York Times to get you started. Follow the links from the source credits to explore the interactive versions:

(Source: The New York Times)

(Source: The New York Times)

Similarly, make note of graphics that communicate less effectively, and see if you can work out why.

Further reading

Alberto Cairo: The Functional Art: An Introduction to Information Graphics and Visualization

Nathan Yau: Data Points: Visualization That Means Something

Network Analysis with Gephi

Network analysis with Gephi

Today, network analysis is being used to study a wide variety of subjects, from how networks of genes and proteins influence our health to how connections between multinational companies affect the stability of the global economy. Network graphs can also be used to great effect in journalism to explore and illustrate connections that are crucial to public policy, or directly affect the lives of ordinary people.

Do any of these phrases have resonance for you?

  • The problem with this city is that it’s run by X’s cronies.
  • Contracts in this town are all about kickbacks.
  • Follow the money!
  • It’s not what you know, it’s who you know.

If so, and you can gather relevant data, network graphs can provide a means to display the connections involved in a way that your audience (and your editors) can readily understand.

The data we will use

Download the data from this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

  • friends.csv A simple network documenting relationships among a small group of people.
  • senate_113-2013.gexf senate-113-2014.gexf Two files with data on voting patterns in the U.S. Senate, detailing the number and percentage of times pairs of Senators voted the same way in each year.
  • senate_one_session.py Python script that will scrape a single year’s data from GovTrack.US; modified from a script written by Renzo Lucioni.
  • senate Folder containing files and code to make an interactive version of the Senate voting network.

Network analysis: the basics

At its simplest level, network analysis is very straightforward. Network graphs consist of edges (the connections) and nodes (the entities that are connected).

One important consideration is whether the network is “directed,” “undirected,” or a mixture of the two. This depends upon the nature of the connections involved. If you are documenting whether people are Facebook friends, for instance, and have no information who made the original friend request, the connections have no obvious direction. But when considering following relationships on Twitter, there is a clear directionality to each relationship: A fan might follow Taylor Swift, for example, but she probably doesn’t follow them back.

The connections in undirected graphs are typically represented by simple lines or curves, while directed relationships are usually represented by arrows.

Here, for example, I used a directed network graph to illustrate patterns of citation of one another’s work by researchers working on a type of stem cell that later won their discoverer a Nobel prize. Notice that in some cases there are arrows going in both connections, because each had frequently cited the other:

(Source: New Scientist)

Nodes and edges can each have data associated with them, which can be represented using size, color and shape.

Network algorithms and metrics

Networks can be drawn manually, with the nodes placed individually to give the most informative display. However, network theorists have devised layout algorithms to automate the production of network graphs. These can be very useful, especially when visualizing large and complex networks.

There are also a series of metrics that can quantify aspects of a network. Here are some examples, which measure the importance of nodes within a network in slightly different ways:

  • Degree is a simple count of the number of connections for each node. For directed networks, it is divided into In-degree, for the number of incoming connections, and Out-degree, for outgoing connections. (In my stem cell citation network, In-Degree was used to set the size of each node.)
  • Eigenvector centrality accounts not only for the node’s own degree, but also the degrees of the nodes to which it connects. As such, it is a measure of each node’s wider “influence” within the network. Google’s PageRank algorithm, which rates the importance of web pages according the the links they recieve, is a variant of this measure.
  • betweenness centrality essentially reveals how important each node is in providing a “bridge” between different parts of the network: It counts the number of times each node appears on the shortest path between two other nodes. It is particularly useful for highlighting the nodes that, if removed, would cause a network to fall apart.
  • Closeness centrality is a measure of how close each node is, on average, to all of the other nodes in a network. It highlights the nodes that connect to the others through a lower number of edges. The Kevin Bacon Game, in which you have to connect Bacon to other movie actors through the fewest number of movies, based on co-appearances, works because he has a high closeness centrality in this network.

Network data formats

The most basic data needed to draw a network is an “edge list” — a list of pairs of nodes that connect within the network, which can be created in a spreadsheet with two columns, one for each member of each pair of nodes.

There are also a number of dedicated data formats used to record data about networks, which can store a variety of data about both edges and nodes. Here are two of the most common:

GEXF is a variant of XML, and is the native data format for Gephi, the network visualization software we will use today.

GraphML is another, older XML format used for storing and exchanging network graph data.

For visualizing networks online, it often makes sense to save them as JSON, which keeps file size small and works well with JavaScript visualization libraries.

Introducing Gephi

Gephi is a tool designed to draw, analyze, filter and customize the appearance of network graphs according to qualitative and quantitative variables in the data.

Gephi allows you to deploy layout algorithms, or to place nodes manually. It can calculate network metrics, and lets you use the results of these analyses to customize the appearance of your network graph.

Finally, Gephi allows you to create publication-quality vector graphics of your network visualizations, and to export your filtered and analyzed networks in data formats that can be displayed interactively online, using JavaScript visualization libraries.

Launch Gephi, and you will see a screen like this:

(You may also see an initial welcome window, allowing you to load recently used or sample data. You can close this.)

Install Gephi plugins

Gephi has a series of plugins that extend its functionality — you can browse the available options here.

We will install a plugin that we will later use to export data from Gephi as JSON. Select Tools>Plugins from the top menu, and the Plugins window should open:

In the Available Plugins tab, look for the JSONExporter plugin — you can use the Search box to find them, if necessary. Then click Install.

After installing plugins, you may be prompted to restart Gephi, which you should do.

If you do not find the plugin you are looking for, close Gephi and browse for the plugin at the Gephi marketplace, where you can download manually. Then relaunch Gephi, select Tools>Plugins from the top menu and go to the Downloaded tab. Click the Add Plugins ... button, and navigate to where the plugin was saved on your computer — it should be in a zipped folder or have an .nbm file extension. Then click Install and follow the instructions that appear.

See here for more instructions on installing Gephi plugins.

Make a simple network graph illustrating connections between friends

Having launched Gephi, click on Data Laboratory. This is where you can view and edit raw network data. From the top menu, select File>New Project, and the screen should look like this:

Notice that you can switch between viewing Nodes and Edges, and that there are buttons to Add node and Add edge, which allow you to construct simple networks by manual data entry. Instead, we will import a simple edge list, to show how Gephi will add the nodes and draw the network from this basic data.

To do this, click the Import Spreadsheet button — which actually imports CSV files, rather than spreadsheets in .xls or .xlsx format. Your CSV file should at a minimum have two columns, headed Source and Target. By default, Gephi will create a directed network from an edge list, with arrows pointing from from Source to Target nodes. If some or all of your connections are undirected, include a third column called Type and fill the rows with Undirected or Directed, as appropriate.

At the dialog box shown below, navigate to the file friends.csv, and ensure that the data is going to be imported as an Edges table:

Click Next> and then Finish, and notice that Gephi has automatically created a Nodes table from the information in the Edges table:

In the Nodes table, click the Copy data to other column button at the bottom of the screen, select Id and click OK to copy to Label. This column can later be used to put labels on the network graph.

Now click Add column, call it Gender and keep its Type as String, because we are going to enter text values, rather than numbers. This column can later be used to color the friends according to their gender.

Having created the column, double-click on each row and enter F or M, as appropriate:

Now switch to the Edges table, and notice that each edge has been classed as Directed:

This would make sense if, for example, we were looking at dinner invitations made by the source nodes. Notice that in this network, not only has Ann invited Bob to dinner, but Bob has also invited Ann. For each of the other pairs, the invitations have not been reciprocated.

Now click Overview to go back to the main graph view, where a network graph should now be visible. You can use your mouse/trackpad to pan and zoom. On my trackpad, right-click and hold enables panning, while the double-finger swipe I would normally use to scroll enables zoom. Your settings may vary!

Note also that the left of the two sliders at bottom controls the size of the edges, and that individual nodes can be clicked and moved to position them manually. Below I have arranged the nodes so that none of the edges cross over one another. The Context panel at top right gives basic information about the network:

Click on the dark T button at bottom to call up labels for the nodes, and use the right of the two sliders to control their size. The light T button would call up edge labels, if they were set.

Turn off the labels once more, and we will next color the nodes according to the friends’ gender.

Notice that the panel at top left contains two tabs, Partition and Ranking. The former is used to style nodes or edges according to qualitative variables, the latter styling by quantitative variables. Select Partition>Nodes, hit the Refresh button with the circling green arrows, select Gender and hit the Run button with the green “play” symbol. The nodes should now be colored by gender, and you may find that the edges also take the color of the source node:

To turn off this behavior, click this button at the bottom of the screen: (the button to its immediate left allows edge visibility to be turned on and off).

Select File>New Project and you will be given the option to save your project before closing. You can also save your work at any time by selecting File>Save or by using the usual ⌘-S or Ctrl-S shortcut.

Visualize patterns of voting in the U.S. Senate

Having learned these basics, we will now explore a more interesting network, based on voting patterns in the U.S. Senate in 2014.

Select File>Open from the top menu and navigate to the file senate-113-2014.gexf. The next dialog box will give you some information about the network being imported, in this case telling you it it is an undirected network containing 101 nodes, and 5049 edges:

Once the network has imported, go to the Data Laboratory to view and examine the data for the Nodes and Edges. Notice that each edge has a column called percent_agree, which is the number of times the member of each pair of Senators voted the same way, divided by the total number of votes in the chamber in 2014, giving a number between 0 and 1:

Click the Configuration button, and ensure that Visible graph only is checked. When we start filtering the data, this will ensure that the data tables show the filtered network, not the original.

Now return to the Overview, where we will use a layout algorithm to alter the appearance of the network. In the Layout panel at bottom left, choose the Fruchterman Reingold layout algorithm and click Run. (I know from prior experimentation that this algorithm gives a reasonable appearance for this network, but do experiment with different options if working on your own network graphs in future.) Note also that there are options to change the parameters of the algorithm, such as the “Gravity” with which connected nodes attract one another. We will simply accept the default options, but again you may want to experiment with different values for your own projects.

When the network settles down, click Stop to stabilize it. The network should look something like this:

This looks a little neater than the initial view, but is still a hairball that tells us little about the underlying dynamics of voting in the Senate. This is because almost all Senators voted the same way at least once, so each one is connected to almost all of the others.

So now we need to filter the network, so that edges are not drawn if Senators voted the same way less often. Select the Filters tab in the main panel at right, and select Attributes>Range, which gives the option to filter on percent_agree. Double-click on percent_agree, to see the following under Queries:

The range can be altered using the sliders, but we will instead double-click on the value for the bottom of the range, and manually edit it to 0.67:

This will draw edges between Senators only if they voted the same way in at least two-thirds of the votes in 2014. Hit Filter, and watch many of the edges disappear. Switch to the Data Laboratory view, and see how the Edges table has also changed. Now return to the Overview, Run the layout algorithm again, and the graph should change to look something like this:

Now the network is organized into two clusters, which are linked through only a single Senator. These are presumably Democrats and Republicans, which we can confirm by coloring the nodes by party in the Partition tab at top left:

To customize the colors, click on each square in the Partition tab, then Shift-Ctrl and click to call up the color selector:

Make Democrats blue (Hex: 0000FF), Republicans red (Hex: FF0000) and the two Independents orange (Hex: FFAA00).

Let’s also reconfigure the network so that the Democrats are on the left and the Republicans on the right. Run the layout algorithm again, and with it running, click and drag one of the outermost Democrats, until the network looks something like this:

Now is a good time to save the project, if you have not done so already.

Next we will calculate some metrics for our new, filtered network. If we are interested in highlighting the Senators who are most bipartisan, then Betweenness centrality is a good measure — remember that it highlights “bridging” nodes that prevent the network from breaking apart into isolated clusters.

Select the Statistics tab in the main right-hand panel, and then Edge Overview>Avg. Path Length>Run. Click OK at the next dialog box, close the Graph Distance Report, and go to the Data Laboratory view. Notice that new columns, including Betweenness Centrality, have appeared in the Nodes table:

Switch back to the Overview, and select the Ranking tab on the top left panel. Choose Betweenness Centrality as the ranking parameter for Nodes, select the gem-like icon (), which controls the size of nodes, and select a minimum and maximum size for the range.

Click Apply, and the network should look something like this:

You may at this point want to switch on the labels, and note that Susan Collins, the Republican from Maine, was the standout bipartisan Senator in 2014.

Now switch to Preview, which is where the appearance of the network graph can be polished before exporting it as a vector graphic. Click Refresh to see the network graphic drawn with default options:

You can then customize using the panel on the left, clicking Refresh to review each change. Here I have removed the nodes’ borders, by setting their width to zero, changed the edges from the default curved to straight, and reduced edge thickness to 0.5:

Export the finished network as vector graphics and for online visualization

Export the network graph in SVG or PDF format using the button at bottom left, or by selecting File>Export>SVG/PDF/PNG file... from the top menu.

Now select File>Export>Graph file... and save as JSON (this option is available through the JSONExporter plugin we installed earlier). Make sure to select Graph>Visible only at the dialog box, so that only the filtered network is exported:

Now try to repeat the exercise with the 2013 data!

Introducing Sigma.js

Network graphs can be visualized using several JavaScript libraries including D3 (see here for an example). However, we will use the Sigma.js JavaScript library, which is specifically designed for the purpose, and can more easily handle large and complex networks.

Make your own Sigma.js interactive network

I have provided a basic Sigma.js template, prepared with the generous help of Alexis Jacomy, the author of Sigma.js. This is in the senatefolder.

Save the JSON file you exported from Gephi in the data subfolder with the name senate.json, then open the file index.html. The resulting interactive should look like this. Notice that when you hover over each node, its label appears, and its direct connections remain highlighted, while the rest of the network is grayed out:

Open index.html in your preferred text editor, and you will see this code:

<!DOCTYPE html>
<html>

<head>

  <meta charset=utf-8 />
  <title>U.S. Senate network</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
  <script src="src/sigma.min.js"></script>
  <script src="src/sigma.parsers.json.min.js"></script>

  <style>
    body {margin:0; padding:0;}
    #sigma-container {position:absolute; top:0; bottom:0; width:100%;}
  </style>

</head>

<body>
  <div id="sigma-container"></div>

  <script>
  function init() {

    // Finds the connections of each node
    sigma.classes.graph.addMethod("neighbors", function(nodeId) {
      var k,
          neighbors = {},
          index = this.allNeighborsIndex[nodeId] || {};

      for (k in index)
        neighbors[k] = this.nodesIndex[k];

      return neighbors;
    });

   // Creates an instance of Sigma.js
    var sigInst = new sigma({
      renderers: [
        {
          container: document.getElementById("sigma-container"),
          type: "canvas"
        }
      ]
    });

    // Customizes its settings 
    sigInst.settings({
      // Drawing properties :
      defaultLabelColor: "#000",
      defaultLabelSize: 14,
      defaultLabelHoverColor: "#fff",
      labelThreshold: 11,
      defaultHoverLabelBGColor: "#888",
      defaultLabelBGColor: "#ddd",
      defaultEdgeType: "straight",

      // Graph properties :
      minNodeSize: 3,
      maxNodeSize: 10,
      minEdgeSize: 0.1,
      maxEdgeSize: 0.2,

      // Mouse properties :
      zoomMax: 20 
    });

    // Parses JSON file to fill the graph
    sigma.parsers.json(
      "data/senate.json",
      sigInst,
      function() {
        //  Little hack here:
        //  In the latest Sigma.js version have to delete edges" colors manually
        sigInst.graph.edges().forEach(function(e) {
          e.color = null;
        });

        // Also, to facilitate the update of node colors, store
        // their original color under the key originalColor:
        sigInst.graph.nodes().forEach(function(n) {
          n.originalColor = n.color;
        });

        sigInst.refresh();
      }
    );


     // When a node is clicked, check for each node to see if it is connected. If not, set its color as gray
     // Do the same for the edges

    var grayColor = "#ccc";
    sigInst.bind("overNode", function(e) {
      var nodeId = e.data.node.id,
          toKeep = sigInst.graph.neighbors(nodeId);
      toKeep[nodeId] = e.data.node;

      sigInst.graph.nodes().forEach(function(n) {
        if (toKeep[n.id])
          n.color = n.originalColor;
        else
          n.color = grayColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        if (e.source === nodeId || e.target === nodeId)
          e.color = null;
        else
          e.color = grayColor;
      });

    // Since the data has been modified, call the refresh method to make the colors update 
      sigInst.refresh();
    });

    // When a node is no longer being hovered over, return to original colors
    sigInst.bind("outNode", function(e) {
      sigInst.graph.nodes().forEach(function(n) {
        n.color = n.originalColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        e.color = null;
      });

      sigInst.refresh();
    });
  }

  if (document.addEventListener)
    document.addEventListener("DOMContentLoaded", init, false);
  else
    window.onload = init;
  </script>

</body>

</html>

The code has been documented to explain what each part does. Notice that the head of the web page loads the main Sigma.js script, and a second script that reads, or “parses,” the json data. These are in the src subfolder.

To explore Sigma.js further, download or clone its Github repository and examine the code for the examples given.

Further reading/resources

Gephi tutorials

Sigma.js wiki

Assignment 02: Scatterplot in D3

As part of this course you will be given weekly challenges to help you get more familiar with D3. In this assignment you will take the data from Global Infectious Disease Ratings & Democratization Scores of Country to plot out a Scatterplot.

The Dataset is called disease_democ.csv and can be found in this zip file.

You will model your Scatterplot on this D3 example

Don’t forget to run the python server to load the csv file into the html document:

python -m SimpleHTTPServer 8888 &

What is data?

Before we leap into creating visualizations, charts and maps, we’ll consider the nature of data, and some basic principles that will help you to investigate datasets to find and tell stories. This is not a course in statistics, but I will introduce a few fundamental statistical concepts, which hopefully will stand you in good stead as we work to visualize data over the next few weeks — and beyond.

We’re often told that there are “lies, damned lies, and statistics.” But data visualization and statistics provide a view of the world that we can’t otherwise obtain. They give us a framework to make sense of daunting and otherwise meaningless masses of information. The “lies” that data and graphics can tell arise when people misuse statistics and visualization methods, not when they are used correctly.

The best data journalists understand that statistics and graphics go hand-in-hand. Just as numbers can be made to lie, graphics may misinform if the designer is ignorant of or abuses basic statistical principles. You don’t have to be an expert statistician to make effective charts and maps, but understanding some basic principles will help you to tell a convincing and compelling story — enlightening rather than misleading your audience.

I hope you will get hooked on the power of a statistical way of thinking. As data artist Martin Wattenberg of Google has said: “Visualization is a gateway drug to statistics.

The data we will use today

Download the data for this session from here. Unzip the folder and place it on your desktop. It contains the following files:

mlb_salaries_2014.csv –  Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.

disease_democ.csv – Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.

gdp_pc.csv – World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.

All of these files are in CSV format, which stands for comma-separated values. These are plain text files, in which fields in the data are separated by commas, and each record is on a separate row. CSV is a common format for storing and exchanging data, and can be read by most data analysis and visualization software. Values that are intended to be treated as text, rather than numbers, are often enclosed in quote marks.

When you ask for data, requesting CSVs or other plain text files is a good idea, as just about all software that handles data can export data as text. The characters used to separate the variables, called ‘delimiters,’ may vary. a ‘.tsv’ extension, for instance, indicates that the variables are separated by tabs. More generally, text files have the extention ‘.txt’.

Types of data: categorical vs. continuous

Before analyzing a dataset, or attempting to draw a graphic, it’s important to consider what, exactly, you’re working with.

Statisticians often use the term “variable.” This simply means any measure or attribute describing a particular item, or “record,” in a dataset. For example, school students might gather data about themselves for a class project, recording their gender and eye color, and height and weight. There’s an important difference between gender and eye color, called “categorical” variables, and height and weight, termed “continuous.”

  • Categorical variables are descriptive labels given to individual records, assigning them to different groups. The simplest categorical data is dichotomous, meaning that there are just two possible groups — in an election, for instance, people either voted, or they did not. More commonly, there are multiple categories. When analyzing traffic accidents, for example, you might consider the day of the week on which each incident occurred, giving seven possible categories.
  • Continuous data is richer, consisting of numbers that can have a range of values on a sliding scale. When working with weather data, for instance, continuous variables might include temperature and amount of rainfall.

There’s a third type of data we often need to consider: date and time. Perhaps the most common task in data journalism is to consider how a variable or variables have changed over time.

Datasets will usually contain a mixture of categorical and continuous variables. Here, for example, is a small part of a spreadsheet containing data on salaries for Major League Baseball players at the opening of the 2014 season:


(Source: Peter Aldhous, data from Lahman Baseball Database data)

This is a typical data table layout, with the individual records — the players — forming the rows and the variables recorded for each player arranged in columns. Here it is easy to recognize the categorical variables of teamID and teamName because they are each entered as text. The numbers for salary, expressed in full or in millions of dollars (salary_mil), are continuous variables.

Don’t assume, however, that every number in a dataset represents a continuous variable. Text descriptions can make datasets unwieldy, so database managers often adopt simpler codes, which are often be numbers, to store categorical data. You can see this in the following example, showing data on traffic accidents resulting in injury or death in Berkeley, downloaded from a database maintained by researchers on campus.

(Source: Peter Aldhous, from Transportation Injury Mapping System data)

Of the numbers seen here, only the YEAR, latitudes and longitudes (POINT_Y and POINT_X) and numbers of people KILLED or INJURED actually represent continuous variables. (Look carefully, and you will see that these numbers are justified right within each cell. The other numbers are justified left, like the text entries, because they were imported into the spreadsheet as text values.)

Like this example, many datasets are difficult to interpret without their supporting documentation. So each time you acquire a dataset, if necessary make sure you also obtain the “codebook” describing all of the variables/fields, and how they are coded. Here is [the codebook](http://paldhous.github.io/ucb/2016/dataviz/data/SWITRS_codebook.pdf) for the traffic accident data.

## What shape is your data?

Particularly when data shows a time series for a single variable, it is often provided like this data on trends in international oil production by region, in “wide” format:

(Source: Peter Aldhous, from U.S. Energy Information Administration data)

Here, all of the numbers represent the same variable, and there is a column for each year. This is good for people to read, but most software for data analysis and visualization does not play well with data in this format.

So if you receive “wide” data, you will usually need to covert it to “long” format, shown here:


(Source: Peter Aldhous, from U.S. Energy Information Administration)

Notice that now there is one column for each variable, which makes it easier for computers to understand.

How to Investigate data? The basic operations

There are many sophisticated statistical methods for crunching data, beyond the scope of this class. But the majority of a data journalist’s work involves the following simple operations:

Sort: Largest to smallest, oldest to newest, alphabetical etc.

Filter: Select a defined subset of the data.

Summarize/Aggregate: Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll group data into categories first, and then aggregate by group.

Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.

We’ll return to these basic operations with data repeatedly over the coming weeks as we manipulate and visualize data.

Working with categorical data

You might imagine that there is little that you can do with categorical data alone, but it can be powerful, and can also be used to create new continuous variables.

The most basic operation with categorical data is to aggregate it by counting the number of records that fall into each category. This gives a table of “frequencies.” Often these are divided by the total number of records, and then multiplied by 100 to show them as percentages of the total.

Here is an example, showing data on the racial and ethnic identities of residents of Alameda County, from the 2010 US Census:


(Source: American FactFinder, U.S. Census Bureau)

Creating frequency counts from categorical data creates a new continuous variable — what has changed is the level of analysis. In this example, the original data would consist of a huge table with a record for each person, noting their racial/ethnic identity as categorical variables; in creating the frequency table shown here, the level of analysis has shifted from the individual to the racial/ethnic group.

We can ask more interesting questions by considering two categorical variables together — as pioneering data journalist Philip Meyer showed when he collected and analyzed survey data to examine the causes of the 1967 Detroit Riot. In July of that year, one of the worst riots in U.S. history raged in the city for five days, following a police raid on an unlicensed after-hours bar. By the time calm was restored, 43 people were dead, 467 injured and more than 2,000 buildings were destroyed.

At the time, Detroit was regarded as being a leader in race relations, so local racial discrimination was not initially seen as one of the main underlying causes of what happened. One popular theory at the time was that the riots were led by black residents who had moved to Detroit from the rural South. Meyer demolished this idea by examining data on whether or not the people surveyed had rioted, and whether they were brought up in the South or the North. He combined these results into a “contingency table” or “cross-tab”:

South North Total
Rioters 19 51 70
Non-rioters 218 149 367
Total 237 200 437

It certainly looks from these numbers as if Northerners were more likely to have participated in the riot. There’s a message here: sometimes a table of numbers is a perfectly acceptable way to communicate a simple story — we don’t always need fancy charts.

But Meyer’s team only interviewed a sample of people from the affected neighborhoods, not everyone who lived there. If they had taken another sample, might they have obtained different results? This is one example where some more sophisticated statistical analysis can help. For contingency tables, a method known as the chi-squared test asks the relevant question: if Southerners and Northerners were in fact equally likely to have rioted, what is the likelihood of obtaining a sample as biased as this by chance alone? In this case, the chi-squared test told Meyer that the probability was less than one in a thousand. So Meyer felt confident writing in the newspaper that Northerners were more likely to have rioted. His work won a Pulitzer Prize for the Detroit Free Press and shifted the focus of political debate about the riot to racial discrimination in policing and housing in Detroit.

Sampling and margins of error

Philip Meyer’s analysis of the Detroit riot raises a general issue: only sometimes is it possible to obtain and analyze all of the data.

There are only 30 teams in Major League Baseball, which at the start of the 2014 season had just under 750 players on their rosters. So compiling all of the data on their contracts and salaries is a manageable task.

But Meyer’s team couldn’t talk to all of the people in the riot-affected neighbourhoods, and pollsters can’t ask every voter which candidate they intend to vote for in an upcoming election. Instead they take a sample. This is common in many forms of data analysis, not just opinion polling.

For a sample to be valid, it must obey a simple statistical rule: every member of the group to which you wish to generalize the results of your analysis must have an equal chance of being included.

Entire textbooks have been written on sampling methods. The simplest form is random sampling — such as when numbers are written on pieces of paper, put into a bag, shaken up, and then drawn out one by one. Opinion pollsters often generate their samples by randomly generating valid telephone numbers, and calling the households concerned.

But there are other methods, and important thing is not that a sample was derived randomly, but that it is representative of the group from which it is drawn. In other words, sampling needs to avoid systematic bias that makes particular data points more or less likely to be included.

Be especially wary of using data from any sample that was not selected to be representative of a wider group. Media organizations frequently run informal online “polls” to engage their audience, but they tell us little about public opinion, as people who happened to visit a news website and cared enough to answer the questions posed may not be representative of the wider population.

To have a good chance of being representative, samples must also be sufficiently large. If you randomly sample ten people, for instance, chance effects mean that you may draw a sample that contains eight women and two men, or perhaps no men at all. Sample 1,000 people from the same population, however, and the proportions of men and women sampled won’t deviate so far from an even split.

This is why polls often give a “margin of error,” which is a measure of the uncertainty that arises from taking a relatively small sample. These margins of error are usually derived from a range of values that statisticians call the “95% confidence interval.” This means that if the same population were sampled repeatedly, the results would fall within this range of values 95 times out of 100.

When dealing with polling and survey data, look for the margins of error. Be careful not to mislead your audience by making a big deal of differences that may just be due to sampling error.

Working with continuous data: consider the distribution

When handling continuous data, there are more possibilities for aggregation than simply counting: you can add the numbers to give a total, for example, or calculate an average.

But summarizing continuous data in a single value inevitably loses a lot of information held in variation within the data. Understanding this variation may be key to working out the story the data may tell, and deciding how to analyze and visualize it. So often the first thing a good data journalist does when examining a dataset is to chart the distribution of each continuous variable. You can think of this as the “shape” of the dataset, for each variable.

Many variables, such as human height and weight, follow a “normal” distribution. If you draw a graph plotting the range of values in the data along the horizontal axis (also known as the X axis), and the number of individual data points for each value on the vertical or Y axis, a normal distribution gives a bell-shaped curve:

(Source: edited from Wikimedia Commons)

This type of chart, showing the distribution as a smoothed line, is known as a “density plot.”

In this example, the X axis is labeled with multiples of a summary statistic called the “standard deviation.” This is a measure of the spread of the data: if you extend one standard deviation either side of the average, it will cover just over 68% of the data points; two standard deviations will cover just over 95%. In simple terms, the standard deviation is a single number that summarizes whether the curve is tall and thin, or short and fat.

Normal distributions are so common that many statistical methods have been invented specifically to work with them. It is also possible to run tests to tell whether data deviates significantly from a normal distribution, to check whether it’s valid to use these methods.

Sometimes, however, it’s very clear just from looking at the shape of a dataset that it is not normally distributed. Here, for example, is the distribution of 2014 Major League Baseball salaries, drawn as columns in increments of $500,000. This type of chart is called a histogram:

(Source: Peter Aldhous, data from the Lahman Baseball Database)

This distribution is highly “skewed.” Almost half of the players were paid less than 1million, while there are just a handful of players who were paid more than 20 million; the highest-paid was pitcher Zack Grienke, paid $26 million by the Los Angeles Dodgers. Knowing this distribution may influence the story you would choose to tell from the data, the summary statistics you would choose to aggregate it, and the methods you might use to visualize it.

In class, we will plot the distribution of the 2014 baseball salary data using D3.

First we’ll need to shape our data in GoogelDrive. Create a new Spreadsheet and import/Upload File and navigate to the file mlb_salaries_2014.csv. The app should recognize that this is a CSV file, but if the preview of the data looks wrong, use import options to correct things. Once the data has imported, and analyze the variables, create a new row and label the categorical and continuous data.

First we need to tell the app what goes on the X and Y axis, respectively. Right-click anywhere in the main panel and select Map x(required)>salary_mil. Were are not going to plot another variable from the data on the Y axis; we just want a count of the players in each salary bin. So select Map y(required)>..count.. and click the Draw Plot button at bottom right.

You should see a blank grid, because we haven’t yet told the app what type of chart to draw. Right-click in the chart area, and select Add Layer>Univariate Geoms>histogram (univariate because we only have one variable, aggregated by a count). Click Draw plot and a chart should draw.

You will notice that the bins are wider than in the example above. Right-click on histogram in the Layers Panel at left, select binwidth>set, type 0.5 into the box and set value. Now hit Draw plot again and you should have something close to the chart above.

To save your plot click on Export PDF from the options at top left and click on the hyperlink at the next page.

Beyond the “average”: mean, median, and mode

Most people know how to calculate an average: add everything up, and divide this sum by the total number of values. Statisticians call this summary the “mean,” and for normally distributed data, it sits right on the top of the bell curve.

The mean is just one example of what statisticians call a “measure of central tendency.” The most common alternative is the “median,” which is the number that sits in the middle, when all the values are arranged in order. (If you have an even number of values, and no single number occupies the middle position, it would be the average of the two middle values.)

Notice how leading media outlets, such as The Upshot at The New York Times, often use medians, rather than means, in graphics summarizing skewed distributions, such as incomes or house prices. Here is an example from April 2014:

(Source: The Upshot, The New York Times)

Statisticians also sometimes consider the “mode,” which is the value that appears most frequently in the dataset.

For a perfect normal distribution, the mean, median and mode are all the same number. But for a skewed dataset like the baseball salaries, they may be very different — and using the mean can paint a rather misleading picture.

Calculate mean, median and mode

Navigate in your browser to your Google Drive account, then click the NEW button at top left and select Google Sheets. Once the spreadsheet opens select File>Import… from the top menu in Google Sheets and select the Upload tab in the dialog box that appears:

‘Click Select a file from your computer’, navigate to the file ‘mlb_salaries_2014.csv’ and click ‘Open’.

At the next dialog box click Import and the file should upload.

When the data has uploaded, drag the darker gray line at the bottom of the light gray cell at top left below row 1, so that the first row becomes a header.

Before:

After:

Select column H by clicking its gray header containing the letter, then from the top menu select Insert>Column right five times to insert three new columns into the spreadsheet, calling them mean, median, and mode.

In the first cell of the mean column enter the following formula, which calculates the mean (called average in a spreadsheet) of all of the values in column H, containing the salaries in $ millions for each player.

=average(H2:H747)

Or alternatively, to select all the values in colum H without having to define their row numbers:

=average(H:H)

Now calculate the median salary:

=median(H:H)

And the mode:

=mode(H:H)

These spreadsheet formulas are, in programming terms, functions. They act on the data specified in the brackets. This will become a familiar concept as we work with code in subsequent weeks.

Across Major League Baseball at the start of the 2014 season, the mean salary was 3.99 million. But when summarizing a distribution in a single value, we usually want to give a “typical” number. Here the mean is inflated by the vast salaries paid to a handful of star players, and may be a bad choice. The median salary of 1.5 million gives a more realistic view of what a typical MLB player was paid.

The mode is less commonly used, but in this case also tells us something interesting: it was $500,000, a sum earned by 35 out of the 746 players. This was the minimum salary paid under 2014 MLB contracts, which explains why it turns up more frequently than any other number. A journalist who considered the median, mode and full range of the salary distribution may produce a richer story than one who failed to think beyond the “average.”

Choosing bins for your data

Often we don’t want to summarize a variable in a single number. But that doesn’t mean we have to show the entire distribution. Frequently data journalists divide the data into groups or “bins,” to reveal how those groups differ from one another. A good example is this interactive graphic on the unemployment rate for different groups of Americans, published by The New York Times in November 2009:

(Source: The New York Times)

In its base state, the graphic shows the overall jobless rate, and how this has changed over time. The buttons along the top allow you to filter the data to examine the rate for different groups. Most of the filtering is on categorical variables, but notice that the continuous variable of age is collapsed into a categorical variable dividing people into three groups: 15-24 years old, 24-44 years old, and 45 years or older.

To produce informative graphics that tell a clear story, data journalists often need to turn a continuous variable into a categorical variable by dividing it into bins. But how do you select the range of values for each bin?

There is no simple answer to this question, as it really depends on the story you are telling. In the jobless rate example, the bins divided the population into groups of young, mid-career and older workers, revealing how young workers in particular were bearing the brunt of the Great Recession.

When binning data, it is again a good idea to look at the distribution, and experiment with different possibilities. For example, the wealth of nations, measured in terms of gross domestic product (GDP) per capita in 2014, has a skewed distribution, similar to the baseball salaries. If we look at the distribution, drawn here in increments of $2,500, we will see that it is highly skewed, rather like the baseball salaries:

(Source: Peter Aldhous, from World Bank data)

Straight away we can see that just a tiny handful of countries had a GDP per capita of more than 50,000,but there is a larger group with values above40,000.

The maps below reveal how setting different ranges for the bins changes the story told by the data. For the first map, I set the lower value for the top bin at $40,000, and then gave the bins equal ranges:(Source: Peter Aldhous, from World Bank data)

This might be useful for telling a story about how high per capita wealth is still concentrated into a small number of nations, but it does a fairly poor job of distinguishing between the per capita wealth of developing countries. And for poorer people, small differences in wealth make a big difference to living conditions.

So for the second map I set the boundaries so that roughly equal numbers of countries fell into each of the five bins. Now Japan, most of Western Europe and Russia join the wealthiest bin, middle-income countries like Brazil, China, and Mexico are grouped in another bin, and there are more fine-grained distinctions between the per capita wealth of different developing countries:(Source: Peter Aldhous, from World Bank data)

Some visualization and mapping software gives you the option of putting equal numbers of records into each bin — usually called “quantiles” (the quartiles we encountered on the box plots are one example). Note that calculated quantiles won’t usually give you nice round numbers for the boundaries between bins. So you may want to adjust the values, as I did for the second map.

You may also want to examine histograms for obvious “valleys” in the data, which may be good places for the breaks between bins.

Calculate quantiles

You can also calculate the boundaries between quantiles for yourself in a spreadsheet. Go back to the Google Spreadsheet with the baseball salary data, and add two more columns: quantile and quantile value.

Next we will calculate the boundaries for bins dividing the data into five quantiles, with one-fifth (0.2 in decimal) of the values in each bin.

First enter the following values into the quantile column, to reflect the division into five quantiles:

=4/5
=3/5
=2/5
=1/5

Then enter this formula into the first cell of the quantile value column:

=percentile(H:H, L2)

Copy the formula down the top four rows, and the spreadsheet should look as follows:

Rounding: avoid spurious precision

Often when you run calculations on numbers, you’ll obtain precise answers that can run to many decimal places. But think about the precision with which the original numbers were measured, and don’t quote numbers that are more precise than this. When rounding numbers to the appropriate level of precision, if the next digit is four or less, round down; if it’s six or more, round up. There are various schemes for rounding if the next digit is five, and there are no further digits to go on: I’d suggest rounding to an even number, which may be up or down, as this is the international standard in computing.

To round the mean value for the baseball salary data to two decimal places, edit the formula to the following:

=round(average(H:H),2)

This formula runs the round function on the result of the average function.

Per what? Working with rates and percentages

Often it doesn’t make much sense to consider raw numbers. There are more murders in Oakland (population from 2010 U.S. Census: 390,724) than in Orinda (2010 population: 17,643). But that’s a fairly meaningless comparison, unless we level the playing field by correcting for the size of the two cities. As in the wealth of nations example above, much of the time data journalists need to work with rates: per capita, per thousand people, and so on.

In simple terms, a rate is one number divided by another number. The key word is “per.” Per capita means “per person,” so to calculate a per capita figure you must divide the total value by the population size. But remember that most people find very small numbers hard to grasp: 0.001 and 0.0001 look similarly small at a glance, even though the first is ten times as large as the second. So when calculating rates, per capita is often not a good choice. For rare events like murders, or deaths from a particular disease, you may need to consider the rate per 1000 people, per 10,000 people, or even per 100,000 people: simply divide the numbers as before, then multiply by the “per” figure.

In addition to leveling the playing field to allow meaningful comparisons, rates can also help bring large numbers, which are again hard for most people to grasp, into perspective: it means little to most people to be told that the annual GDP of the United States is almost 17 trillion, but knowing that GDP per person is just over 50,000 is easier to comprehend.

Percentages are just a special case of rates, meaning “per hundred.” So to calculate a percentage, you divide one number by another and then multiply by 100.

Doing simple math with rates and percentages

Often you will need to calculate percentage change. The formula for this is:

(new value - old value) / old value * 100

Sometimes you may need to compare two rates or percentages. For example, if 50 out of 150 black mortgage applicants in a given income bracket are denied a mortgage, and 300 out of 2,400 white applicants in the same income bracket are denied a mortgage, the percentage rates of denial for the two groups are:

Black:

50 / 150 * 100 = 33.3%

White:

300 / 2,400 * 100 = 12.5%

You can divide one percentage or rate by the other, but be careful how you describe the result:

33.3 / 12.5 = 2.664

You can say from this calculation that black applicants are about 2.7 times as likely to be denied loans as whites. But even though the Associated Press style guide doesn’t make the distinction, don’t say black applicants are about 2.7 times more likely to be denied loans. Strictly speaking, more likely refers to following calculation:

(33.3 - 12.5) / 12.5 = 1.664

Asking questions with data

As data journalists, we want to ask questions of data. When statisticians do this, they assign probabilities to the answers to specific questions. They might ask whether variables are related to one another: for instance, do wealthier people tend to live longer? Or they might ask whether different groups are different from one another: for example, do patients given an experimental drug get better more quickly than those given the standard treatment?

When asking these questions, the most common statistical approach may seem back to front. Rather than asking whether the answer they’re interested in is likely to be true, statisticians usually instead calculate probabilities that the observed results would be obtained if the “null hypothesis” is correct.

In Philip Meyer’s analysis of the Detroit riot, the null hypothesis was that Northerners and Southerners were equally likely to have rioted. In the examples given above, the null hypotheses are that there is no relationship between wealth and lifespan, and that the new drug is just as effective as the old treatment.

The resulting probabilities are often given as p values, which are shown as decimal numbers between 0 and 1. Philip Meyer’s chi-squared result would have been written as: p <0.001

The decimal 0.001 is the same as the fraction 1/1000, and < is the mathematical symbol for “less than.” So this means that there was less than one in a thousand chance that the difference in participation in the riot between Northerners and Southerners was caused by a chance sampling effect.

This would be called a “significant” result. When statisticians use this word, they don’t necessarily mean that the result has real-world consequence. It just means that the result is unlikely to be due to chance. However, if you have framed your question carefully, like Meyer did, a statistically significant result may be very consequential indeed.

There is no fixed cut-off for judging a result to be statistically significant. But as a general rule, p <0.05 is considered the minimum standard. This means you are likely to get this result by chance less than 5 times out of 100. If Meyer had obtained a result only just exceeding this standard, he may still have concluded that Northerners were more likely to riot, but would probably have been more cautious in how he worded his story.

When considering differences between groups, statisticians sometimes avoid p values, and instead give 95% confidence intervals, like the margins of error on opinion polls. Only if these don’t overlap would a statistician assume that the results for different groups are significantly different.

So when picking numbers from studies to use in your graphics, pay attention to p values and confidence intervals!

Relationships between variables: correlation and its pitfalls

Some of the most powerful stories that data can tell examine how one variable relates to another. This video from a BBC documentary made by Hans Rosling of the Gapminder Foundation, for example, explores the relationship between life expectancy in different countries and the nations’ wealth:


(Source: BBC/Gapminder)

Correlation refers to statistical methods that test the strength of the relationship between two variables recorded for each of the records in a dataset. Correlations can either be positive, which means that two variables tend to increase together; or negative, which means that as one variable increases in value, the other one tends to decrease.

Tests of correlation determine whether the recorded relationship between the two variables is likely to have arisen by chance — here the null hypothesis is that there is actually no relationship between the two.

Statisticians usually test for correlation because they suspect that variation in one variable causes variation in the other, but correlation cannot prove causation. For example, there is a statistically significant correlation between children’s shoe sizes and their reading test scores, but clearly having bigger feet doesn’t make a child a better reader. In reality, older children are likely both to have bigger feet and be better at reading — the causation lies elsewhere.

Here, the child’s age is a “lurking” variable. Lurking variables are a general problem in data analysis, not just in tests of correlation, and some can be hard even for experts to spot.

For example, by the early 1990s epidemiological studies suggested that women who took Hormone Replacement Therapy (HRT) after menopause were less likely to suffer from coronary heart disease. But some years later, when doctors ran clinical trials in which they gave women HRT to test this protective effect, it actually caused a statistically significant increase in heart disease. Going back to the original studies, researchers found that women who had HRT tended to be from higher socioeconomic groups, who had better diets and exercised more.

Data journalists should be very wary of falling into similar traps. While you may not be able to gather all of the necessary data and run statistical tests, take special care to think about possible lurking variables when drawing any chart that illustrates a correlation, or implies a relationship between two variables.

Scatter plots and trend lines

When testing the relationship between two variables, statisticians will usually draw a simple chart called a “scatter plot,” in which the records in a dataset are plotted as points according to their scores for each of the two variables.

Here is an example, illustrating a controversial theory claiming that the extent to which a country has developed a democratic political system is driven largely by the historical prevalence of infectious disease:

(Source: Peter Aldhous, data from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries)

As we have learned, correlation cannot prove causation. But correlations are usually run to explore relationships that are suspected to be causal. The convention when drawing scatter plots is to put the variable suspected to be the causal factor, called the “explanatory” variable, on the X axis, and the “response” variable on the Y.

When producing any chart based on the scatter plot format, it’s a good idea to follow this convention, because otherwise you are likely to confuse people who are used to viewing such graphs.

The example above also shows a straight line drawn through the points. This is known as a “trend line,” or the “line of best fit” for the data, and was calculated by a method called “linear regression.” It is a simple example of what statisticians call “fitting a model” to data.

Models are mathematical equations that allow statisticians to make predictions. The equation for this trend line is:

Y = -1.85*X + 104.45

Here X is the infectious disease prevalence score, Y is the democratization score, and 104.45 is the value at which the trend line would cross the vertical axis at X = 0. The slope of the line is -1.85, which means that when X increases by a single point, Y tends to decrease by 1.85 points. (For a trend line sloping upwards from left to right, the slope would be a positive number.)

The data used for this graph doesn’t include all of the world’s countries. But if you knew the infectious disease prevalence score for a missing nation, you could use the equation or the graph to predict its likely democratization score. To see how this works multiply 30 by -1.85, then add 104.45. The answer is roughly 49, and you will get the same result if you draw a vertical line up from the horizontal axis for an infectious disease prevalence score of 30, and then draw a horizontal line from where this crosses the trend line to the vertical axis at X = 0.

The most frequently used statistical test for correlation determines how closely the points cluster around the linear trend line, and determines the statistical significance of this relationship, given the size of the sample.

In this example there is a significant negative correlation, but that doesn’t prove that low rates of infectious disease made some countries more democratic. Not only are there possible lurking variables, but cause-and-effect could also work the other way round: more democratic societies might place greater value on their citizens’ lives, and make more effort to prevent and treat infectious diseases.

We will make a version of this chart in class. Import the file disease_democ.csv into the web app as before, and map infect_rate to the X axis and democ_score to the Y.
Now right-click in the main chart area and select Add layer>Bivariate Geoms>point. Click Draw plot and the points should appear on the scatter plot.

In the Layers panel, right-click on point and select size>Set>4 to increase the size of the points. Click Draw plot again.

Now we will add the trend line. Right-click back in the chart area, select Add layer>Bivariate Geoms>smooth and Draw plot. This will draw a smoothed line that meanders through the points, and plot a measure of the uncertainty around this line known as the “standard error.”

We instead want a linear trend line, without the standard error. In the Layers panel, right-click on smooth and select method>Set>lm (lm stands for for “linear model”); also select se>Set>FALSE, to remove the standard error plot. Draw plot and you should have something approximating the chart above. (The scales on the axes will be different, however.)

Beyond the straight line: non-linear relationships

Relationships between variables aren’t always best described by straight lines, as we can see by looking at the Gapminder Foundation’s Wealth & Health of Nations graphic, on which Hans Rosling’s “200 Countries” video is based. This is a bubble plot, a relative of the scatter plot in which the size of the bubbles depends upon a third variable — here each country’s population:

(Source: Gapminder)

Look carefully at the X axis: income per person doesn’t increase in even steps. Instead the graph is drawn so that the distance between 400 and 4,000 equals the distance between 4,000 and 40,000.

This is because the axis has been plotted on a “logarithmic” scale. It would increase in even steps if the numbers used were not the actual incomes per person, but their common logarithms (how many times 10 would have to be multiplied by itself to give that number).

Logarithmic scales are often used to make graphs plotting a wide range of values easier to read. If we turn off the logarithmic scale on the Gapminder infographic, the income per person values for the poorer countries bunch up, making it hard to see the differences between them:

(Source: Gapminder)

From this version of the graphic, we can see that a line of best fit through the data would be a curve that first rises sharply, and then levels out. This is called a “logarithmic curve,” which is described by another simple equation.

A logarithmic curve is just one example of a “non-linear” mathematical relationship that statisticians can use to fit a model to data.

Assignment

  • Calculate the values needed to group nations into five quantile bins, according to the 2014 GDP per capita data in the file gdp_pc.csv.
  • Create the infectious disease and democratization scatter plot in D3 so that the points are color-coded by a nation’s income group. Note, if your solution results in multiple trend lines, you are mapping color at the wrong point in building the chart!
  • Save the plot as a PDF file. If the points on the scatter plot do not render correctly, paste the url for the PDF into another browser; it should work in Google Chrome.
  • Subscribe to visualization blogs, follow visualization thought leaders on Twitter, and take other steps to track developments in data viz and data journalism.
  • Send me your calculated quantile values, your scatter plot, and your initial list of visualization blogs by the start of next week’s class.

Further reading:

Sarah Cohen: Numbers in the Newsroom: Using Math and Statistics in News

Philip Meyer: Precision Journalism: A Reporter’s Introduction to Social Science Methods

Future of Education – part I

For the first part of your Future of Education Data Visualization you will be looking for data to support a story or message that you’d like to communicate. For example if you’d like to show that the price of university education in Canada is consistently rising over the past decade you may want to find specific data that represents this increase in tuition (see links below):

Tuition fees for degree programs, 2016/2017

Undergraduate tuition fees for full time Canadian students, by discipline, by province (Canada)

Once you’ve analyzed the data then sketch different ways of representing this data (pencil and paper).

Nominal, Ordinal & Quantitative

Nominal

A Nominal variable (sometimes called a categorical variable) is one that has two or more categories, but there is no intrinsic ordering to the categories or is qualitative. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.

Ordinal

An ordinal variable is similar to a nominal variable yet the difference between the two is that ordinals have a clear ordering and can be quanitative.  For example, suppose you have a variable, economic status, with three categories (low, medium and high).  In addition to being able to classify people into these three categories, you can order the categories as low, medium and high. Now consider a variable like educational experience (with values such as elementary school graduate, high school graduate, some college and college graduate). These also can be ordered as elementary school, high school, some college, and college graduate.  Even though we can order these from lowest to highest, the spacing between the values may not be the same across the levels of the variables.  Say we assign scores 1, 2, 3 and 4 to these four levels of educational experience and we compare the difference in education between categories one and two with the difference in educational experience between categories two and three, or the difference between categories three and four. The difference between categories one and two (elementary and high school) is probably much bigger than the difference between categories two and three (high school and some college).  In this example, we can order the people in level of educational experience but the size of the difference between categories is inconsistent (because the spacing between categories one and two is bigger than categories two and three).  If these categories were equally spaced, then the variable would be an interval variable.

D3 – 01 | Hello World

Step 0 – Preparing the HTML

Base HTML code that you can copy into a file. Alternatively, create an HTML skeleton file with your editor (e.g. Brackets, Sublime or TextMate) and made sure to link to the D3.js library. The Javascript code in the snippets should go within the script tag! You should enter your code snippets one by one and then test the code. Don’t enter all code at once.

<html>
  <head>
    <script src="http://d3js.org/d3.v3.min.js"></script>
  </head>

  <body>
    <script>
    </script>
  </body>
<html>

Step 1 – Selecting and appending

Concepts:

  • Method Chaining
  • Selecting

Step 1.1: Appending the div

d3.select("body")
    .append("div")
    .style("border", "1px black solid")
    .text("hello world");

In English:

  1. Selecting the “<body>” element in the HTML,
  2. we create and append a “<div>” to it ,
  3. with the CSS style rule of a solid, black, 1px-thick border,
  4. and containing the text “hello world”.

Step 1.2: Event Binding

d3.selectAll("div")
    .style("background-color", "pink")
    .style("font-size", "24px")
    .attr("class", "d3div")
    .on("click", function() {alert("You clicked a div")})

In English:

  1. Selecting the first <div> of the document (there is a second method, .selectAll() to choose all elements with the selector name),
  2. we color the background of the <div> pink in the CSS
  3. and change the font size of the <div> to 24px in the CSS.
  4. We add to the <div> tag a “class” attribute and name this new class “d3div”.
  5. We bind an event handler to the <div> tag selected in the first line. The event is the action of a “click”, and when that action is executed (when someone clicks the <div>), a function is called. In this example, that function has only one line: alert that will display “You clicked a div”.

Task 2 – The <svg> element

This task is completely separate from Task 1. You can keep the base HTML file, but your script area will look entirely different! Remember to save the different files separately for future reference.

Concepts:

  • svg element
  • DOM order – no z-scale!
  • coordinate attributes

Step 0:

Add the svg element in the HTML file, within the body.

<svg style="width:500px; height:500px; border:1px lightgray solid;">

</svg>       

Step 1:

d3.select("svg")
    .append("line")
    .attr({"x1":20, "y1":20, "x2":250, "y2":250})
    .style("stroke","red")
    .style("stroke-width","4px");

Select the “<svg>” tag added in the HTML file and append a line with starting point (20,20) and ending point (250,250). The line is 4px wide and red.

Step 2:

d3.select("svg")
    .append("text")
    .attr({"x": 20, "y": 20})
    .text("HELLO");

Select the <svg> tag again and append “text” to it at the coordinate (20, 20) with the text content “HELLO”. (Note: Hit inspect element to see what your HTML looks like after the script has been run!)

Step 3:

d3.select("svg")
    .append("circle")
    .attr({"r": 20, "cx": 20, "cy": 20})
    .style("fill","purple");
    
d3.select("svg")
    .append("circle")
    .attr({"r": 100, "cx": 400, "cy": 400})
    .style("fill", "lightblue");

Now, select the <svg> again and this time, append a circle with the specified attributes (cx and cy denote center coordinates). Select it again and append the other circle with its own attributes.

Step 4:

d3.select("svg")
    .append("text")
    .attr({"x": 400, "y": 400})
    .text("WORLD");

Now, append the final text “WORLD” to the <svg>, positioned at the point (400,400).

Food for thought: Why do the elements overlap the way they do? What determines what goes on top and what goes below?

Task 3 – Transitions

To add transitions (the D3.js equivalent of animations), we’ll need to change our base code up a bit. To start off with, make a copy of your previous file, and then replace the <script> code with the one shown below.

Step 1:

d3.select("svg")
    .append("circle")
    .attr({"r": 20, "cx": 20, "cy": 20})
    .style("fill","red");
            
d3.select("svg")
    .append("text")
    .attr("id","a")
    .attr({"x": 20, "y": 20})
    .style("opacity", 0)
    .text("HELLO WORLD");          

d3.select("svg")
    .append("circle")
    .attr({"r": 100, "cx": 400, "cy": 400})
    .style("fill","lightblue");
            
d3.select("svg")
    .append("text")
    .attr("id","b")
    .attr({"x": 400, "y": 400})
    .style("opacity", 0)
    .text("Uh, hi.");

Take a look at the code above.

  • Why is the code broken up in the way it is?
  • Find all the added instructions (were not in the previous code). Why don’t you see all elements?
  • Can you see the purpose of the added instructions, with respect to transitions (which we’ll add in the next step).

Step 2:

Add the following code:

d3.select("#a")
    .transition()
    .delay(1000)
    .style("opacity", 1);

Time to add transitions with the .transition() method. .delay(milliseconds) determines the amount of time to elapse before the transition is applied. We usually delay animations by a little bit because instantaneous changes can be jarring for the user. This particular transition above waits a second (1000 milliseconds) before it transitions the opacity of #a (one of the text fields) from 0 (transparent, therefore invisible) to 1 (full opacity).

Step 3:

d3.select("#b")
    .transition()
    .delay(2000)
    .style("opacity", .75);
d3.selectAll("circle")
    .transition()
    .duration(2000)
    .attr("cy", 200);

What do you think the above code does?

Here are some things you can try:

  • Have 4 circles on the canvas, each moving in different directions.
  • Have each of the circles move at different times.
  • Make it look like there is only one circle when the page first loads, but reveal through transitions that there are actually many layers.

Assignment 00: One Good Data Visualisation

Find an example of a data visualisation that represents at least two of the principles outlined in Principles of Information Design PDF.
Write some brief notes about:

  • Why you chose this visualisation
  • What makes is a good example in your opinion
  • Which principles does it adhere to
  • Is the visualisation exploratory of explanatory

Post your example with notes in the thread below