Getting data from a Google Spreadsheet

When you’re in need of a way to store data collectively online, you may want a quick and simple way to store, edit and retrieve data, Google Spreadsheets are an easy-to-use choice.

Let’s say you create a simple table and don’t want to build a complicated database backend to write, edit and store for your data – but you want to conveniently consume this data as JSON – then Google Drive helps you with that.

Step 1: Create a spreadsheet and publish it to the web

Sticking with the Table example, you may want to create a spreadsheet like this:

You can then publish the document to the web by choosing “File” > “Publish to the web”.
In the lower half of the publishing dialog, you can see a link to your data – but it only gives us HTML (try to change output from “html” to “json” – it doesn’t work).

Copy the key=… part (in my example it is 0AtMEoZDi5-pedElCS1lrVnp0Yk1vbFdPaUlOc3F3a2c
) and put it into this URL: https://spreadsheets.google.com/feeds/list/PUT-KEY-HERE/od6/public/values?alt=json-in-script&callback= instead of “PUT-KEY-HERE”.

For my example the URL is https://spreadsheets.google.com/feeds/list/0AtMEoZDi5-pedElCS1lrVnp0Yk1vbFdPaUlOc3F3a2c/od6/public/values?alt=json-in-script&callback=. It does not directly work in the browser, but if you append something, say “x”, to the URL it displays your data as JSONP.

Alternatively, you can get it as pure JSON (but you may need to run it through a CORS proxy, for example cors.io) with https://spreadsheets.google.com/feeds/list/PUT-KEY-HERE/od6/public/values?alt=json

Now, in your Processing Visualization can retrieve the data and use it however you wish to.

Creating a Custom D3 Visualization

This course barely scratches the surface of what D3 is capable of. A more complicated example will show how you can make custom visualizations with D3 that are impossible with standard charting libraries. This is where D3 really shines. The goal is to make a chart of day lengths throughout the year, like this one drawn by a child from M&M child care.

This chart has several interesting aspects that are worth noting. It is centred on noon and the length of the day is shown with two lines over the central axis, which forms a shape which is coloured in and the background is blue to represent night. There are four labeled axes, top and bottom and left and right. Building a visualization like it is not likely to come from a pre-packaged library.

Planning the Visualization

To plan the attack, a diagram of the desired visualization is helpful.

This brings several points to the foreground:

  • The visualization will be inset with a margin of 40 pixels to print the axes. Using transform will make the math for this simple.
  • Axes will go above and below and left and right of the graphic. Since the axes are all time-related, it makes sense to use time scales.
  • Helpfully, the y axis counts up the same way as the SVG coordinate system so you do not need to reverse the axis.
  • The sunrise section will be yellow and will be made up of what is left of a rect after drawing shapes with a svg:path for sunrise time and sunset time. d3.svg.area makes this easy.
  • The points on the path need to be connected into a smooth curve. There are different kinds of interpolation for this.

New Concepts

For this example, some new concepts will be introduced:

  • Time scales, an extension of linear scales tailored for time values.
  • The svg:g element which groups other elements like a div and can be transformed as a group.
  • Drawing complex shapes with svg:path, which is complicated, but made easy with the d3.svg.area function.
  • Using d3.range to create an array of numbers.
  • Using scale.ticks to generate tick marks for the axes.
  • Using SVG translation to simplify positioning.
  • Styling text with CSS.
  • Adjusting pixel positions to avoid anti-aliasing

Digression: Anti-aliasing

Anti-aliasing helps make text and curved lines look smoother, but can make straight lines appear fuzzy. If you are using a WebKit-based browser, the lines on the left should be fuzzy, while those on the right should be crisp.

Anti-aliasing often happens when drawing straight lines with D3. If the line is 1 pixel tall and positioned exactly on a pixel, it will straddle the pixel and become fuzzy. To avoid it, add or subtract 0.5 to the pixel position. The example below uses this technique to prevent fuzziness.

However, rendering differs depending on the SVG layout engine. In the example above, Firefox 5 renders fuzzy vertical lines when positioned exactly on a pixel, while the horizontal lines are not.

Setting the SVG shape-rendering property to crispEdges disables anti-aliasing for some or all lines. However, this may have unwanted side-effects if the visualization has curved lines, making them look jagged.

svg {
  shape-rendering: crispEdges;
}

A better solution may be to target the crispEdges property at certain (vertical or horizontal) lines with a CSS class.

line.crisp {
  shape-rendering: crispEdges;
}

Using this solution without pixel munging seems to work better across different browsers, but you should test it yourself and see what works best for your audience.

Getting the Data

Having game-planned the graphic, the first step is to collect the data. You can find sunrise and sunset times for cities around the world at timeanddate.com. I could not find a source of sunrise and sunset times in a computer-readable format and didn’t feel like writing a scraper, so I selected a few days throughout the year for the selected location (Minneapolis, Minnesota).

The data format is like this:

{
 date: new Date(2011, 4, 15), // day of year
 sunrise: [7, 51],            // sunrise time as array of hours and minutes
 sunset: [16, 42]             // likewise, sunset time (24 hour clock)
}

Sunrise and sunset times are stored as an array of hour and minute values. These are used to construct a Date to represent the time of day when plotting.

Building the Visualization

The JavaScript code for the visualization is below, with comments in line.
Here’s my codePen:

var width = 700;
var height = 525;
var padding = 40;

// the vertical axis is a time scale that runs from 00:00 - 23:59
// the horizontal axis is a time scale that runs from the 2011-01-01 to 2011-12-31

var y = d3.time.scale().domain([new Date(2011, 0, 1), new Date(2011, 0, 1, 23, 59)]).range([0, height]);
var x = d3.time.scale().domain([new Date(2011, 0, 1), new Date(2011, 11, 31)]).range([0, width]);

var monthNames = ["Jan", "Feb", "Mar", "April", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"];

// Sunrise and sun set times for dates in 2011. I have picked the 1st
// and 15th day of every month, plus other important dates like equinoxes
// and solstices and dates around the standard time/DST transition.

var data = [
  {date: new Date(2011, 0, 1), sunrise: [7, 51], sunset: [16, 42]},
  {date: new Date(2011, 0, 15), sunrise: [7, 48], sunset: [16, 58]},
  {date: new Date(2011, 1, 1), sunrise: [7, 33], sunset: [17, 21]},
  {date: new Date(2011, 1, 15), sunrise: [7, 14], sunset: [17, 41]},
  {date: new Date(2011, 2, 1), sunrise: [6, 51], sunset: [18, 0]},
  {date: new Date(2011, 2, 12), sunrise: [6, 32], sunset: [18, 15]}, // dst - 1 day
  {date: new Date(2011, 2, 13), sunrise: [7, 30], sunset: [19, 16]}, // dst
  {date: new Date(2011, 2, 14), sunrise: [7, 28], sunset: [19, 18]}, // dst + 1 day
  {date: new Date(2011, 2, 14), sunrise: [7, 26], sunset: [19, 19]},
  {date: new Date(2011, 2, 20), sunrise: [07, 17], sunset: [19, 25]}, // equinox
  {date: new Date(2011, 3, 1), sunrise: [6, 54], sunset: [19, 41]},
  {date: new Date(2011, 3, 15), sunrise: [6, 29], sunset: [19, 58]},
  {date: new Date(2011, 4, 1), sunrise: [6, 3], sunset: [20, 18]},
  {date: new Date(2011, 4, 15), sunrise: [5, 44], sunset: [20, 35]},
  {date: new Date(2011, 5, 1), sunrise: [5, 30], sunset: [20, 52]},
  {date: new Date(2011, 5, 15), sunrise: [5, 26], sunset: [21, 1]},
  {date: new Date(2011, 5, 21), sunrise: [5, 26], sunset: [21, 3]}, // solstice
  {date: new Date(2011, 6, 1), sunrise: [5, 30], sunset: [21, 3]},
  {date: new Date(2011, 6, 15), sunrise: [5, 41], sunset: [20, 57]},
  {date: new Date(2011, 7, 1), sunrise: [5, 58], sunset: [20, 40]},
  {date: new Date(2011, 7, 15), sunrise: [6, 15], sunset: [20, 20]},
  {date: new Date(2011, 8, 1), sunrise: [6, 35], sunset: [19, 51]},
  {date: new Date(2011, 8, 15), sunrise: [6, 51], sunset: [19, 24]},
  {date: new Date(2011, 8, 23), sunrise: [7, 1], sunset: [19, 9]}, // equinox
  {date: new Date(2011, 9, 1), sunrise: [7, 11], sunset: [18, 54]},
  {date: new Date(2011, 9, 15), sunrise: [7, 28], sunset: [18, 29]},
  {date: new Date(2011, 10, 1), sunrise: [7, 51], sunset: [18, 2]},
  {date: new Date(2011, 10, 5), sunrise: [7, 57], sunset: [17, 56]}, // last day of dst
  {date: new Date(2011, 10, 6), sunrise: [6, 58], sunset: [16, 55]}, // standard time
  {date: new Date(2011, 10, 7), sunrise: [6, 59], sunset: [16, 54]}, // standard time + 1
  {date: new Date(2011, 10, 15), sunrise: [7, 10], sunset: [16, 44]},
  {date: new Date(2011, 11, 1), sunrise: [7, 31], sunset: [16, 33]},
  {date: new Date(2011, 11, 15), sunrise: [7, 44], sunset: [16, 32]},
  {date: new Date(2011, 11, 22), sunrise: [7, 49], sunset: [16, 35]}, // solstice
  {date: new Date(2011, 11, 31), sunrise: [7, 51], sunset: [16, 41]}
];

function yAxisLabel(d) {
  if (d == 12) { return "noon"; }
  if (d < 12) { return d; }
  return (d - 12);
}

// The labels along the x axis will be positioned on the 15th of the
// month

function midMonthDates() {
  return d3.range(0, 12).map(function(i) { return new Date(2011, i, 15) });
}

var dayLength = d3.select("#day-length").
  append("svg:svg").
  attr("width", width + padding * 2).
  attr("height", height + padding * 2);

// create a group to hold the axis-related elements
var axisGroup = dayLength.append("svg:g").
  attr("transform", "translate("+padding+","+padding+")");

// draw the x and y tick marks. Since they are behind the visualization, they
// can be drawn all the way across it. Because the  has been
// translated, they stick out the left side by going negative.

axisGroup.selectAll(".yTicks").
  data(d3.range(5, 22)).
  enter().append("svg:line").
  attr("x1", -5).
  // Round and add 0.5 to fix anti-aliasing effects (see above)
  attr("y1", function(d) { return d3.round(y(new Date(2011, 0, 1, d))) + 0.5; }).
  attr("x2", width+5).
  attr("y2", function(d) { return d3.round(y(new Date(2011, 0, 1, d))) + 0.5; }).
  attr("stroke", "lightgray").
  attr("class", "yTicks");

axisGroup.selectAll(".xTicks").
  data(midMonthDates).
  enter().append("svg:line").
  attr("x1", x).
  attr("y1", -5).
  attr("x2", x).
  attr("y2", height+5).
  attr("stroke", "lightgray").
  attr("class", "yTicks");

// draw the text for the labels. Since it is the same on top and
// bottom, there is probably a cleaner way to do this by copying the
// result and translating it to the opposite side

axisGroup.selectAll("text.xAxisTop").
  data(midMonthDates).
  enter().
  append("svg:text").
  text(function(d, i) { return monthNames[i]; }).
  attr("x", x).
  attr("y", -8).
  attr("text-anchor", "middle").
  attr("class", "axis xAxisTop");

axisGroup.selectAll("text.xAxisBottom").
  data(midMonthDates).
  enter().
  append("svg:text").
  text(function(d, i) { return monthNames[i]; }).
  attr("x", x).
  attr("y", height+15).
  attr("text-anchor", "middle").
  attr("class", "xAxisBottom");

axisGroup.selectAll("text.yAxisLeft").
  data(d3.range(5, 22)).
  enter().
  append("svg:text").
  text(yAxisLabel).
  attr("x", -7).
  attr("y", function(d) { return y(new Date(2011, 0, 1, d)); }).
  attr("dy", "3").
  attr("class", "yAxisLeft").
  attr("text-anchor", "end");

axisGroup.selectAll("text.yAxisRight").
  data(d3.range(5, 22)).
  enter().
  append("svg:text").
  text(yAxisLabel).
  attr("x", width+7).
  attr("y", function(d) { return y(new Date(2011, 0, 1, d)); }).
  attr("dy", "3").
  attr("class", "yAxisRight").
  attr("text-anchor", "start");

// create a group for the sunrise and sunset paths

var lineGroup = dayLength.append("svg:g").
  attr("transform", "translate("+ padding + ", " + padding + ")");

// draw the background. The part of this that remains uncovered will
// represent the daylight hours.

lineGroup.append("svg:rect").
  attr("x", 0).
  attr("y", 0).
  attr("height", height).
  attr("width", width).
  attr("fill", "lightyellow");

// The meat of the visualization is surprisingly simple. sunriseLine
// and sunsetLine are areas (closed svg:path elements) that use the date
// for the x coordinate and sunrise and sunset (respectively) for the y
// coordinate. The sunrise shape is anchored at the top of the chart, and
// sunset area is anchored at the bottom of the chart.

var sunriseLine = d3.svg.area().
  x(function(d) { return x(d.date); }).
  y1(function(d) { return y(new Date(2011, 0, 1, d.sunrise[0], d.sunrise[1])); }).
  interpolate("linear");

lineGroup.
  append("svg:path").
  attr("d", sunriseLine(data)).
  attr("fill", "steelblue");

var sunsetLine = d3.svg.area().
  x(function(d) { return x(d.date); }).
  y0(height).
  y1(function(d) { return y(new Date(2011, 0, 1, d.sunset[0], d.sunset[1])); }).
  interpolate("linear");

lineGroup.append("svg:path").
  attr("d", sunsetLine(data)).
  attr("fill", "steelblue");

// finally, draw a line representing 12:00 across the entire
// visualization

lineGroup.append("svg:line").
  attr("x1", 0).
  attr("y1", d3.round(y(new Date(2011, 0, 1, 12))) + 0.5).
  attr("x2", width).
  attr("y2", d3.round(y(new Date(2011, 0, 1, 12))) + 0.5).
  attr("stroke", "lightgray");

Using CSS helps remove duplication from the text formatting attributes:

div#day-length text {
  fill: gray;
  font-family: Helvetica, sans-serif;
  font-size: 10px;
}

And the final result:

JanFebMarAprilMayJunJulAugSepOctNovDecJanFebMarAprilMayJunJulAugSepOctNovDec567891011noon123456789567891011noon123456789

Next Steps

Now that the visualization is complete, think of what could be done to improve it. You could label the solstices and equinoxes with exact times, or plot different sunset and sunrise times for different locations against each other, or make a label that shows exact times for the day the user is are mousing over…

The possibilities are endless.

Debugging D3

Lest you assume these examples were coded straight through, let me assure you this was not the case. There was a significant amount of trial-and-error coding needed to get them working.

One of the advantages of D3 using SVG as its native graphical representation is that it is easier to debug the WebKit Inspector or Firebug.

SVG in the WebKit Inspector

This is incredibly useful. You can try it on this page to see how the SVG code for these examples is generated.

Another technique that has been helpful for me is to use console.log in positioning functions. This lets you inspect the datum and index of an element as well as the calculated position for the x or y value. For example:

attr("x", function(d, i) {
  console.log(d);
  console.log(i);

  return x(d.whatever);
})

Further Reading

Now you know about as much about D3 as I do. Obviously, this barely scratches the surface. For more tutorials and resources about D3, read the following articles:

D3 Refresh

This article aims to give you a high level overview of D3’s capabilities, in each example you’ll be able to see the input data, transformation and the output document. Rather than explaining what every function does I’ll show you the code and you should be able to get a rough understanding of how things work. I’ll only dig into details for the most important concepts, Scales and Selections.

William Playfair invented the bar, line and area charts in 1786 and the pie chart in 1801. Today, these are still the primary ways that most data sets are presented. Now, these charts are excellent but D3 gives you the tools and the flexibility to make unique data visualizations for the web, your creativity is the only limiting factor.

A Bar Chart

Although we want to get to more than William Playfair’s charts we’ll begin by making the humble bar chart with HTML – one of the easiest ways to understand how D3 transforms data into a document. Here’s what that looks like:

See code in codePen

d3.select('#chart')
  .selectAll("div")
  .data([4, 8, 15, 16, 23, 42])
  .enter()
  .append("div")
  .style("height", (d)=> d + "px")

The selectAll function returns a D3 “selection”: an array of elements that get created when we enter and append a div for each data point.

This code maps the input data [4, 8, 15, 16, 23, 42] to this output HTML.

<div id="chart">
  <div style="height: 4px;"></div>
  <div style="height: 8px;"></div>
  <div style="height: 15px;"></div>
  <div style="height: 16px;"></div>
  <div style="height: 23px;"></div>
  <div style="height: 42px;"></div>
</div>

All of the style properties that don’t change can go in the CSS.

#chart div {
  display: inline-block;
  background: #4285F4;
  width: 20px;
  margin-right: 3px;
}

GitHub’s Contribution Chart

With a few lines of extra code we can convert the bar chart above to a contribution chart similar to Github’s.

A GitHub-style contribution chart

See code in codepen

Rather than setting a height based on the data’s value we can set a background-color instead.

const colorMap = d3.interpolateRgb(
  d3.rgb('#d6e685'),
  d3.rgb('#1e6823')
)

d3.select('#chart')
  .selectAll("div")
  .data([.2, .4, 0, 0, .13, .92])
  .enter()
  .append("div")
  .style("background-color", (d)=> {
    return d == 0 ? '#eee' : colorMap(d)
  })

The colorMap function takes an input value between 0 and 1 and returns a colour along the gradient of colours between the two we provide. Interpolation is a key tool in graphics programming and animation, we’ll see more examples of it later.

An SVG Primer

Much of D3’s power comes from the fact that it works with SVG, which contains tags for drawing 2D graphics like circles, polygons, paths and text.

<svg width="200" height="200">
  <circle fill="#3E5693" cx="50" cy="120" r="20" />
  <text x="100" y="100">Hello SVG!</text>
  <path d="M100,10L150,70L50,70Z" fill="#BEDBC3" stroke="#539E91" stroke-width="3">
</svg>

The code above draws:

  • A circle at 50,120 with a radius of 20
  • The text “Hello SVG!” at 100,100
  • A triangle with a 3px border, the d attribute has the following instructions
    • Move to 100,10
    • Line to 150,70
    • Line to 50,70
    • Close path(Z)

<path> is the most powerful element in SVG.

Circles

Labeled circles showing sales by time of day

See code in codepen

The data sets in the previous examples have been a simple array of numbers, D3 can work with more complex types too.

const data = [{
  label: "7am",
  sales: 20
},{
  label: "8am",
  sales: 12
}, {
  label: "9am",
  sales: 8
}, {
  label: "10am",
  sales: 27
}]

For each point of data we will append a <g>(group) element to the #chart and append <circle> and <text>elements to each with properties from our objects.

const g = d3.select('#chart')
  .selectAll("g")
  .data(data)
  .enter()
  .append('g')
g.append("circle")
  .attr('cy', 40)
  .attr('cx', (d, i)=> (i+1) * 50)
  .attr('r', (d)=> d.sales)
g.append("text")
  .attr('y', 90)
  .attr('x', (d, i)=> (i+1) * 50)
  .text((d)=> d.label)

The variable g holds a d3 “selection” containing an array of <g> nodes, operations like append() append a new element to each item in the selection.

This code maps the input data into this SVG document, can you see how it works?

<svg height="100" width="250" id="chart">
  <g>
    <circle cy="40" cx="50" r="20"/>
    <text y="90" x="50">7am</text>
  </g>
  <g>
    <circle cy="40" cx="100" r="12"/>
    <text y="90" x="100">8am</text>
  </g>
  <g>
    <circle cy="40" cx="150" r="8"/>
    <text y="90" x="150">9am</text>
  </g>
  <g>
    <circle cy="40" cx="200" r="27"/>
    <text y="90" x="200">10am</text>
  </g>
</svg>

Line Chart

A basic line chart

See the codepen

Drawing a line chart in SVG is quite simple, we want to transform data like this:

const data = [
  { x: 0, y: 30 },
  { x: 50, y: 20 },
  { x: 100, y: 40 },
  { x: 150, y: 80 },
  { x: 200, y: 95 }
]

Into this document:

<svg id="chart" height="100" width="200">
  <path stroke-width="2" d="M0,70L50,80L100,60L150,20L200,5">
</svg>

Note: The y values are subtracted from the height of the chart (100) because we want a y value of 100 to be at the top of the svg (0 from the top).

Given it’s only a single path element, we could do it ourselves with code like this:

const path = "M" + data.map((d)=> {
  return d.x + ',' + (100 - d.y);
}).join('L');
const line = `<path stroke-width="2" d="${ path }"/>`;
document.querySelector('#chart').innerHTML = line;

D3 has path generating functions to make this much simpler though, here’s what it looks like.

const line = d3.svg.line()
  .x((d)=> d.x)
  .y((d)=> 100 - d.y)
  .interpolate("linear")

d3.select('#chart')
  .append("path")
  .attr('stroke-width', 2)
  .attr('d', line(data))

Much better! The interpolate function has a few different ways it can draw the line around the x, y coordinates too. See how it looks with “linear”, “step-before”, “basis” and “cardinal”.

A linear-style line chart
A step-before-style line chart
A basis-style line chart

Scales

Scales are functions that map an input domain to an output range.

See the codepen

In the examples we’ve looked at so far we’ve been able to get away with using “magic numbers” to position things within the charts bounds, when the data is dynamic you need to do some math to scale the data appropriately.

Imagine we want to render a line chart that is 500px wide and 200px high with the following data:

const data = [
  { x: 0, y: 30 },
  { x: 25, y: 15 },
  { x: 50, y: 20 }
]

Ideally we want the y axis values to go from 0 to 30 (max y value) and the x axis values to go from 0 to 50 (max x value) so that the data takes up the full dimensions of the chart.

We can use d3.max to find the max values in our data set and create scales for transforming our x, y input values into x, y output coordinates for our SVG paths.

const width = 500;
const height = 200;
const xMax = d3.max(data, (d)=> d.x)
const yMax = d3.max(data, (d)=> d.y)

const xScale = d3.scale.linear()
  .domain([0, xMax]) // input domain
  .range([0, width]) // output range

const yScale = d3.scale.linear()
  .domain([0, yMax]) // input domain
  .range([height, 0]) // output range

These scales are similar to the colour interpolation function we created earlier, they are simply functions which map input values to a value somewhere on the output range.

xScale(0) -> 0
xScale(10) -> 100
xScale(50) -> 500

They also work with values outside of the input domain as well:

xScale(-10) -> -100
xScale(60) -> 600

We can use these scales in our line generating function like this:

const line = d3.svg.line()
  .x((d)=> xScale(d.x))
  .y((d)=> yScale(d.y))
  .interpolate("linear")

Another thing you can easily do with scales is to specify padding around the output range:

const padding = 20;
const xScale = d3.scale.linear()
  .domain([0, xMax])
  .range([padding, width - padding])

const yScale = d3.scale.linear()
  .domain([0, yMax])
  .range([height - padding, padding])

Now we can render a dynamic data set and our line chart will always fit inside our 500px / 200px bounds with 20px padding on all sides.

Linear scales are the most common type but there’s others like pow for exponential scales and ordinal scales for representing non-numeric data like names or categories. In addition to Quantitative Scales and Ordinal Scales there are also Time Scales for mapping date ranges.

For example, we can create a scale that maps my lifespan to a number between 0 and 500:

const life = d3.time.scale()
  .domain([new Date(1986, 1, 18), new Date()])
  .range([0, 500])

// At which point between 0 and 500 was my 18th birthday?
life(new Date(2004, 1, 18))

If you’d like to go further with this, try the Animated Flight Visualization

Animated Flight Visualization

So far we’ve only looked at static lifeless graphics with a few rollovers for additional information. Let’s make an animated visualization that shows the active flights over time between Melbourne and Sydney in Australia.

See the Pen D3 – scales by Haig Armen (@haigarmen) on CodePen.

The SVG document for this type of graphic is made up of text, lines and circles.

<svg id="chart" width="600" height="500">
  <text class="time" x="300" y="50" text-anchor="middle">6:00</text>
  <text class="origin-text" x="90" y="75" text-anchor="end">MEL</text>
  <text class="dest-text" x="510" y="75" text-anchor="start">SYD</text>
  <circle class="origin-dot" r="5" cx="100" cy="75" />
  <circle class="dest-dot" r="5" cx="500" cy="75" />
  <line class="origin-dest-line" x1="110" y1="75" x2="490" y2="75" />

  <!-- for each flight in the current time -->
  <g class="flight">
    <text class="flight-id" x="160" y="100">JQ 500</text>
    <line class="flight-line" x1="100" y1="100" x2="150" y2="100" />
    <circle class="flight-dot" cx="150" cy="100" r="5" />
  </g>

</svg>

The dynamic parts are the time and the elements within the flight group and the data might look something like this:

let data = [
  { departs: '06:00 am', arrives: '07:25 am', id: 'Jetstar 500' },
  { departs: '06:00 am', arrives: '07:25 am', id: 'Qantas 400' },
  { departs: '06:00 am', arrives: '07:25 am', id: 'Virgin 803' }
]

To get an x position for a dynamic time we’ll need to create a time scale for each flight that maps its departure and arrival times to an x position on our chart. We can loop through our data at the start adding Date objects and scales so they’re easier to work with. Moment.js helps a lot here with date parsing and manipulation.

data.forEach((d)=> {
  d.departureDate = moment(d.departs, "hh-mm a").toDate();
  d.arrivalDate = moment(d.arrives, "hh-mm a").toDate();
  d.xScale = d3.time.scale()
    .domain([departureDate, arrivalDate])
    .range([100, 500])
});

We can now pass our changing Date to xScale() to get an x coordinate for each flight.

Render Loop

Departure and arrival times are rounded to 5 minutes so we can step through our data in 5m increments from the first departure to the last arrival.

let now = moment(data[0].departs, "hh:mm a");
const end = moment(data[data.length - 1].arrives, "hh:mm a");

const loop = function() {
  const time = now.toDate();

  // Filter data set to active flights in the current time
  const currentData = data.filter((d)=> {
    return d.departureDate <= time && time <= d.arrivalDate
  });

  render(currentData, time);

  if (now <= end) {
    // Increment 5m and call loop again in 500ms
    now = now.add(5, 'minutes');
    setTimeout(loop, 500);
  }
}

Enter, Update and Exit

D3 allows you to specify transformations and transitions of elements when:

  • New data points come in (Enter)
  • Existing data points change (Update)
  • Existing data points are removed (Exit)
const render = function(data, time) {
  // render the time
  d3.select('.time')
    .text(moment(time).format("hh:mm a"))

  // Make a d3 selection and apply our data set
  const flight = d3.select('#chart')
    .selectAll('g.flight')
    .data(data, (d)=> d.id)

  // Enter new nodes for any data point with an id not in the DOM
  const newFlight = flight.enter()
    .append("g")
    .attr('class', 'flight')

  const xPoint = (d)=> d.xScale(time);
  const yPoint = (d, i)=> 100 + i * 25;

  newFlight.append("circle")
    .attr('class',"flight-dot")
    .attr('cx', xPoint)
    .attr('cy', yPoint)
    .attr('r', "5")

  // Update existing nodes in selection with id's that are in the data
  flight.select('.flight-dot')
    .attr('cx', xPoint)
    .attr('cy', yPoint)

  // Exit old nodes in selection with id's that are not in the data
  const oldFlight = flight.exit()
    .remove()
}

Transitions

The code above renders a frame every 500ms with a 5 minute time increment:

  • It updates the time
  • Creates a new flight group with a circle for every flight
  • Updates the x/y coordinates of current flights
  • Removes the flight groups when they’ve arrived

This works but what we really want is a smooth transition between each of these frames. We can achieve this by creating a transition on any D3 selection and providing a duration and easing function before setting attributes or style properties.

For example, let’s fade in the opacity of entering flight groups.

const newFlight = flight.enter()
  .append("g")
  .attr('class', 'flight')
  .attr('opacity', 0)

newFlight.transition()
  .duration(500)
  .attr('opacity', 1)

Let’s fade out exiting flight groups.

flight.exit()
  .transition()
  .duration(500)
  .attr('opacity', 0)
  .remove()

Add a smooth transition between the x and y points.

flight.select('.flight-dot')
  .transition()
  .duration(500)
  .ease('linear')
  .attr('cx', xPoint)
  .attr('cy', yPoint)

We can also transition the time between the 5 minute increments so that it displays every minute rather than every five minutes using the tween function.

const inFiveMinutes = moment(time).add(5, 'minutes').toDate();
const i = d3.interpolate(time, inFiveMinutes);
d3.select('.time')
  .transition()
  .duration(500)
  .ease('linear')
  .tween("text", ()=> {
    return function(t) {
      this.textContent = moment(i(t)).format("hh:mm a");
    };
  });

t is a progress value between 0 and 1 for the transition.

Acquiring, cleaning, and formatting data

Not so many years ago, data was hard to obtain. Often data journalists would have to painstakingly compile their own datasets from paper records, or make specific requests for electronic databases using freedom of information laws.

The Internet has changed the game. While those methods may still be needed, many government databases can now be queried online, and the results of those searches downloaded. Other public datasets can be downloaded in their entirety.

For data journalists, the main problem today is usually not finding relevant data, but in working out whether it can be trusted, spotting and correcting errors and inconsistencies, and getting it in the right format for analysis and visualization.

In this class, we will cover some tips and tricks for finding the data you need online, getting it onto your computer, and how to recognize and clean “dirty” data. We will also review some common data formats, and how to convert from one to another.

The data we will use today

Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following files:

  • oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
  • ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
  • urls.xls A spreadsheet that we’ll use in webscraping.

Data portals

Life is much easier if you can find everything you need in one place. The main effort to centralize access to data by the U.S. federal government is Data.gov. You can search for data from the home page, or follow the Data and Topics links from the top menu.

Be warned, however, that Data.gov is a work in progress, and does not contain all of the U.S. government’s data. Some of the most useful datasets are still only available on the websites of individual federal agencies. FedStats has links to agencies with with data collections.

As a data journalist, it is worth familiarizing yourself with the main federal government agencies that have responsibility for the beats you are interested in, and the datasets they maintain. Here are some examples of agencies with useful data:

Other data portals at various levels of government are emerging. The City and County of San Francisco, for example, was at the forefront of the Open Data movement, establishing DataSF in 2009.

If you need to make comparisons between nations, the World Bank probably has what you need. Its World Development Indicators catalog containing data for more than 7,000 different measures, compiled by the bank and other UN agencies.

You can navigate the site using the search box or using the topics links to the right. When you click on a particular indicator, you are sent to a page that gives options to download the dataset from a link near the top right of the page. The data in some cases goes back as far as 1960, and is listed both by individual country and summarized by regions and income groups. We have already worked with some of this data in Week 3.

Other useful sources of data for international comparisons are Gapminder and the UN Statistical Division. For health data in particular, try the Organization for Economic Co-operation and Development and the World Health Organization.

Search for data on the web

Often, however, your starting point in searching for data will be Google. Often simply combining a few keywords in a Google search with “data” or “database” is enough to find what you need, but it can be worth focusing your queries using Google’s advanced search:

(Source: Google)

The options to search by site or domain and file typecan be very useful when looking for data. For example, the U.S. Geological Survey is the best source of data on earthquakes and seismic risk, so when searching for this information, specifying the domain usgs.gov would be a good idea. You can make the domains as narrow or broad as you like: .gov, for instance, would search a wide range of U.S. government sites, while .edu would search the sites of all academic institutions using that top-level domain; journalism.berkeley.edu would search the web pages of the Berkeley J-School only.

The file type search offers a drop-down menu, with the options including Excel spreadsheets, and Google Earth KML and KMZ files. These are common data formats, but you are not limited to those on the menu. In a regular Google search, type a space after your search terms followed by filetype:xxx, where xxx is the suffix for the file type in question. For example, dbf will look for database tables in this format. Combining file type and domain searches can be a good way to find data an agency has posted online — some of which may not otherwise be readily accessible.

One common data format doesn’t show up in file-type searches. Geographical data is often made available as “shapefiles,” a format we will explore in our later mapping classes. Because they consist of multiple files that are usually stored in compressed folders, shapefiles can’t readily be searched using a file type suffix, but they can usually be found by adding the terms “shapefile” or “GIS data” to a regular Google search.

Search online databases

Many important public databases can be searched online, and some offer options to download the results of your queries. Most of these databases give a simple search box, but it’s always worth looking for the advanced search page, which will offer more options to customize your search. Here, for example, is the advanced search page for ClinicalTrials.gov, a database of tests of experimental drugs and other medical treatments taking place in the U.S. and beyond:

(Source: ClinicalTrials.gov)

When you start working with a new online database, take some time to familiarize yourself with how its searches work: Read the Help or FAQs, and then run test searches to see what results you obtain. Here, for example, is the “How To” section of ClinicalTrials.gov.

Many online databases can be searched using Boolean logic, using the operators AND, OR and NOT to link search terms together. AND will return only data including both search terms; OR will return data containing either term; NOT will return data containing the first term but not the second.

So find out how a particular database uses Boolean logic — and the default settings that it will use if you list search terms without any Boolean operators.

Putting search terms in quote marks often searches for a specific phrase. For example, searching for “heart attack” on ClinicalTrials.gov will give only give results in which those two words appear together; leaving out the quote marks will include any trial in which both words appear.

Also find out whether the database allows “wildcards,” symbols such as * or % that can be dropped into your search to obtain results with variations on a word or number.

Look for download options — and know when you are hitting the wall

Having run a search on an online database, you will usually want to download the results, so look for the download links or buttons.

A common problem with online databases, however, is that they may impose limits on the number of results that are returned on each search. And even when a search returns everything, there may be a limit on how many of those results can be downloaded to your own computer.

If broad searches on a database keep returning the same number of results, that is a sign that you are probably running up against a search limit, and any download will not contain the complete set of data that you are interested in. However, you may be able to work out ways of searching to obtain all of the data in chunks.

Download the entire database

Downloading an entire database, where this is allowed, frees you from the often-limited options given on an online advanced search form: You can then upload the data into your own database, and query it in any way that you want.

So always look for ways to grab all of the data. One trick is to run a search on just the database’s wildcard character, or with the query boxes left blank. If you do the latter at ClinicalTrials.gov, for instance, your search will return all of the trials in the database, which can then be downloaded using the options at the bottom of the results page.

Other databases have an online search form, but also have a separate link from where data to be downloaded in its entirety, usually as a text file or series of text files. One example is the U.S. Food and Drug Administration’s Bioresearch Monitoring Information System (BMIS), which lists doctors and other researchers involved in testing experimental drugs. It can be searched online here, but can also be downloaded in full from here.

Note that large text files are again often stored in compressed folders, so may be invisible to a Google search by file type.

Where there’s a government form, there’s usually a database

The BMIS database also illustrates another useful tip when looking for data. It is compiled from information supplied in this government form:

(Source: Food and Drug Administration)

Wherever a government agency collects information using paper or electronic forms, this information is likely to be entered into an electronic database. Even if it is not available online, you can often obtain the database in its entirety (minus any redactions that may be required by law) through a public records request.

Ask for what you don’t find

That leads to another general tip: If you don’t find what you’re looking for, speak to government officials, academic experts and other sources who should know about what data exists, and ask whether they can provide it for you. I have often obtained data, including for this animated map of cicada swarms, simply by asking for it (and, of course, promising proper attribution):

(Source: New Scientist)

Automate downloads of multiple data files

Often data doesn’t reside in a single searchable database, but instead exists online as a series of separate files. In such cases, clicking on each link is tedious and time-consuming. But you can automate the process using the DownThemAll! Firefox add-on.

To illustrate, go to Gapminder’s data catalog, and select All indicators. The webpage now includes links to more than 500 downloadable spreadsheets.

At the dialog box, you can choose where to save the files, and to filter the links to select just the files you want. In this case, unchecking all the boxes and Fast Filtering using the term xls will correctly identify the spreadsheet downloads:

Extract data from tables on the web

On other occasions, data may exist in tables on the web. Copying and pasting data from web tables can be tricky, but the Table2Clipboard Firefox add-on can simplify the process.

Before using the add-on, select Tools>Table2Clipboard and choose the following options under the CSV tab:

This will ensure that each row in the extracted data is put on a new line, and each column is separated by a tab.

To illustrate what Table2Clipboard does, go to the Women’s Tennis Association singles rankings page, right-click anywhere in the table and select Table2Clipboard>Copy whole table:

You can now paste the data into an empty text file, or into a spreadsheet.

Manipulate urls to expose the data you need

As you search for data using web query forms, make a habit of looking at what happens to the url. Often the urls will contain patterns detailing the search you have run, and it will be possible to alter the data provided by manipulating the url. This can be quicker than filling in search forms. In some cases it may even reveal more data than the search form alone would provide, overriding controls on the number of records displayed.

To illustrate how this works, go to the ISCRTN Registry, one of the main international registries of clinical trials. Find the Advanced Search and search for breast cancer under condition:

When the data is returned, note the url:

http://www.isrctn.com/search?q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Notice how the url changes if you select 100 under Show results:

http://www.isrctn.com/search?pageSize=100&sort=&page=1&q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Now change the page size in the url to 500:

http://www.isrctn.com/search?pageSize=500&sort=&page=1&q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Having done so, all of the registered clinical trials involving cancer should now be displayed on a single page. We could now use DownThemAll! to download all of the individual web pages describing each of these trials, or we could use this url as the starting point to scrape data from each of those pages.

Scrape data from the web

Sometimes you will need to compile your own dataset from information that is not available for easy download, but is instead spread across a series of webpages, or in a database that imposes limits on the amount of data that can be downloaded from any search, or doesn’t include a download button. This is where web scraping comes in.

Using programming languages such as Python or R, it is possible to write scripts that will pull data down from many webpages, or query web search forms to download an entire database piece by piece. The idea behind web scraping is to identify the patterns you would need to follow if collecting the data manually, then automate the process and write the results to a data file. That often means experimenting to reveal the most efficient way of exposing all of the data you require.

Teaching the programming skills needed for webscraping is beyond the scope of this class — see the Further reading links for resources, if you are interested in learning to scrape by coding.

However, software is starting to emerge that allows non-programmers to scrape data from the web. These include OutWit Hub and the Windows-only Helium Scraper. In today’s class, we will use Import.io.

To demonstrate webscraping, we will download data on disciplinary actions against doctors in the state of New York.

Navigate to this page, which is the start of the list. Then click on the Next Page link, and see that the url changes to the following:

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=30

Notice that the first entry on this list is actually the last entry on the previous one, so this url is the next page with no duplicates:

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=31

Experiment with different numbers at the end of the url until you find the end of the list. As of writing, this url exposed the end of the list, revealing that there were 7420 disciplinary actions in the database.

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=7420

Click on the link for the last doctor’s name, and notice that data on each disciplinary action, plus a link to the official documentation as a PDF, are on separate web pages. So we need to cycle through all of these pages to grab data on every disciplinary action.

The first step is to cycle through the entire list, grabbing all of the urls for the individual plages.

The best way of doing this in Import.io is to set up a scrape from all of the urls that define the list. In a Google spreadsheet, copy the base url down about 250 rows, then put the numbers that define the first three pages in the first three cells of the next column:

Select those three cells, then move the cursor to the bottom right-hand corner until it becomes a cross, and double-click to copy the pattern down the entire column.

In the first cell in the third column, type the following formula:

=concatenate(A1,B1)

Hit return, and copy this formula down the column to give the urls we will use to scrape the list. To save time in class, I have already made this spreadsheet, urls.xls for you to use. It is in the folder scraping.

Open Import.io, and you should see a screen like this:

Click the pink New button to start setting up a scraper, and at the dialog box select Start Extractor:

You can close the tutorial video that appears by clicking the OK, got it! button. Enter the url for the first page of the list in the search box, and then move the slider on ON

Write name in the box at top left, replacing the default my_column, and click on the first link under Physician Name. At the dialog box that appears, tell Import.io that your table will contain Many rows.

Import.io will now grab the text and links in a column called name. This is all we need for the first phase of the scrape, so click the DONE button and select a name for the API, such as ny_doctors, and click PUBLISH.

At the next window, select Bulk Extract under How would you like to use this API? and paste into the box the urls from the spreadsheet:

Click Save URLS and then Run queries and the scrape should begin. It will take a couple of minutes to process all of the urls. If any fail, click on the warning message to retry.

Now click on the Export button and select HTML to export as web page, which should look like this. Save it on your desktop, and open in a browser. The column name now contains all the urls we need for the second stage of the scrape:

Click the New button to set up the second phase of the scrape, and again Start Extractor. Enter the second url from your HTML table, to select a named doctor, rather than a practice. Call the column first name, click on the doctor’s first name, and this time tell Import.io that your scrape will have Just one row — because each of the pages we are about to scrape contains data on one disciplinary action.

Click + NEW COLUMN and repeat for the doctor’s last name and the other fields in the data. Make sure to click on the empty box for License Restrictions, so the scrape does grab this data where it exists, and the link to the PDF document. When you are done, the screen should look like this:

Click DONE, select a name for the API, such as ny_orders, and click PUBLISH.

Again select Bulk Extract, and paste into the box the entire column of urls from your html table. You can do this is Firefox using Table2Clipboard, using its Select column option. Remember to delete the column header name from the list of urls before clicking Save URLs and Run queries.

Once the scrape has completed, click the Export button and select Spreadsheet to export as a CSV file.

Use application programming interfaces (APIs)

Websites like the ISRCTN clinical trials registry are not expressly designed to be searched by manipulating their urls, but some organizations make their data available through APIs that can be queried by constructing a url in a similar way. This allows websites and apps to call in specific chunks of data as required, and work with it “on the fly.”

To see how this works, go to the U.S. Geological Survey’s Earthquake Archive Search & URL Builder, where we will search for all earthquakes with a magnitude of 6 or greater that occurred within 6,000 kilometres of the geographic center of the contiguous United States, which this site tells us lies at a latitude of 39.828175degrees and a longitude of -98.5795 degrees. We will initially use the Output Options to ask for the data in a format called GeoJSON (a variant of JSON). Enter 1900-01-01T00:00:00 under Start for Date & Timeboxes so that we obtain all recorded earthquakes from the beginning of 1900 onward. The search form should look like this:

(Source: U.S. Geological Survey)

You should receive a quantity of data at the following url:

http://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=1900-01-01T00:00:00&latitude=39.828175&longitude=-98.5795&maxradiuskm=6000&minmagnitude=6&orderby=time

See what happens if you append -asc to the end of that url: This should sort the the earthquakes from oldest to newest, rather than the default of newest to oldest. Here is the full documentation for querying the earthquake API by manipulating these urls.

Now remove the -asc and replace geojson in the url with csv. The data should now download in CSV format.

PDFs: the bane of data journalism

Some organizations persist in making data available as PDFs, rather than text files, spreadsheets or databases. This makes the data hard to extract. While you should always ask for data in a more friendly format — ideally a CSV or other simple text file — as a data journalist you are at some point likely to find yourself needing to pull data out of a PDF.

For digital PDFs, Tabula is a useful data extraction tool — however it will not work with PDFs created by scanning the original document, which have to be interpreted using Optical Character Recognition (OCR) software (which is, for example, included in Adobe Acrobat).

Also useful is the online service Cometdocs. While it is a commercial tool, members of Investigative Reporters and Editors can obtain a free account. Cometdocs can read scanned PDFs, however its accuracy will vary depending on how well the OCR works on the document in question.

Can I trust this data?

Having identified a possible source of data for your project, you need to ask: Is it reliable, accurate and useful? If you rush into analysis or visualization without considering this question, your hard work may be undermined by the maxim: “Garbage In, Garbage Out.”

The best rule of thumb in determining the reliability of a dataset is find out whether it has been used for analysis before, and by whom. If a dataset was put together for an academic study, or is actively curated so it can be made available for experts to analyze, you can be reasonably confident that it is as complete and accurate as it can be — the U.S. Geological Survey’s earthquake data is a good example.

While in general you might be more trusting of data downloaded from a .gov or .edu domain than something found elsewhere on the web, don’t simply assume that it is reliable and accurate. Be especially wary of databases that are compiled from forms submitted to government agencies, such as the Bioresearch Monitoring Information System (BMIS) database mentioned earlier.

Government agencies may be required by law to maintain databases such as BMIS, but that doesn’t mean that the information contained in them is wholly reliable. First, forms may not always be submitted, making the data incomplete. Second, information may be entered by hand from the forms into the database — and not surprisingly, mistakes are made.

So before using any dataset, do some background research to find out how it was put together, and whether it has been rigorously checked for errors. If possible, try to speak to the people responsible for managing the database, and any academics or other analysts who have used the data. They will be your best guide to a dataset’s strengths and weaknesses.

Even for well-curated data, make a point of speaking with experts who compile it or use it, and ask them about the data’s quirks and limitations. From talking with experts on hurricanes, for example, I know not to place too much trust in data on North Atlantic storms prior to about 1990, before satellite monitoring was well developed — even though the data available from NOAA goes back to 1851.

Always ask probing questions of a dataset before putting your trust in it. Is this data complete? Is it up-to-date? If it comes from a survey, was it based on a representative sample of people who are relevant to your project? Remember that the first dataset you find online may not be the most relevant or reliable.

Recognize dirty data

In an ideal world, every dataset we find would have been lovingly curated, allowing us to start analyzing and visualizing without worrying about its accuracy.

In practice, however, often the best available data has some flaws, which may need to be corrected as far as is possible. So before starting to work with a new dataset, load it into a spreadsheet or database and take a look for common errors. Here, for example, is a sample of records from the BMIS database, with names including non-alphabetical characters — which are clearly errors:

(Source: Peter Aldhous, from Bioresearch Information Monitoring System data)

Look for glitches in the alignment of columns, which may cause data to appear in the wrong field.

For people’s names, look for variations in spelling, format, initials and accents, which may cause the same person to appear in multiple guises. Similar glitches may affect addresses, and any other information entered as text.

Some fields offer some obvious checks: if you see a zip code with less than 5 digits, for instance, you know it must be wrong.

Dates can also be entered incorrectly, so it’s worth scanning for those that fall outside the timeframe that should be covered by the data.

Also scan numbers in fields that represent continuous variables for any obvious outliers. These values are worth checking out. Are they correct, or did someone misplace a decimal point or enter a number in the wrong units?

Other common problems are white spaces before and after some entries, which may need to be stripped out.

At all stages of your work, pay attention to zeros. Is each one actually supposed to represent zero, or should the cell in fact be empty, or “null”? Take particular care when exporting data from one software tool and importing to another, and check how nulls have been handled.

Clean and process data with Open Refine

Checking and cleaning “dirty” data, and processing data into the format you need, can be the most labor intensive part of many data journalism projects. However, Open Refine (formerly Google Refine) can streamline the task — and also create a reproducible script to quickly repeat the process on data that must be cleaned and processed in the same way.

When you launch Open Refine, it opens in your web browser. However, any data you load into the program will remain on your computer — it does not get posted online.

The opening screen should look like this:

Reshape data from wide to long format

Click the Browse button and navigate to the file oil_production.csv. Click Next>>, and check that data looks correct:

Open Refine should recognize that the data is in a CSV file, but if not you can use the panel at bottom to specify the correct file type and format for the data. When you are satisfied that the data has been read correctly, click the Create Project >> button at top right. The screen should now look like this:

As you can see, the data is in wide format, with values for oil production by region organized in columns, one for each year. To convert this to long format, click on the small downward-pointing triangle for the first of these year columns, and select Transpose>Transpose cells across columns into rows.

Fill in the dialog box as below, making sure that From Column and To Column are highlighted correctly, that the Key column and Value column have been given appropriate names, and that Fill down in other columns is checked. (Failing to do check this box will mean that the region names each will only appear once in the reshaped data, rather than being copied down to appear next to the corresponding data for year and oil production.)

Click Transpose and then the 50 rows link, to see the first 50 rows of the reshaped data:

Click the Export button at top right and you will see options to export the data in a variety of file types, including Comma-separated value and Excel spreadsheet.

Clean and process dirty data

Click the Google Refine logo at top left to return to the opening screen. Create a new project from the file ucb_stanford_2014.csv.

Entries recognized as numbers or dates will be green, those treated as text strings will be black:

Again, each field/column has a button with a downward-pointing triangle. Click on these buttons and you get the option to create “facets” for the column, which provide a powerful way to edit and clean data.

Click on the button for the field Recipent City, and select Facet>Text facet. A summary of the various entries now appears in the panel to the left:

The numbers next to each entry show how many records there are for each value.

We can edit entries individually: Select Veterans Bureau Hospi, which is clearly not a city, click on the Editlink, change it to Unknown. (If cleaning this data for a real project, we would need to check with an external source to get the actual city for this entry.)

Another problem is that we have a mixture of cases, with some entries in Title or Proper Case, some in UPPERCASE. We can fix this back in the field itself. Click its button again and select Edit cells>common transforms>To titlecase.

Now notice that we apparently have duplicate entries for Berkeley, Palo Alto and Stanford. This is the result of trailing white space after the city names for some entries. Select Edit cells>common transforms>Trim leading and trailing whitespace and notice how the problem resolves:

Having cleaned this field, close the facet by clicking the cross at top left.

Now create a text facet for the field Recipient:

What a mess! The only possibilities are Stanford or Berkeley, yet there are multiple variants of each, many including Board of Trustees for Stanford and Regents of for UC Berkeley.

First, manually edit Interuniveristy Center for Japanese Language to Stanford, which is where this center is based.

We could contrinute editing manually, but to illustrate Open Refine’s editing functions click on the Clusterbutton. Here you can experiment with different clustering algorithms to edit entries that may be variants of the same thing. Select key collision and metaphone3, then start checking the clusters and renaming them as Berkeley or Stanford as appropriate:

Click Merge Selected & Close and the facet can then be quickly edited manually:

Often we may need to convert fields to text, numbers or dates. For example, click on the button for Award Date and select Edit cells>common transforms>To date and see that it changes from a string of text to a date in standard format.

Notice the field Award amount, which is a value in dollars. Negative values are given in brackets. Because of these symbols, the field is being
recognized as a string of text, rather than a number. So to fix this problem, we have to remove the symbols.

Select Edit colum>Add column based on this column... and fill in the dialog box as follows:

Here value refers to the value in the original column, and replace is a function that replaces characters in the value. We can run several replace operations by “chaining” them together. This is a concept we’ll meet again in subsequent weeks, when we work with the D3 JavaScript library and R.

Here we are replacing the “$” symbols, the commas separating thousands, and the closing brackets with nothing; we are replacing the opening brackets with a hyphen to designate negative numbers.

Click OK and the new column will be created. Note that it is still being treated as text, but that can be corrected by selecting Edit cells>common transforms>To number.

This is just one example of many data transformation functions that can be accessed using Open Refine’s expression language, called GREL. Learning these functions can make Open Refine into a very powerful data processing tool. Study the “Further reading” links for more.

Open Refine’s facets can also be used to inspect columns containing numbers. Select Facet>Numeric facetfor the new field. This will create a histogram showing the distribution of numbers in the field:

We can then use the slider controls to filter the data, which is good for examining possible outliers at the top of bottom of the range. Notice that here a small number of grants have negative values, while there is one grant with a value of more than $3 billion from the National Science Foundation. This might need to be checked out to ensure that it is not an error.

While most of the data processing we have explored could also be done in a spreadsheet, the big advantage of Open Refine is that we can extract a “pipeline” for processing data to use when we obtain data in the same format in future.

Select Undo / Redo at top left. Notice that clicking on one of the steps detailed at left will transform the data back to that stage in our processing. This means you don’t need to worry about making mistakes, as it’s always possible to revert to an earlier state, before the error, and pick up from there.

Return to the final step, then click the Extract button. At the dialog box, check only those operations that you will want to perform in future (typically generic transformations on fields/columns, and not correcting errors for individual entries). Here I have unchecked all of the corrections in the text facets, and selected just those operations that I know I will want to repeat if I obtain data from this source again:

This will generate JSON in the right hand panel that can be copied into a blank text file and saved.

To process similar data in future. Click the Apply button on the Undo / Redo tab, paste in the text from this file, and click Perform Operations. The data will then be processed automatically.

When you are finished cleaning and processing your data, click the Export button at top right to export as a CSV file or in other formats.

Open Refine is a very powerful tool that will reward efforts to explore its wide range of its functions for manipulating data. See the “Further reading” for more.

Standardize names with Mr People

For processing names from a string of text into a standardized format with multiple fields, you may wish to experiment with Mr People, a web app made by Matt Ericson, a member of the graphics team at The New York Times.

(Source: Mr People)

It takes a simple list of names and turns them into separate fields for title, first name, last name, middle name and suffix.

Mr People can save you time, but it is not infallible — it may give errors with Spanish family names, for instance, or if people have multiple titles or suffixes, such as “MD, PhD.” So always check the results before moving on to further analysis and visualization.

Correct for inflation (and cost of living)

A common task in data journalism and visualization is to compare currency values over time. When doing so, it usually makes sense to show the values after correcting for inflation — for example in constant 2014 dollars for a time series ending in 2014. Some data sources, such as the World Bank, provide some data both in raw form or in a given year’s constant dollars.

So pay attention to whether currency values have already been corrected for inflation, or whether you will need to do so yourself. When correcting for inflation in the United States, the most widely-used method is the Consumer Price Index, or CPI, which is based on prices paid by urban consumers for a representative basket of goods and services. Use this online calculator to obtain conversions.

If, for example, you need to convert a column of data in a spreadsheet from 2010 dollars into today’s values, fill in the calculator like this:

A dollar today is worth the same as 0.9 dollars in 2010.

So to convert today’s values into 2010 dollars, use the following formula:

2016 value * 0.9

And to convert the 2010 values to today’s values, divide rather than multiply:

2010 value / 0.9

Alternatively, fill in the calculator the other way round, and multiply as before.

Convert 2010 to today’s values:

2010 value * 1.11

For comparing currency values across nations, regions or cities, you may also need to correct for the cost of living — or differences in what a dollar can buy in different places. For World Bank indicators, look for the phrase “purchasing power parity,” or PPP, for data that includes this correction. PPP conversion factors for nations over time are given here.

Understand common data formats, and convert between them

Until now, we have used data in text files, mostly in CSV format.

Text files are great for transferring data from one software application to another during analysis and visualization, but other formats that are easier for machines to read are typically used when transferring data between computers online. If you are involved in web development or designing online interactive graphics, you are likely to encounter these formats.

JSON, or JavaScript Object Notation, which we have already encountered today, is a data format often used by APIs. JSON treats data as a series of “objects,” which begin and end with curly brackets. Each object in turn contains a series of name-value pairs. There is a colon between the name and value in each pair, and the pairs separated by commas.

Here, for example, are the first few rows of the infectious disease and democracy data from week 1, converted to JSON:

[{"country":"Bahrain","income_group":"High income: non-OECD","democ_score":45.6,"infect_rate":23},
{"country":"Bahamas, The","income_group":"High income: non-OECD","democ_score":48.4,"infect_rate":24},
{"country":"Qatar","income_group":"High income: non-OECD","democ_score":50.4,"infect_rate":24},
{"country":"Latvia","income_group":"High income: non-OECD","democ_score":52.8,"infect_rate":25},
{"country":"Barbados","income_group":"High income: non-OECD","democ_score":46,"infect_rate":26}]

XML, or Extensible Markup Language, is another format often used to move data around online. For example, the RSS feeds through which you can subscribe to content from blogs and websites using a reader such as Feedly are formatted in XML.

In XML data is structured by enclosing values within “tags,” similar to those used to code different elements on a web page in HTML. Here is that same data in XML format:

<?xml version="1.0" encoding="UTF-8"?>
<rows>
  <row country="Bahrain" income_group="High income: non-OECD" democ_score="45.6" infect_rate="23" ></row>
  <row country="Bahamas, The" income_group="High income: non-OECD" democ_score="48.4" infect_rate="24" ></row>
  <row country="Qatar" income_group="High income: non-OECD" democ_score="50.4" infect_rate="24" ></row>
  <row country="Latvia" income_group="High income: non-OECD" democ_score="52.8" infect_rate="25" ></row>
  <row country="Barbados" income_group="High income: non-OECD" democ_score="46" infect_rate="26" ></row>
</rows>

Mr Data Converter is a web app made by Shan Carter of the graphics team at The New York Times that makes it easy to convert data from a spreadsheet or delimited text file to JSON or XML.

Copy the data from a CSV or tab-delimited text file and paste it into the top box, select the output you want, and it will appear at the bottom. You will generally want to select the Properties variants of JSON or XML.

You can then copy and paste this output into a text editor, and save the file with the appropriate extension (.xml, .json).

(Source: Mr Data Converter)

To convert data from JSON or XML into text files, you can use Open Refine. First create a new project and import your JSON or XML file. Use the Export button and select Tab-separated value or Comma-separated value to export as a text file.

Assignment

  • Grab the data for the top 100 ranked women’s singles tennis players.
  • Use Open Refine to process this data as follows:
    • Create new columns for First Name and Last Name. Hint: First create a copy of the Playercolumn with a new name using Edit Column>Add column based on this column.... Then look under Edit column for an option to split this new column into two; you will also need to rename the resulting columns.
    • Convert the birth dates for the players to standard date/time format.
    • Create a new column for the Previous Rank with the square brackets removed, converted to numbers. Hint: First copy the old column as above; this time you can delete the old column when you are done.
  • Extract the operations to process this data, and save in a file with the extension .json.
  • Now go back to the WTA site and grab the singles rankings for all U.S. players for the first ranking of 2016 (made on January 4). Hint: Make sure you hit Search after adjusting the menus.
  • Process this data in Open Refine using your extracted JSON, then export the processed data as a CSV file.
  • Send me your JSON and CSV files.

Further reading

Paul Bradshaw. Scraping For Journalists

Dan Nguyen. The Bastards Book of Ruby
I use R or Python rather than Ruby, but this book provides a good introduction to the practice of web scraping using code, and using your browser’s web inspector to plan your scraping approach.

Hadley Wickham’s rvest package
This is the R package I usually use for web scraping.

Open Refine Wiki

Open Refine Documentation

Open Refine Recipes

Using GitHub

In this week’s class we will learn the basics of version control, so that you can work on your final projects in a clean folder with a single set of files, but can save snapshots of versions of your work at each point and return to them if necessary.

This avoids the hell of having to search through multiple versions of similar files. That, as Ben Welsh of the Los Angeles Times explains in this video, legendary in data journalism circles as “Ben’s rant,” is nihilism!

Version control was invented for programmers working on complex coding projects. But it is good practice for any project — even if all you are managing are versions of a simple website, or a series of spreadsheets.

This tutorial borrows from the Workflow and GitHub lession in Jeremy Rue’s Advanced Coding Interactives class — see the further reading links below.

Introducing Git, GitHub and GitHub Desktop

The version control software we will use is called Git. It is installed automatically when you install and configure GitHub Desktop. GitHub Desktop is a point-and-click GUI that allows you to manage version control for local versions of projects on your own computer, and sync them remotely with GitHub. GitHub is a social network, based on Git, that allows developers to view and share one another’s code, and collaborate on projects.

Even if you are working on a project alone, it is worth regularly synching to GitHub. Not only does this provides a backup copy of the entire project history in the event of a problem with your local version, but GitHub also allows you to host websites. This means you can go straight from a project you are developing to a published website. If you don’t already have a personal portfolio website, you can host one for free on GitHub.

The files we will use today

Download the files for this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

index.html index2.html Two simple webpages, which we will edit and publish on GitHub.
css fonts js Folders with files to run the Bootstrap web framework.

Some terminology

  • repository or repo Think of this as a folder for a project. A repository contains all of the project files, and stores each file’s revision history. Repositories on GitHub can have multiple collaborators and can be either public or private.
  • clone Copy a repository from GitHub to your local computer.
  • master This is the main version of your repository, created automatically when you make a new repository.
  • branch A version of your repository separate from the master branch. As you switch back and forth between branches, the files on your computer are automatically modified to reflect those changes. Branches are used commonly when multiple collaborators are working on different aspects of a project.
  • pull request Proposed changes to a repository submitted by a collaborator who has been working on a branch.
  • merge Taking the changes from on branch and applying them to another. This is often done after a pull request.
  • push or sync Submitting your latest commits to the remote responsitory, on GitHub and syncing any changes from there back to your computer.
  • gh-pages A special branch that is published on the web. This is how you host websites on GitHub. Even if a repository is private, its published version will be visible to anyone who has the url.
  • fork Split off a separate version of a repository. You can fork anyone’s code on GitHub to make your own version of their repo.

Here is a more extended GitHub glossary.

Create and secure your GitHub account

Navigate to GitHub and sign up:

Choose your plan. If you want to be able to create private repositories, which cannot be viewed by others on the web, you will need to upgrade to a paid account. But for now select a free account and click Continue:

At the next screen, click the skip this step link:

Now view your profile by clicking on the icon at top right and selecting Your profile. This is your page on GitHub. Click Edit profile to see the following:

Here you can add your personal details, and a profile picture. For now just add the name you want to display on GitHub. Fill in the rest in your own time after class.

You should have been sent a confirmation email to the address you used to sign up. Click on the verification link to verify this address on GitHub.

Back on the GitHub website, click on the Emails link in the panel at left. If you wish, you can add another email to use on GitHub, which will need to be verified as well. If you don’t wish to display your email on GitHub check the Keep my email address private box.

Now click on the Security link. I strongly recommend that you click on Set up two-factor authentication to set up this important security feature for your account. It will require you to enter a six-digit code sent to your phone each time you sign on from an unfamiliar device or location.

At the next screen, click Set up using SMS. Then enter your phone number, send a code to your phone and enter it where shown:

At the next screen click Download and print recovery codes. These will allow you to get back into your account if you lose your phone. Do print them out, keep them somewhere safe, and delete the file.

Open and authenticate GitHub desktop

Open the GitHub Desktop app. At the opening screen, click Continue:

Then add your GitHub login details:

You will then be sent a new two-factor authentication code which you will need to enter:

At the next screen, enter your name and email address if they do not automatically appear, click Install Command Line Tools, and then Continue:

Then click Done at the Find local repositories screen, as you don’t have local repositories to add.

The following screen should now open:

Your workspace contains one repo, which is an automated GitHub tutorial. Complete this in your own time if you wish. It will repeat many of the steps we will explore today.

Your first repository

Create a new repository on GitHub

Back on the GitHub website, go to your profile page, click on the Repositories tab and click New:

Fill in the next screen as follows, giving the repository a name and initializing it with a README file. Then click Create repository:

You should now see the page for this repo:

Notice that there is an Initial commit with a code consisting of a series of letters and numbers. There will be a code for each commit you make from now on.

Clone to GitHub desktop

Click on Clone or download and select Open in Desktop:

You should now be sent to the GitHub Desktop app, where you will be asked where on your computer you wish to clone the repo folder. Choose a location and click Clone:

Now you should be able to see the repo in the GitHub Desktop app:

You should also be able to find the folder you just cloned in the location you specified:

It contains a single file called README.md. This is a simple text file written in a language called Markdown, which we will explore shortly. You use this file to
write an extended description for your project, which will be displayed on the repo’s page on GitHub.

Make a simple change to the project

Add the file index.html to the project folder on your computer. Notice that you now have 1 Uncommitted Change in GitHub Desktop.

Click on that tab, and you should see the following screen:

GitHub Desktop highlights additions from your last commit in green, and deletions in red.

Commit that change, sync with GitHub

Write a summary and description for your commit, then click Commit to master:

Back in the History tab, you should now see two commits:

So far you have committed the change on your local computer, but you haven’t pushed it to GitHub, To do that, click the Sync button at top right.

Go to the project page on the GitHub website, refresh your browser if necessary, and see that there are now two commits, and the index.html file is in the remote repo:

Make a new branch, make a pull request, and merge with master

Back in GitHub Desktop, click on the new branch button at top left, and create a new branch called test-branch:

You can now switch between your two branches using the button to the immediate right of the new branch, which will display either master or test-branch. Do pay close attention to which branch you are working in!

Here I am working in the test branch, having made the edit below:

While in the test-branch on Github Desktop, open index.html in a text editor. Delete <p>I'm a paragraph</p> and replace it with the following:

<h2>Hello again!</h2>
<p>I'm a new paragraph</p>

Save the file, then return to GitHub Desktop to view the changes in the test-branch.

Now switch to the master branch and look at the file index.html in your text editor. It should have reverted to the earlier version, because you haven’t merged the change in test-branch with master.

Switch back to test-branch in Github Desktop, and commit the change as before with an appropriate summary and description:

Click the Pull request at top right and then Send Pull Request:

You should now be able to see the pull request on GitHub:

Click Compare & pull request to see the following screen:

If another collaborator had made this pull request, you might merge this into master online and then sync your local version of the repo with the remote to incorporate it.

However, you made this pull request, so Close pull request and return to GitHub desktop. In the masterbranch, click Compare at top left. Select test-branch and then click Update from test-branch. This should merge the changes from test-branch into master:

Make sure you are in the master branch on GitHub Desktop, then view the file in your text editor to confirm that it is now the version you edited in test-branch.

Make a gh-pages branch, and publish to the web

In your master branch on Github Desktop, make a branch called gh-pages:

In this branch, click the Publish button at top right.

Now go to the GitHib repo page, refresh your browser if necessary, and notice that this branch now exists there:

Now go the the url https://[username].github.io/my-first-repo/, whhere [username] is your GitHub user name, and the webpage index.html should be online:

Introducing Markdown, Haroopad, and Bootstrap

Markdown provides a simple way to write text documents that can easily be converted to HTML, without having to worry about writing all of the tags to produce a properly formatted web page.

Haroopad is a Markdown editor that we will use to edit the README.md for our repos, and also author some text to add to a simple webpage.

Bootstrap is a web framework that allows you to create responsively designed websites that work well on all devices, from phones to desktop computers. It was originally developed by Twitter.

(I used Bootstrap to make this class website, writing the class notes in Markdown using Haroopad.)

Edit your README, make some more changes to your repo, commit and sync with GitHub

Open README.md in Haroopad. The Markdown code is shown on the left, and how it renders as HTML on the right:

Now edit to the following:

# My first repository

### This is the repo we have been using to learn GitHub.

Here is some text. Notice that it doesn't have the # used to denote various weights of HTML headings (You can use up to six #).

And here is a [link](http://severnc6.github.io/my-first-repo) to the `gh-pages` website for this repo.

*Here* is some italic text.

**Here** is some bold text.

And here is a list:
- first item
- second item
 - sub-item
- third item

This should display in Haroopad like this:

See here for a more comprehensive guide to using Markdown.

Save README.md in Haroopad and close it.

With Github Desktop in the master branch, delete index.html from your repo, and copy into the repo the file index2.html and the folders js, css, and fonts. Rename index2.html to index.html.

You now have a template Bootstrap page with a navigation bar at the top. Open in a browser, and it should look like this:

The links in the dropdown menu are currently to pages that don’t exist, and the email link will send a message to me. Open in a text editor to view the code for the page:

Open a new file in Haroopad, edit to add the following and save into your repo as index-text.md:

# A Bootstrap webpage

### It has a subheading

And also some text.

From the top menu in Haroopad, select File>Export...>HTML and notice that it has saved as a webpage in your repo your computer.

We just want to take the text from the web page and copy it into our index.html page. To do this, select File>Export...>Plain HTML from the top menu in Haroopad, open index.html in your text editor, position your cursor between immediately below the <div class="container"> tag, and ⌘-V to paste in the HTML for the text we wrote in Haroopad.

Save index.html and view in your browser.

In GitHub Desktop, view the uncommited changes, Commit to master and Sync to GitHub.

Now switch to the gh-pages branch, Update from master and Sync:

Both the master and gh-pages branches should now be updated on GitHub:

Follow the link we included in the README, and you’ll be sent to the hosted webpage, at https://[username].github.io/my-first-repo/, where [username] is your GitHub user name.

Next steps with Bootstrap

W3Schools has a tutorial here, and Jeremy Rue has a tutorial here. The key to responsive design with Bootstrap is its grid system, which allows up to 12 columns across a page. This section of the W3schools tutorial explains how to use the grid system to customize layouts for different devices.

This site helps you customize a Bootstrap navigation bar.

There are various sites on the web that provide customized Bootstrap themes — some free, some not. Search “Bootstrap themes” to find them. A theme is a customized version of Bootstrap that can be used as a starting point for your own website. Jeremy Rue has also created a suggested portfolio theme.

Assignment

  • Create a repository on GitHub to host your final project and clone to your computer so you can manage the project in GitHub Desktop.
  • In Markdown, write a pitch for your final project in a file called project-pitch.md. Your final project accounts for 45% of your grade for this class, so it’s important that you get off to a good start with a substantial and thoughtful pitch. You will also be graded separately on this pitch assignment.
    • Explain the goals of your project.
    • Detail the data sources you intend to use, and explain how you intend to search for data if you have not identified them.
    • Identify the questions you wish to address.
    • Building from these questions, provide an initial outline of how you intend the visualize the data, describing the charts/maps you are considering.
  • Using Haroopad, save the Markdown document as an HTML file with the same name.
  • Create a gh-pages branch for your repository and publish it on GitHub. View the webpage at created at http://[username].github.io/[project]/project-pitch.html where [username] is your GitHub user name and [project] is the name of your project repository.
  • Share that url with me so I can read your project pitch and provide feedback.

Further reading

Workflow and Github
Lesson from Jeremy Rue’s Advanced Coding Interactives class.

Getting Started with GitHub Desktop

Getting Started with GitHub Pages
This explains how you can creates web pages automatically from GitHub. However, I recommend authoring them locally, as we covered in class.

Git Reference Manual

Getting started with Bootstrap

W3Schools Bootstrap tutorial

Using Bootstrap Framework For Building Websites
Lesson from Jeremy Rue’s Intro to Multimedia Web Skills class.

Graphical Analysis & Exploration

Introducing Tableau Public

In this tutorial we will work with Tableau Public, which allows you to create a wide variety of interactive charts, maps and tables and organize them into dashboards and stories that can be saved to the cloud and embedded on the web.

The free Public version of the software requires you to save your visualizations to the open web. If you have sensitive data that needs to be kept within your organization, you will need a license for the Desktop version of the software.

Tableau was developed for exploratory graphical data analysis, so it is a good tool for exploring a new dataset — filtering, sorting and summarizing/aggregating the data in different ways while experimenting with various chart types.

Although Tableau was not designed as a publication tool, the ability to embed finished dashboards and stories has also allowed newsrooms and individual journalists lacking JavaScript coding expertise to create interactive online graphics.

The data we will use today

Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following file:

Visualize the data on neonatal mortality

Connect to the data

Launch Tableau Public, and you should see the following screen:

Under the Connect heading at top left, select Text File, navigate to the file nations.csv and Open. At this point, you can view the data, which will be labeled as follows:

  • Text: Abc
  • Numbers: #
  • Dates: calendar symbol
  • Geography: globe symbol

You can edit fields to give them the correct data type if there are any problems:

Once the data has loaded, click Sheet 1 at bottom left and you should see a screen like this:

Dimensions and measures: categorical and continuous

The fields should appear in the Data panel at left. Notice that Tableau has divided the fields into Dimensionsand Measures. These broadly correspond to categorical and continuous variables. Dimensions are fields containing text or dates, while measures contain numbers.

If any field appears in the wrong place, click the small downward-pointing triangle that appears when it is highlighted and select Convert to Dimension or Convert to Measure as required.

Shelves and Show Me

Notice that the main panel contains a series of “shelves,” called Pages, Columns, Rows, Filters and so on. Tableau charts and maps are made by dragging and dropping fields from the data into these shelves.

Over to the right you should see the Show Me panel, which will highlight chart types you can make from the data currently loaded into the Columns and Rows shelves. It is your go-to resource when experimenting with different visualization possibilities. You can open and close this panel by clicking on its title bar.

Columns and rows: X and Y axes

The starting point for creating any chart or map in Tableau is to place fields into Columns and Rows, which for most charts correspond to the X and Y axes, respectively. When making maps, longitude goes in Columns and latitude in Rows. If you display the data as a table, then these labels are self-explanatory.

Some questions to ask this data

  • How has the total number of neonatal deaths changed over time, globally, regionally and nationally?
  • How has the neonatal death rate for each country changed over time?

Create new calculated variables

The data contains fields on birth and neonatal death rates, but not the total numbers of births and deaths, which must be calculated. From the top menu, select Analysis>Create Calculated Field. Fill in the dialog box as follows (just start typing a field name to select it for use in a formula):

Notice that calculated fields appear in the Data panel preceded by an = symbol.

Now create a second calculated field giving the total number of neonatal deaths:

In the second formula, we have rounded the number of neonatal deaths to the nearest thousand using -3(-2 would round to the nearest hundred, -1 to the nearest ten, 1 to one decimal place, 2 to two decimal places, and so on.)

Here we have simply run some simple arithmetic, but it’s possible to use a wide variety of functions to manipulate data in Tableau in many ways. To see all of the available functions, click on the little grey triangle at the right of the dialog boxes above.

Understand that Tableau’s default behaviour is to summarize/aggregate data

As we work through today’s exercise, notice that Tableau routinely summarizes or aggregates measures that are dropped into Columns and Rows, calculating a SUM or AVG (average or mean), for example.

This behaviour can be turned off by selecting Analysis from the top menu and unchecking Aggregate Measures. However, I do not recommend doing this, as it will disable some Tableau functions. Instead, if you don’t want to summarize all of the data, drop categorical variables into the Detail shelf so that any summary statistic will be calculated at the correct level for your analysis. If necessary, you can set the aggregation so it is being performed on a single data point, and therefore has no effect.

Make a series of treemaps showing neonatal deaths over time

A treemap allows us to directly compare the neonatal deaths in each country, nested by region.

Drag Country and Region onto Columns and Neonatal deaths onto Rows. Then open Show Me and select the treemap option. The initial chart should look like this:

Look at the Marks shelf and see that the size and color of the rectangles reflect the SUM of Neonatal deathsfor each country, while each rectangle is labeled with Region and Country:

Now drag Region to Color to remove it from the Label and colour the rectangles by region, using Tableau’s default qualitative colour scheme for categorical data:

For a more subtle color scheme, click on Color, select Edit Colors... and at the dialog box select the Tableau Classic Medium qualitative color scheme, then click Assign Palette and OK.

(Tableau’s qualitative color schemes are well designed, so there is no need to adopt a ColorBrewer scheme. However, it is possible to edit colors individually as you wish.)

Click on Color and set transparency to 75%. (For your assignment you will create a chart with overlapping circles, which will benefit from using some transparency to allow all circles to be seen. So we are setting transparency now for consistency.)

The treemap should now look like this:

Tableau has by default aggregated Neonatal deaths using the SUM function, so what we are seeing is the number for each country added up across the years.

To see one year at a time, we need to filter by year. If you drag the existing Year variable to the Filtersshelf, you will get the option to filter by a range of numbers, which isn’t what we need:

Instead, we need to be able check individual years, and draw a treemap for each one. To do that, select Yearin the Dimensions panel and Duplicate.

Select the new variable and Convert to Discrete and then Rename it Year (discrete). Now drag this new variable to Filters, select 2014, and click OK:

The treemap now displays the data for 2014:

That’s good for a snapshot of the data, but with a little tinkering, we can adapt this visualization to show change in the number of neonatal deaths over time at the national, regional and global levels.

Select Year (discrete) in the Filters shelf and Filter ... to edit the filter. Select all the years with even numbers and click OK:

Now drag Year (discrete) onto Rows and the chart should look like this:

The formatting needs work, but notice that we now have a bar chart made out of treemaps.

Extend the chart area to the right by changing from Standard to Entire View on the dropdown menu in the top ribbon:

I find it more intuitive to have the most recent year at the top, so select Year (discrete) in the Rows shelf, select Sort and fill in the dialog box so that the years are sorted in Descending order:

The chart should now look like this:

We will create a map to serve as a legend for the regions, so click on the title bar for the color legend and select Hide Card to remove it from the visualization.

To remove some clutter from the chart, select Format>Borders from the top menu, and under Sheet>Row Divider, set Pane to None. Then close the Format Borders panel.

Right-click on the Sheet 1 title for the chart and select Hide Title. Also right-click on Year (discrete) at the top left of the chart and select Hide Field Labels for Rows. Then hover just above the top bar to get a double-arrowed drag symbol and drag upwards to reduce the white space at the top. You may also want to drag the bars a little closer to the year labels.

The labels will only appear in the larger rectangles. Rather than removing them entirely, let’s just leave a label for India in 2014, to make it clear that this is the country with by far the largest number of neonatal deaths. Click on Label in the Marks shelf, and switch from All to Selected under Marks to Label. Then right-click on the rectangle for India in 2014, and select Mark Label>Always Show. The chart should now look like this:

Hover over one of the rectangles, and notice the tooltip that appears. By default, all the fields we have used to make the visualization appear in the tooltip. (If you need any more, just drag those fields onto Tooltip.) Click on Tooltip and edit as follows. (Unchecking Include command buttons disables some interactivity, giving a plain tooltip):

Save to the web

Right-click on Sheet 1 at bottom left and Rename Sheet to Treemap bar chart. Then select File>Save to Tableau Public... from the top menu. At the logon dialog box enter your Tableau Public account details, give the Workbook a suitable name and click Save. When the save is complete, a view of the visualization on Tableau’s servers will open in your default browser.

Make a map to use as a colour legend

Select Worksheet>New Worksheet from the top menu, and double-click on Country. Tableau recognizes the names of countries and states/provinces; for the U.S., it also recognizes counties. Its default map-making behaviour is to put a circle at the geographic centre, or centroid, of each area, which can be scaled and coloured to reflect values from the data:

However, we need each country to be filled with colour by region. In the Show Me tab, switch to the filled mapsoption, and each nation should fill with colour. Drag Region to Colour and see how the same colour scheme we used previously carries over to the map. Click on Colour, set the transparency to 75% to match the bubble chart and remove the borders. Also click on Tooltip and uncheck Show tooltip so that no tooltip appears on the legend.

We will use this map as a colour legend, so its separate colour legend is unnecessary. Click the colour legend’s title bar and select Hide Card to remove it from the visualization. Also remove the Sheet 2 title as before.

Centre the map in the view by clicking on it, holding and panning, just as you would on Google Maps. It should now look something like this:

Rename the worksheet Map legend and save to the web again.

Make a line chart showing neonatal mortality rate by country over time

To address our second question, and explore the neonatal death rate over time by country, we can use a line chart.

First, rename Neonat Mortal as Neonatal death rate (per 1,000 births). Then, open a new worksheet, drag this variable to Rows and Year to Columns. The chart should now look like this:

Tableau has aggregated the data by adding up the rates for each country in every year, which makes no sense here. So drag Country to Detail in the Marks shelf to draw one line per country:

Drag region to Colour and set the transparency to 75%.

Now right-click on the X axis, select Edit Axis, edit the dialog box as follows and click OK:

Right-click on the X axis again, select Format, change Alignment/Default/Header/Direction to Up and use the dropdown menu set the Font to bold. Also remove the Sheet 3 title.

The chart should now look like this:

We can also highlight the countries with the highest total number of neonatal deaths by dragging Neonatal deaths to Size. The chart should now look like this:

This line chart shows that the trend in most countries has been to reduce neonatal deaths, while some countries have had more complex trajectories. But to make comparisons between individual countries, it will be necessary to add controls to filter the chart.

Tableau’s default behaviour when data is filtered is to redraw charts to reflect the values in the filtered data. So if we want the Y axis and the line thicknesses to stay the same when the chart is filtered, we need to freeze them.

To freeze the line thicknesses, hover over the title bar for the line thickness legend, select Edit Sizes... and fill in the dialog box as follows:

Now remove this legend from the visualization, together with the colour legend. We can later add an annotation to our dashboard to explain the line thickness.

To freeze the Y axis, right-click on it, select Edit Axis..., make it Fixed and click OK:

Right-click on the Y axis again, select Format... and increase the font size to 10pt to make it easier to read.

Now drag Country to Filters, make sure All are checked, and at the dialog box, click OK:

Now we need to add a filter control to select countries to compare. On Country in the Filters shelf, select Show Filter. A default filter control, with a checkbox for each nation, will appear to the right of the chart:

This isn’t the best filter control for this visualization. To change it, click on the title bar for the filter, note the range of filter controls available, and select Multiple Values (Custom List). This allows users to select individual countries by starting to type their names. Then select Edit Title... and add some text explaining how the filter works:

Take some time to explore how this filter works.

Rename Income to Income group. Then add Region and Income group to Filters, making sure that All options are checked for each. Select Show Filter for both of these filters, and select Single Value Dropdown for the control. Reset both of these filters to All, and the chart should now look like this:

Notice that the Income group filter lists the options in alphabetical order, rather than income order, which would make more sense. To fix this, right-click on Income group in the data panel and select Default Properties>Sort. At the dialog box below, select Manual sort, edit the order as follows and click OK:

The chart should now look like this:

Finally, click on Tooltip and edit as follows:

Rename the sheet Line chart and save to the web.

Make a dashboard combining both charts

From the top menu, select Dashboard>New Dashboard. Set its Size to Automatic, so that the dashboard will fill to the size of any screen on which it is displayed:

To make a dashboard, drag charts, and other elements from the left-hand panel to the dashboard area. Notice that Tableau allows you to add items including: horizontal and vertical containers, text boxes, images (useful for adding a publication’s logo), embedded web pages and blank space. These can be added Tiled, which means they cannot overlap, or Floating, which allows one element to be placed over another.

Drag Treemap bar chart from the panel at left to the main panel. The default title, from the worksheet name, isn’t very informative, so right-click on that, select Edit Title ... and change to Total deaths.

Now add Line Chart to the right of the dashboard (the gray area will show where it will appear) and edit its title to Death rates. Also add a note to explain that line widths are proportional to the total number of deaths. The dashboard should now look like this:

Notice that the Country, Region and Income group filters control only the line chart. To make them control the treemaps, too, click on each filter, open up the dropdown menu form the downward-pointing triangle, and select Apply to Worksheets>Selected Worksheets... and fill in the dialog box as follows:

The filters will now control both charts.

Add Map legend for a color legend at bottom right. (You will probably need to drag the window for the last filter down to push it into position.) Hide the legend’s title then right-click on the map and select Hide View Toolbarto remove the map controls.

We can also allow the highlighting of a country on one chart to be carried across the entire dashboard. Select Dashboard>Actions... from the top menu, and at the first dialog box select Add action>Highlight. Filling the second dialog box as follows will cause each country to be highlighted across the dashboard when it is clicked on just one of the charts:

Click OK on both dialog boxes to apply this action.

Select Dashboard>Show Title from the top menu. Right-click on it, select Edit Title... and change from the default to something more informative:

Now drag a Text box to the bottom of the dashboard and add a footnote giving source information:

The dashboard should now look like this:

Design for different devices

This dashboard works on a large screen, but not on a small phone. To see this, click the Device Previewbutton at top left and select Phone under Device type. In portrait orientation, this layout does not work at all:

Click the Add Phone Layout at top right, and then click Custom tab under Layout - Phone in the left-hand panel. You can then rearrange and if necessary remove elements for different devices. Here I have removed the line chart and filter controls, and changed the legend to a Floating element so that it sits in the blank space to the top right of the bar chart of treemaps.

Now save to the web once more. Once the dashboard is online, use the Share link at the bottom to obtain an embed code, which can be inserted into the HTML of any web page.

(You can also Download a static view of the graphic as a PNG image or a PDF.)

You can download the workbook for any Tableau visualization by clicking the Download Workbook link. The files (which will have the extension .twbx) will open in Tableau Public.

Having saved a Tableau visualization to the web, you can reopen it by selecting File>Open from Tableau Public... from the top menu.

Another approach to responsive design

As an alternative to using Tableau’s built-in device options, you may wish to create three different dashboards, each with a size appropriate for phones, tablets, and desktops respectively. You can then follow the instructions here to put the embed codes for each of these dashboards into a div with a separate class, and then use @media CSS rules to ensure that only the div with the correct dashboard displays, depending on the size of the device.

If you need to make a fully responsive Tableau visualization and are struggling, contact me for help!

When making responsively designed web pages, make sure to include this line of code between the <head></head> tags of your HTML:

<meta name="viewport" content="width=device-width, initial-scale=1.0">

From dashboards to stories

Tableau also allows you to create stories, which combine successive dashboards into a step-by-step narrative. Select Story>New Story from the top menu. Having already made a dashboard, you should find these simple and intuitive to create. Select New Blank Point to add a new scene to the narrative.

Assignment

  • Create this second dashboard from the data.Here are some hints:
    • Drop Year into the Pages shelf to create the control to cycle through the years.
    • You will need to change the Marks to solid circles and scale them by the total number of neonatal deaths. Having done so, you will also need to increase the size of all circles so countries with small numbers of neonatal deaths are visible. Good news: Tableau’s default behavior is to size circles correctly by area, so they will be the correct sizes, relative to one another.
    • You will need to switch to a Logarithmic X axis and alter/fix its range.
    • Format GDP per capita in dollars by clicking on it in the Data panel and selecting Default Properties>Number Format>Currency (Custom).
    • Create a single trend line for each year’s data, so that the line shifts with the circles from year to year. Do this by dragging Trend line into the chart area from the Analytics panel. You will then need to select Analysis>Trend Lines>Edit Trend Lines... and adjust the options to give a single line with the correct behavior.
    • Getting the smaller circles rendered on top of the larger ones, so their tooltips can be accessed, is tricky. To solve this, open the dropdown menu for Country in the Marks shelf, select Sort and fill in the dialog box as follows:Now drag Country so it appears at the top of the list of fields in the Marks shelf.

    This should be a challenging exercise that will help you learn how Tableau works. If you get stuck, download my visualization and study how it is put together.

  • By next week’s class, send me the url for your second dashboard. (Don’t worry about designing for different devices.)

Further reading/viewing

Tableau Public training videos

Gallery of Tableau Public visualizations: Again, you can download the workbooks to see how they were put together.

Tableau Public Knowledge Base: Useful resource with the answers to many queries about how to use the software.

Network Analysis with Gephi

Network analysis with Gephi

Today, network analysis is being used to study a wide variety of subjects, from how networks of genes and proteins influence our health to how connections between multinational companies affect the stability of the global economy. Network graphs can also be used to great effect in journalism to explore and illustrate connections that are crucial to public policy, or directly affect the lives of ordinary people.

Do any of these phrases have resonance for you?

  • The problem with this city is that it’s run by X’s cronies.
  • Contracts in this town are all about kickbacks.
  • Follow the money!
  • It’s not what you know, it’s who you know.

If so, and you can gather relevant data, network graphs can provide a means to display the connections involved in a way that your audience (and your editors) can readily understand.

The data we will use

Download the data from this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

  • friends.csv A simple network documenting relationships among a small group of people.
  • senate_113-2013.gexf senate-113-2014.gexf Two files with data on voting patterns in the U.S. Senate, detailing the number and percentage of times pairs of Senators voted the same way in each year.
  • senate_one_session.py Python script that will scrape a single year’s data from GovTrack.US; modified from a script written by Renzo Lucioni.
  • senate Folder containing files and code to make an interactive version of the Senate voting network.

Network analysis: the basics

At its simplest level, network analysis is very straightforward. Network graphs consist of edges (the connections) and nodes (the entities that are connected).

One important consideration is whether the network is “directed,” “undirected,” or a mixture of the two. This depends upon the nature of the connections involved. If you are documenting whether people are Facebook friends, for instance, and have no information who made the original friend request, the connections have no obvious direction. But when considering following relationships on Twitter, there is a clear directionality to each relationship: A fan might follow Taylor Swift, for example, but she probably doesn’t follow them back.

The connections in undirected graphs are typically represented by simple lines or curves, while directed relationships are usually represented by arrows.

Here, for example, I used a directed network graph to illustrate patterns of citation of one another’s work by researchers working on a type of stem cell that later won their discoverer a Nobel prize. Notice that in some cases there are arrows going in both connections, because each had frequently cited the other:

(Source: New Scientist)

Nodes and edges can each have data associated with them, which can be represented using size, color and shape.

Network algorithms and metrics

Networks can be drawn manually, with the nodes placed individually to give the most informative display. However, network theorists have devised layout algorithms to automate the production of network graphs. These can be very useful, especially when visualizing large and complex networks.

There are also a series of metrics that can quantify aspects of a network. Here are some examples, which measure the importance of nodes within a network in slightly different ways:

  • Degree is a simple count of the number of connections for each node. For directed networks, it is divided into In-degree, for the number of incoming connections, and Out-degree, for outgoing connections. (In my stem cell citation network, In-Degree was used to set the size of each node.)
  • Eigenvector centrality accounts not only for the node’s own degree, but also the degrees of the nodes to which it connects. As such, it is a measure of each node’s wider “influence” within the network. Google’s PageRank algorithm, which rates the importance of web pages according the the links they recieve, is a variant of this measure.
  • betweenness centrality essentially reveals how important each node is in providing a “bridge” between different parts of the network: It counts the number of times each node appears on the shortest path between two other nodes. It is particularly useful for highlighting the nodes that, if removed, would cause a network to fall apart.
  • Closeness centrality is a measure of how close each node is, on average, to all of the other nodes in a network. It highlights the nodes that connect to the others through a lower number of edges. The Kevin Bacon Game, in which you have to connect Bacon to other movie actors through the fewest number of movies, based on co-appearances, works because he has a high closeness centrality in this network.

Network data formats

The most basic data needed to draw a network is an “edge list” — a list of pairs of nodes that connect within the network, which can be created in a spreadsheet with two columns, one for each member of each pair of nodes.

There are also a number of dedicated data formats used to record data about networks, which can store a variety of data about both edges and nodes. Here are two of the most common:

GEXF is a variant of XML, and is the native data format for Gephi, the network visualization software we will use today.

GraphML is another, older XML format used for storing and exchanging network graph data.

For visualizing networks online, it often makes sense to save them as JSON, which keeps file size small and works well with JavaScript visualization libraries.

Introducing Gephi

Gephi is a tool designed to draw, analyze, filter and customize the appearance of network graphs according to qualitative and quantitative variables in the data.

Gephi allows you to deploy layout algorithms, or to place nodes manually. It can calculate network metrics, and lets you use the results of these analyses to customize the appearance of your network graph.

Finally, Gephi allows you to create publication-quality vector graphics of your network visualizations, and to export your filtered and analyzed networks in data formats that can be displayed interactively online, using JavaScript visualization libraries.

Launch Gephi, and you will see a screen like this:

(You may also see an initial welcome window, allowing you to load recently used or sample data. You can close this.)

Install Gephi plugins

Gephi has a series of plugins that extend its functionality — you can browse the available options here.

We will install a plugin that we will later use to export data from Gephi as JSON. Select Tools>Plugins from the top menu, and the Plugins window should open:

In the Available Plugins tab, look for the JSONExporter plugin — you can use the Search box to find them, if necessary. Then click Install.

After installing plugins, you may be prompted to restart Gephi, which you should do.

If you do not find the plugin you are looking for, close Gephi and browse for the plugin at the Gephi marketplace, where you can download manually. Then relaunch Gephi, select Tools>Plugins from the top menu and go to the Downloaded tab. Click the Add Plugins ... button, and navigate to where the plugin was saved on your computer — it should be in a zipped folder or have an .nbm file extension. Then click Install and follow the instructions that appear.

See here for more instructions on installing Gephi plugins.

Make a simple network graph illustrating connections between friends

Having launched Gephi, click on Data Laboratory. This is where you can view and edit raw network data. From the top menu, select File>New Project, and the screen should look like this:

Notice that you can switch between viewing Nodes and Edges, and that there are buttons to Add node and Add edge, which allow you to construct simple networks by manual data entry. Instead, we will import a simple edge list, to show how Gephi will add the nodes and draw the network from this basic data.

To do this, click the Import Spreadsheet button — which actually imports CSV files, rather than spreadsheets in .xls or .xlsx format. Your CSV file should at a minimum have two columns, headed Source and Target. By default, Gephi will create a directed network from an edge list, with arrows pointing from from Source to Target nodes. If some or all of your connections are undirected, include a third column called Type and fill the rows with Undirected or Directed, as appropriate.

At the dialog box shown below, navigate to the file friends.csv, and ensure that the data is going to be imported as an Edges table:

Click Next> and then Finish, and notice that Gephi has automatically created a Nodes table from the information in the Edges table:

In the Nodes table, click the Copy data to other column button at the bottom of the screen, select Id and click OK to copy to Label. This column can later be used to put labels on the network graph.

Now click Add column, call it Gender and keep its Type as String, because we are going to enter text values, rather than numbers. This column can later be used to color the friends according to their gender.

Having created the column, double-click on each row and enter F or M, as appropriate:

Now switch to the Edges table, and notice that each edge has been classed as Directed:

This would make sense if, for example, we were looking at dinner invitations made by the source nodes. Notice that in this network, not only has Ann invited Bob to dinner, but Bob has also invited Ann. For each of the other pairs, the invitations have not been reciprocated.

Now click Overview to go back to the main graph view, where a network graph should now be visible. You can use your mouse/trackpad to pan and zoom. On my trackpad, right-click and hold enables panning, while the double-finger swipe I would normally use to scroll enables zoom. Your settings may vary!

Note also that the left of the two sliders at bottom controls the size of the edges, and that individual nodes can be clicked and moved to position them manually. Below I have arranged the nodes so that none of the edges cross over one another. The Context panel at top right gives basic information about the network:

Click on the dark T button at bottom to call up labels for the nodes, and use the right of the two sliders to control their size. The light T button would call up edge labels, if they were set.

Turn off the labels once more, and we will next color the nodes according to the friends’ gender.

Notice that the panel at top left contains two tabs, Partition and Ranking. The former is used to style nodes or edges according to qualitative variables, the latter styling by quantitative variables. Select Partition>Nodes, hit the Refresh button with the circling green arrows, select Gender and hit the Run button with the green “play” symbol. The nodes should now be colored by gender, and you may find that the edges also take the color of the source node:

To turn off this behavior, click this button at the bottom of the screen: (the button to its immediate left allows edge visibility to be turned on and off).

Select File>New Project and you will be given the option to save your project before closing. You can also save your work at any time by selecting File>Save or by using the usual ⌘-S or Ctrl-S shortcut.

Visualize patterns of voting in the U.S. Senate

Having learned these basics, we will now explore a more interesting network, based on voting patterns in the U.S. Senate in 2014.

Select File>Open from the top menu and navigate to the file senate-113-2014.gexf. The next dialog box will give you some information about the network being imported, in this case telling you it it is an undirected network containing 101 nodes, and 5049 edges:

Once the network has imported, go to the Data Laboratory to view and examine the data for the Nodes and Edges. Notice that each edge has a column called percent_agree, which is the number of times the member of each pair of Senators voted the same way, divided by the total number of votes in the chamber in 2014, giving a number between 0 and 1:

Click the Configuration button, and ensure that Visible graph only is checked. When we start filtering the data, this will ensure that the data tables show the filtered network, not the original.

Now return to the Overview, where we will use a layout algorithm to alter the appearance of the network. In the Layout panel at bottom left, choose the Fruchterman Reingold layout algorithm and click Run. (I know from prior experimentation that this algorithm gives a reasonable appearance for this network, but do experiment with different options if working on your own network graphs in future.) Note also that there are options to change the parameters of the algorithm, such as the “Gravity” with which connected nodes attract one another. We will simply accept the default options, but again you may want to experiment with different values for your own projects.

When the network settles down, click Stop to stabilize it. The network should look something like this:

This looks a little neater than the initial view, but is still a hairball that tells us little about the underlying dynamics of voting in the Senate. This is because almost all Senators voted the same way at least once, so each one is connected to almost all of the others.

So now we need to filter the network, so that edges are not drawn if Senators voted the same way less often. Select the Filters tab in the main panel at right, and select Attributes>Range, which gives the option to filter on percent_agree. Double-click on percent_agree, to see the following under Queries:

The range can be altered using the sliders, but we will instead double-click on the value for the bottom of the range, and manually edit it to 0.67:

This will draw edges between Senators only if they voted the same way in at least two-thirds of the votes in 2014. Hit Filter, and watch many of the edges disappear. Switch to the Data Laboratory view, and see how the Edges table has also changed. Now return to the Overview, Run the layout algorithm again, and the graph should change to look something like this:

Now the network is organized into two clusters, which are linked through only a single Senator. These are presumably Democrats and Republicans, which we can confirm by coloring the nodes by party in the Partition tab at top left:

To customize the colors, click on each square in the Partition tab, then Shift-Ctrl and click to call up the color selector:

Make Democrats blue (Hex: 0000FF), Republicans red (Hex: FF0000) and the two Independents orange (Hex: FFAA00).

Let’s also reconfigure the network so that the Democrats are on the left and the Republicans on the right. Run the layout algorithm again, and with it running, click and drag one of the outermost Democrats, until the network looks something like this:

Now is a good time to save the project, if you have not done so already.

Next we will calculate some metrics for our new, filtered network. If we are interested in highlighting the Senators who are most bipartisan, then Betweenness centrality is a good measure — remember that it highlights “bridging” nodes that prevent the network from breaking apart into isolated clusters.

Select the Statistics tab in the main right-hand panel, and then Edge Overview>Avg. Path Length>Run. Click OK at the next dialog box, close the Graph Distance Report, and go to the Data Laboratory view. Notice that new columns, including Betweenness Centrality, have appeared in the Nodes table:

Switch back to the Overview, and select the Ranking tab on the top left panel. Choose Betweenness Centrality as the ranking parameter for Nodes, select the gem-like icon (), which controls the size of nodes, and select a minimum and maximum size for the range.

Click Apply, and the network should look something like this:

You may at this point want to switch on the labels, and note that Susan Collins, the Republican from Maine, was the standout bipartisan Senator in 2014.

Now switch to Preview, which is where the appearance of the network graph can be polished before exporting it as a vector graphic. Click Refresh to see the network graphic drawn with default options:

You can then customize using the panel on the left, clicking Refresh to review each change. Here I have removed the nodes’ borders, by setting their width to zero, changed the edges from the default curved to straight, and reduced edge thickness to 0.5:

Export the finished network as vector graphics and for online visualization

Export the network graph in SVG or PDF format using the button at bottom left, or by selecting File>Export>SVG/PDF/PNG file... from the top menu.

Now select File>Export>Graph file... and save as JSON (this option is available through the JSONExporter plugin we installed earlier). Make sure to select Graph>Visible only at the dialog box, so that only the filtered network is exported:

Now try to repeat the exercise with the 2013 data!

Introducing Sigma.js

Network graphs can be visualized using several JavaScript libraries including D3 (see here for an example). However, we will use the Sigma.js JavaScript library, which is specifically designed for the purpose, and can more easily handle large and complex networks.

Make your own Sigma.js interactive network

I have provided a basic Sigma.js template, prepared with the generous help of Alexis Jacomy, the author of Sigma.js. This is in the senatefolder.

Save the JSON file you exported from Gephi in the data subfolder with the name senate.json, then open the file index.html. The resulting interactive should look like this. Notice that when you hover over each node, its label appears, and its direct connections remain highlighted, while the rest of the network is grayed out:

Open index.html in your preferred text editor, and you will see this code:

<!DOCTYPE html>
<html>

<head>

  <meta charset=utf-8 />
  <title>U.S. Senate network</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
  <script src="src/sigma.min.js"></script>
  <script src="src/sigma.parsers.json.min.js"></script>

  <style>
    body {margin:0; padding:0;}
    #sigma-container {position:absolute; top:0; bottom:0; width:100%;}
  </style>

</head>

<body>
  <div id="sigma-container"></div>

  <script>
  function init() {

    // Finds the connections of each node
    sigma.classes.graph.addMethod("neighbors", function(nodeId) {
      var k,
          neighbors = {},
          index = this.allNeighborsIndex[nodeId] || {};

      for (k in index)
        neighbors[k] = this.nodesIndex[k];

      return neighbors;
    });

   // Creates an instance of Sigma.js
    var sigInst = new sigma({
      renderers: [
        {
          container: document.getElementById("sigma-container"),
          type: "canvas"
        }
      ]
    });

    // Customizes its settings 
    sigInst.settings({
      // Drawing properties :
      defaultLabelColor: "#000",
      defaultLabelSize: 14,
      defaultLabelHoverColor: "#fff",
      labelThreshold: 11,
      defaultHoverLabelBGColor: "#888",
      defaultLabelBGColor: "#ddd",
      defaultEdgeType: "straight",

      // Graph properties :
      minNodeSize: 3,
      maxNodeSize: 10,
      minEdgeSize: 0.1,
      maxEdgeSize: 0.2,

      // Mouse properties :
      zoomMax: 20 
    });

    // Parses JSON file to fill the graph
    sigma.parsers.json(
      "data/senate.json",
      sigInst,
      function() {
        //  Little hack here:
        //  In the latest Sigma.js version have to delete edges" colors manually
        sigInst.graph.edges().forEach(function(e) {
          e.color = null;
        });

        // Also, to facilitate the update of node colors, store
        // their original color under the key originalColor:
        sigInst.graph.nodes().forEach(function(n) {
          n.originalColor = n.color;
        });

        sigInst.refresh();
      }
    );


     // When a node is clicked, check for each node to see if it is connected. If not, set its color as gray
     // Do the same for the edges

    var grayColor = "#ccc";
    sigInst.bind("overNode", function(e) {
      var nodeId = e.data.node.id,
          toKeep = sigInst.graph.neighbors(nodeId);
      toKeep[nodeId] = e.data.node;

      sigInst.graph.nodes().forEach(function(n) {
        if (toKeep[n.id])
          n.color = n.originalColor;
        else
          n.color = grayColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        if (e.source === nodeId || e.target === nodeId)
          e.color = null;
        else
          e.color = grayColor;
      });

    // Since the data has been modified, call the refresh method to make the colors update 
      sigInst.refresh();
    });

    // When a node is no longer being hovered over, return to original colors
    sigInst.bind("outNode", function(e) {
      sigInst.graph.nodes().forEach(function(n) {
        n.color = n.originalColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        e.color = null;
      });

      sigInst.refresh();
    });
  }

  if (document.addEventListener)
    document.addEventListener("DOMContentLoaded", init, false);
  else
    window.onload = init;
  </script>

</body>

</html>

The code has been documented to explain what each part does. Notice that the head of the web page loads the main Sigma.js script, and a second script that reads, or “parses,” the json data. These are in the src subfolder.

To explore Sigma.js further, download or clone its Github repository and examine the code for the examples given.

Further reading/resources

Gephi tutorials

Sigma.js wiki

D3 – 01 | Hello World

Step 0 – Preparing the HTML

Base HTML code that you can copy into a file. Alternatively, create an HTML skeleton file with your editor (e.g. Brackets, Sublime or TextMate) and made sure to link to the D3.js library. The Javascript code in the snippets should go within the script tag! You should enter your code snippets one by one and then test the code. Don’t enter all code at once.

<html>
  <head>
    <script src="http://d3js.org/d3.v3.min.js"></script>
  </head>

  <body>
    <script>
    </script>
  </body>
<html>

Step 1 – Selecting and appending

Concepts:

  • Method Chaining
  • Selecting

Step 1.1: Appending the div

d3.select("body")
    .append("div")
    .style("border", "1px black solid")
    .text("hello world");

In English:

  1. Selecting the “<body>” element in the HTML,
  2. we create and append a “<div>” to it ,
  3. with the CSS style rule of a solid, black, 1px-thick border,
  4. and containing the text “hello world”.

Step 1.2: Event Binding

d3.selectAll("div")
    .style("background-color", "pink")
    .style("font-size", "24px")
    .attr("class", "d3div")
    .on("click", function() {alert("You clicked a div")})

In English:

  1. Selecting the first <div> of the document (there is a second method, .selectAll() to choose all elements with the selector name),
  2. we color the background of the <div> pink in the CSS
  3. and change the font size of the <div> to 24px in the CSS.
  4. We add to the <div> tag a “class” attribute and name this new class “d3div”.
  5. We bind an event handler to the <div> tag selected in the first line. The event is the action of a “click”, and when that action is executed (when someone clicks the <div>), a function is called. In this example, that function has only one line: alert that will display “You clicked a div”.

Task 2 – The <svg> element

This task is completely separate from Task 1. You can keep the base HTML file, but your script area will look entirely different! Remember to save the different files separately for future reference.

Concepts:

  • svg element
  • DOM order – no z-scale!
  • coordinate attributes

Step 0:

Add the svg element in the HTML file, within the body.

<svg style="width:500px; height:500px; border:1px lightgray solid;">

</svg>       

Step 1:

d3.select("svg")
    .append("line")
    .attr({"x1":20, "y1":20, "x2":250, "y2":250})
    .style("stroke","red")
    .style("stroke-width","4px");

Select the “<svg>” tag added in the HTML file and append a line with starting point (20,20) and ending point (250,250). The line is 4px wide and red.

Step 2:

d3.select("svg")
    .append("text")
    .attr({"x": 20, "y": 20})
    .text("HELLO");

Select the <svg> tag again and append “text” to it at the coordinate (20, 20) with the text content “HELLO”. (Note: Hit inspect element to see what your HTML looks like after the script has been run!)

Step 3:

d3.select("svg")
    .append("circle")
    .attr({"r": 20, "cx": 20, "cy": 20})
    .style("fill","purple");
    
d3.select("svg")
    .append("circle")
    .attr({"r": 100, "cx": 400, "cy": 400})
    .style("fill", "lightblue");

Now, select the <svg> again and this time, append a circle with the specified attributes (cx and cy denote center coordinates). Select it again and append the other circle with its own attributes.

Step 4:

d3.select("svg")
    .append("text")
    .attr({"x": 400, "y": 400})
    .text("WORLD");

Now, append the final text “WORLD” to the <svg>, positioned at the point (400,400).

Food for thought: Why do the elements overlap the way they do? What determines what goes on top and what goes below?

Task 3 – Transitions

To add transitions (the D3.js equivalent of animations), we’ll need to change our base code up a bit. To start off with, make a copy of your previous file, and then replace the <script> code with the one shown below.

Step 1:

d3.select("svg")
    .append("circle")
    .attr({"r": 20, "cx": 20, "cy": 20})
    .style("fill","red");
            
d3.select("svg")
    .append("text")
    .attr("id","a")
    .attr({"x": 20, "y": 20})
    .style("opacity", 0)
    .text("HELLO WORLD");          

d3.select("svg")
    .append("circle")
    .attr({"r": 100, "cx": 400, "cy": 400})
    .style("fill","lightblue");
            
d3.select("svg")
    .append("text")
    .attr("id","b")
    .attr({"x": 400, "y": 400})
    .style("opacity", 0)
    .text("Uh, hi.");

Take a look at the code above.

  • Why is the code broken up in the way it is?
  • Find all the added instructions (were not in the previous code). Why don’t you see all elements?
  • Can you see the purpose of the added instructions, with respect to transitions (which we’ll add in the next step).

Step 2:

Add the following code:

d3.select("#a")
    .transition()
    .delay(1000)
    .style("opacity", 1);

Time to add transitions with the .transition() method. .delay(milliseconds) determines the amount of time to elapse before the transition is applied. We usually delay animations by a little bit because instantaneous changes can be jarring for the user. This particular transition above waits a second (1000 milliseconds) before it transitions the opacity of #a (one of the text fields) from 0 (transparent, therefore invisible) to 1 (full opacity).

Step 3:

d3.select("#b")
    .transition()
    .delay(2000)
    .style("opacity", .75);
d3.selectAll("circle")
    .transition()
    .duration(2000)
    .attr("cy", 200);

What do you think the above code does?

Here are some things you can try:

  • Have 4 circles on the canvas, each moving in different directions.
  • Have each of the circles move at different times.
  • Make it look like there is only one circle when the page first loads, but reveal through transitions that there are actually many layers.