D3 Refresh

This article aims to give you a high level overview of D3’s capabilities, in each example you’ll be able to see the input data, transformation and the output document. Rather than explaining what every function does I’ll show you the code and you should be able to get a rough understanding of how things work. I’ll only dig into details for the most important concepts, Scales and Selections.

William Playfair invented the bar, line and area charts in 1786 and the pie chart in 1801. Today, these are still the primary ways that most data sets are presented. Now, these charts are excellent but D3 gives you the tools and the flexibility to make unique data visualizations for the web, your creativity is the only limiting factor.

A Bar Chart

Although we want to get to more than William Playfair’s charts we’ll begin by making the humble bar chart with HTML – one of the easiest ways to understand how D3 transforms data into a document. Here’s what that looks like:

See code in codePen

d3.select('#chart')
  .selectAll("div")
  .data([4, 8, 15, 16, 23, 42])
  .enter()
  .append("div")
  .style("height", (d)=> d + "px")

The selectAll function returns a D3 “selection”: an array of elements that get created when we enter and append a div for each data point.

This code maps the input data [4, 8, 15, 16, 23, 42] to this output HTML.

<div id="chart">
  <div style="height: 4px;"></div>
  <div style="height: 8px;"></div>
  <div style="height: 15px;"></div>
  <div style="height: 16px;"></div>
  <div style="height: 23px;"></div>
  <div style="height: 42px;"></div>
</div>

All of the style properties that don’t change can go in the CSS.

#chart div {
  display: inline-block;
  background: #4285F4;
  width: 20px;
  margin-right: 3px;
}

GitHub’s Contribution Chart

With a few lines of extra code we can convert the bar chart above to a contribution chart similar to Github’s.

A GitHub-style contribution chart

See code in codepen

Rather than setting a height based on the data’s value we can set a background-color instead.

const colorMap = d3.interpolateRgb(
  d3.rgb('#d6e685'),
  d3.rgb('#1e6823')
)

d3.select('#chart')
  .selectAll("div")
  .data([.2, .4, 0, 0, .13, .92])
  .enter()
  .append("div")
  .style("background-color", (d)=> {
    return d == 0 ? '#eee' : colorMap(d)
  })

The colorMap function takes an input value between 0 and 1 and returns a colour along the gradient of colours between the two we provide. Interpolation is a key tool in graphics programming and animation, we’ll see more examples of it later.

An SVG Primer

Much of D3’s power comes from the fact that it works with SVG, which contains tags for drawing 2D graphics like circles, polygons, paths and text.

<svg width="200" height="200">
  <circle fill="#3E5693" cx="50" cy="120" r="20" />
  <text x="100" y="100">Hello SVG!</text>
  <path d="M100,10L150,70L50,70Z" fill="#BEDBC3" stroke="#539E91" stroke-width="3">
</svg>

The code above draws:

  • A circle at 50,120 with a radius of 20
  • The text “Hello SVG!” at 100,100
  • A triangle with a 3px border, the d attribute has the following instructions
    • Move to 100,10
    • Line to 150,70
    • Line to 50,70
    • Close path(Z)

<path> is the most powerful element in SVG.

Circles

Labeled circles showing sales by time of day

See code in codepen

The data sets in the previous examples have been a simple array of numbers, D3 can work with more complex types too.

const data = [{
  label: "7am",
  sales: 20
},{
  label: "8am",
  sales: 12
}, {
  label: "9am",
  sales: 8
}, {
  label: "10am",
  sales: 27
}]

For each point of data we will append a <g>(group) element to the #chart and append <circle> and <text>elements to each with properties from our objects.

const g = d3.select('#chart')
  .selectAll("g")
  .data(data)
  .enter()
  .append('g')
g.append("circle")
  .attr('cy', 40)
  .attr('cx', (d, i)=> (i+1) * 50)
  .attr('r', (d)=> d.sales)
g.append("text")
  .attr('y', 90)
  .attr('x', (d, i)=> (i+1) * 50)
  .text((d)=> d.label)

The variable g holds a d3 “selection” containing an array of <g> nodes, operations like append() append a new element to each item in the selection.

This code maps the input data into this SVG document, can you see how it works?

<svg height="100" width="250" id="chart">
  <g>
    <circle cy="40" cx="50" r="20"/>
    <text y="90" x="50">7am</text>
  </g>
  <g>
    <circle cy="40" cx="100" r="12"/>
    <text y="90" x="100">8am</text>
  </g>
  <g>
    <circle cy="40" cx="150" r="8"/>
    <text y="90" x="150">9am</text>
  </g>
  <g>
    <circle cy="40" cx="200" r="27"/>
    <text y="90" x="200">10am</text>
  </g>
</svg>

Line Chart

A basic line chart

See the codepen

Drawing a line chart in SVG is quite simple, we want to transform data like this:

const data = [
  { x: 0, y: 30 },
  { x: 50, y: 20 },
  { x: 100, y: 40 },
  { x: 150, y: 80 },
  { x: 200, y: 95 }
]

Into this document:

<svg id="chart" height="100" width="200">
  <path stroke-width="2" d="M0,70L50,80L100,60L150,20L200,5">
</svg>

Note: The y values are subtracted from the height of the chart (100) because we want a y value of 100 to be at the top of the svg (0 from the top).

Given it’s only a single path element, we could do it ourselves with code like this:

const path = "M" + data.map((d)=> {
  return d.x + ',' + (100 - d.y);
}).join('L');
const line = `<path stroke-width="2" d="${ path }"/>`;
document.querySelector('#chart').innerHTML = line;

D3 has path generating functions to make this much simpler though, here’s what it looks like.

const line = d3.svg.line()
  .x((d)=> d.x)
  .y((d)=> 100 - d.y)
  .interpolate("linear")

d3.select('#chart')
  .append("path")
  .attr('stroke-width', 2)
  .attr('d', line(data))

Much better! The interpolate function has a few different ways it can draw the line around the x, y coordinates too. See how it looks with “linear”, “step-before”, “basis” and “cardinal”.

A linear-style line chart
A step-before-style line chart
A basis-style line chart

Scales

Scales are functions that map an input domain to an output range.

See the codepen

In the examples we’ve looked at so far we’ve been able to get away with using “magic numbers” to position things within the charts bounds, when the data is dynamic you need to do some math to scale the data appropriately.

Imagine we want to render a line chart that is 500px wide and 200px high with the following data:

const data = [
  { x: 0, y: 30 },
  { x: 25, y: 15 },
  { x: 50, y: 20 }
]

Ideally we want the y axis values to go from 0 to 30 (max y value) and the x axis values to go from 0 to 50 (max x value) so that the data takes up the full dimensions of the chart.

We can use d3.max to find the max values in our data set and create scales for transforming our x, y input values into x, y output coordinates for our SVG paths.

const width = 500;
const height = 200;
const xMax = d3.max(data, (d)=> d.x)
const yMax = d3.max(data, (d)=> d.y)

const xScale = d3.scale.linear()
  .domain([0, xMax]) // input domain
  .range([0, width]) // output range

const yScale = d3.scale.linear()
  .domain([0, yMax]) // input domain
  .range([height, 0]) // output range

These scales are similar to the colour interpolation function we created earlier, they are simply functions which map input values to a value somewhere on the output range.

xScale(0) -> 0
xScale(10) -> 100
xScale(50) -> 500

They also work with values outside of the input domain as well:

xScale(-10) -> -100
xScale(60) -> 600

We can use these scales in our line generating function like this:

const line = d3.svg.line()
  .x((d)=> xScale(d.x))
  .y((d)=> yScale(d.y))
  .interpolate("linear")

Another thing you can easily do with scales is to specify padding around the output range:

const padding = 20;
const xScale = d3.scale.linear()
  .domain([0, xMax])
  .range([padding, width - padding])

const yScale = d3.scale.linear()
  .domain([0, yMax])
  .range([height - padding, padding])

Now we can render a dynamic data set and our line chart will always fit inside our 500px / 200px bounds with 20px padding on all sides.

Linear scales are the most common type but there’s others like pow for exponential scales and ordinal scales for representing non-numeric data like names or categories. In addition to Quantitative Scales and Ordinal Scales there are also Time Scales for mapping date ranges.

For example, we can create a scale that maps my lifespan to a number between 0 and 500:

const life = d3.time.scale()
  .domain([new Date(1986, 1, 18), new Date()])
  .range([0, 500])

// At which point between 0 and 500 was my 18th birthday?
life(new Date(2004, 1, 18))

If you’d like to go further with this, try the Animated Flight Visualization

Animated Flight Visualization

So far we’ve only looked at static lifeless graphics with a few rollovers for additional information. Let’s make an animated visualization that shows the active flights over time between Melbourne and Sydney in Australia.

See the Pen D3 – scales by Haig Armen (@haigarmen) on CodePen.

The SVG document for this type of graphic is made up of text, lines and circles.

<svg id="chart" width="600" height="500">
  <text class="time" x="300" y="50" text-anchor="middle">6:00</text>
  <text class="origin-text" x="90" y="75" text-anchor="end">MEL</text>
  <text class="dest-text" x="510" y="75" text-anchor="start">SYD</text>
  <circle class="origin-dot" r="5" cx="100" cy="75" />
  <circle class="dest-dot" r="5" cx="500" cy="75" />
  <line class="origin-dest-line" x1="110" y1="75" x2="490" y2="75" />

  <!-- for each flight in the current time -->
  <g class="flight">
    <text class="flight-id" x="160" y="100">JQ 500</text>
    <line class="flight-line" x1="100" y1="100" x2="150" y2="100" />
    <circle class="flight-dot" cx="150" cy="100" r="5" />
  </g>

</svg>

The dynamic parts are the time and the elements within the flight group and the data might look something like this:

let data = [
  { departs: '06:00 am', arrives: '07:25 am', id: 'Jetstar 500' },
  { departs: '06:00 am', arrives: '07:25 am', id: 'Qantas 400' },
  { departs: '06:00 am', arrives: '07:25 am', id: 'Virgin 803' }
]

To get an x position for a dynamic time we’ll need to create a time scale for each flight that maps its departure and arrival times to an x position on our chart. We can loop through our data at the start adding Date objects and scales so they’re easier to work with. Moment.js helps a lot here with date parsing and manipulation.

data.forEach((d)=> {
  d.departureDate = moment(d.departs, "hh-mm a").toDate();
  d.arrivalDate = moment(d.arrives, "hh-mm a").toDate();
  d.xScale = d3.time.scale()
    .domain([departureDate, arrivalDate])
    .range([100, 500])
});

We can now pass our changing Date to xScale() to get an x coordinate for each flight.

Render Loop

Departure and arrival times are rounded to 5 minutes so we can step through our data in 5m increments from the first departure to the last arrival.

let now = moment(data[0].departs, "hh:mm a");
const end = moment(data[data.length - 1].arrives, "hh:mm a");

const loop = function() {
  const time = now.toDate();

  // Filter data set to active flights in the current time
  const currentData = data.filter((d)=> {
    return d.departureDate <= time && time <= d.arrivalDate
  });

  render(currentData, time);

  if (now <= end) {
    // Increment 5m and call loop again in 500ms
    now = now.add(5, 'minutes');
    setTimeout(loop, 500);
  }
}

Enter, Update and Exit

D3 allows you to specify transformations and transitions of elements when:

  • New data points come in (Enter)
  • Existing data points change (Update)
  • Existing data points are removed (Exit)
const render = function(data, time) {
  // render the time
  d3.select('.time')
    .text(moment(time).format("hh:mm a"))

  // Make a d3 selection and apply our data set
  const flight = d3.select('#chart')
    .selectAll('g.flight')
    .data(data, (d)=> d.id)

  // Enter new nodes for any data point with an id not in the DOM
  const newFlight = flight.enter()
    .append("g")
    .attr('class', 'flight')

  const xPoint = (d)=> d.xScale(time);
  const yPoint = (d, i)=> 100 + i * 25;

  newFlight.append("circle")
    .attr('class',"flight-dot")
    .attr('cx', xPoint)
    .attr('cy', yPoint)
    .attr('r', "5")

  // Update existing nodes in selection with id's that are in the data
  flight.select('.flight-dot')
    .attr('cx', xPoint)
    .attr('cy', yPoint)

  // Exit old nodes in selection with id's that are not in the data
  const oldFlight = flight.exit()
    .remove()
}

Transitions

The code above renders a frame every 500ms with a 5 minute time increment:

  • It updates the time
  • Creates a new flight group with a circle for every flight
  • Updates the x/y coordinates of current flights
  • Removes the flight groups when they’ve arrived

This works but what we really want is a smooth transition between each of these frames. We can achieve this by creating a transition on any D3 selection and providing a duration and easing function before setting attributes or style properties.

For example, let’s fade in the opacity of entering flight groups.

const newFlight = flight.enter()
  .append("g")
  .attr('class', 'flight')
  .attr('opacity', 0)

newFlight.transition()
  .duration(500)
  .attr('opacity', 1)

Let’s fade out exiting flight groups.

flight.exit()
  .transition()
  .duration(500)
  .attr('opacity', 0)
  .remove()

Add a smooth transition between the x and y points.

flight.select('.flight-dot')
  .transition()
  .duration(500)
  .ease('linear')
  .attr('cx', xPoint)
  .attr('cy', yPoint)

We can also transition the time between the 5 minute increments so that it displays every minute rather than every five minutes using the tween function.

const inFiveMinutes = moment(time).add(5, 'minutes').toDate();
const i = d3.interpolate(time, inFiveMinutes);
d3.select('.time')
  .transition()
  .duration(500)
  .ease('linear')
  .tween("text", ()=> {
    return function(t) {
      this.textContent = moment(i(t)).format("hh:mm a");
    };
  });

t is a progress value between 0 and 1 for the transition.

D3 Mega List

Here is an updated list of d3 examples, sorted alphabetically. Most of the D3 examples in this list come from this excel list but I also added some updates and my examples to push the list over 2K. Examples are really helpful when doing any kind of development so I am hoping that this big list of D3 examples will be a valuable resource. Bookmark and share with others. Here is the huge list of D3 demos:

  1. “Elbow” Dendrogram
  2. 113th U.S. Congressional Districts
  3. 20 years of the english premier football league
  4. 20000 points in random motion
  5. 2012 NFL Conference Champs
  6. 2012-2013 NBA Salary Breakdown
  7. 2013 Country Population World Map in D3.js
  8. 25 great circles
  9. 2D Matrix Decomposition
  10. 300 Outings
  11. 3D bar chart with D3.js and x3dom
  12. 3D scatter plot using d3 and x3dom
  13. 401k Fees Vary Widely for Similar Companies (Scatter)
  14. 512 Paths to the White House
  15. 582781
  16. 619560
  17. 777029
  18. 779986
  19. 7th Grade Graphs with D3
  20. 9-Patch Quilt Generator
  21. 908051
  22. 908382
  23. 913077
  24. A Bar Chart
  25. A Bar Chart, Part 1
  26. A Bar Chart, Part 2
  27. A Chicago Divided by Killings
  28. A Christmas Carol
  29. A CoffeeScript console for d3.js visualization
  30. A fun, difficult introduction to d3
  31. A JSNetworkX example
  32. A KoExtensions example: #d3js KnockoutJS, RavenDB, WebAPI, Bootstrap
  33. A line chart plotting unit sales, colored by price for d3 data visualisations
  34. A map of translations of Othello into German
  35. A marimekko chart showing SKUs grouped by owner and brand.
  36. A matrix chart where each point is replaced with a marimekko
  37. A Migration of Unmarried Men
  38. A physics model of a physics model
  39. A Race to Entitlement
  40. A Radius Follows the Mouse
  41. A sea of tweets: what are italians saying about the election
  42. A simpler variation of Kepler’s Tally
  43. A Slice of Canadian Life
  44. A sprintf-like function using d3,js
  45. A statistical model for blood pressure in patients with hypertension
  46. A Visit From The Goon Squad – Interactive Character Map
  47. Abusing The Force Talk
  48. AC Milan vs Juventus
  49. Across U.S. Companies, Tax Rates Vary Greatly
  50. Adaptive Resampling
  51. Adaptive Resampling
  52. Addepar
  53. Advanced object constancy
  54. Advanced visualizations with D3.js and Kartograph
  55. Adventures in D3
  56. AFL Brownlow Medalists
  57. AFL Brownlow Medalists
  58. Aid Explorer
  59. Air pollution
  60. Airbnb vs Hotels: A Price Comparison
  61. Airocean World
  62. Airocean World (Dymaxion) map
  63. Airy’s Minimum Error
  64. Airy’s Minimum Error
  65. Aitoff
  66. Aitoff Graticule
  67. Alaska Albers
  68. Albers Equal-Area Conic
  69. Albers Projection
  70. Albers Siberia
  71. Albers Tiles
  72. Albers USA
  73. Albers USA Projection
  74. Albers with Resampling
  75. Albers without Resampling
  76. AlbersUSA + PR
  77. All the Medalists: Men’s 100-Meter Freestyle
  78. Alpha-shapes aka concave hulls
  79. Alternative D3.js documentation
  80. American Forces in Afghanistan and Iraq
  81. Among the Oscar Contenders, a Host of Connections
  82. An inlet to Tributary
  83. An introduction to d3.js video with synced visualisation
  84. An overview of the Hong Kong budget in 2013-14
  85. Analog clock
  86. Andrew Berls, Visualizing your bash history with d3.js
  87. Angel List compensation scatterplot
  88. AngularJS + D3.js = Radian
  89. Animated Bézier Curves
  90. Animated Bubble Chart of Gates Educational Donations
  91. Animated bubble charts for school data analysis
  92. Animated Clipped textPath
  93. Animated Quasicrystals
  94. Animated Sankey (alluvial) diagram
  95. Animated Spirographs
  96. Animated textPath
  97. Animated tree
  98. Animated Trigonometry
  99. Annual traffic entering from station to Paris
  100. Antimeridian Cutting
  101. Antipodes
  102. antulik’s Gists
  103. Apollonian Gasket
  104. Apple logo with gradient
  105. Arc Deduplication
  106. Arc Tween (Clock)
  107. Arc Tween Commented Example
  108. Arcs Around
  109. Area chart
  110. Area Chart
  111. Area Choropleth
  112. Area Transition
  113. Area with Missing Data
  114. Argentina Census
  115. Arlington Visual Budget
  116. Armadillo Projection
  117. Arnold’s Cat Map
  118. Array Subclassing Test
  119. Arrows are Beautiful
  120. Article-Level Metrics over time
  121. Asia Lambert Conic Conformal
  122. At the Democratic Convention the Words Being Used
  123. At the National Conventions the Words They Used
  124. Atlantis
  125. Atlas zur Landtagswahl Bayern 2013
  126. AttrTween, Transitions and MV* in Reusable D3
  127. Audio Spectrum Analyzer
  128. August Projection
  129. Autofocus
  130. Autoforking
  131. Automatic floating labels using d3 force-layout
  132. Automatic Projection Tiles
  133. Automatically sizing text
  134. Axis Component
  135. Axis Examples
  136. Axis Styling
  137. Azimuth and Distance from London
  138. Azimuthal Equidistant
  139. Azimuthal Equidistant
  140. Azimuthal Projections
  141. Baby Names in England & Wales
  142. Backbone-D3: Simple visualisations of Backbone collections via D3.js
  143. Baker Dinomic
  144. bar + sum: d3.js & angular.js
  145. bar + sum: d3.js & backbone.js
  146. bar + sum: d3.js & ember.js
  147. bar + sum: reusable d3.js
  148. bar + sum: vanilla d3.js
  149. Bar Chart
  150. Bar chart code generator and online editor
  151. Bar Chart I
  152. Bar Chart II
  153. Bar Chart II
  154. Bar Chart II
  155. Bar Chart II
  156. Bar Chart III
  157. Bar Chart III
  158. Bar Chart III
  159. Bar Chart with Negative Values
  160. barStack (flex layout)
  161. Bart particles
  162. Bartholomew’s Regional Projection
  163. Base64.js
  164. Baseball 2012 Predictions based on past 6 years
  165. Basic Gantt Chart
  166. Basic Reusable Slopegraph
  167. Bathymetry of Lake Michigan
  168. Bay Area d3 User Group
  169. Bay Area earthquake responses by zip code
  170. BBEdit Preferences
  171. Bearcart
  172. Beautiful Spiral Things
  173. Beautiful visualizations with D3.js
  174. Beer taxes in your state – CNNMoney
  175. Beeswarm plot
  176. Behind the Australian Financial Review’s Budget Explorer
  177. Berghaus Star
  178. Better force layout node selection
  179. Bharat Bhole
  180. Bibly v2: Visualizing word distribution within the KJV bible
  181. Bieber Fever Meter with HTML5 Web Socket d3.js and Pusher
  182. Big Money in Tax Breaks
  183. Biham-Middleton-Levine Traffic Model
  184. Bilevel Partition
  185. Binify + D3 = Gorgeous honeycomb maps
  186. Binned Line Chart
  187. BioVis Project: Identification of Mutations that Affect Protein Function
  188. BIS Derivative Data
  189. Bitdeli: Custom analytics with Python and GitHub
  190. Bitly link Co-occurrence
  191. Bivariate Area Chart
  192. Bivariate Hexbin Map
  193. Bl.ocks RSS
  194. Blobular
  195. Blocky Counties
  196. Blocky Counties
  197. Bloom Filters
  198. Blur/fade effect
  199. Boeing 777 Descent Profiles, SFO
  200. Boggs Eumorphic
  201. Bonne Projection
  202. Boomstick motion
  203. Boomstick motion coffee
  204. Boston d3.js User Group
  205. Boulder County Wildfires
  206. Bounded Force Layout
  207. Box plot
  208. Bracket Layout
  209. Briesemeister
  210. Bromley
  211. Browser usage plurality
  212. Brush
  213. Brush as Slider
  214. Brush Handles
  215. Brush Snapping
  216. Brush Snapping II
  217. Brush Transitions
  218. Brushable Network
  219. Brushable Network II
  220. Bubble Chart
  221. Bubble My Page Visualization
  222. Bubbles
  223. Bubbles generator using a simplex noise
  224. Build world clocks
  225. Build Your Own Graph!
  226. Building a lightweight, flexible D3.js dashboard
  227. Building a tree diagram
  228. Building a UML editor in JS
  229. Building Cubic Hamiltonian Graphs from LCF Notation
  230. BulleT (a variant of mbostock’s Bullet Charts)
  231. Bullet chart variant
  232. Bullet Charts
  233. Bump Chart with rCharts and Rickshaw
  234. Caged/d3-tip
  235. Calculating quadtree bounding boxes and displaying them in leaflet
  236. Calendar View
  237. Calendar View
  238. California Population Density
  239. Calkin-Wilf Tree
  240. Calroc
  241. Calroc: Web as Theater
  242. Can people localize sounds with one functional ear?
  243. Can’t we all get along?
  244. Can’t we all get along?
  245. Candlestick charts
  246. Canvas Geometric Zooming
  247. Canvas Semantic Zooming
  248. Canvas Swarm
  249. Canvas Voronoi
  250. Canvas with d3 and Underscore
  251. Capturing Listeners
  252. Capturing Mousemove
  253. Caravaggio’s Bacco (1597)
  254. Carotid-Kundalini Fractal Explorer
  255. Carotid-Kundalini Fractal Explorer
  256. CartoDB + D3 Bubble Map
  257. CartoDB makes D3 maps a breeze
  258. Cartogram.js: Continuous Area Cartograms
  259. Case-Sensitivity and SVG-in-HTML
  260. Cassini
  261. Cellular automata
  262. Cellular automata
  263. Central Limit Theorem Visualized in D3
  264. CFCLTWiki
  265. Chained Transitions
  266. Chained Transitions
  267. Chamberlin Trimetric
  268. Chamberlin Trimetric Projection
  269. Changes in Employment and Salary by Industry
  270. Chart Wheel Visualization
  271. Chart.io: The Easiest Business Dashboard You’ll Ever Use
  272. Chartbuilder
  273. Chernoff faces
  274. Chernoff faces Fisheye Geodesic grid Hive plot Horizon chart Sankey diagram
  275. Chicago Lobbyists
  276. Chicago Ward Remap Outlines
  277. CHIPMOD
  278. Chord Diagram
  279. Chord diagram with Dex
  280. Chord Diagram: Dependencies Between Classes
  281. Chord diagram: Fade on Hover
  282. Chord diagram: Updating data
  283. Chord Layout Transitions
  284. Choropleth
  285. Choropleth classification systems
  286. Choropleth with interactive parameters for NYC data visualization
  287. Christchurch Earthquakes
  288. christophermanning’s bl.ocks
  289. Chroma + Phi (ϕ)
  290. Chrome Circle Precision Bug
  291. Chrome Circle Precision Bug
  292. Circle Packing
  293. Circle Packing with Zero Values
  294. Circle Packing Zero Values
  295. Circle-bound D3 force layout
  296. Circle-Circle Intersection
  297. Circle-Polygon Intersection
  298. Circles
  299. Circular heat chart
  300. Circular key scale
  301. Circular Layout
  302. Circular Layout (Arc)
  303. Circular Layout (Recursive)
  304. Circular Layout (Slider)
  305. Circular Segment
  306. Circular tree comparing the src directory for three versions of d3
  307. Classements par étape – Tour de France 2012
  308. Clean Up for Natural Earth GeoJSON
  309. click-to-center
  310. click-to-center via transform
  311. Click-to-Recenter Brush
  312. Click-to-Recenter Brush II
  313. Click-to-Zoom via Transform
  314. click-to-zoom via transform
  315. Clickme: Render JavaScript visualizations using R objects
  316. Clinical trials in Multiple Sclerosis
  317. Close Votes – visualizing voting similarities for the Dutch 2012 national elections
  318. Closest Point on Path
  319. Closest Point on Path II
  320. Closest Point to Segment
  321. Cluster Dendrogram
  322. Cluster Dendrogram
  323. Cluster Dendrogram II
  324. Cluster Force Layout IV
  325. Clustered Force Layout
  326. Clustered Force Layout
  327. Clustered Force Layout III
  328. Co-Authors Chords
  329. CodeFlower Source code visualization
  330. CoderDojo – Intro to D3.js
  331. Coffee Flavour Wheel
  332. Collapsible Force Layout
  333. Collapsible Force Layout
  334. Collapsible Force Layout
  335. Collapsible Indented Tree
  336. Collapsible Tree
  337. Collapsible tree
  338. Collapsible Tree Layout
  339. Collapsible tree with labels
  340. Collatz Graph: All Numbers Lead to One
  341. Collective.js.d3 Integrates D3.js in Plone
  342. Collider – a d3.js game
  343. Collignon Projection
  344. Collision Detection
  345. Collision Detection
  346. Collision Detection (Canvas)
  347. Collpase/expand nodes of a tree
  348. Collusion FireFox Addon
  349. Colony – Visualising Javascript projects and their dependencies
  350. Color Brewer
  351. Color scheme sunburst
  352. Color via Clipping
  353. Color: a color matching game
  354. Combinatorial Necklaces and Bracelets
  355. Combining D3 and Ember to Build Interactive Maps
  356. Comic Book Narrative Charts
  357. Commented bar chart code
  358. Comparing the same surveys by different polling organizations (polish)
  359. Comparison of MS trials baseline characteristics
  360. Complete Graphs
  361. Composite Map Projection
  362. Composition of Church Membership by State: 1890
  363. Computationally Endowed
  364. Concentric Circles Emanating
  365. Concurrent Transitions
  366. Concurrent Transitions II
  367. Confidence interval in poll surveys
  368. Congressional Network Analysis
  369. Connections in time
  370. Constrained Zoom
  371. Constraint relaxation 1
  372. Constraint relaxation 1
  373. Constraint relaxation 2
  374. Contextual Pie Menu in AngularJS
  375. Contour Plot
  376. Converting dynamic SVG to PNG with node.js, d3 and Imagemagick
  377. Convex Hull
  378. Conway’s game of life with JS and D3.js
  379. Conway’s Game of life as a scrolling background (broken link)
  380. Conway’s game of life in D3.js
  381. Coordinated visualizations for Consumer Packaged Goods
  382. Coordinated Visualizations: An introduction to crossfilter.js
  383. Copper: Wrapper around python packages with D3.js viz
  384. Cost of living
  385. Cost of Living – Parallel Coordinates
  386. Costa Rica shaded relief
  387. Counting Weekdays
  388. Countries and Capitals D3 Demo
  389. Countries and Capitals with D3 and Natural Earth
  390. Countries by Area
  391. County Circles
  392. CPI Interactive index, with Angular.js, bootstrap and d3.js
  393. Craig Retroazimuthal
  394. Craig Retroazimuthal
  395. Craster Parabolic
  396. Crayola Colour Chronology
  397. Create a JavaScript bar chart with D3
  398. Create any map of the world in SVG
  399. Creating a Polar Area Diagram
  400. Creating Animated Bubble Charts in D3
  401. Creating Animations and Transitions With D3
  402. Creating Basic Charts using d3.js
  403. Creating Reusable D3, MVC, and Events
  404. Creating Thumbnails with GraphicsMagick
  405. Crime in Mexico
  406. Cross-linked Mouseover
  407. Crossfilter.js
  408. CS6964: Information Visualization
  409. CSS3 Modal Button
  410. CSSdeck: Repulsion example
  411. CSSOM/SVG Test
  412. CSV Syntax Definition
  413. csv2tsv
  414. Cube Metrics Client (Node.js + WebSockets)
  415. Cube Realtime Map
  416. Cube: Time Series Data Collection & Analysis
  417. Cubism.js: Time Series Visualization
  418. Current Article Popularity Trends on Hacker News
  419. Current rainfall, weather and buoy information for Ventura County and nearby counties
  420. Curved Links
  421. Curved textPath
  422. Custom Axis
  423. Custom Cartesian Projection
  424. Custom Easing
  425. Custom Multi Scale Time Format Axis
  426. Custom Path and Area Generator
  427. Custom Projection
  428. Cylindrical Equal-Area
  429. D#.js and Hawaii Open Data
  430. D3 and Custom Data Attributes
  431. D3 and the Power of Projections : MapBrief
  432. D3 and WordPress
  433. D3 Arc Diagram
  434. D3 Bookmarklet
  435. D3 Chart Builder
  436. D3 concept browser
  437. D3 Conceptually
  438. D3 Dependencies
  439. D3 Dorling cartogram with rectangular states
  440. D3 examples
  441. D3 Examples on Heroku
  442. D3 flights
  443. D3 for Mere Mortals
  444. D3 Geo Boilerplate: Responsive, Zoom Limits, TopoJson, and Tooltips
  445. D3 GeoJSON and TopoJSON Online Renderer with Drag and Drop
  446. D3 Globe with Natural Earth Image wrapped around using Canvas
  447. D3 graph plugin
  448. D3 graphics in a Pergola SVG UI
  449. D3 heatmap using Backbone.js and CoffeeScript
  450. D3 Hello World
  451. D3 js slides
  452. D3 line chart for Angularjs
  453. D3 linked view with a hexagonal cartogram
  454. d3 meta-visualization
  455. D3 node focus
  456. d3 O’Clock: Building a Virtual Analog Clock with d3.js, Part I
  457. d3 pie plugin
  458. D3 PJAX
  459. d3 rendered with RaphaelJS for IE Compatibility
  460. D3 selection transform syntax
  461. d3 several time scales
  462. D3 Show Reel
  463. D3 Slopegraph I
  464. D3 Slopegraph II
  465. D3 tag at Empire5
  466. D3 tag at Exploring Data
  467. D3 Time Zone World Map
  468. D3 Treemap with Title Headers
  469. D3 Tutorials
  470. D3 Waveform Live demo
  471. D3 with HTML: divs as datavis
  472. d3 workshop
  473. D3 Workshop Slides
  474. D3 World Map Game
  475. D3 World Map that Zooms to each Country on Click
  476. D3 World Map with Country Tooltips and Colors
  477. D3 World Map with local image tiles
  478. D3 World Map with Smooth Mouse Zooming
  479. D3 World Map with Zoom, ToolTips, and Data Points
  480. D3 World Maps: Tooltips, Zooming, and Queue
  481. D3-Builder
  482. d3-comparator: sort arrays of objects by multiple dimensions
  483. D3-curvy/
  484. D3-plugins
  485. D3-tip on a bar chart
  486. D3-tree
  487. d3-tree-heatmap
  488. D3, Conceptually
  489. D3: Data-Driven Documents
  490. d3.bayarea( ) Celebrating 1024 members!
  491. d3.chart Choropleths
  492. d3.chart.tooltips
  493. d3.create + selection.adopt
  494. d3.geo.path + Canvas
  495. d3.geo.path and d3.behavior.zoom
  496. d3.geo.tile
  497. d3.geo.tile
  498. d3.geo.tiler
  499. D3.java script by Vienno – Keenjar
  500. D3.js and a little bit of ClosureScript
  501. D3.js and Excel
  502. D3.js and GWT proof-of-concept
  503. D3.js and Meteor to generate SVG
  504. D3.js and MongoDB
  505. D3.js and vega.js plots in the IP notebook
  506. D3.js and X-Requested-With Header
  507. D3.js crash course
  508. D3.js Docco documentation
  509. D3.js Documentation Generator for Dash and HTML
  510. D3.js experiments in the console
  511. d3.js for Attacker Reports
  512. D3.js force diagram from Excel
  513. D3.js force diagrams straight from Excel
  514. D3.js force diagrams with markers straight from Excel
  515. D3.js Geo fun
  516. D3.js graphs for RHQ
  517. D3.js Lessons: Create a Basic Column Chart
  518. D3.js Meta Tutorial
  519. D3.js nested data
  520. d3.js on Veengle
  521. D3.js Playground
  522. D3.js playground
  523. D3.js Premiership Season
  524. D3.js Presentation
  525. D3.js Slider Examples
  526. D3.js Sublime2 snippets
  527. D3.js tag at Frakturmedia
  528. d3.js tag at Monkeyologist
  529. D3.js tag on The JavaDude Weblog
  530. D3.js talk at Github
  531. D3.js talk from Iowa City Feb 2013 Iowa JS Meetup
  532. D3.js Tips and Tricks
  533. D3.js tree with drag nea logic
  534. D3.js tutorial on CodeAcademy
  535. d3.js video tutorial
  536. D3.js, elasticsearch, bordeaux open data
  537. D3.js,Data Visualisation in the Browser
  538. D3.js: Data-Driven Delight
  539. d3.micromaps
  540. d3.nest
  541. d3.phylogram
  542. d3.sticker plugin
  543. d3.time.format localization
  544. d3.time.scale nice
  545. d3.tsv
  546. d34raphael
  547. DAG as force graph
  548. Dagre: Directed graph rendering
  549. Daily data return rates for seismic networks in the EarthScope USArray
  550. Dance.js: D3 with Backbone and Data.js
  551. Dangle
  552. Dashifyr
  553. Dat achart plugin
  554. Data Science Venn Diagram
  555. Data Stories #22: NYT Graphics and D3
  556. Data Story
  557. Data Visualization at MinnPost
  558. Data Visualization Libraries Based on D3.JS
  559. Data Visualization Using D3.js
  560. Data visualization with D3.js and python
  561. Data Visualization with D3.js, slides and video
  562. Data-Driven Documents, Defined, Resources, Data Driven Journalism
  563. Datadog
  564. DataFart
  565. Dataflow programming with D3 and Blockly
  566. DataMaps: Interactive maps for data visualizations.
  567. Datameer Smart Analytics
  568. Datawrapper: An open source tool to create embeddable charts
  569. Date Ticks
  570. DavaViz for Everyone: Responsive Maps With D3
  571. David Foster Wallace’s ‘Infinite Jest’
  572. DC Code Browser
  573. DC government
  574. Dc.js
  575. De Maastricht au traité budgétaire : les oui et les non de 39 personnalités politiques
  576. De Maastricht au traité budgétaire : les oui et les non de 39 personnalités politiques
  577. Deadly Tornado Outbreak – April 25-28, 2011
  578. Decomposing an image from canvas to SVG
  579. Delaunay Force Mesh
  580. Delaunay Triangulation
  581. Delta-flora for IntelliJ analyze project source code history
  582. Density map of homicides in Monterrey
  583. Dependo: force directed graph of JavaScript dependencies
  584. Description: A little language for d3js
  585. Design process of The Electoral Map
  586. Designing a Reusable Line Chart in D3JS
  587. Detecting Duplicates in O(1) Space and O(n) Time
  588. Dex Motion Chart Demo
  589. DexCharts: A new reusable charting library for D3.js
  590. Diagram of Patients and Symptoms
  591. Dial examples
  592. Difference Chart
  593. Diffusion-limited aggregation
  594. Dimensional Changes in Wood
  595. Dimple Pong
  596. Dimple.js: An oo API for business analytics powered by d3.
  597. Direct Flights with Connections
  598. Directed Graph Editor
  599. Directly render and serve d3 visualizations from a nodejs server.
  600. Disc
  601. Dispatching Events
  602. Dissecting a Trailer: The Parts of the Film That Make the Cut
  603. Distances from North Korea
  604. DOM-to-Canvas using D3
  605. Donut Chart
  606. Donut Multiples
  607. Donut Transitions
  608. Dorling World Map
  609. Dot Append video tutorials
  610. Dot Enter video tutorials
  611. Dot enter( ) stage left
  612. Dot plot with jittering
  613. Dots
  614. Double Cordiform
  615. Downton Ipsum ~ A Downton Abbey-inspired lorem ipsum text generator
  616. Drag + Zoom
  617. Drag and Drop Container Divs
  618. Drag and Drop SVG Geography Game with D3.js
  619. Drag and resize a D3.js chart with JqueryUI
  620. Drag Multiples
  621. Drag rectangle
  622. Draggable Network
  623. Draggable Network II
  624. Draw tangent on a line on mouseover
  625. Drawing Chemical Structures with Force Layout
  626. Drawing Hexagon Mesh with contour using TopoJSON
  627. Driving from Thailand to the Netherlands
  628. Drop shadow example
  629. Drought and Deluge in the Lower 48
  630. Drought during Month
  631. Drought Extends Crops Wither
  632. DRY Bar Chart
  633. Dual scale line chart
  634. DViz: a declarative data visualization library
  635. Dymo
  636. Dynamic Distance Cartogram for ORBIS
  637. Dynamic Hexbin
  638. Dynamic Simplification
  639. Dynamic Simplification II
  640. Dynamic Simplification III
  641. Dynamic Simplification IV
  642. Dynamic Visualization LEGO
  643. Dynamic-Graphs: charting lib for real-time data
  644. Dynamics of Swedish politics
  645. Earthquakes
  646. Easy infographics with D3.js
  647. Eckert I Projection
  648. Eckert II Projection
  649. Eckert III Projection
  650. Eckert IV Projection
  651. Eckert V Projection
  652. Eckert VI Projection
  653. Eckert–Greifendorff
  654. eCommerce API Wheel for eBay
  655. Economic performance of the Amsterdam Metro Area by sector and year
  656. Edge labels
  657. Eisenlohr Projection
  658. El Patrón de los Números Primos
  659. Elastic collisions
  660. Elbow Dendrogram
  661. Election 2012 Social Dashboard (interactive Twitter visualization)
  662. Electro 2013: The magnetic force between political candidates and objectives
  663. Elezioni 2013 – I risultati del voto per la Camera dei deputati
  664. Embed D3.js Animations in Slidify
  665. Embedly Blog, Visualizing discussions on Reddit with a D3 network and Embedly
  666. Ember Table
  667. Ember Timetree
  668. English Football Tickets: Value For Money
  669. Enumerating vertex induced connected subgraphs
  670. Epicyclic Gearing
  671. Epicyclical Gears
  672. EPSG:2163 Coordinates
  673. Equidistant Conic Projection
  674. Equirectangular (Plate Carrée)
  675. Error bars reusable component
  676. Eurozone crisis: more than debt
  677. Events in the Game of Thrones
  678. Every ColorBrewer Scale
  679. Every known drone strike and victim in Pakistan
  680. Example of interactive MDS visualisation
  681. Example of map with routes in Gunma
  682. Exit, Update, Enter
  683. Exit, Update, Enter II
  684. Exoplanets
  685. Exoplanets: an interactive version of XKCD 1071
  686. Expandable Menu
  687. Exploration of the Google PageRank Algorithm
  688. Explore Analytics: cloud-based data analytics and visualization
  689. Exploring d3.js with data from my runs to plot my heart rate
  690. Exploring Health Care Cost and Quality
  691. Exploring Reusability with D3.js
  692. Explosions
  693. Export to SVG/PNG/PDF server-side using Perl
  694. Extending the D3 Zoomable Sunburst with Labels
  695. Extent Ticks
  696. External SVG
  697. Extradition Treaties
  698. Eyedropper
  699. Eyedropper
  700. F1 Championship Points as a d3.js Powered Sankey Diagram
  701. Facebook IPO
  702. Facebook Mutual Friends
  703. Facebook Open Graph with Angular
  704. Faces
  705. Factorisation Diagrams
  706. Fahey
  707. Fahrradunfälle in Deutschland
  708. Fancy Markers
  709. Fancy Markers (No Gradient)
  710. Farid Rener CV
  711. Fast Multidimensional Filtering for Coordinated Views
  712. Fast Pointing
  713. Faster pan/zoom on big TopoJSON of Iceland
  714. Faux-3D Arcs
  715. Faux-3d Shaded Globe
  716. Feltronifier
  717. Fibonacci Numbers
  718. Fill-Rule Evenodd
  719. Filling Geometric Objects
  720. Financial visualization of top tech companies
  721. Fineo: an app based on Sankey diagrams
  722. Finite State Stream
  723. First steps in data visualisation using d3.js
  724. Fisheye Distortion
  725. Fisheye Grid
  726. Fixed-width Histogram Irwin-Hall distribution
  727. Fixed-width Histogram of Durations log-normal distribution
  728. Flat-Polar Parabolic
  729. Flat-Polar Quartic
  730. Flat-Polar Sinusoidal
  731. Floating Landmasses
  732. Floor Plan Map
  733. Flow – Straight, Arrows
  734. Flows of refugees between the world countries in 2008
  735. Focus+Context via Brushing
  736. Focusable Maps
  737. Football passes
  738. For Example
  739. For Protovis Users
  740. Force Directed States of America
  741. Force Editor + Pan/Zoom
  742. Force layout big
  743. Force Layout from Adjacency List
  744. Force Layout from CSV
  745. Force Layout from List
  746. Force layout graph with colour-coded node neighbours
  747. Force Layout Multiples (Independent)
  748. Force layout on composite objects
  749. Force Layout with Canvas
  750. Force Layout with Mouseover Labels
  751. Force Layout with Tooltips
  752. Force-based label placement
  753. Force-Based Label Placement
  754. Force-Directed Graph
  755. Force-Directed Graph
  756. Force-Directed Graph with Mouseover
  757. Force-Directed Graphs: Playing around with D3.js
  758. Force-Directed Layout from XML
  759. Force-directed layout with custom Forces
  760. Force-directed layout with drag and drop
  761. Force-directed layout with from Matrix Market format
  762. Force-directed layout with images and Labels
  763. Force-directed layout with multi Foci and Convex Hulls
  764. Force-directed layout with multiple Foci
  765. Force-directed layout with symbols
  766. Force-directed lollipop chart
  767. Force-Directed Parallel Coordinates
  768. Force-directed Splitting
  769. Force-Directed States
  770. Force-Directed SVG Icons
  771. Force-Directed Tree
  772. ForceEdgeBundling on US airline routes
  773. ForceLayoutEditor
  774. Forecast of Mexican 2012 presidential election
  775. Foreign aid, corruption and internet use
  776. Formula 1 Lap Chart
  777. Forrst, Visualizing US Foreign Aid with D3.js
  778. Foucaut’s Stereographic Equivalent
  779. Four Ways to Slice Obama’s 2013 Budget Proposal
  780. France – Data Explorer
  781. From Random Polygon to Ellipse
  782. From tree to cluster and radial projection
  783. Fuzzy Counties
  784. Fuzzy Link-Bot
  785. G3plot-1
  786. Gall Stereographic
  787. Gall–Peters
  788. Game of life
  789. GAMEPREZ Developer Kit
  790. Gantt Chart plugin
  791. Gantt Chart, example 3
  792. Gauge
  793. Gaussian Primes
  794. General Update Pattern I
  795. General Update Pattern II
  796. General Update Pattern III
  797. GeoDash
  798. Geodesic Grid
  799. Geodesic Rainbow
  800. Geographic Bounding Boxes
  801. Geographic Bounding Boxes
  802. Geographic Clipping
  803. GeoJOIN
  804. GeoJSON Transforms
  805. Geometry daily #129
  806. GeoMobilité – Application cartographique de la mobilité
  807. Get dirty with data using d3.js
  808. getBBox
  809. Getting Started with D3
  810. ggplot2 + d3 = r2d3
  811. ggplot2-Style Axis
  812. Gilbert’s Two-World Perspective
  813. Ginzburg IV
  814. Ginzburg IX
  815. Ginzburg V
  816. Ginzburg VI
  817. Ginzburg VIII
  818. Giraffe : A Graphite Dashboard with a long neck
  819. Girko’s Circular Law
  820. Girls Lead in Science Exam, but Not in the United States
  821. Gist API Latency
  822. Git-backed Node Blob Server
  823. GitHub visualization
  824. Github Visualizer
  825. gka’s blocks
  826. Glimpse.js: a new chart library on top of D3.js
  827. Global Life Expectancy
  828. Global Oil Production & Consumption since 1965
  829. Global Surface Temperature: 500 … 2009
  830. Globe rendered with WebGL and Three.js.
  831. Glucose heatmap over hours of day
  832. Glucose with panning
  833. Gnomonic
  834. Gnomonic Butterfly
  835. Goode Homolosine
  836. Google calendar like display
  837. Google Flu Trends
  838. Google Hurdles
  839. Google Maps + D3
  840. GOV.UK’s web traffic
  841. Gradient Along Stroke
  842. Gradient Bump
  843. Gradient Encoding
  844. Graph diagram of gene ontology
  845. Graph of my current interests and aspirations
  846. Graph Rollup
  847. Graphicbaseball: 2012 Batters
  848. Graphicbaseball: 2012 Pitchers
  849. Graphs
  850. Gravity balls
  851. Gray Earth
  852. Great Arc
  853. Great Circle Arc Intersections
  854. Great-Circle Distance
  855. Grid layout
  856. Gringorten Equal-Area
  857. Grouped Bar Chart
  858. Grouped Bar Chart
  859. GSA-Leased Opportunity Dashboard
  860. Gun homicides in America 2010
  861. Gun ownership versus gun violence
  862. Guts of EnergyPlus Source Code Visualized with d3.js
  863. Guyou Projection
  864. Hacker News statistics using PhantomJS
  865. Hacker Notes, d3 tag
  866. Hamiltonian Graph
  867. Hammer
  868. Hammer Retroazimuthal
  869. Hamming Quilt
  870. Haphazard collection of examples for a book
  871. HarvardX Research: worldwide student enrollment
  872. Hashing Points
  873. Hata’s tree-like set (with slider)
  874. Hatnote Listen to Wikipedia
  875. HEALPix
  876. Health and Wealth of Nations
  877. Healthvis R package – one line D3 graphics with R
  878. Heatmap
  879. Heatmap and 2D Histogram
  880. Heatmap of gene expression with hierarchical clustering
  881. Heatmap with Canvas
  882. Heavily annotated scatterplot
  883. Hedonometer: Daily Happiness Averages for Twitter
  884. Heightmap
  885. Hell is Other People: Scott Made This
  886. Hello WebGL
  887. Hello WebGL II
  888. Hello WebGL III
  889. Hello WebGL IV
  890. herrstucki on bl.ocks
  891. Hexagonal Binning
  892. Hexagonal Binning (Area)
  893. Hexagonal cartogram of Asian economies and potential shifts in manufacturing
  894. Hexagonal Grids
  895. Hexbin Edits on OpenStreetMap
  896. Hierarchical Bar Chart
  897. Hierarchical Bars
  898. Hierarchical classification
  899. Hierarchical Edge Bundling
  900. Hierarchical Edge Bundling
  901. Hierarchical Edge Bundling
  902. Hierarchical Edge Bundling
  903. Hierarchical Edge Bundling
  904. Hilbert Curve
  905. Hilbert Stocks
  906. Hilbert Stocks
  907. Hilbert Tiles
  908. Hill Eucyclic
  909. Histogram
  910. Histogram (Redirect)
  911. Histogram Chart
  912. Histogram Generator with D3
  913. Historical UK Maps
  914. History of the WWE Title
  915. Hive Plot
  916. Hive Plot (Areas)
  917. Hive Plot (Links)
  918. Hive Plot for Student Systems
  919. Hobo–Dyer
  920. Home energy consumption
  921. Horizon Chart
  922. Horse Exports/Imports in the EU
  923. Hotspots
  924. House Hunting All Day, Every Day – Trulia Insights
  925. How does Quartz create visualizations so quickly on breaking news?
  926. How educated are world leaders?
  927. How Obama Won Re-election
  928. How selectAll Works
  929. How Selections Work
  930. How the Chicago Public School District Compares
  931. How to Animate Transitions Between Multiple Charts
  932. How to convert to D3js JSON format
  933. How to design a dashboard using d3.js
  934. How to Embed Open Spending Visualizations
  935. How to get a significant correlation value by moving just one point around
  936. How to Make an Interactive Network Visualization
  937. How to Make Choropleth Maps in D3
  938. How to visualise funnel data from Google Analytics
  939. HTML Overlay with pageX / pageY
  940. HTML5 input type nodes
  941. http://nowherenearithaca.blogspot.com/2012/06/annotating-d3-example-with-docco.html
  942. Hypercube Edges in Orthogonal Projection
  943. Hypercube with Parallel Coordinates
  944. Iceland Topography
  945. Icelandic population pyramid
  946. Icequake
  947. Icicle
  948. Icosahedron
  949. Icosahedron
  950. iD Architecture: Map Rendering and Other UI
  951. iD: a friendly editor for OpenStreetMap
  952. IDH des communes du Nord-Pas de Calais.
  953. iLearning – D3.js Basic for iPad
  954. Image Markers
  955. Image Processing
  956. Image tiles with float: left
  957. Immersion: a people-centric view of your email life
  958. Income diff. between male and female dominated occupations 1
  959. Income diff. between male and female dominated occupations 2
  960. Increased Border Enforcement, With Varying Results
  961. Increased Border Enforcement, With Varying Results – Interactive Graphic – NYTimes.com
  962. Indented tree layout
  963. Indian Village Components
  964. Indo-Europeans
  965. Inequality and NY Subway
  966. Inequality in America
  967. Infinite Plasma Fractal
  968. Infinite Queue
  969. Infro
  970. Infro.js: Filtering Tabular Data
  971. Infro.js: Nutrient Dataset
  972. Inkscape-s3-server
  973. Input Value Interpolation
  974. Inspired by geometry daily
  975. Instant interactive visualization with d3 + ggplot2
  976. Integrating D3 with a CouchDB database 1
  977. Integrating D3 with a CouchDB database 2
  978. Integrating D3 with a CouchDB database 3
  979. Integrating D3 with a CouchDB database 4
  980. Interactive azimuthal projection simulating a 3D earth with stars
  981. Interactive Data Visualization for the Web
  982. Interactive Data Visualization for the Web: read online
  983. Interactive Gnomonic
  984. Interactive Line Graph
  985. Interactive Line Graph
  986. Interactive MDS visualisation
  987. Interactive Orthographic
  988. Interactive Publication History
  989. Interactive Stereographic
  990. Interactive Streamgraph
  991. Interactive visual breakpoint detection on SegAnnDB
  992. Interpolating with d3.tween
  993. Interrupted Boggs Eumorphic
  994. Interrupted Goode Homolosine
  995. Interrupted Goode Raster
  996. Interrupted Mollweide
  997. Interrupted Sinu-Mollweide
  998. Interrupted Sinusoidal
  999. Interrupted Transverse Mercator
  1000. Intro to d3
  1001. Introducing Contributions on GitHub
  1002. Introduction
  1003. Introduction to D3
  1004. Introduction to D3
  1005. Introduction to D3.js
  1006. Introduction to d3.js and data-driven visualizations
  1007. Introduction to Network Analysis and Representation
  1008. IPython-Notebook with D3.js
  1009. Irish Horse Breeding Data
  1010. IRL Trnspttr
  1011. Irregular Histogram (Lollipop)
  1012. Is Barack Obama the President? (Balloon charts)
  1013. iTunes Music Library Artist/Genre Graph
  1014. Jan Willem Tulp portfolio
  1015. Japanese Government Bonds Rates
  1016. Japanese Government Bonds Yield Curve
  1017. Javascript and MapReduce
  1018. Javascript Idioms in D3.js
  1019. Jerome Cukier » Selections in d3 – the long story
  1020. Jérôme Cukier portfolio
  1021. JezzBall
  1022. Jim Vallandingham portfolio
  1023. Job Flow
  1024. Jobs by state
  1025. johan’s blocks
  1026. JSNetworkX: A port of the NetworkX graph lib to JS
  1027. Jsplotlib
  1028. Junction Finding
  1029. Just Enough SVG
  1030. K-means
  1031. Kaleidoscope
  1032. Kaprekar Routine
  1033. Kavrayskiy VII Projection
  1034. Kentucky Population Density
  1035. Kentucky Population Density
  1036. Kepler’s Tally of Planets
  1037. Kernel Density Estimation
  1038. Kind of 3D with D3
  1039. Kindred Britain
  1040. Know Huddle – Correlation
  1041. Koalas to the Max
  1042. L*a*b* and HCL color spaces
  1043. La Nuit Blanche
  1044. Labeled points
  1045. Labeling in OpenStreetMap’s iD Editor
  1046. Lagrange Projection
  1047. Lambert Azimuthal Equal-Area
  1048. Lambert Conformal Conic Projection
  1049. Language Network
  1050. Lantern
  1051. Larrivée Projection
  1052. Laskowski Tri-Optimal
  1053. Last Chart! – See the Music
  1054. Latest Earthquakes
  1055. Lazy Scale Domain
  1056. LDA Topic Arcs: The DaVinci Code
  1057. LDAviz
  1058. Leaflet + D3js: Hexbin
  1059. Leaflet Template
  1060. leaflet.d3
  1061. Leap Motion D3.js Demo
  1062. Leap motion map tests
  1063. Learn how to make Data Visualizations with D3.js
  1064. Learning D3, Speaker Deck
  1065. Left-Aligned Ticks
  1066. Left-Aligned Ticks
  1067. Legend
  1068. Leibniz Spiral
  1069. Lepracursor
  1070. Les Misérables Co-occurrence Matrix
  1071. Let’s Make a Map
  1072. Letter Frequency
  1073. Liberal Revolution of 1820 in Lisbon
  1074. Library for visualizing Go games
  1075. License Usage Dashboard
  1076. Life expectancy 1960-2009 choropleth
  1077. Life expectancy 1960-2009 panel chart
  1078. Life expectancy 1960-2009 slopegraph
  1079. Limaçon as envolve of circles around a circle
  1080. Line Chart
  1081. Line Chart with tooltips
  1082. Line chart with zoom, pan, and axis rescale
  1083. Line Interpolation
  1084. Line Intersection Brushing
  1085. Line Simplification
  1086. Line Tension
  1087. Line Transition
  1088. Line Transition (Broken)
  1089. Linear Gradient
  1090. Linear Programming
  1091. Lines with Rounded Turns
  1092. Linked Jazz network graph
  1093. List of all the Gists from Mike Bostock
  1094. Littrow
  1095. Live coding based on Bret Victor’s Inventing on Principle talk
  1096. Loading a thumbnail into Gist for bl.ocks.org d3 graphs
  1097. Loading Adobe Photoshop ASE color palette
  1098. Lobster Catch Analyst
  1099. Log Axis
  1100. Log Axis with Zero
  1101. London D3.js Meetup #2
  1102. London d3.js Meetup #5
  1103. London d3.js User Group
  1104. London Olympics Perceptions – Donuts to Chord Diagram Transition
  1105. Long Scroll
  1106. Lorenz System
  1107. Lorenz Toy
  1108. Loupe
  1109. Loximuthal
  1110. Loxodrome
  1111. Made with D3.js
  1112. Major League Baseball Home Runs 1995-2010
  1113. Make a bubble chart using d3.js demo
  1114. Making maps with d3.js
  1115. Mandel for Controller Bulldog Budget
  1116. Manipulating data like a boss with d3
  1117. Manual Axis Interpolation
  1118. Map Direct Flights with D3
  1119. Map from GeoJSON data with zoom/pan
  1120. Map of all M2.5+ earthquakes of the last 24h.
  1121. Map of COMIPEMS Scores
  1122. Map of Germany using D3.js and Simplify.js
  1123. Map of Italiens
  1124. Map of pro sports teams by territory
  1125. Map Projection Distortions
  1126. Map Projection Transitions
  1127. Map with faux-3D globe
  1128. Map Zooming
  1129. Map Zooming II
  1130. Map Zooming III
  1131. Mapbox: add vector features to your map with D3
  1132. Mapping Hate Crimes in Iran
  1133. Mapping the Melting Pot
  1134. Mapping Tours with D3 and SeatGeek
  1135. Maps and sound
  1136. Maps Garage: Exploring Map Data with Crossfilter
  1137. Marey’s Trains
  1138. Marey’s Trains II
  1139. Margin Convention
  1140. Marimekko Chart
  1141. Marimekko, Mekko or Mosaic Chart
  1142. Markov processes
  1143. Marmoset chimerism dotplot
  1144. Masking with external svg elements
  1145. MathBox animation vs d3.js enter/exit
  1146. MathJax label
  1147. Matrix Layout
  1148. Maurer No. 73
  1149. Men’s 100m Olympic champions
  1150. Mercator
  1151. Mercator Projection
  1152. Merge Sort
  1153. Merging States
  1154. Merging States II
  1155. Meshu turns your places into beautiful objects.
  1156. Messing around wih D3.js and hierarchical data
  1157. Metaevil
  1158. meteor-deployments
  1159. Metrica
  1160. Metropolitan Unemployment
  1161. Mexican Presidential Election 2012
  1162. mgrafeeds
  1163. Mike Bostock portfolio
  1164. Milky Way
  1165. Miller Projection
  1166. Minecraft Overviewer
  1167. Minimalist example of reusable D3.js plugin
  1168. Miniviz
  1169. Minute: record of all of my keystrokes
  1170. Mirrored Easing
  1171. Misc. Examples
  1172. Miscellaneous utilities for D3.js
  1173. Mitchell’s Best-Candidate
  1174. Mitchell’s Best-Candidate 1
  1175. Mitchell’s Best-Candidate 2
  1176. Mitchell’s Best-Candidate 3
  1177. MLB Hall of Fame Voting Trajectories
  1178. MN Giving Day 2012
  1179. Mobile Patent Lawsuits
  1180. Mobile Patent Suits
  1181. Modal Logic Playground
  1182. Modifying a Force Layout
  1183. Moiré Patterns
  1184. Molecule
  1185. Mollweide
  1186. Mollweide Hemispheres
  1187. Mollweide Watercolour
  1188. Monday-based Calendar
  1189. Money Wins Elections
  1190. Monotone Interpolation Bug
  1191. Monotone Line Interpolation
  1192. Monte Carlo simulation of bifurcations in the logistic map
  1193. Month Axis
  1194. More Data Visualization Libraries Based on D3.JS
  1195. More Introduction to D3
  1196. Morley’s trisector theorem
  1197. Morphogenesis Simulation
  1198. Most simple d3.js stack bar chart from matrix
  1199. Mouseenter
  1200. mousewheel-zoom + click-to-center
  1201. Movie color analysis with XBMC, Boblight, Java and D3.js
  1202. Moving Histogram
  1203. Moving Squares
  1204. Mower game
  1205. Muerte Materna en Argentina
  1206. Multi-Foci Force Layout
  1207. Multi-Foci Force Layout
  1208. Multi-Line Voronoi
  1209. Multi-Series Line Chart
  1210. Multi-series Line Chart with Long Format Data (columns instead of rows)
  1211. Multi-Series Line to Stacked Area Chart Transition
  1212. Multi-Value Maps
  1213. Multiline chart with brushing and mouseover
  1214. Multiline with zoomooz
  1215. Multiple Area charts and a brush tool
  1216. Multiple area charts with d3.js
  1217. Multiple Leap Motions over WebSockets – YouTube
  1218. Multiple Lines grid
  1219. Multiple time-series with object constancy
  1220. Multiple visualization from the Société Typographique de Neuchâtel
  1221. My Force Directed Graph
  1222. Natural Earth
  1223. Natural Log Scale
  1224. NCAA 2012 March Madness Power Rankings
  1225. Negative stacked bar chart
  1226. Nell–Hammer Projection
  1227. Nested Selections
  1228. Network of World Merchandise Trade
  1229. Neuroscience and brain stimulation publication counts
  1230. New Jersey Blocks
  1231. New Jersey State Plane
  1232. New York Block Groups
  1233. New Zealand Earthquakes Pattern of Life
  1234. Newton’s balls
  1235. Newton’s cradle
  1236. NFL salaries by team and position
  1237. Nick Jaffe’s Polymap
  1238. No Antimeridian Cutting
  1239. Nodal is a fun way to view your GitHub network graph
  1240. Node + MySQL + JSON
  1241. Node-Link Tree
  1242. Non-Computed Style Tween
  1243. Non-contiguous Cartogram
  1244. Non-Contiguous Cartogram
  1245. Non-contiguous cartogram of seats allocated in the canadian House of Commons
  1246. Noob on JSON : Data for d3.js documents
  1247. Normalized Stacked Bar Chart
  1248. Number of heat stroke
  1249. Number of unique rectangle-free 4-colourings for an nxm grid
  1250. Nutrient Database Explorer
  1251. NVD3
  1252. NVD3 for BI
  1253. nvd3.py
  1254. NY Times Strikeouts Graphic, recreated using rCharts and PolychartJS
  1255. NYC Bike Share
  1256. NYC D3.js
  1257. Obesity map
  1258. Object Constancy
  1259. Object constancy with multiple sets of time-series
  1260. Ocean
  1261. OECD Health Government Spending and Obesity Rates (nvd3)
  1262. offsetX / offsetY
  1263. offsetX / offsetY
  1264. Ohio State Plane (N)
  1265. Old Visualizations Made New Again
  1266. Oliver Rolle / Logarithmic Line Chart
  1267. Olympic Medal Rivalry
  1268. OMG Particles!
  1269. One Path for All Links
  1270. One System, Every Kepler Planet
  1271. One-Way Markers
  1272. Online GeoJSON and TopoJSON renderer
  1273. Open Knowledge Festival Hashtag Graph Visualization
  1274. OpenBudget
  1275. OPHZ Zooming
  1276. ORBIS v2
  1277. Order
  1278. Ordinal Axis
  1279. Ordinal Brushing
  1280. Ordinal Tick Filtering
  1281. Ordinal Tick Filtering
  1282. Orthographic
  1283. Orthographic Clipping
  1284. Orthographic Grid
  1285. Orthographic Projection
  1286. Orthographic Shading
  1287. Orthographic to Equirectangular
  1288. Over the Decades How States Have Shifted
  1289. Pack Test
  1290. Pack Test
  1291. Pair Contribution and Selection
  1292. Pale Dawn
  1293. Pan+Zoom
  1294. Papa
  1295. Parallel Coordinates
  1296. Parallel Coordinates
  1297. Parallel Coordinates
  1298. Parallel Coordinates
  1299. Parallel Coordinates
  1300. Parallel coordinates with fisheye distortion
  1301. Parallel Lines and Football using Dex and D3.js
  1302. Parallel Sets
  1303. Paris Transilien
  1304. Partition Layout (Zoomable Icicle)
  1305. Path and Transform Transitions
  1306. Path from function 2
  1307. Path from function 3
  1308. Path Tween
  1309. path_from_function_2
  1310. Path_from_function_2
  1311. Pedigree Tree
  1312. Peirce Quincuncial
  1313. Peirce Quincuncial
  1314. Percent women in city councils
  1315. Percentile line chart of gene expression microarrays
  1316. Percolation model
  1317. Periodic table
  1318. Periodic table
  1319. Perlin circles
  1320. Perlin Ink
  1321. Perlin Landscape
  1322. Perlin Worms
  1323. Peter Cook Web Developer
  1324. Ph.D. Thesis Progress
  1325. PhD in the Bundestag
  1326. Phylogenetic Tree of Life
  1327. Pictograms
  1328. Pie Chart
  1329. Pie Chart Update I
  1330. Pie Chart Update II
  1331. Pie chart update III
  1332. Pie chart update IV
  1333. Pie Chart Update, III
  1334. Pie Chart Update, IV
  1335. Pie Chart Update, V
  1336. Pie Chart Updating with Text
  1337. Pie Multiples
  1338. Pie Multiples with Nesting
  1339. Pimp my Tribe
  1340. Pixymaps (Dragging)
  1341. Pixymaps (Scrolling)
  1342. Placename patterns
  1343. Places in the Game of Thrones
  1344. Plan du métro interactif
  1345. Plan interactif du métro
  1346. Planarity
  1347. Planck-cl
  1348. Plant Hardiness Zones
  1349. Plants
  1350. Plot.io (swallowed by Platfora)
  1351. Plotly: create graphics, analyze with Python, annotate and share
  1352. Plotsk: A python/coffeescript/d3.js-based library for plotting data in a web browser
  1353. Plotting library for python based on D3.js
  1354. Población de Argentina, Experimento D3.js
  1355. Poincaré Disc
  1356. Point-Along-Path Interpolation
  1357. Point-Along-Path Interpolation
  1358. Polar Azimuthal Equal-area
  1359. Polar Plot
  1360. PolarClock
  1361. Polls on the 2012 U.S. Election
  1362. Polybrush.js
  1363. Polychart: A browser-based platform for exploring data and creating charts
  1364. Polyconic Projection
  1365. Polygonal Lasso Selection
  1366. Polylinear Time Scale
  1367. Polymaps / Andrew Mager
  1368. Polymaps / Andrew Mager
  1369. Polymaps / Andrew Mager
  1370. Polymaps / Andrew Mager
  1371. Polymaps / Andrew Mager
  1372. Polymaps / CSS Hover
  1373. Polymaps / Heatmap
  1374. Polymaps / JSONP Queue
  1375. Polymaps / Procedural Perlin
  1376. Polymaps + D3
  1377. Polymaps + D3 Part 2
  1378. Polymaps bad projection example
  1379. polymaps.appspot.com
  1380. Poor Anti-Aliasing in SVG #1
  1381. Poor Anti-Aliasing in SVG #2
  1382. Population Choropleth
  1383. Population of the cantons and of the 10 largest cities of Switzerland
  1384. Population Pyramid
  1385. Portfolio
  1386. Portrait in Chinese ascii: Chris Viau
  1387. Portrait in Chinese ascii: EJFox
  1388. Predsjednik Republike Srpske
  1389. Presentation on Visualizing Data in D3.js and mapping tools at NetTuesday
  1390. Price Changes: animated dimple.js chart
  1391. Processing Fixed-Width Data
  1392. Profils des cyclistes
  1393. Programmatic Pan+Zoom
  1394. Progress Events
  1395. Project Groups – IS428: Visual Analytics for Business Intelligence
  1396. Project to Bounding Box
  1397. Projected Choropleth
  1398. Projected TopoJSON
  1399. Projection Contexts
  1400. Projection Transitions
  1401. Proof of Pythagoras’s Theorem
  1402. Proportion of Foreign Born in Large Cities: 1900
  1403. Prose-only Blocks
  1404. Prototype Chart Template (WIP)
  1405. Prototype: d3.geo
  1406. Protovis / David Karr
  1407. Protovis / Nelson Minar
  1408. Protovis / Quomo Pete
  1409. Pseudo-Demers Cartogram
  1410. Pseudo-Dorling Cartogram
  1411. Psi man
  1412. Public Interest Evaluation Project
  1413. Pushing D3.js commands to the browser from iPython
  1414. Pyramid charts: demographic transition in the US
  1415. Python-NVD3
  1416. Q-Q Plots
  1417. Quadratic Koch Island Simplification
  1418. Quadtree
  1419. Quadtree Madness Round 2
  1420. Quartic Authalic
  1421. Quartile plots
  1422. Quartile plots with outliers
  1423. Queue.js Demo
  1424. Quick Charting with D3js
  1425. Quick scatterplot tutorial for d3.js
  1426. Quicksort
  1427. Radar chart
  1428. Radial Arc Diagram
  1429. Rainbow Colors
  1430. Rainbow showing how to use mask and clipPath
  1431. Rainbow Worm
  1432. Rainbows are Harmful
  1433. Raindrops
  1434. Rainflow
  1435. Rakie & Jake
  1436. Random Arboretum
  1437. Random Points on a Sphere
  1438. Random Walk in Configuration Space
  1439. Range Transition
  1440. Raster & Vector Zoom
  1441. Raster Reprojection
  1442. Rbspd3
  1443. rCharts Custom, Cancer, Fantasy Football, and Three Level Mixed Effects Logistic Regression
  1444. rCharts: R interface for NVD3, Polycharts, MorrisJs and soon Rickshaw, DexCharts, Dc.js
  1445. Reactive Charts with D3.js and Reactive.js
  1446. Read File or HTTP
  1447. Real time sales
  1448. Real-time sentiment analysis of Obama 2012 victory speech
  1449. Really cool wordpress theme
  1450. Realtime Visualizations w/ D3 and Backbone
  1451. Realtime webserver stats
  1452. Recettear Item Data
  1453. Rectangular Polyconic
  1454. Rectilinear Grid
  1455. Reddit Insight
  1456. Rega: Experimental Ruby Vega generator
  1457. Reingold–Tilford Tree
  1458. Reingold–Tilford Tree (Redirect)
  1459. Relations of football players participating in Euro 2012
  1460. Remittance flows
  1461. Remittances
  1462. Render Geographic Information in 3D With Three.js and D3.js
  1463. Render sever-side using Phantomjs
  1464. Rendering Tests
  1465. Reorderable Stacked Bar Chart
  1466. Replicating a New York Times d3.js Chart with Tableau
  1467. Reports for Simple
  1468. Reprojected Raster Tiles
  1469. Republic of Ireland – Data Explorer
  1470. Resampling Comparison
  1471. Resampling Comparison
  1472. Resizable Force Layout
  1473. Resizable Markers
  1474. Responsive D3
  1475. Responsive SVG resizing without re-rendering
  1476. Responsive TopoJSON Sizing
  1477. Retrofit Analysis Report
  1478. Reusable D3 With The Queen, Prince Charles, a Corgi and Pie Charts
  1479. Reusable D3.js, Part 1: Using AttrTween, Transitions and MV*
  1480. Reusable D3.js, Part 2: Using AttrTween, Transitions and MV*
  1481. Reusable Interdependent Interactive Histograms
  1482. Reusable Pie Charts
  1483. Reusable text rotation
  1484. Reveal animation on a tree with a clip path
  1485. Reverse Geocoding Plug-in using an offline canvas
  1486. Rhodonea Curve
  1487. RHQ – Project Documentation Editor
  1488. Rickshaw: JavaScript toolkit for creating interactive real-time graphs
  1489. Ring Cutting
  1490. Ring Extraction
  1491. Rivers of the U.S.A.
  1492. Robinson Projection
  1493. Romanian parliamentarian bubble chart. In Romanian
  1494. Rotated Axis Labels
  1495. Rotating Cluster Layout
  1496. Rotating Equirectangular
  1497. Rotating Icosahedron
  1498. Rotating Orthographic
  1499. Rotating Orthographic
  1500. Rotating Orthographic
  1501. Rotating Transverse
  1502. Rotating Transverse Mercator
  1503. Rotating Transverse Mercator
  1504. Rotating Voronoi
  1505. Rotating Winkel Tripel
  1506. Rounded Rectangle
  1507. Rounded Rectangles
  1508. Route Probability Exploration with Parallel Coordinates
  1509. Running Away Balloons – simple game
  1510. sammyt/see
  1511. San Francisco Contours
  1512. San Francisco Movies (Beta Version)
  1513. Sankey Diagram
  1514. Sankey diagram with cycles
  1515. Sankey diagram with horizontal and vertical node movement
  1516. Sankey Diagram with Overlap
  1517. Sankey diagrams from Excel
  1518. Sankey Diagrams of Local Economic Flows
  1519. Sankey from Excel, inherited cell colors for links
  1520. Sankey Interpolation
  1521. Sankey your Google Spreadsheet Data
  1522. saraquigley bl.ocks
  1523. SAS ANALYSIS
  1524. SAS and D3.js: a macro to draw scatter plot
  1525. SAS and D3.js: map to display US cities murder rates
  1526. Satellite Projection
  1527. Satellite Projection Test
  1528. Satellite Raster
  1529. Save SVG as PNG
  1530. Scale-Dependent Sampling
  1531. Scatterize
  1532. Scatterplot
  1533. Scatterplot and Heatmap
  1534. Scatterplot for K-Means clustering visualization
  1535. Scatterplot Matrix
  1536. Scatterplot Matrix
  1537. Scatterplot Matrix Brushing
  1538. Scatterplot with Multiple Series
  1539. Scatterplot with Shapes
  1540. Schelling’s segregation model
  1541. School Absenteism
  1542. SCION simulation environment
  1543. Scott Murray tutorials in Japanese
  1544. See-Through Globe
  1545. See-Through Globe II
  1546. Segmented Lines
  1547. Selectable elements
  1548. Selecties EK 2012
  1549. selection.order
  1550. Self-Immolation In Tibet
  1551. Self-Organising Maps
  1552. Sensitivity/Specificity Plot
  1553. Sequential Tiles
  1554. Series of D3.js video tutorials
  1555. Set Partitions
  1556. Seven years of SSLC in Karnataka
  1557. Shape of My Library — Comics
  1558. Shape Tweening
  1559. Shared Data
  1560. SHEETSEE.JS: Fill up Websites with Stuff from Google Spreasheet
  1561. Shiboronoi
  1562. Shiny and R adaptation of Mike Bostock’s d3 Brushable Scatterplot
  1563. Shiny R and D3.js
  1564. Simple Bar Graph in Angular Directive with d3.js and Prototype.js
  1565. Simple D3.js Bar Chart Webcast
  1566. Simple Dashboard Example
  1567. Simple example using Vega, D3, and Jstat
  1568. Simple HTML data tables
  1569. Simple Junctions
  1570. Simple Radar Chart
  1571. Simple Reusable Bar Chart
  1572. Simple scatterplot
  1573. Simple table
  1574. Simple-map-d3
  1575. Simplex Noise Code 39 Barcode
  1576. Simplex Noise Dots
  1577. Simplifying and cleaning Shapefiles.
  1578. Sinu-Mollweide
  1579. Sinusoidal
  1580. Skillpedia: an open encyclopedia for skills
  1581. Sky Open Source, Behavioral Database
  1582. Skybox
  1583. SKYFALL. Meteorite falls map.
  1584. Slides and live code from the GAFFTA d3 intro workshop
  1585. Slippy map + extent indicator
  1586. SlopeGraph
  1587. Slopegraph lines in SVG and Canvas
  1588. Slopegraphs
  1589. Small Multiples
  1590. Small Multiples with Details on Demand
  1591. Smoke charts
  1592. Smooth Scrolling
  1593. Smooth Slider
  1594. SnakeViz: An In-Browser Python Profile Viewer
  1595. Snowden’s Route
  1596. Snowflake Simplification
  1597. Snowflakes
  1598. Snowflakes with D3
  1599. Social trust vs ease of doing business
  1600. Social web use in 2009
  1601. SOCR Violin Chart
  1602. Solar Terminator
  1603. SOM Animation
  1604. Sortable Bar Chart
  1605. Sortable Bar Chart
  1606. Sortable Table with Bars
  1607. Sorting Visualisations
  1608. Sparkline Directive for Angular with d3.js
  1609. Sparklines
  1610. SPARQLy GUIs: Linked Data and Semantic Web technologies
  1611. Spermatozoa
  1612. Sphere Spirals
  1613. Spherical Mercator
  1614. Spilhaus Maps
  1615. Spinny Globe
  1616. Spiral experiment
  1617. Spiral for John Hunter
  1618. Splay Tree animation with Dart D3.js and local storage
  1619. Spline Editor
  1620. Spline Transition
  1621. Split line game
  1622. Square Circle Spiral Illusion
  1623. Squares ↔ Hexagons
  1624. SRTM Tile Grabber: downloading elevation data
  1625. Stacked and grouped bar chart
  1626. Stacked Area Chart
  1627. Stacked Area via Nest
  1628. Stacked Bar Chart
  1629. Stacked Bar Chart
  1630. Stacked bar chart from a structure description of an R table
  1631. Stacked layout with time axis
  1632. Stacked Radial Area
  1633. Stacked-to-Grouped Bars
  1634. Stacked-to-multiple bar charts
  1635. Stage rankings – Tour de France 2013
  1636. Startseite – NZZ.ch
  1637. Startup Salary & Equity Compensation
  1638. Stat 221
  1639. Stat 221
  1640. Static Force Layout
  1641. SteamGraphs and Dex
  1642. Step by Step-Road Accidents in cities by years 2010
  1643. Steps Walked per Day
  1644. Stereographic
  1645. Sticky Force-Directed Graph
  1646. Stitching States from Counties
  1647. Stowers Group Collaboration Network
  1648. Strange attractor
  1649. Strata 2013 D3 Tutorial, Speaker Deck
  1650. Streamgraph
  1651. Streamgraph
  1652. Streamgraph
  1653. Streamgraph realtime streaming mouse coordinates
  1654. Streams
  1655. Streams
  1656. Street Extent Visualization Using #d3js and CartoDB
  1657. Strikeouts Are Still Soaring
  1658. Stripe Gross Volume witth D3.js
  1659. Stroke Dash Interpolation
  1660. stroke-dasharray
  1661. Students’s seating habits
  1662. style.setProperty
  1663. Subsecond Ticks
  1664. SugarForge: SolCRM by AlineaSol: Project Info
  1665. Summer Olympics Home Ground Advantage
  1666. Sunburst
  1667. Sunburst Layout with Labels
  1668. Sunburst with Distortion
  1669. Sunflower Phyllotaxis
  1670. Sunlight Heatmap
  1671. Sunny side of the Earth, for any date and time
  1672. Superformula Explorer
  1673. Superformula Tweening
  1674. Superscript Format
  1675. Superscript Format II
  1676. SVG feGaussianBlur
  1677. SVG foreignObject Example
  1678. SVG Geometric Zooming
  1679. SVG Group Element and D3.js
  1680. SVG Open Keynote Slides
  1681. SVG Path Cleaning
  1682. SVG Patterns
  1683. SVG resize to container
  1684. SVG Semantic Zooming
  1685. SVG Swarm
  1686. SVG to Canvas
  1687. SVG to Canvas to PNG using Canvg
  1688. Swimlane
  1689. Swiss Cantons
  1690. Swiss Topography
  1691. Symbol Map
  1692. Symbol Map
  1693. Table of Progress
  1694. Table Sorting
  1695. Table-driven plot
  1696. Tag Cloud
  1697. TAGSExplorer: Visualising Twitter graphs from a Google Spreadsheet
  1698. Talk at JS.geo 2013
  1699. Tampa Bay Rays Streamgraph
  1700. Telostats: Public bike stations in Tel Aviv
  1701. Templating ala Mustache with Chernoff faces example
  1702. Test Env
  1703. Tetris
  1704. Tetris
  1705. Text on arc path
  1706. TGI Models
  1707. The Amazing Pie
  1708. The Beautiful Table: fancy bar chart of football statistics
  1709. The business of Bond
  1710. The Concept Map
  1711. The d3 Community: How to Get Involved
  1712. The Diabetes Dashboard
  1713. The electoral map: building path to victory
  1714. The Euro Debt Crisis
  1715. The first commented line is your dabblet’s title
  1716. The first thing that should be shown in any Trigonometry class
  1717. The Gist to Clone All Gists
  1718. The Holy Bible Visualizaiton
  1719. The last slice of PIE
  1720. The Music of Graphs
  1721. The open source card report
  1722. The Polya process
  1723. The Polyconic Projection
  1724. The Polyglots Project
  1725. The Quest for the Graphical Web
  1726. The Senate Social Network
  1727. The Sentinal project
  1728. The Story of The US Told In 141 Maps
  1729. The Sun’s View of the Earth
  1730. The Wealth & Health of Nations
  1731. Thinking with Joins
  1732. Threat Report
  1733. Three Little Circles
  1734. Three-Axis Rotation
  1735. Threshold Choropleth
  1736. Threshold Encoding
  1737. Threshold Key
  1738. Time Bubble Lines
  1739. Time Series
  1740. Timeline
  1741. Timeline
  1742. Timeline of earthquake in Christchurch 2010
  1743. Times
  1744. Tissot’s Indicatrix
  1745. Tmcw’s bl.ocks
  1746. tnightingale bl.ocks
  1747. Tobler World-in-a-Square
  1748. Tooltips for D3.js visualizations
  1749. TopoJSON Examples
  1750. TopoJSON Layers
  1751. TopoJSON Parallax
  1752. TopoJSON Points
  1753. TopoJSON vectors on raster image tiles, with zoom and pan
  1754. Topology-Preserving Geometry Simplification
  1755. Towards Reusable Charts
  1756. TradeArc – Arc Diagram of Offseason NHL Trades
  1757. Traffix jitsu
  1758. Transform Interpolation
  1759. Transform Transitions
  1760. Transition End
  1761. Transition Example
  1762. Transition from a streamgraph to multiple area charts
  1763. Transition Speed Test
  1764. TransportView
  1765. Transverse Mercator
  1766. Transversing Equirectangular
  1767. Tree Layout from CSV
  1768. Tree layout mods
  1769. Tree Layout Orientations
  1770. Treemap
  1771. Treemap
  1772. Treemap Padding
  1773. Tributary
  1774. Tributary, optical_illusion_001_motion2
  1775. Tributary, simple globe canvas
  1776. Trisul Network Analytic
  1777. TruliaTrends
  1778. TruliaTrends
  1779. Try D3 Now
  1780. Trying out D3’s geographic features
  1781. Tübingen
  1782. Tufte’s slope graphs
  1783. Tweening Polygons
  1784. Tweitgeist: Live Top Hashtags on Twitter
  1785. Twitter Activity During Hurricane Sandy
  1786. Twitter Influencer Visualization
  1787. Twitter SVG Logo
  1788. Two Point Equidistant
  1789. Two Point Equidistant
  1790. Two Tables, Understanding D3 Selections
  1791. U.S. Airports
  1792. U.S. Counties TopoJSON
  1793. U.S. Counties TopoJSON Mesh
  1794. U.S. Land TopoJSON
  1795. U.S. Population Pyramid
  1796. U.S. Rivers
  1797. U.S. State Mesh
  1798. U.S. States TopoJSON
  1799. U.S. TopoJSON
  1800. U.S. TopoJSON
  1801. U.S. Urban Areas
  1802. Uber Rides by Neighborhood
  1803. UK University Statistics
  1804. UK Wind
  1805. UMLS (Unified Medical Language System) Visualizer
  1806. UN Global Pulse 2010 Visualization
  1807. Underscore’s Equivalents in D3
  1808. Understanding the D3 Parallel Plot Example
  1809. Unemployment ranked with horizontal bars
  1810. Unit circle animation
  1811. United Kingdom Peace Index
  1812. University of Washington Departments
  1813. Unknown Pleasures
  1814. Untitled-2
  1815. Update-Only Transition
  1816. Upside-Down Text
  1817. Urban bus races
  1818. Urban Water Explorer
  1819. US Budget
  1820. US Census Visualization
  1821. US Elections 2012 / Twitter
  1822. US energy consumption since 1775
  1823. US History in Maps
  1824. US, CA, MX and PR
  1825. Use Inkscape shapes in D3.js tree diagram
  1826. Use the Force! Slides
  1827. Use the Force! Video
  1828. Using and Abusing the force
  1829. Using d3 visualization for fraud detection and trending
  1830. Using D3, backbone and tornado to visualize histograms of a csv file
  1831. Using D3.js to Brute Force the Pirate Puzzle – Azundo Design
  1832. Using Inkscape with d3
  1833. Using Plunker for development and hosting your D3.js creations
  1834. Using Selections in D3 to Make Data-Driven Visualizations
  1835. Using SMASH for custom D3.js builds
  1836. Using SVG and canvas on the same force-directed layout
  1837. Using SVG Gradients and Filters With d3.js
  1838. Using the D3.js Visualization Library with AngularJS
  1839. UT1 – UTC
  1840. uvCharts
  1841. Van der Grinten II
  1842. Van der Grinten III
  1843. Van der Grinten IV
  1844. Van der Grinten Projection
  1845. Van Wijk and Nuij Zooming
  1846. van Wijk Smooth Zooming
  1847. Variable-width Histogram
  1848. Various visualisations especially with d3.geo
  1849. Vector Tiles
  1850. Vector Tiles
  1851. Vector Tiles
  1852. Vega for time series chart with shaded blocks
  1853. Vegetable Nutrition w/ Parallel Coordinates
  1854. Vélib network visualization
  1855. Venn diagram
  1856. Venn Diagram using Clipping
  1857. Venn Diagram using Opacity
  1858. Venn Diagrams with 3+ circles
  1859. Vertical Bullet Charts
  1860. Vertical Bullet Charts
  1861. Very limited in-progress attempt to hook d3.js up to three.js
  1862. Veteran Survival Data
  1863. Video tutorials in Japanese
  1864. Viewing OpenLearn Mindmaps Using d3.js
  1865. Viewing Relations, Attributes, and Entities in RDF
  1866. VIM keymap
  1867. Violin: Instrumenting JavaScript
  1868. Violin/Box plots
  1869. Virginia Counties
  1870. Visual Hacker News
  1871. Visual Search
  1872. Visual Sedimentation
  1873. Visual Sedimentation Tweet
  1874. Visual Storytelling with D3: An Introduction to Data Visualization in JS
  1875. Visual.ly Meetup Recap: Introductory D3 Workshop
  1876. Visual.ly tagged D3.js
  1877. Visualising a real-time DataSift feed with Node and D3.js
  1878. Visualising Change in Presidential Vote
  1879. Visualising ConAir Data With Cubism.js Arduino TempoDB Sinatra
  1880. Visualising New Zealand’s Stolen Vehicle Database Part1
  1881. Visualising New Zealand’s Stolen Vehicle Database Part2
  1882. Visualization of Beijing Air Pollution
  1883. Visualization of music suggestion
  1884. Visualize online conversion journeys
  1885. Visualize with d3js: Bring life to your data
  1886. Visualize Words on My Blog Using D3.js
  1887. Visualizing a network with Cypher and d3.js
  1888. Visualizing a newborn’s feeding and diaper activity
  1889. Visualizing book production – Tools of Change for Publishing
  1890. Visualizing Data with Web Standards Slides
  1891. Visualizing Data with Web Standards Video
  1892. Visualizing document similarity over time
  1893. Visualizing Facebook Friends With D3.js
  1894. Visualizing Flight Options
  1895. Visualizing Hospital Price Data
  1896. Visualizing my entire website as a network
  1897. Visualizing NetworkX graphs in the browser using D3
  1898. Visualizing NFL Draft History
  1899. Visualizing opinons around the world (zoomable world map and interactive pie chart)
  1900. Visualizing San Francisco Home Price Ranges
  1901. Visualizing Swiss politicians on Twitter using D3.js
  1902. Visualizing the iOS App Store
  1903. Visualizing the News through Metro Maps
  1904. Visualizing The Racial Divide
  1905. Visualizing U.S. Births and Deaths in Real-Time
  1906. VizWiz: Displaying time-series data
  1907. VLS&STATS making off « @comeetie :: blog
  1908. Von der EEG Umlage befreite Unternehmen
  1909. Voronoi Arc Map
  1910. Voronoi Boids: Voroboids
  1911. Voronoi Clipping
  1912. Voronoi Diagram
  1913. Voronoi Diagram with Force Directed Nodes and Delaunay Links
  1914. Voronoi Labels
  1915. Voronoi Lookup
  1916. Voronoi paint
  1917. Voronoi Picking
  1918. Voronoi Tesselation
  1919. Voronoi Tessellation (Redirect)
  1920. Voronoi Tessellation (Redirect)
  1921. Voronoi Test (N=2)
  1922. Voronoi tests
  1923. Voynich Manuscript Voyager
  1924. VVVV viewer
  1925. W3C Validation Errors
  1926. Wagner IV
  1927. Wagner VI Projection
  1928. Wagner VII
  1929. Walmart locations
  1930. Waterfall chart of Tendulkar’s ODI career
  1931. Waterman Butterfly
  1932. Waterman Butterfly Map
  1933. Wave
  1934. We Love France: transition between the Hexagon and a heart
  1935. We’re In The Money: How Much Do The Movies We Love Make?
  1936. Weather of the World
  1937. Web reporting with D3js and R using RStudio Shiny
  1938. Web Traffic as flying bubbles
  1939. Web-Based Visualization Part 1: The D3.js Key Concept
  1940. Webplatform dancing logo
  1941. WebPlatform.org SVG Logo
  1942. Website Graph Bavigation
  1943. Weeknd3
  1944. What do countries look like?
  1945. What Do You Work For?
  1946. WHat makes us happy
  1947. What Size Am I? Finding dresses that fit
  1948. When is Easter?
  1949. Which career should I invest in?
  1950. White House Petition Choropleth
  1951. Who are Rennes Metropolis inhabitants?
  1952. Who do they serve
  1953. Who Voted for Rick Santorum and Mitt Romney
  1954. Why are people shooting up our schools?
  1955. Wiechel
  1956. Wikistalker
  1957. Wimbledon
  1958. Wimbledon 2013 Player bubbles
  1959. Wind
  1960. Wind History
  1961. Winkel Tripel Graticule
  1962. Winkel Tripel Projection
  1963. Wood Grain
  1964. Word Cloud
  1965. Word Frequency Bubble Clouds
  1966. Word Tree
  1967. Word with map image tiles clipped to the font
  1968. Word wrap in SVG using foreignObject
  1969. World Bank Global Development Sprint
  1970. World Bank Global Development Sprint
  1971. World Boundaries TopoJSON
  1972. World Map
  1973. World Map with Country Descriptions
  1974. World Population Density with D3 and Hammer Projection
  1975. World Tour
  1976. World Wide Women’s Rights
  1977. WorldBank Contract Awards
  1978. Wrapping Long Labels
  1979. X-Value Mouseover
  1980. x3dom event test
  1981. xCharts: a D3-based library for building custom charts and graphs
  1982. XKCD-style plots
  1983. Your Tax-paid Tweets
  1984. Zensus 2011 Atlas
  1985. Zero Ticks
  1986. Zip Codes
  1987. Zipdecode
  1988. zipdecode
  1989. ZJONSSON’s bl.ocks
  1990. zmaril/d3 » src › core › nest.js
  1991. Zoom a map to a Feature Bounding Box
  1992. Zoom Center
  1993. Zoom Transitions
  1994. Zoomable Area
  1995. Zoomable Area Chart
  1996. Zoomable Circle Packing
  1997. Zoomable map
  1998. Zoomable Pack Layout
  1999. Zoomable Partition Layout
  2000. Zoomable Sunburst
  2001. Zoomable Treemap
  2002. Zoomdata
  2003. ZUI in D3.js

Also, here are my d3 examples. Please share with others and enjoy.

Acquiring, cleaning, and formatting data

Not so many years ago, data was hard to obtain. Often data journalists would have to painstakingly compile their own datasets from paper records, or make specific requests for electronic databases using freedom of information laws.

The Internet has changed the game. While those methods may still be needed, many government databases can now be queried online, and the results of those searches downloaded. Other public datasets can be downloaded in their entirety.

For data journalists, the main problem today is usually not finding relevant data, but in working out whether it can be trusted, spotting and correcting errors and inconsistencies, and getting it in the right format for analysis and visualization.

In this class, we will cover some tips and tricks for finding the data you need online, getting it onto your computer, and how to recognize and clean “dirty” data. We will also review some common data formats, and how to convert from one to another.

The data we will use today

Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following files:

  • oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
  • ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
  • urls.xls A spreadsheet that we’ll use in webscraping.

Data portals

Life is much easier if you can find everything you need in one place. The main effort to centralize access to data by the U.S. federal government is Data.gov. You can search for data from the home page, or follow the Data and Topics links from the top menu.

Be warned, however, that Data.gov is a work in progress, and does not contain all of the U.S. government’s data. Some of the most useful datasets are still only available on the websites of individual federal agencies. FedStats has links to agencies with with data collections.

As a data journalist, it is worth familiarizing yourself with the main federal government agencies that have responsibility for the beats you are interested in, and the datasets they maintain. Here are some examples of agencies with useful data:

Other data portals at various levels of government are emerging. The City and County of San Francisco, for example, was at the forefront of the Open Data movement, establishing DataSF in 2009.

If you need to make comparisons between nations, the World Bank probably has what you need. Its World Development Indicators catalog containing data for more than 7,000 different measures, compiled by the bank and other UN agencies.

You can navigate the site using the search box or using the topics links to the right. When you click on a particular indicator, you are sent to a page that gives options to download the dataset from a link near the top right of the page. The data in some cases goes back as far as 1960, and is listed both by individual country and summarized by regions and income groups. We have already worked with some of this data in Week 3.

Other useful sources of data for international comparisons are Gapminder and the UN Statistical Division. For health data in particular, try the Organization for Economic Co-operation and Development and the World Health Organization.

Search for data on the web

Often, however, your starting point in searching for data will be Google. Often simply combining a few keywords in a Google search with “data” or “database” is enough to find what you need, but it can be worth focusing your queries using Google’s advanced search:

(Source: Google)

The options to search by site or domain and file typecan be very useful when looking for data. For example, the U.S. Geological Survey is the best source of data on earthquakes and seismic risk, so when searching for this information, specifying the domain usgs.gov would be a good idea. You can make the domains as narrow or broad as you like: .gov, for instance, would search a wide range of U.S. government sites, while .edu would search the sites of all academic institutions using that top-level domain; journalism.berkeley.edu would search the web pages of the Berkeley J-School only.

The file type search offers a drop-down menu, with the options including Excel spreadsheets, and Google Earth KML and KMZ files. These are common data formats, but you are not limited to those on the menu. In a regular Google search, type a space after your search terms followed by filetype:xxx, where xxx is the suffix for the file type in question. For example, dbf will look for database tables in this format. Combining file type and domain searches can be a good way to find data an agency has posted online — some of which may not otherwise be readily accessible.

One common data format doesn’t show up in file-type searches. Geographical data is often made available as “shapefiles,” a format we will explore in our later mapping classes. Because they consist of multiple files that are usually stored in compressed folders, shapefiles can’t readily be searched using a file type suffix, but they can usually be found by adding the terms “shapefile” or “GIS data” to a regular Google search.

Search online databases

Many important public databases can be searched online, and some offer options to download the results of your queries. Most of these databases give a simple search box, but it’s always worth looking for the advanced search page, which will offer more options to customize your search. Here, for example, is the advanced search page for ClinicalTrials.gov, a database of tests of experimental drugs and other medical treatments taking place in the U.S. and beyond:

(Source: ClinicalTrials.gov)

When you start working with a new online database, take some time to familiarize yourself with how its searches work: Read the Help or FAQs, and then run test searches to see what results you obtain. Here, for example, is the “How To” section of ClinicalTrials.gov.

Many online databases can be searched using Boolean logic, using the operators AND, OR and NOT to link search terms together. AND will return only data including both search terms; OR will return data containing either term; NOT will return data containing the first term but not the second.

So find out how a particular database uses Boolean logic — and the default settings that it will use if you list search terms without any Boolean operators.

Putting search terms in quote marks often searches for a specific phrase. For example, searching for “heart attack” on ClinicalTrials.gov will give only give results in which those two words appear together; leaving out the quote marks will include any trial in which both words appear.

Also find out whether the database allows “wildcards,” symbols such as * or % that can be dropped into your search to obtain results with variations on a word or number.

Look for download options — and know when you are hitting the wall

Having run a search on an online database, you will usually want to download the results, so look for the download links or buttons.

A common problem with online databases, however, is that they may impose limits on the number of results that are returned on each search. And even when a search returns everything, there may be a limit on how many of those results can be downloaded to your own computer.

If broad searches on a database keep returning the same number of results, that is a sign that you are probably running up against a search limit, and any download will not contain the complete set of data that you are interested in. However, you may be able to work out ways of searching to obtain all of the data in chunks.

Download the entire database

Downloading an entire database, where this is allowed, frees you from the often-limited options given on an online advanced search form: You can then upload the data into your own database, and query it in any way that you want.

So always look for ways to grab all of the data. One trick is to run a search on just the database’s wildcard character, or with the query boxes left blank. If you do the latter at ClinicalTrials.gov, for instance, your search will return all of the trials in the database, which can then be downloaded using the options at the bottom of the results page.

Other databases have an online search form, but also have a separate link from where data to be downloaded in its entirety, usually as a text file or series of text files. One example is the U.S. Food and Drug Administration’s Bioresearch Monitoring Information System (BMIS), which lists doctors and other researchers involved in testing experimental drugs. It can be searched online here, but can also be downloaded in full from here.

Note that large text files are again often stored in compressed folders, so may be invisible to a Google search by file type.

Where there’s a government form, there’s usually a database

The BMIS database also illustrates another useful tip when looking for data. It is compiled from information supplied in this government form:

(Source: Food and Drug Administration)

Wherever a government agency collects information using paper or electronic forms, this information is likely to be entered into an electronic database. Even if it is not available online, you can often obtain the database in its entirety (minus any redactions that may be required by law) through a public records request.

Ask for what you don’t find

That leads to another general tip: If you don’t find what you’re looking for, speak to government officials, academic experts and other sources who should know about what data exists, and ask whether they can provide it for you. I have often obtained data, including for this animated map of cicada swarms, simply by asking for it (and, of course, promising proper attribution):

(Source: New Scientist)

Automate downloads of multiple data files

Often data doesn’t reside in a single searchable database, but instead exists online as a series of separate files. In such cases, clicking on each link is tedious and time-consuming. But you can automate the process using the DownThemAll! Firefox add-on.

To illustrate, go to Gapminder’s data catalog, and select All indicators. The webpage now includes links to more than 500 downloadable spreadsheets.

At the dialog box, you can choose where to save the files, and to filter the links to select just the files you want. In this case, unchecking all the boxes and Fast Filtering using the term xls will correctly identify the spreadsheet downloads:

Extract data from tables on the web

On other occasions, data may exist in tables on the web. Copying and pasting data from web tables can be tricky, but the Table2Clipboard Firefox add-on can simplify the process.

Before using the add-on, select Tools>Table2Clipboard and choose the following options under the CSV tab:

This will ensure that each row in the extracted data is put on a new line, and each column is separated by a tab.

To illustrate what Table2Clipboard does, go to the Women’s Tennis Association singles rankings page, right-click anywhere in the table and select Table2Clipboard>Copy whole table:

You can now paste the data into an empty text file, or into a spreadsheet.

Manipulate urls to expose the data you need

As you search for data using web query forms, make a habit of looking at what happens to the url. Often the urls will contain patterns detailing the search you have run, and it will be possible to alter the data provided by manipulating the url. This can be quicker than filling in search forms. In some cases it may even reveal more data than the search form alone would provide, overriding controls on the number of records displayed.

To illustrate how this works, go to the ISCRTN Registry, one of the main international registries of clinical trials. Find the Advanced Search and search for breast cancer under condition:

When the data is returned, note the url:

http://www.isrctn.com/search?q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Notice how the url changes if you select 100 under Show results:

http://www.isrctn.com/search?pageSize=100&sort=&page=1&q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Now change the page size in the url to 500:

http://www.isrctn.com/search?pageSize=500&sort=&page=1&q=&filters=condition%3Abreast+cancer&searchType=advanced-search

Having done so, all of the registered clinical trials involving cancer should now be displayed on a single page. We could now use DownThemAll! to download all of the individual web pages describing each of these trials, or we could use this url as the starting point to scrape data from each of those pages.

Scrape data from the web

Sometimes you will need to compile your own dataset from information that is not available for easy download, but is instead spread across a series of webpages, or in a database that imposes limits on the amount of data that can be downloaded from any search, or doesn’t include a download button. This is where web scraping comes in.

Using programming languages such as Python or R, it is possible to write scripts that will pull data down from many webpages, or query web search forms to download an entire database piece by piece. The idea behind web scraping is to identify the patterns you would need to follow if collecting the data manually, then automate the process and write the results to a data file. That often means experimenting to reveal the most efficient way of exposing all of the data you require.

Teaching the programming skills needed for webscraping is beyond the scope of this class — see the Further reading links for resources, if you are interested in learning to scrape by coding.

However, software is starting to emerge that allows non-programmers to scrape data from the web. These include OutWit Hub and the Windows-only Helium Scraper. In today’s class, we will use Import.io.

To demonstrate webscraping, we will download data on disciplinary actions against doctors in the state of New York.

Navigate to this page, which is the start of the list. Then click on the Next Page link, and see that the url changes to the following:

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=30

Notice that the first entry on this list is actually the last entry on the previous one, so this url is the next page with no duplicates:

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=31

Experiment with different numbers at the end of the url until you find the end of the list. As of writing, this url exposed the end of the list, revealing that there were 7420 disciplinary actions in the database.

http://w3.health.state.ny.us/opmc/factions.nsf/byphysician?OpenView&Start=7420

Click on the link for the last doctor’s name, and notice that data on each disciplinary action, plus a link to the official documentation as a PDF, are on separate web pages. So we need to cycle through all of these pages to grab data on every disciplinary action.

The first step is to cycle through the entire list, grabbing all of the urls for the individual plages.

The best way of doing this in Import.io is to set up a scrape from all of the urls that define the list. In a Google spreadsheet, copy the base url down about 250 rows, then put the numbers that define the first three pages in the first three cells of the next column:

Select those three cells, then move the cursor to the bottom right-hand corner until it becomes a cross, and double-click to copy the pattern down the entire column.

In the first cell in the third column, type the following formula:

=concatenate(A1,B1)

Hit return, and copy this formula down the column to give the urls we will use to scrape the list. To save time in class, I have already made this spreadsheet, urls.xls for you to use. It is in the folder scraping.

Open Import.io, and you should see a screen like this:

Click the pink New button to start setting up a scraper, and at the dialog box select Start Extractor:

You can close the tutorial video that appears by clicking the OK, got it! button. Enter the url for the first page of the list in the search box, and then move the slider on ON

Write name in the box at top left, replacing the default my_column, and click on the first link under Physician Name. At the dialog box that appears, tell Import.io that your table will contain Many rows.

Import.io will now grab the text and links in a column called name. This is all we need for the first phase of the scrape, so click the DONE button and select a name for the API, such as ny_doctors, and click PUBLISH.

At the next window, select Bulk Extract under How would you like to use this API? and paste into the box the urls from the spreadsheet:

Click Save URLS and then Run queries and the scrape should begin. It will take a couple of minutes to process all of the urls. If any fail, click on the warning message to retry.

Now click on the Export button and select HTML to export as web page, which should look like this. Save it on your desktop, and open in a browser. The column name now contains all the urls we need for the second stage of the scrape:

Click the New button to set up the second phase of the scrape, and again Start Extractor. Enter the second url from your HTML table, to select a named doctor, rather than a practice. Call the column first name, click on the doctor’s first name, and this time tell Import.io that your scrape will have Just one row — because each of the pages we are about to scrape contains data on one disciplinary action.

Click + NEW COLUMN and repeat for the doctor’s last name and the other fields in the data. Make sure to click on the empty box for License Restrictions, so the scrape does grab this data where it exists, and the link to the PDF document. When you are done, the screen should look like this:

Click DONE, select a name for the API, such as ny_orders, and click PUBLISH.

Again select Bulk Extract, and paste into the box the entire column of urls from your html table. You can do this is Firefox using Table2Clipboard, using its Select column option. Remember to delete the column header name from the list of urls before clicking Save URLs and Run queries.

Once the scrape has completed, click the Export button and select Spreadsheet to export as a CSV file.

Use application programming interfaces (APIs)

Websites like the ISRCTN clinical trials registry are not expressly designed to be searched by manipulating their urls, but some organizations make their data available through APIs that can be queried by constructing a url in a similar way. This allows websites and apps to call in specific chunks of data as required, and work with it “on the fly.”

To see how this works, go to the U.S. Geological Survey’s Earthquake Archive Search & URL Builder, where we will search for all earthquakes with a magnitude of 6 or greater that occurred within 6,000 kilometres of the geographic center of the contiguous United States, which this site tells us lies at a latitude of 39.828175degrees and a longitude of -98.5795 degrees. We will initially use the Output Options to ask for the data in a format called GeoJSON (a variant of JSON). Enter 1900-01-01T00:00:00 under Start for Date & Timeboxes so that we obtain all recorded earthquakes from the beginning of 1900 onward. The search form should look like this:

(Source: U.S. Geological Survey)

You should receive a quantity of data at the following url:

http://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=1900-01-01T00:00:00&latitude=39.828175&longitude=-98.5795&maxradiuskm=6000&minmagnitude=6&orderby=time

See what happens if you append -asc to the end of that url: This should sort the the earthquakes from oldest to newest, rather than the default of newest to oldest. Here is the full documentation for querying the earthquake API by manipulating these urls.

Now remove the -asc and replace geojson in the url with csv. The data should now download in CSV format.

PDFs: the bane of data journalism

Some organizations persist in making data available as PDFs, rather than text files, spreadsheets or databases. This makes the data hard to extract. While you should always ask for data in a more friendly format — ideally a CSV or other simple text file — as a data journalist you are at some point likely to find yourself needing to pull data out of a PDF.

For digital PDFs, Tabula is a useful data extraction tool — however it will not work with PDFs created by scanning the original document, which have to be interpreted using Optical Character Recognition (OCR) software (which is, for example, included in Adobe Acrobat).

Also useful is the online service Cometdocs. While it is a commercial tool, members of Investigative Reporters and Editors can obtain a free account. Cometdocs can read scanned PDFs, however its accuracy will vary depending on how well the OCR works on the document in question.

Can I trust this data?

Having identified a possible source of data for your project, you need to ask: Is it reliable, accurate and useful? If you rush into analysis or visualization without considering this question, your hard work may be undermined by the maxim: “Garbage In, Garbage Out.”

The best rule of thumb in determining the reliability of a dataset is find out whether it has been used for analysis before, and by whom. If a dataset was put together for an academic study, or is actively curated so it can be made available for experts to analyze, you can be reasonably confident that it is as complete and accurate as it can be — the U.S. Geological Survey’s earthquake data is a good example.

While in general you might be more trusting of data downloaded from a .gov or .edu domain than something found elsewhere on the web, don’t simply assume that it is reliable and accurate. Be especially wary of databases that are compiled from forms submitted to government agencies, such as the Bioresearch Monitoring Information System (BMIS) database mentioned earlier.

Government agencies may be required by law to maintain databases such as BMIS, but that doesn’t mean that the information contained in them is wholly reliable. First, forms may not always be submitted, making the data incomplete. Second, information may be entered by hand from the forms into the database — and not surprisingly, mistakes are made.

So before using any dataset, do some background research to find out how it was put together, and whether it has been rigorously checked for errors. If possible, try to speak to the people responsible for managing the database, and any academics or other analysts who have used the data. They will be your best guide to a dataset’s strengths and weaknesses.

Even for well-curated data, make a point of speaking with experts who compile it or use it, and ask them about the data’s quirks and limitations. From talking with experts on hurricanes, for example, I know not to place too much trust in data on North Atlantic storms prior to about 1990, before satellite monitoring was well developed — even though the data available from NOAA goes back to 1851.

Always ask probing questions of a dataset before putting your trust in it. Is this data complete? Is it up-to-date? If it comes from a survey, was it based on a representative sample of people who are relevant to your project? Remember that the first dataset you find online may not be the most relevant or reliable.

Recognize dirty data

In an ideal world, every dataset we find would have been lovingly curated, allowing us to start analyzing and visualizing without worrying about its accuracy.

In practice, however, often the best available data has some flaws, which may need to be corrected as far as is possible. So before starting to work with a new dataset, load it into a spreadsheet or database and take a look for common errors. Here, for example, is a sample of records from the BMIS database, with names including non-alphabetical characters — which are clearly errors:

(Source: Peter Aldhous, from Bioresearch Information Monitoring System data)

Look for glitches in the alignment of columns, which may cause data to appear in the wrong field.

For people’s names, look for variations in spelling, format, initials and accents, which may cause the same person to appear in multiple guises. Similar glitches may affect addresses, and any other information entered as text.

Some fields offer some obvious checks: if you see a zip code with less than 5 digits, for instance, you know it must be wrong.

Dates can also be entered incorrectly, so it’s worth scanning for those that fall outside the timeframe that should be covered by the data.

Also scan numbers in fields that represent continuous variables for any obvious outliers. These values are worth checking out. Are they correct, or did someone misplace a decimal point or enter a number in the wrong units?

Other common problems are white spaces before and after some entries, which may need to be stripped out.

At all stages of your work, pay attention to zeros. Is each one actually supposed to represent zero, or should the cell in fact be empty, or “null”? Take particular care when exporting data from one software tool and importing to another, and check how nulls have been handled.

Clean and process data with Open Refine

Checking and cleaning “dirty” data, and processing data into the format you need, can be the most labor intensive part of many data journalism projects. However, Open Refine (formerly Google Refine) can streamline the task — and also create a reproducible script to quickly repeat the process on data that must be cleaned and processed in the same way.

When you launch Open Refine, it opens in your web browser. However, any data you load into the program will remain on your computer — it does not get posted online.

The opening screen should look like this:

Reshape data from wide to long format

Click the Browse button and navigate to the file oil_production.csv. Click Next>>, and check that data looks correct:

Open Refine should recognize that the data is in a CSV file, but if not you can use the panel at bottom to specify the correct file type and format for the data. When you are satisfied that the data has been read correctly, click the Create Project >> button at top right. The screen should now look like this:

As you can see, the data is in wide format, with values for oil production by region organized in columns, one for each year. To convert this to long format, click on the small downward-pointing triangle for the first of these year columns, and select Transpose>Transpose cells across columns into rows.

Fill in the dialog box as below, making sure that From Column and To Column are highlighted correctly, that the Key column and Value column have been given appropriate names, and that Fill down in other columns is checked. (Failing to do check this box will mean that the region names each will only appear once in the reshaped data, rather than being copied down to appear next to the corresponding data for year and oil production.)

Click Transpose and then the 50 rows link, to see the first 50 rows of the reshaped data:

Click the Export button at top right and you will see options to export the data in a variety of file types, including Comma-separated value and Excel spreadsheet.

Clean and process dirty data

Click the Google Refine logo at top left to return to the opening screen. Create a new project from the file ucb_stanford_2014.csv.

Entries recognized as numbers or dates will be green, those treated as text strings will be black:

Again, each field/column has a button with a downward-pointing triangle. Click on these buttons and you get the option to create “facets” for the column, which provide a powerful way to edit and clean data.

Click on the button for the field Recipent City, and select Facet>Text facet. A summary of the various entries now appears in the panel to the left:

The numbers next to each entry show how many records there are for each value.

We can edit entries individually: Select Veterans Bureau Hospi, which is clearly not a city, click on the Editlink, change it to Unknown. (If cleaning this data for a real project, we would need to check with an external source to get the actual city for this entry.)

Another problem is that we have a mixture of cases, with some entries in Title or Proper Case, some in UPPERCASE. We can fix this back in the field itself. Click its button again and select Edit cells>common transforms>To titlecase.

Now notice that we apparently have duplicate entries for Berkeley, Palo Alto and Stanford. This is the result of trailing white space after the city names for some entries. Select Edit cells>common transforms>Trim leading and trailing whitespace and notice how the problem resolves:

Having cleaned this field, close the facet by clicking the cross at top left.

Now create a text facet for the field Recipient:

What a mess! The only possibilities are Stanford or Berkeley, yet there are multiple variants of each, many including Board of Trustees for Stanford and Regents of for UC Berkeley.

First, manually edit Interuniveristy Center for Japanese Language to Stanford, which is where this center is based.

We could contrinute editing manually, but to illustrate Open Refine’s editing functions click on the Clusterbutton. Here you can experiment with different clustering algorithms to edit entries that may be variants of the same thing. Select key collision and metaphone3, then start checking the clusters and renaming them as Berkeley or Stanford as appropriate:

Click Merge Selected & Close and the facet can then be quickly edited manually:

Often we may need to convert fields to text, numbers or dates. For example, click on the button for Award Date and select Edit cells>common transforms>To date and see that it changes from a string of text to a date in standard format.

Notice the field Award amount, which is a value in dollars. Negative values are given in brackets. Because of these symbols, the field is being
recognized as a string of text, rather than a number. So to fix this problem, we have to remove the symbols.

Select Edit colum>Add column based on this column... and fill in the dialog box as follows:

Here value refers to the value in the original column, and replace is a function that replaces characters in the value. We can run several replace operations by “chaining” them together. This is a concept we’ll meet again in subsequent weeks, when we work with the D3 JavaScript library and R.

Here we are replacing the “$” symbols, the commas separating thousands, and the closing brackets with nothing; we are replacing the opening brackets with a hyphen to designate negative numbers.

Click OK and the new column will be created. Note that it is still being treated as text, but that can be corrected by selecting Edit cells>common transforms>To number.

This is just one example of many data transformation functions that can be accessed using Open Refine’s expression language, called GREL. Learning these functions can make Open Refine into a very powerful data processing tool. Study the “Further reading” links for more.

Open Refine’s facets can also be used to inspect columns containing numbers. Select Facet>Numeric facetfor the new field. This will create a histogram showing the distribution of numbers in the field:

We can then use the slider controls to filter the data, which is good for examining possible outliers at the top of bottom of the range. Notice that here a small number of grants have negative values, while there is one grant with a value of more than $3 billion from the National Science Foundation. This might need to be checked out to ensure that it is not an error.

While most of the data processing we have explored could also be done in a spreadsheet, the big advantage of Open Refine is that we can extract a “pipeline” for processing data to use when we obtain data in the same format in future.

Select Undo / Redo at top left. Notice that clicking on one of the steps detailed at left will transform the data back to that stage in our processing. This means you don’t need to worry about making mistakes, as it’s always possible to revert to an earlier state, before the error, and pick up from there.

Return to the final step, then click the Extract button. At the dialog box, check only those operations that you will want to perform in future (typically generic transformations on fields/columns, and not correcting errors for individual entries). Here I have unchecked all of the corrections in the text facets, and selected just those operations that I know I will want to repeat if I obtain data from this source again:

This will generate JSON in the right hand panel that can be copied into a blank text file and saved.

To process similar data in future. Click the Apply button on the Undo / Redo tab, paste in the text from this file, and click Perform Operations. The data will then be processed automatically.

When you are finished cleaning and processing your data, click the Export button at top right to export as a CSV file or in other formats.

Open Refine is a very powerful tool that will reward efforts to explore its wide range of its functions for manipulating data. See the “Further reading” for more.

Standardize names with Mr People

For processing names from a string of text into a standardized format with multiple fields, you may wish to experiment with Mr People, a web app made by Matt Ericson, a member of the graphics team at The New York Times.

(Source: Mr People)

It takes a simple list of names and turns them into separate fields for title, first name, last name, middle name and suffix.

Mr People can save you time, but it is not infallible — it may give errors with Spanish family names, for instance, or if people have multiple titles or suffixes, such as “MD, PhD.” So always check the results before moving on to further analysis and visualization.

Correct for inflation (and cost of living)

A common task in data journalism and visualization is to compare currency values over time. When doing so, it usually makes sense to show the values after correcting for inflation — for example in constant 2014 dollars for a time series ending in 2014. Some data sources, such as the World Bank, provide some data both in raw form or in a given year’s constant dollars.

So pay attention to whether currency values have already been corrected for inflation, or whether you will need to do so yourself. When correcting for inflation in the United States, the most widely-used method is the Consumer Price Index, or CPI, which is based on prices paid by urban consumers for a representative basket of goods and services. Use this online calculator to obtain conversions.

If, for example, you need to convert a column of data in a spreadsheet from 2010 dollars into today’s values, fill in the calculator like this:

A dollar today is worth the same as 0.9 dollars in 2010.

So to convert today’s values into 2010 dollars, use the following formula:

2016 value * 0.9

And to convert the 2010 values to today’s values, divide rather than multiply:

2010 value / 0.9

Alternatively, fill in the calculator the other way round, and multiply as before.

Convert 2010 to today’s values:

2010 value * 1.11

For comparing currency values across nations, regions or cities, you may also need to correct for the cost of living — or differences in what a dollar can buy in different places. For World Bank indicators, look for the phrase “purchasing power parity,” or PPP, for data that includes this correction. PPP conversion factors for nations over time are given here.

Understand common data formats, and convert between them

Until now, we have used data in text files, mostly in CSV format.

Text files are great for transferring data from one software application to another during analysis and visualization, but other formats that are easier for machines to read are typically used when transferring data between computers online. If you are involved in web development or designing online interactive graphics, you are likely to encounter these formats.

JSON, or JavaScript Object Notation, which we have already encountered today, is a data format often used by APIs. JSON treats data as a series of “objects,” which begin and end with curly brackets. Each object in turn contains a series of name-value pairs. There is a colon between the name and value in each pair, and the pairs separated by commas.

Here, for example, are the first few rows of the infectious disease and democracy data from week 1, converted to JSON:

[{"country":"Bahrain","income_group":"High income: non-OECD","democ_score":45.6,"infect_rate":23},
{"country":"Bahamas, The","income_group":"High income: non-OECD","democ_score":48.4,"infect_rate":24},
{"country":"Qatar","income_group":"High income: non-OECD","democ_score":50.4,"infect_rate":24},
{"country":"Latvia","income_group":"High income: non-OECD","democ_score":52.8,"infect_rate":25},
{"country":"Barbados","income_group":"High income: non-OECD","democ_score":46,"infect_rate":26}]

XML, or Extensible Markup Language, is another format often used to move data around online. For example, the RSS feeds through which you can subscribe to content from blogs and websites using a reader such as Feedly are formatted in XML.

In XML data is structured by enclosing values within “tags,” similar to those used to code different elements on a web page in HTML. Here is that same data in XML format:

<?xml version="1.0" encoding="UTF-8"?>
<rows>
  <row country="Bahrain" income_group="High income: non-OECD" democ_score="45.6" infect_rate="23" ></row>
  <row country="Bahamas, The" income_group="High income: non-OECD" democ_score="48.4" infect_rate="24" ></row>
  <row country="Qatar" income_group="High income: non-OECD" democ_score="50.4" infect_rate="24" ></row>
  <row country="Latvia" income_group="High income: non-OECD" democ_score="52.8" infect_rate="25" ></row>
  <row country="Barbados" income_group="High income: non-OECD" democ_score="46" infect_rate="26" ></row>
</rows>

Mr Data Converter is a web app made by Shan Carter of the graphics team at The New York Times that makes it easy to convert data from a spreadsheet or delimited text file to JSON or XML.

Copy the data from a CSV or tab-delimited text file and paste it into the top box, select the output you want, and it will appear at the bottom. You will generally want to select the Properties variants of JSON or XML.

You can then copy and paste this output into a text editor, and save the file with the appropriate extension (.xml, .json).

(Source: Mr Data Converter)

To convert data from JSON or XML into text files, you can use Open Refine. First create a new project and import your JSON or XML file. Use the Export button and select Tab-separated value or Comma-separated value to export as a text file.

Assignment

  • Grab the data for the top 100 ranked women’s singles tennis players.
  • Use Open Refine to process this data as follows:
    • Create new columns for First Name and Last Name. Hint: First create a copy of the Playercolumn with a new name using Edit Column>Add column based on this column.... Then look under Edit column for an option to split this new column into two; you will also need to rename the resulting columns.
    • Convert the birth dates for the players to standard date/time format.
    • Create a new column for the Previous Rank with the square brackets removed, converted to numbers. Hint: First copy the old column as above; this time you can delete the old column when you are done.
  • Extract the operations to process this data, and save in a file with the extension .json.
  • Now go back to the WTA site and grab the singles rankings for all U.S. players for the first ranking of 2016 (made on January 4). Hint: Make sure you hit Search after adjusting the menus.
  • Process this data in Open Refine using your extracted JSON, then export the processed data as a CSV file.
  • Send me your JSON and CSV files.

Further reading

Paul Bradshaw. Scraping For Journalists

Dan Nguyen. The Bastards Book of Ruby
I use R or Python rather than Ruby, but this book provides a good introduction to the practice of web scraping using code, and using your browser’s web inspector to plan your scraping approach.

Hadley Wickham’s rvest package
This is the R package I usually use for web scraping.

Open Refine Wiki

Open Refine Documentation

Open Refine Recipes

Using GitHub

In this week’s class we will learn the basics of version control, so that you can work on your final projects in a clean folder with a single set of files, but can save snapshots of versions of your work at each point and return to them if necessary.

This avoids the hell of having to search through multiple versions of similar files. That, as Ben Welsh of the Los Angeles Times explains in this video, legendary in data journalism circles as “Ben’s rant,” is nihilism!

Version control was invented for programmers working on complex coding projects. But it is good practice for any project — even if all you are managing are versions of a simple website, or a series of spreadsheets.

This tutorial borrows from the Workflow and GitHub lession in Jeremy Rue’s Advanced Coding Interactives class — see the further reading links below.

Introducing Git, GitHub and GitHub Desktop

The version control software we will use is called Git. It is installed automatically when you install and configure GitHub Desktop. GitHub Desktop is a point-and-click GUI that allows you to manage version control for local versions of projects on your own computer, and sync them remotely with GitHub. GitHub is a social network, based on Git, that allows developers to view and share one another’s code, and collaborate on projects.

Even if you are working on a project alone, it is worth regularly synching to GitHub. Not only does this provides a backup copy of the entire project history in the event of a problem with your local version, but GitHub also allows you to host websites. This means you can go straight from a project you are developing to a published website. If you don’t already have a personal portfolio website, you can host one for free on GitHub.

The files we will use today

Download the files for this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

index.html index2.html Two simple webpages, which we will edit and publish on GitHub.
css fonts js Folders with files to run the Bootstrap web framework.

Some terminology

  • repository or repo Think of this as a folder for a project. A repository contains all of the project files, and stores each file’s revision history. Repositories on GitHub can have multiple collaborators and can be either public or private.
  • clone Copy a repository from GitHub to your local computer.
  • master This is the main version of your repository, created automatically when you make a new repository.
  • branch A version of your repository separate from the master branch. As you switch back and forth between branches, the files on your computer are automatically modified to reflect those changes. Branches are used commonly when multiple collaborators are working on different aspects of a project.
  • pull request Proposed changes to a repository submitted by a collaborator who has been working on a branch.
  • merge Taking the changes from on branch and applying them to another. This is often done after a pull request.
  • push or sync Submitting your latest commits to the remote responsitory, on GitHub and syncing any changes from there back to your computer.
  • gh-pages A special branch that is published on the web. This is how you host websites on GitHub. Even if a repository is private, its published version will be visible to anyone who has the url.
  • fork Split off a separate version of a repository. You can fork anyone’s code on GitHub to make your own version of their repo.

Here is a more extended GitHub glossary.

Create and secure your GitHub account

Navigate to GitHub and sign up:

Choose your plan. If you want to be able to create private repositories, which cannot be viewed by others on the web, you will need to upgrade to a paid account. But for now select a free account and click Continue:

At the next screen, click the skip this step link:

Now view your profile by clicking on the icon at top right and selecting Your profile. This is your page on GitHub. Click Edit profile to see the following:

Here you can add your personal details, and a profile picture. For now just add the name you want to display on GitHub. Fill in the rest in your own time after class.

You should have been sent a confirmation email to the address you used to sign up. Click on the verification link to verify this address on GitHub.

Back on the GitHub website, click on the Emails link in the panel at left. If you wish, you can add another email to use on GitHub, which will need to be verified as well. If you don’t wish to display your email on GitHub check the Keep my email address private box.

Now click on the Security link. I strongly recommend that you click on Set up two-factor authentication to set up this important security feature for your account. It will require you to enter a six-digit code sent to your phone each time you sign on from an unfamiliar device or location.

At the next screen, click Set up using SMS. Then enter your phone number, send a code to your phone and enter it where shown:

At the next screen click Download and print recovery codes. These will allow you to get back into your account if you lose your phone. Do print them out, keep them somewhere safe, and delete the file.

Open and authenticate GitHub desktop

Open the GitHub Desktop app. At the opening screen, click Continue:

Then add your GitHub login details:

You will then be sent a new two-factor authentication code which you will need to enter:

At the next screen, enter your name and email address if they do not automatically appear, click Install Command Line Tools, and then Continue:

Then click Done at the Find local repositories screen, as you don’t have local repositories to add.

The following screen should now open:

Your workspace contains one repo, which is an automated GitHub tutorial. Complete this in your own time if you wish. It will repeat many of the steps we will explore today.

Your first repository

Create a new repository on GitHub

Back on the GitHub website, go to your profile page, click on the Repositories tab and click New:

Fill in the next screen as follows, giving the repository a name and initializing it with a README file. Then click Create repository:

You should now see the page for this repo:

Notice that there is an Initial commit with a code consisting of a series of letters and numbers. There will be a code for each commit you make from now on.

Clone to GitHub desktop

Click on Clone or download and select Open in Desktop:

You should now be sent to the GitHub Desktop app, where you will be asked where on your computer you wish to clone the repo folder. Choose a location and click Clone:

Now you should be able to see the repo in the GitHub Desktop app:

You should also be able to find the folder you just cloned in the location you specified:

It contains a single file called README.md. This is a simple text file written in a language called Markdown, which we will explore shortly. You use this file to
write an extended description for your project, which will be displayed on the repo’s page on GitHub.

Make a simple change to the project

Add the file index.html to the project folder on your computer. Notice that you now have 1 Uncommitted Change in GitHub Desktop.

Click on that tab, and you should see the following screen:

GitHub Desktop highlights additions from your last commit in green, and deletions in red.

Commit that change, sync with GitHub

Write a summary and description for your commit, then click Commit to master:

Back in the History tab, you should now see two commits:

So far you have committed the change on your local computer, but you haven’t pushed it to GitHub, To do that, click the Sync button at top right.

Go to the project page on the GitHub website, refresh your browser if necessary, and see that there are now two commits, and the index.html file is in the remote repo:

Make a new branch, make a pull request, and merge with master

Back in GitHub Desktop, click on the new branch button at top left, and create a new branch called test-branch:

You can now switch between your two branches using the button to the immediate right of the new branch, which will display either master or test-branch. Do pay close attention to which branch you are working in!

Here I am working in the test branch, having made the edit below:

While in the test-branch on Github Desktop, open index.html in a text editor. Delete <p>I'm a paragraph</p> and replace it with the following:

<h2>Hello again!</h2>
<p>I'm a new paragraph</p>

Save the file, then return to GitHub Desktop to view the changes in the test-branch.

Now switch to the master branch and look at the file index.html in your text editor. It should have reverted to the earlier version, because you haven’t merged the change in test-branch with master.

Switch back to test-branch in Github Desktop, and commit the change as before with an appropriate summary and description:

Click the Pull request at top right and then Send Pull Request:

You should now be able to see the pull request on GitHub:

Click Compare & pull request to see the following screen:

If another collaborator had made this pull request, you might merge this into master online and then sync your local version of the repo with the remote to incorporate it.

However, you made this pull request, so Close pull request and return to GitHub desktop. In the masterbranch, click Compare at top left. Select test-branch and then click Update from test-branch. This should merge the changes from test-branch into master:

Make sure you are in the master branch on GitHub Desktop, then view the file in your text editor to confirm that it is now the version you edited in test-branch.

Make a gh-pages branch, and publish to the web

In your master branch on Github Desktop, make a branch called gh-pages:

In this branch, click the Publish button at top right.

Now go to the GitHib repo page, refresh your browser if necessary, and notice that this branch now exists there:

Now go the the url https://[username].github.io/my-first-repo/, whhere [username] is your GitHub user name, and the webpage index.html should be online:

Introducing Markdown, Haroopad, and Bootstrap

Markdown provides a simple way to write text documents that can easily be converted to HTML, without having to worry about writing all of the tags to produce a properly formatted web page.

Haroopad is a Markdown editor that we will use to edit the README.md for our repos, and also author some text to add to a simple webpage.

Bootstrap is a web framework that allows you to create responsively designed websites that work well on all devices, from phones to desktop computers. It was originally developed by Twitter.

(I used Bootstrap to make this class website, writing the class notes in Markdown using Haroopad.)

Edit your README, make some more changes to your repo, commit and sync with GitHub

Open README.md in Haroopad. The Markdown code is shown on the left, and how it renders as HTML on the right:

Now edit to the following:

# My first repository

### This is the repo we have been using to learn GitHub.

Here is some text. Notice that it doesn't have the # used to denote various weights of HTML headings (You can use up to six #).

And here is a [link](http://severnc6.github.io/my-first-repo) to the `gh-pages` website for this repo.

*Here* is some italic text.

**Here** is some bold text.

And here is a list:
- first item
- second item
 - sub-item
- third item

This should display in Haroopad like this:

See here for a more comprehensive guide to using Markdown.

Save README.md in Haroopad and close it.

With Github Desktop in the master branch, delete index.html from your repo, and copy into the repo the file index2.html and the folders js, css, and fonts. Rename index2.html to index.html.

You now have a template Bootstrap page with a navigation bar at the top. Open in a browser, and it should look like this:

The links in the dropdown menu are currently to pages that don’t exist, and the email link will send a message to me. Open in a text editor to view the code for the page:

Open a new file in Haroopad, edit to add the following and save into your repo as index-text.md:

# A Bootstrap webpage

### It has a subheading

And also some text.

From the top menu in Haroopad, select File>Export...>HTML and notice that it has saved as a webpage in your repo your computer.

We just want to take the text from the web page and copy it into our index.html page. To do this, select File>Export...>Plain HTML from the top menu in Haroopad, open index.html in your text editor, position your cursor between immediately below the <div class="container"> tag, and ⌘-V to paste in the HTML for the text we wrote in Haroopad.

Save index.html and view in your browser.

In GitHub Desktop, view the uncommited changes, Commit to master and Sync to GitHub.

Now switch to the gh-pages branch, Update from master and Sync:

Both the master and gh-pages branches should now be updated on GitHub:

Follow the link we included in the README, and you’ll be sent to the hosted webpage, at https://[username].github.io/my-first-repo/, where [username] is your GitHub user name.

Next steps with Bootstrap

W3Schools has a tutorial here, and Jeremy Rue has a tutorial here. The key to responsive design with Bootstrap is its grid system, which allows up to 12 columns across a page. This section of the W3schools tutorial explains how to use the grid system to customize layouts for different devices.

This site helps you customize a Bootstrap navigation bar.

There are various sites on the web that provide customized Bootstrap themes — some free, some not. Search “Bootstrap themes” to find them. A theme is a customized version of Bootstrap that can be used as a starting point for your own website. Jeremy Rue has also created a suggested portfolio theme.

Assignment

  • Create a repository on GitHub to host your final project and clone to your computer so you can manage the project in GitHub Desktop.
  • In Markdown, write a pitch for your final project in a file called project-pitch.md. Your final project accounts for 45% of your grade for this class, so it’s important that you get off to a good start with a substantial and thoughtful pitch. You will also be graded separately on this pitch assignment.
    • Explain the goals of your project.
    • Detail the data sources you intend to use, and explain how you intend to search for data if you have not identified them.
    • Identify the questions you wish to address.
    • Building from these questions, provide an initial outline of how you intend the visualize the data, describing the charts/maps you are considering.
  • Using Haroopad, save the Markdown document as an HTML file with the same name.
  • Create a gh-pages branch for your repository and publish it on GitHub. View the webpage at created at http://[username].github.io/[project]/project-pitch.html where [username] is your GitHub user name and [project] is the name of your project repository.
  • Share that url with me so I can read your project pitch and provide feedback.

Further reading

Workflow and Github
Lesson from Jeremy Rue’s Advanced Coding Interactives class.

Getting Started with GitHub Desktop

Getting Started with GitHub Pages
This explains how you can creates web pages automatically from GitHub. However, I recommend authoring them locally, as we covered in class.

Git Reference Manual

Getting started with Bootstrap

W3Schools Bootstrap tutorial

Using Bootstrap Framework For Building Websites
Lesson from Jeremy Rue’s Intro to Multimedia Web Skills class.

Graphical Analysis & Exploration

Introducing Tableau Public

In this tutorial we will work with Tableau Public, which allows you to create a wide variety of interactive charts, maps and tables and organize them into dashboards and stories that can be saved to the cloud and embedded on the web.

The free Public version of the software requires you to save your visualizations to the open web. If you have sensitive data that needs to be kept within your organization, you will need a license for the Desktop version of the software.

Tableau was developed for exploratory graphical data analysis, so it is a good tool for exploring a new dataset — filtering, sorting and summarizing/aggregating the data in different ways while experimenting with various chart types.

Although Tableau was not designed as a publication tool, the ability to embed finished dashboards and stories has also allowed newsrooms and individual journalists lacking JavaScript coding expertise to create interactive online graphics.

The data we will use today

Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following file:

Visualize the data on neonatal mortality

Connect to the data

Launch Tableau Public, and you should see the following screen:

Under the Connect heading at top left, select Text File, navigate to the file nations.csv and Open. At this point, you can view the data, which will be labeled as follows:

  • Text: Abc
  • Numbers: #
  • Dates: calendar symbol
  • Geography: globe symbol

You can edit fields to give them the correct data type if there are any problems:

Once the data has loaded, click Sheet 1 at bottom left and you should see a screen like this:

Dimensions and measures: categorical and continuous

The fields should appear in the Data panel at left. Notice that Tableau has divided the fields into Dimensionsand Measures. These broadly correspond to categorical and continuous variables. Dimensions are fields containing text or dates, while measures contain numbers.

If any field appears in the wrong place, click the small downward-pointing triangle that appears when it is highlighted and select Convert to Dimension or Convert to Measure as required.

Shelves and Show Me

Notice that the main panel contains a series of “shelves,” called Pages, Columns, Rows, Filters and so on. Tableau charts and maps are made by dragging and dropping fields from the data into these shelves.

Over to the right you should see the Show Me panel, which will highlight chart types you can make from the data currently loaded into the Columns and Rows shelves. It is your go-to resource when experimenting with different visualization possibilities. You can open and close this panel by clicking on its title bar.

Columns and rows: X and Y axes

The starting point for creating any chart or map in Tableau is to place fields into Columns and Rows, which for most charts correspond to the X and Y axes, respectively. When making maps, longitude goes in Columns and latitude in Rows. If you display the data as a table, then these labels are self-explanatory.

Some questions to ask this data

  • How has the total number of neonatal deaths changed over time, globally, regionally and nationally?
  • How has the neonatal death rate for each country changed over time?

Create new calculated variables

The data contains fields on birth and neonatal death rates, but not the total numbers of births and deaths, which must be calculated. From the top menu, select Analysis>Create Calculated Field. Fill in the dialog box as follows (just start typing a field name to select it for use in a formula):

Notice that calculated fields appear in the Data panel preceded by an = symbol.

Now create a second calculated field giving the total number of neonatal deaths:

In the second formula, we have rounded the number of neonatal deaths to the nearest thousand using -3(-2 would round to the nearest hundred, -1 to the nearest ten, 1 to one decimal place, 2 to two decimal places, and so on.)

Here we have simply run some simple arithmetic, but it’s possible to use a wide variety of functions to manipulate data in Tableau in many ways. To see all of the available functions, click on the little grey triangle at the right of the dialog boxes above.

Understand that Tableau’s default behaviour is to summarize/aggregate data

As we work through today’s exercise, notice that Tableau routinely summarizes or aggregates measures that are dropped into Columns and Rows, calculating a SUM or AVG (average or mean), for example.

This behaviour can be turned off by selecting Analysis from the top menu and unchecking Aggregate Measures. However, I do not recommend doing this, as it will disable some Tableau functions. Instead, if you don’t want to summarize all of the data, drop categorical variables into the Detail shelf so that any summary statistic will be calculated at the correct level for your analysis. If necessary, you can set the aggregation so it is being performed on a single data point, and therefore has no effect.

Make a series of treemaps showing neonatal deaths over time

A treemap allows us to directly compare the neonatal deaths in each country, nested by region.

Drag Country and Region onto Columns and Neonatal deaths onto Rows. Then open Show Me and select the treemap option. The initial chart should look like this:

Look at the Marks shelf and see that the size and color of the rectangles reflect the SUM of Neonatal deathsfor each country, while each rectangle is labeled with Region and Country:

Now drag Region to Color to remove it from the Label and colour the rectangles by region, using Tableau’s default qualitative colour scheme for categorical data:

For a more subtle color scheme, click on Color, select Edit Colors... and at the dialog box select the Tableau Classic Medium qualitative color scheme, then click Assign Palette and OK.

(Tableau’s qualitative color schemes are well designed, so there is no need to adopt a ColorBrewer scheme. However, it is possible to edit colors individually as you wish.)

Click on Color and set transparency to 75%. (For your assignment you will create a chart with overlapping circles, which will benefit from using some transparency to allow all circles to be seen. So we are setting transparency now for consistency.)

The treemap should now look like this:

Tableau has by default aggregated Neonatal deaths using the SUM function, so what we are seeing is the number for each country added up across the years.

To see one year at a time, we need to filter by year. If you drag the existing Year variable to the Filtersshelf, you will get the option to filter by a range of numbers, which isn’t what we need:

Instead, we need to be able check individual years, and draw a treemap for each one. To do that, select Yearin the Dimensions panel and Duplicate.

Select the new variable and Convert to Discrete and then Rename it Year (discrete). Now drag this new variable to Filters, select 2014, and click OK:

The treemap now displays the data for 2014:

That’s good for a snapshot of the data, but with a little tinkering, we can adapt this visualization to show change in the number of neonatal deaths over time at the national, regional and global levels.

Select Year (discrete) in the Filters shelf and Filter ... to edit the filter. Select all the years with even numbers and click OK:

Now drag Year (discrete) onto Rows and the chart should look like this:

The formatting needs work, but notice that we now have a bar chart made out of treemaps.

Extend the chart area to the right by changing from Standard to Entire View on the dropdown menu in the top ribbon:

I find it more intuitive to have the most recent year at the top, so select Year (discrete) in the Rows shelf, select Sort and fill in the dialog box so that the years are sorted in Descending order:

The chart should now look like this:

We will create a map to serve as a legend for the regions, so click on the title bar for the color legend and select Hide Card to remove it from the visualization.

To remove some clutter from the chart, select Format>Borders from the top menu, and under Sheet>Row Divider, set Pane to None. Then close the Format Borders panel.

Right-click on the Sheet 1 title for the chart and select Hide Title. Also right-click on Year (discrete) at the top left of the chart and select Hide Field Labels for Rows. Then hover just above the top bar to get a double-arrowed drag symbol and drag upwards to reduce the white space at the top. You may also want to drag the bars a little closer to the year labels.

The labels will only appear in the larger rectangles. Rather than removing them entirely, let’s just leave a label for India in 2014, to make it clear that this is the country with by far the largest number of neonatal deaths. Click on Label in the Marks shelf, and switch from All to Selected under Marks to Label. Then right-click on the rectangle for India in 2014, and select Mark Label>Always Show. The chart should now look like this:

Hover over one of the rectangles, and notice the tooltip that appears. By default, all the fields we have used to make the visualization appear in the tooltip. (If you need any more, just drag those fields onto Tooltip.) Click on Tooltip and edit as follows. (Unchecking Include command buttons disables some interactivity, giving a plain tooltip):

Save to the web

Right-click on Sheet 1 at bottom left and Rename Sheet to Treemap bar chart. Then select File>Save to Tableau Public... from the top menu. At the logon dialog box enter your Tableau Public account details, give the Workbook a suitable name and click Save. When the save is complete, a view of the visualization on Tableau’s servers will open in your default browser.

Make a map to use as a colour legend

Select Worksheet>New Worksheet from the top menu, and double-click on Country. Tableau recognizes the names of countries and states/provinces; for the U.S., it also recognizes counties. Its default map-making behaviour is to put a circle at the geographic centre, or centroid, of each area, which can be scaled and coloured to reflect values from the data:

However, we need each country to be filled with colour by region. In the Show Me tab, switch to the filled mapsoption, and each nation should fill with colour. Drag Region to Colour and see how the same colour scheme we used previously carries over to the map. Click on Colour, set the transparency to 75% to match the bubble chart and remove the borders. Also click on Tooltip and uncheck Show tooltip so that no tooltip appears on the legend.

We will use this map as a colour legend, so its separate colour legend is unnecessary. Click the colour legend’s title bar and select Hide Card to remove it from the visualization. Also remove the Sheet 2 title as before.

Centre the map in the view by clicking on it, holding and panning, just as you would on Google Maps. It should now look something like this:

Rename the worksheet Map legend and save to the web again.

Make a line chart showing neonatal mortality rate by country over time

To address our second question, and explore the neonatal death rate over time by country, we can use a line chart.

First, rename Neonat Mortal as Neonatal death rate (per 1,000 births). Then, open a new worksheet, drag this variable to Rows and Year to Columns. The chart should now look like this:

Tableau has aggregated the data by adding up the rates for each country in every year, which makes no sense here. So drag Country to Detail in the Marks shelf to draw one line per country:

Drag region to Colour and set the transparency to 75%.

Now right-click on the X axis, select Edit Axis, edit the dialog box as follows and click OK:

Right-click on the X axis again, select Format, change Alignment/Default/Header/Direction to Up and use the dropdown menu set the Font to bold. Also remove the Sheet 3 title.

The chart should now look like this:

We can also highlight the countries with the highest total number of neonatal deaths by dragging Neonatal deaths to Size. The chart should now look like this:

This line chart shows that the trend in most countries has been to reduce neonatal deaths, while some countries have had more complex trajectories. But to make comparisons between individual countries, it will be necessary to add controls to filter the chart.

Tableau’s default behaviour when data is filtered is to redraw charts to reflect the values in the filtered data. So if we want the Y axis and the line thicknesses to stay the same when the chart is filtered, we need to freeze them.

To freeze the line thicknesses, hover over the title bar for the line thickness legend, select Edit Sizes... and fill in the dialog box as follows:

Now remove this legend from the visualization, together with the colour legend. We can later add an annotation to our dashboard to explain the line thickness.

To freeze the Y axis, right-click on it, select Edit Axis..., make it Fixed and click OK:

Right-click on the Y axis again, select Format... and increase the font size to 10pt to make it easier to read.

Now drag Country to Filters, make sure All are checked, and at the dialog box, click OK:

Now we need to add a filter control to select countries to compare. On Country in the Filters shelf, select Show Filter. A default filter control, with a checkbox for each nation, will appear to the right of the chart:

This isn’t the best filter control for this visualization. To change it, click on the title bar for the filter, note the range of filter controls available, and select Multiple Values (Custom List). This allows users to select individual countries by starting to type their names. Then select Edit Title... and add some text explaining how the filter works:

Take some time to explore how this filter works.

Rename Income to Income group. Then add Region and Income group to Filters, making sure that All options are checked for each. Select Show Filter for both of these filters, and select Single Value Dropdown for the control. Reset both of these filters to All, and the chart should now look like this:

Notice that the Income group filter lists the options in alphabetical order, rather than income order, which would make more sense. To fix this, right-click on Income group in the data panel and select Default Properties>Sort. At the dialog box below, select Manual sort, edit the order as follows and click OK:

The chart should now look like this:

Finally, click on Tooltip and edit as follows:

Rename the sheet Line chart and save to the web.

Make a dashboard combining both charts

From the top menu, select Dashboard>New Dashboard. Set its Size to Automatic, so that the dashboard will fill to the size of any screen on which it is displayed:

To make a dashboard, drag charts, and other elements from the left-hand panel to the dashboard area. Notice that Tableau allows you to add items including: horizontal and vertical containers, text boxes, images (useful for adding a publication’s logo), embedded web pages and blank space. These can be added Tiled, which means they cannot overlap, or Floating, which allows one element to be placed over another.

Drag Treemap bar chart from the panel at left to the main panel. The default title, from the worksheet name, isn’t very informative, so right-click on that, select Edit Title ... and change to Total deaths.

Now add Line Chart to the right of the dashboard (the gray area will show where it will appear) and edit its title to Death rates. Also add a note to explain that line widths are proportional to the total number of deaths. The dashboard should now look like this:

Notice that the Country, Region and Income group filters control only the line chart. To make them control the treemaps, too, click on each filter, open up the dropdown menu form the downward-pointing triangle, and select Apply to Worksheets>Selected Worksheets... and fill in the dialog box as follows:

The filters will now control both charts.

Add Map legend for a color legend at bottom right. (You will probably need to drag the window for the last filter down to push it into position.) Hide the legend’s title then right-click on the map and select Hide View Toolbarto remove the map controls.

We can also allow the highlighting of a country on one chart to be carried across the entire dashboard. Select Dashboard>Actions... from the top menu, and at the first dialog box select Add action>Highlight. Filling the second dialog box as follows will cause each country to be highlighted across the dashboard when it is clicked on just one of the charts:

Click OK on both dialog boxes to apply this action.

Select Dashboard>Show Title from the top menu. Right-click on it, select Edit Title... and change from the default to something more informative:

Now drag a Text box to the bottom of the dashboard and add a footnote giving source information:

The dashboard should now look like this:

Design for different devices

This dashboard works on a large screen, but not on a small phone. To see this, click the Device Previewbutton at top left and select Phone under Device type. In portrait orientation, this layout does not work at all:

Click the Add Phone Layout at top right, and then click Custom tab under Layout - Phone in the left-hand panel. You can then rearrange and if necessary remove elements for different devices. Here I have removed the line chart and filter controls, and changed the legend to a Floating element so that it sits in the blank space to the top right of the bar chart of treemaps.

Now save to the web once more. Once the dashboard is online, use the Share link at the bottom to obtain an embed code, which can be inserted into the HTML of any web page.

(You can also Download a static view of the graphic as a PNG image or a PDF.)

You can download the workbook for any Tableau visualization by clicking the Download Workbook link. The files (which will have the extension .twbx) will open in Tableau Public.

Having saved a Tableau visualization to the web, you can reopen it by selecting File>Open from Tableau Public... from the top menu.

Another approach to responsive design

As an alternative to using Tableau’s built-in device options, you may wish to create three different dashboards, each with a size appropriate for phones, tablets, and desktops respectively. You can then follow the instructions here to put the embed codes for each of these dashboards into a div with a separate class, and then use @media CSS rules to ensure that only the div with the correct dashboard displays, depending on the size of the device.

If you need to make a fully responsive Tableau visualization and are struggling, contact me for help!

When making responsively designed web pages, make sure to include this line of code between the <head></head> tags of your HTML:

<meta name="viewport" content="width=device-width, initial-scale=1.0">

From dashboards to stories

Tableau also allows you to create stories, which combine successive dashboards into a step-by-step narrative. Select Story>New Story from the top menu. Having already made a dashboard, you should find these simple and intuitive to create. Select New Blank Point to add a new scene to the narrative.

Assignment

  • Create this second dashboard from the data.Here are some hints:
    • Drop Year into the Pages shelf to create the control to cycle through the years.
    • You will need to change the Marks to solid circles and scale them by the total number of neonatal deaths. Having done so, you will also need to increase the size of all circles so countries with small numbers of neonatal deaths are visible. Good news: Tableau’s default behavior is to size circles correctly by area, so they will be the correct sizes, relative to one another.
    • You will need to switch to a Logarithmic X axis and alter/fix its range.
    • Format GDP per capita in dollars by clicking on it in the Data panel and selecting Default Properties>Number Format>Currency (Custom).
    • Create a single trend line for each year’s data, so that the line shifts with the circles from year to year. Do this by dragging Trend line into the chart area from the Analytics panel. You will then need to select Analysis>Trend Lines>Edit Trend Lines... and adjust the options to give a single line with the correct behavior.
    • Getting the smaller circles rendered on top of the larger ones, so their tooltips can be accessed, is tricky. To solve this, open the dropdown menu for Country in the Marks shelf, select Sort and fill in the dialog box as follows:Now drag Country so it appears at the top of the list of fields in the Marks shelf.

    This should be a challenging exercise that will help you learn how Tableau works. If you get stuck, download my visualization and study how it is put together.

  • By next week’s class, send me the url for your second dashboard. (Don’t worry about designing for different devices.)

Further reading/viewing

Tableau Public training videos

Gallery of Tableau Public visualizations: Again, you can download the workbooks to see how they were put together.

Tableau Public Knowledge Base: Useful resource with the answers to many queries about how to use the software.

Data visualization: Principles

Why visualize data? It is a good way to communicate complex information, because we are highly visual animals, evolved to spot patterns and make visual comparisons. To visualize effectively, however, it helps to understand a little about how our brains process visual information. The mantra for this week’s class is: Design for the human brain!

Visualization: encoding data using visual cues

Whenever we visualize, we are encoding data using visual cues, or “mapping” data onto variation in size, shape or color, and so on. There are various ways of doing this, as this primer illustrates:

These cues are not created equal, however. In the mid-1980s, statisticians William Cleveland and Robert McGill ran some experiments with human volunteers, measuring how accurately they were able to perceive the quantitative information encoded by different cues. This is what they found:

This perceptual hierarchy of visual cues is important. When making comparisons with continuous variables, aim to use cues near the top of the scale wherever possible.

But this doesn’t mean that everything becomes a bar chart

Length on an aligned scale may be the best option to allow people to compare numbers accurately, but that doesn’t mean the other possibilities are always to be avoided in visualization. Indeed, color hue is a good way of encoding categorical data. The human brain is particularly good at recognizing patterns and differences. This means that variations in color, shape and orientation, while poor for accurately encoding the precise value of continuous variables, can be good choices for representing categorical data.

You can also combine different visual cues in the same graphic to encode different variables. But always think about the main messages you are trying to impart, and where you can use visual cues near the top of the visual hierarchy to communicate that message most effectively.

To witness this perceptual hierarchy, look at the following visual encodings of the same simple dataset. In which of the three charts is it easiest to compare the numerical values that are encoded?

If you have spent any time reading blogs on data visualization, you will know the disdain in which pie charts are often held. It should be clear which of these two charts is easiest to read:

(Source: VizThinker)

Pie charts encode continuous variables primarily using the angles made in the center of the circle. It is certainly true that angles are harder read accurately than aligned bars. However, note that encoding data using the area of circles — which has become a “fad” in data visualization in recent years — makes even tougher demands on your audience.

Which chart type should I use?

This is a frequently asked question, and the best answer is: Experiment with different charts, to see which works best to liberate the story in your data. Some of the visualization software — notably Tableau Public — will suggest chart types for you to try. However, it is good to have a basic framework to help you prioritize particular chart types for particular visualization tasks. Although it is far from comprehensive, and makes some specific chart suggestions that I would not personally endorse, this “chart of charts” provides a useful framework by providing four answers to the question: “What would you like to show?”

(Source: A. Abela, Extreme Presentation Method)

Last week, we covered charts to show the distribution of a single continuous variable, and to study the relationship between two continuous variables. So let’s now explore possibilities for comparison between items for a single continuous variable, and composition, or how parts make up the whole. In each case, this framework considers both a snapshot at one point in time, and how to visualize comparison and composition over time — a common task in data journalism.

I like to add a couple more answers to the question: connection, or visualizing how people, things, or organizations relate to one another; and location, which covers maps.

Simple comparisons: bars and columns

Applying the perceptual hierarchy of visual cues, bar and column charts are usually the best options for simple comparisons. Vertical columns often work well when few items are being compared, while horizontal bars may be a better option when there are many items to compare, as in this example from The Wall Street Journal, illustrating common passwords revealed by a 2013 data breach at Gawker Media.

Here I have used a bar chart to show payments for speaking about drug prescription made to doctors in California by the drug company Pfizer in the second half of 2009, using data gathered in reporting this story.

Notice how spot colour is used here as a secondary visual cue, to highlight the doctor who received the most money.

There is one sacrosanct rule with bar and column charts: Because they rely on the length of the bars to encode data, you must start the bars at zero. Failing to do this will mislead your audience. Several graphics aired by Fox News have been criticized for disobeying this rule, for example:

(Source: Fox News, via Media Matters for America)

Comparisons: change over time

Bar or column charts can also be used to illustrate change over time, but there are other possibilities, as shown in these charts showing participation in the federal government’s food stamps nutritional assistance program, from 1969 to 2014.

(Source: Peter Aldhous, from U.S. Department of Agriculture data)

Each of these charts communicates the same basic information with a subtly different emphasis. The column chart emphasizes each year as a discrete point in time, while the line chart focuses on the overall trend or trajectory. The dot-and-line chart is a compromise between these two approaches, showing the trend while also drawing attention to the value for each year. (The dot-column chart is an unusual variant of a column chart, included here to show another possible design approach.)

Multiple comparisons, including over time

When comparing very many items, or how one item has changed over time, “small multiples” provide another approach. They has been used very successfully in recent years by several news organizations. Here is a small section from a larger graphic showing the severity of drought in California in late 2013 and early 2014:

(Source: Los Angeles Times)

Small multiples are becoming more popular as more people consume news graphics on mobile devices. Unlike larger conventional graphics, they can be made to reflow easily in responsive web designs to display effectively on small screens.

If you are comparing two points in time for many items, a slope graph can be an effective choice. Slope falls about midway on the perceptual hierarchy of visual cues, but allows us to scan many items at once and note obvious differences. Here I used slope graphs to visualize data from a study examining the influence of putting house plants in hospital rooms on patient’s sense of well-being, measured before abdominal surgery, and after a period of recovery. I used thicker lines and color to highlight ratings that showed statistically significant improvements.

(Source: Peter Aldhous, from data in this research paper)

Composition: parts of the whole

This is where the much-maligned pie chart does have a role, although it is not the only option. Which of these two representations of an August 2014 poll of public opinion on President Barack Obama’s job performance makes the differences between his approval ratings for difference policy areas easiest to read, the pie charts or the stacked column charts below?

These graphics involve both comparison and composition — a common situation in data journalism.

(Source: Peter Aldhous, from CBS poll data, via PollingReport.com)

In class, we’ll discuss how both of these representations of the data could have been improved.

I would suggest abandoning pie charts if there are any more than three parts to the whole, as they become very hard to read when there are many segments. ProPublica’s graphics style guide goes further, allowing pie charts with two segments only.

Recent research into how people perceive composition visualizations with just categories suggests that the best approach may actually be a square chart. Surprisingly, this is an example where an encoding of area seems to beat length for accuracy:

(Source: Eagereyes)

Another approach, known as a treemap, similarly uses area to encode the size of parts of the whole, and can be effective to display “nested” variables — where each part of the whole is broken down into further parts. Here The New York Times used a treemap to display President Obama’s 2012 budget request to Congress, also using color to indicate whether the proposal represented an increase (shades of green) or decrease (red) in spending:

(Source: The New York Times)

Composition: change over time

Data journalists frequently need to show how parts of the whole vary over time. Here is an example, illustrating the development of drought across the United States, which uses a stacked columns format, in this case with no space between the columns.

(Source: The Upshot, The New York Times)

In the drought example, the size of the whole remains constant. Even if the size of the whole changes, this format can be used to show changes in the relative size of parts of the whole, by converting all of the values at each time interval into percentages of the total.

Stacked column charts can also be used to simultaneously show change in composition over time and change in the size of the whole. This example is from one of my own articles, looking at change over time in the numbers of three categories of scientific research papers in published in Proceedings of the National Academy of Sciences:

(Source: Nature)

Just as for simple comparisons over time, columns are not the only possibility when plotting changes in composition over time. The parts-of-the-whole equivalent of the line chart, stressing the overall trend rather than values at discrete points in time, is the stacked area chart. Again, these charts can be used to show change of time with the size of the whole held constant, or varying over time. This 2009 interactive from the The New York Times used this format to reveal how Americans typically spend their day:

(Source: The New York Times)

Making connections: network graphs

The chart types thought-starter we have used as a framework so far misses two of my answers to the question: “What would you like to show?” We will cover location in subsequent classes on mapping.

Journalists may be interested in exploring connection — which donors gave money to which candidate, how companies are connected through members of their boards, and so on. Network graphs can visualize these questions, and are sometimes used in news media. Here, for example, The New York Times showed connections between the national teams, players and club teams at the 2014 soccer World Cup:

(Source: The New York Times)

Complex network graphs can be very hard to read — “hairball” is a pejorative term used to describe them — so networks often need to be filtered to tell a clear story to your audience.

If you are interested in learning how to make network graphs, I have tutorials here.

Case study: Immunization in California kindergartens

Now we’ll explore a dataset at different levels of analysis, to show how different visual encodings may be needed for different visualization tasks with the same data.

This data, from the California Department of Public Health, gives numbers on immunization and enrollment at kindergartens across the state. The data is provided at the level of individual schools, but can be aggregated to look at counties, or the entire state.

When looking at change over time at the state level, the perceptual hierarchy makes a column chart a good choice:

(Source: Peter Aldhous, from California Department of Public Health data)

Notice that I’ve focused on the percentage of children with incomplete vaccination, rather than the percentage complete, for two reasons:

  • The differences between the lengths of the bars are greater, and so is easier to read.
  • More importantly, incomplete vaccination is what increases the risk of infectious disease outbreaks, which is why we care about this data.

But as for the food stamps data, a bar chart is not the only choice:

Here’s the same information presented as a line chart:

(Source: Peter Aldhous, from California Department of Public Health data)

Notice that here, I haven’t started the Y axis at zero. This would be unforgivable for a bar chart, where the length of the bar is the visual encoding, and so starting at an arbitrary value would distort the comparison between the bars. Here, however, I’m emphasizing the relative slope, to show change over time, so starting at zero is less crucial.

And here’s the data as a dot-and-line chart:

(Source: Peter Aldhous, from California Department of Public Health data)

Here, I’ve returned to a Y axis that starts at zero, so that the relative positions of the points can be compared accurately.

But what if we want to look at individual counties? When comparing a handful of counties, the dot-and-line chart, combining the visual cues of position on an aligned scale (for the yearly values) and slope (for the rate of change from year to year) works well:

(Source: Peter Aldhous, from California Department of Public Health data)

But there are 58 counties in California, and trying to compare them all using a dot-and-line chart results in chaos:

(Source: Peter Aldhous, from California Department of Public Health data)

In this case, it makes sense to drop down the perceptual hierarchy, and use the intensity of color to represent the percentage of incomplete immunization:

(Source: Peter Aldhous, from California Department of Public Health data)

This type of chart is called a heat map. It provides a quick and easy way to scan for the counties and years with the highest rates of incomplete immunization.

What if we want to visualize the data for every kindergarten on a single chart, to give an overview of how immunization rates vary across schools?

Here’s my best attempt at this:

(Source: Peter Aldhous, from California Department of Public Health data)

Here I’ve drawn a circle for every school, and used their position on an aligned scale, along the Y axis, to encode the percentage of incomplete immunization. I’ve also used the area of the circles to encode the enrollment at each kindergarten — but this is secondary to the chart’s main message, which is about the variation of immunization rates across schools.

Using color effectively

Color falls low on the perceptual hierarchy of visual cues, but as we have seen above, it is often deployed to highlight particular elements of a chart, and sometimes to encode data values. Poor choice of color schemes is a problem that bedevils many news graphics, so it is worth taking some time to consider how to use color to maximum effect.

It helps to think about colors in terms of the color wheel, which places colors that “harmonize” well together side by side, and arranges those that have strong visual contrast — blue and orange, for instance — at opposite sides of the circle:

(Source: Wikimedia Commons)

When encoding data with color, take care to fit the color scheme to your data, and the story you’re aiming to tell. Color is often used to encode the values of categorical data. Here you want to use “qualitative” color schemes, where the aim is to pick colors that will be maximally distinctive, as widely spread around the color wheel as possible:

(Source: ColorBrewer)

When using color to encode continuous data, it usually makes sense to use increasing intensity, or saturation of color to indicate larger values. These are called “sequential” color schemes:

(Source: ColorBrewer)

In some circumstances, you may have data that has positive and negative values, or which highlights deviation from a central value. Here, you should use a “diverging” color scheme, which will usually have two colors reasonably well separated on the color wheel as its end points, and cycle through a neutral color in the middle:

(Source: ColorBrewer)

Choosing color schemes is a complex science and art, but there is no need to “roll your own” for every graphic you make. Many visualization tools include suggested color palettes, and I often make use of the website from which the examples above were taken, called ColorBrewer. Orginally designed for maps, but useful for charts in general, these color schemes have been rigorously tested to be maximally informative.

In class, we will take some time to play around with ColorBrewer and examine its outputs. You will notice that the colors it suggests can be displayed according to their values on three color “models”: HEX, RGB and CMYK. Here is a brief explanation of these and other common color models.

  • RGB Three values, describing a color in terms of combinations of red, green, and blue light, with each scale ranging from 0 to 255; sometimes extended to RGB(A), where A is alpha, which encodes transparency. Example: rgb(169, 104, 54).
  • HEX A six-figure “hexadecimal” encoding of RGB values, with each scale ranging from hex 00 (equivalent to 0) to hex ff (equivalent to 255); HEX values will be familiar if you have any experience with web design, as they are commonly used to denote color in HTML and CSS. Example: #a96836
  • CMYK Four values, describing a color in combinations of cyan, magenta, yellow and black, relevant to the combination of print inks. Example: cmyk(0, 0.385, 0.68, 0.337)
  • HSL Three values, describing a color in terms of hue, saturation and lightness (running from black, through the color in question, to white). Hue is the position on a blended version of the color wheel in degrees around the circle ranging from 0 to 360, where 0 is red. Saturation and lightness are given as percentages. Example: hsl(26.1, 51.6%, 43.7%)
  • HSV/B Similar to HSL, except that brightness (sometimes called value) replaces lightness, running from black to the color in question. hsv(26.1, 68.07%, 66.25%)

Colorizer is one of several web apps for picking colors and converting values from one model to another.

Custom color schemes can also work well, but experiment to see how different colors influence your story. The following graphic from The Wall Street Journal, for instance, uses an unusual pseudo-diverging scheme to encode data — the US unemployment rate — that would typically be represented using a sequential color scheme. It has the effect of strongly highlighting periods where the jobless rate rises to around 10%, which flow like rivers of blood through the graphic. This was presumably the designer’s aim.

(Source: The Wall Street Journal)

If you intend to roll your own color scheme, try experimenting with I want hue for qualitative color schemes, the Chroma.js Color Scale Helper for sequential schemes, and this color ramp generator, in combination with Colorizer or another online color picker, for diverging schemes.

You will also notice that ColorBrewer allows you to select color schemes that are colorblind safe. Surprisingly, many news organizations persist in using color schemes that exclude a substantial minority of their audience. Red and green lie on opposite sides of the color wheel, and also can be used to suggest “good” or “go,” versus “bad” or “stop.” But about 5% of men have red-green colorblindness, also known as deuteranopia. Here, for example, is what the budget treemap from The New York Times would look like to someone with this condition:

(Source: The New York Times via Color Oracle)

Install Color Oracle to check how your charts and maps will look to people with various forms of colorblindness.

Using chart furniture, minimizing chart junk, highlighting the story

In addition to the data, encoded through the visual cues we have discussed, various items of chart furniture can help frame the story told by your data:

  • Title and subtitle These provide context for the chart.
  • Coordinate system For most charts, this is provided by the horizontal and vertical axes, giving a cartesian system defined by X and Y coordinates; for a pie chart it is provided by angles around a circle, called a polar coordinate system.
  • Scale Labeled tick marks and grid lines can help your audience read data values.
  • Labels You will usually want to label each axis. Think about other labels that may be necessary to explain the message of your graphic.
  • Legend If you use color or shape to encode data, you will often need a legend to explain this encoding.
  • Source information Usually given as a footnote. Don’t forget this!

Chart furniture can also be used to encode data, as in this example, which shows the terms of New York City’s police commissioners and mayors with reference to the time scale on the X axis:

(Source: The New York Times)

In this example, the label for the Y axis is displayed horizontally in the main chart area, rather than vertically alongside the chart. News media often do this so that readers don’t have to crane their necks to read the label. If you do this, check that it is clear to users that the label refers to scale on the Y axis.

Think carefully about how much chart furniture you really need, and make sure that the story told by your data is front and center. Think data-ink: What proportion the ink or pixels in your chart is actually encoding data, and what proportion is embellishment, adding little to your story?

Here is a nice example of a graphic that minimizes chart junk, and maximizes data-ink. Notice how the Y axis doesn’t need to be drawn, and the gridlines are an absence of ink, consisting of white lines passing through the columns:

(Source: The Upshot, The New York Times)

Contrast this with the proliferation of chart junk in the earlier misleading Fox News column chart.

Labels and spot-color highlights can be particularly useful to highlight your story, as shown in the following scatter plots, used here to show the relationship between the median salaries paid to women and men for the same jobs in 2015. In this case there is no suggestion of causation; here the scatter plot format is being used to display two distributions simultaneously — see the chart types thought-starter.

It is clear from the first, unlabeled plot, that male and female salaries for the same job are strongly correlated, as we would expect, but that relationship is not very interesting. Notice also how I have used transparency to help distinguish overlapping individual points.

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

What we are interested in here is whether men and women are compensated similarly for doing the same jobs. The story in the data starts to emerge if you add a line of equal pay, with a slope of 1 (note that this isn’t a trend line, as we discussed last week). Here I have also highlighted the few jobs in which women in 2013 enjoyed a marginal pay gap over men:

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

Notice how adding another line, representing a 25% pay gap, and highlighting the jobs where the pay gap between men and women is largest, emphasizes different aspects of the story:

(Source: Peter Aldhous, from Bureau of Labor Statistics data)

Pitfalls to avoid

If you ever decide to encode data using area, be sure to do so correctly. Hopefully it is obvious that if one unit is a square with sides of length one, then the correct way to represent a value of four is a square with sides of length two (2*2 = 4), not a square with sides of length four (4*4 = 16).

Mistakes are frequently made, however, when encoding data by the area of circles. In 2011, for instance, President Barack Obama’s State of the Union Address for the first time included an “enhanced” online version with supporting data visualizations. This included the following chart, comparing US Gross Domestic Product to that of competing nations:

(Source: The 2011 State of the Union Address: Enhanced Version)

Data-savvy bloggers were quick to point out that the data had been scaled by the radius of each circle, not its area. Because area = π * radius^2, you need to scale the circles by the square root of the radius to achieve the correct result, on the right:

(Source: Fast Fedora blog)

Many software packages (Microsoft Excel is a notable culprit) allow users to create charts with 3-D effects. Some graphic designers produce customized charts with similar aesthetics. The problem is that that it is very hard to read the data values from 3-D representations, as this example illustrates:

(Source: Good)

A good rule of thumb for data visualization is that trying to represent three dimensions on a two dimensional printed or web page is almost always one dimension too many, except in unusual circumstances, such as these representations of Mount St. Helens in Washington State, before and after its 1980 eruption:

(Source: OriginLab)

Above all, aim for clarity and simplicity in your chart design. Clarity should trump simplicity. As Albert Einstein is reputed to have said: “Everything should be made as simple as possible, but not simpler.”

Sometimes even leading media outlets lose their way. See if you can make sense of this interactive graphic on clandestine US government agencies and their contractors:

(Source: The Washington Post)

Be true to the ‘feel’ of the data

Think about what the data represents in the real world, and use chart forms, visual encodings and color schemes that allow the audience’s senses to get close to what the data means — note again the “rivers of blood” running through The Wall Street Journal’s unemployment chart, which suggest human suffering.

The best example I know of this uses sound rather than visual cues, so strictly speaking it is “sonification” rather than visualization. In 2010, this interactive from The New York Times explored the narrow margins separating medalists from also-rans in many events at the Vancouver Winter Olympics. It visualized the results in a conventional way, but also included sound files encoding the race timings with musical notes.

(Source: The New York Times)

Our brains process music in time, but perceive charts in space. That’s why the auditory component of this interactive was the key to its success.

Break the story down into scenes

Many stories have a step-by-step narrative, and different charts may tell different parts of the story. So think about communicating such stories through a series of graphics. This is another good reason to experiment with different chart types when exploring a new dataset. Here is a nice example of this approach, examining demographic change in Brazil:

(Source: Época, via Visualopolis)

Good practice for interactives

Nowadays the primary publication medium for many news graphics is the web or apps on mobile platforms, rather than print, which opens up many possibilities for interactivity. This can greatly enhance your ability to tell a story, but it also creates new possibilities to confuse and distract your audience — think of this as interactive chart junk.

A good general approach for interactive graphics is to provide an overview first, and then allow the interested user to zoom or filter to dig deeper into the data. In such cases, the starting state for an interactive should tell a clear story: If users have to make an effort to dig into a graphic to get anything from it, few are likely to do so. Indeed, assume that much of your audience will spend only a short time interacting with the data. “How Different Groups Spend Their Day” from The New York Times is a good example of this approach.

Similarly, don’t hide labels or information essential to understanding the graphic in tooltips that are accessed only on clicks or hovers. This is where to put more detailed information for users who have sufficient interest to explore further.

Make the controls for an interactive obvious — play buttons should look like play buttons, for instance. You can include a few words of explanation, but only a very few: as far as possible, how to use the interactive should be intuitive, and built into its design.

The interactivity of the web also facilitates a scene-by-scene narrative — a device employed frequently by The New York Times‘ graphics team in recent years. With colleagues at New Scientist, I also used this approach for this interactive, exploring the likely number of Earth-like planets in our Galaxy:

(Source: New Scientist)

‘Mobile-first’ may change your approach

Increasingly, news content is being viewed on mobile devices with small screens

At the most basic level, this means making graphics “responsive,” so that their size adjusts to screen size. But there is more to effective design for mobile than this.

We have already discussed the value of small multiples, which can be made to reflow for different screen sizes.

This interactive, exploring spending on incarceration by block in Chicago, is a nice example of organizing and displaying the same material differently for different screen sizes. Open it up on your laptop then reduce the size of your browswer window to see how it behaves.

(Source: DataMade)

Again, a step-by-step narrative can be a useful device in overcoming the limitations of a small screen. This interactive, exploring school segregation by race in Florida, is a good example of this approach:

(Source: Tampa Bay Times)

Here’s an article that includes some of my thoughts on the challenge of making graphics that work effectively on mobile.

Be careful with animation

Animation in interactives can be very effective. But remember the goal of staying true to the ‘feel’ of the data. Animated images evolve over time, so animation can be particularly useful to encode data that changes over time. But again you need to think about what the human brain is able to perceive. Research has shown that people have trouble tracking more than about four points at a time. Try playing Gapminder World without the energetic audio commentary of Hans Rosling’s “200 Countries” video, and see whether the story told by the data is clear.

Animated transitions between different states of a graphic can be pleasing. But overdo it, and you’re into the realm of annoying Powerpoint presentations with items zooming into slides with distracting animation effects. It’s also possible for elegant animated transitions to “steal the show” from the story told by the data, which arguably is the case for this exploration by The New York Times of President Obama’s 2013 budget request to Congress:

(Source: The New York Times)

Sketch and experiment to find the story

One key message I’d like you to take from this class is that there are many ways of visualizing the same data. Effective graphics and interactives do not usually emerge fully formed. They usually arise through sketching and experimentation.

As you sketch and experiment with data, use the framework suggested by the chart selector thought-starter to prioritize different chart types, and always keep the perceptual hierarchy of visual cues at the front of your mind. Remember the mantra: Design for the human brain!

Also, show your experiments to friends and colleagues. If people are confused or don’t see the story, you may need to try a different approach.

Learn from the experts

Over the coming weeks and beyond, make a habit of looking for innovative graphics, especially those employing unusual chart forms, that communicate the story from data in an effective way. Work out how they use visual cues to encode data. Here are a couple of examples from The New York Times to get you started. Follow the links from the source credits to explore the interactive versions:

(Source: The New York Times)

(Source: The New York Times)

Similarly, make note of graphics that communicate less effectively, and see if you can work out why.

Further reading

Alberto Cairo: The Functional Art: An Introduction to Information Graphics and Visualization

Nathan Yau: Data Points: Visualization That Means Something

Network Analysis with Gephi

Network analysis with Gephi

Today, network analysis is being used to study a wide variety of subjects, from how networks of genes and proteins influence our health to how connections between multinational companies affect the stability of the global economy. Network graphs can also be used to great effect in journalism to explore and illustrate connections that are crucial to public policy, or directly affect the lives of ordinary people.

Do any of these phrases have resonance for you?

  • The problem with this city is that it’s run by X’s cronies.
  • Contracts in this town are all about kickbacks.
  • Follow the money!
  • It’s not what you know, it’s who you know.

If so, and you can gather relevant data, network graphs can provide a means to display the connections involved in a way that your audience (and your editors) can readily understand.

The data we will use

Download the data from this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

  • friends.csv A simple network documenting relationships among a small group of people.
  • senate_113-2013.gexf senate-113-2014.gexf Two files with data on voting patterns in the U.S. Senate, detailing the number and percentage of times pairs of Senators voted the same way in each year.
  • senate_one_session.py Python script that will scrape a single year’s data from GovTrack.US; modified from a script written by Renzo Lucioni.
  • senate Folder containing files and code to make an interactive version of the Senate voting network.

Network analysis: the basics

At its simplest level, network analysis is very straightforward. Network graphs consist of edges (the connections) and nodes (the entities that are connected).

One important consideration is whether the network is “directed,” “undirected,” or a mixture of the two. This depends upon the nature of the connections involved. If you are documenting whether people are Facebook friends, for instance, and have no information who made the original friend request, the connections have no obvious direction. But when considering following relationships on Twitter, there is a clear directionality to each relationship: A fan might follow Taylor Swift, for example, but she probably doesn’t follow them back.

The connections in undirected graphs are typically represented by simple lines or curves, while directed relationships are usually represented by arrows.

Here, for example, I used a directed network graph to illustrate patterns of citation of one another’s work by researchers working on a type of stem cell that later won their discoverer a Nobel prize. Notice that in some cases there are arrows going in both connections, because each had frequently cited the other:

(Source: New Scientist)

Nodes and edges can each have data associated with them, which can be represented using size, color and shape.

Network algorithms and metrics

Networks can be drawn manually, with the nodes placed individually to give the most informative display. However, network theorists have devised layout algorithms to automate the production of network graphs. These can be very useful, especially when visualizing large and complex networks.

There are also a series of metrics that can quantify aspects of a network. Here are some examples, which measure the importance of nodes within a network in slightly different ways:

  • Degree is a simple count of the number of connections for each node. For directed networks, it is divided into In-degree, for the number of incoming connections, and Out-degree, for outgoing connections. (In my stem cell citation network, In-Degree was used to set the size of each node.)
  • Eigenvector centrality accounts not only for the node’s own degree, but also the degrees of the nodes to which it connects. As such, it is a measure of each node’s wider “influence” within the network. Google’s PageRank algorithm, which rates the importance of web pages according the the links they recieve, is a variant of this measure.
  • betweenness centrality essentially reveals how important each node is in providing a “bridge” between different parts of the network: It counts the number of times each node appears on the shortest path between two other nodes. It is particularly useful for highlighting the nodes that, if removed, would cause a network to fall apart.
  • Closeness centrality is a measure of how close each node is, on average, to all of the other nodes in a network. It highlights the nodes that connect to the others through a lower number of edges. The Kevin Bacon Game, in which you have to connect Bacon to other movie actors through the fewest number of movies, based on co-appearances, works because he has a high closeness centrality in this network.

Network data formats

The most basic data needed to draw a network is an “edge list” — a list of pairs of nodes that connect within the network, which can be created in a spreadsheet with two columns, one for each member of each pair of nodes.

There are also a number of dedicated data formats used to record data about networks, which can store a variety of data about both edges and nodes. Here are two of the most common:

GEXF is a variant of XML, and is the native data format for Gephi, the network visualization software we will use today.

GraphML is another, older XML format used for storing and exchanging network graph data.

For visualizing networks online, it often makes sense to save them as JSON, which keeps file size small and works well with JavaScript visualization libraries.

Introducing Gephi

Gephi is a tool designed to draw, analyze, filter and customize the appearance of network graphs according to qualitative and quantitative variables in the data.

Gephi allows you to deploy layout algorithms, or to place nodes manually. It can calculate network metrics, and lets you use the results of these analyses to customize the appearance of your network graph.

Finally, Gephi allows you to create publication-quality vector graphics of your network visualizations, and to export your filtered and analyzed networks in data formats that can be displayed interactively online, using JavaScript visualization libraries.

Launch Gephi, and you will see a screen like this:

(You may also see an initial welcome window, allowing you to load recently used or sample data. You can close this.)

Install Gephi plugins

Gephi has a series of plugins that extend its functionality — you can browse the available options here.

We will install a plugin that we will later use to export data from Gephi as JSON. Select Tools>Plugins from the top menu, and the Plugins window should open:

In the Available Plugins tab, look for the JSONExporter plugin — you can use the Search box to find them, if necessary. Then click Install.

After installing plugins, you may be prompted to restart Gephi, which you should do.

If you do not find the plugin you are looking for, close Gephi and browse for the plugin at the Gephi marketplace, where you can download manually. Then relaunch Gephi, select Tools>Plugins from the top menu and go to the Downloaded tab. Click the Add Plugins ... button, and navigate to where the plugin was saved on your computer — it should be in a zipped folder or have an .nbm file extension. Then click Install and follow the instructions that appear.

See here for more instructions on installing Gephi plugins.

Make a simple network graph illustrating connections between friends

Having launched Gephi, click on Data Laboratory. This is where you can view and edit raw network data. From the top menu, select File>New Project, and the screen should look like this:

Notice that you can switch between viewing Nodes and Edges, and that there are buttons to Add node and Add edge, which allow you to construct simple networks by manual data entry. Instead, we will import a simple edge list, to show how Gephi will add the nodes and draw the network from this basic data.

To do this, click the Import Spreadsheet button — which actually imports CSV files, rather than spreadsheets in .xls or .xlsx format. Your CSV file should at a minimum have two columns, headed Source and Target. By default, Gephi will create a directed network from an edge list, with arrows pointing from from Source to Target nodes. If some or all of your connections are undirected, include a third column called Type and fill the rows with Undirected or Directed, as appropriate.

At the dialog box shown below, navigate to the file friends.csv, and ensure that the data is going to be imported as an Edges table:

Click Next> and then Finish, and notice that Gephi has automatically created a Nodes table from the information in the Edges table:

In the Nodes table, click the Copy data to other column button at the bottom of the screen, select Id and click OK to copy to Label. This column can later be used to put labels on the network graph.

Now click Add column, call it Gender and keep its Type as String, because we are going to enter text values, rather than numbers. This column can later be used to color the friends according to their gender.

Having created the column, double-click on each row and enter F or M, as appropriate:

Now switch to the Edges table, and notice that each edge has been classed as Directed:

This would make sense if, for example, we were looking at dinner invitations made by the source nodes. Notice that in this network, not only has Ann invited Bob to dinner, but Bob has also invited Ann. For each of the other pairs, the invitations have not been reciprocated.

Now click Overview to go back to the main graph view, where a network graph should now be visible. You can use your mouse/trackpad to pan and zoom. On my trackpad, right-click and hold enables panning, while the double-finger swipe I would normally use to scroll enables zoom. Your settings may vary!

Note also that the left of the two sliders at bottom controls the size of the edges, and that individual nodes can be clicked and moved to position them manually. Below I have arranged the nodes so that none of the edges cross over one another. The Context panel at top right gives basic information about the network:

Click on the dark T button at bottom to call up labels for the nodes, and use the right of the two sliders to control their size. The light T button would call up edge labels, if they were set.

Turn off the labels once more, and we will next color the nodes according to the friends’ gender.

Notice that the panel at top left contains two tabs, Partition and Ranking. The former is used to style nodes or edges according to qualitative variables, the latter styling by quantitative variables. Select Partition>Nodes, hit the Refresh button with the circling green arrows, select Gender and hit the Run button with the green “play” symbol. The nodes should now be colored by gender, and you may find that the edges also take the color of the source node:

To turn off this behavior, click this button at the bottom of the screen: (the button to its immediate left allows edge visibility to be turned on and off).

Select File>New Project and you will be given the option to save your project before closing. You can also save your work at any time by selecting File>Save or by using the usual ⌘-S or Ctrl-S shortcut.

Visualize patterns of voting in the U.S. Senate

Having learned these basics, we will now explore a more interesting network, based on voting patterns in the U.S. Senate in 2014.

Select File>Open from the top menu and navigate to the file senate-113-2014.gexf. The next dialog box will give you some information about the network being imported, in this case telling you it it is an undirected network containing 101 nodes, and 5049 edges:

Once the network has imported, go to the Data Laboratory to view and examine the data for the Nodes and Edges. Notice that each edge has a column called percent_agree, which is the number of times the member of each pair of Senators voted the same way, divided by the total number of votes in the chamber in 2014, giving a number between 0 and 1:

Click the Configuration button, and ensure that Visible graph only is checked. When we start filtering the data, this will ensure that the data tables show the filtered network, not the original.

Now return to the Overview, where we will use a layout algorithm to alter the appearance of the network. In the Layout panel at bottom left, choose the Fruchterman Reingold layout algorithm and click Run. (I know from prior experimentation that this algorithm gives a reasonable appearance for this network, but do experiment with different options if working on your own network graphs in future.) Note also that there are options to change the parameters of the algorithm, such as the “Gravity” with which connected nodes attract one another. We will simply accept the default options, but again you may want to experiment with different values for your own projects.

When the network settles down, click Stop to stabilize it. The network should look something like this:

This looks a little neater than the initial view, but is still a hairball that tells us little about the underlying dynamics of voting in the Senate. This is because almost all Senators voted the same way at least once, so each one is connected to almost all of the others.

So now we need to filter the network, so that edges are not drawn if Senators voted the same way less often. Select the Filters tab in the main panel at right, and select Attributes>Range, which gives the option to filter on percent_agree. Double-click on percent_agree, to see the following under Queries:

The range can be altered using the sliders, but we will instead double-click on the value for the bottom of the range, and manually edit it to 0.67:

This will draw edges between Senators only if they voted the same way in at least two-thirds of the votes in 2014. Hit Filter, and watch many of the edges disappear. Switch to the Data Laboratory view, and see how the Edges table has also changed. Now return to the Overview, Run the layout algorithm again, and the graph should change to look something like this:

Now the network is organized into two clusters, which are linked through only a single Senator. These are presumably Democrats and Republicans, which we can confirm by coloring the nodes by party in the Partition tab at top left:

To customize the colors, click on each square in the Partition tab, then Shift-Ctrl and click to call up the color selector:

Make Democrats blue (Hex: 0000FF), Republicans red (Hex: FF0000) and the two Independents orange (Hex: FFAA00).

Let’s also reconfigure the network so that the Democrats are on the left and the Republicans on the right. Run the layout algorithm again, and with it running, click and drag one of the outermost Democrats, until the network looks something like this:

Now is a good time to save the project, if you have not done so already.

Next we will calculate some metrics for our new, filtered network. If we are interested in highlighting the Senators who are most bipartisan, then Betweenness centrality is a good measure — remember that it highlights “bridging” nodes that prevent the network from breaking apart into isolated clusters.

Select the Statistics tab in the main right-hand panel, and then Edge Overview>Avg. Path Length>Run. Click OK at the next dialog box, close the Graph Distance Report, and go to the Data Laboratory view. Notice that new columns, including Betweenness Centrality, have appeared in the Nodes table:

Switch back to the Overview, and select the Ranking tab on the top left panel. Choose Betweenness Centrality as the ranking parameter for Nodes, select the gem-like icon (), which controls the size of nodes, and select a minimum and maximum size for the range.

Click Apply, and the network should look something like this:

You may at this point want to switch on the labels, and note that Susan Collins, the Republican from Maine, was the standout bipartisan Senator in 2014.

Now switch to Preview, which is where the appearance of the network graph can be polished before exporting it as a vector graphic. Click Refresh to see the network graphic drawn with default options:

You can then customize using the panel on the left, clicking Refresh to review each change. Here I have removed the nodes’ borders, by setting their width to zero, changed the edges from the default curved to straight, and reduced edge thickness to 0.5:

Export the finished network as vector graphics and for online visualization

Export the network graph in SVG or PDF format using the button at bottom left, or by selecting File>Export>SVG/PDF/PNG file... from the top menu.

Now select File>Export>Graph file... and save as JSON (this option is available through the JSONExporter plugin we installed earlier). Make sure to select Graph>Visible only at the dialog box, so that only the filtered network is exported:

Now try to repeat the exercise with the 2013 data!

Introducing Sigma.js

Network graphs can be visualized using several JavaScript libraries including D3 (see here for an example). However, we will use the Sigma.js JavaScript library, which is specifically designed for the purpose, and can more easily handle large and complex networks.

Make your own Sigma.js interactive network

I have provided a basic Sigma.js template, prepared with the generous help of Alexis Jacomy, the author of Sigma.js. This is in the senatefolder.

Save the JSON file you exported from Gephi in the data subfolder with the name senate.json, then open the file index.html. The resulting interactive should look like this. Notice that when you hover over each node, its label appears, and its direct connections remain highlighted, while the rest of the network is grayed out:

Open index.html in your preferred text editor, and you will see this code:

<!DOCTYPE html>
<html>

<head>

  <meta charset=utf-8 />
  <title>U.S. Senate network</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
  <script src="src/sigma.min.js"></script>
  <script src="src/sigma.parsers.json.min.js"></script>

  <style>
    body {margin:0; padding:0;}
    #sigma-container {position:absolute; top:0; bottom:0; width:100%;}
  </style>

</head>

<body>
  <div id="sigma-container"></div>

  <script>
  function init() {

    // Finds the connections of each node
    sigma.classes.graph.addMethod("neighbors", function(nodeId) {
      var k,
          neighbors = {},
          index = this.allNeighborsIndex[nodeId] || {};

      for (k in index)
        neighbors[k] = this.nodesIndex[k];

      return neighbors;
    });

   // Creates an instance of Sigma.js
    var sigInst = new sigma({
      renderers: [
        {
          container: document.getElementById("sigma-container"),
          type: "canvas"
        }
      ]
    });

    // Customizes its settings 
    sigInst.settings({
      // Drawing properties :
      defaultLabelColor: "#000",
      defaultLabelSize: 14,
      defaultLabelHoverColor: "#fff",
      labelThreshold: 11,
      defaultHoverLabelBGColor: "#888",
      defaultLabelBGColor: "#ddd",
      defaultEdgeType: "straight",

      // Graph properties :
      minNodeSize: 3,
      maxNodeSize: 10,
      minEdgeSize: 0.1,
      maxEdgeSize: 0.2,

      // Mouse properties :
      zoomMax: 20 
    });

    // Parses JSON file to fill the graph
    sigma.parsers.json(
      "data/senate.json",
      sigInst,
      function() {
        //  Little hack here:
        //  In the latest Sigma.js version have to delete edges" colors manually
        sigInst.graph.edges().forEach(function(e) {
          e.color = null;
        });

        // Also, to facilitate the update of node colors, store
        // their original color under the key originalColor:
        sigInst.graph.nodes().forEach(function(n) {
          n.originalColor = n.color;
        });

        sigInst.refresh();
      }
    );


     // When a node is clicked, check for each node to see if it is connected. If not, set its color as gray
     // Do the same for the edges

    var grayColor = "#ccc";
    sigInst.bind("overNode", function(e) {
      var nodeId = e.data.node.id,
          toKeep = sigInst.graph.neighbors(nodeId);
      toKeep[nodeId] = e.data.node;

      sigInst.graph.nodes().forEach(function(n) {
        if (toKeep[n.id])
          n.color = n.originalColor;
        else
          n.color = grayColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        if (e.source === nodeId || e.target === nodeId)
          e.color = null;
        else
          e.color = grayColor;
      });

    // Since the data has been modified, call the refresh method to make the colors update 
      sigInst.refresh();
    });

    // When a node is no longer being hovered over, return to original colors
    sigInst.bind("outNode", function(e) {
      sigInst.graph.nodes().forEach(function(n) {
        n.color = n.originalColor;
      });

      sigInst.graph.edges().forEach(function(e) {
        e.color = null;
      });

      sigInst.refresh();
    });
  }

  if (document.addEventListener)
    document.addEventListener("DOMContentLoaded", init, false);
  else
    window.onload = init;
  </script>

</body>

</html>

The code has been documented to explain what each part does. Notice that the head of the web page loads the main Sigma.js script, and a second script that reads, or “parses,” the json data. These are in the src subfolder.

To explore Sigma.js further, download or clone its Github repository and examine the code for the examples given.

Further reading/resources

Gephi tutorials

Sigma.js wiki

Assignment 02: Scatterplot in D3

As part of this course you will be given weekly challenges to help you get more familiar with D3. In this assignment you will take the data from Global Infectious Disease Ratings & Democratization Scores of Country to plot out a Scatterplot.

The Dataset is called disease_democ.csv and can be found in this zip file.

You will model your Scatterplot on this D3 example

Don’t forget to run the python server to load the csv file into the html document:

python -m SimpleHTTPServer 8888 &

What is data?

Before we leap into creating visualizations, charts and maps, we’ll consider the nature of data, and some basic principles that will help you to investigate datasets to find and tell stories. This is not a course in statistics, but I will introduce a few fundamental statistical concepts, which hopefully will stand you in good stead as we work to visualize data over the next few weeks — and beyond.

We’re often told that there are “lies, damned lies, and statistics.” But data visualization and statistics provide a view of the world that we can’t otherwise obtain. They give us a framework to make sense of daunting and otherwise meaningless masses of information. The “lies” that data and graphics can tell arise when people misuse statistics and visualization methods, not when they are used correctly.

The best data journalists understand that statistics and graphics go hand-in-hand. Just as numbers can be made to lie, graphics may misinform if the designer is ignorant of or abuses basic statistical principles. You don’t have to be an expert statistician to make effective charts and maps, but understanding some basic principles will help you to tell a convincing and compelling story — enlightening rather than misleading your audience.

I hope you will get hooked on the power of a statistical way of thinking. As data artist Martin Wattenberg of Google has said: “Visualization is a gateway drug to statistics.

The data we will use today

Download the data for this session from here. Unzip the folder and place it on your desktop. It contains the following files:

mlb_salaries_2014.csv –  Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.

disease_democ.csv – Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.

gdp_pc.csv – World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.

All of these files are in CSV format, which stands for comma-separated values. These are plain text files, in which fields in the data are separated by commas, and each record is on a separate row. CSV is a common format for storing and exchanging data, and can be read by most data analysis and visualization software. Values that are intended to be treated as text, rather than numbers, are often enclosed in quote marks.

When you ask for data, requesting CSVs or other plain text files is a good idea, as just about all software that handles data can export data as text. The characters used to separate the variables, called ‘delimiters,’ may vary. a ‘.tsv’ extension, for instance, indicates that the variables are separated by tabs. More generally, text files have the extention ‘.txt’.

Types of data: categorical vs. continuous

Before analyzing a dataset, or attempting to draw a graphic, it’s important to consider what, exactly, you’re working with.

Statisticians often use the term “variable.” This simply means any measure or attribute describing a particular item, or “record,” in a dataset. For example, school students might gather data about themselves for a class project, recording their gender and eye color, and height and weight. There’s an important difference between gender and eye color, called “categorical” variables, and height and weight, termed “continuous.”

  • Categorical variables are descriptive labels given to individual records, assigning them to different groups. The simplest categorical data is dichotomous, meaning that there are just two possible groups — in an election, for instance, people either voted, or they did not. More commonly, there are multiple categories. When analyzing traffic accidents, for example, you might consider the day of the week on which each incident occurred, giving seven possible categories.
  • Continuous data is richer, consisting of numbers that can have a range of values on a sliding scale. When working with weather data, for instance, continuous variables might include temperature and amount of rainfall.

There’s a third type of data we often need to consider: date and time. Perhaps the most common task in data journalism is to consider how a variable or variables have changed over time.

Datasets will usually contain a mixture of categorical and continuous variables. Here, for example, is a small part of a spreadsheet containing data on salaries for Major League Baseball players at the opening of the 2014 season:


(Source: Peter Aldhous, data from Lahman Baseball Database data)

This is a typical data table layout, with the individual records — the players — forming the rows and the variables recorded for each player arranged in columns. Here it is easy to recognize the categorical variables of teamID and teamName because they are each entered as text. The numbers for salary, expressed in full or in millions of dollars (salary_mil), are continuous variables.

Don’t assume, however, that every number in a dataset represents a continuous variable. Text descriptions can make datasets unwieldy, so database managers often adopt simpler codes, which are often be numbers, to store categorical data. You can see this in the following example, showing data on traffic accidents resulting in injury or death in Berkeley, downloaded from a database maintained by researchers on campus.

(Source: Peter Aldhous, from Transportation Injury Mapping System data)

Of the numbers seen here, only the YEAR, latitudes and longitudes (POINT_Y and POINT_X) and numbers of people KILLED or INJURED actually represent continuous variables. (Look carefully, and you will see that these numbers are justified right within each cell. The other numbers are justified left, like the text entries, because they were imported into the spreadsheet as text values.)

Like this example, many datasets are difficult to interpret without their supporting documentation. So each time you acquire a dataset, if necessary make sure you also obtain the “codebook” describing all of the variables/fields, and how they are coded. Here is [the codebook](http://paldhous.github.io/ucb/2016/dataviz/data/SWITRS_codebook.pdf) for the traffic accident data.

## What shape is your data?

Particularly when data shows a time series for a single variable, it is often provided like this data on trends in international oil production by region, in “wide” format:

(Source: Peter Aldhous, from U.S. Energy Information Administration data)

Here, all of the numbers represent the same variable, and there is a column for each year. This is good for people to read, but most software for data analysis and visualization does not play well with data in this format.

So if you receive “wide” data, you will usually need to covert it to “long” format, shown here:


(Source: Peter Aldhous, from U.S. Energy Information Administration)

Notice that now there is one column for each variable, which makes it easier for computers to understand.

How to Investigate data? The basic operations

There are many sophisticated statistical methods for crunching data, beyond the scope of this class. But the majority of a data journalist’s work involves the following simple operations:

Sort: Largest to smallest, oldest to newest, alphabetical etc.

Filter: Select a defined subset of the data.

Summarize/Aggregate: Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll group data into categories first, and then aggregate by group.

Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.

We’ll return to these basic operations with data repeatedly over the coming weeks as we manipulate and visualize data.

Working with categorical data

You might imagine that there is little that you can do with categorical data alone, but it can be powerful, and can also be used to create new continuous variables.

The most basic operation with categorical data is to aggregate it by counting the number of records that fall into each category. This gives a table of “frequencies.” Often these are divided by the total number of records, and then multiplied by 100 to show them as percentages of the total.

Here is an example, showing data on the racial and ethnic identities of residents of Alameda County, from the 2010 US Census:


(Source: American FactFinder, U.S. Census Bureau)

Creating frequency counts from categorical data creates a new continuous variable — what has changed is the level of analysis. In this example, the original data would consist of a huge table with a record for each person, noting their racial/ethnic identity as categorical variables; in creating the frequency table shown here, the level of analysis has shifted from the individual to the racial/ethnic group.

We can ask more interesting questions by considering two categorical variables together — as pioneering data journalist Philip Meyer showed when he collected and analyzed survey data to examine the causes of the 1967 Detroit Riot. In July of that year, one of the worst riots in U.S. history raged in the city for five days, following a police raid on an unlicensed after-hours bar. By the time calm was restored, 43 people were dead, 467 injured and more than 2,000 buildings were destroyed.

At the time, Detroit was regarded as being a leader in race relations, so local racial discrimination was not initially seen as one of the main underlying causes of what happened. One popular theory at the time was that the riots were led by black residents who had moved to Detroit from the rural South. Meyer demolished this idea by examining data on whether or not the people surveyed had rioted, and whether they were brought up in the South or the North. He combined these results into a “contingency table” or “cross-tab”:

South North Total
Rioters 19 51 70
Non-rioters 218 149 367
Total 237 200 437

It certainly looks from these numbers as if Northerners were more likely to have participated in the riot. There’s a message here: sometimes a table of numbers is a perfectly acceptable way to communicate a simple story — we don’t always need fancy charts.

But Meyer’s team only interviewed a sample of people from the affected neighborhoods, not everyone who lived there. If they had taken another sample, might they have obtained different results? This is one example where some more sophisticated statistical analysis can help. For contingency tables, a method known as the chi-squared test asks the relevant question: if Southerners and Northerners were in fact equally likely to have rioted, what is the likelihood of obtaining a sample as biased as this by chance alone? In this case, the chi-squared test told Meyer that the probability was less than one in a thousand. So Meyer felt confident writing in the newspaper that Northerners were more likely to have rioted. His work won a Pulitzer Prize for the Detroit Free Press and shifted the focus of political debate about the riot to racial discrimination in policing and housing in Detroit.

Sampling and margins of error

Philip Meyer’s analysis of the Detroit riot raises a general issue: only sometimes is it possible to obtain and analyze all of the data.

There are only 30 teams in Major League Baseball, which at the start of the 2014 season had just under 750 players on their rosters. So compiling all of the data on their contracts and salaries is a manageable task.

But Meyer’s team couldn’t talk to all of the people in the riot-affected neighbourhoods, and pollsters can’t ask every voter which candidate they intend to vote for in an upcoming election. Instead they take a sample. This is common in many forms of data analysis, not just opinion polling.

For a sample to be valid, it must obey a simple statistical rule: every member of the group to which you wish to generalize the results of your analysis must have an equal chance of being included.

Entire textbooks have been written on sampling methods. The simplest form is random sampling — such as when numbers are written on pieces of paper, put into a bag, shaken up, and then drawn out one by one. Opinion pollsters often generate their samples by randomly generating valid telephone numbers, and calling the households concerned.

But there are other methods, and important thing is not that a sample was derived randomly, but that it is representative of the group from which it is drawn. In other words, sampling needs to avoid systematic bias that makes particular data points more or less likely to be included.

Be especially wary of using data from any sample that was not selected to be representative of a wider group. Media organizations frequently run informal online “polls” to engage their audience, but they tell us little about public opinion, as people who happened to visit a news website and cared enough to answer the questions posed may not be representative of the wider population.

To have a good chance of being representative, samples must also be sufficiently large. If you randomly sample ten people, for instance, chance effects mean that you may draw a sample that contains eight women and two men, or perhaps no men at all. Sample 1,000 people from the same population, however, and the proportions of men and women sampled won’t deviate so far from an even split.

This is why polls often give a “margin of error,” which is a measure of the uncertainty that arises from taking a relatively small sample. These margins of error are usually derived from a range of values that statisticians call the “95% confidence interval.” This means that if the same population were sampled repeatedly, the results would fall within this range of values 95 times out of 100.

When dealing with polling and survey data, look for the margins of error. Be careful not to mislead your audience by making a big deal of differences that may just be due to sampling error.

Working with continuous data: consider the distribution

When handling continuous data, there are more possibilities for aggregation than simply counting: you can add the numbers to give a total, for example, or calculate an average.

But summarizing continuous data in a single value inevitably loses a lot of information held in variation within the data. Understanding this variation may be key to working out the story the data may tell, and deciding how to analyze and visualize it. So often the first thing a good data journalist does when examining a dataset is to chart the distribution of each continuous variable. You can think of this as the “shape” of the dataset, for each variable.

Many variables, such as human height and weight, follow a “normal” distribution. If you draw a graph plotting the range of values in the data along the horizontal axis (also known as the X axis), and the number of individual data points for each value on the vertical or Y axis, a normal distribution gives a bell-shaped curve:

(Source: edited from Wikimedia Commons)

This type of chart, showing the distribution as a smoothed line, is known as a “density plot.”

In this example, the X axis is labeled with multiples of a summary statistic called the “standard deviation.” This is a measure of the spread of the data: if you extend one standard deviation either side of the average, it will cover just over 68% of the data points; two standard deviations will cover just over 95%. In simple terms, the standard deviation is a single number that summarizes whether the curve is tall and thin, or short and fat.

Normal distributions are so common that many statistical methods have been invented specifically to work with them. It is also possible to run tests to tell whether data deviates significantly from a normal distribution, to check whether it’s valid to use these methods.

Sometimes, however, it’s very clear just from looking at the shape of a dataset that it is not normally distributed. Here, for example, is the distribution of 2014 Major League Baseball salaries, drawn as columns in increments of $500,000. This type of chart is called a histogram:

(Source: Peter Aldhous, data from the Lahman Baseball Database)

This distribution is highly “skewed.” Almost half of the players were paid less than 1million, while there are just a handful of players who were paid more than 20 million; the highest-paid was pitcher Zack Grienke, paid $26 million by the Los Angeles Dodgers. Knowing this distribution may influence the story you would choose to tell from the data, the summary statistics you would choose to aggregate it, and the methods you might use to visualize it.

In class, we will plot the distribution of the 2014 baseball salary data using D3.

First we’ll need to shape our data in GoogelDrive. Create a new Spreadsheet and import/Upload File and navigate to the file mlb_salaries_2014.csv. The app should recognize that this is a CSV file, but if the preview of the data looks wrong, use import options to correct things. Once the data has imported, and analyze the variables, create a new row and label the categorical and continuous data.

First we need to tell the app what goes on the X and Y axis, respectively. Right-click anywhere in the main panel and select Map x(required)>salary_mil. Were are not going to plot another variable from the data on the Y axis; we just want a count of the players in each salary bin. So select Map y(required)>..count.. and click the Draw Plot button at bottom right.

You should see a blank grid, because we haven’t yet told the app what type of chart to draw. Right-click in the chart area, and select Add Layer>Univariate Geoms>histogram (univariate because we only have one variable, aggregated by a count). Click Draw plot and a chart should draw.

You will notice that the bins are wider than in the example above. Right-click on histogram in the Layers Panel at left, select binwidth>set, type 0.5 into the box and set value. Now hit Draw plot again and you should have something close to the chart above.

To save your plot click on Export PDF from the options at top left and click on the hyperlink at the next page.

Beyond the “average”: mean, median, and mode

Most people know how to calculate an average: add everything up, and divide this sum by the total number of values. Statisticians call this summary the “mean,” and for normally distributed data, it sits right on the top of the bell curve.

The mean is just one example of what statisticians call a “measure of central tendency.” The most common alternative is the “median,” which is the number that sits in the middle, when all the values are arranged in order. (If you have an even number of values, and no single number occupies the middle position, it would be the average of the two middle values.)

Notice how leading media outlets, such as The Upshot at The New York Times, often use medians, rather than means, in graphics summarizing skewed distributions, such as incomes or house prices. Here is an example from April 2014:

(Source: The Upshot, The New York Times)

Statisticians also sometimes consider the “mode,” which is the value that appears most frequently in the dataset.

For a perfect normal distribution, the mean, median and mode are all the same number. But for a skewed dataset like the baseball salaries, they may be very different — and using the mean can paint a rather misleading picture.

Calculate mean, median and mode

Navigate in your browser to your Google Drive account, then click the NEW button at top left and select Google Sheets. Once the spreadsheet opens select File>Import… from the top menu in Google Sheets and select the Upload tab in the dialog box that appears:

‘Click Select a file from your computer’, navigate to the file ‘mlb_salaries_2014.csv’ and click ‘Open’.

At the next dialog box click Import and the file should upload.

When the data has uploaded, drag the darker gray line at the bottom of the light gray cell at top left below row 1, so that the first row becomes a header.

Before:

After:

Select column H by clicking its gray header containing the letter, then from the top menu select Insert>Column right five times to insert three new columns into the spreadsheet, calling them mean, median, and mode.

In the first cell of the mean column enter the following formula, which calculates the mean (called average in a spreadsheet) of all of the values in column H, containing the salaries in $ millions for each player.

=average(H2:H747)

Or alternatively, to select all the values in colum H without having to define their row numbers:

=average(H:H)

Now calculate the median salary:

=median(H:H)

And the mode:

=mode(H:H)

These spreadsheet formulas are, in programming terms, functions. They act on the data specified in the brackets. This will become a familiar concept as we work with code in subsequent weeks.

Across Major League Baseball at the start of the 2014 season, the mean salary was 3.99 million. But when summarizing a distribution in a single value, we usually want to give a “typical” number. Here the mean is inflated by the vast salaries paid to a handful of star players, and may be a bad choice. The median salary of 1.5 million gives a more realistic view of what a typical MLB player was paid.

The mode is less commonly used, but in this case also tells us something interesting: it was $500,000, a sum earned by 35 out of the 746 players. This was the minimum salary paid under 2014 MLB contracts, which explains why it turns up more frequently than any other number. A journalist who considered the median, mode and full range of the salary distribution may produce a richer story than one who failed to think beyond the “average.”

Choosing bins for your data

Often we don’t want to summarize a variable in a single number. But that doesn’t mean we have to show the entire distribution. Frequently data journalists divide the data into groups or “bins,” to reveal how those groups differ from one another. A good example is this interactive graphic on the unemployment rate for different groups of Americans, published by The New York Times in November 2009:

(Source: The New York Times)

In its base state, the graphic shows the overall jobless rate, and how this has changed over time. The buttons along the top allow you to filter the data to examine the rate for different groups. Most of the filtering is on categorical variables, but notice that the continuous variable of age is collapsed into a categorical variable dividing people into three groups: 15-24 years old, 24-44 years old, and 45 years or older.

To produce informative graphics that tell a clear story, data journalists often need to turn a continuous variable into a categorical variable by dividing it into bins. But how do you select the range of values for each bin?

There is no simple answer to this question, as it really depends on the story you are telling. In the jobless rate example, the bins divided the population into groups of young, mid-career and older workers, revealing how young workers in particular were bearing the brunt of the Great Recession.

When binning data, it is again a good idea to look at the distribution, and experiment with different possibilities. For example, the wealth of nations, measured in terms of gross domestic product (GDP) per capita in 2014, has a skewed distribution, similar to the baseball salaries. If we look at the distribution, drawn here in increments of $2,500, we will see that it is highly skewed, rather like the baseball salaries:

(Source: Peter Aldhous, from World Bank data)

Straight away we can see that just a tiny handful of countries had a GDP per capita of more than 50,000,but there is a larger group with values above40,000.

The maps below reveal how setting different ranges for the bins changes the story told by the data. For the first map, I set the lower value for the top bin at $40,000, and then gave the bins equal ranges:(Source: Peter Aldhous, from World Bank data)

This might be useful for telling a story about how high per capita wealth is still concentrated into a small number of nations, but it does a fairly poor job of distinguishing between the per capita wealth of developing countries. And for poorer people, small differences in wealth make a big difference to living conditions.

So for the second map I set the boundaries so that roughly equal numbers of countries fell into each of the five bins. Now Japan, most of Western Europe and Russia join the wealthiest bin, middle-income countries like Brazil, China, and Mexico are grouped in another bin, and there are more fine-grained distinctions between the per capita wealth of different developing countries:(Source: Peter Aldhous, from World Bank data)

Some visualization and mapping software gives you the option of putting equal numbers of records into each bin — usually called “quantiles” (the quartiles we encountered on the box plots are one example). Note that calculated quantiles won’t usually give you nice round numbers for the boundaries between bins. So you may want to adjust the values, as I did for the second map.

You may also want to examine histograms for obvious “valleys” in the data, which may be good places for the breaks between bins.

Calculate quantiles

You can also calculate the boundaries between quantiles for yourself in a spreadsheet. Go back to the Google Spreadsheet with the baseball salary data, and add two more columns: quantile and quantile value.

Next we will calculate the boundaries for bins dividing the data into five quantiles, with one-fifth (0.2 in decimal) of the values in each bin.

First enter the following values into the quantile column, to reflect the division into five quantiles:

=4/5
=3/5
=2/5
=1/5

Then enter this formula into the first cell of the quantile value column:

=percentile(H:H, L2)

Copy the formula down the top four rows, and the spreadsheet should look as follows:

Rounding: avoid spurious precision

Often when you run calculations on numbers, you’ll obtain precise answers that can run to many decimal places. But think about the precision with which the original numbers were measured, and don’t quote numbers that are more precise than this. When rounding numbers to the appropriate level of precision, if the next digit is four or less, round down; if it’s six or more, round up. There are various schemes for rounding if the next digit is five, and there are no further digits to go on: I’d suggest rounding to an even number, which may be up or down, as this is the international standard in computing.

To round the mean value for the baseball salary data to two decimal places, edit the formula to the following:

=round(average(H:H),2)

This formula runs the round function on the result of the average function.

Per what? Working with rates and percentages

Often it doesn’t make much sense to consider raw numbers. There are more murders in Oakland (population from 2010 U.S. Census: 390,724) than in Orinda (2010 population: 17,643). But that’s a fairly meaningless comparison, unless we level the playing field by correcting for the size of the two cities. As in the wealth of nations example above, much of the time data journalists need to work with rates: per capita, per thousand people, and so on.

In simple terms, a rate is one number divided by another number. The key word is “per.” Per capita means “per person,” so to calculate a per capita figure you must divide the total value by the population size. But remember that most people find very small numbers hard to grasp: 0.001 and 0.0001 look similarly small at a glance, even though the first is ten times as large as the second. So when calculating rates, per capita is often not a good choice. For rare events like murders, or deaths from a particular disease, you may need to consider the rate per 1000 people, per 10,000 people, or even per 100,000 people: simply divide the numbers as before, then multiply by the “per” figure.

In addition to leveling the playing field to allow meaningful comparisons, rates can also help bring large numbers, which are again hard for most people to grasp, into perspective: it means little to most people to be told that the annual GDP of the United States is almost 17 trillion, but knowing that GDP per person is just over 50,000 is easier to comprehend.

Percentages are just a special case of rates, meaning “per hundred.” So to calculate a percentage, you divide one number by another and then multiply by 100.

Doing simple math with rates and percentages

Often you will need to calculate percentage change. The formula for this is:

(new value - old value) / old value * 100

Sometimes you may need to compare two rates or percentages. For example, if 50 out of 150 black mortgage applicants in a given income bracket are denied a mortgage, and 300 out of 2,400 white applicants in the same income bracket are denied a mortgage, the percentage rates of denial for the two groups are:

Black:

50 / 150 * 100 = 33.3%

White:

300 / 2,400 * 100 = 12.5%

You can divide one percentage or rate by the other, but be careful how you describe the result:

33.3 / 12.5 = 2.664

You can say from this calculation that black applicants are about 2.7 times as likely to be denied loans as whites. But even though the Associated Press style guide doesn’t make the distinction, don’t say black applicants are about 2.7 times more likely to be denied loans. Strictly speaking, more likely refers to following calculation:

(33.3 - 12.5) / 12.5 = 1.664

Asking questions with data

As data journalists, we want to ask questions of data. When statisticians do this, they assign probabilities to the answers to specific questions. They might ask whether variables are related to one another: for instance, do wealthier people tend to live longer? Or they might ask whether different groups are different from one another: for example, do patients given an experimental drug get better more quickly than those given the standard treatment?

When asking these questions, the most common statistical approach may seem back to front. Rather than asking whether the answer they’re interested in is likely to be true, statisticians usually instead calculate probabilities that the observed results would be obtained if the “null hypothesis” is correct.

In Philip Meyer’s analysis of the Detroit riot, the null hypothesis was that Northerners and Southerners were equally likely to have rioted. In the examples given above, the null hypotheses are that there is no relationship between wealth and lifespan, and that the new drug is just as effective as the old treatment.

The resulting probabilities are often given as p values, which are shown as decimal numbers between 0 and 1. Philip Meyer’s chi-squared result would have been written as: p <0.001

The decimal 0.001 is the same as the fraction 1/1000, and < is the mathematical symbol for “less than.” So this means that there was less than one in a thousand chance that the difference in participation in the riot between Northerners and Southerners was caused by a chance sampling effect.

This would be called a “significant” result. When statisticians use this word, they don’t necessarily mean that the result has real-world consequence. It just means that the result is unlikely to be due to chance. However, if you have framed your question carefully, like Meyer did, a statistically significant result may be very consequential indeed.

There is no fixed cut-off for judging a result to be statistically significant. But as a general rule, p <0.05 is considered the minimum standard. This means you are likely to get this result by chance less than 5 times out of 100. If Meyer had obtained a result only just exceeding this standard, he may still have concluded that Northerners were more likely to riot, but would probably have been more cautious in how he worded his story.

When considering differences between groups, statisticians sometimes avoid p values, and instead give 95% confidence intervals, like the margins of error on opinion polls. Only if these don’t overlap would a statistician assume that the results for different groups are significantly different.

So when picking numbers from studies to use in your graphics, pay attention to p values and confidence intervals!

Relationships between variables: correlation and its pitfalls

Some of the most powerful stories that data can tell examine how one variable relates to another. This video from a BBC documentary made by Hans Rosling of the Gapminder Foundation, for example, explores the relationship between life expectancy in different countries and the nations’ wealth:


(Source: BBC/Gapminder)

Correlation refers to statistical methods that test the strength of the relationship between two variables recorded for each of the records in a dataset. Correlations can either be positive, which means that two variables tend to increase together; or negative, which means that as one variable increases in value, the other one tends to decrease.

Tests of correlation determine whether the recorded relationship between the two variables is likely to have arisen by chance — here the null hypothesis is that there is actually no relationship between the two.

Statisticians usually test for correlation because they suspect that variation in one variable causes variation in the other, but correlation cannot prove causation. For example, there is a statistically significant correlation between children’s shoe sizes and their reading test scores, but clearly having bigger feet doesn’t make a child a better reader. In reality, older children are likely both to have bigger feet and be better at reading — the causation lies elsewhere.

Here, the child’s age is a “lurking” variable. Lurking variables are a general problem in data analysis, not just in tests of correlation, and some can be hard even for experts to spot.

For example, by the early 1990s epidemiological studies suggested that women who took Hormone Replacement Therapy (HRT) after menopause were less likely to suffer from coronary heart disease. But some years later, when doctors ran clinical trials in which they gave women HRT to test this protective effect, it actually caused a statistically significant increase in heart disease. Going back to the original studies, researchers found that women who had HRT tended to be from higher socioeconomic groups, who had better diets and exercised more.

Data journalists should be very wary of falling into similar traps. While you may not be able to gather all of the necessary data and run statistical tests, take special care to think about possible lurking variables when drawing any chart that illustrates a correlation, or implies a relationship between two variables.

Scatter plots and trend lines

When testing the relationship between two variables, statisticians will usually draw a simple chart called a “scatter plot,” in which the records in a dataset are plotted as points according to their scores for each of the two variables.

Here is an example, illustrating a controversial theory claiming that the extent to which a country has developed a democratic political system is driven largely by the historical prevalence of infectious disease:

(Source: Peter Aldhous, data from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries)

As we have learned, correlation cannot prove causation. But correlations are usually run to explore relationships that are suspected to be causal. The convention when drawing scatter plots is to put the variable suspected to be the causal factor, called the “explanatory” variable, on the X axis, and the “response” variable on the Y.

When producing any chart based on the scatter plot format, it’s a good idea to follow this convention, because otherwise you are likely to confuse people who are used to viewing such graphs.

The example above also shows a straight line drawn through the points. This is known as a “trend line,” or the “line of best fit” for the data, and was calculated by a method called “linear regression.” It is a simple example of what statisticians call “fitting a model” to data.

Models are mathematical equations that allow statisticians to make predictions. The equation for this trend line is:

Y = -1.85*X + 104.45

Here X is the infectious disease prevalence score, Y is the democratization score, and 104.45 is the value at which the trend line would cross the vertical axis at X = 0. The slope of the line is -1.85, which means that when X increases by a single point, Y tends to decrease by 1.85 points. (For a trend line sloping upwards from left to right, the slope would be a positive number.)

The data used for this graph doesn’t include all of the world’s countries. But if you knew the infectious disease prevalence score for a missing nation, you could use the equation or the graph to predict its likely democratization score. To see how this works multiply 30 by -1.85, then add 104.45. The answer is roughly 49, and you will get the same result if you draw a vertical line up from the horizontal axis for an infectious disease prevalence score of 30, and then draw a horizontal line from where this crosses the trend line to the vertical axis at X = 0.

The most frequently used statistical test for correlation determines how closely the points cluster around the linear trend line, and determines the statistical significance of this relationship, given the size of the sample.

In this example there is a significant negative correlation, but that doesn’t prove that low rates of infectious disease made some countries more democratic. Not only are there possible lurking variables, but cause-and-effect could also work the other way round: more democratic societies might place greater value on their citizens’ lives, and make more effort to prevent and treat infectious diseases.

We will make a version of this chart in class. Import the file disease_democ.csv into the web app as before, and map infect_rate to the X axis and democ_score to the Y.
Now right-click in the main chart area and select Add layer>Bivariate Geoms>point. Click Draw plot and the points should appear on the scatter plot.

In the Layers panel, right-click on point and select size>Set>4 to increase the size of the points. Click Draw plot again.

Now we will add the trend line. Right-click back in the chart area, select Add layer>Bivariate Geoms>smooth and Draw plot. This will draw a smoothed line that meanders through the points, and plot a measure of the uncertainty around this line known as the “standard error.”

We instead want a linear trend line, without the standard error. In the Layers panel, right-click on smooth and select method>Set>lm (lm stands for for “linear model”); also select se>Set>FALSE, to remove the standard error plot. Draw plot and you should have something approximating the chart above. (The scales on the axes will be different, however.)

Beyond the straight line: non-linear relationships

Relationships between variables aren’t always best described by straight lines, as we can see by looking at the Gapminder Foundation’s Wealth & Health of Nations graphic, on which Hans Rosling’s “200 Countries” video is based. This is a bubble plot, a relative of the scatter plot in which the size of the bubbles depends upon a third variable — here each country’s population:

(Source: Gapminder)

Look carefully at the X axis: income per person doesn’t increase in even steps. Instead the graph is drawn so that the distance between 400 and 4,000 equals the distance between 4,000 and 40,000.

This is because the axis has been plotted on a “logarithmic” scale. It would increase in even steps if the numbers used were not the actual incomes per person, but their common logarithms (how many times 10 would have to be multiplied by itself to give that number).

Logarithmic scales are often used to make graphs plotting a wide range of values easier to read. If we turn off the logarithmic scale on the Gapminder infographic, the income per person values for the poorer countries bunch up, making it hard to see the differences between them:

(Source: Gapminder)

From this version of the graphic, we can see that a line of best fit through the data would be a curve that first rises sharply, and then levels out. This is called a “logarithmic curve,” which is described by another simple equation.

A logarithmic curve is just one example of a “non-linear” mathematical relationship that statisticians can use to fit a model to data.

Assignment

  • Calculate the values needed to group nations into five quantile bins, according to the 2014 GDP per capita data in the file gdp_pc.csv.
  • Create the infectious disease and democratization scatter plot in D3 so that the points are color-coded by a nation’s income group. Note, if your solution results in multiple trend lines, you are mapping color at the wrong point in building the chart!
  • Save the plot as a PDF file. If the points on the scatter plot do not render correctly, paste the url for the PDF into another browser; it should work in Google Chrome.
  • Subscribe to visualization blogs, follow visualization thought leaders on Twitter, and take other steps to track developments in data viz and data journalism.
  • Send me your calculated quantile values, your scatter plot, and your initial list of visualization blogs by the start of next week’s class.

Further reading:

Sarah Cohen: Numbers in the Newsroom: Using Math and Statistics in News

Philip Meyer: Precision Journalism: A Reporter’s Introduction to Social Science Methods