Making beautiful experiment visualizations with Matplotlib

Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.

It might take a couple seconds to visually parse this visualization at first glance. I don’t think that’s because it’s complicated per se, but rather because the viz itself contains so much information. After you are used to the format, it’s hard to think of a way to convey a higher density of decision-making-relevant information in such a small space. There are a few things that make this a particularly good visualization for the result of an experiment.

Why it is awesome

It frames many “tests” within the context of a single experiment

The terms experiment and test are often used interchangeably across product teams, no doubt in part due to the terminology around A/B testing. But in the context of a single experiment—in which we experiment by trying something new—we may perform a number of different statistical tests. While each individual test has its own confidence level, we must be careful to adjust our claims of confidence on the experiment level, else we vall fictim to the multiple comparisons problem.

Even if you don’t apply any sort of quantitative correction—to guarantee some global family-wise error rate (FWER) or false discovery rate (FDR)—having all the tests shown together adds useful context for the reader. Suppose you hear the following statement during a company all-hands:

We saw a significant increase in viewing hours for the Action genre in position four.

This statement agrees with the above example plot, but it isn’t particularly insightful. Should we prefer the Action genre for this position over others genres? Or is this the ideal position for that genre across all possible genres? Perhaps both? Small verbal descriptions of specific outcomes from experiments like this tend to get taken out of context. When this happens, their utility decreases, and their risk of being “misused” increases. Unfortunately I have observed that these sort of “snippets” are frequently used as ammunition by some decision-makers to support their a priori preferred choice.

It emphasizes intervals over point estimates (and p-values)

The past few years have seen signficiant backlash (pun intended) against the use and misuse of p-values in academia. Today’s social scientists are all familiar with publication bias and the replication crisis. Yet when a n A/B test is presented in a tech company boardroom, the first question is still often Is this result significant?.

The Netflix visualization replaces the role of p-values with a visual depiction of some confidence interval, whose colour changes depending on whether or not it includes zero. Additionally, although point estimates are shown within each interval, they are visually de-emphasised within the overall context of the visualization. I’m guessing that Netflix removed x-axis labels to avoid sharing confidental data, but even with those included, it limits people to making statements such as “we expect somewhere between a 1-2% improvement” rather than “we expect a 1.27% improvement”. Using two decimals of precision when our confidence interval is 100x as wide the estimate itself is superfluous and gives us a false sense of confidence in our results.

The contextual info “stays together” in a single shareable image

All of the above properties of a good experiment visualization could also be fulfilled by a nicely designed Tableau dashboard. But what should you do after the experiment ends, and you want to share or save the result for later? Your company’s dashboards are always changing after all, so you can’t guarantee the data will be there a year from now if you want to reference it. So you take a screenshot.

Detailed dashboards are difficult to archive or share

Well this is unfortunate. In order to capture the key parts of the result, you’ve had to take a nearly fullscreen grab of the dashboard. You can throw this in a slide deck somewhere, but you can’t expect anyone to read it. And if they do, you can’t expect them to reach the same conclusion as you did. In contrast, Netflix’s visualization outputs a story. Better yet, it’s a story contained in a single copy-paste-able sharable png file. This ensures that the nuance of your analysis does not get lost in transit as it is shared over Slack and email.

Rolling our own visualization function

Unfortunately I have not been able to surface any sort of open source libraries under the name “Netflix Vizkit”, so I decided to recreate my own version using Matplotlib. The function takes as input a pandas dataframe with either a single or multilevel index, and three columns: uplift, std_err, and alpha. If you are running a large number of tests, it would be prudent to first run your dataframe through your procedure of choice to correct for multiple comparisons. I’ll skip that for the purposes of this example.

For this example, I’ve populated a dataframe with fake results corresponding to an email campaign in which we tested three variants and measured four different conversion rates for each. You could also pass in a dataframe with a single level of index, you’ll just get everything plotted on one axis instead of four separate axes.

plot_experiment_results(
  df=example_data,
  title='Example email campaign (α=0.10)',
  sample_size=123456,
  combine_axes=False)

There are a couple additional parameters in there to add context to the plot, including a title and sample size context line. Remember, we want our output to stand by itself as a record of the outcome of the experiment! This function generates the plot below.

If you want to more closely match the Netflix plot, you can pass the paramete combine_axes=True to merge groups together into a single axis. I found this a bit less easy to visually parse, so I usually leave them separate.

Full code for the example

	import datetime as dt
	from typing import List, Tuple
	import numpy as np
	import pandas as pd
	import matplotlib.pyplot as plt
	import matplotlib.ticker as plticker
	from scipy.stats import norm


	plt.rcParams['figure.facecolor'] = 'white'


	def get_colors(interval_start: float, interval_end: float) -> Tuple[str, str]:
	""" Determine chart colors based on overlap of interval with zero. """
	if interval_start > 0:
	return 'darkseagreen', 'darkgreen'
	elif interval_end < 0:
	return 'darksalmon', 'darkred'
	else:
	return 'lightgray', 'gray'


	def plot_single_group(ax, sub_df: pd.DataFrame) -> Tuple[float, float]:
	""" Plot each row of a DataFrame on the same mpl axis object. """

	ytick_labels = []
	x_min, x_max = 0, 0

	# Iterate over each row in group, reversing order since mpl plots from bottom up
	for j, (dim, row) in enumerate(sub_df.iloc[::-1].iterrows()):
	if isinstance(dim, tuple):
	dim = dim[1]

	# Calculate z-score for each test based on test-specific correction factor
	z = norm(0, 1).ppf(1 - row.alpha / 2)
	interval_start = row.uplift - (z * row.std_err)
	interval_end = row.uplift + (z * row.std_err)

	# Conditional coloring based on significance of result
	fill_color, edge_color = get_colors(interval_start, interval_end)

	ax.barh(j, [z * row.std_err, z * row.std_err],
	left=[interval_start, interval_start + z * row.std_err],
	height=0.8,
	color=fill_color,
	edgecolor=edge_color,
	linewidth=0.8,
	zorder=3)

	ytick_labels.append(dim)
	x_min = min(x_min, interval_start - 0.01)
	x_max = max(x_max, interval_end + 0.01)

	# Axis-specific formatting

	ax.xaxis.grid(True, alpha=0.4)
	ax.xaxis.set_ticks_position('none')
	ax.axvline(0.00, color='black', linewidth=1.1, zorder=2)
	ax.yaxis.tick_right()
	ax.set_yticks(np.arange(len(sub_df)))
	ax.set_yticklabels(ytick_labels)
	y_min, y_max = ax.get_ylim()
	ax.set_ylim(y_min-0.4, y_max+0.4)
	ax.yaxis.set_ticks_position('none')

	return x_min, x_max


	def plot_experiment_results(df: pd.DataFrame, title: str = None, sample_size: int = None, combine_axes: bool = False) -> None:
	""" Plot a (possibly MultiIndex) DataFrame on one or more matplotlib axes.

	Args:
	df (pd.DataFrame): DataFrame with MultiIndex representing dimensions or KPIs, and following cols: uplift, std_err, alpha
	title (str): Title displayed above plot
	sample_size (int): Used to add contextual information to bottom corner of plot
	combine_axes (bool): If true and input df has multiindex, collapse axes together into one visible axis.

	"""

	plt.rcParams['figure.facecolor'] = 'white'

	n_levels = len(df.index.names)
	if n_levels > 2:
	raise ValueError
	elif n_levels == 2:
	plt_rows = df.index.get_level_values(0).nunique()
	else:
	plt_rows = 1

	# Make an axis for each group of MultiIndex DataFrame input
	fig, axes = plt.subplots(nrows=plt_rows,
	ncols=1,
	sharex=True,
	figsize=(6, 0.5 * df.shape[0] + 0.2), dpi=100)

	if n_levels == 1:
	ax = axes
	x_min, x_max = plot_single_group(ax, df)

	if n_levels == 2:
	# Iterate over top-level groupings of index
	x_mins, x_maxs = [], []
	for i, (group, results) in enumerate(df.groupby(level=0, sort=False)):
	ax = axes[i]
	a, b = plot_single_group(ax, results)
	x_mins.append(a)
	x_maxs.append(b)
	ax.set_ylabel(group)

	x_min = min(x_mins)
	x_max = max(x_maxs)
	ax = axes[-1] # set variable back to final axis for downstream formatting functions

	if combine_axes:
	fig.subplots_adjust(hspace=0)
	axes[0].spines['bottom'].set_visible(False)
	axes[-1].spines['top'].set_visible(False)
	for axis in axes[1:-1]:
	axis.spines['bottom'].set_visible(False)
	axis.spines['top'].set_visible(False)

	ax.set_xlim(x_min, x_max)
	x_tick_width = (1 + np.floor((x_max - x_min)/0.10)) / 100
	loc = plticker.MultipleLocator(base=x_tick_width) # this locator puts ticks at regular intervals
	ax.xaxis.set_major_locator(loc)
	ax.set_xticklabels(['{:.0%}'.format(x) for x in ax.get_xticks()])
	ax.set_xlabel('Uplift (relative)')

	# Add title, sample size, and timestamp labels to plot
	fig.text(0.5, 0.95 - 0.025 * n_levels, title, size='x-large', horizontalalignment='center')

	vertical_offset = - (0.1 + 0.2 * n_levels)
	timestamp_text = dt.datetime.now().strftime('Analyzed: %Y-%m-%d')
	fig.text(1, vertical_offset,
	timestamp_text,
	size='small', color='grey',
	ha='right', wrap=True, transform=ax.transAxes)

	if sample_size:
	sample_size_text = f'Sample size: {int(sample_size/1000)}K'
	fig.text(0, vertical_offset,
	sample_size_text,
	size='small', color='grey',
	ha='left', wrap=True, transform=ax.transAxes)


	if __name__ == '__main__':
	np.random.seed(24)

	example_data = (pd.DataFrame({'uplift': (uniform().rvs(12) - 0.50) / 30,
	'std_err': uniform(0, 0.01).rvs(12),
	'dimension': ['Group A'] * 4 + ['Group B'] * 4 + ['Group C'] * 4,
	'metric': ['Received rate', 'Open rate', 'Click rate', 'Purchase rate'] * 3,
	'alpha': 0.05 })
	.set_index(['metric', 'dimension'])
	.sort_index()
	.reindex(['Received rate', 'Open rate', 'Click rate', 'Purchase rate'], level=0))

	plot_experiment_results(df=example_data,
	title='Example email campaign (α=0.10)',
	sample_size=123456,
	combine_axes=False)

view raw plot_experiment_result.py hosted with ❤ by GitHub