‹ Geoff Ruddock

8 Big Ideas from Scott Page's “The Model Thinker”

Jan 10, 2020

I recently finished reading Scott E. Page’s wonderful book The Model Thinker. As a data scientist, I have a technical interest in models, particularly in the space of statistics and machine learning. As a general thinker, I am a big fan of Shane Parrish’s mental models concept, in which he champions developing an understanding of a wide breadth of models across disciplines to aid in general decision-making. A majority of the mental models on Farnam Street come from more of a psychology or behavioural economics background. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50+ models in the book, but here is a quick summary of a few big ideas which resonated with me.

What makes for a good model?

A good model is parsimonious

While describing different high-level types of models in the first chapter, the author references a joke I was not familiar with. The original joke1 pokes fun at physicists for making unrealistic simplifying assumptions in their model, such as that of a cow being a perfect sphere.

Milk production at a dairy farm was low, so the farmer wrote to the local university, asking for help from academia. A multidisciplinary team of professors was assembled, headed by a theoretical physicist, and two weeks of intensive on-site investigation took place.

The scholars then returned to the university, notebooks crammed with data, where the task of writing the report was left to the team leader. Shortly thereafter the physicist returned to the farm, saying to the farmer, “I have the solution, but it works only in the case of spherical cows in a vacuum”.

But as the author points out, sometimes these amusingly extreme simplifications actually yield surprisingly usable rough results.

The spherical cow is a favorite classroom example of the analogy approach: to make an estimate of the amount of leather in a cowhide, we assume a spherical cow. We do so because the integral tables in the back of calculus textbooks include tan(x) and cos(x) but not cow(x).

There is no model which is a perfect representation of reality. A model with perfect accuracy would be like a 1:1 scale map, which is clearly not practical to use. So when we select a model, we are implicitly selecting some factors to include and others to exclude. Effective models include the important factors—and are therefore accurate—while excluding the less important ones—and are therefore simple and hence useful to us.

A good model knows its purpose

In the second chapter, Why Model, the author categorizes seven overarching uses for models:

  1. Reason: to identify conditions and deduce logical implications
  2. Explain: to provide (testable) explanations for empirical phenomena
  3. Design: to choose features of institutions, policies, and rules
  4. Communicate: To relate knowledge and understandings
  5. Act: to guild policy changes and strategic actions
  6. Predict: to make numerical and categorical predictions of future and unknown phenomena
  7. Explore: to investigate possibilities and hypotheticals

It seems self-evident that models are used for a wide variety of purposes, but what is worth noting here is how the success criteria for each potential use-case could differ. This implies that anyone setting out to apply a model to solve a problem would be wise to carefully and honestly consider the core underlying purpose, in order to ensure success is actually achievable.

Interpretability vs. predictive power

There is a trope in data science about much of machine learning being merely glorified applied statistics2 , but there is definitely an underlying tension between two paradigms of success as interpretability and of success as predictive power.

Relevant xkcd comic: https://xkcd.com/1838/

Relevant xkcd comic: https://xkcd.com/1838/

Traditional statistics focuses on building models which have explanatory power. A good model is not just true, but also interpretable, and easy to interface into qualitative decision-making. Case in point, using the python package statsmodels to fit a linear model gives you a full R-style summary of fit out of the box.

The more recent focus in pure ML arenas is around having good predictive power with less consideration given to our qualitative understanding of the inner workings of the models themselves. For example, take the scikit-learn approach to linear models, which does not even give you an easy way to visualize p-values out of the box.

I’m not advocating one paradigm over the other, but it is important to honestly consider what success would look like for whatever project/decision/goal you are seeking to apply a model to solve. It’s easy to pay lip service to pure predictive power, but will you and your team feel comfortable with a powerful algorithm whose decisions you don’t understand?

Many models thinking

The third chapter is an appeal to adopting what the author calls “many models thinking” for which he lays out a theorem I was not familiar with.

Condorcet Jury Theorem – Each of an odd number of people (models) classifies an unknown state of the world as either true or false. Each classifies independently from one another, and classifies correctly with a probability $p>\frac{1}{2}$ .

Theorem: A majority vote classifies correctly with higher probability than any person (model), and as the number of people (models) becomes large, the accuracy of the majority vote approaches 100%.

Everything is a remix

This immediately brings to mind the idea of ensemble learning in ML. Just substitute “weak learners” for a single vote, and “strong learner” for majority vote in the above theorem. I was surprised to discover that the Condorcet Jury Theorem was expressed in 1785—nearly 250 years ago. It is humbling to observe instances where seemingly modern techniques3 are actually a remix of much older concepts from other fields. There are no truly new ideas.

The devil is in the details

The author points out that in reality we don’t see our prediction accuracy go to 100% as we increase the number of models or inputs into a majority vote. The reason is usually that one of the assumptions in the above theorem is violated:

  • Weak learners must each have some signal. If $p=\tfrac{1}{2}$, then we cannot improve predictions by averaging together pure noise.
  • The votes must be independent. If multiple votes are perfectly dependent, then they really only count for one vote. If they have some moderate level of correlation, then their absolute number is overstated.

In real-world collective decision-making, it is plausible that both of these assumptions are violated. Votes are certainly not independent, and it is conceivable that some voters have negative signal—their predictions are wrong more than would occur due to pure chance.

Related concepts: wisdom of the crowd, prediction markets

Adaptive systems

Systems which are able respond to feedback poses additional challenges for quantifying accuracy of our models and the predictions we generate from them.

The Lucas Critique states that changes in a policy or the environment likely produce behavioural responses by those affected. Models estimated with data on past human behaviours will therefore not be accurate. Models must take into account the fact that people respond to policy and environmental changes.

See also: why your KPIs suck

This brings to mind Goodhart’s Law, which tells us that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. This is a key challenge faced by anyone who has tried to design KPIs for an organization.

Power-law (long-tail) distributions

Besides the Normal distribution, Power-law distributions are one of the most important statistical distributions to understand. Whereas aggregates of independent things tend to follow the normal distribution via the central limit theorem, aggregates of dependent things—particularly when feedback loops are involved—follow power-law distributions.

Power-law distributions – In a power-law distribution, the probability of an event is inversely related to its size: the larger the event, the less likely it occurs.

$$ p(x) = C x^{-a} $$

They are sometimes difficult to grasp intuitively though, which can cause problems when we use attempt to use heuristics to gauge things like risk when subconciously considering a normal distribution.

Contemplating a power-law distribution of human heights reveals how much power-law distributions differ from normal distributions. If human heights were distributed by a power law similar to that of city populations, and if we calibrate the mean height at 5 feet 9 inches, then the United States would include one person the height of the Empire State Building, over 10,000 people taller than giraffes, and 180 million people less than 7 inches tall.

Power-laws arise due to “Preferential Attachment”

The author presents a couple potential causal factors which explain how power-law distributions arise. The most compelling is the preferential attachment model, which states that entities grow at rates relative to their proportions. Aka: the rich get richer and the poor get poorer. He gives a compelling example about a music download experiment:

In the music lab experiments, college students could sample and download songs. In the first treatment, subjects did not know what songs others downloaded, and the distributions of downloads had a shorter tail—no song received more than two hundred downloads and only one song received fewer than thirty. In a second treatment, students knew what others downloaded. The tail of the distribution grew: one song received more than three hundred downloads. Perhaps more telling, over half received fewer than thirty. The tail became longer. Social influence increased inequality. This inequality is not a concern if social influence leads people to download better songs. However, correlations between downloads in the two treatments were not strong. If we interpret the number of downloads of a song in the first treatment as a proxy for the song’s quality, social influence did not result in people downloading better songs. The big winners were not random, but they were not the best.

So our world becomes more interconnected and feedback loops multiply, we should expect to see more long-tails arise in situations where they may not have done so historically.

See also: Black swan theory, the Matthew effect.

Concavity and convexity

Concave and convex functions is another math concept which is much more profound when considered from an economic perspective. I will admit to invoking [Jensen’s inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality in math proofs without truly reflecting on how it influnces human decision-making.

Convexity implies risk-taking

Convex functions have an increasing slope: the function’s value increases by a larger amount as we increase a variable’s value. The number of possible pairs of people is a convex function of the group size. A group of three people includes three unique pairs. A group of four people includes six unique pairs, and a group of five includes ten unique pairs. Each increase in group size increases the number of pairs by a larger amount. Similarly, each time a chef adds a new spice to his repertoire, he increases the number of spice combinations by a larger amount.

Concavity implies diversity

Concave functions with positive slopes exhibit diminishing returns: the added value of each extra thing diminishes as we have more of that thing. Our utility or value from almost all goods exhibits diminishing returns. The more leisure, money, ice cream, or even time spent with loved ones, the less we value having more of it. Evidence for this can be found in the fact that the more we consume of just about anything, including chocolate, the less we enjoy it and the less we are willing to pay for it.

See also: Range by David Epstein, Specialization is for insects.

Markov models

Markov models describe sequential systems that follow the Markov property, which states that the probability of future states depends only on the current state, not the entire sequence of states that preceded it. Essentially: given the present, the past and future are conditionally independent. $$ P(\text{Tomorrow}|\text{Today, Yesterday, 2 days ago…, Day 1}) = P(\text{Tomorrow}|\text{Today}) $$ The Markov property sounds like a gross over-simplification of reality, but it can yield surprisingly useful results, because it allows our models to capture a compromise between “full independence” and “complete dependence”, which is often impractical to model at all.

It can be shown that recurrent Markov chains which follow specific properties are guarantee to converge to some long-run stationary distribution which reflects the long-run proportion of time spent in each state if the chain is run indefinitely.

Perron-Frobenius Theorem – A markov process converges to a unique statistical equilibrium provided it satisfies four conditions:

  1. Finite set of states: $ S = { 1, 2, \ldots, K } $
  2. Fixed transition rule
  3. Ergodicity (state accessibility): The system can get from any state to any other through a series of transitions.
  4. Non-cyclic: The system does not produce a deterministic cycle through a sequence of states.

The unique statistical equilibrium implies that long-run distributions of outcomes cannot depend on the initial state or on the path of events. In other words, initial conditions do not matter, and history does not matter. Nor can interventions that change the state matter. As time marches on, a process that satisfies the assumptions inexorably heads to its unique statistical equilibrium and then stays there.

Besides being the foundation of MCMC, this has interesting implications from a sociological perspective.

The takeaway from the theorem should not be that history cannot matter but that if history does matter, one of the model’s assumptions must be violated. Two assumptions—the finite number of states and no simple cycle—almost always hold. Ergodicity can be violated, as when allies go to war and cannot transition back to an alliance. Such examples notwithstanding, ergodicity generally holds as well. The forces that create social inequality have proven immune to policy interventions. In Markov models interventions that change families’ states—such as special programs for underperforming students or a one-day food drive—can provide temporary boosts. They cannot change the long-run equilibrium. In contrast, interventions that provide resources and training that improve people’s ability to keep jobs, and therefore change their probabilities of moving from employed to unemployed, could change long-run outcomes. At a minimum, the model gives us a terminology—the distinction between states and transition probabilities—along with a logic to see the value of changing structural forces rather than the current state.

This has a powerful implication for anyone attempting to alter the long-run state of a complex system. Rather than directly manipulating the states themselves, we should adopt second-order thinking and consider how we can modify the transition probabilities between states such that our desired end state arises naturally. If your goal is to declutter your messy bedroom, you can set aside a weekend to go full Marie Kondo on your wardrobe, but unless you implement systemic changes which influence the rate of accumulation of junk, you will find yourself back in the same state a year later.

Systems dynamics models

Systems dynamics models give us a vocabulary for describing the behaviour of complex systems:

  • Sources produce inputs into the system.
  • Sinks absorb outputs.
  • Stocks keep track of levels of variables.
  • Flows capture feedbacks between levels of stocks.

A great place to learn more about this approach is Donella H Meadows’ book Thinking in Systems: A Primer (summary here).

Long-run stability of systems

Feedback loops imply that some systems are not stable in the long-run.

“The basic logic of feedbacks is straightforward: positive feedbacks reinforce actions, negative feedbacks dampen them. A system with only positive feedbacks will either blow up or collapse. A system with only negative feedbacks will either stabilize or cycle. A system with both positive feedbacks and negative feedbacks has the potential to produce complexity.”

Reasoning about effects

Feedback loops make it difficult to reason about the effect of small changes to long-run equilibrium.

“The direct effect of increasing the growth rate of hares is more hares. The indirect effect, more foxes, implies fewer hares. These two effects cancel out. Nonintuitive findings such as these are a hallmark of systems dynamics models. Our intuition fails because we latch onto direct effects and fail to think through the entire logical chain. Even if the direct effect of increasing (or decreasing) a rate or flow may be to increase (or decrease) a stock, the presence of systems effects in the form of positive and negative feedbacks means that other stocks will also change values, so the net effect of a change in a rate or flow may be reduced, canceled, or even reversed.”

Modelling human behaviour with adaptive rules

This book touches on game theory in a number of chapters. The most interesting section to me was a description of a problem where there is no dominant pure strategy. But when individual actors adopt diverse probabilistic actions, the system naturally reaches a collectively efficient outcome.

El Farol Bar problem – El Farol is a nightclub in Sante Fe, New Mexico that features dancing every Tuesday night. Each week, a population of 100 potential dancers decide whether to go dance at El Farol or stay home. All 100 people like to dance, but they do not want to go if the club is too crowded. Each persn earns a payoff of zero from staying home, a payoff of 1 from attending if 60 or fewer people attend, and a payoff of -1 from attending when more than 60 people attend.

Simulations of this type of model find that if individuals possess a large ensemble of rules, then approximately 60 people attend each week: coordination emerges without any central planner. In other words, the system of adaptive rules self-organizes into nearly efficient outcomes.

There is a feedback cycle between micro-level and macro-level rules. The decision of whether to attend or not (micro) influences the level of over-attendence (macro) which in turn influences the individual decisions in the next time period.

If the rules people apply produce a crowded El Farol four weeks in a row, then rules that tell people to attend less often will produce higher payoffs. As people switch to those rules, fewer people will attend. The micro-level rules produce a macro-level phenomenon (over-attendance) that feeds back to the micro-level rules.


  1. https://en.wikipedia.org/wiki/Spherical_cow ↩︎

  2. https://statweb.stanford.edu/~tibs/stat315a/glossary.pdf ↩︎

  3. https://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf ↩︎