<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Geoff Ruddock</title><link>https://geoffruddock.com/</link><description>Recent content on Geoff Ruddock</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thursday, 27 Nov 2025 22:39:07 +0000</lastBuildDate><atom:link href="https://geoffruddock.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Convert clipboard HTML contents to Markdown with Alfred</title><link>https://geoffruddock.com/google-docs-to-markdown-with-alfred/</link><pubDate>Friday, 14 Jan 2022</pubDate><guid>https://geoffruddock.com/google-docs-to-markdown-with-alfred/</guid><description>&lt;h2 id="goal">Goal &lt;a class="anchor" href="#goal">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>All of my personal notes are written in markdown. I use &lt;a href="https://obsidian.md/" target="_blank">Obsidian&lt;/a> to manage them, but the specific tool is not relevant for the purposes of this post.&lt;/p>
&lt;p>When referencing things in Google Docs, I find myself generally &lt;em>linking&lt;/em> to the doc, rather than copy/pasting, because the default copy/paste output is poorly formatted, and is tedious to correct. This is sub-optimal, because then I cannot surface the contents via search in Obsidian, unless the text of the URL matches my keywords.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="goal">Goal <a class="anchor" href="#goal">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>All of my personal notes are written in markdown. I use <a href="https://obsidian.md/" target="_blank">Obsidian</a> to manage them, but the specific tool is not relevant for the purposes of this post.</p>
<p>When referencing things in Google Docs, I find myself generally <em>linking</em> to the doc, rather than copy/pasting, because the default copy/paste output is poorly formatted, and is tedious to correct. This is sub-optimal, because then I cannot surface the contents via search in Obsidian, unless the text of the URL matches my keywords.</p>
<p>So the goal here is to make it trivially easy to copy/paste from GDocs into a markdown format, hopefully resulting in me doing that <em>more frequently</em>, resulting in more useful results when searching my vault.</p>
<blockquote>
<p>Easier to copy/paste → save more content → better search results → less time searching for things</p></blockquote>
<p><strong>Alternatives</strong>:</p>
<ul>
<li>If you&rsquo;re happy with copy/pasting into another window to convert, you may consider running a local instance of <a href="https://github.com/Mr0grog/google-docs-to-markdown" target="_blank">google-docs-to-markdown</a>, or—if you are not converting any sensitive data—even just using <a href="https://mr0grog.github.io/google-docs-to-markdown/" target="_blank">the demo web applet</a>.</li>
<li>If you have administrative access to your Google Workspace, you may consider installing the <a href="https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607" target="_blank">Docs to Markdown</a> add-on.</li>
</ul>
<h2 id="getting-html-contents-from-clipboard">Getting HTML contents from clipboard <a class="anchor" href="#getting-html-contents-from-clipboard">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>You may have wondered at some point: why does copied text appear differently depending on which app I paste it in? For example, copied text from a google doc will appears identical when pasted in another google doc, but will render as plain text when pasted into a barebones text editor.</p>
<p>It&rsquo;s worth understanding <a href="https://whynothugo.nl/journal/2022/10/21/how-the-clipboard-works/" target="_blank">how the clipboard works</a>. Most relevant for us:</p>
<ol>
<li>When the copy command is invoked, the active application can offer a variety of potential formats, incl. HTML, RTF, and plain text.</li>
<li>When the paste command is invoked, the active application chooses which format to receive.</li>
</ol>
<p>If you&rsquo;re using Mac, you can use the free <a href="https://langui.net/clipboard-viewer/" target="_blank">Clipboard Viewer</a> application to inspect the different formats that are offered by the application from which you are copying.</p>
<p><img src="clipboard_viewer.png" alt="Using Clipboard Viewer to inspect text copied from a Google Doc"></p>
<h2 id="convert-html-to-markdown">Convert HTML to Markdown <a class="anchor" href="#convert-html-to-markdown">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Now that we have the HTML contents of our clipboard, we need to convert it to markdown.</p>
<h3 id="example-html">Example HTML <a class="anchor" href="#example-html">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Let&rsquo;s start with the optimistic scenario in which we have <em>perfectly</em> structured HTML, notably:</p>
<ol>
<li>Text formatting is represented in <a href="https://web.dev/learn/html/semantic-html/" target="_blank">semantic elements</a> such as: <code>strong</code> (bold), <code>em</code> (italics), or <code>del</code> (strikethrough).</li>
<li>Nested lists are wrapped inside an <code>li</code> tag, not placed directly below the parent list.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">HTML</span><span class="p">,</span> <span class="n">display</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">example_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;h4&gt;Example text&lt;/h4&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;h5&gt;Text&lt;/h5&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;This text is &lt;strong&gt;bold&lt;/strong&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;This text is &lt;em&gt;italicized&lt;/em&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;This text is &lt;del&gt;strikethrough&lt;/del&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;Some &lt;code&gt;func = lambda x: print(x)&lt;/code&gt; inline code&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A link: &lt;a href=&#34;https://en.wikipedia.org/wiki/Main_Page&#34;&gt;Wikipedia&lt;/a&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;h5&gt;Nested lists&lt;/h5&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;ul &gt; ul&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ul&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;ul &gt; ol&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;B1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;&lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;B1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;B1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ol&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;B2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;ol &gt; ul&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ul&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;ol &gt; ol&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;&lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ol&gt;&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="n">example_html</span><span class="p">))</span>
</span></span></code></pre></div><h4>Example text</h4>
<h5>Text</h5>
<ul>
    <li>This text is <strong>bold</strong></li>
    <li>This text is <em>italicized</em></li>
    <li>This text is <del>strikethrough</del></li>
    <li>Some <code>func = lambda x: print(x)</code> inline code</li>
    <li>A link: <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a></li>
</ul>
<h5>Nested lists</h5>
<p>ul > ul</p>
<ul>
    <li>A1</li>
    <li><ul>
        <li>A1a</li>
        <li>A1b</li>
    </ul></li>
    <li>A2</li>
</ul>
<p>ul > ol</p>
<ul>
    <li>B1</li>
    <li><ol>
        <li>B1a</li>
        <li>B1b</li>
    </ol></li>
    <li>B2</li>
</ul>
<p>ol > ul</p>
<ol>
    <li>A1</li>
    <li><ul>
        <li>A1a</li>
        <li>A1b</li>
    </ul></li>
    <li>A2</li>
</ol>
<p>ol > ol</p>
<ol>
    <li>A1</li>
    <li><ol>
        <li>A1a</li>
        <li>A1b</li>
    </ol></li>
    <li>A2</li>
</ol>
<h3 id="markdownify">Markdownify <a class="anchor" href="#markdownify">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>In this idealized scenario, the <a href="https://github.com/matthewwithanm/python-markdownify" target="_blank">markdownify</a> library converts our HTML reasonably well out-of-the-box.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">markdownify</span> <span class="kn">import</span> <span class="n">markdownify</span> <span class="k">as</span> <span class="n">md</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">md</span><span class="p">(</span><span class="n">example_html</span><span class="p">,</span> <span class="n">heading_style</span><span class="o">=</span><span class="s1">&#39;ATX&#39;</span><span class="p">,</span> <span class="n">bullets</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">))</span>
</span></span></code></pre></div><pre><code>#### Example text


##### Text


- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)


##### Nested lists


ul &gt; ul


- A1
- - A1a
	- A1b
- A2


ul &gt; ol


- B1
- 1. B1a
	2. B1b
- B2


ol &gt; ul


1. A1
2. - A1a
	- A1b
3. A2


ol &gt; ol


1. A1
2. 1. A1a
	2. A1b
3. A2
</code></pre>
<p>But there are a few issues:</p>
<ol>
<li>The first child of nested lists gets a double list marker instead of proper indentation</li>
<li>Ordered list numbers do not properly reset at each level</li>
<li>There are unnecessary double empty lines</li>
<li>Output uses tab characters (<code>\t</code>) instead of four spaces.</li>
</ol>
<h3 id="fix-list-indentation">Fix list indentation <a class="anchor" href="#fix-list-indentation">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We can fix everything except for the wrong numbering using a few simple regex expressions.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">strip</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">md_text</span> <span class="o">=</span> <span class="n">md</span><span class="p">(</span><span class="n">example_html</span><span class="p">,</span> <span class="n">heading_style</span><span class="o">=</span><span class="s1">&#39;ATX&#39;</span><span class="p">,</span> <span class="n">bullets</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="n">find_replace_pairs</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="s1">&#39;- - &#39;</span><span class="p">,</span>          <span class="s1">&#39;</span><span class="se">\t</span><span class="s1">- &#39;</span><span class="p">),</span>    <span class="c1"># fix indent on first child of ul &gt; ul</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;- (\d\.)&#39;</span><span class="p">,</span>     <span class="sa">r</span><span class="s1">&#39;\t\1&#39;</span><span class="p">),</span>   <span class="c1"># fix indent on first child of ul &gt; ol</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\d\. - &#39;</span><span class="p">,</span>      <span class="sa">r</span><span class="s1">&#39;\t- &#39;</span><span class="p">),</span>   <span class="c1"># fix indent on first child of ol &gt; ul</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\d\. (\d)\.&#39;</span><span class="p">,</span>  <span class="sa">r</span><span class="s1">&#39;\t\1.&#39;</span><span class="p">),</span>  <span class="c1"># fix indent on first child of ol &gt; ol</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="s1">&#39;</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">,</span>            <span class="s1">&#39;    &#39;</span><span class="p">),</span>    <span class="c1"># replace tabs with four spaces</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="s1">&#39;</span><span class="se">\n\n</span><span class="s1">&#39;</span><span class="p">,</span>          <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">),</span>      <span class="c1"># remove extra line breaks</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">f</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">find_replace_pairs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">md_text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">md_text</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">MULTILINE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">md_text</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul &gt; ul

- A1
    - A1a
    - A1b
- A2

ul &gt; ol

- B1
    1. B1a
    2. B1b
- B2

ol &gt; ul

1. A1
    - A1a
    - A1b
3. A2

ol &gt; ol

1. A1
    1. A1a
    2. A1b
3. A2
</code></pre>
<h3 id="re-number-ordered-lists">Re-number ordered lists <a class="anchor" href="#re-number-ordered-lists">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>To fix the numbering of ordered list items, we&rsquo;ll write a function:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">renumber_list</span><span class="p">(</span><span class="n">md_text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Replace ordered list markers with the correct number. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># Track how many previous items have appeared, at each level of indentation</span>
</span></span><span class="line"><span class="cl">    <span class="n">prev_items_at_level</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="n">md_text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">lines</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">        <span class="c1"># If line is a list item (either ordered or unordered) …</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="k">match</span> <span class="o">:=</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="s1">&#39;(\s*)([-\d])&#39;</span><span class="p">,</span> <span class="n">line</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># Infer level based on number of leading spaces</span>
</span></span><span class="line"><span class="cl">            <span class="n">level</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="k">match</span><span class="o">.</span><span class="n">groups</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">            <span class="c1"># If line is an ordered list item …</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="s1">&#39;(\s*)(\d)&#39;</span><span class="p">,</span> <span class="n">line</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">                <span class="c1"># Replace marker, update counter</span>
</span></span><span class="line"><span class="cl">                <span class="n">marker</span> <span class="o">=</span> <span class="n">prev_items_at_level</span><span class="p">[</span><span class="n">level</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">                <span class="n">lines</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s1">&#39;^(\s*)(\d)&#39;</span><span class="p">,</span> <span class="sa">f</span><span class="s1">&#39;\g&lt;1&gt;</span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">marker</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">prev_items_at_level</span><span class="p">[</span><span class="n">level</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">                
</span></span><span class="line"><span class="cl">            <span class="c1"># Reset counters for deeper levels</span>
</span></span><span class="line"><span class="cl">            <span class="n">prev_items_at_level</span><span class="p">[</span><span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">:]</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">prev_items_at_level</span><span class="p">)</span> <span class="o">-</span> <span class="n">level</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># If not a list item, reset all counters</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">prev_items_at_level</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">renumber_list</span><span class="p">(</span><span class="n">md_text</span><span class="p">))</span>
</span></span></code></pre></div><pre><code>#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul &gt; ul

- A1
    - A1a
    - A1b
- A2

ul &gt; ol

- B1
    1. B1a
    2. B1b
- B2

ol &gt; ul

1. A1
    - A1a
    - A1b
2. A2

ol &gt; ol

1. A1
    1. A1a
    2. A1b
2. A2
</code></pre>
<h3 id="all-together-html2md">All together: html2md <a class="anchor" href="#all-together-html2md">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">unicodedata</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">html2md</span><span class="p">(</span><span class="n">html</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; &#34;&#34;&#34;</span> 
</span></span><span class="line"><span class="cl">    <span class="n">md_text</span> <span class="o">=</span> <span class="n">md</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="n">heading_style</span><span class="o">=</span><span class="s1">&#39;ATX&#39;</span><span class="p">,</span> <span class="n">bullets</span><span class="o">=</span><span class="s1">&#39;-&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">find_replace_pairs</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="s1">&#39;- - &#39;</span><span class="p">,</span>          <span class="s1">&#39;</span><span class="se">\t</span><span class="s1">- &#39;</span><span class="p">),</span>    <span class="c1"># fix indent on first child of ul &gt; ul</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;- (\d\.)&#39;</span><span class="p">,</span>     <span class="sa">r</span><span class="s1">&#39;\t\1&#39;</span><span class="p">),</span>   <span class="c1"># fix indent on first child of ul &gt; ol</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\d\. - &#39;</span><span class="p">,</span>      <span class="sa">r</span><span class="s1">&#39;\t- &#39;</span><span class="p">),</span>   <span class="c1"># fix indent on first child of ol &gt; ul</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\d\. (\d)\.&#39;</span><span class="p">,</span>  <span class="sa">r</span><span class="s1">&#39;\t\1.&#39;</span><span class="p">),</span>  <span class="c1"># fix indent on first child of ol &gt; ol</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="s1">&#39;</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">,</span>            <span class="s1">&#39;    &#39;</span><span class="p">),</span>    <span class="c1"># replace tabs with four spaces</span>
</span></span><span class="line"><span class="cl">        <span class="p">(</span><span class="s1">&#39;</span><span class="se">\n\n\n</span><span class="s1">&#39;</span><span class="p">,</span>          <span class="s1">&#39;</span><span class="se">\n\n</span><span class="s1">&#39;</span><span class="p">),</span>  <span class="c1"># remove extra line breaks</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># some websites wrongly encode &amp;nbsp; as this character </span>
</span></span><span class="line"><span class="cl">    <span class="n">md_text</span> <span class="o">=</span> <span class="n">md_text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="sa">u</span><span class="s1">&#39;</span><span class="se">\xa0</span><span class="s1">&#39;</span><span class="p">,</span> <span class="sa">u</span><span class="s1">&#39; &#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;Â&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">f</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">find_replace_pairs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">md_text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">md_text</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">MULTILINE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">renumber_list</span><span class="p">(</span><span class="n">md_text</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">example_html</span><span class="p">))</span>
</span></span></code></pre></div><pre><code>#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul &gt; ul

- A1
    - A1a
    - A1b
- A2

ul &gt; ol

- B1
    1. B1a
    2. B1b
- B2

ol &gt; ul

1. A1
    - A1a
    - A1b
2. A2

ol &gt; ol

1. A1
    1. A1a
    2. A1b
2. A2
</code></pre>
<h2 id="clean-up-html-of-google-docs">Clean up HTML of Google Docs <a class="anchor" href="#clean-up-html-of-google-docs">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>When copying from a google doc, the HTML structure is not quite as pristine as above, so we&rsquo;ll need to perform some pre-processing steps on the HTML before converting it. To help with this, we&rsquo;ll import <a href="https://en.wikipedia.org/wiki/Beautiful_Soup_%28HTML_parser%29" target="_blank">Beautiful Soup</a>, a Python package for parsing HTML. Then we&rsquo;ll write a few transformation functions, loosely inspired by the <a href="https://github.com/Mr0grog/google-docs-to-markdown/blob/main/lib/fix-google-html.js" target="_blank">google-docs-to-markdown</a> javascript library.</p>
<h3 id="default-output">Default output <a class="anchor" href="#default-output">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The default output is a mess:</p>
<ul>
<li>All of the text formatting (besides the link) is missing.</li>
<li>The nested lists are not properly indented, and have some empty lines.</li>
<li>The entire chunk of text is wrapped in formatted as bold (wrapped in <code>**</code>).</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">gdoc_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;meta charset=&#39;utf-8&#39;&gt;&lt;meta charset=&#34;utf-8&#34;&gt;&lt;b style=&#34;font-weight:normal;&#34; id=&#34;docs-internal-guid-e98e0af3-7fff-7df9-91ea-0ffad7c3607d&#34;&gt;&lt;h1 dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:20pt;margin-bottom:6pt;&#34;&gt;&lt;span style=&#34;font-size:20pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;Example text&lt;/span&gt;&lt;/h1&gt;&lt;h3 dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:16pt;margin-bottom:4pt;&#34;&gt;&lt;span style=&#34;font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;Text&lt;/span&gt;&lt;/h3&gt;&lt;ul style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;This text is &lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;bold&lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;This text is &lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;italicized&lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;This text is &lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:line-through;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;strikethrough&lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;Some &lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:&#39;Courier New&#39;;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;func = lambda x: print(x)&lt;/span&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt; inline code.&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;A &lt;/span&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Main_Page&#34; style=&#34;text-decoration:none;&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;link&lt;/span&gt;&lt;/a&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt; with text.&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3 dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:16pt;margin-bottom:4pt;&#34;&gt;&lt;span style=&#34;font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;Nested lists&lt;/span&gt;&lt;/h3&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:5pt;&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;ul &amp;gt; ul&lt;/span&gt;&lt;/p&gt;&lt;ul style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;A1&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;ul style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;</span><span class="s2">&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;A1a&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;A1b&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:5pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;A2&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:5pt;&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;ul &amp;gt; ol&lt;/span&gt;&lt;/p&gt;&lt;ul style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;B1&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;ol style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;B1a&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;B1b&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:5pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;B2&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:5pt;&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;ol &amp;gt; ul&lt;/span&gt;&lt;/p&gt;&lt;ol style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:11pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;C1&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;ul style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;C1a&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;C1b&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:5pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;C2&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:12pt;margin-bottom:12pt;&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;ol &amp;gt; ol&lt;/span&gt;&lt;/p&gt;&lt;ol style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:12pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;D1&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;ol style=&#34;margin-top:0;margin-bottom:0;padding-inline-start:48px;</span><span class="s2">&#34;&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;D1a&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;2&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:0pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;D1b&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;li dir=&#34;ltr&#34; style=&#34;list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;&#34; aria-level=&#34;1&#34;&gt;&lt;p dir=&#34;ltr&#34; style=&#34;line-height:1.38;margin-top:0pt;margin-bottom:12pt;&#34; role=&#34;presentation&#34;&gt;&lt;span style=&#34;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;&#34;&gt;D2&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">((</span><span class="n">gdoc_html</span><span class="p">)))</span>
</span></span></code></pre></div><pre><code>**# Example text

### Text

- This text is bold.
- This text is italicized.
- This text is strikethrough.
- Some func = lambda x: print(x) inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.

### Nested lists

ul &gt; ul

- A1
- A1a
- A1b

- A2

ul &gt; ol

- B1
1. B1a
2. B1b

- B2

ol &gt; ul

1. C1
- C1a
- C1b

1. C2

ol &gt; ol

1. D1
2. D1a
3. D1b

1. D2**
</code></pre>
<h3 id="inline-styles-to-semantic-tags">Inline styles to semantic tags <a class="anchor" href="#inline-styles-to-semantic-tags">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The lack of text formatting is caused by the fact that Google Docs does not use proper semantic tags, but rather puts text inside a <code>span</code> and styles it with inline CSS.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">bold_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;span style=&#34;font-weight:400;&#34;&gt;This text is &lt;/span&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;span style=&#34;font-weight:700;&#34;&gt;bold&lt;/span&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;span style=&#34;font-weight:400;&#34;&gt;.&lt;/span&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span></code></pre></div><p>We can fix this by using regex to search for elements which contain the relevant styles, then wrapping those elements in the proper semantic tag.</p>
<p>While we&rsquo;re at it, we can also parse inline code. Although Google Docs does not natively support inline code, we can fake it by treating any text using the font <em>Courier New</em> (the most common monospace font) as intended to be code.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse_gdoc_inline_styles</span><span class="p">(</span><span class="n">soup</span><span class="p">:</span> <span class="n">BeautifulSoup</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; GDocs uses inline styles on spans instead of semantic HTML tags. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Dont inline styles inside headings</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">(</span><span class="s1">&#39;h1&#39;</span><span class="p">,</span> <span class="s1">&#39;h2&#39;</span><span class="p">,</span> <span class="s1">&#39;h3&#39;</span><span class="p">,</span> <span class="s1">&#39;h4&#39;</span><span class="p">,</span> <span class="s1">&#39;h5&#39;</span><span class="p">,</span> <span class="s1">&#39;h6&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">        <span class="n">style</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;style&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">style</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;font-weight:\s?700&#39;</span><span class="p">,</span> <span class="n">style</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">_</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;strong&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;font-style:\s?italic&#39;</span><span class="p">,</span> <span class="n">style</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">_</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;em&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;text-decoration:\s?line-through&#39;</span><span class="p">,</span> <span class="n">style</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">_</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;del&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;font-family:\s?&#39;Courier New&#39;&#34;</span><span class="p">,</span> <span class="n">style</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">_</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;code&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">bold_html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">parse_gdoc_inline_styles</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">strip</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))))</span>
</span></span></code></pre></div><pre><code>This text is **bold**.
</code></pre>
<h3 id="wrap-naked-ul-elements">Wrap naked <code>ul</code> elements <a class="anchor" href="#wrap-naked-ul-elements">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The second issue is that Google Docs drops nested lists (<code>ul</code>, <code>ol</code> elements) directly inside the parent list, without wrapping them in an <code>li</code> tag, as our markdown converter expects.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">nested_list_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;ul &gt; ul&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;ul &gt; ol&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;B1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;B1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;B1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;B2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;ol &gt; ul&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;li&gt;ol &gt; ol&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A1&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1a&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">            &lt;li&gt;A1b&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">        &lt;li&gt;A2&lt;/li&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">    &lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;/ul&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># display(HTML(nested_list_html))</span>
</span></span></code></pre></div><p>Again, we can use Beautiful Soup to find these instances and manually enclose the child list in an <code>li</code> element.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">:</span> <span class="n">BeautifulSoup</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; GDocs does not wrap nested lists in an &lt;li&gt; tag. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;ul&#39;</span><span class="p">,</span> <span class="s1">&#39;ol&#39;</span><span class="p">]</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;ul&#39;</span><span class="p">,</span> <span class="s1">&#39;ol&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">            <span class="n">tag</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;li&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">nested_list_html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">html</span> <span class="o">=</span> <span class="n">strip</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">html</span><span class="p">))</span>
</span></span></code></pre></div><pre><code>- ul &gt; ul
    - A1
        - A1a
        - A1b
    - A2
- ul &gt; ol
    - B1
        1. B1a
        2. B1b
    - B2
- ol &gt; ul
    1. A1
        - A1a
        - A1b
    2. A2
- ol &gt; ol
    1. A1
        1. A1a
        2. A1b
    2. A2
</code></pre>
<h3 id="strip-unnecessary-tags">Strip unnecessary tags <a class="anchor" href="#strip-unnecessary-tags">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Finally, we remove a bunch of unnecessary tags, incl.</p>
<ol>
<li>The outer <code>b</code> element that the google doc seem to be weirdly wrapped in.</li>
<li>Any unnecessary layers of wrapping, incl. <code>p</code>, <code>span</code>.</li>
<li>Nested <code>code</code> tags, which is not a gdoc-specific issue, but which another common collaboration tool seems to produce.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">strip_unnecessary_tags</span><span class="p">(</span><span class="n">soup</span><span class="p">:</span> <span class="n">BeautifulSoup</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Remove various unnecessary tags, for easier debugging. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">tag</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">soup</span><span class="p">()):</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># gdocs seems to wrap the entire HTML contents in a &lt;b&gt; tag for some reason</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;=</span> <span class="mi">3</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;b&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">tag</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># gdocs also includes 1-2 meta tags at the top of the content</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;meta&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">tag</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># strip attributes, mostly just to make debugging easier</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">attribute</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;class&#39;</span><span class="p">,</span> <span class="s1">&#39;value&#39;</span><span class="p">,</span> <span class="s1">&#39;rel&#39;</span><span class="p">,</span> <span class="s1">&#39;target&#39;</span><span class="p">,</span> <span class="s1">&#39;dir&#39;</span><span class="p">,</span> <span class="s1">&#39;aria-level&#39;</span><span class="p">,</span> <span class="s1">&#39;role&#39;</span><span class="p">,</span> <span class="s1">&#39;id&#39;</span><span class="p">,</span> <span class="s1">&#39;style&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">            <span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="n">attribute</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># unwrap nested text formatting</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;strong&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">]</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;strong&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tag</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;em&#39;</span><span class="p">,</span> <span class="s1">&#39;i&#39;</span><span class="p">]</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;em&#39;</span><span class="p">,</span> <span class="s1">&#39;i&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tag</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>  
</span></span></code></pre></div><h3 id="combined-prep_gdoc_html">Combined: <code>prep_gdoc_html</code> <a class="anchor" href="#combined-prep_gdoc_html">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Putting everything together, our output looks much better!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">prep_gdoc_html</span><span class="p">(</span><span class="n">html</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; One function to combine all preprocessing steps. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">parse_gdoc_inline_styles</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">strip_unnecessary_tags</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">debug</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="kn">from</span> <span class="nn">bs4.formatter</span> <span class="kn">import</span> <span class="n">HTMLFormatter</span>
</span></span><span class="line"><span class="cl">        <span class="n">formatter</span> <span class="o">=</span> <span class="n">HTMLFormatter</span><span class="p">(</span><span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">formatter</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">final_html</span> <span class="o">=</span> <span class="n">prep_gdoc_html</span><span class="p">(</span><span class="n">gdoc_html</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">final_html</span><span class="p">))</span>
</span></span></code></pre></div><pre><code># Example text

### Text

- This text is **bold**.
- This text is *italicized*.
- This text is ~~strikethrough~~.
- Some `func = lambda x: print(x)` inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.

### Nested lists

**ul &gt; ul**

- A1
    - A1a
    - A1b
- A2

**ul &gt; ol**

- B1
    1. B1a
    2. B1b
- B2

**ol &gt; ul**

1. C1
    - C1a
    - C1b
2. C2

**ol &gt; ol**

1. D1
    1. D1a
    2. D1b
2. D2
</code></pre>
<h2 id="clean-up-html-of-quip-docs">Clean up HTML of Quip docs <a class="anchor" href="#clean-up-html-of-quip-docs">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://quip.com/" target="_blank">Quip</a> is another real-time collaboration tool you may find yourself wanting to copy markdown from.</p>
<p>A regular copy/paste from a quip document generally looks better than a google doc. This is because quip <em>attempts</em> to copy markdown into the plain text contents of the clipboard. It works well on lists, but fails to encode headings, italics, strikethrough, or inline code.</p>
<p><img src="quip_plain_text_contents@2x.png" alt="Quip plain text clipboard contents"></p>
<h3 id="default-quip-output">Default quip output <a class="anchor" href="#default-quip-output">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A few things are broken:</p>
<ol>
<li>Document-level lines of text are not wrapped in block elements, which results in the lack of appropriate spacing between these lines and subsequent elements (headings, lists).</li>
<li>The inline code gets wrapped in double backticks, but it should be a single pair.</li>
</ol>
<p>Note: quip does not supported mixed (ol/ul) list nesting, so those examples are excluded from the test text.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">quip_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;meta charset=&#39;utf-8&#39;&gt;&lt;h1&gt;HTML to Markdown&lt;/h1&gt;&lt;h2&gt;Text&lt;/h2&gt;A single line&lt;br&gt;&lt;br&gt;A line split&lt;br&gt;with a line break&lt;br&gt;&lt;br&gt;Another single line&lt;br&gt;&lt;h2&gt;Text formatting&lt;/h2&gt;&lt;ul&gt;&lt;li&gt;This text is &lt;b&gt;bold&lt;/b&gt;&lt;/li&gt;&lt;li&gt;This text is &lt;i&gt;italicized&lt;/i&gt;&lt;/li&gt;&lt;li&gt;This text is &lt;span style=&#34;text-decoration: line-through&#34;&gt;strikethrough&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Some &lt;code&gt;&lt;code&gt;func = lambda x: print(x)&lt;/code&gt;&lt;/code&gt; inline code&lt;/li&gt;&lt;li&gt;A link: &lt;a href=&#34;https://en.wikipedia.org/wiki/Main_Page&#34;&gt;Wikipedia&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h2&gt;Nested list&lt;/h2&gt;&lt;b&gt;ul &amp;gt; ul&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;A1&lt;/li&gt;&lt;ul&gt;&lt;li&gt;A1a&lt;/li&gt;&lt;ul&gt;&lt;li&gt;A1a1&lt;/li&gt;&lt;li&gt;A1a2&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;A1b&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;A2&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;b&gt;ol &amp;gt; ol&lt;/b&gt;&lt;br&gt;&lt;ol&gt;&lt;li style=&#34;list-style-type:decimal&#34;&gt;D1&lt;/li&gt;&lt;ol&gt;&lt;li style=&#34;list-style-type:lower-alpha&#34;&gt;D1a&lt;/li&gt;&lt;ol&gt;&lt;li style=&#34;list-style-type:lower-roman&#34;&gt;D1a1&lt;/li&gt;&lt;li style=&#34;list-style-type:lower-roman&#34;&gt;D1a2&lt;/li&gt;&lt;/ol&gt;&lt;li style=&#34;list-style-type:lower-alpha&#34;&gt;D1b&lt;/li&gt;&lt;/ol&gt;&lt;li style=&#34;list-style-type:decimal&#34;&gt;D2&lt;/li&gt;&lt;/ol&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#print(html2md(quip_html))</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">prep_gdoc_html</span><span class="p">(</span><span class="n">quip_html</span><span class="p">)))</span>
</span></span></code></pre></div><pre><code># HTML to Markdown

## Text

A single line  
  
A line split  
with a line break  
  
Another single line  
## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul &gt; ul**  
- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

  
**ol &gt; ol**  
1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2
</code></pre>
<h3 id="wrap-top-level-text-elements">Wrap top-level text elements <a class="anchor" href="#wrap-top-level-text-elements">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The code below works reasonably well. It doesn&rsquo;t properly handle hard linebreaks that should not be split into separate paragraphs, but I can&rsquo;t think of an elegant solution to this off the top of my head.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">wrap_top_level_text</span><span class="p">(</span><span class="n">soup</span><span class="p">:</span> <span class="n">BeautifulSoup</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;  &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s1">&#39;br&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">prev_tag</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span>
</span></span><span class="line"><span class="cl">        <span class="n">next_tag</span> <span class="o">=</span> <span class="n">tag</span><span class="o">.</span><span class="n">next_sibling</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Only proceed if br is at the root document level, and previous tag exists</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;[document]&#39;</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># print(f&#39;Prev: {tag.previous_sibling}&#39;)</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># print(f&#39;Next: {tag.next_sibling}&#39;)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span><span class="o">.</span><span class="n">name</span> <span class="ow">is</span> <span class="kc">None</span> <span class="ow">or</span> <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;b&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s1">&#39;p&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">                <span class="c1"># If next tag is also a br, remove that too</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">next_sibling</span> <span class="ow">and</span> <span class="n">tag</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;br&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="n">tag</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">                    
</span></span><span class="line"><span class="cl">                <span class="n">tag</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">                
</span></span><span class="line"><span class="cl">            <span class="c1"># Remove unnecessary trailing br after block-level elements</span>
</span></span><span class="line"><span class="cl">            <span class="k">elif</span> <span class="n">tag</span><span class="o">.</span><span class="n">previous_sibling</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;ul&#39;</span><span class="p">,</span> <span class="s1">&#39;ol&#39;</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tag</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">                
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">debug</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">pprint</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl"><span class="n">preproc_html</span> <span class="o">=</span> <span class="n">prep_gdoc_html</span><span class="p">(</span><span class="n">quip_html</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">preproc_html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">wrap_top_level_text</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">html</span> <span class="o">=</span> <span class="n">strip</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">html</span><span class="p">))</span>
</span></span></code></pre></div><pre><code># HTML to Markdown

## Text

A single line

A line split

with a line break

Another single line

## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul &gt; ul**

- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

**ol &gt; ol**

1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2
</code></pre>
<h3 id="unwrap-nested-code-blocks">Unwrap nested code blocks <a class="anchor" href="#unwrap-nested-code-blocks">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">nested_code_html</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">&lt;p&gt;Some &lt;code&gt;&lt;code&gt;func = lambda x: print(x)&lt;/code&gt;&lt;/code&gt; inline code.&lt;/p&gt;
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">unwrap_nested_code_blocks</span><span class="p">(</span><span class="n">soup</span><span class="p">:</span> <span class="n">BeautifulSoup</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Quip wraps inline code twice, which confuses markdownify. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s1">&#39;code&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;code&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">tag</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">nested_code_html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">unwrap_nested_code_blocks</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Before: </span><span class="si">{</span><span class="n">html2md</span><span class="p">(</span><span class="n">nested_code_html</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;After: </span><span class="si">{</span><span class="n">html2md</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>Before: Some ``func = lambda x: print(x)`` inline code.
After: Some `func = lambda x: print(x)` inline code.
</code></pre>
<h3 id="combined-prep_quip_html">Combined: <code>prep_quip_html</code> <a class="anchor" href="#combined-prep_quip_html">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">prep_quip_html</span><span class="p">(</span><span class="n">html</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; One function to combine all preprocessing steps. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">parse_gdoc_inline_styles</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">wrap_top_level_text</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">unwrap_nested_code_blocks</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">strip_unnecessary_tags</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">debug</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="kn">from</span> <span class="nn">bs4.formatter</span> <span class="kn">import</span> <span class="n">HTMLFormatter</span>
</span></span><span class="line"><span class="cl">        <span class="n">formatter</span> <span class="o">=</span> <span class="n">HTMLFormatter</span><span class="p">(</span><span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">formatter</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">html2md</span><span class="p">(</span><span class="n">prep_quip_html</span><span class="p">(</span><span class="n">quip_html</span><span class="p">)))</span>
</span></span></code></pre></div><pre><code># HTML to Markdown

## Text

A single line

A line split

with a line break

Another single line

## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul &gt; ul**

- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

**ol &gt; ol**

1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2
</code></pre>
<h2 id="combined-prep_any_html">Combined: <code>prep_any_html</code> <a class="anchor" href="#combined-prep_any_html">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Now let&rsquo;s combine everything into a single generic pre-processing function, and double-check that the outputs match those of the app-specific functions that we checked above.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">prep_any_html</span><span class="p">(</span><span class="n">html</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; One function to combine all preprocessing steps. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">parse_gdoc_inline_styles</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">wrap_naked_lists</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">wrap_top_level_text</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">unwrap_nested_code_blocks</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">strip_unnecessary_tags</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">debug</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="kn">from</span> <span class="nn">bs4.formatter</span> <span class="kn">import</span> <span class="n">HTMLFormatter</span>
</span></span><span class="line"><span class="cl">        <span class="n">formatter</span> <span class="o">=</span> <span class="n">HTMLFormatter</span><span class="p">(</span><span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">formatter</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">prep_any_html</span><span class="p">(</span><span class="n">quip_html</span><span class="p">)</span> <span class="o">==</span> <span class="n">prep_quip_html</span><span class="p">(</span><span class="n">quip_html</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">prep_any_html</span><span class="p">(</span><span class="n">gdoc_html</span><span class="p">)</span> <span class="o">==</span> <span class="n">prep_gdoc_html</span><span class="p">(</span><span class="n">gdoc_html</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="creating-an-alfred-workflow">Creating an Alfred workflow <a class="anchor" href="#creating-an-alfred-workflow">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The actual Alfred workflow is relatively straightforward.</p>
<p><strong>Invoke</strong></p>
<p>Invoke the workflow—either using Alfred&rsquo;s <a href="https://www.alfredapp.com/universal-actions/" target="_blank">universal actions</a> after highlighting the text, or using the <code>h2m</code> keyword trigger.</p>
<p><strong>Get clipboard contents as HTML</strong></p>
<p>It gets the actual clipboard contents by running this shell command:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">osascript -e <span class="s1">&#39;the clipboard as «class HTML»&#39;</span> <span class="p">|</span> perl -ne <span class="s1">&#39;print chr foreach unpack(&#34;C*&#34;,pack(&#34;H*&#34;,substr($_,11,-3)))&#39;</span> <span class="p">|</span> cat
</span></span></code></pre></div><p>Contents are stored in hex, so the <code>perl</code> command converts to ASCII.</p>
<p><strong>Convert and copy</strong></p>
<p>Then it runs the clipboard contents through <code>html2markdown.py</code>—a script which contains all of the logic we built out earlier—and copies the output markdown back to the clipboard.</p>
<p><img src="alfred_workflow@2x.png" alt="Screenshot of Alfred workflow"></p>

      ]]></content:encoded></item><item><title>Scraping PNG icons for emoji with Python</title><link>https://geoffruddock.com/python-emoji-to-png/</link><pubDate>Sunday, 31 Oct 2021</pubDate><guid>https://geoffruddock.com/python-emoji-to-png/</guid><description>&lt;h2 id="motivation">Motivation &lt;a class="anchor" href="#motivation">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>I put together an emoji search Alfred workflow which uses &lt;a href="https://github.com/sindresorhus/alfy" target="_blank">alfy&lt;/a> to filter &lt;a href="https://github.com/github/gemoji/blob/master/db/emoji.json" target="_blank">this JSON file of emoji&lt;/a>.&lt;/p>
&lt;p>There are plenty of existing emoji Alfred workflows around, but I wanted one that allowed me to edit the aliases for individual emoji.&lt;/p>
&lt;p>The one missing piece was to have the workflow display &lt;em>the emoji itself&lt;/em> as the icon for each result. The Alfred &lt;a href="https://www.alfredapp.com/help/workflows/inputs/script-filter/json/" target="_blank">Script Filter JSON Format&lt;/a> includes an &lt;code>icon&lt;/code> field, but it expects the path to an actual icon file on disk.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="motivation">Motivation <a class="anchor" href="#motivation">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I put together an emoji search Alfred workflow which uses <a href="https://github.com/sindresorhus/alfy" target="_blank">alfy</a> to filter <a href="https://github.com/github/gemoji/blob/master/db/emoji.json" target="_blank">this JSON file of emoji</a>.</p>
<p>There are plenty of existing emoji Alfred workflows around, but I wanted one that allowed me to edit the aliases for individual emoji.</p>
<p>The one missing piece was to have the workflow display <em>the emoji itself</em> as the icon for each result. The Alfred <a href="https://www.alfredapp.com/help/workflows/inputs/script-filter/json/" target="_blank">Script Filter JSON Format</a> includes an <code>icon</code> field, but it expects the path to an actual icon file on disk.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/python-emoji-to-png/alfred_workflow_without_emoji_hu_8e7e7611de0b808.jpg 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/python-emoji-to-png/alfred_workflow_without_emoji.jpg"
                
    
            
                alt="This simply will not do!" width="500"/> <figcaption>
                <p>This simply will not do!</p>
            </figcaption>
    </figure>
<h2 id="read-list-of-emoji">Read list of emoji <a class="anchor" href="#read-list-of-emoji">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>First, let&rsquo;s read in the <a href="https://github.com/github/gemoji/blob/master/db/emoji.json" target="_blank"><code>emoji.json</code></a> file mentioned above.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;emoji.json&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">emoji_json</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Number of emoji: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">emoji_json</span><span class="p">)</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>Number of emoji: 1812
</code></pre>
<p>Here are the first 100 emoji contained inside.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">emoji_json</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">100</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">],</span> <span class="n">end</span><span class="o">=</span><span class="s1">&#39; &#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>😀 😃 😄 😁 😆 😅 🤣 😂 🙂 🙃 😉 😊 😇 🥰 😍 🤩 😘 😗 ☺️ 😚 😙 🥲 😋 😛 😜 🤪 😝 🤑 🤗 🤭 🤫 🤔 🤐 🤨 😐 😑 😶 😶‍🌫️ 😏 😒 🙄 😬 😮‍💨 🤥 😌 😔 😪 🤤 😴 😷 🤒 🤕 🤢 🤮 🤧 🥵 🥶 🥴 😵 😵‍💫 🤯 🤠 🥳 🥸 😎 🤓 🧐 😕 😟 🙁 ☹️ 😮 😯 😲 😳 🥺 😦 😧 😨 😰 😥 😢 😭 😱 😖 😣 😞 😓 😩 😫 🥱 😤 😡 😠 🤬 😈 👿 💀 ☠️ 💩 
</code></pre>
<hr>
<h2 id="easy-mode-twitter-emoji">Easy mode: Twitter emoji <a class="anchor" href="#easy-mode-twitter-emoji">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I found the <a href="https://github.com/glasnt/emojificate" target="_blank">emojificate</a> library, which makes straightforward use of the <a href="https://twemoji.twitter.com/" target="_blank">Twemoji</a> CDN to fetch various sizes of Twitter-style emoji.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">Image</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_png_url</span><span class="p">(</span><span class="n">char</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Pulled from: https://github.com/glasnt/emojificate/blob/latest/emojificate/filter.py&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">cdn_fmt</span> <span class="o">=</span> <span class="s2">&#34;https://twemoji.maxcdn.com/v/latest/72x72/</span><span class="si">{codepoint}</span><span class="s2">.png&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">codepoint</span><span class="p">(</span><span class="n">codes</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># See https://github.com/twitter/twemoji/issues/419#issuecomment-637360325</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="s2">&#34;200d&#34;</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">codes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="s2">&#34;-&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">codes</span> <span class="k">if</span> <span class="n">c</span> <span class="o">!=</span> <span class="s2">&#34;fe0f&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;-&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">codes</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">cdn_fmt</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">codepoint</span><span class="o">=</span><span class="n">codepoint</span><span class="p">([</span><span class="s2">&#34;</span><span class="si">{cp:x}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">cp</span><span class="o">=</span><span class="nb">ord</span><span class="p">(</span><span class="n">c</span><span class="p">))</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">char</span><span class="p">]))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">url</span> <span class="o">=</span> <span class="n">get_png_url</span><span class="p">(</span><span class="s1">&#39;🐿️&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">Image</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</span></span></code></pre></div><p><img src="./index_7_0.png" alt="png"></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">tqdm.notebook</span> <span class="kn">import</span> <span class="n">tqdm</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">clear_output</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">download_png</span><span class="p">(</span><span class="n">url</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Download a specific png file to disk.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;twitter-icons/</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s1">.png&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">img_data</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">content</span>
</span></span><span class="line"><span class="cl">        <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">img_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">emoji_json</span><span class="p">,</span> <span class="n">total</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">emoji_json</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">    <span class="n">fp</span> <span class="o">=</span> <span class="n">e</span><span class="p">[</span><span class="s1">&#39;description&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">,</span> <span class="s1">&#39;-&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">url</span> <span class="o">=</span> <span class="n">get_png_url</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">download_png</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">fp</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">clear_output</span><span class="p">()</span>
</span></span></code></pre></div>
    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/python-emoji-to-png/alfred_workflow_with_twitter_emoji_hu_b3e6106b9790fc45.jpg 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/python-emoji-to-png/alfred_workflow_with_twitter_emoji.jpg"
                
    
            
                alt="That was easy!" width="500"/> <figcaption>
                <p>That was easy!</p>
            </figcaption>
    </figure>
<hr>
<h2 id="hard-mode-scraping-from-unicodeorg">Hard mode: scraping from unicode.org <a class="anchor" href="#hard-mode-scraping-from-unicodeorg">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The above works perfectly for twitter emoji, but what if we want the <em>apple</em> emoji?</p>
<p>Inspired by this StackOverflow question—<a href="https://stackoverflow.com/questions/53721028/programmatically-get-a-png-for-a-unicode-emoji/53722098#53722098" target="_blank">Programmatically get a PNG for a unicode emoji</a>—we could also scrape icons from this page: <a href="https://unicode.org/emoji/charts/full-emoji-list.html" target="_blank">https://unicode.org/emoji/charts/full-emoji-list.html</a>.</p>
<h3 id="fetch-html">Fetch HTML <a class="anchor" href="#fetch-html">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">emoji_page_html</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;https://unicode.org/emoji/charts/full-emoji-list.html&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">text</span>
</span></span></code></pre></div><h3 id="strip-variation-selectors">Strip variation selectors <a class="anchor" href="#strip-variation-selectors">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>One small gotcha here—which will otherwise mess with our regex matches—is that some emoji are optionally followed by an invisible <a href="https://unicode-table.com/en/FE0F/" target="_blank">variation selector</a> character. This is meant to specify that the character should be rendered as <em>emoji</em> rather than as icons, but this seems to be appended to many emoji which don&rsquo;t have obvious icon representations, such as the chipmunk 🐿️.</p>
<p>We&rsquo;ll strip these (trailing) characters from our <code>emoji.json</code> inputs, and write our regex to optionally match them, if present in the unicode table.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">strip_variation_electors</span><span class="p">(</span><span class="n">emoji</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">u</span><span class="s1">&#39;[</span><span class="se">\ufe00</span><span class="s1">-</span><span class="se">\ufe0f</span><span class="s1">]$&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">emoji</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">emoji_df</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">emoji_json</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">emoji</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">strip_variation_electors</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">name</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;description&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">,</span> <span class="s1">&#39;-&#39;</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">        <span class="n">length</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">split</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;emoji&#39;</span><span class="p">,</span> <span class="s1">&#39;length&#39;</span><span class="p">]]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">emoji_df</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>1    1320
2     258
3     190
4      13
5      13
7      12
8       3
6       3
Name: length, dtype: int64
</code></pre>
<h3 id="extract-using-regex">Extract using regex <a class="anchor" href="#extract-using-regex">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">extract_emoji_from_html</span><span class="p">(</span><span class="n">emoji</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">version</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#html_search_string = r&#34;&lt;img alt=&#39;{}&#39; class=&#39;imga&#39; src=&#39;data:image\/png;base64,([^&#39;]+)&#39;&gt;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">html_search_string</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">&#34;&lt;img alt=&#39;</span><span class="si">{}</span><span class="s2">(?:[\ufe00-\ufe0f])?&#39;(?: title=&#39;.+&#39;)? class=&#39;imga&#39; src=&#39;data:image\/png;base64,([^&#39;]+)&#39;&gt;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">matchlist</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">html_search_string</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">emoji</span><span class="p">),</span> <span class="n">emoji_page_html</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">matchlist</span><span class="p">[</span><span class="n">version</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">emoji_b64</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">df</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">emoji_df</span><span class="p">[[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;emoji&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">iterrows</span><span class="p">(),</span> <span class="n">total</span><span class="o">=</span><span class="n">emoji_df</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
</span></span><span class="line"><span class="cl">    <span class="n">name</span><span class="p">,</span> <span class="n">emoji</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">emoji_b64</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">extract_emoji_from_html</span><span class="p">(</span><span class="n">emoji</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">IndexError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">clear_output</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">len</span><span class="p">(</span><span class="n">emoji_b64</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>1811
</code></pre>
<h3 id="check-results">Check results <a class="anchor" href="#check-results">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">is_found</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;is_found&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">},</span> <span class="n">index</span><span class="o">=</span><span class="n">emoji_b64</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">joined</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">emoji_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">),</span> <span class="n">is_found</span><span class="p">,</span> <span class="n">left_index</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">right_index</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#joined.groupby(&#39;length&#39;)[&#39;is_found&#39;].mean()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">joined</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;is_found&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span>
</span></span></code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>emoji</th>
      <th>length</th>
      <th>is_found</th>
    </tr>
    <tr>
      <th>name</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>keycap:-*</th>
      <td>*️⃣</td>
      <td>3</td>
      <td>0.0</td>
    </tr>
  </tbody>
</table>
</div>
<h3 id="write-to-files">Write to files <a class="anchor" href="#write-to-files">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">base64</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">img_data</span> <span class="ow">in</span> <span class="n">emoji_b64</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">b64</span> <span class="o">=</span> <span class="n">base64</span><span class="o">.</span><span class="n">b64decode</span><span class="p">(</span><span class="n">img_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;apple-icons/</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s1">.png&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">b64</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl"><span class="n">clear_output</span><span class="p">()</span>
</span></span></code></pre></div>
    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/python-emoji-to-png/alfred_workflow_with_apple_emoji_hu_52e8fa23d2992a11.jpg 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/python-emoji-to-png/alfred_workflow_with_apple_emoji.jpg"
                
    
            
                alt="Success!" width="500"/> <figcaption>
                <p>Success!</p>
            </figcaption>
    </figure>
<hr>
<h2 id="appendix-multi-character-emoji">Appendix: multi-character emoji <a class="anchor" href="#appendix-multi-character-emoji">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">emoji_df</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">emoji_json</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">name</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;description&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">,</span> <span class="s1">&#39;-&#39;</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">        <span class="n">split</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;‍&#39;</span><span class="p">,</span> <span class="s1">&#39;ZWJ&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;️&#39;</span><span class="p">,</span> <span class="s1">&#39;VS&#39;</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)]),</span>
</span></span><span class="line"><span class="cl">        <span class="n">length</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;emoji&#39;</span><span class="p">,</span> <span class="s1">&#39;length&#39;</span><span class="p">,</span> <span class="s1">&#39;split&#39;</span><span class="p">]]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">emoji_df</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</span></span></code></pre></div><p>What is going on here?!</p>
<h3 id="length-two">Length-two <a class="anchor" href="#length-two">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Most (but not all) of these emoji are unchanged by stripping the trailing variation selector character.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">emoji_df</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">&#39;split&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">==</span> <span class="s1">&#39;VS&#39;</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">stripped</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;️&#39;</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>Besides trailing variation selectors, some length-two emoji are <a href="https://emojipedia.org/emoji-flag-sequence/" target="_blank">emoji flag sequences</a>, which are made up of two &ldquo;regional indicator&rdquo; characters.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">emoji_df</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">2</span><span class="p">,</span> <span class="p">:]</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="n">y</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span> <span class="o">!=</span> <span class="s1">&#39;️&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><h3 id="length-three">Length-three <a class="anchor" href="#length-three">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Most length-three emojis are created <a href="https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/" target="_blank">by joining multiple emojis together</a> using a <a href="https://en.wikipedia.org/wiki/Zero-width_joiner" target="_blank">zero-width joiner</a> character.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">emoji_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">3</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="length-four">Length-four <a class="anchor" href="#length-four">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Emoji of length four seem to be composites of two other emoji, a ZWJ, and a seemingly unnecessary variation selector.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">emoji_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">stripped</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;️&#39;</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><h3 id="length-five">Length-five <a class="anchor" href="#length-five">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Length five emoji seem to be some combination of:</p>
<ol>
<li>Sequences of two emoji, incl. two unnecessary variation selector characters.</li>
<li>Sequendes of three emoji, joined by two ZWJ.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">emoji_df</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;length&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">5</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">stripped</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;emoji&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;️&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">length_stripped</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;stripped&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" target="_blank">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a></li>
</ul>

      ]]></content:encoded></item><item><title>How to batch modify dates of daily journal files</title><link>https://geoffruddock.com/batch-modify-dates-daily-journal/</link><pubDate>Thursday, 16 Sep 2021</pubDate><guid>https://geoffruddock.com/batch-modify-dates-daily-journal/</guid><description>&lt;h2 id="goal">Goal &lt;a class="anchor" href="#goal">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>I&amp;rsquo;ve been using &lt;a href="https://obsidian.md/" target="_blank">Obsidian&lt;/a> as the primary hub for personal notes for the past year. My daily notes act as a sort of &lt;a href="https://malcolmocean.com/2017/11/captains-log-ultra-simple-tech-for-self-reflection/" target="_blank">captain&amp;rsquo;s log&lt;/a>, and have superceded my use of a dedicated journaling app. So I exported all my journal entries to markdown, and added them to my Obsidian vault as daily notes.&lt;/p>
&lt;p>In the process of migrating journal entries between apps over the years&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, I must have messed up some metadata at some point, because I just realized today that all my entries before a certain point in time were wrong by one day. So I wrote a small script (below) to batch correct these files.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="goal">Goal <a class="anchor" href="#goal">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I&rsquo;ve been using <a href="https://obsidian.md/" target="_blank">Obsidian</a> as the primary hub for personal notes for the past year. My daily notes act as a sort of <a href="https://malcolmocean.com/2017/11/captains-log-ultra-simple-tech-for-self-reflection/" target="_blank">captain&rsquo;s log</a>, and have superceded my use of a dedicated journaling app. So I exported all my journal entries to markdown, and added them to my Obsidian vault as daily notes.</p>
<p>In the process of migrating journal entries between apps over the years<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, I must have messed up some metadata at some point, because I just realized today that all my entries before a certain point in time were wrong by one day. So I wrote a small script (below) to batch correct these files.</p>
<h2 id="setup">Setup <a class="anchor" href="#setup">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>To help write and debug the script, I put a handful of dummy files in a <code>staging</code> directory, each following the <code>YYYY-MM-DD.md</code> naming convention.</p>
<p>We&rsquo;ll use <code>pathlib.Path(…).glob('*.md')</code> to get a list of markdown files in this directory.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="nn">dt</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;staging&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">all_md_files</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;*.md&#39;</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Print the contents of each file</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">all_md_files</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span> <span class="o">+</span> <span class="s1">&#39;=&#39;</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;=&#39;</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> <span class="o">+</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">open_file</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">open_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</span></span></code></pre></div><pre><code>=============
2013-12-31.md
=============

# 2013-12-31

This file should become 2014-01-01.

=============
2014-01-01.md
=============

# 2014-01-01

This file should become 2014-01-02.

=============
2014-01-03.md
=============

# 2014-01-03

This file should remain as 2014-01-03.
</code></pre>
<p>Note that these files contain the date in their names, but also as an <code>h1</code> heading within the file itself, so we&rsquo;ll need to change both.</p>
<h2 id="the-script">The script <a class="anchor" href="#the-script">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>My original issue only affected files up to a certain date, so let&rsquo;s filter the list of markdown files.</p>
<p>Because we are <em>incrementing</em> the dates of files, we&rsquo;ll want to work through the list in reverse order. Before making yesterday today, we must make today tomorrow—else there will be a conflict.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">selected_files</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">([</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">all_md_files</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">stem</span> <span class="o">&lt;=</span> <span class="s1">&#39;2014-01-02&#39;</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">selected_files</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>2014-01-01.md
2013-12-31.md
</code></pre>
<p>The actual work to be done here is relatively simple:</p>
<ol>
<li>Convert string to datetime, increment by 1d, convert back to string.</li>
<li>Replace references to the previous date string with the new one, inside file contents.</li>
<li>Rename the file itself.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">replace_in_file</span><span class="p">(</span><span class="n">fp</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">old</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">new</span><span class="p">:</span><span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Replace &#39;old&#39; strings with &#39;new&#39; strings in a given file (fp) &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fp</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">open_file</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">old_contents</span> <span class="o">=</span> <span class="n">open_file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">new_contents</span> <span class="o">=</span> <span class="n">old_contents</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">old</span><span class="p">,</span> <span class="n">new</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fp</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">open_file</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">open_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">new_contents</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">selected_files</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">old_dt</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">stem</span><span class="p">,</span> <span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_dt</span> <span class="o">=</span> <span class="p">(</span><span class="n">old_dt</span> <span class="o">+</span> <span class="n">dt</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_ds</span> <span class="o">=</span> <span class="n">new_dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">replace_in_file</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">f</span><span class="o">.</span><span class="n">stem</span><span class="p">,</span> <span class="n">new_ds</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">f</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">parent</span> <span class="o">/</span> <span class="p">(</span><span class="n">new_ds</span> <span class="o">+</span> <span class="s1">&#39;.md&#39;</span><span class="p">))</span>
</span></span></code></pre></div><h2 id="check-results">Check results <a class="anchor" href="#check-results">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Visually inspecting the files in the <code>staging</code> directory, we can see the final result matches what we hoped to achieve.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">&#39;*.md&#39;</span><span class="p">))):</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span> <span class="o">+</span> <span class="s1">&#39;=&#39;</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;=&#39;</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> <span class="o">+</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">open_file</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">open_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</span></span></code></pre></div><pre><code>=============
2014-01-01.md
=============

# 2014-01-01

This file should become 2014-01-01.

=============
2014-01-02.md
=============

# 2014-01-02

This file should become 2014-01-02.

=============
2014-01-03.md
=============

# 2014-01-03

This file should remain as 2014-01-03.
</code></pre>
<p>If you&rsquo;re running a script that modifies your files in-place like this, be sure to have recent, working backups before you start, in case something goes wrong!</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I started journaling with <a href="http://ohlife.com/index.php" target="_blank">OhLife</a>, before it was shut down in 2014, replaced it with <a href="https://dabble.me/" target="_blank">Dabble.Me</a>, then most recently ported everything over to <a href="https://dayoneapp.com/" target="_blank">Day One</a> in 2019.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Soundproofing a Synology NAS</title><link>https://geoffruddock.com/soundproof-synology/</link><pubDate>Sunday, 14 Mar 2021</pubDate><guid>https://geoffruddock.com/soundproof-synology/</guid><description>&lt;p>After years of using SSDs, I had almost entirely forgotten how &lt;em>annoying&lt;/em> the sound of actual, spinning hard disk platters is. That is, until I bought a &lt;a href="https://www.amazon.co.uk/Synology-DS420-Bay-NAS-Enclosure/dp/B088V71WGH" target="_blank">Synology DS420+ NAS&lt;/a> earlier this year to set up as a home media server. (Lockdown projects, yay!)&lt;/p>
&lt;p>The usual advice here is to simply place your NAS somewhere out of earshot. But if you want to connect your NAS via a 1 Gbps ethernet hookup, your choices may be more constrained. In my case, there was only one suitable location—in my home office, on the shelf behind my desk.&lt;/p></description><content:encoded><![CDATA[
        <p>After years of using SSDs, I had almost entirely forgotten how <em>annoying</em> the sound of actual, spinning hard disk platters is. That is, until I bought a <a href="https://www.amazon.co.uk/Synology-DS420-Bay-NAS-Enclosure/dp/B088V71WGH" target="_blank">Synology DS420+ NAS</a> earlier this year to set up as a home media server. (Lockdown projects, yay!)</p>
<p>The usual advice here is to simply place your NAS somewhere out of earshot. But if you want to connect your NAS via a 1 Gbps ethernet hookup, your choices may be more constrained.  In my case, there was only one suitable location—in my home office, on the shelf behind my desk.</p>
<h2 id="a-prelude-on-acoustics">A prelude on acoustics <a class="anchor" href="#a-prelude-on-acoustics">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I don&rsquo;t have much prior knowledge of acoustics, besides knowing the fact that <a href="https://en.wikipedia.org/wiki/Decibel" target="_blank">decibels</a> are measured in $\log_{10}$ scale. To be honest, I&rsquo;m still kind of talking out of my ass here, but I realized in retrospect that thinking through the nature of the problem before jumping to solutions would have saved me some wasted effort.</p>
<h3 id="source-of-the-sound">Source of the sound <a class="anchor" href="#source-of-the-sound">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>As far as I can tell, there are three fundamental sources of sound coming from my NAS:</p>
<ol>
<li><strong>Spinning disks</strong> – A constant hum that is present whenever the drive is powered up—which, for a NAS, is probably all the time. Drives that spin at 5400 rpm are generally—but not always—quieter than comparable disks that spin at 7200 rpm.</li>
<li><strong>Read/write operations</strong> – The quintessential &ldquo;hard drive sound&rdquo;, which occurs when the drive is actually being <em>used</em>, and so can vary minute-to-minute.</li>
<li><strong>Cooling fan</strong> – A relatively constant hum, but whose pitch and volume can vary, based on how much cooling your NAS decides it needs.</li>
</ol>
<h3 id="travel-path-of-the-sound">Travel path of the sound <a class="anchor" href="#travel-path-of-the-sound">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>While all sound is <em>technically</em> just vibrations being transmitted through the air, there are a few different &ldquo;paths&rdquo; it can take, which has practical implications for how you should approach noise reduction:</p>
<ol>
<li>Between disk drive and NAS unit</li>
<li>Between NAS unit and hard surface</li>
<li>Directly through the air</li>
</ol>
<h2 id="isolate-drives-from-enclosure">Isolate drives from enclosure <a class="anchor" href="#isolate-drives-from-enclosure">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The most common recommendation that comes up for &ldquo;<em>Noisy Synology NAS</em>&rdquo; is to attach some sort of velcro/felt/foam to the drive sleds, along the lines of <a href="https://www.youtube.com/watch?v=z-aTvVo59tY" target="_blank">this youtube video</a>. Although I found the stock sleds to be reasonably snug, adding some velcro/felt/foam padding removes any remaining vertical wiggle room within which the drive itself could vibrate.</p>
<p>I initially tried attaching foam velcro dots to the drive sleds themselves, but didn&rsquo;t notice any major improvement. Unsure of whether the flaw was in the solution or my implementation, I picked up a strip of <a href="https://www.amazon.co.uk/gp/product/B07L6KBH7G" target="_blank">single-sided 1 mm foam tape</a>, and tried again, this time taping the metal on the enclosure itself. This provided a nice snug fit for the drive sleds, but once again did not result in any noticeable difference in sound.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/sled_velcro_hu_932444c07c91c823.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/sled_velcro_hu_8ba548697d338c0d.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/sled_velcro_hu_9c58342ec4ad2026.jpg 1200w,
                
                       https://geoffruddock.com/soundproof-synology/sled_velcro_hu_d36e3058c3fa138a.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/sled_velcro_hu_8ba548697d338c0d.jpg"
                
    
            
                alt="I had some velcro dots lying around."/> <figcaption>
                <p>I had some velcro dots lying around.</p>
            </figcaption>
    </figure>


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/sled_foam_hu_5fa57f5423097ea1.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/sled_foam_hu_2c11a29946a03e56.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/sled_foam_hu_cfd56b422f0d7880.jpg 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/sled_foam_hu_2c11a29946a03e56.jpg"
                
    
            
                alt="But 1 mm foam tape works even better."/> <figcaption>
                <p>But 1 mm foam tape works even better.</p>
            </figcaption>
    </figure>


        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 400px
    }

</style>
<p>Why didn&rsquo;t this help much? I suspect that a particular combination of particular hard drives spinning at a particular speed may cause a <em>resonance</em> problem for some people. If your NAS is generally quiet, but occasionally gets extremely noisy, this may be your problem. If your NAS is <em>consistently</em> loud, you can probably skip this entirely.</p>
<h2 id="isolate-enclosure-from-surface">Isolate enclosure from surface <a class="anchor" href="#isolate-enclosure-from-surface">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Even with foam padding, the hard drives are physically still connected to the NAS board, and so will transmit some vibrations through to the enclosure itself. So my next attempt was to reduce the amount of vibrations being transmitted from the NAS itself to the surface it sits on. My NAS is sitting on an IKEA Kallax shelf, and I could clearly feel the read/write vibrations being transmitted into the shelf by feeling from the cubby below.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/shelf_basic_hu_3c8cd48d14109535.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_basic_hu_9d94279c3608e259.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_basic_hu_8f374e302f9e1626.jpg 1200w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_basic_hu_910740ffcca4fa80.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/shelf_basic_hu_9d94279c3608e259.jpg"
                
    
            
                alt="You can feel the vibrations propogating through the shelving unit." width="500"/> <figcaption>
                <p>You can feel the vibrations propogating through the shelving unit.</p>
            </figcaption>
    </figure>
<p>I picked up a set of three-layer EVA anti-vibration pads recommended in <a href="https://www.howtogeek.com/346082/how-to-get-rid-of-vibration-and-noise-in-your-nas/" target="_blank">this How-To Geek article</a>. They <em>kind of</em> helped, but I could still clearly feel the write operations through the shelf. In retrospect, these pads seem to be designed for much heavier equipment, incl. washing machines and air compressors. So they are probably much more dense than would be optimal for a relatively lightweight NAS.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/shelf_eva_hu_c18f4af38b7657e2.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_eva_hu_26350c1e4e8530c0.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_eva_hu_d25923682de90d99.jpg 1200w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_eva_hu_3db8e1836e6dbe31.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/shelf_eva_hu_26350c1e4e8530c0.jpg"
                
    
            
                alt="These EVA foam feet didn&rsquo;t help much." width="500"/> <figcaption>
                <p>These EVA foam feet didn&rsquo;t help much.</p>
            </figcaption>
    </figure>
<p>What worked better was a few layers of much softer foam. I had some spare acoustic foam (originally purchased for step #3) that did the trick, but I suppose you could also just pick up a solid foam block from an arts and crafts store. Alternatively, you could perhaps create some sort of  <a href="https://www.reddit.com/r/buildapc/comments/oxfj3/what_to_do_with_those_525_bays/c3kucfo" target="_blank">makeshift hammock</a> to suspend the NAS in the air.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/shelf_foam_hu_222f56fc92c3433f.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_foam_hu_8ed509980a63d7f2.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_foam_hu_b975060d6ee6152b.jpg 1200w,
                
                       https://geoffruddock.com/soundproof-synology/shelf_foam_hu_d22375bb14136358.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/shelf_foam_hu_8ed509980a63d7f2.jpg"
                
    
            
                alt="A thick bed of soft foam worked best." width="500"/> <figcaption>
                <p>A thick bed of soft foam worked best.</p>
            </figcaption>
    </figure>
<h2 id="dampen-sound-travel-through-air">Dampen sound travel through air <a class="anchor" href="#dampen-sound-travel-through-air">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If your NAS is not rattling, and the vibration is not being transmitted to the surface it is sitting on, the only real remaining source of noise transmission is through the air itself.</p>
<h3 id="acoustic-foam-panels">Acoustic foam panels <a class="anchor" href="#acoustic-foam-panels">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I bought a set of <a href="https://www.aliexpress.com/item/1005001507043816.html" target="_blank">30 cm acoustic foam panels</a> from AliExpress. I have no clue whether these panels have the ideal acoustic properties, but they were cheap. I installed a <a href="https://www.ikea.com/gb/en/p/kallax-insert-with-door-white-20278167/" target="_blank">KALLAX door insert</a> and then affixed the panels  to the inside surfaces.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/final_foam_hu_9bd0ad3c7beeb7b1.jpg 480w,
                
                       https://geoffruddock.com/soundproof-synology/final_foam_hu_5b98bf81293af145.jpg 800w,
                
                       https://geoffruddock.com/soundproof-synology/final_foam_hu_4e0a4af6451883b.jpg 1200w,
                
                       https://geoffruddock.com/soundproof-synology/final_foam_hu_c19d645bbd1ec05d.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/final_foam_hu_5b98bf81293af145.jpg"
                
    
            
                alt="My end state."/> <figcaption>
                <p>My end state.</p>
            </figcaption>
    </figure>
<h3 id="beware-of-heat">Beware of heat <a class="anchor" href="#beware-of-heat">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>This worked very well, but introduced a different problem—heat. Even though it wasn&rsquo;t a perfect seal, things got a bit toasty inside the cubby with the door closed. The NAS still worked, but I was uneasy about the long-term implications on drive lifespan. I subsequently cut out a square hole in the back of the insert and installed an <a href="https://www.amazon.co.uk/gp/product/B06XRCDZDH" target="_blank">80 mm USB fan</a>. This entirely solved the heat issue, but reintroduced some noise. It&rsquo;s a reasonable trade though, because I find the constant sound from a fan to be much less intrusive than the intermittent sound of disk I/O operations.</p>
<h2 id="epilogue">Epilogue <a class="anchor" href="#epilogue">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="what-id-do-differently">What I&rsquo;d do differently <a class="anchor" href="#what-id-do-differently">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If I were starting from scratch, would I take the same approach? Probably not.</p>
<p>My NAS is now quiet enough, but it&rsquo;s not particularly modular—moving the NAS would require moving the entire shelf with the soundproof insert, or disassembling the insert and installing it in another KALLAX unit.</p>
<p>I may embark on building a custom soundproofed box at some point, as a potential future woodworking project. Something along the lines of <a href="https://www.youtube.com/watch?v=jXVSNzr3p70" target="_blank">this custom NAS box</a>, but perhaps employing more surface area of insulation material, as done in <a href="https://www.youtube.com/watch?v=d0KsCrtu3jg" target="_blank">this PC build</a>.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/soundproof-synology/youtube_screenshot_hu_173bc8f86870b2a0.png 480w,
                
                       https://geoffruddock.com/soundproof-synology/youtube_screenshot_hu_15c899cd9adb8d94.png 800w,
                
                       https://geoffruddock.com/soundproof-synology/youtube_screenshot_hu_65aa8c9056d25008.png 1200w,
                
                       https://geoffruddock.com/soundproof-synology/youtube_screenshot_hu_24125382561fe997.png 1500w,
                '
    
                
                
                src="https://geoffruddock.com/soundproof-synology/youtube_screenshot_hu_15c899cd9adb8d94.png"
                
    
            
                alt="Firmly on my &ldquo;someday/maybe&rdquo; project list, for now at least." width="500"/> <figcaption>
                <p>Firmly on my &ldquo;someday/maybe&rdquo; project list, for now at least.</p>
            </figcaption>
    </figure>
<h3 id="an-afterword-on-acoustics-testing">An afterword on acoustics testing <a class="anchor" href="#an-afterword-on-acoustics-testing">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I initially intended to measure and compare the noise levels between each approach I tried, but abandoned this mid-way through when I realized that the free &ldquo;noise meter&rdquo; app I downloaded from Google Play wasn&rsquo;t doing a great job of capturing my own perceptual sense of each approach.</p>
<p>Two thoughts on why measuring and comparing <em>average dB</em> didn&rsquo;t prove to be as useful as I expected:</p>
<ol>
<li><strong>It is extremely sensitive to ambient factors</strong> – Before taking measurements, I put my laptop to sleep, closed the door, and made sure to breath quietly and barely move. But even with these precautions, some ambient noise snuck in, including traffic from outside my window, or the upstairs neighbour doing laundry.</li>
<li><strong>Quality of noise matters, not just quantity</strong> – Our perception of sound is not just a function of <em>quantity</em> but also <em>quality</em>. In retrospect, what I found annoying about the disk drives was not the absolute volume level, but the intermittent and &ldquo;clicky&rdquo; nature of the I/O operations. Because these sounds are not constant, they contribute much less to the measurements of &ldquo;average dB&rdquo; than they do to my own subjective perception of overall noise level.</li>
</ol>
<p>So if I were to try this again, I would explore either measuring something like the <em>80th percentile</em> of noise level, rather than the average. This would make it more robust against &ldquo;outlier&rdquo; ambient noises, and also more accurately capture the influence of intermittent I/O operations.</p>

      ]]></content:encoded></item><item><title>Turn on your thermostat before an alarm with Tasker (Android)</title><link>https://geoffruddock.com/tado-thermostat-with-tasker/</link><pubDate>Thursday, 03 Dec 2020</pubDate><guid>https://geoffruddock.com/tado-thermostat-with-tasker/</guid><description>&lt;p>My main Black Friday purchase this year was a &lt;a href="https://www.tado.com/all-en/smart-thermostat" target="_blank">Tado°&lt;/a> system (thermostat + smart radiator valves), which I acquired with the goal in mind of regulating the temperature in my bedroom by:&lt;/p>
&lt;ol>
&lt;li>Turning the heat down early enough to be consistently cold at night.&lt;/li>
&lt;li>Turning the heat up in the morning to make it easier to wake up. Ideally 1h before.&lt;/li>
&lt;/ol>
&lt;p>The first part is easy, but the second part can&amp;rsquo;t quite be achieved, at least out-of-the-box. If your daily sleep schedule is consistent, you can just set a corresponding heating schedule in the Tado app. But my own wake-up time varies between 7a-9a, and I found it cumbersome to change the &amp;ldquo;Smart Schedule&amp;rdquo; every evening in anticipation of when I would wake up the following morning.&lt;/p></description><content:encoded><![CDATA[
        <p>My main Black Friday purchase this year was a <a href="https://www.tado.com/all-en/smart-thermostat" target="_blank">Tado°</a> system (thermostat + smart radiator valves), which I acquired with the goal in mind of regulating the temperature in my bedroom by:</p>
<ol>
<li>Turning the heat down early enough to be consistently cold at night.</li>
<li>Turning the heat up in the morning to make it easier to wake up. Ideally 1h before.</li>
</ol>
<p>The first part is easy, but the second part can&rsquo;t quite be achieved, at least out-of-the-box. If your daily sleep schedule is consistent, you can just set a corresponding heating schedule in the Tado app. But my own wake-up time varies between 7a-9a, and I found it cumbersome to change the &ldquo;Smart Schedule&rdquo; every evening in anticipation of when I would wake up the following morning.</p>
<h3 id="why-ifttt-isnt-enough">Why IFTTT isn&rsquo;t enough <a class="anchor" href="#why-ifttt-isnt-enough">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Tado&rsquo;s <a href="https://ifttt.com/tado_heating" target="_blank">IFTTT integration</a> is a good start, but it isn&rsquo;t perfect. It is relatively straight-forward to create a recipe that turns on our heating when our Android alarm goes off. But ideally it would trigger some amount of time <em>ahead</em> of the alarm, so that the room has time to actually rise to temperature.</p>
<p>IFTTT doesn&rsquo;t have any capability for this sort of pre-trigger logic. I toyed with triggering based on a &ldquo;dummy&rdquo; alarm set 1h ahead, but ultimately realized that actions trigger on the alarm being <em>dismissed</em> rather than going off.</p>
<h2 id="how-to">How-to <a class="anchor" href="#how-to">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There are two things we need to do, which we can achieve using a combination of IFTTT and <a href="https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm&amp;hl=en&amp;gl=US" target="_blank">Tasker</a>:</p>
<ol>
<li>Calculate <em>when</em> to trigger the heat, based on the next alarm set.</li>
<li>Actually trigger the Tado° thermostat to turn on at that time.</li>
</ol>
<h3 id="calculate-when-to-trigger">Calculate when to trigger <a class="anchor" href="#calculate-when-to-trigger">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We&rsquo;ll use the <a href="https://play.google.com/store/apps/details?id=com.joaomgcd.autoalarm" target="_blank">AutoAlarm plugin</a> (Google Play) to enable tasker to see the time of the next alarm we have set. These instructions are heavily based on <a href="https://community.sharptools.io/t/how-to-control-smart-home-x-minutes-before-android-alarm-clock/35" target="_blank">this forum post</a>, which includes a <a href="https://www.youtube.com/watch?v=lnq5imRkvU8" target="_blank">video walk-through</a> that is useful if you&rsquo;re not familiar with the Tasker interface.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_1_hu_ba608f05267231a3.png 480w,
                
                       https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_1_hu_fdb080e24c194785.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_1_hu_fdb080e24c194785.png"
                
    
            
                alt="Set the to/from window based on when you normally wake up." width="400"/> <figcaption>
                <p>Set the to/from window based on when you normally wake up.</p>
            </figcaption>
    </figure>
<ol>
<li>Create a new time-based profile (<code>Tasker → Profiles → (+) → Time</code>)
<ul>
<li>From/to → if you use other android alarms throughout the day (besides for waking up) you&rsquo;ll want to set a window here to only run around when your wake-up alarm should be. I used 5a-10a.</li>
<li>Every → I set it to run every 5 minutes, but I suppose you could run it less frequently to minimize effect on battery. This is less of a concern for me personally, because my phone is always plugged in overnight.</li>
</ul>
</li>
<li>Add the following actions to your profile
<ul>
<li><code>Plugins → AutoAlarm</code> → this will run the plugin, exposing a <code>%seconds</code> variable which contains the number of seconds until our next alarm.</li>
<li>Set a variable <code>%minsBefore</code> to the number of minutes ahead of your alarm that you want your heat to turn on. I used 60 here, but adjust accordingly.</li>
<li>Set a variable <code>%TriggerHeatAtSec</code> that calculates the formula <code>round(%TIMES + %seconds - (%minsBefore * 60))</code>. Make sure the variable name starts with an uppercase character, so that it is available in &ldquo;global scope&rdquo; for later use by a <em>different</em> profile. If you&rsquo;re trying to understand this formula: <code>%TIMES</code> is a built-in Tasker variable that returns the current time in UNIX format (number of seconds since January 1970, don&rsquo;t ask).</li>
<li>If you use other alarms throughout the day, also set the variable <code>%NextAlarmHour</code> so it can be used as a condition later.</li>
</ul>
</li>
</ol>
<p>☠️ There are some reports of this method sometimes <a href="https://www.reddit.com/r/tasker/comments/di6iiu/wrong_time_in_autoalarm/" target="_blank">not working</a>, possibly due to <a href="https://www.reddit.com/r/tasker/comments/dex1td/tasker_autoalarm_issue/" target="_blank">other apps using alarms silently</a> in the background. If it&rsquo;s acting funky, double-check that you have the <em>Reliable Alarms</em> option disabled in Tasker. This option sets background alarms in the built-in Android clock app to ensure that the Tasker app does not get killed due to battery-saving settings. But it interferes with what we&rsquo;re trying to do here. If this doesn&rsquo;t help, you may have a <em>different</em> app that is causing the interference. You can try debugging with the <a href="https://play.google.com/store/apps/details?id=com.balda.clocktask&amp;hl=en_IE&amp;gl=US" target="_blank">ClockTask</a> plugin, which has a variable that tells you <em>which</em> app the next alarm is set.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/tado-thermostat-with-tasker/tasker_edit_1_hu_98c6ac1d146b0371.png 480w,
                
                       https://geoffruddock.com/tado-thermostat-with-tasker/tasker_edit_1_hu_9bd19292d78a0a38.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/tado-thermostat-with-tasker/tasker_edit_1_hu_9bd19292d78a0a38.png"
                
    
            
                alt="This profile calculates the trigger time for the thermostat." width="400"/> <figcaption>
                <p>This profile calculates the trigger time for the thermostat.</p>
            </figcaption>
    </figure>
<h3 id="actually-trigger-the-thermostat">Actually trigger the thermostat <a class="anchor" href="#actually-trigger-the-thermostat">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Now we need to trigger our thermostat to kick in at the previously calculated time. We could <em>probably</em> do this entirely via Tasker using <a href="https://shkspr.mobi/blog/2019/02/tado-api-guide-updated-for-2019/" target="_blank">the tado API</a>, but using IFTTT relieves us of the burden of dealing with OAuth tokens, etc.</p>
<ol>
<li>Connect your Tado account to IFTTT</li>
<li>Enable the <a href="https://ifttt.com/maker_webhooks" target="_blank">IFTTT Webhooks channel</a>.</li>
<li>Create an applet that triggers your thermostat based on a webhook event. <a href="https://ifttt.com/applets/dpewPDKi" target="_blank">Here</a> is mine.</li>
<li>Go back to the <a href="https://ifttt.com/maker_webhooks" target="_blank">webhooks channel page</a> and click <em>Documentation</em> in the top-right to get to a page where you can test out your configuration. Fill in the name of your event and click <em>Test It</em>. After a few seconds, check your Tado app (or listen to your radiator valve) and you should notice that it was triggered successfully.

       
       
       
       
       
       
       
       <figure>
           <img loading="lazy"
               
                   sizes="(min-width: 35em) 1200px, 100vw"
                     
                   srcset='
                   
                          https://geoffruddock.com/tado-thermostat-with-tasker/ifttt_hu_f3673bd5aecb7dec.png 480w,
                   
                          
                   
                          
                   
                          
                   '
       
                   
                   
                   src="https://geoffruddock.com/tado-thermostat-with-tasker/ifttt.png"
                   
       
               
                   alt="Screenshot of IFTTT webhook configuration" width="500"/> 
       </figure></li>
<li>Go back to Tasker and create a second profile that runs from/until <code>%TriggerHeatAtSec</code> (once).</li>
<li>To prevent the logic from triggering for random alarms later in the day, you can set an additional constraint to the trigger logic: <code>%NextAlarmHour &lt; 13</code> (to only turn on heating for alarms before 1p).

       
       
       
       
       
       
       
       <figure>
           <img loading="lazy"
               
                   sizes="(min-width: 35em) 1200px, 100vw"
                     
                   srcset='
                   
                          https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_2_hu_b55d63d2d1f57749.png 480w,
                   
                          https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_2_hu_70e8206e3239f9b9.png 800w,
                   
                          
                   
                          
                   '
       
                   
                   
                   src="https://geoffruddock.com/tado-thermostat-with-tasker/tasker_profile_2_hu_70e8206e3239f9b9.png"
                   
       
               
                   alt="This profile actually does the triggering." width="400"/> <figcaption>
                   <p>This profile actually does the triggering.</p>
               </figcaption>
       </figure></li>
<li>Add a single task that uses <em>HTTP Request</em> to make a <em>GET</em> request to the webhooks URL we tested earlier, which will look something like <code>https://maker.ifttt.com/trigger/&lt;event_name&gt;/with/key/&lt;your_key&gt;</code>.</li>
</ol>

      ]]></content:encoded></item><item><title>Accidental abstract art (ft. matplotlib)</title><link>https://geoffruddock.com/accidental-abstract-art/</link><pubDate>Saturday, 10 Oct 2020</pubDate><guid>https://geoffruddock.com/accidental-abstract-art/</guid><description>&lt;p>A collection of accidental art that I have created while trying to plot something &lt;em>actually useful&lt;/em> with &lt;a href="https://geoffruddock.com/notebooks/data-viz/matplotlib/">matplotlib&lt;/a> or other tools.&lt;/p>
&lt;figure>
&lt;img loading="lazy"
sizes="(min-width: 35em) 1200px, 100vw"
srcset='
'
src="https://geoffruddock.com/accidental-abstract-art/interstellar.png"
alt="Interstellar" width="500"/> &lt;figcaption>
&lt;p>Interstellar&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img loading="lazy"
sizes="(min-width: 35em) 1200px, 100vw"
srcset='
https://geoffruddock.com/accidental-abstract-art/accidental_landscape_hu_aac944a974c32dde.png 480w,
'
src="https://geoffruddock.com/accidental-abstract-art/accidental_landscape.png"
alt="Accidental 3D render" width="500"/> &lt;figcaption>
&lt;p>Accidental 3D render&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img loading="lazy"
sizes="(min-width: 35em) 1200px, 100vw"
srcset='
https://geoffruddock.com/accidental-abstract-art/windows_98_hu_5f0518c41ef11f38.png 480w,
'
src="https://geoffruddock.com/accidental-abstract-art/windows_98.png"
alt="Windows 95" width="500"/> &lt;figcaption>
&lt;p>Windows 95&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img loading="lazy"
sizes="(min-width: 35em) 1200px, 100vw"
srcset='
'
src="https://geoffruddock.com/accidental-abstract-art/signature_of_time.png"
alt="The signature of time" width="500"/> &lt;figcaption>
&lt;p>The signature of time&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img loading="lazy"
sizes="(min-width: 35em) 1200px, 100vw"
srcset='
'
src="https://geoffruddock.com/accidental-abstract-art/dantes_inferno.png"
alt="Dante&amp;rsquo;s inferno" width="500"/> &lt;figcaption>
&lt;p>Dante&amp;rsquo;s inferno&lt;/p>
&lt;/figcaption>
&lt;/figure></description><content:encoded><![CDATA[
        <p>A collection of accidental art that I have created while trying to plot something <em>actually useful</em> with <a href="https://geoffruddock.com/notebooks/data-viz/matplotlib/">matplotlib</a> or other tools.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/accidental-abstract-art/interstellar.png"
                
    
            
                alt="Interstellar" width="500"/> <figcaption>
                <p>Interstellar</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/accidental-abstract-art/accidental_landscape_hu_aac944a974c32dde.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/accidental-abstract-art/accidental_landscape.png"
                
    
            
                alt="Accidental 3D render" width="500"/> <figcaption>
                <p>Accidental 3D render</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/accidental-abstract-art/windows_98_hu_5f0518c41ef11f38.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/accidental-abstract-art/windows_98.png"
                
    
            
                alt="Windows 95" width="500"/> <figcaption>
                <p>Windows 95</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/accidental-abstract-art/signature_of_time.png"
                
    
            
                alt="The signature of time" width="500"/> <figcaption>
                <p>The signature of time</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/accidental-abstract-art/dantes_inferno.png"
                
    
            
                alt="Dante&rsquo;s inferno" width="500"/> <figcaption>
                <p>Dante&rsquo;s inferno</p>
            </figcaption>
    </figure>

      ]]></content:encoded></item><item><title>Keep your SQL queries DRY with Jinja templating</title><link>https://geoffruddock.com/sql-jinja-templating/</link><pubDate>Wednesday, 01 Jul 2020</pubDate><guid>https://geoffruddock.com/sql-jinja-templating/</guid><description>&lt;h2 id="a-usecase-for-templating-your-sql-queries">A usecase for templating your SQL queries &lt;a class="anchor" href="#a-usecase-for-templating-your-sql-queries">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>Suppose you have a table &lt;code>raw_events&lt;/code> which contains events related to an email marketing campaign. You&amp;rsquo;d like to see the total number of each event type per day. This is a classic use-case for a &lt;a href="https://en.wikipedia.org/wiki/Pivot_table" target="_blank">pivot table&lt;/a>, but let&amp;rsquo;s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.&lt;/p>
&lt;p>The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of &lt;code>CASE WHEN&lt;/code> expressions.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="a-usecase-for-templating-your-sql-queries">A usecase for templating your SQL queries <a class="anchor" href="#a-usecase-for-templating-your-sql-queries">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Suppose you have a table <code>raw_events</code> which contains events related to an email marketing campaign. You&rsquo;d like to see the total number of each event type per day. This is a classic use-case for a <a href="https://en.wikipedia.org/wiki/Pivot_table" target="_blank">pivot table</a>, but let&rsquo;s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.</p>
<p>The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of <code>CASE WHEN</code> expressions.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">date_</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">SUM</span><span class="p">(</span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="n">event_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;send&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">END</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_send</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">SUM</span><span class="p">(</span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="n">event_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;deliver&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">END</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_deliver</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">SUM</span><span class="p">(</span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="n">event_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;open&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">END</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_open</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">SUM</span><span class="p">(</span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="n">event_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;click&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">END</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_click</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">raw_events</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">ASC</span><span class="w">
</span></span></span></code></pre></div><p>This gets the job done for our toy example, it is not particularly <em>scalable</em>. Suppose our <code>event_type</code> column had 40 possible values, instead of just four. In that case, our solution is sub-optimal on two criteria:</p>
<ol>
<li><strong>Readability</strong> – With 30x possible values, our query would be ~10x as many lines as before. While it won&rsquo;t take 10x longer to read, it does impose a cognitive cost to read. This is exacerbated when we&rsquo;ve got a number of sub-queries in a single file.</li>
<li><strong>Maintainability</strong> – If we add new <code>event_types</code> in the future, this query must be updated to match. This is tedious, and introduces an opportunity for error.</li>
</ol>
<h3 id="can-we-do-better">Can we do better? <a class="anchor" href="#can-we-do-better">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>This is a pretty good scenario to use <a href="https://jinja.palletsprojects.com/en/2.11.x/" target="_blank">jinja</a>, a Python templating library which lets us perform basic flow control (loops, conditionals) inside of text templates. It is heavily used among the Flask community, but is also well suited for data analytics with SQL.</p>
<p>I&rsquo;ll avoid giving a mediocre regurgitation of jinja syntax here, and defer to their own excellent documentation instead. Let&rsquo;s skip ahead to see what our query would look like using a jinja template.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Template</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<p>In our toy example, the resulting query is not that much shorter, but it has the benefit of abstracting out a variable <code>events</code>, which contains the list of possible values for the <code>event_type</code> column. In the future, we can extend this query easily by simply appending to the <code>events</code> list.</p>
<h2 id="whitespace">Whitespace <a class="anchor" href="#whitespace">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>You may have noticed that I added some unexplained <code>-</code> characters in the blocks above to get a pretty output.</p>
<p>The default output without these characters is a bit ugly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% i</span><span class="s2">f not loop.last %}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% e</span><span class="s2">ndif %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send
    
        , 
    
    
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver
    
        , 
    
    
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open
    
        , 
    
    
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
    
    
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<p>Adding a minus sign (<code>-</code>) tells jinja to strip the whitespace before or after a block.</p>
<p>There are four possible positions for the minus sign:</p>
<ol>
<li>Start of opening block</li>
<li>End of opening block</li>
<li>Start of closing block</li>
<li>End of closing block</li>
</ol>
<p>Let&rsquo;s take a look at the effect of adding a minus sign in each position.</p>
<h4 id="start-of-opening-block">Start of opening block <a class="anchor" href="#start-of-opening-block">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Adding the minus sign to the start of the opening block strips the leading whitespace <em>outside of</em> the for-loop. Basically, it just removes the extra line inhabited by the <code>{%- for event in events %}</code> block itself.</p>
<p>😄 This removes the empty line between <code>date_</code> and the first <code>SUM</code>.</p>
<p>😢 But it does not remove the empty lines between each <code>SUM</code> statement.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<h4 id="end-of-opening-block">End of opening block <a class="anchor" href="#end-of-opening-block">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Adding the minus sign to the end of the opening block strips the leading whitespace <em>within</em> the for-loop.</p>
<p>😄 This removes the empty lines between each <code>SUM</code> statement.</p>
<p>😢 But it leaves an empty line between the final statement and the SQL outside of the for-loop.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% f</span><span class="s2">or event in events -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<h4 id="start-of-closing-block">Start of closing block <a class="anchor" href="#start-of-closing-block">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Adding the minus sign to the start of the closing block strips the <em>trailing</em> whitespace within the for-loop.</p>
<p>😄 This removes the empty lines between each <code>SUM</code> statement.</p>
<p>😢 But it leaves an empty line between the <code>date_</code> and the first <code>SUM</code> statement.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<h4 id="end-of-closing-block">End of closing block <a class="anchor" href="#end-of-closing-block">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Adding the minus sign to the end of the closing block removes the trailing whitespace outside of the for-loop.</p>
<p>😢 This looks the worst. It leaves an empty line between each <code>SUM</code> statement. While it removes the final empty line between the for-loop and the rest of the untemplated SQL, it pulls <code>FROM</code> onto the incorrect level of indentation.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">% e</span><span class="s2">ndfor -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_clickFROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<h4 id="the-ideal-mix">The ideal mix <a class="anchor" href="#the-ideal-mix">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>By combining minus signs on the start of the opening block and the start of the ending block, we can tell jinja to strip the first leading empty line, and also the lines between each <code>SUM</code> statement.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    date_,
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- f</span><span class="s2">or event in events %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    SUM(CASE WHEN event_type = &#39;{{event}}&#39; THEN 1 END) AS num_{{event}}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- i</span><span class="s2">f not loop.last -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">        , 
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndif -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">    {</span><span class="si">%- e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM raw_events
</span></span></span><span class="line"><span class="cl"><span class="s2">GROUP BY 1
</span></span></span><span class="line"><span class="cl"><span class="s2">ORDER BY 1 ASC
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">events</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
</code></pre>
<h2 id="correlation-matrix">Correlation matrix <a class="anchor" href="#correlation-matrix">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Let&rsquo;s look at a slightly more compliated example. Suppose we used the previous query to make a new table called <code>daily_event_counts</code>. Now we are interested in measuring the pairwise correlation between each type of event.</p>
<p>We can use the <code>CORR()</code> function to calculate each pair, but we need to tell the SQL engine which columns to use for each calculation. This is a good example of where the quick-and-dirty approach fails to scale. We have four types of events, but there are <code>4×4=16</code> pairwise correlations.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s1">&#39;send&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s1">&#39;deliver&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">CORR</span><span class="p">(</span><span class="n">send</span><span class="p">,</span><span class="w"> </span><span class="n">deliver</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">corr_</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">daily_event_counts</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">UNION</span><span class="w"> </span><span class="k">ALL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s1">&#39;send&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s1">&#39;open&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">CORR</span><span class="p">(</span><span class="n">send</span><span class="p">,</span><span class="w"> </span><span class="k">open</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">corr_</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">daily_event_counts</span><span class="w">
</span></span></span></code></pre></div><p>In reality, we are only interested in six of these pairwise correlations. The four diagonals will just equal one, and the matrix is symmetric, so half the computations are redundant. For the sake of simplicity, let&rsquo;s ignore this for now, and proceed to calculate all sixteen.</p>
<h3 id="nested-for-loops-with-jinja">Nested for-loops with jinja <a class="anchor" href="#nested-for-loops-with-jinja">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Nested for-loop are relatively straightforward, but I will point out two changes:</p>
<ol>
<li>We want to place <code>UNION ALL</code> after all iterations except the final one. Previously we used <code>{%- if not loop.last -%}</code> to check if it was the final iteration. Since we now have a nested loop, we need to keep track of two indices. We can do this by using the block <code>{% set outer_loop = loop %}</code> to assign the outer loop to a new variable <code>outer_loop</code> before it is &ldquo;replaced&rdquo; by the inner loop.</li>
<li>We add a minus sign (<code>-</code>) on the end of the outer opening block, to avoid getting an additional empty line between iterations of the outer loop. This gives us a consistent spacing of one empty line between each <code>UNION ALL</code> statement.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Template</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sql</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">%- f</span><span class="s2">or x in cols -%}
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">% s</span><span class="s2">et outer_loop = loop %}
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">%- f</span><span class="s2">or y in cols %}
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#39;{{x}}&#39; AS x,
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#39;{{y}}&#39; AS y,
</span></span></span><span class="line"><span class="cl"><span class="s2">    CORR({{x}}, {{y}}) AS corr_
</span></span></span><span class="line"><span class="cl"><span class="s2">FROM daily_event_counts
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">% i</span><span class="s2">f not (loop.last and outer_loop.last) %}
</span></span></span><span class="line"><span class="cl"><span class="s2">UNION ALL
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">% e</span><span class="s2">ndif %}
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">%- e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">{</span><span class="si">%- e</span><span class="s2">ndfor %}
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">Template</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">cols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;send&#39;</span><span class="p">,</span> <span class="s1">&#39;deliver&#39;</span><span class="p">,</span> <span class="s1">&#39;open&#39;</span><span class="p">,</span> <span class="s1">&#39;click&#39;</span><span class="p">]))</span>
</span></span></code></pre></div><pre><code>SELECT
    'send' AS x,
    'send' AS y,
    CORR(send, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'deliver' AS y,
    CORR(send, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'open' AS y,
    CORR(send, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'click' AS y,
    CORR(send, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'send' AS y,
    CORR(deliver, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'deliver' AS y,
    CORR(deliver, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'open' AS y,
    CORR(deliver, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'click' AS y,
    CORR(deliver, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'send' AS y,
    CORR(open, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'deliver' AS y,
    CORR(open, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'open' AS y,
    CORR(open, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'click' AS y,
    CORR(open, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'send' AS y,
    CORR(click, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'deliver' AS y,
    CORR(click, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'open' AS y,
    CORR(click, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'click' AS y,
    CORR(click, click) AS corr_
FROM daily_event_counts
</code></pre>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://jinja.palletsprojects.com/en/2.11.x/templates/#whitespace-control" target="_blank">Template Designer Documentation: Whitespace Control</a></li>
<li><a href="https://jinja.palletsprojects.com/en/3.0.x/tricks/#accessing-the-parent-loop" target="_blank">Tips &amp; Tricks: Accessing the parent Loop</a></li>
</ul>

      ]]></content:encoded></item><item><title>Geotagging Lightroom photos with Google Timeline data</title><link>https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/</link><pubDate>Tuesday, 16 Jun 2020</pubDate><guid>https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/</guid><description>&lt;p>Lightroom has a &lt;em>Maps&lt;/em> view, but I have never really used it before. While all my smartphone photos are automatically geotagged, the 80% of my photos shot on my dedicated camera (Sony A7Rii) lack geodata. I have historically neglected adding geotag info as I have imported photos through the years. As a COVID lockdown project, I decided to try using the location tracking data from Google Timeline to geotag my photos &amp;ldquo;automatically&amp;rdquo;.&lt;/p></description><content:encoded><![CDATA[
        <p>Lightroom has a <em>Maps</em> view, but I have never really used it before. While all my smartphone photos are automatically geotagged, the 80% of my photos shot on my dedicated camera (Sony A7Rii) lack geodata. I have historically neglected adding geotag info as I have imported photos through the years. As a COVID lockdown project, I decided to try using the location tracking data from Google Timeline to geotag my photos &ldquo;automatically&rdquo;.</p>
<h3 id="download-your-location-history-from-google-takeout">Download your location history from Google Takeout <a class="anchor" href="#download-your-location-history-from-google-takeout">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The <a href="https://takeout.google.com/" target="_blank">Google Takeout</a> tool lets you export your data across a variety of Google services. We are only interested in our location history, so click <em>Deselect all</em> and scroll down to the <em>Location History</em> section.</p>
<p><img src="google_takeout.webp" alt="Google Takeout"></p>
<p>You&rsquo;ll receive an email a few minutes later with a download link. After you unzip it, you&rsquo;ll find a large JSON file. Depending on how many years you&rsquo;ve had <em>Location History</em> enabled, the file may be quite big. Mine was ~ 600 Mb for six years of data.</p>
<p>⚠ Note that while the Google Timeline UI displays your location history in local time, the JSON export contains timestamps stored in UTC time. This will be important later.</p>
<h3 id="convert-it-to-gpx-format-and-split-be-year">Convert it to gpx format, and split be year <a class="anchor" href="#convert-it-to-gpx-format-and-split-be-year">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Google Takeout gives us our location data in a JSON format, but most map software—including Lightroom—expects a GPX format. Luckily there is an excellent <a href="https://github.com/Scarygami/location-history-json-converter" target="_blank">location-history-json-converter</a> python script available on GitHub which solves our exact problem.</p>
<p>I recommend splitting your location data by year to ensure the files are reasonably small. If you live in a timezone which observes daylight savings time, you will need to read each year&rsquo;s file three separate times into the Lightroom plugin, setting the appropriate timezone each time. Even with a year of data, there is a 20-30 second lag when reading the file using the Lightroom plugin. This is annoying, but manageable.</p>
<p>Here is an example of the terminal commands to convert your JSON export into yearly gpx files:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl"><span class="c1"># make a working directory to keep things together</span>
</span></span><span class="line"><span class="cl">mkdir location_history <span class="o">&amp;&amp;</span> <span class="nb">cd</span> location_history
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># move your downloaded json file into our directory</span>
</span></span><span class="line"><span class="cl">mv <span class="s2">&#34;~/Downloads/Location History.json&#34;</span> location_history.json
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># download the conversion tool</span>
</span></span><span class="line"><span class="cl">git clone https://github.com/Scarygami/location-history-json-converter.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> location-history-json-converter
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># perform the actual conversion</span>
</span></span><span class="line"><span class="cl">python location_history_json_converter.py ../location_history.json 2018.gpx <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>	-s 2018-01-01 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>	-e 2019-01-01 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>	-f gpx
</span></span></code></pre></div><h3 id="download-lightroom-plugin">Download Lightroom plugin <a class="anchor" href="#download-lightroom-plugin">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Jeffrey Friedl—the king of Lightroom plugins—has an excellent <a href="http://regex.info/blog/lightroom-goodies/gps" target="_blank">geo-encoding plugin</a> which handles most of the heavy lifting for us. It has a plethora of settings including fuzzy matching—which is important since smartphone GPS samples less frequently than a dedicated hiking unit—and so the location data frequently will not match our photo timestamps exactly.</p>
<h3 id="fix-your-timezones">Fix your timezones <a class="anchor" href="#fix-your-timezones">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If you&rsquo;re like me, you may have been lazy with keeping your camera&rsquo;s cock up-to-date with daylight savings changes, or when traveling in a different timezone. If you geotag all your photos in a single batch, you&rsquo;ll likely notice that many photos fail—even with fuzzy matching enable—and others are placed in an entirely wrong location.</p>
<p>Here comes the un-fun part. Crack a beer, put on some music, and spend a few hours working through your Lightroom catalog side-by-side with your Google Timeline. Here was my rough workflow:</p>
<ol>
<li>Find an image that is of an identifiable location, then cross-reference with Google Timeline to discern whether the timestamp is correct.</li>
<li>Select an entire batch of photos (same trip), and adjust their timestamps in Lightroom using <code>Menu → Metadata → Edit capture time → Shift by set number of hours</code>.</li>
<li>Mark down the <a href="https://www.timeanddate.com/" target="_blank">UTC offset</a> for the date range of photos you processed, since you&rsquo;ll need to enter this value when running the geo-encoding plugin.</li>
</ol>
<h3 id="geotag-your-photos">Geotag your photos <a class="anchor" href="#geotag-your-photos">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>You&rsquo;ll need to work in batches, one for each timezone your photos were taken in.</p>
<ol>
<li>Bring up the geocoding prompt in Lightroom from <code>Library → Plugin extras → Geoencode</code>.</li>
<li>Select the gpx tracklog file corresponding to the year of the photos you are encoding. It may take 20-30 seconds to read the file before allowing you select options.</li>
<li>Set the UTC offset for the batch, keeping in mind daylight savings time.</li>
<li>If your photo library also contains smartphone photos, you probably don&rsquo;t want to overwrite their location info. In this case, make sure to select <em>Process only those still unmapped</em>.</li>
<li>Click <em>Geoencode images</em> and take a deep breath.</li>
<li>Pay attention to the summary prompt → If more than 30% of photos failed, you may have missed a timezone problem.</li>
</ol>
<p><img src="plugin_prompt.png" alt="Plugin prompt"></p>
<h3 id="tidy-up-metadata">Tidy up metadata <a class="anchor" href="#tidy-up-metadata">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>After running the geocoding, head over to the <em>Maps</em> view and spot-check the results. If your capture times and UTF offset are correct, most of the photo locations should be <em>reasonably accurate</em>. There will still be some weird results, simply because the raw GPS data sometimes &ldquo;jumps&rdquo; around—particularly when you are in rural areas and/or hilly terrain. You can tidy those up in a couple ways:</p>
<ol>
<li>Right-click on image → Metadata Presets → Copy Metadata → GPS, then paste onto other image.</li>
<li>Select images → Right-click on map to set GPS.</li>
<li>Simply drag the photos from the film strip onto location on the map where they belong.</li>
</ol>
<h2 id="reflections">Reflections <a class="anchor" href="#reflections">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="was-it-worth-the-effort">Was it worth the effort? <a class="anchor" href="#was-it-worth-the-effort">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I spent significantly longer on this little project than I expected to. Nevertheless, I think it was a worthwhile endeavour. Here are my takeaways:</p>
<ol>
<li>In the future I will be more vigilant about updating my camera clock when traveling, or when DST changes. I set a recurring Google calendar event for the latter.</li>
<li>When traveling somewhere where I do not have a data plan, I should disable roaming rather than putting my phone in airplane mode, since the latter disables GPS as well.</li>
<li>Ideally I should perform geotagging as part of my regular editing workflow, rather than doing it all at once.</li>
</ol>
<p>I haven&rsquo;t found a perfect solution for #3 though, since downloading my entire location history and converting it is a hassle, and not something I want to do after every day trip. In the future, I may write a script which uses Google APIs to schedule an export, convert to gpx, split into daily files, and drop it somewhere in Google Drive.</p>
<h3 id="what-about-daily-exports-from-google-timeline">What about daily exports from Google Timeline? <a class="anchor" href="#what-about-daily-exports-from-google-timeline">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Google Timeline&rsquo;s web UI contains an option for exporting a single day to a KML file, but yields less accurate data than the JSON export from Google Takeout. The JSON export contains raw GPS readings, while the KML export contains the more processed data that you see in Google Timeline, where it aggregates points together into inferred paths and journeys.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/google_timeline_screenshot_hu_9147d33f05fbf2a.png 480w,
                
                       https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/google_timeline_screenshot_hu_5a8aa583ba2d68f4.png 800w,
                
                       https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/google_timeline_screenshot_hu_9454d544875a19da.png 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/geotag-lightroom-photos-with-google-location-history/google_timeline_screenshot_hu_5a8aa583ba2d68f4.png"
                
    
            
                alt="Raw GPS data can be noisy, but arguably that&rsquo;s what we want here."/> <figcaption>
                <p>Raw GPS data can be noisy, but arguably that&rsquo;s what we want here.</p>
            </figcaption>
    </figure>
<p>In the KML file, the raw measurements are grouped by route with a start and an end timestamp. This presents two problems:</p>
<ol>
<li>Google uses a non-standard format which is not easily converted into a gpx file with GPSBabel.</li>
<li>Raw measurements are no longer associated with specific timestamps, they could be anywhere between the start and end timestamp of their grouping. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
</ol>
<p>Here is an example of the data format. While it is theoretically possible to write a script to convert this into a valid gpx format to mitigate problem #1, I&rsquo;m not sure this would be worthwhile, since the timestamps would still be less accurate.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-xml" data-lang="xml"><span class="line"><span class="cl"><span class="nt">&lt;Placemark&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;name&gt;</span>On the subway<span class="nt">&lt;/name&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;address&gt;&lt;/address&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;description&gt;</span> On the subway from 2020-06-13T14:00:10.523Z to 2020-06-13T14:04:13.078Z. Distance 2753m <span class="nt">&lt;/description&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;LineString&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;altitudeMode&gt;</span>clampToGround<span class="nt">&lt;/altitudeMode&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;extrude&gt;</span>1<span class="nt">&lt;/extrude&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;tesselate&gt;</span>1<span class="nt">&lt;/tesselate&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;coordinates&gt;</span>13.401893,52.475390999999995,0 13.401893,52.475390999999995,0 13.3918281,52.4985295,0 13.391048022278145,52.49785324977018,0 <span class="nt">&lt;/coordinates&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;/LineString&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;TimeSpan&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;begin&gt;</span>2020-06-13T14:00:10.523Z<span class="nt">&lt;/begin&gt;</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&lt;end&gt;</span>2020-06-13T14:04:13.078Z<span class="nt">&lt;/end&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&lt;/TimeSpan&gt;</span>
</span></span><span class="line"><span class="cl"><span class="nt">&lt;/Placemark&gt;</span>
</span></span></code></pre></div><h3 id="reverse-geo-encoding">Reverse geo-encoding <a class="anchor" href="#reverse-geo-encoding">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Reverse geo-encoding populates fields like <em>City</em>, <em>State</em> and <em>Country</em> based on the raw GPS data. Lightroom has built-in reverse geo-encoding, which I find satisfactory. But Jeffrey&rsquo;s plugin can also perform reverse geo-encoding using Google location data. Apparently it is much more accurate. A key feature is that you can specify a <em>My Maps</em> file with custom-named locations, and the plugin will use those location names whenever possible.</p>
<blockquote>
<p>The plugin can reverse-geocode via both Google and OpenStreetMap, though in order to use Google, you must create a developer&rsquo;s API key, and enter that into the plugin in the Plugin Manager. (The egregiously-complex steps needed to create the Google API key are beyond my ability to explain as of yet, sorry.)</p></blockquote>
<p>To use this feature you need to first set up a Google Cloud Platform account, then <a href="https://developers.google.com/places/web-service/get-api-key" target="_blank">generate an API key</a>, and enable the <a href="https://console.cloud.google.com/apis/library/geocoding-backend.googleapis.com" target="_blank">Geocoding API</a>. Keep in mind that a large quantity of requests <a href="https://cloud.google.com/maps-platform/pricing" target="_blank">could cost you</a>. GCP does give you $200 in free monthly credits for maps-related APIs though, which translates to roughly 40k requests. Definitely enough for occasional use, just take care not to reverse geo-encode your entire photo library in a single month.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://gyng.github.io/book/articles/geotag/geotag.html" target="_blank">Guide to geotagging photos with Google location history and exiftool</a> – A guide for geotagging using <code>exiftool</code> rather than Lightroom.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://www.photools.com/community/index.php?topic=6919.0" target="_blank">https://www.photools.com/community/index.php?topic=6919.0</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>How to learn mental models with spaced repetition</title><link>https://geoffruddock.com/mental-models-with-anki/</link><pubDate>Friday, 01 May 2020</pubDate><guid>https://geoffruddock.com/mental-models-with-anki/</guid><description>&lt;p>As a subscriber to the &lt;a href="https://fs.blog/" target="_blank">Farnam Street&lt;/a> newsletter, I enjoy reading Shane&amp;rsquo;s articles about using various mental models from other disciplines to improve our decision-making. Reading about these mental models is fun, but I am cognizant of the fact that reading about something does not equate to &lt;em>learning&lt;/em> it. The real measure of success for learning a mental model is: &lt;em>Can I reliably recall the relevant properties of this concept in a useful real-world scenario?&lt;/em>&lt;/p></description><content:encoded><![CDATA[
        <p>As a subscriber to the <a href="https://fs.blog/" target="_blank">Farnam Street</a> newsletter, I enjoy reading Shane&rsquo;s articles about using various mental models from other disciplines to improve our decision-making. Reading about these mental models is fun, but I am cognizant of the fact that reading about something does not equate to <em>learning</em> it. The real measure of success for learning a mental model is: <em>Can I reliably recall the relevant properties of this concept in a useful real-world scenario?</em></p>
<p>Since this definition of success hinges on succesful recall, it is an ideal candidate for spaced repetition software. <a href="https://geoffruddock.com/three-years-of-spaced-repetition-with-anki/" target="_blank">Anki</a> is a popular spaced repetition (flashcards) app which I have used in language-learning and technical contexts over the past few years. So the question then becomes: how can we best use Anki to facilitate the process of learning mental models?</p>
<h2 id="notes-fields-and-cards-oh-my">Notes, fields, and cards, oh my! <a class="anchor" href="#notes-fields-and-cards-oh-my">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Before we continue, it will be helpful to briefly review the terminology used by Anki, as there are some subtleties here. When you are <em>reviewing</em>, you work through a <strong>deck</strong> (like a folder, usually grouped by theme) of <strong>cards</strong>, each with a front and back side. But when you are <em>creating</em> flashcards, you actually do this by choosing a <strong>note type</strong> and adding information into the <strong>fields</strong> on that note. Each note type has associated <strong>card types</strong>, which are essentially templates built with HTML and CSS to determine which fields go where.</p>
<p>The default note type is called <em>Basic</em> and has only two fields: <em>Front</em> and <em>Back</em>. This basic note type has a single card type which displays those fields on the corresponding sides of a card. This is an intuitive place to start with spaced repetition, but it is also a <a href="https://en.wikipedia.org/wiki/Skeuomorph" target="_blank">skeuomorph</a> for physical flashcards. With digital flashcards, there is no limit to the number of fields and card types we can have within a single note type.</p>
<h2 id="a-naive-approach">A naive approach <a class="anchor" href="#a-naive-approach">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>One of my early learnings from making medicore anki cards is that <a href="https://geoffruddock.com/three-years-of-spaced-repetition-with-anki/#memory-is-directional" target="_blank">memory is directional</a>. Just because you can name a concept, doesn&rsquo;t mean you can define it, and vice versa. So for almost any concept worth learning, we will want to ultimately generate multiple cards to test both <em>recognition</em> and <em>production</em>.</p>
<p>I started off by simply creating a bunch of <em>Basic</em> note cards for each mental model I wanted to memorize. As an example, here are four cards I created when learning about the <a href="https://en.wikipedia.org/wiki/Streetlight_effect" target="_blank">Streetlight effect</a>.</p>
<table>
  <thead>
      <tr>
          <th>Front</th>
          <th>Back</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>What is the the Streetlight Effect?</td>
          <td>An observational bias where a person who is searching for something looks only where it is easiest.</td>
      </tr>
      <tr>
          <td>What is the name for <em>An observational bias where a person who is searching for something looks only where it is easiest</em>?</td>
          <td>The Streetlight Effect</td>
      </tr>
      <tr>
          <td>What is the implication of the Streetlight Effect?</td>
          <td>We must be careful to focus our problem-solving efforts towards the area <em>where the solution is likely to be</em> rather than the area where we have the most data.</td>
      </tr>
      <tr>
          <td>What does this picture represent?<br /><img src="streetlight_effect.jpg" alt="img"></td>
          <td>The Streetlight Effect</td>
      </tr>
  </tbody>
</table>
<p>I did this for 40+ concepts and mental models of the course of a few months. In the process, I noticed some recurring flaws with this approach.</p>
<h3 id="flaw-1-card-creation-involves-cognitive-overhead">Flaw #1: Card creation involves cognitive overhead <a class="anchor" href="#flaw-1-card-creation-involves-cognitive-overhead">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Every time we sit down to create new cards, we need to remember all the different front–back pairings above, and manually create a card for each pair. Suppose we occasionally we forget to make a particular card for a concept. When this happens, there is no easy way for us to discover its absence later.</p>
<p>The flexibility of pure Front ⟷ Back cards is useful when we are encoding unstructured information, but becomes a burden when many of our cards follow a similar schema. It took me a while to develop this schema, but now when I read about new concept I am always sub-conciously asking myself:</p>
<ul>
<li>How can I explain this concept to a five year-old?</li>
<li>What is the key implication or use-case for this concept?</li>
<li>When does this concept/tool/approach fail? When should I avoid it?</li>
</ul>
<p>Once we have an implicit schema like this, we can leverage the power of note types and card types in Anki to automate our flashcard creation.</p>
<h3 id="flaw-2-there-is-boilerplate-text">Flaw #2: There is boilerplate text <a class="anchor" href="#flaw-2-there-is-boilerplate-text">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The above example cards each include some boilerplate text on the front of the card:</p>
<ul>
<li>What is…</li>
<li>What is the name for…</li>
<li>What is the implication of…</li>
<li>What does this represent…</li>
</ul>
<p>These chunks of text are necessary to prime our brain about what specific piece of information we are asking it to recall. But they require a non-trivial amount of time to verbally process before we can then move on to the actual act of recall. And after writing them 40+ times we will inevitably have some variation in wording, which further increases the cognitive burden. Ideally our card structure itself could signal what piece of information we are being asked to recall.</p>
<h3 id="flaw-3-it-is-difficult-to-refactor">Flaw #3: It is difficult to refactor <a class="anchor" href="#flaw-3-it-is-difficult-to-refactor">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>One infrequently discussed element of spaced repetition learning is <a href="http://cognitivemedium.com/srs-mathematics" target="_blank">card refactoring</a>. This is not a concern if we are using flashcards to memorize the birthdays of US presidents, as that information will never change. Our initial encoding of the information is probably good enough, even five years from now</p>
<p>But what if we are using spaced repetitition to learn <a href="https://www.coursera.org/learn/learning-how-to-learn" target="_blank">compressible topics</a> such as mathematics? In this case, we are almost never encoding <em>raw factual information</em> in our cards, but rather a snapshot of our current understanding of the topic. As we continue to develop our understanding in a topic and establish connections across disciplines, we will frequently notice that our previous understanding of a concept was imprecise or subtly wrong.</p>
<p>When we update our understanding, we should also update our flashcards, so that they remain relevant and valuable to us. With 5-10 disconnected cards for a topic floating around our Anki deck, it can be a burden to find and update all of the cards related to a particular topic. For example, you may have some cards related to the <a href="https://en.wikipedia.org/wiki/Law_of_total_probability" target="_blank">law of total probability</a>, but you may also have a number of other cards which reference that concept. Simply searching for that text across all your decks will surface not just the concept cards, but also these other cards which mention the concept.</p>
<h2 id="a-note-template-for-mental-models">A note template for mental models <a class="anchor" href="#a-note-template-for-mental-models">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>With the above criteria in mind, I built my own &ldquo;concept&rdquo; note type to attempt to address these issues.  Here are a few screenshots to show what the cards look like on AnkiDroid. Below I explain a few of the features, and why I made particular design decisions.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/mental-models-with-anki/concept_to_implication_hu_916367f64e37b87d.jpg 480w,
                
                       https://geoffruddock.com/mental-models-with-anki/concept_to_implication_hu_e85e000a4bd1470d.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/mental-models-with-anki/concept_to_implication_hu_e85e000a4bd1470d.jpg"
                
    
            
                alt="Concept → Implication"/> <figcaption>
                <p>Concept → Implication</p>
            </figcaption>
    </figure>


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/mental-models-with-anki/definition_to_concept_hu_f9dfcab0d1b9e86.jpg 480w,
                
                       https://geoffruddock.com/mental-models-with-anki/definition_to_concept_hu_41280b2b362437d9.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/mental-models-with-anki/definition_to_concept_hu_41280b2b362437d9.jpg"
                
    
            
                alt="Definition → Concept"/> <figcaption>
                <p>Definition → Concept</p>
            </figcaption>
    </figure>


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/mental-models-with-anki/visual_to_concept_hu_7f7d512c4f48b42d.jpg 480w,
                
                       https://geoffruddock.com/mental-models-with-anki/visual_to_concept_hu_ecb77d7ed8b6b2ce.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/mental-models-with-anki/visual_to_concept_hu_ecb77d7ed8b6b2ce.jpg"
                
    
            
                alt="Visual → Concept"/> <figcaption>
                <p>Visual → Concept</p>
            </figcaption>
    </figure>


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/mental-models-with-anki/full_card_hu_3ce2b569c246344a.jpg 480w,
                
                       https://geoffruddock.com/mental-models-with-anki/full_card_hu_226d7b77fe74c603.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/mental-models-with-anki/full_card_hu_226d7b77fe74c603.jpg"
                
    
            
                alt="Full card revealed"/> <figcaption>
                <p>Full card revealed</p>
            </figcaption>
    </figure>


        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 400px
    }

</style>
<h3 id="centered-layouts-with-css-flexbox">Centered layouts with CSS Flexbox <a class="anchor" href="#centered-layouts-with-css-flexbox">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Designing card layouts on Anki 2.0 was a pain, but since version 2.1, Anki uses a new rendering engine which supports modern web development techniques such as <a href="https://css-tricks.com/snippets/css/a-guide-to-flexbox/" target="_blank">CSS Flexbox</a>. Our core design makes use of flexbox to ensure that cards are centered both horizontally and vertically on the screen. On mobile, cards typically take up the full width, but on desktop they are limited to ~800px for better readability on wide screens. The goal here is to reduce cognitive overhead when rapidly reviewing many differnet cards.</p>
<h3 id="blurred-answers-with-global-javascript">Blurred answers with global JavaScript <a class="anchor" href="#blurred-answers-with-global-javascript">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Rather than having six different <em>Front Side</em> and <em>Back Side</em> designs—one for each card type—I elected to use a single note layout across all cards, and then to simply blur the field I wish to recall. The back side of each card includes no new content, it just calls a <a href="https://github.com/asdfgeoff/anki-templates/blob/master/mental-models/src/_master.js" target="_blank">javascript function</a> which removes the blur from the recall field.</p>
<p>This reduces the overhead of managing 6x2=12 different HTML templates, enforces consistency, and further reduces cognitive overhead when reviewing. Since the location of the content does not change on answer reveal, there is no time required to grok the structure of the answer. There is also an implicit encoding of <em>data type as location</em>, which I found my brain picked up on subconciously after a few review sessions.</p>
<h3 id="creating-cards">Creating cards <a class="anchor" href="#creating-cards">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3>
    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/mental-models-with-anki/add_prompt_hu_dbdeb51e626c771b.png 480w,
                
                       https://geoffruddock.com/mental-models-with-anki/add_prompt_hu_c1f65a3a47e904bf.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/mental-models-with-anki/add_prompt_hu_c1f65a3a47e904bf.png"
                
    
            
                alt="Prompt to add a new card"/> 
    </figure>
<p>The template lives in a <a href="https://apps.ankiweb.net/docs/manual.html#note-types" target="_blank">custom note type</a> called <em>Concept</em> which contains five fields: name, description, visual, implication, drawbacks. It is okay if you do not use every field. There are six card types which use <a href="https://apps.ankiweb.net/docs/manual.html#selective-card-generation" target="_blank">selective card generation</a> to only create flashcards when the necessary fields are non-empty:</p>
<ol>
<li>Concept → Description</li>
<li>Concept → Implication</li>
<li>Concept → Drawbacks</li>
<li>Description → Concept</li>
<li>Visual → Concept</li>
<li>Implication → Concept</li>
</ol>
<p>The first three deal with <em>recognition</em>. Can we recall the specific details of the concept when directly prompted? The second three deal with <em>production</em>. Can we identify the correct concept from a description, visual representation, or implication / use-case. While the first three cards cover many academic use-cases, the final three are the ones which more directly influence our ability to recognize and apply mental models in a real-life context.  So far, this approach feels like it has worked reasonably well.</p>
<p>I often add a note with only 2-3 fields, then come back and add another field or two a couple months later, when I have a more nuanced understanding of a concept. I experimented with additional fields, but found that these are the minimal set which cover 80% of my card generation needs. I&rsquo;m not just invoking the 80/20 cliché here—I checked my Anki stats and found that I have a 4:1 ratio of these <em>Concept</em> cards to my more basic <em>Q&amp;A</em> card, which I use for cards that don&rsquo;t fit the mould.</p>
<h3 id="how-to-download-it">How to download it <a class="anchor" href="#how-to-download-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>You can find the template on <a href="https://github.com/asdfgeoff/anki-templates/tree/master/mental-models" target="_blank">Github</a>. There is no way to import/export a <em>note type</em> by itself, so the workaround is to import the sample deck, which contains the note type, the necessary CSS and Javascript, and also 30 example cards which showcase the template.</p>
<h2 id="known-issues">Known issues <a class="anchor" href="#known-issues">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>This template is still somewhat a work-in-progress. Here are a couple issues that still need to be fixed. Feel free to send a PR on Github if you are interested in helping improve the template.</p>
<h3 id="flashing-caused-by-mathjax">Flashing caused by MathJax <a class="anchor" href="#flashing-caused-by-mathjax">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I have recently <a href="https://geoffruddock.com/anki-math-typesetting/" target="_blank">changed all of my math expressions from LaTeX to MathJax</a> in Anki. It&rsquo;s much nicer to work with, but one disadvantage is that it causes the cards to briefly &ldquo;flash&rdquo; when displayed, as the underlying markup is being typeset in real-time. Unfortunately I found this to be <em>more</em> noticable and annoying using this template, because the rest of the card is otherwise identical. Whereas on the basic card template, so much of the card changes on answer reveal that the typesetting is less noticeable.</p>
<h3 id="cards-which-are-too-large-for-mobile">Cards which are too large for mobile <a class="anchor" href="#cards-which-are-too-large-for-mobile">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If your card has a lot of text, some of it may be hidden below the fold when reviewing on the mobile app. I am considering adding some shadow styling to indicate when there is scrollable content on a card. But one could argue that if your card has too much text to fit onto a single screen, you should break up the card into more atomic units anyways.</p>
<h3 id="identifying-links-across-topics">Identifying links across topics <a class="anchor" href="#identifying-links-across-topics">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Anecdotally, I feel like I recognize connections between concepts and fields more frequently than I did before. One small hiccup I have not yet solved is how to link concepts across disciplines that are closely-related but perhaps not identical. For example, after reading books about behavioural psycology, the philosopy of Stoicism and cognitive behavioural therapy (CBTI), I have noticed they have many parallel concepts but with different terminology and subtly different implications.</p>

      ]]></content:encoded></item><item><title>Bulk compress videos to H.265 (x265) with ffmpeg</title><link>https://geoffruddock.com/bulk-compress-videos-x265-with-ffmpeg/</link><pubDate>Tuesday, 21 Apr 2020</pubDate><guid>https://geoffruddock.com/bulk-compress-videos-x265-with-ffmpeg/</guid><description>&lt;p>Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer &lt;a href="https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding" target="_blank">H.265 HEVC codec&lt;/a>. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video.&lt;/p>
&lt;p>The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> There are some third-party apps such as &lt;a href="https://play.google.com/store/apps/details?id=com.neaststudios.ultracorder&amp;amp;hl=en" target="_blank">UltraCorder&lt;/a> which support H.265, but I&amp;rsquo;d prefer to stick with the stock camera app. I frequently use the handy &amp;ldquo;double tap power button&amp;rdquo; shortcut to quickly launch my camera app, so it is important that the app which launches is able to handle both photos and videos. This rules out using a specialty video app which supports H.265 encoding.&lt;/p></description><content:encoded><![CDATA[
        <p>Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer <a href="https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding" target="_blank">H.265 HEVC codec</a>. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video.</p>
<p>The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> There are some third-party apps such as <a href="https://play.google.com/store/apps/details?id=com.neaststudios.ultracorder&amp;hl=en" target="_blank">UltraCorder</a> which support H.265, but I&rsquo;d prefer to stick with the stock camera app. I frequently use the handy &ldquo;double tap power button&rdquo; shortcut to quickly launch  my camera app, so it is important that the app which launches is able to handle both photos and videos. This rules out using a specialty video app which supports H.265 encoding.</p>
<h2 id="converting-a-single-video">Converting a single video <a class="anchor" href="#converting-a-single-video">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>It&rsquo;s pretty easy to compress these videos using the <code>ffmpeg</code> command line tool with something like the command below.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">ffmpeg -i input.mp4 -vcodec libx265 -crf <span class="m">28</span> output.mp4
</span></span></code></pre></div><p>I tried a few different quality settings before settling on the default  <code>-crf 28</code> . With this level, I cannot visually tell the difference in quality between compressed and uncompressed videos.</p>
<h2 id="converting-videos-in-bulk">Converting videos in bulk <a class="anchor" href="#converting-videos-in-bulk">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>So <code>ffmpeg</code> is great, but I wanted to batch process all of my phone videos, with the following goals in mind:</p>
<ul>
<li>Recursively search sub-directories, since I typically organize my photos and videos into a folder hierarchy.</li>
<li>Only compress videos which are not already compressed (by using <code>ffprobe</code> to detect if encoding is <code>hevc</code>)</li>
<li>Preserve the metadata (creation time, modification time) from the original video, so that chronological sort continues to work properly.</li>
</ul>
<p>So I wrote a small python CLI tool which achieves the above goals.</p>
<h2 id="the-script">The script <a class="anchor" href="#the-script">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Put the script somewhere convenient (such as your home folder), <code>cd</code> to your content directory, then call the script from the terminal using <code>python ~/compress_videos.py --recursive --file-ext=mp4 .</code></p>
<p>☠  <strong>Use at your own risk</strong> – I accept no responsibility for any data loss or mistakes caused by this script. It is a prudent idea to test it first and visually inspect the output. You should do so for each different input file you are converting. I ran into difficulty with a couple <code>.MTS</code> video files which were interlaced, and so required different settings to convert properly.</p>
<script src="https://gist.github.com/asdfgeoff/62b155ee4ea6b81c9175c39ec2d22e9a.js"></script>

<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://slhck.info/video/2017/02/24/crf-guide.html" target="_blank">CRF Guide (Constant Rate Factor in x264, x265 and libvpx)</a> – A good overview of the difference between constant and variable bitrate encoding, and suggestions for sensible defaults for each.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://netflixtechblog.com/a-large-scale-comparison-of-x264-x265-and-libvpx-a-sneak-peek-2e81e88f8b0f" target="_blank">A Large-Scale Comparison of x264, x265, and libvpx — a Sneak Peek</a> (Netflix Tech Blog)&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Building an AdaBoost classifier from scratch in Python</title><link>https://geoffruddock.com/adaboost-from-scratch-in-python/</link><pubDate>Friday, 20 Mar 2020</pubDate><guid>https://geoffruddock.com/adaboost-from-scratch-in-python/</guid><description>&lt;h2 id="goal">Goal &lt;a class="anchor" href="#goal">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>A few weeks ago while learning about Naive Bayes, I wrote a post about &lt;a href="https://geoffruddock.com/naive-bayes-from-scratch-with-numpy/">implementing Naive Bayes from scratch with Python&lt;/a>. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.&lt;/p>
&lt;h2 id="who-is-ada-anyways">Who is Ada, anyways? &lt;a class="anchor" href="#who-is-ada-anyways">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>&lt;a href="https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29" target="_blank">Boosting&lt;/a> refers to a family of machine learning meta-algorithms which combine the outputs of many &amp;ldquo;weak&amp;rdquo; classifiers into a powerful &amp;ldquo;committee&amp;rdquo;, where each of the weak clasifiers alone may have an error rate which is only slightly better than random guessing.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="goal">Goal <a class="anchor" href="#goal">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>A few weeks ago while learning about Naive Bayes, I wrote a post about <a href="https://geoffruddock.com/naive-bayes-from-scratch-with-numpy/">implementing Naive Bayes from scratch with Python</a>. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.</p>
<h2 id="who-is-ada-anyways">Who is Ada, anyways? <a class="anchor" href="#who-is-ada-anyways">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29" target="_blank">Boosting</a> refers to a family of machine learning meta-algorithms which combine the outputs of many &ldquo;weak&rdquo; classifiers into a powerful &ldquo;committee&rdquo;, where each of the weak clasifiers alone may have an error rate which is only slightly better than random guessing.</p>
<p>The name <a href="https://en.wikipedia.org/wiki/AdaBoost" target="_blank">AdaBoost</a> stands for <em>Adaptive Boosting</em>, and it refers to a particular boosting algorithm in which we fit a sequence of &ldquo;stumps&rdquo; (decision trees with a single node and two leaves) and weight their contribution to the final vote by how accurate their predictions are. After each iteration, we re-weight the dataset to assign greater importance to data points which were misclassified by the previous weak learner, so that those data points get &ldquo;special attention&rdquo; during iteration $t+1$.</p>
<h2 id="how-does-it-compare-to-random-forest">How does it compare to Random Forest? <a class="anchor" href="#how-does-it-compare-to-random-forest">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Random Forest</th>
          <th>AdaBoost</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Depth</td>
          <td>Unlimited (a full tree)</td>
          <td>Stump (single node w/ 2 leaves)</td>
      </tr>
      <tr>
          <td>Trees grown</td>
          <td>Independently</td>
          <td>Sequentially</td>
      </tr>
      <tr>
          <td>Votes</td>
          <td>Equal</td>
          <td>Weighted</td>
      </tr>
  </tbody>
</table>
<h2 id="the-adaboost-algorithm">The AdaBoost algorithm <a class="anchor" href="#the-adaboost-algorithm">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://www.cs.toronto.edu/~mbrubake/teaching/C11/Handouts/AdaBoost.pdf" target="_blank">This handout</a> gives a good overview of the algorithm, which is useful to understand before we touch any code.</p>
<p>A) Initialize sample weights uniformly as $w_i^1 = \frac{1}{n}$.</p>
<p>B) For each iteration $t$:</p>
<ol>
<li>Find weak learner <code>$h_t(x)$</code> which minimizes <code>$\epsilon_t = \sum_{i=1}^n \mathbf{1}[h_t(x_i) \neq y_i] \, w_i^{(t)}$</code>.</li>
<li>We set a weight for our weak learner based on its accuracy: $\alpha_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$</li>
<li>Increase weights of misclassified observations: $w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha^t y_i h_t(x_i)}$.</li>
<li>Renormalize weights, so that $\sum_{i=1}^n w_i^{(t+1)}=1$.</li>
</ol>
<p>C) Make final prediction as weighted majority vote of weak learner predictions: $H(x) = \text{sign} \Big( \sum_{t=1}^T \alpha_t h_t(x) \Big)$.</p>
<h2 id="getting-started">Getting started <a class="anchor" href="#getting-started">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="helper-plot-function">Helper plot function <a class="anchor" href="#helper-plot-function">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We&rsquo;re going to use the function below to visualize our data points, and optionally overlay the decision boundary of a fitted AdaBoost model. Don&rsquo;t worry if you don&rsquo;t understand everything that is happening here, it is not critical to understanding the algorithm itself.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib</span> <span class="k">as</span> <span class="nn">mpl</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">y</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">clf</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">sample_weights</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">annotate</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">ax</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">mpl</span><span class="o">.</span><span class="n">axes</span><span class="o">.</span><span class="n">Axes</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Plot ± samples in 2D, optionally with decision boundary &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="nb">set</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">==</span> <span class="p">{</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span> <span class="s1">&#39;Expecting response labels to be ±1&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">ax</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">fig</span><span class="o">.</span><span class="n">set_facecolor</span><span class="p">(</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">pad</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="n">x_min</span><span class="p">,</span> <span class="n">x_max</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="o">-</span> <span class="n">pad</span><span class="p">,</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">+</span> <span class="n">pad</span>
</span></span><span class="line"><span class="cl">    <span class="n">y_min</span><span class="p">,</span> <span class="n">y_max</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="o">-</span> <span class="n">pad</span><span class="p">,</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">+</span> <span class="n">pad</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">sample_weights</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">sizes</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">sample_weights</span><span class="p">)</span> <span class="o">*</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">100</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">sizes</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="mi">100</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X_pos</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">sizes_pos</span> <span class="o">=</span> <span class="n">sizes</span><span class="p">[</span><span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="n">X_pos</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="n">sizes_pos</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">&#39;+&#39;</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;red&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X_neg</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">y</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">sizes_neg</span> <span class="o">=</span> <span class="n">sizes</span><span class="p">[</span><span class="n">y</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="n">X_neg</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="n">sizes_neg</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">&#39;.&#39;</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">&#39;blue&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">clf</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">plot_step</span> <span class="o">=</span> <span class="mf">0.01</span>
</span></span><span class="line"><span class="cl">        <span class="n">xx</span><span class="p">,</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">x_min</span><span class="p">,</span> <span class="n">x_max</span><span class="p">,</span> <span class="n">plot_step</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                             <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">y_min</span><span class="p">,</span> <span class="n">y_max</span><span class="p">,</span> <span class="n">plot_step</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">Z</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">c_</span><span class="p">[</span><span class="n">xx</span><span class="o">.</span><span class="n">ravel</span><span class="p">(),</span> <span class="n">yy</span><span class="o">.</span><span class="n">ravel</span><span class="p">()])</span>
</span></span><span class="line"><span class="cl">        <span class="n">Z</span> <span class="o">=</span> <span class="n">Z</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">xx</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># If all predictions are positive class, adjust color map acordingly</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="nb">list</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">unique</span><span class="p">(</span><span class="n">Z</span><span class="p">))</span> <span class="o">==</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">            <span class="n">fill_colors</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;r&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">fill_colors</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">ax</span><span class="o">.</span><span class="n">contourf</span><span class="p">(</span><span class="n">xx</span><span class="p">,</span> <span class="n">yy</span><span class="p">,</span> <span class="n">Z</span><span class="p">,</span> <span class="n">colors</span><span class="o">=</span><span class="n">fill_colors</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">annotate</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">offset</span> <span class="o">=</span> <span class="mf">0.05</span>
</span></span><span class="line"><span class="cl">            <span class="n">ax</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;$x_</span><span class="si">{</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s1">$&#39;</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">y</span> <span class="o">-</span> <span class="n">offset</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="n">x_min</span><span class="o">+</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">x_max</span><span class="o">-</span><span class="mf">0.5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="n">y_min</span><span class="o">+</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">y_max</span><span class="o">-</span><span class="mf">0.5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s1">&#39;$x_1$&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s1">&#39;$x_2$&#39;</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="generate-a-fake-dataset">Generate a fake dataset <a class="anchor" href="#generate-a-fake-dataset">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We will generate a toy dataset using a similar approach to <a href="https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html" target="_blank">sklearn documentation</a> but using less data points. The key here is that we want to have two classes which are not linearly separable, since this is the ideal use-case for AdaBoost.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">make_gaussian_quantiles</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">make_toy_dataset</span><span class="p">(</span><span class="n">n</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="n">random_seed</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Generate a toy dataset for evaluating AdaBoost classifiers &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">n_per_class</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">random_seed</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">random_seed</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">make_gaussian_quantiles</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="n">n</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">n_classes</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">*</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">make_toy_dataset</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">random_seed</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span></code></pre></div><p><img src="./index_9_0.png" alt="png"></p>
<h3 id="benchmark-with-scikit-learn">Benchmark with scikit-learn <a class="anchor" href="#benchmark-with-scikit-learn">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Let&rsquo;s establish a benchmark for what our model&rsquo;s output should resemble by importing <code>AdaBoostClassifier</code> from scikit-learn and fitting it to our toy dataset.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">AdaBoostClassifier</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">bench</span> <span class="o">=</span> <span class="n">AdaBoostClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">algorithm</span><span class="o">=</span><span class="s1">&#39;SAMME&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">bench</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">train_err</span> <span class="o">=</span> <span class="p">(</span><span class="n">bench</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">!=</span> <span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Train error: </span><span class="si">{</span><span class="n">train_err</span><span class="si">:</span><span class="s1">.1%</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>Train error: 0.0%
</code></pre>
<p><img src="./index_11_1.png" alt="png"></p>
<p>The classifier fully fits the training dataset in 10 iterations, which is not surprising given that the data points in our toy dataset are reasoanbly well separated.</p>
<h2 id="rolling-our-own-adaboost-classifier">Rolling our own AdaBoost classifier <a class="anchor" href="#rolling-our-own-adaboost-classifier">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Below is the skeleton code for our AdaBoost classifier. After fitting the model, we&rsquo;ll save all the key attributes to the class—including sample weights at each iteration-so we can inspect them later to understand what our algorithm is doing at each step.</p>
<p>The table below shows a mapping between the variable names we will use and the math notation used earlier in the description of the algorithm.</p>
<table>
  <thead>
      <tr>
          <th>Variable</th>
          <th>Math</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>sample_weights</code> with shape: (T, n)</td>
          <td>$ w_{i}^{(t)} $</td>
      </tr>
      <tr>
          <td><code>stumps</code> with shape: (T, )</td>
          <td>$ h_t(x) $</td>
      </tr>
      <tr>
          <td><code>stump_weights</code> with shape (T, )</td>
          <td>$ \alpha_t $</td>
      </tr>
      <tr>
          <td><code>errors</code> with shape: (T, )</td>
          <td>$ \epsilon_t $</td>
      </tr>
      <tr>
          <td><code>clf.predict(X)</code></td>
          <td>$ H_t(x) $</td>
      </tr>
  </tbody>
</table>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">AdaBoost</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; AdaBoost enemble classifier from scratch &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stumps</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stump_weights</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">errors</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">sample_weights</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_check_X_y</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34; Validate assumptions about format of input data&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">assert</span> <span class="nb">set</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">==</span> <span class="p">{</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span> <span class="s1">&#39;Response variable must be ±1&#39;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span>
</span></span></code></pre></div><h3 id="fitting-the-model">Fitting the model <a class="anchor" href="#fitting-the-model">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Recall our algorithm to fit the model:</p>
<ol>
<li>Find weak learner <code>$h_t(x)$</code> which minimizes <code>$\epsilon_t = \sum_{i=1}^n \mathbf{1}[h_t(x_i) \neq y_i] \, w_i^t$</code>.</li>
<li>We set a weight for our weak learner based on its accuracy: $\alpha_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$</li>
<li>Increase weights of misclassified observations: $w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}$. Note that $y_i h_t(x_i)$ will evaluate to $+1$ when hypothesis agrees with label, and $-1$ when it does not agree.</li>
<li>Renormalize weights, so that $\sum_{i=1}^n w_i^{(t+1)} =1$.</li>
</ol>
<p>The code below is essentially a 1-to-1 implementation of the above, but there are a few things to note:</p>
<ul>
<li>Since the focus here is understanding the ensemble element of AdaBoost, we&rsquo;ll outsource the logic of picking each $h_t(x)$ to <code>DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)</code>.</li>
<li>We set the initial uniform sample weights outside of the for-loop and set the weights for $t+1$ within each iteration $t$, unless it is the last iteration. We are going out of our way here to save an array of sample weights on the fitted model, so that we can later visualize the sample weights at each iteration.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.tree</span> <span class="kn">import</span> <span class="n">DecisionTreeClassifier</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">iters</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Fit the model using training data &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_check_X_y</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># init numpy arrays</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">sample_weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">iters</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">stumps</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">iters</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">stump_weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">iters</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">errors</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">iters</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># initialize weights uniformly</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">sample_weights</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">n</span><span class="p">)</span> <span class="o">/</span> <span class="n">n</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># fit  weak learner</span>
</span></span><span class="line"><span class="cl">        <span class="n">curr_sample_weights</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sample_weights</span><span class="p">[</span><span class="n">t</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">stump</span> <span class="o">=</span> <span class="n">DecisionTreeClassifier</span><span class="p">(</span><span class="n">max_depth</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">max_leaf_nodes</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">stump</span> <span class="o">=</span> <span class="n">stump</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sample_weight</span><span class="o">=</span><span class="n">curr_sample_weights</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># calculate error and stump weight from weak learner prediction</span>
</span></span><span class="line"><span class="cl">        <span class="n">stump_pred</span> <span class="o">=</span> <span class="n">stump</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">err</span> <span class="o">=</span> <span class="n">curr_sample_weights</span><span class="p">[(</span><span class="n">stump_pred</span> <span class="o">!=</span> <span class="n">y</span><span class="p">)]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="c1"># / n</span>
</span></span><span class="line"><span class="cl">        <span class="n">stump_weight</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">((</span><span class="mi">1</span> <span class="o">-</span> <span class="n">err</span><span class="p">)</span> <span class="o">/</span> <span class="n">err</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># update sample weights</span>
</span></span><span class="line"><span class="cl">        <span class="n">new_sample_weights</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">curr_sample_weights</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">stump_weight</span> <span class="o">*</span> <span class="n">y</span> <span class="o">*</span> <span class="n">stump_pred</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="n">new_sample_weights</span> <span class="o">/=</span> <span class="n">new_sample_weights</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># If not final iteration, update sample weights for t+1</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">t</span><span class="o">+</span><span class="mi">1</span> <span class="o">&lt;</span> <span class="n">iters</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">sample_weights</span><span class="p">[</span><span class="n">t</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_sample_weights</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># save results of iteration</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stumps</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">=</span> <span class="n">stump</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stump_weights</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">=</span> <span class="n">stump_weight</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">errors</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">=</span> <span class="n">err</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="bp">self</span>
</span></span></code></pre></div><h3 id="making-predictions">Making predictions <a class="anchor" href="#making-predictions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We make a final prediction by taking a &ldquo;weighted majority vote&rdquo;, calculated as the sign (±) of the linear combination of each stump&rsquo;s prediction and its corresponding stump weight.</p>
<p><code>$$ H_t(x) = \text{sign} \Big( \sum_{t=1}^T a_t h_t(x) \Big) $$</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Make predictions using already fitted model &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">stump_preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">stump</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="k">for</span> <span class="n">stump</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">stumps</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">sign</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">stump_weights</span><span class="p">,</span> <span class="n">stump_preds</span><span class="p">))</span>
</span></span></code></pre></div><h3 id="performance">Performance <a class="anchor" href="#performance">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Now let&rsquo;s put everything together, and fit the model with the same parameters as our benchmark.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># assign our individually defined functions as methods of our classifier</span>
</span></span><span class="line"><span class="cl"><span class="n">AdaBoost</span><span class="o">.</span><span class="n">fit</span> <span class="o">=</span> <span class="n">fit</span>
</span></span><span class="line"><span class="cl"><span class="n">AdaBoost</span><span class="o">.</span><span class="n">predict</span> <span class="o">=</span> <span class="n">predict</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">clf</span> <span class="o">=</span> <span class="n">AdaBoost</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">iters</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">clf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">train_err</span> <span class="o">=</span> <span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">!=</span> <span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Train error: </span><span class="si">{</span><span class="n">train_err</span><span class="si">:</span><span class="s1">.1%</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>Train error: 0.0%
</code></pre>
<p><img src="./index_20_1.png" alt="png"></p>
<p>Success! We&rsquo;ve achieved the exact same result as our <code>sklearn</code> benchmark. I cherry-picked this toy dataset to show the strengths of AdaBoost, but you can run this notebook yourself and see that it matches the output regardless of starting conditions.</p>
<h2 id="developing-intuition">Developing intuition <a class="anchor" href="#developing-intuition">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="visualizing-our-learner-step-by-step">Visualizing our learner step-by-step <a class="anchor" href="#visualizing-our-learner-step-by-step">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Since we saved all intermediate variables as arrays to our fitted model, we can use the function below to visualize how our ensemble learner evolves at each iteration $t$:</p>
<ul>
<li>The left column shows the &ldquo;stump&rdquo; weak learner selected, which corresponds to $h_t(x)$.</li>
<li>The right column shows the cumulative strong learner so far: $H_t(x)$.</li>
<li>The size of the data point markers reflects their relative weighting. Data points misclassified in the previous iteration will be more heavily weighted—and therefore appear larger—in the next iteration.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">truncate_adaboost</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">t</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Truncate a fitted AdaBoost up to (and including) a particular iteration &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="n">t</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;t must be a positive integer&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">from</span> <span class="nn">copy</span> <span class="kn">import</span> <span class="n">deepcopy</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_clf</span> <span class="o">=</span> <span class="n">deepcopy</span><span class="p">(</span><span class="n">clf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_clf</span><span class="o">.</span><span class="n">stumps</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">stumps</span><span class="p">[:</span><span class="n">t</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_clf</span><span class="o">.</span><span class="n">stump_weights</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">stump_weights</span><span class="p">[:</span><span class="n">t</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">new_clf</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">plot_staged_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">clf</span><span class="p">,</span> <span class="n">iters</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Plot weak learner and cumulaive strong learner at each iteration. &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># larger grid</span>
</span></span><span class="line"><span class="cl">    <span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">iters</span><span class="o">*</span><span class="mi">3</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                             <span class="n">nrows</span><span class="o">=</span><span class="n">iters</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                             <span class="n">ncols</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                             <span class="n">sharex</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                             <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">fig</span><span class="o">.</span><span class="n">set_facecolor</span><span class="p">(</span><span class="s1">&#39;white&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s1">&#39;Decision boundaries by iteration&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">ax1</span><span class="p">,</span> <span class="n">ax2</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Plot weak learner</span>
</span></span><span class="line"><span class="cl">        <span class="n">_</span> <span class="o">=</span> <span class="n">ax1</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Weak learner at t=</span><span class="si">{</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">clf</span><span class="o">.</span><span class="n">stumps</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                      <span class="n">sample_weights</span><span class="o">=</span><span class="n">clf</span><span class="o">.</span><span class="n">sample_weights</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                      <span class="n">annotate</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Plot strong learner</span>
</span></span><span class="line"><span class="cl">        <span class="n">trunc_clf</span> <span class="o">=</span> <span class="n">truncate_adaboost</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">_</span> <span class="o">=</span> <span class="n">ax2</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Strong learner at t=</span><span class="si">{</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">plot_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">trunc_clf</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">sample_weights</span><span class="o">=</span><span class="n">clf</span><span class="o">.</span><span class="n">sample_weights</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                      <span class="n">annotate</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">plt</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">plt</span><span class="o">.</span><span class="n">subplots_adjust</span><span class="p">(</span><span class="n">top</span><span class="o">=</span><span class="mf">0.95</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">clf</span> <span class="o">=</span> <span class="n">AdaBoost</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">iters</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">plot_staged_adaboost</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">clf</span><span class="p">)</span>
</span></span></code></pre></div><p><img src="./index_24_0.png" alt="png"></p>
<h3 id="why-do-some-iterations-have-no-decision-boundary">Why do some iterations have no decision boundary? <a class="anchor" href="#why-do-some-iterations-have-no-decision-boundary">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>You may notice that our weak learners at iterations $t=2,5,7,10$ classify all points as positive.  This occurs because given the current sample weights, the lowest error is achieved by simply predicting all data points to be positive. Note that in each of the plots above for these iterations, the negative samples are surrounded by proportially higher-weighted positive samples.</p>
<p>There is no way to draw a linear decision boundary to correctly classify any number of negative data points without misclassifying a higher cumulative weight of positive samples. This does not stop our algorithm from converging though. All the negative points are misclassified and therefore increase in sample weight. This updating of weights allows the next iteration&rsquo;s weak learner to discover a meaningful decision boundary.</p>
<h3 id="why-do-we-use-that-specific-formula-for-alpha_t">Why do we use that specific formula for alpha_t? <a class="anchor" href="#why-do-we-use-that-specific-formula-for-alpha_t">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Are you curious why we use this particular value for <code>$\alpha_t$</code>? We can show that the choice of <code>$a_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$</code> minimizes exponential loss <code>$L_{exp}(x, y) = e^{-y \, h(x)}$</code> over the training set.</p>
<p>Ignoring the sign function, our strong learner <code>$H$</code> at iteration <code>$t$</code> is a weighted combination of weak learners <code>$h(x)$</code>. At any given iteration <code>$t$</code>, we can define <code>$H_t(x)$ </code>recursively as its value at iteration <code>$t-1$</code> plus the weighted weak learner of the current iteration.</p>
<p><code>$$ \begin{aligned} H_t(x) &amp;= \sum_{i=1}^t \alpha_i h_i(x) \\ &amp;= H_{t-1} + \alpha_t h_t(x) \end{aligned} $$</code></p>
<p>Our loss function applied to $H$ is the average loss across all $n$ data points. We can substitute in our recursive definition of <code>$H_t(x)$</code>, and split the exponential term using the identity $e^{a+b}=e^a e^b$.</p>
<p><code>$$ \begin{aligned} L(H_t) &amp;= \tfrac{1}{n} \sum_{i=1}^n e^{-y_i H_t(x_i)} \\ &amp;= \tfrac{1}{n} \sum_{i=1}^n e^{-y_i H_{t-1}(x_i)} e^{-y_i \alpha_t h_t(x_i)} \\ &amp;= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{e^{-y_i H_{t-1}(x_i)}} e^{-y_i \alpha_t h_t(x_i)} \\ &amp;= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{w^t_i} \; e^{-y_i \alpha_t h_t(x_i)} \\ \end{aligned} $$</code></p>
<p>Now we take the derivative of our loss function with respect to <code>$\alpha_t$</code> and set it to zero to find the parameter value at which the loss function is minimized. We can split the summation into two: cases where <code>$h_t(x_i) = y_i$</code> and cases where <code>$h_t(x_i) \neq y_i$</code>.</p>
<p><code>$$ \begin{aligned} L(H_t) &amp;= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{w^t_i \,} e^{-y_i \alpha_t h_t(x_i)} \\ \frac{\partial L}{\partial \alpha_t} = 0 &amp;= - \tfrac{1}{n} \sum_{i=1}^n w^t_i \, y_i h_t(x_i) e^{-y_i \alpha_t h_t(x_i)} \\ &amp;= - \tfrac{1}{n} \sum_{i: h_t(x_i) = y_i}^n w^t_i \, e^{-\alpha_t} - \tfrac{1}{n} \sum_{i: h_t(x_i) \neq y_i}^n w^t_i \, e^{\alpha_t} \\ \end{aligned} $$</code></p>
<p>Finally, we recognize the summation of weights is equivalent to our error calculation discussed earlier: <code>$\sum D_t(i) = \epsilon_t$</code>. Making the substitution and then manipulating algebraically allows us to isolate <code>$\alpha_t$</code>.</p>
<p><code>$$ \begin{aligned} 0 &amp;= (\epsilon_t-1) e^{-\alpha_t} + \epsilon_t e^{\alpha_t} \\ (1-\epsilon_t) e^{-\alpha_t} &amp;= \epsilon_t e^{\alpha_t} \\ \frac{1 - \epsilon_t}{\epsilon_t} &amp;= \frac{e^{\alpha_t}}{e^{-\alpha_t}} = e^{2 \alpha_t} \\ \alpha_t &amp;= \frac{1}{2} \ln \bigg( \frac{1 - \epsilon_t}{\epsilon_t} \bigg) \end{aligned} $$</code></p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html" target="_blank">sklearn.ensemble.AdaBoostClassifier</a> – Official scikit-learn documentation</li>
<li><a href="https://www.cs.toronto.edu/~mbrubake/teaching/C11/Handouts/AdaBoost.pdf" target="_blank">University of Toronto CS – AdaBoost</a> – Understandable handout PDF which lays out a pseudo-code algorithm and walks through some of the math.</li>
<li><a href="https://jeremykun.com/2015/05/18/boosting-census/" target="_blank">Weak Learning, Boosting, and the AdaBoost algorithm</a> – Discussion of AdaBoost in the context of PAC learning, along with python implementation.</li>
<li><a href="https://xavierbourretsicotte.github.io/AdaBoost.html" target="_blank">AdaBoost: Implementation and intuition</a> – Python implementation with visualization function, served partially as the inspiration for this post.</li>
</ul>

      ]]></content:encoded></item><item><title>Building a Naive Bayes classifier from scratch with NumPy</title><link>https://geoffruddock.com/naive-bayes-from-scratch-with-numpy/</link><pubDate>Monday, 16 Mar 2020</pubDate><guid>https://geoffruddock.com/naive-bayes-from-scratch-with-numpy/</guid><description>&lt;p>While learning about &lt;a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier" target="_blank">Naive Bayes classifiers&lt;/a>, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB" target="_blank">sklearn.naive_bayes.MultinomialNB&lt;/a> estimator which produces identical results on a sample dataset.&lt;/p>
&lt;p>While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest &lt;code>MultinomialNB&lt;/code> inherits from &lt;code>_BaseDiscreteNB&lt;/code> which itself inherits from &lt;code>_BaseNB&lt;/code> which has multiple inheritence from &lt;code>BaseEstimator&lt;/code> and &lt;code>ClassifierMaixin&lt;/code>.&lt;/p></description><content:encoded><![CDATA[
        <p>While learning about <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier" target="_blank">Naive Bayes classifiers</a>, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB" target="_blank">sklearn.naive_bayes.MultinomialNB</a> estimator which produces identical results on a sample dataset.</p>
<p>While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest <code>MultinomialNB</code> inherits from <code>_BaseDiscreteNB</code> which itself inherits from <code>_BaseNB</code> which has multiple inheritence from <code>BaseEstimator</code> and <code>ClassifierMaixin</code>.</p>
<h3 id="what-is-naive-bayes">What is Naive Bayes? <a class="anchor" href="#what-is-naive-bayes">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Naive Bayes is a simple generative (probabilistic) classification model based on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem" target="_blank">Bayes&rsquo; theorem</a>. The typical example use-case for this algorithm is classifying email messages as spam or &ldquo;ham&rdquo; (non-spam) based on the previously observed frequency of words which have appeared in known spam or ham emails in the past.</p>
<p>$$
P(\text{ spam }|\text{ text }) = \frac{P(\text{ text }|\text{ spam }) , P(\text{ spam })}{P(\text{ text })}
$$</p>
<p>Following typical ML notation, we use $y$ to denote the &ldquo;class&rdquo; of our message, where $y=1$ for spam messages and $y=0$ for non-spam messages. We will represent our text data as an array $x$ of length $j$, with each value representing the number of times the $j^{th}$ word appears in a particular email. The value of $j$ represents the collective number of words seen across all training data. Our model then becomes</p>
<p>$$
P(y|x) = \frac{P(x|y) , P(y)}{P(x)}
$$</p>
<h3 id="why-would-we-want-to-use-something-naive">Why would we want to use something &ldquo;naive&rdquo;? <a class="anchor" href="#why-would-we-want-to-use-something-naive">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>This classifier is &ldquo;naive&rdquo; in the sense that predictive features are assumed to be conditionally independent given their class. Naturally this is gross simplification of reality—if an email contains the word &ldquo;sports&rdquo; we would expect it to also be more likely to contain related words like &ldquo;bet&rdquo; or &ldquo;odds&rdquo;. But this assumption allows us to calculate the joint probability simply as the product of marginal likelihoods, without worrying about the correlation structure between different words. This makes the model <em>much</em> easier to fit.</p>
<blockquote>
<p>All models are wrong, some are useful. — George Box, statistician</p></blockquote>
<p>If our model were not &ldquo;naive&rdquo;, we would have to calculate the joint likelihood function as a messy product of $j$ separate conditional likelihood functions. With so many parameters to estimate, we would need a large quantity of training data to avoid overfitting.</p>
<p>$$
P(\mathbf{x} \vert y) = P(x_1 \vert y) P(x_2 \vert x_1, y) P(x_3 \vert x_1, x_2, y) \ldots P(x_j \vert x_1, x_2, \ldots, x_{j-1}, y)
$$</p>
<p>But if we assume conditional independence, this calculation becomes much more simple. We can now just multiply together the likelihoods of each word $x_j$ conditional only on their class $y$ (whether or not they are spam) to get the joint likelihood for the entire message.</p>
<p>$$
P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c)
$$</p>
<h3 id="our-toy-dataset">Our toy dataset <a class="anchor" href="#our-toy-dataset">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The function below generates a test dataset based on Chapter 3.5, Exercise 3.22 from <em>Machine Learning: A Probabilistic Perspective</em>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Callable</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">make_spam_dataset</span><span class="p">(</span><span class="n">show_X</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">Callable</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Create a small toy dataset for MultinomialNB implementation
</span></span></span><span class="line"><span class="cl"><span class="s2">    
</span></span></span><span class="line"><span class="cl"><span class="s2">    Returns:
</span></span></span><span class="line"><span class="cl"><span class="s2">        X: word count matrix
</span></span></span><span class="line"><span class="cl"><span class="s2">        y: indicator of whether or not message is spam
</span></span></span><span class="line"><span class="cl"><span class="s2">        msg_tx_func: a function to transform new test data into word count matrix
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">vocab</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;secret&#39;</span><span class="p">,</span> <span class="s1">&#39;offer&#39;</span><span class="p">,</span> <span class="s1">&#39;low&#39;</span><span class="p">,</span> <span class="s1">&#39;price&#39;</span><span class="p">,</span> <span class="s1">&#39;valued&#39;</span><span class="p">,</span> <span class="s1">&#39;customer&#39;</span><span class="p">,</span> <span class="s1">&#39;today&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;dollar&#39;</span><span class="p">,</span> <span class="s1">&#39;million&#39;</span><span class="p">,</span> <span class="s1">&#39;sports&#39;</span><span class="p">,</span> <span class="s1">&#39;is&#39;</span><span class="p">,</span> <span class="s1">&#39;for&#39;</span><span class="p">,</span> <span class="s1">&#39;play&#39;</span><span class="p">,</span> <span class="s1">&#39;healthy&#39;</span><span class="p">,</span> <span class="s1">&#39;pizza&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">spam</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;million dollar offer&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;secret offer today&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;secret is secret&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">not_spam</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;low price for valued customer&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;play secret sports today&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;sports is healthy&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s1">&#39;low price pizza&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">all_messages</span> <span class="o">=</span> <span class="n">spam</span> <span class="o">+</span> <span class="n">not_spam</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">vocabulary</span><span class="o">=</span><span class="n">vocab</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">word_counts</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">all_messages</span><span class="p">)</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">word_counts</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">vocab</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">is_spam</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">spam</span><span class="p">)</span> <span class="o">+</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">not_spam</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">msg_tx_func</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">show_X</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">display</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">df</span><span class="o">.</span><span class="n">to_numpy</span><span class="p">(),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">is_spam</span><span class="p">),</span> <span class="n">msg_tx_func</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">tx_func</span> <span class="o">=</span> <span class="n">make_spam_dataset</span><span class="p">()</span>
</span></span></code></pre></div><div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
<p></style></p>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>secret</th>
      <th>offer</th>
      <th>low</th>
      <th>price</th>
      <th>valued</th>
      <th>customer</th>
      <th>today</th>
      <th>dollar</th>
      <th>million</th>
      <th>sports</th>
      <th>is</th>
      <th>for</th>
      <th>play</th>
      <th>healthy</th>
      <th>pizza</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>6</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
  </tbody>
</table>
</div>
<h3 id="our-model">Our model <a class="anchor" href="#our-model">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Our model needs to resolve the three component &ldquo;ingredients&rdquo; necessary to make predictions on future data points. The table below describes each component, and shows the mapping between math notation above and variable names in our code below.</p>
<table>
  <thead>
      <tr>
          <th>Variable</th>
          <th>Math</th>
          <th>Decription</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>prior</code></td>
          <td>$ P(y) $</td>
          <td>Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam).</td>
      </tr>
      <tr>
          <td><code>lk_word</code></td>
          <td>$P(x_i \vert y)$</td>
          <td>The likelihood of each word, conditional on message class. We are implicitly using the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution" target="_blank">multinomial distribution</a> here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class.</td>
      </tr>
      <tr>
          <td><code>lk_message</code></td>
          <td>$ P(\mathbf{x} \vert y) $</td>
          <td>The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class.</td>
      </tr>
      <tr>
          <td><code>normalize_term</code></td>
          <td>$ P(\mathbf{x}) $</td>
          <td>The likelihood of an entire message across all possible classes.</td>
      </tr>
  </tbody>
</table>
<p>We&rsquo;ve got a few additional attributes as well:</p>
<ul>
<li>The <code>alpha</code> attribute will be added to each word count, to avoid us having zero probabilities for words not seen in our training sample.</li>
<li>The <code>is_fitted_</code> attribute is a scikit-learn convention to ensure we don&rsquo;t accidentally try to make predictions on a model that has not yet been fitted.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">NaiveBayes</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; DIY binary Naive Bayes classifier based on categorical data &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">1.0</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34; &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">prior</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">word_counts</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">lk_word</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">is_fitted_</span> <span class="o">=</span> <span class="kc">False</span>
</span></span></code></pre></div><h2 id="fitting-the-model">Fitting the model <a class="anchor" href="#fitting-the-model">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Let&rsquo;s implement our algorithm to handle an arbitrary number of classes, even though our toy example only has two (spam/not-spam). Our fit function needs to do two things:</p>
<h3 id="calculate-prior">Calculate prior <a class="anchor" href="#calculate-prior">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>This one is easy. We split our input array into two sub-arrays in <code>X_by_class</code>, then count the number of elements in each class to arrive at our <code>prior</code>.</p>
<h3 id="calculate-likelihoods">Calculate likelihoods <a class="anchor" href="#calculate-likelihoods">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>We set <code>word_counts</code> by looping over each of the sub-arrays in <code>X_by_class</code> and taking the column sums within each sub-array. Note that the numpy notation is a bit unintutive here, <code>.sum(axis=0)</code> means that we collapse the $0^{th}$ axis (rows) leaving only columns. This gives us an array of shape <code>(c,j)</code> which counts the number of times the $j^{th}$ word appears across all emails of class $c$.</p>
<p>Finally, our likelihood function <code>lk_word</code> is simply these word counts divided by the total number of times all words appear in each class. We achieve this by taking row sum using <code>.sum(axis=0)</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.utils.validation</span> <span class="kn">import</span> <span class="n">check_X_y</span><span class="p">,</span> <span class="n">check_array</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Fit training data for Naive Bayes classifier &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># not strictly necessary, but this ensures we have clean input</span>
</span></span><span class="line"><span class="cl">    <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">check_X_y</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X_by_class</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">X</span><span class="p">[</span><span class="n">y</span> <span class="o">==</span> <span class="n">c</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">unique</span><span class="p">(</span><span class="n">y</span><span class="p">)])</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">prior</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">X_class</span><span class="p">)</span> <span class="o">/</span> <span class="n">n</span> <span class="k">for</span> <span class="n">X_class</span> <span class="ow">in</span> <span class="n">X_by_class</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">word_counts</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">sub_arr</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="k">for</span> <span class="n">sub_arr</span> <span class="ow">in</span> <span class="n">X_by_class</span><span class="p">])</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">lk_word</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">word_counts</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">word_counts</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">is_fitted_</span> <span class="o">=</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="bp">self</span>
</span></span></code></pre></div><h2 id="predicting-new-emails">Predicting new emails <a class="anchor" href="#predicting-new-emails">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>We can now make predictions, either on the same emails we used to train the model, or on entirely new emails never before seen by the model. We&rsquo;ll do this by first predicting probabilities for each class, then making our final prediction by taking the class with the highest probability.</p>
<p>Recall that our conditional likelihood for an entire message $\bf{x}$ is calculated as the product of conditional likelihoods for each word $x_j$ present in the message. Note here that if a word appears twice, its <code>lk_word</code> gets factored twice into our joint likelihood.</p>
<p>$$
P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c)
$$</p>
<p>So we loop over each message (row) in our array $X$ and calculate individual conditional likelihoods, then multiply them all together and multiply by our clas priors. At the very end, we divide everything by $P(x)$ so that we have valid probabilities.</p>
<h3 id="what-if-about-previously-unseen-words">What if about previously unseen words? <a class="anchor" href="#what-if-about-previously-unseen-words">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Suppose we have a word which has never appeared in training messages labelled as spam. Its conditional likelihood would be zero, which would take our entire joint likelihood to zero as well. This is precisely why we added <code>alpha</code> while calculating word counts in the fit function, so that this situation does not occur.</p>
<h3 id="probabilistic-prediction">Probabilistic prediction <a class="anchor" href="#probabilistic-prediction">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">predict_proba</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Predict probability of class membership &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">is_fitted_</span><span class="p">,</span> <span class="s1">&#39;Model must be fit before predicting&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="n">X</span> <span class="o">=</span> <span class="n">check_array</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># loop over each observation to calculate conditional probabilities</span>
</span></span><span class="line"><span class="cl">    <span class="n">class_numerators</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">prior</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">word_exists</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">bool</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">lk_words_present</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">lk_word</span><span class="p">[:,</span> <span class="n">word_exists</span><span class="p">]</span> <span class="o">**</span> <span class="n">x</span><span class="p">[</span><span class="n">word_exists</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">lk_message</span> <span class="o">=</span> <span class="p">(</span><span class="n">lk_words_present</span><span class="p">)</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">class_numerators</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">lk_message</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">prior</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">normalize_term</span> <span class="o">=</span> <span class="n">class_numerators</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">conditional_probas</span> <span class="o">=</span> <span class="n">class_numerators</span> <span class="o">/</span> <span class="n">normalize_term</span>
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="p">(</span><span class="n">conditional_probas</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mf">0.001</span><span class="p">)</span><span class="o">.</span><span class="n">all</span><span class="p">(),</span> <span class="s1">&#39;Rows should sum to 1&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">conditional_probas</span>
</span></span></code></pre></div><h3 id="binary-prediction">Binary prediction <a class="anchor" href="#binary-prediction">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Our <code>predict_proba</code> function will return probabilities for each class, but in our toy example we really just want a binary outcome: is the message spam or not? Once we&rsquo;ve done the work to get the class probabilities, it is easy to find the index of the highest probability class using <code>np.argmax(axis=1)</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Predict class with highest probability &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="putting-it-all-together">Putting it all together <a class="anchor" href="#putting-it-all-together">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>We defined the above logic as standalone functions, so now we need to assign each of them to the relevant method of our <code>NaiveBayes</code> class. This would not be necessary if we defined everything in a single module or notebook cell.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># attach functions defined above to our classifier</span>
</span></span><span class="line"><span class="cl"><span class="c1"># this is not needed if you define the entire class in a single cell</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">NaiveBayes</span><span class="o">.</span><span class="n">fit</span> <span class="o">=</span> <span class="n">fit</span>
</span></span><span class="line"><span class="cl"><span class="n">NaiveBayes</span><span class="o">.</span><span class="n">predict_proba</span> <span class="o">=</span> <span class="n">predict_proba</span>
</span></span><span class="line"><span class="cl"><span class="n">NaiveBayes</span><span class="o">.</span><span class="n">predict</span> <span class="o">=</span> <span class="n">predict</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">preds</span> <span class="o">=</span> <span class="n">NaiveBayes</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;Accuracy: </span><span class="si">{</span><span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>Accuracy: 1.0
</code></pre>
<p>You can find a gist with the code all together <a href="https://gist.github.com/asdfgeoff/5d63704c17052e642d3ea93351dda152" target="_blank">here</a>.</p>
<h3 id="comparing-with-sklearn">Comparing with sklearn <a class="anchor" href="#comparing-with-sklearn">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The function below fits our model alongside <code>MultinomialNB</code> and asserts that we have similar values for our priors, likelihoods, and predictions.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">MultinomialNB</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">test_against_benchmark</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Check that DIY model matches outputs from scikit-learn estimator &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">make_spam_dataset</span><span class="p">(</span><span class="n">show_X</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">bench</span> <span class="o">=</span> <span class="n">MultinomialNB</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span> <span class="o">=</span> <span class="n">NaiveBayes</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">prior</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">bench</span><span class="o">.</span><span class="n">class_log_prior_</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mf">0.001</span><span class="p">)</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;[✔︎] Identical prior probabilities&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">lk_word</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">bench</span><span class="o">.</span><span class="n">feature_log_prob_</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mf">0.001</span><span class="p">)</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;[✔︎] Identical word likelihoods&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">assert</span> <span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">/</span> <span class="n">bench</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mf">0.001</span><span class="p">)</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;[✔︎] Identical predictions&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">test_against_benchmark</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>[✔︎] Identical prior probabilities
[✔︎] Identical word likelihoods
[✔︎] Identical predictions
</code></pre>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you want to learn more about Naive Bayes, here are a few of the resources I found most helpful.</p>
<ul>
<li><a href="https://courses.cs.washington.edu/courses/cse312/18sp/lectures/naive-bayes/naivebayesnotes.pdf" target="_blank">Notes on Naive Bayes Classifiers for Spam Filtering</a> (Jonathan Lee, University of Washington) is a good entry point, as it provides a relatively succinct description of the typical spam detection example for Naive Bayes.</li>
<li><a href="http://www.cs.columbia.edu/~mcollins/em.pdf" target="_blank">The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm</a> (Michael Collins, Columbia) provides a more comprehensive walkthrough of the math behind NB, including derivation of maximum likleihood estimates.</li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB" target="_blank">sklearn.naive_bayes.MultinomialNB</a> (scikit-learn docs) is the example implementation which I tried to reproduce.</li>
<li><a href="http://kenzotakahashi.github.io/naive-bayes-from-scratch-in-python.html" target="_blank">Naive Bayes from Scratch in Python</a> (Kenzo Takahashi) is the best DIY post I&rsquo;ve seen so far, and the key inspiration for this post.</li>
</ul>

      ]]></content:encoded></item><item><title>Render LaTeX math expressions in Hugo with MathJax 3</title><link>https://geoffruddock.com/math-typesetting-in-hugo/</link><pubDate>Tuesday, 04 Feb 2020</pubDate><guid>https://geoffruddock.com/math-typesetting-in-hugo/</guid><description>&lt;p>This blog runs on &lt;a href="https://gohugo.io/" target="_blank">Hugo&lt;/a>, a publishing framework which processes markdown text files into static web assets which can be conveniently hosted on a server without a database. It is great for a number of reasons (speed, simplicity) but one area where I find it lacking is in its support for math typesetting.&lt;/p>
&lt;h2 id="the-problem">The problem &lt;a class="anchor" href="#the-problem">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>Typically, you embed a javascript library such as &lt;a href="https://www.mathjax.org/" target="_blank">MathJax&lt;/a> or &lt;a href="https://katex.org/" target="_blank">KaTeX&lt;/a> by adding a line of HTML to your website template. While the page is loading in a visitor&amp;rsquo;s browser, the library processes text enclosed in dollar signs and, renders it as LaTeX and replaces the contents of the page.&lt;/p></description><content:encoded><![CDATA[
        <p>This blog runs on <a href="https://gohugo.io/" target="_blank">Hugo</a>, a publishing framework which processes markdown text files into static web assets which can be conveniently hosted on a server without a database. It is great for a number of reasons (speed, simplicity) but one area where I find it lacking is in its support for math typesetting.</p>
<h2 id="the-problem">The problem <a class="anchor" href="#the-problem">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Typically, you embed a javascript library such as <a href="https://www.mathjax.org/" target="_blank">MathJax</a> or <a href="https://katex.org/" target="_blank">KaTeX</a> by adding a line of HTML to your website template. While the page is loading in a visitor&rsquo;s browser, the library processes text enclosed in dollar signs and, renders it as LaTeX and replaces the contents of the page.</p>
<p>The problem is that the initial page contents have already been processed by Hugo&rsquo;s markdown engine before the page even loads. The markdown parser interprets underscores (<code>_</code>) as italics, and so it removes them and wraps the enclosed text in the appropriate HTML tags. However the underscore is frequently used in LaTeX for subscript. E.g. <code>x_1</code> gets rendered to $ x_1 $. So if your page contains multiple underscores, your LaTeX code will be broken before the page even starts loading.</p>
<h2 id="the-typical-solution">The (typical) solution <a class="anchor" href="#the-typical-solution">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The best general approach seems to be <a href="https://web.archive.org/web/20200217094652/http://doswa.com/2011/07/20/mathjax-in-markdown.html" target="_blank">this one</a>:</p>
<ol>
<li>Configure MathJax to attempt to typeset within <code>&lt;code&gt;</code> blocks (which it skips by default)</li>
<li>Add a class <code>has-jax</code> to your CSS which undoes whatever code-specific formatting your website uses.</li>
<li>Add a pseudo-callback to MathJax which waits until typesetting is complete, then runs a piece of javascript to add the above class to all the parent element of all MathJax elements.</li>
</ol>
<p>The page above includes all the necessary code snippets to implement this for MathJax 2.x. But MathJax 2 is <a href="https://www.intmath.com/cg5/katex-mathjax-comparison.php" target="_blank">a lot slower</a> than MathJax 3 or KaTeX. I tried simply swapping out the <code>src</code> for the newer version, but this did not work, because it seem that MathJax 3 uses an entirely new syntax than 2.x.</p>
<blockquote>
<p>MathJax v3 is a complete rewrite of MathJax from the ground up, and so its internal structure is quite different from that of version 2. That means MathJax v3 is <strong>not</strong> a drop-in replacement for MathJax v2, and upgrading to version 3 takes some adjustment to your web pages. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p></blockquote>
<h2 id="adapted-for-mathjax-3">Adapted for MathJax 3 <a class="anchor" href="#adapted-for-mathjax-3">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The code below is a modification of Doswa&rsquo;s code which loads MathJax 3 instead of 2.x.</p>
<ol>
<li>
<p>Create a file in your theme directory <code>layouts/partials/mathjax_support.html</code> as the following:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-html" data-lang="html"><span class="line"><span class="cl"><span class="p">&lt;</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nx">MathJax</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nx">tex</span><span class="o">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nx">inlineMath</span><span class="o">:</span> <span class="p">[[</span><span class="s1">&#39;$&#39;</span><span class="p">,</span> <span class="s1">&#39;$&#39;</span><span class="p">],</span> <span class="p">[</span><span class="s1">&#39;\\(&#39;</span><span class="p">,</span> <span class="s1">&#39;\\)&#39;</span><span class="p">]],</span>
</span></span><span class="line"><span class="cl">      <span class="nx">displayMath</span><span class="o">:</span> <span class="p">[[</span><span class="s1">&#39;$$&#39;</span><span class="p">,</span><span class="s1">&#39;$$&#39;</span><span class="p">],</span> <span class="p">[</span><span class="s1">&#39;\\[&#39;</span><span class="p">,</span> <span class="s1">&#39;\\]&#39;</span><span class="p">]],</span>
</span></span><span class="line"><span class="cl">      <span class="nx">processEscapes</span><span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nx">processEnvironments</span><span class="o">:</span> <span class="kc">true</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="nx">options</span><span class="o">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nx">skipHtmlTags</span><span class="o">:</span> <span class="p">[</span><span class="s1">&#39;script&#39;</span><span class="p">,</span> <span class="s1">&#39;noscript&#39;</span><span class="p">,</span> <span class="s1">&#39;style&#39;</span><span class="p">,</span> <span class="s1">&#39;textarea&#39;</span><span class="p">,</span> <span class="s1">&#39;pre&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="nb">window</span><span class="p">.</span><span class="nx">addEventListener</span><span class="p">(</span><span class="s1">&#39;load&#39;</span><span class="p">,</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nb">document</span><span class="p">.</span><span class="nx">querySelectorAll</span><span class="p">(</span><span class="s2">&#34;mjx-container&#34;</span><span class="p">).</span><span class="nx">forEach</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">x</span><span class="p">){</span>
</span></span><span class="line"><span class="cl">        <span class="nx">x</span><span class="p">.</span><span class="nx">parentElement</span><span class="p">.</span><span class="nx">classList</span> <span class="o">+=</span> <span class="s1">&#39;has-jax&#39;</span><span class="p">})</span>
</span></span><span class="line"><span class="cl">    <span class="p">});</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">&lt;/</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span><span class="line"><span class="cl"><span class="p">&lt;</span><span class="nt">script</span> <span class="na">src</span><span class="o">=</span><span class="s">&#34;https://polyfill.io/v3/polyfill.min.js?features=es6&#34;</span><span class="p">&gt;&lt;/</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span><span class="line"><span class="cl"><span class="p">&lt;</span><span class="nt">script</span> <span class="na">type</span><span class="o">=</span><span class="s">&#34;text/javascript&#34;</span> <span class="na">id</span><span class="o">=</span><span class="s">&#34;MathJax-script&#34;</span> <span class="na">async</span>
</span></span><span class="line"><span class="cl">  <span class="na">src</span><span class="o">=</span><span class="s">&#34;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&#34;</span><span class="p">&gt;&lt;/</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span></code></pre></div></li>
<li>
<p>Next, open the file <code>layouts/partials/header.html</code> and add the following line just before the closing <code>&lt;/head&gt;</code> tag:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-html" data-lang="html"><span class="line"><span class="cl">{{ if .Params.mathjax }}{{ partial &#34;mathjax_support.html&#34; . }}{{ end }}
</span></span></code></pre></div></li>
<li>
<p>Then, add the following lines to your CSS file. You may need to tinker with the contents here depending on your theme, these are just the settings which worked for me.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-css" data-lang="css"><span class="line"><span class="cl"><span class="nt">code</span><span class="p">.</span><span class="nc">has-jax</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kp">-webkit-</span><span class="n">font-smoothing</span><span class="p">:</span> <span class="n">antialiased</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">background</span><span class="p">:</span> <span class="kc">inherit</span> <span class="cp">!important</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">border</span><span class="p">:</span> <span class="kc">none</span> <span class="cp">!important</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">font-size</span><span class="p">:</span> <span class="mi">100</span><span class="kt">%</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div></li>
<li>
<p>Finally, add <code>mathjax: true</code> to the YAML frontmatter of any pages containing math markup. Alternatively, you could omit the outer <code>{{ if .Params.mathjax }} … {{ end }}</code> conditional above to load the library automatically on <em>all</em> pages. However given that this library is quite heavy (it&rsquo;s consistently the asset that <em>Google PageSpeed Insights</em> complains the most about) and that only &lt;20% of my blog posts contain math at all, this is worth the extra effort for me.</p>
</li>
</ol>
<h2 id="other-approaches-i-considered">Other approaches I considered <a class="anchor" href="#other-approaches-i-considered">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Here are a few other solutions I looked into, but ultimately decided not to adopt as a final solution.</p>
<h3 id="manually-escape-all-problematic-characters">Manually escape all problematic characters <a class="anchor" href="#manually-escape-all-problematic-characters">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>You could manually escape all underscore or backslash characters with an additional backslash. This works if you rarely use LaTeX and just need a specific expression to render correctly, but it will get quickly annoying if your posts include multiple math expressions. Besides breaking rendering of LaTeX in your markdown editor, it also makes the raw code difficult to read.</p>
<h3 id="use-mmark-markdown-processing-engine">Use MMark markdown processing engine <a class="anchor" href="#use-mmark-markdown-processing-engine">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Hugo lets you specify which processing engine to use to convert markdown during the build process. There is one engine—MMark—which handles LaTeX well and so makes the above modifications entirely unnecessary. This was the approach <a href="https://web.archive.org/web/20180118223655/https://gohugo.io/content-management/formats/" target="_blank">previously officially recommended</a> in Hugo documentation.</p>
<p>However according to the <a href="https://gohugo.io/content-management/formats/" target="_blank">current docs</a>, MMark is deprecated and will be removed in a future release. It may work for a while still, but it doesn&rsquo;t make sense for me to adopt a solution that is already deprecated.</p>
<h3 id="goldmark-engine-with-mathjax-extension">Goldmark engine with MathJax extension <a class="anchor" href="#goldmark-engine-with-mathjax-extension">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The new default markdown engine used by Hugo is called goldmark. There is an extension <a href="https://github.com/litao91/goldmark-mathjax" target="_blank">goldmark-mathjax</a> that seems to do exactly what we want. But as of Feb 2020, a <a href="https://github.com/gohugoio/hugo/pull/6842" target="_blank">PR to merge it into hugo</a> for relying on unacceptable dependencies. So for the time being, this approach would require forking Hugo and modifying it to use this extension. I have no real experience with Go, so I decided to avoid this approach for now.</p>
<h3 id="katex-math-shortcode">KaTeX math shortcode <a class="anchor" href="#katex-math-shortcode">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If you are willing to use KaTeX instead of MathJax, then <a href="https://bluestnight.com/docs/midnight/users/shortcodes/math/" target="_blank">this approach</a> may be a good option. But it is cumbersome to wrap all your inline math equations in a shortcode. It is already annoying that the backtick approach breaks in-editor latex rendering in most editors, but at least the raw latex code is displayed in monospace text, and the backticks do not take up much screen space. For example, to render $x=1$ you would need to type <code>{{ &lt; math &gt; }}x=1{{ &lt;/math&gt; }}</code>, which makes it even more difficult to read and edit content in your markdown editor. I didn&rsquo;t find the speed difference between KaTeX and Mathjax 3 to be sufficient to justify the decreased editing experience.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="http://docs.mathjax.org/en/latest/upgrading/v2.html#version-2-compatibility-example" target="_blank">MathJax docs – Upgrading from v2 to v3</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>8 Big Ideas from Scott Page's “The Model Thinker”</title><link>https://geoffruddock.com/book-summary-the-model-thinker/</link><pubDate>Friday, 10 Jan 2020</pubDate><guid>https://geoffruddock.com/book-summary-the-model-thinker/</guid><description>&lt;p>I recently finished reading Scott E. Page&amp;rsquo;s wonderful book &lt;a href="https://www.amazon.com/Model-Thinker-What-Need-Know/dp/0465094627" target="_blank">The Model Thinker&lt;/a>. As a data scientist, I have a technical interest in models, particularly in the space of statistics and machine learning. As a general thinker, I am a big fan of Shane Parrish&amp;rsquo;s &lt;a href="https://fs.blog/mental-models/" target="_blank">mental models&lt;/a> concept, in which he champions developing an understanding of a wide breadth of models across disciplines to aid in general decision-making. A majority of the mental models on Farnam Street come from more of a psychology or behavioural economics background. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50+ models in the book, but here is a quick summary of a few big ideas which resonated with me.&lt;/p></description><content:encoded><![CDATA[
        <p>I recently finished reading Scott E. Page&rsquo;s wonderful book <a href="https://www.amazon.com/Model-Thinker-What-Need-Know/dp/0465094627" target="_blank">The Model Thinker</a>. As a data scientist, I have a technical interest in models, particularly in the space of statistics and machine learning. As a general thinker, I am a big fan of Shane Parrish&rsquo;s <a href="https://fs.blog/mental-models/" target="_blank">mental models</a> concept, in which he champions developing an understanding of a wide breadth of models across disciplines to aid in general decision-making. A majority of the mental models on Farnam Street come from more of a psychology or behavioural economics background. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50+ models in the book, but here is a quick summary of a few big ideas which resonated with me.</p>
<h2 id="what-makes-for-a-good-model">What makes for a good model? <a class="anchor" href="#what-makes-for-a-good-model">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="a-good-model-is-parsimonious">A good model is parsimonious <a class="anchor" href="#a-good-model-is-parsimonious">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>While describing different high-level types of models in the first chapter, the author references a joke I was not familiar with. The original joke<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> pokes fun at physicists for making unrealistic simplifying assumptions in their model, such as that of a cow being a perfect sphere.</p>
<blockquote>
<p>Milk production at a dairy farm was low, so the farmer wrote to the local university, asking for help from academia. A multidisciplinary team of professors was assembled, headed by a theoretical physicist, and two weeks of intensive on-site investigation took place.</p>
<p>The scholars then returned to the university, notebooks crammed with data, where the task of writing the report was left to the team leader. Shortly thereafter the physicist returned to the farm, saying to the farmer, &ldquo;I have the solution, but it works only in the case of spherical cows in a vacuum&rdquo;.</p></blockquote>
<p>But as the author points out, sometimes these amusingly extreme simplifications actually yield surprisingly usable rough results.</p>
<blockquote>
<p>The spherical cow is a favorite classroom example of the analogy approach: to make an estimate of the amount of leather in a cowhide, we assume a spherical cow. We do so because the integral tables in the back of calculus textbooks include <code>tan(x)</code> and <code>cos(x)</code> but not <code>cow(x)</code>.</p></blockquote>
<p>There is no model which is a perfect representation of reality. A model with perfect accuracy would be like a 1:1 scale map, which is clearly not practical to use. So when we select a model, we are implicitly selecting some factors to include and others to exclude. Effective models include the important factors—and are therefore accurate—while excluding the less important ones—and are therefore simple and hence useful to us.</p>
<h3 id="a-good-model-knows-its-purpose">A good model knows its purpose <a class="anchor" href="#a-good-model-knows-its-purpose">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>In the second chapter, <em>Why Model</em>, the author categorizes seven overarching uses for models:</p>
<ol>
<li>Reason: to identify conditions and deduce logical implications</li>
<li>Explain: to provide (testable) explanations for empirical phenomena</li>
<li>Design: to choose features of institutions, policies, and rules</li>
<li>Communicate: To relate knowledge and understandings</li>
<li>Act: to guild policy changes and strategic actions</li>
<li>Predict: to make numerical and categorical predictions of future and unknown phenomena</li>
<li>Explore: to investigate possibilities and hypotheticals</li>
</ol>
<p>It seems self-evident that models are used for a wide variety of purposes, but what is worth noting here is how the success criteria for each potential use-case could differ. This implies that anyone setting out to apply a model to solve a problem would be wise to carefully and honestly consider the core underlying purpose, in order to ensure success is actually achievable.</p>
<h4 id="interpretability-vs-predictive-power">Interpretability vs. predictive power <a class="anchor" href="#interpretability-vs-predictive-power">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>There is a trope in data science about much of machine learning being merely glorified applied statistics, but there is definitely an underlying tension between two paradigms of success as <em>interpretability</em> and of success as <em>predictive power</em>.</p>
<p>Traditional statistics focuses on building models which have <em>explanatory power</em>. A good model is not just true, but also interpretable, and easy to interface into qualitative decision-making. Case in point, using the python package <code>statsmodels</code> to fit a linear model gives you a full <a href="https://www.statsmodels.org/dev/regression.html" target="_blank">R-style summary</a> of fit out of the box.</p>
<p>The more recent focus in pure ML arenas is around having good <em>predictive power</em> with less consideration given to our qualitative understanding of the inner workings of the models themselves. For example, take the <code>scikit-learn</code> approach to linear models, which does not even give you an easy way to visualize p-values out of the box.</p>
<p>I&rsquo;m not advocating one paradigm over the other, but it is important to honestly consider what success would look like for whatever project/decision/goal you are seeking to apply a model to solve. It&rsquo;s easy to pay lip service to pure predictive power, but will you and your team feel comfortable with a powerful algorithm whose decisions you don&rsquo;t understand?</p>
<h2 id="many-models-thinking">Many models thinking <a class="anchor" href="#many-models-thinking">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The third chapter is an appeal to adopting what the author calls &ldquo;many models thinking&rdquo; for which he lays out a theorem I was not familiar with.</p>
<blockquote>
<p><strong>Condorcet Jury Theorem</strong> – Each of an odd number of people (models) classifies an unknown state of the world as either true or false. Each classifies independently from one another, and classifies correctly with a probability $p&gt;\frac{1}{2}$ .</p>
<p>Theorem: A majority vote classifies correctly with higher probability than any person (model), and as the number of people (models) becomes large, the accuracy of the majority vote approaches 100%.</p></blockquote>
<h4 id="everything-is-a-remix">Everything is a remix <a class="anchor" href="#everything-is-a-remix">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>This immediately brings to mind the idea of <a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank">ensemble learning</a> in ML. Just substitute &ldquo;weak learners&rdquo; for a single vote, and &ldquo;strong learner&rdquo; for majority vote in the above theorem. I was surprised to discover that the Condorcet Jury Theorem was expressed in 1785—nearly 250 years ago. It is humbling to observe instances where seemingly modern techniques<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> are actually a remix of much older concepts from other fields.  There are no truly new ideas.</p>
<h4 id="the-devil-is-in-the-details">The devil is in the details <a class="anchor" href="#the-devil-is-in-the-details">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>The author points out that in reality we don&rsquo;t see our prediction accuracy go to 100% as we increase the number of models or inputs into a majority vote. The reason is usually that one of the assumptions in the above theorem is violated:</p>
<ul>
<li>Weak learners must each have <em>some</em> signal. If $p=\tfrac{1}{2}$, then we cannot improve predictions by averaging together pure noise.</li>
<li>The votes must be independent. If multiple votes are perfectly dependent, then they really only count for one vote. If they have some moderate level of correlation, then their absolute number is overstated.</li>
</ul>
<p>In real-world collective decision-making, it is plausible that <em>both</em> of these assumptions are violated. Votes are certainly not independent, and it is conceivable that some voters have <em>negative</em> signal—their predictions are wrong more than would occur due to pure chance.</p>
<p>Related concepts: <a href="https://en.wikipedia.org/wiki/Wisdom_of_the_crowd" target="_blank">wisdom of the crowd</a>, <a href="https://www.gwern.net/Prediction-markets" target="_blank">prediction markets</a></p>
<h2 id="adaptive-systems">Adaptive systems <a class="anchor" href="#adaptive-systems">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Systems which are able respond to feedback poses additional challenges for quantifying accuracy of our models and the predictions we generate from them.</p>
<blockquote>
<p>The <a href="https://en.wikipedia.org/wiki/Lucas_critique" target="_blank">Lucas Critique</a> states that changes in a policy or the environment likely produce behavioural responses by those affected. Models estimated with data on past human behaviours will therefore not be accurate. Models must take into account the fact that people respond to policy and environmental changes.</p></blockquote>
<h4 id="see-also-why-your-kpis-suck">See also: why your KPIs suck <a class="anchor" href="#see-also-why-your-kpis-suck">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>This brings to mind <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law" target="_blank">Goodhart&rsquo;s Law</a>, which tells us that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. This is a key challenge faced by anyone who has tried to design KPIs for an organization.</p>
<h2 id="power-law-long-tail-distributions">Power-law (long-tail) distributions <a class="anchor" href="#power-law-long-tail-distributions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Besides the Normal distribution, Power-law distributions are one of the most important statistical distributions to understand. Whereas aggregates of independent things tend to follow the normal distribution via the central limit theorem, aggregates of <strong>dependent</strong> things—particularly when feedback loops are involved—follow power-law distributions.</p>
<blockquote>
<p><strong>Power-law distributions</strong> – In a power-law distribution, the probability of an event is inversely related to its size: the larger the event, the less likely it occurs.</p>
<p>$$ p(x) = C x^{-a} $$</p></blockquote>
<p>They are sometimes difficult to grasp intuitively though, which can cause problems when we use attempt to use heuristics to gauge things like risk when subconciously considering a normal distribution.</p>
<blockquote>
<p>Contemplating a power-law distribution of human heights reveals how much power-law distributions differ from normal distributions. If human heights were distributed by a power law similar to that of city populations, and if we calibrate the mean height at 5 feet 9 inches, then the United States would include one person the height of the Empire State Building, over 10,000 people taller than giraffes, and 180 million people less than 7 inches tall.</p></blockquote>
<h4 id="power-laws-arise-due-to-preferential-attachment">Power-laws arise due to &ldquo;Preferential Attachment&rdquo; <a class="anchor" href="#power-laws-arise-due-to-preferential-attachment">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>The author presents a couple potential causal factors which explain how power-law distributions arise. The most compelling is the <em>preferential attachment</em> model, which states that entities grow at rates relative to their proportions. Aka: the rich get richer and the poor get poorer. He gives a compelling example about a music download experiment:</p>
<blockquote>
<p>In the <strong>music lab experiments</strong>, college students could sample and download songs. In the first treatment, subjects did not know what songs others downloaded, and the distributions of downloads had a shorter tail—no song received more than two hundred downloads and only one song received fewer than thirty. In a second treatment, students knew what others downloaded. The tail of the distribution grew: one song received more than three hundred downloads. Perhaps more telling, over half received fewer than thirty. The tail became longer. Social influence increased inequality. This inequality is not a concern if social influence leads people to download better songs. However, correlations between downloads in the two treatments were not strong. If we interpret the number of downloads of a song in the first treatment as a proxy for the song’s quality, social influence did not result in people downloading better songs. The big winners were not random, but they were not the best.</p></blockquote>
<p>So our world becomes more interconnected and feedback loops multiply, we should expect to see more long-tails arise in situations where they may not have done so historically.</p>
<p>See also: <a href="https://en.wikipedia.org/wiki/Black_swan_theory" target="_blank">Black swan theory</a>, <a href="https://en.wikipedia.org/wiki/Matthew_effect" target="_blank">the Matthew effect</a>.</p>
<h2 id="concavity-and-convexity">Concavity and convexity <a class="anchor" href="#concavity-and-convexity">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Concave and convex functions is another math concept which is much more profound when considered from an economic perspective. I will admit to invoking [Jensen&rsquo;s inequality](<a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality" target="_blank">https://en.wikipedia.org/wiki/Jensen%27s_inequality</a> in math proofs without truly reflecting on how it influnces human decision-making.</p>
<h4 id="convexity-implies-risk-taking">Convexity implies risk-taking <a class="anchor" href="#convexity-implies-risk-taking">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><blockquote>
<p>Convex functions have an increasing slope: the function’s value increases by a larger amount as we increase a variable’s value. The number of possible pairs of people is a convex function of the group size. A group of three people includes three unique pairs. A group of four people includes six unique pairs, and a group of five includes ten unique pairs. Each increase in group size increases the number of pairs by a larger amount. Similarly, each time a chef adds a new spice to his repertoire, he increases the number of spice combinations by a larger amount.</p></blockquote>
<h4 id="concavity-implies-diversity">Concavity implies diversity <a class="anchor" href="#concavity-implies-diversity">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><blockquote>
<p>Concave functions with positive slopes exhibit <em>diminishing returns</em>: the added value of each extra thing diminishes as we have more of that thing. Our utility or value from almost all goods exhibits diminishing returns. The more leisure, money, ice cream, or even time spent with loved ones, the less we value having more of it. Evidence for this can be found in the fact that the more we consume of just about anything, including chocolate, the less we enjoy it and the less we are willing to pay for it.</p></blockquote>
<p>See also: <a href="https://www.amazon.com/Range-Generalists-Triumph-Specialized-World/dp/0735214484" target="_blank">Range by David Epstein</a>, <a href="https://en.wikipedia.org/wiki/Competent_man" target="_blank">Specialization is for insects</a>.</p>
<h2 id="markov-models">Markov models <a class="anchor" href="#markov-models">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Markov models describe sequential systems that follow the Markov property, which states that the probability of future states depends only on the current state, not the entire sequence of states that preceded it. Essentially: given the present, the past and future are conditionally independent.
$$
P(\text{Tomorrow}|\text{Today, Yesterday, 2 days ago…, Day 1}) = P(\text{Tomorrow}|\text{Today})
$$
The Markov property sounds like a gross over-simplification of reality, but it can yield surprisingly useful results, because it allows our models to capture a compromise between &ldquo;full independence&rdquo; and &ldquo;complete dependence&rdquo;, which is often impractical to model at all.</p>
<p>It can be shown that recurrent Markov chains which follow specific properties are guarantee to converge to some long-run <a href="https://en.wikipedia.org/wiki/Stationary_distribution" target="_blank">stationary distribution</a> which reflects the long-run proportion of time spent in each state if the chain is run indefinitely.</p>
<blockquote>
<p><strong>Perron-Frobenius Theorem</strong> – A markov process converges to a unique statistical equilibrium provided it satisfies four conditions:</p>
<ol>
<li>Finite set of states: $ S = { 1, 2, \ldots, K } $</li>
<li>Fixed transition rule</li>
<li>Ergodicity (state accessibility): The system can get from any state to any other through a series of transitions.</li>
<li>Non-cyclic: The system does not produce a deterministic cycle through a sequence of states.</li>
</ol>
<p>The unique statistical equilibrium implies that long-run distributions of outcomes cannot depend on the initial state or on the path of events. In other words, initial conditions do not matter, and history does not matter. Nor can interventions that change the state matter. As time marches on, a process that satisfies the assumptions inexorably heads to its unique statistical equilibrium and then stays there.</p></blockquote>
<p>Besides being the foundation of <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo" target="_blank">MCMC</a>, this has interesting implications from a sociological perspective.</p>
<blockquote>
<p>The takeaway from the theorem should not be that history cannot matter but that if history does matter, one of the model’s assumptions must be violated. Two assumptions—the finite number of states and no simple cycle—almost always hold. Ergodicity can be violated, as when allies go to war and cannot transition back to an alliance. Such examples notwithstanding, ergodicity generally holds as well. The forces that create social inequality have proven immune to policy interventions. In Markov models <strong>interventions that change families’ states</strong>—such as special programs for underperforming students or a one-day food drive—can provide temporary boosts. They <strong>cannot change the long-run equilibrium</strong>. In contrast, interventions that provide resources and training that improve people’s ability to keep jobs, and therefore change their probabilities of moving from employed to unemployed, could change long-run outcomes. At a minimum, the model gives us a terminology—the distinction between states and transition probabilities—along with a logic to see the value of changing structural forces rather than the current state.</p></blockquote>
<p>This has a powerful implication for anyone attempting to alter the long-run state of a complex system. Rather than directly manipulating the states themselves, we should adopt <a href="https://fs.blog/2016/04/second-order-thinking/" target="_blank">second-order thinking</a> and consider how we can modify the transition probabilities <em>between</em> states such that our desired end state arises naturally. If your goal is to declutter your messy bedroom, you can set aside a weekend to go full <a href="https://www.netflix.com/title/80209379" target="_blank">Marie Kondo</a> on your wardrobe, but unless you implement systemic changes which influence the rate of accumulation of junk, you will find yourself back in the same state a year later.</p>
<h2 id="systems-dynamics-models">Systems dynamics models <a class="anchor" href="#systems-dynamics-models">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Systems dynamics models give us a vocabulary for describing the behaviour of complex systems:</p>
<ul>
<li><strong>Sources</strong> produce inputs into the system.</li>
<li><strong>Sinks</strong> absorb outputs.</li>
<li><strong>Stocks</strong> keep track of levels of variables.</li>
<li><strong>Flows</strong> capture feedbacks between levels of stocks.</li>
</ul>
<p>A great place to learn more about this approach is Donella H Meadows&rsquo; book <a href="https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557" target="_blank">Thinking in Systems: A Primer</a> (summary <a href="https://readingraphics.com/book-summary-thinking-in-systems/" target="_blank">here</a>).</p>
<h4 id="long-run-stability-of-systems">Long-run stability of systems <a class="anchor" href="#long-run-stability-of-systems">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Feedback loops imply that some systems are not stable in the long-run.</p>
<blockquote>
<p>&ldquo;The basic logic of feedbacks is straightforward: positive feedbacks reinforce actions, negative feedbacks dampen them. A system with only positive feedbacks will either blow up or collapse. A system with only negative feedbacks will either stabilize or cycle. A system with both positive feedbacks and negative feedbacks has the potential to produce complexity.&rdquo;</p></blockquote>
<h4 id="reasoning-about-effects">Reasoning about effects <a class="anchor" href="#reasoning-about-effects">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>Feedback loops make it difficult to reason about the effect of small changes to long-run equilibrium.</p>
<blockquote>
<p>&ldquo;The direct effect of increasing the growth rate of hares is more hares. The indirect effect, more foxes, implies fewer hares. These two effects cancel out. Nonintuitive findings such as these are a hallmark of systems dynamics models. Our intuition fails because we latch onto direct effects and fail to think through the entire logical chain. Even if the direct effect of increasing (or decreasing) a rate or flow may be to increase (or decrease) a stock, the presence of systems effects in the form of positive and negative feedbacks means that other stocks will also change values, so the net effect of a change in a rate or flow may be reduced, canceled, or even reversed.&rdquo;</p></blockquote>
<h2 id="modelling-human-behaviour-with-adaptive-rules">Modelling human behaviour with adaptive rules <a class="anchor" href="#modelling-human-behaviour-with-adaptive-rules">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>This book touches on game theory in a number of chapters. The most interesting section to me was a description of a problem where there is no dominant <em>pure</em> strategy. But when individual actors adopt diverse probabilistic actions, the system naturally reaches a collectively efficient outcome.</p>
<blockquote>
<p><strong>El Farol Bar problem</strong> – El Farol is a nightclub in Sante Fe, New Mexico that features dancing every Tuesday night. Each week, a population of 100 potential dancers decide whether to go dance at El Farol or stay home. All 100 people like to dance, but they do not want to go if the club is too crowded. Each persn earns a payoff of zero from staying home, a payoff of 1 from attending if 60 or fewer people attend, and a payoff of -1 from attending when more than 60 people attend.</p>
<p>Simulations of this type of model find that if individuals possess a large ensemble of rules, then approximately 60 people attend each week: coordination emerges without any central planner. In other words, the system of adaptive rules self-organizes into nearly efficient outcomes.</p></blockquote>
<p>There is a feedback cycle between micro-level and macro-level rules. The decision of whether to attend or not (micro) influences the level of over-attendence (macro) which in turn influences the individual decisions in the next time period.</p>
<blockquote>
<p>If the rules people apply produce a crowded El Farol four weeks in a row, then rules that tell people to attend less often will produce higher payoffs. As people switch to those rules, fewer people will attend. The micro-level rules produce a macro-level phenomenon (over-attendance) that feeds back to the micro-level rules.</p></blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://en.wikipedia.org/wiki/Spherical_cow" target="_blank">https://en.wikipedia.org/wiki/Spherical_cow</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p><a href="http://rob.schapire.net/papers/strengthofweak.pdf" target="_blank">http://rob.schapire.net/papers/strengthofweak.pdf</a>&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Scraping unlisted stock prices with BeautifulSoup</title><link>https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/</link><pubDate>Saturday, 14 Dec 2019</pubDate><guid>https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/</guid><description>&lt;p>After taking a course on &lt;a href="https://quantsoftware.gatech.edu/Machine_Learning_for_Trading_Course" target="_blank">Machine Learning for Trading&lt;/a>, I decided to apply some of the concepts I had learned to model my own stock trading performance. Unfortunately this was not nearly as straightforward as I expected, since my trade history included a number of stocks which no longer exist.&lt;/p>
&lt;h2 id="how-do-you-find-the-share-price-of-an-unlisted-company">How do you find the share price of an unlisted company? &lt;a class="anchor" href="#how-do-you-find-the-share-price-of-an-unlisted-company">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the &lt;a href="https://github.com/ranaroussi/yfinance" target="_blank">yfinance&lt;/a> package. But these sources generally only contain data for currently listed stocks. My trade history includes a number of &lt;a href="https://en.wikipedia.org/wiki/IShares" target="_blank">iShares ETFs&lt;/a> which no longer exist, including one in particular: &lt;code>AAIT&lt;/code>. In my case, the ticker still exists in Yahoo Finance, but the data is clearly broken.&lt;/p></description><content:encoded><![CDATA[
        <p>After taking a course on <a href="https://quantsoftware.gatech.edu/Machine_Learning_for_Trading_Course" target="_blank">Machine Learning for Trading</a>, I decided to apply some of the concepts I had learned to model my own stock trading performance. Unfortunately this was not nearly as straightforward as I expected, since my trade history included a number of stocks which no longer exist.</p>
<h2 id="how-do-you-find-the-share-price-of-an-unlisted-company">How do you find the share price of an unlisted company? <a class="anchor" href="#how-do-you-find-the-share-price-of-an-unlisted-company">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the <a href="https://github.com/ranaroussi/yfinance" target="_blank">yfinance</a> package. But these sources generally only contain data for currently listed stocks. My trade history includes a number of <a href="https://en.wikipedia.org/wiki/IShares" target="_blank">iShares ETFs</a> which no longer exist, including one in particular: <code>AAIT</code>. In my case, the ticker still exists in Yahoo Finance, but the data is clearly broken.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/yahoo_finance_aait_hu_d57621578832d88d.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/yahoo_finance_aait.png"
                
    
            
                alt="Does not seem like a random walk to me" width="400"/> <figcaption>
                <p>Does not seem like a random walk to me</p>
            </figcaption>
    </figure>
<p>There are a number of <a href="http://quandl.com/" target="_blank">paid sources</a> for historical data of unlisted companies, but I can&rsquo;t justify paying $40 for data I am just using to scratch my own curiosity. After a bit of googling, I found some price data for <code>AAIT</code> on a 90s-styled website <a href="https://www.historicalstockprice.com/history/?a=historical&amp;ticker=AAIT&amp;month=02&amp;day=11&amp;year=2013" target="_blank">historicalstockprice.com</a>. Unfortunately it only lets you view a single day at a time, and has no option for csv export. On the bright side, this presented a good opportunity to play around with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a> python library.</p>
<h4 id="update">UPDATE <a class="anchor" href="#update">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>I later found that <a href="https://www.investing.com/etfs/ishares-msci-allcou-asia-info-tech-historical-data" target="_blank">investing.com</a> has data for a number of unlisted stocks, including <code>AAIT</code>. It also lets you easily download a CSV file with daily prices. So you probably want to check that source before going to all the effort of writing a scraper from scratch.</p>
<h2 id="building-a-simple-web-scraper">Building a simple web scraper <a class="anchor" href="#building-a-simple-web-scraper">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h4 id="find-the-actual-url-to-scrape">Find the actual URL to scrape <a class="anchor" href="#find-the-actual-url-to-scrape">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>The first step is to figure out the actual URL we need to scrape. Let&rsquo;s start with the <a href="https://www.historicalstockprice.com/history/?a=historical&amp;ticker=AAIT&amp;month=02&amp;day=11&amp;year=2013" target="_blank">actual webpage itself</a>.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_page_hu_3e5e5083c81ccf0e.png 480w,
                
                       https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_page_hu_9f041f36d26f45d4.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_page_hu_9f041f36d26f45d4.png"
                
    
            
                alt="Scraping was just easier in the &rsquo;90s." width="500"/> <figcaption>
                <p>Scraping was just easier in the &rsquo;90s.</p>
            </figcaption>
    </figure>
<p>But if we look at the actual source code for the page (Right-click → <em>View Page Source</em> in Google Chrome) it appears that the price data is not there. So it seems that the price data is loaded from some other API—likely using javascript—after the page itself loads. If we disable javascript and reload the page, it confirms our suspicions.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_no_javascript_hu_722b5cf8cbb0b7eb.png 480w,
                
                       https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_no_javascript_hu_3f04596d87a2248b.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/hsp_no_javascript_hu_3f04596d87a2248b.png"
                
    
            
                alt="With javascript disabled, our page does not contain any price data 😭" width="500"/> <figcaption>
                <p>With javascript disabled, our page does not contain any price data 😭</p>
            </figcaption>
    </figure>
<p>After reading through the source code, it is apparent that the page loads its contents from a <a href="https://www.tickertech.net/etfchannel/cgi/?a=historical&amp;ticker=AAIT&amp;month=02&amp;day=11&amp;year=2013" target="_blank">secondary URL</a>, which is the actual URL we want to scrape.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/scraping-unlisted-stock-prices-with-beautiful-soup/ticker_tech_sub_page.png"
                
    
            
                alt="Now we&rsquo;re making progress!" width="500"/> <figcaption>
                <p>Now we&rsquo;re making progress!</p>
            </figcaption>
    </figure>
<h4 id="write-a-scraper-using-beautifulsoup">Write a scraper using BeautifulSoup <a class="anchor" href="#write-a-scraper-using-beautifulsoup">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h4><p>The URL contains parameters for ticker, year, month, and date, so we just need to loop over our date range of interest, format the URL template with the appropriate parameters, and make an API call.</p>
<p>We are interested in the contents of the cell under &ldquo;Close&rdquo;. Ideally the response would contain a page with CSS classes and IDs, which we could use to cleverly select the appropriate element, but in our case there are no classes or IDs. But since the page always has the exact same structure, we can just take the contents of the fifth <code>td</code> element of the second <code>table</code> element.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">tqdm.notebook</span> <span class="kn">import</span> <span class="n">tqdm</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">scrape_hsp</span><span class="p">(</span><span class="n">ticker</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">start_date</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">end_date</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34; Scrape ticker data from historicalstockprice.com &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">URL</span> <span class="o">=</span> <span class="s1">&#39;https://www.tickertech.net/etfchannel/cgi/?a=historical&amp;ticker=</span><span class="si">{TICKER}</span><span class="s1">&amp;month=</span><span class="si">{MM}</span><span class="s1">&amp;day=</span><span class="si">{DD}</span><span class="s1">&amp;year=</span><span class="si">{YYYY}</span><span class="s1">&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="n">date_range</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">bdate_range</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="n">end_date</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">prices</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">date_range</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">dt</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">date_range</span><span class="p">,</span> <span class="n">unit</span><span class="o">=</span><span class="s1">&#39;days&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-</span><span class="si">%d</span><span class="s1">&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">formatted_url</span> <span class="o">=</span> <span class="n">URL</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">TICKER</span><span class="o">=</span><span class="n">ticker</span><span class="p">,</span> <span class="n">MM</span><span class="o">=</span><span class="n">month</span><span class="p">,</span> <span class="n">DD</span><span class="o">=</span><span class="n">day</span><span class="p">,</span> <span class="n">YYYY</span><span class="o">=</span><span class="n">year</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">page</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">formatted_url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">content</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">val</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s1">&#39;table&#39;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">&#39;td&#39;</span><span class="p">)[</span><span class="mi">4</span><span class="p">]</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">&#39;font&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">prices</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">dt</span><span class="p">]</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">val</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">IndexError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">prices</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">prices</span> <span class="o">=</span> <span class="n">scrape_hsp</span><span class="p">(</span><span class="n">ticker</span><span class="o">=</span><span class="s1">&#39;AAIT&#39;</span><span class="p">,</span> <span class="n">start_date</span><span class="o">=</span><span class="s1">&#39;2013-01-01&#39;</span><span class="p">,</span> <span class="n">end_date</span><span class="o">=</span><span class="s1">&#39;2015-08-28&#39;</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>HBox(children=(FloatProgress(value=0.0, max=694.0), HTML(value='')))
</code></pre>
<p>This takes 7-13 minutes to run for our selected date range, which is acceptable. If we needed to scrape a much larger date range or a number of symbols, we could use the <code>multiprocessing</code> library to make concurrent requests.</p>
<p>When we visualize the data below, we see that we&rsquo;ve got a reasonable time series of price data!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">prices</span><span class="o">.</span><span class="n">bfill</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;AAIT – Closing price&#39;</span><span class="p">);</span>
</span></span></code></pre></div><p><img src="./index_6_0.png" alt="png"></p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://realpython.com/beautiful-soup-web-scraper-python/" target="_blank">Beautiful Soup: Build a Web Scraper With Python</a> (Real Python) – Provides a good introduction to the <code>BeautifulSoup</code> python library, which is the most popular and well-documented library for building a scraper.</p>

      ]]></content:encoded></item><item><title>A clean way to share results from a Jupyter Notebook</title><link>https://geoffruddock.com/share-results-from-jupyter-notebook/</link><pubDate>Monday, 02 Dec 2019</pubDate><guid>https://geoffruddock.com/share-results-from-jupyter-notebook/</guid><description>&lt;p>I love jupyter notebooks. As a data scientist, notebooks are probably &lt;em>the&lt;/em> fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a &lt;a href="https://www.locallyoptimistic.com/post/the-lab-book/" target="_blank">lab notebook&lt;/a> for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.&lt;/p></description><content:encoded><![CDATA[
        <p>I love jupyter notebooks. As a data scientist, notebooks are probably <em>the</em> fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a <a href="https://www.locallyoptimistic.com/post/the-lab-book/" target="_blank">lab notebook</a> for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.</p>
<h2 id="the-drawbacks-of-notebooks">The drawbacks of notebooks <a class="anchor" href="#the-drawbacks-of-notebooks">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>But notebooks are not perfect. They introduce <a href="https://yihui.org/en/2018/09/notebook-war/" target="_blank">a number of problems</a>, including—but not limited to:</p>
<ol>
<li><strong>Modularity</strong> – reusable chunks of code tend to remain in notebooks rather than being extracted into their own modules—or even packages—as frequently as they should.</li>
<li><strong>Best practices</strong> – non-linear execution and global state are great for prototyping, but also make it cumbersome to refactor code later, or to write automated tests.</li>
<li><strong>Version control</strong> – Even if you do extract key functionality into their own modules, it becomes hard to keep track of these changes in github, because they are dwarfed by pull requests which contain ±10k lines of code, caused by the JSON representation of raw jupyter notebooks.</li>
</ol>
<h2 id="presenting-your-results-to-non-technical-stakeholders">Presenting your results to non-technical stakeholders <a class="anchor" href="#presenting-your-results-to-non-technical-stakeholders">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>A critical junction arises near the end of any data science project—how will you share results with the relevant stakeholders? The tool of choice in many organisations—at least my own—tends to be <em>Google Slides</em>. Unfortunately I have created more than a few slide decks whose contents almost entirely consist of matplotlib pngs, copy–pasted directly from a jupyter notebook notebook. This is sub-optimal, because it causes a disconnect between <em>code</em> and <em>content</em>. Future re-runs of your notebook, perhaps with fixed or fresh data, will not automatically update the visualizations in those slides. This decoupling counteracts much of the benefit of reproducibility which the notebook format promises in the first place.</p>
<h2 id="what-stops-us-from-presenting-the-notebook-itself">What stops us from presenting the notebook itself? <a class="anchor" href="#what-stops-us-from-presenting-the-notebook-itself">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Jupyter notebooks have built-in support for Markdown and HTML, so you <em>can</em> embed rich content and largely control formatting. The main obstacle to presentation-quality notebooks seems to be managing <em>attention</em>.</p>
<ol>
<li>It&rsquo;s difficult to focus the attention of your audience on a single thing like you can with slides.</li>
<li>Although we want to keep code (input cells) for reproducability&rsquo;s sake, showing it is <em>distracting</em>.</li>
</ol>
<p>Take for example, the screenshot below of an HTML output of a raw Jupyter notebook.  Notice that the majority of our &ldquo;above the fold&rdquo; content here is irrelevant to almost any potential audience of the notebook. Only 20% of the height is made up of details around the analysis.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_raw_hu_d64822a5b7bd9c0f.png 480w,
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_raw_hu_e2de70d0427b2311.png 800w,
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_raw_hu_a32c6b390fea3f80.png 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_raw_hu_e2de70d0427b2311.png"
                
    
            
                alt="Not something you&rsquo;d want to share with a stakeholder." width="600"/> <figcaption>
                <p>Not something you&rsquo;d want to share with a stakeholder.</p>
            </figcaption>
    </figure>
<h2 id="existing-attempts-to-solve-this-problem">Existing attempts to solve this problem <a class="anchor" href="#existing-attempts-to-solve-this-problem">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="slides">Slides <a class="anchor" href="#slides">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>One solution I&rsquo;ve seen—most frequently used to give technical taks, e.g. at JupyterCon—are slides built using the <a href="https://github.com/damianavila/RISE" target="_blank">RISE</a> extension. These definitely solve our first problem—focusing the audience&rsquo;s attention—but don&rsquo;t address the second. In fact, they seem best suited for presentations where <em>the code itself</em> is an integral part of what is being presented. I suspect that&rsquo;s why it appears so frequently in technical talks, but less frequently elsewhere.</p>
<h3 id="nbconvert-with-no-input-flag">nbconvert with &ndash;no-input flag <a class="anchor" href="#nbconvert-with-no-input-flag">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Nbconvert has a built-in flag to hide input, but unfortunately it seems to result in a poorly formatted final output, in which the output of code cells is not aligned with the markdown cells.</p>
<pre tabindex="0"><code>jupyter nbconvert my_notebook.ipynb --no-input
</code></pre>
    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/nbconvert_no_input_hu_9256ca041e719e66.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/share-results-from-jupyter-notebook/nbconvert_no_input.png"
                
    
            
                alt="Still not something you&rsquo;d want to share with a stakeholder." width="300"/> <figcaption>
                <p>Still not something you&rsquo;d want to share with a stakeholder.</p>
            </figcaption>
    </figure>
<h3 id="static-website-generator">Static website generator <a class="anchor" href="#static-website-generator">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If you don&rsquo;t need <em>slides</em> specifically, and if you are interested in building up a consistent experience for your entire team, it might be worth using a <a href="https://mikkelhartmann.dk/2019/05/14/static-website-from-jupyter-notebooks.html" target="_blank">static website generator</a> to build a sort of knowledge repo from multiple notebooks. This is less well-suited for sharing a single notebook, particularly if you don&rsquo;t feel like deploying a site to host the output.</p>
<h2 id="a-solution-using-nbconvert-templates">A solution using nbconvert templates <a class="anchor" href="#a-solution-using-nbconvert-templates">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you are primarily intersted in having a clean and shareable <em>report</em> rather than slides, it is possible to achieve this with vanilla <code>nbconvert</code>, rather than adding dependenies on external packages. The best solution I found was <a href="http://damianavila.github.io/blog/posts/mimic-the-ipython-notebook-cell-execution.html" target="_blank">this nbconvert template</a> by Damian Avila, which uses jQuery to add toggle functionlity, such that the code is initially hidden but can be displayed by clicking on the output of any cell.</p>
<p>It is easy to use:</p>
<ol>
<li>Download the <a href="https://gist.github.com/asdfgeoff/cbb38d2116735aaca933467c6dbb17d5#file-toggle-tpl" target="_blank">toggle.tpl</a> template file.</li>
<li>Figure out where your jupyter template directory is, by running <code>from jupyter_core.paths import jupyter_path; print(jupyter_path('nbconvert','templates'))</code></li>
<li>Copy the template file to that directory.</li>
<li>From the command line in the directory containing your notebook, run <code>jupyter nbconvert my_notebook.ipynb --template=toggle</code></li>
</ol>
<p>Here&rsquo;s what our output looks like after using <code>nbconvert</code> with a template to hide code cells.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_folded_hu_4a6ad7b648056cfc.png 480w,
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_folded_hu_7c4d026cdb4e9c87.png 800w,
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_folded_hu_27970065a5b5cfad.png 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/share-results-from-jupyter-notebook/screenshot_folded_hu_7c4d026cdb4e9c87.png"
                
    
            
                alt="An output you can proudly share with stakeholders." width="600"/> <figcaption>
                <p>An output you can proudly share with stakeholders.</p>
            </figcaption>
    </figure>
<p>What an improvement from our first attempt! In this (somewhat contrived example) our entire document now fits &ldquo;above the fold&rdquo;. More importantly, the audience can easily grok the structure of the document and scan it visually.</p>
<h2 id="bonus-useful-jupyter-notebook-extensions">Bonus: useful jupyter notebook extensions <a class="anchor" href="#bonus-useful-jupyter-notebook-extensions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Jupyter has a useful package called <a href="https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html" target="_blank">nbextensions</a> which provides a bunch of extended functionality to your  notebooks. There are two extensions in particular which are useful for our purposes.</p>
<h3 id="previewing-final-hidden-output-from-your-notebook">Previewing final &ldquo;hidden&rdquo; output from your notebook <a class="anchor" href="#previewing-final-hidden-output-from-your-notebook">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>There are a few nbextensions related to hiding code cells, but my favourite is <a href="https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/hide_input_all/readme.html" target="_blank">Hide input all</a>, which can be used to fold all cells in your notebook in a single click. This is great for previewing what the final html output will look like from within the notebook itself. rather than having to run the full <code>nbconvert</code> command each time.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/hide_all_input_hu_6b03f888089dc7fe.png 480w,
                
                       https://geoffruddock.com/share-results-from-jupyter-notebook/hide_all_input_hu_c870f1c3704e71d0.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/share-results-from-jupyter-notebook/hide_all_input_hu_c870f1c3704e71d0.png"
                
    
            
                alt="Clicking a single button hides all input cells in your notebook." width="400"/> <figcaption>
                <p>Clicking a single button hides all input cells in your notebook.</p>
            </figcaption>
    </figure>
<h3 id="adding-clickable-links-to-section-headers">Adding clickable links to section headers <a class="anchor" href="#adding-clickable-links-to-section-headers">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Another great nbextension is <a href="https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/toc2/README.html" target="_blank">Table of Contents (2)</a>, which builds a dynamically-updated ToC based on the markdown headings in a notebook. This serves as a good outline during editing, useful for reviewing and revising the macro-level structure of our document. The table is rendered with clickable links in the final html output, which enables readers to navigate through a large report by jumping right to a particular section.</p>

      ]]></content:encoded></item><item><title>Can you run an A/B test with unequal sample sizes?</title><link>https://geoffruddock.com/run-ab-test-with-unequal-sample-size/</link><pubDate>Monday, 25 Nov 2019</pubDate><guid>https://geoffruddock.com/run-ab-test-with-unequal-sample-size/</guid><description>&lt;p>I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. Most sample size calculators—including our own internal one—assumes an equal split between 2+ variations, so I had to take a step back to answer this question.&lt;/p>
&lt;h2 id="tldr-yes-but-you-wouldnt-want-to">TL;DR: Yes, but you wouldn&amp;rsquo;t want to. &lt;a class="anchor" href="#tldr-yes-but-you-wouldnt-want-to">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>You can run an experiment with an unequal allocation (e.g. 10–90) as long as you don&amp;rsquo;t modify the allocation while the experiment is running. However it will be less efficient than a 50–50 allocation—either your test will have less power, or you will need to run it longer to achieve a comparable result.&lt;/p></description><content:encoded><![CDATA[
        <p>I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. Most sample size calculators—including our own internal one—assumes an equal split between 2+ variations, so I had to take a step back to answer this question.</p>
<h2 id="tldr-yes-but-you-wouldnt-want-to">TL;DR: Yes, but you wouldn&rsquo;t want to. <a class="anchor" href="#tldr-yes-but-you-wouldnt-want-to">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>You can run an experiment with an unequal allocation (e.g. 10–90) as long as you don&rsquo;t modify the allocation while the experiment is running. However it will be less efficient than a 50–50 allocation—either your test will have less power, or you will need to run it longer to achieve a comparable result.</p>
<h2 id="do-unequal-sample-sizes-bias-results">Do unequal sample sizes bias results? <a class="anchor" href="#do-unequal-sample-sizes-bias-results">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>We want our A/B test results to be an unbiased estimator of the true effect. To achieve this, we rely on <a href="https://en.wikipedia.org/wiki/Randomized_controlled_trial" target="_blank">randomized assignment</a> to &ldquo;spread out&rdquo; the influence of confounding factors equally across variations, so that they do not influence our relative comparison of different or uplift between the variations. Even if the proportion of users assigned to each variation is unequal, randomized assignment still works as long as we don&rsquo;t <em>change</em> the traffic split. You should never modify the traffic allocation mid-experiment, because this can introduce temporal bias into your results. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<h2 id="are-unequal-sample-sizes-efficient">Are unequal sample sizes efficient? <a class="anchor" href="#are-unequal-sample-sizes-efficient">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>So it is <em>possible</em> to run an experiment with a non 50–50 split, but is it <em>advisable</em>? If our goal is to achieve some predetermined <a href="https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/" target="_blank">risk profile</a> as quickly as possible—then probably not.</p>
<p>Suppose we have a 15% conversion rate, and are designing an experiment to detect a 1% absolute increase with 90% power and 90% confidence. Let&rsquo;s use the <code>pwr</code> R library below, because it supports non-equal sample sizes. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">pwr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">n1</span> <span class="o">=</span> <span class="m">25000</span>
</span></span><span class="line"><span class="cl"><span class="n">n2</span> <span class="o">=</span> <span class="m">25000</span>
</span></span><span class="line"><span class="cl"><span class="n">p1</span> <span class="o">=</span> <span class="m">0.15</span>
</span></span><span class="line"><span class="cl"><span class="n">p2</span> <span class="o">=</span> <span class="m">0.16</span>
</span></span><span class="line"><span class="cl"><span class="n">h</span> <span class="o">=</span> <span class="nf">abs</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nf">asin</span><span class="p">(</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">p1</span><span class="p">))</span><span class="m">-2</span><span class="o">*</span><span class="nf">asin</span><span class="p">(</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">p2</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pwr.2p2n.test</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">n1</span><span class="o">=</span><span class="n">n1</span><span class="p">,</span> <span class="n">n2</span><span class="o">=</span><span class="n">n2</span><span class="p">,</span> <span class="n">sig.level</span><span class="o">=</span><span class="m">0.10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>             n1 = 25000
             n2 = 25000
      sig.level = 0.1
          power = 0.9257466
    alternative = two.sided
</code></pre><p>So with a 50—50 split, you need to run the experiment on 50k total users—25k per variation—to get the desired result. What happens if we use a 10–90 split instead?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">n1</span> <span class="o">=</span> <span class="m">5000</span>
</span></span><span class="line"><span class="cl"><span class="n">n2</span> <span class="o">=</span> <span class="m">45000</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pwr.2p2n.test</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">n1</span><span class="o">=</span><span class="n">n1</span><span class="p">,</span> <span class="n">n2</span><span class="o">=</span><span class="n">n2</span><span class="p">,</span> <span class="n">sig.level</span><span class="o">=</span><span class="m">0.10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>             n1 = 5000
             n2 = 45000
      sig.level = 0.1
          power = 0.5829899
    alternative = two.sided
</code></pre><p>Uh-oh! The <a href="https://en.wikipedia.org/wiki/Power_%28statistics%29" target="_blank">power</a> of your experiment—its ability to detect a true effect—falls to under 60%. Let&rsquo;s scale up our total sample size to find the point at which we achieve a similar power as our initial plan.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">n1</span> <span class="o">=</span> <span class="m">5000</span> <span class="o">*</span> <span class="m">2.8</span>
</span></span><span class="line"><span class="cl"><span class="n">n2</span> <span class="o">=</span> <span class="m">45000</span> <span class="o">*</span> <span class="m">2.8</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pwr.2p2n.test</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">n1</span><span class="o">=</span><span class="n">n1</span><span class="p">,</span> <span class="n">n2</span><span class="o">=</span><span class="n">n2</span><span class="p">,</span> <span class="n">sig.level</span><span class="o">=</span><span class="m">0.10</span><span class="p">)</span>
</span></span></code></pre></div><pre tabindex="0"><code>             n1 = 14000
             n2 = 126000
      sig.level = 0.1
          power = 0.9274638
    alternative = two.sided
</code></pre><p>So a 10–90 allocation would require <strong>2.8x</strong> as many total users to reach a similar outcome as a 50–50 split. We can understand why this is the case by looking at the formula for the standard error of the difference between two binomial proportions, which defines the width of our confidence intervals.
$$
SE_{\Delta} = \sqrt{\frac{p_a(a-p_a)}{n_a} + \frac{p_b(1-p_b)}{n_b}}
$$
A lower standard error equals greater certainty. The overall term will decrease whenever we collect samples in either variation, increasing either $ n_1 $ or $ n_2 $. But there are diminishing returns as $n_i $ increases. Suppose we&rsquo;ve already collected 1000 samples in variation A, but only 100 samples in variation B. Collecting an additional 100 samples in A will only  half of the term under the square root by 10%, whereas an additional 100 samples in B would cut that term in half.</p>
<h2 id="when-do-unequal-sample-sizes-make-sense">When do unequal sample sizes make sense? <a class="anchor" href="#when-do-unequal-sample-sizes-make-sense">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you look closely at the R outputs above, you&rsquo;ll notice that while our <em>total</em> users required is 2.8x, the number of users assigned to the control group (<code>n1</code>) is actually lower—14k vs 25k. So if we have a very strong <a href="https://en.wikipedia.org/wiki/Prior_probability" target="_blank">prior belief</a> in our change—but still want to perform some perfunctory experimentation—an unequal sample size could make sense here. But it&rsquo;s a double-edged sword: if your change is <em>worse</em> than baseline, you will have ultimately exposed more users to the change than necessary to reach a conclusive result. Probably best to keep it 50–50, since your typical A/B test design involves <a href="https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/" target="_blank">enough factors to consider already</a>.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://help.vwo.com/hc/en-us/articles/360034153494-Can-I-Change-Traffic-Distribution-while-a-Test-Is-Running-" target="_blank">Can I Change Traffic Distribution while a Test Is Running?</a> [VWO]&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p><a href="https://rpubs.com/sypark0215/223385" target="_blank">Proportional power analysis in unequal sample size</a> [RPubs]&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Planning A/B tests with a symmetric risk profile (α=β)</title><link>https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/</link><pubDate>Monday, 11 Nov 2019</pubDate><guid>https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/</guid><description>&lt;p>Here is a somewhat unconventional recommendation for the design of online experiments:&lt;/p>
&lt;p>&lt;strong>Set your default parameters for alpha (α) and beta (β) to the same value.&lt;/strong>&lt;/p>
&lt;p>This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily &lt;em>use&lt;/em> these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making. A default of symmetric risk—setting α=β—has a beneficial side effect of making experiment design easier to understand and communicate. A more parsimonious and intuitive process is more likely to actually get performed the next time someone is in your org is planning an experiment.&lt;/p></description><content:encoded><![CDATA[
        <p>Here is a somewhat unconventional recommendation for the design of online experiments:</p>
<p><strong>Set your default parameters for alpha (α) and beta (β) to the same value.</strong></p>
<p>This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily <em>use</em> these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making. A default of symmetric risk—setting α=β—has a beneficial side effect of making experiment design easier to understand and communicate. A more parsimonious and intuitive process is more likely to actually get performed the next time someone is in your org is planning an experiment.</p>
<h2 id="why-sample-size-calculations-actually-matter">Why sample size calculations actually matter <a class="anchor" href="#why-sample-size-calculations-actually-matter">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Performing a <a href="https://academic.oup.com/ndt/article/25/5/1388/1842407" target="_blank">sample size calculation</a> is the most important first step you can take to ensure your experiment is successful. The calculation itself acts as a forcing function<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, requiring us to ask ourselves a number of questions which reduce our chances of succumbing to common post-analysis pitfalls such as underpowered tests or the multiple comparisons problem.</p>
<ul>
<li>What is the specific metric we will use to measure success of this experiment?</li>
<li>What magnitude of effect do we expect to see? Are changes on the scale of 1% or 100%?</li>
<li>What level of risk are we willing to accept of being wrong?</li>
</ul>
<p>Unfortunately, many people consider this calculation to be optional. In many companies, there is nothing truly blocking people from starting an experiment without a plan. So in the interest of efficiency and 80/20, many teams end up embracing a defacto test-first, analyze-second strategy. Besides making us vulnerable to the post-analysis pitfalls mentioned above, this unfortunately also reduces our capacity to <em>learn</em> from experiments. The beauty of the scientific method is that when we make falsifiable hypotheses and proceed to falsify them, we are then presented with golden opportunity to refactor our mental models of the world. We can use data to <a href="https://alearningaday.blog/2019/10/31/refining-your-gut-reaction-with-data/" target="_blank">refine our intuitions</a>. But if we don&rsquo;t actually write out a crisp hypothesis before starting the experiment, it is too easy to victim to <a href="https://en.wikipedia.org/wiki/Hindsight_bias" target="_blank">hindsight bias</a>, subconciously rewriting the narrative into one which affirms our identity but denies us personal growth.</p>
<h2 id="a-very-brief-review-of-type-i--ii-errors">A very brief review of Type I &amp; II errors <a class="anchor" href="#a-very-brief-review-of-type-i--ii-errors">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Without diving too deep here, recall that there are two key parameters which correspond to the two ways we can make a mistake in the context of a statistical test:</p>
<ul>
<li>alpha (α) represents our long-run accepted risk of <em>false positives</em> (FPR).</li>
<li>beta (β) represents our long-run accepted risk of <em>false negatives</em> (FNR).</li>
<li>The <em>power</em> of a statistical test is its ability to correctly identify a true effect (1-β)</li>
</ul>
<p>I will defer to Google&rsquo;s <a href="https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative" target="_blank">ML Crash Course</a> for deeper understanding on this topic, since it provides the clearest learning example I&rsquo;ve seen using a &ldquo;boy who cried wolf&rdquo; analogy.</p>
<h2 id="the-problem-with-your-typical-sample-size-calculation">The problem with your typical sample size calculation <a class="anchor" href="#the-problem-with-your-typical-sample-size-calculation">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The typical sample size calculation is a trade-off between three parameters: α, β, and the <em>minimum detectable effect</em> (MDE), which is the smallest relative change in our metric of interest which is meaningful to us.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/sample_size_calc_v2_hu_fdc78a31fe1ac2f8.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/sample_size_calc_v2.png"
                
    
            
                alt="Required sample size is a function of three input parameters." width="500"/> <figcaption>
                <p>Required sample size is a function of three input parameters.</p>
            </figcaption>
    </figure>
<p>This calculation is straightforward if we have predetermined inputs and merely want to know the output. But this does not match the reality of planning an experiment in a tech company. It is more of a negotiation than a calculation, particularly when working with a non-technical stakeholder. A typical conversation around sample size might look like this:</p>
<ol>
<li>PM asks for your help planning an experiment for a new feature they are launching.</li>
<li>You calculate the required sample size based on their primary KPI and send it back.</li>
<li>PM replies asking if you accidentally meant <em>days</em> where you wrote <em>X weeks duration</em>.</li>
<li>You explain the nature of the calculation, false positives, false negatives, etc.</li>
<li>PM probes for where he or she can apply the good ol&rsquo; 80-20 rule to achieve results more quickly.</li>
</ol>
<p>This conversation can be frustrating for many analysts, but essentially what your stakeholder is trying to do here is to develop an intuition for what the marignal cost of each parameter is, so that they can discern where to compromise. This is a process which is a bit clumsy when we&rsquo;ve got three &ldquo;knobs&rdquo; to work with. When we set α=β, we effectively eliminate one of these knobs, and turn it into a two-dimensional problem involving MDE and risk. At this point, we can summarize the required sample size at various levels of each using a data table a 2D plot. Conceptually, we can visualize the trade-off between these three parameters in a similar fashion to the <a href="https://en.wikipedia.org/wiki/Project_management_triangle" target="_blank">project management triangle</a>.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/triangle.png"
                
    
            
                alt="Any change in one dimension requires sacrifice in one of the other two." width="500"/> <figcaption>
                <p>Any change in one dimension requires sacrifice in one of the other two.</p>
            </figcaption>
    </figure>
<h2 id="question-your-defaults-α005-β020">Question your defaults (α=0.05, β=0.20) <a class="anchor" href="#question-your-defaults-%ce%b1005-%ce%b2020">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The first page of google results consists largely of medicore sample size calculators pretending to be easy-to-use by simply hiding α and β parameters. The better ones—including my personal favourite, <a href="https://www.evanmiller.org/ab-testing/sample-size.html" target="_blank">Evan Miller&rsquo;s Sample Size Calculator</a>—set default values and provide clear explanations as to what these parameters mean. And yet, every calculator which does display α and β—including Evan&rsquo;s and <a href="https://cxl.com/ab-test-calculator/" target="_blank">other</a> <a href="https://www.abtasty.com/sample-size-calculator/" target="_blank">ones</a>—set their default values to α=0.05 and β=0.20.</p>
<p>If you don&rsquo;t have have a particularly strong opinion—or understanding—of what your relative ratio between these types of risk should be, it is tempting to simply go with the default options. Before you do so, allow me the opportunity to disabuse you of the notion that these are sacred numbers, unanimously agreed upon by some group of clever statisticians sitting in some room years ago.</p>
<h3 id="significance">Significance <a class="anchor" href="#significance">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>There has been a decent amount of media coverage recently around the problems with p-values<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> and their role in the social sciences replication crisis. So naturally at least a few peope have asked <em>Why is 0.05 such a sacred number?</em></p>
<blockquote>
<p>The use of the 5% p-value threshold appears to have become universal in biomedical research, yet it does not seem to to be based on any clear statistical reasoning. So far as I can make out, the origin of this threshold seems to lie in a discussion of the theoretical basis of experimental design, published by the Cambridge geneticist and statistician RA Fisher in 1926.</p></blockquote>
<p>— <a href="https://www.bmj.com/rapid-response/2011/11/03/origin-5-p-value-threshold" target="_blank">Origin of the 5% p-value threshold</a> [BMJ]</p>
<p>The short answer: it isn&rsquo;t. But even though there is nothing <em>a priori</em> special about <code>p &lt;0.05</code>, one could make a solid argument that the practice of having a generally-agreed-upon benchmark is the important part. A shared standard is valuable when we want to compare levels of evidence across different studies or research groups.  which standardizes the level of evidence used across different research groups.</p>
<h3 id="power">Power <a class="anchor" href="#power">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The practice of planning experiments with 80% power is an equally accepted standard, but it does not seem to be discussed nearly as often. It also raises the question: why are these defaults set at a 4:1 ratio?</p>
<blockquote>
<p>Although there are no formal standards for power (sometimes referred to as π), most researchers assess the power of their tests using π = 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between <em>β</em>-risk and <em>α</em>-risk. (<em>β</em> is the probability of a Type II error, and α is the probability of a Type I error; 0.2 and 0.05 are conventional values for <em>β</em> and <em>α</em>).</p></blockquote>
<p>— <a href="https://en.wikipedia.org/wiki/Power_%28statistics%29" target="_blank">Power (statistics)</a> [Wikipedia]</p>
<p>I suspect that this assymetry in risk is at least partially due to the close connection between the development of statistics and the biomedical space. Suppose you are a statistician working for a pharma company. You are running an experiment to determine whether a potential new drug is more effective at treating a particular ailment than an exiting alternative. In this context, a false negative—failing to detect that the new drug is in fact more effective—could mean aborting development and missing out on the potential profit from bringing it to market. A false positive—incorrectly concluding the new drug is more effective when it is the same or worse—could mean spending billions of dollars to bring an ineffective drug to market, then subsequently spending billions more on with lawsuits and reputational damage in the decade that follows. In this hypothetical high-stakes scenario in which we face assymetric costs, it is prudent to be extra-conservative on false positives (α) at the expense of increased false negatives (β).</p>
<h2 id="flavours-of-hypothesis-testing-fisher-vs-neymanpearson">Flavours of hypothesis testing: Fisher vs. Neyman–Pearson <a class="anchor" href="#flavours-of-hypothesis-testing-fisher-vs-neymanpearson">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>So it seems entirely plausible that particular domains—including biomedical research—require an assymetric risk profile, in which we value one of false positives or negatives more heavily than the other. But why do we never see scenarios in which we value false negatives more highly than false positives? While there are in fact a few such studies<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>, they are few and far between. Alexander Etz lays out a good argument for why this is the case, in his article <a href="https://alexanderetz.com/2014/12/15/question-why-do-we-settle-for-80-power-answer-were-confused/" target="_blank">Question: Why do we settle for 80% power? Answer: We’re confused.</a>:</p>
<blockquote>
<p>Why do they not adjust α and settle for α = 0.20 and β = 0.05? Why is small α a non-negotiable demand, while small β is only a flexible desideratum? A large α would seem to be scientifically unacceptable, indicating a lack of rigor, while a large β is merely undesirable, an unfortunate but sometimes unavoidable consequence of the fact that observations are expensive or that subjects eligible for the trial are hard to find and recruit. We might have to live with a large β, but good science seems to demand that α be small.</p></blockquote>
<p>A lot of the confusion around hypothesis testing seems to stem from the fact that it is a blend of two underlying philosophies: Fisherian significance testing, and Neyman–Pearson hypothesis testing. It is particularly difficult to grok for outsiders, because while these two paradigms have irreconcilable differences, they also share some simliarities, and even use the same terminology of null hypotheses, alpha, etc.</p>
<p>I will defer to this <a href="https://stats.stackexchange.com/questions/23142/when-to-use-fisher-and-neyman-pearson-framework/51823#51823" target="_blank">excellent explanation of the differences</a> by StackExchange user &ldquo;gong&rdquo;:</p>
<blockquote>
<p><strong>Fisher</strong> thought that the p-value could be interpreted as <em>a continuous measure of evidence against the null hypothesis</em>. There is no particular fixed value at which the results become &lsquo;significant&rsquo;.</p>
<p>On the other hand, <strong>Neyman &amp; Pearson</strong> thought you could use the p-value as part of <em>a formalized decision making process</em>. At the end of your investigation, you have to either reject the null hypothesis, or fail to reject the null hypothesis.</p>
<p>The Fisherian and Neyman-Pearson approaches are <em>not the same</em>. The central contention of the Neyman-Pearson framework is that at the end of your study, you have to make a decision and walk away.</p></blockquote>
<p>One particularly frustrating pieces of statistical terminology—&ldquo;failing to reject the null hypothesis&rdquo;—comes from the Fisherian paradigm. If you are evaluating evidence in relation to a <em>single</em> hypothesis and you do not achieve a significant result, it could be either because such a result is not possible—the null hypothesis is correct—or simply because you did not collect enough data to disprove it. Therefore in a Fisherian context, we cannot <em>accept</em> a hypothesis, we can only <em>fail to reject</em> it.</p>
<p>This paradigm is a natural match for the decentralized structure of scientific discovery in society. Hypotheses aren&rsquo;t evaluated only once, so false negatives only delay discovery, rather than eliminating it. But researchers face implicit pressure to find surprising (significant) results for their experiment. Funding for future research may depend on it. Since it is not quite as sexy to fund experiments that verify knowledge we already &ldquo;know&rdquo;<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>, it makes sense to be very conservative with false positives, at the cost of accepting more false negatives.</p>
<p>This paradigm is less good of a match to the typical decision-making context in a modern tech company in which A/B testing is being performed. We are not interested in advancing the societal body of shared scientific knowledge. We just want to make optimal decisions in an environment of uncertainty. Should we launch version A or B? If we truly walk away after making the decision, then <em>failing to reject</em> the null hypothesis is tantamount to <em>accepting</em> it.</p>
<p>The Neyman–Pearson paradigm is a better fit for this scenario, because it pairs statistics with decision theory. In the NP framework, indecision is not an option. There is no option to &ldquo;collect more data&rdquo;. We plan a required sample size, collect data, make a binary decision between A and B, and then walk away. Rather than providing some continuous measure of evidence for or against a hypothesis, NP hypothesis testing arms us with the tools to confidently make decisions which minimize our long-run regret.</p>
<h2 id="unprivilege-your-null-hypothesis">Unprivilege your null hypothesis <a class="anchor" href="#unprivilege-your-null-hypothesis">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you are testing two versions of your website, which should you designate as the null hypothesis, and which as the alternative hypothesis? It is standard practice to choose a null hypothesis which reflects the &ldquo;status quo&rdquo; that you are attempting to disprove. Given the typical defaults of α=0.05 and β=0.20, this means your null hypothesis occupies a &ldquo;priviliged&rdquo; position of being innocent until proven guilty<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup>. But it can be alarming to observe that the outcome (decision) from an experiment can entirely flip depending on how you frame your null hypothesis<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup>. Doesn&rsquo;t feel particularly objective, does it?</p>
<p>A fantastic side effect of setting α = β when our costs of mistakes are equal is that we can be agnostic as to what our <em>default option</em> is. We don&rsquo;t have to be as careful as to which hypothesis we designate as null. Consider the following two scenarios:</p>
<ol>
<li>You are testing the impact of a new landing page concept on a single market. You have only translated content for a single language, and you&rsquo;d like to A/B test the new concept before investing in more translations. Unless you see a <em>significant positive effect</em> in your experiment, you plan on staying with the existing system.</li>
<li>Your backend team has done some major refactoring work, and you&rsquo;d like to run an A/B test to verify that QA did not overlook any critical bugs. All things equal, you would prefer to go with the new refactored codebase, so you plan on launching the change unless you see a  <em>significant negative</em> effect from the experiment.</li>
</ol>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Landing page</th>
          <th>Refactor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Default</td>
          <td>Stay with existing version</td>
          <td>Launch new version</td>
      </tr>
      <tr>
          <td>False positive</td>
          <td>Wasted resources</td>
          <td>Missed opportunity for improved conversion</td>
      </tr>
      <tr>
          <td>False negative</td>
          <td>Missed opportunity for improved conversion</td>
          <td>Worse conversion</td>
      </tr>
  </tbody>
</table>
<p>These two scenarios share a common failure mode—missed opportunity—but because our default decision differs, our risk is treated differently as well. This failure mode is denoted as β-risk in the first scenario, and α-risk in the second. If we were to use the default parameters (α=0.05, β=0.20) for both experiments, we could say &ldquo;we planned and ran both experiments the same way&rdquo; but our chance of missing an opportunity would differ by a factor of 4x. If we use a symmetrical risk profile, then we do not need to pay such close attention to which our default options are, because the long-run risk of making each type of mistake is the same.</p>
<h2 id="a-pragmatic-approach-to-statistical-rigor">A pragmatic approach to statistical rigor <a class="anchor" href="#a-pragmatic-approach-to-statistical-rigor">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you are championing statistical thinking and experimentation practices in a move-fast-and-break-things environment, you need to pick your battles. For example: it&rsquo;s probably not worth kicking up a fuss about people in your org treating confidence intervals like posterior probabilities.</p>
<p>On the other hand, I would argue it is certainly worth encouraging and enabling people to perform sample size calculations as part of a pre-experiment planning process. Such a process has multiple benefits: it reduces the risk of implicit multiple comparisons<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup> which would inflate your long-run rate of false positives, and also reduces the number of underpowered tests you perform. Underpowered tests in particular can lead to a pernicious scenario in which experiment results lose credibility within the organization. Small simplifications to the planning process such as using a default of α = β can help you achieve this goal.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Although the magnitude of improvement from the &ldquo;opt-out organ donation&rdquo; study has been partially debunked, every good salesperson knows there is some power behind the <a href="https://en.wikipedia.org/wiki/Default_effect" target="_blank">default effect</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>The biggest value comes not from the output of the calculation, but rather from the questions we <a href="https://en.wikipedia.org/wiki/Behavior-shaping_constraint" target="_blank">must ask ourselves</a> during the process. &ldquo;Plans Are Worthless, But Planning Is Everything&rdquo; – Dwight D Eisenhower.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p><a href="https://www.vox.com/latest-news/2019/3/22/18275913/statistical-significance-p-values-explained" target="_blank">800 scientists say it’s time to abandon “statistical significance”</a> (Vox)&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p><a href="http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html" target="_blank">Justify Your Alpha by Minimizing or Balancing Error Rates</a> (The 20% Statistician)&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>This has changed somewhat since the <a href="https://en.wikipedia.org/wiki/Replication_crisis" target="_blank">Replication crisis</a>, but the fact this crisis occured at all indicates there is a systemic bias towards new discoveries.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>This is anlogous to the legal concept of <a href="https://en.wikipedia.org/wiki/Presumption_of_innocence" target="_blank">Presumption of innocence</a>. Priviliging the null hypothesis certainly makes sense here. A criminal escaping justice is unfortunate, but an innocent citizen wrongly imprisoned is horrific.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>I found a good example in this <a href="https://en.wikipedia.org/wiki/Presumption_of_innocence" target="_blank">StackExchange question</a> which illustrates how our decision can flip depending on which hypothesis we assign to be null.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Even if you aren&rsquo;t explicitly testing multiple hypotheses, not having a clearly defined hypothesis before running your experiment leaves you vulnerable to inflated FPR via <a href="https://en.wikipedia.org/wiki/Researcher_degrees_of_freedom" target="_blank">researcher degrees of freedom</a>.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Making beautiful experiment visualizations with Matplotlib</title><link>https://geoffruddock.com/matplotlib-experiment-visualizations/</link><pubDate>Monday, 21 Oct 2019</pubDate><guid>https://geoffruddock.com/matplotlib-experiment-visualizations/</guid><description>&lt;p>Netflix recently posted an article on their tech blog titled &lt;a href="https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21" target="_blank">Reimagining Experimentation Analysis at Netflix&lt;/a>. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I&amp;rsquo;ve searched for best practices before, and the the &lt;a href="https://conversionxl.com/blog/visualize-ab-test-results/" target="_blank">only reasonable template I could find&lt;/a> is built for Excel, which doesn&amp;rsquo;t fit my python workflow.&lt;/p></description><content:encoded><![CDATA[
        <p>Netflix recently posted an article on their tech blog titled <a href="https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21" target="_blank">Reimagining Experimentation Analysis at Netflix</a>. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I&rsquo;ve searched for best practices before, and the the <a href="https://conversionxl.com/blog/visualize-ab-test-results/" target="_blank">only reasonable template I could find</a> is built for Excel, which doesn&rsquo;t fit my python workflow.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/netflix_vizkit_hu_7c7ef6fe86402d0d.png 480w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/netflix_vizkit_hu_a824f4ba897c8833.png 800w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/netflix_vizkit_hu_e6e756eb50c15269.png 1200w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/netflix_vizkit_hu_71d912d0b4d6189b.png 1500w,
                '
    
                
                
                src="https://geoffruddock.com/matplotlib-experiment-visualizations/netflix_vizkit_hu_a824f4ba897c8833.png"
                
    
            
                alt="Netflix&#39;s visualization" width="600"/> 
    </figure>
<p>It might take a couple seconds to visually parse this visualization at first glance. I don&rsquo;t think that&rsquo;s because it&rsquo;s <em>complicated</em> per se, but rather because the viz itself contains <em>so much</em> information. After you are used to the format, it&rsquo;s hard to think of a way to convey a higher density of decision-making-relevant information in such a small space. There are a few things that make this a particularly good visualization for the result of an experiment.</p>
<h2 id="why-it-is-awesome">Why it is awesome <a class="anchor" href="#why-it-is-awesome">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="it-frames-many-tests-within-the-context-of-a-single-experiment">It frames many &ldquo;tests&rdquo; within the context of a single experiment <a class="anchor" href="#it-frames-many-tests-within-the-context-of-a-single-experiment">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The terms <em>experiment</em> and <em>test</em> are often used interchangeably across product teams, no doubt in part due to the terminology around <em>A/B testing</em>.  But in the context of a single experiment—in which we <em>experiment</em> by trying something new—we may perform a number of different <em>statistical tests</em>. While each individual test has its own confidence level, we must be careful to adjust our claims of confidence on the experiment level, else we vall fictim to the <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem" target="_blank">multiple comparisons</a> problem.</p>
<p>Even if you don&rsquo;t apply any sort of quantitative correction—to guarantee some global <a href="https://en.wikipedia.org/wiki/Family-wise_error_rate" target="_blank">family-wise error rate</a> (FWER) or <a href="https://en.wikipedia.org/wiki/False_discovery_rate" target="_blank">false discovery rate</a> (FDR)—having all the tests shown together adds useful context for the reader. Suppose you hear the following statement during a company all-hands:</p>
<blockquote>
<p><em>We saw a significant increase in viewing hours for the Action genre in position four</em>.</p></blockquote>
<p>This statement agrees with the above example plot, but it isn&rsquo;t particularly insightful. Should we prefer the Action genre for this position over others genres? Or is this the ideal position for that genre across all possible genres? Perhaps both? Small verbal descriptions of specific outcomes from experiments like this tend to get taken out of context. When this happens, their utility decreases, and their risk of being &ldquo;misused&rdquo; increases. Unfortunately I have observed that these sort of &ldquo;snippets&rdquo; are frequently used as ammunition by some decision-makers to support their a priori preferred choice.</p>
<h3 id="it-emphasizes-intervals-over-point-estimates-and-p-values">It emphasizes intervals over point estimates (and p-values) <a class="anchor" href="#it-emphasizes-intervals-over-point-estimates-and-p-values">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The past few years have seen signficiant backlash (pun intended) against the use and misuse of p-values in academia. Today&rsquo;s social scientists are all familiar with <a href="https://en.wikipedia.org/wiki/Publication_bias" target="_blank">publication bias</a> and the <a href="https://en.wikipedia.org/wiki/Replication_crisis" target="_blank">replication crisis</a>. Yet when a n A/B test is presented in a tech company boardroom, the first question is still often <em>Is this result significant?</em>.</p>
<p>The Netflix visualization replaces the role of p-values with a visual depiction of some confidence interval, whose colour changes depending on whether or not it includes zero. Additionally, although point estimates are shown within each interval, they are visually de-emphasised within the overall context of the visualization. I&rsquo;m guessing that Netflix removed x-axis labels to avoid sharing confidental data, but even with those included, it limits people to making statements such as &ldquo;we expect somewhere between a 1-2% improvement&rdquo; rather than &ldquo;we expect a 1.27% improvement&rdquo;. Using two decimals of precision when our confidence interval is 100x as wide the estimate itself is superfluous and gives us a false sense of confidence in our results.</p>
<h3 id="the-contextual-info-stays-together-in-a-single-shareable-image">The contextual info &ldquo;stays together&rdquo; in a single shareable image <a class="anchor" href="#the-contextual-info-stays-together-in-a-single-shareable-image">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>All of the above properties of a good experiment visualization could also be fulfilled by a nicely designed Tableau dashboard. But what should you do after the experiment ends, and you want to share or save the result for later? Your company&rsquo;s dashboards are always changing after all, so you can&rsquo;t guarantee the data will be there a year from now if you want to reference it. So you take a screenshot.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/tableau_blurred_hu_81d0907452ddda89.png 480w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/tableau_blurred_hu_64bf819ec272ab3e.png 800w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/tableau_blurred_hu_d6bdf8504bc8b0e8.png 1200w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/tableau_blurred_hu_2c2b43f2d921687f.png 1500w,
                '
    
                
                
                src="https://geoffruddock.com/matplotlib-experiment-visualizations/tableau_blurred_hu_64bf819ec272ab3e.png"
                
    
            
                alt="Detailed dashboards are difficult to archive or share" width="500"/> <figcaption>
                <p>Detailed dashboards are difficult to archive or share</p>
            </figcaption>
    </figure>
<p>Well this is unfortunate. In order to capture the key parts of the result, you&rsquo;ve had to take a nearly fullscreen grab of the dashboard. You can throw this in a slide deck somewhere, but you can&rsquo;t expect anyone to read it. And if they do, you can&rsquo;t expect them to reach the same conclusion as you did. In contrast, Netflix&rsquo;s visualization outputs a <em>story</em>. Better yet, it&rsquo;s a story contained in a single copy-paste-able sharable png file. This ensures that the nuance of your analysis does not get lost in transit as it is shared over Slack and email.</p>
<h2 id="rolling-our-own-visualization-function">Rolling our own visualization function <a class="anchor" href="#rolling-our-own-visualization-function">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Unfortunately I have not been able to surface any sort of open source libraries under the name &ldquo;Netflix Vizkit&rdquo;, so I decided to recreate my own version using Matplotlib. The function takes as input a pandas dataframe with either a single or multilevel index, and three columns: <code>uplift</code>, <code>std_err</code>, and <code>alpha</code>. If you are running a large number of tests, it would be prudent to first run your dataframe through your procedure of choice to correct for <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem" target="_blank">multiple comparisons</a>. I&rsquo;ll skip that for the purposes of this example.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/example_input_dataframe_hu_6cca9873918e86a6.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/matplotlib-experiment-visualizations/example_input_dataframe.png"
                
    
            
                alt="Example input dataframe" width="350"/> 
    </figure>
<p>For this example, I&rsquo;ve populated a dataframe with fake results corresponding to an email campaign in which we tested three variants and measured four different conversion rates for each. You could also pass in a dataframe with a single level of index, you&rsquo;ll just get everything plotted on one axis instead of four separate axes.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">plot_experiment_results</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">df</span><span class="o">=</span><span class="n">example_data</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">title</span><span class="o">=</span><span class="s1">&#39;Example email campaign (α=0.10)&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">sample_size</span><span class="o">=</span><span class="mi">123456</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">combine_axes</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span></code></pre></div><p>There are a couple additional parameters in there to add context to the plot, including a title and sample size context line. Remember, we want our output to stand by itself as a record of the outcome of the experiment! This function generates the plot below.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/plot_output_hu_5c5edb915f6cb051.png 480w,
                
                       https://geoffruddock.com/matplotlib-experiment-visualizations/plot_output_hu_3f21cf632c859c2d.png 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/matplotlib-experiment-visualizations/plot_output_hu_3f21cf632c859c2d.png"
                
    
            
                alt="Plot output" width="600"/> 
    </figure>
<p>If you want to more closely match the Netflix plot, you can pass the paramete <code>combine_axes=True</code> to merge groups together into a single axis. I found this a bit less easy to visually parse, so I usually leave them separate.</p>
<h3 id="full-code-for-the-example">Full code for the example <a class="anchor" href="#full-code-for-the-example">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><script src="https://gist.github.com/asdfgeoff/741de7942ff81a8986023bffb963d6df.js"></script>


      ]]></content:encoded></item><item><title>Sampling from an iteratively built array in Python</title><link>https://geoffruddock.com/building-and-sampling-from-python-array/</link><pubDate>Monday, 07 Oct 2019</pubDate><guid>https://geoffruddock.com/building-and-sampling-from-python-array/</guid><description>&lt;p>While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before…&lt;/p>
&lt;h2 id="whats-the-fastest-way-to-sample-from-an-array-while-building-it">What&amp;rsquo;s the fastest way to sample from an array while building it? &lt;a class="anchor" href="#whats-the-fastest-way-to-sample-from-an-array-while-building-it">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>If you&amp;rsquo;re reading this, you should first question whether you actually &lt;em>need&lt;/em> to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using &lt;code>np.random.choice&lt;/code>, you can avoid this problem entirely.&lt;/p></description><content:encoded><![CDATA[
        <p>While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before…</p>
<h2 id="whats-the-fastest-way-to-sample-from-an-array-while-building-it">What&rsquo;s the fastest way to sample from an array while building it? <a class="anchor" href="#whats-the-fastest-way-to-sample-from-an-array-while-building-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you&rsquo;re reading this, you should first question whether you actually <em>need</em> to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using <code>np.random.choice</code>, you can avoid this problem entirely.</p>
<p>Unfortunately I could not find a clever way workaround for my purposes. This arose while I was implementing the Dyna-Q reinforcement learning algorithm, which requires iteratively sampling from the set of observed state tuples after every iteration of the algorithm. These sampled tuples are then used to refine the transition matrix, with the goal of reducing the number of &ldquo;real&rdquo; iterations in which the agent must interact with its environment.</p>
<h2 id="constraints">Constraints <a class="anchor" href="#constraints">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li>Must allow for sampling from a 2D array (matrix)</li>
<li>We do not know ahead of time how many iterations are needed (until convergence)</li>
<li>Sampled values must be transposed into column vectors (although their actual use is not shown)</li>
</ul>
<h2 id="benchmarks">Benchmarks <a class="anchor" href="#benchmarks">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I explored a few possible approaches below. Each function runs 10k iterations, in which it appends a row to the &ldquo;so far&rdquo; array and then samples 200 rows from it. Note that the function does not actually <em>do</em> anything with the sampled values—that is out of the scope of this article. I simulate the random vector by generating a single random number using the built-in <code>random</code> module (which is <a href="https://geoffruddock.com/python-random-module-faster-than-numpy/">faster than numpy</a>) and duplicating it to make a row vector.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">random</span>
</span></span></code></pre></div><h3 id="approach-1-purely-built-in-python-avoid-numpy-entirely">Approach 1: Purely built-in python, avoid NumPy entirely <a class="anchor" href="#approach-1-purely-built-in-python-avoid-numpy-entirely">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>My first instinct was to attempt to write the function using as few NumPy objects as possible, since I knew from previous experience that the <code>np.append()</code> has some overhead. We can represent the 2D matrix as a list of tuples, and then use the <code>zip</code> function to take the sampled rows and &ldquo;transpose&rdquo; them into pseudo column vectors.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">build_list_choices</span><span class="p">(</span><span class="n">iters</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">list_obj</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">ri</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">list_obj</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">ri</span><span class="p">,</span> <span class="n">ri</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">random</span><span class="o">.</span><span class="n">choices</span><span class="p">(</span><span class="n">list_obj</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">sample_size</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="o">-</span><span class="n">r</span> <span class="mi">5</span> <span class="n">build_list_choices</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>507 ms ± 36.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
</code></pre>
<h3 id="approach-2-iteratively-append-to-numpy-array">Approach #2: Iteratively append to NumPy array <a class="anchor" href="#approach-2-iteratively-append-to-numpy-array">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I had read multiple times about the overhead of calling <code>np.append</code> repeatedly, so I wrote this mainly to benchmark the speed, rather than as a real candidate solution.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">build_arr_iteratively</span><span class="p">(</span><span class="n">iters</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">ri</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="p">[[</span><span class="n">ri</span><span class="p">,</span> <span class="n">ri</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">a_arr</span><span class="p">,</span> <span class="n">b_arr</span> <span class="o">=</span> <span class="n">arr</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">arr</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">)]</span><span class="o">.</span><span class="n">T</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="o">-</span><span class="n">r</span> <span class="mi">5</span> <span class="n">build_arr_iteratively</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>472 ms ± 9.94 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
</code></pre>
<p>Surprisingly, iteratively appending to a NumPy array has very similar performance to the first approach. Reflecting on the common advice to avoid <code>np.append</code>, I suppose this is contrasted to the much faster alternative of gathering a list of rows and calling a final <code>np.array()</code> once. Unfortunately this alternative wouldn&rsquo;t work for our use-case, which requires access to the array at each iteration.</p>
<h3 id="approach-3-preallocate-array-and-assign-within-iterations">Approach 3: Preallocate array and assign within iterations <a class="anchor" href="#approach-3-preallocate-array-and-assign-within-iterations">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>To avoid the overhead of <code>np.append</code>, we can preallocate size in the array. If we don&rsquo;t know the final size but are confident in the maximum size, we can simply instantiate the array at that maximum size and take a slice up to the $ i^{th} $ row at each iteration when sampling.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">build_arr_prealloc</span><span class="p">(</span><span class="n">iters</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">iters</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">ri</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">arr</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">ri</span><span class="p">,</span> <span class="n">ri</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">arr_non_zero</span> <span class="o">=</span> <span class="n">arr</span><span class="p">[:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span>
</span></span><span class="line"><span class="cl">        <span class="n">a_arr</span><span class="p">,</span> <span class="n">b_arr</span> <span class="o">=</span> <span class="n">arr_non_zero</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">arr_non_zero</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">)]</span><span class="o">.</span><span class="n">T</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="o">-</span><span class="n">r</span> <span class="mi">5</span> <span class="n">build_arr_prealloc</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>371 ms ± 7.08 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
</code></pre>
<p>We observe a modest improvement over repeatedly appending. In fact the <code>np.random.choice</code> line dominates the run-time in these functions, so the time spent purely building the array drops from ~100ms to ~20ms, a <strong>5x</strong> improvement.</p>
<h3 id="avoid-at-all-costs-iteratively-building-list-and-converting-to-array-in-each-iteration">Avoid at all costs: Iteratively building list and converting to array in each iteration <a class="anchor" href="#avoid-at-all-costs-iteratively-building-list-and-converting-to-array-in-each-iteration">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The one thing you should definitely avoid is accumulating a python list but then converting to a numpy array at each step. This takes <em>massively</em> longer than the above approaches.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">build_list_iteratively</span><span class="p">(</span><span class="n">iters</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">list_obj</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">ri</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">list_obj</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">ri</span><span class="p">,</span> <span class="n">ri</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">list_obj</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">a_arr</span><span class="p">,</span> <span class="n">b_arr</span> <span class="o">=</span> <span class="n">arr</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">arr</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">)]</span><span class="o">.</span><span class="n">T</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="o">-</span><span class="n">r</span> <span class="mi">5</span> <span class="n">build_list_iteratively</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>14.2 s ± 307 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
</code></pre>
<h2 id="conclusion">Conclusion <a class="anchor" href="#conclusion">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li>Given the specific requirements specified above, both a purely numpy and purely built-in python approach to the problem yield similar results.</li>
<li>Even <code>np.append()</code> is reasonable, since the sampling part dominates the overall run-time.</li>
<li>If you are confident about the maximum number of iterations you&rsquo;ll run, you can preallocate rows to the numpy array for a ~25% faster overall run-time.</li>
<li>Whatever you do, avoid calling <code>np.array()</code> during each iteration, this is by far the slowest approach.</li>
</ul>
<p>If you think you have a better approach, drop a comment below!</p>

      ]]></content:encoded></item><item><title>Building a hurdle regression estimator in scikit-learn</title><link>https://geoffruddock.com/building-a-hurdle-regression-estimator-in-scikit-learn/</link><pubDate>Monday, 16 Sep 2019</pubDate><guid>https://geoffruddock.com/building-a-hurdle-regression-estimator-in-scikit-learn/</guid><description>&lt;h2 id="what-are-hurdle-models">What are hurdle models? &lt;a class="anchor" href="#what-are-hurdle-models">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>Google explains best,&lt;/p>
&lt;blockquote>
&lt;p>The &lt;strong>hurdle model&lt;/strong> is a two-part &lt;strong>model&lt;/strong> that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a &lt;strong>hurdle&lt;/strong> is cleared.&lt;/p>&lt;/blockquote>
&lt;p>— &lt;a href="https://data.library.virginia.edu/getting-started-with-hurdle-models/" target="_blank">Getting started with hurdle models&lt;/a> [University of Virginia Library]&lt;/p>
&lt;h2 id="what-are-hurdle-models-useful-for">What are hurdle models useful for? &lt;a class="anchor" href="#what-are-hurdle-models-useful-for">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. If we have a dataset with a heavily skewed response or one which contains extreme outliers, it is a common practice to apply something like a &lt;a href="https://en.wikipedia.org/wiki/Power_transform#Box%e2%80%93Cox_transformation" target="_blank">Box-Cox power transformation&lt;/a> before fitting.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="what-are-hurdle-models">What are hurdle models? <a class="anchor" href="#what-are-hurdle-models">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Google explains best,</p>
<blockquote>
<p>The <strong>hurdle model</strong> is a two-part <strong>model</strong> that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a <strong>hurdle</strong> is cleared.</p></blockquote>
<p>— <a href="https://data.library.virginia.edu/getting-started-with-hurdle-models/" target="_blank">Getting started with hurdle models</a> [University of Virginia Library]</p>
<h2 id="what-are-hurdle-models-useful-for">What are hurdle models useful for? <a class="anchor" href="#what-are-hurdle-models-useful-for">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. If we have a dataset with a heavily skewed response or one which contains extreme outliers, it is a common practice to apply something like a <a href="https://en.wikipedia.org/wiki/Power_transform#Box%e2%80%93Cox_transformation" target="_blank">Box-Cox power transformation</a> before fitting.</p>
<p>But what do you do if you come across a clearly multi-modal distribution like the one below? Applying a power transform here will just change the scale of the variable, it won&rsquo;t help with the fact that there is a huge spike of values at zero. The fact that it is multi-modal is a good indicator that we are <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox" target="_blank">over-aggregating</a> data which belong to two or more distinct underlying data generation processes.</p>
<p><img src="hurdle_response.png" alt="Example of a multi-modal distribution"></p>
<p>Distributions like this are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.</p>
<p>In the toy example above we have two underlying processes: Does a customer come back? If so, how many purchases does he or she make? The first is modeled as a binomial random variable (coin flip) and the second as a  $ \text{Pois}(\lambda=4) $ random variable, which represents discrete event counts.</p>
<p><img src="split_distributions.png" alt="Example of a multi-modal distribution"></p>
<h2 id="how-can-i-implement-a-hurdle-model">How can I implement a hurdle model? <a class="anchor" href="#how-can-i-implement-a-hurdle-model">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>So we want to fit and predict two sub-models, and then multiply their predictions together:</p>
<ol>
<li>A classifier, trained and tested on all of our data.</li>
<li>A regressor, trained only on true positive samples, but used to make predictions on all test data.</li>
</ol>
<p>The most straightforward way to achieve this would be to just train two separate models, make predictions on the same test dataset, and multiply their predictions together before evaluating. However with this approach we lose the ability to interface our model with the rest of the scikit-learn ecosystem, including passing it into <code>GridSearchCV</code> or any of the evaluation functions such as <code>cross_val_predict</code>.</p>
<p>A better approach is to implement our hurdle model as a valid scikit-learn estimator object by extending from the provided <code>BaseEstimator</code> class.</p>
<script src="https://gist.github.com/asdfgeoff/8361a66bc45bfa91fa22095dd0670d59.js"></script>

<h2 id="making-it-a-valid-scikit-learn-estimator">Making it a valid Scikit-Learn estimator <a class="anchor" href="#making-it-a-valid-scikit-learn-estimator">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The code snippet above may feel like it is longer than it needs to be. This is primarily because I tried to write it as a <a href="https://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator" target="_blank">valid scikit-learn estimator</a>, which I learned involves jumping through a few hoops so that it is compatible with other sklearn functions, including:</p>
<ol>
<li>Init variables must each be of a data type which evaluates as equal when compared with another copy of itself. This is necessary because sklearn clones estimators behind the scenes to do parallel processing in functions such as <code>GridSearchCv</code>. Primitive datatypes (e.g. <code>'yo' == 'yo'</code> and <code>42 == 42</code>) pass this test, but already-initialized estimators to use as sub-models do not. Because of this, I pass model type as a string, then use the <code>_resolve_estimator</code> method to instantiate the actual estimator.</li>
<li>The <code>fit</code> method returns the estimator itself, to enable method chaining.</li>
<li>The attribute <code>self.is_fitted_</code> is set by the <code>.fit()</code> method and then checked by <code>.predict()</code>.</li>
<li>Any input is validated using the <code>check_array()</code> function before being fit or predicted.</li>
</ol>
<p>Scikit-learn provides a <code>check_estimator</code> function which runs a battery of automated tests against your estimator. I learned most of these requirements above while attempting to pass these tests.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator" target="_blank">Rolling your own estimator</a> [scikit-learn docs] – Provides a good overview of how to write your own estimator</p>
<p><a href="https://github.com/NeverForged/Hurdle/blob/master/src/hurdle.py" target="_blank">Github / NeverForged / Hurdle</a> [Github] – I used this as a starting point for my code.</p>
<p><a href="http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/" target="_blank">Creating your own estimator in scikit-learn</a> – Some additional concerns w.r.t <code>GridSearchCV</code></p>

      ]]></content:encoded></item><item><title>When Python is built-in random module is faster than NumpPy</title><link>https://geoffruddock.com/python-random-module-faster-than-numpy/</link><pubDate>Tuesday, 10 Sep 2019</pubDate><guid>https://geoffruddock.com/python-random-module-faster-than-numpy/</guid><description>&lt;h2 id="tldr">TL;DR &lt;a class="anchor" href="#tldr">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>If you need a single random number (or up to 5) use the built-in &lt;code>random&lt;/code> module instead of &lt;code>np.random&lt;/code>.&lt;/p>
&lt;h2 id="an-instinct-to-vectorize">An instinct to vectorize &lt;a class="anchor" href="#an-instinct-to-vectorize">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>An early learning for any aspiring pandas user is to always prefer &amp;ldquo;vectorized&amp;rdquo; operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at &lt;a href="https://realpython.com/fast-flexible-pandas/" target="_blank">blazing-fast speeds&lt;/a> behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="tldr">TL;DR <a class="anchor" href="#tldr">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you need a single random number (or up to 5) use the built-in <code>random</code> module instead of <code>np.random</code>.</p>
<h2 id="an-instinct-to-vectorize">An instinct to vectorize <a class="anchor" href="#an-instinct-to-vectorize">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>An early learning for any aspiring pandas user is to always prefer &ldquo;vectorized&rdquo; operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at <a href="https://realpython.com/fast-flexible-pandas/" target="_blank">blazing-fast speeds</a> behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.</p>
<p>But after learning to love NumPy for this reason, I was surprised to encounter a few situations where NumPy is actually <em>slower</em> than vanilla python. Particularly when generating scalar values or small arrays of random numbers using  the <code>np.random</code> sub-module.</p>
<h2 id="generating-a-random-float">Generating a random float <a class="anchor" href="#generating-a-random-float">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I have written more than a few pieces of code which introduce some randomness by a random float in the range <code>[0, 1]</code> to the sampling rate argument in an if-statement. For this purpose, you should use python&rsquo;s built-in <code>random</code> module.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">random</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span>
</span></span></code></pre></div><pre><code>69.5 ns ± 0.817 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
</code></pre>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>987 ns ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
</code></pre>
<p>Generating a single random float is <strong>10x faster</strong> using using Python&rsquo;s built-in <code>random</code> module compared to <code>np.random</code>. with NumPy than with base python. So if you need to generate a single random number—or less than 10 numbers—it is faster to simply loop over <code>random.random()</code> a few times rather than calling <code>np.random.rand()</code>.</p>
<h2 id="generating-a-random-integer">Generating a random integer <a class="anchor" href="#generating-a-random-integer">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Generating random integers with the <code>random</code> module is not quite as slow, but it is still slower than <code>np.random.randint()</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>5.05 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
</code></pre>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>898 ns ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
</code></pre>
<p>Generating a single random integer is <strong>5x faster</strong> using <code>random</code> module compared to <code>np.random</code>.f</p>
<h2 id="sampling-from-existing-array-or-list">Sampling from existing array or list <a class="anchor" href="#sampling-from-existing-array-or-list">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">population</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1000000</span><span class="p">))</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">population</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>48.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">timeit</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">population</span><span class="p">)</span>
</span></span></code></pre></div><pre><code>930 ns ± 6.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
</code></pre>
<p>Sampling a single value from a list executes a full <strong>50x faster</strong> using <code>random</code> than <code>np.random</code>.</p>
<p>This is a slightly unfair comparison—NumPy spends most of the time converting the <code>population</code> list into an array object before sampling—but it represents a real use-case I ran across when attempting to iteratively build and sample from an array of unknown length while building a reinforcement algorithm.</p>
<h2 id="a-note-of-caution-for-cryptography-purposes">A note of caution for cryptography purposes <a class="anchor" href="#a-note-of-caution-for-cryptography-purposes">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>It is stated in the documentation for python&rsquo;s <a href="https://docs.python.org/3/library/random.html" target="_blank">random</a> module but is worth reiterating: these are &ldquo;pseudo-random&rdquo; numbers which are good enough for most statistical purposes but should <strong>not</strong> be used for applications which require cryptographically secure random numbers.</p>
<blockquote>
<p>The pseudo-random generators of this module should not be used for security purposes. For security or cryptographic uses, see the secrets module.</p></blockquote>

      ]]></content:encoded></item><item><title>Creating a monthly + daily DAG pattern in Airflow</title><link>https://geoffruddock.com/monthly-daily-dag-pattern-airflow/</link><pubDate>Thursday, 15 Aug 2019</pubDate><guid>https://geoffruddock.com/monthly-daily-dag-pattern-airflow/</guid><description>&lt;h2 id="problem">Problem &lt;a class="anchor" href="#problem">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>You initially built a data pipeline for a project you were working on, but eventually other members of your team started using it as well. You move the logic into Airflow, so that the pipeline is updated automatically on some regular basis.&lt;/p>
&lt;p>You&amp;rsquo;d like to set &lt;code>schedule_interval&lt;/code> to daily so that the data is always fresh, but you&amp;rsquo;d also like the ability to execute relatively quick backfills. With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="problem">Problem <a class="anchor" href="#problem">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>You initially built a data pipeline for a project you were working on, but eventually other members of your team started using it as well. You move the logic into Airflow, so that the pipeline is updated automatically on some regular basis.</p>
<p>You&rsquo;d like to set <code>schedule_interval</code> to daily so that the data is always fresh, but you&rsquo;d also like the ability to execute relatively quick backfills. With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh.</p>
<h2 id="solution">Solution <a class="anchor" href="#solution">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>We want to eat our cake and have it too. We can achieve this by creating two separate DAGs—one daily and one monthly—using the same underlying logic.</p>
<p>Astronomer.io has a nice guide to <a href="https://www.astronomer.io/guides/dynamically-generating-dags/" target="_blank">dynamically generating DAGs in Airflow</a>. The key insight is that we want to wrap the DAG definition code into a <code>create_dag</code> function and then call it multiple times at the top-level of the file to actually instantiate your multiple DAGs.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_dag</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="n">dag</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Declare tasks here (operators and sensors)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Set dependencies between tasks here</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">dag</span>
</span></span></code></pre></div><p>Our parameters of interest are <code>dag_id</code>, <code>start_date</code> and <code>schedule_interval</code>, so be sure to include those on your <code>create_dag</code> function.</p>
<p>We&rsquo;d like our monthly job to run on the first of every month, for all historical data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">dag_monthly</span> <span class="o">=</span> <span class="n">create_dag</span><span class="p">(</span><span class="n">dag_id</span><span class="o">=</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">DAG_NAME</span><span class="si">}</span><span class="s1">_monthly&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                         <span class="n">start_date</span><span class="o">=</span><span class="n">START_DATE</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                         <span class="n">schedule_interval</span><span class="o">=</span><span class="s1">&#39;0 7 1 * *&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>We&rsquo;d like our daily job to only run for the current month, but daily</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
</span></span><span class="line"><span class="cl"><span class="n">current_month_start</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m&#39;</span><span class="p">),</span> <span class="s1">&#39;%Y-%m&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dag_daily</span> <span class="o">=</span> <span class="n">create_dag</span><span class="p">(</span><span class="n">dag_id</span><span class="o">=</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">DAG_NAME</span><span class="si">}</span><span class="s1">_daily&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">start_date</span><span class="o">=</span><span class="n">current_month_start</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">schedule_interval</span><span class="o">=</span><span class="s1">&#39;0 8 * * *&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>Make sure to define both of your DAGs at the top-level of the <code>_def.py</code> file so that Airflow knows to instantiate them. They will appear as separate DAGs in the main UI, but the underlying logic is DRY since they are both defined from the same <code>create_dag</code> function.</p>
<h2 id="updates">Updates <a class="anchor" href="#updates">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>[2019-09-03] – Initially I had <code>schedule_interval='0 7 2-31 * *'</code> on the daily dag to avoid duplicate processing on the 1st day of the month. But Airflow runs jobs when the <em>next</em> schedule interval arrives (somewhat counter-intuitive) so what we actually want do do is skip the job corresponding with the <em>last day</em> of the month, rather than the first day. Unfortunately it is not possible to express this in a simple cron expression, due to the varying length of months.</p>

      ]]></content:encoded></item><item><title>One-hot encoding + linear regression = multi-collinearity</title><link>https://geoffruddock.com/one-hot-encoding-plus-linear-regression-equals-multi-collinearity/</link><pubDate>Monday, 29 Jul 2019</pubDate><guid>https://geoffruddock.com/one-hot-encoding-plus-linear-regression-equals-multi-collinearity/</guid><description>&lt;h2 id="my-coefficients-are-bigger-than-your-coefficients">My coefficients are bigger than your coefficients &lt;a class="anchor" href="#my-coefficients-are-bigger-than-your-coefficients">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>I was attempting to fit a simple linear regression model the other day with &lt;code>sklearn.linear_model.LinearRegression&lt;/code> but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a &lt;em>crazy&lt;/em> magnitude, on the order of &lt;em>billions&lt;/em>. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="my-coefficients-are-bigger-than-your-coefficients">My coefficients are bigger than your coefficients <a class="anchor" href="#my-coefficients-are-bigger-than-your-coefficients">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I was attempting to fit a simple linear regression model the other day with <code>sklearn.linear_model.LinearRegression</code> but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a <em>crazy</em> magnitude, on the order of <em>billions</em>. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100.</p>
<pre tabindex="0"><code>feature_A_1    4060461707040.634
feature_A_2    4060461707005.303
feature_A_3    4060461706988.173
feature_B_1   -2529776773226.519
feature_B_2   -2529776773214.394
feature_B_3   -2529776773206.096
feature_B_4   -2529776773204.950
feature_B_5   -2529776773203.577
feature_B_6   -2529776773201.271
feature_B_7   -2529776773195.004
Name: coef, dtype: float64
</code></pre><p>What is going on here? It turns out it was related to my use of  <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html" target="_blank"><code>OneHotEncoder</code></a> in my preprocessing pipeline to convert categorical features into a numeric format suitable for linear models. The best practice to convert a categorical feature containing $ k $ values is to output only $ k-1 $ one-hot encoded features, leaving one of them as the &ldquo;default&rdquo; value when all other $ k-1 $ booleans are zero.  Unfortunately I overlooked the fact that by default, <code>OneHotEncoder</code> sets the parameter <code>drop=None</code> which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients. This is known as the <a href="https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29" target="_blank">dummy variable trap</a>.</p>
<h2 id="an-easy-fix">An easy fix… <a class="anchor" href="#an-easy-fix">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Since we do not want to remove the intercept, the solution is to call encode our categorical features with the parameter <code>drop='first'</code> to produce only $ k-1 $ columns for each categorical feature.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">OneHotEncoder</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">cat_cols</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">pipeline</span> <span class="o">=</span> <span class="p">([</span>
</span></span><span class="line"><span class="cl">  <span class="p">(</span><span class="s1">&#39;one_hot&#39;</span><span class="p">,</span> <span class="n">OneHotEncoder</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="s1">&#39;first&#39;</span><span class="p">),</span> <span class="n">cat_cols</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="p">(</span><span class="s1">&#39;lin_reg&#39;</span><span class="p">,</span> <span class="n">LinearRegession</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">pipeline</span><span class="o">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="but-it-doesnt-play-nicely-with-cv-pipelines">…but it doesn&rsquo;t play nicely with CV pipelines <a class="anchor" href="#but-it-doesnt-play-nicely-with-cv-pipelines">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>An additional challenge I faced was that my <code>OneHotEncoder</code> was part of a pipeline which was ultimately fed into the <code>cross_val_predict</code> function. This function splits up the dataset into a number of folds and runs the preprocessing pipeline separately for each fold. It is possible that the training dataset used in one or more of the CV folds may not include every possible value for every categorical feature. When the pipeline is subsequently applied to the test dataset in that fold, it will throw an error about an unknown value, unless you use the parameter <code>OneHotEncoder(handle_unknowns='ignore)</code> .</p>
<p>Unfortunately is not possible to simultaneously set <code>drop='first'</code> and <code>handle_unknowns='ignore'</code> on <code>OneHotEncoder</code> , else you get the error below.</p>
<pre tabindex="0"><code>ValueError: `handle_unknown` must be &#39;error&#39; when the drop parameter is specified, as both would create categories that are all zero.
</code></pre><p>I have not found an elegant solution to this problem. If you know one, please let me know. For now, I fell back to a non-pipeline solution in which I fit <code>OneHotEncoder</code> against the entire dataset, and then make predictions against a manually-split test set.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">numeric_cols</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">number</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">cat_cols</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Train the transformer on the full dataset (causes some leakage for PowerTransformer)</span>
</span></span><span class="line"><span class="cl"><span class="n">col_tx</span> <span class="o">=</span> <span class="n">ColumnTransformer</span><span class="p">(</span><span class="n">transformers</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="s1">&#39;num&#39;</span><span class="p">,</span> <span class="n">PowerTransformer</span><span class="p">(),</span> <span class="n">numeric_cols</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="p">(</span><span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="n">OneHotEncoder</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="s1">&#39;first&#39;</span><span class="p">,</span> <span class="n">handle_unknown</span><span class="o">=</span><span class="s1">&#39;error&#39;</span><span class="p">),</span> <span class="n">cat_cols</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Transform training data and fit model</span>
</span></span><span class="line"><span class="cl"><span class="n">X_train_tx</span> <span class="o">=</span> <span class="n">col_tx</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_tx</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Transform test data and make predictions</span>
</span></span><span class="line"><span class="cl"><span class="n">X_test_tx</span> <span class="o">=</span> <span class="n">col_tx</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">preds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_tx</span><span class="p">)</span>
</span></span></code></pre></div>
      ]]></content:encoded></item><item><title>How to fix the hinge on an IKEA Friheten couch</title><link>https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/</link><pubDate>Saturday, 20 Jul 2019</pubDate><guid>https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/</guid><description>&lt;p>If there is one piece of furniture I regret buying from IKEA, it is the &lt;a href="https://www.ikea.com/us/en/catalog/products/S69216757/" target="_blank">FRIHETEN sofa bed&lt;/a> for €400. I wonder if this is where they make their margin, after selling €5 coffee tables as loss leaders to get you into the store.&lt;/p>
&lt;p>The FRIHETEN has two mechanical components that are prone to failure: a section which pulls out and &amp;ldquo;pops up&amp;rdquo; to form the sofa bed, and a chaise section which opens up to provide storage within. After two years of occasionally pulling the sofa out to vacuum, the chaise lid seat started to &amp;ldquo;slip&amp;rdquo; into the storage compartment when someone was sitting on it.&lt;/p></description><content:encoded><![CDATA[
        <p>If there is one piece of furniture I regret buying from IKEA, it is the <a href="https://www.ikea.com/us/en/catalog/products/S69216757/" target="_blank">FRIHETEN sofa bed</a> for €400. I wonder if this is where they make their margin, after selling €5 coffee tables as loss leaders to get you into the store.</p>
<p>The FRIHETEN has two mechanical components that are prone to failure: a section which pulls out and &ldquo;pops up&rdquo; to form the sofa bed, and a chaise section which opens up to provide storage within. After two years of occasionally pulling the sofa out to vacuum, the chaise lid seat started to &ldquo;slip&rdquo; into the storage compartment when someone was sitting on it.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/sagging_corner_hu_e9bd37238b9a95e0.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/sagging_corner_hu_cafabe0fecafade9.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/sagging_corner_hu_a62015461df1d941.jpg 1200w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/sagging_corner_hu_4661538b413fca05.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/sagging_corner_hu_cafabe0fecafade9.jpg"
                
    
            
                alt="Sagging corner" width="600px"/> 
    </figure>
<p>Upon closer inspection, it seems that there is a metal edge on each side of the lid near the hinge assembly, which when closed should rest on top a metal bracket on the lower box. I noticed that my hinge arms were no longer perfectly centered, so when the top piece came down it would slip off &ldquo;into&rdquo; the storage, and that corner would dip.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_before_hu_6a0e22e55a5e2eff.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_before_hu_7afd6dde5f03f23c.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_before_hu_1325b5dc7a1d3529.jpg 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_before_hu_7afd6dde5f03f23c.jpg"
                
    
            
                alt="Not much contact surface area."/> <figcaption>
                <p>Not much contact surface area.</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_after_hu_c82f3c648db38fd3.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_after_hu_6810c93016510e9.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_after_hu_c3628af12d900465.jpg 1200w,
                
                       
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/hinge_view_after_hu_6810c93016510e9.jpg"
                
    
            
                alt="Better."/> <figcaption>
                <p>Better.</p>
            </figcaption>
    </figure>

        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 400px
    }

</style>
<p>I managed to stop it from slipping by putting a 10cm wide metal corner bracket over the connecting bracket, which adds an additional ~1cm of metal, giving the top piece a solid surface but still leaving just enough clearance for the hinge mechanism to function.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_before_hu_abde0cab8229fd0f.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_before_hu_62b8705f954e93a2.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_before_hu_cd943b46fa81a354.jpg 1200w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_before_hu_5de92265714bc7d7.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_before_hu_62b8705f954e93a2.jpg"
                
    
            
                alt="Before"/> <figcaption>
                <p>Before</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_after_hu_ba03f8f6e964508d.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_after_hu_ce06c41f40ae3900.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_after_hu_32c9bcf9df5eef50.jpg 1200w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_after_hu_b2fc82a6a07c1598.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/top_view_after_hu_ce06c41f40ae3900.jpg"
                
    
            
                alt="After"/> <figcaption>
                <p>After</p>
            </figcaption>
    </figure>

        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 300px
    }

</style>
<p>Hopefully this helps anyone who is facing a similar issue. Your couch is not garbage, you can fix it with a €2 piece of metal from your local hardware store.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/corner_bracket_hu_41095e9381657edd.jpg 480w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/corner_bracket_hu_71e7ed1f5dfa568e.jpg 800w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/corner_bracket_hu_cec9e56ed4a66f66.jpg 1200w,
                
                       https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/corner_bracket_hu_65c672958fb2a802.jpg 1500w,
                '
    
                
                
                src="https://geoffruddock.com/fix-hinge-on-ikea-friheten-couch/corner_bracket_hu_71e7ed1f5dfa568e.jpg"
                
    
            
                alt="Have fun asking where to find this at the store though." width="500"/> <figcaption>
                <p>Have fun asking where to find this at the store though.</p>
            </figcaption>
    </figure>

      ]]></content:encoded></item><item><title>Reflections on three years of spaced repetition with Anki</title><link>https://geoffruddock.com/reflections-on-three-years-of-spaced-repetition-with-anki/</link><pubDate>Monday, 17 Jun 2019</pubDate><guid>https://geoffruddock.com/reflections-on-three-years-of-spaced-repetition-with-anki/</guid><description>&lt;p>I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward.&lt;/p></description><content:encoded><![CDATA[
        <p>I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward.</p>
<h2 id="wtf-is-anki">WTF is Anki? <a class="anchor" href="#wtf-is-anki">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>For the uninitiated, Gwern provides an <a href="https://www.gwern.net/Spaced-repetition" target="_blank">excellent overview of spaced repetition</a> and its effectiveness as a learning tool. I&rsquo;ll focus the rest of this post on my own personal use-cases for the tool.</p>
<p>I first downloaded Anki during my first year of undergrad, less than 48 hours before taking my final exam for <a href="https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English" target="_blank">Latin and Greek roots in English</a>. It was the classic university student use-case: <em>how can I cram all this knowledge into my head for just long enough to pass next week&rsquo;s exam</em>? Even though Anki&rsquo;s algorithms are optimized for long-term retention, they are still the best approach for short-term cramming.</p>
<p>Since that first encounter with spaced repetition, I have used it (somewhat) more successfully to build my foreign vocabulary, internalize <a href="https://fs.blog/mental-models/" target="_blank">mental models</a> learned from other fields, and most recently to review math proofs in order to improve my retention while studying my masters degree part-time.</p>
<p><img src="anki_stats.png" alt="cycle_learning_forgetting"></p>
<h2 id="remembering-stuff-is-hard-work">Remembering stuff is hard work <a class="anchor" href="#remembering-stuff-is-hard-work">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Reviewing flashcards for 15 minutes per day does not sound like a particularly large time investment, but it takes a disproportionate amount of mental exertion. Anki&rsquo;s algorithm attempts to schedule cards such that you are <em>just barely</em> able to successfully remember them. The <a href="https://en.wikipedia.org/wiki/Active_recall" target="_blank">active recall</a> principle tells us that it is at this point that review confers the maximum benefit in terms of long-term memory consolidation. If Anki showed you cards earlier it would be easier to remember them, but doing so would yield less benefit per review.</p>
<p>In this regard, remembering stuff is no different than any other skill acquisition. The study of <a href="https://en.wikipedia.org/wiki/Practice_%28learning_method%29#Deliberate_practice" target="_blank">deliberate practice</a> tells us that skill improvement is not proportional to <em>total practice</em>, but rather to the amount of practice conducted at the outer edge of our abilities. A musician who plays through a song they already know by heart may have fun, but he does not improve as much as a musician who spends an hour deconstructing a song he cannot yet play comfortably, practicinga single chord over-and-over again until it becomes muscle memory. So 15 minutes of flashcards feels unpleasant, but it gives us greater memory benefit than hours of passive consumption (watching videos, reading articles, etc.)</p>
<h2 id="spaced-repetition-is-not-a-replacement-for-learning">Spaced repetition is not a replacement for learning <a class="anchor" href="#spaced-repetition-is-not-a-replacement-for-learning">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Trying to remember things you&rsquo;ve already learned can be tough, but trying to remember things you never really learned in the firstplace is just downright frustrating. I learned this the hard way while attempting to learn frequency lists<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> of Spanish vocabulary through an imported csv file. It feels deceptively productive to create a large number of cards, but you may be surprised at how much more difficult they are to remember than cards you create yourself<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p>
<p>What anki does do well, is provide a mechanism for decoupling the active process of <em>learning</em> from the more routine process of <em>remembering</em>. Imagine you are learning some topic (a language, a field, a skill) and you resolve to spend three hours every weekend studying. The output of this focused effort spent studying is an increase in your <em>knowledge</em>. But you will also experience a decrease in knowledge caused by <em>forgetting</em> some proportion of your existing or just-learned knowledge. So next time you study, you may spend 95% of your time learning new material, and 5% &ldquo;refreshing&rdquo; on previously forgotten material.</p>
<p><img src="cycle_learning_forgetting.png" alt="cycle_learning_forgetting"></p>
<p>The actual percentage forgotten of course depends on the person, how they study, what they are studying, how often they study, etc. But the key insight here is that you&rsquo;ve got a <em>leaky system</em> that is not long-term stable. It just wouldn&rsquo;t make sense to say <em>I&rsquo;ve got a curiosity about physics so I will spend one Sunday per year learning it</em> because you&rsquo;d probably spend the first half of the day trying to remember what you learned a year ago. For any level of intensity, there is some level of frequency at which you reach an equilibrium: you are treading water but staying in the same place.</p>
<p>Now let&rsquo;s say you implement a practice of creating new flashcards at the end of your study sessions, and performing a 10-minute daily review regardless of whether you are studying that day or not. This effectively modifies our &ldquo;learning system&rdquo; to look like the chart below.</p>
<p><img src="cycle_with_flashcards.png" alt="cycle_with_flashcards"></p>
<p>To some extent, we are systematically countering the effect of forgetting using review. No review system is perfect, so naturally the rate of forgetting will never equal zero. A lower rate of forgetting equates to a lower &ldquo;equilibrium&rdquo; point with respect to study frequency. Without spaced repetition, studying a topic once every 3 months may feel inefficient, because you&rsquo;d need to spend a chunk of time &ldquo;refreshing your memory&rdquo; before learning new content. With spaced repetition, you will almost certainly find it much quicker to jump back into things.</p>
<p>I have found this immensely valuable while studying part-time towards my masters degree in analytics over the course of multiple years. Machine learning and statistics are extensions of underlying concepts in calculus, probability, and linear algebra. Having an deep understanding of fundamental concepts from these fields makes learning ML easier, but they are difficult to maintain when &ldquo;used&rdquo; so infrequently. By creating anki cards for key math proofs, I ensure that I encounter them with some minimum frequency in a problem-solving context, which prevents me from entirely forgetting them in between courses.</p>
<h2 id="memory-is-directional">Memory is directional <a class="anchor" href="#memory-is-directional">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Our ability to recall particular memories is known to be <a href="https://en.wikipedia.org/wiki/Context-dependent_memory" target="_blank">dependent on the context in which we learned them</a>. We can conceptualize our memory as a graph, in which individual memories are encoded as nodes, contextual cues and relationships as edges between those nodes, and the remembering process as the task of traversing the graph to a particular node.</p>
<p>We&rsquo;ve all had the experience of having something a word or concept on the <em>tip of our tongue</em> but not quite being able to remember it. We know the memory exists in our knowledge graph, but we lack sufficient connections <em>between</em> nodes to retrieve it in a timely manner.</p>
<p>Let&rsquo;s take the example of learning a piece of foreign vocabulary using a typical front ↔ back flashcard. Reviewing this flashcard in both directions would strengthen the following connections (edges) in our knowledge graph:</p>
<ol>
<li>Foreign word → English: strengthens our ability to recognize and comprehend the word when encountered in the external world (i.e. recognition)</li>
<li>English → Foreign world: strengthens our ability to produce the foreign word from &ldquo;thin air&rdquo; to express ourselves (i.e. production)</li>
</ol>
<p>Recognition is generally easier than production, and may often be a prerequisite. But training recognition alone does not translate to production, and often gives us a false sense of confidence in our knowledge. This is why the <a href="https://mattyford.com/blog/2014/1/23/the-feynman-technique-model" target="_blank">Feynman technique</a> is so powerful: being able to fully &ldquo;produce&rdquo; the knowledge (either a foreign word or a full explanation of a concept) is the holy grail of remembering.</p>
<p>We can take this concept even further. We have trained our recognition from a textual representation of the word, but could we recognize it when spoken to us in a noisy bar? We have trained production from a similar textual English word, but could we think of the foreign word just from looking at a picture of the concept, without sub-vocalizing in English first?</p>
<p>My vocabulary note type now include multiple fields, and <a href="https://apps.ankiweb.net/docs/manual.html#selective-card-generation" target="_blank">selectively generate</a> 5-6 cards depending on which fields I fill in:</p>
<ol>
<li>Foreign language</li>
<li>Translation</li>
<li>Definition</li>
<li>Example</li>
<li>Picture</li>
<li>Synonyms</li>
<li>Audio</li>
</ol>
<p>A beneficial side effect of having a custom note type is that I have to manually create my own cards rather than being tempted by bulk import. This turns out to be a benefit because the time spent crafting the card—picking a picture, choosing the example sentence, etc.—translates to better starting retention.</p>
<h2 id="the-bottom-line">The bottom line <a class="anchor" href="#the-bottom-line">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>All things considered, active recall testing with Anki is probably the single highest ROI habit that I perform on a regular basis. 100 hours per year may sound like a big chunk of time, but I think it&rsquo;s important to consider that a lot of this was downtime to begin with. If you consider that much of this time is spent on public transit, or waiting to meet a friend who is a few minutes late, then it likely has an <em>even higher</em> ROI.</p>
<p>That said, I have learned to become somewhat more selective about what I add, and to more frequently use custom note types which enable richer content and facilitate multi-directional connections in memory. If it&rsquo;s worth adding a card at all, it is worth spending a few extra seconds to improve your chances of successfully recalling it months from now. I am also more willing to delete cards that aren&rsquo;t working for me, including poorly created cards that I frequently fail to recall.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://rs.io/anki-tips/" target="_blank">Anki Tips: What I Learned Making 10,000 Flashcards</a> – Inspiration for this post, includes some easy-to-digest tips for using Anki more effectively.</li>
<li><a href="http://augmentingcognition.com/ltm.html" target="_blank">Augmenting Long-term Memory</a> – An in-depth exploration of spaced repetition and its applications for both academic purposes and practical life, from the perspective of a physicist.</li>
<li><a href="https://www.gwern.net/Spaced-repetition" target="_blank">Spaced Repetition for Efficient Learning</a> – Gwern gives a great overview of the scientific literature w.r.t. spaced repetition.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Tim Ferriss recommends this approach in <a href="https://tim.blog/2009/01/20/learning-language/" target="_blank">How to Learn Any Language in 3 Months</a>. It may be sufficient for bootstrapping yourself into an immersion environment, but it&rsquo;s certainly not ideal for long-term retention without some other form of practice.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Gabriel Wyner, the author of <em>Fluent Forever</em>, is all for memorizing vocabulary, but strongly recommends generating the flashcards yourself rather than importing them, to solidify the knowledge in your head. Read about it <a href="https://blog.fluent-forever.com/base-vocabulary-list/" target="_blank">here</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Embed markdown documentation into your Airflow DAGs</title><link>https://geoffruddock.com/markdown-documentation-in-airflow-dags/</link><pubDate>Monday, 13 May 2019</pubDate><guid>https://geoffruddock.com/markdown-documentation-in-airflow-dags/</guid><description>&lt;h2 id="why-you-should-do-it">Why you should do it &lt;a class="anchor" href="#why-you-should-do-it">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>I recently discovered that Apache Airflow allows you to &lt;a href="https://airflow.apache.org/concepts.html?highlight=doc_md#documentation-notes" target="_blank">embed markdown documentation directly into the Web UI&lt;/a>. This is very neat feature, because it enables you locate your documentation &lt;em>as close as possible&lt;/em> to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="why-you-should-do-it">Why you should do it <a class="anchor" href="#why-you-should-do-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I recently discovered that Apache Airflow allows you to <a href="https://airflow.apache.org/concepts.html?highlight=doc_md#documentation-notes" target="_blank">embed markdown documentation directly into the Web UI</a>. This is very neat feature, because it enables you locate your documentation <em>as close as possible</em> to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members.</p>
<p><img src="airflow_comments_redacted.png" alt="Screenshot of documentation in Airflow UI"></p>
<h2 id="how-to-do-it">How to do it <a class="anchor" href="#how-to-do-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>To make your markdown visible in the Web UI, simply assign the string variable to the <code>doc_md</code> attribute of your DAG, e.g. <code>dag.docs_md = &quot;My documentation here&quot;</code>. That said, I generally put the docs in a string variable at the top of the file, and then assign it later down in the file. This way, it serves a dual purpose of providing context to anyone editing the dag definition file itself.</p>
<h2 id="example-code">Example code <a class="anchor" href="#example-code">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">docs</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">## DAG Name
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">#### Purpose
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">This DAG connects data from one source to another,
</span></span></span><span class="line"><span class="cl"><span class="s2">performs necessary transformations,
</span></span></span><span class="line"><span class="cl"><span class="s2">and creates a set of tables that can be used by analysts 
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">#### Outputs
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">This pipeline produces the following output tables:
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">- `table_A` – Contains useful information about ABC.
</span></span></span><span class="line"><span class="cl"><span class="s2">- `table_b` – Contains useful inormation about XYZ.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">#### Owner
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">For any questions or concerns, please contact 
</span></span></span><span class="line"><span class="cl"><span class="s2">[me@mycompany.com](mailto:me@mycompany.com).
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="n">DAG</span><span class="p">(</span><span class="err">…</span><span class="p">)</span> <span class="k">as</span> <span class="n">dag</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">dag</span><span class="o">.</span><span class="n">doc_md</span> <span class="o">=</span> <span class="n">docs</span>
</span></span></code></pre></div>
      ]]></content:encoded></item><item><title>Save entire webpages for reference With SingleFile</title><link>https://geoffruddock.com/save-entire-webpages-with-singlefile/</link><pubDate>Monday, 15 Apr 2019</pubDate><guid>https://geoffruddock.com/save-entire-webpages-with-singlefile/</guid><description>&lt;p>I&amp;rsquo;ve been reading through a lot of Tiago Forte&amp;rsquo;s writing on his members-only publication &lt;a href="https://praxis.fortelabs.co/" target="_blank">Praxis&lt;/a>. Since reading through his series on &lt;a href="https://praxis.fortelabs.co/progressive-summarization-a-practical-technique-for-designing-discoverable-notes-3459b257d3eb/" target="_blank">progressive summarization&lt;/a>, I have become more concientious with regards to saving the &amp;ldquo;work-in-progress&amp;rdquo; artifacts of my thinking process to Evernote. Often this involves a link to a piece of content, a couple highlights, and a bullet point or two about key takeaways.&lt;/p>
&lt;h2 id="the-problem">The problem &lt;a class="anchor" href="#the-problem">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>It&amp;rsquo;s pretty easy to surface relevant notes using the Search function if I&amp;rsquo;ve added enough contextual info to the note, but less so if it&amp;rsquo;s just a link. So I wanted to start saving the actual raw content of key articles, particularly if they come from a members-only publication to which I may not have permanent access, and I cannot surface on Google.&lt;/p></description><content:encoded><![CDATA[
        <p>I&rsquo;ve been reading through a lot of Tiago Forte&rsquo;s writing on his members-only publication <a href="https://praxis.fortelabs.co/" target="_blank">Praxis</a>. Since reading through his series on <a href="https://praxis.fortelabs.co/progressive-summarization-a-practical-technique-for-designing-discoverable-notes-3459b257d3eb/" target="_blank">progressive summarization</a>, I have become more concientious with regards to saving the &ldquo;work-in-progress&rdquo; artifacts of my thinking process to Evernote. Often this involves a link to a piece of content, a couple highlights, and a bullet point or two about key takeaways.</p>
<h2 id="the-problem">The problem <a class="anchor" href="#the-problem">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>It&rsquo;s pretty easy to surface relevant notes using the Search function if I&rsquo;ve added enough contextual info to the note, but less so if it&rsquo;s just a link. So I wanted to start saving the actual raw content of key articles, particularly if they come from a members-only publication to which I may not have permanent access, and I cannot surface on Google.</p>
<p>I initially tried using the Evernote web clipper to save entire articles, but quickly realized that this was cluttering up the namespace of my evernote search. A few 10k word articles add up quickly, and soon they dwarfed the amount of content in my otherwise relatively text-sparse notes. Searching simple one or two-word phrases related to everyday notes (e.g. shopping, home tech, etc.) would return a result set cluttered with barely relevant saved articles.</p>
<h2 id="criteria">Criteria <a class="anchor" href="#criteria">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>A suitable solution satisfied the following criteria, in decreasing order of importance:</p>
<p><strong>Searchable</strong> – Can I perform a global search for text contained within pages without knowing which page to open?</p>
<p><strong>Portable</strong> – Can I search and open the file on different devices using some static file, or must I launch some command line tool on my laptop before opening some proprietary format or web UI for a localhost database?</p>
<p><strong>Readable</strong> – While true as-web formatting would be ideal, I would settle for being able to read the primary content (text) start-to-end. Lack of CSS and javascript is sometimes not just ugly, but makes the content unreadable.</p>
<h2 id="solution-singlefile">Solution: SingleFile <a class="anchor" href="#solution-singlefile">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I recently came across a neat Chrome extension called <a href="https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle?hl=en" target="_blank">SingleFile</a> which saves webpages as HTML files, but first waits for lazy-loading javascript, images and CSS to render. It doesn&rsquo;t work perfectly—it sometimes includes the blurry version of lazy-loaded photos unless you first scroll to the end of the page—but it works lightyears better than anything else I&rsquo;ve tried.</p>
<p>If you store your HTML files in a folder indexed by <a href="https://www.alfredapp.com/" target="_blank">Alfred</a>, you can instantly surface them using the <em>in</em> keyword.</p>
<h2 id="other-things-ive-tried-or-considered">Other things I&rsquo;ve tried (or considered) <a class="anchor" href="#other-things-ive-tried-or-considered">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="saving-html-locally-via-your-browser">Saving HTML locally via your browser <a class="anchor" href="#saving-html-locally-via-your-browser">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>In Chrome you can achieve this with a <em>Right-click → Save as → Complete webpage (.webm)</em>. The main problem is that it doesn&rsquo;t include CSS and JavaScript not present at the initial pageload. Without this CSS, a lot of pages are impossible to decipher.</p>
<h3 id="print-friendly--pdf">Print Friendly &amp; PDF <a class="anchor" href="#print-friendly--pdf">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>This is the best Chrome extension I&rsquo;ve found up until now. It only outputs PDFs, but it lets you interactively remove superflous components (e.g. advertisements, banner images) before saving. This extension would work well for someone who either wants a print-ready format or likes PDFs (e.g. for highlighting in Mac Preview app). [<a href="https://chrome.google.com/webstore/detail/print-friendly-pdf/ohlencieiipommannpdfcmfdpjjmeolj?hl=en" target="_blank">chrome web store</a>]</p>
<h3 id="web-recorder">Web recorder <a class="anchor" href="#web-recorder">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A tool called <a href="https://github.com/webrecorder/webrecorder" target="_blank">webrecorder.io</a> came up in a few Hacker News threads. It seems to be a comprehensive roll-your-own alternative to something like Archive.org. It&rsquo;s somewhat overkill for my purposes though, which largely amount to archiving articles for personal consumption.</p>

      ]]></content:encoded></item><item><title>Every good data analysis starts with "Why?"</title><link>https://geoffruddock.com/good-data-analysis-starts-with-why/</link><pubDate>Tuesday, 02 Apr 2019</pubDate><guid>https://geoffruddock.com/good-data-analysis-starts-with-why/</guid><description>&lt;p>In a previous life as a PM, I wrote &lt;em>a lot&lt;/em> of jira tickets. In the software development world, a &amp;ldquo;ticket&amp;rdquo; is a unit of work entered in some workflow tracking tool such as jira, but it can represent anything from a task or goal to an issue or bug. After translating a few high-level strategic projects into trees of tickets, I realized that the average ticket was pretty mediocre.&lt;/p></description><content:encoded><![CDATA[
        <p>In a previous life as a PM, I wrote <em>a lot</em> of jira tickets. In the software development world, a &ldquo;ticket&rdquo; is a unit of work entered in some workflow tracking tool such as jira, but it can represent anything from a task or goal to an issue or bug. After translating a few high-level strategic projects into trees of tickets, I realized that the average ticket was pretty mediocre.</p>
<blockquote>
<p><strong>Issue #4824</strong>: Login button is broken.</p></blockquote>
<p>This ticket provides very little contextual value. When you are on the hook for <em>results</em> rather merely completed tickets, this is sub-optimal.  A lot of methodologies such as <a href="https://en.wikipedia.org/wiki/User_story" target="_blank">user story mapping</a> invoke a structure which makes it somewhat more difficult to produce such low-value tickets, but they can still be &ldquo;gamed&rdquo;.</p>
<blockquote>
<p><strong>User story</strong>: As a user, I would like the login button to work.</p>
<p><strong>Status quo</strong>: The login button doesn&rsquo;t work.</p>
<p><strong>Acceptance criteria</strong>: The login button doesn&rsquo;t work.</p></blockquote>
<p>A big revelation for me was stumbling across Simon Sinek&rsquo;s Ted talk <a href="https://www.youtube.com/watch?v=IPYeCltXpxw" target="_blank"><em>Start with Why</em></a>. I am now of the opinion that the humble user story—often the least carefully considered part of the ticket—is in fact the most important part of the ticket. A well-written user story—one which effectively captures and communicates <em>The Why</em>—has the potential to make or break the success of a software development effort by subtly conveying intent and purpose. I have learned first-hand that tickets which focus on a crisply defined <em>purpose</em> rather than prescribing a set of actions ultimately correlate with successful projects.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/good-data-analysis-starts-with-why/start_with_why_golden_circle_hu_fe575b58d3d9d2db.png 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/good-data-analysis-starts-with-why/start_with_why_golden_circle.png"
                
    
            
                alt="All good things come in threes." width="500"/> <figcaption>
                <p>All good things come in threes.</p>
            </figcaption>
    </figure>
<p>Before I started working as a data scientist, I did not realize that this principle is <em>just as important</em> for interfacing between analysts ↔ stakeholders as it is for the traditional PMs ↔ developer relationship. Being on the receiving end of vaguely formulated requests has reinforced the fundamental importance of communicating <em>The Why</em> on projects where work spans across multiple people or teams.</p>
<h2 id="why-start-with-why">Why start with why? <a class="anchor" href="#why-start-with-why">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The modern knowledge worker works in a highly specialized environment. Specialization improves efficiency, but it comes at a less reactivity and adaptability to change. As units of work grow beyond the span of a single agent, it <a href="https://seths.blog/2019/11/all-or-some/" target="_blank">imposes a trade-off</a>. But we can hack this trade-off. In an organization with multiple actors, the question shouldn&rsquo;t be <em>Is collaboration worth it?</em> but rather <em>How can we reduce the cost of collaboration?</em> A natural place to start is in the written form of the request/job/project/task.</p>
<p>There is a common failure mode with technical support, and is commonly referred to as the <a href="https://en.wikipedia.org/wiki/XY_problem" target="_blank">XY Problem</a>, which manifests when a customer with some underlying goal X makes an inferential leap to a sub-problem Y but then does not commmunicate that inferential leap when asking for help. If you are on the receiving end of an unclear request, the <a href="https://en.wikipedia.org/wiki/Five_whys" target="_blank">Five Why&rsquo;s</a> technique is a useful approach to uncovering the true root cause or issue at play. If you are on the <em>dispatching end</em> of a request, then <em>Starting with Why</em> is a prophylactic technique to avoid the message being interpreted incorrectly to begin with.</p>
<h2 id="the-why-of-data-analysis">The &ldquo;why&rdquo; of data analysis <a class="anchor" href="#the-why-of-data-analysis">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Data analysis has a aura of objectivity, but in practice it frequently involves a number of subjective decisions. The sheer number of choices one must make in the course of answering any sort of interesting question with data is overwhelming. You say, &ldquo;We want to understand the behaviour of our returning customers&rdquo;, I reply &ldquo;What do you mean by customers? Define &lsquo;returning&rsquo;. And what sorts of behaviour are we interested in specifically?&rdquo; Some of these are pivotal questions which simply must be answered. But others are <em>micro-decisions</em>, each of which only marginally effect on the results, but whose compound effect across such decisions can influence the entire outcome of an analysis. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<blockquote>
<p>Only when you know the question will you know what the answer means.</p></blockquote>
<p>— Douglas Adams, The Hitchhiker&rsquo;s Guide to the Galaxy</p>
<p>It is critical to have a firm grasp of your root question before starting an analysis.  Having a firm grasp of that question lets you make smarter sub-decisions and makes you more likely to arrive at a useful outcome. The main challenge is that sometimes the question-asker sometimes doesn&rsquo;t consider it necessary to show their full hand. Vulnerability does not come easy, after all.</p>
<h2 id="dont-be-an-sql-monkey">Don&rsquo;t be an SQL monkey <a class="anchor" href="#dont-be-an-sql-monkey">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>So given that you can&rsquo;t directly influence the clarity of thinking of your stakeholders, what is a frustrated data analyst to do? Here are a few actionable tips which I try to run through each time I am facing a vaguely defined problem, in order to maximize my chance of ultimately reaching a successful outcome.</p>
<h3 id="crisply-define-terminology">Crisply define terminology <a class="anchor" href="#crisply-define-terminology">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Make sure you are fully aligned on what the terms you are using actually <em>mean</em>. Every business has a number of phrases which serve as weasel words. Here&rsquo;s a few examples:</p>
<ul>
<li><em>Users</em> → Paying customers or everyone with an account? What about non-logged-in &ldquo;users&rdquo;, which are really just tracking cookies in someone&rsquo;s browser?</li>
<li><em>Retention</em> → This one is straightforward for subscription services, but more vaguely defined in a non-contractual setting. How frequently does a customer need to purchase to be considered &ldquo;active&rdquo;? What if they return and browse, but don&rsquo;t purchase?</li>
<li><em>New vs. returning</em> → In relation to the above, do we start counting when a customer first visits, when the sign up, or when they buy?</li>
</ul>
<h3 id="think-through-hypotheticals">Think through hypotheticals <a class="anchor" href="#think-through-hypotheticals">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A helpful way to draw out the meat of a decision is to ask what hypothetical action we would take for each possible outcome of the analysis. Consider drawing a <a href="https://en.wikipedia.org/wiki/Decision_tree" target="_blank">decision tree</a> (the diagram, not the machine learning model). Understanding the topology of a decision allows you to more carefully craft the analysis to inform that particular decision.</p>
<p>In the case of an A/B test: what is our default decision/action if results are inconculsive? Do we only launch if there is X% improvement in our KPIs, or do we launch as long as there is no noticable decline? What is X%? Working through this thought exercise will sharpen your intuition around the nature of the problem, making it easier to make better <em>micro-decisions</em> such as setting appropriate <a href="https://geoffruddock.com/ab-testing-with-a-symmetric-risk-profile/" target="_blank">appropriate risk parameters</a> (α, β) for the A/B test based on the business risk of a false positive or false negative outcome.</p>
<h3 id="if-its-a-visualization-draw-a-picture">If it&rsquo;s a visualization, draw a picture <a class="anchor" href="#if-its-a-visualization-draw-a-picture">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Almost every request for a specific data visualization is <em>actually</em> a request for an artifact which your stakeholder <em>believes</em> will be useful for solving some underlying question which he or she has decided to keep hidden from you. Try to understand that underlying question.</p>
<p>But sometimes there is a need for just a good plain old chart or dashboard. In those cases, I have found it helpful to draw a picture of the output before starting. You can do this collaboratively with your stakeholder, or—if your drawing skills are as embarassingly poor as mine—you can sketch something out ahead and meet to align on it.</p>
<p>Based on your <em>a priori</em> domain knowledge, it is often possible to arrive at something reasonably similar in structure and content to the final piece of dataviz. Getting this prototype in front of your stakeholder before implementing it will frequently surface follow-up questions or revisions that would otherwise have costed you time for re-work. It may also help you identify gaps between the currently available data and the data required to answer the underlying question.</p>
<h2 id="more-data--more-problems">More data → more problems <a class="anchor" href="#more-data--more-problems">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There is pervasive desire for more and faster data, particularly among product managers. But besides adding unnecessary processing complexity, real-time analytics can actually provide <em>negative</em> benefit.</p>
<p>Adam Robinson has a great little story he tells on various podcasts about a study by psychologist Paul Slovic <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> which illustrates how additional data has diminishing returns for decision quality, but not for our confidence in our decisions. Beyond some point, additional data makes us no more accurate, but it makes us <em>think</em> we are more accurate.</p>
<p>When you&rsquo;ve got 3 data points which disagree with your prior worldview, it&rsquo;s tough to avoid the cognitive dissonance. It&rsquo;s uncomfortable, but this is how scientific progress is made. But if you&rsquo;ve got 30 data points and only ½ of them disagree, it&rsquo;s a lot easier to tell yourself a story which reaffirms your worldview and sidesteps the cognitive dissonance. Unfortunately this cognitive comfort comes at the cost of a wrong decision.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://jvns.ca/blog/good-questions/" target="_blank">How to ask good questions</a> (Julia Evans)</li>
<li><a href="https://www.freshworks.com/freshsales-crm/resources/summary-of-start-with-why-blog/" target="_blank">A 12-Minute Summary of “Start With Why” by Simon Sinek</a></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://statmodeling.stat.columbia.edu/2012/11/01/researcher-degrees-of-freedom/" target="_blank">https://statmodeling.stat.columbia.edu/2012/11/01/researcher-degrees-of-freedom/</a>&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>I often share a snippet from <a href="https://ma.tt/2017/11/adam-robinson-on-understanding/" target="_blank">Matt Mullenweg&rsquo;s blog</a>, although I recall first hearing of this study on the Tim Ferriss show.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Calculating the bearing between coordinates in Redshift</title><link>https://geoffruddock.com/calculate-angle-between-coordinates-with-redshift-udfs/</link><pubDate>Monday, 11 Mar 2019</pubDate><guid>https://geoffruddock.com/calculate-angle-between-coordinates-with-redshift-udfs/</guid><description>&lt;p>I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the &lt;em>direction&lt;/em> of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.&lt;/p></description><content:encoded><![CDATA[
        <p>I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the <em>direction</em> of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.</p>
<h2 id="what-do-we-mean-by-direction-anyway">What do we mean by &ldquo;direction&rdquo; anyway? <a class="anchor" href="#what-do-we-mean-by-direction-anyway">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>The most intuitive interpretation of direction seemed like <a href="https://en.wikipedia.org/wiki/Bearing_%28navigation%29" target="_blank">compass bearing</a>, so I set out to find a way to convert a pair of spatial coordinates (latitude and longitude) into a variable which represents degrees right of true north. Unfortunately I could not find any suitable built-in functions to deal with spatial data in Redshift.</p>
<p>While it would not be <em>difficult</em> to spin up a jupyter notebook, pull in some data via SQL and run each row throw some function, it would not be an <em>ideal</em> approach. Keeping a small data request like this as a pure SQL query means it is easily reproducable in the future, without worrying about python package versions, anaconda environments, etc. Furthermore, anyone with access to the data warehouse can fetch updated data, rather than only someone comfortable with python.</p>
<h2 id="enter-python-udfs-in-redshift">Enter Python UDFs in Redshift <a class="anchor" href="#enter-python-udfs-in-redshift">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>But all is not lost. <a href="https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/" target="_blank">Python UDFs</a> to the rescue! Redshift lets you declare user-defined functions that take some scalar inputs, run a chunk of python code and return the output right back into SQL. Instead of declaring your function as a python fuction using <code>def my_func(param)</code> syntax, you place its contents in the UDF function declaration below.</p>
<pre tabindex="0"><code>CREATE OR REPLACE FUNCTION my_func (param_a float,
                                    param b float)
                                    RETURNS float STABLE AS
$$ &lt; python code &gt; $$ LANGUAGE plpythonu;
</code></pre><p>Trying to remember as little decade-old trigonometry knowledge as possible, I found a working function on <a href="https://gis.stackexchange.com/a/29240" target="_blank">this stackexchange question</a> and plugged it into our UDF boilerplate below. The final result looks like this,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">bearing_between_coordinates</span><span class="w"> </span><span class="p">(</span><span class="n">x_lat</span><span class="w"> </span><span class="nb">float</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                                                        </span><span class="n">x_lon</span><span class="w"> </span><span class="nb">float</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                                                        </span><span class="n">y_lat</span><span class="w"> </span><span class="nb">float</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                                                        </span><span class="n">y_lon</span><span class="w"> </span><span class="nb">float</span><span class="p">)</span><span class="w"> </span><span class="k">RETURNS</span><span class="w"> </span><span class="nb">float</span><span class="w"> </span><span class="k">STABLE</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">import</span><span class="w"> </span><span class="n">math</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">startLat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">radians</span><span class="p">(</span><span class="n">x_lat</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">startLong</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">radians</span><span class="p">(</span><span class="n">x_lon</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">endLat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">radians</span><span class="p">(</span><span class="n">y_lat</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">endLong</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">radians</span><span class="p">(</span><span class="n">y_lon</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">dLong</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">endLong</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">startLong</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">dPhi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">tan</span><span class="p">(</span><span class="n">endLat</span><span class="o">/</span><span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="o">+</span><span class="n">math</span><span class="p">.</span><span class="n">pi</span><span class="o">/</span><span class="mi">4</span><span class="p">.</span><span class="mi">0</span><span class="p">)</span><span class="o">/</span><span class="n">math</span><span class="p">.</span><span class="n">tan</span><span class="p">(</span><span class="n">startLat</span><span class="o">/</span><span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="o">+</span><span class="n">math</span><span class="p">.</span><span class="n">pi</span><span class="o">/</span><span class="mi">4</span><span class="p">.</span><span class="mi">0</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">if</span><span class="w"> </span><span class="k">abs</span><span class="p">(</span><span class="n">dLong</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">pi</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="k">if</span><span class="w"> </span><span class="n">dLong</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">             </span><span class="n">dLong</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="p">(</span><span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">pi</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">dLong</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">         </span><span class="k">else</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">             </span><span class="n">dLong</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">math</span><span class="p">.</span><span class="n">pi</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">dLong</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">degrees</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">atan2</span><span class="p">(</span><span class="n">dLong</span><span class="p">,</span><span class="w"> </span><span class="n">dPhi</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">360</span><span class="p">.</span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="mi">360</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w"> </span><span class="k">LANGUAGE</span><span class="w"> </span><span class="n">plpythonu</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><p>Execute this once in your database console, then you can use it within an existing query, for example,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">bearing_between_coordinates</span><span class="p">(</span><span class="n">x_lat</span><span class="p">,</span><span class="w"> </span><span class="n">x_lon</span><span class="p">,</span><span class="w"> </span><span class="n">y_lat</span><span class="p">,</span><span class="w"> </span><span class="n">y_lon</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">bearing</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">lat_lon_coords</span><span class="w">
</span></span></span></code></pre></div><h2 id="or-you-can-stitch-together-trig-functions-in-redshift">Or you can stitch together trig functions in Redshift <a class="anchor" href="#or-you-can-stitch-together-trig-functions-in-redshift">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Update: A Python UDF may be overkill. I realized after writing the above that I can replicate the contents of the function itself using <a href="https://docs.aws.amazon.com/redshift/latest/dg/Math_functions.html" target="_blank">built-in trigonometric functions</a> in Redshift. This results in the &ldquo;almost one-liner&rdquo; below. I opted to use a CTE to convert inputs to radians rather than embedding in the select to make that behemoth <em>slightly</em> less unreadable.</p>
<p>There is definitely a trade-off on interpretability though. This SQL code does a poor job of projecting <a href="https://en.wikipedia.org/wiki/Intentional_programming" target="_blank">intent</a> compared to a defined function. Rather than reading the first line of the function declaration, you need to read all the way through to the final alias <code>bearing_degrees</code> to understand why we are chaining together a bunch of trig functions anyway.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">coords_as_radians</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="n">RADIANS</span><span class="p">(</span><span class="n">x_lat</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">x_lat</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="w"> </span><span class="n">RADIANS</span><span class="p">(</span><span class="n">x_lon</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">x_lon</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="w"> </span><span class="n">RADIANS</span><span class="p">(</span><span class="n">y_lat</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">y_lat</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">,</span><span class="w"> </span><span class="n">RADIANS</span><span class="p">(</span><span class="n">y_lon</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">y_lon</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">raw_coordinates</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">(</span><span class="n">DEGREES</span><span class="p">(</span><span class="n">ATAN2</span><span class="p">(</span><span class="n">SIN</span><span class="p">(</span><span class="n">arr_lon</span><span class="o">-</span><span class="n">dep_lon</span><span class="p">)</span><span class="o">*</span><span class="n">COS</span><span class="p">(</span><span class="n">arr_lat</span><span class="p">),</span><span class="w"> </span><span class="n">COS</span><span class="p">(</span><span class="n">dep_lat</span><span class="p">)</span><span class="o">*</span><span class="n">SIN</span><span class="p">(</span><span class="n">arr_lat</span><span class="p">)</span><span class="o">-</span><span class="n">SIN</span><span class="p">(</span><span class="n">dep_lat</span><span class="p">)</span><span class="o">*</span><span class="n">COS</span><span class="p">(</span><span class="n">arr_lat</span><span class="p">)</span><span class="o">*</span><span class="n">COS</span><span class="p">(</span><span class="n">arr_lon</span><span class="o">-</span><span class="n">dep_lon</span><span class="p">)))</span><span class="o">+</span><span class="mi">360</span><span class="p">)::</span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">18</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="mi">360</span><span class="p">.</span><span class="mi">00</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">bearing_degrees</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">coords_as_radians</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://www.movable-type.co.uk/scripts/latlong.html" target="_blank">Calculate distance, bearing and more between Latitude/Longitude points</a></li>
</ul>

      ]]></content:encoded></item><item><title>DIY insulated sous-vide container from a cooler</title><link>https://geoffruddock.com/diy-cooler-sous-vide-container/</link><pubDate>Friday, 01 Mar 2019</pubDate><guid>https://geoffruddock.com/diy-cooler-sous-vide-container/</guid><description>&lt;p>Last year I built a &lt;a href="https://geoffruddock.com/diy-ikea-sous-vide-container/">DIY insulated sous-vide container&lt;/a> using $10 of IKEA parts. It worked pretty well, using 60% less electricity than an uninsulated container. But it was a bit of an eye-sore, and I got tired of leaving a mess of towels out on my kitchen counter.&lt;/p>
&lt;p>Can we do better? I did some research on sous-vide cooler hacks and was impressed by the build described in &lt;a href="https://www.chowhound.com/post/sous-vide-cooler-hacks-1064726" target="_blank">this Chowhound thread&lt;/a>. So I set out with those instructions, but made a few modifications along the way. The main change I made was to drill the hole in the back of the lid rather than the front, so that it can be opened without removing the sous vide unit.&lt;/p></description><content:encoded><![CDATA[
        <p>Last year I built a <a href="/diy-ikea-sous-vide-container/">DIY insulated sous-vide container</a> using $10 of IKEA parts. It worked pretty well, using 60% less electricity than an uninsulated container. But it was a bit of an eye-sore, and I got tired of leaving a mess of towels out on my kitchen counter.</p>
<p>Can we do better? I did some research on sous-vide cooler hacks and was impressed by the build described in <a href="https://www.chowhound.com/post/sous-vide-cooler-hacks-1064726" target="_blank">this Chowhound thread</a>. So I set out with those instructions, but made a few modifications along the way. The main change I made was to drill the hole in the back of the lid rather than the front, so that it can be opened without removing the sous vide unit.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/completed_closed_hu_36aa15b0ab88f519.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/completed_closed_hu_ba47f4d0d659f63a.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/completed_closed_hu_ba47f4d0d659f63a.jpg"
                
    
            
                alt="Finished cooler, lid closed"/> 
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/completed_open_hu_ff945e0b7c3c94a5.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/completed_open_hu_e3e85bacc7e0ac7b.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/completed_open_hu_e3e85bacc7e0ac7b.jpg"
                
    
            
                alt="Finished cooler, lid open"/> 
    </figure>

        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 500px
    }

</style>
<h2 id="necessary-supplies">Necessary supplies <a class="anchor" href="#necessary-supplies">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><table>
  <thead>
      <tr>
          <th>Item</th>
          <th>Notes</th>
          <th>Cost</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.amazon.com/Igloo-Cooler-Legend-12-Red/dp/B000BOB5L8" target="_blank">Igloo Legend 12 cooler</a></td>
          <td>This size is perfect for weeknight cooks. It is shallow enough to only need 4.5L to fill, but wide enough to be opened without removing the sous vide device from the lid every time.</td>
          <td>$21</td>
      </tr>
      <tr>
          <td>Spray foam insulation</td>
          <td>We want something with good thermal properties, and which comes in a can with a spray nozel, so we can spray it into tight spaces.</td>
          <td>$8</td>
      </tr>
      <tr>
          <td>Silicone caulk]</td>
          <td>We really don&rsquo;t need much, so just get a small container.</td>
          <td>$4</td>
      </tr>
      <tr>
          <td><a href="https://www.amazon.com/uxcell-Rings-Nitrile-Rubber-Diameter/dp/B07HRT7JT5" target="_blank">60mm x 3.5mm o-rings</a></td>
          <td>These are for outside the lid, to adjust how deeply the sous vide unit sits.</td>
          <td>$6</td>
      </tr>
      <tr>
          <td><a href="https://www.amazon.com/uxcell-Metric-Nitrile-Rubber-Thickness/dp/B008IBKYUY" target="_blank">40mm x2mm o-rings</a></td>
          <td>These are for inside the lid, to keep the unit snugly in place when the lid is opened.</td>
          <td>$4</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>$43​</strong></td>
      </tr>
  </tbody>
</table>
<p>We&rsquo;ll also need the following tools:</p>
<ul>
<li>A reasonably powerful power drill</li>
<li><a href="https://www.amazon.com/DEWALT-D180038-8-Inch-Standard-Bi-Metal/dp/B00005LEZD" target="_blank">60mm hole saw bit</a> – You&rsquo;ll want this to match the diameter of your sous-vide unit as closely as possible so that it fits in snugly. My Anova unit (original version) needed a ~62mm hole, so I used a 60mm bit and sanded it down until it fit. If you are using a newer version, check the diameter of your unit.</li>
</ul>
<h2 id="how-to-build-it">How to build it <a class="anchor" href="#how-to-build-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="fill-the-cooler-lid-with-insulating-foam">Fill the cooler lid with insulating foam <a class="anchor" href="#fill-the-cooler-lid-with-insulating-foam">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Drink coolers are designed to keep their contents <em>cold</em> rather than hot. So it would make sense for the cooler to have better insulation around the sides and bottom than the top lid. But since <em>heat rises</em>, we care disproportionately about the thermal performance of the top lid. The top lid of the Igloo Legend 12 cooler is hollow, so we can reduce heat loss even further by filling it with spray insulation foam.</p>
<p>Here&rsquo;s how:</p>
<ol>
<li>Drill a bunch of small holes on the underside of the lid, just slightly larger than the diameter of the foam insulation hose nozzle. We want to use multiple holes, since the foam will expand inside the lid, and may cause it to deform if we spray it all into one corner. Better to distribute it evenly throughout the lid.</li>
<li>Lay out a few sheets of newspaper on the ground. This will get messy.</li>
<li>Distribute the spray foam inside the lid as deeply into corners as possible. Leave some extra space near each hole. The foam will expand greatly over the course of 24 hours, so you it to expand from the corners towards the holes, so that all the air escapes. Err on the side of <em>less</em>, because you can do apply another round after 24 hours if the foam has not expanded to entirely fill the lid.</li>
<li>Wait 24 hours, then break off the bulbs of hardened foam which are protruding from the holes we drilled earlier.</li>
</ol>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/inside_lid_hu_961262228af5d8ea.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/inside_lid_hu_16fa25ad4f2397c8.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/inside_lid_hu_16fa25ad4f2397c8.jpg"
                
    
            
                alt="Inside lid" width="800"/> 
    </figure>
<h3 id="drill-a-hole-for-the-sous-vide-unit">Drill a hole for the sous vide unit <a class="anchor" href="#drill-a-hole-for-the-sous-vide-unit">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><ol>
<li>Pick a spot on the lid you want to drill. I suggest somewhere near the back hinge, so that you can open the lid fully without removing the sous vide unit. The Igloo cooler I was using has a natural spot for the hole.</li>
<li>Measure the diameter of you sous vide unit, and drill a hole using the appropriately sized circular saw bit. Err on the side of smaller for a snug fit. You can use sand paper to slightly expand the hole after drilling it.</li>
</ol>
<h3 id="seal-off-holes-with-silicone">Seal off holes with silicone <a class="anchor" href="#seal-off-holes-with-silicone">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Now we want to apply silicone caulk to all the holes in the lid, so that moisture does not get inside during use.</p>
<ol>
<li>Carve out a bit of foam insulation around each of the holes.</li>
<li>Apply a bit of silicone caulk to each holes.</li>
<li>Do the same around the rim of the main hole.</li>
<li>Use a credit card or another flat surface to smooth the caulk.</li>
<li>Wait 12 hours and touch up if necessary. The rim of the main hole took me 2-3 applications until I was confident it would be waterproof.</li>
</ol>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/main_hole_hu_c1002e1923b397a3.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/main_hole_hu_f76b0cdc6ea3e88a.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/main_hole_hu_f76b0cdc6ea3e88a.jpg"
                
    
            
                alt="There is some room for improvement here." width="400"/> <figcaption>
                <p>There is some room for improvement here.</p>
            </figcaption>
    </figure>
<h3 id="insert-sous-vide-with-o-rings">Insert sous vide (with o-rings) <a class="anchor" href="#insert-sous-vide-with-o-rings">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Now you can insert the sous vide unit into the main hole. Add o-rings to the top of the unit until it sits high enough that the stem clears the edge of the cooler when the lid is opened.</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/outer_orings_hu_913bcf512902d3b3.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/outer_orings_hu_bbf99060851c9536.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/outer_orings_hu_bbf99060851c9536.jpg"
                
    
            
                alt="The exact o-ring size doesn&rsquo;t matter much."/> <figcaption>
                <p>The exact o-ring size doesn&rsquo;t matter much.</p>
            </figcaption>
    </figure>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/edge_clearance_hu_349d93993bd66c86.jpg 480w,
                
                       https://geoffruddock.com/diy-cooler-sous-vide-container/edge_clearance_hu_754d1719c787d709.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-cooler-sous-vide-container/edge_clearance_hu_754d1719c787d709.jpg"
                
    
            
                alt="Make sure it clears the edge when the lid is opened."/> <figcaption>
                <p>Make sure it clears the edge when the lid is opened.</p>
            </figcaption>
    </figure>

        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 500px
    }

</style>
<h3 id="mark-the-fill-line">Mark the fill line <a class="anchor" href="#mark-the-fill-line">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Measure the depth of the cooler, subtract the offset from the o-rings to the fill line on your unit, and mark a line inside the cooler using a permanent marker and a ruler.</p>
<h2 id="energy-efficiency">Energy efficiency <a class="anchor" href="#energy-efficiency">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I ran a series of tests using a <a href="https://www.amazon.com/TP-Link-HS100-Required-Google-Assistant/dp/B0178IC734" target="_blank">TP-Link Kasa Smart Plug</a> to measure energy expenditure. For each test, I brought 4.5L of water up to 66°C and then started measuring after the water reached temperature.</p>
<p>This is the same temperature as the <a href="https://geoffruddock.com/diy-cooler-sous-vide-container/#energy-efficiency">previous tests</a>), but this time using only 4.5L of water instead of 7L. Although this may give our new build a slight advantage, it reflects the minimum amount of water necessary to reach the &ldquo;min&rdquo; marker on the sous vide unit, and so I think it best reflects real-life usage.</p>
<table>
  <thead>
      <tr>
          <th><strong>Hours</strong></th>
          <th><strong>Energy (kWh)</strong></th>
          <th><strong>Watts</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>23.5</td>
          <td>1.00</td>
          <td>42</td>
      </tr>
      <tr>
          <td>11</td>
          <td>0.45</td>
          <td>41</td>
      </tr>
      <tr>
          <td>12</td>
          <td>0.49</td>
          <td>41</td>
      </tr>
      <tr>
          <td>—</td>
          <td><strong>Average</strong></td>
          <td><strong>41</strong></td>
      </tr>
  </tbody>
</table>
<p>So this cooler build uses a further 30% less electricity than the <a href="https://geoffruddock.com/diy-cooler-sous-vide-container/#energy-efficiency">previous build</a> when it was wrapped with towels, which used 63 watts. It uses a full 75% less electricity than the unwrapped container, which used 148 watts. So our new build is the best of both worlds! It is the most energy efficient, and also looks better on the kitchen countertop than either of the previous options.</p>
<h2 id="using-it-as-a-regular-cooler">Using it as a regular cooler <a class="anchor" href="#using-it-as-a-regular-cooler">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>After a bit of trial-and-error, I found <a href="https://www.aliexpress.com/item/4001204549022.html?spm=a2g0s.9042311.0.0.2cb84c4dVlFFJv" target="_blank">this 63mm plastic plug</a> on AliExpress which fits perfectly into the hole at the top of the cooler. This is pretty useful, because then you can use the cooler as both a sous vide container and as a regular cooler when necessary.</p>

      ]]></content:encoded></item><item><title>The best way to manage dependencies between DAGs in Airflow</title><link>https://geoffruddock.com/dependencies-between-dags-in-airflow/</link><pubDate>Monday, 11 Feb 2019</pubDate><guid>https://geoffruddock.com/dependencies-between-dags-in-airflow/</guid><description>&lt;p>Airflow provides a few different sensors and operators which enable you to coordinate scheduling between different DAGs, including:&lt;/p>
&lt;ul>
&lt;li>ExternalTaskSensor&lt;/li>
&lt;li>TriggerDagRunOperator&lt;/li>
&lt;li>SubDagOperator&lt;/li>
&lt;/ul>
&lt;h2 id="which-one-is-the-best-to-use">Which one is the best to use? &lt;a class="anchor" href="#which-one-is-the-best-to-use">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>I have previously written about &lt;a href="https://geoffruddock.com/how-to-use-external-task-sensor-in-airflow/">how to use ExternalTaskSensor in Airflow&lt;/a> but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem.&lt;/p></description><content:encoded><![CDATA[
        <p>Airflow provides a few different sensors and operators which enable you to coordinate scheduling between different DAGs, including:</p>
<ul>
<li>ExternalTaskSensor</li>
<li>TriggerDagRunOperator</li>
<li>SubDagOperator</li>
</ul>
<h2 id="which-one-is-the-best-to-use">Which one is the best to use? <a class="anchor" href="#which-one-is-the-best-to-use">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I have previously written about <a href="https://geoffruddock.com/how-to-use-external-task-sensor-in-airflow/">how to use ExternalTaskSensor in Airflow</a> but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem.</p>
<h2 id="use-cases">Use cases <a class="anchor" href="#use-cases">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="i-need-the-ability-to-sometimes-run-dag_b-independent-of-dag_a-but-i-want-to-share-state-history-between-them">I need the ability to sometimes run <code>dag_B</code> independent of <code>dag_A</code>, but I want to share state (history) between them. <a class="anchor" href="#i-need-the-ability-to-sometimes-run-dag_b-independent-of-dag_a-but-i-want-to-share-state-history-between-them">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Using <code>SubDagOperator</code> creates a tidy parent–child relationship between your DAGs. The sub-DAGs will not appear in the top-level UI of Airflow, but rather nested within the parent DAG, accessible via a <em>Zoom into Sub DAG</em> button. This is a nice feature if those DAGs are <em>always</em> run together. However if you need to sometimes run the sub-DAG alone, you will need to initialize it as it&rsquo;s own top-level DAG, which will not share state with the sub-DAG.</p>
<p>In this scenario, you are better off using either <code>ExternalTaskSensor</code> or <code>TriggerDagRunOperator</code>.</p>
<h3 id="my-local-development-or-test-environment-uses-sqlite-rather-than-a-postgres-db">My local development or test environment uses SQLite rather than a Postgres DB. <a class="anchor" href="#my-local-development-or-test-environment-uses-sqlite-rather-than-a-postgres-db">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>SQLite does not support concurrent write operations, so it forces Airflow to use the <code>SequentialExecutor</code>, meaning only one task can be active at any given time. Using ExternalTaskSensor will consume one worker slot spent &ldquo;waiting&rdquo; for the upstream task, and <a href="https://issues.apache.org/jira/browse/AIRFLOW-47" target="_blank">so your Airflow will be deadlocked</a>.</p>
<p>In this case, it is preferable to use <a href="http://airflow.apache.org/concepts.html#subdags" target="_blank"><code>SubDagOperator</code></a>, since these tasks can be run with only a single worker. Astronomer.io has some good documentations on <a href="https://www.astronomer.io/guides/subdags/" target="_blank">how to use sub-DAGs in Airflow</a>.</p>
<h3 id="i-want-dag_b-to-sometimes-run-depending-on-some-conditional-logic">I want <code>dag_B</code> to sometimes run depending on some conditional logic <a class="anchor" href="#i-want-dag_b-to-sometimes-run-depending-on-some-conditional-logic">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If you want to include conditional logic, you can feed a python function to <code>TriggerDagRunOperator</code> which determines which DAG is actually triggered (if at all).</p>

      ]]></content:encoded></item><item><title>Set dependencies between Airflow DAGs with ExternalTaskSensor</title><link>https://geoffruddock.com/how-to-use-external-task-sensor-in-airflow/</link><pubDate>Monday, 21 Jan 2019</pubDate><guid>https://geoffruddock.com/how-to-use-external-task-sensor-in-airflow/</guid><description>&lt;h2 id="problem">Problem &lt;a class="anchor" href="#problem">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>You are an analyst/data engineer/data scientist building a data processing pipeline in Airflow. Last week you wrote a job that peforms all the necessary processing to build your &lt;code>sales&lt;/code> table in the database. This week, you are building a &lt;code>customers&lt;/code> table that aggregates data from your previous &lt;code>sales &lt;/code> table.&lt;/p>
&lt;p>Should you add the necessary &lt;code>customers&lt;/code> logic as a new task on the existing DAG, or should you create an entirely new DAG? Since the dependency is only in one direction (tomorrow&amp;rsquo;s &lt;code>sales&lt;/code> data does not depend on today&amp;rsquo;s &lt;code>customers&lt;/code> data) you decide to decouple into two separate DAGs.&lt;/p></description><content:encoded><![CDATA[
        <h2 id="problem">Problem <a class="anchor" href="#problem">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>You are an analyst/data engineer/data scientist building a data processing pipeline in Airflow. Last week you wrote a job that peforms all the necessary processing to build your <code>sales</code> table in the database. This week, you are building a <code>customers</code> table that aggregates data from your previous <code>sales </code> table.</p>
<p>Should you add the necessary <code>customers</code> logic as a new task on the existing DAG, or should you create an entirely new DAG? Since the dependency is only in one direction (tomorrow&rsquo;s <code>sales</code> data does not depend on today&rsquo;s <code>customers</code> data) you decide to decouple into two separate DAGs.</p>
<p>But how can you make sure your new DAG waits until the necessary <code>sales</code> data is loaded before starting? Airflow offers rich options for specifying intra-DAG scheduling and dependencies, but it is not immediately obvious how to do so for inter-DAG dependencies.</p>
<p>The duct-tape fix here is to schedule <code>customers</code> to run some sufficient number of minutes/hours later than <code>sales</code> that we can be reasonably confident it finished. We can do better though.</p>
<h2 id="solution">Solution <a class="anchor" href="#solution">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Airflow provides an out-of-the-box sensor called <a href="https://airflow.apache.org/docs/apache-airflow/1.10.3/_api/airflow/sensors/external_task_sensor/index.html" target="_blank">ExternalTaskSensor</a> that we can use to model this &ldquo;one-way dependency&rdquo; between two DAGs. Here&rsquo;s what we need to do:</p>
<ol>
<li>Configure <code>dag_A</code> and <code>dag_B</code> to have the same <code>start_date</code> and <code>schedule_interval</code> parameters.</li>
<li>Instantiate an instance of <code>ExternalTaskSensor</code> in <code>dag_B</code> pointing towards a specific task of <code>dag_A</code> nd set it as an upstream dependency of the first task(s) in your pipeline.</li>
<li>Initiate dagruns for both DAGs at roughly the same time. <code>dag_B</code> itself will start, but your task sensor will wait until the corresponding date run of <code>dag_A</code> finishes before allowing the actual tasks to start.</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">airflow.sensors.external_task_sensor</span> <span class="kn">import</span> <span class="n">ExternalTaskSensor</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">&#39;dag_B&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">dag</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="n">wait_for_dag_A</span> <span class="o">=</span> <span class="n">ExternalTaskSensor</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">task_id</span><span class="o">=</span><span class="s1">&#39;wait_for_dag_A&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">external_dag_id</span><span class="o">=</span><span class="s1">&#39;dag_A&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">external_task_id</span><span class="o">=</span><span class="s1">&#39;final_task&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="n">main_task</span> <span class="o">=</span> <span class="n">PythonOperator</span><span class="p">(</span><span class="err">…</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">wait_for_dag_A</span> <span class="o">&gt;&gt;</span> <span class="n">main_task</span>
</span></span></code></pre></div><p>Note: This requires tasks to run in parallel, which is not possible when Airflow is using <code>SequentialExecutor</code>, which is often the default for a barebones Airflow installation. This executor uses an SQLite database to store metadata, and SQLite does not support parallel IO. Using <code>LocalExecutor</code> will enable parallel operations, but requires an actual database (e.g. Postgres) to function.</p>
<p><strong>Update:</strong> I explore some different, possibly better-suited approaches to this problem <a href="https://geoffruddock.com/dependencies-between-dags-in-airflow/">here</a> including <code>SubDagOperator</code> and <code>TriggerDagRunOperator</code>.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://www.mikulskibartosz.name/using-sensors-in-airflow/" target="_blank">Dependencies between DAGs: How to wait until another DAG finishes in Airflow?</a> [Bartosz Mikulski]</p>

      ]]></content:encoded></item><item><title>Thoughts on Blitzstein's Probability course (Harvard Stat 110)</title><link>https://geoffruddock.com/blitzstein-probability-course/</link><pubDate>Friday, 21 Dec 2018</pubDate><guid>https://geoffruddock.com/blitzstein-probability-course/</guid><description>&lt;p>One textbook which is &lt;a href="https://news.ycombinator.com/item?id=18425031" target="_blank">frequently&lt;/a> &lt;a href="https://news.ycombinator.com/item?id=17474646" target="_blank">recommended&lt;/a> on &lt;a href="https://news.ycombinator.com/item?id=12703032" target="_blank">Hacker News&lt;/a> threads about self-study math material is Blitzstein and Hwang&amp;rsquo;s &lt;a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918/" target="_blank">&lt;em>An Introduction to Probability&lt;/em>&lt;/a>. Having just recently finished the book, I realized that this is the first textbook I have truly worked through &lt;em>end-to-end&lt;/em> while studying a topic outside a school course. Here are some thoughts on what the book does well, and my (minor) grievances.&lt;/p>
&lt;h2 id="the-good">The good &lt;a class="anchor" href="#the-good">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>There are a few characteristics that make this book particularly attractive for self-study.&lt;/p></description><content:encoded><![CDATA[
        <p>One textbook which is <a href="https://news.ycombinator.com/item?id=18425031" target="_blank">frequently</a> <a href="https://news.ycombinator.com/item?id=17474646" target="_blank">recommended</a> on <a href="https://news.ycombinator.com/item?id=12703032" target="_blank">Hacker News</a> threads about self-study math material is Blitzstein and Hwang&rsquo;s <a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918/" target="_blank"><em>An Introduction to Probability</em></a>. Having just recently finished the book, I realized that this is the first textbook I have truly worked through <em>end-to-end</em> while studying a topic outside a school course. Here are some thoughts on what the book does well, and my (minor) grievances.</p>
<h2 id="the-good">The good <a class="anchor" href="#the-good">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There are a few characteristics that make this book particularly attractive for self-study.</p>
<h3 id="access-to-material">Access to material <a class="anchor" href="#access-to-material">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>To start, the book itself is available for free (digital version) and is accompanied by 34 hours of video lectures and a detailed solutions manual for 8-10 exercises of those provided at the end of each chapter.</p>
<ul>
<li>You can get a free digital copy of the textbook at <a href="http://probabilitybook.net/" target="_blank">http://probabilitybook.net</a></li>
<li>The YouTube playlist for course lectures is at <a href="https://goo.gl/i7njSb" target="_blank">https://goo.gl/i7njSb</a></li>
<li>There is now an accompanying <a href="https://www.edx.org/course/introduction-to-probability" target="_blank">edX course</a>, although I did not complete this myself.</li>
<li>There is also a useful and thorough <a href="https://www.wzchen.com/probability-cheatsheet/" target="_blank">probability cheatsheet</a> compiled by a past student.</li>
<li>Googling the specific phrasing of many exercises often lands you on a StackOverflow question with discussion around that exact problem pulled from the book.</li>
</ul>
<h3 id="lots-of-exercises-with-solutions">Lots of exercises (with solutions!) <a class="anchor" href="#lots-of-exercises-with-solutions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I did not excel at math during undergrad, and I came to the incorrect conclusion that perhaps I am just not a “math person”. It took me a couple of years and a few hours of <a href="https://www.youtube.com/c/3blue1brown" target="_blank">3Blue1Brown</a> videos to break this mindset, and to realize that much of my earlier difficulty was with learning material which abstracts away concepts too quickly, and which lacks a clear relationship between theory and practical application. So my quality standard for a textbook for self-study is quite a bit higher than the average textbook.</p>
<p>Blitzstein&rsquo;s book contains ~600 exercises, and the <a href="https://projects.iq.harvard.edu/stat110/strategic-practice-problems" target="_blank">selected solutions</a> include detailed answers to ~100 of them. I found the number of officially-solved exercises in each chapter to be sufficient to build a deep intuitive understanding of the material. If you even wanted to go further, you can find many of the non-officially-solved questions answered somewhere on Chegg or stackoverflow.</p>
<h3 id="focuses-on-building-intuition">Focuses on building intuition <a class="anchor" href="#focuses-on-building-intuition">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A course in statistics will necessarily involve math, but Blitzstein does a good job of prioritizing the role of intuition whenever possible. He frequently employs &ldquo;story proofs&rdquo; to prove concepts or identities using verbal reasoning, rather than formal mathematical proofs. As far as I can tell, this is an approach the authors themselves have pioneered, as I can&rsquo;t find many references to the concept outside this book.</p>
<blockquote>
<p>A story proof is a proof by interpretation. For counting problems, this often means counting the same thing in two different ways, rather than doing tedious algebra. A story proof often avoids messy calculations and goes further than an algebraic proof toward explaining why the result is true. The word “story” has several meanings, some more mathematical than others, but a story proof (in the sense in which we’re using the term) is a fully valid mathematical proof. Here are some examples of story proofs, which also serve as further examples of counting.</p></blockquote>
<p>One example of a powerful story proof is that of Vandermonde&rsquo;s identity, which is an identity used in a few important proofs later in the book.</p>
<blockquote>
<p>Example 1.5.3 (Vandermonde’s identity). A famous relationship between binomial coeffecients, called Vandermonde’s identity, says that
$$
{m+n \choose k} = \sum_{j=0}^k {m \choose j} {n \choose k-j}
$$
This identity will come up several times in this book. Trying to prove it with a brute force expansion of all the binomial coefficients would be a nightmare. But a story proves the result elegantly and makes it clear why the identity holds.</p>
<p>Story proof : Consider a group of $m$ men and $n$ women, from which a committee of size $k$ will be chosen. There are ${m+n \choose k}$ possibilities. If there are $j$ men in the committee, then there must be $k-j$ women in the committee. The right-hand side of Vandermonde’s identity sums up the cases for $j$.</p></blockquote>
<p>I find this approach very compelling, because it reduces the &ldquo;barriers to entry&rdquo; of mathematical proofs, letting you use them to test your knowledge without understanding a bunch of math symbols like $\exists, \forall, \in$. It is easy to employ the <a href="https://doist.com/blog/feynman-technique/" target="_blank">Feynman Technique</a> by creating an <a href="https://geoffruddock.com/reflections-on-three-years-of-spaced-repetition-with-anki/">Anki card</a> for the most useful story proofs, and then periodically being prompted to explain the story proof.</p>
<h3 id="clear-relationships-between-related-concepts">Clear relationships between related concepts <a class="anchor" href="#clear-relationships-between-related-concepts">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>At the end of each chapter, the authors reflect on how newly introduced concepts relate to those from previous chapters. Spoiler alert: most probability distributions are related to each other when either <em>conditioning</em> on some event, or when taking the <em>limit</em> as $n \to \infty$. As Professor Blitzstein is fond of saying in the video lectures: &ldquo;<em>Conditioning is the soul of statistics</em>&rdquo;. The book incrementally builds the flowchart below, which we see in its complete form at the end of Chapter 10.</p>
<p><img src="distribution_flowchart.png" alt="Flowchart of probability distributions"></p>
<h3 id="does-not-assume-prior-knowledge">Does not assume prior knowledge <a class="anchor" href="#does-not-assume-prior-knowledge">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A common challenge with using material for self-study is that one&rsquo;s own existing knowledge may not precisely match the known prerequisites of students taking the course which the textbook was written for. Rather than “assuming some calculus knowledge”, university instructors have the luxury of knowing the exact content of prerequisite courses in their own departments, and so can confidently skip reasoning steps which seem too basic to spell out explicitly.</p>
<p>Blitzstein &amp; Hwang clearly go out of their way to decouple the course material from knowledge dependencies as much as possible. You will never hear the phrase <em>It is trivial to prove…</em> or read <em>The proof of this theorem is left as an exercise…</em> in this course. Whenever there is an unavoidable dependency on prior knowledge, the authors make explicit note of this fact, and reference the math appendix. The math appendix itself does a good job of cherry-picking useful prerequisite concepts—such as properties of functions, factorial and gamma functions, Taylor series, geometric series—and building intuition around them. For example: understanding how to apply a change-of-variables transformation to a multi-dimensional probability density function requires the concept of the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant" target="_blank">Jacobian matrix</a>, which itself requires a bit of multivariate calculus to understand fully. My calculus was a bit rusty to do the proof, but since I knew exactly what I was missing, it was straightforward to brush up on a few Khan Academy videos within an hour before continuing with the chapter.</p>
<p>Although the author does not include a mindmap of concepts in the book, I found this <a href="https://metacademy.org/graphs/concepts/central_limit_theorem#focus=j28yq33k&amp;mode=explore" target="_blank">Metacademy</a> DAG for Central Limit Theorem (which is presented near the end of the book) to be a good approximation of how earlier concepts build up to concepts in the later chapters. Except for multiple integrals, there is an overall very little dependency on prior math knowledge.</p>
<p><img src="concepts_dag.png" alt="DAG of concepts"></p>
<h2 id="the-bad">The bad <a class="anchor" href="#the-bad">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="non-standard-notation-for-some-distributions">Non-standard notation for some distributions <a class="anchor" href="#non-standard-notation-for-some-distributions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>A matter of slight annoyance is that there are a few instances where the authors create their own parameterization of distributions which differ from the standard notation found outside of the book. For example, the notation for the <a href="https://en.wikipedia.org/wiki/Gamma_distribution" target="_blank">Gamma distribution</a> and its corresponding PDF is given as:</p>
<p>$$
\begin{aligned}
Y &amp;\sim \text{Gamma}(a, \lambda) \\\
f(y) &amp;= \frac{1}{\Gamma(a)}(\lambda y)^a e^{-\lambda y} \frac{1}{y}
\end{aligned}
$$</p>
<p>Outside of the book, there are two typical parameterizations of the Gamma distribution: shape–scale or shape–rate. The shape–rate parameterization most closly matches the one we find in the book.</p>
<p>$$
\begin{aligned}
X &amp;\sim \text{Ga}(\alpha, \beta) \\\
f(x) &amp;= \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}
\end{aligned}
$$</p>
<p>Why are they different? Well I&rsquo;m sure that it is because the authors think their parameterization is more intuitive. If wikipedia can&rsquo;t even agree on a single standard parameterization, why not introduce a third? I have mixed feelings here, because their choice of parameter $\lambda$ instead of $\beta$ actually <em>is</em> more intuitive, as it makes it more obvious how the Gamma distribution is closely related to the Exponential distribution, which shares the same rate parameter $\lambda$. But the form of the PDF makes it a bit tricky when referencing outside material alongside the textbook itself.</p>
<p>Another example of non-standard notation is around the presentation of the <a href="https://en.wikipedia.org/wiki/Geometric_distribution" target="_blank">Geometric distribution</a>. Outside of the course, this can refer to either the distribution of the number of Bernoulli <em>trials</em> before the first success (with support ${1, 2, 3, \ldots}$), or it can refer to the number of <em>failures</em> (with support ${0, 1, 2, \ldots }$). Blitzstein refers to the former as the <em>First Success</em> distribution, and the latter as the <em>Geometric</em> distribution. It is difficult to complain about not adhering to a common definition when the common definition itself is ambiguous. But this is something to keep in mind when referencing outside resources.</p>
<h2 id="the-ugly">The ugly <a class="anchor" href="#the-ugly">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There is no ugly. I struggled to even think of the above complaint about non-standard notation. This book is a gem, and I highly recommend it to anyone considering self-studying probability. In an ever-evolving data science field, it is difficult to predict which model or framework will be in vogue next year. But I think it&rsquo;s a solid bet that probability will continue to play a central role in the field, and so there is a high ROI on investing the time to develop a solid understanding of the fundamental concepts this book covers.</p>

      ]]></content:encoded></item><item><title>Abridged: David Foster Wallace's “This Is Water”</title><link>https://geoffruddock.com/this-is-water-abridged/</link><pubDate>Thursday, 18 Oct 2018</pubDate><guid>https://geoffruddock.com/this-is-water-abridged/</guid><description>&lt;p>&lt;a href="https://en.wikipedia.org/wiki/This_Is_Water" target="_blank">&lt;em>This is Water&lt;/em>&lt;/a> is a 22-minute commencement speech given by David Foster Wallace at Kenyon College in 2015 which was later adapted into a short book. It is difficult to overstate how powerful it is. Even after listening to this speech countless times, it never fails to send a shiver down my spine.&lt;/p>
&lt;blockquote>
&lt;p>If you’re automatically sure that you know what reality is, and you are operating on your default setting, then you, like me, probably won’t consider possibilities that aren’t annoying and miserable. But if you really learn how to pay attention, then you will know there are other options. It will actually be within your power to experience a crowded, hot, slow, consumer-hell type situation as not only meaningful, but sacred, on fire with the same force that made the stars: love, fellowship, the mystical oneness of all things deep down.&lt;/p></description><content:encoded><![CDATA[
        <p><a href="https://en.wikipedia.org/wiki/This_Is_Water" target="_blank"><em>This is Water</em></a> is a 22-minute commencement speech given by David Foster Wallace at Kenyon College in 2015 which was later adapted into a short book. It is difficult to overstate how powerful it is. Even after listening to this speech countless times, it never fails to send a shiver down my spine.</p>
<blockquote>
<p>If you’re automatically sure that you know what reality is, and you are operating on your default setting, then you, like me, probably won’t consider possibilities that aren’t annoying and miserable. But if you really learn how to pay attention, then you will know there are other options. It will actually be within your power to experience a crowded, hot, slow, consumer-hell type situation as not only meaningful, but sacred, on fire with the same force that made the stars: love, fellowship, the mystical oneness of all things deep down.</p></blockquote>
<p>I originally discovered the speech through <a href="https://fs.blog/2012/04/david-foster-wallace-this-is-water/" target="_blank">an article on Farnam Street</a>, which includes audio and a full text transcript of the speech. The original version is a bit long though. It also includes some parts tailored to the graduating class which I find subtract from the overall power of the primary message. I found <a href="https://jamesclear.com/great-speeches/this-is-water-by-david-foster-wallace" target="_blank">this abridged transcript</a> from James Clear, which is better, but only includes an abridged transcript, not an abridged version of the audio itself.</p>
<p>In the interest of periodically forcing myself out of this default mode of thinking, I created a recurring calendar event to re-listen to the speech every sixth months. The transcript is good, but it doesn&rsquo;t quite &ldquo;click&rdquo; as well as when I listen to the audio version, delivered by DFW himself. So I pulled the original audio into Audacity and edited it to roughly match the abridged transcript. You can download this abridged audio version <a href="this_is_water.mp3">here</a>.</p>

      ]]></content:encoded></item><item><title>DIY insulated sous-vide container with $10 of IKEA parts</title><link>https://geoffruddock.com/diy-ikea-sous-vide-container/</link><pubDate>Saturday, 01 Sep 2018</pubDate><guid>https://geoffruddock.com/diy-ikea-sous-vide-container/</guid><description>&lt;p>Since acquiring an Anova sous-vide cooker, it has become an essential component of my weekly cooking routine. Their &lt;a href="https://anovaculinary.com/anova-precision-cooker/" target="_blank">marketing materials&lt;/a> show the device being used in any large pot you probably already have. This is fine for occasional use, but since I use the device frequently I started looking for a dedicated vessel.&lt;/p>
&lt;p>A dedicated vessel also lets you cook a larger quantity of food, or something awkwardly large like a rack of ribs. You can buy a &lt;a href="https://anovaculinary.com/en-ca/products/anova-precision-16l-container" target="_blank">pre-built container&lt;/a> but it costs $70 and is not insulated. So I decided to build a simple dedicated container that was semi-insulated, so that it would be energy efficient when cooking ribs for 48 hours.&lt;/p></description><content:encoded><![CDATA[
        <p>Since acquiring an Anova sous-vide cooker, it has become an essential component of my weekly cooking routine. Their <a href="https://anovaculinary.com/anova-precision-cooker/" target="_blank">marketing materials</a> show the device being used in any large pot you probably already have.  This is fine for occasional use, but since I use the device frequently I started looking for a dedicated vessel.</p>
<p>A dedicated vessel also lets you cook a larger quantity of food, or something awkwardly large like a rack of ribs. You can buy a <a href="https://anovaculinary.com/en-ca/products/anova-precision-16l-container" target="_blank">pre-built container</a> but it costs $70 and is not insulated. So I decided to build a simple dedicated container that was semi-insulated, so that it would be energy efficient when cooking ribs for 48 hours.</p>
<h2 id="what-youll-need">What you&rsquo;ll need <a class="anchor" href="#what-youll-need">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="consumable-supplies">Consumable supplies <a class="anchor" href="#consumable-supplies">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><table>
  <thead>
      <tr>
          <th>Item</th>
          <th>Cost</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.ikea.com/us/en/p/samla-box-with-lid-clear-s39885645/" target="_blank">SAMLA box</a> (with lid) from IKEA</td>
          <td>$3</td>
      </tr>
      <tr>
          <td><a href="https://www.ikea.com/us/en/p/vitmossa-throw-gray-90304889/" target="_blank">Cheap blanket</a> (x2)</td>
          <td>$4</td>
      </tr>
      <tr>
          <td><a href="https://www.amazon.com/Photography-Backdrop-Woodworking-Backdrops-Background/dp/B07XLLT3NF" target="_blank">Small plastic clamps</a> (x2)</td>
          <td>$2</td>
      </tr>
      <tr>
          <td>Hot glue</td>
          <td>$1</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>$10</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="tools">Tools <a class="anchor" href="#tools">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><ul>
<li>A reasonably powerful power drill</li>
<li>60mm hole saw bit – You&rsquo;ll want this to match the diameter of your sous-vide unit as closely as possible so that it fits in snugly. My Anova unit (original version) needed a ~62mm hole, so I used a 60mm bit and sanded it down until it fit. If you are using a newer version, check the diameter of your unit.</li>
<li>Hack saw – To make cuts between the circular hole and the edge of the plastic lid.</li>
<li>Sandpaper – To smooth any uneven plastic edges.</li>
</ul>
<h2 id="how-to-build-it">How to build it <a class="anchor" href="#how-to-build-it">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="glue-a-support-to-the-box">Glue a support to the box <a class="anchor" href="#glue-a-support-to-the-box">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The lip of the SAMLA box is a bit flimsy, so you&rsquo;ll probably want to glue something on to reinforce it. I used a a transluscent plastic bag clip that was nearby in the kitchen. You&rsquo;ll want to do this first, since this affects the position of the clamp, which determines where you should drill the hole in the lid.</p>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/clamp_hu_939e5572b8e1d68a.jpg 480w,
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/clamp_hu_c919e0d6cca25937.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-ikea-sous-vide-container/clamp_hu_c919e0d6cca25937.jpg"
                
    
            
                alt="Glue a support for the clamp to attach to" width="400"/> 
    </figure>
<h3 id="cut-a-hole-in-the-lid">Cut a hole in the lid <a class="anchor" href="#cut-a-hole-in-the-lid">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><ol>
<li>Attach the sous-vide clamp to the container and mark the center of the hole on the plastic lid.</li>
<li>Attach the lid firmly to the box, and use the hole saw attachment on a drill to make a cut at the marked point.</li>
<li>Use the hacksaw to extend the cuts from both sides of the hole to the edge of the lid.</li>
<li>Use sandpaper to slightly enlarge the hole until it fits your Anova device, and also to smooth off any rough edges from the hacksaw.</li>
</ol>

    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/lid_hu_87b3cc9274599c79.jpg 480w,
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/lid_hu_aa99ca016c27315f.jpg 800w,
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-ikea-sous-vide-container/lid_hu_aa99ca016c27315f.jpg"
                
    
            
                alt="Cut a whole in the lid" width="400"/> 
    </figure>
<h3 id="wrap-with-towels">Wrap with towels <a class="anchor" href="#wrap-with-towels">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><ol>
<li>Fold the towels into long thin strips.</li>
<li>Wrap one snuggly around the sides and use plastic clamps to attach at the back</li>
<li>Fold the other in such a way that it covers the entire lid, tucking around sous-vide clamp.</li>
</ol>
<p>I tried two ways shown below to fold the top blanket. A simple square fold (left) looks nicer, but it leaves the lid around the sous-vide unit itself uninsulated. Folding a long thin towel (right) to cover the entire lid looks a bit messier, but is more energy efficient (~10 watts).</p>
<div id="multi-fig-outer">
    <div id="multi-fig-inner">
        


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/wrapped_1_hu_53be60a93706e90d.jpg 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-ikea-sous-vide-container/wrapped_1.jpg"
                
    
            
                alt="Easier"/> <figcaption>
                <p>Easier</p>
            </figcaption>
    </figure>


    
    
    
    
    
    
    
    <figure>
        <img loading="lazy"
            
                sizes="(min-width: 35em) 1200px, 100vw"
                  
                srcset='
                
                       https://geoffruddock.com/diy-ikea-sous-vide-container/wrapped_2_hu_fb5837cbadcab527.jpg 480w,
                
                       
                
                       
                
                       
                '
    
                
                
                src="https://geoffruddock.com/diy-ikea-sous-vide-container/wrapped_2.jpg"
                
    
            
                alt="More efficient (slightly)"/> <figcaption>
                <p>More efficient (slightly)</p>
            </figcaption>
    </figure>


        
    </div>
</div>

<style>

    #multi-fig-outer {
        text-align: center;
    }

    #multi-fig-inner {
        display: inline-block;
    }

    #multi-fig-inner > figure {
        display: inline-block;
        width: auto;
        margin: 0;
    }

    #multi-fig-inner > figure > img {
        max-height: 500px
    }

</style>
<h3 id="for-even-larger-cooks">For even larger cooks <a class="anchor" href="#for-even-larger-cooks">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If you are cooking something even larger (or a large quantity for a dinner party) you can also buy the larger <a href="https://www.ikea.com/us/en/p/samla-box-clear-80102976/" target="_blank">6-gallon SAMLA box</a> which uses the same size lid.</p>
<h3 id="update-2020-05-01">Update [2020-05-01]: <a class="anchor" href="#update-2020-05-01">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If I were doing this again from scratch, I would probably opt for the newer <a href="https://www.ikea.com/gb/en/p/ikea-365-food-container-with-lid-rectangular-plastic-s79276760/" target="_blank">IKEA 365+</a> tupperware containers. The lid is more securely attached than the SAMLA, so you could do just the hole saw cut and let the sous-vide unit hang down from the lid. This would save you from doing the hacksaw cut to accomodate the Anova clamp, which was annoying to do and probably results in worse energy efficiency.</p>
<h2 id="energy-efficiency">Energy efficiency <a class="anchor" href="#energy-efficiency">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I ran a series of tests using a <a href="https://www.amazon.com/TP-Link-HS100-Required-Google-Assistant/dp/B0178IC734" target="_blank">TP-Link Kasa Smart Plug</a> to measure energy expenditure. For each test, I brought 7L of water up to 66°C and then started measuring after the water reached temperature.</p>
<p>Wrapping the container in blankets is a bit annoying (and ugly) but it uses less than half the energy as when unwrapped. If you are doing long multi-day cooks, it is definitely worth making sure your container is insulated. For example, a 48-hour cook without towels would take 7.2 kW, whereas it would only take 2.9 kW with the towels.</p>
<h3 id="ikea-energy-tests-bare">IKEA energy tests (bare) <a class="anchor" href="#ikea-energy-tests-bare">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><table>
  <thead>
      <tr>
          <th><strong>Hours</strong></th>
          <th><strong>Energy (kWh)</strong></th>
          <th><strong>Watts</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>13</td>
          <td>1.93</td>
          <td>148</td>
      </tr>
      <tr>
          <td>12</td>
          <td>1.75</td>
          <td>146</td>
      </tr>
      <tr>
          <td>9</td>
          <td>1.36</td>
          <td>151</td>
      </tr>
      <tr>
          <td>—</td>
          <td><strong>Average</strong></td>
          <td><strong>148</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="ikea-energy-tests-wrapped">IKEA energy tests (wrapped) <a class="anchor" href="#ikea-energy-tests-wrapped">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><table>
  <thead>
      <tr>
          <th><strong>Hours</strong></th>
          <th><strong>Energy (kWh)</strong></th>
          <th><strong>Watts</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>9.5</td>
          <td>0.66</td>
          <td>70</td>
      </tr>
      <tr>
          <td>7.5</td>
          <td>0.43</td>
          <td>57</td>
      </tr>
      <tr>
          <td>6</td>
          <td>0.38</td>
          <td>63</td>
      </tr>
      <tr>
          <td>—</td>
          <td><strong>Average</strong></td>
          <td><strong>63</strong></td>
      </tr>
  </tbody>
</table>

      ]]></content:encoded></item><item><title>Redshift function of the week: RATIO_TO_REPORT</title><link>https://geoffruddock.com/ratio-to-report-window-function-in-redshift/</link><pubDate>Sunday, 10 Jun 2018</pubDate><guid>https://geoffruddock.com/ratio-to-report-window-function-in-redshift/</guid><description>&lt;p>A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Quite often this arises as an intermediate need while working towards some other primary task.&lt;/p>
&lt;p>Let&amp;rsquo;s work with a simple example: you&amp;rsquo;d like to get a rough sense of how many of your company&amp;rsquo;s orders come from from each country. So you write the following query,&lt;/p></description><content:encoded><![CDATA[
        <p>A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Quite often this arises as an intermediate need while working towards some other primary task.</p>
<p>Let&rsquo;s work with a simple example: you&rsquo;d like to get a rough sense of how many of your company&rsquo;s orders come from from each country. So you write the following query,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">country</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span></code></pre></div><p>And you get the following result back in your SQL client,</p>
<table>
  <thead>
      <tr>
          <th>country</th>
          <th>num_orders</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USA</td>
          <td>21264505</td>
      </tr>
      <tr>
          <td>Canada</td>
          <td>6408593</td>
      </tr>
      <tr>
          <td>Mexico</td>
          <td>2208305</td>
      </tr>
  </tbody>
</table>
<p>This is kind of difficult to read. You can immediately discern that USA has the most orders, but it&rsquo;s tough to eyeball proportions here without counting digits and performing some rough mental math. So you refine your query to tell you what you are actually interested in: the relative proportion of orders between countries.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">country_totals</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="n">country</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">num_orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">country</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="n">ROUND</span><span class="p">(</span><span class="n">num_orders</span><span class="w"> </span><span class="p">::</span><span class="nb">NUMERIC</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">num_orders</span><span class="p">)</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(),</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pct_orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">country_totals</span><span class="w">
</span></span></span></code></pre></div><table>
  <thead>
      <tr>
          <th>country</th>
          <th>num_orders</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USA</td>
          <td>0.71169602953464</td>
      </tr>
      <tr>
          <td>Canada</td>
          <td>0.21443048342874</td>
      </tr>
      <tr>
          <td>Mexico</td>
          <td>0.07395810292746</td>
      </tr>
  </tbody>
</table>
<p>This is better. It answers your question, and you can go back to your main task. But is it optimal? It took you 12 lines to answer a relatively simple question. Hopefully you didn&rsquo;t write these from scratch, but even if you pasted a snippet, it&rsquo;s not particularly easy to read if you or someone else needs to refer back to this query in the future.</p>
<p>Can we do better?</p>
<h2 id="enter-the-ratio_to_report-function">Enter the RATIO_TO_REPORT function <a class="anchor" href="#enter-the-ratio_to_report-function">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Amazon Redshift provides an off-the-shelf window function called <a href="https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RATIO_TO_REPORT.html" target="_blank">ratio_to_report</a> which basically solves what we are trying to accomplish. You can use it as follows,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">country</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="n">RATIO_TO_REPORT</span><span class="p">(</span><span class="k">COUNT</span><span class="p">(</span><span class="n">num_orders</span><span class="p">))</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">()</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pct_orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span></code></pre></div><p>We can reason through this query as follows: the <code>GROUP BY</code> operation totals orders for each country, and then the <code>RATIO_TO_REPORT</code> function is called on the already-grouped rows, dividing each by their grand total.  Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.</p>
<p>As a final step, we can clean up the output even further by rounding the percentage to a meaningful precision, so that our eyes don&rsquo;t spend an extra 100ms parsing the fact that it is a fraction when we look at the table.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">country</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="n">ROUND</span><span class="p">(</span><span class="n">RATIO_TO_REPORT</span><span class="p">(</span><span class="k">COUNT</span><span class="p">(</span><span class="n">num_orders</span><span class="p">))</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(),</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">pct_orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span></code></pre></div><table>
  <thead>
      <tr>
          <th>country</th>
          <th>num_orders</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USA</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>Canada</td>
          <td>0.21</td>
      </tr>
      <tr>
          <td>Mexico</td>
          <td>0.07</td>
      </tr>
  </tbody>
</table>
<p>Boom! We&rsquo;ve arrived at a short query which gives us a clean result that answers our underlying question. This is a trivial scenario, but the sort that one encounters daily, and so taking the optimal approach pays off in long-run efficiency.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li>
<p><a href="https://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_RATIO_TO_REPORT_WF.html" target="_blank">RATIO_TO_REPORT Window Function</a> – AWS documentation. Not particularly easy to read, but still a primary resource.</p>
</li>
<li>
<p><a href="https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/" target="_blank">SQL queries don&rsquo;t start with SELECT</a> – It&rsquo;s useful to understand the SQL &ldquo;order of operations&rdquo; when working with queries which combine both groupby aggregation and window functions.</p>
</li>
<li>
<p><a href="https://www.periscopedata.com/blog/calculating-proportional-values" target="_blank">Calculationg Proportional Values in SQL</a> – Implementation details for different SQL engines.</p>
</li>
</ul>

      ]]></content:encoded></item><item><title>The hidden costs of poor data quality</title><link>https://geoffruddock.com/hidden-costs-of-poor-data-quality/</link><pubDate>Wednesday, 02 Aug 2017</pubDate><guid>https://geoffruddock.com/hidden-costs-of-poor-data-quality/</guid><description>&lt;p>The phrase &amp;ldquo;data quality&amp;rdquo; is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement.&lt;/p>
&lt;p>We&amp;rsquo;d all love 100% accuracy, but in the era of &lt;a href="https://xkcd.com/1428/" target="_blank">moving fast and breaking things&lt;/a>, don&amp;rsquo;t we &lt;em>want&lt;/em> to sacrifice a little accuracy in the name of speed? After all, isn&amp;rsquo;t it often better to &lt;a href="https://fs.blog/2018/04/reversible-irreversible-decisions/" target="_blank">make fast decisions with imperfect information&lt;/a> and adjust course if necessary at a later point?&lt;/p></description><content:encoded><![CDATA[
        <p>The phrase &ldquo;data quality&rdquo; is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement.</p>
<p>We&rsquo;d all love 100% accuracy, but in the era of <a href="https://xkcd.com/1428/" target="_blank">moving fast and breaking things</a>, don&rsquo;t we <em>want</em> to sacrifice a little accuracy in the name of speed? After all, isn&rsquo;t it often better to <a href="https://fs.blog/2018/04/reversible-irreversible-decisions/" target="_blank">make fast decisions with imperfect information</a> and adjust course if necessary at a later point?</p>
<p>There is certainly a trade-off at play here. The optimal level of quality to aim for is likely less than 100%. But it&rsquo;s probably higher than you think. The speed vs. accuracy trade-off needs to be calculated with all costs considered, not just the most visible and direct ones. For this reason, let&rsquo;s ignore the obvious cost of low-quality data, which is incorrect decisions and lost opportunities to apply machine learning models due to an unacceptable signal-to-noise ratio. Instead, let&rsquo;s focus on the lesser-considered costs of poor data quality.</p>
<h2 id="it-reduces-the-standard-for-derivative-works">It reduces the standard for derivative works <a class="anchor" href="#it-reduces-the-standard-for-derivative-works">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://en.wikipedia.org/wiki/Broken_windows_theory" target="_blank">Broken windows theory</a> is a pop-science concept originating from criminology which states that visible signs of disorder encourage further disorder. When participants in a system are acclimitized to a <em>slightly broken</em> status quo, it becomes normative to produce more <em>slightly broken</em> output.</p>
<p>I suspect this concept applies to data quality as well, in two ways: phsycological and practical. Ambient quality issues degrade the shared standard for quality within a team, and the &ldquo;definition of done&rdquo; will slip. For an analysis on a tight timeline, 90% accuracy may be accepted instead of spending the time investigating the root cause of the innacuracy.</p>
<p>From a practical standpoint, it becomes more difficult to implement data quality checks or perform sanity tests when nothing is black or white, but rather shades of gray. For example, if a tracking event should <em>always</em> have a session-ID associated with it, then it becomes easy to detect when a problem arises, with something as simple as <code>COUNT(*) WHERE session_id IS NULL</code>. But when tracking events <em>usually but not always</em> have a session-ID associated with them, you can&rsquo;t apply simple rules, nor can you identify if a specific data transformation/aggregation/analysis introduces meaningfully more null values. Soon you are in the domain of <a href="https://en.wikipedia.org/wiki/Anomaly_detection" target="_blank">anomaly detection</a>, which is an endeavour in and of itself.</p>
<h2 id="it-takes-dramatically-longer-to-reach-a-final-result">It takes dramatically longer to reach a final result <a class="anchor" href="#it-takes-dramatically-longer-to-reach-a-final-result">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet" target="_blank">Anscombe&rsquo;s Quartet</a> is a famous example of why it&rsquo;s not enough to analyze a dataset using descriptive statistic alone. You need to look at the distribution of the underlying data. But what if you <em>do</em> visualize the data during the initial analysis, and now you want to build a simple dashboard for ongoing monitoring? Particularly when working with large datasets, it is much more convenient to apply a server-side aggregate function in SQL such as <code>AVG()</code> than to pull the raw data, filter for outliers, and then perform client-side aggregation in software such as Tableau.</p>
<p>So you draw up a simple query based on <code>AVG()</code> and feed it into a dashboard. This works, until a deployment next month starts sending negative values to the database due to an incorrectly configured timezone on one of the backend server. This goes unnoticed until an analyst a year from now realizes that the average time-to-purchase is being understated by more than 50%. Oops!</p>
<p>Well, what if we write a more &ldquo;defensive query&rdquo; in the first place, to prevent such an incident from occuring? For example…</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">GREATEST</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="n">time_to_purchase</span><span class="p">))</span><span class="w">
</span></span></span></code></pre></div><p>It is possible to write such queries in response to specific (and known) quality issues, but doing so for multiple types of potential quality problems leads to clunky and slow queries peppered with <code>SELECT DISTINCT</code> statements that not only take longer to execute, but are more difficult to write, read, and modify in the future.</p>
<h2 id="it-accumulates-multiplicatively-not-additively">It accumulates multiplicatively, not additively <a class="anchor" href="#it-accumulates-multiplicatively-not-additively">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>A common question posed to analysts is &ldquo;how accurate is this data&rdquo;? This is deceptively difficult to answer. Non-technical stakeholders often conceive of error as an additive quantity, but in reality the effect is often multiplicative.</p>
<p>For example, let&rsquo;s say we are trying to calculate the average number of searches per visit on our website. We&rsquo;ve got two tables for this: <code>visits</code> and <code>searches</code>. To make things interesting, there was a technical bug a few months ago, in which some visitors were assigned a session-ID of <code>undefined</code>.</p>
<p>So we write a query,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="k">COUNT</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">search_id</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">visits</span><span class="w"> </span><span class="n">v</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">searches</span><span class="w"> </span><span class="n">s</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">ON</span><span class="w"> </span><span class="n">v</span><span class="p">.</span><span class="n">session_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="p">.</span><span class="n">session_id</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">v</span><span class="p">.</span><span class="n">session_id</span><span class="w">
</span></span></span></code></pre></div><p>The problem here is that SQL will inadverently perform a <a href="https://www.w3resource.com/sql/joins/cross-join.php" target="_blank">cross-join</a> on these undefined rows, giving you a wildly inaccurate result…</p>
<h2 id="it-undermines-confidence-in-the-analytics-team">It undermines confidence in the analytics team <a class="anchor" href="#it-undermines-confidence-in-the-analytics-team">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Your team has collected and transformed the data, built the model, and performed the analysis. Now it&rsquo;s time to communicate the results to stakeholders.</p>
<p>But often the data presents inconvenient truthes, which stakeholders may be reluctant to accept. In order to reduce cognitive dissonance, people often engage in <a href="https://en.wikipedia.org/wiki/Motivated_reasoning" target="_blank">motivated reasoning</a>, questioning the quality of the data, and whether we can actually trust the results.</p>
<p>So even if you&rsquo;ve already paid the inflated costs assocated with reaching a meaningful and accurate insight in the face of poor data quality, you may yet face a more difficult evaganlizing for controversial actions or outcomes based on those results.</p>

      ]]></content:encoded></item><item><title>Essential productivity apps for Mac users</title><link>https://geoffruddock.com/essential-productivity-apps-for-mac-users/</link><pubDate>Monday, 05 Jun 2017</pubDate><guid>https://geoffruddock.com/essential-productivity-apps-for-mac-users/</guid><description>&lt;p>Once a year I try to reevaluate my &amp;ldquo;personal tech stack&amp;rdquo; to see if I am using fundamental tools as effectively as possible. Not just bigger tools such as todo lists, calendars, and note-taking, but also the smaller utility apps that get used so frequently they blend into our daily work routine. Our fluency with the tools we use every day is the foundation of personal productivity&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> , so it makes sense to optimize even small interactions&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup> such as switching between windows. With that in mind, here are three key Mac apps that make me a &lt;em>tiny bit more efficient&lt;/em> but do so &lt;em>very frequently&lt;/em>.&lt;/p></description><content:encoded><![CDATA[
        <p>Once a year I try to reevaluate my &ldquo;personal tech stack&rdquo; to see if I am using fundamental tools as effectively as possible. Not just bigger tools such as todo lists, calendars, and note-taking, but also the smaller utility apps that get used so frequently they blend into our daily work routine. Our fluency with the tools we use every day is the foundation of personal productivity<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> , so it makes sense to optimize even small interactions<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> such as switching between windows. With that in mind, here are three key Mac apps that make me a <em>tiny bit more efficient</em> but do so <em>very frequently</em>.</p>
<h2 id="alfred-launcher">Alfred (launcher) <a class="anchor" href="#alfred-launcher">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Alfred is a super-charged replacement for the built-in Spotlight search bar. It gives you nuanced control over search options such as indexing, fuzzy matching, etc. But it&rsquo;s real power comes from three  The real power comes from three other features though…</p>
<p><img src="alfred_features.png" alt="List of alfred features"></p>
<p><strong>Clipboard history</strong> – Being able to go back and re-paste copied items from your recent history reduces the number of times you need to switch between windows when copying multiple chunks of content (e.g. title, description, link) from one application or document to another. This thereby incurrs less context-switching costs on your brain. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p><strong>Text expansion</strong> – I primarily use this to encode symbols that I use frequently, such as greek letters (α, β, μ, σ, λ). Python allows most (all?) of these unicode characters to be used as variable names. If you are writing a function which exactly mirrors a mathematical expression, it can be convenient to write it using the actual greek characters themselves, rather than their latin names.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># with latin names</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">sigma</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="p">)</span> <span class="o">*</span> <span class="n">sigma</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># with actual greek characters</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">μ</span><span class="p">,</span> <span class="n">σ</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">μ</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">σ</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="p">)</span> <span class="o">*</span> <span class="n">σ</span><span class="p">)</span>
</span></span></code></pre></div><p>You can download my snippets collection of greek characters <a href="/math_symbols.alfredsnippets">here</a>. I include only the characters which are not confusingly similar to regular latin characters (e.g. uppercase alpha, beta).</p>
<p>I also use snippets for arrows (▲, ▼, ←, ↔, →, →) , fractions (½, ⅓, ⅒) or little snippets of frequently used SQL code or LaTeX. Alfred lets you specify where to place the cursor after expanding your snippet, which is useful for boilerplate code snippets.</p>
<p><strong>Workflows</strong> – Alfred provides a drag-and-drop GUI for creating pseudo-programming recipes that you can plumb together to achieve surprisingly complex tasks. There are downloadable recipes which provide deep integration into your apps, such as allowing you to search for notes in Evernote from within the Alfred search bar and open them directly via deep-link. That said, my most common workflows are relatively simple:</p>
<ul>
<li>Search Amazon in multiple countries for a product</li>
<li>Type a German word and search multiple dictionary websites and Google Images. Useful for creating flashcards in a foreign language.</li>
<li>Run a short python script which takes LaTeX formatted markup from my clipboard (usually copied from Typora) and uses regex to change to the syntax required by Anki.</li>
<li>Created a bunch of hotkeys to wrap a highlighted work with relevant HTML tags, such as <code>Cmd+Shift+B</code> to <code>&lt;b&gt;bold something&lt;/b&gt;</code> or  <code>Cmd+Shift+I</code> to <code>&lt;i&gt;italicize something&lt;/i&gt;</code> .</li>
</ul>
<h2 id="lightshot-screenshots">Lightshot (screenshots) <a class="anchor" href="#lightshot-screenshots">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I make heavy use of screenshots as a communication medium in a work context. As remote work increasingly becomes the norm, we must adapt our communication styles to match. Certain things are just easier to <em>show</em> to someone than to <em>describe</em> verbally. But with less opportunity to meet face-to-face it becomes more difficult to do so. A good screenshot tool reduces the friction to communicating visually in an asyncrhonous context. I&rsquo;m sure there are multiple good screenshot tools out there, but I like <a href="https://app.prntscr.com/en/index.html" target="_blank">Lightshot</a> because it allows me to snip a selection of my screen, apply some basic annotations, and copy to my clipboard within the span of 1-2 seconds. I can then immediately paste the image into a Slack window or email thread.</p>
<p><img src="lightshot_example.png" alt="Example of using Lightshot"></p>
<h2 id="magnet-window-management">Magnet (window management) <a class="anchor" href="#magnet-window-management">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><a href="https://magnet.crowdcafe.com/" target="_blank">Magnet</a> lets you snap windows to one half of your screen by dragging them to any edge or corner, or using a set of hotkeys. This is similar to the built-in windows management functionality in Windows, which I sorely missed when I switched operating systems five years ago.</p>
<p>It&rsquo;s worth spending a few minutes familiarizing yourself with the hotkeys, because you&rsquo;ll use them hundreds of times per week. Being nimble with window management lets you make more effective use of your monitor space, and minimzes the literal <a href="https://blog.rescuetime.com/context-switching/" target="_blank">context-switching cost</a> you incur on your brain when switching  your gaze between monitors or even between Mac&rsquo;s built-in Spaces.</p>
<p><img src="magnet.png" alt="Hero image for Magnet tool"></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>See Tiago Forte, <a href="https://praxis.fortelabs.co/the-digital-productivity-pyramid/" target="_blank">The Digital Productivity Pyramid</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Relevant xkcd: <a href="https://xkcd.com/1205/" target="_blank">Is it worth the time?</a>&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p><a href="https://en.wikipedia.org/wiki/Task_switching_%28psychology%29" target="_blank">https://en.wikipedia.org/wiki/Task_switching_(psychology)</a>&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>

      ]]></content:encoded></item><item><title>Jupyter Notebooks for Interactive SQL Exploration</title><link>https://geoffruddock.com/interactive-sql-exploration-with-jupyter-notebooks/</link><pubDate>Sunday, 16 Apr 2017</pubDate><guid>https://geoffruddock.com/interactive-sql-exploration-with-jupyter-notebooks/</guid><description>&lt;p>I&amp;rsquo;m always hesitant to tell people that I work as a data scientist. Partially because it&amp;rsquo;s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title &amp;ldquo;scientist&amp;rdquo; to describe work which does not necessarily involve the scientific method.&lt;/p>
&lt;blockquote>
&lt;p>Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study. To call oneself a “data scientist” makes no sense. One cannot study data in general. One can only study data about something in particular.&lt;/p></description><content:encoded><![CDATA[
        <p>I&rsquo;m always hesitant to tell people that I work as a data scientist. Partially because it&rsquo;s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title &ldquo;scientist&rdquo; to describe work which does not necessarily involve the scientific method.</p>
<blockquote>
<p>Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study. To call oneself a “data scientist” makes no sense. One cannot study data in general. One can only study data about something in particular.</p></blockquote>
<div style="text-align: right">— <a href="https://www.perceptualedge.com/blog/?p=2560">There Is No Science of Data</a></div><br>
So it's always nice when I find an opportunity to borrow a concept or practice from actual science and apply it in my day-to-day. One of my favourites is the practice of [keeping a lab notebook](https://www.sciencemag.org/careers/2019/09/how-keep-lab-notebook) with commentary and supplementary details around the meandering path taken towards a final result. 
<p>Jupyter notebooks and R Markdown are two common tools that make it easy to intermingle code and analysis (as markdown) in a way that allows you to elucidate your thought process along a particular path.</p>
<p>But I have always felt a bit frustrated that there is not a similar tool for SQL. I try to get out of SQL and into python as soon as possible, but sometimes it is inevitable. On occasion, while writing a query to pull a starting dataset for some sort of analysis in pandas, I find myself troubleshooting something like missing or duplicate records in SQL. Usually this involves executing a sequence of simple queries against various tables in the database to narrow down the source of the problem, often using the output of one as input into another query. For example:</p>
<ol>
<li>Pull a single order-ID which is missing from my dataset</li>
<li>Query the orders table for that order-ID to find the corresponding customer-ID</li>
<li>Query the customers table for that customer-ID to find device data</li>
<li>…</li>
</ol>
<h2 id="what-i-did-before">What I did before <a class="anchor" href="#what-i-did-before">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>I generally prefer writing SQL queries in my IDE (PyCharm) which provides a number of useful features including auto-completion of column and table names, along with warnings that appear for typos, etc.</p>
<p>Usually I will add comments above queries as I go along using the <code>-- comment</code> syntax and at the end of the chain of queries I may copy/paste everything into a .sql file to save somewhere in case I need to run through that specific chain of troubleshooting steps again.</p>
<h2 id="enter-jupyter-w-sql-magics">Enter Jupyter w/ SQL magics <a class="anchor" href="#enter-jupyter-w-sql-magics">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>There is a neat jupyter extension called <a href="https://github.com/catherinedevlin/ipython-sql" target="_blank">ipython-sql</a> that adds an <code>%%sql</code> magic command to your jupyter notebooks. <a href="https://jakevdp.github.io/PythonDataScienceHandbook/01.03-magic-commands.html" target="_blank">Magic commands</a> are special non-python commands starting with the <code>%</code> which, when run from a notebook cell, add some sort of additional functionality.</p>
<p>Prefixing a code cell with <code>%%sql</code> will let you execute the SQL code below against your database, and return the result below. It even applies syntax highlighting to your SQL, making it more readable.</p>
<h2 id="how-to-use">How to use <a class="anchor" href="#how-to-use">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>First thing we need to do is install the extension,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">! pip install ipython-sql
</span></span></code></pre></div><p>Next, we need to load the extension and create a connection with your database,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%</span><span class="n">reload_ext</span> <span class="n">sql</span>  <span class="c1"># Use reload_ext instead of load_ext to avoid message on re-running cell.</span>
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">config</span> <span class="n">SqlMagic</span><span class="o">.</span><span class="n">autopandas</span> <span class="o">=</span> <span class="kc">True</span>  <span class="c1"># Return a pandas DataFrame instead of an SQL ResultSet.</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Provide the JDBC connection string template for your database. </span>
</span></span><span class="line"><span class="cl"><span class="n">redshift_str_template</span> <span class="o">=</span> <span class="s1">&#39;postgresql://</span><span class="si">{user}</span><span class="s1">:</span><span class="si">{pwd}</span><span class="s1">@</span><span class="si">{host}</span><span class="s1">:</span><span class="si">{port}</span><span class="s1">/</span><span class="si">{db}</span><span class="s1">&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Fill in the string with your credentials, stored in environment variables.</span>
</span></span><span class="line"><span class="cl"><span class="n">connect_str</span> <span class="o">=</span> <span class="n">redshift_str_template</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">user</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;REDSHIFT_USERNAME&#39;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">pwd</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;REDSHIFT_PASSWORD&#39;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">host</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;REDSHIFT_HOST&#39;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">port</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;REDSHIFT_PORT&#39;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">db</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;REDSHIFT_DB&#39;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Open a connection to your database</span>
</span></span><span class="line"><span class="cl"><span class="o">%</span><span class="n">sql</span> <span class="err">$</span><span class="n">connect_str</span>
</span></span></code></pre></div><p>The code above assumes that you are using Amazon Redshift as a database, and that your credentials are stored in environment variables. If this is not the case, you can replace the <code>os.environ[]</code> calls with strings, but be careful not to commit your notebook to a shared repository with plaintext credentials.</p>
<p>Now we can run an SQL query in our notebook,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%%</span><span class="n">sql</span>
</span></span><span class="line"><span class="cl"><span class="n">SELECT</span> <span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">FROM</span> <span class="n">sales</span>
</span></span><span class="line"><span class="cl"><span class="n">WHERE</span> <span class="n">ts</span> <span class="o">&gt;</span> <span class="n">CURRENT_DATE</span> <span class="o">-</span> <span class="n">interval</span> <span class="s1">&#39;7 days&#39;</span>
</span></span></code></pre></div><p>Another cool feature is the ability to save the output to a variable,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%%</span><span class="n">sql</span> <span class="n">num_sales</span> <span class="o">&lt;&lt;</span>
</span></span><span class="line"><span class="cl"><span class="n">SELECT</span> <span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">FROM</span> <span class="n">sales</span>
</span></span><span class="line"><span class="cl"><span class="n">WHERE</span> <span class="n">ts</span> <span class="o">&gt;</span> <span class="n">CURRENT_DATE</span> <span class="o">-</span> <span class="n">interval</span> <span class="s1">&#39;7 days&#39;</span>
</span></span></code></pre></div><p>It works in reverse too, so you can feed a python variable such as <code>N_DAYS = 7</code> back into a query by referencing it with a trailing <code>:</code> in your SQL.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">%%</span><span class="n">sql</span> <span class="n">num_sales</span> <span class="o">&lt;&lt;</span>
</span></span><span class="line"><span class="cl"><span class="n">SELECT</span> <span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">FROM</span> <span class="n">sales</span>
</span></span><span class="line"><span class="cl"><span class="n">WHERE</span> <span class="n">ts</span> <span class="o">&gt;</span> <span class="n">CURRENT_DATE</span> <span class="o">-</span> <span class="n">interval</span> <span class="s1">&#39;N_DAYS: days&#39;</span>
</span></span></code></pre></div><p>Using these two features together, it is possible to write notebook which performs a sequence of debugging steps, with each query taking a dynamic value from the previous output. You can then save this notebook, and easily re-run the same troubleshooting steps on fresh data when the problem arises in the future.</p>
<h2 id="couldnt-i-achieve-the-same-thing-with-jinja-and-psycopg2">Couldn&rsquo;t I achieve the same thing with jinja and psycopg2? <a class="anchor" href="#couldnt-i-achieve-the-same-thing-with-jinja-and-psycopg2">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Theoretically we could write queries into string variables in a Jupyter notebook and run them using <code>psycopg2</code> or pandas, but this always felt too clunky to be usable. The above approach almost entirely removes the friction and boilerplate code, while also giving us the benefit of syntax highlighting.</p>
<h2 id="further-reading">Further reading <a class="anchor" href="#further-reading">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><ul>
<li><a href="https://towardsdatascience.com/jupyter-magics-with-sql-921370099589" target="_blank">https://towardsdatascience.com/jupyter-magics-with-sql-921370099589</a></li>
<li><a href="https://github.com/catherinedevlin/ipython-sql" target="_blank">https://github.com/catherinedevlin/ipython-sql</a></li>
</ul>

      ]]></content:encoded></item><item><title>Typesetting math equations with Anki</title><link>https://geoffruddock.com/anki-math-typesetting/</link><pubDate>Monday, 27 Mar 2017</pubDate><guid>https://geoffruddock.com/anki-math-typesetting/</guid><description>&lt;p>Anki is a tool I use daily to &lt;a href="https://geoffruddock.com/reflections-on-three-years-of-spaced-repetition-with-anki/">remember things better&lt;/a>. Below are the things I have learned about typesetting math equations in Anki using both MathJax and raw LaTeX. Hopefully these notes can save you some time.&lt;/p>
&lt;h2 id="update-2020-04-17">Update [2020-04-17] &lt;a class="anchor" href="#update-2020-04-17">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>Anki 2.1+ now has &lt;a href="https://docs.ankiweb.net/#/math" target="_blank">built-in support&lt;/a> for MathJax. This is now the best approach to math typesetting, since it removes the dependency on LaTeX being installed on your computer. Besides being a pain in the ass to configure, this also required a bunch of configurations that you had to keep track of if you regularly use multiple computers with Anki. As a bonus, the MathJax syntax is cleaner, and you can now edit expressions on AnkiDroid and they will render immediately.&lt;/p></description><content:encoded><![CDATA[
        <p>Anki is a tool I use daily to <a href="https://geoffruddock.com/reflections-on-three-years-of-spaced-repetition-with-anki/">remember things better</a>. Below are the things I have learned about typesetting math equations in Anki using both MathJax and raw LaTeX. Hopefully these notes can save you some time.</p>
<h2 id="update-2020-04-17">Update [2020-04-17] <a class="anchor" href="#update-2020-04-17">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Anki 2.1+ now has <a href="https://docs.ankiweb.net/#/math" target="_blank">built-in support</a> for MathJax. This is now the best approach to math typesetting, since it removes the dependency on LaTeX being installed on your computer. Besides being a pain in the ass to configure, this also required a bunch of configurations that you had to keep track of if you regularly use multiple computers with Anki. As a bonus, the MathJax syntax is cleaner, and you can now edit expressions on AnkiDroid and they will render immediately.</p>
<h2 id="how-to-convert-existing-latex-in-anki-to-mathjax">How to convert existing LaTeX in Anki to MathJax <a class="anchor" href="#how-to-convert-existing-latex-in-anki-to-mathjax">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>If you have already been using a full installation of LaTeX and have a bunch of anki cards that you want to convert to Mathjax, the process is relatively easy.</p>
<p>First, make sure you back up your entire Anki database. In the card browser, select the cards you want to convert and go to Menu → Edit → Find and Replace. Make sure Treat input as regular expression_ is checked, and then run the following input/output pairs. It&rsquo;s a good idea to test on a couple cards first.</p>
<table>
  <thead>
      <tr>
          <th>Find</th>
          <th>Replace</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>\\[\\$\\]</code></td>
          <td><code>\\\(</code></td>
      </tr>
      <tr>
          <td><code>\\[\\/\\$\\]</code></td>
          <td><code>\\\)</code></td>
      </tr>
      <tr>
          <td><code>\\[\$\$\]</code></td>
          <td><code>\\\[</code></td>
      </tr>
      <tr>
          <td><code>\\[\\/\\\$\\\$\\]</code></td>
          <td><code>\\\]</code></td>
      </tr>
      <tr>
          <td><code>\\[latex\\]</code></td>
          <td><code>\\\[</code></td>
      </tr>
      <tr>
          <td><code>\\[\\/latex\\]</code></td>
          <td><code>\\\]</code></td>
      </tr>
  </tbody>
</table>
<p>You could probably combine these into a lesser number of more complex regular expressions, but I didn&rsquo;t want to mess around too much, since Anki&rsquo;s preview-less and undo-less Find &amp; Replace tool made me somewhat nervous.</p>
<p>Depending on what sort of syntax you used, you may need to convert or remove some additional strings which are not recognized by MathJax. For me, this included replacing the <code>align*</code> environment with <code>aligned</code> and removing in-equation tags (using regex: <code>\\tag{\d}</code>).</p>
<p>After running the above pairs, review a few cards to ensure everything looks okay. Then run <em>Tools → Check Media</em> and sync. Both operations may take a while (30-60 seconds) depending on how many LaTeX expressions you had previously. I had 3010 rendered LaTeX images which took up a total of 32.6 MB.</p>
<h2 id="using-latex-with-anki">Using LaTeX with Anki <a class="anchor" href="#using-latex-with-anki">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="hazards-">Hazards ☠ <a class="anchor" href="#hazards-">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Anki&rsquo;s [official documentation on LaTeX support](&lt;https://docs.ankiweb.net/#/math?id=latex) is excellent, and deserves a careful readthrough, since it covers many typical problems. I will highlight two problems I did not pay enough attention to when starting</p>
<ul>
<li>You can only put LaTeX tags inside fields, not inside the card template itself. Otherwise it breaks the logic used by the &ldquo;update media references&rdquo; process.</li>
<li>If you are using cloze deletion and have a nested LaTeX expression, put a space between curly brackets <code>} }</code> to avoid confusing Anki between cloze brackets and latex code.</li>
</ul>
<h3 id="understand-the-difference-between-inline-and-display-equations">Understand the difference between inline and display equations <a class="anchor" href="#understand-the-difference-between-inline-and-display-equations">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Latex has two different ways to render math equations: <code>inline</code> and <code>display</code>.</p>
<p>Their primary difference in full latex documents is that <code>inline</code> equations are smaller and do not cause a line break, so they can be used within a flowing paragraph, while <code>display</code> appear as a larger, centered equation with a line break before and after. Since anki renders latex figures indvidiually as png files and inserts them into your template, this spacing does not apply to us.</p>
<p>The secondary difference is that <code>inline</code> has tighter formatting on a variety of symbols, most notably on summations, integrals, etc.</p>
<h2 id="tweaks">Tweaks <a class="anchor" href="#tweaks">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="outputting-high-resolution-png-files">Outputting high-resolution PNG files <a class="anchor" href="#outputting-high-resolution-png-files">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I was not satisfied with the default rendering settings, which were generating images with noticable aliasing. The following settings render equations at 800 DPI with a transparent background and medium compression. The files are not much bigger in the end, due to hte compression.</p>
<ol>
<li>
<p>Install the <a href="https://ankiweb.net/shared/info/937148547" target="_blank">Edit LaTeX build process</a> addon.</p>
</li>
<li>
<p>Open <code>latex_build_process.py</code> and modify it as follows:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">newLaTeX</span> <span class="o">=</span> \
</span></span><span class="line"><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">[</span><span class="s2">&#34;latex&#34;</span><span class="p">,</span> <span class="s2">&#34;-interaction=nonstopmode&#34;</span><span class="p">,</span> <span class="s2">&#34;tmp.tex&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="p">[</span><span class="s2">&#34;dvipng&#34;</span><span class="p">,</span> <span class="s2">&#34;-D&#34;</span><span class="p">,</span> <span class="s2">&#34;800&#34;</span><span class="p">,</span> <span class="s2">&#34;-T&#34;</span><span class="p">,</span> <span class="s2">&#34;tight&#34;</span><span class="p">,</span> <span class="s2">&#34;-bg&#34;</span><span class="p">,</span> <span class="s2">&#34;Transparent&#34;</span><span class="p">,</span> <span class="s2">&#34;tmp.dvi&#34;</span><span class="p">,</span> <span class="s2">&#34;-z&#34;</span><span class="p">,</span> <span class="s2">&#34;6&#34;</span><span class="p">,</span> <span class="s2">&#34;-o&#34;</span><span class="p">,</span> <span class="s2">&#34;tmp.png&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># make the changes</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">anki.latex</span>
</span></span><span class="line"><span class="cl"><span class="n">anki</span><span class="o">.</span><span class="n">latex</span><span class="o">.</span><span class="n">latexCmds</span> <span class="o">=</span> <span class="n">newLaTeX</span>
</span></span></code></pre></div></li>
<li>
<p>In your card template CSS, put <code>.latex { zoom: 14%; }</code> to return the images to a reasonable size.</p>
</li>
</ol>
<h3 id="center-align-rendered-latex-images">Center-align rendered LaTeX images <a class="anchor" href="#center-align-rendered-latex-images">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>If your inline LaTeX equations seem not to be aligned with the surrounding text, you can add the following to your card CSS.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-css" data-lang="css"><span class="line"><span class="cl"><span class="nt">img</span><span class="o">[</span><span class="nt">src</span><span class="o">*=</span><span class="s2">&#34;latex&#34;</span><span class="o">]</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">vertical-align</span><span class="p">:</span> <span class="kc">middle</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="making-display-equations-larger-than-inline-equations">Making display equations larger than inline equations <a class="anchor" href="#making-display-equations-larger-than-inline-equations">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Annoyingly, there is no way to automatically display png files rendered from display math as larger than inline math. So the best way I have found to do achieve this is to add a conditional field to the template and run a snippet of javascript code to modify the CSS on the fly.</p>
<ol>
<li>
<p>Add a field to your note template, I used the name <code>_latex_displaymath</code> and set the text size to 10, so that it takes up minimal space in the anki browser.</p>
</li>
<li>
<p>Add the following to your note CSS</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-css" data-lang="css"><span class="line"><span class="cl"><span class="p">.</span><span class="nc">display-math</span> <span class="p">.</span><span class="nc">latex</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">zoom</span><span class="p">:</span> <span class="mi">80</span><span class="kt">%</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">display</span><span class="p">:</span> <span class="kc">block</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">margin</span><span class="p">:</span> <span class="mi">0</span> <span class="kc">auto</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">padding</span><span class="p">:</span> <span class="mi">30</span><span class="kt">px</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div></li>
<li>
<p>Add the following to your card HTML</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-html" data-lang="html"><span class="line"><span class="cl"><span class="p">&lt;</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kd">var</span> <span class="nx">displayMath</span> <span class="o">=</span> <span class="s2">&#34;{{_latex_displaymath}}&#34;</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="nx">displayMath</span><span class="p">.</span><span class="nx">length</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s2">&#34;answer&#34;</span><span class="p">).</span><span class="nx">classList</span><span class="p">.</span><span class="nx">add</span><span class="p">(</span><span class="s1">&#39;display-math&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">&lt;/</span><span class="nt">script</span><span class="p">&gt;</span>
</span></span></code></pre></div></li>
</ol>

      ]]></content:encoded></item><item><title>Test your product assumptions with GA Intelligence Alerts</title><link>https://geoffruddock.com/test-your-product-assumptions/</link><pubDate>Sunday, 17 Jul 2016</pubDate><guid>https://geoffruddock.com/test-your-product-assumptions/</guid><description>&lt;p>A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly.&lt;/p>
&lt;p>More knowledge is good, right? On one hand, past experience calibrates our sense of &lt;a href="https://en.wikipedia.org/wiki/Prior_probability" target="_blank">prior probability&lt;/a>, which allows us to make better decisions in noisy contexts. This &amp;ldquo;prior&amp;rdquo; knowledge which we acquire has a dark side though. When we encode certain data points as truthes into our mental models, our perception of the world becomes static. We can become overconfident in our knowledge of how things work, and be caught off-guard when our assumptions about how the world works are no longer true.&lt;/p></description><content:encoded><![CDATA[
        <p>A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly.</p>
<p>More knowledge is good, right? On one hand, past experience calibrates our sense of <a href="https://en.wikipedia.org/wiki/Prior_probability" target="_blank">prior probability</a>, which allows us to make better decisions in noisy contexts. This &ldquo;prior&rdquo; knowledge which we acquire has a dark side though. When we encode certain data points as truthes into our mental models, our perception of the world becomes static. We can become overconfident in our knowledge of how things work, and be caught off-guard when our assumptions about how the world works are no longer true.</p>
<blockquote>
<p>“In the beginner’s mind there are many possibilities, but in the expert’s there are few”</p>
<p>― <strong>Shunryu Suzuki</strong></p></blockquote>
<p>So wouldn&rsquo;t it be great if there were a way to be notified when our acquired mental models diverge from reality?</p>
<p>In software development there are entire methodologies such as <a href="https://en.wikipedia.org/wiki/Test-driven_development" target="_blank">Test-driven development</a> (TDD) which revolve around explicitly formulating and testing assumptions at each stage in the development process. One of my favourite python statements is <code>assert</code>, which lets you specify a condition you assert to evaluate to <code>TRUE</code>, and ask Python to raise an exception when that is not the case.</p>
<p>But if you are working with data in an analytics tool rather than in Python, how can you achieve this?</p>
<p><a href="https://support.google.com/analytics/answer/1033021?hl=en" target="_blank">Intelligence Alerts</a> are a neat feature in Google Analytics which allow you to specify a metric and dimension combination, and then to configure an alert on a daily/weekly/monthly basis when that metric changes by either an absolute value or percentage change.</p>
<img src="create_alert.png" width=600 alt="How to create a GA alert">
<p>You can use this tool to codify your assumptions about user behaviour, and then get alerted if they change. Set notification thresholds calibrated to your perceived lower bound on normal usage behaviour. When a niche but important feature breaks silently a few months from now, you will be the first to know.</p>

      ]]></content:encoded></item><item><title>Book review: Remote Research (user research)</title><link>https://geoffruddock.com/book-review-remote-research/</link><pubDate>Tuesday, 07 Jun 2016</pubDate><guid>https://geoffruddock.com/book-review-remote-research/</guid><description>&lt;p>This is a brief review of the book &lt;a href="http://amzn.to/1RVY9Uc" target="_blank">Remote Research&lt;/a>, and a summary of points that resonated with me.&lt;/p>
&lt;h2 id="key-concepts">Key Concepts &lt;a class="anchor" href="#key-concepts">
&lt;i class="fas fa-hashtag anchor-link">&lt;/i>
&lt;/a>&lt;/h2>&lt;p>&lt;strong>Moderated research&lt;/strong> – Real-time interaction with a user that is time-expensive, but is easier to discover unanticipated insights due to the greater “texture” of the interaction.&lt;/p>
&lt;blockquote>
&lt;p>“Moderated research allows you to gather in-depth qualitative feedback: behavior, tone-of-voice, task and time context, and so on. Moderators can probe at new subjects as they arise over the course of a session, which makes the scope of the research more flexible and enables the researcher to explore behaviors that were unforeseen during the planning phases of the study. Researchers should pay close attention to these “emerging topics,” since they often identify issues that were overlooked during the planning of the study.”&lt;/p></description><content:encoded><![CDATA[
        <p>This is a brief review of the book <a href="http://amzn.to/1RVY9Uc" target="_blank">Remote Research</a>, and a summary of points that resonated with me.</p>
<h2 id="key-concepts">Key Concepts <a class="anchor" href="#key-concepts">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><strong>Moderated research</strong> – Real-time interaction with a user that is time-expensive, but is easier to discover unanticipated insights due to the greater “texture” of the interaction.</p>
<blockquote>
<p>“Moderated research allows you to gather in-depth qualitative feedback: behavior, tone-of-voice, task and time context, and so on. Moderators can probe at new subjects as they arise over the course of a session, which makes the scope of the research more flexible and enables the researcher to explore behaviors that were unforeseen during the planning phases of the study. Researchers should pay close attention to these “emerging topics,” since they often identify issues that were overlooked during the planning of the study.”</p></blockquote>
<p><strong>Automated research</strong> – Data collection process is set up a priori and the research is conducted asynchronously, without your involvement.</p>
<blockquote>
<p>“Automated research is nearly always quantitative and is good at addressing more specific questions (“What percentage of users can successfully log in?” “How long does it take for users to find the product they’re looking for?”), or measuring how users perform on a few simple tasks over a large sample. If all you need is raw performance data, and not why users behave the way they do, then automated testing is for you.”</p></blockquote>
<p><strong>Starting an interaction</strong> – The quality of your data in a moderated study is influenced by the consistency and quality of your participant on-boarding process.</p>
<blockquote>
<p>“Establish the users’ expectations about what will happen during the study and what kind of mindset they should have entering the study. The most important things to establish are that you want the participants to use the interface like they normally would … And let them know you’d also like them to think aloud while they’re on the site … It’s also nice to set users at ease by reassuring them that you had nothing to do with the design of the interface, so they can be completely honest:”</p></blockquote>
<p><strong>Time Aware Research</strong> – Using live recruitment in a moderated study leads to richer and more authentic interactions with participants that occur in their native environment.</p>
<blockquote>
<p>“Remote research is more appropriate when you want to watch people performing real tasks, rather than tasks you assign to them. The soul of remote research is that it lets you conduct what we call Time-Aware Research (TAR).”</p></blockquote>
<h2 id="execution-tips">Execution Tips <a class="anchor" href="#execution-tips">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><strong>Progress from high to low variability</strong> – Start the session with undirected natural tasks, which gives the participant space to surprise you. Finish by running through any tasks the user did not complete naturally, this time in a structured manner.</p>
<p><strong>Timestamp your notes</strong> – make timestamps based on “time since session start” instead of absolute times, to make them easier to review later.</p>
<p><strong>Cross-reference “control” metrics with your analytics</strong> – Double-check that your research is not biased due to a flaw in the design or structure of the study.</p>
<blockquote>
<p>“If there’s a discrepancy between your study findings and the Web site’s analytics (“80% of study participants clicked on the green button, but only 40% of our general Web audience does”), it could mean that the task design was flawed, the target audience of the study differs from that of the main audience, or that there’s an unforeseen issue altogether.”</p></blockquote>
<p><strong>Ask open-ended questions</strong> – Remain neutral to avoid influencing the responses from participants.</p>
<p>“So, tell me what you’re looking at … What’s going through your mind right now? … What do you want to do from here? … When did you decide to leave the site/exit the program? … What brought you to this page?”</p>
<h2 id="thoughts">Thoughts <a class="anchor" href="#thoughts">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p>Remote Research lays out a comprehensive framework for starting to conduct research studies at your company, and is useful for beginners or for filling in the gaps in your mental model. However it seems more targeted towards large companies with established UX practices than towards startups. If you are executing alone—perhaps as a one-man UX team—you may still feel a gap between theory and execution. The tools section of the book seems dated, which is understandable, however it would be great to see some more tactical information on conducting remote research on the cheap. Two tricks that I have used at work myself are:</p>
<ul>
<li>Running tests from <a href="https://www.google.com/analytics/tag-manager/" target="_blank">Google Tag Manager</a> – Aligning with the owner of the tracking platform (often Product team) is a quicker way to get the necessary code live than doing it in-house with IT.</li>
<li>Use a general session recording tool – Using a tool such as <a href="http://www.inspectlet.com/" target="_blank">Inspectlet</a>, you can record most or all user interactions and then filter the recordings down afterwards. This allows you to <a href="http://geoffruddock.com/user-research-when-you-cant-reach-your-users/" target="_blank">observe a very specific behaviour chain</a> that may not occur frequently enough on your site to target users live.</li>
</ul>

      ]]></content:encoded></item><item><title>Book review: Web Form Design</title><link>https://geoffruddock.com/book-review-web-form-design/</link><pubDate>Wednesday, 11 May 2016</pubDate><guid>https://geoffruddock.com/book-review-web-form-design/</guid><description>&lt;p>I finished reading &lt;a href="http://www.amazon.com/Web-Form-Design-Luke-Wroblewski-ebook/dp/B004VFUP2I" target="_blank">Web Form Design&lt;/a> recently on the recommendation of a mentor. The author makes a good case about web forms being a high leverage area to invest design efforts. The combination of forms being mandatory, complex, and not particularly sexy, results in an experience that is often the worst part of a user’s interaction with your product. He then breaks down the form into the building blocks of Labels, Input Fields, and Actions, then lays out best practices for each. Here are a few snippets from the book that resonated with me.&lt;/p></description><content:encoded><![CDATA[
        <p>I finished reading <a href="http://www.amazon.com/Web-Form-Design-Luke-Wroblewski-ebook/dp/B004VFUP2I" target="_blank">Web Form Design</a> recently on the recommendation of a mentor. The author makes a good case about web forms being a high leverage area to invest design efforts. The combination of forms being mandatory, complex, and not particularly sexy, results in an experience that is often the worst part of a user’s interaction with your product. He then breaks down the form into the building blocks of Labels, Input Fields, and Actions, then lays out best practices for each. Here are a few snippets from the book that resonated with me.</p>
<h3 id="labels">Labels <a class="anchor" href="#labels">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><strong>Top-aligned labels</strong> – “The results of live site testing across several different geographies have also supported <em>top-aligned labels as the quickest way to get people through forms</em>. These studies also had higher completion rates (over 10 percent higher) than the left-aligned versions of forms they were tested against… One of the reasons top-aligned forms are completed quickly may be because <em>they only require a single eye fixation to take in both input label and input field.</em> [50ms compared to 240ms for right-aligned and 500ms for left-aligned labels] … Top-aligned labels, however, do <em>take up additional vertical real estate</em>.”</p>
<p><strong>Right-aligned labels</strong> – “The resulting left rag of the labels in a right-aligned layout reduces the effectiveness of a quick scan to see what information the form requires … That said, in cases where you want to <em>minimize the amount of vertical screen</em> space your form uses, right-aligned labels can provide fast completion times.”</p>
<p><strong>Left-aligned labels</strong> – “Left-aligning input field labels <em>makes scanning the information required by a form easier</em>. People can simply inspect the left column of labels up and down without being interrupted by input fields… Unfortunately, a few long labels often extend the distance between labels and inputs and, as a result, completion times may suffer. People have to “jump” from column to column in order to find the right association of input field and input label before entering data. The reason <em>left-aligned forms are the slowest of the three options to complete</em> may be because of the number of eye fixations they require to parse.”</p>
<p><strong>Inside-alignd labels</strong> – “<em>In cases where screen real estate is at a premium</em>, combining labels and input fields into a single user interface element may be appropriate… Because labels within fields need to go away when people are entering their answer into an input field, <em>the context for the answer is gone</em>. As such, labels within inputs <em>aren’t a good solution for long forms</em>… It’s also generally a good rule not to use labels within inputs for non-obvious questions. That is, questions that may require people to reference the label while answering.</p>
<h3 id="input-fields">Input Fields <a class="anchor" href="#input-fields">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><strong>Tabbing behaviour</strong> –“Web form designers should consider what the experience will be like for the <em>large numbers of people who move between input fields using the Tab key</em>, and they should design accordingly.”</p>
<p><strong>Radio buttons</strong> – “Allow people to select exactly one choice from two or more always visible and mutually exclusive options. Because radio buttons are mutually exclusive, they should have a default value selected (more on this later). It’s also a good idea to <em>make sure both the radio button and its label can be selected</em> to activate a radio button selection.”</p>
<p><strong>Input switching</strong> – “[Sequential] basic text boxes … lead users to <em>skip back and forth between their mouse and keyboard</em> … in order to complete the interaction.”</p>
<p><strong>Length of input fields</strong> – “<em>The way we display input fields can produce valuable clues on how they should be filled in</em>… In the eBay Express example … the size of the zip code input matches the size of an actual zip code in the United States: 5 digits. The size of the phone number text boxes match the number of digits in a standard phone number in the United States. The rest of the text boxes are a consistent length that provides enough room for a complete answer.”</p>
<p><strong>Required/optional fields</strong> – “If most of the inputs on a form are optional, <em>indicate the few that are required</em>. … When indicating what form fields are either required or optional, text is the most clear. However, the * symbol is relatively well understood to mean required.”</p>
<h3 id="actions">Actions <a class="anchor" href="#actions">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><strong>Secondary actions</strong> – “When you <em>reduce the visual prominence of secondary actions</em>, it minimizes the risk for potential errors and further directs people toward a successful outcome.”</p>
<p><strong>Success vs. Error messages</strong> – “The key difference between error and success messages, however, is that error messages cannot be ignored or dismissed—they must be addressed. <em>Success messages</em>, on the other hand, <em>should never block people’s progress</em>—they should encourage more of it.</p>
<p><strong>Animating success messages</strong> – “Because human beings are instinctively drawn to motion—we had to avoid sabertoothed tigers somehow—animated messages that transition off a page can let people know their actions have been successful. The most common transitions utilized for this are fades, dissolves, or roll-ups.”</p>
<p><strong>Effective in-line validation</strong> – “Inline confirmation works best for questions with potentially high error rates or specific formatting requirements… When validating people’s answers inline, <em>do so after they have finished</em> providing an answer, not during the process.”</p>

      ]]></content:encoded></item><item><title>Tracking: Organizational Challenges</title><link>https://geoffruddock.com/lessons-learned-from-tracking/</link><pubDate>Friday, 12 Feb 2016</pubDate><guid>https://geoffruddock.com/lessons-learned-from-tracking/</guid><description>&lt;p>There are plenty of technical guides online about &lt;a href="https://www.simoahava.com/" target="_blank">tracking user behaviour using GTM&lt;/a>. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking.&lt;/p>
&lt;p>One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization. Stakeholders get nervous when you propose changes to tracking, even if tracking currently sucks.&lt;/p></description><content:encoded><![CDATA[
        <p>There are plenty of technical guides online about <a href="https://www.simoahava.com/" target="_blank">tracking user behaviour using GTM</a>. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking.</p>
<p>One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization. Stakeholders get nervous when you propose changes to tracking, even if tracking currently sucks.</p>
<p>As a product manager, my primary interest in tracking is to feed higher-quality data into the product decisions my team makes. Being “data-driven” is chic, but having reliable and relevant data is not a given. It requires some strategic forethought to track the right things and track them properly.</p>
<p>The first thing I did was consolidate all the country-specific containers into a single global container in GTM. Our application is nearly identical between countries, so this was easy from a technical perspective. We removed outdated tags, replaced country-specific IDs with lookup tables, and updated triggers to match. The second major change was change how we name events to communicate user behavior in a more transparent way.</p>
<p>A few lessons learned from the process:</p>
<h3 id="reports-are-fragile">Reports are fragile <a class="anchor" href="#reports-are-fragile">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Tracking data feeds into many teams’ reports—some of which you may not be aware of. These reports can be quite fragile to changes made to the tracking layer. Even worse than breaking a report, is to subtly impact some of its underlying assumptions, reducing the accuracy and usefulness of that report without anyone realizing it.</p>
<p>The best way to mitigate this risk is to coordinate tightly with BI. Sit down and trace all the “customers” of tracking data to get a better sense of how changes will impact various teams and reports. It is especially important to be aware of which reports are consumed by external stakeholders such as investors. These reports often process the data down to a single number in a spreadsheet cell, without any context around it. For example, inserting a GA event could impact the “bounce rate” calculation on that page.</p>
<h3 id="people-are-overly-confident-in-their-data">People are overly confident in their data <a class="anchor" href="#people-are-overly-confident-in-their-data">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Making decisions on real-world data is not as clean-cut as a case study in business school, and it is always good practice to question the source and validity of the data you are using to make a decision. Unfortunately some decision-makers can lose sight of this. Prepare for some push-back against your proposed fixes or improvements to tracking, as this implies that prior decisions were made with flawed data. Data is never infallible, but this can be an uncomfortable reality for some managers.</p>
<h3 id="decouple-tracking-from-kpi-definition">Decouple tracking from KPI definition <a class="anchor" href="#decouple-tracking-from-kpi-definition">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>The ideal tracking event crisply describes the nature of the user interaction without commenting on the value to the business. Event names such as “Unique Lead” or “Customer Intent” are opaque and give no visibility into what exactly those actions are, or why they are important to the business. It is better to push the task of KPI definition “up the stack” to management, so that the people who are ultimately consuming the tracking data will be better-equipped to make decisions on it.</p>

      ]]></content:encoded></item><item><title>The Best of Seth Godin for Product Managers</title><link>https://geoffruddock.com/seth-godin-for-product-managers/</link><pubDate>Friday, 10 Jul 2015</pubDate><guid>https://geoffruddock.com/seth-godin-for-product-managers/</guid><description>&lt;p>One of the consistent &lt;em>must-reads&lt;/em> that has remained in my RSS feed over the years is Seth Godin’s blog. Seth consistently puts out a stream of incredibly wise thoughts. I have found that some of his posts resonate with me even more when I re-read them at a later point in my life/career. Here are some of my favourite Seth Godin posts, as they relate to the role of Product Manager.&lt;/p></description><content:encoded><![CDATA[
        <p>One of the consistent <em>must-reads</em> that has remained in my RSS feed over the years is Seth Godin’s blog. Seth consistently puts out a stream of incredibly wise thoughts. I have found that some of his posts resonate with me even more when I re-read them at a later point in my life/career. Here are some of my favourite Seth Godin posts, as they relate to the role of Product Manager.</p>
<p><a href="https://seths.blog/2015/01/please-go-away/" target="_blank">Please, go away</a> – Being out-of-touch with customers hurts every part of an organization, but especially the product team. Sometimes it requires a conscious effort to correct for this. You may receive surprisingly strong push-back from some people on your efforts.</p>
<p><a href="https://seths.blog/2014/07/project-management-for-work-that-matters/" target="_blank">Project management for work that matters</a> – Ten very good pieces of advice for the project mgmt. parts of a PM’s job.</p>
<p><a href="https://seths.blog/2007/01/really_bad_powe/" target="_blank">Really Bad Powerpoint</a> – One of Seth’s longer blog posts. A good philosophical guide to using powerpoint effectively. I try to stay away from powerpoint as much as possible, but sometimes it is necessary, especially for interacting with stakeholders.</p>
<p><a href="https://seths.blog/2014/03/not-even-one-note/" target="_blank">Not even one note</a> – <em>Why</em> it is important to choose <em>better</em> features over <em>more</em> features. He also talks about <a href="https://seths.blog/2014/05/no-is-essential/" target="_blank"><em>how</em></a> to make that choice.</p>
<p><a href="https://seths.blog/2014/11/inventing-a-tribe/" target="_blank">Inventing a tribe</a> – Building a successful product vision does not have to involve creating something totally new and revolutionary from scratch. It is far more likely that it will involve connecting and empowering the people that already share a vision with you.</p>
<p><a href="https://seths.blog/2006/07/how_to_live_hap/" target="_blank">How to live happily with a great designer</a> – Some tips for working effectively with designers.</p>
<p><a href="https://seths.blog/2005/08/two_kinds_of_wr/" target="_blank">Two kinds of writing</a> – As a PM you will be interacting with totally different groups of people on a daily basis. It is important to adjust your writing and communication style to each audience. You will want to use a different approach when dealing with customers, engineers, marketing, or stakeholders.</p>
<p><a href="https://seths.blog/2015/05/why-do-you-do-it-this-way/" target="_blank">Why do you do it this way?</a> – A good way to test some of the underlying product decisions made in the past. Asking <em>why</em> three times is a great way to uncover the philosophy of a team.</p>
<p><a href="https://seths.blog/2015/06/marketing-to-the-organization/" target="_blank">Marketing to the organization</a> – Product managers lead without positional authority, so it becomes important to approach things at a <em>meta</em> level, thinking about what you can do internally to give a product or project the best chance of succeeding.</p>
<p><a href="https://seths.blog/2015/01/doing-calculus-with-roman-numerals/" target="_blank">Doing calculus with Roman numerals</a> – As a non-technical PM, it is especially important to be relentlessly curious and to ask many question about the technical side. Not to make your job easier, but to open up a level of performance that is not possible without understanding the tools being used around you.</p>

      ]]></content:encoded></item><item><title>Reading books for long-term value</title><link>https://geoffruddock.com/reading-books-for-long-term-value/</link><pubDate>Wednesday, 08 Jul 2015</pubDate><guid>https://geoffruddock.com/reading-books-for-long-term-value/</guid><description>&lt;p>For a while now, my Pocket reading list has been growing at a faster rate than I have been consuming it. Recently this problem has crept into my offline reading as well, and now my &lt;a href="http://www.goodreads.com/" target="_blank">GoodReads&lt;/a> list is growing hopelessly long.&lt;/p>
&lt;p>Initially I approached this as a &lt;em>quantity&lt;/em> problem, and started looking into speed-reading as a method of consuming more information. There is a neat tool called &lt;a href="http://www.spritzinc.com/" target="_blank">Spritz&lt;/a> that controls for eye movement to help you learn. But it turned out the problem was about &lt;em>quality&lt;/em> of reading, rather than &lt;em>quantity&lt;/em> of material. This manifested itself in a disappointing recall of key arguments and theses of books I had read more than a year or two before.&lt;/p></description><content:encoded><![CDATA[
        <p>For a while now, my Pocket reading list has been growing at a faster rate than I have been consuming it. Recently this problem has crept into my offline reading as well, and now my <a href="http://www.goodreads.com/" target="_blank">GoodReads</a> list is growing hopelessly long.</p>
<p>Initially I approached this as a <em>quantity</em> problem, and started looking into speed-reading as a method of consuming more information. There is a neat tool called <a href="http://www.spritzinc.com/" target="_blank">Spritz</a> that controls for eye movement to help you learn. But it turned out the problem was about <em>quality</em> of reading, rather than <em>quantity</em> of material. This manifested itself in a disappointing recall of key arguments and theses of books I had read more than a year or two before.</p>
<p>Part of the problem was that I considered the primary goal of reading to be <em>acquiring information</em>. The issue with this approach is that if the raw data is not synthesized, you won’t remember it for as long. I now consider the primary goal of reading to be <em>rewiring parts of my cognitive process</em> based on the information in the book.</p>
<p>Here are a couple of the systems I have put into place to derive more long-term value out of my reading:</p>
<h2 id="before">Before <a class="anchor" href="#before">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="read-summaries">Read summaries <a class="anchor" href="#read-summaries">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>In an effort to reduce the <em>input</em> side of my reading list problem, I have begun heavily vetting the recommendations or discoveries that I place into my reading list. Anything non-fiction gets checked for in <a href="http://geoffruddock.com/blinkist-daily/" target="_blank">Blinkist</a> to see if there is already a summary available. For other genres, I like to check <a href="http://www.brainpickings.org/" target="_blank">Maria Popova’s Brain Pickings</a> to see if she has written on that book before. Reading through a summary like this will give you a better sense of whether you should commit to reading the full book. And if you <em>do</em> proceed to read the book, you begin with a rough mental framework that makes it much easier to absorb the arguments and theses into your mental model.</p>
<h2 id="during">During <a class="anchor" href="#during">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="use-an-e-reader">Use an e-reader <a class="anchor" href="#use-an-e-reader">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>Buying an Amazon Kindle has been a huge help. Besides the whole “thousand books in your pocket” thing, I find the highlighting feature to be incredibly valuable. I have never been much of a highlighter / markup-er of printed media, but I am well aware of the benefits for cognitively absorbing material. Kindle’s highlights lets you collect snippets from a book and export them as a text file.</p>
<h3 id="read-deliberately">Read deliberately <a class="anchor" href="#read-deliberately">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><a href="http://www.farnamstreetblog.com/2014/05/remembering-what-you-read/" target="_blank">Shane Parish of Farnam Street</a> has written extensively on the subject of learning, reading, and self-improvement. He has some pieces of good advice that ultimately add up to the act of <em>reading deliberately</em>. Take a second before you begin to think about the author, the context, and your existing knowledge on the subject. While reading, mentally summarize arguments periodically, and try to abstract at a higher level. After you put down a book, spend a couple minutes in silence, contemplating what you’ve just learned, and attempting to synthesize it into your existing mental framework.</p>
<h2 id="after">After <a class="anchor" href="#after">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><h3 id="write-a-book-summary">Write a book summary <a class="anchor" href="#write-a-book-summary">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>There is a reason that <a href="http://www.gatesnotes.com/Books" target="_blank">Bill Gates publishes book reviews</a>, and it’s not because he has nothing better to do with his time. Writing these reviews will encourage you to read at the analytical level required to summarize effectively. I usually start by sorting through all of my kindle highlights from a book, then organizing them into thematic groups, and trying to build a structured opinion on the work. Making a value judgement in your summary will force you to go a step further in your reading, to do the work of synthesizing the material and forming an argument.</p>
<h3 id="mindmapping">Mindmapping <a class="anchor" href="#mindmapping">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>I also find it useful to push one level above individual books, and to make a conscious effort of trying to integrate the knew book into my <a href="http://www.farnamstreetblog.com/mental-models/" target="_blank">mental frameworks of knowledge</a>. Mindmapping is a good tool for this, as it <a href="http://www.asianefficiency.com/mind-mapping/" target="_blank">helps you visualize and form connections</a> between pieces of material without the need to traverse the information in a linear fashion. Another option is to collect key passages into your <a href="http://thoughtcatalog.com/ryan-holiday/2013/08/how-and-why-to-keep-a-commonplace-book/" target="_blank">commonplace book</a>.</p>
<p>Adding these additional layers to my reading &ldquo;stack&rdquo; definitely slows down my rate of consumption, but I think it is well worth the increase in comprehension, synthesis, and long-term retention.</p>

      ]]></content:encoded></item><item><title>How to conduct user research when you can’t reach your users</title><link>https://geoffruddock.com/user-research-when-you-cant-reach-your-users/</link><pubDate>Saturday, 04 Jul 2015</pubDate><guid>https://geoffruddock.com/user-research-when-you-cant-reach-your-users/</guid><description>&lt;p>If you are a product manager, you have almost certainly heard about &lt;a href="http://radar.oreilly.com/2015/03/7-user-research-myths-and-mistakes.html" target="_blank">the importance of conducting user research&lt;/a> before. Quantitative data can point to &lt;em>where&lt;/em> a problem exists, but nothing beats qualitative research for learning &lt;em>why&lt;/em> that problem occurs. Large datasets can obscure individual usage patterns, making it hard to “get into the user’s head”. User research helps you understand the &lt;a href="http://www.nngroup.com/articles/mental-models/" target="_blank">conceptual models of your users&lt;/a> and to build personas around them.&lt;/p></description><content:encoded><![CDATA[
        <p>If you are a product manager, you have almost certainly heard about <a href="http://radar.oreilly.com/2015/03/7-user-research-myths-and-mistakes.html" target="_blank">the importance of conducting user research</a> before. Quantitative data can point to <em>where</em> a problem exists, but nothing beats qualitative research for learning <em>why</em> that problem occurs. Large datasets can obscure individual usage patterns, making it hard to “get into the user’s head”. User research helps you understand the <a href="http://www.nngroup.com/articles/mental-models/" target="_blank">conceptual models of your users</a> and to build personas around them.</p>
<p>Normal user research methods involve getting users into a room and watching them interact with your product. But what do you do if you can’t reach your users as easily? What if your users are in different countries, or speak different languages? These factors certainly make user research more difficult, but also simultaneously make it <em>even more important</em>.</p>
<p>One solution I’ve been playing with recently is a combination of <a href="https://www.olark.com/" target="_blank">Olark live chat</a> and <a href="http://www.inspectlet.com/" target="_blank">Inspectlet</a>. Inspectlet is a tool that records the cursor movements, clicks and scrolls of your users, and then rebuilds them into a video of the user’s session. At first it almost seems as if you are “spying” on users, though in fact the videos are all assembled post-hoc. Inspectlet is, of course, not as interactive as true user testing, but it does allow you to get surprising insights on user behaviour.</p>
<p>What is really powerful is when you combine these two together. Olark is primarily a live-chat tool, but when you are offline it reverts to a feedback box, placed on a targeted part of your website or product. Here is how I chain the two tools together:</p>
<ul>
<li>Place the Olark feedback box on a specifically targeted element of your website where you expect there will be user frustration. Olark’s premium plan offers targeting, or you can <a href="http://www.google.com/tagmanager/" target="_blank">roll your own DIY targeting by firing the Olark tag through Google Tag Manager</a>.</li>
<li>After some time, read through the responses Olark sends to your email. If you are tracking foreign-language users, you can translate most messages right from within Google Chrome.</li>
<li>When you find a user response that interests you, grab the IP address from the message and filter for that IP in Inspectlet. Unless your product has massive traction already, you’ll probably find a single session that matches that IP address.</li>
<li>Watch the user session to learn the process the user went through before leaving the corresponding piece of feedback.</li>
</ul>
<p>This combination is the most effective solution I have found so far to bridge the user research gap on hard-to-reach users. However I wouldn’t say this is a replacement for conducting real user research. If you can, nothing beats an in-person session.</p>

      ]]></content:encoded></item><item><title>Reconciling contradictory advice</title><link>https://geoffruddock.com/contradicting-advice/</link><pubDate>Friday, 26 Jun 2015</pubDate><guid>https://geoffruddock.com/contradicting-advice/</guid><description>&lt;p>One of the problems with abstracted tidbits of advice is that they lose much of their meaning when divorced from their context. The correct decision can be heavily weighted by the nuances of the specific scenario. As a result, you often receive seemingly conflicting pieces of advice. The easy example is with contradicting proverbs, which are &lt;a href="http://www.tipfortat.com/" target="_blank">humorously documented here&lt;/a>. But the contradictions also occur in more serious advice given around technology, business strategy, and product development. Here are a couple I have been thinking about recently.&lt;/p></description><content:encoded><![CDATA[
        <p>One of the problems with abstracted tidbits of advice is that they lose much of their meaning when divorced from their context. The correct decision can be heavily weighted by the nuances of the specific scenario. As a result, you often receive seemingly conflicting pieces of advice. The easy example is with contradicting proverbs, which are <a href="http://www.tipfortat.com/" target="_blank">humorously documented here</a>. But the contradictions also occur in more serious advice given around technology, business strategy, and product development. Here are a couple I have been thinking about recently.</p>
<h2 id="breadth-vs-depth">Breadth vs. depth <a class="anchor" href="#breadth-vs-depth">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><strong>Should you strive to be well-rounded (full-stack?) or should you focus on your strengths?</strong></p>
<p>This can be viewed as a version of the classic <a href="http://fourhourworkweek.com/2007/09/14/the-top-5-reasons-to-be-a-jack-of-all-trades/" target="_blank">generalist–specialist dichotomy</a>. But it is more interesting when applied to the &ldquo;micro&rdquo; skill level rather than &ldquo;macro&rdquo; level career advice. When it comes to your skills and capabilities, should you focus on your strengths, or invest the time to round-out your weaker skills? This is loosely related to the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit" target="_blank">multi-armed bandit problem</a>, and to the concept of <a href="http://52weeksofux.com/post/694598769/the-local-maximum" target="_blank">local maxima</a>. What is the optimal mix of breadth and depth?</p>
<h2 id="perfectionism-vs-mediocrity">Perfectionism vs. mediocrity <a class="anchor" href="#perfectionism-vs-mediocrity">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><strong>Should you apply the 80/20 rule, or should you focus on the details?</strong></p>
<p><a href="http://blog.ellenchisa.com/2011/12/15/9-the-hard-part-of-the-80-20-rule/" target="_blank">Ellen Chisa</a> pointed out this contradiction on her blog, specifically in the context of product development. It ties into the concept of <em>Minimum Viable Product (MVP)</em> which is unfortunately often cited as an excuse to cut corners and ship half-baked products into the market. 80/20 style prioritization lets you achieve more output with fixed time/money. But it makes an implicit assumption that you are optimizing for raw efficiency. What if that is not true?</p>
<p>Imagine you are playing <em>Super Mario</em> for a moment. If you get 95% through a level but then die, you start again from the beginning. You are rewarded not for your average performance, but for the number of <em>absolute wins</em> you achieve. You can fail at that 95% over and over, and walk away with a 90% average but without making any real progress to the next level. In the context of product development, you are not optimizing for <em>average happiness of a user</em> but rather <em>number of users happy enough to sign-up / buy</em>. In this sense, users are fungible unit of success.</p>
<p>If you spread your resources out with the 80/20 rule, you could launch 5x the number of features, but at an 80% quality level. This could get you 5x the exposure, or perhaps 5x the engagement, but it does not necessarily lead to 5x the sales / conversions. Imagine a user has some intrinsic standard for how well a solution must fit their needs to sign-up or buy. If this &ldquo;bar&rdquo; falls above 80%, then you might lose all your 5x users to a bunch of niche competitors that serve their specific needs at a theoretical 90% level.</p>
<p>It may make more sense to focus your resources on developing something at a 95-100% level but with only 20% of the scope. This involves saying no to 80% of opportunities/features. As a result, you might get objectively fewer users into the start of your funnel. But assuming that your product is well-executed—that you didn’t waste these theoretical resources—then you should have a far higher conversion than in the 80/20 scenario.</p>
<h2 id="moving-fast-vs-patience">Moving fast vs. patience <a class="anchor" href="#moving-fast-vs-patience">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h2><p><strong>Is it better to have the time lead of being first-to-market or the lower risk of being a close second?</strong></p>
<p>Using the &ldquo;first mover advantage&rdquo; is a classic business school strategy. It is completely logical in industries such as telecom or social networks, where customers are locked-in and there are strong network effects at play. Yet many first-mover activities center around creating a market, and are not always defensible to a specific company. Competitors can get a &ldquo;free ride&rdquo; on your push for regulatory change or established supply chains. When does it make sense to be a trailblazer, and <a href="http://insight.kellogg.northwestern.edu/article/the_second_mover_advantage" target="_blank">when does it make sense to tuck yourself into the slipstream of the current leader</a>?</p>

      ]]></content:encoded></item><item><title>The Wirecutter: on trust, and satisficing</title><link>https://geoffruddock.com/the-wirecutter-on-trust-and-satisficing/</link><pubDate>Sunday, 21 Jun 2015</pubDate><guid>https://geoffruddock.com/the-wirecutter-on-trust-and-satisficing/</guid><description>&lt;p>I am a big fan of the consumer editorial site &lt;a href="http://thewirecutter.com/" target="_blank">The Wirecutter&lt;/a>. They earned a position in my stack of newsletter subscriptions for their help with simplifying tech purchasing decisions.&lt;/p>
&lt;p>In his book &lt;a href="https://en.wikipedia.org/wiki/The_Paradox_of_Choice" target="_blank">The Paradox of Choice: Why More is Less&lt;/a>, Barry Schwartz lays out a dichotomy of people’s decision-making behaviour. Some people are &lt;em>maximizers&lt;/em>—those who strive to make the optimal decision. Others are &lt;em>satisficers&lt;/em>—those who make a decision as soon as it meets their criteria. Mr. Schwartz’s thesis is that satisficers are happier than maximizers in the long-run. Although their average decision is less optimal, it requires much less effort. Maximizer-behaviour is useful for high-stakes irreversible decisions, but most decisions are not like that. It is difficult to be a maximizer with the sheer volume of smaller decisions we face on a daily basis.&lt;/p></description><content:encoded><![CDATA[
        <p>I am a big fan of the consumer editorial site <a href="http://thewirecutter.com/" target="_blank">The Wirecutter</a>. They earned a position in my stack of newsletter subscriptions for their help with simplifying tech purchasing decisions.</p>
<p>In his book <a href="https://en.wikipedia.org/wiki/The_Paradox_of_Choice" target="_blank">The Paradox of Choice: Why More is Less</a>, Barry Schwartz lays out a dichotomy of people’s decision-making behaviour. Some people are <em>maximizers</em>—those who strive to make the optimal decision. Others are <em>satisficers</em>—those who make a decision as soon as it meets their criteria. Mr. Schwartz’s thesis is that satisficers are happier than maximizers in the long-run. Although their average decision is less optimal, it requires much less effort. Maximizer-behaviour is useful for high-stakes irreversible decisions, but most decisions are not like that. It is difficult to be a maximizer with the sheer volume of smaller decisions we face on a daily basis.</p>
<p>One example that can be surprisingly taxing is deciding what TV, camera, charger, BBQ, or washing machine to buy. You might have strong preferences about some of these, but it is more than likely that you are not familiar with most of the above product categories. Making a truly informed decision requires that you first familiarize yourself with the offerings in the market. Then you must prioritizing your own requirements and analyze each option, before coming to a decision. If you make the wrong decision and you will be reminded of it every time you use the product over the next few years.</p>
<p>Previously I have never trusted a single review to consider it more than a single data-point. Look up a review on Engagdet, Gizmodo, The Verge, and Cnet, and they often all offer conflicting opinions on the same product. But The Wirecutter is different.</p>
<p>First, the reviews are centred around <em>user problems</em> (Which X should I buy?) rather than <em>tech solutions</em> (Review of the new Z 2.0). The editor aggregates reviews from across the web on a select group of options and reports the results. This serves as a “one-stop” source of information instead of as a single data-point.</p>
<p>Second, each review leads with a summary of the recommendation and a link to buy on Amazon. But underneath this summary is a comprehensive breakdown of the logic behind that decision. There are sections such as <em>Why you should trust us</em>, <em>Flaws but not deal-breakers</em> as well as alternative recommendations based on niche use-cases.</p>
<p>On my first couple visits to The Wirecutter, I read the entire page—in classic maximizer behaviour. But after making a few purchasing decisions based on their advice, I have developed a great deal of trust in the editorial team from The Wirecutter. Now I often only skim the review—and if it is a less critical decision, I will simply buy their top recommendation without much extra thought. In a sense, it has allowed me to outsource the burden of “maximizing” tech purchasing decisions to a trusted third-party.</p>
<p>The ultimate test of trust in tech decisions is to ask yourself: “Would I recommend this to my mother?”. If you recommend the wrong product, you might find yourself fixing it or providing support for your next few Thanksgiving Dinners. For me, The Wirecutter has passed this test. Whenever Mom asks for advice on something I have no familiarity with (“Which dashcam should I buy?”), I just link her to The Wirecutter.</p>

      ]]></content:encoded></item><item><title>Problem Spaces</title><link>https://geoffruddock.com/problem-spaces/</link><pubDate>Monday, 18 May 2015</pubDate><guid>https://geoffruddock.com/problem-spaces/</guid><description>&lt;p>A &lt;a href="http://paulgraham.com/startupideas.html" target="_blank">common thread of startup advice&lt;/a> is to avoid thinking about &lt;em>ideas&lt;/em> and to instead think about &lt;em>problems&lt;/em> that need to be solved. Switching to a problem-seeking mindset feels a little unnatural at first, but is ultimately a more productive way to approach the ideation process. Time spent thinking through a specific solution can quickly spiral into day-dreaming (“Wouldn’t it be cool if?…” or “Also we could do…”) which is at best a waste of time, and at worst can distract you from the finding the core essence of a product.&lt;/p></description><content:encoded><![CDATA[
        <p>A <a href="http://paulgraham.com/startupideas.html" target="_blank">common thread of startup advice</a> is to avoid thinking about <em>ideas</em> and to instead think about <em>problems</em> that need to be solved. Switching to a problem-seeking mindset feels a little unnatural at first, but is ultimately a more productive way to approach the ideation process. Time spent thinking through a specific solution can quickly spiral into day-dreaming (“Wouldn’t it be cool if?…” or “Also we could do…”) which is at best a waste of time, and at worst can distract you from the finding the core essence of a product.</p>
<p>Lately I’ve been making more of an effort to focus on problems that need solving, instead of ideas. I’ve noticed some common threads between loosely-related problems, which have crystallized into <em>problem spaces</em> that I find myself thinking about repeatedly. These problem spaces encompass a few related problems that could be solved in a variety of totally unrelated ways. Here are a few that have been on my mind recently.</p>
<h3 id="preservingfriendships">Preserving friendships <a class="anchor" href="#preservingfriendships">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>100 years ago, people had to make a conscious effort to stay in touch with friends, especially over long distances. Today, the decision of who we interact with on a daily basis is largely decided by social media algorithms. An unintended consequence of this switch is that it is <a href="http://thelede.blogs.nytimes.com/2008/04/22/a-simple-bff-strategy-confirmed-by-scientists" target="_blank">remarkably easy to fall out of touch with certain friends</a>, especially after moving to a new city, a new country, or a new stage of life. If we allow Facebook to curate our social interactions, we risk falling out of touch with those who slip between the cracks of the news feed algorithm. How can I mitigate this, to ensure that 5 or 10 years from now I am still closely in touch with important people in my life?</p>
<h3 id="individualizedtraveladvice">Individualized travel advice <a class="anchor" href="#individualizedtraveladvice">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p>There is no shortage of services that aggregate travel advice and recommendations, such as TripAdvisor, WikiTravel, or Yelp. These solutions are definitely more responsive and granular than published travel guides, but they still fall short of providing individually tailored or curated advice. I have had a couple disappointing experiences with these, specifically when visiting a destination where I am far from the target demographic, and where the popular recommendations do not appeal to me. Similar to how curated email newsletters are replacing aggregated news—at least for myself—I think there is room to apply a more curated approach to travel recommendations. The tricky part seems to be finding and picking a trusted curator for a “disposable” source of information. I can subscribe to 10 newsletters and then pick the best one in a month, and this is still worthwhile if I come out of it with a a trusted source that I will read for years. But I can’t justify this same level of trial and investment to find a good source of information on a city I will be in for a single weekend.</p>
<h3 id="semi-socialphotosharing">Semi-social photo sharing <a class="anchor" href="#semi-socialphotosharing">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><a href="http://www.engadget.com/2013/08/23/do-we-really-need-yet-another-photo-sharing-app-no-we-do-not/" target="_blank">There is a cliche</a> about the amount of effort entrepreneurs spend on photo-sharing apps, which proposes that there are much more important and worthy problems to be solved. Nevertheless, I think the experience of sharing photos is still ripe for innovation. Up until a couple years ago, the entire space was focused on social, ignoring the entire spectrum of situations where I may want to share a photo but not to <em>my entire social network</em>. Snapchat changed this, introducing <em>ephemeral messaging</em> that addresses the more personal and/or frivolous end of the spectrum. But there is still a big space in between—where I want to share <em>some</em> photos with <em>some</em> people but it may not be a conscious effort, and it doesn’t need to be ephemeral. <em>Instagram Direct</em> is interesting in this regard, because it allows you to address the some people/some photos part, albeit in a very conscious fashion. But ultimately I still find myself with an offline library of my photos that simply don’t end up being shared, but that friends love to flip through on my phone. I wonder if this space could benefit from machine learning – if an app could figure out who I was at the bar with last night, and then suggest sharing with a selective list of people who may care about my slightly blurry, definitely not Instagram-worthy pictures.</p>
<p>I may expand on these problem spaces here in more detail in the future, as I continue to think about them. If you are thinking about similar areas, drop me a line and let’s talk.</p>

      ]]></content:encoded></item><item><title>Tool of the week: Blinkist Daily</title><link>https://geoffruddock.com/blinkist-daily/</link><pubDate>Saturday, 09 May 2015</pubDate><guid>https://geoffruddock.com/blinkist-daily/</guid><description>&lt;p>I’m a big fan of &lt;a href="https://www.blinkist.com" target="_blank">Blinkist&lt;/a>, which is a subscription service that provides really well-written summaries of popular non-fiction books. These aren’t the SparkNotes you remember from your high school days—each summary is split into thematic bites, and the information is presented in a form that is already partially synthesized.&lt;/p>
&lt;p>Each day Blinkist offers free access to one of their new summaries through &lt;a href="https://www.blinkist.com/daily" target="_blank">Blinkist Daily&lt;/a>. I find that the curation of books they use for Blinkist Daily is very high-quality, and I can usually find at least 2 summaries per week that I am interested in. It’s a similar model to &lt;a href="https://www.creativelive.com/" target="_blank">Creative Live&lt;/a>, where the initial &lt;em>live&lt;/em> screening/viewing is free, but you can pay for access to the catalog of old content.&lt;/p></description><content:encoded><![CDATA[
        <p>I’m a big fan of <a href="https://www.blinkist.com" target="_blank">Blinkist</a>, which is a subscription service that provides really well-written summaries of popular non-fiction books. These aren’t the SparkNotes you remember from your high school days—each summary is split into thematic bites, and the information is presented in a form that is already partially synthesized.</p>
<p>Each day Blinkist offers free access to one of their new summaries through <a href="https://www.blinkist.com/daily" target="_blank">Blinkist Daily</a>. I find that the curation of books they use for Blinkist Daily is very high-quality, and I can usually find at least 2 summaries per week that I am interested in. It’s a similar model to <a href="https://www.creativelive.com/" target="_blank">Creative Live</a>, where the initial <em>live</em> screening/viewing is free, but you can pay for access to the catalog of old content.</p>
<p>So I found myself reading 2-3 blinks per week through Blinkist Daily. Eventually I picked up an annual subscription to the core Blinkist service, which lets you push summaries to your kindle. What is interesting is that I have found myself using the service <em>less</em> now that I am paying for it than when I was mooching off the free 24-hour summaries from Blinkist Daily.</p>
<p>In some perverse way, having unlimited access to their entire library of information at my fingertips reduces my usage of the service. I don’t know if this is necessarily something wrong with the core product as much as it is something brilliant about <em>Blinkist Daily</em>. Curating a single summary per day and offering it for a fixed period of time simultaneously reduces the <em>decision fatigue</em> of choosing what to learn, and also introduces an element of scarcity in the form of a hard <em>deadline</em> at which point the summary disappears forever.</p>
<p> </p>

      ]]></content:encoded></item><item><title>Resources on Product Management</title><link>https://geoffruddock.com/resources-on-product-management/</link><pubDate>Sunday, 12 Apr 2015</pubDate><guid>https://geoffruddock.com/resources-on-product-management/</guid><description>&lt;p>When I started as a Product Manager last year, I knew I had a lot to learn. I scoured through the internet, reading everything I could find on Product Management and how to succeed starting out as a non-technical PM. I have compiled a list of some of the most useful things I have read, partially so that I can revisit them myself from time-to-time.&lt;/p>
&lt;p>Some of these articles are not strictly product-related—many of them involve design, project management, and elements of software development. The PM role varies greatly between companies, and often involves stepping in to fill whatever necessary gaps exist in order to ship a successful product. &lt;/p></description><content:encoded><![CDATA[
        <p>When I started as a Product Manager last year, I knew I had a lot to learn. I scoured through the internet, reading everything I could find on Product Management and how to succeed starting out as a non-technical PM. I have compiled a list of some of the most useful things I have read, partially so that I can revisit them myself from time-to-time.</p>
<p>Some of these articles are not strictly product-related—many of them involve design, project management, and elements of software development. The PM role varies greatly between companies, and often involves stepping in to fill whatever necessary gaps exist in order to ship a successful product. </p>
<h3 id="product">Product <a class="anchor" href="#product">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><blockquote>
<p> Good product managers crisply define the target, the “what” (as opposed to the how) and manage the delivery of the “what.”</p></blockquote>
<p><a href="https://a16z.com/2012/06/15/good-product-managerbad-product-manager/">Good Product Manager / Bad Product Manager</a> – A note by Ben Horowitz that is worth a re-read every few months.</p>
<blockquote>
<p>believe great taste can be developed but not in a linear manner that is predictable or time bound. The best, and perhaps only way, to develop great taste is to be interdisciplinary and to gather a large variety of life experiences to draw upon. This is why Steve Jobs’ focus on the intersection of technology and liberal arts has always made a lot of sense to me.</p></blockquote>
<p><a href="http://bubba.vc/2014/12/08/the-three-skills-of-a-great-pm/">The Three Skills of a Great PM</a> – Some core skills of an effective Product Manager, distilled into 3 semi-quantified “rates”</p>
<blockquote>
<p>Product management may be the one job that the organization would get along fine without (at least for a good while). Without engineers, nothing would get built. Without sales people, nothing is sold. Without designers, the product looks like crap. But in a world without PMs, everyone simply fills in the gap and goes on with their lives. It’s important to remember that – as a PM, you’re expendable. Now, in the long run great product management usually makes the difference between winning and losing, but you have to prove it.</p></blockquote>
<p><a href="https://www.kennethnorton.com/essays/productmanager.html">How to Hire a Product Manager</a> – An essay by Ken Norton from Google Ventures. Although it is obstensibly written for people looking to hire a PM, it gives a great outlook into the role for someone <em>trying to get hired</em>. </p>
<h3 id="design">Design <a class="anchor" href="#design">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><blockquote>
<p>There is your product and then there is the experience someone has using your product. It’s easy to see the difference from afar, but to the person using your product they are one in the same. This cannot be understated. Every interaction with your product/service/company matters and becomes part of the product experience.</p></blockquote>
<p><a href="http://bokardo.com/archives/experience-product/">The experience is the product</a> – Joshua Porter on the inseparability of the product itself and the experience that surrounds it. </p>
<p><a href="http://usabilitypost.com/2010/11/17/the-design-of-everyday-things/">Book review: The Design of Everyday Things</a> – A good summary of one of the classic books on Design by Don Norman. This summary convinced me to read the whole book. </p>
<blockquote>
<p>The single easiest way to see things through the eyes of your new user is to simply watch your user interacting with your product for the first time and talk to her about the experience. Don’t try to do this without help from your users. You know way too much.</p></blockquote>
<p><a href="http://usersknow.blogspot.com/2012/06/you-know-too-much.html">You Know Too Much</a> – Laura Klein on why it is so difficult to keep use conceptual models in our head, and why it is essential to watch users interacting with our product. </p>
<blockquote>
<p>Learning about your customer is the single most important part of your startup. If you’re outsourcing that to a person who isn’t directly responsible for making critical product decisions, then you are making a horrible mistake.</p></blockquote>
<p><a href="http://usersknow.blogspot.com/2012/11/startups-shouldnt-hire-user-researchers.html">Startups Shouldn’t Hire User Researchers</a> – Laura Klein on why user research should be a responsibility of PMs </p>
<blockquote>
<p>The fact is, understanding what your users like and don’t like about your product doesn’t mean giving up on your vision. You don’t have to make every single change suggested by your users. You don’t have to sacrifice a coherent design to the whims of a focus group.</p></blockquote>
<p><a href="http://usersknow.blogspot.com/2009/09/6-stupid-excuses-for-not-getting.html">6 Stupid Excuses for Not Getting Feedback</a> – There is a good chance that you recognize at least one of these excuses. </p>
<blockquote>
<p>One of the main reasons I like the thinking aloud method of user testing is that it gives us insights into a user’s mental model. When users verbalize what they think, believe, and predict while they use your design, you can piece together much of their mental model.</p></blockquote>
<p><a href="http://www.nngroup.com/articles/mental-models/">Mental Models</a> – Jakob Nielsen explains how good design needs to consider the mental models that users carry while interacting with your product. Good design is always a moving target.</p>
<h3 id="execution--shipping">Execution / Shipping <a class="anchor" href="#execution--shipping">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><p><a href="http://sivers.org/multiply">Ideas are just a multiplier of execution</a> – People generally overestimate the value of ideas and strategy, while underestimating the critical importance of execution. Business students in case study discussions will spend 70 minutes fiercely debating high-level strategic issues, then round off the final 10 minutes on operations and execution by saying “Then we’ll hire an engineering team and build out the product.” </p>
<blockquote>
<p>There is no later for your customers. The only thing that matters is what they’re using right now. They don’t give a shit about your roadmap, your brilliant feature pipeline, or your vision of a better future. They’re trying to get work done right now and they only know what you’ve already delivered. So build a discipline around your launches, knowing that your temporary, let’s get this out quickly and iterate later release is the current reality for your customers. Build up your attention to detail and force yourself to treat every launch like it is your final launch. Imagine that you’ll never be able to deploy something after this…have you done your best work?</p></blockquote>
<p><a href="http://bokardo.com/archives/later/">There is no later for your customers</a> – “Just Ship It” is not an excuse to release a sub-par, unfinished product to the world. The concept of <em>Minimum Viable Product</em> is also wrongly applied in this regard as an excuse to ship something half-baked. </p>
<blockquote>
<p>This is one of the reasons why B2B applications often get away with being so awful and hard to use. If a product helps me do my job better and makes me more money, it’s solving a big problem for me. I’ll put up with a few missing features or a less than stellar experience.</p></blockquote>
<p>How Bad Can I Make My Product? <a href="http://usersknow.blogspot.com/2013/10/how-bad-can-i-make-my-product.html">— A good litmus test for determining approximately how much you should be sacrificing release quality and </a><em>polish</em> for speed. </p>
<blockquote>
<p>So what makes an idea guy an idea guy? Usually it’s the simple fact that they don’t have any other skills to bring to the startup.</p></blockquote>
<p><a href="https://web.archive.org/web/20170809095131/http://www.tonywright.com/2007/5-reasons-you-dont-want-to-partner-with-an-idea-guy/">5 Reasons you don’t want to partner with an “Idea Guy”</a>  – People unfamiliar with the industry often equate Product to being the <em>Idea Guy</em>. Ensure that you do not fall into this trap as a non-technical PM—always focus on delivering tangible value through analytics, testing, and pursuing a deep understanding of the customer. </p>
<h3 id="software-development">Software Development <a class="anchor" href="#software-development">
    <i class="fas fa-hashtag anchor-link"></i>
    
</a></h3><blockquote>
<p>The work of implementing a feature initially is often a tiny fraction of the work to support that feature over the lifetime of a product, and yes, we can “just” code any logic someone dreams up. What might take two weeks right now adds a marginal cost to every engineering project we’ll take on in this product in the future. In fact, I’d argue that the initial time spent implementing a feature is one of the least interesting data points to consider when weighing the cost and benefit of a feature.</p></blockquote>
<p><a href="http://firstround.com/review/The-one-cost-engineers-and-product-managers-dont-consider/">The One Cost Engineers and Product Managers Don’t Consider</a> – Without acutely understanding the compounding effect of complexity costs on engineering resources, the product organization can find it increasingly difficult to ship new features. </p>
<blockquote>
<p>The key is to understand that the root cause of all this grief about commitments is when these commitments are made. They are made too early. They are made before we know if we can actually deliver on this obligation, and even more important, if what we deliver will actually solve the problem for the customer.</p></blockquote>
<p><a href="http://www.svpg.com/managing-commitments-in-an-agile-team/">Managing Commitments in an Agile Team</a> – Making meaningful estimates on software development projects is a long-standing problem that has had entire books written about it. Understand how you can manage expectations with stakeholders and work with engineering to avoid the all-to0-typical disappointments of time and cost overruns on poorly made estimates.</p>

      ]]></content:encoded></item><item><title>Thoughts on managing recurring tasks</title><link>https://geoffruddock.com/managing-recurring-tasks/</link><pubDate>Wednesday, 17 Sep 2014</pubDate><guid>https://geoffruddock.com/managing-recurring-tasks/</guid><description>&lt;p>Most people use some combination of a calendar and todo list to organize their lives, whether it be a paper organizer or one of the myriad task list apps that pop up every day in the App Store. Personally I use a combination of Google Calendar and Todoist. Working together, these two do a pretty good job of keeping me organized. That said, the one type of task I have found awkward to manage are those tasks that you’d like to complete on a regular basis, but aren’t particularly time sensitive. Stuff like changing your bed sheets, backing up your computer, or cleaning up your itunes library.&lt;/p></description><content:encoded><![CDATA[
        <p>Most people use some combination of a calendar and todo list to organize their lives, whether it be a paper organizer or one of the myriad task list apps that pop up every day in the App Store. Personally I use a combination of Google Calendar and Todoist. Working together, these two do a pretty good job of keeping me organized. That said, the one type of task I have found awkward to manage are those tasks that you’d like to complete on a regular basis, but aren’t particularly time sensitive. Stuff like changing your bed sheets, backing up your computer, or cleaning up your itunes library.</p>
<p><strong>They don’t belong on your todo list.</strong> It doesn’t make sense to clutter up your todo list with an endless stream of recurring tasks that aren’t relevant to your day-to-day goals. <a href="http://lifehacker.com/5853732/take-a-more-realistic-approach-to-your-to+do-list-with-the-3-%252B-2-rule">A cluttered todo list reduces your effectiveness</a>, so you should be striving to keep it as clean as possible.</p>
<p><strong>Neither do they belong in your calendar.</strong> These tasks don’t need to be done on a specific day or at a specific time. Treating these tasks as calendar events just clutters up your calendar with events that you probably won’t respect, and makes it more likely you’ll lose track of something important.</p>
<p><strong>The solution:</strong> Augment your organizational system with an app specifically designed for recurring tasks. The two best such apps are <a href="https://appadvice.com/review/keep-those-pesky-recurring-tasks-in-check-with-radar">Radar </a>(iOS, $1.99) and <a href="https://play.google.com/store/apps/details?id=com.ugglynoodle.regularly&hl=en">Regularly</a> (Android, Free). Radar is an iOS app that is specifically designed to handle those recurring tasks that don’t quite fit into either your calendar or your todo list. For Android users, Regularly has similar, although it isn’t quite as aesthetically pleasing.</p>
<p>These apps let you add recurring tasks and specify how often you want to do them, measured in number of days, weeks, or months. Then they keep you on track with a list of upcoming tasks and push notifications when they are due.</p>
<p>What is great about using Radar is that you aren’t imposing false deadlines on tasks that are in reality quite flexible. If you don’t feel like dusting out your PC today, you can just do it tomorrow. But Regularly will make sure you do it every six months. When “Call Mom” pops up, you don’t need to immediately do it, but you know to plan on doing it at some point over the next couple days. Radar/Regularly really starts to shine when you begin to add a bunch of tasks with longer horizons, such as checking your stock portfolio or changing your air filter. I have a list of around 30 semi-regular chores and tasks, so every week that I check the app I just do a couple things to stay on the ball.</p>

      ]]></content:encoded></item></channel></rss>