Geoff Ruddock

Convert clipboard HTML contents to Markdown with Alfred

Friday, 14 Jan 2022

Goal

All of my personal notes are written in markdown. I use Obsidian to manage them, but the specific tool is not relevant for the purposes of this post.

When referencing things in Google Docs, I find myself generally linking to the doc, rather than copy/pasting, because the default copy/paste output is poorly formatted, and is tedious to correct. This is sub-optimal, because then I cannot surface the contents via search in Obsidian, unless the text of the URL matches my keywords.

So the goal here is to make it trivially easy to copy/paste from GDocs into a markdown format, hopefully resulting in me doing that more frequently, resulting in more useful results when searching my vault.

Easier to copy/paste → save more content → better search results → less time searching for things

Alternatives:

If you’re happy with copy/pasting into another window to convert, you may consider running a local instance of google-docs-to-markdown, or—if you are not converting any sensitive data—even just using the demo web applet.
If you have administrative access to your Google Workspace, you may consider installing the Docs to Markdown add-on.

Getting HTML contents from clipboard

You may have wondered at some point: why does copied text appear differently depending on which app I paste it in? For example, copied text from a google doc will appears identical when pasted in another google doc, but will render as plain text when pasted into a barebones text editor.

It’s worth understanding how the clipboard works. Most relevant for us:

When the copy command is invoked, the active application can offer a variety of potential formats, incl. HTML, RTF, and plain text.
When the paste command is invoked, the active application chooses which format to receive.

If you’re using Mac, you can use the free Clipboard Viewer application to inspect the different formats that are offered by the application from which you are copying.

Convert HTML to Markdown

Now that we have the HTML contents of our clipboard, we need to convert it to markdown.

Example HTML

Let’s start with the optimistic scenario in which we have perfectly structured HTML, notably:

Text formatting is represented in semantic elements such as: strong (bold), em (italics), or del (strikethrough).
Nested lists are wrapped inside an li tag, not placed directly below the parent list.

from IPython.display import HTML, display

example_html = """
Example text
Text

    This text is bold
    This text is italicized
    This text is strikethrough
    Some func = lambda x: print(x) inline code
    A link: Wikipedia

Nested lists
ul > ul

    A1
    
        A1a
        A1b
    
    A2

ul > ol

    B1
    
        B1a
        B1b
    
    B2

ol > ul

    A1
    
        A1a
        A1b
    
    A2

ol > ol

    A1
    
        A1a
        A1b
    
    A2

"""

display(HTML(example_html))

Example text

Text

This text is bold
This text is italicized
This text is ~~strikethrough~~
Some func = lambda x: print(x) inline code
A link: Wikipedia

Nested lists

ul > ul

A1
- A1a
- A1b
A2

ul > ol

B1
1. B1a
2. B1b
B2

ol > ul

A1
- A1a
- A1b
A2

ol > ol

A1
1. A1a
2. A1b
A2

Markdownify

In this idealized scenario, the markdownify library converts our HTML reasonably well out-of-the-box.

from markdownify import markdownify as md

print(md(example_html, heading_style='ATX', bullets='-'))

#### Example text


##### Text


- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)


##### Nested lists


ul > ul


- A1
- - A1a
	- A1b
- A2


ul > ol


- B1
- 1. B1a
	2. B1b
- B2


ol > ul


1. A1
2. - A1a
	- A1b
3. A2


ol > ol


1. A1
2. 1. A1a
	2. A1b
3. A2

But there are a few issues:

The first child of nested lists gets a double list marker instead of proper indentation
Ordered list numbers do not properly reset at each level
There are unnecessary double empty lines
Output uses tab characters (\t) instead of four spaces.

Fix list indentation

We can fix everything except for the wrong numbering using a few simple regex expressions.

import re

def strip(x: str) -> str:
    return ''.join(x.split('\n'))

md_text = md(example_html, heading_style='ATX', bullets='-')
    
find_replace_pairs = [
    ('- - ',          '\t- '),    # fix indent on first child of ul > ul
    (r'- (\d\.)',     r'\t\1'),   # fix indent on first child of ul > ol
    (r'\d\. - ',      r'\t- '),   # fix indent on first child of ol > ul
    (r'\d\. (\d)\.',  r'\t\1.'),  # fix indent on first child of ol > ol
    ('\t',            '    '),    # replace tabs with four spaces
    ('\n\n',          '\n'),      # remove extra line breaks
]

for f, r in find_replace_pairs:
    md_text = re.sub(f, r, md_text, flags=re.MULTILINE)

print(md_text)

#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
3. A2

ol > ol

1. A1
    1. A1a
    2. A1b
3. A2

Re-number ordered lists

To fix the numbering of ordered list items, we’ll write a function:

def renumber_list(md_text: str) -> str:
    """ Replace ordered list markers with the correct number. """
    
    # Track how many previous items have appeared, at each level of indentation
    prev_items_at_level = [0] * 10

    lines = md_text.split('\n')
    for idx, line in enumerate(lines):
            
        # If line is a list item (either ordered or unordered) …
        if match := re.match('(\s*)([-\d])', line):
            
            # Infer level based on number of leading spaces
            level = int(len(match.groups()[0]) / 4)
        
            # If line is an ordered list item …
            if re.match('(\s*)(\d)', line):

                # Replace marker, update counter
                marker = prev_items_at_level[level] + 1
                lines[idx] = re.sub('^(\s*)(\d)', f'\g<1>{str(marker)}', line)
                prev_items_at_level[level] += 1
                
            # Reset counters for deeper levels
            prev_items_at_level[level+1:] = [0] * (len(prev_items_at_level) - level - 1)

        # If not a list item, reset all counters
        else:
            prev_items_at_level = [0] * 10
            
    return '\n'.join(lines)
    
print(renumber_list(md_text))

#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
2. A2

ol > ol

1. A1
    1. A1a
    2. A1b
2. A2

All together: html2md

import unicodedata

def html2md(html: str) -> str:
    """ """ 
    md_text = md(html, heading_style='ATX', bullets='-')
    
    find_replace_pairs = [
        ('- - ',          '\t- '),    # fix indent on first child of ul > ul
        (r'- (\d\.)',     r'\t\1'),   # fix indent on first child of ul > ol
        (r'\d\. - ',      r'\t- '),   # fix indent on first child of ol > ul
        (r'\d\. (\d)\.',  r'\t\1.'),  # fix indent on first child of ol > ol
        ('\t',            '    '),    # replace tabs with four spaces
        ('\n\n\n',          '\n\n'),  # remove extra line breaks
    ]
    
    # some websites wrongly encode   as this character 
    md_text = md_text.replace(u'\xa0', u' ').replace('Â', '')

    for f, r in find_replace_pairs:
        md_text = re.sub(f, r, md_text, flags=re.MULTILINE)
        
    return renumber_list(md_text).strip()

print(html2md(example_html))

#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
2. A2

ol > ol

1. A1
    1. A1a
    2. A1b
2. A2

Clean up HTML of Google Docs

When copying from a google doc, the HTML structure is not quite as pristine as above, so we’ll need to perform some pre-processing steps on the HTML before converting it. To help with this, we’ll import Beautiful Soup, a Python package for parsing HTML. Then we’ll write a few transformation functions, loosely inspired by the google-docs-to-markdown javascript library.

Default output

The default output is a mess:

All of the text formatting (besides the link) is missing.
The nested lists are not properly indented, and have some empty lines.
The entire chunk of text is wrapped in formatted as bold (wrapped in **).

gdoc_html = """
Example text
Text
This text is bold.
This text is italicized.
This text is strikethrough.
Some func = lambda x: print(x) inline code.
A link with text.
Nested lists
ul > ul
A1
" role="presentation">A1a
A1b
A2
ul > ol
B1
B1a
B1b
B2
ol > ul
C1
C1a
C1b
C2
ol > ol
D1
">D1a
D1b
D2



"""

print(html2md((gdoc_html)))

**# Example text

### Text

- This text is bold.
- This text is italicized.
- This text is strikethrough.
- Some func = lambda x: print(x) inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.

### Nested lists

ul > ul

- A1
- A1a
- A1b

- A2

ul > ol

- B1
1. B1a
2. B1b

- B2

ol > ul

1. C1
- C1a
- C1b

1. C2

ol > ol

1. D1
2. D1a
3. D1b

1. D2**

Inline styles to semantic tags

The lack of text formatting is caused by the fact that Google Docs does not use proper semantic tags, but rather puts text inside a span and styles it with inline CSS.

bold_html = """

    This text is 
    bold
    .

"""

We can fix this by using regex to search for elements which contain the relevant styles, then wrapping those elements in the proper semantic tag.

While we’re at it, we can also parse inline code. Although Google Docs does not natively support inline code, we can fake it by treating any text using the font Courier New (the most common monospace font) as intended to be code.

from bs4 import BeautifulSoup

def parse_gdoc_inline_styles(soup: BeautifulSoup) -> None:
    """ GDocs uses inline styles on spans instead of semantic HTML tags. """
    
    for tag in soup():
        
        # Dont inline styles inside headings
        if tag.parent.name in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'):
            continue
            
        style = tag.get('style')
        if style:
            if re.search(r'font-weight:\s?700', style):
                _ = tag.wrap(soup.new_tag('strong'))
            if re.search(r'font-style:\s?italic', style):
                _ = tag.wrap(soup.new_tag('em'))
            if re.search(r'text-decoration:\s?line-through', style):
                _ = tag.wrap(soup.new_tag('del'))
            if re.search(r"font-family:\s?'Courier New'", style):
                _ = tag.wrap(soup.new_tag('code'))

soup = BeautifulSoup(bold_html, 'html.parser')
parse_gdoc_inline_styles(soup)

print(html2md(strip(str(soup))))

This text is **bold**.

Wrap naked `ul` elements

The second issue is that Google Docs drops nested lists (ul, ol elements) directly inside the parent list, without wrapping them in an li tag, as our markdown converter expects.

nested_list_html = """

    ul > ul
    
        A1
        
            A1a
            A1b
        
        A2
    
    ul > ol
    
        B1
        
            B1a
            B1b
        
        B2
    
    ol > ul
    
        A1
        
            A1a
            A1b
        
        A2
    
    ol > ol
    
        A1
        
            A1a
            A1b
        
        A2
    

"""

# display(HTML(nested_list_html))

Again, we can use Beautiful Soup to find these instances and manually enclose the child list in an li element.

def wrap_naked_lists(soup: BeautifulSoup) -> None:
    """ GDocs does not wrap nested lists in an  tag. """
    for tag in soup():
        if tag.name in ['ul', 'ol'] and tag.parent.name in ['ul', 'ol']:
            tag.wrap(soup.new_tag('li'))

soup = BeautifulSoup(nested_list_html, 'html.parser')
wrap_naked_lists(soup)
html = strip(str(soup))

print(html2md(html))

- ul > ul
    - A1
        - A1a
        - A1b
    - A2
- ul > ol
    - B1
        1. B1a
        2. B1b
    - B2
- ol > ul
    1. A1
        - A1a
        - A1b
    2. A2
- ol > ol
    1. A1
        1. A1a
        2. A1b
    2. A2

Strip unnecessary tags

Finally, we remove a bunch of unnecessary tags, incl.

The outer b element that the google doc seem to be weirdly wrapped in.
Any unnecessary layers of wrapping, incl. p, span.
Nested code tags, which is not a gdoc-specific issue, but which another common collaboration tool seems to produce.

def strip_unnecessary_tags(soup: BeautifulSoup) -> None:
    """ Remove various unnecessary tags, for easier debugging. """
    for i, tag in enumerate(soup()):
        
        # gdocs seems to wrap the entire HTML contents in a  tag for some reason
        if i <= 3 and tag.name == 'b':
            tag.unwrap()
        
        # gdocs also includes 1-2 meta tags at the top of the content
        if tag.name == 'meta':
            tag.extract()
        
        # strip attributes, mostly just to make debugging easier
        for attribute in ['class', 'value', 'rel', 'target', 'dir', 'aria-level', 'role', 'id', 'style']:
            del tag[attribute]

        # unwrap nested text formatting
        if tag.parent:
            if tag.name in ['strong', 'b'] and tag.parent.name in ['strong', 'b']:
                tag.unwrap()
            if tag.name in ['em', 'i'] and tag.parent.name in ['em', 'i']:
                tag.unwrap()

Combined: `prep_gdoc_html`

Putting everything together, our output looks much better!

def prep_gdoc_html(html: str, debug=False) -> str: """ One function to combine all preprocessing steps. """ soup = BeautifulSoup(html, 'html.parser') parse_gdoc_inline_styles(soup) wrap_naked_lists(soup) strip_unnecessary_tags(soup) if debug: from bs4.formatter import HTMLFormatter formatter = HTMLFormatter(indent=4) print(soup.prettify(formatter=formatter)) return (str(soup)) final_html = prep_gdoc_html(gdoc_html) print(html2md(final_html))
# Example text ### Text - This text is **bold**. - This text is *italicized*. - This text is ~~strikethrough~~. - Some `func = lambda x: print(x)` inline code. - A [link](https://en.wikipedia.org/wiki/Main_Page) with text. ### Nested lists **ul > ul** - A1 - A1a - A1b - A2 **ul > ol** - B1 1. B1a 2. B1b - B2 **ol > ul** 1. C1 - C1a - C1b 2. C2 **ol > ol** 1. D1 1. D1a 2. D1b 2. D2

Clean up HTML of Quip docs
Quip is another real-time collaboration tool you may find yourself wanting to copy markdown from.

A regular copy/paste from a quip document generally looks better than a google doc. This is because quip attempts to copy markdown into the plain text contents of the clipboard. It works well on lists, but fails to encode headings, italics, strikethrough, or inline code.

Default quip output
A few things are broken:

Document-level lines of text are not wrapped in block elements, which results in the lack of appropriate spacing between these lines and subsequent elements (headings, lists).

The inline code gets wrapped in double backticks, but it should be a single pair.

Note: quip does not supported mixed (ol/ul) list nesting, so those examples are excluded from the test text.

quip_html = """ HTML to Markdown Text A single line A line split with a line break Another single line Text formatting This text is bold This text is italicized This text is strikethrough Some func = lambda x: print(x) inline code A link: Wikipedia Nested list ul > ul A1 A1a A1a1 A1a2 A1b A2 ol > ol D1 D1a D1a1 D1a2 D1b D2 """.strip() #print(html2md(quip_html)) print(html2md(prep_gdoc_html(quip_html)))
# HTML to Markdown ## Text A single line A line split with a line break Another single line ## Text formatting - This text is **bold** - This text is *italicized* - This text is ~~strikethrough~~ - Some ``func = lambda x: print(x)`` inline code - A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) ## Nested list **ul > ul** - A1 - A1a - A1a1 - A1a2 - A1b - A2 **ol > ol** 1. D1 1. D1a 1. D1a1 2. D1a2 2. D1b 2. D2

Wrap top-level text elements
The code below works reasonably well. It doesn’t properly handle hard linebreaks that should not be split into separate paragraphs, but I can’t think of an elegant solution to this off the top of my head.

def wrap_top_level_text(soup: BeautifulSoup, debug=False) -> None: """ """ for tag in soup.findAll('br'): prev_tag = tag.previous_sibling next_tag = tag.next_sibling # Only proceed if br is at the root document level, and previous tag exists if tag.parent and tag.parent.name == '[document]' and tag.previous_sibling: # print(f'Prev: {tag.previous_sibling}') # print(f'Next: {tag.next_sibling}') if tag.previous_sibling.name is None or tag.previous_sibling.name in ['b']: tag.previous_sibling.wrap(soup.new_tag('p')) # If next tag is also a br, remove that too if tag.next_sibling and tag.next_sibling.name == 'br': tag.next_sibling.extract() tag.extract() # Remove unnecessary trailing br after block-level elements elif tag.previous_sibling.name in ['ul', 'ol']: tag.extract() if debug: pprint(str(soup)) preproc_html = prep_gdoc_html(quip_html) soup = BeautifulSoup(preproc_html, 'html.parser') wrap_naked_lists(soup) wrap_top_level_text(soup) html = strip(str(soup)) print(html2md(html))
# HTML to Markdown ## Text A single line A line split with a line break Another single line ## Text formatting - This text is **bold** - This text is *italicized* - This text is ~~strikethrough~~ - Some ``func = lambda x: print(x)`` inline code - A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) ## Nested list **ul > ul** - A1 - A1a - A1a1 - A1a2 - A1b - A2 **ol > ol** 1. D1 1. D1a 1. D1a1 2. D1a2 2. D1b 2. D2

Unwrap nested code blocks
nested_code_html = """ Some func = lambda x: print(x) inline code. """ def unwrap_nested_code_blocks(soup: BeautifulSoup) -> None: """ Quip wraps inline code twice, which confuses markdownify. """ for tag in soup.findAll('code'): if tag.parent.name == 'code': tag.unwrap() soup = BeautifulSoup(nested_code_html, 'html.parser') unwrap_nested_code_blocks(soup) print(f'Before: {html2md(nested_code_html)}') print(f'After: {html2md(str(soup))}')
Before: Some ``func = lambda x: print(x)`` inline code. After: Some `func = lambda x: print(x)` inline code.

Combined: prep_quip_html
def prep_quip_html(html: str, debug=False) -> str: """ One function to combine all preprocessing steps. """ soup = BeautifulSoup(html, 'html.parser') parse_gdoc_inline_styles(soup) wrap_naked_lists(soup) wrap_top_level_text(soup) unwrap_nested_code_blocks(soup) strip_unnecessary_tags(soup) if debug: from bs4.formatter import HTMLFormatter formatter = HTMLFormatter(indent=4) print(soup.prettify(formatter=formatter)) return (str(soup)) print(html2md(prep_quip_html(quip_html)))
# HTML to Markdown ## Text A single line A line split with a line break Another single line ## Text formatting - This text is **bold** - This text is *italicized* - This text is ~~strikethrough~~ - Some `func = lambda x: print(x)` inline code - A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) ## Nested list **ul > ul** - A1 - A1a - A1a1 - A1a2 - A1b - A2 **ol > ol** 1. D1 1. D1a 1. D1a1 2. D1a2 2. D1b 2. D2

Combined: prep_any_html
Now let’s combine everything into a single generic pre-processing function, and double-check that the outputs match those of the app-specific functions that we checked above.

def prep_any_html(html: str, debug=False) -> str: """ One function to combine all preprocessing steps. """ soup = BeautifulSoup(html, 'html.parser') parse_gdoc_inline_styles(soup) wrap_naked_lists(soup) wrap_top_level_text(soup) unwrap_nested_code_blocks(soup) strip_unnecessary_tags(soup) if debug: from bs4.formatter import HTMLFormatter formatter = HTMLFormatter(indent=4) print(soup.prettify(formatter=formatter)) return (str(soup)) assert prep_any_html(quip_html) == prep_quip_html(quip_html) assert prep_any_html(gdoc_html) == prep_gdoc_html(gdoc_html)
Creating an Alfred workflow
The actual Alfred workflow is relatively straightforward.

Invoke

Invoke the workflow—either using Alfred’s universal actions after highlighting the text, or using the h2m keyword trigger.

Get clipboard contents as HTML

It gets the actual clipboard contents by running this shell command:

osascript -e 'the clipboard as «class HTML»' | perl -ne 'print chr foreach unpack("C*",pack("H*",substr($_,11,-3)))' | cat
Contents are stored in hex, so the perl command converts to ASCII.

Convert and copy

Then it runs the clipboard contents through html2markdown.py—a script which contains all of the logic we built out earlier—and copies the output markdown back to the clipboard.

Scraping PNG icons for emoji with Python

Sunday, 31 Oct 2021

Motivation
I put together an emoji search Alfred workflow which uses alfy to filter this JSON file of emoji.

There are plenty of existing emoji Alfred workflows around, but I wanted one that allowed me to edit the aliases for individual emoji.

The one missing piece was to have the workflow display the emoji itself as the icon for each result. The Alfred Script Filter JSON Format includes an icon field, but it expects the path to an actual icon file on disk.

This simply will not do!

Read list of emoji
First, let’s read in the emoji.json file mentioned above.

import json with open('emoji.json', 'r') as f: emoji_json = json.loads(f.read()) print(f'Number of emoji: {len(emoji_json)}')
Number of emoji: 1812

Here are the first 100 emoji contained inside.

for e in emoji_json[0:100]: print(e['emoji'], end=' ')
😀 😃 😄 😁 😆 😅 🤣 😂 🙂 🙃 😉 😊 😇 🥰 😍 🤩 😘 😗 ☺️ 😚 😙 🥲 😋 😛 😜 🤪 😝 🤑 🤗 🤭 🤫 🤔 🤐 🤨 😐 😑 😶 😶‍🌫️ 😏 😒 🙄 😬 😮‍💨 🤥 😌 😔 😪 🤤 😴 😷 🤒 🤕 🤢 🤮 🤧 🥵 🥶 🥴 😵 😵‍💫 🤯 🤠 🥳 🥸 😎 🤓 🧐 😕 😟 🙁 ☹️ 😮 😯 😲 😳 🥺 😦 😧 😨 😰 😥 😢 😭 😱 😖 😣 😞 😓 😩 😫 🥱 😤 😡 😠 🤬 😈 👿 💀 ☠️ 💩

Easy mode: Twitter emoji
I found the emojificate library, which makes straightforward use of the Twemoji CDN to fetch various sizes of Twitter-style emoji.

from IPython.display import Image def get_png_url(char: str) -> str: """ Pulled from: https://github.com/glasnt/emojificate/blob/latest/emojificate/filter.py""" cdn_fmt = "https://twemoji.maxcdn.com/v/latest/72x72/{codepoint}.png" def codepoint(codes): # See https://github.com/twitter/twemoji/issues/419#issuecomment-637360325 if "200d" not in codes: return "-".join([c for c in codes if c != "fe0f"]) return "-".join(codes) return cdn_fmt.format(codepoint=codepoint(["{cp:x}".format(cp=ord(c)) for c in char])) url = get_png_url('🐿️') Image(url)

import requests from tqdm.notebook import tqdm from IPython.display import clear_output def download_png(url: str, name: str) -> None: """Download a specific png file to disk.""" with open(f'twitter-icons/{name}.png', 'wb') as f: img_data = requests.get(url).content f.write(img_data) for e in tqdm(emoji_json, total=len(emoji_json)): fp = e['description'].replace(' ', '-') url = get_png_url(e['emoji']) download_png(url, fp) clear_output()

That was easy!

Hard mode: scraping from unicode.org
The above works perfectly for twitter emoji, but what if we want the apple emoji?

Inspired by this StackOverflow question—Programmatically get a PNG for a unicode emoji—we could also scrape icons from this page: https://unicode.org/emoji/charts/full-emoji-list.html.

Fetch HTML
emoji_page_html = requests.get('https://unicode.org/emoji/charts/full-emoji-list.html').text
Strip variation selectors
One small gotcha here—which will otherwise mess with our regex matches—is that some emoji are optionally followed by an invisible variation selector character. This is meant to specify that the character should be rendered as emoji rather than as icons, but this seems to be appended to many emoji which don’t have obvious icon representations, such as the chipmunk 🐿️.

We’ll strip these (trailing) characters from our emoji.json inputs, and write our regex to optionally match them, if present in the unicode table.

import re import pandas as pd def strip_variation_electors(emoji: str) -> str: return re.sub(u'[\ufe00-\ufe0f]$', '', emoji) emoji_df = ( pd.DataFrame(emoji_json) .assign(emoji=lambda x: x['emoji'].apply(strip_variation_electors)) .assign( name=lambda x: x['description'].apply(lambda x: x.replace(' ', '-')), length=lambda x: x['emoji'].apply(len), split=lambda x: x['emoji'].apply(list) ) .loc[:, ['name', 'emoji', 'length']] ) emoji_df['length'].value_counts()
1 1320 2 258 3 190 4 13 5 13 7 12 8 3 6 3 Name: length, dtype: int64

Extract using regex
def extract_emoji_from_html(emoji: str, version=0) -> str: #html_search_string = r"" html_search_string = r"{}(?:[\ufe00-\ufe0f])?'(?: title='.+')? class='imga' src='data:image\/png;base64,([^']+)'>" matchlist = re.findall(html_search_string.format(emoji), emoji_page_html) return matchlist[version] emoji_b64 = {} for _, df in tqdm(emoji_df[['name', 'emoji']].iterrows(), total=emoji_df.shape[0]): name, emoji = df['name'], df['emoji'] try: emoji_b64[name] = extract_emoji_from_html(emoji) except IndexError: pass clear_output() len(emoji_b64)
1811

Check results
is_found = pd.DataFrame({'is_found': 1}, index=emoji_b64.keys()).sort_index() joined = ( pd.merge(emoji_df.set_index('name'), is_found, left_index=True, right_index=True, how='left') .fillna(0) ) #joined.groupby('length')['is_found'].mean() joined.loc[lambda x: x['is_found'] == 0]

emoji length is_found

name

keycap:-* *️⃣ 3 0.0

Write to files
import base64 for name, img_data in emoji_b64.items(): b64 = base64.b64decode(img_data) with open(f'apple-icons/{name}.png', 'wb') as f: f.write(b64) clear_output()

Success!

Appendix: multi-character emoji
import pandas as pd emoji_df = ( pd.DataFrame(emoji_json) .assign( name=lambda x: x['description'].apply(lambda x: x.replace(' ', '-')), split=lambda x: x['emoji'].apply(lambda x: [x.replace('‍', 'ZWJ').replace('️', 'VS') for x in list(x)]), length=lambda x: x['emoji'].apply(len) ) .loc[:, ['name', 'emoji', 'length', 'split']] ) emoji_df['length'].value_counts()
What is going on here?!

Length-two
Most (but not all) of these emoji are unchanged by stripping the trailing variation selector character.

( emoji_df .loc[lambda x: (x['length'] == 2) & (x['split'].apply(lambda x: x[-1]) == 'VS')] .assign(stripped=lambda x: x['emoji'].apply(lambda y: y.strip('️'))) .sample(10) )
Besides trailing variation selectors, some length-two emoji are emoji flag sequences, which are made up of two “regional indicator” characters.

( emoji_df .loc[lambda x: x['length'] == 2, :] .loc[lambda x: x['emoji'].apply(lambda y: list(y)[1]) != '️'] .sample(5) )
Length-three
Most length-three emojis are created by joining multiple emojis together using a zero-width joiner character.

emoji_df.loc[lambda x: x['length'] == 3, :].head(30).sample(10)
Length-four
Emoji of length four seem to be composites of two other emoji, a ZWJ, and a seemingly unnecessary variation selector.

( emoji_df.loc[lambda x: x['length'] == 4] .assign(stripped=lambda x: x['emoji'].apply(lambda y: y.strip('️'))) )
Length-five
Length five emoji seem to be some combination of:

Sequences of two emoji, incl. two unnecessary variation selector characters.

Sequendes of three emoji, joined by two ZWJ.

( emoji_df .loc[lambda x: x['length'] == 5] .assign(stripped=lambda x: x['emoji'].str.replace('️', '')) .assign(length_stripped=lambda x: x['stripped'].apply(len)) )
Further reading

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to batch modify dates of daily journal files

Thursday, 16 Sep 2021

Goal
I’ve been using Obsidian as the primary hub for personal notes for the past year. My daily notes act as a sort of captain’s log, and have superceded my use of a dedicated journaling app. So I exported all my journal entries to markdown, and added them to my Obsidian vault as daily notes.

In the process of migrating journal entries between apps over the years¹, I must have messed up some metadata at some point, because I just realized today that all my entries before a certain point in time were wrong by one day. So I wrote a small script (below) to batch correct these files.

Setup
To help write and debug the script, I put a handful of dummy files in a staging directory, each following the YYYY-MM-DD.md naming convention.

We’ll use pathlib.Path(…).glob('*.md') to get a list of markdown files in this directory.

import datetime as dt from pathlib import Path
p = Path('staging') all_md_files = sorted(list(p.glob('*.md'))) # Print the contents of each file for f in all_md_files: print('\n' + '=' * len(f.name)) print(f.name) print('=' * len(f.name) + '\n') with open(f, 'r') as open_file: print(open_file.read())
============= 2013-12-31.md ============= # 2013-12-31 This file should become 2014-01-01. ============= 2014-01-01.md ============= # 2014-01-01 This file should become 2014-01-02. ============= 2014-01-03.md ============= # 2014-01-03 This file should remain as 2014-01-03.

Note that these files contain the date in their names, but also as an h1 heading within the file itself, so we’ll need to change both.

The script
My original issue only affected files up to a certain date, so let’s filter the list of markdown files.

Because we are incrementing the dates of files, we’ll want to work through the list in reverse order. Before making yesterday today, we must make today tomorrow—else there will be a conflict.

selected_files = sorted([f for f in all_md_files if f.stem <= '2014-01-02'], reverse=True) for f in selected_files: print(f.name)
2014-01-01.md 2013-12-31.md

The actual work to be done here is relatively simple:

Convert string to datetime, increment by 1d, convert back to string.

Replace references to the previous date string with the new one, inside file contents.

Rename the file itself.

def replace_in_file(fp: str, old: str, new:str) -> None: """ Replace 'old' strings with 'new' strings in a given file (fp) """ with open(fp, 'r') as open_file: old_contents = open_file.read() new_contents = old_contents.replace(old, new) with open(fp, 'w') as open_file: open_file.write(new_contents) for f in selected_files: old_dt = dt.datetime.strptime(f.stem, '%Y-%m-%d') new_dt = (old_dt + dt.timedelta(days=1)) new_ds = new_dt.strftime('%Y-%m-%d') replace_in_file(f, f.stem, new_ds) f.rename(f.parent / (new_ds + '.md'))
Check results
Visually inspecting the files in the staging directory, we can see the final result matches what we hoped to achieve.

for f in sorted(list(p.glob('*.md'))): print('\n' + '=' * len(f.name)) print(f.name) print('=' * len(f.name) + '\n') with open(f, 'r') as open_file: print(open_file.read())
============= 2014-01-01.md ============= # 2014-01-01 This file should become 2014-01-01. ============= 2014-01-02.md ============= # 2014-01-02 This file should become 2014-01-02. ============= 2014-01-03.md ============= # 2014-01-03 This file should remain as 2014-01-03.

If you’re running a script that modifies your files in-place like this, be sure to have recent, working backups before you start, in case something goes wrong!

I started journaling with OhLife, before it was shut down in 2014, replaced it with Dabble.Me, then most recently ported everything over to Day One in 2019. ↩︎

Soundproofing a Synology NAS

Sunday, 14 Mar 2021

After years of using SSDs, I had almost entirely forgotten how annoying the sound of actual, spinning hard disk platters is. That is, until I bought a Synology DS420+ NAS earlier this year to set up as a home media server. (Lockdown projects, yay!)

The usual advice here is to simply place your NAS somewhere out of earshot. But if you want to connect your NAS via a 1 Gbps ethernet hookup, your choices may be more constrained. In my case, there was only one suitable location—in my home office, on the shelf behind my desk.

A prelude on acoustics
I don’t have much prior knowledge of acoustics, besides knowing the fact that decibels are measured in $\log_{10}$ scale. To be honest, I’m still kind of talking out of my ass here, but I realized in retrospect that thinking through the nature of the problem before jumping to solutions would have saved me some wasted effort.

Source of the sound
As far as I can tell, there are three fundamental sources of sound coming from my NAS:

Spinning disks – A constant hum that is present whenever the drive is powered up—which, for a NAS, is probably all the time. Drives that spin at 5400 rpm are generally—but not always—quieter than comparable disks that spin at 7200 rpm.

Read/write operations – The quintessential “hard drive sound”, which occurs when the drive is actually being used, and so can vary minute-to-minute.

Cooling fan – A relatively constant hum, but whose pitch and volume can vary, based on how much cooling your NAS decides it needs.

Travel path of the sound
While all sound is technically just vibrations being transmitted through the air, there are a few different “paths” it can take, which has practical implications for how you should approach noise reduction:

Between disk drive and NAS unit

Between NAS unit and hard surface

Directly through the air

Isolate drives from enclosure
The most common recommendation that comes up for “Noisy Synology NAS” is to attach some sort of velcro/felt/foam to the drive sleds, along the lines of this youtube video. Although I found the stock sleds to be reasonably snug, adding some velcro/felt/foam padding removes any remaining vertical wiggle room within which the drive itself could vibrate.

I initially tried attaching foam velcro dots to the drive sleds themselves, but didn’t notice any major improvement. Unsure of whether the flaw was in the solution or my implementation, I picked up a strip of single-sided 1 mm foam tape, and tried again, this time taping the metal on the enclosure itself. This provided a nice snug fit for the drive sleds, but once again did not result in any noticeable difference in sound.

I had some velcro dots lying around.

But 1 mm foam tape works even better.

Why didn’t this help much? I suspect that a particular combination of particular hard drives spinning at a particular speed may cause a resonance problem for some people. If your NAS is generally quiet, but occasionally gets extremely noisy, this may be your problem. If your NAS is consistently loud, you can probably skip this entirely.

Isolate enclosure from surface
Even with foam padding, the hard drives are physically still connected to the NAS board, and so will transmit some vibrations through to the enclosure itself. So my next attempt was to reduce the amount of vibrations being transmitted from the NAS itself to the surface it sits on. My NAS is sitting on an IKEA Kallax shelf, and I could clearly feel the read/write vibrations being transmitted into the shelf by feeling from the cubby below.

You can feel the vibrations propogating through the shelving unit.

I picked up a set of three-layer EVA anti-vibration pads recommended in this How-To Geek article. They kind of helped, but I could still clearly feel the write operations through the shelf. In retrospect, these pads seem to be designed for much heavier equipment, incl. washing machines and air compressors. So they are probably much more dense than would be optimal for a relatively lightweight NAS.

These EVA foam feet didn’t help much.

What worked better was a few layers of much softer foam. I had some spare acoustic foam (originally purchased for step #3) that did the trick, but I suppose you could also just pick up a solid foam block from an arts and crafts store. Alternatively, you could perhaps create some sort of makeshift hammock to suspend the NAS in the air.

A thick bed of soft foam worked best.

Dampen sound travel through air
If your NAS is not rattling, and the vibration is not being transmitted to the surface it is sitting on, the only real remaining source of noise transmission is through the air itself.

Acoustic foam panels
I bought a set of 30 cm acoustic foam panels from AliExpress. I have no clue whether these panels have the ideal acoustic properties, but they were cheap. I installed a KALLAX door insert and then affixed the panels to the inside surfaces.

My end state.

Beware of heat
This worked very well, but introduced a different problem—heat. Even though it wasn’t a perfect seal, things got a bit toasty inside the cubby with the door closed. The NAS still worked, but I was uneasy about the long-term implications on drive lifespan. I subsequently cut out a square hole in the back of the insert and installed an 80 mm USB fan. This entirely solved the heat issue, but reintroduced some noise. It’s a reasonable trade though, because I find the constant sound from a fan to be much less intrusive than the intermittent sound of disk I/O operations.

Epilogue
What I’d do differently
If I were starting from scratch, would I take the same approach? Probably not.

My NAS is now quiet enough, but it’s not particularly modular—moving the NAS would require moving the entire shelf with the soundproof insert, or disassembling the insert and installing it in another KALLAX unit.

I may embark on building a custom soundproofed box at some point, as a potential future woodworking project. Something along the lines of this custom NAS box, but perhaps employing more surface area of insulation material, as done in this PC build.

Firmly on my “someday/maybe” project list, for now at least.

An afterword on acoustics testing
I initially intended to measure and compare the noise levels between each approach I tried, but abandoned this mid-way through when I realized that the free “noise meter” app I downloaded from Google Play wasn’t doing a great job of capturing my own perceptual sense of each approach.

Two thoughts on why measuring and comparing average dB didn’t prove to be as useful as I expected:

It is extremely sensitive to ambient factors – Before taking measurements, I put my laptop to sleep, closed the door, and made sure to breath quietly and barely move. But even with these precautions, some ambient noise snuck in, including traffic from outside my window, or the upstairs neighbour doing laundry.

Quality of noise matters, not just quantity – Our perception of sound is not just a function of quantity but also quality. In retrospect, what I found annoying about the disk drives was not the absolute volume level, but the intermittent and “clicky” nature of the I/O operations. Because these sounds are not constant, they contribute much less to the measurements of “average dB” than they do to my own subjective perception of overall noise level.

So if I were to try this again, I would explore either measuring something like the 80th percentile of noise level, rather than the average. This would make it more robust against “outlier” ambient noises, and also more accurately capture the influence of intermittent I/O operations.

Turn on your thermostat before an alarm with Tasker (Android)

Thursday, 03 Dec 2020

My main Black Friday purchase this year was a Tado° system (thermostat + smart radiator valves), which I acquired with the goal in mind of regulating the temperature in my bedroom by:

Turning the heat down early enough to be consistently cold at night.

Turning the heat up in the morning to make it easier to wake up. Ideally 1h before.

The first part is easy, but the second part can’t quite be achieved, at least out-of-the-box. If your daily sleep schedule is consistent, you can just set a corresponding heating schedule in the Tado app. But my own wake-up time varies between 7a-9a, and I found it cumbersome to change the “Smart Schedule” every evening in anticipation of when I would wake up the following morning.

Why IFTTT isn’t enough
Tado’s IFTTT integration is a good start, but it isn’t perfect. It is relatively straight-forward to create a recipe that turns on our heating when our Android alarm goes off. But ideally it would trigger some amount of time ahead of the alarm, so that the room has time to actually rise to temperature.

IFTTT doesn’t have any capability for this sort of pre-trigger logic. I toyed with triggering based on a “dummy” alarm set 1h ahead, but ultimately realized that actions trigger on the alarm being dismissed rather than going off.

How-to
There are two things we need to do, which we can achieve using a combination of IFTTT and Tasker:

Calculate when to trigger the heat, based on the next alarm set.

Actually trigger the Tado° thermostat to turn on at that time.

Calculate when to trigger
We’ll use the AutoAlarm plugin (Google Play) to enable tasker to see the time of the next alarm we have set. These instructions are heavily based on this forum post, which includes a video walk-through that is useful if you’re not familiar with the Tasker interface.

Set the to/from window based on when you normally wake up.

Create a new time-based profile (Tasker → Profiles → (+) → Time)

From/to → if you use other android alarms throughout the day (besides for waking up) you’ll want to set a window here to only run around when your wake-up alarm should be. I used 5a-10a.

Every → I set it to run every 5 minutes, but I suppose you could run it less frequently to minimize effect on battery. This is less of a concern for me personally, because my phone is always plugged in overnight.

Add the following actions to your profile

Plugins → AutoAlarm → this will run the plugin, exposing a %seconds variable which contains the number of seconds until our next alarm.

Set a variable %minsBefore to the number of minutes ahead of your alarm that you want your heat to turn on. I used 60 here, but adjust accordingly.

Set a variable %TriggerHeatAtSec that calculates the formula round(%TIMES + %seconds - (%minsBefore * 60)). Make sure the variable name starts with an uppercase character, so that it is available in “global scope” for later use by a different profile. If you’re trying to understand this formula: %TIMES is a built-in Tasker variable that returns the current time in UNIX format (number of seconds since January 1970, don’t ask).

If you use other alarms throughout the day, also set the variable %NextAlarmHour so it can be used as a condition later.

☠️ There are some reports of this method sometimes not working, possibly due to other apps using alarms silently in the background. If it’s acting funky, double-check that you have the Reliable Alarms option disabled in Tasker. This option sets background alarms in the built-in Android clock app to ensure that the Tasker app does not get killed due to battery-saving settings. But it interferes with what we’re trying to do here. If this doesn’t help, you may have a different app that is causing the interference. You can try debugging with the ClockTask plugin, which has a variable that tells you which app the next alarm is set.

This profile calculates the trigger time for the thermostat.

Actually trigger the thermostat
Now we need to trigger our thermostat to kick in at the previously calculated time. We could probably do this entirely via Tasker using the tado API, but using IFTTT relieves us of the burden of dealing with OAuth tokens, etc.

Connect your Tado account to IFTTT

Enable the IFTTT Webhooks channel.

Create an applet that triggers your thermostat based on a webhook event. Here is mine.

Go back to the webhooks channel page and click Documentation in the top-right to get to a page where you can test out your configuration. Fill in the name of your event and click Test It. After a few seconds, check your Tado app (or listen to your radiator valve) and you should notice that it was triggered successfully.

Go back to Tasker and create a second profile that runs from/until %TriggerHeatAtSec (once).

To prevent the logic from triggering for random alarms later in the day, you can set an additional constraint to the trigger logic: %NextAlarmHour < 13 (to only turn on heating for alarms before 1p).

This profile actually does the triggering.

Add a single task that uses HTTP Request to make a GET request to the webhooks URL we tested earlier, which will look something like https://maker.ifttt.com/trigger//with/key/.

Accidental abstract art (ft. matplotlib)

Saturday, 10 Oct 2020

A collection of accidental art that I have created while trying to plot something actually useful with matplotlib or other tools.

Interstellar

Accidental 3D render

Windows 95

The signature of time

Dante’s inferno

Keep your SQL queries DRY with Jinja templating

Wednesday, 01 Jul 2020

A usecase for templating your SQL queries
Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.

The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.

SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC
This gets the job done for our toy example, it is not particularly scalable. Suppose our event_type column had 40 possible values, instead of just four. In that case, our solution is sub-optimal on two criteria:

Readability – With 30x possible values, our query would be ~10x as many lines as before. While it won’t take 10x longer to read, it does impose a cognitive cost to read. This is exacerbated when we’ve got a number of sub-queries in a single file.

Maintainability – If we add new event_types in the future, this query must be updated to match. This is tedious, and introduces an opportunity for error.

Can we do better?
This is a pretty good scenario to use jinja, a Python templating library which lets us perform basic flow control (loops, conditionals) inside of text templates. It is heavily used among the Flask community, but is also well suited for data analytics with SQL.

I’ll avoid giving a mediocre regurgitation of jinja syntax here, and defer to their own excellent documentation instead. Let’s skip ahead to see what our query would look like using a jinja template.

from jinja2 import Template
sql = """ SELECT date_, {%- for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {%- endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send, SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver, SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open, SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

In our toy example, the resulting query is not that much shorter, but it has the benefit of abstracting out a variable events, which contains the list of possible values for the event_type column. In the future, we can extend this query easily by simply appending to the events list.

Whitespace
You may have noticed that I added some unexplained - characters in the blocks above to get a pretty output.

The default output without these characters is a bit ugly.

sql = """ SELECT date_, {% for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {% if not loop.last %} , {% endif %} {% endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send , SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver , SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open , SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

Adding a minus sign (-) tells jinja to strip the whitespace before or after a block.

There are four possible positions for the minus sign:

Start of opening block

End of opening block

Start of closing block

End of closing block

Let’s take a look at the effect of adding a minus sign in each position.

Start of opening block
Adding the minus sign to the start of the opening block strips the leading whitespace outside of the for-loop. Basically, it just removes the extra line inhabited by the {%- for event in events %} block itself.

😄 This removes the empty line between date_ and the first SUM.

😢 But it does not remove the empty lines between each SUM statement.

sql = """ SELECT date_, {%- for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {% endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send, SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver, SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open, SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

End of opening block
Adding the minus sign to the end of the opening block strips the leading whitespace within the for-loop.

😄 This removes the empty lines between each SUM statement.

😢 But it leaves an empty line between the final statement and the SQL outside of the for-loop.

sql = """ SELECT date_, {% for event in events -%} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {% endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

Start of closing block
Adding the minus sign to the start of the closing block strips the trailing whitespace within the for-loop.

😄 This removes the empty lines between each SUM statement.

😢 But it leaves an empty line between the date_ and the first SUM statement.

sql = """ SELECT date_, {% for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {%- endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send, SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver, SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open, SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

End of closing block
Adding the minus sign to the end of the closing block removes the trailing whitespace outside of the for-loop.

😢 This looks the worst. It leaves an empty line between each SUM statement. While it removes the final empty line between the for-loop and the rest of the untemplated SQL, it pulls FROM onto the incorrect level of indentation.

sql = """ SELECT date_, {% for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {% endfor -%} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send, SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver, SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open, SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_clickFROM raw_events GROUP BY 1 ORDER BY 1 ASC

The ideal mix
By combining minus signs on the start of the opening block and the start of the ending block, we can tell jinja to strip the first leading empty line, and also the lines between each SUM statement.

sql = """ SELECT date_, {%- for event in events %} SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}} {%- if not loop.last -%} , {%- endif -%} {%- endfor %} FROM raw_events GROUP BY 1 ORDER BY 1 ASC """ print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT date_, SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send, SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver, SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open, SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click FROM raw_events GROUP BY 1 ORDER BY 1 ASC

Correlation matrix
Let’s look at a slightly more compliated example. Suppose we used the previous query to make a new table called daily_event_counts. Now we are interested in measuring the pairwise correlation between each type of event.

We can use the CORR() function to calculate each pair, but we need to tell the SQL engine which columns to use for each calculation. This is a good example of where the quick-and-dirty approach fails to scale. We have four types of events, but there are 4×4=16 pairwise correlations.

SELECT 'send' AS x, 'deliver' AS y, CORR(send, deliver) AS corr_ FROM daily_event_counts UNION ALL SELECT 'send' AS x, 'open' AS y, CORR(send, open) AS corr_ FROM daily_event_counts
In reality, we are only interested in six of these pairwise correlations. The four diagonals will just equal one, and the matrix is symmetric, so half the computations are redundant. For the sake of simplicity, let’s ignore this for now, and proceed to calculate all sixteen.

Nested for-loops with jinja
Nested for-loop are relatively straightforward, but I will point out two changes:

We want to place UNION ALL after all iterations except the final one. Previously we used {%- if not loop.last -%} to check if it was the final iteration. Since we now have a nested loop, we need to keep track of two indices. We can do this by using the block {% set outer_loop = loop %} to assign the outer loop to a new variable outer_loop before it is “replaced” by the inner loop.

We add a minus sign (-) on the end of the outer opening block, to avoid getting an additional empty line between iterations of the outer loop. This gives us a consistent spacing of one empty line between each UNION ALL statement.

from jinja2 import Template sql = """ {%- for x in cols -%} {% set outer_loop = loop %} {%- for y in cols %} SELECT '{{x}}' AS x, '{{y}}' AS y, CORR({{x}}, {{y}}) AS corr_ FROM daily_event_counts {% if not (loop.last and outer_loop.last) %} UNION ALL {% endif %} {%- endfor %} {%- endfor %} """ print(Template(sql).render(cols=['send', 'deliver', 'open', 'click']))
SELECT 'send' AS x, 'send' AS y, CORR(send, send) AS corr_ FROM daily_event_counts UNION ALL SELECT 'send' AS x, 'deliver' AS y, CORR(send, deliver) AS corr_ FROM daily_event_counts UNION ALL SELECT 'send' AS x, 'open' AS y, CORR(send, open) AS corr_ FROM daily_event_counts UNION ALL SELECT 'send' AS x, 'click' AS y, CORR(send, click) AS corr_ FROM daily_event_counts UNION ALL SELECT 'deliver' AS x, 'send' AS y, CORR(deliver, send) AS corr_ FROM daily_event_counts UNION ALL SELECT 'deliver' AS x, 'deliver' AS y, CORR(deliver, deliver) AS corr_ FROM daily_event_counts UNION ALL SELECT 'deliver' AS x, 'open' AS y, CORR(deliver, open) AS corr_ FROM daily_event_counts UNION ALL SELECT 'deliver' AS x, 'click' AS y, CORR(deliver, click) AS corr_ FROM daily_event_counts UNION ALL SELECT 'open' AS x, 'send' AS y, CORR(open, send) AS corr_ FROM daily_event_counts UNION ALL SELECT 'open' AS x, 'deliver' AS y, CORR(open, deliver) AS corr_ FROM daily_event_counts UNION ALL SELECT 'open' AS x, 'open' AS y, CORR(open, open) AS corr_ FROM daily_event_counts UNION ALL SELECT 'open' AS x, 'click' AS y, CORR(open, click) AS corr_ FROM daily_event_counts UNION ALL SELECT 'click' AS x, 'send' AS y, CORR(click, send) AS corr_ FROM daily_event_counts UNION ALL SELECT 'click' AS x, 'deliver' AS y, CORR(click, deliver) AS corr_ FROM daily_event_counts UNION ALL SELECT 'click' AS x, 'open' AS y, CORR(click, open) AS corr_ FROM daily_event_counts UNION ALL SELECT 'click' AS x, 'click' AS y, CORR(click, click) AS corr_ FROM daily_event_counts

Further reading

Template Designer Documentation: Whitespace Control

Tips & Tricks: Accessing the parent Loop

Geotagging Lightroom photos with Google Timeline data

Tuesday, 16 Jun 2020

Lightroom has a Maps view, but I have never really used it before. While all my smartphone photos are automatically geotagged, the 80% of my photos shot on my dedicated camera (Sony A7Rii) lack geodata. I have historically neglected adding geotag info as I have imported photos through the years. As a COVID lockdown project, I decided to try using the location tracking data from Google Timeline to geotag my photos “automatically”.

Download your location history from Google Takeout
The Google Takeout tool lets you export your data across a variety of Google services. We are only interested in our location history, so click Deselect all and scroll down to the Location History section.

You’ll receive an email a few minutes later with a download link. After you unzip it, you’ll find a large JSON file. Depending on how many years you’ve had Location History enabled, the file may be quite big. Mine was ~ 600 Mb for six years of data.

⚠ Note that while the Google Timeline UI displays your location history in local time, the JSON export contains timestamps stored in UTC time. This will be important later.

Convert it to gpx format, and split be year
Google Takeout gives us our location data in a JSON format, but most map software—including Lightroom—expects a GPX format. Luckily there is an excellent location-history-json-converter python script available on GitHub which solves our exact problem.

I recommend splitting your location data by year to ensure the files are reasonably small. If you live in a timezone which observes daylight savings time, you will need to read each year’s file three separate times into the Lightroom plugin, setting the appropriate timezone each time. Even with a year of data, there is a 20-30 second lag when reading the file using the Lightroom plugin. This is annoying, but manageable.

Here is an example of the terminal commands to convert your JSON export into yearly gpx files:

# make a working directory to keep things together mkdir location_history && cd location_history # move your downloaded json file into our directory mv "~/Downloads/Location History.json" location_history.json # download the conversion tool git clone https://github.com/Scarygami/location-history-json-converter.git cd location-history-json-converter # perform the actual conversion python location_history_json_converter.py ../location_history.json 2018.gpx \ -s 2018-01-01 \ -e 2019-01-01 \ -f gpx
Download Lightroom plugin
Jeffrey Friedl—the king of Lightroom plugins—has an excellent geo-encoding plugin which handles most of the heavy lifting for us. It has a plethora of settings including fuzzy matching—which is important since smartphone GPS samples less frequently than a dedicated hiking unit—and so the location data frequently will not match our photo timestamps exactly.

Fix your timezones
If you’re like me, you may have been lazy with keeping your camera’s cock up-to-date with daylight savings changes, or when traveling in a different timezone. If you geotag all your photos in a single batch, you’ll likely notice that many photos fail—even with fuzzy matching enable—and others are placed in an entirely wrong location.

Here comes the un-fun part. Crack a beer, put on some music, and spend a few hours working through your Lightroom catalog side-by-side with your Google Timeline. Here was my rough workflow:

Find an image that is of an identifiable location, then cross-reference with Google Timeline to discern whether the timestamp is correct.

Select an entire batch of photos (same trip), and adjust their timestamps in Lightroom using Menu → Metadata → Edit capture time → Shift by set number of hours.

Mark down the UTC offset for the date range of photos you processed, since you’ll need to enter this value when running the geo-encoding plugin.

Geotag your photos
You’ll need to work in batches, one for each timezone your photos were taken in.

Bring up the geocoding prompt in Lightroom from Library → Plugin extras → Geoencode.

Select the gpx tracklog file corresponding to the year of the photos you are encoding. It may take 20-30 seconds to read the file before allowing you select options.

Set the UTC offset for the batch, keeping in mind daylight savings time.

If your photo library also contains smartphone photos, you probably don’t want to overwrite their location info. In this case, make sure to select Process only those still unmapped.

Click Geoencode images and take a deep breath.

Pay attention to the summary prompt → If more than 30% of photos failed, you may have missed a timezone problem.

Tidy up metadata
After running the geocoding, head over to the Maps view and spot-check the results. If your capture times and UTF offset are correct, most of the photo locations should be reasonably accurate. There will still be some weird results, simply because the raw GPS data sometimes “jumps” around—particularly when you are in rural areas and/or hilly terrain. You can tidy those up in a couple ways:

Right-click on image → Metadata Presets → Copy Metadata → GPS, then paste onto other image.

Select images → Right-click on map to set GPS.

Simply drag the photos from the film strip onto location on the map where they belong.

Reflections
Was it worth the effort?
I spent significantly longer on this little project than I expected to. Nevertheless, I think it was a worthwhile endeavour. Here are my takeaways:

In the future I will be more vigilant about updating my camera clock when traveling, or when DST changes. I set a recurring Google calendar event for the latter.

When traveling somewhere where I do not have a data plan, I should disable roaming rather than putting my phone in airplane mode, since the latter disables GPS as well.

Ideally I should perform geotagging as part of my regular editing workflow, rather than doing it all at once.

I haven’t found a perfect solution for #3 though, since downloading my entire location history and converting it is a hassle, and not something I want to do after every day trip. In the future, I may write a script which uses Google APIs to schedule an export, convert to gpx, split into daily files, and drop it somewhere in Google Drive.

What about daily exports from Google Timeline?
Google Timeline’s web UI contains an option for exporting a single day to a KML file, but yields less accurate data than the JSON export from Google Takeout. The JSON export contains raw GPS readings, while the KML export contains the more processed data that you see in Google Timeline, where it aggregates points together into inferred paths and journeys.

Raw GPS data can be noisy, but arguably that’s what we want here.

In the KML file, the raw measurements are grouped by route with a start and an end timestamp. This presents two problems:

Google uses a non-standard format which is not easily converted into a gpx file with GPSBabel.

Raw measurements are no longer associated with specific timestamps, they could be anywhere between the start and end timestamp of their grouping. ¹

Here is an example of the data format. While it is theoretically possible to write a script to convert this into a valid gpx format to mitigate problem #1, I’m not sure this would be worthwhile, since the timestamps would still be less accurate.

On the subway On the subway from 2020-06-13T14:00:10.523Z to 2020-06-13T14:04:13.078Z. Distance 2753m clampToGround 1 1 13.401893,52.475390999999995,0 13.401893,52.475390999999995,0 13.3918281,52.4985295,0 13.391048022278145,52.49785324977018,0 2020-06-13T14:00:10.523Z 2020-06-13T14:04:13.078Z
Reverse geo-encoding
Reverse geo-encoding populates fields like City, State and Country based on the raw GPS data. Lightroom has built-in reverse geo-encoding, which I find satisfactory. But Jeffrey’s plugin can also perform reverse geo-encoding using Google location data. Apparently it is much more accurate. A key feature is that you can specify a My Maps file with custom-named locations, and the plugin will use those location names whenever possible.

The plugin can reverse-geocode via both Google and OpenStreetMap, though in order to use Google, you must create a developer’s API key, and enter that into the plugin in the Plugin Manager. (The egregiously-complex steps needed to create the Google API key are beyond my ability to explain as of yet, sorry.)

To use this feature you need to first set up a Google Cloud Platform account, then generate an API key, and enable the Geocoding API. Keep in mind that a large quantity of requests could cost you. GCP does give you $200 in free monthly credits for maps-related APIs though, which translates to roughly 40k requests. Definitely enough for occasional use, just take care not to reverse geo-encode your entire photo library in a single month.

Further reading
Guide to geotagging photos with Google location history and exiftool – A guide for geotagging using exiftool rather than Lightroom.

https://www.photools.com/community/index.php?topic=6919.0 ↩︎

How to learn mental models with spaced repetition

Friday, 01 May 2020

As a subscriber to the Farnam Street newsletter, I enjoy reading Shane’s articles about using various mental models from other disciplines to improve our decision-making. Reading about these mental models is fun, but I am cognizant of the fact that reading about something does not equate to learning it. The real measure of success for learning a mental model is: Can I reliably recall the relevant properties of this concept in a useful real-world scenario?

Since this definition of success hinges on succesful recall, it is an ideal candidate for spaced repetition software. Anki is a popular spaced repetition (flashcards) app which I have used in language-learning and technical contexts over the past few years. So the question then becomes: how can we best use Anki to facilitate the process of learning mental models?

Notes, fields, and cards, oh my!
Before we continue, it will be helpful to briefly review the terminology used by Anki, as there are some subtleties here. When you are reviewing, you work through a deck (like a folder, usually grouped by theme) of cards, each with a front and back side. But when you are creating flashcards, you actually do this by choosing a note type and adding information into the fields on that note. Each note type has associated card types, which are essentially templates built with HTML and CSS to determine which fields go where.

The default note type is called Basic and has only two fields: Front and Back. This basic note type has a single card type which displays those fields on the corresponding sides of a card. This is an intuitive place to start with spaced repetition, but it is also a skeuomorph for physical flashcards. With digital flashcards, there is no limit to the number of fields and card types we can have within a single note type.

A naive approach
One of my early learnings from making medicore anki cards is that memory is directional. Just because you can name a concept, doesn’t mean you can define it, and vice versa. So for almost any concept worth learning, we will want to ultimately generate multiple cards to test both recognition and production.

I started off by simply creating a bunch of Basic note cards for each mental model I wanted to memorize. As an example, here are four cards I created when learning about the Streetlight effect.

Front Back

What is the the Streetlight Effect? An observational bias where a person who is searching for something looks only where it is easiest.

What is the name for An observational bias where a person who is searching for something looks only where it is easiest? The Streetlight Effect

What is the implication of the Streetlight Effect? We must be careful to focus our problem-solving efforts towards the area where the solution is likely to be rather than the area where we have the most data.

What does this picture represent?
The Streetlight Effect

I did this for 40+ concepts and mental models of the course of a few months. In the process, I noticed some recurring flaws with this approach.

Flaw #1: Card creation involves cognitive overhead
Every time we sit down to create new cards, we need to remember all the different front–back pairings above, and manually create a card for each pair. Suppose we occasionally we forget to make a particular card for a concept. When this happens, there is no easy way for us to discover its absence later.

The flexibility of pure Front ⟷ Back cards is useful when we are encoding unstructured information, but becomes a burden when many of our cards follow a similar schema. It took me a while to develop this schema, but now when I read about new concept I am always sub-conciously asking myself:

How can I explain this concept to a five year-old?

What is the key implication or use-case for this concept?

When does this concept/tool/approach fail? When should I avoid it?

Once we have an implicit schema like this, we can leverage the power of note types and card types in Anki to automate our flashcard creation.

Flaw #2: There is boilerplate text
The above example cards each include some boilerplate text on the front of the card:

What is…

What is the name for…

What is the implication of…

What does this represent…

These chunks of text are necessary to prime our brain about what specific piece of information we are asking it to recall. But they require a non-trivial amount of time to verbally process before we can then move on to the actual act of recall. And after writing them 40+ times we will inevitably have some variation in wording, which further increases the cognitive burden. Ideally our card structure itself could signal what piece of information we are being asked to recall.

Flaw #3: It is difficult to refactor
One infrequently discussed element of spaced repetition learning is card refactoring. This is not a concern if we are using flashcards to memorize the birthdays of US presidents, as that information will never change. Our initial encoding of the information is probably good enough, even five years from now

But what if we are using spaced repetitition to learn compressible topics such as mathematics? In this case, we are almost never encoding raw factual information in our cards, but rather a snapshot of our current understanding of the topic. As we continue to develop our understanding in a topic and establish connections across disciplines, we will frequently notice that our previous understanding of a concept was imprecise or subtly wrong.

When we update our understanding, we should also update our flashcards, so that they remain relevant and valuable to us. With 5-10 disconnected cards for a topic floating around our Anki deck, it can be a burden to find and update all of the cards related to a particular topic. For example, you may have some cards related to the law of total probability, but you may also have a number of other cards which reference that concept. Simply searching for that text across all your decks will surface not just the concept cards, but also these other cards which mention the concept.

A note template for mental models
With the above criteria in mind, I built my own “concept” note type to attempt to address these issues. Here are a few screenshots to show what the cards look like on AnkiDroid. Below I explain a few of the features, and why I made particular design decisions.

Concept → Implication

Definition → Concept

Visual → Concept

Full card revealed

Centered layouts with CSS Flexbox
Designing card layouts on Anki 2.0 was a pain, but since version 2.1, Anki uses a new rendering engine which supports modern web development techniques such as CSS Flexbox. Our core design makes use of flexbox to ensure that cards are centered both horizontally and vertically on the screen. On mobile, cards typically take up the full width, but on desktop they are limited to ~800px for better readability on wide screens. The goal here is to reduce cognitive overhead when rapidly reviewing many differnet cards.

Blurred answers with global JavaScript
Rather than having six different Front Side and Back Side designs—one for each card type—I elected to use a single note layout across all cards, and then to simply blur the field I wish to recall. The back side of each card includes no new content, it just calls a javascript function which removes the blur from the recall field.

This reduces the overhead of managing 6x2=12 different HTML templates, enforces consistency, and further reduces cognitive overhead when reviewing. Since the location of the content does not change on answer reveal, there is no time required to grok the structure of the answer. There is also an implicit encoding of data type as location, which I found my brain picked up on subconciously after a few review sessions.

Creating cards

The template lives in a custom note type called Concept which contains five fields: name, description, visual, implication, drawbacks. It is okay if you do not use every field. There are six card types which use selective card generation to only create flashcards when the necessary fields are non-empty:

Concept → Description

Concept → Implication

Concept → Drawbacks

Description → Concept

Visual → Concept

Implication → Concept

The first three deal with recognition. Can we recall the specific details of the concept when directly prompted? The second three deal with production. Can we identify the correct concept from a description, visual representation, or implication / use-case. While the first three cards cover many academic use-cases, the final three are the ones which more directly influence our ability to recognize and apply mental models in a real-life context. So far, this approach feels like it has worked reasonably well.

I often add a note with only 2-3 fields, then come back and add another field or two a couple months later, when I have a more nuanced understanding of a concept. I experimented with additional fields, but found that these are the minimal set which cover 80% of my card generation needs. I’m not just invoking the 80/20 cliché here—I checked my Anki stats and found that I have a 4:1 ratio of these Concept cards to my more basic Q&A card, which I use for cards that don’t fit the mould.

How to download it
You can find the template on Github. There is no way to import/export a note type by itself, so the workaround is to import the sample deck, which contains the note type, the necessary CSS and Javascript, and also 30 example cards which showcase the template.

Known issues
This template is still somewhat a work-in-progress. Here are a couple issues that still need to be fixed. Feel free to send a PR on Github if you are interested in helping improve the template.

Flashing caused by MathJax
I have recently changed all of my math expressions from LaTeX to MathJax in Anki. It’s much nicer to work with, but one disadvantage is that it causes the cards to briefly “flash” when displayed, as the underlying markup is being typeset in real-time. Unfortunately I found this to be more noticable and annoying using this template, because the rest of the card is otherwise identical. Whereas on the basic card template, so much of the card changes on answer reveal that the typesetting is less noticeable.

Cards which are too large for mobile
If your card has a lot of text, some of it may be hidden below the fold when reviewing on the mobile app. I am considering adding some shadow styling to indicate when there is scrollable content on a card. But one could argue that if your card has too much text to fit onto a single screen, you should break up the card into more atomic units anyways.

Identifying links across topics
Anecdotally, I feel like I recognize connections between concepts and fields more frequently than I did before. One small hiccup I have not yet solved is how to link concepts across disciplines that are closely-related but perhaps not identical. For example, after reading books about behavioural psycology, the philosopy of Stoicism and cognitive behavioural therapy (CBTI), I have noticed they have many parallel concepts but with different terminology and subtly different implications.

Bulk compress videos to H.265 (x265) with ffmpeg

Tuesday, 21 Apr 2020

Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer H.265 HEVC codec. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video.

The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.¹ There are some third-party apps such as UltraCorder which support H.265, but I’d prefer to stick with the stock camera app. I frequently use the handy “double tap power button” shortcut to quickly launch my camera app, so it is important that the app which launches is able to handle both photos and videos. This rules out using a specialty video app which supports H.265 encoding.

Converting a single video
It’s pretty easy to compress these videos using the ffmpeg command line tool with something like the command below.

ffmpeg -i input.mp4 -vcodec libx265 -crf 28 output.mp4
I tried a few different quality settings before settling on the default -crf 28 . With this level, I cannot visually tell the difference in quality between compressed and uncompressed videos.

Converting videos in bulk
So ffmpeg is great, but I wanted to batch process all of my phone videos, with the following goals in mind:

Recursively search sub-directories, since I typically organize my photos and videos into a folder hierarchy.

Only compress videos which are not already compressed (by using ffprobe to detect if encoding is hevc)

Preserve the metadata (creation time, modification time) from the original video, so that chronological sort continues to work properly.

So I wrote a small python CLI tool which achieves the above goals.

The script
Put the script somewhere convenient (such as your home folder), cd to your content directory, then call the script from the terminal using python ~/compress_videos.py --recursive --file-ext=mp4 .

☠ Use at your own risk – I accept no responsibility for any data loss or mistakes caused by this script. It is a prudent idea to test it first and visually inspect the output. You should do so for each different input file you are converting. I ran into difficulty with a couple .MTS video files which were interlaced, and so required different settings to convert properly.

Further reading
CRF Guide (Constant Rate Factor in x264, x265 and libvpx) – A good overview of the difference between constant and variable bitrate encoding, and suggestions for sensible defaults for each.

A Large-Scale Comparison of x264, x265, and libvpx — a Sneak Peek (Netflix Tech Blog) ↩︎

Building an AdaBoost classifier from scratch in Python

Friday, 20 Mar 2020

Goal
A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.

Who is Ada, anyways?
Boosting refers to a family of machine learning meta-algorithms which combine the outputs of many “weak” classifiers into a powerful “committee”, where each of the weak clasifiers alone may have an error rate which is only slightly better than random guessing.

The name AdaBoost stands for Adaptive Boosting, and it refers to a particular boosting algorithm in which we fit a sequence of “stumps” (decision trees with a single node and two leaves) and weight their contribution to the final vote by how accurate their predictions are. After each iteration, we re-weight the dataset to assign greater importance to data points which were misclassified by the previous weak learner, so that those data points get “special attention” during iteration $t+1$.

How does it compare to Random Forest?

Property Random Forest AdaBoost

Depth Unlimited (a full tree) Stump (single node w/ 2 leaves)

Trees grown Independently Sequentially

Votes Equal Weighted

The AdaBoost algorithm
This handout gives a good overview of the algorithm, which is useful to understand before we touch any code.

A) Initialize sample weights uniformly as $w_i^1 = \frac{1}{n}$.

B) For each iteration $t$:

Find weak learner $h_t(x)$ which minimizes $\epsilon_t = \sum_{i=1}^n \mathbf{1}[h_t(x_i) \neq y_i] \, w_i^{(t)}$ .

We set a weight for our weak learner based on its accuracy: $\alpha_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$

Increase weights of misclassified observations: $w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha^t y_i h_t(x_i)}$.

Renormalize weights, so that $\sum_{i=1}^n w_i^{(t+1)}=1$.

C) Make final prediction as weighted majority vote of weak learner predictions: $H(x) = \text{sign} \Big( \sum_{t=1}^T \alpha_t h_t(x) \Big)$.

Getting started
Helper plot function
We’re going to use the function below to visualize our data points, and optionally overlay the decision boundary of a fitted AdaBoost model. Don’t worry if you don’t understand everything that is happening here, it is not critical to understanding the algorithm itself.

from typing import Optional import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl def plot_adaboost(X: np.ndarray, y: np.ndarray, clf=None, sample_weights: Optional[np.ndarray] = None, annotate: bool = False, ax: Optional[mpl.axes.Axes] = None) -> None: """ Plot ± samples in 2D, optionally with decision boundary """ assert set(y) == {-1, 1}, 'Expecting response labels to be ±1' if not ax: fig, ax = plt.subplots(figsize=(5, 5), dpi=100) fig.set_facecolor('white') pad = 1 x_min, x_max = X[:, 0].min() - pad, X[:, 0].max() + pad y_min, y_max = X[:, 1].min() - pad, X[:, 1].max() + pad if sample_weights is not None: sizes = np.array(sample_weights) * X.shape[0] * 100 else: sizes = np.ones(shape=X.shape[0]) * 100 X_pos = X[y == 1] sizes_pos = sizes[y == 1] ax.scatter(*X_pos.T, s=sizes_pos, marker='+', color='red') X_neg = X[y == -1] sizes_neg = sizes[y == -1] ax.scatter(*X_neg.T, s=sizes_neg, marker='.', c='blue') if clf: plot_step = 0.01 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # If all predictions are positive class, adjust color map acordingly if list(np.unique(Z)) == [1]: fill_colors = ['r'] else: fill_colors = ['b', 'r'] ax.contourf(xx, yy, Z, colors=fill_colors, alpha=0.2) if annotate: for i, (x, y) in enumerate(X): offset = 0.05 ax.annotate(f'$x_{i + 1}$', (x + offset, y - offset)) ax.set_xlim(x_min+0.5, x_max-0.5) ax.set_ylim(y_min+0.5, y_max-0.5) ax.set_xlabel('$x_1$') ax.set_ylabel('$x_2$')
Generate a fake dataset
We will generate a toy dataset using a similar approach to sklearn documentation but using less data points. The key here is that we want to have two classes which are not linearly separable, since this is the ideal use-case for AdaBoost.

from sklearn.datasets import make_gaussian_quantiles from sklearn.model_selection import train_test_split def make_toy_dataset(n: int = 100, random_seed: int = None): """ Generate a toy dataset for evaluating AdaBoost classifiers """ n_per_class = int(n/2) if random_seed: np.random.seed(random_seed) X, y = make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=2) return X, y*2-1 X, y = make_toy_dataset(n=10, random_seed=10) plot_adaboost(X, y)

Benchmark with scikit-learn
Let’s establish a benchmark for what our model’s output should resemble by importing AdaBoostClassifier from scikit-learn and fitting it to our toy dataset.

from sklearn.ensemble import AdaBoostClassifier bench = AdaBoostClassifier(n_estimators=10, algorithm='SAMME').fit(X, y) plot_adaboost(X, y, bench) train_err = (bench.predict(X) != y).mean() print(f'Train error: {train_err:.1%}')
Train error: 0.0%

The classifier fully fits the training dataset in 10 iterations, which is not surprising given that the data points in our toy dataset are reasoanbly well separated.

Rolling our own AdaBoost classifier
Below is the skeleton code for our AdaBoost classifier. After fitting the model, we’ll save all the key attributes to the class—including sample weights at each iteration-so we can inspect them later to understand what our algorithm is doing at each step.

The table below shows a mapping between the variable names we will use and the math notation used earlier in the description of the algorithm.

Variable Math

sample_weights with shape: (T, n) $ w_{i}^{(t)} $

stumps with shape: (T, ) $ h_t(x) $

stump_weights with shape (T, ) $ \alpha_t $

errors with shape: (T, ) $ \epsilon_t $

clf.predict(X) $ H_t(x) $

class AdaBoost: """ AdaBoost enemble classifier from scratch """ def __init__(self): self.stumps = None self.stump_weights = None self.errors = None self.sample_weights = None def _check_X_y(self, X, y): """ Validate assumptions about format of input data""" assert set(y) == {-1, 1}, 'Response variable must be ±1' return X, y
Fitting the model
Recall our algorithm to fit the model:

Find weak learner $h_t(x)$ which minimizes $\epsilon_t = \sum_{i=1}^n \mathbf{1}[h_t(x_i) \neq y_i] \, w_i^t$ .

We set a weight for our weak learner based on its accuracy: $\alpha_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$

Increase weights of misclassified observations: $w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}$. Note that $y_i h_t(x_i)$ will evaluate to $+1$ when hypothesis agrees with label, and $-1$ when it does not agree.

Renormalize weights, so that $\sum_{i=1}^n w_i^{(t+1)} =1$.

The code below is essentially a 1-to-1 implementation of the above, but there are a few things to note:

Since the focus here is understanding the ensemble element of AdaBoost, we’ll outsource the logic of picking each $h_t(x)$ to DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2).

We set the initial uniform sample weights outside of the for-loop and set the weights for $t+1$ within each iteration $t$, unless it is the last iteration. We are going out of our way here to save an array of sample weights on the fitted model, so that we can later visualize the sample weights at each iteration.

from sklearn.tree import DecisionTreeClassifier def fit(self, X: np.ndarray, y: np.ndarray, iters: int): """ Fit the model using training data """ X, y = self._check_X_y(X, y) n = X.shape[0] # init numpy arrays self.sample_weights = np.zeros(shape=(iters, n)) self.stumps = np.zeros(shape=iters, dtype=object) self.stump_weights = np.zeros(shape=iters) self.errors = np.zeros(shape=iters) # initialize weights uniformly self.sample_weights[0] = np.ones(shape=n) / n for t in range(iters): # fit weak learner curr_sample_weights = self.sample_weights[t] stump = DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2) stump = stump.fit(X, y, sample_weight=curr_sample_weights) # calculate error and stump weight from weak learner prediction stump_pred = stump.predict(X) err = curr_sample_weights[(stump_pred != y)].sum()# / n stump_weight = np.log((1 - err) / err) / 2 # update sample weights new_sample_weights = ( curr_sample_weights * np.exp(-stump_weight * y * stump_pred) ) new_sample_weights /= new_sample_weights.sum() # If not final iteration, update sample weights for t+1 if t+1 < iters: self.sample_weights[t+1] = new_sample_weights # save results of iteration self.stumps[t] = stump self.stump_weights[t] = stump_weight self.errors[t] = err return self
Making predictions
We make a final prediction by taking a “weighted majority vote”, calculated as the sign (±) of the linear combination of each stump’s prediction and its corresponding stump weight.

$$ H_t(x) = \text{sign} \Big( \sum_{t=1}^T a_t h_t(x) \Big) $$

def predict(self, X): """ Make predictions using already fitted model """ stump_preds = np.array([stump.predict(X) for stump in self.stumps]) return np.sign(np.dot(self.stump_weights, stump_preds))
Performance
Now let’s put everything together, and fit the model with the same parameters as our benchmark.

# assign our individually defined functions as methods of our classifier AdaBoost.fit = fit AdaBoost.predict = predict clf = AdaBoost().fit(X, y, iters=10) plot_adaboost(X, y, clf) train_err = (clf.predict(X) != y).mean() print(f'Train error: {train_err:.1%}')
Train error: 0.0%

Success! We’ve achieved the exact same result as our sklearn benchmark. I cherry-picked this toy dataset to show the strengths of AdaBoost, but you can run this notebook yourself and see that it matches the output regardless of starting conditions.

Developing intuition
Visualizing our learner step-by-step
Since we saved all intermediate variables as arrays to our fitted model, we can use the function below to visualize how our ensemble learner evolves at each iteration $t$:

The left column shows the “stump” weak learner selected, which corresponds to $h_t(x)$.

The right column shows the cumulative strong learner so far: $H_t(x)$.

The size of the data point markers reflects their relative weighting. Data points misclassified in the previous iteration will be more heavily weighted—and therefore appear larger—in the next iteration.

def truncate_adaboost(clf, t: int): """ Truncate a fitted AdaBoost up to (and including) a particular iteration """ assert t > 0, 't must be a positive integer' from copy import deepcopy new_clf = deepcopy(clf) new_clf.stumps = clf.stumps[:t] new_clf.stump_weights = clf.stump_weights[:t] return new_clf def plot_staged_adaboost(X, y, clf, iters=10): """ Plot weak learner and cumulaive strong learner at each iteration. """ # larger grid fig, axes = plt.subplots(figsize=(8, iters*3), nrows=iters, ncols=2, sharex=True, dpi=100) fig.set_facecolor('white') _ = fig.suptitle('Decision boundaries by iteration') for i in range(iters): ax1, ax2 = axes[i] # Plot weak learner _ = ax1.set_title(f'Weak learner at t={i + 1}') plot_adaboost(X, y, clf.stumps[i], sample_weights=clf.sample_weights[i], annotate=False, ax=ax1) # Plot strong learner trunc_clf = truncate_adaboost(clf, t=i + 1) _ = ax2.set_title(f'Strong learner at t={i + 1}') plot_adaboost(X, y, trunc_clf, sample_weights=clf.sample_weights[i], annotate=False, ax=ax2) plt.tight_layout() plt.subplots_adjust(top=0.95) plt.show() clf = AdaBoost().fit(X, y, iters=10) plot_staged_adaboost(X, y, clf)

Why do some iterations have no decision boundary?
You may notice that our weak learners at iterations $t=2,5,7,10$ classify all points as positive. This occurs because given the current sample weights, the lowest error is achieved by simply predicting all data points to be positive. Note that in each of the plots above for these iterations, the negative samples are surrounded by proportially higher-weighted positive samples.

There is no way to draw a linear decision boundary to correctly classify any number of negative data points without misclassifying a higher cumulative weight of positive samples. This does not stop our algorithm from converging though. All the negative points are misclassified and therefore increase in sample weight. This updating of weights allows the next iteration’s weak learner to discover a meaningful decision boundary.

Why do we use that specific formula for alpha_t?
Are you curious why we use this particular value for $\alpha_t$ ? We can show that the choice of $a_t = \frac{1}{2} \ln \Big( \frac{1-\epsilon_t}{\epsilon_t} \Big)$ minimizes exponential loss $L_{exp}(x, y) = e^{-y \, h(x)}$ over the training set.

Ignoring the sign function, our strong learner $H$ at iteration $t$ is a weighted combination of weak learners $h(x)$ . At any given iteration $t$ , we can define $H_t(x)$ recursively as its value at iteration $t-1$ plus the weighted weak learner of the current iteration.

$$ \begin{aligned} H_t(x) &= \sum_{i=1}^t \alpha_i h_i(x) \\ &= H_{t-1} + \alpha_t h_t(x) \end{aligned} $$

Our loss function applied to $H$ is the average loss across all $n$ data points. We can substitute in our recursive definition of $H_t(x)$ , and split the exponential term using the identity $e^{a+b}=e^a e^b$.

$$ \begin{aligned} L(H_t) &= \tfrac{1}{n} \sum_{i=1}^n e^{-y_i H_t(x_i)} \\ &= \tfrac{1}{n} \sum_{i=1}^n e^{-y_i H_{t-1}(x_i)} e^{-y_i \alpha_t h_t(x_i)} \\ &= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{e^{-y_i H_{t-1}(x_i)}} e^{-y_i \alpha_t h_t(x_i)} \\ &= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{w^t_i} \; e^{-y_i \alpha_t h_t(x_i)} \\ \end{aligned} $$

Now we take the derivative of our loss function with respect to $\alpha_t$ and set it to zero to find the parameter value at which the loss function is minimized. We can split the summation into two: cases where $h_t(x_i) = y_i$ and cases where $h_t(x_i) \neq y_i$ .

$$ \begin{aligned} L(H_t) &= \tfrac{1}{n} \sum_{i=1}^n \color{lightgrey}{w^t_i \,} e^{-y_i \alpha_t h_t(x_i)} \\ \frac{\partial L}{\partial \alpha_t} = 0 &= - \tfrac{1}{n} \sum_{i=1}^n w^t_i \, y_i h_t(x_i) e^{-y_i \alpha_t h_t(x_i)} \\ &= - \tfrac{1}{n} \sum_{i: h_t(x_i) = y_i}^n w^t_i \, e^{-\alpha_t} - \tfrac{1}{n} \sum_{i: h_t(x_i) \neq y_i}^n w^t_i \, e^{\alpha_t} \\ \end{aligned} $$

Finally, we recognize the summation of weights is equivalent to our error calculation discussed earlier: $\sum D_t(i) = \epsilon_t$ . Making the substitution and then manipulating algebraically allows us to isolate $\alpha_t$ .

$$ \begin{aligned} 0 &= (\epsilon_t-1) e^{-\alpha_t} + \epsilon_t e^{\alpha_t} \\ (1-\epsilon_t) e^{-\alpha_t} &= \epsilon_t e^{\alpha_t} \\ \frac{1 - \epsilon_t}{\epsilon_t} &= \frac{e^{\alpha_t}}{e^{-\alpha_t}} = e^{2 \alpha_t} \\ \alpha_t &= \frac{1}{2} \ln \bigg( \frac{1 - \epsilon_t}{\epsilon_t} \bigg) \end{aligned} $$

Further reading

sklearn.ensemble.AdaBoostClassifier – Official scikit-learn documentation

University of Toronto CS – AdaBoost – Understandable handout PDF which lays out a pseudo-code algorithm and walks through some of the math.

Weak Learning, Boosting, and the AdaBoost algorithm – Discussion of AdaBoost in the context of PAC learning, along with python implementation.

AdaBoost: Implementation and intuition – Python implementation with visualization function, served partially as the inspiration for this post.

Building a Naive Bayes classifier from scratch with NumPy

Monday, 16 Mar 2020

While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.

While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest MultinomialNB inherits from _BaseDiscreteNB which itself inherits from _BaseNB which has multiple inheritence from BaseEstimator and ClassifierMaixin.

What is Naive Bayes?
Naive Bayes is a simple generative (probabilistic) classification model based on Bayes’ theorem. The typical example use-case for this algorithm is classifying email messages as spam or “ham” (non-spam) based on the previously observed frequency of words which have appeared in known spam or ham emails in the past.

$$ P(\text{ spam }|\text{ text }) = \frac{P(\text{ text }|\text{ spam }) , P(\text{ spam })}{P(\text{ text })} $$

Following typical ML notation, we use $y$ to denote the “class” of our message, where $y=1$ for spam messages and $y=0$ for non-spam messages. We will represent our text data as an array $x$ of length $j$, with each value representing the number of times the $j^{th}$ word appears in a particular email. The value of $j$ represents the collective number of words seen across all training data. Our model then becomes

$$ P(y|x) = \frac{P(x|y) , P(y)}{P(x)} $$

Why would we want to use something “naive”?
This classifier is “naive” in the sense that predictive features are assumed to be conditionally independent given their class. Naturally this is gross simplification of reality—if an email contains the word “sports” we would expect it to also be more likely to contain related words like “bet” or “odds”. But this assumption allows us to calculate the joint probability simply as the product of marginal likelihoods, without worrying about the correlation structure between different words. This makes the model much easier to fit.

All models are wrong, some are useful. — George Box, statistician

If our model were not “naive”, we would have to calculate the joint likelihood function as a messy product of $j$ separate conditional likelihood functions. With so many parameters to estimate, we would need a large quantity of training data to avoid overfitting.

$$ P(\mathbf{x} \vert y) = P(x_1 \vert y) P(x_2 \vert x_1, y) P(x_3 \vert x_1, x_2, y) \ldots P(x_j \vert x_1, x_2, \ldots, x_{j-1}, y) $$

But if we assume conditional independence, this calculation becomes much more simple. We can now just multiply together the likelihoods of each word $x_j$ conditional only on their class $y$ (whether or not they are spam) to get the joint likelihood for the entire message.

$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$

Our toy dataset
The function below generates a test dataset based on Chapter 3.5, Exercise 3.22 from Machine Learning: A Probabilistic Perspective.

import numpy as np import pandas as pd from typing import Callable from sklearn.feature_extraction.text import CountVectorizer def make_spam_dataset(show_X=True) -> (pd.DataFrame, np.ndarray, Callable): """ Create a small toy dataset for MultinomialNB implementation Returns: X: word count matrix y: indicator of whether or not message is spam msg_tx_func: a function to transform new test data into word count matrix """ vocab = [ 'secret', 'offer', 'low', 'price', 'valued', 'customer', 'today', 'dollar', 'million', 'sports', 'is', 'for', 'play', 'healthy', 'pizza' ] spam = [ 'million dollar offer', 'secret offer today', 'secret is secret' ] not_spam = [ 'low price for valued customer', 'play secret sports today', 'sports is healthy', 'low price pizza' ] all_messages = spam + not_spam vectorizer = CountVectorizer(vocabulary=vocab) word_counts = vectorizer.fit_transform(all_messages).toarray() df = pd.DataFrame(word_counts, columns=vocab) is_spam = [1] * len(spam) + [0] * len(not_spam) msg_tx_func = lambda x: vectorizer.transform(x).toarray() if show_X: display(df) return df.to_numpy(), np.array(is_spam), msg_tx_func X, y, tx_func = make_spam_dataset()

secret offer low price valued customer today dollar million sports is for play healthy pizza

0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0

1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0

2 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0

3 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0

4 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0

5 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0

6 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1

Our model
Our model needs to resolve the three component “ingredients” necessary to make predictions on future data points. The table below describes each component, and shows the mapping between math notation above and variable names in our code below.

Variable Math Decription

prior $ P(y) $ Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam).

lk_word $P(x_i \vert y)$ The likelihood of each word, conditional on message class. We are implicitly using the multinomial distribution here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class.

lk_message $ P(\mathbf{x} \vert y) $ The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class.

normalize_term $ P(\mathbf{x}) $ The likelihood of an entire message across all possible classes.

We’ve got a few additional attributes as well:

The alpha attribute will be added to each word count, to avoid us having zero probabilities for words not seen in our training sample.

The is_fitted_ attribute is a scikit-learn convention to ensure we don’t accidentally try to make predictions on a model that has not yet been fitted.

class NaiveBayes(object): """ DIY binary Naive Bayes classifier based on categorical data """ def __init__(self, alpha=1.0): """ """ self.prior = None self.word_counts = None self.lk_word = None self.alpha = alpha self.is_fitted_ = False
Fitting the model
Let’s implement our algorithm to handle an arbitrary number of classes, even though our toy example only has two (spam/not-spam). Our fit function needs to do two things:

Calculate prior
This one is easy. We split our input array into two sub-arrays in X_by_class, then count the number of elements in each class to arrive at our prior.

Calculate likelihoods
We set word_counts by looping over each of the sub-arrays in X_by_class and taking the column sums within each sub-array. Note that the numpy notation is a bit unintutive here, .sum(axis=0) means that we collapse the $0^{th}$ axis (rows) leaving only columns. This gives us an array of shape (c,j) which counts the number of times the $j^{th}$ word appears across all emails of class $c$.

Finally, our likelihood function lk_word is simply these word counts divided by the total number of times all words appear in each class. We achieve this by taking row sum using .sum(axis=0).

from sklearn.utils.validation import check_X_y, check_array def fit(self, X: np.ndarray, y: np.ndarray): """ Fit training data for Naive Bayes classifier """ # not strictly necessary, but this ensures we have clean input X, y = check_X_y(X, y) n = X.shape[0] X_by_class = np.array([X[y == c] for c in np.unique(y)]) self.prior = np.array([len(X_class) / n for X_class in X_by_class]) self.word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class]) + self.alpha self.lk_word = self.word_counts / self.word_counts.sum(axis=1).reshape(-1, 1) self.is_fitted_ = True return self
Predicting new emails
We can now make predictions, either on the same emails we used to train the model, or on entirely new emails never before seen by the model. We’ll do this by first predicting probabilities for each class, then making our final prediction by taking the class with the highest probability.

Recall that our conditional likelihood for an entire message $\bf{x}$ is calculated as the product of conditional likelihoods for each word $x_j$ present in the message. Note here that if a word appears twice, its lk_word gets factored twice into our joint likelihood.

$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$

So we loop over each message (row) in our array $X$ and calculate individual conditional likelihoods, then multiply them all together and multiply by our clas priors. At the very end, we divide everything by $P(x)$ so that we have valid probabilities.

What if about previously unseen words?
Suppose we have a word which has never appeared in training messages labelled as spam. Its conditional likelihood would be zero, which would take our entire joint likelihood to zero as well. This is precisely why we added alpha while calculating word counts in the fit function, so that this situation does not occur.

Probabilistic prediction
def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Predict probability of class membership """ assert self.is_fitted_, 'Model must be fit before predicting' X = check_array(X) # loop over each observation to calculate conditional probabilities class_numerators = np.zeros(shape=(X.shape[0], self.prior.shape[0])) for i, x in enumerate(X): word_exists = x.astype(bool) lk_words_present = self.lk_word[:, word_exists] ** x[word_exists] lk_message = (lk_words_present).prod(axis=1) class_numerators[i] = lk_message * self.prior normalize_term = class_numerators.sum(axis=1).reshape(-1, 1) conditional_probas = class_numerators / normalize_term assert (conditional_probas.sum(axis=1) - 1 < 0.001).all(), 'Rows should sum to 1' return conditional_probas
Binary prediction
Our predict_proba function will return probabilities for each class, but in our toy example we really just want a binary outcome: is the message spam or not? Once we’ve done the work to get the class probabilities, it is easy to find the index of the highest probability class using np.argmax(axis=1).

def predict(self, X: np.ndarray) -> np.ndarray: """ Predict class with highest probability """ return self.predict_proba(X).argmax(axis=1)
Putting it all together
We defined the above logic as standalone functions, so now we need to assign each of them to the relevant method of our NaiveBayes class. This would not be necessary if we defined everything in a single module or notebook cell.

# attach functions defined above to our classifier # this is not needed if you define the entire class in a single cell NaiveBayes.fit = fit NaiveBayes.predict_proba = predict_proba NaiveBayes.predict = predict preds = NaiveBayes().fit(X, y).predict(X) print(f'Accuracy: {(preds == y).mean()}')
Accuracy: 1.0

You can find a gist with the code all together here.

Comparing with sklearn
The function below fits our model alongside MultinomialNB and asserts that we have similar values for our priors, likelihoods, and predictions.

from sklearn.naive_bayes import MultinomialNB def test_against_benchmark(): """ Check that DIY model matches outputs from scikit-learn estimator """ X, y, _ = make_spam_dataset(show_X=False) bench = MultinomialNB().fit(X, y) model = NaiveBayes(alpha=1).fit(X, y) assert (model.prior / np.exp(bench.class_log_prior_) - 1 < 0.001).all() print('[✔︎] Identical prior probabilities') assert (model.lk_word / np.exp(bench.feature_log_prob_) - 1 < 0.001).all() print('[✔︎] Identical word likelihoods') assert (model.predict_proba(X) / bench.predict_proba(X) - 1 < 0.001).all() print('[✔︎] Identical predictions') test_against_benchmark()
[✔︎] Identical prior probabilities [✔︎] Identical word likelihoods [✔︎] Identical predictions

Further reading
If you want to learn more about Naive Bayes, here are a few of the resources I found most helpful.

Notes on Naive Bayes Classifiers for Spam Filtering (Jonathan Lee, University of Washington) is a good entry point, as it provides a relatively succinct description of the typical spam detection example for Naive Bayes.

The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm (Michael Collins, Columbia) provides a more comprehensive walkthrough of the math behind NB, including derivation of maximum likleihood estimates.

sklearn.naive_bayes.MultinomialNB (scikit-learn docs) is the example implementation which I tried to reproduce.

Naive Bayes from Scratch in Python (Kenzo Takahashi) is the best DIY post I’ve seen so far, and the key inspiration for this post.

Render LaTeX math expressions in Hugo with MathJax 3

Tuesday, 04 Feb 2020

This blog runs on Hugo, a publishing framework which processes markdown text files into static web assets which can be conveniently hosted on a server without a database. It is great for a number of reasons (speed, simplicity) but one area where I find it lacking is in its support for math typesetting.

The problem
Typically, you embed a javascript library such as MathJax or KaTeX by adding a line of HTML to your website template. While the page is loading in a visitor’s browser, the library processes text enclosed in dollar signs and, renders it as LaTeX and replaces the contents of the page.

The problem is that the initial page contents have already been processed by Hugo’s markdown engine before the page even loads. The markdown parser interprets underscores (_) as italics, and so it removes them and wraps the enclosed text in the appropriate HTML tags. However the underscore is frequently used in LaTeX for subscript. E.g. x_1 gets rendered to $ x_1 $. So if your page contains multiple underscores, your LaTeX code will be broken before the page even starts loading.

The (typical) solution
The best general approach seems to be this one:

Configure MathJax to attempt to typeset within blocks (which it skips by default)
Add a class has-jax to your CSS which undoes whatever code-specific formatting your website uses. Add a pseudo-callback to MathJax which waits until typesetting is complete, then runs a piece of javascript to add the above class to all the parent element of all MathJax elements.
The page above includes all the necessary code snippets to implement this for MathJax 2.x. But MathJax 2 is a lot slower than MathJax 3 or KaTeX. I tried simply swapping out the src for the newer version, but this did not work, because it seem that MathJax 3 uses an entirely new syntax than 2.x. MathJax v3 is a complete rewrite of MathJax from the ground up, and so its internal structure is quite different from that of version 2. That means MathJax v3 is not a drop-in replacement for MathJax v2, and upgrading to version 3 takes some adjustment to your web pages. ¹ Adapted for MathJax 3 The code below is a modification of Doswa’s code which loads MathJax 3 instead of 2.x. Create a file in your theme directory layouts/partials/mathjax_support.html as the following: <script> MathJax = { tex: { inlineMath: [['$', '$'], ['\$', '\$']], displayMath: [['$$','$$'], ['\\[', '\\]']], processEscapes: true, processEnvironments: true }, options: { skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre'] } }; window.addEventListener('load', (event) => { document.querySelectorAll("mjx-container").forEach(function(x){ x.parentElement.classList += 'has-jax'}) }); script> <script src="https://polyfill.io/v3/polyfill.min.js?features=es6">script> <script type="text/javascript" id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">script> Next, open the file layouts/partials/header.html and add the following line just before the closing tag: {{ if .Params.mathjax }}{{ partial "mathjax_support.html" . }}{{ end }} Then, add the following lines to your CSS file. You may need to tinker with the contents here depending on your theme, these are just the settings which worked for me. code.has-jax { -webkit-font-smoothing: antialiased; background: inherit !important; border: none !important; font-size: 100%; } Finally, add mathjax: true to the YAML frontmatter of any pages containing math markup. Alternatively, you could omit the outer {{ if .Params.mathjax }} … {{ end }} conditional above to load the library automatically on all pages. However given that this library is quite heavy (it’s consistently the asset that Google PageSpeed Insights complains the most about) and that only <20% of my blog posts contain math at all, this is worth the extra effort for me. Other approaches I considered Here are a few other solutions I looked into, but ultimately decided not to adopt as a final solution. Manually escape all problematic characters You could manually escape all underscore or backslash characters with an additional backslash. This works if you rarely use LaTeX and just need a specific expression to render correctly, but it will get quickly annoying if your posts include multiple math expressions. Besides breaking rendering of LaTeX in your markdown editor, it also makes the raw code difficult to read. Use MMark markdown processing engine Hugo lets you specify which processing engine to use to convert markdown during the build process. There is one engine—MMark—which handles LaTeX well and so makes the above modifications entirely unnecessary. This was the approach previously officially recommended in Hugo documentation. However according to the current docs, MMark is deprecated and will be removed in a future release. It may work for a while still, but it doesn’t make sense for me to adopt a solution that is already deprecated. Goldmark engine with MathJax extension The new default markdown engine used by Hugo is called goldmark. There is an extension goldmark-mathjax that seems to do exactly what we want. But as of Feb 2020, a PR to merge it into hugo for relying on unacceptable dependencies. So for the time being, this approach would require forking Hugo and modifying it to use this extension. I have no real experience with Go, so I decided to avoid this approach for now. KaTeX math shortcode If you are willing to use KaTeX instead of MathJax, then this approach may be a good option. But it is cumbersome to wrap all your inline math equations in a shortcode. It is already annoying that the backtick approach breaks in-editor latex rendering in most editors, but at least the raw latex code is displayed in monospace text, and the backticks do not take up much screen space. For example, to render $x=1$ you would need to type {{ < math > }}x=1{{ }}, which makes it even more difficult to read and edit content in your markdown editor. I didn’t find the speed difference between KaTeX and Mathjax 3 to be sufficient to justify the decreased editing experience. MathJax docs – Upgrading from v2 to v3 ↩︎
8 Big Ideas from Scott Page's “The Model Thinker” Friday, 10 Jan 2020 I recently finished reading Scott E. Page’s wonderful book The Model Thinker. As a data scientist, I have a technical interest in models, particularly in the space of statistics and machine learning. As a general thinker, I am a big fan of Shane Parrish’s mental models concept, in which he champions developing an understanding of a wide breadth of models across disciplines to aid in general decision-making. A majority of the mental models on Farnam Street come from more of a psychology or behavioural economics background. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50+ models in the book, but here is a quick summary of a few big ideas which resonated with me. What makes for a good model? A good model is parsimonious While describing different high-level types of models in the first chapter, the author references a joke I was not familiar with. The original joke¹ pokes fun at physicists for making unrealistic simplifying assumptions in their model, such as that of a cow being a perfect sphere. Milk production at a dairy farm was low, so the farmer wrote to the local university, asking for help from academia. A multidisciplinary team of professors was assembled, headed by a theoretical physicist, and two weeks of intensive on-site investigation took place. The scholars then returned to the university, notebooks crammed with data, where the task of writing the report was left to the team leader. Shortly thereafter the physicist returned to the farm, saying to the farmer, “I have the solution, but it works only in the case of spherical cows in a vacuum”. But as the author points out, sometimes these amusingly extreme simplifications actually yield surprisingly usable rough results. The spherical cow is a favorite classroom example of the analogy approach: to make an estimate of the amount of leather in a cowhide, we assume a spherical cow. We do so because the integral tables in the back of calculus textbooks include tan(x) and cos(x) but not cow(x). There is no model which is a perfect representation of reality. A model with perfect accuracy would be like a 1:1 scale map, which is clearly not practical to use. So when we select a model, we are implicitly selecting some factors to include and others to exclude. Effective models include the important factors—and are therefore accurate—while excluding the less important ones—and are therefore simple and hence useful to us. A good model knows its purpose In the second chapter, Why Model, the author categorizes seven overarching uses for models: Reason: to identify conditions and deduce logical implications Explain: to provide (testable) explanations for empirical phenomena Design: to choose features of institutions, policies, and rules Communicate: To relate knowledge and understandings Act: to guild policy changes and strategic actions Predict: to make numerical and categorical predictions of future and unknown phenomena Explore: to investigate possibilities and hypotheticals It seems self-evident that models are used for a wide variety of purposes, but what is worth noting here is how the success criteria for each potential use-case could differ. This implies that anyone setting out to apply a model to solve a problem would be wise to carefully and honestly consider the core underlying purpose, in order to ensure success is actually achievable. Interpretability vs. predictive power There is a trope in data science about much of machine learning being merely glorified applied statistics, but there is definitely an underlying tension between two paradigms of success as interpretability and of success as predictive power. Traditional statistics focuses on building models which have explanatory power. A good model is not just true, but also interpretable, and easy to interface into qualitative decision-making. Case in point, using the python package statsmodels to fit a linear model gives you a full R-style summary of fit out of the box. The more recent focus in pure ML arenas is around having good predictive power with less consideration given to our qualitative understanding of the inner workings of the models themselves. For example, take the scikit-learn approach to linear models, which does not even give you an easy way to visualize p-values out of the box. I’m not advocating one paradigm over the other, but it is important to honestly consider what success would look like for whatever project/decision/goal you are seeking to apply a model to solve. It’s easy to pay lip service to pure predictive power, but will you and your team feel comfortable with a powerful algorithm whose decisions you don’t understand? Many models thinking The third chapter is an appeal to adopting what the author calls “many models thinking” for which he lays out a theorem I was not familiar with. Condorcet Jury Theorem – Each of an odd number of people (models) classifies an unknown state of the world as either true or false. Each classifies independently from one another, and classifies correctly with a probability $p>\frac{1}{2}$ . Theorem: A majority vote classifies correctly with higher probability than any person (model), and as the number of people (models) becomes large, the accuracy of the majority vote approaches 100%. Everything is a remix This immediately brings to mind the idea of ensemble learning in ML. Just substitute “weak learners” for a single vote, and “strong learner” for majority vote in the above theorem. I was surprised to discover that the Condorcet Jury Theorem was expressed in 1785—nearly 250 years ago. It is humbling to observe instances where seemingly modern techniques² are actually a remix of much older concepts from other fields. There are no truly new ideas. The devil is in the details The author points out that in reality we don’t see our prediction accuracy go to 100% as we increase the number of models or inputs into a majority vote. The reason is usually that one of the assumptions in the above theorem is violated: Weak learners must each have some signal. If $p=\tfrac{1}{2}$, then we cannot improve predictions by averaging together pure noise. The votes must be independent. If multiple votes are perfectly dependent, then they really only count for one vote. If they have some moderate level of correlation, then their absolute number is overstated. In real-world collective decision-making, it is plausible that both of these assumptions are violated. Votes are certainly not independent, and it is conceivable that some voters have negative signal—their predictions are wrong more than would occur due to pure chance. Related concepts: wisdom of the crowd, prediction markets Adaptive systems Systems which are able respond to feedback poses additional challenges for quantifying accuracy of our models and the predictions we generate from them. The Lucas Critique states that changes in a policy or the environment likely produce behavioural responses by those affected. Models estimated with data on past human behaviours will therefore not be accurate. Models must take into account the fact that people respond to policy and environmental changes. See also: why your KPIs suck This brings to mind Goodhart’s Law, which tells us that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. This is a key challenge faced by anyone who has tried to design KPIs for an organization. Power-law (long-tail) distributions Besides the Normal distribution, Power-law distributions are one of the most important statistical distributions to understand. Whereas aggregates of independent things tend to follow the normal distribution via the central limit theorem, aggregates of dependent things—particularly when feedback loops are involved—follow power-law distributions. Power-law distributions – In a power-law distribution, the probability of an event is inversely related to its size: the larger the event, the less likely it occurs. $$ p(x) = C x^{-a} $$ They are sometimes difficult to grasp intuitively though, which can cause problems when we use attempt to use heuristics to gauge things like risk when subconciously considering a normal distribution. Contemplating a power-law distribution of human heights reveals how much power-law distributions differ from normal distributions. If human heights were distributed by a power law similar to that of city populations, and if we calibrate the mean height at 5 feet 9 inches, then the United States would include one person the height of the Empire State Building, over 10,000 people taller than giraffes, and 180 million people less than 7 inches tall. Power-laws arise due to “Preferential Attachment” The author presents a couple potential causal factors which explain how power-law distributions arise. The most compelling is the preferential attachment model, which states that entities grow at rates relative to their proportions. Aka: the rich get richer and the poor get poorer. He gives a compelling example about a music download experiment: In the music lab experiments, college students could sample and download songs. In the first treatment, subjects did not know what songs others downloaded, and the distributions of downloads had a shorter tail—no song received more than two hundred downloads and only one song received fewer than thirty. In a second treatment, students knew what others downloaded. The tail of the distribution grew: one song received more than three hundred downloads. Perhaps more telling, over half received fewer than thirty. The tail became longer. Social influence increased inequality. This inequality is not a concern if social influence leads people to download better songs. However, correlations between downloads in the two treatments were not strong. If we interpret the number of downloads of a song in the first treatment as a proxy for the song’s quality, social influence did not result in people downloading better songs. The big winners were not random, but they were not the best. So our world becomes more interconnected and feedback loops multiply, we should expect to see more long-tails arise in situations where they may not have done so historically. See also: Black swan theory, the Matthew effect. Concavity and convexity Concave and convex functions is another math concept which is much more profound when considered from an economic perspective. I will admit to invoking [Jensen’s inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality in math proofs without truly reflecting on how it influnces human decision-making. Convexity implies risk-taking Convex functions have an increasing slope: the function’s value increases by a larger amount as we increase a variable’s value. The number of possible pairs of people is a convex function of the group size. A group of three people includes three unique pairs. A group of four people includes six unique pairs, and a group of five includes ten unique pairs. Each increase in group size increases the number of pairs by a larger amount. Similarly, each time a chef adds a new spice to his repertoire, he increases the number of spice combinations by a larger amount. Concavity implies diversity Concave functions with positive slopes exhibit diminishing returns: the added value of each extra thing diminishes as we have more of that thing. Our utility or value from almost all goods exhibits diminishing returns. The more leisure, money, ice cream, or even time spent with loved ones, the less we value having more of it. Evidence for this can be found in the fact that the more we consume of just about anything, including chocolate, the less we enjoy it and the less we are willing to pay for it. See also: Range by David Epstein, Specialization is for insects. Markov models Markov models describe sequential systems that follow the Markov property, which states that the probability of future states depends only on the current state, not the entire sequence of states that preceded it. Essentially: given the present, the past and future are conditionally independent. $$ P(\text{Tomorrow}|\text{Today, Yesterday, 2 days ago…, Day 1}) = P(\text{Tomorrow}|\text{Today}) $$ The Markov property sounds like a gross over-simplification of reality, but it can yield surprisingly useful results, because it allows our models to capture a compromise between “full independence” and “complete dependence”, which is often impractical to model at all. It can be shown that recurrent Markov chains which follow specific properties are guarantee to converge to some long-run stationary distribution which reflects the long-run proportion of time spent in each state if the chain is run indefinitely. Perron-Frobenius Theorem – A markov process converges to a unique statistical equilibrium provided it satisfies four conditions: Finite set of states: $ S = { 1, 2, \ldots, K } $ Fixed transition rule Ergodicity (state accessibility): The system can get from any state to any other through a series of transitions. Non-cyclic: The system does not produce a deterministic cycle through a sequence of states. The unique statistical equilibrium implies that long-run distributions of outcomes cannot depend on the initial state or on the path of events. In other words, initial conditions do not matter, and history does not matter. Nor can interventions that change the state matter. As time marches on, a process that satisfies the assumptions inexorably heads to its unique statistical equilibrium and then stays there. Besides being the foundation of MCMC, this has interesting implications from a sociological perspective. The takeaway from the theorem should not be that history cannot matter but that if history does matter, one of the model’s assumptions must be violated. Two assumptions—the finite number of states and no simple cycle—almost always hold. Ergodicity can be violated, as when allies go to war and cannot transition back to an alliance. Such examples notwithstanding, ergodicity generally holds as well. The forces that create social inequality have proven immune to policy interventions. In Markov models interventions that change families’ states—such as special programs for underperforming students or a one-day food drive—can provide temporary boosts. They cannot change the long-run equilibrium. In contrast, interventions that provide resources and training that improve people’s ability to keep jobs, and therefore change their probabilities of moving from employed to unemployed, could change long-run outcomes. At a minimum, the model gives us a terminology—the distinction between states and transition probabilities—along with a logic to see the value of changing structural forces rather than the current state. This has a powerful implication for anyone attempting to alter the long-run state of a complex system. Rather than directly manipulating the states themselves, we should adopt second-order thinking and consider how we can modify the transition probabilities between states such that our desired end state arises naturally. If your goal is to declutter your messy bedroom, you can set aside a weekend to go full Marie Kondo on your wardrobe, but unless you implement systemic changes which influence the rate of accumulation of junk, you will find yourself back in the same state a year later. Systems dynamics models Systems dynamics models give us a vocabulary for describing the behaviour of complex systems: Sources produce inputs into the system. Sinks absorb outputs. Stocks keep track of levels of variables. Flows capture feedbacks between levels of stocks. A great place to learn more about this approach is Donella H Meadows’ book Thinking in Systems: A Primer (summary here). Long-run stability of systems Feedback loops imply that some systems are not stable in the long-run. “The basic logic of feedbacks is straightforward: positive feedbacks reinforce actions, negative feedbacks dampen them. A system with only positive feedbacks will either blow up or collapse. A system with only negative feedbacks will either stabilize or cycle. A system with both positive feedbacks and negative feedbacks has the potential to produce complexity.” Reasoning about effects Feedback loops make it difficult to reason about the effect of small changes to long-run equilibrium. “The direct effect of increasing the growth rate of hares is more hares. The indirect effect, more foxes, implies fewer hares. These two effects cancel out. Nonintuitive findings such as these are a hallmark of systems dynamics models. Our intuition fails because we latch onto direct effects and fail to think through the entire logical chain. Even if the direct effect of increasing (or decreasing) a rate or flow may be to increase (or decrease) a stock, the presence of systems effects in the form of positive and negative feedbacks means that other stocks will also change values, so the net effect of a change in a rate or flow may be reduced, canceled, or even reversed.” Modelling human behaviour with adaptive rules This book touches on game theory in a number of chapters. The most interesting section to me was a description of a problem where there is no dominant pure strategy. But when individual actors adopt diverse probabilistic actions, the system naturally reaches a collectively efficient outcome. El Farol Bar problem – El Farol is a nightclub in Sante Fe, New Mexico that features dancing every Tuesday night. Each week, a population of 100 potential dancers decide whether to go dance at El Farol or stay home. All 100 people like to dance, but they do not want to go if the club is too crowded. Each persn earns a payoff of zero from staying home, a payoff of 1 from attending if 60 or fewer people attend, and a payoff of -1 from attending when more than 60 people attend. Simulations of this type of model find that if individuals possess a large ensemble of rules, then approximately 60 people attend each week: coordination emerges without any central planner. In other words, the system of adaptive rules self-organizes into nearly efficient outcomes. There is a feedback cycle between micro-level and macro-level rules. The decision of whether to attend or not (micro) influences the level of over-attendence (macro) which in turn influences the individual decisions in the next time period. If the rules people apply produce a crowded El Farol four weeks in a row, then rules that tell people to attend less often will produce higher payoffs. As people switch to those rules, fewer people will attend. The micro-level rules produce a macro-level phenomenon (over-attendance) that feeds back to the micro-level rules. https://en.wikipedia.org/wiki/Spherical_cow ↩︎ http://rob.schapire.net/papers/strengthofweak.pdf ↩︎ Scraping unlisted stock prices with BeautifulSoup Saturday, 14 Dec 2019 After taking a course on Machine Learning for Trading, I decided to apply some of the concepts I had learned to model my own stock trading performance. Unfortunately this was not nearly as straightforward as I expected, since my trade history included a number of stocks which no longer exist. How do you find the share price of an unlisted company? There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks. My trade history includes a number of iShares ETFs which no longer exist, including one in particular: AAIT. In my case, the ticker still exists in Yahoo Finance, but the data is clearly broken. Does not seem like a random walk to me There are a number of paid sources for historical data of unlisted companies, but I can’t justify paying $40 for data I am just using to scratch my own curiosity. After a bit of googling, I found some price data for AAIT on a 90s-styled website historicalstockprice.com. Unfortunately it only lets you view a single day at a time, and has no option for csv export. On the bright side, this presented a good opportunity to play around with the BeautifulSoup python library. UPDATE I later found that investing.com has data for a number of unlisted stocks, including AAIT. It also lets you easily download a CSV file with daily prices. So you probably want to check that source before going to all the effort of writing a scraper from scratch. Building a simple web scraper Find the actual URL to scrape The first step is to figure out the actual URL we need to scrape. Let’s start with the actual webpage itself. Scraping was just easier in the ’90s. But if we look at the actual source code for the page (Right-click → View Page Source in Google Chrome) it appears that the price data is not there. So it seems that the price data is loaded from some other API—likely using javascript—after the page itself loads. If we disable javascript and reload the page, it confirms our suspicions. With javascript disabled, our page does not contain any price data 😭 After reading through the source code, it is apparent that the page loads its contents from a secondary URL, which is the actual URL we want to scrape. Now we’re making progress! Write a scraper using BeautifulSoup The URL contains parameters for ticker, year, month, and date, so we just need to loop over our date range of interest, format the URL template with the appropriate parameters, and make an API call. We are interested in the contents of the cell under “Close”. Ideally the response would contain a page with CSS classes and IDs, which we could use to cleverly select the appropriate element, but in our case there are no classes or IDs. But since the page always has the exact same structure, we can just take the contents of the fifth td element of the second table element. import pandas as pd import matplotlib.pyplot as plt from tqdm.notebook import tqdm import requests from bs4 import BeautifulSoup def scrape_hsp(ticker: str, start_date: str, end_date: str) -> pd.Series: """ Scrape ticker data from historicalstockprice.com """ URL = 'https://www.tickertech.net/etfchannel/cgi/?a=historical&ticker={TICKER}&month={MM}&day={DD}&year={YYYY}' date_range = pd.bdate_range(start_date, end_date) prices = pd.Series(index=date_range, dtype=float) for dt in tqdm(date_range, unit='days'): year, month, day = dt.strftime('%Y-%m-%d').split('-') formatted_url = URL.format(TICKER=ticker, MM=month, DD=day, YYYY=year) page = requests.get(formatted_url) soup = BeautifulSoup(page.content, 'html.parser') try: val = soup.findAll('table')[1].find_all('td')[4].find('font').contents[0] prices.loc[dt] = float(val) except IndexError: continue return prices prices = scrape_hsp(ticker='AAIT', start_date='2013-01-01', end_date='2015-08-28') HBox(children=(FloatProgress(value=0.0, max=694.0), HTML(value=''))) This takes 7-13 minutes to run for our selected date range, which is acceptable. If we needed to scrape a much larger date range or a number of symbols, we could use the multiprocessing library to make concurrent requests. When we visualize the data below, we see that we’ve got a reasonable time series of price data! prices.bfill().plot() plt.title('AAIT – Closing price'); Further reading Beautiful Soup: Build a Web Scraper With Python (Real Python) – Provides a good introduction to the BeautifulSoup python library, which is the most popular and well-documented library for building a scraper. A clean way to share results from a Jupyter Notebook Monday, 02 Dec 2019 I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team. The drawbacks of notebooks But notebooks are not perfect. They introduce a number of problems, including—but not limited to: Modularity – reusable chunks of code tend to remain in notebooks rather than being extracted into their own modules—or even packages—as frequently as they should. Best practices – non-linear execution and global state are great for prototyping, but also make it cumbersome to refactor code later, or to write automated tests. Version control – Even if you do extract key functionality into their own modules, it becomes hard to keep track of these changes in github, because they are dwarfed by pull requests which contain ±10k lines of code, caused by the JSON representation of raw jupyter notebooks. Presenting your results to non-technical stakeholders A critical junction arises near the end of any data science project—how will you share results with the relevant stakeholders? The tool of choice in many organisations—at least my own—tends to be Google Slides. Unfortunately I have created more than a few slide decks whose contents almost entirely consist of matplotlib pngs, copy–pasted directly from a jupyter notebook notebook. This is sub-optimal, because it causes a disconnect between code and content. Future re-runs of your notebook, perhaps with fixed or fresh data, will not automatically update the visualizations in those slides. This decoupling counteracts much of the benefit of reproducibility which the notebook format promises in the first place. What stops us from presenting the notebook itself? Jupyter notebooks have built-in support for Markdown and HTML, so you can embed rich content and largely control formatting. The main obstacle to presentation-quality notebooks seems to be managing attention. It’s difficult to focus the attention of your audience on a single thing like you can with slides. Although we want to keep code (input cells) for reproducability’s sake, showing it is distracting. Take for example, the screenshot below of an HTML output of a raw Jupyter notebook. Notice that the majority of our “above the fold” content here is irrelevant to almost any potential audience of the notebook. Only 20% of the height is made up of details around the analysis. Not something you’d want to share with a stakeholder. Existing attempts to solve this problem Slides One solution I’ve seen—most frequently used to give technical taks, e.g. at JupyterCon—are slides built using the RISE extension. These definitely solve our first problem—focusing the audience’s attention—but don’t address the second. In fact, they seem best suited for presentations where the code itself is an integral part of what is being presented. I suspect that’s why it appears so frequently in technical talks, but less frequently elsewhere. nbconvert with –no-input flag Nbconvert has a built-in flag to hide input, but unfortunately it seems to result in a poorly formatted final output, in which the output of code cells is not aligned with the markdown cells. jupyter nbconvert my_notebook.ipynb --no-input Still not something you’d want to share with a stakeholder. Static website generator If you don’t need slides specifically, and if you are interested in building up a consistent experience for your entire team, it might be worth using a static website generator to build a sort of knowledge repo from multiple notebooks. This is less well-suited for sharing a single notebook, particularly if you don’t feel like deploying a site to host the output. A solution using nbconvert templates If you are primarily intersted in having a clean and shareable report rather than slides, it is possible to achieve this with vanilla nbconvert, rather than adding dependenies on external packages. The best solution I found was this nbconvert template by Damian Avila, which uses jQuery to add toggle functionlity, such that the code is initially hidden but can be displayed by clicking on the output of any cell. It is easy to use: Download the toggle.tpl template file. Figure out where your jupyter template directory is, by running from jupyter_core.paths import jupyter_path; print(jupyter_path('nbconvert','templates')) Copy the template file to that directory. From the command line in the directory containing your notebook, run jupyter nbconvert my_notebook.ipynb --template=toggle Here’s what our output looks like after using nbconvert with a template to hide code cells. An output you can proudly share with stakeholders. What an improvement from our first attempt! In this (somewhat contrived example) our entire document now fits “above the fold”. More importantly, the audience can easily grok the structure of the document and scan it visually. Bonus: useful jupyter notebook extensions Jupyter has a useful package called nbextensions which provides a bunch of extended functionality to your notebooks. There are two extensions in particular which are useful for our purposes. Previewing final “hidden” output from your notebook There are a few nbextensions related to hiding code cells, but my favourite is Hide input all, which can be used to fold all cells in your notebook in a single click. This is great for previewing what the final html output will look like from within the notebook itself. rather than having to run the full nbconvert command each time. Clicking a single button hides all input cells in your notebook. Adding clickable links to section headers Another great nbextension is Table of Contents (2), which builds a dynamically-updated ToC based on the markdown headings in a notebook. This serves as a good outline during editing, useful for reviewing and revising the macro-level structure of our document. The table is rendered with clickable links in the final html output, which enables readers to navigate through a large report by jumping right to a particular section. Can you run an A/B test with unequal sample sizes? Monday, 25 Nov 2019 I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. Most sample size calculators—including our own internal one—assumes an equal split between 2+ variations, so I had to take a step back to answer this question. TL;DR: Yes, but you wouldn’t want to. You can run an experiment with an unequal allocation (e.g. 10–90) as long as you don’t modify the allocation while the experiment is running. However it will be less efficient than a 50–50 allocation—either your test will have less power, or you will need to run it longer to achieve a comparable result. Do unequal sample sizes bias results? We want our A/B test results to be an unbiased estimator of the true effect. To achieve this, we rely on randomized assignment to “spread out” the influence of confounding factors equally across variations, so that they do not influence our relative comparison of different or uplift between the variations. Even if the proportion of users assigned to each variation is unequal, randomized assignment still works as long as we don’t change the traffic split. You should never modify the traffic allocation mid-experiment, because this can introduce temporal bias into your results. ¹ Are unequal sample sizes efficient? So it is possible to run an experiment with a non 50–50 split, but is it advisable? If our goal is to achieve some predetermined risk profile as quickly as possible—then probably not. Suppose we have a 15% conversion rate, and are designing an experiment to detect a 1% absolute increase with 90% power and 90% confidence. Let’s use the pwr R library below, because it supports non-equal sample sizes. ² library(pwr) n1 = 25000 n2 = 25000 p1 = 0.15 p2 = 0.16 h = abs(2*asin(sqrt(p1))-2*asin(sqrt(p2))) pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10) n1 = 25000 n2 = 25000 sig.level = 0.1 power = 0.9257466 alternative = two.sided So with a 50—50 split, you need to run the experiment on 50k total users—25k per variation—to get the desired result. What happens if we use a 10–90 split instead? n1 = 5000 n2 = 45000 pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10) n1 = 5000 n2 = 45000 sig.level = 0.1 power = 0.5829899 alternative = two.sided Uh-oh! The power of your experiment—its ability to detect a true effect—falls to under 60%. Let’s scale up our total sample size to find the point at which we achieve a similar power as our initial plan. n1 = 5000 * 2.8 n2 = 45000 * 2.8 pwr.2p2n.test(h, n1=n1, n2=n2, sig.level=0.10) n1 = 14000 n2 = 126000 sig.level = 0.1 power = 0.9274638 alternative = two.sided So a 10–90 allocation would require 2.8x as many total users to reach a similar outcome as a 50–50 split. We can understand why this is the case by looking at the formula for the standard error of the difference between two binomial proportions, which defines the width of our confidence intervals. $$ SE_{\Delta} = \sqrt{\frac{p_a(a-p_a)}{n_a} + \frac{p_b(1-p_b)}{n_b}} $$ A lower standard error equals greater certainty. The overall term will decrease whenever we collect samples in either variation, increasing either $ n_1 $ or $ n_2 $. But there are diminishing returns as $n_i $ increases. Suppose we’ve already collected 1000 samples in variation A, but only 100 samples in variation B. Collecting an additional 100 samples in A will only half of the term under the square root by 10%, whereas an additional 100 samples in B would cut that term in half. When do unequal sample sizes make sense? If you look closely at the R outputs above, you’ll notice that while our total users required is 2.8x, the number of users assigned to the control group (n1) is actually lower—14k vs 25k. So if we have a very strong prior belief in our change—but still want to perform some perfunctory experimentation—an unequal sample size could make sense here. But it’s a double-edged sword: if your change is worse than baseline, you will have ultimately exposed more users to the change than necessary to reach a conclusive result. Probably best to keep it 50–50, since your typical A/B test design involves enough factors to consider already. Can I Change Traffic Distribution while a Test Is Running? [VWO] ↩︎ Proportional power analysis in unequal sample size [RPubs] ↩︎ Planning A/B tests with a symmetric risk profile (α=β) Monday, 11 Nov 2019 Here is a somewhat unconventional recommendation for the design of online experiments: Set your default parameters for alpha (α) and beta (β) to the same value. This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices¹, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making. A default of symmetric risk—setting α=β—has a beneficial side effect of making experiment design easier to understand and communicate. A more parsimonious and intuitive process is more likely to actually get performed the next time someone is in your org is planning an experiment. Why sample size calculations actually matter Performing a sample size calculation is the most important first step you can take to ensure your experiment is successful. The calculation itself acts as a forcing function², requiring us to ask ourselves a number of questions which reduce our chances of succumbing to common post-analysis pitfalls such as underpowered tests or the multiple comparisons problem. What is the specific metric we will use to measure success of this experiment? What magnitude of effect do we expect to see? Are changes on the scale of 1% or 100%? What level of risk are we willing to accept of being wrong? Unfortunately, many people consider this calculation to be optional. In many companies, there is nothing truly blocking people from starting an experiment without a plan. So in the interest of efficiency and 80/20, many teams end up embracing a defacto test-first, analyze-second strategy. Besides making us vulnerable to the post-analysis pitfalls mentioned above, this unfortunately also reduces our capacity to learn from experiments. The beauty of the scientific method is that when we make falsifiable hypotheses and proceed to falsify them, we are then presented with golden opportunity to refactor our mental models of the world. We can use data to refine our intuitions. But if we don’t actually write out a crisp hypothesis before starting the experiment, it is too easy to victim to hindsight bias, subconciously rewriting the narrative into one which affirms our identity but denies us personal growth. A very brief review of Type I & II errors Without diving too deep here, recall that there are two key parameters which correspond to the two ways we can make a mistake in the context of a statistical test: alpha (α) represents our long-run accepted risk of false positives (FPR). beta (β) represents our long-run accepted risk of false negatives (FNR). The power of a statistical test is its ability to correctly identify a true effect (1-β) I will defer to Google’s ML Crash Course for deeper understanding on this topic, since it provides the clearest learning example I’ve seen using a “boy who cried wolf” analogy. The problem with your typical sample size calculation The typical sample size calculation is a trade-off between three parameters: α, β, and the minimum detectable effect (MDE), which is the smallest relative change in our metric of interest which is meaningful to us. Required sample size is a function of three input parameters. This calculation is straightforward if we have predetermined inputs and merely want to know the output. But this does not match the reality of planning an experiment in a tech company. It is more of a negotiation than a calculation, particularly when working with a non-technical stakeholder. A typical conversation around sample size might look like this: PM asks for your help planning an experiment for a new feature they are launching. You calculate the required sample size based on their primary KPI and send it back. PM replies asking if you accidentally meant days where you wrote X weeks duration. You explain the nature of the calculation, false positives, false negatives, etc. PM probes for where he or she can apply the good ol’ 80-20 rule to achieve results more quickly. This conversation can be frustrating for many analysts, but essentially what your stakeholder is trying to do here is to develop an intuition for what the marignal cost of each parameter is, so that they can discern where to compromise. This is a process which is a bit clumsy when we’ve got three “knobs” to work with. When we set α=β, we effectively eliminate one of these knobs, and turn it into a two-dimensional problem involving MDE and risk. At this point, we can summarize the required sample size at various levels of each using a data table a 2D plot. Conceptually, we can visualize the trade-off between these three parameters in a similar fashion to the project management triangle. Any change in one dimension requires sacrifice in one of the other two. Question your defaults (α=0.05, β=0.20) The first page of google results consists largely of medicore sample size calculators pretending to be easy-to-use by simply hiding α and β parameters. The better ones—including my personal favourite, Evan Miller’s Sample Size Calculator—set default values and provide clear explanations as to what these parameters mean. And yet, every calculator which does display α and β—including Evan’s and other ones—set their default values to α=0.05 and β=0.20. If you don’t have have a particularly strong opinion—or understanding—of what your relative ratio between these types of risk should be, it is tempting to simply go with the default options. Before you do so, allow me the opportunity to disabuse you of the notion that these are sacred numbers, unanimously agreed upon by some group of clever statisticians sitting in some room years ago. Significance There has been a decent amount of media coverage recently around the problems with p-values³ and their role in the social sciences replication crisis. So naturally at least a few peope have asked Why is 0.05 such a sacred number? The use of the 5% p-value threshold appears to have become universal in biomedical research, yet it does not seem to to be based on any clear statistical reasoning. So far as I can make out, the origin of this threshold seems to lie in a discussion of the theoretical basis of experimental design, published by the Cambridge geneticist and statistician RA Fisher in 1926. — Origin of the 5% p-value threshold [BMJ] The short answer: it isn’t. But even though there is nothing a priori special about p <0.05, one could make a solid argument that the practice of having a generally-agreed-upon benchmark is the important part. A shared standard is valuable when we want to compare levels of evidence across different studies or research groups. which standardizes the level of evidence used across different research groups. Power The practice of planning experiments with 80% power is an equally accepted standard, but it does not seem to be discussed nearly as often. It also raises the question: why are these defaults set at a 4:1 ratio? Although there are no formal standards for power (sometimes referred to as π), most researchers assess the power of their tests using π = 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between β-risk and α-risk. (β is the probability of a Type II error, and α is the probability of a Type I error; 0.2 and 0.05 are conventional values for β and α). — Power (statistics) [Wikipedia] I suspect that this assymetry in risk is at least partially due to the close connection between the development of statistics and the biomedical space. Suppose you are a statistician working for a pharma company. You are running an experiment to determine whether a potential new drug is more effective at treating a particular ailment than an exiting alternative. In this context, a false negative—failing to detect that the new drug is in fact more effective—could mean aborting development and missing out on the potential profit from bringing it to market. A false positive—incorrectly concluding the new drug is more effective when it is the same or worse—could mean spending billions of dollars to bring an ineffective drug to market, then subsequently spending billions more on with lawsuits and reputational damage in the decade that follows. In this hypothetical high-stakes scenario in which we face assymetric costs, it is prudent to be extra-conservative on false positives (α) at the expense of increased false negatives (β). Flavours of hypothesis testing: Fisher vs. Neyman–Pearson So it seems entirely plausible that particular domains—including biomedical research—require an assymetric risk profile, in which we value one of false positives or negatives more heavily than the other. But why do we never see scenarios in which we value false negatives more highly than false positives? While there are in fact a few such studies⁴, they are few and far between. Alexander Etz lays out a good argument for why this is the case, in his article Question: Why do we settle for 80% power? Answer: We’re confused.: Why do they not adjust α and settle for α = 0.20 and β = 0.05? Why is small α a non-negotiable demand, while small β is only a flexible desideratum? A large α would seem to be scientifically unacceptable, indicating a lack of rigor, while a large β is merely undesirable, an unfortunate but sometimes unavoidable consequence of the fact that observations are expensive or that subjects eligible for the trial are hard to find and recruit. We might have to live with a large β, but good science seems to demand that α be small. A lot of the confusion around hypothesis testing seems to stem from the fact that it is a blend of two underlying philosophies: Fisherian significance testing, and Neyman–Pearson hypothesis testing. It is particularly difficult to grok for outsiders, because while these two paradigms have irreconcilable differences, they also share some simliarities, and even use the same terminology of null hypotheses, alpha, etc. I will defer to this excellent explanation of the differences by StackExchange user “gong”: Fisher thought that the p-value could be interpreted as a continuous measure of evidence against the null hypothesis. There is no particular fixed value at which the results become ‘significant’. On the other hand, Neyman & Pearson thought you could use the p-value as part of a formalized decision making process. At the end of your investigation, you have to either reject the null hypothesis, or fail to reject the null hypothesis. The Fisherian and Neyman-Pearson approaches are not the same. The central contention of the Neyman-Pearson framework is that at the end of your study, you have to make a decision and walk away. One particularly frustrating pieces of statistical terminology—“failing to reject the null hypothesis”—comes from the Fisherian paradigm. If you are evaluating evidence in relation to a single hypothesis and you do not achieve a significant result, it could be either because such a result is not possible—the null hypothesis is correct—or simply because you did not collect enough data to disprove it. Therefore in a Fisherian context, we cannot accept a hypothesis, we can only fail to reject it. This paradigm is a natural match for the decentralized structure of scientific discovery in society. Hypotheses aren’t evaluated only once, so false negatives only delay discovery, rather than eliminating it. But researchers face implicit pressure to find surprising (significant) results for their experiment. Funding for future research may depend on it. Since it is not quite as sexy to fund experiments that verify knowledge we already “know”⁵, it makes sense to be very conservative with false positives, at the cost of accepting more false negatives. This paradigm is less good of a match to the typical decision-making context in a modern tech company in which A/B testing is being performed. We are not interested in advancing the societal body of shared scientific knowledge. We just want to make optimal decisions in an environment of uncertainty. Should we launch version A or B? If we truly walk away after making the decision, then failing to reject the null hypothesis is tantamount to accepting it. The Neyman–Pearson paradigm is a better fit for this scenario, because it pairs statistics with decision theory. In the NP framework, indecision is not an option. There is no option to “collect more data”. We plan a required sample size, collect data, make a binary decision between A and B, and then walk away. Rather than providing some continuous measure of evidence for or against a hypothesis, NP hypothesis testing arms us with the tools to confidently make decisions which minimize our long-run regret. Unprivilege your null hypothesis If you are testing two versions of your website, which should you designate as the null hypothesis, and which as the alternative hypothesis? It is standard practice to choose a null hypothesis which reflects the “status quo” that you are attempting to disprove. Given the typical defaults of α=0.05 and β=0.20, this means your null hypothesis occupies a “priviliged” position of being innocent until proven guilty⁶. But it can be alarming to observe that the outcome (decision) from an experiment can entirely flip depending on how you frame your null hypothesis⁷. Doesn’t feel particularly objective, does it? A fantastic side effect of setting α = β when our costs of mistakes are equal is that we can be agnostic as to what our default option is. We don’t have to be as careful as to which hypothesis we designate as null. Consider the following two scenarios: You are testing the impact of a new landing page concept on a single market. You have only translated content for a single language, and you’d like to A/B test the new concept before investing in more translations. Unless you see a significant positive effect in your experiment, you plan on staying with the existing system. Your backend team has done some major refactoring work, and you’d like to run an A/B test to verify that QA did not overlook any critical bugs. All things equal, you would prefer to go with the new refactored codebase, so you plan on launching the change unless you see a significant negative effect from the experiment. Landing page Refactor Default Stay with existing version Launch new version False positive Wasted resources Missed opportunity for improved conversion False negative Missed opportunity for improved conversion Worse conversion These two scenarios share a common failure mode—missed opportunity—but because our default decision differs, our risk is treated differently as well. This failure mode is denoted as β-risk in the first scenario, and α-risk in the second. If we were to use the default parameters (α=0.05, β=0.20) for both experiments, we could say “we planned and ran both experiments the same way” but our chance of missing an opportunity would differ by a factor of 4x. If we use a symmetrical risk profile, then we do not need to pay such close attention to which our default options are, because the long-run risk of making each type of mistake is the same. A pragmatic approach to statistical rigor If you are championing statistical thinking and experimentation practices in a move-fast-and-break-things environment, you need to pick your battles. For example: it’s probably not worth kicking up a fuss about people in your org treating confidence intervals like posterior probabilities. On the other hand, I would argue it is certainly worth encouraging and enabling people to perform sample size calculations as part of a pre-experiment planning process. Such a process has multiple benefits: it reduces the risk of implicit multiple comparisons⁸ which would inflate your long-run rate of false positives, and also reduces the number of underpowered tests you perform. Underpowered tests in particular can lead to a pernicious scenario in which experiment results lose credibility within the organization. Small simplifications to the planning process such as using a default of α = β can help you achieve this goal. Although the magnitude of improvement from the “opt-out organ donation” study has been partially debunked, every good salesperson knows there is some power behind the default effect. ↩︎ The biggest value comes not from the output of the calculation, but rather from the questions we must ask ourselves during the process. “Plans Are Worthless, But Planning Is Everything” – Dwight D Eisenhower. ↩︎ 800 scientists say it’s time to abandon “statistical significance” (Vox) ↩︎ Justify Your Alpha by Minimizing or Balancing Error Rates (The 20% Statistician) ↩︎ This has changed somewhat since the Replication crisis, but the fact this crisis occured at all indicates there is a systemic bias towards new discoveries. ↩︎ This is anlogous to the legal concept of Presumption of innocence. Priviliging the null hypothesis certainly makes sense here. A criminal escaping justice is unfortunate, but an innocent citizen wrongly imprisoned is horrific. ↩︎ I found a good example in this StackExchange question which illustrates how our decision can flip depending on which hypothesis we assign to be null. ↩︎ Even if you aren’t explicitly testing multiple hypotheses, not having a clearly defined hypothesis before running your experiment leaves you vulnerable to inflated FPR via researcher degrees of freedom. ↩︎ Making beautiful experiment visualizations with Matplotlib Monday, 21 Oct 2019 Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow. It might take a couple seconds to visually parse this visualization at first glance. I don’t think that’s because it’s complicated per se, but rather because the viz itself contains so much information. After you are used to the format, it’s hard to think of a way to convey a higher density of decision-making-relevant information in such a small space. There are a few things that make this a particularly good visualization for the result of an experiment. Why it is awesome It frames many “tests” within the context of a single experiment The terms experiment and test are often used interchangeably across product teams, no doubt in part due to the terminology around A/B testing. But in the context of a single experiment—in which we experiment by trying something new—we may perform a number of different statistical tests. While each individual test has its own confidence level, we must be careful to adjust our claims of confidence on the experiment level, else we vall fictim to the multiple comparisons problem. Even if you don’t apply any sort of quantitative correction—to guarantee some global family-wise error rate (FWER) or false discovery rate (FDR)—having all the tests shown together adds useful context for the reader. Suppose you hear the following statement during a company all-hands: We saw a significant increase in viewing hours for the Action genre in position four. This statement agrees with the above example plot, but it isn’t particularly insightful. Should we prefer the Action genre for this position over others genres? Or is this the ideal position for that genre across all possible genres? Perhaps both? Small verbal descriptions of specific outcomes from experiments like this tend to get taken out of context. When this happens, their utility decreases, and their risk of being “misused” increases. Unfortunately I have observed that these sort of “snippets” are frequently used as ammunition by some decision-makers to support their a priori preferred choice. It emphasizes intervals over point estimates (and p-values) The past few years have seen signficiant backlash (pun intended) against the use and misuse of p-values in academia. Today’s social scientists are all familiar with publication bias and the replication crisis. Yet when a n A/B test is presented in a tech company boardroom, the first question is still often Is this result significant?. The Netflix visualization replaces the role of p-values with a visual depiction of some confidence interval, whose colour changes depending on whether or not it includes zero. Additionally, although point estimates are shown within each interval, they are visually de-emphasised within the overall context of the visualization. I’m guessing that Netflix removed x-axis labels to avoid sharing confidental data, but even with those included, it limits people to making statements such as “we expect somewhere between a 1-2% improvement” rather than “we expect a 1.27% improvement”. Using two decimals of precision when our confidence interval is 100x as wide the estimate itself is superfluous and gives us a false sense of confidence in our results. The contextual info “stays together” in a single shareable image All of the above properties of a good experiment visualization could also be fulfilled by a nicely designed Tableau dashboard. But what should you do after the experiment ends, and you want to share or save the result for later? Your company’s dashboards are always changing after all, so you can’t guarantee the data will be there a year from now if you want to reference it. So you take a screenshot. Detailed dashboards are difficult to archive or share Well this is unfortunate. In order to capture the key parts of the result, you’ve had to take a nearly fullscreen grab of the dashboard. You can throw this in a slide deck somewhere, but you can’t expect anyone to read it. And if they do, you can’t expect them to reach the same conclusion as you did. In contrast, Netflix’s visualization outputs a story. Better yet, it’s a story contained in a single copy-paste-able sharable png file. This ensures that the nuance of your analysis does not get lost in transit as it is shared over Slack and email. Rolling our own visualization function Unfortunately I have not been able to surface any sort of open source libraries under the name “Netflix Vizkit”, so I decided to recreate my own version using Matplotlib. The function takes as input a pandas dataframe with either a single or multilevel index, and three columns: uplift, std_err, and alpha. If you are running a large number of tests, it would be prudent to first run your dataframe through your procedure of choice to correct for multiple comparisons. I’ll skip that for the purposes of this example. For this example, I’ve populated a dataframe with fake results corresponding to an email campaign in which we tested three variants and measured four different conversion rates for each. You could also pass in a dataframe with a single level of index, you’ll just get everything plotted on one axis instead of four separate axes. plot_experiment_results( df=example_data, title='Example email campaign (α=0.10)', sample_size=123456, combine_axes=False) There are a couple additional parameters in there to add context to the plot, including a title and sample size context line. Remember, we want our output to stand by itself as a record of the outcome of the experiment! This function generates the plot below. If you want to more closely match the Netflix plot, you can pass the paramete combine_axes=True to merge groups together into a single axis. I found this a bit less easy to visually parse, so I usually leave them separate. Full code for the example Sampling from an iteratively built array in Python Monday, 07 Oct 2019 While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before… What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.random.choice, you can avoid this problem entirely. Unfortunately I could not find a clever way workaround for my purposes. This arose while I was implementing the Dyna-Q reinforcement learning algorithm, which requires iteratively sampling from the set of observed state tuples after every iteration of the algorithm. These sampled tuples are then used to refine the transition matrix, with the goal of reducing the number of “real” iterations in which the agent must interact with its environment. Constraints Must allow for sampling from a 2D array (matrix) We do not know ahead of time how many iterations are needed (until convergence) Sampled values must be transposed into column vectors (although their actual use is not shown) Benchmarks I explored a few possible approaches below. Each function runs 10k iterations, in which it appends a row to the “so far” array and then samples 200 rows from it. Note that the function does not actually do anything with the sampled values—that is out of the scope of this article. I simulate the random vector by generating a single random number using the built-in random module (which is faster than numpy) and duplicating it to make a row vector. import numpy as np import random Approach 1: Purely built-in python, avoid NumPy entirely My first instinct was to attempt to write the function using as few NumPy objects as possible, since I knew from previous experience that the np.append() has some overhead. We can represent the 2D matrix as a list of tuples, and then use the zip function to take the sampled rows and “transpose” them into pseudo column vectors. def build_list_choices(iters=10000, sample_size=200): list_obj = [] for i in range(iters): ri = random.randint(0, 100) list_obj.append((ri, ri)) a, b = zip(*random.choices(list_obj, k=sample_size)) %timeit -n 5 -r 5 build_list_choices() 507 ms ± 36.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) Approach #2: Iteratively append to NumPy array I had read multiple times about the overhead of calling np.append repeatedly, so I wrote this mainly to benchmark the speed, rather than as a real candidate solution. def build_arr_iteratively(iters=10000, sample_size=200): arr = np.zeros(shape=(2, 2)) for _ in range(iters): ri = random.randint(0, 100) arr = np.append(arr, [[ri, ri]], axis=0) a_arr, b_arr = arr[np.random.choice(len(arr), size=sample_size)].T %timeit -n 5 -r 5 build_arr_iteratively() 472 ms ± 9.94 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) Surprisingly, iteratively appending to a NumPy array has very similar performance to the first approach. Reflecting on the common advice to avoid np.append, I suppose this is contrasted to the much faster alternative of gathering a list of rows and calling a final np.array() once. Unfortunately this alternative wouldn’t work for our use-case, which requires access to the array at each iteration. Approach 3: Preallocate array and assign within iterations To avoid the overhead of np.append, we can preallocate size in the array. If we don’t know the final size but are confident in the maximum size, we can simply instantiate the array at that maximum size and take a slice up to the $ i^{th} $ row at each iteration when sampling. def build_arr_prealloc(iters=10000, sample_size=200): arr = np.zeros((iters, 2)) for i in range(iters): ri = random.randint(0, 100) arr[i] = [ri, ri] arr_non_zero = arr[:i+1, :] a_arr, b_arr = arr_non_zero[np.random.choice(len(arr_non_zero), size=sample_size)].T %timeit -n 5 -r 5 build_arr_prealloc() 371 ms ± 7.08 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) We observe a modest improvement over repeatedly appending. In fact the np.random.choice line dominates the run-time in these functions, so the time spent purely building the array drops from ~100ms to ~20ms, a 5x improvement. Avoid at all costs: Iteratively building list and converting to array in each iteration The one thing you should definitely avoid is accumulating a python list but then converting to a numpy array at each step. This takes massively longer than the above approaches. def build_list_iteratively(iters=10000, sample_size=200): list_obj = [] for _ in range(iters): ri = random.randint(0, 100) list_obj.append((ri, ri)) arr = np.array(list_obj) a_arr, b_arr = arr[np.random.choice(len(arr), size=sample_size)].T %timeit -n 5 -r 5 build_list_iteratively() 14.2 s ± 307 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) Conclusion Given the specific requirements specified above, both a purely numpy and purely built-in python approach to the problem yield similar results. Even np.append() is reasonable, since the sampling part dominates the overall run-time. If you are confident about the maximum number of iterations you’ll run, you can preallocate rows to the numpy array for a ~25% faster overall run-time. Whatever you do, avoid calling np.array() during each iteration, this is by far the slowest approach. If you think you have a better approach, drop a comment below! Building a hurdle regression estimator in scikit-learn Monday, 16 Sep 2019 What are hurdle models? Google explains best, The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. — Getting started with hurdle models [University of Virginia Library] What are hurdle models useful for? Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. If we have a dataset with a heavily skewed response or one which contains extreme outliers, it is a common practice to apply something like a Box-Cox power transformation before fitting. But what do you do if you come across a clearly multi-modal distribution like the one below? Applying a power transform here will just change the scale of the variable, it won’t help with the fact that there is a huge spike of values at zero. The fact that it is multi-modal is a good indicator that we are over-aggregating data which belong to two or more distinct underlying data generation processes. Distributions like this are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage. In the toy example above we have two underlying processes: Does a customer come back? If so, how many purchases does he or she make? The first is modeled as a binomial random variable (coin flip) and the second as a $ \text{Pois}(\lambda=4) $ random variable, which represents discrete event counts. How can I implement a hurdle model? So we want to fit and predict two sub-models, and then multiply their predictions together: A classifier, trained and tested on all of our data. A regressor, trained only on true positive samples, but used to make predictions on all test data. The most straightforward way to achieve this would be to just train two separate models, make predictions on the same test dataset, and multiply their predictions together before evaluating. However with this approach we lose the ability to interface our model with the rest of the scikit-learn ecosystem, including passing it into GridSearchCV or any of the evaluation functions such as cross_val_predict. A better approach is to implement our hurdle model as a valid scikit-learn estimator object by extending from the provided BaseEstimator class. Making it a valid Scikit-Learn estimator The code snippet above may feel like it is longer than it needs to be. This is primarily because I tried to write it as a valid scikit-learn estimator, which I learned involves jumping through a few hoops so that it is compatible with other sklearn functions, including: Init variables must each be of a data type which evaluates as equal when compared with another copy of itself. This is necessary because sklearn clones estimators behind the scenes to do parallel processing in functions such as GridSearchCv. Primitive datatypes (e.g. 'yo' == 'yo' and 42 == 42) pass this test, but already-initialized estimators to use as sub-models do not. Because of this, I pass model type as a string, then use the _resolve_estimator method to instantiate the actual estimator. The fit method returns the estimator itself, to enable method chaining. The attribute self.is_fitted_ is set by the .fit() method and then checked by .predict(). Any input is validated using the check_array() function before being fit or predicted. Scikit-learn provides a check_estimator function which runs a battery of automated tests against your estimator. I learned most of these requirements above while attempting to pass these tests. Further reading Rolling your own estimator [scikit-learn docs] – Provides a good overview of how to write your own estimator Github / NeverForged / Hurdle [Github] – I used this as a starting point for my code. Creating your own estimator in scikit-learn – Some additional concerns w.r.t GridSearchCV When Python is built-in random module is faster than NumpPy Tuesday, 10 Sep 2019 TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random. An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops. But after learning to love NumPy for this reason, I was surprised to encounter a few situations where NumPy is actually slower than vanilla python. Particularly when generating scalar values or small arrays of random numbers using the np.random sub-module. Generating a random float I have written more than a few pieces of code which introduce some randomness by a random float in the range [0, 1] to the sampling rate argument in an if-statement. For this purpose, you should use python’s built-in random module. import numpy as np import random %timeit random.random() 69.5 ns ± 0.817 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit np.random.rand(0, 1) 987 ns ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Generating a single random float is 10x faster using using Python’s built-in random module compared to np.random. with NumPy than with base python. So if you need to generate a single random number—or less than 10 numbers—it is faster to simply loop over random.random() a few times rather than calling np.random.rand(). Generating a random integer Generating random integers with the random module is not quite as slow, but it is still slower than np.random.randint(). %timeit np.random.randint(0, 100) 5.05 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %timeit random.randint(0, 100) 898 ns ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Generating a single random integer is 5x faster using random module compared to np.random.f Sampling from existing array or list population = list(range(1000000)) %timeit np.random.choice(population) 48.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit random.choice(population) 930 ns ± 6.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Sampling a single value from a list executes a full 50x faster using random than np.random. This is a slightly unfair comparison—NumPy spends most of the time converting the population list into an array object before sampling—but it represents a real use-case I ran across when attempting to iteratively build and sample from an array of unknown length while building a reinforcement algorithm. A note of caution for cryptography purposes It is stated in the documentation for python’s random module but is worth reiterating: these are “pseudo-random” numbers which are good enough for most statistical purposes but should not be used for applications which require cryptographically secure random numbers. The pseudo-random generators of this module should not be used for security purposes. For security or cryptographic uses, see the secrets module. Creating a monthly + daily DAG pattern in Airflow Thursday, 15 Aug 2019 Problem You initially built a data pipeline for a project you were working on, but eventually other members of your team started using it as well. You move the logic into Airflow, so that the pipeline is updated automatically on some regular basis. You’d like to set schedule_interval to daily so that the data is always fresh, but you’d also like the ability to execute relatively quick backfills. With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh. Solution We want to eat our cake and have it too. We can achieve this by creating two separate DAGs—one daily and one monthly—using the same underlying logic. Astronomer.io has a nice guide to dynamically generating DAGs in Airflow. The key insight is that we want to wrap the DAG definition code into a create_dag function and then call it multiple times at the top-level of the file to actually instantiate your multiple DAGs. def create_dag(*args, **kwargs): dag = DAG(*args, **kwargs) with dag: # Declare tasks here (operators and sensors) # Set dependencies between tasks here return dag Our parameters of interest are dag_id, start_date and schedule_interval, so be sure to include those on your create_dag function. We’d like our monthly job to run on the first of every month, for all historical data. dag_monthly = create_dag(dag_id=f'{DAG_NAME}_monthly', start_date=START_DATE, schedule_interval='0 7 1 * *') We’d like our daily job to only run for the current month, but daily from datetime import datetime current_month_start = datetime.strptime(datetime.now().strftime('%Y-%m'), '%Y-%m') dag_daily = create_dag(dag_id=f'{DAG_NAME}_daily', start_date=current_month_start, schedule_interval='0 8 * * *') Make sure to define both of your DAGs at the top-level of the _def.py file so that Airflow knows to instantiate them. They will appear as separate DAGs in the main UI, but the underlying logic is DRY since they are both defined from the same create_dag function. Updates [2019-09-03] – Initially I had schedule_interval='0 7 2-31 * *' on the daily dag to avoid duplicate processing on the 1st day of the month. But Airflow runs jobs when the next schedule interval arrives (somewhat counter-intuitive) so what we actually want do do is skip the job corresponding with the last day of the month, rather than the first day. Unfortunately it is not possible to express this in a simple cron expression, due to the varying length of months. One-hot encoding + linear regression = multi-collinearity Monday, 29 Jul 2019 My coefficients are bigger than your coefficients I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100. feature_A_1 4060461707040.634 feature_A_2 4060461707005.303 feature_A_3 4060461706988.173 feature_B_1 -2529776773226.519 feature_B_2 -2529776773214.394 feature_B_3 -2529776773206.096 feature_B_4 -2529776773204.950 feature_B_5 -2529776773203.577 feature_B_6 -2529776773201.271 feature_B_7 -2529776773195.004 Name: coef, dtype: float64 What is going on here? It turns out it was related to my use of OneHotEncoder in my preprocessing pipeline to convert categorical features into a numeric format suitable for linear models. The best practice to convert a categorical feature containing $ k $ values is to output only $ k-1 $ one-hot encoded features, leaving one of them as the “default” value when all other $ k-1 $ booleans are zero. Unfortunately I overlooked the fact that by default, OneHotEncoder sets the parameter drop=None which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients. This is known as the dummy variable trap. An easy fix… Since we do not want to remove the intercept, the solution is to call encode our categorical features with the parameter drop='first' to produce only $ k-1 $ columns for each categorical feature. from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder cat_cols = X.select_dtypes('category').dtypes.index.values.tolist() pipeline = ([ ('one_hot', OneHotEncoder(drop='first'), cat_cols), ('lin_reg', LinearRegession()) ]) pipeline.fit_predict(X, y) …but it doesn’t play nicely with CV pipelines An additional challenge I faced was that my OneHotEncoder was part of a pipeline which was ultimately fed into the cross_val_predict function. This function splits up the dataset into a number of folds and runs the preprocessing pipeline separately for each fold. It is possible that the training dataset used in one or more of the CV folds may not include every possible value for every categorical feature. When the pipeline is subsequently applied to the test dataset in that fold, it will throw an error about an unknown value, unless you use the parameter OneHotEncoder(handle_unknowns='ignore) . Unfortunately is not possible to simultaneously set drop='first' and handle_unknowns='ignore' on OneHotEncoder , else you get the error below. ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero. I have not found an elegant solution to this problem. If you know one, please let me know. For now, I fell back to a non-pipeline solution in which I fit OneHotEncoder against the entire dataset, and then make predictions against a manually-split test set. numeric_cols = X.select_dtypes(np.number).dtypes.index.values.tolist() cat_cols = X.select_dtypes('category').dtypes.index.values.tolist() # Train the transformer on the full dataset (causes some leakage for PowerTransformer) col_tx = ColumnTransformer(transformers=[ ('num', PowerTransformer(), numeric_cols), ('cat', OneHotEncoder(drop='first', handle_unknown='error'), cat_cols) ]).fit(X) # Transform training data and fit model X_train_tx = col_tx.transform(X_train) model = LinearRegression() model.fit(X_train_tx, y_train) # Transform test data and make predictions X_test_tx = col_tx.transform(X_test) preds = model.predict(X_test_tx) How to fix the hinge on an IKEA Friheten couch Saturday, 20 Jul 2019 If there is one piece of furniture I regret buying from IKEA, it is the FRIHETEN sofa bed for €400. I wonder if this is where they make their margin, after selling €5 coffee tables as loss leaders to get you into the store. The FRIHETEN has two mechanical components that are prone to failure: a section which pulls out and “pops up” to form the sofa bed, and a chaise section which opens up to provide storage within. After two years of occasionally pulling the sofa out to vacuum, the chaise lid seat started to “slip” into the storage compartment when someone was sitting on it. Upon closer inspection, it seems that there is a metal edge on each side of the lid near the hinge assembly, which when closed should rest on top a metal bracket on the lower box. I noticed that my hinge arms were no longer perfectly centered, so when the top piece came down it would slip off “into” the storage, and that corner would dip. Not much contact surface area. Better. I managed to stop it from slipping by putting a 10cm wide metal corner bracket over the connecting bracket, which adds an additional ~1cm of metal, giving the top piece a solid surface but still leaving just enough clearance for the hinge mechanism to function. Before After Hopefully this helps anyone who is facing a similar issue. Your couch is not garbage, you can fix it with a €2 piece of metal from your local hardware store. Have fun asking where to find this at the store though. Reflections on three years of spaced repetition with Anki Monday, 17 Jun 2019 I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward. WTF is Anki? For the uninitiated, Gwern provides an excellent overview of spaced repetition and its effectiveness as a learning tool. I’ll focus the rest of this post on my own personal use-cases for the tool. I first downloaded Anki during my first year of undergrad, less than 48 hours before taking my final exam for Latin and Greek roots in English. It was the classic university student use-case: how can I cram all this knowledge into my head for just long enough to pass next week’s exam? Even though Anki’s algorithms are optimized for long-term retention, they are still the best approach for short-term cramming. Since that first encounter with spaced repetition, I have used it (somewhat) more successfully to build my foreign vocabulary, internalize mental models learned from other fields, and most recently to review math proofs in order to improve my retention while studying my masters degree part-time. Remembering stuff is hard work Reviewing flashcards for 15 minutes per day does not sound like a particularly large time investment, but it takes a disproportionate amount of mental exertion. Anki’s algorithm attempts to schedule cards such that you are just barely able to successfully remember them. The active recall principle tells us that it is at this point that review confers the maximum benefit in terms of long-term memory consolidation. If Anki showed you cards earlier it would be easier to remember them, but doing so would yield less benefit per review. In this regard, remembering stuff is no different than any other skill acquisition. The study of deliberate practice tells us that skill improvement is not proportional to total practice, but rather to the amount of practice conducted at the outer edge of our abilities. A musician who plays through a song they already know by heart may have fun, but he does not improve as much as a musician who spends an hour deconstructing a song he cannot yet play comfortably, practicinga single chord over-and-over again until it becomes muscle memory. So 15 minutes of flashcards feels unpleasant, but it gives us greater memory benefit than hours of passive consumption (watching videos, reading articles, etc.) Spaced repetition is not a replacement for learning Trying to remember things you’ve already learned can be tough, but trying to remember things you never really learned in the firstplace is just downright frustrating. I learned this the hard way while attempting to learn frequency lists¹ of Spanish vocabulary through an imported csv file. It feels deceptively productive to create a large number of cards, but you may be surprised at how much more difficult they are to remember than cards you create yourself². What anki does do well, is provide a mechanism for decoupling the active process of learning from the more routine process of remembering. Imagine you are learning some topic (a language, a field, a skill) and you resolve to spend three hours every weekend studying. The output of this focused effort spent studying is an increase in your knowledge. But you will also experience a decrease in knowledge caused by forgetting some proportion of your existing or just-learned knowledge. So next time you study, you may spend 95% of your time learning new material, and 5% “refreshing” on previously forgotten material. The actual percentage forgotten of course depends on the person, how they study, what they are studying, how often they study, etc. But the key insight here is that you’ve got a leaky system that is not long-term stable. It just wouldn’t make sense to say I’ve got a curiosity about physics so I will spend one Sunday per year learning it because you’d probably spend the first half of the day trying to remember what you learned a year ago. For any level of intensity, there is some level of frequency at which you reach an equilibrium: you are treading water but staying in the same place. Now let’s say you implement a practice of creating new flashcards at the end of your study sessions, and performing a 10-minute daily review regardless of whether you are studying that day or not. This effectively modifies our “learning system” to look like the chart below. To some extent, we are systematically countering the effect of forgetting using review. No review system is perfect, so naturally the rate of forgetting will never equal zero. A lower rate of forgetting equates to a lower “equilibrium” point with respect to study frequency. Without spaced repetition, studying a topic once every 3 months may feel inefficient, because you’d need to spend a chunk of time “refreshing your memory” before learning new content. With spaced repetition, you will almost certainly find it much quicker to jump back into things. I have found this immensely valuable while studying part-time towards my masters degree in analytics over the course of multiple years. Machine learning and statistics are extensions of underlying concepts in calculus, probability, and linear algebra. Having an deep understanding of fundamental concepts from these fields makes learning ML easier, but they are difficult to maintain when “used” so infrequently. By creating anki cards for key math proofs, I ensure that I encounter them with some minimum frequency in a problem-solving context, which prevents me from entirely forgetting them in between courses. Memory is directional Our ability to recall particular memories is known to be dependent on the context in which we learned them. We can conceptualize our memory as a graph, in which individual memories are encoded as nodes, contextual cues and relationships as edges between those nodes, and the remembering process as the task of traversing the graph to a particular node. We’ve all had the experience of having something a word or concept on the tip of our tongue but not quite being able to remember it. We know the memory exists in our knowledge graph, but we lack sufficient connections between nodes to retrieve it in a timely manner. Let’s take the example of learning a piece of foreign vocabulary using a typical front ↔ back flashcard. Reviewing this flashcard in both directions would strengthen the following connections (edges) in our knowledge graph: Foreign word → English: strengthens our ability to recognize and comprehend the word when encountered in the external world (i.e. recognition) English → Foreign world: strengthens our ability to produce the foreign word from “thin air” to express ourselves (i.e. production) Recognition is generally easier than production, and may often be a prerequisite. But training recognition alone does not translate to production, and often gives us a false sense of confidence in our knowledge. This is why the Feynman technique is so powerful: being able to fully “produce” the knowledge (either a foreign word or a full explanation of a concept) is the holy grail of remembering. We can take this concept even further. We have trained our recognition from a textual representation of the word, but could we recognize it when spoken to us in a noisy bar? We have trained production from a similar textual English word, but could we think of the foreign word just from looking at a picture of the concept, without sub-vocalizing in English first? My vocabulary note type now include multiple fields, and selectively generate 5-6 cards depending on which fields I fill in: Foreign language Translation Definition Example Picture Synonyms Audio A beneficial side effect of having a custom note type is that I have to manually create my own cards rather than being tempted by bulk import. This turns out to be a benefit because the time spent crafting the card—picking a picture, choosing the example sentence, etc.—translates to better starting retention. The bottom line All things considered, active recall testing with Anki is probably the single highest ROI habit that I perform on a regular basis. 100 hours per year may sound like a big chunk of time, but I think it’s important to consider that a lot of this was downtime to begin with. If you consider that much of this time is spent on public transit, or waiting to meet a friend who is a few minutes late, then it likely has an even higher ROI. That said, I have learned to become somewhat more selective about what I add, and to more frequently use custom note types which enable richer content and facilitate multi-directional connections in memory. If it’s worth adding a card at all, it is worth spending a few extra seconds to improve your chances of successfully recalling it months from now. I am also more willing to delete cards that aren’t working for me, including poorly created cards that I frequently fail to recall. Further reading Anki Tips: What I Learned Making 10,000 Flashcards – Inspiration for this post, includes some easy-to-digest tips for using Anki more effectively. Augmenting Long-term Memory – An in-depth exploration of spaced repetition and its applications for both academic purposes and practical life, from the perspective of a physicist. Spaced Repetition for Efficient Learning – Gwern gives a great overview of the scientific literature w.r.t. spaced repetition. Tim Ferriss recommends this approach in How to Learn Any Language in 3 Months. It may be sufficient for bootstrapping yourself into an immersion environment, but it’s certainly not ideal for long-term retention without some other form of practice. ↩︎ Gabriel Wyner, the author of Fluent Forever, is all for memorizing vocabulary, but strongly recommends generating the flashcards yourself rather than importing them, to solidify the knowledge in your head. Read about it here. ↩︎ Embed markdown documentation into your Airflow DAGs Monday, 13 May 2019 Why you should do it I recently discovered that Apache Airflow allows you to embed markdown documentation directly into the Web UI. This is very neat feature, because it enables you locate your documentation as close as possible to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members. How to do it To make your markdown visible in the Web UI, simply assign the string variable to the doc_md attribute of your DAG, e.g. dag.docs_md = "My documentation here". That said, I generally put the docs in a string variable at the top of the file, and then assign it later down in the file. This way, it serves a dual purpose of providing context to anyone editing the dag definition file itself. Example code docs = """ ## DAG Name #### Purpose This DAG connects data from one source to another, performs necessary transformations, and creates a set of tables that can be used by analysts #### Outputs This pipeline produces the following output tables: - `table_A` – Contains useful information about ABC. - `table_b` – Contains useful inormation about XYZ. #### Owner For any questions or concerns, please contact [me@mycompany.com](mailto:me@mycompany.com). """ with DAG(…) as dag: dag.doc_md = docs Save entire webpages for reference With SingleFile Monday, 15 Apr 2019 I’ve been reading through a lot of Tiago Forte’s writing on his members-only publication Praxis. Since reading through his series on progressive summarization, I have become more concientious with regards to saving the “work-in-progress” artifacts of my thinking process to Evernote. Often this involves a link to a piece of content, a couple highlights, and a bullet point or two about key takeaways. The problem It’s pretty easy to surface relevant notes using the Search function if I’ve added enough contextual info to the note, but less so if it’s just a link. So I wanted to start saving the actual raw content of key articles, particularly if they come from a members-only publication to which I may not have permanent access, and I cannot surface on Google. I initially tried using the Evernote web clipper to save entire articles, but quickly realized that this was cluttering up the namespace of my evernote search. A few 10k word articles add up quickly, and soon they dwarfed the amount of content in my otherwise relatively text-sparse notes. Searching simple one or two-word phrases related to everyday notes (e.g. shopping, home tech, etc.) would return a result set cluttered with barely relevant saved articles. Criteria A suitable solution satisfied the following criteria, in decreasing order of importance: Searchable – Can I perform a global search for text contained within pages without knowing which page to open? Portable – Can I search and open the file on different devices using some static file, or must I launch some command line tool on my laptop before opening some proprietary format or web UI for a localhost database? Readable – While true as-web formatting would be ideal, I would settle for being able to read the primary content (text) start-to-end. Lack of CSS and javascript is sometimes not just ugly, but makes the content unreadable. Solution: SingleFile I recently came across a neat Chrome extension called SingleFile which saves webpages as HTML files, but first waits for lazy-loading javascript, images and CSS to render. It doesn’t work perfectly—it sometimes includes the blurry version of lazy-loaded photos unless you first scroll to the end of the page—but it works lightyears better than anything else I’ve tried. If you store your HTML files in a folder indexed by Alfred, you can instantly surface them using the in keyword. Other things I’ve tried (or considered) Saving HTML locally via your browser In Chrome you can achieve this with a Right-click → Save as → Complete webpage (.webm). The main problem is that it doesn’t include CSS and JavaScript not present at the initial pageload. Without this CSS, a lot of pages are impossible to decipher. Print Friendly & PDF This is the best Chrome extension I’ve found up until now. It only outputs PDFs, but it lets you interactively remove superflous components (e.g. advertisements, banner images) before saving. This extension would work well for someone who either wants a print-ready format or likes PDFs (e.g. for highlighting in Mac Preview app). [chrome web store] Web recorder A tool called webrecorder.io came up in a few Hacker News threads. It seems to be a comprehensive roll-your-own alternative to something like Archive.org. It’s somewhat overkill for my purposes though, which largely amount to archiving articles for personal consumption. Every good data analysis starts with "Why?" Tuesday, 02 Apr 2019 In a previous life as a PM, I wrote a lot of jira tickets. In the software development world, a “ticket” is a unit of work entered in some workflow tracking tool such as jira, but it can represent anything from a task or goal to an issue or bug. After translating a few high-level strategic projects into trees of tickets, I realized that the average ticket was pretty mediocre. Issue #4824: Login button is broken. This ticket provides very little contextual value. When you are on the hook for results rather merely completed tickets, this is sub-optimal. A lot of methodologies such as user story mapping invoke a structure which makes it somewhat more difficult to produce such low-value tickets, but they can still be “gamed”. User story: As a user, I would like the login button to work. Status quo: The login button doesn’t work. Acceptance criteria: The login button doesn’t work. A big revelation for me was stumbling across Simon Sinek’s Ted talk Start with Why. I am now of the opinion that the humble user story—often the least carefully considered part of the ticket—is in fact the most important part of the ticket. A well-written user story—one which effectively captures and communicates The Why—has the potential to make or break the success of a software development effort by subtly conveying intent and purpose. I have learned first-hand that tickets which focus on a crisply defined purpose rather than prescribing a set of actions ultimately correlate with successful projects. All good things come in threes. Before I started working as a data scientist, I did not realize that this principle is just as important for interfacing between analysts ↔ stakeholders as it is for the traditional PMs ↔ developer relationship. Being on the receiving end of vaguely formulated requests has reinforced the fundamental importance of communicating The Why on projects where work spans across multiple people or teams. Why start with why? The modern knowledge worker works in a highly specialized environment. Specialization improves efficiency, but it comes at a less reactivity and adaptability to change. As units of work grow beyond the span of a single agent, it imposes a trade-off. But we can hack this trade-off. In an organization with multiple actors, the question shouldn’t be Is collaboration worth it? but rather How can we reduce the cost of collaboration? A natural place to start is in the written form of the request/job/project/task. There is a common failure mode with technical support, and is commonly referred to as the XY Problem, which manifests when a customer with some underlying goal X makes an inferential leap to a sub-problem Y but then does not commmunicate that inferential leap when asking for help. If you are on the receiving end of an unclear request, the Five Why’s technique is a useful approach to uncovering the true root cause or issue at play. If you are on the dispatching end of a request, then Starting with Why is a prophylactic technique to avoid the message being interpreted incorrectly to begin with. The “why” of data analysis Data analysis has a aura of objectivity, but in practice it frequently involves a number of subjective decisions. The sheer number of choices one must make in the course of answering any sort of interesting question with data is overwhelming. You say, “We want to understand the behaviour of our returning customers”, I reply “What do you mean by customers? Define ‘returning’. And what sorts of behaviour are we interested in specifically?” Some of these are pivotal questions which simply must be answered. But others are micro-decisions, each of which only marginally effect on the results, but whose compound effect across such decisions can influence the entire outcome of an analysis. ¹ Only when you know the question will you know what the answer means. — Douglas Adams, The Hitchhiker’s Guide to the Galaxy It is critical to have a firm grasp of your root question before starting an analysis. Having a firm grasp of that question lets you make smarter sub-decisions and makes you more likely to arrive at a useful outcome. The main challenge is that sometimes the question-asker sometimes doesn’t consider it necessary to show their full hand. Vulnerability does not come easy, after all. Don’t be an SQL monkey So given that you can’t directly influence the clarity of thinking of your stakeholders, what is a frustrated data analyst to do? Here are a few actionable tips which I try to run through each time I am facing a vaguely defined problem, in order to maximize my chance of ultimately reaching a successful outcome. Crisply define terminology Make sure you are fully aligned on what the terms you are using actually mean. Every business has a number of phrases which serve as weasel words. Here’s a few examples: Users → Paying customers or everyone with an account? What about non-logged-in “users”, which are really just tracking cookies in someone’s browser? Retention → This one is straightforward for subscription services, but more vaguely defined in a non-contractual setting. How frequently does a customer need to purchase to be considered “active”? What if they return and browse, but don’t purchase? New vs. returning → In relation to the above, do we start counting when a customer first visits, when the sign up, or when they buy? Think through hypotheticals A helpful way to draw out the meat of a decision is to ask what hypothetical action we would take for each possible outcome of the analysis. Consider drawing a decision tree (the diagram, not the machine learning model). Understanding the topology of a decision allows you to more carefully craft the analysis to inform that particular decision. In the case of an A/B test: what is our default decision/action if results are inconculsive? Do we only launch if there is X% improvement in our KPIs, or do we launch as long as there is no noticable decline? What is X%? Working through this thought exercise will sharpen your intuition around the nature of the problem, making it easier to make better micro-decisions such as setting appropriate appropriate risk parameters (α, β) for the A/B test based on the business risk of a false positive or false negative outcome. If it’s a visualization, draw a picture Almost every request for a specific data visualization is actually a request for an artifact which your stakeholder believes will be useful for solving some underlying question which he or she has decided to keep hidden from you. Try to understand that underlying question. But sometimes there is a need for just a good plain old chart or dashboard. In those cases, I have found it helpful to draw a picture of the output before starting. You can do this collaboratively with your stakeholder, or—if your drawing skills are as embarassingly poor as mine—you can sketch something out ahead and meet to align on it. Based on your a priori domain knowledge, it is often possible to arrive at something reasonably similar in structure and content to the final piece of dataviz. Getting this prototype in front of your stakeholder before implementing it will frequently surface follow-up questions or revisions that would otherwise have costed you time for re-work. It may also help you identify gaps between the currently available data and the data required to answer the underlying question. More data → more problems There is pervasive desire for more and faster data, particularly among product managers. But besides adding unnecessary processing complexity, real-time analytics can actually provide negative benefit. Adam Robinson has a great little story he tells on various podcasts about a study by psychologist Paul Slovic ² which illustrates how additional data has diminishing returns for decision quality, but not for our confidence in our decisions. Beyond some point, additional data makes us no more accurate, but it makes us think we are more accurate. When you’ve got 3 data points which disagree with your prior worldview, it’s tough to avoid the cognitive dissonance. It’s uncomfortable, but this is how scientific progress is made. But if you’ve got 30 data points and only ½ of them disagree, it’s a lot easier to tell yourself a story which reaffirms your worldview and sidesteps the cognitive dissonance. Unfortunately this cognitive comfort comes at the cost of a wrong decision. Further reading How to ask good questions (Julia Evans) A 12-Minute Summary of “Start With Why” by Simon Sinek https://statmodeling.stat.columbia.edu/2012/11/01/researcher-degrees-of-freedom/ ↩︎ I often share a snippet from Matt Mullenweg’s blog, although I recall first hearing of this study on the Tim Ferriss show. ↩︎ Calculating the bearing between coordinates in Redshift Monday, 11 Mar 2019 I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse. What do we mean by “direction” anyway? The most intuitive interpretation of direction seemed like compass bearing, so I set out to find a way to convert a pair of spatial coordinates (latitude and longitude) into a variable which represents degrees right of true north. Unfortunately I could not find any suitable built-in functions to deal with spatial data in Redshift. While it would not be difficult to spin up a jupyter notebook, pull in some data via SQL and run each row throw some function, it would not be an ideal approach. Keeping a small data request like this as a pure SQL query means it is easily reproducable in the future, without worrying about python package versions, anaconda environments, etc. Furthermore, anyone with access to the data warehouse can fetch updated data, rather than only someone comfortable with python. Enter Python UDFs in Redshift But all is not lost. Python UDFs to the rescue! Redshift lets you declare user-defined functions that take some scalar inputs, run a chunk of python code and return the output right back into SQL. Instead of declaring your function as a python fuction using def my_func(param) syntax, you place its contents in the UDF function declaration below. CREATE OR REPLACE FUNCTION my_func (param_a float, param b float) RETURNS float STABLE AS $$ < python code > $$ LANGUAGE plpythonu; Trying to remember as little decade-old trigonometry knowledge as possible, I found a working function on this stackexchange question and plugged it into our UDF boilerplate below. The final result looks like this, CREATE OR REPLACE FUNCTION bearing_between_coordinates (x_lat float, x_lon float, y_lat float, y_lon float) RETURNS float STABLE AS $$ import math startLat = math.radians(x_lat) startLong = math.radians(x_lon) endLat = math.radians(y_lat) endLong = math.radians(y_lon) dLong = endLong - startLong dPhi = math.log(math.tan(endLat/2.0+math.pi/4.0)/math.tan(startLat/2.0+math.pi/4.0)) if abs(dLong) > math.pi: if dLong > 0.0: dLong = -(2.0 * math.pi - dLong) else: dLong = (2.0 * math.pi + dLong) return (math.degrees(math.atan2(dLong, dPhi)) + 360.0) % 360.0; $$ LANGUAGE plpythonu; Execute this once in your database console, then you can use it within an existing query, for example, SELECT bearing_between_coordinates(x_lat, x_lon, y_lat, y_lon) AS bearing FROM lat_lon_coords Or you can stitch together trig functions in Redshift Update: A Python UDF may be overkill. I realized after writing the above that I can replicate the contents of the function itself using built-in trigonometric functions in Redshift. This results in the “almost one-liner” below. I opted to use a CTE to convert inputs to radians rather than embedding in the select to make that behemoth slightly less unreadable. There is definitely a trade-off on interpretability though. This SQL code does a poor job of projecting intent compared to a defined function. Rather than reading the first line of the function declaration, you need to read all the way through to the final alias bearing_degrees to understand why we are chaining together a bunch of trig functions anyway. WITH coords_as_radians AS ( SELECT RADIANS(x_lat) AS x_lat , RADIANS(x_lon) AS x_lon , RADIANS(y_lat) AS y_lat , RADIANS(y_lon) AS y_lon FROM raw_coordinates ) SELECT (DEGREES(ATAN2(SIN(arr_lon-dep_lon)*COS(arr_lat), COS(dep_lat)*SIN(arr_lat)-SIN(dep_lat)*COS(arr_lat)*COS(arr_lon-dep_lon)))+360)::DECIMAL(18, 2) % 360.00 AS bearing_degrees FROM coords_as_radians ; Further reading Calculate distance, bearing and more between Latitude/Longitude points DIY insulated sous-vide container from a cooler Friday, 01 Mar 2019 Last year I built a DIY insulated sous-vide container using $10 of IKEA parts. It worked pretty well, using 60% less electricity than an uninsulated container. But it was a bit of an eye-sore, and I got tired of leaving a mess of towels out on my kitchen counter. Can we do better? I did some research on sous-vide cooler hacks and was impressed by the build described in this Chowhound thread. So I set out with those instructions, but made a few modifications along the way. The main change I made was to drill the hole in the back of the lid rather than the front, so that it can be opened without removing the sous vide unit. Necessary supplies Item Notes Cost Igloo Legend 12 cooler This size is perfect for weeknight cooks. It is shallow enough to only need 4.5L to fill, but wide enough to be opened without removing the sous vide device from the lid every time. $21 Spray foam insulation We want something with good thermal properties, and which comes in a can with a spray nozel, so we can spray it into tight spaces. $8 Silicone caulk] We really don’t need much, so just get a small container. $4 60mm x 3.5mm o-rings These are for outside the lid, to adjust how deeply the sous vide unit sits. $6 40mm x2mm o-rings These are for inside the lid, to keep the unit snugly in place when the lid is opened. $4 Total $43 We’ll also need the following tools: A reasonably powerful power drill 60mm hole saw bit – You’ll want this to match the diameter of your sous-vide unit as closely as possible so that it fits in snugly. My Anova unit (original version) needed a ~62mm hole, so I used a 60mm bit and sanded it down until it fit. If you are using a newer version, check the diameter of your unit. How to build it Fill the cooler lid with insulating foam Drink coolers are designed to keep their contents cold rather than hot. So it would make sense for the cooler to have better insulation around the sides and bottom than the top lid. But since heat rises, we care disproportionately about the thermal performance of the top lid. The top lid of the Igloo Legend 12 cooler is hollow, so we can reduce heat loss even further by filling it with spray insulation foam. Here’s how: Drill a bunch of small holes on the underside of the lid, just slightly larger than the diameter of the foam insulation hose nozzle. We want to use multiple holes, since the foam will expand inside the lid, and may cause it to deform if we spray it all into one corner. Better to distribute it evenly throughout the lid. Lay out a few sheets of newspaper on the ground. This will get messy. Distribute the spray foam inside the lid as deeply into corners as possible. Leave some extra space near each hole. The foam will expand greatly over the course of 24 hours, so you it to expand from the corners towards the holes, so that all the air escapes. Err on the side of less, because you can do apply another round after 24 hours if the foam has not expanded to entirely fill the lid. Wait 24 hours, then break off the bulbs of hardened foam which are protruding from the holes we drilled earlier. Drill a hole for the sous vide unit Pick a spot on the lid you want to drill. I suggest somewhere near the back hinge, so that you can open the lid fully without removing the sous vide unit. The Igloo cooler I was using has a natural spot for the hole. Measure the diameter of you sous vide unit, and drill a hole using the appropriately sized circular saw bit. Err on the side of smaller for a snug fit. You can use sand paper to slightly expand the hole after drilling it. Seal off holes with silicone Now we want to apply silicone caulk to all the holes in the lid, so that moisture does not get inside during use. Carve out a bit of foam insulation around each of the holes. Apply a bit of silicone caulk to each holes. Do the same around the rim of the main hole. Use a credit card or another flat surface to smooth the caulk. Wait 12 hours and touch up if necessary. The rim of the main hole took me 2-3 applications until I was confident it would be waterproof. There is some room for improvement here. Insert sous vide (with o-rings) Now you can insert the sous vide unit into the main hole. Add o-rings to the top of the unit until it sits high enough that the stem clears the edge of the cooler when the lid is opened. The exact o-ring size doesn’t matter much. Make sure it clears the edge when the lid is opened. Mark the fill line Measure the depth of the cooler, subtract the offset from the o-rings to the fill line on your unit, and mark a line inside the cooler using a permanent marker and a ruler. Energy efficiency I ran a series of tests using a TP-Link Kasa Smart Plug to measure energy expenditure. For each test, I brought 4.5L of water up to 66°C and then started measuring after the water reached temperature. This is the same temperature as the previous tests), but this time using only 4.5L of water instead of 7L. Although this may give our new build a slight advantage, it reflects the minimum amount of water necessary to reach the “min” marker on the sous vide unit, and so I think it best reflects real-life usage. Hours Energy (kWh) Watts 23.5 1.00 42 11 0.45 41 12 0.49 41 — Average 41 So this cooler build uses a further 30% less electricity than the previous build when it was wrapped with towels, which used 63 watts. It uses a full 75% less electricity than the unwrapped container, which used 148 watts. So our new build is the best of both worlds! It is the most energy efficient, and also looks better on the kitchen countertop than either of the previous options. Using it as a regular cooler After a bit of trial-and-error, I found this 63mm plastic plug on AliExpress which fits perfectly into the hole at the top of the cooler. This is pretty useful, because then you can use the cooler as both a sous vide container and as a regular cooler when necessary. The best way to manage dependencies between DAGs in Airflow Monday, 11 Feb 2019 Airflow provides a few different sensors and operators which enable you to coordinate scheduling between different DAGs, including: ExternalTaskSensor TriggerDagRunOperator SubDagOperator Which one is the best to use? I have previously written about how to use ExternalTaskSensor in Airflow but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem. Use cases I need the ability to sometimes run dag_B independent of dag_A, but I want to share state (history) between them. Using SubDagOperator creates a tidy parent–child relationship between your DAGs. The sub-DAGs will not appear in the top-level UI of Airflow, but rather nested within the parent DAG, accessible via a Zoom into Sub DAG button. This is a nice feature if those DAGs are always run together. However if you need to sometimes run the sub-DAG alone, you will need to initialize it as it’s own top-level DAG, which will not share state with the sub-DAG. In this scenario, you are better off using either ExternalTaskSensor or TriggerDagRunOperator. My local development or test environment uses SQLite rather than a Postgres DB. SQLite does not support concurrent write operations, so it forces Airflow to use the SequentialExecutor, meaning only one task can be active at any given time. Using ExternalTaskSensor will consume one worker slot spent “waiting” for the upstream task, and so your Airflow will be deadlocked. In this case, it is preferable to use SubDagOperator, since these tasks can be run with only a single worker. Astronomer.io has some good documentations on how to use sub-DAGs in Airflow. I want dag_B to sometimes run depending on some conditional logic If you want to include conditional logic, you can feed a python function to TriggerDagRunOperator which determines which DAG is actually triggered (if at all). Set dependencies between Airflow DAGs with ExternalTaskSensor Monday, 21 Jan 2019 Problem You are an analyst/data engineer/data scientist building a data processing pipeline in Airflow. Last week you wrote a job that peforms all the necessary processing to build your sales table in the database. This week, you are building a customers table that aggregates data from your previous sales table. Should you add the necessary customers logic as a new task on the existing DAG, or should you create an entirely new DAG? Since the dependency is only in one direction (tomorrow’s sales data does not depend on today’s customers data) you decide to decouple into two separate DAGs. But how can you make sure your new DAG waits until the necessary sales data is loaded before starting? Airflow offers rich options for specifying intra-DAG scheduling and dependencies, but it is not immediately obvious how to do so for inter-DAG dependencies. The duct-tape fix here is to schedule customers to run some sufficient number of minutes/hours later than sales that we can be reasonably confident it finished. We can do better though. Solution Airflow provides an out-of-the-box sensor called ExternalTaskSensor that we can use to model this “one-way dependency” between two DAGs. Here’s what we need to do: Configure dag_A and dag_B to have the same start_date and schedule_interval parameters. Instantiate an instance of ExternalTaskSensor in dag_B pointing towards a specific task of dag_A nd set it as an upstream dependency of the first task(s) in your pipeline. Initiate dagruns for both DAGs at roughly the same time. dag_B itself will start, but your task sensor will wait until the corresponding date run of dag_A finishes before allowing the actual tasks to start. from airflow.sensors.external_task_sensor import ExternalTaskSensor with DAG('dag_B') as dag: wait_for_dag_A = ExternalTaskSensor( task_id='wait_for_dag_A', external_dag_id='dag_A', external_task_id='final_task') main_task = PythonOperator(…) wait_for_dag_A >> main_task Note: This requires tasks to run in parallel, which is not possible when Airflow is using SequentialExecutor, which is often the default for a barebones Airflow installation. This executor uses an SQLite database to store metadata, and SQLite does not support parallel IO. Using LocalExecutor will enable parallel operations, but requires an actual database (e.g. Postgres) to function. Update: I explore some different, possibly better-suited approaches to this problem here including SubDagOperator and TriggerDagRunOperator. Further reading Dependencies between DAGs: How to wait until another DAG finishes in Airflow? [Bartosz Mikulski] Thoughts on Blitzstein's Probability course (Harvard Stat 110) Friday, 21 Dec 2018 One textbook which is frequently recommended on Hacker News threads about self-study math material is Blitzstein and Hwang’s An Introduction to Probability. Having just recently finished the book, I realized that this is the first textbook I have truly worked through end-to-end while studying a topic outside a school course. Here are some thoughts on what the book does well, and my (minor) grievances. The good There are a few characteristics that make this book particularly attractive for self-study. Access to material To start, the book itself is available for free (digital version) and is accompanied by 34 hours of video lectures and a detailed solutions manual for 8-10 exercises of those provided at the end of each chapter. You can get a free digital copy of the textbook at http://probabilitybook.net The YouTube playlist for course lectures is at https://goo.gl/i7njSb There is now an accompanying edX course, although I did not complete this myself. There is also a useful and thorough probability cheatsheet compiled by a past student. Googling the specific phrasing of many exercises often lands you on a StackOverflow question with discussion around that exact problem pulled from the book. Lots of exercises (with solutions!) I did not excel at math during undergrad, and I came to the incorrect conclusion that perhaps I am just not a “math person”. It took me a couple of years and a few hours of 3Blue1Brown videos to break this mindset, and to realize that much of my earlier difficulty was with learning material which abstracts away concepts too quickly, and which lacks a clear relationship between theory and practical application. So my quality standard for a textbook for self-study is quite a bit higher than the average textbook. Blitzstein’s book contains ~600 exercises, and the selected solutions include detailed answers to ~100 of them. I found the number of officially-solved exercises in each chapter to be sufficient to build a deep intuitive understanding of the material. If you even wanted to go further, you can find many of the non-officially-solved questions answered somewhere on Chegg or stackoverflow. Focuses on building intuition A course in statistics will necessarily involve math, but Blitzstein does a good job of prioritizing the role of intuition whenever possible. He frequently employs “story proofs” to prove concepts or identities using verbal reasoning, rather than formal mathematical proofs. As far as I can tell, this is an approach the authors themselves have pioneered, as I can’t find many references to the concept outside this book. A story proof is a proof by interpretation. For counting problems, this often means counting the same thing in two different ways, rather than doing tedious algebra. A story proof often avoids messy calculations and goes further than an algebraic proof toward explaining why the result is true. The word “story” has several meanings, some more mathematical than others, but a story proof (in the sense in which we’re using the term) is a fully valid mathematical proof. Here are some examples of story proofs, which also serve as further examples of counting. One example of a powerful story proof is that of Vandermonde’s identity, which is an identity used in a few important proofs later in the book. Example 1.5.3 (Vandermonde’s identity). A famous relationship between binomial coeffecients, called Vandermonde’s identity, says that $$ {m+n \choose k} = \sum_{j=0}^k {m \choose j} {n \choose k-j} $$ This identity will come up several times in this book. Trying to prove it with a brute force expansion of all the binomial coefficients would be a nightmare. But a story proves the result elegantly and makes it clear why the identity holds. Story proof : Consider a group of $m$ men and $n$ women, from which a committee of size $k$ will be chosen. There are ${m+n \choose k}$ possibilities. If there are $j$ men in the committee, then there must be $k-j$ women in the committee. The right-hand side of Vandermonde’s identity sums up the cases for $j$. I find this approach very compelling, because it reduces the “barriers to entry” of mathematical proofs, letting you use them to test your knowledge without understanding a bunch of math symbols like $\exists, \forall, \in$. It is easy to employ the Feynman Technique by creating an Anki card for the most useful story proofs, and then periodically being prompted to explain the story proof. Clear relationships between related concepts At the end of each chapter, the authors reflect on how newly introduced concepts relate to those from previous chapters. Spoiler alert: most probability distributions are related to each other when either conditioning on some event, or when taking the limit as $n \to \infty$. As Professor Blitzstein is fond of saying in the video lectures: “Conditioning is the soul of statistics”. The book incrementally builds the flowchart below, which we see in its complete form at the end of Chapter 10. Does not assume prior knowledge A common challenge with using material for self-study is that one’s own existing knowledge may not precisely match the known prerequisites of students taking the course which the textbook was written for. Rather than “assuming some calculus knowledge”, university instructors have the luxury of knowing the exact content of prerequisite courses in their own departments, and so can confidently skip reasoning steps which seem too basic to spell out explicitly. Blitzstein & Hwang clearly go out of their way to decouple the course material from knowledge dependencies as much as possible. You will never hear the phrase It is trivial to prove… or read The proof of this theorem is left as an exercise… in this course. Whenever there is an unavoidable dependency on prior knowledge, the authors make explicit note of this fact, and reference the math appendix. The math appendix itself does a good job of cherry-picking useful prerequisite concepts—such as properties of functions, factorial and gamma functions, Taylor series, geometric series—and building intuition around them. For example: understanding how to apply a change-of-variables transformation to a multi-dimensional probability density function requires the concept of the Jacobian matrix, which itself requires a bit of multivariate calculus to understand fully. My calculus was a bit rusty to do the proof, but since I knew exactly what I was missing, it was straightforward to brush up on a few Khan Academy videos within an hour before continuing with the chapter. Although the author does not include a mindmap of concepts in the book, I found this Metacademy DAG for Central Limit Theorem (which is presented near the end of the book) to be a good approximation of how earlier concepts build up to concepts in the later chapters. Except for multiple integrals, there is an overall very little dependency on prior math knowledge. The bad Non-standard notation for some distributions A matter of slight annoyance is that there are a few instances where the authors create their own parameterization of distributions which differ from the standard notation found outside of the book. For example, the notation for the Gamma distribution and its corresponding PDF is given as: $$ \begin{aligned} Y &\sim \text{Gamma}(a, \lambda) \\\ f(y) &= \frac{1}{\Gamma(a)}(\lambda y)^a e^{-\lambda y} \frac{1}{y} \end{aligned} $$ Outside of the book, there are two typical parameterizations of the Gamma distribution: shape–scale or shape–rate. The shape–rate parameterization most closly matches the one we find in the book. $$ \begin{aligned} X &\sim \text{Ga}(\alpha, \beta) \\\ f(x) &= \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \end{aligned} $$ Why are they different? Well I’m sure that it is because the authors think their parameterization is more intuitive. If wikipedia can’t even agree on a single standard parameterization, why not introduce a third? I have mixed feelings here, because their choice of parameter $\lambda$ instead of $\beta$ actually is more intuitive, as it makes it more obvious how the Gamma distribution is closely related to the Exponential distribution, which shares the same rate parameter $\lambda$. But the form of the PDF makes it a bit tricky when referencing outside material alongside the textbook itself. Another example of non-standard notation is around the presentation of the Geometric distribution. Outside of the course, this can refer to either the distribution of the number of Bernoulli trials before the first success (with support ${1, 2, 3, \ldots}$), or it can refer to the number of failures (with support ${0, 1, 2, \ldots }$). Blitzstein refers to the former as the First Success distribution, and the latter as the Geometric distribution. It is difficult to complain about not adhering to a common definition when the common definition itself is ambiguous. But this is something to keep in mind when referencing outside resources. The ugly There is no ugly. I struggled to even think of the above complaint about non-standard notation. This book is a gem, and I highly recommend it to anyone considering self-studying probability. In an ever-evolving data science field, it is difficult to predict which model or framework will be in vogue next year. But I think it’s a solid bet that probability will continue to play a central role in the field, and so there is a high ROI on investing the time to develop a solid understanding of the fundamental concepts this book covers. Abridged: David Foster Wallace's “This Is Water” Thursday, 18 Oct 2018 This is Water is a 22-minute commencement speech given by David Foster Wallace at Kenyon College in 2015 which was later adapted into a short book. It is difficult to overstate how powerful it is. Even after listening to this speech countless times, it never fails to send a shiver down my spine. If you’re automatically sure that you know what reality is, and you are operating on your default setting, then you, like me, probably won’t consider possibilities that aren’t annoying and miserable. But if you really learn how to pay attention, then you will know there are other options. It will actually be within your power to experience a crowded, hot, slow, consumer-hell type situation as not only meaningful, but sacred, on fire with the same force that made the stars: love, fellowship, the mystical oneness of all things deep down. I originally discovered the speech through an article on Farnam Street, which includes audio and a full text transcript of the speech. The original version is a bit long though. It also includes some parts tailored to the graduating class which I find subtract from the overall power of the primary message. I found this abridged transcript from James Clear, which is better, but only includes an abridged transcript, not an abridged version of the audio itself. In the interest of periodically forcing myself out of this default mode of thinking, I created a recurring calendar event to re-listen to the speech every sixth months. The transcript is good, but it doesn’t quite “click” as well as when I listen to the audio version, delivered by DFW himself. So I pulled the original audio into Audacity and edited it to roughly match the abridged transcript. You can download this abridged audio version here. DIY insulated sous-vide container with $10 of IKEA parts Saturday, 01 Sep 2018 Since acquiring an Anova sous-vide cooker, it has become an essential component of my weekly cooking routine. Their marketing materials show the device being used in any large pot you probably already have. This is fine for occasional use, but since I use the device frequently I started looking for a dedicated vessel. A dedicated vessel also lets you cook a larger quantity of food, or something awkwardly large like a rack of ribs. You can buy a pre-built container but it costs $70 and is not insulated. So I decided to build a simple dedicated container that was semi-insulated, so that it would be energy efficient when cooking ribs for 48 hours. What you’ll need Consumable supplies Item Cost SAMLA box (with lid) from IKEA $3 Cheap blanket (x2) $4 Small plastic clamps (x2) $2 Hot glue $1 Total $10 Tools A reasonably powerful power drill 60mm hole saw bit – You’ll want this to match the diameter of your sous-vide unit as closely as possible so that it fits in snugly. My Anova unit (original version) needed a ~62mm hole, so I used a 60mm bit and sanded it down until it fit. If you are using a newer version, check the diameter of your unit. Hack saw – To make cuts between the circular hole and the edge of the plastic lid. Sandpaper – To smooth any uneven plastic edges. How to build it Glue a support to the box The lip of the SAMLA box is a bit flimsy, so you’ll probably want to glue something on to reinforce it. I used a a transluscent plastic bag clip that was nearby in the kitchen. You’ll want to do this first, since this affects the position of the clamp, which determines where you should drill the hole in the lid. Cut a hole in the lid Attach the sous-vide clamp to the container and mark the center of the hole on the plastic lid. Attach the lid firmly to the box, and use the hole saw attachment on a drill to make a cut at the marked point. Use the hacksaw to extend the cuts from both sides of the hole to the edge of the lid. Use sandpaper to slightly enlarge the hole until it fits your Anova device, and also to smooth off any rough edges from the hacksaw. Wrap with towels Fold the towels into long thin strips. Wrap one snuggly around the sides and use plastic clamps to attach at the back Fold the other in such a way that it covers the entire lid, tucking around sous-vide clamp. I tried two ways shown below to fold the top blanket. A simple square fold (left) looks nicer, but it leaves the lid around the sous-vide unit itself uninsulated. Folding a long thin towel (right) to cover the entire lid looks a bit messier, but is more energy efficient (~10 watts). Easier More efficient (slightly) For even larger cooks If you are cooking something even larger (or a large quantity for a dinner party) you can also buy the larger 6-gallon SAMLA box which uses the same size lid. Update [2020-05-01]: If I were doing this again from scratch, I would probably opt for the newer IKEA 365+ tupperware containers. The lid is more securely attached than the SAMLA, so you could do just the hole saw cut and let the sous-vide unit hang down from the lid. This would save you from doing the hacksaw cut to accomodate the Anova clamp, which was annoying to do and probably results in worse energy efficiency. Energy efficiency I ran a series of tests using a TP-Link Kasa Smart Plug to measure energy expenditure. For each test, I brought 7L of water up to 66°C and then started measuring after the water reached temperature. Wrapping the container in blankets is a bit annoying (and ugly) but it uses less than half the energy as when unwrapped. If you are doing long multi-day cooks, it is definitely worth making sure your container is insulated. For example, a 48-hour cook without towels would take 7.2 kW, whereas it would only take 2.9 kW with the towels. IKEA energy tests (bare) Hours Energy (kWh) Watts 13 1.93 148 12 1.75 146 9 1.36 151 — Average 148 IKEA energy tests (wrapped) Hours Energy (kWh) Watts 9.5 0.66 70 7.5 0.43 57 6 0.38 63 — Average 63 Redshift function of the week: RATIO_TO_REPORT Sunday, 10 Jun 2018 A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Quite often this arises as an intermediate need while working towards some other primary task. Let’s work with a simple example: you’d like to get a rough sense of how many of your company’s orders come from from each country. So you write the following query, SELECT country, COUNT(*) AS num_orders FROM orders ORDER BY 2 DESC And you get the following result back in your SQL client, country num_orders USA 21264505 Canada 6408593 Mexico 2208305 This is kind of difficult to read. You can immediately discern that USA has the most orders, but it’s tough to eyeball proportions here without counting digits and performing some rough mental math. So you refine your query to tell you what you are actually interested in: the relative proportion of orders between countries. WITH country_totals AS ( SELECT country , COUNT(*) AS num_orders FROM orders GROUP BY 1 ORDER BY 2 DESC ) SELECT country , ROUND(num_orders ::NUMERIC / SUM(num_orders) OVER (), 2) AS pct_orders FROM country_totals country num_orders USA 0.71169602953464 Canada 0.21443048342874 Mexico 0.07395810292746 This is better. It answers your question, and you can go back to your main task. But is it optimal? It took you 12 lines to answer a relatively simple question. Hopefully you didn’t write these from scratch, but even if you pasted a snippet, it’s not particularly easy to read if you or someone else needs to refer back to this query in the future. Can we do better? Enter the RATIO_TO_REPORT function Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. You can use it as follows, SELECT country , RATIO_TO_REPORT(COUNT(num_orders)) OVER () AS pct_orders FROM orders GROUP BY 1 ORDER BY 2 DESC We can reason through this query as follows: the GROUP BY operation totals orders for each country, and then the RATIO_TO_REPORT function is called on the already-grouped rows, dividing each by their grand total. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result. As a final step, we can clean up the output even further by rounding the percentage to a meaningful precision, so that our eyes don’t spend an extra 100ms parsing the fact that it is a fraction when we look at the table. SELECT country , ROUND(RATIO_TO_REPORT(COUNT(num_orders)) OVER (), 2) AS pct_orders FROM orders GROUP BY 1 ORDER BY 2 DESC country num_orders USA 0.71 Canada 0.21 Mexico 0.07 Boom! We’ve arrived at a short query which gives us a clean result that answers our underlying question. This is a trivial scenario, but the sort that one encounters daily, and so taking the optimal approach pays off in long-run efficiency. Further reading RATIO_TO_REPORT Window Function – AWS documentation. Not particularly easy to read, but still a primary resource. SQL queries don’t start with SELECT – It’s useful to understand the SQL “order of operations” when working with queries which combine both groupby aggregation and window functions. Calculationg Proportional Values in SQL – Implementation details for different SQL engines. The hidden costs of poor data quality Wednesday, 02 Aug 2017 The phrase “data quality” is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement. We’d all love 100% accuracy, but in the era of moving fast and breaking things, don’t we want to sacrifice a little accuracy in the name of speed? After all, isn’t it often better to make fast decisions with imperfect information and adjust course if necessary at a later point? There is certainly a trade-off at play here. The optimal level of quality to aim for is likely less than 100%. But it’s probably higher than you think. The speed vs. accuracy trade-off needs to be calculated with all costs considered, not just the most visible and direct ones. For this reason, let’s ignore the obvious cost of low-quality data, which is incorrect decisions and lost opportunities to apply machine learning models due to an unacceptable signal-to-noise ratio. Instead, let’s focus on the lesser-considered costs of poor data quality. It reduces the standard for derivative works Broken windows theory is a pop-science concept originating from criminology which states that visible signs of disorder encourage further disorder. When participants in a system are acclimitized to a slightly broken status quo, it becomes normative to produce more slightly broken output. I suspect this concept applies to data quality as well, in two ways: phsycological and practical. Ambient quality issues degrade the shared standard for quality within a team, and the “definition of done” will slip. For an analysis on a tight timeline, 90% accuracy may be accepted instead of spending the time investigating the root cause of the innacuracy. From a practical standpoint, it becomes more difficult to implement data quality checks or perform sanity tests when nothing is black or white, but rather shades of gray. For example, if a tracking event should always have a session-ID associated with it, then it becomes easy to detect when a problem arises, with something as simple as COUNT(*) WHERE session_id IS NULL. But when tracking events usually but not always have a session-ID associated with them, you can’t apply simple rules, nor can you identify if a specific data transformation/aggregation/analysis introduces meaningfully more null values. Soon you are in the domain of anomaly detection, which is an endeavour in and of itself. It takes dramatically longer to reach a final result Anscombe’s Quartet is a famous example of why it’s not enough to analyze a dataset using descriptive statistic alone. You need to look at the distribution of the underlying data. But what if you do visualize the data during the initial analysis, and now you want to build a simple dashboard for ongoing monitoring? Particularly when working with large datasets, it is much more convenient to apply a server-side aggregate function in SQL such as AVG() than to pull the raw data, filter for outliers, and then perform client-side aggregation in software such as Tableau. So you draw up a simple query based on AVG() and feed it into a dashboard. This works, until a deployment next month starts sending negative values to the database due to an incorrectly configured timezone on one of the backend server. This goes unnoticed until an analyst a year from now realizes that the average time-to-purchase is being understated by more than 50%. Oops! Well, what if we write a more “defensive query” in the first place, to prevent such an incident from occuring? For example… SELECT AVG(GREATEST(0, time_to_purchase)) It is possible to write such queries in response to specific (and known) quality issues, but doing so for multiple types of potential quality problems leads to clunky and slow queries peppered with SELECT DISTINCT statements that not only take longer to execute, but are more difficult to write, read, and modify in the future. It accumulates multiplicatively, not additively A common question posed to analysts is “how accurate is this data”? This is deceptively difficult to answer. Non-technical stakeholders often conceive of error as an additive quantity, but in reality the effect is often multiplicative. For example, let’s say we are trying to calculate the average number of searches per visit on our website. We’ve got two tables for this: visits and searches. To make things interesting, there was a technical bug a few months ago, in which some visitors were assigned a session-ID of undefined. So we write a query, SELECT AVG(COUNT(s.search_id)) FROM visits v JOIN searches s ON v.session_id = s.session_id GROUP BY v.session_id The problem here is that SQL will inadverently perform a cross-join on these undefined rows, giving you a wildly inaccurate result… It undermines confidence in the analytics team Your team has collected and transformed the data, built the model, and performed the analysis. Now it’s time to communicate the results to stakeholders. But often the data presents inconvenient truthes, which stakeholders may be reluctant to accept. In order to reduce cognitive dissonance, people often engage in motivated reasoning, questioning the quality of the data, and whether we can actually trust the results. So even if you’ve already paid the inflated costs assocated with reaching a meaningful and accurate insight in the face of poor data quality, you may yet face a more difficult evaganlizing for controversial actions or outcomes based on those results. Essential productivity apps for Mac users Monday, 05 Jun 2017 Once a year I try to reevaluate my “personal tech stack” to see if I am using fundamental tools as effectively as possible. Not just bigger tools such as todo lists, calendars, and note-taking, but also the smaller utility apps that get used so frequently they blend into our daily work routine. Our fluency with the tools we use every day is the foundation of personal productivity¹ , so it makes sense to optimize even small interactions² such as switching between windows. With that in mind, here are three key Mac apps that make me a tiny bit more efficient but do so very frequently. Alfred (launcher) Alfred is a super-charged replacement for the built-in Spotlight search bar. It gives you nuanced control over search options such as indexing, fuzzy matching, etc. But it’s real power comes from three The real power comes from three other features though… Clipboard history – Being able to go back and re-paste copied items from your recent history reduces the number of times you need to switch between windows when copying multiple chunks of content (e.g. title, description, link) from one application or document to another. This thereby incurrs less context-switching costs on your brain. ³ Text expansion – I primarily use this to encode symbols that I use frequently, such as greek letters (α, β, μ, σ, λ). Python allows most (all?) of these unicode characters to be used as variable names. If you are writing a function which exactly mirrors a mathematical expression, it can be convenient to write it using the actual greek characters themselves, rather than their latin names. # with latin names def log_likelihood(x, mu, sigma): return (-(x-mu)**2 / (2*sigma**2) ) - np.log(np.sqrt(2 * np.pi) * sigma) # with actual greek characters def log_likelihood(x, μ, σ): return (-(x-μ)**2 / (2*σ**2) ) - np.log(np.sqrt(2 * np.pi) * σ) You can download my snippets collection of greek characters here. I include only the characters which are not confusingly similar to regular latin characters (e.g. uppercase alpha, beta). I also use snippets for arrows (▲, ▼, ←, ↔, →, →) , fractions (½, ⅓, ⅒) or little snippets of frequently used SQL code or LaTeX. Alfred lets you specify where to place the cursor after expanding your snippet, which is useful for boilerplate code snippets. Workflows – Alfred provides a drag-and-drop GUI for creating pseudo-programming recipes that you can plumb together to achieve surprisingly complex tasks. There are downloadable recipes which provide deep integration into your apps, such as allowing you to search for notes in Evernote from within the Alfred search bar and open them directly via deep-link. That said, my most common workflows are relatively simple: Search Amazon in multiple countries for a product Type a German word and search multiple dictionary websites and Google Images. Useful for creating flashcards in a foreign language. Run a short python script which takes LaTeX formatted markup from my clipboard (usually copied from Typora) and uses regex to change to the syntax required by Anki. Created a bunch of hotkeys to wrap a highlighted work with relevant HTML tags, such as Cmd+Shift+B to bold something or Cmd+Shift+I to italicize something . Lightshot (screenshots) I make heavy use of screenshots as a communication medium in a work context. As remote work increasingly becomes the norm, we must adapt our communication styles to match. Certain things are just easier to show to someone than to describe verbally. But with less opportunity to meet face-to-face it becomes more difficult to do so. A good screenshot tool reduces the friction to communicating visually in an asyncrhonous context. I’m sure there are multiple good screenshot tools out there, but I like Lightshot because it allows me to snip a selection of my screen, apply some basic annotations, and copy to my clipboard within the span of 1-2 seconds. I can then immediately paste the image into a Slack window or email thread. Magnet (window management) Magnet lets you snap windows to one half of your screen by dragging them to any edge or corner, or using a set of hotkeys. This is similar to the built-in windows management functionality in Windows, which I sorely missed when I switched operating systems five years ago. It’s worth spending a few minutes familiarizing yourself with the hotkeys, because you’ll use them hundreds of times per week. Being nimble with window management lets you make more effective use of your monitor space, and minimzes the literal context-switching cost you incur on your brain when switching your gaze between monitors or even between Mac’s built-in Spaces. See Tiago Forte, The Digital Productivity Pyramid. ↩︎ Relevant xkcd: Is it worth the time? ↩︎ https://en.wikipedia.org/wiki/Task_switching_(psychology) ↩︎ Jupyter Notebooks for Interactive SQL Exploration Sunday, 16 Apr 2017 I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method. Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study. To call oneself a “data scientist” makes no sense. One cannot study data in general. One can only study data about something in particular. — There Is No Science of Data So it's always nice when I find an opportunity to borrow a concept or practice from actual science and apply it in my day-to-day. One of my favourites is the practice of [keeping a lab notebook](https://www.sciencemag.org/careers/2019/09/how-keep-lab-notebook) with commentary and supplementary details around the meandering path taken towards a final result. Jupyter notebooks and R Markdown are two common tools that make it easy to intermingle code and analysis (as markdown) in a way that allows you to elucidate your thought process along a particular path. But I have always felt a bit frustrated that there is not a similar tool for SQL. I try to get out of SQL and into python as soon as possible, but sometimes it is inevitable. On occasion, while writing a query to pull a starting dataset for some sort of analysis in pandas, I find myself troubleshooting something like missing or duplicate records in SQL. Usually this involves executing a sequence of simple queries against various tables in the database to narrow down the source of the problem, often using the output of one as input into another query. For example: Pull a single order-ID which is missing from my dataset Query the orders table for that order-ID to find the corresponding customer-ID Query the customers table for that customer-ID to find device data … What I did before I generally prefer writing SQL queries in my IDE (PyCharm) which provides a number of useful features including auto-completion of column and table names, along with warnings that appear for typos, etc. Usually I will add comments above queries as I go along using the -- comment syntax and at the end of the chain of queries I may copy/paste everything into a .sql file to save somewhere in case I need to run through that specific chain of troubleshooting steps again. Enter Jupyter w/ SQL magics There is a neat jupyter extension called ipython-sql that adds an %%sql magic command to your jupyter notebooks. Magic commands are special non-python commands starting with the % which, when run from a notebook cell, add some sort of additional functionality. Prefixing a code cell with %%sql will let you execute the SQL code below against your database, and return the result below. It even applies syntax highlighting to your SQL, making it more readable. How to use First thing we need to do is install the extension, ! pip install ipython-sql Next, we need to load the extension and create a connection with your database, %reload_ext sql # Use reload_ext instead of load_ext to avoid message on re-running cell. %config SqlMagic.autopandas = True # Return a pandas DataFrame instead of an SQL ResultSet. # Provide the JDBC connection string template for your database. redshift_str_template = 'postgresql://{user}:{pwd}@{host}:{port}/{db}' # Fill in the string with your credentials, stored in environment variables. connect_str = redshift_str_template.format( user=os.environ['REDSHIFT_USERNAME'], pwd=os.environ['REDSHIFT_PASSWORD'], host=os.environ['REDSHIFT_HOST'], port=os.environ['REDSHIFT_PORT'], db=os.environ['REDSHIFT_DB']) # Open a connection to your database %sql $connect_str The code above assumes that you are using Amazon Redshift as a database, and that your credentials are stored in environment variables. If this is not the case, you can replace the os.environ[] calls with strings, but be careful not to commit your notebook to a shared repository with plaintext credentials. Now we can run an SQL query in our notebook, %%sql SELECT COUNT(*) FROM sales WHERE ts > CURRENT_DATE - interval '7 days' Another cool feature is the ability to save the output to a variable, %%sql num_sales << SELECT COUNT(*) FROM sales WHERE ts > CURRENT_DATE - interval '7 days' It works in reverse too, so you can feed a python variable such as N_DAYS = 7 back into a query by referencing it with a trailing : in your SQL. %%sql num_sales << SELECT COUNT(*) FROM sales WHERE ts > CURRENT_DATE - interval 'N_DAYS: days' Using these two features together, it is possible to write notebook which performs a sequence of debugging steps, with each query taking a dynamic value from the previous output. You can then save this notebook, and easily re-run the same troubleshooting steps on fresh data when the problem arises in the future. Couldn’t I achieve the same thing with jinja and psycopg2? Theoretically we could write queries into string variables in a Jupyter notebook and run them using psycopg2 or pandas, but this always felt too clunky to be usable. The above approach almost entirely removes the friction and boilerplate code, while also giving us the benefit of syntax highlighting. Further reading https://towardsdatascience.com/jupyter-magics-with-sql-921370099589 https://github.com/catherinedevlin/ipython-sql Typesetting math equations with Anki Monday, 27 Mar 2017 Anki is a tool I use daily to remember things better. Below are the things I have learned about typesetting math equations in Anki using both MathJax and raw LaTeX. Hopefully these notes can save you some time. Update [2020-04-17] Anki 2.1+ now has built-in support for MathJax. This is now the best approach to math typesetting, since it removes the dependency on LaTeX being installed on your computer. Besides being a pain in the ass to configure, this also required a bunch of configurations that you had to keep track of if you regularly use multiple computers with Anki. As a bonus, the MathJax syntax is cleaner, and you can now edit expressions on AnkiDroid and they will render immediately. How to convert existing LaTeX in Anki to MathJax If you have already been using a full installation of LaTeX and have a bunch of anki cards that you want to convert to Mathjax, the process is relatively easy. First, make sure you back up your entire Anki database. In the card browser, select the cards you want to convert and go to Menu → Edit → Find and Replace. Make sure Treat input as regular expression_ is checked, and then run the following input/output pairs. It’s a good idea to test on a couple cards first. Find Replace \\[\\$\\] \\$ \\[\\/\\$\\] \\$ \\[\$\$\] \\\[ \\[\\/\\\$\\\$\\] \\\] \\[latex\\] \\\[ \\[\\/latex\\] \\\] You could probably combine these into a lesser number of more complex regular expressions, but I didn’t want to mess around too much, since Anki’s preview-less and undo-less Find & Replace tool made me somewhat nervous. Depending on what sort of syntax you used, you may need to convert or remove some additional strings which are not recognized by MathJax. For me, this included replacing the align* environment with aligned and removing in-equation tags (using regex: \\tag{\d}). After running the above pairs, review a few cards to ensure everything looks okay. Then run Tools → Check Media and sync. Both operations may take a while (30-60 seconds) depending on how many LaTeX expressions you had previously. I had 3010 rendered LaTeX images which took up a total of 32.6 MB. Using LaTeX with Anki Hazards ☠ Anki’s [official documentation on LaTeX support]( You can only put LaTeX tags inside fields, not inside the card template itself. Otherwise it breaks the logic used by the “update media references” process. If you are using cloze deletion and have a nested LaTeX expression, put a space between curly brackets } } to avoid confusing Anki between cloze brackets and latex code. Understand the difference between inline and display equations Latex has two different ways to render math equations: inline and display. Their primary difference in full latex documents is that inline equations are smaller and do not cause a line break, so they can be used within a flowing paragraph, while display appear as a larger, centered equation with a line break before and after. Since anki renders latex figures indvidiually as png files and inserts them into your template, this spacing does not apply to us. The secondary difference is that inline has tighter formatting on a variety of symbols, most notably on summations, integrals, etc. Tweaks Outputting high-resolution PNG files I was not satisfied with the default rendering settings, which were generating images with noticable aliasing. The following settings render equations at 800 DPI with a transparent background and medium compression. The files are not much bigger in the end, due to hte compression. Install the Edit LaTeX build process addon. Open latex_build_process.py and modify it as follows: newLaTeX = \ [ ["latex", "-interaction=nonstopmode", "tmp.tex"], ["dvipng", "-D", "800", "-T", "tight", "-bg", "Transparent", "tmp.dvi", "-z", "6", "-o", "tmp.png"] ] # make the changes import anki.latex anki.latex.latexCmds = newLaTeX In your card template CSS, put .latex { zoom: 14%; } to return the images to a reasonable size. Center-align rendered LaTeX images If your inline LaTeX equations seem not to be aligned with the surrounding text, you can add the following to your card CSS. img[src*="latex"] { vertical-align: middle; } Making display equations larger than inline equations Annoyingly, there is no way to automatically display png files rendered from display math as larger than inline math. So the best way I have found to do achieve this is to add a conditional field to the template and run a snippet of javascript code to modify the CSS on the fly. Add a field to your note template, I used the name _latex_displaymath and set the text size to 10, so that it takes up minimal space in the anki browser. Add the following to your note CSS .display-math .latex { zoom: 80%; display: block; margin: 0 auto; padding: 30px; } Add the following to your card HTML <script> var displayMath = "{{_latex_displaymath}}" if (displayMath.length > 0) { document.getElementById("answer").classList.add('display-math'); } script> Test your product assumptions with GA Intelligence Alerts Sunday, 17 Jul 2016 A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly. More knowledge is good, right? On one hand, past experience calibrates our sense of prior probability, which allows us to make better decisions in noisy contexts. This “prior” knowledge which we acquire has a dark side though. When we encode certain data points as truthes into our mental models, our perception of the world becomes static. We can become overconfident in our knowledge of how things work, and be caught off-guard when our assumptions about how the world works are no longer true. “In the beginner’s mind there are many possibilities, but in the expert’s there are few” ― Shunryu Suzuki So wouldn’t it be great if there were a way to be notified when our acquired mental models diverge from reality? In software development there are entire methodologies such as Test-driven development (TDD) which revolve around explicitly formulating and testing assumptions at each stage in the development process. One of my favourite python statements is assert, which lets you specify a condition you assert to evaluate to TRUE, and ask Python to raise an exception when that is not the case. But if you are working with data in an analytics tool rather than in Python, how can you achieve this? Intelligence Alerts are a neat feature in Google Analytics which allow you to specify a metric and dimension combination, and then to configure an alert on a daily/weekly/monthly basis when that metric changes by either an absolute value or percentage change. You can use this tool to codify your assumptions about user behaviour, and then get alerted if they change. Set notification thresholds calibrated to your perceived lower bound on normal usage behaviour. When a niche but important feature breaks silently a few months from now, you will be the first to know. Book review: Remote Research (user research) Tuesday, 07 Jun 2016 This is a brief review of the book Remote Research, and a summary of points that resonated with me. Key Concepts Moderated research – Real-time interaction with a user that is time-expensive, but is easier to discover unanticipated insights due to the greater “texture” of the interaction. “Moderated research allows you to gather in-depth qualitative feedback: behavior, tone-of-voice, task and time context, and so on. Moderators can probe at new subjects as they arise over the course of a session, which makes the scope of the research more flexible and enables the researcher to explore behaviors that were unforeseen during the planning phases of the study. Researchers should pay close attention to these “emerging topics,” since they often identify issues that were overlooked during the planning of the study.” Automated research – Data collection process is set up a priori and the research is conducted asynchronously, without your involvement. “Automated research is nearly always quantitative and is good at addressing more specific questions (“What percentage of users can successfully log in?” “How long does it take for users to find the product they’re looking for?”), or measuring how users perform on a few simple tasks over a large sample. If all you need is raw performance data, and not why users behave the way they do, then automated testing is for you.” Starting an interaction – The quality of your data in a moderated study is influenced by the consistency and quality of your participant on-boarding process. “Establish the users’ expectations about what will happen during the study and what kind of mindset they should have entering the study. The most important things to establish are that you want the participants to use the interface like they normally would … And let them know you’d also like them to think aloud while they’re on the site … It’s also nice to set users at ease by reassuring them that you had nothing to do with the design of the interface, so they can be completely honest:” Time Aware Research – Using live recruitment in a moderated study leads to richer and more authentic interactions with participants that occur in their native environment. “Remote research is more appropriate when you want to watch people performing real tasks, rather than tasks you assign to them. The soul of remote research is that it lets you conduct what we call Time-Aware Research (TAR).” Execution Tips Progress from high to low variability – Start the session with undirected natural tasks, which gives the participant space to surprise you. Finish by running through any tasks the user did not complete naturally, this time in a structured manner. Timestamp your notes – make timestamps based on “time since session start” instead of absolute times, to make them easier to review later. Cross-reference “control” metrics with your analytics – Double-check that your research is not biased due to a flaw in the design or structure of the study. “If there’s a discrepancy between your study findings and the Web site’s analytics (“80% of study participants clicked on the green button, but only 40% of our general Web audience does”), it could mean that the task design was flawed, the target audience of the study differs from that of the main audience, or that there’s an unforeseen issue altogether.” Ask open-ended questions – Remain neutral to avoid influencing the responses from participants. “So, tell me what you’re looking at … What’s going through your mind right now? … What do you want to do from here? … When did you decide to leave the site/exit the program? … What brought you to this page?” Thoughts Remote Research lays out a comprehensive framework for starting to conduct research studies at your company, and is useful for beginners or for filling in the gaps in your mental model. However it seems more targeted towards large companies with established UX practices than towards startups. If you are executing alone—perhaps as a one-man UX team—you may still feel a gap between theory and execution. The tools section of the book seems dated, which is understandable, however it would be great to see some more tactical information on conducting remote research on the cheap. Two tricks that I have used at work myself are: Running tests from Google Tag Manager – Aligning with the owner of the tracking platform (often Product team) is a quicker way to get the necessary code live than doing it in-house with IT. Use a general session recording tool – Using a tool such as Inspectlet, you can record most or all user interactions and then filter the recordings down afterwards. This allows you to observe a very specific behaviour chain that may not occur frequently enough on your site to target users live. Book review: Web Form Design Wednesday, 11 May 2016 I finished reading Web Form Design recently on the recommendation of a mentor. The author makes a good case about web forms being a high leverage area to invest design efforts. The combination of forms being mandatory, complex, and not particularly sexy, results in an experience that is often the worst part of a user’s interaction with your product. He then breaks down the form into the building blocks of Labels, Input Fields, and Actions, then lays out best practices for each. Here are a few snippets from the book that resonated with me. Labels Top-aligned labels – “The results of live site testing across several different geographies have also supported top-aligned labels as the quickest way to get people through forms. These studies also had higher completion rates (over 10 percent higher) than the left-aligned versions of forms they were tested against… One of the reasons top-aligned forms are completed quickly may be because they only require a single eye fixation to take in both input label and input field. [50ms compared to 240ms for right-aligned and 500ms for left-aligned labels] … Top-aligned labels, however, do take up additional vertical real estate.” Right-aligned labels – “The resulting left rag of the labels in a right-aligned layout reduces the effectiveness of a quick scan to see what information the form requires … That said, in cases where you want to minimize the amount of vertical screen space your form uses, right-aligned labels can provide fast completion times.” Left-aligned labels – “Left-aligning input field labels makes scanning the information required by a form easier. People can simply inspect the left column of labels up and down without being interrupted by input fields… Unfortunately, a few long labels often extend the distance between labels and inputs and, as a result, completion times may suffer. People have to “jump” from column to column in order to find the right association of input field and input label before entering data. The reason left-aligned forms are the slowest of the three options to complete may be because of the number of eye fixations they require to parse.” Inside-alignd labels – “In cases where screen real estate is at a premium, combining labels and input fields into a single user interface element may be appropriate… Because labels within fields need to go away when people are entering their answer into an input field, the context for the answer is gone. As such, labels within inputs aren’t a good solution for long forms… It’s also generally a good rule not to use labels within inputs for non-obvious questions. That is, questions that may require people to reference the label while answering. Input Fields Tabbing behaviour –“Web form designers should consider what the experience will be like for the large numbers of people who move between input fields using the Tab key, and they should design accordingly.” Radio buttons – “Allow people to select exactly one choice from two or more always visible and mutually exclusive options. Because radio buttons are mutually exclusive, they should have a default value selected (more on this later). It’s also a good idea to make sure both the radio button and its label can be selected to activate a radio button selection.” Input switching – “[Sequential] basic text boxes … lead users to skip back and forth between their mouse and keyboard … in order to complete the interaction.” Length of input fields – “The way we display input fields can produce valuable clues on how they should be filled in… In the eBay Express example … the size of the zip code input matches the size of an actual zip code in the United States: 5 digits. The size of the phone number text boxes match the number of digits in a standard phone number in the United States. The rest of the text boxes are a consistent length that provides enough room for a complete answer.” Required/optional fields – “If most of the inputs on a form are optional, indicate the few that are required. … When indicating what form fields are either required or optional, text is the most clear. However, the * symbol is relatively well understood to mean required.” Actions Secondary actions – “When you reduce the visual prominence of secondary actions, it minimizes the risk for potential errors and further directs people toward a successful outcome.” Success vs. Error messages – “The key difference between error and success messages, however, is that error messages cannot be ignored or dismissed—they must be addressed. Success messages, on the other hand, should never block people’s progress—they should encourage more of it. Animating success messages – “Because human beings are instinctively drawn to motion—we had to avoid sabertoothed tigers somehow—animated messages that transition off a page can let people know their actions have been successful. The most common transitions utilized for this are fades, dissolves, or roll-ups.” Effective in-line validation – “Inline confirmation works best for questions with potentially high error rates or specific formatting requirements… When validating people’s answers inline, do so after they have finished providing an answer, not during the process.” Tracking: Organizational Challenges Friday, 12 Feb 2016 There are plenty of technical guides online about tracking user behaviour using GTM. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking. One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization. Stakeholders get nervous when you propose changes to tracking, even if tracking currently sucks. As a product manager, my primary interest in tracking is to feed higher-quality data into the product decisions my team makes. Being “data-driven” is chic, but having reliable and relevant data is not a given. It requires some strategic forethought to track the right things and track them properly. The first thing I did was consolidate all the country-specific containers into a single global container in GTM. Our application is nearly identical between countries, so this was easy from a technical perspective. We removed outdated tags, replaced country-specific IDs with lookup tables, and updated triggers to match. The second major change was change how we name events to communicate user behavior in a more transparent way. A few lessons learned from the process: Reports are fragile Tracking data feeds into many teams’ reports—some of which you may not be aware of. These reports can be quite fragile to changes made to the tracking layer. Even worse than breaking a report, is to subtly impact some of its underlying assumptions, reducing the accuracy and usefulness of that report without anyone realizing it. The best way to mitigate this risk is to coordinate tightly with BI. Sit down and trace all the “customers” of tracking data to get a better sense of how changes will impact various teams and reports. It is especially important to be aware of which reports are consumed by external stakeholders such as investors. These reports often process the data down to a single number in a spreadsheet cell, without any context around it. For example, inserting a GA event could impact the “bounce rate” calculation on that page. People are overly confident in their data Making decisions on real-world data is not as clean-cut as a case study in business school, and it is always good practice to question the source and validity of the data you are using to make a decision. Unfortunately some decision-makers can lose sight of this. Prepare for some push-back against your proposed fixes or improvements to tracking, as this implies that prior decisions were made with flawed data. Data is never infallible, but this can be an uncomfortable reality for some managers. Decouple tracking from KPI definition The ideal tracking event crisply describes the nature of the user interaction without commenting on the value to the business. Event names such as “Unique Lead” or “Customer Intent” are opaque and give no visibility into what exactly those actions are, or why they are important to the business. It is better to push the task of KPI definition “up the stack” to management, so that the people who are ultimately consuming the tracking data will be better-equipped to make decisions on it. The Best of Seth Godin for Product Managers Friday, 10 Jul 2015 One of the consistent must-reads that has remained in my RSS feed over the years is Seth Godin’s blog. Seth consistently puts out a stream of incredibly wise thoughts. I have found that some of his posts resonate with me even more when I re-read them at a later point in my life/career. Here are some of my favourite Seth Godin posts, as they relate to the role of Product Manager. Please, go away – Being out-of-touch with customers hurts every part of an organization, but especially the product team. Sometimes it requires a conscious effort to correct for this. You may receive surprisingly strong push-back from some people on your efforts. Project management for work that matters – Ten very good pieces of advice for the project mgmt. parts of a PM’s job. Really Bad Powerpoint – One of Seth’s longer blog posts. A good philosophical guide to using powerpoint effectively. I try to stay away from powerpoint as much as possible, but sometimes it is necessary, especially for interacting with stakeholders. Not even one note – Why it is important to choose better features over more features. He also talks about how to make that choice. Inventing a tribe – Building a successful product vision does not have to involve creating something totally new and revolutionary from scratch. It is far more likely that it will involve connecting and empowering the people that already share a vision with you. How to live happily with a great designer – Some tips for working effectively with designers. Two kinds of writing – As a PM you will be interacting with totally different groups of people on a daily basis. It is important to adjust your writing and communication style to each audience. You will want to use a different approach when dealing with customers, engineers, marketing, or stakeholders. Why do you do it this way? – A good way to test some of the underlying product decisions made in the past. Asking why three times is a great way to uncover the philosophy of a team. Marketing to the organization – Product managers lead without positional authority, so it becomes important to approach things at a meta level, thinking about what you can do internally to give a product or project the best chance of succeeding. Doing calculus with Roman numerals – As a non-technical PM, it is especially important to be relentlessly curious and to ask many question about the technical side. Not to make your job easier, but to open up a level of performance that is not possible without understanding the tools being used around you. Reading books for long-term value Wednesday, 08 Jul 2015 For a while now, my Pocket reading list has been growing at a faster rate than I have been consuming it. Recently this problem has crept into my offline reading as well, and now my GoodReads list is growing hopelessly long. Initially I approached this as a quantity problem, and started looking into speed-reading as a method of consuming more information. There is a neat tool called Spritz that controls for eye movement to help you learn. But it turned out the problem was about quality of reading, rather than quantity of material. This manifested itself in a disappointing recall of key arguments and theses of books I had read more than a year or two before. Part of the problem was that I considered the primary goal of reading to be acquiring information. The issue with this approach is that if the raw data is not synthesized, you won’t remember it for as long. I now consider the primary goal of reading to be rewiring parts of my cognitive process based on the information in the book. Here are a couple of the systems I have put into place to derive more long-term value out of my reading: Before Read summaries In an effort to reduce the input side of my reading list problem, I have begun heavily vetting the recommendations or discoveries that I place into my reading list. Anything non-fiction gets checked for in Blinkist to see if there is already a summary available. For other genres, I like to check Maria Popova’s Brain Pickings to see if she has written on that book before. Reading through a summary like this will give you a better sense of whether you should commit to reading the full book. And if you do proceed to read the book, you begin with a rough mental framework that makes it much easier to absorb the arguments and theses into your mental model. During Use an e-reader Buying an Amazon Kindle has been a huge help. Besides the whole “thousand books in your pocket” thing, I find the highlighting feature to be incredibly valuable. I have never been much of a highlighter / markup-er of printed media, but I am well aware of the benefits for cognitively absorbing material. Kindle’s highlights lets you collect snippets from a book and export them as a text file. Read deliberately Shane Parish of Farnam Street has written extensively on the subject of learning, reading, and self-improvement. He has some pieces of good advice that ultimately add up to the act of reading deliberately. Take a second before you begin to think about the author, the context, and your existing knowledge on the subject. While reading, mentally summarize arguments periodically, and try to abstract at a higher level. After you put down a book, spend a couple minutes in silence, contemplating what you’ve just learned, and attempting to synthesize it into your existing mental framework. After Write a book summary There is a reason that Bill Gates publishes book reviews, and it’s not because he has nothing better to do with his time. Writing these reviews will encourage you to read at the analytical level required to summarize effectively. I usually start by sorting through all of my kindle highlights from a book, then organizing them into thematic groups, and trying to build a structured opinion on the work. Making a value judgement in your summary will force you to go a step further in your reading, to do the work of synthesizing the material and forming an argument. Mindmapping I also find it useful to push one level above individual books, and to make a conscious effort of trying to integrate the knew book into my mental frameworks of knowledge. Mindmapping is a good tool for this, as it helps you visualize and form connections between pieces of material without the need to traverse the information in a linear fashion. Another option is to collect key passages into your commonplace book. Adding these additional layers to my reading “stack” definitely slows down my rate of consumption, but I think it is well worth the increase in comprehension, synthesis, and long-term retention. How to conduct user research when you can’t reach your users Saturday, 04 Jul 2015 If you are a product manager, you have almost certainly heard about the importance of conducting user research before. Quantitative data can point to where a problem exists, but nothing beats qualitative research for learning why that problem occurs. Large datasets can obscure individual usage patterns, making it hard to “get into the user’s head”. User research helps you understand the conceptual models of your users and to build personas around them. Normal user research methods involve getting users into a room and watching them interact with your product. But what do you do if you can’t reach your users as easily? What if your users are in different countries, or speak different languages? These factors certainly make user research more difficult, but also simultaneously make it even more important. One solution I’ve been playing with recently is a combination of Olark live chat and Inspectlet. Inspectlet is a tool that records the cursor movements, clicks and scrolls of your users, and then rebuilds them into a video of the user’s session. At first it almost seems as if you are “spying” on users, though in fact the videos are all assembled post-hoc. Inspectlet is, of course, not as interactive as true user testing, but it does allow you to get surprising insights on user behaviour. What is really powerful is when you combine these two together. Olark is primarily a live-chat tool, but when you are offline it reverts to a feedback box, placed on a targeted part of your website or product. Here is how I chain the two tools together: Place the Olark feedback box on a specifically targeted element of your website where you expect there will be user frustration. Olark’s premium plan offers targeting, or you can roll your own DIY targeting by firing the Olark tag through Google Tag Manager. After some time, read through the responses Olark sends to your email. If you are tracking foreign-language users, you can translate most messages right from within Google Chrome. When you find a user response that interests you, grab the IP address from the message and filter for that IP in Inspectlet. Unless your product has massive traction already, you’ll probably find a single session that matches that IP address. Watch the user session to learn the process the user went through before leaving the corresponding piece of feedback. This combination is the most effective solution I have found so far to bridge the user research gap on hard-to-reach users. However I wouldn’t say this is a replacement for conducting real user research. If you can, nothing beats an in-person session. Reconciling contradictory advice Friday, 26 Jun 2015 One of the problems with abstracted tidbits of advice is that they lose much of their meaning when divorced from their context. The correct decision can be heavily weighted by the nuances of the specific scenario. As a result, you often receive seemingly conflicting pieces of advice. The easy example is with contradicting proverbs, which are humorously documented here. But the contradictions also occur in more serious advice given around technology, business strategy, and product development. Here are a couple I have been thinking about recently. Breadth vs. depth Should you strive to be well-rounded (full-stack?) or should you focus on your strengths? This can be viewed as a version of the classic generalist–specialist dichotomy. But it is more interesting when applied to the “micro” skill level rather than “macro” level career advice. When it comes to your skills and capabilities, should you focus on your strengths, or invest the time to round-out your weaker skills? This is loosely related to the multi-armed bandit problem, and to the concept of local maxima. What is the optimal mix of breadth and depth? Perfectionism vs. mediocrity Should you apply the 80/20 rule, or should you focus on the details? Ellen Chisa pointed out this contradiction on her blog, specifically in the context of product development. It ties into the concept of Minimum Viable Product (MVP) which is unfortunately often cited as an excuse to cut corners and ship half-baked products into the market. 80/20 style prioritization lets you achieve more output with fixed time/money. But it makes an implicit assumption that you are optimizing for raw efficiency. What if that is not true? Imagine you are playing Super Mario for a moment. If you get 95% through a level but then die, you start again from the beginning. You are rewarded not for your average performance, but for the number of absolute wins you achieve. You can fail at that 95% over and over, and walk away with a 90% average but without making any real progress to the next level. In the context of product development, you are not optimizing for average happiness of a user but rather number of users happy enough to sign-up / buy. In this sense, users are fungible unit of success. If you spread your resources out with the 80/20 rule, you could launch 5x the number of features, but at an 80% quality level. This could get you 5x the exposure, or perhaps 5x the engagement, but it does not necessarily lead to 5x the sales / conversions. Imagine a user has some intrinsic standard for how well a solution must fit their needs to sign-up or buy. If this “bar” falls above 80%, then you might lose all your 5x users to a bunch of niche competitors that serve their specific needs at a theoretical 90% level. It may make more sense to focus your resources on developing something at a 95-100% level but with only 20% of the scope. This involves saying no to 80% of opportunities/features. As a result, you might get objectively fewer users into the start of your funnel. But assuming that your product is well-executed—that you didn’t waste these theoretical resources—then you should have a far higher conversion than in the 80/20 scenario. Moving fast vs. patience Is it better to have the time lead of being first-to-market or the lower risk of being a close second? Using the “first mover advantage” is a classic business school strategy. It is completely logical in industries such as telecom or social networks, where customers are locked-in and there are strong network effects at play. Yet many first-mover activities center around creating a market, and are not always defensible to a specific company. Competitors can get a “free ride” on your push for regulatory change or established supply chains. When does it make sense to be a trailblazer, and when does it make sense to tuck yourself into the slipstream of the current leader? The Wirecutter: on trust, and satisficing Sunday, 21 Jun 2015 I am a big fan of the consumer editorial site The Wirecutter. They earned a position in my stack of newsletter subscriptions for their help with simplifying tech purchasing decisions. In his book The Paradox of Choice: Why More is Less, Barry Schwartz lays out a dichotomy of people’s decision-making behaviour. Some people are maximizers—those who strive to make the optimal decision. Others are satisficers—those who make a decision as soon as it meets their criteria. Mr. Schwartz’s thesis is that satisficers are happier than maximizers in the long-run. Although their average decision is less optimal, it requires much less effort. Maximizer-behaviour is useful for high-stakes irreversible decisions, but most decisions are not like that. It is difficult to be a maximizer with the sheer volume of smaller decisions we face on a daily basis. One example that can be surprisingly taxing is deciding what TV, camera, charger, BBQ, or washing machine to buy. You might have strong preferences about some of these, but it is more than likely that you are not familiar with most of the above product categories. Making a truly informed decision requires that you first familiarize yourself with the offerings in the market. Then you must prioritizing your own requirements and analyze each option, before coming to a decision. If you make the wrong decision and you will be reminded of it every time you use the product over the next few years. Previously I have never trusted a single review to consider it more than a single data-point. Look up a review on Engagdet, Gizmodo, The Verge, and Cnet, and they often all offer conflicting opinions on the same product. But The Wirecutter is different. First, the reviews are centred around user problems (Which X should I buy?) rather than tech solutions (Review of the new Z 2.0). The editor aggregates reviews from across the web on a select group of options and reports the results. This serves as a “one-stop” source of information instead of as a single data-point. Second, each review leads with a summary of the recommendation and a link to buy on Amazon. But underneath this summary is a comprehensive breakdown of the logic behind that decision. There are sections such as Why you should trust us, Flaws but not deal-breakers as well as alternative recommendations based on niche use-cases. On my first couple visits to The Wirecutter, I read the entire page—in classic maximizer behaviour. But after making a few purchasing decisions based on their advice, I have developed a great deal of trust in the editorial team from The Wirecutter. Now I often only skim the review—and if it is a less critical decision, I will simply buy their top recommendation without much extra thought. In a sense, it has allowed me to outsource the burden of “maximizing” tech purchasing decisions to a trusted third-party. The ultimate test of trust in tech decisions is to ask yourself: “Would I recommend this to my mother?”. If you recommend the wrong product, you might find yourself fixing it or providing support for your next few Thanksgiving Dinners. For me, The Wirecutter has passed this test. Whenever Mom asks for advice on something I have no familiarity with (“Which dashcam should I buy?”), I just link her to The Wirecutter. Problem Spaces Monday, 18 May 2015 A common thread of startup advice is to avoid thinking about ideas and to instead think about problems that need to be solved. Switching to a problem-seeking mindset feels a little unnatural at first, but is ultimately a more productive way to approach the ideation process. Time spent thinking through a specific solution can quickly spiral into day-dreaming (“Wouldn’t it be cool if?…” or “Also we could do…”) which is at best a waste of time, and at worst can distract you from the finding the core essence of a product. Lately I’ve been making more of an effort to focus on problems that need solving, instead of ideas. I’ve noticed some common threads between loosely-related problems, which have crystallized into problem spaces that I find myself thinking about repeatedly. These problem spaces encompass a few related problems that could be solved in a variety of totally unrelated ways. Here are a few that have been on my mind recently. Preserving friendships 100 years ago, people had to make a conscious effort to stay in touch with friends, especially over long distances. Today, the decision of who we interact with on a daily basis is largely decided by social media algorithms. An unintended consequence of this switch is that it is remarkably easy to fall out of touch with certain friends, especially after moving to a new city, a new country, or a new stage of life. If we allow Facebook to curate our social interactions, we risk falling out of touch with those who slip between the cracks of the news feed algorithm. How can I mitigate this, to ensure that 5 or 10 years from now I am still closely in touch with important people in my life? Individualized travel advice There is no shortage of services that aggregate travel advice and recommendations, such as TripAdvisor, WikiTravel, or Yelp. These solutions are definitely more responsive and granular than published travel guides, but they still fall short of providing individually tailored or curated advice. I have had a couple disappointing experiences with these, specifically when visiting a destination where I am far from the target demographic, and where the popular recommendations do not appeal to me. Similar to how curated email newsletters are replacing aggregated news—at least for myself—I think there is room to apply a more curated approach to travel recommendations. The tricky part seems to be finding and picking a trusted curator for a “disposable” source of information. I can subscribe to 10 newsletters and then pick the best one in a month, and this is still worthwhile if I come out of it with a a trusted source that I will read for years. But I can’t justify this same level of trial and investment to find a good source of information on a city I will be in for a single weekend. Semi-social photo sharing There is a cliche about the amount of effort entrepreneurs spend on photo-sharing apps, which proposes that there are much more important and worthy problems to be solved. Nevertheless, I think the experience of sharing photos is still ripe for innovation. Up until a couple years ago, the entire space was focused on social, ignoring the entire spectrum of situations where I may want to share a photo but not to my entire social network. Snapchat changed this, introducing ephemeral messaging that addresses the more personal and/or frivolous end of the spectrum. But there is still a big space in between—where I want to share some photos with some people but it may not be a conscious effort, and it doesn’t need to be ephemeral. Instagram Direct is interesting in this regard, because it allows you to address the some people/some photos part, albeit in a very conscious fashion. But ultimately I still find myself with an offline library of my photos that simply don’t end up being shared, but that friends love to flip through on my phone. I wonder if this space could benefit from machine learning – if an app could figure out who I was at the bar with last night, and then suggest sharing with a selective list of people who may care about my slightly blurry, definitely not Instagram-worthy pictures. I may expand on these problem spaces here in more detail in the future, as I continue to think about them. If you are thinking about similar areas, drop me a line and let’s talk. Tool of the week: Blinkist Daily Saturday, 09 May 2015 I’m a big fan of Blinkist, which is a subscription service that provides really well-written summaries of popular non-fiction books. These aren’t the SparkNotes you remember from your high school days—each summary is split into thematic bites, and the information is presented in a form that is already partially synthesized. Each day Blinkist offers free access to one of their new summaries through Blinkist Daily. I find that the curation of books they use for Blinkist Daily is very high-quality, and I can usually find at least 2 summaries per week that I am interested in. It’s a similar model to Creative Live, where the initial live screening/viewing is free, but you can pay for access to the catalog of old content. So I found myself reading 2-3 blinks per week through Blinkist Daily. Eventually I picked up an annual subscription to the core Blinkist service, which lets you push summaries to your kindle. What is interesting is that I have found myself using the service less now that I am paying for it than when I was mooching off the free 24-hour summaries from Blinkist Daily. In some perverse way, having unlimited access to their entire library of information at my fingertips reduces my usage of the service. I don’t know if this is necessarily something wrong with the core product as much as it is something brilliant about Blinkist Daily. Curating a single summary per day and offering it for a fixed period of time simultaneously reduces the decision fatigue of choosing what to learn, and also introduces an element of scarcity in the form of a hard deadline at which point the summary disappears forever. Resources on Product Management Sunday, 12 Apr 2015 When I started as a Product Manager last year, I knew I had a lot to learn. I scoured through the internet, reading everything I could find on Product Management and how to succeed starting out as a non-technical PM. I have compiled a list of some of the most useful things I have read, partially so that I can revisit them myself from time-to-time. Some of these articles are not strictly product-related—many of them involve design, project management, and elements of software development. The PM role varies greatly between companies, and often involves stepping in to fill whatever necessary gaps exist in order to ship a successful product. Product Good product managers crisply define the target, the “what” (as opposed to the how) and manage the delivery of the “what.” Good Product Manager / Bad Product Manager – A note by Ben Horowitz that is worth a re-read every few months. believe great taste can be developed but not in a linear manner that is predictable or time bound. The best, and perhaps only way, to develop great taste is to be interdisciplinary and to gather a large variety of life experiences to draw upon. This is why Steve Jobs’ focus on the intersection of technology and liberal arts has always made a lot of sense to me. The Three Skills of a Great PM – Some core skills of an effective Product Manager, distilled into 3 semi-quantified “rates” Product management may be the one job that the organization would get along fine without (at least for a good while). Without engineers, nothing would get built. Without sales people, nothing is sold. Without designers, the product looks like crap. But in a world without PMs, everyone simply fills in the gap and goes on with their lives. It’s important to remember that – as a PM, you’re expendable. Now, in the long run great product management usually makes the difference between winning and losing, but you have to prove it. How to Hire a Product Manager – An essay by Ken Norton from Google Ventures. Although it is obstensibly written for people looking to hire a PM, it gives a great outlook into the role for someone trying to get hired. Design There is your product and then there is the experience someone has using your product. It’s easy to see the difference from afar, but to the person using your product they are one in the same. This cannot be understated. Every interaction with your product/service/company matters and becomes part of the product experience. The experience is the product – Joshua Porter on the inseparability of the product itself and the experience that surrounds it. Book review: The Design of Everyday Things – A good summary of one of the classic books on Design by Don Norman. This summary convinced me to read the whole book. The single easiest way to see things through the eyes of your new user is to simply watch your user interacting with your product for the first time and talk to her about the experience. Don’t try to do this without help from your users. You know way too much. You Know Too Much – Laura Klein on why it is so difficult to keep use conceptual models in our head, and why it is essential to watch users interacting with our product. Learning about your customer is the single most important part of your startup. If you’re outsourcing that to a person who isn’t directly responsible for making critical product decisions, then you are making a horrible mistake. Startups Shouldn’t Hire User Researchers – Laura Klein on why user research should be a responsibility of PMs The fact is, understanding what your users like and don’t like about your product doesn’t mean giving up on your vision. You don’t have to make every single change suggested by your users. You don’t have to sacrifice a coherent design to the whims of a focus group. 6 Stupid Excuses for Not Getting Feedback – There is a good chance that you recognize at least one of these excuses. One of the main reasons I like the thinking aloud method of user testing is that it gives us insights into a user’s mental model. When users verbalize what they think, believe, and predict while they use your design, you can piece together much of their mental model. Mental Models – Jakob Nielsen explains how good design needs to consider the mental models that users carry while interacting with your product. Good design is always a moving target. Execution / Shipping Ideas are just a multiplier of execution – People generally overestimate the value of ideas and strategy, while underestimating the critical importance of execution. Business students in case study discussions will spend 70 minutes fiercely debating high-level strategic issues, then round off the final 10 minutes on operations and execution by saying “Then we’ll hire an engineering team and build out the product.” There is no later for your customers. The only thing that matters is what they’re using right now. They don’t give a shit about your roadmap, your brilliant feature pipeline, or your vision of a better future. They’re trying to get work done right now and they only know what you’ve already delivered. So build a discipline around your launches, knowing that your temporary, let’s get this out quickly and iterate later release is the current reality for your customers. Build up your attention to detail and force yourself to treat every launch like it is your final launch. Imagine that you’ll never be able to deploy something after this…have you done your best work? There is no later for your customers – “Just Ship It” is not an excuse to release a sub-par, unfinished product to the world. The concept of Minimum Viable Product is also wrongly applied in this regard as an excuse to ship something half-baked. This is one of the reasons why B2B applications often get away with being so awful and hard to use. If a product helps me do my job better and makes me more money, it’s solving a big problem for me. I’ll put up with a few missing features or a less than stellar experience. How Bad Can I Make My Product? — A good litmus test for determining approximately how much you should be sacrificing release quality and polish for speed. So what makes an idea guy an idea guy? Usually it’s the simple fact that they don’t have any other skills to bring to the startup. 5 Reasons you don’t want to partner with an “Idea Guy” – People unfamiliar with the industry often equate Product to being the Idea Guy. Ensure that you do not fall into this trap as a non-technical PM—always focus on delivering tangible value through analytics, testing, and pursuing a deep understanding of the customer. Software Development The work of implementing a feature initially is often a tiny fraction of the work to support that feature over the lifetime of a product, and yes, we can “just” code any logic someone dreams up. What might take two weeks right now adds a marginal cost to every engineering project we’ll take on in this product in the future. In fact, I’d argue that the initial time spent implementing a feature is one of the least interesting data points to consider when weighing the cost and benefit of a feature. The One Cost Engineers and Product Managers Don’t Consider – Without acutely understanding the compounding effect of complexity costs on engineering resources, the product organization can find it increasingly difficult to ship new features. The key is to understand that the root cause of all this grief about commitments is when these commitments are made. They are made too early. They are made before we know if we can actually deliver on this obligation, and even more important, if what we deliver will actually solve the problem for the customer. Managing Commitments in an Agile Team – Making meaningful estimates on software development projects is a long-standing problem that has had entire books written about it. Understand how you can manage expectations with stakeholders and work with engineering to avoid the all-to0-typical disappointments of time and cost overruns on poorly made estimates. Thoughts on managing recurring tasks Wednesday, 17 Sep 2014 Most people use some combination of a calendar and todo list to organize their lives, whether it be a paper organizer or one of the myriad task list apps that pop up every day in the App Store. Personally I use a combination of Google Calendar and Todoist. Working together, these two do a pretty good job of keeping me organized. That said, the one type of task I have found awkward to manage are those tasks that you’d like to complete on a regular basis, but aren’t particularly time sensitive. Stuff like changing your bed sheets, backing up your computer, or cleaning up your itunes library. They don’t belong on your todo list. It doesn’t make sense to clutter up your todo list with an endless stream of recurring tasks that aren’t relevant to your day-to-day goals. A cluttered todo list reduces your effectiveness, so you should be striving to keep it as clean as possible. Neither do they belong in your calendar. These tasks don’t need to be done on a specific day or at a specific time. Treating these tasks as calendar events just clutters up your calendar with events that you probably won’t respect, and makes it more likely you’ll lose track of something important. The solution: Augment your organizational system with an app specifically designed for recurring tasks. The two best such apps are Radar (iOS, $1.99) and Regularly (Android, Free). Radar is an iOS app that is specifically designed to handle those recurring tasks that don’t quite fit into either your calendar or your todo list. For Android users, Regularly has similar, although it isn’t quite as aesthetically pleasing. These apps let you add recurring tasks and specify how often you want to do them, measured in number of days, weeks, or months. Then they keep you on track with a list of upcoming tasks and push notifications when they are due. What is great about using Radar is that you aren’t imposing false deadlines on tasks that are in reality quite flexible. If you don’t feel like dusting out your PC today, you can just do it tomorrow. But Regularly will make sure you do it every six months. When “Call Mom” pops up, you don’t need to immediately do it, but you know to plan on doing it at some point over the next couple days. Radar/Regularly really starts to shine when you begin to add a bunch of tasks with longer horizons, such as checking your stock portfolio or changing your air filter. I have a list of around 30 semi-regular chores and tasks, so every week that I check the app I just do a couple things to stay on the ball.

Find	Replace
`\\[\\$\\]`	`\\\(`
`\\[\\/\\$\\]`	`\\\)`
`\\[\$\$\]`	`\\\[`
`\\[\\/\\\$\\\$\\]`	`\\\]`
`\\[latex\\]`	`\\\[`
`\\[\\/latex\\]`	`\\\]`

Front	Back
What is the the Streetlight Effect?	An observational bias where a person who is searching for something looks only where it is easiest.
What is the name for An observational bias where a person who is searching for something looks only where it is easiest?	The Streetlight Effect
What is the implication of the Streetlight Effect?	We must be careful to focus our problem-solving efforts towards the area where the solution is likely to be rather than the area where we have the most data.
What does this picture represent?	The Streetlight Effect

Property	Random Forest	AdaBoost
Depth	Unlimited (a full tree)	Stump (single node w/ 2 leaves)
Trees grown	Independently	Sequentially
Votes	Equal	Weighted

Variable	Math
`sample_weights` with shape: (T, n)	$ w_{i}^{(t)} $
`stumps` with shape: (T, )	$ h_t(x) $
`stump_weights` with shape (T, )	$ \alpha_t $
`errors` with shape: (T, )	$ \epsilon_t $
`clf.predict(X)`	$ H_t(x) $

	secret	offer	low	price	valued	customer	today	dollar	million	sports	is	for	play	healthy	pizza
0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0
1	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0
2	2	0	0	0	0	0	0	0	0	0	1	0	0	0	0
3	0	0	1	1	1	1	0	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	1	0	0	1	0	0
5	0	0	0	0	0	0	0	0	0	1	1	0	0	1	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1

Variable	Math	Decription
`prior`	$ P(y) $	Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam).
`lk_word`	$P(x_i \vert y)$	The likelihood of each word, conditional on message class. We are implicitly using the multinomial distribution here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class.
`lk_message`	$ P(\mathbf{x} \vert y) $	The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class.
`normalize_term`	$ P(\mathbf{x}) $	The likelihood of an entire message across all possible classes.

	Landing page	Refactor
Default	Stay with existing version	Launch new version
False positive	Wasted resources	Missed opportunity for improved conversion
False negative	Missed opportunity for improved conversion	Worse conversion

Item	Notes	Cost
Igloo Legend 12 cooler	This size is perfect for weeknight cooks. It is shallow enough to only need 4.5L to fill, but wide enough to be opened without removing the sous vide device from the lid every time.	$21
Spray foam insulation	We want something with good thermal properties, and which comes in a can with a spray nozel, so we can spray it into tight spaces.	$8
Silicone caulk]	We really don’t need much, so just get a small container.	$4
60mm x 3.5mm o-rings	These are for outside the lid, to adjust how deeply the sous vide unit sits.	$6
40mm x2mm o-rings	These are for inside the lid, to keep the unit snugly in place when the lid is opened.	$4
Total		$43

Item	Cost
SAMLA box (with lid) from IKEA	$3
Cheap blanket (x2)	$4
Small plastic clamps (x2)	$2
Hot glue	$1
Total	$10

country	num_orders
USA	0.71169602953464
Canada	0.21443048342874
Mexico	0.07395810292746

Hours	Energy (kWh)	Watts
23.5	1.00	42
11	0.45	41
12	0.49	41
—	Average	41

	secret	offer	low	price	valued	customer	today	dollar	million	sports	is	for	play	healthy	pizza
0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0
1	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0
2	2	0	0	0	0	0	0	0	0	0	1	0	0	0	0
3	0	0	1	1	1	1	0	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	1	0	0	1	0	0
5	0	0	0	0	0	0	0	0	0	1	1	0	0	1	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1

Geoff Ruddock

Convert clipboard HTML contents to Markdown with Alfred

Goal

Getting HTML contents from clipboard

Convert HTML to Markdown

Example HTML

Example text

Text

Nested lists

Example text

Text

Nested lists

Markdownify

Fix list indentation

Re-number ordered lists

All together: html2md

Clean up HTML of Google Docs

Default output

Example text

Text

Nested lists

Inline styles to semantic tags

Wrap naked ul elements

Strip unnecessary tags

Combined: prep_gdoc_html

Clean up HTML of Quip docs

Default quip output

HTML to Markdown

Text

Text formatting

Nested list

Wrap top-level text elements

Unwrap nested code blocks

Combined: prep_quip_html

Combined: prep_any_html

Creating an Alfred workflow

Scraping PNG icons for emoji with Python

Motivation

Read list of emoji

Easy mode: Twitter emoji

Hard mode: scraping from unicode.org

Fetch HTML

Strip variation selectors

Extract using regex

Check results

Write to files

Appendix: multi-character emoji

Length-two

Length-three

Length-four

Length-five

Further reading

How to batch modify dates of daily journal files

Goal

Setup

The script

Check results

Soundproofing a Synology NAS

A prelude on acoustics

Source of the sound

Travel path of the sound

Isolate drives from enclosure

Isolate enclosure from surface

Dampen sound travel through air

Acoustic foam panels

Beware of heat

Epilogue

What I’d do differently

An afterword on acoustics testing

Turn on your thermostat before an alarm with Tasker (Android)

Why IFTTT isn’t enough

How-to

Calculate when to trigger

Actually trigger the thermostat

Accidental abstract art (ft. matplotlib)

Keep your SQL queries DRY with Jinja templating

A usecase for templating your SQL queries

Can we do better?

Whitespace

Start of opening block

Wrap naked `ul` elements

Combined: `prep_gdoc_html`

Combined: `prep_quip_html`

Combined: `prep_any_html`

	secret	offer	low	price	valued	customer	today	dollar	million	sports	is	for	play	healthy	pizza
0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0
1	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0
2	2	0	0	0	0	0	0	0	0	0	1	0	0	0	0
3	0	0	1	1	1	1	0	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	1	0	0	1	0	0
5	0	0	0	0	0	0	0	0	0	1	1	0	0	1	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1