Geoff Ruddock

Convert clipboard HTML contents to Markdown with Alfred

Goal

All of my personal notes are written in markdown. I use Obsidian to manage them, but the specific tool is not relevant for the purposes of this post.

When referencing things in Google Docs, I find myself generally linking to the doc, rather than copy/pasting, because the default copy/paste output is poorly formatted, and is tedious to correct. This is sub-optimal, because then I cannot surface the contents via search in Obsidian, unless the text of the URL matches my keywords.

So the goal here is to make it trivially easy to copy/paste from GDocs into a markdown format, hopefully resulting in me doing that more frequently, resulting in more useful results when searching my vault.

Easier to copy/paste → save more content → better search results → less time searching for things

Alternatives:

Getting HTML contents from clipboard

You may have wondered at some point: why does copied text appear differently depending on which app I paste it in? For example, copied text from a google doc will appears identical when pasted in another google doc, but will render as plain text when pasted into a barebones text editor.

It’s worth understanding how the clipboard works. Most relevant for us:

  1. When the copy command is invoked, the active application can offer a variety of potential formats, incl. HTML, RTF, and plain text.
  2. When the paste command is invoked, the active application chooses which format to receive.

If you’re using Mac, you can use the free Clipboard Viewer application to inspect the different formats that are offered by the application from which you are copying.

Using Clipboard Viewer to inspect text copied from a Google Doc

Convert HTML to Markdown

Now that we have the HTML contents of our clipboard, we need to convert it to markdown.

Example HTML

Let’s start with the optimistic scenario in which we have perfectly structured HTML, notably:

  1. Text formatting is represented in semantic elements such as: strong (bold), em (italics), or del (strikethrough).
  2. Nested lists are wrapped inside an li tag, not placed directly below the parent list.
from IPython.display import HTML, display

example_html = """
<h4>Example text</h4>
<h5>Text</h5>
<ul>
    <li>This text is <strong>bold</strong></li>
    <li>This text is <em>italicized</em></li>
    <li>This text is <del>strikethrough</del></li>
    <li>Some <code>func = lambda x: print(x)</code> inline code</li>
    <li>A link: <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a></li>
</ul>
<h5>Nested lists</h5>
<p>ul > ul</p>
<ul>
    <li>A1</li>
    <li><ul>
        <li>A1a</li>
        <li>A1b</li>
    </ul></li>
    <li>A2</li>
</ul>
<p>ul > ol</p>
<ul>
    <li>B1</li>
    <li><ol>
        <li>B1a</li>
        <li>B1b</li>
    </ol></li>
    <li>B2</li>
</ul>
<p>ol > ul</p>
<ol>
    <li>A1</li>
    <li><ul>
        <li>A1a</li>
        <li>A1b</li>
    </ul></li>
    <li>A2</li>
</ol>
<p>ol > ol</p>
<ol>
    <li>A1</li>
    <li><ol>
        <li>A1a</li>
        <li>A1b</li>
    </ol></li>
    <li>A2</li>
</ol>
"""

display(HTML(example_html))

Example text

Text
Nested lists

ul > ul

ul > ol

ol > ul

  1. A1
    • A1a
    • A1b
  2. A2

ol > ol

  1. A1
    1. A1a
    2. A1b
  2. A2

Markdownify

In this idealized scenario, the markdownify library converts our HTML reasonably well out-of-the-box.

from markdownify import markdownify as md

print(md(example_html, heading_style='ATX', bullets='-'))
#### Example text


##### Text


- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)


##### Nested lists


ul > ul


- A1
- - A1a
	- A1b
- A2


ul > ol


- B1
- 1. B1a
	2. B1b
- B2


ol > ul


1. A1
2. - A1a
	- A1b
3. A2


ol > ol


1. A1
2. 1. A1a
	2. A1b
3. A2

But there are a few issues:

  1. The first child of nested lists gets a double list marker instead of proper indentation
  2. Ordered list numbers do not properly reset at each level
  3. There are unnecessary double empty lines
  4. Output uses tab characters (\t) instead of four spaces.

Fix list indentation

We can fix everything except for the wrong numbering using a few simple regex expressions.

import re

def strip(x: str) -> str:
    return ''.join(x.split('\n'))

md_text = md(example_html, heading_style='ATX', bullets='-')
    
find_replace_pairs = [
    ('- - ',          '\t- '),    # fix indent on first child of ul > ul
    (r'- (\d\.)',     r'\t\1'),   # fix indent on first child of ul > ol
    (r'\d\. - ',      r'\t- '),   # fix indent on first child of ol > ul
    (r'\d\. (\d)\.',  r'\t\1.'),  # fix indent on first child of ol > ol
    ('\t',            '    '),    # replace tabs with four spaces
    ('\n\n',          '\n'),      # remove extra line breaks
]

for f, r in find_replace_pairs:
    md_text = re.sub(f, r, md_text, flags=re.MULTILINE)

print(md_text)
#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
3. A2

ol > ol

1. A1
    1. A1a
    2. A1b
3. A2

Re-number ordered lists

To fix the numbering of ordered list items, we’ll write a function:

def renumber_list(md_text: str) -> str:
    """ Replace ordered list markers with the correct number. """
    
    # Track how many previous items have appeared, at each level of indentation
    prev_items_at_level = [0] * 10

    lines = md_text.split('\n')
    for idx, line in enumerate(lines):
            
        # If line is a list item (either ordered or unordered) …
        if match := re.match('(\s*)([-\d])', line):
            
            # Infer level based on number of leading spaces
            level = int(len(match.groups()[0]) / 4)
        
            # If line is an ordered list item …
            if re.match('(\s*)(\d)', line):

                # Replace marker, update counter
                marker = prev_items_at_level[level] + 1
                lines[idx] = re.sub('^(\s*)(\d)', f'\g<1>{str(marker)}', line)
                prev_items_at_level[level] += 1
                
            # Reset counters for deeper levels
            prev_items_at_level[level+1:] = [0] * (len(prev_items_at_level) - level - 1)

        # If not a list item, reset all counters
        else:
            prev_items_at_level = [0] * 10
            
    return '\n'.join(lines)
    
print(renumber_list(md_text))
#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
2. A2

ol > ol

1. A1
    1. A1a
    2. A1b
2. A2

All together: html2md

import unicodedata

def html2md(html: str) -> str:
    """ """ 
    md_text = md(html, heading_style='ATX', bullets='-')
    
    find_replace_pairs = [
        ('- - ',          '\t- '),    # fix indent on first child of ul > ul
        (r'- (\d\.)',     r'\t\1'),   # fix indent on first child of ul > ol
        (r'\d\. - ',      r'\t- '),   # fix indent on first child of ol > ul
        (r'\d\. (\d)\.',  r'\t\1.'),  # fix indent on first child of ol > ol
        ('\t',            '    '),    # replace tabs with four spaces
        ('\n\n\n',          '\n\n'),  # remove extra line breaks
    ]
    
    # some websites wrongly encode &nbsp; as this character 
    md_text = md_text.replace(u'\xa0', u' ').replace('Â', '')

    for f, r in find_replace_pairs:
        md_text = re.sub(f, r, md_text, flags=re.MULTILINE)
        
    return renumber_list(md_text).strip()

print(html2md(example_html))
#### Example text

##### Text

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

##### Nested lists

ul > ul

- A1
    - A1a
    - A1b
- A2

ul > ol

- B1
    1. B1a
    2. B1b
- B2

ol > ul

1. A1
    - A1a
    - A1b
2. A2

ol > ol

1. A1
    1. A1a
    2. A1b
2. A2

Clean up HTML of Google Docs

When copying from a google doc, the HTML structure is not quite as pristine as above, so we’ll need to perform some pre-processing steps on the HTML before converting it. To help with this, we’ll import Beautiful Soup, a Python package for parsing HTML. Then we’ll write a few transformation functions, loosely inspired by the google-docs-to-markdown javascript library.

Default output

The default output is a mess:

gdoc_html = """
<meta charset='utf-8'><meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-e98e0af3-7fff-7df9-91ea-0ffad7c3607d"><h1 dir="ltr" style="line-height:1.38;margin-top:20pt;margin-bottom:6pt;"><span style="font-size:20pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Example text</span></h1><h3 dir="ltr" style="line-height:1.38;margin-top:16pt;margin-bottom:4pt;"><span style="font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Text</span></h3><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">bold</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">italicized</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:line-through;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">strikethrough</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Some </span><span style="font-size:11pt;font-family:'Courier New';color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">func = lambda x: print(x)</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> inline code.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A </span><a href="https://en.wikipedia.org/wiki/Main_Page" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">link</span></a><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> with text.</span></p></li></ul><h3 dir="ltr" style="line-height:1.38;margin-top:16pt;margin-bottom:4pt;"><span style="font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Nested lists</span></h3><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ul &gt; ul</span></p><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1</span></p></li><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1a</span></p></li><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1b</span></p></li></ul><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A2</span></p></li></ul><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ul &gt; ol</span></p><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1</span></p></li><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1a</span></p></li><li dir="ltr" style="list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1b</span></p></li></ol><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B2</span></p></li></ul><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ol &gt; ul</span></p><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1</span></p></li><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1a</span></p></li><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1b</span></p></li></ul><li dir="ltr" style="list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C2</span></p></li></ol><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ol &gt; ol</span></p><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1</span></p></li><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1a</span></p></li><li dir="ltr" style="list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1b</span></p></li></ol><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:12pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D2</span></p></li></ol><br /><br /></b>
"""

print(html2md((gdoc_html)))
**# Example text

### Text

- This text is bold.
- This text is italicized.
- This text is strikethrough.
- Some func = lambda x: print(x) inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.

### Nested lists

ul > ul

- A1
- A1a
- A1b

- A2

ul > ol

- B1
1. B1a
2. B1b

- B2

ol > ul

1. C1
- C1a
- C1b

1. C2

ol > ol

1. D1
2. D1a
3. D1b

1. D2**

Inline styles to semantic tags

The lack of text formatting is caused by the fact that Google Docs does not use proper semantic tags, but rather puts text inside a span and styles it with inline CSS.

bold_html = """
<p>
    <span style="font-weight:400;">This text is </span>
    <span style="font-weight:700;">bold</span>
    <span style="font-weight:400;">.</span>
</p>
"""

We can fix this by using regex to search for elements which contain the relevant styles, then wrapping those elements in the proper semantic tag.

While we’re at it, we can also parse inline code. Although Google Docs does not natively support inline code, we can fake it by treating any text using the font Courier New (the most common monospace font) as intended to be code.

from bs4 import BeautifulSoup

def parse_gdoc_inline_styles(soup: BeautifulSoup) -> None:
    """ GDocs uses inline styles on spans instead of semantic HTML tags. """
    
    for tag in soup():
        
        # Dont inline styles inside headings
        if tag.parent.name in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'):
            continue
            
        style = tag.get('style')
        if style:
            if re.search(r'font-weight:\s?700', style):
                _ = tag.wrap(soup.new_tag('strong'))
            if re.search(r'font-style:\s?italic', style):
                _ = tag.wrap(soup.new_tag('em'))
            if re.search(r'text-decoration:\s?line-through', style):
                _ = tag.wrap(soup.new_tag('del'))
            if re.search(r"font-family:\s?'Courier New'", style):
                _ = tag.wrap(soup.new_tag('code'))

soup = BeautifulSoup(bold_html, 'html.parser')
parse_gdoc_inline_styles(soup)

print(html2md(strip(str(soup))))
This text is **bold**.

Wrap naked ul elements

The second issue is that Google Docs drops nested lists (ul, ol elements) directly inside the parent list, without wrapping them in an li tag, as our markdown converter expects.

nested_list_html = """
<ul>
    <li>ul > ul</li>
    <ul>
        <li>A1</li>
        <ul>
            <li>A1a</li>
            <li>A1b</li>
        </ul>
        <li>A2</li>
    </ul>
    <li>ul > ol</li>
    <ul>
        <li>B1</li>
        <ol>
            <li>B1a</li>
            <li>B1b</li>
        </ol>
        <li>B2</li>
    </ul>
    <li>ol > ul</li>
    <ol>
        <li>A1</li>
        <ul>
            <li>A1a</li>
            <li>A1b</li>
        </ul>
        <li>A2</li>
    </ol>
    <li>ol > ol</li>
    <ol>
        <li>A1</li>
        <ol>
            <li>A1a</li>
            <li>A1b</li>
        </ol>
        <li>A2</li>
    </ol>
</ul>
"""

# display(HTML(nested_list_html))

Again, we can use Beautiful Soup to find these instances and manually enclose the child list in an li element.

def wrap_naked_lists(soup: BeautifulSoup) -> None:
    """ GDocs does not wrap nested lists in an <li> tag. """
    for tag in soup():
        if tag.name in ['ul', 'ol'] and tag.parent.name in ['ul', 'ol']:
            tag.wrap(soup.new_tag('li'))

soup = BeautifulSoup(nested_list_html, 'html.parser')
wrap_naked_lists(soup)
html = strip(str(soup))

print(html2md(html))
- ul > ul
    - A1
        - A1a
        - A1b
    - A2
- ul > ol
    - B1
        1. B1a
        2. B1b
    - B2
- ol > ul
    1. A1
        - A1a
        - A1b
    2. A2
- ol > ol
    1. A1
        1. A1a
        2. A1b
    2. A2

Strip unnecessary tags

Finally, we remove a bunch of unnecessary tags, incl.

  1. The outer b element that the google doc seem to be weirdly wrapped in.
  2. Any unnecessary layers of wrapping, incl. p, span.
  3. Nested code tags, which is not a gdoc-specific issue, but which another common collaboration tool seems to produce.
def strip_unnecessary_tags(soup: BeautifulSoup) -> None:
    """ Remove various unnecessary tags, for easier debugging. """
    for i, tag in enumerate(soup()):
        
        # gdocs seems to wrap the entire HTML contents in a <b> tag for some reason
        if i <= 3 and tag.name == 'b':
            tag.unwrap()
        
        # gdocs also includes 1-2 meta tags at the top of the content
        if tag.name == 'meta':
            tag.extract()
        
        # strip attributes, mostly just to make debugging easier
        for attribute in ['class', 'value', 'rel', 'target', 'dir', 'aria-level', 'role', 'id', 'style']:
            del tag[attribute]

        # unwrap nested text formatting
        if tag.parent:
            if tag.name in ['strong', 'b'] and tag.parent.name in ['strong', 'b']:
                tag.unwrap()
            if tag.name in ['em', 'i'] and tag.parent.name in ['em', 'i']:
                tag.unwrap()  

Combined: prep_gdoc_html

Putting everything together, our output looks much better!

def prep_gdoc_html(html: str, debug=False) -> str:
    """ One function to combine all preprocessing steps. """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    parse_gdoc_inline_styles(soup)
    wrap_naked_lists(soup)
    strip_unnecessary_tags(soup)
    
    if debug:
        from bs4.formatter import HTMLFormatter
        formatter = HTMLFormatter(indent=4)
        print(soup.prettify(formatter=formatter))

    return (str(soup))


final_html = prep_gdoc_html(gdoc_html)

print(html2md(final_html))
# Example text

### Text

- This text is **bold**.
- This text is *italicized*.
- This text is ~~strikethrough~~.
- Some `func = lambda x: print(x)` inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.

### Nested lists

**ul > ul**

- A1
    - A1a
    - A1b
- A2

**ul > ol**

- B1
    1. B1a
    2. B1b
- B2

**ol > ul**

1. C1
    - C1a
    - C1b
2. C2

**ol > ol**

1. D1
    1. D1a
    2. D1b
2. D2

Clean up HTML of Quip docs

Quip is another real-time collaboration tool you may find yourself wanting to copy markdown from.

A regular copy/paste from a quip document generally looks better than a google doc. This is because quip attempts to copy markdown into the plain text contents of the clipboard. It works well on lists, but fails to encode headings, italics, strikethrough, or inline code.

Quip plain text clipboard contents

Default quip output

A few things are broken:

  1. Document-level lines of text are not wrapped in block elements, which results in the lack of appropriate spacing between these lines and subsequent elements (headings, lists).
  2. The inline code gets wrapped in double backticks, but it should be a single pair.

Note: quip does not supported mixed (ol/ul) list nesting, so those examples are excluded from the test text.

quip_html = """
<meta charset='utf-8'><h1>HTML to Markdown</h1><h2>Text</h2>A single line<br><br>A line split<br>with a line break<br><br>Another single line<br><h2>Text formatting</h2><ul><li>This text is <b>bold</b></li><li>This text is <i>italicized</i></li><li>This text is <span style="text-decoration: line-through">strikethrough</span></li><li>Some <code><code>func = lambda x: print(x)</code></code> inline code</li><li>A link: <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a></li></ul><h2>Nested list</h2><b>ul &gt; ul</b><br><ul><li>A1</li><ul><li>A1a</li><ul><li>A1a1</li><li>A1a2</li></ul><li>A1b</li></ul><li>A2</li></ul><br><b>ol &gt; ol</b><br><ol><li style="list-style-type:decimal">D1</li><ol><li style="list-style-type:lower-alpha">D1a</li><ol><li style="list-style-type:lower-roman">D1a1</li><li style="list-style-type:lower-roman">D1a2</li></ol><li style="list-style-type:lower-alpha">D1b</li></ol><li style="list-style-type:decimal">D2</li></ol>
""".strip()

#print(html2md(quip_html))
print(html2md(prep_gdoc_html(quip_html)))
# HTML to Markdown

## Text

A single line  
  
A line split  
with a line break  
  
Another single line  
## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul > ul**  
- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

  
**ol > ol**  
1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2

Wrap top-level text elements

The code below works reasonably well. It doesn’t properly handle hard linebreaks that should not be split into separate paragraphs, but I can’t think of an elegant solution to this off the top of my head.

def wrap_top_level_text(soup: BeautifulSoup, debug=False) -> None:
    """  """
    
    for tag in soup.findAll('br'):

        prev_tag = tag.previous_sibling
        next_tag = tag.next_sibling
        
        # Only proceed if br is at the root document level, and previous tag exists
        if tag.parent and tag.parent.name == '[document]' and tag.previous_sibling:
            
            # print(f'Prev: {tag.previous_sibling}')
            # print(f'Next: {tag.next_sibling}')
            
            if tag.previous_sibling.name is None or tag.previous_sibling.name in ['b']:
                tag.previous_sibling.wrap(soup.new_tag('p'))

                # If next tag is also a br, remove that too
                if tag.next_sibling and tag.next_sibling.name == 'br':
                    tag.next_sibling.extract()
                    
                tag.extract()
                
            # Remove unnecessary trailing br after block-level elements
            elif tag.previous_sibling.name in ['ul', 'ol']:
                tag.extract()
                
    if debug:
        pprint(str(soup))

        
preproc_html = prep_gdoc_html(quip_html)
soup = BeautifulSoup(preproc_html, 'html.parser')

wrap_naked_lists(soup)
wrap_top_level_text(soup)
html = strip(str(soup))

print(html2md(html))
# HTML to Markdown

## Text

A single line

A line split

with a line break

Another single line

## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul > ul**

- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

**ol > ol**

1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2

Unwrap nested code blocks

nested_code_html = """
<p>Some <code><code>func = lambda x: print(x)</code></code> inline code.</p>
"""

def unwrap_nested_code_blocks(soup: BeautifulSoup) -> None:
    """ Quip wraps inline code twice, which confuses markdownify. """
    
    for tag in soup.findAll('code'):
        if tag.parent.name == 'code':
            tag.unwrap()
            
soup = BeautifulSoup(nested_code_html, 'html.parser')
unwrap_nested_code_blocks(soup)

print(f'Before: {html2md(nested_code_html)}')
print(f'After: {html2md(str(soup))}')
Before: Some ``func = lambda x: print(x)`` inline code.
After: Some `func = lambda x: print(x)` inline code.

Combined: prep_quip_html

def prep_quip_html(html: str, debug=False) -> str:
    """ One function to combine all preprocessing steps. """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    parse_gdoc_inline_styles(soup)
    wrap_naked_lists(soup)
    wrap_top_level_text(soup)
    unwrap_nested_code_blocks(soup)
    strip_unnecessary_tags(soup)
    
    if debug:
        from bs4.formatter import HTMLFormatter
        formatter = HTMLFormatter(indent=4)
        print(soup.prettify(formatter=formatter))

    return (str(soup))


print(html2md(prep_quip_html(quip_html)))
# HTML to Markdown

## Text

A single line

A line split

with a line break

Another single line

## Text formatting

- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)

## Nested list

**ul > ul**

- A1
    - A1a
        - A1a1
        - A1a2
    - A1b
- A2

**ol > ol**

1. D1
    1. D1a
        1. D1a1
        2. D1a2
    2. D1b
2. D2

Combined: prep_any_html

Now let’s combine everything into a single generic pre-processing function, and double-check that the outputs match those of the app-specific functions that we checked above.

def prep_any_html(html: str, debug=False) -> str:
    """ One function to combine all preprocessing steps. """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    parse_gdoc_inline_styles(soup)
    wrap_naked_lists(soup)
    wrap_top_level_text(soup)
    unwrap_nested_code_blocks(soup)
    strip_unnecessary_tags(soup)
    
    if debug:
        from bs4.formatter import HTMLFormatter
        formatter = HTMLFormatter(indent=4)
        print(soup.prettify(formatter=formatter))

    return (str(soup))

assert prep_any_html(quip_html) == prep_quip_html(quip_html)
assert prep_any_html(gdoc_html) == prep_gdoc_html(gdoc_html)

Creating an Alfred workflow

The actual Alfred workflow is relatively straightforward.

Invoke

Invoke the workflow—either using Alfred’s universal actions after highlighting the text, or using the h2m keyword trigger.

Get clipboard contents as HTML

It gets the actual clipboard contents by running this shell command:

osascript -e 'the clipboard as «class HTML»' | perl -ne 'print chr foreach unpack("C*",pack("H*",substr($_,11,-3)))' | cat

Contents are stored in hex, so the perl command converts to ASCII.

Convert and copy

Then it runs the clipboard contents through html2markdown.py—a script which contains all of the logic we built out earlier—and copies the output markdown back to the clipboard.

Screenshot of Alfred workflow


comments powered by Disqus