Goal
All of my personal notes are written in markdown. I use Obsidian to manage them, but the specific tool is not relevant for the purposes of this post.
When referencing things in Google Docs, I find myself generally linking to the doc, rather than copy/pasting, because the default copy/paste output is poorly formatted, and is tedious to correct. This is sub-optimal, because then I cannot surface the contents via search in Obsidian, unless the text of the URL matches my keywords.
So the goal here is to make it trivially easy to copy/paste from GDocs into a markdown format, hopefully resulting in me doing that more frequently, resulting in more useful results when searching my vault.
Easier to copy/paste → save more content → better search results → less time searching for things
Alternatives:
- If you’re happy with copy/pasting into another window to convert, you may consider running a local instance of google-docs-to-markdown, or—if you are not converting any sensitive data—even just using the demo web applet.
- If you have administrative access to your Google Workspace, you may consider installing the Docs to Markdown add-on.
Getting HTML contents from clipboard
You may have wondered at some point: why does copied text appear differently depending on which app I paste it in? For example, copied text from a google doc will appears identical when pasted in another google doc, but will render as plain text when pasted into a barebones text editor.
It’s worth understanding how the clipboard works. Most relevant for us:
- When the copy command is invoked, the active application can offer a variety of potential formats, incl. HTML, RTF, and plain text.
- When the paste command is invoked, the active application chooses which format to receive.
If you’re using Mac, you can use the free Clipboard Viewer application to inspect the different formats that are offered by the application from which you are copying.
Convert HTML to Markdown
Now that we have the HTML contents of our clipboard, we need to convert it to markdown.
Example HTML
Let’s start with the optimistic scenario in which we have perfectly structured HTML, notably:
- Text formatting is represented in semantic elements such as:
strong
(bold),em
(italics), ordel
(strikethrough). - Nested lists are wrapped inside an
li
tag, not placed directly below the parent list.
from IPython.display import HTML, display
example_html = """
<h4>Example text</h4>
<h5>Text</h5>
<ul>
<li>This text is <strong>bold</strong></li>
<li>This text is <em>italicized</em></li>
<li>This text is <del>strikethrough</del></li>
<li>Some <code>func = lambda x: print(x)</code> inline code</li>
<li>A link: <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a></li>
</ul>
<h5>Nested lists</h5>
<p>ul > ul</p>
<ul>
<li>A1</li>
<li><ul>
<li>A1a</li>
<li>A1b</li>
</ul></li>
<li>A2</li>
</ul>
<p>ul > ol</p>
<ul>
<li>B1</li>
<li><ol>
<li>B1a</li>
<li>B1b</li>
</ol></li>
<li>B2</li>
</ul>
<p>ol > ul</p>
<ol>
<li>A1</li>
<li><ul>
<li>A1a</li>
<li>A1b</li>
</ul></li>
<li>A2</li>
</ol>
<p>ol > ol</p>
<ol>
<li>A1</li>
<li><ol>
<li>A1a</li>
<li>A1b</li>
</ol></li>
<li>A2</li>
</ol>
"""
display(HTML(example_html))
Example text
Text
- This text is bold
- This text is italicized
- This text is
strikethrough - Some
func = lambda x: print(x)
inline code - A link: Wikipedia
Nested lists
ul > ul
- A1
- A1a
- A1b
- A2
ul > ol
- B1
- B1a
- B1b
- B2
ol > ul
- A1
- A1a
- A1b
- A2
ol > ol
- A1
- A1a
- A1b
- A2
Markdownify
In this idealized scenario, the markdownify library converts our HTML reasonably well out-of-the-box.
from markdownify import markdownify as md
print(md(example_html, heading_style='ATX', bullets='-'))
#### Example text
##### Text
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
##### Nested lists
ul > ul
- A1
- - A1a
- A1b
- A2
ul > ol
- B1
- 1. B1a
2. B1b
- B2
ol > ul
1. A1
2. - A1a
- A1b
3. A2
ol > ol
1. A1
2. 1. A1a
2. A1b
3. A2
But there are a few issues:
- The first child of nested lists gets a double list marker instead of proper indentation
- Ordered list numbers do not properly reset at each level
- There are unnecessary double empty lines
- Output uses tab characters (
\t
) instead of four spaces.
Fix list indentation
We can fix everything except for the wrong numbering using a few simple regex expressions.
import re
def strip(x: str) -> str:
return ''.join(x.split('\n'))
md_text = md(example_html, heading_style='ATX', bullets='-')
find_replace_pairs = [
('- - ', '\t- '), # fix indent on first child of ul > ul
(r'- (\d\.)', r'\t\1'), # fix indent on first child of ul > ol
(r'\d\. - ', r'\t- '), # fix indent on first child of ol > ul
(r'\d\. (\d)\.', r'\t\1.'), # fix indent on first child of ol > ol
('\t', ' '), # replace tabs with four spaces
('\n\n', '\n'), # remove extra line breaks
]
for f, r in find_replace_pairs:
md_text = re.sub(f, r, md_text, flags=re.MULTILINE)
print(md_text)
#### Example text
##### Text
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
##### Nested lists
ul > ul
- A1
- A1a
- A1b
- A2
ul > ol
- B1
1. B1a
2. B1b
- B2
ol > ul
1. A1
- A1a
- A1b
3. A2
ol > ol
1. A1
1. A1a
2. A1b
3. A2
Re-number ordered lists
To fix the numbering of ordered list items, we’ll write a function:
def renumber_list(md_text: str) -> str:
""" Replace ordered list markers with the correct number. """
# Track how many previous items have appeared, at each level of indentation
prev_items_at_level = [0] * 10
lines = md_text.split('\n')
for idx, line in enumerate(lines):
# If line is a list item (either ordered or unordered) …
if match := re.match('(\s*)([-\d])', line):
# Infer level based on number of leading spaces
level = int(len(match.groups()[0]) / 4)
# If line is an ordered list item …
if re.match('(\s*)(\d)', line):
# Replace marker, update counter
marker = prev_items_at_level[level] + 1
lines[idx] = re.sub('^(\s*)(\d)', f'\g<1>{str(marker)}', line)
prev_items_at_level[level] += 1
# Reset counters for deeper levels
prev_items_at_level[level+1:] = [0] * (len(prev_items_at_level) - level - 1)
# If not a list item, reset all counters
else:
prev_items_at_level = [0] * 10
return '\n'.join(lines)
print(renumber_list(md_text))
#### Example text
##### Text
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
##### Nested lists
ul > ul
- A1
- A1a
- A1b
- A2
ul > ol
- B1
1. B1a
2. B1b
- B2
ol > ul
1. A1
- A1a
- A1b
2. A2
ol > ol
1. A1
1. A1a
2. A1b
2. A2
All together: html2md
import unicodedata
def html2md(html: str) -> str:
""" """
md_text = md(html, heading_style='ATX', bullets='-')
find_replace_pairs = [
('- - ', '\t- '), # fix indent on first child of ul > ul
(r'- (\d\.)', r'\t\1'), # fix indent on first child of ul > ol
(r'\d\. - ', r'\t- '), # fix indent on first child of ol > ul
(r'\d\. (\d)\.', r'\t\1.'), # fix indent on first child of ol > ol
('\t', ' '), # replace tabs with four spaces
('\n\n\n', '\n\n'), # remove extra line breaks
]
# some websites wrongly encode as this character
md_text = md_text.replace(u'\xa0', u' ').replace('Â', '')
for f, r in find_replace_pairs:
md_text = re.sub(f, r, md_text, flags=re.MULTILINE)
return renumber_list(md_text).strip()
print(html2md(example_html))
#### Example text
##### Text
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
##### Nested lists
ul > ul
- A1
- A1a
- A1b
- A2
ul > ol
- B1
1. B1a
2. B1b
- B2
ol > ul
1. A1
- A1a
- A1b
2. A2
ol > ol
1. A1
1. A1a
2. A1b
2. A2
Clean up HTML of Google Docs
When copying from a google doc, the HTML structure is not quite as pristine as above, so we’ll need to perform some pre-processing steps on the HTML before converting it. To help with this, we’ll import Beautiful Soup, a Python package for parsing HTML. Then we’ll write a few transformation functions, loosely inspired by the google-docs-to-markdown javascript library.
Default output
The default output is a mess:
- All of the text formatting (besides the link) is missing.
- The nested lists are not properly indented, and have some empty lines.
- The entire chunk of text is wrapped in formatted as bold (wrapped in
**
).
gdoc_html = """
<meta charset='utf-8'><meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-e98e0af3-7fff-7df9-91ea-0ffad7c3607d"><h1 dir="ltr" style="line-height:1.38;margin-top:20pt;margin-bottom:6pt;"><span style="font-size:20pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Example text</span></h1><h3 dir="ltr" style="line-height:1.38;margin-top:16pt;margin-bottom:4pt;"><span style="font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Text</span></h3><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">bold</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">italicized</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This text is </span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:line-through;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">strikethrough</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Some </span><span style="font-size:11pt;font-family:'Courier New';color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">func = lambda x: print(x)</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> inline code.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A </span><a href="https://en.wikipedia.org/wiki/Main_Page" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">link</span></a><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> with text.</span></p></li></ul><h3 dir="ltr" style="line-height:1.38;margin-top:16pt;margin-bottom:4pt;"><span style="font-size:13.999999999999998pt;font-family:Arial;color:#434343;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Nested lists</span></h3><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ul > ul</span></p><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1</span></p></li><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1a</span></p></li><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A1b</span></p></li></ul><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">A2</span></p></li></ul><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ul > ol</span></p><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1</span></p></li><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1a</span></p></li><li dir="ltr" style="list-style-type:lower-alpha;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B1b</span></p></li></ol><li dir="ltr" style="list-style-type:disc;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">B2</span></p></li></ul><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:5pt;"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ol > ul</span></p><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:11pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1</span></p></li><ul style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1a</span></p></li><li dir="ltr" style="list-style-type:circle;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C1b</span></p></li></ul><li dir="ltr" style="list-style-type:decimal;font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:5pt;" role="presentation"><span style="font-size:10.5pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">C2</span></p></li></ol><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">ol > ol</span></p><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1</span></p></li><ol style="margin-top:0;margin-bottom:0;padding-inline-start:48px;"><li dir="ltr" style="list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1a</span></p></li><li dir="ltr" style="list-style-type:lower-alpha;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="2"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D1b</span></p></li></ol><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;" aria-level="1"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:12pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">D2</span></p></li></ol><br /><br /></b>
"""
print(html2md((gdoc_html)))
**# Example text
### Text
- This text is bold.
- This text is italicized.
- This text is strikethrough.
- Some func = lambda x: print(x) inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.
### Nested lists
ul > ul
- A1
- A1a
- A1b
- A2
ul > ol
- B1
1. B1a
2. B1b
- B2
ol > ul
1. C1
- C1a
- C1b
1. C2
ol > ol
1. D1
2. D1a
3. D1b
1. D2**
Inline styles to semantic tags
The lack of text formatting is caused by the fact that Google Docs does not use proper semantic tags, but rather puts text inside a span
and styles it with inline CSS.
bold_html = """
<p>
<span style="font-weight:400;">This text is </span>
<span style="font-weight:700;">bold</span>
<span style="font-weight:400;">.</span>
</p>
"""
We can fix this by using regex to search for elements which contain the relevant styles, then wrapping those elements in the proper semantic tag.
While we’re at it, we can also parse inline code. Although Google Docs does not natively support inline code, we can fake it by treating any text using the font Courier New (the most common monospace font) as intended to be code.
from bs4 import BeautifulSoup
def parse_gdoc_inline_styles(soup: BeautifulSoup) -> None:
""" GDocs uses inline styles on spans instead of semantic HTML tags. """
for tag in soup():
# Dont inline styles inside headings
if tag.parent.name in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'):
continue
style = tag.get('style')
if style:
if re.search(r'font-weight:\s?700', style):
_ = tag.wrap(soup.new_tag('strong'))
if re.search(r'font-style:\s?italic', style):
_ = tag.wrap(soup.new_tag('em'))
if re.search(r'text-decoration:\s?line-through', style):
_ = tag.wrap(soup.new_tag('del'))
if re.search(r"font-family:\s?'Courier New'", style):
_ = tag.wrap(soup.new_tag('code'))
soup = BeautifulSoup(bold_html, 'html.parser')
parse_gdoc_inline_styles(soup)
print(html2md(strip(str(soup))))
This text is **bold**.
Wrap naked ul
elements
The second issue is that Google Docs drops nested lists (ul
, ol
elements) directly inside the parent list, without wrapping them in an li
tag, as our markdown converter expects.
nested_list_html = """
<ul>
<li>ul > ul</li>
<ul>
<li>A1</li>
<ul>
<li>A1a</li>
<li>A1b</li>
</ul>
<li>A2</li>
</ul>
<li>ul > ol</li>
<ul>
<li>B1</li>
<ol>
<li>B1a</li>
<li>B1b</li>
</ol>
<li>B2</li>
</ul>
<li>ol > ul</li>
<ol>
<li>A1</li>
<ul>
<li>A1a</li>
<li>A1b</li>
</ul>
<li>A2</li>
</ol>
<li>ol > ol</li>
<ol>
<li>A1</li>
<ol>
<li>A1a</li>
<li>A1b</li>
</ol>
<li>A2</li>
</ol>
</ul>
"""
# display(HTML(nested_list_html))
Again, we can use Beautiful Soup to find these instances and manually enclose the child list in an li
element.
def wrap_naked_lists(soup: BeautifulSoup) -> None:
""" GDocs does not wrap nested lists in an <li> tag. """
for tag in soup():
if tag.name in ['ul', 'ol'] and tag.parent.name in ['ul', 'ol']:
tag.wrap(soup.new_tag('li'))
soup = BeautifulSoup(nested_list_html, 'html.parser')
wrap_naked_lists(soup)
html = strip(str(soup))
print(html2md(html))
- ul > ul
- A1
- A1a
- A1b
- A2
- ul > ol
- B1
1. B1a
2. B1b
- B2
- ol > ul
1. A1
- A1a
- A1b
2. A2
- ol > ol
1. A1
1. A1a
2. A1b
2. A2
Strip unnecessary tags
Finally, we remove a bunch of unnecessary tags, incl.
- The outer
b
element that the google doc seem to be weirdly wrapped in. - Any unnecessary layers of wrapping, incl.
p
,span
. - Nested
code
tags, which is not a gdoc-specific issue, but which another common collaboration tool seems to produce.
def strip_unnecessary_tags(soup: BeautifulSoup) -> None:
""" Remove various unnecessary tags, for easier debugging. """
for i, tag in enumerate(soup()):
# gdocs seems to wrap the entire HTML contents in a <b> tag for some reason
if i <= 3 and tag.name == 'b':
tag.unwrap()
# gdocs also includes 1-2 meta tags at the top of the content
if tag.name == 'meta':
tag.extract()
# strip attributes, mostly just to make debugging easier
for attribute in ['class', 'value', 'rel', 'target', 'dir', 'aria-level', 'role', 'id', 'style']:
del tag[attribute]
# unwrap nested text formatting
if tag.parent:
if tag.name in ['strong', 'b'] and tag.parent.name in ['strong', 'b']:
tag.unwrap()
if tag.name in ['em', 'i'] and tag.parent.name in ['em', 'i']:
tag.unwrap()
Combined: prep_gdoc_html
Putting everything together, our output looks much better!
def prep_gdoc_html(html: str, debug=False) -> str:
""" One function to combine all preprocessing steps. """
soup = BeautifulSoup(html, 'html.parser')
parse_gdoc_inline_styles(soup)
wrap_naked_lists(soup)
strip_unnecessary_tags(soup)
if debug:
from bs4.formatter import HTMLFormatter
formatter = HTMLFormatter(indent=4)
print(soup.prettify(formatter=formatter))
return (str(soup))
final_html = prep_gdoc_html(gdoc_html)
print(html2md(final_html))
# Example text
### Text
- This text is **bold**.
- This text is *italicized*.
- This text is ~~strikethrough~~.
- Some `func = lambda x: print(x)` inline code.
- A [link](https://en.wikipedia.org/wiki/Main_Page) with text.
### Nested lists
**ul > ul**
- A1
- A1a
- A1b
- A2
**ul > ol**
- B1
1. B1a
2. B1b
- B2
**ol > ul**
1. C1
- C1a
- C1b
2. C2
**ol > ol**
1. D1
1. D1a
2. D1b
2. D2
Clean up HTML of Quip docs
Quip is another real-time collaboration tool you may find yourself wanting to copy markdown from.
A regular copy/paste from a quip document generally looks better than a google doc. This is because quip attempts to copy markdown into the plain text contents of the clipboard. It works well on lists, but fails to encode headings, italics, strikethrough, or inline code.
Default quip output
A few things are broken:
- Document-level lines of text are not wrapped in block elements, which results in the lack of appropriate spacing between these lines and subsequent elements (headings, lists).
- The inline code gets wrapped in double backticks, but it should be a single pair.
Note: quip does not supported mixed (ol/ul) list nesting, so those examples are excluded from the test text.
quip_html = """
<meta charset='utf-8'><h1>HTML to Markdown</h1><h2>Text</h2>A single line<br><br>A line split<br>with a line break<br><br>Another single line<br><h2>Text formatting</h2><ul><li>This text is <b>bold</b></li><li>This text is <i>italicized</i></li><li>This text is <span style="text-decoration: line-through">strikethrough</span></li><li>Some <code><code>func = lambda x: print(x)</code></code> inline code</li><li>A link: <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a></li></ul><h2>Nested list</h2><b>ul > ul</b><br><ul><li>A1</li><ul><li>A1a</li><ul><li>A1a1</li><li>A1a2</li></ul><li>A1b</li></ul><li>A2</li></ul><br><b>ol > ol</b><br><ol><li style="list-style-type:decimal">D1</li><ol><li style="list-style-type:lower-alpha">D1a</li><ol><li style="list-style-type:lower-roman">D1a1</li><li style="list-style-type:lower-roman">D1a2</li></ol><li style="list-style-type:lower-alpha">D1b</li></ol><li style="list-style-type:decimal">D2</li></ol>
""".strip()
#print(html2md(quip_html))
print(html2md(prep_gdoc_html(quip_html)))
# HTML to Markdown
## Text
A single line
A line split
with a line break
Another single line
## Text formatting
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
## Nested list
**ul > ul**
- A1
- A1a
- A1a1
- A1a2
- A1b
- A2
**ol > ol**
1. D1
1. D1a
1. D1a1
2. D1a2
2. D1b
2. D2
Wrap top-level text elements
The code below works reasonably well. It doesn’t properly handle hard linebreaks that should not be split into separate paragraphs, but I can’t think of an elegant solution to this off the top of my head.
def wrap_top_level_text(soup: BeautifulSoup, debug=False) -> None:
""" """
for tag in soup.findAll('br'):
prev_tag = tag.previous_sibling
next_tag = tag.next_sibling
# Only proceed if br is at the root document level, and previous tag exists
if tag.parent and tag.parent.name == '[document]' and tag.previous_sibling:
# print(f'Prev: {tag.previous_sibling}')
# print(f'Next: {tag.next_sibling}')
if tag.previous_sibling.name is None or tag.previous_sibling.name in ['b']:
tag.previous_sibling.wrap(soup.new_tag('p'))
# If next tag is also a br, remove that too
if tag.next_sibling and tag.next_sibling.name == 'br':
tag.next_sibling.extract()
tag.extract()
# Remove unnecessary trailing br after block-level elements
elif tag.previous_sibling.name in ['ul', 'ol']:
tag.extract()
if debug:
pprint(str(soup))
preproc_html = prep_gdoc_html(quip_html)
soup = BeautifulSoup(preproc_html, 'html.parser')
wrap_naked_lists(soup)
wrap_top_level_text(soup)
html = strip(str(soup))
print(html2md(html))
# HTML to Markdown
## Text
A single line
A line split
with a line break
Another single line
## Text formatting
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some ``func = lambda x: print(x)`` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
## Nested list
**ul > ul**
- A1
- A1a
- A1a1
- A1a2
- A1b
- A2
**ol > ol**
1. D1
1. D1a
1. D1a1
2. D1a2
2. D1b
2. D2
Unwrap nested code blocks
nested_code_html = """
<p>Some <code><code>func = lambda x: print(x)</code></code> inline code.</p>
"""
def unwrap_nested_code_blocks(soup: BeautifulSoup) -> None:
""" Quip wraps inline code twice, which confuses markdownify. """
for tag in soup.findAll('code'):
if tag.parent.name == 'code':
tag.unwrap()
soup = BeautifulSoup(nested_code_html, 'html.parser')
unwrap_nested_code_blocks(soup)
print(f'Before: {html2md(nested_code_html)}')
print(f'After: {html2md(str(soup))}')
Before: Some ``func = lambda x: print(x)`` inline code.
After: Some `func = lambda x: print(x)` inline code.
Combined: prep_quip_html
def prep_quip_html(html: str, debug=False) -> str:
""" One function to combine all preprocessing steps. """
soup = BeautifulSoup(html, 'html.parser')
parse_gdoc_inline_styles(soup)
wrap_naked_lists(soup)
wrap_top_level_text(soup)
unwrap_nested_code_blocks(soup)
strip_unnecessary_tags(soup)
if debug:
from bs4.formatter import HTMLFormatter
formatter = HTMLFormatter(indent=4)
print(soup.prettify(formatter=formatter))
return (str(soup))
print(html2md(prep_quip_html(quip_html)))
# HTML to Markdown
## Text
A single line
A line split
with a line break
Another single line
## Text formatting
- This text is **bold**
- This text is *italicized*
- This text is ~~strikethrough~~
- Some `func = lambda x: print(x)` inline code
- A link: [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
## Nested list
**ul > ul**
- A1
- A1a
- A1a1
- A1a2
- A1b
- A2
**ol > ol**
1. D1
1. D1a
1. D1a1
2. D1a2
2. D1b
2. D2
Combined: prep_any_html
Now let’s combine everything into a single generic pre-processing function, and double-check that the outputs match those of the app-specific functions that we checked above.
def prep_any_html(html: str, debug=False) -> str:
""" One function to combine all preprocessing steps. """
soup = BeautifulSoup(html, 'html.parser')
parse_gdoc_inline_styles(soup)
wrap_naked_lists(soup)
wrap_top_level_text(soup)
unwrap_nested_code_blocks(soup)
strip_unnecessary_tags(soup)
if debug:
from bs4.formatter import HTMLFormatter
formatter = HTMLFormatter(indent=4)
print(soup.prettify(formatter=formatter))
return (str(soup))
assert prep_any_html(quip_html) == prep_quip_html(quip_html)
assert prep_any_html(gdoc_html) == prep_gdoc_html(gdoc_html)
Creating an Alfred workflow
The actual Alfred workflow is relatively straightforward.
Invoke
Invoke the workflow—either using Alfred’s universal actions after highlighting the text, or using the h2m
keyword trigger.
Get clipboard contents as HTML
It gets the actual clipboard contents by running this shell command:
osascript -e 'the clipboard as «class HTML»' | perl -ne 'print chr foreach unpack("C*",pack("H*",substr($_,11,-3)))' | cat
Contents are stored in hex, so the perl
command converts to ASCII.
Convert and copy
Then it runs the clipboard contents through html2markdown.py
—a script which contains all of the logic we built out earlier—and copies the output markdown back to the clipboard.
comments powered by Disqus