Semicolony notes
A short demo of markdown with emphasis, inline code, and a link.
Things to remember
- The 80% case is heading + paragraph + list.
- Code blocks with triple backticks survive round-tripping.
- Tables are GFM, not strict CommonMark.
Production tip — keep the source in markdown, render to HTML at request time, and never let a CMS re-edit the rendered HTML.
package main
import "fmt"
func main() {
fmt.Println("hello, study hall")
}
| Feature | CommonMark | GFM |
|---|---|---|
| Tables | — | ✓ |
| Strike | — | ✓ |
| Task list | — | ✓ |
Three stages, two passes.
A markdown parser is, almost without exception, a three-stage pipeline: a tokenizer that walks the source character by character and emits typed tokens, a block parser that groups those tokens into block-level structures (paragraphs, lists, fenced code, blockquotes, ATX and Setext headings), and an inline parser that runs in a second pass over the text content of each block to resolve emphasis, links, code spans, and autolinks. The block pass is greedy and line-oriented; the inline pass is context-sensitive and uses an emphasis-resolution algorithm whose canonical form was specified by John MacFarlane in the CommonMark spec, first published as 0.12 in September 2014 and stabilised at 0.30 in June 2021.
Most modern parsers — marked, markdown-it, cmark, pulldown-cmark, commonmark.js — follow this two-pass shape because the original Gruber 2004 reference implementation in Perl was a tangle of regular expressions that nobody could reason about, and its ambiguities are exactly what CommonMark was created to nail down.
The reason edge cases differ across implementations is that the underspecified parts of Gruber's original document leave dozens of judgement calls to the parser author. Lazy continuation lines are the canonical example: in CommonMark, a paragraph inside a blockquote can continue on a line that has no > prefix at all, but a list item cannot lazily continue across a blank line in the same way. Tab handling is another swamp — CommonMark mandates that a tab is treated as advancing to the next column that is a multiple of four, not as a literal four spaces.
HTML embedding is the gnarliest area in practice. CommonMark defines seven distinct types of raw-HTML block, distinguished by the opening tag and the conditions under which the block ends — type 1 ends on closing script/pre/style; type 6 (block-level tags like div) ends on a blank line; type 7 (generic tags) cannot interrupt a paragraph at all. The result is that "embed some raw HTML in your markdown" works reliably for the simple cases and produces wildly divergent output across marked, markdown-it, and pandoc for anything involving partial tags, comments spanning blank lines, or processing instructions.
CommonMark vs GFM vs the long tail.
CommonMark is the strict baseline: paragraphs, ATX/Setext headings, blockquotes, ordered and unordered lists, indented and fenced code blocks, inline code, emphasis, links, images, autolinks (only the angle-bracket form), hard breaks, and raw HTML. Everything else is an extension. GitHub Flavoured Markdown, formalised in 2017 as a strict superset on top of CommonMark 0.28, adds tables, task list items, strikethrough with ~~, autolinking of bare URLs, and disallowed raw HTML for safety.
MultiMarkdown, Fletcher Penney's 2005 dialect that predates CommonMark, was the first to add tables, footnotes, definition lists, citations, math via MathJax, and metadata blocks — many of its ideas were absorbed downstream. Kramdown, the Ruby parser used by Jekyll, supports inline attribute lists which let authors smuggle CSS classes into rendered HTML, plus its own footnote and definition-list syntax. Pandoc is the maximalist: it accepts most of the above plus pipe-and-grid tables, fenced divs, bracketed spans, raw blocks for any output format, and a citation system backed by CSL.
| Feature | CommonMark | GFM | MultiMarkdown | kramdown | Pandoc |
|---|---|---|---|---|---|
| Tables (pipe) | no | yes | yes | yes | yes |
| Task lists | no | yes | no | yes | yes |
| Footnotes | no | no | yes | yes | yes |
| Strikethrough | no | yes | no | yes | yes |
| Autolink bare URLs | no | yes | partial | yes | yes |
| Definition lists | no | no | yes | yes | yes |
| Inline math | no | no | yes | yes | yes |
| Inline attributes | no | no | partial | yes | yes |
Code-block language hints are a quiet point of divergence. CommonMark specifies the info string after the opening fence as opaque — the parser stores it but assigns no semantics — so it is renderers, not parsers, that decide whether a fenced block triggers Prism, Shiki, or Highlight.js. GFM standardises a class of language-typescript on the inner code element, which is the convention almost every static site generator now follows.
A document written for Pandoc with citations and fenced divs will render as garbled text on GitHub, and a GFM task list will render as literal [ ] characters in a strict CommonMark renderer. If you publish to multiple targets, you either constrain yourself to the CommonMark intersection or you preprocess.
HTML → Markdown is a lifting operation.
Markdown to HTML is a lowering operation: the source language is small, the target is large, and every markdown construct has a well-defined HTML expansion. HTML to markdown is a lifting operation across a lossy boundary, and that asymmetry is why turndown (Dom Christie's library, originally released 2014, currently 7.x) and similar tools like pandoc --from html --to markdown produce output that round-trips imperfectly. Markdown has no syntax for class, id, style, data-attributes, ARIA attributes, aside, figure, figcaption, details, summary, sup, sub, kbd, mark, abbr, or any of the form elements.
When turndown encounters them, it has three choices: drop the wrapper and keep the children, emit raw HTML pass-through, or use a configured rule to map the element to a markdown approximation. The default behaviour is a compromise. Inline styles and class attributes are unconditionally dropped — there is nowhere to put them. Nested structures that have no markdown equivalent (a table inside a blockquote, a list inside a paragraph, a div with display logic) get flattened or emitted as raw HTML, which then re-parses on the next markdown pass and may or may not survive intact. Tables with colspan or rowspan cannot be expressed in GFM pipe tables at all, so turndown falls back to inline HTML.
The deeper tradeoff is between round-trippability and human-readability. A round-trippable converter would emit raw HTML aggressively to preserve every attribute and structural nuance, producing output that re-renders to something pixel-identical but reads like HTML in a markdown wrapper. A human-readable converter aggressively normalises — converting strong and b both to bold, em and i both to emphasis, collapsing whitespace, dropping decorative wrappers — and produces clean source that a human can edit but whose re-rendered HTML differs subtly from the original. turndown defaults toward the readable end and exposes keep and addRule hooks for callers who need to preserve specific elements.
User markdown is user JavaScript.
Markdown's design permits raw HTML pass-through, which means any user-submitted markdown is also potentially user-submitted HTML, which means it is potentially user-submitted JavaScript. A script tag typed into a markdown comment box renders as a working script tag unless something between the parser and the DOM strips it. Stored XSS via markdown is one of the most common vulnerability classes in user-generated-content sites, and the marked library alone has shipped advisories including CVE-2017-1000427 (regex-based ReDoS), CVE-2022-21680 and CVE-2022-21681 (ReDoS in block and inline grammars, fixed in 4.0.10 in January 2022), and a string of XSS findings in the 0.x series related to autolink and image-title parsing.
CommonMark-conforming parsers are not immune — there have been documented bypasses involving HTML comments containing -- sequences and processing instructions that confuse downstream sanitisers. The defence stack has converged on a clear pattern. Render the markdown to HTML, then run the HTML through a structural sanitiser that operates on a parsed DOM, not on string regexes. DOMPurify (Cure53, currently 3.x, originally 2014) parses the HTML into a document, walks the tree, and applies an allow-list of tags, attributes, and URL schemes; its config is declarative and its threat model includes mutation-XSS.
sanitize-html (Apostrophe, 2013) is a Node-side equivalent built on parse5. Python's bleach plays the same role for Django and Flask stacks, though it has been in maintenance mode since 2023 and the ecosystem has been migrating to nh3 (Rust-backed ammonia bindings). The pattern is: render then sanitise, never the reverse.
A naive pre-render strip that removes script tags from the input misses links with javascript: URLs, image URLs with data: payloads, and any attribute injection through reference-link titles. Sanitising the rendered DOM is the only place where the full attack surface is visible.
Speed costs at live-preview scale.
Parser cost matters once documents get past about 50 KB of source or once you are rendering a live preview on every keystroke. marked is fast — typically 5–15× faster than unified/remark on equivalent input — because it is a single-pass tokenizer that emits HTML strings directly, with no intermediate AST allocation, no visitor pattern, and no plugin pipeline. The cost is that you cannot transform the document between parse and render: what marked gives you is HTML, take it or leave it. markdown-it sits in the middle, exposing a token stream that plugins can mutate but stopping short of a full AST.
unified and remark build an mdast tree, optionally convert it to hast (HTML AST), optionally run rehype plugins, and finally serialise. Each stage allocates, and the visitor traversals add constant-factor overhead, but the model is the only one that lets you write a plugin that, say, rewrites every relative link, extracts a table of contents, or transforms math nodes into KaTeX output, in a composable way. For a static site generator that runs once per deploy, this is the right tradeoff; for a live editor preview at 60fps on a 200 KB document, it is the wrong one.
Incremental rendering — re-parsing only the block that changed since the last keystroke — is the standard live-preview optimisation. Because markdown's block grammar is line-oriented and most block boundaries are stable across edits, you can keep a cache of block-level tokens keyed by source line range and re-tokenize only the affected range. ProseMirror, CodeMirror 6, and Lexical all do versions of this. Memory bounds matter on the server too: a malicious 10 MB markdown document with deeply nested blockquotes can blow up an unbounded recursive parser, which is why production parsers cap nesting depth.
A practical rule for picking a parser: if you control the input and just need fast HTML output, use marked. If you need to transform documents — rewrite relative links, generate tables of contents, extract front-matter metadata, swap math nodes for KaTeX, lint heading levels — use unified with remark and rehype. If you need GitHub-exact rendering of comments or PR descriptions, use cmark-gfm directly via its native bindings — that is the actual parser GitHub runs in production.
Why so much of the web is now markdown source.
GitHub READMEs (2008), Reddit comments (2010), Stack Overflow answers (2008), Slack messages, Discord, Notion blocks, Substack drafts, Hugo and Jekyll site sources — all use markdown variants because the source is human-editable, the output is renderable, and the parsing rules are forgiving. The trade-off is that "markdown" is a genus, not a species: CommonMark, GFM, MultiMarkdown, kramdown, Pandoc-flavoured all diverge on edge cases. If you're shipping a markdown-in / HTML-out pipeline, pin a parser version and write fixture tests against expected output. Don't trust the eyeball check.
Markdown allows raw HTML pass-through by default. If you accept user-submitted markdown and render it on someone else's page, run the resulting HTML through DOMPurify or equivalent. <script> tags in markdown render as live <script> tags in HTML — that's the path to stored XSS.