Tool

Regex test.

JavaScript / ECMAScript regex flavour: g i m s u y flags, capture groups, named groups, replace mode. Highlights every match in your text and lists capture groups for each. Compiled in your browser — no server round-trip.

Matches
2
Flags
gi
Status
ok

Pattern
/ /
Presets
Test text
Highlighted
Reach Ada at ada@analytical.engine
or Charles at charles@babbage.org.
Spam: foo@bar (no TLD).
Matches & capture groups
#MatchPosGroups
1 ada@analytical.engine 13–34 $1=ada · $2=analytical.engine
2 charles@babbage.org 49–68 $1=charles · $2=babbage.org

Not all engines speak the same language.

When you paste a pattern into the tester on this page, it runs through V8's IRRegexp (or JavaScriptCore's YARR if you happen to be on Safari) and obeys the ECMAScript specification — currently ES2024, with named capture groups, Unicode property escapes (\p{Script=Greek}), the d flag for indices, and the v flag for set notation. That is one regex flavour among at least half a dozen that engineers run into every week, and the differences are not cosmetic. They change which patterns compile, which patterns match, and most importantly which patterns can hang your production server.

POSIX gave us two grammars: Basic Regular Expressions (BRE), still lurking inside default grep and sed, where (, ), +, and ? are literal characters unless backslash-escaped; and Extended Regular Expressions (ERE), used by egrep, awk, and grep -E, where those metacharacters are special by default. Both flavours are anchored in the 1980s. Neither supports lookaround, non-greedy quantifiers, or named captures.

PCRE — Perl-Compatible Regular Expressions — is what most engineers actually mean when they say "real regex." Originally written by Philip Hazel in 1997 to give Exim a Perl-style engine in C, the original library (now called PCRE1) was frozen at 8.45 in 2021. The successor, PCRE2, started in 2015 with a cleaner API, better Unicode handling, and JIT compilation through SLJIT. The migration has been slow and painful: nginx moved its default to PCRE2 in 1.21.5, PHP switched in 7.3, and a long tail of distros only flipped during 2023–2024.

Then there is RE2, Russ Cox's 2010 library born out of Google's frustration with code-search regexes taking down servers. RE2 compiles every pattern to a non-deterministic finite automaton and executes it in time linear in the length of the input — full stop, no exceptions. To make that guarantee, RE2 deliberately refuses to support backreferences (\1, \2) and lookaround. Those features make a regex no longer a regular language in the formal sense; they require backtracking or exponential automaton states, and Cox would rather refuse to compile your pattern than let it ReDoS your fleet. Go's regexp package, Cloudflare's WAF, and the rules engine inside many cloud providers all use RE2 or a port.

EngineBackreferencesLookbehindAtomic groupsTime complexity
POSIX EREnononoimplementation-defined
PCRE2 10.xyesvariable since 10.38yesexponential worst case
Oniguruma 6.xyesvariableyesexponential worst case
RE2 / Go regexpno (refuses)no (refuses)noO(n) guaranteed
ECMAScript 2024yesvariable since ES2018noexponential worst case
Rust regex 1.xno (refuses)no (refuses)noO(nm) guaranteed
Name the engine

A pattern that works in your editor's "find" dialog may not work in your CI's grep, your application's runtime, or your gateway's WAF. When sharing a regex across systems, name the engine. "JavaScript regex" and "PCRE2 regex" are not interchangeable.

Subtler than they look.

^ and $ are the tip of the iceberg. By default in JavaScript, ^ matches start-of-string and $ matches end-of-string; with the m flag they additionally match around \n. With the s (dotall) flag, . matches \n, which arrived in ES2018 — old code-bases still write [\s\S] out of habit.

Word boundaries are subtler. \b matches the zero-width position between a \w character and a non-\w character; \B matches everywhere \b does not. Crucially, in default JavaScript, \w is [A-Za-z0-9_] — ASCII only — so \bnaïve\b will not behave the way a French speaker expects unless you opt into the u or v flag and write \b against a Unicode-aware definition. PCRE2 has the (*UCP) modifier; Python's re has re.UNICODE (default in Python 3); Go's regexp documents the ASCII-only behaviour explicitly.

Lookahead (?=…) and negative lookahead (?!…) shipped in JavaScript from the start. Lookbehind (?<=…) and negative lookbehind (?<!…) arrived in ES2018, and unlike most engines, V8 supports variable-width lookbehind from day one — (?<=foo|foobar) works in JS even though it does not in older PCRE. Atomic groups (?>…) and possessive quantifiers (a++, a*+, a?+) are the two big features JavaScript still lacks. Both prevent the engine from giving back characters it has matched, and both are essential for writing safe patterns in PCRE2.

How a regex takes down a fleet.

The textbook example is (a+)+b against the input aaaaaaaaaaaaaaaaaaaa (twenty as, no b). The outer + and the inner + can each split the run of as in many ways, and a backtracking engine will try every combination before deciding the match fails. The number of attempts is exponential in the length of the input — at twenty characters, on the order of 2²⁰, about a million steps. At forty, a trillion. Your CPU pegs at 100% and the request never returns.

Stack Overflow's 30-minute global outage on 20 July 2016 was exactly this. A single regex on the home page was used to trim trailing whitespace from posts; a particular post triggered catastrophic backtracking and pinned every web server in the pool. The post-mortem is worth reading in full — the takeaway is that any regex in a hot path, on user-controlled input, with nested quantifiers or alternation overlap, is a latent denial-of-service waiting for the right input.

The structural problem is ambiguity. (a+)+ has many ways to partition a string of as into one-or-more groups of one-or-more as. (a|a)*, (.*)*, (\w+\s?)+ — all the same shape. Any time the regex grammar gives the engine a choice, and the choices overlap, backtracking has to enumerate them.

The fixes are well known. Possessive quantifiers ((a+)++b in PCRE) tell the engine never to give back. Atomic groups ((?>a+)+b) do the same thing. Rewriting the pattern to be unambiguous — a+b instead of (a+)+b — eliminates the choice altogether. Switching to an automaton-based engine — RE2, Hyperscan, the Rust regex crate, Go's regexp — makes the question moot.

To find these patterns before they find you, run eslint-plugin-redos (which uses recursive analysis of the regex AST), safe-regex (a lighter-weight star-height heuristic), or Snyk's regex scanner in CI.

Don't reach for regex first.

The single most common misuse of regex is email validation. RFC 5322 is complicated — comments, quoted local parts, IP-literal domains — and the regex that purports to fully implement it is 6,000 characters long and still wrong about internationalised domain names. Use a real library: validator.js's isEmail, or just send the confirmation email and see if it bounces. The HTML5 input-type spec defines a deliberately permissive regex; copying that one is fine for client-side hints.

URL parsing is the same story. Use new URL(input) — it throws on invalid input, exposes .hostname, .pathname, .searchParams, and handles IDN, IPv6 brackets, and percent-encoding correctly. Regex on URLs goes wrong the moment someone pastes a value with embedded credentials or odd encodings. JSON: JSON.parse. Always. There is no regex shortcut. HTML: do not. Browsers ship a real parser; on the server, use parse5, cheerio, or node-html-parser. Dates: Date.parse for ISO 8601 and RFC 2822. The new Temporal API covers everything else.

Where regex shines: extracting fields from log lines with a known shape, validating that an identifier matches [A-Za-z_][A-Za-z0-9_]*, replacing tabs with spaces, normalising whitespace, splitting CSV when you control the producer.

When the grammar isn't regular.

When the input has nesting — balanced parentheses, JSON, S-expressions, code — a regular language cannot describe it, full stop. Regex is mathematically incapable, no matter how clever the pattern. That is when you reach for a parser.

Parser combinators (parsec in Haskell, nom in Rust, parsimmon in JS, arcsecond for a pure-JS option) compose small parsers into bigger ones, and give you good error messages for free. PEG grammars (Peggy, formerly PEG.js; lark in Python; pest in Rust) let you write the grammar in a clean DSL and generate a parser. Tree-sitter, originally built for Atom and now used by Neovim, Helix, and GitHub's code search, parses real programming languages incrementally and is fast enough to run on every keystroke. Earley parsers (nearley.js, lark in Earley mode) handle ambiguous grammars where a single input has multiple valid parses.

The rule of thumb: if your input grammar is recursive — if a thing can contain another thing of the same kind — regex is the wrong tool, and the right tool is one of the above. Use the regex tester on this page for what regex is good at, and reach for a parser the moment your pattern starts growing escape hatches.

A famously misused hammer.

Regular expressions parse regular languages — those describable by a finite-state machine. HTML, JSON, source code, balanced parentheses are not regular; trying to parse them with regex eventually breaks. The famous Stack Overflow answer "you can't parse HTML with regex" is technically correct: HTML can nest arbitrarily, and finite-state machines can't count nesting depth. Use a real parser. For finding short fixed structures inside text — emails, IDs, dates, log fields — regex is the right tool and is wonderfully fast.

ReDoS — production timeout culprit

A 2016 Stack Overflow outage was caused by a single regex with catastrophic backtracking on a long whitespace string. Linters like eslint-plugin-redos and Snyk's safe-regex2 can flag dangerous patterns before they ship.

Found this useful?