URL encode.
Percent-encode strings for safe use in URLs, query parameters, and headers — or decode them back. Toggle between component (escapes / ? #) and full-URI (preserves them) modes. Local only.
name%3DAda%20Lovelace%26note%3Dhello%2Fworld%3F
Why some characters survive.
RFC 3986 splits the ASCII printable range into three classes. Unreserved characters — letters, digits, and - _ . ~ — never need escaping; they're always safe. Reserved characters — : / ? # [ ] @ ! $ & ' ( ) * + , ; = — have meaning in the URL grammar; whether they're escaped depends on where they appear in the URL. Everything else (spaces, Unicode, control codes) must be percent-encoded as the UTF-8 byte sequence in hex, with each byte rendered as %XX.
| Class | Characters | Encoded? |
|---|---|---|
| Unreserved | A–Z a–z 0–9 - _ . ~ | never |
| Reserved (gen-delims) | : / ? # [ ] @ | only if used as data, not delimiter |
| Reserved (sub-delims) | ! $ & ' ( ) * + , ; = | only if used as data, not delimiter |
| Other ASCII | space < > " { } \ ^ ` | | always |
| Non-ASCII (Unicode) | U+0080 and above | always (UTF-8 byte sequence) |
| Control codes | U+0000–U+001F U+007F | always |
The "depends on where it appears" rule is what makes URL encoding interesting. A literal / is fine inside a path (it's a separator) but must be encoded inside a query value (where it's data). A literal ? is fine inside a fragment but separates path from query when it appears earlier. The two encodeURI / encodeURIComponent functions encode for two different positions: full-URI mode preserves the structural characters, component mode escapes everything that isn't unreserved.
Two specs, one address bar.
URLs have two competing standards. RFC 3986 (2005) defines URIs as a static grammar — straightforward to parse, strict about what's allowed. The WHATWG URL Living Standard (2014–today) defines URLs as what browsers actually accept and how they normalise them — much more lenient, with elaborate handling of edge cases like backslashes, missing schemes, IDN hostnames, and malformed inputs. The browser's URL object follows WHATWG; curl, server-side libraries, and most non-browser tools follow RFC 3986.
The two specs disagree on a surprising number of small things. Backslashes in paths: WHATWG converts them to forward slashes, RFC rejects. Tabs and newlines inside a URL: WHATWG strips them silently, RFC rejects. Missing scheme: WHATWG infers from context, RFC rejects. Trailing dots in hostnames, leading zeros in IPv4 octets, square brackets around IPv6 addresses, percent-encoded path separators — every one of these is a known divergence point.
Practical consequence: a URL accepted by Chrome can be rejected by your backend's URL parser, and vice versa. The WHATWG spec includes a normalisation step ("URL parsing then serialisation") that all browsers implement; running an untrusted URL through that step before sending it to a strict backend often resolves the discrepancy. Some shops standardise on the whatwg-url npm package or Python's urllib3.util.parse_url for cross-language consistency.
| Behaviour | RFC 3986 | WHATWG URL |
|---|---|---|
| Backslash in path | error | converts to forward slash |
| Embedded tab/newline | error | strips silently |
| Leading whitespace | error | trims |
| IPv4 with leading zeros | octal | decimal |
| Empty host | error for http(s) | error |
| Percent-encoded slash in path | preserved | often decoded |
| IDN hostnames | not specified | Punycode (xn--) |
Picking the right encode function.
JavaScript ships three built-in encoders that look similar and behave differently. encodeURI escapes only what's needed to keep a string from breaking the URL grammar — leaves : / ? # & = unencoded. encodeURIComponent escapes everything reserved — escapes : / ? # & = too. escape is a deprecated holdover from ES1 that handles non-ASCII via %uXXXX sequences nobody else accepts; never use it.
// Encoding a query value
const tag = "Q&A: rock + roll";
encodeURI(tag); // "Q&A:%20rock%20+%20roll" ← & survives = WRONG
encodeURIComponent(tag); // "Q%26A%3A%20rock%20%2B%20roll" ← right
// Building a URL with the URLSearchParams API (preferred)
const u = new URL('https://api.example.com/search');
u.searchParams.set('q', tag);
u.searchParams.set('lang', 'en');
u.toString();
// "https://api.example.com/search?q=Q%26A%3A+rock+%2B+roll&lang=en" The URLSearchParams API is the right tool for query strings — it handles encoding correctly, keeps the alphabet consistent, and treats + as space (the application/x-www-form-urlencoded convention) instead of %20. For path segments, hand-roll encodeURIComponent on each segment and then join with /. For fragments, encode the same way as a query value.
Other languages have their own asymmetries. Python 3 has urllib.parse.quote (safe defaults to /) and quote_plus (encodes space as + for forms). Go has url.QueryEscape and url.PathEscape with different reserved-character sets. Java's URLEncoder.encode only handles form encoding (RFC 1738) — not modern URL encoding. Always check the docs; default behaviour rarely matches RFC 3986 exactly.
Where + became space.
Two specs handle space differently. RFC 3986 says space is %20, full stop. The older application/x-www-form-urlencoded spec (still used by HTML form submissions and most query strings) says space is +. Both occur in the wild. Browsers send form data as +; fetch with URLSearchParams emits +; curl --data-urlencode emits %20. Decoders generally treat + as a literal plus unless they're parsing a form body — that's why this tool emits %20.
Where this bites: a search query for "C++" sent through a form encoder becomes C%2B%2B; sent through a non-form encoder becomes C++. If a backend expects RFC 3986 encoding but receives form encoding, the decoded result is "C " (two spaces) — the literal pluses became spaces. The fix on the receiving side is to be explicit: application/x-www-form-urlencoded bodies use parse_qs; URL query strings parsed by URL.searchParams assume the same convention; raw URL paths use decodeURIComponent which doesn't translate +.
A second wrinkle: + inside a URL path (not query) is always a literal plus. Only when used inside a query string parsed as form-encoded does it represent a space. Implementations that uniformly decode + as space in all positions break filenames containing pluses (/files/C++%20notes.pdf reads as C notes.pdf — three spaces).
Build a URL with URLSearchParams, send it, and decode it server-side with parse_qs (Python) or URLSearchParams (Node/Deno). Both treat + as space. Now try the same URL with a path-style decoder. The space character will round-trip differently. The right answer is: pick one convention per pipeline and stick with it.
Five bugs that show up in every codebase eventually.
First, double-encoding. A space goes ' ' → %20 → %2520 when a layer doesn't realise the input is already encoded. If you see %25XX sequences in URLs hitting your backend, that's the smoking gun. The fix is to track the encoding state at every boundary: write a comment next to every URL string saying whether it's encoded or decoded, and never re-encode without explicitly decoding first.
Second, encoding the wrong thing. Some libraries pass paths through query encoders (escaping / in the process), then attempt to use the result as a path. The encoded slashes confuse routing — Express, Rails, and Django all have known cases where %2F in a path is rejected by the router even though it's technically valid. Many web servers (nginx, Apache) reject paths containing %2F by default for this reason.
Third, encoding before joining. Building a URL by concatenating an already-encoded base with a raw path-segment leaks the un-encoded segment into the URL. The pattern that works is: parse the base URL into a structured object (new URL(base)), append the segment to .pathname using path-segment encoding, and let the URL object emit the joined string.
Fourth, IDN hostname surprises. The hostname münchen.de is valid as Unicode but transmitted as Punycode xn--mnchen-3ya.de. Some backends accept both; some accept only the Punycode form; some accept only the Unicode form. The browser handles the conversion transparently in the address bar but most server-side URL parsers don't. URL in modern Node and the WHATWG spec do; older parsers may not.
Fifth, log redaction. URLs in access logs often contain query parameters with sensitive data — API keys, OAuth tokens, session IDs. Even if the URL is HTTPS, log files are usually plaintext. Either strip query strings before logging, or use POST bodies for sensitive parameters. The OWASP "Top 10" lists this category every year because it keeps happening.
| Symptom | Likely cause |
|---|---|
| %25XX appearing in URLs | double encoding somewhere upstream |
| Spaces becoming pluses unexpectedly | form-encoder used on non-form data |
| Special chars surviving into a database | encoder applied too late, after framework parsed URL |
| Slashes in path-segments breaking routing | middleware decodes %2F too eagerly |
| Unicode hostname unreachable from CLI | tool doesn't IDNA-encode automatically |
| Query parameter values stripped silently | NUL bytes or control codes truncating |
Percent-encoding shows up elsewhere too.
The percent-encoding mechanism predates URLs. RFC 1738 (1994) standardised it for "Uniform Resource Locators"; later RFCs extended the same encoding to other contexts. application/x-www-form-urlencoded bodies use it. mailto: URIs encode message bodies and headers with it (a ?subject=Hello%20World after the address). Content-Disposition headers use a related but distinct encoding (filename*=UTF-8''…). Header field values follow yet another set of rules in RFC 5987.
In practice, you'll see percent-encoding used outside URLs in three common places. (1) Form submissions: application/x-www-form-urlencoded POST bodies use the same encoding as URL query strings, including the +-as-space convention. (2) Cookie values: RFC 6265 doesn't strictly require encoding, but most cookie-handling libraries percent-encode anything containing reserved or non-ASCII characters. (3) HTTP headers: certain headers (Content-Disposition, Link, RFC 5987-encoded parameters) use related but slightly different schemes.
A side note: the % character itself is its own percent-encoding (%25). So when you encode a URL that already contains percent-encoded sequences, every % in the input becomes %25 in the output — that's why double-encoding produces %2520 for what was originally a space. The encoder doesn't know it's looking at already-encoded data; it just sees the literal % as a character that needs escaping.
Server-Side Request Forgery (SSRF) attacks often exploit URL-parsing differences between layers. A library that allow-lists hostnames may parse a URL one way, but the HTTP client that fetches it parses differently — letting the attacker reach an internal IP. The mitigations are: parse once with the WHATWG URL object, normalise to a canonical form, allow-list against the normalised hostname, then fetch using the same canonical URL. Never re-parse between the check and the fetch.
URL encoding sits on the attack surface.
URL parameters cross trust boundaries — from browser to load balancer to application server to database. Every layer that re-encodes or decodes is a potential injection point. Three classes of attack come up regularly. Open redirects: if a parameter like ?next=/dashboard is used to redirect after login, an attacker can supply ?next=https://evil.com or ?next=//evil.com and intercept the user. The fix is to validate that next is a relative path (starts with / but not //) before redirecting.
Path traversal: ..%2F..%2F..%2Fetc%2Fpasswd survives a tolerant URL decoder and reaches a file-serving backend that doesn't normalise. The fix is to canonicalize the path before any file access — resolve .. segments, reject paths that escape the document root, decode all percent-encoded characters once and only once. Most modern frameworks ship correct path handling; problems show up in custom file-serving middleware or in S3-compatible object stores misconfigured to allow path-style access.
XSS via reflected URL parameters: a search box that echoes ?q= back into the HTML is a classic vector. The encoded form ?q=%3Cscript%3Ealert(1)%3C%2Fscript%3E decodes to <script>alert(1)</script>. Defence is HTML-escaping when the value lands in HTML, JavaScript-escaping when it lands in inline JS, URL-escaping when it lands in another URL. Each context has its own escape function — generic "escape" doesn't exist.
A useful diagnostic in production: log the encoded and decoded forms of every URL parameter at the boundary, with a flag noting whether the value was structurally valid for its target context. The logs balloon, but they catch encoding-mismatch bugs before users do. Most teams don't go this far; the ones that do find issues in middleware they didn't write.