URL Parser Guide: Anatomy, Encoding Rules, and the JavaScript URL API
A URL looks simple until it breaks in production. Wrong percent-encoding in a path segment, a fragment silently dropped by a reverse proxy, a query string with spaces encoded as + instead of %20 — these are real bugs. Use the URL Parser to inspect any URL component by component as you follow along. For query strings specifically, the Query String Parser and URL Encoder / Decoder are also useful companions.
URL Anatomy
RFC 3986 defines the generic URL syntax. Every URL is composed of up to six parts. Here is the full example with all components present:
https://user:pass@host:8080/path/to/page?query=value#fragment
───── ───────────── ──── ──── ───────────── ─────────── ────────
scheme userinfo host port path query fragmentTogether, user:pass@host:8080 is called the authority. It combines optional userinfo, the host, and an optional port. The authority always follows the double slash (//) after the scheme.
Component-by-Component Breakdown
| Component | Example | Notes |
|---|---|---|
| Scheme | https | Case-insensitive. Followed by :. Common: http, https, ftp, mailto, file. |
| Userinfo | user:pass | Deprecated for HTTP/HTTPS. Credentials in URLs are a security risk — avoid. Followed by @. |
| Host | host | Case-insensitive for DNS names. IPv6 addresses must be wrapped in brackets: [::1]. |
| Port | 8080 | Omit when using the default port for the scheme (443 for HTTPS, 80 for HTTP). Preceded by :. |
| Path | /path/to/page | Case-sensitive on most servers. Segments separated by /. Slashes within a segment must be percent-encoded. |
| Query | query=value | Preceded by ?. Key-value pairs separated by &. No standardized structure beyond that — servers interpret it. |
| Fragment | fragment | Preceded by #. Never sent to the server. Handled entirely by the browser or client. |
Reserved vs Unreserved Characters (RFC 3986)
RFC 3986 divides characters into three categories that determine whether they need to be percent-encoded:
| Category | Characters | Encoding rule |
|---|---|---|
| Unreserved | A–Z a–z 0–9 - _ . ~ | Never encode these. Encoding them is technically valid but unnecessary and produces longer URLs. |
| Reserved (general delimiters) | : / ? # [ ] @ | These delimit URL components. They must be percent-encoded when used as data, not as delimiters. |
| Reserved (sub-delimiters) | ! $ & ' ( ) * + , ; = | Allowed in some positions without encoding; must be encoded in others (e.g., = in a query value). |
| Everything else | Spaces, non-ASCII, control chars | Must always be percent-encoded: convert each byte to %XX (uppercase hex). |
Percent-encoding takes the byte value and formats it as % followed by two uppercase hex digits. A space is %20, a euro sign is %E2%82%AC (three UTF-8 bytes).
Encoding Rules Differ Per Component
This is where most encoding bugs come from. The set of characters that must be encoded is different in each URL component. A slash has a different meaning in a path segment than in a query string, which has a different meaning in a fragment.
| Component | Character that stays literal | Character that must be encoded |
|---|---|---|
| Path segment | Unreserved + ! $ & ' ( ) * + , ; = : @ | / (slash — it would split the segment), ?, #, spaces, non-ASCII |
| Query key or value | Unreserved + ! $ ' ( ) * + , ; : @ / ? | =, & (they delimit pairs), #, spaces, non-ASCII |
| Fragment | Unreserved + most sub-delimiters + / ? : @ ! | Spaces and non-ASCII (browsers are lenient here but RFC requires encoding) |
Key insight: a path segment can contain = and & without encoding, because those characters have no special meaning in paths. But a query value containing & must be encoded as %26 or the parser will split the pair at that character.
JavaScript URL API
Modern JavaScript provides a built-in URL class that correctly handles parsing, construction, and encoding. Prefer it over manual string manipulation.
// Parse a URL
const url = new URL('https://host:8080/path/to/page?query=value#fragment');
console.log(url.protocol); // "https:"
console.log(url.hostname); // "host"
console.log(url.port); // "8080"
console.log(url.pathname); // "/path/to/page"
console.log(url.search); // "?query=value"
console.log(url.hash); // "#fragment"
// searchParams gives you a structured view of the query string
console.log(url.searchParams.get('query')); // "value"
// Build a URL safely — no manual encoding needed
const built = new URL('https://api.example.com/search');
built.searchParams.set('q', 'hello world & more');
built.searchParams.set('page', '2');
console.log(built.toString());
// "https://api.example.com/search?q=hello+world+%26+more&page=2"Relative URL pitfall: new URL('/path') throws a TypeError because a relative URL has no base. You must provide a base:
// Wrong — throws TypeError: Failed to construct 'URL'
const bad = new URL('/api/users');
// Correct — provide a base
const good = new URL('/api/users', 'https://example.com');
console.log(good.toString()); // "https://example.com/api/users"
// Or use the current page's URL in the browser
const relative = new URL('/api/users', window.location.href);URLSearchParams is also available standalone for working with query strings without a full URL:
const params = new URLSearchParams('a=1&b=hello+world&c=%E2%82%AC');
console.log(params.get('a')); // "1"
console.log(params.get('b')); // "hello world" (+ decoded as space)
console.log(params.get('c')); // "€" (percent-decoded)
// Iterate all pairs
for (const [key, value] of params) {
console.log(key, value);
}
// Serialize back
params.append('d', 'new value');
console.log(params.toString()); // "a=1&b=hello+world&c=%E2%82%AC&d=new+value"Common Bugs
Double-encoding
Double-encoding happens when you percent-encode a string that has already been percent-encoded. A space becomes %20 after one pass; encoding again turns the % into %25, giving %2520 — which decodes to the literal string %20, not a space.
// Bug: encoding an already-encoded string
const encoded = encodeURIComponent('%20');
console.log(encoded); // "%2520" — double-encoded!
// Fix: only encode raw user input, never pre-encoded values
const raw = ' ';
const correct = encodeURIComponent(raw);
console.log(correct); // "%20"
// Using URL API avoids this entirely — set raw values, it encodes once
const url = new URL('https://example.com/search');
url.searchParams.set('q', 'hello world'); // raw, no pre-encoding
console.log(url.toString()); // "...?q=hello+world"Fragments are not sent to the server
The fragment (everything after #) is a client-side concept. Browsers strip it before sending the HTTP request. This means:
- Server-side code never sees the fragment — do not put security-sensitive data there.
- Fragments are also stripped from the
Refererheader when navigating between pages. This is intentional privacy protection. - Single-page applications that use hash-based routing (
#/route) rely on this behavior — the server always receives the same path regardless of the in-app route.
Host is case-insensitive, path is not
EXAMPLE.COM and example.com refer to the same host. DNS is case-insensitive. But /Page and /page are different paths on case-sensitive filesystems (Linux servers). This distinction matters for canonical URL normalization and SEO — pick one case convention and enforce it with redirects.
application/x-www-form-urlencoded vs RFC 3986
There are two encoding schemes that look similar but differ in one important way: how they encode a space character.
| Scheme | Space encoded as | Where used |
|---|---|---|
| RFC 3986 | %20 | URL paths, canonical URLs, HTTP headers, new URL() |
| application/x-www-form-urlencoded | + | HTML form submissions, URLSearchParams, query strings from forms |
URLSearchParams uses application/x-www-form-urlencoded, which is why params.toString() produces q=hello+world. If you copy that query string into a URL path segment and decode it with a RFC 3986 decoder, the + will remain a literal plus sign, not a space.
The safe rule: use new URL() and searchParams together and let the platform handle encoding. Only decode manually when you know which scheme was used to encode.
Internationalized Domain Names (IDN)
Domain names are limited to ASCII letters, digits, and hyphens by the DNS protocol. To support non-ASCII domain names (e.g., münchen.de), the IDNA standard (Internationalizing Domain Names in Applications) uses Punycode encoding. The non-ASCII label is converted to an ASCII-compatible encoding (ACE) prefixed with xn--:
münchen.de → xn--mnchen-3ya.de
中文.com → xn--fiq228c.comBrowsers display the Unicode form in the address bar but send the Punycode form in HTTP requests. When you parse a URL containing an IDN with new URL(), the hostname property returns the Punycode form.
Homograph attacks exploit IDN by registering domains that look visually identical to legitimate ones using characters from different Unicode scripts. For example, the Cyrillic letter "а" (U+0430) is visually identical to the Latin "a" (U+0061). A domain like pаypal.com (with a Cyrillic "а") renders identically in many fonts to paypal.com. Browsers mitigate this by showing the Punycode form for mixed-script domains.
URL Parsing in Python, Go, and Java
The URL API patterns differ across languages, but the underlying RFC 3986 structure is the same.
Python — urllib.parse
from urllib.parse import urlparse, urlencode, parse_qs, quote
result = urlparse('https://user:pass@host:8080/path?q=hello+world#frag')
print(result.scheme) # 'https'
print(result.netloc) # 'user:pass@host:8080'
print(result.hostname) # 'host'
print(result.port) # 8080
print(result.path) # '/path'
print(result.query) # 'q=hello+world'
print(result.fragment) # 'frag'
# Parse query string into a dict
params = parse_qs(result.query)
print(params) # {'q': ['hello world']}
# Encode a path segment — use quote() not quote_plus()
safe_segment = quote('hello world/page', safe='')
print(safe_segment) # 'hello%20world%2Fpage'Go — net/url
package main
import (
"fmt"
"net/url"
)
func main() {
u, err := url.Parse("https://host:8080/path?q=hello+world#frag")
if err != nil {
panic(err)
}
fmt.Println(u.Scheme) // "https"
fmt.Println(u.Host) // "host:8080"
fmt.Println(u.Path) // "/path"
fmt.Println(u.Fragment) // "frag"
// Query params — handles + and %20 both as space
q := u.Query()
fmt.Println(q.Get("q")) // "hello world"
// Build a URL safely
params := url.Values{}
params.Set("q", "hello world & more")
u.RawQuery = params.Encode()
fmt.Println(u.String()) // "https://host:8080/path?q=hello+world+%26+more"
}Java — java.net.URI
import java.net.URI;
import java.net.URISyntaxException;
public class URLExample {
public static void main(String[] args) throws URISyntaxException {
URI uri = new URI("https://host:8080/path?q=hello%20world#frag");
System.out.println(uri.getScheme()); // "https"
System.out.println(uri.getHost()); // "host"
System.out.println(uri.getPort()); // 8080
System.out.println(uri.getPath()); // "/path"
System.out.println(uri.getQuery()); // "q=hello world" (decoded)
System.out.println(uri.getRawQuery()); // "q=hello%20world" (raw)
System.out.println(uri.getFragment()); // "frag"
// Note: java.net.URL resolves DNS on construction — prefer URI for parsing
}
}In Java, prefer java.net.URI over java.net.URL for parsing. The URL class calls DNS on construction and its equals() method performs network lookups — both are surprising behaviors that URI avoids.
Inspect any URL with the URL Parser to see each component highlighted. For query strings, try the Query String Parser. To percent-encode or decode individual values, use the URL Encoder / Decoder.
For a deeper look at encoding edge cases including surrogate pairs and non-BMP characters, see URL Encoding Edge Cases. If you work with multi-step web workflows, the Web Payload Workflow guide covers how encoding interacts with HTTP bodies and headers end-to-end.