DevToys Web Pro iconDevToys Web ProBlog
Traducido con LocalePack logoLocalePack
Valóranos:
Prueba la extensión del navegador:
← Back to Blog

URL Parser Guide: Anatomy, Encoding Rules, and the JavaScript URL API

10 min read

A URL looks simple until it breaks in production. Wrong percent-encoding in a path segment, a fragment silently dropped by a reverse proxy, a query string with spaces encoded as + instead of %20 — these are real bugs. Use the URL Parser to inspect any URL component by component as you follow along. For query strings specifically, the Query String Parser and URL Encoder / Decoder are also useful companions.

URL Anatomy

RFC 3986 defines the generic URL syntax. Every URL is composed of up to six parts. Here is the full example with all components present:

https://user:pass@host:8080/path/to/page?query=value#fragment
 ─────   ─────────────  ────  ────  ─────────────  ───────────  ────────
 scheme  userinfo       host  port  path           query        fragment

Together, user:pass@host:8080 is called the authority. It combines optional userinfo, the host, and an optional port. The authority always follows the double slash (//) after the scheme.

Component-by-Component Breakdown

ComponentExampleNotes
SchemehttpsCase-insensitive. Followed by :. Common: http, https, ftp, mailto, file.
Userinfouser:passDeprecated for HTTP/HTTPS. Credentials in URLs are a security risk — avoid. Followed by @.
HosthostCase-insensitive for DNS names. IPv6 addresses must be wrapped in brackets: [::1].
Port8080Omit when using the default port for the scheme (443 for HTTPS, 80 for HTTP). Preceded by :.
Path/path/to/pageCase-sensitive on most servers. Segments separated by /. Slashes within a segment must be percent-encoded.
Queryquery=valuePreceded by ?. Key-value pairs separated by &. No standardized structure beyond that — servers interpret it.
FragmentfragmentPreceded by #. Never sent to the server. Handled entirely by the browser or client.

Reserved vs Unreserved Characters (RFC 3986)

RFC 3986 divides characters into three categories that determine whether they need to be percent-encoded:

CategoryCharactersEncoding rule
UnreservedA–Z a–z 0–9 - _ . ~Never encode these. Encoding them is technically valid but unnecessary and produces longer URLs.
Reserved (general delimiters): / ? # [ ] @These delimit URL components. They must be percent-encoded when used as data, not as delimiters.
Reserved (sub-delimiters)! $ & ' ( ) * + , ; =Allowed in some positions without encoding; must be encoded in others (e.g., = in a query value).
Everything elseSpaces, non-ASCII, control charsMust always be percent-encoded: convert each byte to %XX (uppercase hex).

Percent-encoding takes the byte value and formats it as % followed by two uppercase hex digits. A space is %20, a euro sign is %E2%82%AC (three UTF-8 bytes).

Encoding Rules Differ Per Component

This is where most encoding bugs come from. The set of characters that must be encoded is different in each URL component. A slash has a different meaning in a path segment than in a query string, which has a different meaning in a fragment.

ComponentCharacter that stays literalCharacter that must be encoded
Path segmentUnreserved + ! $ & ' ( ) * + , ; = : @/ (slash — it would split the segment), ?, #, spaces, non-ASCII
Query key or valueUnreserved + ! $ ' ( ) * + , ; : @ / ?=, & (they delimit pairs), #, spaces, non-ASCII
FragmentUnreserved + most sub-delimiters + / ? : @ !Spaces and non-ASCII (browsers are lenient here but RFC requires encoding)

Key insight: a path segment can contain = and & without encoding, because those characters have no special meaning in paths. But a query value containing & must be encoded as %26 or the parser will split the pair at that character.

JavaScript URL API

Modern JavaScript provides a built-in URL class that correctly handles parsing, construction, and encoding. Prefer it over manual string manipulation.

// Parse a URL
const url = new URL('https://host:8080/path/to/page?query=value#fragment');

console.log(url.protocol);  // "https:"
console.log(url.hostname);  // "host"
console.log(url.port);      // "8080"
console.log(url.pathname);  // "/path/to/page"
console.log(url.search);    // "?query=value"
console.log(url.hash);      // "#fragment"

// searchParams gives you a structured view of the query string
console.log(url.searchParams.get('query'));  // "value"

// Build a URL safely — no manual encoding needed
const built = new URL('https://api.example.com/search');
built.searchParams.set('q', 'hello world & more');
built.searchParams.set('page', '2');
console.log(built.toString());
// "https://api.example.com/search?q=hello+world+%26+more&page=2"

Relative URL pitfall: new URL('/path') throws a TypeError because a relative URL has no base. You must provide a base:

// Wrong — throws TypeError: Failed to construct 'URL'
const bad = new URL('/api/users');

// Correct — provide a base
const good = new URL('/api/users', 'https://example.com');
console.log(good.toString());  // "https://example.com/api/users"

// Or use the current page's URL in the browser
const relative = new URL('/api/users', window.location.href);

URLSearchParams is also available standalone for working with query strings without a full URL:

const params = new URLSearchParams('a=1&b=hello+world&c=%E2%82%AC');

console.log(params.get('a'));   // "1"
console.log(params.get('b'));   // "hello world"  (+ decoded as space)
console.log(params.get('c'));   // "€"            (percent-decoded)

// Iterate all pairs
for (const [key, value] of params) {
  console.log(key, value);
}

// Serialize back
params.append('d', 'new value');
console.log(params.toString());  // "a=1&b=hello+world&c=%E2%82%AC&d=new+value"

Common Bugs

Double-encoding

Double-encoding happens when you percent-encode a string that has already been percent-encoded. A space becomes %20 after one pass; encoding again turns the % into %25, giving %2520 — which decodes to the literal string %20, not a space.

// Bug: encoding an already-encoded string
const encoded = encodeURIComponent('%20');
console.log(encoded);  // "%2520" — double-encoded!

// Fix: only encode raw user input, never pre-encoded values
const raw = ' ';
const correct = encodeURIComponent(raw);
console.log(correct);  // "%20"

// Using URL API avoids this entirely — set raw values, it encodes once
const url = new URL('https://example.com/search');
url.searchParams.set('q', 'hello world');  // raw, no pre-encoding
console.log(url.toString());  // "...?q=hello+world"

Fragments are not sent to the server

The fragment (everything after #) is a client-side concept. Browsers strip it before sending the HTTP request. This means:

  • Server-side code never sees the fragment — do not put security-sensitive data there.
  • Fragments are also stripped from the Referer header when navigating between pages. This is intentional privacy protection.
  • Single-page applications that use hash-based routing (#/route) rely on this behavior — the server always receives the same path regardless of the in-app route.

Host is case-insensitive, path is not

EXAMPLE.COM and example.com refer to the same host. DNS is case-insensitive. But /Page and /page are different paths on case-sensitive filesystems (Linux servers). This distinction matters for canonical URL normalization and SEO — pick one case convention and enforce it with redirects.

application/x-www-form-urlencoded vs RFC 3986

There are two encoding schemes that look similar but differ in one important way: how they encode a space character.

SchemeSpace encoded asWhere used
RFC 3986%20URL paths, canonical URLs, HTTP headers, new URL()
application/x-www-form-urlencoded+HTML form submissions, URLSearchParams, query strings from forms

URLSearchParams uses application/x-www-form-urlencoded, which is why params.toString() produces q=hello+world. If you copy that query string into a URL path segment and decode it with a RFC 3986 decoder, the + will remain a literal plus sign, not a space.

The safe rule: use new URL() and searchParams together and let the platform handle encoding. Only decode manually when you know which scheme was used to encode.

Internationalized Domain Names (IDN)

Domain names are limited to ASCII letters, digits, and hyphens by the DNS protocol. To support non-ASCII domain names (e.g., münchen.de), the IDNA standard (Internationalizing Domain Names in Applications) uses Punycode encoding. The non-ASCII label is converted to an ASCII-compatible encoding (ACE) prefixed with xn--:

münchen.de  xn--mnchen-3ya.de
中文.com  xn--fiq228c.com

Browsers display the Unicode form in the address bar but send the Punycode form in HTTP requests. When you parse a URL containing an IDN with new URL(), the hostname property returns the Punycode form.

Homograph attacks exploit IDN by registering domains that look visually identical to legitimate ones using characters from different Unicode scripts. For example, the Cyrillic letter "а" (U+0430) is visually identical to the Latin "a" (U+0061). A domain like pаypal.com (with a Cyrillic "а") renders identically in many fonts to paypal.com. Browsers mitigate this by showing the Punycode form for mixed-script domains.

URL Parsing in Python, Go, and Java

The URL API patterns differ across languages, but the underlying RFC 3986 structure is the same.

Python — urllib.parse

from urllib.parse import urlparse, urlencode, parse_qs, quote

result = urlparse('https://user:pass@host:8080/path?q=hello+world#frag')
print(result.scheme)    # 'https'
print(result.netloc)    # 'user:pass@host:8080'
print(result.hostname)  # 'host'
print(result.port)      # 8080
print(result.path)      # '/path'
print(result.query)     # 'q=hello+world'
print(result.fragment)  # 'frag'

# Parse query string into a dict
params = parse_qs(result.query)
print(params)  # {'q': ['hello world']}

# Encode a path segment — use quote() not quote_plus()
safe_segment = quote('hello world/page', safe='')
print(safe_segment)  # 'hello%20world%2Fpage'

Go — net/url

package main

import (
    "fmt"
    "net/url"
)

func main() {
    u, err := url.Parse("https://host:8080/path?q=hello+world#frag")
    if err != nil {
        panic(err)
    }

    fmt.Println(u.Scheme)   // "https"
    fmt.Println(u.Host)     // "host:8080"
    fmt.Println(u.Path)     // "/path"
    fmt.Println(u.Fragment) // "frag"

    // Query params — handles + and %20 both as space
    q := u.Query()
    fmt.Println(q.Get("q"))  // "hello world"

    // Build a URL safely
    params := url.Values{}
    params.Set("q", "hello world & more")
    u.RawQuery = params.Encode()
    fmt.Println(u.String())  // "https://host:8080/path?q=hello+world+%26+more"
}

Java — java.net.URI

import java.net.URI;
import java.net.URISyntaxException;

public class URLExample {
    public static void main(String[] args) throws URISyntaxException {
        URI uri = new URI("https://host:8080/path?q=hello%20world#frag");

        System.out.println(uri.getScheme());    // "https"
        System.out.println(uri.getHost());      // "host"
        System.out.println(uri.getPort());      // 8080
        System.out.println(uri.getPath());      // "/path"
        System.out.println(uri.getQuery());     // "q=hello world"  (decoded)
        System.out.println(uri.getRawQuery());  // "q=hello%20world" (raw)
        System.out.println(uri.getFragment());  // "frag"

        // Note: java.net.URL resolves DNS on construction — prefer URI for parsing
    }
}

In Java, prefer java.net.URI over java.net.URL for parsing. The URL class calls DNS on construction and its equals() method performs network lookups — both are surprising behaviors that URI avoids.


Inspect any URL with the URL Parser to see each component highlighted. For query strings, try the Query String Parser. To percent-encode or decode individual values, use the URL Encoder / Decoder.

For a deeper look at encoding edge cases including surrogate pairs and non-BMP characters, see URL Encoding Edge Cases. If you work with multi-step web workflows, the Web Payload Workflow guide covers how encoding interacts with HTTP bodies and headers end-to-end.