DevToys Web Pro iconDevToys Web ProBlog
Prevedeno z LocalePack logoLocalePack
Ocenite nas:
Preizkusite razširitev brskalnika:
← Back to Blog

Slugify Guide: Unicode Normalization, Transliteration, and Collision Handling

8 min read

A URL slug is the human-readable segment at the end of a path: /posts/my-great-article. Good slugs are lowercase ASCII, use hyphens as word separators, carry no consecutive hyphens, and stay under 100 characters. Getting there from arbitrary user input — accented titles, Cyrillic headings, emoji-laden blog posts — requires a deliberate pipeline. Try the Slugify Tool to follow along with the examples below.

What Makes a Good Slug

Before touching code, agree on the invariants. A production-grade slug must be:

  • Lowercase ASCII — browsers and CDNs treat paths case-sensitively on most systems; sticking to lowercase removes the ambiguity entirely.
  • Hyphens only — underscores are valid in URLs but look like spaces in underlined links; hyphens are the universal convention and are treated as word separators by Google.
  • No consecutive hyphensmy--article is ugly and signals a naive implementation. Collapse all runs of hyphens to a single one.
  • No leading or trailing hyphens — strip them after collapsing.
  • 100 characters or fewer — longer slugs get truncated in browser tabs, social previews, and some CMS database columns.
  • Readable — the slug should hint at the content. A slug like x4k9b2m tells users and crawlers nothing.

The Unicode Problem

The first failure most developers hit: titles with diacritics. "naïve résumé" contains characters outside ASCII. A naive .toLowerCase().replace(/\s+/g, "-") produces naïve-résumé — percent-encoded in the browser as na%C3%AFve-r%C3%A9sum%C3%A9. That is valid but ugly, and breaks copy-pasted URLs in some older clients.

The correct approach is NFD normalization followed by combining-mark removal:

// NFD decomposes composed characters into base + combining marks
// ̀–ͯ is the Unicode "Combining Diacritical Marks" block
function removeDiacritics(str) {
  return str.normalize("NFD").replace(/[̀-ͯ]/g, "");
}

removeDiacritics("naïve résumé"); // "naive resume"

NFD (Canonical Decomposition) splits a precomposed character like ï (U+00EF) into a base letter i (U+0069) plus a combining diaeresis (U+0308). The regex then strips all combining marks, leaving only base letters. NFC (the default in most JS engines) would not help here because it recomposes the characters back.

See also: URL Encoding Edge Cases and URL Parser Guide for how the browser handles non-ASCII path segments.

Transliteration Beyond Latin

NFD only helps with Latin-script diacritics. Cyrillic, Greek, Arabic, and CJK characters have no Latin base letter to fall back on — they need transliteration (script-to-script mapping).

ScriptExample inputTransliterated
CyrillicПривет мирprivet-mir
GreekΕλληνικάellinika
German umlautsÜber Straßeuber-strasse
Chinese你好世界ni-hao-shi-jie (pinyin)

Handling this yourself is tedious. Reach for a library instead. Three good options across ecosystems:

  • JavaScript: @sindresorhus/slugify — opinionated, zero dependencies, handles most Latin-extended and common symbol replacements.
  • Python: python-slugify — wraps Unidecode for broad transliteration coverage including Cyrillic and CJK.
  • Ruby: Stringex — the oldest and most battle-tested option in the Rails ecosystem, with locale-aware transliteration tables.
npm install @sindresorhus/slugify
import slugify from '@sindresorhus/slugify';

slugify('Ülrich von Straßen');   // 'ulrich-von-strassen'
slugify('naïve résumé');         // 'naive-resume'
slugify('Hello World! (2026)');  // 'hello-world-2026'

Stop-Word Removal Tradeoffs

Some teams strip common words ("the", "a", "and", "of", "in") from slugs before generation. The argument: /blog/best-guide-to-javascript is shorter and cleaner than /blog/the-best-guide-to-javascript.

The counterarguments are stronger than they look:

  • Stop-word lists are language-specific. "the" is a stop word in English but a meaningful word in other languages.
  • Removing words can create collisions. "The Best Guide" and "Best Guide" both produce best-guide.
  • Google treats hyphens as word separators and reads the full slug. Removing "the" gives minimal SEO benefit and fragments your URL history when you change the rule later.

Recommendation: skip stop-word removal unless you are building a very constrained system (e.g., product SKU slugs with a fixed vocabulary). A short slug beats a "perfect" slug — but shorter usually means truncating at a word boundary, not stripping words from the middle.

Collision Handling

Two posts titled "My Great Article" produce the same slug. Your database unique constraint will catch this — but your application needs a strategy for resolving it before inserting.

The sequential counter approach is the most common:

async function uniqueSlug(base, checkExists) {
  if (!(await checkExists(base))) return base;
  let n = 2;
  while (await checkExists(`${base}-${n}`)) n++;
  return `${base}-${n}`;
}

// "my-great-article"    → already taken
// "my-great-article-2"  → already taken
// "my-great-article-3"  → available, use this

The sequential counter is predictable but leaks information (visitors can enumerate all posts with the same title). A short random suffix avoids this:

import { randomBytes } from 'crypto';

function randomSuffix(bytes = 3) {
  return randomBytes(bytes).toString('base64url').slice(0, 4);
}

// "my-great-article-x4k9"
// "my-great-article-b2mq"

Random suffixes are non-guessable and stay short (4 characters = 16.7 million combinations with base64url), but they are opaque. Prefer sequential counters for public content and random suffixes for private or user-generated content where enumeration matters.

Length Strategies

Truncating at a hard character limit (slug.slice(0, 100)) risks cutting in the middle of a word: my-great-artic. Always truncate at a word boundary:

function truncateAtWord(slug, maxLen = 100) {
  if (slug.length <= maxLen) return slug;
  const truncated = slug.slice(0, maxLen);
  const lastHyphen = truncated.lastIndexOf('-');
  return lastHyphen > 0 ? truncated.slice(0, lastHyphen) : truncated;
}

truncateAtWord('the-quick-brown-fox-jumps-over-the-lazy-dog', 30);
// "the-quick-brown-fox-jumps-over"

For content that must be stable under title edits, store a deterministic hash suffix alongside the slug: my-great-article-a1b2c3. The hash is derived from a stable ID (e.g., database row ID or UUID), not the title, so it never changes even if the title does — and you still have an opaque uniqueness guarantee.

Slugs vs IDs in URLs

There are two dominant URL patterns for content:

PatternExampleLookup key
Slug-only/posts/my-great-articleslug column (must be unique)
ID-first/posts/42-my-great-articlenumeric prefix (slug is decorative)
ID-only/posts/42numeric ID

The ID-first pattern (/posts/42-my-great-article) is used by Stack Overflow, DEV.to, and many others. The server extracts the numeric prefix, ignores the slug entirely for routing, and redirects if the slug portion is stale. This makes renaming a title zero-risk: old links with the old slug still work because the ID is the real key.

Slug-only URLs are cleaner and better for SEO (no numeric noise), but they require that slugs never change — or that you maintain a historical slug redirect table.

Changing Slugs

If you choose slug-only URLs and allow title edits, you must handle slug changes. Never silently swap the slug — that breaks every external link pointing at the old URL.

The correct strategy:

  • Store a slug_history table (or array column) alongside the current slug.
  • When the slug changes, append the old slug to the history and issue a permanent 301 redirect from the old path to the new one.
  • Canonicalize in your <head> to the current slug so crawlers update their index.
-- Minimal slug history table
CREATE TABLE slug_redirects (
  old_slug   TEXT NOT NULL,
  post_id    BIGINT NOT NULL REFERENCES posts(id),
  created_at TIMESTAMPTZ DEFAULT now(),
  PRIMARY KEY (old_slug)
);

Code Examples

Custom slugify in JavaScript (no library)

function slugify(input, maxLen = 100) {
  return input
    .normalize("NFD")                     // decompose diacritics
    .replace(/[̀-ͯ]/g, "")      // strip combining marks
    .toLowerCase()
    .trim()
    .replace(/[^a-z0-9s-]/g, "")        // keep only alphanumeric, spaces, hyphens
    .replace(/[s_]+/g, "-")             // spaces/underscores → hyphens
    .replace(/-{2,}/g, "-")              // collapse consecutive hyphens
    .replace(/^-+|-+$/g, "")            // strip leading/trailing hyphens
    .slice(0, maxLen)
    .replace(/-+$/, "");                 // re-strip trailing hyphen after slice
}

Library-based (recommended for production)

import slugify from '@sindresorhus/slugify';

const slug = slugify(title, {
  separator: '-',
  lowercase: true,
  decamelize: false,
  customReplacements: [['&', 'and']],
});

Postgres generated column

-- Requires the pg_trgm or unaccent extension for diacritic stripping
CREATE EXTENSION IF NOT EXISTS unaccent;

CREATE TABLE posts (
  id      BIGSERIAL PRIMARY KEY,
  title   TEXT NOT NULL,
  slug    TEXT GENERATED ALWAYS AS (
    lower(
      regexp_replace(
        regexp_replace(
          unaccent(title),
          '[^a-zA-Z0-9s]', '', 'g'
        ),
        's+', '-', 'g'
      )
    )
  ) STORED
);

The Postgres approach is useful for batch imports or migrations but lacks collision handling — you will still need application-level uniqueness logic on top.

Pitfalls

  • Slugging percent-encoded input: If your input arrives as a URL string (e.g., copy-pasted from a browser), decode it first with decodeURIComponent() before slugifying. Otherwise %20 becomes 20 in your slug.
  • Whitespace from pasted text: Titles pasted from Word or Google Docs often contain non-breaking spaces (U+00A0), thin spaces (U+2009), or zero-width spaces (U+200B). Normalize all whitespace variants to regular spaces before processing.
  • Emoji in titles: Emoji are stripped by the [^a-z0-9] regex, which can leave consecutive hyphens if they appeared between words. Always collapse hyphens after stripping.
  • Empty slug after stripping: A title like "???" produces an empty string. Fall back to a UUID or timestamp-based slug rather than inserting an empty value.
  • Case-insensitive collision detection: Run your checkExists query with a case-insensitive comparison (ILIKE in Postgres, or normalize both sides to lowercase) so "My-Article" and "my-article" are treated as the same slug.

Generate and preview slugs from any title — including Unicode, Cyrillic, and emoji input — with the Slugify Tool. Related reading: URL Parser Guide and URL Encoding Edge Cases.