DevToys Web Pro iconDevToys Web ProBlog
Ohodnoťte nás:
Vyskúšajte rozšírenie prehliadača:
← Back to Blog

Text to Unicode: Code Points, Escape Sequences, and Encoding Explained

9 min read

Unicode is the universal character encoding that powers every modern text system — from plain ASCII letters to emoji, Chinese ideographs, and mathematical symbols. But "Unicode" is often used loosely to mean several different things: the code point number assigned to a character, the UTF-8 or UTF-16 byte representation on disk, and the various escape notations used in source code and markup. This guide untangles all three layers and shows you how to convert text to Unicode code points in every common format. Use the Text to Unicode converter to run any of these conversions instantly in your browser.

Code Points, Code Units, and Bytes

Unicode assigns every character a unique integer called a code point, written in U+ notation — for example, U+0041 is the Latin capital letter A, and U+1F600 is the grinning face emoji. There are over 1.1 million possible code points (U+0000 to U+10FFFF), though only around 150,000 are currently assigned.

A code point is an abstract number. To store or transmit it, an encoding maps it to one or more code units — fixed-size chunks of bits. The two encodings you will encounter most often are:

  • UTF-8 — variable-width encoding using 1–4 bytes per code point. Code points U+0000–U+007F map to a single byte (identical to ASCII). Higher code points use 2, 3, or 4 bytes. This is the dominant encoding on the web and in files.
  • UTF-16 — variable-width encoding using one or two 16-bit code units per code point. Code points U+0000–U+FFFF (the Basic Multilingual Plane) map to a single 16-bit unit. Code points above U+FFFF require two code units called a surrogate pair. JavaScript strings are stored internally as UTF-16.

The distinction between code points and code units is the source of many off-by-one bugs in string handling, especially with emoji.

Escape Syntaxes Compared

Different languages and formats have their own notation for encoding a Unicode character as an escape sequence in source code or markup. The table below summarises the main ones:

SyntaxExample (é = U+00E9)Used in
U+XXXXU+00E9Unicode standard notation, documentation
\uXXXX\u00E9JavaScript, Java, JSON, C#, Python (str)
\u{XXXXX}\u{1F600}JavaScript ES6+, Rust, Swift, Ruby
&#decimal;éHTML, XML numeric character reference
&#xHEX;éHTML, XML hex character reference
0xXXXX0x00E9C, C++, Python (integer literals)
\xXX\xE9C strings, Python bytes, Perl, PHP

The \uXXXX form is limited to the Basic Multilingual Plane (U+0000 to U+FFFF) because it always expects exactly four hex digits. The brace form \u{XXXXX} accepts any number of hex digits and handles the full Unicode range — use it whenever you work with emoji or other supplementary characters.

JavaScript String Internals and Code Points

JavaScript strings are sequences of UTF-16 code units. For most characters this is invisible — a letter like A (U+0041) occupies a single code unit and "A".length === 1. The complexity appears with code points above U+FFFF, which require two UTF-16 code units — a surrogate pair.

The grinning face emoji at U+1F600 is a good test case. Its UTF-16 surrogate pair is the high surrogate U+D83D followed by the low surrogate U+DE00:

// The emoji occupies two UTF-16 code units
const emoji = "\uD83D\uDE00";   // same as "\u{1F600}"
console.log(emoji.length);        // 2  — code unit count, not character count

// .length counts code units, not code points
const face = "\u{1F600}";
console.log(face.length);         // 2
console.log([...face].length);    // 1  — spread operator uses code points

// codePointAt vs charCodeAt
console.log(face.codePointAt(0).toString(16)); // "1f600"
console.log(face.charCodeAt(0).toString(16));  // "d83d"  — only the high surrogate

// Iterating by code point (ES6+)
for (const char of "Hello \u{1F600}") {
  console.log(char, char.codePointAt(0).toString(16));
}
// H 48 / e 65 / l 6c / l 6c / o 6f / (space) 20 / 😀 1f600

The key takeaway: always use the spread operator [...str] or for...of when you need to iterate over characters, and use codePointAt() instead of charCodeAt() when you need the actual code point number. The Text Analyzer tool can show you the code point breakdown of any string, including surrogate pairs.

Converting Text to Unicode Escapes in JavaScript

The workflow for converting a string to its Unicode code point list, then to various escape formats, follows a consistent pattern: spread the string to respect surrogate pairs, call codePointAt(0) on each character, and format the resulting number as needed.

// Get all code points from a string (handles emoji correctly)
function getCodePoints(str) {
  return [...str].map((char) => char.codePointAt(0));
}

// Format as U+ notation
function toUPlus(str) {
  return getCodePoints(str)
    .map((cp) => "U+" + cp.toString(16).toUpperCase().padStart(4, "0"))
    .join(" ");
}

// Format as \uXXXX or \u{XXXXX} escapes
function toJsEscape(str) {
  return getCodePoints(str)
    .map((cp) => {
      if (cp <= 0xffff) return "\\u" + cp.toString(16).padStart(4, "0");
      return "\\u{" + cp.toString(16) + "}";
    })
    .join("");
}

// Format as HTML decimal entities
function toHtmlEntities(str) {
  return getCodePoints(str)
    .map((cp) => "&#" + cp + ";")
    .join("");
}

// Examples
console.log(toUPlus("Héllo"));          // U+0048 U+00E9 U+006C U+006C U+006F
console.log(toJsEscape("é"));           // \u00e9
console.log(toJsEscape("\u{1F600}")); // \u{1f600}
console.log(toHtmlEntities("é"));       // &#233;

// Reverse: code point back to character
console.log(String.fromCodePoint(0x1f600)); // 😀
console.log(String.fromCodePoint(72, 101, 108, 108, 111)); // Hello

HTML Numeric Character References

HTML supports two forms of numeric character reference, both referring directly to the Unicode code point — no encoding table needed. They work for any valid Unicode code point, including emoji:

<!-- Decimal: &#[decimal code point]; -->
<p>Caf&#233;  <!-- é -->
<p>&#128512; <!-- 😀 U+1F600 -->

<!-- Hexadecimal: &#x[hex code point]; -->
<p>Caf&#xE9;  <!-- é -->
<p>&#x1F600; <!-- 😀 -->

Named HTML entities like &eacute; are a convenience shorthand for frequently used characters. Under the hood they map to the same code point — &eacute; is just a memorable alias for &#233;. For characters without a named entity (most emoji, for instance), the numeric form is your only option.

Surrogate Pairs and the Supplementary Planes

UTF-16 was originally designed when Unicode was expected to fit in 65,536 code points. When Unicode expanded beyond U+FFFF, a range of code unit values (U+D800 to U+DFFF) was reserved as surrogates — they carry no character meaning on their own and exist solely to form pairs that encode supplementary code points.

The formula for decomposing a code point cp above U+FFFF into its surrogate pair is:

function toSurrogatePair(cp) {
  // Only needed for code points above U+FFFF
  if (cp <= 0xffff) return { single: cp.toString(16).toUpperCase() };
  const adjusted = cp - 0x10000;
  const high = 0xd800 + (adjusted >> 10);  // high surrogate: D800–DBFF
  const low  = 0xdc00 + (adjusted & 0x3ff); // low  surrogate: DC00–DFFF
  return {
    high: high.toString(16).toUpperCase(),
    low:  low.toString(16).toUpperCase(),
    jsEscape: "\\u" + high.toString(16).padStart(4,"0")
            + "\\u" + low.toString(16).padStart(4,"0"),
  };
}

console.log(toSurrogatePair(0x1f600));
// { high: 'D83D', low: 'DE00', jsEscape: '\uD83D\uDE00' }

// The modern way — no manual math needed
console.log("\u{1F600}");   // 😀
console.log(String.fromCodePoint(0x1f600)); // 😀

Lone surrogates (a high or low surrogate without its partner) are technically invalid Unicode. They can appear in JavaScript strings because JS uses UTF-16 internally and does not validate surrogate pairing. This is a common source of encoding errors when serialising such strings to UTF-8 (e.g., for JSON or a network request) — some libraries will throw, others will emit the replacement character U+FFFD.

Mojibake: When Encoding Goes Wrong

Mojibake (文字化け) is the garbled text that appears when a byte sequence is decoded with the wrong encoding. A classic example: the string café encoded as UTF-8 contains the byte sequence 0xC3 0xA9 for the é. If that byte sequence is decoded as Latin-1 (ISO-8859-1) instead, 0xC3 is à and 0xA9 is ©, producing café — recognisable mojibake.

Common mojibake patterns and their causes:

  • é for é — UTF-8 content read as Latin-1 (very common in legacy databases and email)
  • ’ for ' — Windows-1252 smart quotes read as UTF-8
  • ?? for any non-ASCII — Unicode content saved to an ASCII-only channel that replaces unknowns with question marks
  • � (U+FFFD) — the replacement character, inserted by decoders when they encounter an invalid byte sequence

The fix is always to identify the actual encoding of the bytes and decode them correctly. The Text to Unicode converter lets you inspect the raw code points in any pasted text, which is the first step in diagnosing mojibake.

Quick Reference: Text to Unicode in JavaScript

The most commonly needed one-liners for working with Unicode code points in JavaScript:

// Character → code point number
"A".codePointAt(0)                // 65
"\u{1F600}".codePointAt(0)      // 128512

// Code point number → character
String.fromCodePoint(65)          // "A"
String.fromCodePoint(128512)      // "😀"

// String → array of characters (respects surrogate pairs)
[..."Hello \u{1F600}"]           // ["H","e","l","l","o"," ","😀"]

// String → U+ code point list
[..."café"].map(c =>
  "U+" + c.codePointAt(0).toString(16).toUpperCase().padStart(4,"0")
)
// ["U+0063","U+0061","U+0066","U+00E9"]

// String → \uXXXX escape (BMP only — use \u{} for emoji)
"é".charCodeAt(0).toString(16)    // "e9"  → \u00e9

// Detect if string has non-BMP characters
const hasSurrogates = (s) => /[\uD800-\uDFFF]/.test(s);
hasSurrogates("Hello");            // false
hasSurrogates("\u{1F600}");      // true

For the reverse direction — parsing U+XXXX notation or \uXXXX escape strings back into text — see the related String Escaping guide, which covers unescape operations for JSON, HTML, and URL encodings in detail.


Understanding Unicode code points and their escape forms is essential for any developer working with internationalisation, emoji, binary protocols, or data interchange formats. The Text to Unicode converter handles all the formats described here — U+ notation, JavaScript escapes, HTML entities, and raw hex — directly in your browser with no server round-trip.