DevToys Web Pro iconDevToys Web ProBlogue
Avalie-nos:
Experimente a extensão do navegador:
← Back to Blog

Text to Binary Guide: How Character Encoding Works, Bit by Bit

9 min read

Every character you type — a letter, a digit, a space, an emoji — is ultimately stored and transmitted as a sequence of 0s and 1s. "Text to binary" is the process of tracing that chain: character → Unicode code point → bytes → bits. Understanding how this works helps you debug encoding bugs, read protocol specifications, and make sense of the binary blobs that appear in network captures and file dumps. Follow along with the Text to Binary converter as you work through the examples below.

The Chain: Character → Code Point → Bytes → Bits

Text-to-binary conversion is not a single step. It is a pipeline with three distinct stages, and each stage requires a specific agreement between the sender and the receiver.

  • Stage 1 — Character to code point. Unicode assigns every character a unique integer called a code point, written as U+XXXX. The letter A is U+0041 (decimal 65). The letter é is U+00E9 (decimal 233). The emoji 😀 is U+1F600 (decimal 128512).
  • Stage 2 — Code point to bytes. An encoding scheme (UTF-8, UTF-16, etc.) defines how to serialize a code point as one or more bytes. This step is where encodings diverge: the same code point produces different byte sequences depending on which encoding you choose.
  • Stage 3 — Bytes to bits. Each byte is an integer from 0 to 255. Writing it in base 2 with 8 digits gives you the binary representation, padded with leading zeros to fill all 8 positions.

The result is a binary string. Without agreeing on the encoding used in Stage 2, the binary string is meaningless — you cannot reverse it into text unless you know which encoding to apply.

ASCII: The Simple Case

ASCII covers the 128 characters of U+0000 through U+007F. Because 128 ≤ 27, every ASCII character fits in 7 bits — and comfortably in a single 8-bit byte with the high bit set to 0. This makes ASCII a special case where Stage 2 and Stage 3 collapse into one operation: the code point value is the byte value is the 8-bit binary pattern.

The letter A (U+0041, decimal 65) in binary is 01000001. Let us verify: 64 + 1 = 65. The bit pattern 0·1·0·0·0·0·0·1 represents 0×128 + 1×64 + 0×32 + 0×16 + 0×8 + 0×4 + 0×2 + 1×1 = 65. Correct.

CharacterCode PointDecimalBinary (8 bits)
AU+00416501000001
aU+00619701100001
ZU+005A9001011010
0U+00304800110000
SpaceU+00203200100000
!U+00213300100001

Notice that uppercase A (01000001) and lowercase a (01100001) differ by exactly one bit — bit 5 (counting from 0 on the right). That is why XOR-ing an ASCII letter with 00100000 (32) toggles its case.

UTF-8: Variable-Width Encoding

UTF-8 is the dominant encoding on the web and in modern file systems. It is backward compatible with ASCII: any byte below 0x80 is a single-byte character identical to its ASCII counterpart. For code points above U+007F, UTF-8 uses sequences of 2, 3, or 4 bytes, all with the high bit set.

The encoding rules are mechanical. The number of bytes needed depends on the code point range:

Code point rangeBytesByte pattern
U+0000 – U+007F10xxxxxxx
U+0080 – U+07FF2110xxxxx 10xxxxxx
U+0800 – U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 – U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Take the accented letter é (U+00E9, decimal 233). It falls in the 2-byte range. In binary, 233 is 11101001. Distributing the 8 significant bits into the pattern 110xxxxx 10xxxxxx yields 11000011 10101001, which is the byte pair 0xC3 0xA9. So the binary representation of é under UTF-8 is two 8-bit groups: 11000011 10101001.

Emoji use 4 bytes. The grinning face 😀 (U+1F600, decimal 128512) encodes as 0xF0 0x9F 0x98 0x80 in UTF-8, which in binary is four groups: 11110000 10011111 10011000 10000000.

UTF-16: 16-Bit Code Units

UTF-16 encodes most characters as a single 16-bit code unit (2 bytes). Characters in the Basic Multilingual Plane (U+0000–U+FFFF) map directly to a single code unit. Characters above U+FFFF use a surrogate pair — two 16-bit units. The letter A in UTF-16 is 0x0041, which as 16 bits is 00000000 01000001.

This is why byte order matters for UTF-16: the two bytes of each code unit can be arranged big-endian (00 41) or little-endian (41 00). UTF-16BE and UTF-16LE differ only in this order. A Byte Order Mark (BOM, U+FEFF) at the start of a file signals which variant is in use. UTF-8 has no byte order ambiguity because its multi-byte sequences are self-describing via the leading bit patterns.

8-Bit Grouping and Separators

A raw binary string like 0100000101000010 is hard to read. The standard convention is to group bits into 8-bit bytes and separate them with a space: 01000001 01000010 is immediately recognizable as two bytes, A and B. Some tools use a different separator — a hyphen, a comma, or no separator at all — and some omit leading zeros.

The Text to Binary converter lets you choose your separator and toggle leading zeros so the output matches whatever format your target system expects.

Why "Binary" Is Ambiguous Without an Encoding

If someone hands you the binary string 11000011 10101001 and asks "what text is this?", you cannot answer without knowing the encoding:

  • Under UTF-8, 0xC3 0xA9 is the letter é (U+00E9).
  • Under Latin-1 (ISO 8859-1), 0xC3 is à and 0xA9 is © — two separate characters.
  • Under UTF-16BE, the two bytes form the code unit 0xC3A9, which is the character U+C3A9 — a completely different code point.

The binary representation is only meaningful when paired with its encoding. This is why modern systems embed encoding declarations — the charset attribute in HTML, the Content-Type header in HTTP, the BOM in text files. See the Hex ↔ ASCII tool to explore how the same bytes look across different encodings, and the Hex to ASCII guide for a deeper look at hexadecimal byte representation.

Converting Binary Back to Text

Reversing the process requires the same agreement on encoding. The steps are:

  • Group into 8-bit chunks. Split the binary string on whitespace (or every 8 characters if there is no separator). Each chunk is one byte.
  • Parse each chunk as base-2. 01000001 → 64 + 1 = 65.
  • Collect the byte values into a byte array. You now have raw bytes.
  • Decode the byte array with the agreed encoding. Pass the bytes through a UTF-8 decoder (or whichever encoding was used) to recover the original text.

JavaScript Examples

The browser provides TextEncoder (text → UTF-8 bytes) and TextDecoder (bytes → text) as built-ins. Using them avoids manual bit manipulation and handles all of Unicode correctly.

// Text → binary string (UTF-8, space-separated bytes)
function textToBinary(text) {
  const bytes = new TextEncoder().encode(text); // Uint8Array of UTF-8 bytes
  return Array.from(bytes)
    .map(b => b.toString(2).padStart(8, '0'))   // each byte as 8 bits
    .join(' ');                                   // space between bytes
}

textToBinary('A');
// => '01000001'

textToBinary('Hi');
// => '01001000 01101001'

textToBinary('é');
// => '11000011 10101001'  (2 UTF-8 bytes)

textToBinary('😀');
// => '11110000 10011111 10011000 10000000'  (4 UTF-8 bytes)
// Binary string → text (UTF-8 decode)
function binaryToText(binaryStr) {
  const bytes = binaryStr
    .trim()
    .split(/\s+/)                          // split on whitespace
    .map(chunk => parseInt(chunk, 2));     // parse each 8-bit group as base-2
  return new TextDecoder('utf-8').decode(new Uint8Array(bytes));
}

binaryToText('01001000 01100101 01101100 01101100 01101111');
// => 'Hello'

binaryToText('11000011 10101001');
// => 'é'

The key lines are b.toString(2).padStart(8, '0') (byte to 8-bit binary string) and parseInt(chunk, 2) (8-bit binary string to byte value). The TextEncoder / TextDecoder pair handles the UTF-8 encoding and decoding transparently.

Practical Uses

  • Education. Visualizing the binary representation of characters is the standard way to teach binary numbers and encoding fundamentals. Showing that A = 01000001 and a = 01100001 makes bit toggling tangible.
  • Protocol and bit-field debugging. Many network protocols pack flags and values into individual bits of a byte. Viewing a byte in binary lets you see which flags are set without doing mental arithmetic.
  • CTF and puzzle challenges. Capture-the-flag competitions regularly encode messages in binary. Recognizing the pattern of 8-bit groups and decoding them to ASCII is a basic skill.
  • Encoding verification. When a character produces an unexpected number of binary groups, it confirms the encoding is multi-byte (UTF-8) rather than single-byte (ASCII / Latin-1).
Example: decoding a CTF binary message

01001000 01100101 01101100 01101100 01101111 00100001

Step 1 parse each group:
  01001000 = 72 H
  01100101 = 101 e
  01101100 = 108 l
  01101100 = 108 l
  01101111 = 111 o
  00100001 = 33 !

Result: Hello!

Convert any text to its binary representation — and reverse binary back to readable text — instantly in your browser with the Text to Binary converter. No data leaves your machine.