Monday, October 6, 2025

Character Sets and Character Encodings

1 What is it a Character Set?

Character Set - Informal definition

A character set is just a collection of symbols (letters, digits, punctuation, emojis, etc.) that a computer system knows about, together with the numbers (code points) assigned to them.

Character Set - Formal definition

A character set (more precisely, a coded character set) is a mapping between a finite set of abstract characters and a set of unique code points (integers).

It defines:

  • Character repertoire = the set of characters included.
  • Code space = the range of assignable numeric values.
  • Mapping function = assigns each character exactly one code point.

Formally:

f : Character ↔ CodePoint ∈ CodeSpace

Where the code point is the single integer value that uniquely identifies one abstract character

2. What is it an Encoding System?

The term “encoding system” is not formally defined in most standards (including Unicode), but we can still use it informally. In Unicode terminology, a coded character set assigns numbers to characters, while an encoding scheme specifies how those numbers are serialized into bytes. An encoding system, in this informal sense, covers both aspects: not only the assignment of numbers but also the rules for representing them as bits and bytes.


Encoding System – Informal Definition

An encoding system is any method of converting characters from a character set into a sequence of bits/bytes for storage or transmission.

In other words:

  • It answers “How do I turn this character into actual data in memory or on disk?”
  • Examples people usually mean when they say "encoding system": ASCII, ISO-8859-1, UTF-8, UTF-16, Base64 (if they’re being loose).
  • It’s a general/computer science term, not a standard term in the Unicode spec.

3. Examples: ASCII, ISO-8859 and Unicode

ASCII

  • Definition: ASCII (American Standard Code for Information Interchange) is a character set.
  • Code space of 0–127 (7 bits); character repertoire limited to basic Latin letters, digits and symbols.
  • Encoding: The standard ASCII numeric codes are directly used as binary representations, so it also serves as a simple encoding, but strictly speaking, ASCII is primarily a character set.

ISO-8859

  • Definition: ISO-8859 (like ISO-8859-1, Latin-1) is a character set.
  • Code space 0-255 (8 bits); character repertoire: extended ASCII, supports Western European alphabets.
  • Encoding: ISO-8859 maps each character to one byte, so in practice, it is also an encoding system—but the term "ISO-8859" usually refers to the character set.

Unicode

  • Definition: Unicode is a character set.
  • Code space 1.1+ million possible code points, covering virtually all written scripts and symbols.
  • Encoding: Unicode is only a character set and requires an encoding like UTF-8, UTF-16, UTF-32.

4. The Unicode Character Set

Basically, Unicode is a standard character set that covers almost all the world’s writing systems. More formally, Unicode is a universal standard designed to represent text from all the world’s writing systems in a consistent way. It assigns a unique integer number, called a code point, to every character: letters, numbers, symbols, punctuation, emojis and more.

4.1 Unicode Code Points

In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the Latin letter A. Unicode has code points that are grouped into 17 code planes, each holding 65536 characters. The first code plane, called the Basic Multilingual Plane (BMP), consists of the “classic” Unicode characters with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF hold many more characters called supplementary characters.

4.2 UTF (Unicode Transformation Format)

What is it UTF? Let's get into some of the formal terminology used by the Unicode Consortium.

Short answer: UTFs (UTF-8, UTF-16, UTF-32) are encoding forms — and when applied to byte streams, they become encoding schemes.

Let’s break this down clearly:

4.3 Key Terms (per Unicode Standard)

  • Character Set (Repertoire): The list of characters (e.g., Unicode code points U+0000 … U+10FFFF).

  • Encoding Form:
    How Unicode scalar values (code points) are represented as one or more code units (e.g., UTF-8 uses 8-bit units, UTF-16 uses 16-bit units).

  • Encoding Scheme: How the code units (from an encoding form) are serialized into bytes for storage or transmission (important when code units are more than 8 bits).
    Examples:

    • UTF-8 → encoding form and encoding scheme are the same, since code units are already 8-bit.
    • UTF-16 → needs an encoding scheme: UTF-16BE (big-endian), UTF-16LE (little-endian), or UTF-16 with BOM.
    • UTF-32 → similarly has UTF-32BE and UTF-32LE.

Note: Encoding Form → from code points to code units; Encoding Scheme → from code units to bytes


Here’s the compact hierarchy that the Unicode Standard lays out, step by step, so you can see where encoding scheme fits in:

  1. Abstract Character → The conceptual unit (e.g., the letter “A”, the emoji 😀).

  2. Code Point → The numeric value assigned in Unicode (e.g., U+0041 for "A", U+1F600 for 😀).

  3. Code Unit → A fixed-size piece used by a given encoding form (8, 16, or 32 bits).

    • UTF-8: 8-bit code units
    • UTF-16: 16-bit code units
    • UTF-32: 32-bit code units
  4. Encoding Form → Maps code points to sequences of one or more code units.

    • Example: U+1F600 😀 → 0xD83D 0xDE00 (UTF-16) or 0xF0 0x9F 0x98 0x80 (UTF-8).
  5. Encoding Scheme → Defines how the code units are serialized into bytes for storage/transmission.

    • UTF-8: scheme = form (already byte-oriented).
    • UTF-16: UTF-16BE (big-endian), UTF-16LE (little-endian), or UTF-16 with BOM.
    • UTF-32: UTF-32BE, UTF-32LE.
  6. Byte Stream → The actual sequence of bytes stored on disk, transmitted over the wire, etc.


Example (Character → Code Point → Encoding Form → Encoding Scheme)

Character: "A"

  1. Code Point: U+0041
  2. UTF-8 (Encoding Form): 1 code unit = 0x41
    Encoding Scheme: stored as 1 byte 41
  3. UTF-16 (Encoding Form): 1 code unit = 0x0041
    Encoding Scheme:
    • UTF-16BE → 00 41
    • UTF-16LE → 41 00
  4. UTF-32 (Encoding Form): 1 code unit = 0x00000041
    Encoding Scheme:
    • UTF-32BE → 00 00 00 41
    • UTF-32LE → 41 00 00 00

No comments:

Post a Comment