Simple 1D image tokenizer from the paper An Image is Worth 32 Tokens for Reconstruction and Generation
A neural architecture for encoding images into sequences of discrete tokens, enabling efficient image compression and representation learning through the use of Vision Transformers and Vector Quantization.
The Image Tokenizer converts images into sequences of discrete tokens using a two-stage process:
- Vision Transformer (ViT) for learning spatial relationships
- Vector Quantization (VQ) for discretization
This architecture enables efficient image compression, learned discrete representations, and interpretable latent spaces suitable for downstream tasks.
The image tokenizer processes input images
- Patch embedding and tokenization
- Transformer-based contextual encoding
- Vector quantization
- Discrete token generation
The Vision Transformer processes images through several key stages:
-
Patch Embedding:
- Input image
$x \in \mathbb{R}^{H \times W \times C}$ is divided into$N = \frac{HW}{P^2}$ patches - Each patch
$x_p \in \mathbb{R}^{P^2 \cdot C}$ is projected to dimension$D$ - Result: sequence of patch embeddings
$z_0 \in \mathbb{R}^{N \times D}$
- Input image
-
Position Encoding:
- Learned position embeddings
$E_{pos} \in \mathbb{R}^{N \times D}$ added to patch embeddings - Input sequence:
$z_0 + E_{pos}$
- Learned position embeddings
-
Transformer Encoding:
-
$L$ layers of multi-head self-attention and MLP blocks - Layer
$l$ computation:$$z'_l = \text{MLP}(\text{LN}(z'_l)) + z'_l$$ - Output: contextual representations
$z_L \in \mathbb{R}^{N \times D}$
-
The VQ layer maps continuous latent vectors to discrete tokens:
-
Codebook:
- Contains
$K$ embedding vectors:${e_k}_{k=1}^K$ where$e_k \in \mathbb{R}^D$ - Learned during training through straight-through gradient estimation
- Contains
-
Quantization Process:
- For each input vector
$z_i$ , find nearest codebook vector:$$k(i) = \arg\min_k |z_i - e_k|_2$$ - Replace with selected codebook vector:
$$z_q^i = e_{k(i)}$$
- For each input vector
-
Training Objectives:
- Codebook loss:
$|sg(z) - e|_2^2$ - Commitment loss:
$\beta|z - sg(e)|_2^2$ - Where
$sg()$ is the stop-gradient operator
- Codebook loss:
-
Token Generation:
- Each quantized vector replaced by codebook index
- Final output: sequence of
$N$ discrete tokens${k(i)}_{i=1}^N$
For an input image with dimensions
- Patch size
$P$ results in$N = \frac{HW}{P^2}$ patches - Each patch produces one embedding in final sequence
- Example: 256×256 image with 64×64 patches yields 16 embeddings
Multi-head attention computed as:
Where:
-
$Q, K, V \in \mathbb{R}^{N \times D}$ are query, key, value matrices -
$d_k$ is scaling factor equal to head dimension
The quantization operation
Total loss:
Typical hyperparameters:
- Image size: 256×256
- Patch size: 64×64
- Model dimension: 1024
- Number of heads: 16
- Number of layers: 12
- Codebook size: 8192
-
$\beta$ (commitment cost): 0.25
Input:
- RGB images:
$\mathbb{R}^{H \times W \times 3}$ - Normalized to [-1, 1] range
Output:
- Sequence of discrete tokens:
${0, ..., K-1}^N$ - Token sequence length =
$\frac{HW}{P^2}$
-
Compression Rate:
- Input:
$H \times W \times 3$ bytes - Output:
$\frac{HW}{P^2} \times \log_2(K)$ bits - Example compression ratio ≈ 24:1
- Input:
-
Computational Complexity:
- Attention:
$O(N^2D)$ per layer - Vector Quantization:
$O(NKD)$
- Attention:
-
Memory Usage:
- Codebook:
$O(KD)$ parameters - Transformer:
$O(L D^2)$ parameters
- Codebook:


