1D Tokenizer

Simple 1D image tokenizer from the paper An Image is Worth 32 Tokens for Reconstruction and Generation

Image Tokenizer

A neural architecture for encoding images into sequences of discrete tokens, enabling efficient image compression and representation learning through the use of Vision Transformers and Vector Quantization.

Overview

The Image Tokenizer converts images into sequences of discrete tokens using a two-stage process:

Vision Transformer (ViT) for learning spatial relationships
Vector Quantization (VQ) for discretization

This architecture enables efficient image compression, learned discrete representations, and interpretable latent spaces suitable for downstream tasks.

Architecture Details

Image Tokenizer Pipeline

The image tokenizer processes input images $x \in \mathbb{R}^{H \times W \times C}$ through the following stages:

Patch embedding and tokenization
Transformer-based contextual encoding
Vector quantization
Discrete token generation

Vision Transformer (ViT)

The Vision Transformer processes images through several key stages:

Patch Embedding:
- Input image $x \in \mathbb{R}^{H \times W \times C}$ is divided into $N = \frac{HW}{P^2}$ patches
- Each patch $x_p \in \mathbb{R}^{P^2 \cdot C}$ is projected to dimension $D$
- Result: sequence of patch embeddings $z_0 \in \mathbb{R}^{N \times D}$
Position Encoding:
- Learned position embeddings $E_{pos} \in \mathbb{R}^{N \times D}$ added to patch embeddings
- Input sequence: $z_0 + E_{pos}$
Transformer Encoding:
- $L$ layers of multi-head self-attention and MLP blocks
- Layer $l$ computation: $$z'_l = \text{MLP}(\text{LN}(z'_l)) + z'_l$$
- Output: contextual representations $z_L \in \mathbb{R}^{N \times D}$

Vector Quantization (VQ)

The VQ layer maps continuous latent vectors to discrete tokens:

Codebook:
- Contains $K$ embedding vectors: ${e_k}_{k=1}^K$ where $e_k \in \mathbb{R}^D$
- Learned during training through straight-through gradient estimation
Quantization Process:
- For each input vector $z_i$, find nearest codebook vector: $$k(i) = \arg\min_k |z_i - e_k|_2$$
- Replace with selected codebook vector: $$z_q^i = e_{k(i)}$$
Training Objectives:
- Codebook loss: $|sg(z) - e|_2^2$
- Commitment loss: $\beta|z - sg(e)|_2^2$
- Where $sg()$ is the stop-gradient operator
Token Generation:
- Each quantized vector replaced by codebook index
- Final output: sequence of $N$ discrete tokens ${k(i)}_{i=1}^N$

Mathematical Framework

Image Processing

For an input image with dimensions $H \times W$:

Patch size $P$ results in $N = \frac{HW}{P^2}$ patches
Each patch produces one embedding in final sequence
Example: 256×256 image with 64×64 patches yields 16 embeddings

Attention Mechanism

Multi-head attention computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

$Q, K, V \in \mathbb{R}^{N \times D}$ are query, key, value matrices
$d_k$ is scaling factor equal to head dimension

Vector Quantization

The quantization operation $q(z)$ is defined as: $$q(z) = e_k \text{ where } k = \arg\min_j |z - e_j|_2$$

Total loss: $$\mathcal{L} = \mathcal{L}_\text{reconstruction} + |sg(z) - e|_2^2 + \beta|z - sg(e)|_2^2$$

Model Configuration

Typical hyperparameters:

Image size: 256×256
Patch size: 64×64
Model dimension: 1024
Number of heads: 16
Number of layers: 12
Codebook size: 8192
$\beta$ (commitment cost): 0.25

Input-Output Specifications

Input:

RGB images: $\mathbb{R}^{H \times W \times 3}$
Normalized to [-1, 1] range

Output:

Sequence of discrete tokens: ${0, ..., K-1}^N$
Token sequence length = $\frac{HW}{P^2}$

Performance Characteristics

Compression Rate:
- Input: $H \times W \times 3$ bytes
- Output: $\frac{HW}{P^2} \times \log_2(K)$ bits
- Example compression ratio ≈ 24:1
Computational Complexity:
- Attention: $O(N^2D)$ per layer
- Vector Quantization: $O(NKD)$
Memory Usage:
- Codebook: $O(KD)$ parameters
- Transformer: $O(L D^2)$ parameters

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
assets		assets
configs/train		configs/train
data		data
evaluator		evaluator
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1D Tokenizer

Image Tokenizer

Overview

Architecture Details

Image Tokenizer Pipeline

Vision Transformer (ViT)

Vector Quantization (VQ)

Mathematical Framework

Image Processing

Attention Mechanism

Vector Quantization

Model Configuration

Input-Output Specifications

Performance Characteristics

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1D Tokenizer

Image Tokenizer

Overview

Architecture Details

Image Tokenizer Pipeline

Vision Transformer (ViT)

Vector Quantization (VQ)

Mathematical Framework

Image Processing

Attention Mechanism

Vector Quantization

Model Configuration

Input-Output Specifications

Performance Characteristics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages