BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen^1,3 · Hanchen Xia¹ · Peng Tu¹ · Haojun Shi¹ · Liwei Zhang¹ · Weihao Yuan⁴ · Siyu Zhu^1,2,3,†

¹Shanghai Academy of AI for Science · ²Shanghai Innovation Institute · ³Fudan University · ⁴Nanjing University

teaser_video.mp4

🚀 Quick Start

git clone https://github.com/fudan-generative-vision/Bard-VL.git
cd Bard-VL

conda create -n bard-vl python=3.12 -y
conda activate bard-vl

pip install -r requirements.txt

⚙️ Training

1. Data Preparation

To prepare the training data, run:

bash tools/preprocessing.sh

This downloads the public source datasets, runs the local converters, and prepares the mixed metadata set used by the training configs that point to datasets/mixed-8192-17M.

Additional preprocessing notes are in tools/preprocessing/README.md.

Training sample format

The training pipeline expects each sample to contain a messages field in chat format. A minimal multimodal example is shown below.

{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "image": "FineVision-processed/chart2text/images/example_chart.jpg"
        },
        {
          "type": "text",
          "text": "Please clarify the meaning conveyed by this graph."
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The graph shows ..."
        }
      ]
    }
  ]
}

2. Stage 1: Progressive Block Merging

The PBM configs live under examples/vlm_finetune/bard_vl/, and matching launch wrappers are available under scripts/.

Variant	Config
`B4-Mask-2B-Instruct`	`examples/vlm_finetune/bard_vl/bard_vl_b4_mask_2b_instruct.yaml`
`B4-Mask-4B-Instruct`	`examples/vlm_finetune/bard_vl/bard_vl_b4_mask_4b_instruct.yaml`
`B4-Mask-8B-Instruct`	`examples/vlm_finetune/bard_vl/bard_vl_b4_mask_8b_instruct.yaml`

Download the base checkpoints with:

bash tools/download_weights.sh

Launch training from the repository root. Example:

bash scripts/bard_vl_b4_mask_4b_instruct.sh

Checkpoints are written under the configured experiment directory, for example:

exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_19999/

3. Export Checkpoint

The distillation configs expect HuggingFace-style model directories such as pretrained_models/Bard-VL-B4-Mask-4B-Instruct. If your PBM training produced a checkpoint directory, export it with tools/consolidate_checkpoint.py:

python3 tools/consolidate_checkpoint.py \
  --dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_39999/model \
  --output-dir pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
  --source-model-dir pretrained_models/Qwen3-VL-Bard-4B-Instruct

4. Stage 2: Stage-Wise Distillation

The SWD configs live under examples/distillation/:

Variant	Config
`B4-Mask-2B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b4_mask_2b.yaml`
`B8-Mask-2B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b8_mask_2b.yaml`
`B16-Mask-2B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b16_mask_2b.yaml`
`B8-Mask-4B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b8_mask_4b.yaml`
`B16-Mask-4B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b16_mask_4b.yaml`
`B32-Mask-4B-Distil-Instruct`	`examples/distillation/bard_vl_kd_diffusion_b32_mask_4b.yaml`

The repository includes launch wrappers such as scripts/bard_vl_kd_diffusion_b4_mask_2b.sh. Edit the environment-specific lines in those wrappers before running them.

bash scripts/bard_vl_kd_diffusion_b4_mask_2b.sh

SWD checkpoints are already saved as ready-to-load Hugging Face-style model directories under exps/<project>/step-*.

The shell launchers under scripts/ and the evaluation examples under eval/lmms-eval/examples/models/ include local conda activation, filesystem paths, and environment variables from the original training environment, so they should be treated as templates.

🎬 Inference

inference.py contains minimal examples for image and video understanding. Edit the messages list inside the script to select the modality and prompt you want to test.

python3 inference.py \
  --model_id pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
  --block_size 4 \
  --denoising_steps 4 \
  --confidence_threshold 0.6

Sample local assets are available under assets/.

📊 Evaluation

The repository vendors an LMMS-Eval setup under eval/lmms-eval/. The Bard-VL model wrapper is implemented in eval/lmms-eval/lmms_eval/models/simple/bard_vl.py.

A clean single-node evaluation example is:

cd eval/lmms-eval
export PYTHONPATH=../..:$PYTHONPATH

accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
  --model bard_vl \
  --model_args "pretrained=../../pretrained_models/Bard-VL-B4-Mask-4B-Instruct,max_pixels=4194304,block_size=4,attn_implementation=sdpa,remasking_strategy=low_confidence_static,confidence_threshold=1.0,interleave_visuals=False" \
  --tasks mme,realworldqa,mmstar,ai2d,chartqa \
  --batch_size 1 \
  --output_path eval_results/bard_vl_b4_mask_4b

Alternatively, you can directly use a bash script for multi-node evaluation as shown below:

cd eval/lmms-eval
bash examples/models/bard_vl.sh

🗂️ Repository Layout

Path	Purpose
`inference.py`	Minimal generation example for image, video, and text inputs
`examples/vlm_finetune/bard_vl/`	Stage-1 PBM training configs
`examples/distillation/`	Stage-2 SWD configs
`train/`	Stage-2 model distillation training code
`tools/preprocessing.sh`	Public dataset download and preprocessing entrypoint
`tools/consolidate_checkpoint.py`	Export a training checkpoint to a Hugging Face-style model directory
`tools/README.md`	Usage notes for the utilities under `tools/`
`eval/lmms-eval/`	Evaluation harness with a Bard-VL model wrapper
`scripts/`	Local launch wrappers with machine-specific settings

📝 Citation

If you find Bard-VL useful in your research, please cite our paper:

@article{chen2026bard,
  title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
  author={Chen, Baoyou and Xia, Hanchen and Tu, Peng and Shi, Haojun and Mu, Shan and Yuan, Weihao and Zhu, Siyu},
  journal={arXiv preprint arXiv:2604.16514},
  year={2026}
}

🙏 Acknowledgements

This repository builds on top of NVIDIA NeMo AutoModel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

🚀 Quick Start

⚙️ Training

1. Data Preparation

2. Stage 1: Progressive Block Merging

3. Export Checkpoint

4. Stage 2: Stage-Wise Distillation

🎬 Inference

📊 Evaluation

🗂️ Repository Layout

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
accelerate_configs		accelerate_configs
assets		assets
eval/lmms-eval		eval/lmms-eval
examples		examples
nemo_automodel		nemo_automodel
scripts		scripts
tools		tools
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

🚀 Quick Start

⚙️ Training

1. Data Preparation

2. Stage 1: Progressive Block Merging

3. Export Checkpoint

4. Stage 2: Stage-Wise Distillation

🎬 Inference

📊 Evaluation

🗂️ Repository Layout

📝 Citation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages