BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Baoyou Chen1,3 · Hanchen Xia1 · Peng Tu1 · Haojun Shi1 · Liwei Zhang1 · Weihao Yuan4 · Siyu Zhu1,2,3,†
1Shanghai Academy of AI for Science · 2Shanghai Innovation Institute · 3Fudan University · 4Nanjing University
teaser_video.mp4 |
git clone https://github.com/fudan-generative-vision/Bard-VL.git
cd Bard-VLconda create -n bard-vl python=3.12 -y
conda activate bard-vl
pip install -r requirements.txtTo prepare the training data, run:
bash tools/preprocessing.shThis downloads the public source datasets, runs the local converters, and prepares the mixed metadata set used by the training configs that point to datasets/mixed-8192-17M.
Additional preprocessing notes are in tools/preprocessing/README.md.
Training sample format
The training pipeline expects each sample to contain a messages field in chat format. A minimal multimodal example is shown below.
{
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful assistant."
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "FineVision-processed/chart2text/images/example_chart.jpg"
},
{
"type": "text",
"text": "Please clarify the meaning conveyed by this graph."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The graph shows ..."
}
]
}
]
}The PBM configs live under examples/vlm_finetune/bard_vl/, and matching launch wrappers are available under scripts/.
Download the base checkpoints with:
bash tools/download_weights.shLaunch training from the repository root. Example:
bash scripts/bard_vl_b4_mask_4b_instruct.shCheckpoints are written under the configured experiment directory, for example:
exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_19999/
The distillation configs expect HuggingFace-style model directories such as pretrained_models/Bard-VL-B4-Mask-4B-Instruct. If your PBM training produced a checkpoint directory, export it with tools/consolidate_checkpoint.py:
python3 tools/consolidate_checkpoint.py \
--dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_39999/model \
--output-dir pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
--source-model-dir pretrained_models/Qwen3-VL-Bard-4B-InstructThe SWD configs live under examples/distillation/:
The repository includes launch wrappers such as scripts/bard_vl_kd_diffusion_b4_mask_2b.sh. Edit the environment-specific lines in those wrappers before running them.
bash scripts/bard_vl_kd_diffusion_b4_mask_2b.shSWD checkpoints are already saved as ready-to-load Hugging Face-style model directories under exps/<project>/step-*.
The shell launchers under scripts/ and the evaluation examples under eval/lmms-eval/examples/models/ include local conda activation, filesystem paths, and environment variables from the original training environment, so they should be treated as templates.
inference.py contains minimal examples for image and video understanding. Edit the messages list inside the script to select the modality and prompt you want to test.
python3 inference.py \
--model_id pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
--block_size 4 \
--denoising_steps 4 \
--confidence_threshold 0.6Sample local assets are available under assets/.
The repository vendors an LMMS-Eval setup under eval/lmms-eval/. The Bard-VL model wrapper is implemented in eval/lmms-eval/lmms_eval/models/simple/bard_vl.py.
A clean single-node evaluation example is:
cd eval/lmms-eval
export PYTHONPATH=../..:$PYTHONPATH
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
--model bard_vl \
--model_args "pretrained=../../pretrained_models/Bard-VL-B4-Mask-4B-Instruct,max_pixels=4194304,block_size=4,attn_implementation=sdpa,remasking_strategy=low_confidence_static,confidence_threshold=1.0,interleave_visuals=False" \
--tasks mme,realworldqa,mmstar,ai2d,chartqa \
--batch_size 1 \
--output_path eval_results/bard_vl_b4_mask_4bAlternatively, you can directly use a bash script for multi-node evaluation as shown below:
cd eval/lmms-eval
bash examples/models/bard_vl.sh| Path | Purpose |
|---|---|
inference.py |
Minimal generation example for image, video, and text inputs |
examples/vlm_finetune/bard_vl/ |
Stage-1 PBM training configs |
examples/distillation/ |
Stage-2 SWD configs |
train/ |
Stage-2 model distillation training code |
tools/preprocessing.sh |
Public dataset download and preprocessing entrypoint |
tools/consolidate_checkpoint.py |
Export a training checkpoint to a Hugging Face-style model directory |
tools/README.md |
Usage notes for the utilities under tools/ |
eval/lmms-eval/ |
Evaluation harness with a Bard-VL model wrapper |
scripts/ |
Local launch wrappers with machine-specific settings |
If you find Bard-VL useful in your research, please cite our paper:
@article{chen2026bard,
title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
author={Chen, Baoyou and Xia, Hanchen and Tu, Peng and Shi, Haojun and Mu, Shan and Yuan, Weihao and Zhu, Siyu},
journal={arXiv preprint arXiv:2604.16514},
year={2026}
}This repository builds on top of NVIDIA NeMo AutoModel.