Skip to content

fudan-generative-vision/Bard-VL

Repository files navigation

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen1,3 · Hanchen Xia1 · Peng Tu1 · Haojun Shi1 · Liwei Zhang1 · Weihao Yuan4 · Siyu Zhu1,2,3,†

1Shanghai Academy of AI for Science   ·   2Shanghai Innovation Institute   ·   3Fudan University   ·   4Nanjing University

Project Page Paper Hugging Face

Quick Start Training Inference Evaluation Repository

teaser_video.mp4

🚀 Quick Start

git clone https://github.com/fudan-generative-vision/Bard-VL.git
cd Bard-VL
conda create -n bard-vl python=3.12 -y
conda activate bard-vl

pip install -r requirements.txt

⚙️ Training

1. Data Preparation

To prepare the training data, run:

bash tools/preprocessing.sh

This downloads the public source datasets, runs the local converters, and prepares the mixed metadata set used by the training configs that point to datasets/mixed-8192-17M.

Additional preprocessing notes are in tools/preprocessing/README.md.

Training sample format

The training pipeline expects each sample to contain a messages field in chat format. A minimal multimodal example is shown below.

{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "image": "FineVision-processed/chart2text/images/example_chart.jpg"
        },
        {
          "type": "text",
          "text": "Please clarify the meaning conveyed by this graph."
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The graph shows ..."
        }
      ]
    }
  ]
}

2. Stage 1: Progressive Block Merging

The PBM configs live under examples/vlm_finetune/bard_vl/, and matching launch wrappers are available under scripts/.

Variant Config
B4-Mask-2B-Instruct examples/vlm_finetune/bard_vl/bard_vl_b4_mask_2b_instruct.yaml
B4-Mask-4B-Instruct examples/vlm_finetune/bard_vl/bard_vl_b4_mask_4b_instruct.yaml
B4-Mask-8B-Instruct examples/vlm_finetune/bard_vl/bard_vl_b4_mask_8b_instruct.yaml

Download the base checkpoints with:

bash tools/download_weights.sh

Launch training from the repository root. Example:

bash scripts/bard_vl_b4_mask_4b_instruct.sh

Checkpoints are written under the configured experiment directory, for example:

exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_19999/

3. Export Checkpoint

The distillation configs expect HuggingFace-style model directories such as pretrained_models/Bard-VL-B4-Mask-4B-Instruct. If your PBM training produced a checkpoint directory, export it with tools/consolidate_checkpoint.py:

python3 tools/consolidate_checkpoint.py \
  --dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_39999/model \
  --output-dir pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
  --source-model-dir pretrained_models/Qwen3-VL-Bard-4B-Instruct

4. Stage 2: Stage-Wise Distillation

The SWD configs live under examples/distillation/:

Variant Config
B4-Mask-2B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b4_mask_2b.yaml
B8-Mask-2B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b8_mask_2b.yaml
B16-Mask-2B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b16_mask_2b.yaml
B8-Mask-4B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b8_mask_4b.yaml
B16-Mask-4B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b16_mask_4b.yaml
B32-Mask-4B-Distil-Instruct examples/distillation/bard_vl_kd_diffusion_b32_mask_4b.yaml

The repository includes launch wrappers such as scripts/bard_vl_kd_diffusion_b4_mask_2b.sh. Edit the environment-specific lines in those wrappers before running them.

bash scripts/bard_vl_kd_diffusion_b4_mask_2b.sh

SWD checkpoints are already saved as ready-to-load Hugging Face-style model directories under exps/<project>/step-*.

The shell launchers under scripts/ and the evaluation examples under eval/lmms-eval/examples/models/ include local conda activation, filesystem paths, and environment variables from the original training environment, so they should be treated as templates.

🎬 Inference

inference.py contains minimal examples for image and video understanding. Edit the messages list inside the script to select the modality and prompt you want to test.

python3 inference.py \
  --model_id pretrained_models/Bard-VL-B4-Mask-4B-Instruct \
  --block_size 4 \
  --denoising_steps 4 \
  --confidence_threshold 0.6

Sample local assets are available under assets/.

📊 Evaluation

The repository vendors an LMMS-Eval setup under eval/lmms-eval/. The Bard-VL model wrapper is implemented in eval/lmms-eval/lmms_eval/models/simple/bard_vl.py.

A clean single-node evaluation example is:

cd eval/lmms-eval
export PYTHONPATH=../..:$PYTHONPATH

accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
  --model bard_vl \
  --model_args "pretrained=../../pretrained_models/Bard-VL-B4-Mask-4B-Instruct,max_pixels=4194304,block_size=4,attn_implementation=sdpa,remasking_strategy=low_confidence_static,confidence_threshold=1.0,interleave_visuals=False" \
  --tasks mme,realworldqa,mmstar,ai2d,chartqa \
  --batch_size 1 \
  --output_path eval_results/bard_vl_b4_mask_4b

Alternatively, you can directly use a bash script for multi-node evaluation as shown below:

cd eval/lmms-eval
bash examples/models/bard_vl.sh

🗂️ Repository Layout

Path Purpose
inference.py Minimal generation example for image, video, and text inputs
examples/vlm_finetune/bard_vl/ Stage-1 PBM training configs
examples/distillation/ Stage-2 SWD configs
train/ Stage-2 model distillation training code
tools/preprocessing.sh Public dataset download and preprocessing entrypoint
tools/consolidate_checkpoint.py Export a training checkpoint to a Hugging Face-style model directory
tools/README.md Usage notes for the utilities under tools/
eval/lmms-eval/ Evaluation harness with a Bard-VL model wrapper
scripts/ Local launch wrappers with machine-specific settings

📝 Citation

If you find Bard-VL useful in your research, please cite our paper:

@article{chen2026bard,
  title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
  author={Chen, Baoyou and Xia, Hanchen and Tu, Peng and Shi, Haojun and Mu, Shan and Yuan, Weihao and Zhu, Siyu},
  journal={arXiv preprint arXiv:2604.16514},
  year={2026}
}

🙏 Acknowledgements

This repository builds on top of NVIDIA NeMo AutoModel.

About

Bridging AutoRegressive and Diffusion Vision-Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors