← return

Which Side Should We Scale? Revisiting Parameter Balance in Vision-Language Models

📂GitHub Repository

This project investigates the critical question of parameter allocation in dual-encoder vision-language models under limited budgets. We constructed a dual-tower architecture combining a pre-trained Vision Transformer (ViT) with a linear projection layer, and a text encoder based on averaged token embeddings from GPT-2. By systematically evaluating models with different scales of vision and language encoders, we found that maintaining a balanced parameter distribution between modalities is more important than the total parameter count in our experimental setup.

Contributions:

  • Conducted a literature review on vision-language dual-encoder architectures
  • Designed and implemented a series of dual-tower models with varying vision-text parameter allocations
  • Performed systematic experiments and analyzed the impact of modality balance on model performance