Dual-encoder architectures are key to achieving efficient image-text retrieval. This work investigates a core question within this framework: under a limited parameter budget, which modality, vision or language, should be preferentially scaled? We construct a dual-tower model composed of a pre-trained ViT and a text encoder built upon averaged token embeddings from GPT-2. By systematically testing combinations of encoders at different scales, our results show that the model's parameter balance is more critical to performance than the total parameter count. For instance, a balanced configuration with ~660M parameters outperforms two larger, imbalanced counterparts with ~760M and ~860M parameters. Furthermore, we observe that these two imbalanced models, skewed towards either vision or language, show no significant difference in performance. This further suggests that, in our simplified framework, a "balanced" strategy that avoids either modality becoming a bottleneck is a superior design principle to simply favoring one over the other. Our study provides valuable empirical guidance on parameter allocation for VLM design under resource constraints.
We would like to express our sincere gratitude to Yifan Hou for his patient guidance and continuous support throughout this project. His feedback during meetings and technical discussions was instrumental in shaping our work.
We also thank Prof. Mrinmaya Sachan for designing and teaching this excellent course, which gave us the opportunity to explore and build this project.