A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

文献类型: 外文期刊

第一作者: Xing, Jialu

作者: Xing, Jialu;Liu, Jianping;Sun, Lulu;Chen, Xi;Gu, Xunxun;Wang, Yingfei;Liu, Jianping;Wang, Jian;Liu, Jianping

作者机构:

关键词: Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter

期刊名称:COMPUTERS & GRAPHICS-UK ( 影响因子:2.5; 五年影响因子:2.2 )

ISSN: 0097-8493

年卷期: 2024 年 119 卷

页码:

收录情况: SCI

摘要: Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.

分类号:

  • 相关文献
作者其他论文 更多>>