EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
文本内容
We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves...
整体描述
这是一张学术论文的截图,标题为"EVA: Exploring the Limits of Masked Visual Representation Learning at Scale"。页面包含论文标题、作者列表及其所属机构、代码和模型的GitHub链接、摘要部分,以及一张标为"Figure 1"的示意图。示意图展示了EVA模型的工作流程,包括通过扩大MIM预训练(使用3000万图像数据和150个epochs)从CLIP到EVA(10亿参数)的过程,以及EVA在下游任务的迁移应用,如图像分类、视频动作分类、目标检测等。
来源说明
图片内容来自一篇关于计算机视觉和机器学习的学术论文。论文作者来自北京人工智能研究院、华中科技大学、浙江大学和北京理工大学等机构。页面中提供了代码和模型的GitHub链接:https://github.com/baaivision/EVA。这类学术论文通常会在学术会议上发表或在 arXiv 等预印本平台上发布,随后可能会被收录到相关学术期刊中。