论文链接:AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE ViT:把图像看成 patch token 序列,而不是像素网格或卷积特征图,然后直接用标准Transformer Encoder 做全局建