CVPR – Computer Vision and Pattern Recognition, the world's leading computer vision conference was filled with researchers and professionals discussing recent accomplishments in the area and looking ahead to the future of the discipline and AI in general.
In this piece, we will analyse the themes and highlights of CVPR 2023, which will serve as both a perspective on the conference and a prediction on the primary issues that will dominate the computer vision environment in the future year.
Vision Transformers on the rise
The transformer architecture has produced significant advances in AI research, advancing the state of the art. The vision transformer has lately arrived in the realm of computer vision. The vision transformer, which was constructed with the transformer architecture, interprets a patch of pixels as a text sequence, allowing the same architecture to be utilised for vision tasks.
We observed a wave of new vision transformer approaches at CVPR 2023, with academics focusing on analysing its biases, trimming it, pretraining it, distilling it, reverse distilling it, and applying it to other tasks.
Papers covering vision transformers found during CVPR:
- OneFormer: One Transformer To Rule Universal Image Segmentation
- Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
- SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Foundational Models for Computer Vision
General pre-trained models have been demonstrated to be versatile multi-task learners, eliminating the need for several, often time-consuming, fine-tuning methods to machine learning challenges. Language models that anticipate the next token in text have shown to be a core model in NLP, with efficacy scaling with model size. There is no such model or loss objective that has developed in the computer vision research community to serve as foundational model for CV tasks.
As we heard during the Tuesday and Wednesday keynotes, there is frequently a mentality of "doing more with less" in Artificial Intelligence academics. Academic researchers recognise that they will be unable to compete with industry research laboratories that have access to massive computational resources to construct broad models.
Having said that, we noticed multiple research labs at CVPR working on foundation models, mostly at the interface of language and pictures.
The following are some examples of general pre-trained computer vision models that were heavily addressed at CVPR:
- Grounding DINO: Zero shot object detection, multi-modal
- SAM: Zero shot segmentation, image only
- Multi-modal GPT-4
- Florence: General task, multi-modal
- OWL-VIT: Zero shot object detection, multi-modal
Machine Learning Strategies
Although there was a lot of talk about general models in the conference room, the majority of CVPR research in 2023 focused on more conventional work on computer vision techniques and tasks.
With new methods and procedures, research on problems like tracking, posture estimation, and NERFs advanced.
Researchers improved training routines by working on machine learning theory, general machine learning methodologies, and empirical findings.
- Soft Augmentation for Image Classification
- FFCV: Accelerating Training by Removing Data Bottlenecks
- The Role of Pre-training Data in Transfer Learning
Conclusion
Many significant events from the year in our sector were highlighted by CVPR 2023. Computer vision is entering a new stage in the adoption process in industry as multi-modal models promise a new basis and technological advancements.
The whole list of CVPR 2023 research papers is available here: