Theory

🦿 Vision Transformers perform exceptionally well on vision tasks, often surpassing 👁️ Convolutional Neural Networks as the state-of-the-art model.

ConvNeXt uses standard CNN techniques to mimic strategies from transformers that allowed them to become the new cutting-edge model. In doing so, it finds that convolutions are still competitive and necessary and shows that heavily modified 🪜 ResNet can compete with the top transformers.

Model

Starting with ResNet-50, we apply multiple transformer-inspired changes to create ConvNeXt.

Change optimizer to AdamW and include modern data augmentation and regularization schemes.
Modify the number of layers per block to match Swin transformers, changing from (3, 4, 6, 3) to (3, 3, 9, 3).
Switch the stem (input processing) block with a “patchify” layer using 4-by-4 convolutions with stride 4.
Use depth-wise convolutions (group convolutions from ResNeXt with number of groups equal to input depth) and 1-by-1 convolutions to simulate self-attention.
Use inverted bottlenecks in each ResNeXt block, inspired by the large hidden dimension in transformers’ MLP block.
Increase kernel-size to mimic Swin windows and move up the depth-wise convolution (for computational efficiency).
Replace ReLU with GeLU, reduce the number of activation functions and normalization layers, and replace batch normalization with layer normalization to match transformers.
Separate out downsampling layers, using 2-by-2 convolutions with stride 2, akin to the patch merging layer in the Swin transformer.

Explorer

🎊 ConvNeXt

Theory

Model

Table of Contents

Backlinks

Graph View

Explorer

🎊 ConvNeXt

Theory §

Model §

Table of Contents

Backlinks

Graph View

Theory

Model