Squeeze-and-Excitation Networks

Squeeze-and-Excitation Networks
Name	Squeeze-and-Excitation Networks
Introduced	2017
Authors	Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Andrea Vedaldi
Venue	IEEE Conference on Computer Vision and Pattern Recognition
Keywords	convolutional neural networks, channel attention, deep learning

Contents

Introduction
Architecture and Mechanism
Variants and Extensions
Training and Implementation Details
Performance and Benchmarks
Applications and Use Cases
Limitations and Criticisms

Squeeze-and-Excitation Networks are a family of neural network modules that augment convolutional architectures by explicitly modeling inter-channel relationships, introduced in 2017. The module, often abbreviated as SE, was proposed by researchers affiliated with institutions such as University of Oxford, University of Cambridge, DeepMind, Facebook AI Research, Google Research, and Microsoft Research and demonstrated improvements on benchmarks like ImageNet and COCO. The approach influenced subsequent work in model design adopted by teams at organizations including NVIDIA, Intel, Amazon Web Services, Apple Inc., and research groups at Stanford University.

Introduction

Squeeze-and-Excitation Networks arise from efforts to improve representational capacity in convolutional backbones such as AlexNet, VGGNet, ResNet, DenseNet, Inception, and MobileNet. The proposal followed prior attention and gating mechanisms from groups working on Long Short-Term Memory, Transformer, and studies at Google Brain and Facebook AI Research. The SE module provides a lightweight, architecture-agnostic plug-in inspired by ideas from literature on channel-wise feature recalibration, connecting advances by teams at MIT, Carnegie Mellon University, ETH Zurich, University of Toronto, University College London, and Harvard University.

Architecture and Mechanism

The SE block comprises a global "squeeze" operation followed by an "excitation" gating mechanism applied to channel descriptors, integrating seamlessly into residual blocks from ResNet families and inverted residuals in MobileNetV2. The squeeze uses global average pooling similar to operations used in SqueezeNet and techniques evaluated by researchers at Cornell University and California Institute of Technology. The excitation uses a small multi-layer perceptron with nonlinearity and sigmoid output, echoing design choices from work at Oxford University Engineering Science, Imperial College London, and Johns Hopkins University. The module’s parameters are comparable to modifications proposed by teams at Alibaba Group, Baidu Research, and Tencent AI Lab, while retaining compatibility with optimization methods from Stochastic Gradient Descent research advanced by groups at University of Toronto and ETH Zurich.

Variants and Extensions

Several variants extend the SE idea: the spatial attention counterparts proposed by groups at Google Research and Facebook AI Research; combined channel-spatial modules from KAIST and Peking University; the conditional and dynamic variants explored by researchers at CMU and University of Washington; frequency-domain adaptations by teams at MIT and EPFL; and lightweight adaptations for edge devices inspired by work at ARM Holdings, NVIDIA and Broadcom. The SE concept also influenced architectures such as ECA-Net, CBAM, SKNet, and has been incorporated into vision transformers developed by groups at Google DeepMind and Microsoft Research.

Training and Implementation Details

Practical training recipes for SE-augmented networks follow protocols used in prominent evaluations by teams at Facebook AI Research, Google Research, OpenAI, and DeepMind. Typical practices include initialization schemes from He initialization studies at Microsoft Research, regularization methods from Dropout and Batch Normalization research at University of Toronto and Stanford University, and learning-rate schedules popularized in work by Facebook AI Research and Google Brain. Implementations appear in frameworks maintained by TensorFlow, PyTorch, MXNet, and JAX, with production deployments in services from Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

Performance and Benchmarks

SE modules yielded consistent top-1 and top-5 accuracy improvements on ImageNet classification reported by the original authors and replicated by groups at University of California, Berkeley, Tsinghua University, and University of Science and Technology of China. Improvements were observed on detection and segmentation benchmarks such as MS COCO and PASCAL VOC, as validated in studies from Wuhan University and Beijing Institute of Technology. Comparative analyses by research labs at NVIDIA and Intel Labs assessed trade-offs in FLOPs, latency, and parameter counts, situating SE-equipped backbones among state-of-the-art models alongside architectures like EfficientNet, RegNet, and HRNet.

Applications and Use Cases

SE modules have been applied across domains by practitioners at Siemens, Philips, General Electric, SpaceX, and Boeing: in medical imaging workflows tied to teams at Mayo Clinic and Johns Hopkins Hospital; in autonomous driving stacks developed by groups at Waymo, Tesla, and Uber ATG; in remote sensing projects by European Space Agency and NASA; and in multimedia systems researched at Pixar, Netflix, and Adobe Systems. SE ideas also appear in speech and audio models studied at Bell Labs, Dolby Laboratories, and in multimodal systems investigated at Carnegie Mellon University and MIT-IBM Watson AI Lab.

Limitations and Criticisms

Criticisms include marginal gains on some tasks reported by teams at University of Pennsylvania and University of Michigan, increased inference latency highlighted by engineers at Qualcomm and ARM Ltd., and challenges integrating SE blocks into extremely constrained hardware studied by Raspberry Pi Foundation and embedded groups at Texas Instruments. Some researchers from ETH Zurich and University of Oxford note that larger architectural or data-scale changes (e.g., those demonstrated by Vision Transformer teams at Google Research and Facebook AI Research) can overshadow per-module gains, leading to debate about the cost–benefit balance in production systems maintained by Alibaba Cloud and Tencent Cloud.

Category:Computer vision