Generated by GPT-5-mini| Squeeze-and-Excitation Networks | |
|---|---|
| Name | Squeeze-and-Excitation Networks |
| Introduced | 2017 |
| Authors | Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Andrea Vedaldi |
| Venue | IEEE Conference on Computer Vision and Pattern Recognition |
| Keywords | convolutional neural networks, channel attention, deep learning |
Squeeze-and-Excitation Networks are a family of neural network modules that augment convolutional architectures by explicitly modeling inter-channel relationships, introduced in 2017. The module, often abbreviated as SE, was proposed by researchers affiliated with institutions such as University of Oxford, University of Cambridge, DeepMind, Facebook AI Research, Google Research, and Microsoft Research and demonstrated improvements on benchmarks like ImageNet and COCO. The approach influenced subsequent work in model design adopted by teams at organizations including NVIDIA, Intel, Amazon Web Services, Apple Inc., and research groups at Stanford University.
Squeeze-and-Excitation Networks arise from efforts to improve representational capacity in convolutional backbones such as AlexNet, VGGNet, ResNet, DenseNet, Inception, and MobileNet. The proposal followed prior attention and gating mechanisms from groups working on Long Short-Term Memory, Transformer, and studies at Google Brain and Facebook AI Research. The SE module provides a lightweight, architecture-agnostic plug-in inspired by ideas from literature on channel-wise feature recalibration, connecting advances by teams at MIT, Carnegie Mellon University, ETH Zurich, University of Toronto, University College London, and Harvard University.
The SE block comprises a global "squeeze" operation followed by an "excitation" gating mechanism applied to channel descriptors, integrating seamlessly into residual blocks from ResNet families and inverted residuals in MobileNetV2. The squeeze uses global average pooling similar to operations used in SqueezeNet and techniques evaluated by researchers at Cornell University and California Institute of Technology. The excitation uses a small multi-layer perceptron with nonlinearity and sigmoid output, echoing design choices from work at Oxford University Engineering Science, Imperial College London, and Johns Hopkins University. The module’s parameters are comparable to modifications proposed by teams at Alibaba Group, Baidu Research, and Tencent AI Lab, while retaining compatibility with optimization methods from Stochastic Gradient Descent research advanced by groups at University of Toronto and ETH Zurich.
Several variants extend the SE idea: the spatial attention counterparts proposed by groups at Google Research and Facebook AI Research; combined channel-spatial modules from KAIST and Peking University; the conditional and dynamic variants explored by researchers at CMU and University of Washington; frequency-domain adaptations by teams at MIT and EPFL; and lightweight adaptations for edge devices inspired by work at ARM Holdings, NVIDIA and Broadcom. The SE concept also influenced architectures such as ECA-Net, CBAM, SKNet, and has been incorporated into vision transformers developed by groups at Google DeepMind and Microsoft Research.
Practical training recipes for SE-augmented networks follow protocols used in prominent evaluations by teams at Facebook AI Research, Google Research, OpenAI, and DeepMind. Typical practices include initialization schemes from He initialization studies at Microsoft Research, regularization methods from Dropout and Batch Normalization research at University of Toronto and Stanford University, and learning-rate schedules popularized in work by Facebook AI Research and Google Brain. Implementations appear in frameworks maintained by TensorFlow, PyTorch, MXNet, and JAX, with production deployments in services from Amazon Web Services, Google Cloud Platform, and Microsoft Azure.
SE modules yielded consistent top-1 and top-5 accuracy improvements on ImageNet classification reported by the original authors and replicated by groups at University of California, Berkeley, Tsinghua University, and University of Science and Technology of China. Improvements were observed on detection and segmentation benchmarks such as MS COCO and PASCAL VOC, as validated in studies from Wuhan University and Beijing Institute of Technology. Comparative analyses by research labs at NVIDIA and Intel Labs assessed trade-offs in FLOPs, latency, and parameter counts, situating SE-equipped backbones among state-of-the-art models alongside architectures like EfficientNet, RegNet, and HRNet.
SE modules have been applied across domains by practitioners at Siemens, Philips, General Electric, SpaceX, and Boeing: in medical imaging workflows tied to teams at Mayo Clinic and Johns Hopkins Hospital; in autonomous driving stacks developed by groups at Waymo, Tesla, and Uber ATG; in remote sensing projects by European Space Agency and NASA; and in multimedia systems researched at Pixar, Netflix, and Adobe Systems. SE ideas also appear in speech and audio models studied at Bell Labs, Dolby Laboratories, and in multimodal systems investigated at Carnegie Mellon University and MIT-IBM Watson AI Lab.
Criticisms include marginal gains on some tasks reported by teams at University of Pennsylvania and University of Michigan, increased inference latency highlighted by engineers at Qualcomm and ARM Ltd., and challenges integrating SE blocks into extremely constrained hardware studied by Raspberry Pi Foundation and embedded groups at Texas Instruments. Some researchers from ETH Zurich and University of Oxford note that larger architectural or data-scale changes (e.g., those demonstrated by Vision Transformer teams at Google Research and Facebook AI Research) can overshadow per-module gains, leading to debate about the cost–benefit balance in production systems maintained by Alibaba Cloud and Tencent Cloud.