Perceiver — LLMpedia

Perceiver
Name	Perceiver
Developer	DeepMind
First release	2021
Type	Multimodal neural network

Contents

Introduction
Architecture
Training and Datasets
Variants and Extensions
Applications
Performance and Evaluation
Limitations and Future Directions

Perceiver The Perceiver is a neural architecture introduced by DeepMind in 2021 that generalizes attention-based models to handle high-dimensional, multimodal inputs such as images, audio, video, and point clouds. It was proposed as a scalable alternative to transformer-based models like BERT, GPT-3, and ViT for tasks spanning ImageNet classification, audio recognition, and multimodal fusion. The design connects ideas from Transformer (machine learning model), Convolutional neural network, and cross-attention mechanisms to enable flexible processing of diverse data modalities.

Introduction

The Perceiver emerged amid growing interest in architectures that can process heterogeneous inputs produced in domains such as ImageNet, YouTube-8M, LibriSpeech, COCO (dataset), and Waymo Open Dataset. It addressed bottlenecks identified in models like GPT-2, GPT-3, BERT, and BigGAN by employing asymmetric attention and latent arrays inspired by concepts from Sparse Transformer, Set Transformer, and prior work at DeepMind and Google Research. Early demonstrations included benchmarks involving ResNet, EfficientNet, and comparisons with Swin Transformer and DeiT on vision tasks.

Architecture

Perceiver's core consists of a fixed-size latent array that interacts with inputs via cross-attention; this mirrors cross-modal designs used in systems like CLIP, ALIGN (model), and DALL·E. The architecture replaces full self-attention across inputs (as in Transformer (machine learning model)) with iterative cross-attention and self-attention within the latent space, echoing concepts from Neural Turing Machine and Memory Networks. Key components include encoder cross-attention, latent self-attention blocks reminiscent of Transformer XL and Reformer (machine learning), and decoder heads for tasks similar to those in T5, BART, and RoBERTa. The design facilitates scaling to large inputs from datasets like Kinetics-600, AudioSet, and Open Images Dataset while avoiding quadratic attention costs seen in models such as Vision Transformer and Performer (kernel attention).

Training and Datasets

Perceiver has been trained on diverse corpora and benchmarks including ImageNet, AudioSet, LibriSpeech, COCO (dataset), Kinetics-600, and multimodal collections used in evaluation of CLIP and ALIGN (model). Training regimes leverage optimizers and schedules common in work from Google Research and DeepMind, including techniques from Adam (optimization algorithm), LAMB (optimizer), and learning rate strategies used in BERT and GPT-2 training. Data augmentation and preprocessing pipelines incorporate methods from AutoAugment, Mixup (data augmentation), and SpecAugment for audio. Evaluation commonly uses metrics from ImageNet Large Scale Visual Recognition Challenge, GLUE, and speech benchmarks like LibriSpeech test-clean.

Variants and Extensions

Subsequent work spawned variants and extensions integrating the Perceiver backbone with ideas from Perceiver IO, adapters from Adapter (machine learning), and hierarchical designs paralleling Swin Transformer and HRNet. Extensions combined Perceiver-like latent arrays with generative and discriminative decoders inspired by VQ-VAE, PixelCNN, StyleGAN2, and Diffusion models (machine learning), enabling tasks akin to DALL·E 2-style image generation and Wavenet audio synthesis. Research groups at DeepMind, Google Research, OpenAI, and academic labs experimented with hybrid models merging Perceiver principles with architectures like SqueezeNet, MobileNetV3, and EfficientNet for resource-constrained deployment on hardware from NVIDIA and Google TPU.

Applications

Perceiver variants have been applied to vision tasks from ImageNet classification and COCO (dataset) object detection to video understanding on Kinetics-400 and YouTube-8M, audio tasks on AudioSet and LibriSpeech, and multimodal tasks including captioning evaluated on MS COCO and retrieval challenges using datasets similar to those used by CLIP and ALIGN (model). Uses extend to robotics and perception in Waymo Open Dataset and KITTI (dataset), scientific data processing in CERN detectors and Human Connectome Project-style neuroimaging, and remote sensing datasets such as SpaceNet. Integrations into pipelines for Google Cloud and research platforms have enabled experimentation in domains explored by OpenAI, DeepMind, Facebook AI Research, and academic groups at Stanford University, Massachusetts Institute of Technology, University of Oxford, University of Toronto, and University of Cambridge.

Performance and Evaluation

Benchmarks report that Perceiver and Perceiver IO achieve competitive performance with fewer memory constraints on tasks across ImageNet, AudioSet, and Kinetics-600 when compared to baselines like Vision Transformer, ResNet, EfficientNet, and Swin Transformer. Papers from DeepMind and follow-up studies evaluated throughput on hardware such as NVIDIA A100 and TPU v3 and measured trade-offs in FLOPs and latency relative to models like GPT-3, BERT, and RoBERTa. Standard evaluation protocols include metrics from ImageNet Large Scale Visual Recognition Challenge, COCO mean average precision, and audio recognition scores used in AudioSet leaderboards.

Limitations and Future Directions

Limitations include sensitivity to latent size, training stability noted in comparisons with Transformer (machine learning model) variants, and challenges in interpretability compared with modular systems from Neural Turing Machine and memory-augmented networks. Future directions explored by researchers at DeepMind, Google Research, OpenAI, and universities include integrating Perceiver-like modules with pretraining regimes from masked language modeling used in BERT and contrastive approaches from SimCLR and MoCo, scaling to web-scale multimodal corpora like those used by GPT-3 and CLIP, and optimizing for deployment on accelerators from NVIDIA and Google TPU. Cross-disciplinary applications foreseen involve collaborations with projects at CERN, NASA, NOAA, and biomedical consortia at Broad Institute and Allen Institute.

Category:Machine learning models