Generated by DeepSeek V3.2| ViVa | |
|---|---|
| Name | ViVa |
| Developer | Google DeepMind, University of Oxford |
| Released | October 2023 |
| Latest release version | 2.1 |
| Latest release date | May 2024 |
| Programming language | Python, JAX |
| Operating system | Linux |
| Genre | Artificial intelligence, Computer vision |
| License | Apache License 2.0 |
ViVa. ViVa is an open-source artificial intelligence framework developed for advanced video understanding and generative video tasks. It integrates state-of-the-art transformer architectures with novel neural network modules to analyze and synthesize dynamic visual data. The project represents a significant collaboration between leading AI research institutions aimed at pushing the boundaries of multimodal learning.
The ViVa framework is designed to unify various video-centric AI models under a single, modular codebase, facilitating research in action recognition, video captioning, and video prediction. Its core philosophy emphasizes reproducibility and extensibility, allowing researchers to build upon benchmarked implementations of models like TimeSformer and VideoMAE. The system is particularly noted for its efficient handling of the high-dimensional data inherent in video streams, a challenge historically addressed by models from Facebook AI Research and Google Research. By providing standardized data loaders and evaluation metrics for datasets such as Kinetics and Something-Something, ViVa accelerates experimental workflows across the machine learning community.
The ViVa project was initiated in early 2022 by a joint team from Google DeepMind and the Visual Geometry Group at the University of Oxford, building upon earlier work in image recognition systems like ResNet and vision transformers. A key inspiration was the success of large language models such as GPT-3 in processing sequential data, which researchers sought to adapt for the temporal dimension of video. The first public release, version 1.0, was announced in October 2023 at the Conference on Neural Information Processing Systems, receiving notable attention from groups at Stanford University and MIT CSAIL. Subsequent updates have incorporated advancements from other video AI projects, including NVIDIA's VideoLDM and Meta's Make-A-Video.
ViVa's architecture supports a wide array of features, chief among them being spatiotemporal attention mechanisms that process video frames and optical flow data concurrently. It includes pre-trained models for zero-shot learning on novel action classes, leveraging techniques from CLIP developed by OpenAI. The framework also offers robust tools for video generation, enabling tasks like future frame prediction and conditional video synthesis based on text prompts. Additional capabilities include anomaly detection in surveillance footage, temporal action localization for Hollywood-style scene analysis, and video question answering akin to models trained on the TVQA dataset. Its distributed training utilities are optimized for GPU clusters, supporting frameworks like PyTorch and TensorFlow.
The primary applications of ViVa span both academic research and industrial deployment. In healthcare, it is used for gait analysis and surgical video analytics, with pilot studies conducted in partnership with the National Institutes of Health. Within the entertainment industry, studios utilize its scene segmentation tools for post-production editing and automated content moderation on platforms like YouTube and TikTok. Autonomous vehicle companies, including Waymo and Tesla, employ adapted versions for behavior prediction from dashboard camera feeds. Furthermore, ViVa underpins educational technology platforms that generate descriptive audio commentary for the visually impaired, and aids sports analytics firms in performance analysis from broadcast footage of events like the FIFA World Cup.
ViVa is implemented primarily in Python, utilizing the JAX library for high-performance automatic differentiation and accelerated linear algebra operations on hardware from NVIDIA and AMD. Standard models operate on video clips of 16 to 32 frames at resolutions from 224x224 to 512x512 pixels. The ViT-Base configuration contains approximately 86 million parameters, while the larger ViT-Large variant scales to over 300 million. Training typically requires datasets such as ImageNet, WebVid, and HowTo100M, and is conducted on TPU v4 pods or clusters of NVIDIA A100 GPUs. The software is distributed under the permissive Apache License 2.0, with its source code hosted on GitHub and comprehensive documentation available through Read the Docs.
Category:Artificial intelligence Category:Computer vision Category:Free science software Category:Video software