Deep Learning from Crawled Spatio-Temporal Representations of Video (DECSTER)
1 July 2018
Applying deep learning in video without using pixel representations. Considering spatio-temporal activity information that is directly extractable from compressed video bitstreams or neuromorphic vision sensing (NVS) hardware
Amount £ 840 401
Project website gow.epsrc.ukri.org
Research topics Deep Learning | Video Delivery | Activity Recognition | Scene Recognition | Object Recognition
Video has been one of the most pervasive forms of online media for some time. Several statistics show that video traffic will dominate IP networks within the next five years. Yet, video remains one of the least-manageable elements of the big data ecosystem. This project argues that this difficulty stems primarily from the fact that all advanced computer vision and machine learning algorithms view video as a stream of frames of picture elements. This is despite the fact that pixel-domain representations are known to be notoriously difficult to manage in machine learning systems, mainly due to: their high volume, high redundancy between successive frames, and artifacts stemming from camera calibration under varying illumination. We propose to abandon pixel representations and consider spatio-temporal activity information that is directly extractable from compressed video bitstreams or neuromorphic vision sensing (NVS) hardware.
The first key outcome of the project will be to design deep neural networks (DNNs) that ingest such activity information in order to derive state-of-the-art classification, action recognition and retrieval results within large video datasets. This will be achieved at record-breaking speed and comparable accuracy to the best DNN designs that utilize pixel-domain video representations and/or optical flow calculations.
The second key outcome will be to design and prototype a crawler-based bitstream parsing and analysis service, where some of the parsing and processing will be carried out by a bitstream crawler running on a remote repository, while the back-end processing will be carried out by high-performance servers in the cloud.
This will enable for the first time the continuous parsing of large compressed video content libraries and NVS repositories with new & improved versions of crawlers in order to derive continuously-improved semantics or track changes and new content elements, in a manner similar to how search engine bots continuously crawl web content. These outcomes will pave the way for exabyte-scale video datasets to be newly-discovered and analysed over commodity hardware.