Video understanding defined as processing a sequence of streaming images can be a path to general intelligence. The task is difficult, requires spatial and temporal reasoning, and even involves reasoning physics, permanence and social knowledge.
My research goal is to understand videos as well as humans do. Understanding videos is hard. The flagship task in video understanding, which is action recognition, is only the very first step. Action recognition, often defined as understanding basic human actions, can be as simple as walking, to a more complex a 3 point basketball shot or buying a coke bottle at CVS.
These tasks appear to give the best performance for deep learning models when there are strong visual biases such as co-located object bias (e.g. 'eating bagel'), background bias ( e.g. 'swimming') or even biases of other modalities such as audio. The accuracy of such systems gets worse when any additional understanding is necessary such as temporal understanding ('delivering mail' vs 'retrieving mail'), object permanence ('store checkout with object'), physical reasoning (trajectory prediction) and finally social understanding (e.g. predator vs prey in the animal kingdom).
I describe some of my research in this article with some additional motivation of why it was pursued, which is often missing in the research papers. Back in 2016 or so, I was working on human action recognition for public safety. I found that not only most action recognition systems did overfit, they were rather poor at fine-grained recognition. They heavily relied on other modalities such as optical flow, audio for higher accuracy and most activity net contests were won by models that used all these modalities. The systems had huge compute costs, and were practically unusable (compute, multi-modal inputs etc.)
I started looking into pure RGB based methods, and decided to explore using fine-grained objects as a basis for understanding videos. The intuition was that this should be cheaper than using 3D convolutions or optical flow (then). We built a video network that learnt spatio-temporal interaction of ROIs instead of objects (because objects is a very rigid definition) over self-attention in parallel streams. This is very similar to transformers that use multi-head attention over input tokens. Our results were quite impressive and we reached SOTA 74.2% without using optical flow or 3D convolutions over video. We later improved our features and reached up to 77.4% with this model[1]. The model was also able to distinguish fine-grained object categories. For e.g. there are several different classes involving a basketball and horse but with different class labels in Kinetics. Our model achieved good accuracy in distinguishing these classes making progress in reducing the object bias. This model is still highly efficient at 1FPS and only 2D convolutions.
Using fine-grained attributes to distinguish and reason about images was quite interesting. It is a hard problem with even just images. I looked at VQA 1.0 dataset, it was quite biased back then and used very few attributes of the objects in the image. I decided to explore a reasoning dataset using a large number of attributes with low visual bias. Luckily, such a dataset existed in the text space. The entire textual entailment dataset uses Flickr30K as its basis. Entailment is a task which tells us if a premise is implied by a hypothesis. In case of the multi-modal setting, we used a text and image as pairs. Since the images are used with multiple pieces of text, the visual bias is virtually eliminated. This dataset is fairly popular and this task is quite hard[2]. For example, there is an image with people getting in a car with the yellow taxi cabs from New York and the text tag is something like “Two people are taking a cab in New York”. This is easy to answer for anyone with a general understanding of NY but its not easy for a model to capture this information. Can BERT capture these types of associations? I don’t know.
Understanding actions in videos is interesting but a popular use-case is just to retrieve the video snippet from often long videos. This is a difficult task and most papers that achieve high accuracy in video action retrieval tasks go over video snippet multiple times. Even efficient algorithms operate over the entire video at least once. I realized that the biggest cost in a real world deployment is not these actual algorithms but this feature extraction step. I extended the video action framework with a simple reinforcement learning algorithm that only performs feature extraction when it predicts the video may have something interesting. It is a very simple idea that shows very good practical results[3].
Understanding actions from RGB data is interesting but we have a large number of new sensors that get non-visual data. For example, many hardware sensors use WiFI data to get human key point information. LIDAR sensors give us a visual mesh of objects that can be segmented or processed to get human keypoints and object boundaries. I started looking into this modality and again was looking at capturing interactions over keypoints instead. My intern and I started looking into using transformers for multi-person pose tracking. Using appropriate positional encodings is key to getting good performance with Transformers but that was not helpful to solve the tracking problem using just key point data. We re-framed the problem as a visual entailment task where a pair of poses is presented and the task is to determine if the two poses temporally follow one another[4]. Evidently, entailment is a powerful training method, perhaps similar to contrastive approaches that have gained popularity recently.
While the above models did reach SOTA at specific tasks, are they truly intelligent? Can they really reason persons and objects just like us. It is unlikely, since many papers have shown that deep learning based methods often capture spurious correlations that manifest as biases. Can deep learning methods capture object permanence (i.e. re-identify object trajectories), learn about the laws of physics with few videos or understand social relationships from video data. Can they truly understand and summarize long video snippets? We have a long way to go to address these problems. Modern networks simply do not have the mechanisms to capture this reasoning with the lack of real-world, reasoning datasets not withstanding.
We did make the first steps with object permanence using a two stage transformer that identifies salient objects and navigates over video frames to re-identify and connect objects through time. The goal here is to mimic human reasoning — as a result the paper uses auxiliary losses and an algorithm to track the objects. This method reached a competitive accuracy over CATER. Other methods that use per-frame supervision (even when objects are hidden) may get high performance but we realized that per-frame supervision triggers the bias since objects often hide in a few select places in CATER.[5]
There are also other networks that reach good performance with self-supervision. These are general powerful methods but what’s going on inside them — who knows. The recent emphasis on self-supervision is timely, we are using far too much data to train the models but will this really allow the model to reason — it is unlikely.
What’s next? Understanding and expanding to different forms of reasoning, learning physical reasoning which I believe is crucial to solve tasks like robotics and self-driving and even incorporating social relationships, world knowledge (like the NY cab example above).
- Relevant publications
[1] Attend and Interact: Higher-Order Object Interactions for Video Understanding. Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. In CVPR, 2018.
[2] Tripping Through Time: Efficient Localization of Activities in Videos. (Spotlight) Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf. In CVPR Workship on Language and Vision, 2019. Also, appears in BMVC'20.
[3] Visual Entailment: A Novel Task for Fine-Grained Image Understanding. Ning Xie, Farley Lai, Derek Doran, Asim Kadav. In NeurIPS Workshop on Visually-Grounded Interaction and Language, 2018 (ViGIL’18).
[4] 15 Keypoints is All You Need. Michael Snower, Asim Kadav, Farley Lai, Hans Peter Graf. Ranked #1 in PoseTrack .
[5] Hopper: Multi-hop Transformer for Spatiotemporal Reasoning. In ICLR 2021. PDF. Poster. Honglu Zhou, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf.