GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

1Nanjing University,   2ShanghaiTech University,   3University of Illinois Urbana-Champaign;  
* Equal Contribution

TL;DR

We propose GLUS, unify the distinct challenges of Referring Video Object Segmentation, "ref" and "vos", into a simple framework for MLLMs.

Abstract

This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between 'Ref' and 'VOS': they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that Global and Local consistency can be Unified into a single video Segmentation MLLM: a set of sparse 'context frames' provides global information, while a stream of continuous 'query frames' conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark.


GLUS: Global-Local Unified Reasoning Framework for MLLMs

We demonstrate that unifying global and local reasoning into a single MLLM for RefVOS through the design of context and query frames constitutes a simple yet effective baseline method for MLLM-based RefVOS models. We further introduce plug-and-play object contrastive loss and self-refinement with key frame selectors, enabling MLLM to focus on the correct objects and most relevant frames.



Quantiative Results


Our GLUS trained with only RefVOS datasets realizes competitive performance among MLLM-based approaches with datasets from diverse tasks. With expanded training datasets (ED), our GLUS achieves state-of-the-art performance across RefVOS benchmarks, especially in MeViS consisting of complex video scenarios.

Method MeViS Ref-Youtube-VOS
J&F J F J&F J F
Methods without LLMs
URVOS 27.825.729.9 47.245.249.1
LBDT 29.327.830.8 49.448.250.6
MTTR 30.028.831.2 55.354.056.6
ReferFormer 31.029.832.2 62.961.364.6
OnlineRefer --- 63.561.665.5
SOC --- 67.365.369.3
TempCD --- 65.863.668.0
LoSh --- 64.262.566.0
LMPM 37.234.240.2 ---
DsHmp 46.443.049.8 67.165.069.1
Methods with LLMs
LISA-7B 37.235.139.4 53.953.454.3
LISA-13B 37.935.840.0 54.454.054.8
TrackGPT-7B 40.137.642.6 56.455.357.4
TrackGPT-13B 41.239.243.1 59.558.160.8
VideoGLAMM 45.248.148.2 ---
VideoLISA-3.8B 44.441.347.6 63.761.765.7
VISA-7B 43.540.746.3 61.559.863.2
VISA-13B 44.541.847.1 63.061.464.7
ViLLa --- 66.564.668.6
GLUS (ours) 50.347.553.2 66.665.068.3
GLUS (ours) (ED) 51.348.554.2 67.365.569.0


Qualitative Results


We provide qualitative comparisons between the previous state-of-the-art (DsHmp) and our GLUS (without extedning datasets) with the videos in MeViS. Notably, these exam- ples illustrate three challenging aspects of RefVOS: (1) Mo- tion Understanding: RefVOS models have to distinguish similar objects with their motions; (2) Global Reasoning: RefVOS models should be capable of using global reason- ing to segment the objects presented only in a short video clip; (3) Vision-Language Reasoning: RefVOS models should perform vision-language unified reasoning in com- plex scenarios. The examples demonstrate that our GLUS effectively tackles RefVOS in challenging language- guided segmentation cases.

Elephant Crushed by another elephant's trunk

Ground Truth

Baseline

GLUS


The panda that has stayed in place with little movement

Ground Truth

Baseline

GLUS


The panda that took a few steps to the left

Ground Truth

Baseline

GLUS


White car move and turn left

Ground Truth

Baseline

GLUS


Person standing behind little girl feeding rabbit

Ground Truth

Baseline

GLUS


Plane moves slower

Ground Truth

Baseline

GLUS





Conclusions

We introduce a simple yet effective framework based on multimodal large language models (MLLM) for referring video object segmentation (RefVOS). Named GLUS, our method establishes unified global and local reasoning in a single LLM, addressing the distinct 'Ref' and 'VOS' challenges of RefVOS. The central design is to provide MLLM with both global (context frames) and local (query frames) contexts. Such unified global-local reasoning is further enhanced with end-to-end optimization with VOS memory modules, which improves the consistency of GLUS. Finally, GLUS introduces plug-and-play object contrastive loss and pseudo-labeling for key frame selection, enabling the MLLM to distinguish the correct object and frame with its limited context window. Our GLUS establishes the new state of the arts on RefVOS benchmarks. We hope our baseline can inspire more systematic studies enabling MLLMs to fine-grained video understanding.

Citation

Acknowledgements



The website template was borrowed from Michaƫl Gharbi, Ref-NeRF, and ReconFusion.