ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima1,3    Shuhei Kurita2,3    Yusuke Oda3    Komei Sugiura1
1Keio University    2NII    3NII LLMC
CVPR 2026

ReMoRa eyecatch

ReMoRa preserves high-fidelity motion representation by refining motion vectors and
integrating them with sparse keyframes before streaming tokens into the LLM.

Abstract

ReMoRa tackles long-form video understanding with multimodal LLMs by operating directly on compressed video streams. Instead of decoding dense RGB sequences, we keep scene-adaptive I-frames for appearance while encoding temporal dynamics as codec motion vectors that act as a lightweight proxy for optical flow. A Refined Motion Representation (RMR) module denoises and densifies those coarse motions, and a Hierarchical Motion State Space (HMSS) module performs linear-time reasoning across group-of-picture hierarchies. This codec-aware pipeline scales to hour-long clips, improves temporal fidelity, and removes redundant computation. ReMoRa consistently surpasses contemporaneous video MLLMs, reaching 60.8 on LongVideoBench, 84.2 on NExT-QA, 72.1 on MLVU, and an average score of 69.8.

Overview of ReMoRa

ReMoRa architecture

Each video is decomposed into group of picture (GOP) chunks. I-frames preserve appearance via a SigLIP encoder, while block-based motion vectors from P/B frames feed the RMR module, which denoises, densifies, and aligns motion vectors with dense optical flow. The HMSS module then performs codec-aware selective scans to summarize every GOP locally before propagating linear-time state transitions across the entire clip.

  • Refined Motion Representation. Optical-flow–pretrained filters lift noisy codec vectors into dense motion fields, enabling temporal cues without decoding RGB sequences.
  • Hierarchical Motion State Space. Bidirectional Mamba scans compress local GOP structure and a global scan maintains long-range dependencies using linear complexity.

Scene-aware Video Preprocessing

I-frames arrive as codec-designated keyframes, so every GOP inherits high-fidelity anchors that latch onto abrupt scene cuts the instant they occur. P/B-frame block motion vectors offer a lightweight pseudo optical flow proxy, but their coarse, noisy nature motivates our RMR module, which aligns these motion cues with dense optical flow supervision before downstream processing.


Quantitative Results

Benchmark comparison chart
VideoQA and ablation summaries

ReMoRa attains 60.8 on LongVideoBench, 84.2 on NExT-QA, 72.1 on MLVU, 64.4 on VideoMME, and 67.7 on Perception Test, yielding the top average score of 69.8 across six long-video benchmarks. For open-ended VideoQA, the model sets a new best of 60.5/3.7 Accuracy/Score on ActivityNet-QA and delivers 73.1/4.0 on MSVD-QA.


Qualitative Results

Qualitative comparison between ReMoRa and baseline models

On NExT-QA, ReMoRa tracks subtle human-object interactions such as “slide down the rail and check pants” or “dog retrieves the thrown object” by leveraging refined motion fields, whereas RGB-only baselines hallucinate abrupt actions. The refined motion cues keep temporal ordering intact even when question cues refer to events several seconds apart.



BibTeX

@article{yashima2026remoramultimodallargelanguage,
          title={ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video
          Understanding},
          author={Daichi Yashima and Shuhei Kurita and Yusuke Oda and Komei Sugiura},
          year={2026},
          eprint={2602.16412},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2602.16412},
          }