Reasoning in the Dark

Abstract

Video multimodal large language models (video MLLMs) rely on RGB appearance cues and degrade sharply in low light, where texture, color, and object boundaries are lost. Event cameras offer a complementary signal, asynchronous brightness changes that preserve motion and temporal structure under extreme illumination, but integrating event streams into language-based reasoning is bottlenecked by scarce paired event-language supervision, especially under low light. We propose LITE-Event, a sample-efficient framework that fuses event and RGB streams for low-light video reasoning while keeping both visual encoders and the LLM frozen. A lightweight visual-language interface combines modality-specific projectors with gated fusion and is adapted at inference by a hypernetwork that generates LoRA-style residual updates from a small support set, allowing the model to specialize without retraining. To support training and evaluation, we release three resources: NExT-QA-Dark, a large-scale synthetic low-light RGB-event dataset for pretraining; EventLang-LowLight, a curated event-language dataset for adaptation; and RealDark-Bench, a real-world low-light RGB-event reasoning benchmark. Trained predominantly on synthetic data, LITE-Event reaches 87.1% on RealDark-Bench, outperforming the strongest enhancement-then-reason baseline by 8.0 points and surpassing both RGB-only and event-only video MLLMs.

Method

Overview of the LITE-Event framework. Frozen encoders extract initial features from low-light RGB video and temporal event streams, which are subsequently mapped into the embedding space of a large language model using modality-specific projectors. A learned gating mechanism fuses these distinct inputs into unified visual tokens, which are dynamically refined at inference time by a hypernetwork-driven, support-conditioned residual projector. Finally, these adapted representations are aligned with the user's text query and processed by an instruction-tuned LLM, enabling robust reasoning and accurate answer generation under challenging low-light conditions.

VQA based on spatial properties

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: Where is the chair positioned in relation to the walking subject?

Choices: A. Beside him B. Behind him C. Directly in front of him D. Carried on his back

GT: C. Directly in front of him

InternVL3+DarkIR: Behind him.

Qwen3-VL: Beside him.

EventGPT-Plus: Behind him.

LITE-Event: Directly in front of him.

VQA based on visibility properties

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: Are the circular fixtures between the slats turned on or off?

Choices: A. They are brightly illuminated B. They are completely dark C. They are flashing rapidly D. Unclear

GT: A. They are brightly illuminated.

InternVL3+DarkIR: Unclear.

Qwen3-VL: They are completely dark.

EventGPT-Plus Unclear.

LITE-Event: They are brightly illuminated.

VQA based on counting

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q:How many people are walking in the scene?

Choices: A. One B. Two C. Three D. Four

GT:B. Two

InternVL3+DarkIR: One.

Qwen3-VL: Two.

EventGPT-Plus: Two.

LITE-Event: Two.

VQA based on OCR

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: What is the written text on the robot?

Choices: A. Spot B. Unitree C. CyberDog D. Boston

GT: B. Unitree

InternVL3+DarkIR: CyberDog.

Qwen3-VL: Unitree.

EventGPT-Plus: Unitree.

LITE-Event: Unitree.

VQA based on action recognition

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: What is the person riding?

Choices: A. A bicycle B. A unicycle C. An electric skateboard D. Roller skates

GT: C. An electric skateboard

InternVL3+DarkIR: An electric skateboard

Qwen3-VL: A unicycle

EventGPT-Plus: A unicycle

LITE-Event: An electric skateboard

VQA based on object properties

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: What kind of bag is the person carrying?

Choices: A. A backpack B. A duffel bag C. A dark, rectangular briefcase D. A plastic grocery bag

GT: C. A dark, rectangular briefcase

InternVL3+DarkIR: A plastic grocery bag

Qwen3-VL: A dark, rectangular briefcase

EventGPT-Plus: A dark, rectangular briefcase

LITE-Event: A dark, rectangular briefcase

VQA based on temporal properties

Hover over the section to play video; hover over the left video to switch between dark and normal light.

Q: Which direction is the deer walking?

Choices: A. Right to left B. Left to right C. Towards the camera D. Away from the camera

GT: B. Left to right

InternVL3+DarkIR: Away from the camera

Qwen3-VL: Left to right

EventGPT-Plus: Towards the camera

LITE-Event: Left to right

Benchmark Results

Performance comparison of models on our benchmarks. LITE-Event shows strong performance, outperforming all baselines across both multiple-choice QA and captioning tasks.

Table 1: Benchmark results of EventLang-Dark.

Table 2: Benchmark results of RealDark-Bench