Video multimodal large language models (video MLLMs) rely on RGB appearance cues and degrade sharply in low light, where texture, color, and object boundaries are lost. Event cameras offer a complementary signal, asynchronous brightness changes that preserve motion and temporal structure under extreme illumination, but integrating event streams into language-based reasoning is bottlenecked by scarce paired event-language supervision, especially under low light. We propose LITE-Event, a sample-efficient framework that fuses event and RGB streams for low-light video reasoning while keeping both visual encoders and the LLM frozen. A lightweight visual-language interface combines modality-specific projectors with gated fusion and is adapted at inference by a hypernetwork that generates LoRA-style residual updates from a small support set, allowing the model to specialize without retraining. To support training and evaluation, we release three resources: NExT-QA-Dark, a large-scale synthetic low-light RGB-event dataset for pretraining; EventLang-LowLight, a curated event-language dataset for adaptation; and RealDark-Bench, a real-world low-light RGB-event reasoning benchmark. Trained predominantly on synthetic data, LITE-Event reaches 87.1% on RealDark-Bench, outperforming the strongest enhancement-then-reason baseline by 8.0 points and surpassing both RGB-only and event-only video MLLMs.
Overview of the LITE-Event framework. Frozen encoders extract initial features from low-light RGB video and temporal event streams, which are subsequently mapped into the embedding space of a large language model using modality-specific projectors. A learned gating mechanism fuses these distinct inputs into unified visual tokens, which are dynamically refined at inference time by a hypernetwork-driven, support-conditioned residual projector. Finally, these adapted representations are aligned with the user's text query and processed by an instruction-tuned LLM, enabling robust reasoning and accurate answer generation under challenging low-light conditions.
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: Where is the chair positioned in relation to the walking subject?
Choices: A. Beside him B. Behind him C. Directly in front of him D. Carried on his back
GT: C. Directly in front of him
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: Are the circular fixtures between the slats turned on or off?
Choices: A. They are brightly illuminated B. They are completely dark C. They are flashing rapidly D. Unclear
GT: A. They are brightly illuminated.
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q:How many people are walking in the scene?
Choices: A. One B. Two C. Three D. Four
GT:B. Two
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: What is the written text on the robot?
Choices: A. Spot B. Unitree C. CyberDog D. Boston
GT: B. Unitree
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: What is the person riding?
Choices: A. A bicycle B. A unicycle C. An electric skateboard D. Roller skates
GT: C. An electric skateboard
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: What kind of bag is the person carrying?
Choices: A. A backpack B. A duffel bag C. A dark, rectangular briefcase D. A plastic grocery bag
GT: C. A dark, rectangular briefcase
Hover over the section to play video; hover over the left video to switch between dark and normal light.
Q: Which direction is the deer walking?
Choices: A. Right to left B. Left to right C. Towards the camera D. Away from the camera
GT: B. Left to right
Performance comparison of models on our benchmarks. LITE-Event shows strong performance, outperforming all baselines across both multiple-choice QA and captioning tasks.
Table 1: Benchmark results of EventLang-Dark.
Table 2: Benchmark results of RealDark-Bench