FOCUS: Efficient Keyframe Selection for Long Video Understanding

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments.

We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region.

On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

Installation

First, follow the installation instructions from the AKS repository to set up the environment and dependencies.
Then install the additional requirements:

pip install -r requirements.txt

Usage

Run FOCUS keyframe extraction on LongVideoBench:

python select_keyframe.py \
    --dataset_name longvideobench \
    --dataset_path ./datasets/longvideobench \
    --output_dir focus_blip \
    --num_keyframes 64 \
    --batch_size 32 \
    --blip_model large

Evaluation

For evaluation, please follow the evaluation setup from the lmms-eval repository and use the evaluation scripts provided in the AKS repository.

Output

FOCUS generates the following outputs:

selected_frames.json: Selected keyframe indices for each video
sampling_details.json: Detailed sampling information including:
- Coarse and fine sampling results
- Arm information and FOCUS scores
- Arm selection probabilities
- Video metadata
extraction_stats.json: Statistics about the extraction process

Acknowledgments

This work builds upon the excellent research from:

AKS: Adaptive Keyframe Sampling for the evaluation framework
lmms-eval for multimodal evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
fig		fig
LICENSE		LICENSE
README.md		README.md
focus.py		focus.py
requirements.txt		requirements.txt
run_focus_example.sh		run_focus_example.sh
select_keyframe.py		select_keyframe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Installation

Usage

Evaluation

Output

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

NUS-HPC-AI-Lab/FOCUS

Folders and files

Latest commit

History

Repository files navigation

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Installation

Usage

Evaluation

Output

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages