Do You See What I Am Pointing At?
Gesture-Based Egocentric Video Question Answering

CVPR 2026

1Imperial College London    2Huawei Noah's Ark Lab, UK
Paper arXiv Code Dataset Model
EgoPointVQA Teaser

Figure 1. Illustration of EgoPointVQA. Current MLLMs struggle to ground deictic pronouns with pointing gestures.

Abstract

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to perform fine-grained spatial reasoning from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4,000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encode tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleave them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others across different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy on average over 6 tasks, surpassing the state-of-the-art InternVL3-14B by 8.6%.

EgoPointVQA at a Glance

4,400
Total Videos
18,745
QA Pairs
6
Task Types
20
Participants
+8.6%
Over SoTA

Key Contributions

Task Taxonomy

EgoPointVQA decomposes deictic question answering into six categories, each testing distinct reasoning capabilities.

🔍
Reference
"What is it?" — Identify the object the user is pointing at.
🔢
Counting
"How many of these?" — Count identical or similar objects in the view.
📐
Spatial
"Where is this located?" — Understand relative position of the referenced object.
Temporal
"What is the second object?" — Resolve references across sequential gestures.
🎨
Attribute
"What color is this?" — Identify properties like color, shape, or material.
💬
Feedback
"How can I use this?" — Answer about the object's function or relevance.
EgoPointVQA Taxonomy

Figure 2. Task taxonomy and examples from EgoPointVQA.

Dataset Generation Pipeline

We generate EgoPointVQA through a three-stage automated pipeline combining synthetic and real-world egocentric videos.
Generation pipeline

Figure 3. EgoPointVQA generation pipeline from simulated and real egocentric videos.

EgoPointVQA statistics

Figure 5. EgoPointVQA statistics — task type distributions and common object word clouds.

Method: Hand Intent Tokens (HINT)

Given an egocentric video and a deictic question (e.g., "What is this?"), our goal is to generate the correct answer. Current MLLMs struggle because they fail to (1) recognize that the question is ambiguous, and (2) identify the user's pointing intent. HINT processes the video in two parallel streams: a standard visual stream and a new hand-intent stream.
HINT method

Figure 6. HINT overall architecture. Vt = visual token, Kt = keypoint feature, Ht = hand intent token.

Keypoint Adapter. The adapter projects 21 distinct 3D keypoints into a single Hand Intent Token that holistically represents the hand posture. The 63-dimensional flattened vector passes through LayerNorm, a two-layer MLP with GeLU activation, producing a token matching the LLM's hidden dimension. Tokens below the confidence threshold are discarded.
Frame-Keypoint Interleaving. Hand intent tokens are interleaved with visual tokens so the LLM can jointly reason over visual content and pointing direction. This construction enables the LLM to understand deictic context and temporally anchored references, while adding less than 1% token overhead.

Results

Performance of different MLLMs on the EgoPointVQA test set (multiple-choice accuracy %).
MethodSizeRefer.TemporalSpatialCountAttr.Feed.Avg.
Proprietary Models
GPT-575.653.662.350.056.177.862.6
GPT-4o56.129.543.144.841.565.746.8
Open-Source MLLMs ≥ 32B
Qwen3-VL32B63.767.965.866.763.477.267.5
InternVL378B71.471.462.345.868.380.166.6
Open-Source MLLMs ≤ 14B
InternVL38B66.157.563.233.351.376.858.0
InternVL314B63.166.161.450.058.577.262.7
EgoGPT7B67.346.450.947.948.874.155.9
LLaVA-OneVision7B54.242.953.535.446.367.149.9
HINT (Ours)
HINTLLaVA-OV7B60.750.056.139.648.871.154.4
HINTInternVL3-8B8B75.066.164.935.461.079.863.7
HINTInternVL3-14B14B73.869.664.954.263.482.568.1

Table 1. HINT (highlighted) consistently improves over baselines. HINT-14B achieves the best average of 68.1%.

Ablation Study

We ablate key components of HINT using InternVL3-8B as the backbone.
SFTHand IntentRefer.TemporalSpatialAttr.
66.158.963.251.3
68.560.759.656.7
75.066.164.961.0

Table 2. Ablation of HINT components. Combining SFT with Hand Intent Tokens yields the largest gains.

Hand Intent ModelingRefer.TemporalSpatial
None (SFT only)68.560.759.6
Visual Keypoints57.160.761.4
Visual Arrow from Fingertip70.260.762.3
3D Keypoints in Text68.555.458.8
2D Keypoints in Text69.057.159.6
HINT (Ours)75.066.164.9

Table 4. Different methods of hand intent modeling. HINT with a learned keypoint adapter outperforms all alternatives.

BibTeX

@inproceedings{choi2026egopointvqa, author = {Choi, Yura and Miles, Roy and Potamias, Rolandos Alexandros and Elezi, Ismail and Deng, Jiankang and Zafeiriou, Stefanos}, title = {Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering}, booktitle = {CVPR}, year = {2026} }