Do You See What I Am Pointing At?
Gesture-Based Egocentric Video Question Answering

CVPR 2026

Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou

Imperial College London, UK

Paper arXiv Code Dataset Model

Figure 1. Illustration of EgoPointVQA. Current MLLMs struggle to ground deictic pronouns with pointing gestures.

Abstract

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to perform fine-grained spatial reasoning from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4,000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encode tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleave them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others across different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy on average over 6 tasks, surpassing the state-of-the-art InternVL3-14B by 8.6%.

EgoPointVQA at a Glance

4,400

Total Videos

18,745

QA Pairs

Task Types

Participants

+8.6%

Over SoTA

Key Contributions

EgoPointVQA Dataset & Benchmark — The first dataset specifically designed for deictic question answering in egocentric videos, where user gestures and pointing behaviors are central to interpretation. It contains 4,000 synthetic and 400 real-world videos with 18,745 QA pairs across 6 task categories.
Hand Intent Tokens (HINT) — A simple yet effective approach that encodes gesture tokens derived from off-the-shelf 3D hand keypoints and interleaves them with the model input, providing explicit spatial and temporal context for interpreting pointing intent.
State-of-the-Art Performance — HINT achieves state-of-the-art performance on gesture-grounded QA tasks, outperforming existing MLLMs across all backbone architectures and establishing a strong foundation for future research.

Task Taxonomy

EgoPointVQA decomposes deictic question answering into six categories, each testing distinct reasoning capabilities.

🔍

Reference

"What is it?" — Identify the object the user is pointing at.

🔢

Counting

"How many of these?" — Count identical or similar objects in the view.

📐

Spatial

"Where is this located?" — Understand relative position of the referenced object.

⏱

Temporal

"What is the second object?" — Resolve references across sequential gestures.

🎨

Attribute

"What color is this?" — Identify properties like color, shape, or material.

💬

Feedback

"How can I use this?" — Answer about the object's function or relevance.

Figure 2. Task taxonomy and examples from EgoPointVQA.

Dataset Generation Pipeline

We generate EgoPointVQA through a three-stage automated pipeline combining synthetic and real-world egocentric videos.

Figure 3. EgoPointVQA generation pipeline from simulated and real egocentric videos.

Figure 5. EgoPointVQA statistics — task type distributions and common object word clouds.

Method: Hand Intent Tokens (HINT)

Given an egocentric video and a deictic question (e.g., "What is this?"), our goal is to generate the correct answer. Current MLLMs struggle because they fail to (1) recognize that the question is ambiguous, and (2) identify the user's pointing intent. HINT processes the video in two parallel streams: a standard visual stream and a new hand-intent stream.

Figure 6. HINT overall architecture. V_t = visual token, K_t = keypoint feature, H_t = hand intent token.

Keypoint Adapter. The adapter projects 21 distinct 3D keypoints into a single Hand Intent Token that holistically represents the hand posture. The 63-dimensional flattened vector passes through LayerNorm, a two-layer MLP with GeLU activation, producing a token matching the LLM's hidden dimension. Tokens below the confidence threshold are discarded.

Frame-Keypoint Interleaving. Hand intent tokens are interleaved with visual tokens so the LLM can jointly reason over visual content and pointing direction. This construction enables the LLM to understand deictic context and temporally anchored references, while adding less than 1% token overhead.

Results

Performance of different MLLMs on the EgoPointVQA test set (multiple-choice accuracy %).

Method	Size	Refer.	Temporal	Spatial	Count	Attr.	Feed.	Avg.
Proprietary Models
GPT-5	—	75.6	53.6	62.3	50.0	56.1	77.8	62.6
GPT-4o	—	56.1	29.5	43.1	44.8	41.5	65.7	46.8
Open-Source MLLMs ≥ 32B
Qwen3-VL	32B	63.7	67.9	65.8	66.7	63.4	77.2	67.5
InternVL3	78B	71.4	71.4	62.3	45.8	68.3	80.1	66.6
Open-Source MLLMs ≤ 14B
InternVL3	8B	66.1	57.5	63.2	33.3	51.3	76.8	58.0
InternVL3	14B	63.1	66.1	61.4	50.0	58.5	77.2	62.7
EgoGPT	7B	67.3	46.4	50.9	47.9	48.8	74.1	55.9
LLaVA-OneVision	7B	54.2	42.9	53.5	35.4	46.3	67.1	49.9
HINT (Ours)
HINT_LLaVA-OV	7B	60.7	50.0	56.1	39.6	48.8	71.1	54.4
HINT_InternVL3-8B	8B	75.0	66.1	64.9	35.4	61.0	79.8	63.7
HINT_{InternVL3-14B}	14B	73.8	69.6	64.9	54.2	63.4	82.5	68.1

Table 1. HINT (highlighted) consistently improves over baselines. HINT-14B achieves the best average of 68.1%.

Ablation Study

We ablate key components of HINT using InternVL3-8B as the backbone.

SFT	Hand Intent	Refer.	Temporal	Spatial	Attr.
✗	✗	66.1	58.9	63.2	51.3
✓	✗	68.5	60.7	59.6	56.7
✓	✓	75.0	66.1	64.9	61.0

Table 2. Ablation of HINT components. Combining SFT with Hand Intent Tokens yields the largest gains.

Hand Intent Modeling	Refer.	Temporal	Spatial
None (SFT only)	68.5	60.7	59.6
Visual Keypoints	57.1	60.7	61.4
Visual Arrow from Fingertip	70.2	60.7	62.3
3D Keypoints in Text	68.5	55.4	58.8
2D Keypoints in Text	69.0	57.1	59.6
HINT (Ours)	75.0	66.1	64.9

Table 4. Different methods of hand intent modeling. HINT with a learned keypoint adapter outperforms all alternatives.

BibTeX

@inproceedings{choi2026egopointvqa, author = {Choi, Yura and Miles, Roy and Potamias, Rolandos Alexandros and Elezi, Ismail and Deng, Jiankang and Zafeiriou, Stefanos}, title = {Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering}, booktitle = {CVPR}, year = {2026} }

Do You See What I Am Pointing At?Gesture-Based Egocentric Video Question Answering