Rea2Seg

Abstract

Framework

Rea²Seg

Attention-driven candidate mask generation and reasoning-based mask selection

Benchmark

ReasonSeg-SGDR

Multi-dimensional evaluation of perception, grounding, and reasoning

Dataset

Rea²Seg-16K

16K reasoning segmentation samples with chain-of-thought annotations

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal Large Language Models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework, Rea²Seg, for mask generation and selection.

Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign a score to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection.

We also observe that a large portion of questions in existing benchmarks focus on commonsense reasoning and do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark, ReasonSeg-SGDR, which comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask annotations.

In addition, we collect a 16K-sample training dataset, Rea²Seg-16K, for reasoning segmentation with detailed chain-of-thought annotations. We further convert this dataset into a mask-scoring dataset to enhance MLLMs' ability to jointly interpret multimodal queries and candidate masks, and to assign scores through reasoning.

Experimental results on ReasonSeg-SGDR and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

Method Overview

ReasonSeg-SGDR Overview

Summary of ReasonSeg-SGDR

Categories	Component Types	# Samples
Discriminative	Camouflage, Occlusion, Thin	279
Geometric	Relative size, Relative position, Ordinal relation, Shared attributes	396
Spatial	Spatial relation, Relative distance, Counting	270
Multi-step	Visual detail, OCR, Part-instance Relation, Spatial Relation, Comparative, Commonsense	245

Rea²Seg-16K Dataset Samples

Performance

Comparison on ReasonSeg-SGDR benchmark (LLaVA).
Method	Disc.		Geo.		Spatial		Multi.		Avg.
Method	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU
LISA	26.9	16.9	25.1	19.3	28.4	24.1	28.4	19.2	27.2	19.9
CoReS	34.3	39.0	25.3	17.9	38.6	28.9	34.4	22.4	33.2	27.1
SESAME	21.7	7.3	25.5	21.9	36.0	28.6	33.3	18.7	29.1	19.1
READ	27.6	15.2	29.3	22.7	46.5	36.9	34.8	21.3	34.6	24.0
Rea²Seg (Top 1)	44.5	56.1	60.0	49.9	56.4	48.8	41.6	28.7	50.6	45.9
Rea²Seg (Top 3)	51.9	62.3	64.9	53.7	65.6	57.7	51.1	40.7	58.4	53.6

Comparison on ReasonSeg-SGDR benchmark (Qwen-VL).
Method	Disc.		Geo.		Spatial		Multi.		Avg.
Method	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU	gIoU	cIoU
LENS-CoT	45.9	34.5	46.7	30.8	49.3	39.8	40.8	24.0	45.7	32.3
Seg-Zero	45.5	44.0	44.0	48.2	62.0	51.3	51.6	42.2	50.7	46.4
VisionReasoner	43.8	43.3	32.7	33.3	47.8	43.4	42.9	36.9	41.8	39.2
GPT-5-mini	51.9	60.2	68.3	29.8	54.7	35.9	43.7	21.7	54.7	36.9
Rea²Seg (Top 1)	48.9	48.8	73.1	61.8	58.8	49.8	45.4	32.2	56.6	48.2
Rea²Seg (Top 3)	56.8	59.4	78.6	56.0	67.6	62.1	55.0	43.4	64.5	55.2

Comparison on ReasonSeg benchmark (LLaVA).
Method	ReasonSeg-Val		ReasonSeg-Test
Method	gIoU	cIoU	gIoU	cIoU
LISA	53.6	52.3	48.7	48.8
LISA (ft)	61.3	62.9	55.6	56.9
READ (ft)	59.8	67.6	58.5	58.6
Rea²Seg (Top 1)	61.8	56.0	60.3	59.2
Rea²Seg (Top 3)	67.5	60.2	65.5	64.8

Comparison on ReasonSeg benchmark (Qwen-VL).
Method	ReasonSeg-Val		ReasonSeg-Test
Method	gIoU	cIoU	gIoU	cIoU
RSVP	58.6	48.5	56.6	51.6
LENS (ft)	62.1	64.9	57.2	58.0
COPRS	61.3	60.6	57.8	52.7
Seg-R1	60.8	56.2	55.3	46.6
Seg-Zero	62.6	62.0	57.5	52.0
SAM-R1	64.0	55.8	60.2	54.3
Rea²Seg (Top 1)	64.0	65.6	62.1	62.3
Rea²Seg (Top 3)	68.4	70.0	66.6	65.5

BibTeX

If you find this work useful, please cite:

@misc{gao2026reasontwicesegmentationcandidate,
  title         = {Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning},
  author        = {Xinyan Gao and Haoran Hao and Xiangyu Yue},
  year          = {2026},
  eprint        = {2606.09303},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.09303},
}