Rea2Seg

Segmentation via Candidate Discovery and Comparative Reasoning

Xinyan Gao1,* Haoran Hao2,* Xiangyu Yue1,†

1 MMLab, The Chinese University of Hong Kong

2 Nanjing University

* Equal Contribution. Corresponding author.

Abstract

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal Large Language Models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework, Rea2Seg, for mask generation and selection.

Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign a score to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection.

We also observe that a large portion of questions in existing benchmarks focus on commonsense reasoning and do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark, ReasonSeg-SGDR, which comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask annotations.

In addition, we collect a 16K-sample training dataset, Rea2Seg-16K, for reasoning segmentation with detailed chain-of-thought annotations. We further convert this dataset into a mask-scoring dataset to enhance MLLMs' ability to jointly interpret multimodal queries and candidate masks, and to assign scores through reasoning.

Experimental results on ReasonSeg-SGDR and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

Method Overview

Rea2Seg method overview

ReasonSeg-SGDR Overview

ReasonSeg-SGDR benchmark overview

Summary of ReasonSeg-SGDR

Categories Component Types # Samples
Discriminative Camouflage, Occlusion, Thin 279
Geometric Relative size, Relative position, Ordinal relation, Shared attributes 396
Spatial Spatial relation, Relative distance, Counting 270
Multi-step Visual detail, OCR, Part-instance Relation, Spatial Relation, Comparative, Commonsense 245

Rea2Seg-16K Dataset Samples

Rea2Seg-16K dataset samples

Performance

Comparison on ReasonSeg-SGDR benchmark (LLaVA).
Method Disc. Geo. Spatial Multi. Avg.
gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU
LISA 26.916.925.119.328.424.128.419.227.219.9
CoReS 34.339.025.317.938.628.934.422.433.227.1
SESAME 21.77.325.521.936.028.633.318.729.119.1
READ 27.615.229.322.746.536.934.821.334.624.0
Rea2Seg (Top 1) 44.556.160.049.956.448.841.628.750.645.9
Rea2Seg (Top 3) 51.962.364.953.765.657.751.140.758.453.6
Comparison on ReasonSeg-SGDR benchmark (Qwen-VL).
Method Disc. Geo. Spatial Multi. Avg.
gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU
LENS-CoT 45.934.546.730.849.339.840.824.045.732.3
Seg-Zero 45.544.044.048.262.051.351.642.250.746.4
VisionReasoner 43.843.332.733.347.843.442.936.941.839.2
GPT-5-mini 51.960.268.329.854.735.943.721.754.736.9
Rea2Seg (Top 1) 48.948.873.161.858.849.845.432.256.648.2
Rea2Seg (Top 3) 56.859.478.656.067.662.155.043.464.555.2
Comparison on ReasonSeg benchmark (LLaVA).
Method ReasonSeg-Val ReasonSeg-Test
gIoU cIoU gIoU cIoU
LISA 53.652.348.748.8
LISA (ft) 61.362.955.656.9
READ (ft) 59.867.658.558.6
Rea2Seg (Top 1) 61.856.060.359.2
Rea2Seg (Top 3) 67.560.265.564.8
Comparison on ReasonSeg benchmark (Qwen-VL).
Method ReasonSeg-Val ReasonSeg-Test
gIoU cIoU gIoU cIoU
RSVP 58.648.556.651.6
LENS (ft) 62.164.957.258.0
COPRS 61.360.657.852.7
Seg-R1 60.856.255.346.6
Seg-Zero 62.662.057.552.0
SAM-R1 64.055.860.254.3
Rea2Seg (Top 1) 64.065.662.162.3
Rea2Seg (Top 3) 68.470.066.665.5

BibTeX

If you find this work useful, please cite:

@misc{gao2026reasontwicesegmentationcandidate,
  title         = {Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning},
  author        = {Xinyan Gao and Haoran Hao and Xiangyu Yue},
  year          = {2026},
  eprint        = {2606.09303},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.09303},
}