Hierarchical Mask Composition for High-Resolution Text-Guided Image Editing

Sanjay Kumar Adhikari; Prerana Shrestha

Authors

Sanjay Kumar Adhikari `Department of Information Technology, Far Western University, Mahendranagar–Bhasi Road, Kanchanpur 10400, Nepal Author
Prerana Shrestha Department of Computer Applications, Madan Bhandari Memorial College, Bhaktapur–Tokha Road, Kathmandu 44600, Nepal Author

Abstract

Text-guided image editing has become central to interactive visual content creation, as natural language offers a flexible interface for specifying semantic modifications. High-resolution editing, however, remains challenging because edits must remain spatially coherent, respect object boundaries, and preserve global structure while responding to localized textual instructions. Existing approaches often rely on a single mask or uniform conditioning over the image, which can lead to spatial bleeding of edits, loss of fine-scale detail, or inconsistent behavior across resolutions. This work introduces a hierarchical mask composition framework for text-guided image editing that decomposes the image plane into a tree of overlapping and nested regions, each associated with distinct textual attributes, editing strengths, or diffusion schedules. The framework constructs a hierarchy of masks ranging from coarse semantic partitions down to fine-grained structures, and composes them in a consistent way to control how local edits propagate across scales. By coupling this hierarchical representation with text-conditioned generative models, the approach enables localized edits at high resolution while maintaining compatibility with latent-space diffusion backbones. The study analyzes the algebraic properties of the composition operator, the numerical behavior of gradient-based optimization of soft masks, and the interaction between hierarchical masking and multi-scale feature representations. Empirical observations on diverse editing tasks indicate that hierarchical mask composition can provide finer spatial control, improved boundary fidelity, and more predictable edit locality compared to single-layer masking strategies, particularly when images are edited at substantially higher resolutions than those used during model pretraining.

Hierarchical Mask Composition for High-Resolution Text-Guided Image Editing

Authors

Abstract

Downloads

Published

Issue

Section