Hierarchical Mask Composition for High-Resolution Text-Guided Image Editing
Abstract
Text-guided image editing has become central to interactive visual content creation, as natural language offers a flexible interface for specifying semantic modifications. High-resolution editing, however, remains challenging because edits must remain spatially coherent, respect object boundaries, and preserve global structure while responding to localized textual instructions. Existing approaches often rely on a single mask or uniform conditioning over the image, which can lead to spatial bleeding of edits, loss of fine-scale detail, or inconsistent behavior across resolutions. This work introduces a hierarchical mask composition framework for text-guided image editing that decomposes the image plane into a tree of overlapping and nested regions, each associated with distinct textual attributes, editing strengths, or diffusion schedules. The framework constructs a hierarchy of masks ranging from coarse semantic partitions down to fine-grained structures, and composes them in a consistent way to control how local edits propagate across scales. By coupling this hierarchical representation with text-conditioned generative models, the approach enables localized edits at high resolution while maintaining compatibility with latent-space diffusion backbones. The study analyzes the algebraic properties of the composition operator, the numerical behavior of gradient-based optimization of soft masks, and the interaction between hierarchical masking and multi-scale feature representations. Empirical observations on diverse editing tasks indicate that hierarchical mask composition can provide finer spatial control, improved boundary fidelity, and more predictable edit locality compared to single-layer masking strategies, particularly when images are edited at substantially higher resolutions than those used during model pretraining.