Hierarchical Mask Composition for High-Resolution Text-Guided Image Editing

Authors

  • Sanjay Kumar Adhikari `Department of Information Technology, Far Western University, Mahendranagar–Bhasi Road, Kanchanpur 10400, Nepal Author
  • Prerana Shrestha Department of Computer Applications, Madan Bhandari Memorial College, Bhaktapur–Tokha Road, Kathmandu 44600, Nepal Author

Abstract

Text-guided image editing has become central to interactive visual content creation, as natural language offers a flexible interface for specifying semantic modifications. High-resolution editing, however, remains challenging because edits must remain spatially coherent, respect object boundaries, and preserve global structure while responding to localized textual instructions. Existing approaches often rely on a single mask or uniform conditioning over the image, which can lead to spatial bleeding of edits, loss of fine-scale detail, or inconsistent behavior across resolutions. This work introduces a hierarchical mask composition framework for text-guided image editing that decomposes the image plane into a tree of overlapping and nested regions, each associated with distinct textual attributes, editing strengths, or diffusion schedules. The framework constructs a hierarchy of masks ranging from coarse semantic partitions down to fine-grained structures, and composes them in a consistent way to control how local edits propagate across scales. By coupling this hierarchical representation with text-conditioned generative models, the approach enables localized edits at high resolution while maintaining compatibility with latent-space diffusion backbones. The study analyzes the algebraic properties of the composition operator, the numerical behavior of gradient-based optimization of soft masks, and the interaction between hierarchical masking and multi-scale feature representations. Empirical observations on diverse editing tasks indicate that hierarchical mask composition can provide finer spatial control, improved boundary fidelity, and more predictable edit locality compared to single-layer masking strategies, particularly when images are edited at substantially higher resolutions than those used during model pretraining.

Downloads

Published

2025-08-04