In ICCV 2025

UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint

1ETH ZĂĽrich, 2TU Munich, 3Google
UIP2P teaser

Unsupervised InstructPix2Pix. Our approach applies more precise and coherent edits while better preserving scene structure. UIP2P surpasses the supervised alternative, IP2P, trained on a synthetic dataset, demonstrating superior performance on both real (a, b) and synthetic (c, d) images.

Abstract

We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited images, and edit instructions. These triplets are typically generated either by existing editing methods—introducing biases—or through human annotations, which are costly and limit generalization. Our approach addresses these challenges by introducing a novel editing mechanism called Edit Reversibility Constraint (ERC), which applies forward and reverse edits in one training step and enforces alignment in image, text, and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-instruction triplets. We empirically show that our approach performs better across a broader range of edits with high-fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with current methods, and proposing ERC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.

Dataset Issues

Dataset Issues

Supervision Biases in InstructPix2Pix and HQ-Edit datasets. Each example shows an input image and its corresponding ground-truth edited image for the given edit instruction. InstructPix2Pix employs Prompt-to-Prompt, while HQ-Edit uses DALL-E 3 and GPT-4V. (a & d) Attribute-entangled edits: Modifying a specific feature, such as clothing or hair color, unintentionally alters surrounding textures or elements. (b & e) Scene-entangled edits: Transforming objects, like turning a cottage into a castle or removing an element, affects unintended parts of the scene. (c & f) Global changes: Edits like converting an image to black and white or changing the time of day introduce widespread scene modifications, often compromising visual preservation.

Method Overview

uip2p Framework

Overview of the UIP2P training framework. The model learns instruction-based image editing by applying forward and reverse instructions. Starting with an input image and a forward instruction, shared Edit Model generates an edited image. A reverse instruction is then applied to reconstruct the original image, enforcing Edit Reversibility Constraint (ERC).

Qualitative Results

Qualitative Results 1
Qualitative Results 2

Qualitative Examples. UIP2P performance is shown across various tasks and datasets, compared to InstructPix2Pix, MagicBrush, HIVE, MGIE, and SmartEdit. Our method demonstrates either comparable or superior results in terms of accurately applying the requested edits while preserving visual consistency. Red circles and arrows indicate drastic problems during the image editing.

Ethics Statement

Advancements in localized image editing technology offer substantial opportunities to enhance creative expression and improve accessibility within digital media and virtual reality environments. Nonetheless, these developments also bring forth important ethical challenges, particularly concerning the misuse of such technology to create misleading content, such as deepfakes (Korshunov and Marcel, 2018), and its potential effect on employment in the image editing industry. Moreover, as also highlighted by Kenthapadi et al. (2023), it requires a thorough and careful discussion about their ethical use to avoid possible misuse. We believe that our method could help reduce some of the biases present in previous datasets, though it will still be affected by biases inherent in models such as CLIP. Ethical frameworks should prioritize encouraging responsible usage, developing clear guidelines to prevent misuse, and promoting fairness and transparency, particularly in sensitive contexts like journalism. Effectively addressing these concerns is crucial to amplifying the positive benefits of the technology while minimizing associated risks. In addition, our user study follows strict anonymity rules to protect the privacy of participants.

BibTeX

@misc{simsar2024uip2p,
    title={UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint}, 
    author={Enis Simsar and Alessio Tonioni and Yongqin Xian and Thomas Hofmann and Federico Tombari},
    year={2024},
    eprint={2412.15216},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}