Coherent Zero-Shot Visual Instruction Generation

Anonymous Authors

Visual instruction generation

Teaser

Abstract

Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions

More visual instructions

Select a sample to view:
Image 1
Image 1
Image 1
Image 1
Ours

Step 1

Step 2

Step 3

Step 4

Steps

Consistent Image Generation

Select a sample to view:
Select a baseline to compare:
Image 1
Image 1
Image 1
Image 1
Baseline
Image 1
Image 1
Image 1
Image 1
Ours

Step 1

Step 2

Step 3

Step 4

Steps

Concatenate Actions and States

Select a sample to view:
Select a baseline to compare:
Image 1
Image 1
Image 1
Image 1
Baseline
Image 1
Image 1
Image 1
Image 1
Ours

Step 1

Step 2

Step 3

Step 4

Steps

The framework

Teaser

Our framework operates in two distinct phases. In the first phase, we use an LLM (e.g., the GPT-4 model) to generate the scene state after each step in the list of instructions. The generated scene state helps guide the image generation in the next stage. We also ask the LLM to generate the similarity between states. This matrix, with each row, indicating the visual similarity of a current visual step to others, guides the generation process. For example, to achieve high state similarity, we wish to maintain consistency as much as possible across the two steps. A low state similarity indicates the performed action changes the scene state substantially. In such cases, blindly encouraging consistency across steps may hurt the quality of the visualized instruction image. In the second phase, we utilize a shared attention layer—replacing the standard model—to allow queries from one image to access keys and values from others within the same instruction set. We enhance this sharing mechanism by applying standard attention masking, controlled by the similarity matrix, to tune the interaction between visual elements finely.