Multimodal Learning 相关度: 8/10

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
arXiv: 2603.17965v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

LaDe提出了一种新的潜在扩散框架,用于生成和分解可编辑的多层图形媒体设计。

主要贡献

  • 提出了一种新的潜在扩散框架LaDe
  • 支持文本到图像生成、文本到图层设计生成和图层分解三个任务
  • 采用LLM的提示扩展器和4D RoPE编码的扩散Transformer

方法论

使用LLM进行提示扩展,用扩散Transformer生成多层设计,并结合VAE解码层。通过条件采样支持多种任务。

原文摘要

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

标签

扩散模型 多层图像生成 文本到图像 图像分解 LLM

arXiv 分类

cs.CV