ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Abstract

Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I.

Benchmark

We introduce ChatGenBench, a benchmark specifically designed for Automatic T2I. It includes a comprehensive step-by-step trail for step-wise evaluation, supporting multimodal or historical user inputs.

Rank	Model	Prompt Score	Selection Acc	Argument Acc
1	ChatGen-Evo-2B	0.247	0.328	0.537
2	ChatGen-Base-8B	0.208	0.264	0.509
3	ChatGen-Base-4B	0.197	0.230	0.490
4	ChatGen-Base-2B	0.184	0.206	0.384
6	Baseline	0.026	-	-

Rank	Model	FID Score	CLIP Score	HPSv2	Imgae Reward	Unified Score
1	ChatGen-Evo-2B	19.1	72.9	25.1	8.9	65.9
2	ChatGen-Base-8B	20.8	70.7	23.9	4.0	60.7
3	ChatGen-Base-4B	21.3	69.9	23.5	2.4	59.0
4	ChatGen-Base-2B	20.7	70.0	23.4	1.5	58.7
5	Baseline	32.7	64.6	20.2	-34.6	37.3

Methods

We present ChatGenEvo, which adopts a multi-stage evolution strategy. By decomposing the task into distinct stages, ChatGen-Evo enables the model to progressively acquire essential Automatic T2I capabilities.

Visualizations

BibTeX

@article{jia2024chatgen, 
  title={ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting}, 
  author={Jia, Chengyou and Xia, Changliang and Dang, Zhuohang and Wu, Weijia and Qian, Hangwei and Luo, Minnan}, 
  journal={arXiv preprint arXiv:2411.17176}, 
  year={2024}
}