OneHOI: Unifying Human-Object Interaction Generation and Editing

OneHOI teaser figure showing human-object interaction generation and editing capabilities — **OneHOI** unifies Human-Object Interaction (HOI) generation and editing in a single, versatile model. It excels at challenging HOI editing, from text-guided changes to novel layout-guided control and novel multi-HOI edits. For generation, **OneHOI** synthesises scenes from text, layouts, arbitrary shapes, or mixed conditions, offering unprecedented control over relational understanding in images.

Interactive Demo

Watch OneHOI iteratively generate and edit scenes step by step.

HOI Generation

Input

Prompt

Source Image

No source (generation)

Mask / Layout

No mask

Output

Waiting...

Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as ⟨person, action, object⟩ triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing.

BibTeX

If you use our work in your research, please cite:

@inproceedings{hoe2026onehoi,
  title={OneHOI: Unifying Human-Object Interaction Generation and Editing},
  author={Hoe, Jiun Tian and Hu, Weipeng and Jiang, Xudong and Tan, Yap-Peng and Chan, Chee Seng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Interactive Demo

Abstract

Related Links

BibTeX