Pi-HOC: Pairwise 3D Human-Object Contact Estimation

1Carnegie Mellon University 2National Robotics Engineering Center (NREC)

Abstract

Understanding real-world human--object interactions in images is an inherently many-to-many problem, where disentangling fine-grained and concurrent physical contacts is particularly challenging. Existing semantic contact estimation methods are either limited to single-human settings or require object geometry (e.g., meshes) in addition to the input image. Current state-of-the-art method leverages a powerful VLM for category-level semantics, but it still struggle in multi-human scenes and scales poorly at inference time. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction across all human--object pairs. Given an input image, Pi-HOC detects human and object instances, enumerates all human--object pairs, and represents each pair with a dedicated human--object (HO) token. An InteractionFormer jointly refines HO tokens and image patch features to produce interaction-aware pair representations. A SAM-based contact decoder then predicts dense contact on SMPL human meshes for each pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. These results establish Pi-HOC as an efficient and scalable solution for dense semantic contact reasoning in complex scenes. We further show that the predicted contacts improve SAM-3D image-to-mesh reconstruction through a test-time optimization procedure and enable referential contact prediction from language queries without additional training.

Pi-HOC Architecture

overview

Quantitative Results

Pi-HOC achieves state-of-the-art performance on Semantic Contact estimation. We test the method on MMHOI and DAMON datasets.

overview

Pi-HOC consistently outperforms InteractVLM across benchmarks, achieving ~17% higher precision on MMHOI and notable gains on DAMON, while maintaining significantly higher inference speed. Unlike InteractVLM, Pi-HOC scales efficiently with increasing human–object pairs, demonstrating a superior balance between accuracy and computational efficiency.

Qualitative Results

a) Multi Human Multi Object Scenarios

multi human multi object results

b) Single Human Multi Object Scenarios

single human multi object results

SAM3D Refinement using Pi-HOC Contact

Comparision with before and after refinement

Referential contact results

Referential Contact - No Retraining / Extra Training data required 🔥

Referential contact results