Accepted by ACM Transactions on Graphics
Figure: Given an initial scene, our Haisor agent takes advantage of the scene graph representation, Dueling Double DQN, Monte Carlo Tree Search, and Motion Planning techniques and outputs a sequence of actions. Following these, the scene is optimized towards the goal of high rationality and human affordance. For rationality, we expect that the overall layout of the furniture is realistic, and no collisions between furniture exist. For human affordance, we expect the free space for human activity to be as large as possible, and people can manipulate movable parts of furniture without being blocked.
Abstract
3D scene synthesis facilitates and benefits many real-world applications. Most scene generators focus on making indoor scenes plausible via learning from training data and leveraging extra constraints such as adjacency and symmetry. Although the generated 3D scenes are mostly plausible with visually realistic layouts, they can be functionally unsuitable for human users to navigate and interact with furniture. Our key observation is that human activity plays a critical role and sufficient free space is essential for human-scene interactions. This is exactly where many existing synthesized scenes fail - the seemingly correct layouts are often not fit for living. To tackle this, we present a human-aware optimization framework Haisor for 3D indoor scene arrangement via reinforcement learning, which aims to find an action sequence to optimize the indoor scene layout automatically. Based on the hierarchical scene graph representation, an optimal action sequence is predicted and performed via Deep Q-Learning with Monte Carlo Tree Search (MCTS), where MCTS is our key feature to search for the optimal solution in long-term sequences and large action space. Multiple human-aware rewards are designed as our core criteria of human-scene interaction, aiming to identify the next smart action by leveraging powerful reinforcement learning. Our framework is optimized end-to-end by giving the indoor scenes with part-level furniture layout including part mobility information. Furthermore, our methodology is extensible and allows utilizing different reward designs to achieve personalized indoor scene synthesis. Extensive experiments demonstrate that our approach optimizes the layout of 3D indoor scenes in a human-aware manner, which is more realistic and plausible than original state-of-the-art generator results, and our approach produces superior smart actions, outperforming alternative baselines.
Paper
Haisor: Human-Aware Indoor Scene Optimization via Deep Reinforcement Learning
Methodology
Figure: Overview of Haisor Agent. We use a two-layer scene graph as our representation. The Q-network is based on a message-passing graph network (GCN) on the two-layer graph. In the network, the part-level node feature is fed into a GCN on the part-level graph, aggregated in the object-level graph, and concatenated with the object-level feature. This block of feature is fed into a GCN on the object-level graph and produces the final per-object feature. This feature is the input of two branches of the MLP (Multi-Layer Perceptron) network, following the structure of Dueling Q-network[Wang et al. 2016]. The Q-value is finally used in the searching process in Monte Carlo Tree Search and outputs the final action.
Figure: Types of movable parts. We define two types of movable parts: Hinge and slider. The hinge movable parts are defined using four parameters r_p, r_d, r_u, r_b, and the slider movable parts are defined using three parameters s_d, s_u, s_b. These parameters are combined with bounding box parameters, to form the data of the part-level scene graph.
Figure: Human-object Interaction. We use a motion planning pipeline to simulate human manipulating a certain movable part of the furniture. The process is divided into three steps: We estimate the initial standing point of the human using the size of furniture and human, and attempt to place the human model at the initial point. If the placement succeeds, we attempt to move the end-effector of the human to a grasping point of the movable part. If the movement succeeds, we attempt to move the end-effector along the Cartesian path of the movable part (an arc for hinges, a line segment for sliders).
Figure: Free Space for Human Activity. We use uniform grid sampling to compute the free space for human. First, we divide the 2-D bounding box of the floor into 900 grids and compute the grid inside and outside of the floor. For each grid, we attempt to place a human mesh onto its center. If the placement is free of collision, the grid is labeled as valid. Finally, we compute the maximum connected component of the valid grids, and use its size to compute the free space of human activity.
Results
Figure: Comparison of Scene Optimization. We show the comparison of optimization results of our method, random agent, heuristic agent and Sync2gen-Opt [Yang et al. 2021]. From the results, we can see that our method better considers the rationality of the room, the collision between furniture and human affordance of the room simultaneously. For each scene, the first row: the zoom-in view of two particular regions, the corresponding area is labeled by a colored rectangle in the figures of the second row; the second row: top-down view of the whole scene. For each row, left to right: Input scene, results of random optimization, results of heuristics optimization, results of Sync2Gen-Opt optimization, and results of our method.
Figure: An Example of Learned Smart Strategy on More Complex Scenes. a) The initial scene with multiple collisions. b) The agent moves the coffee table towards the left direction and moves two stools around the coffee table, forming the first region of table-chairs. c) The agent moves four chairs toward the dining table forming the second region. d) The agent moves cabinets towards the wall and moves the stand towards the sofa to increase the human affordance scores.
Figure: An Example of Optimizing Human Affordance while Preserving Scene Rationality. The layout of the initial scene is generally rational, but two objects with movable parts: a drawer and a hinge cannot be properly manipulated by humans. Here we show our agent is capable of optimizing this type of scene without losing the original layout of the scene.
BibTex