Scene Graph to Image Generation with Contextualized Object Layout Refinement
Authors
Abstract
Generating high-quality images from scene graphs, that is, graphs that describe multiple entities in complex relations, is a challenging task that attracted substantial interest recently. Prior work trained such models by using supervised learning, where the goal is to produce the exact target image layout for each scene graph. It relied on predicting object locations and shapes independently and in parallel. However, scene graphs are underspecified, and thus the same scene graph often occurs with many target images in the training data. This leads to generated images with high inter-object overlap, empty areas, blurry objects, and overall compromised quality. In this work, we propose a method that alleviates these issues by generating all object layouts together and reducing the reliance on such supervision. Our model predicts layouts directly from embeddings (without predicting intermediate boxes) by gradually upsampling, refining and contextualizing object layouts. It is trained with a novel adversarial loss, that optimizes the interaction between object pairs. This improves coverage and removes overlaps, while maintaining sensible contours and respecting objects relations. We empirically show on the COCO-STUFF dataset that our proposed approach substantially improves the quality of generated layouts as well as the overall image quality. Our evaluation shows that we improve layout coverage by almost 20 points, and drop object overlap to negligible amounts. This leads to better image generation, relation fulfillment and objects quality.
1. Introduction
Synthesizing images from natural language descriptions has received substantial attention recently [1, 2, 3, 4] , as it has wide applicability for content generation. However, it has been shown that models that accept textual descriptions as their input fail to produce images with multiple detailed objects with complex relations [1, 2, 4] . Thus, scene graphs [5] , i.e. graphs where nodes correspond to entities and edges describe relations between them, were proposed [6] as an intermediate representation of the desired image. This approach has been widely adopted [7, 8, 9] for this task.
When generating images from scene graphs (SG), there are three main desiderata: (i) Photo-realism: the image should look natural with salient objects, (ii) Correctness: the image should contain the objects and relations specified in the SG, and (iii) Diversity: because an SG is an underspecified representation compatible with many output images, a model should reflect that in its output distribution. Current models for SG-to-image generation invariably combine a supervised learning objective at training time. Specifically, given an SG and an image they predict for each object separately the exact location and shape from the gold semantic layout to produce * equal contribution the ground truth image. Although this can achieve correctness for simple geometric relations, it inevitably results in poor quality image-layout due to the under-specificity of the SG. In particular, many distinct images can be represented by the same SG, thus maximum likelihood based techniques result in a blurry average of object shapes and positions across possible images. Such generations are likely to exhibit low resolution, low coverage, and high inter-object overlap. Moreover, due to the strict specification of the prediction task, true diversity is inherently impossible.
In this work we propose two main technical contributions to solve these issues: (i) To address the diversity issue, we reduce the dependence on supervised losses and shift towards adversarial ones. In particular, rather than predicting the box and mask of each object according to the target image, we use an adversarial network as a discriminator. It ensures that the generated object layout is truthful to the required object class in both position and shape and that the relation between every pair of objects is sensible and obeys the constraints dictated by the SG. (ii) To address the quality issue, we introduce a novel method to perform high-resolution layout generation. It incorporates the ability of Graph Convolution Networks (GCNs) [10] to work on variable-shaped structured graphs and contextualizes the state of all objects with CNNbased generators. Using this layout refinement network, we fuse predicted object layouts such that each remains true to its class and respects its dictated relations, while maintaining high coverage and few overlaps. We stack multiple copies of this block and present Contextualized Objects Layout Refiner (COLoR): the first model to generate layouts directly from SGs without any intermediate steps such as boxes and masks.
2. Background
We now describe the architecture of prior work SG2Im [6] , which we build upon, and formally define the task. During training, the available information for every Image I is a segmentation mask, identifying each pixel in the image to a unique object and its class. This can be used to generate multiple SGs for each image by computing geometric relations between objects and randomly sampling from all possible edges in the complete graph. In addition, this segmentation mask is used to infer the layouts, masks, and bounding boxes for all objects in the image, as defined below. Layout generation An Image Layout is a mapping of each pixel in the image to a specific object. Given an object in the layout, we define an object layout l ∈ [0, 1] H×W as a mapping over the image, indicating pixels that belong to the object. Higher values signify stronger presence of the object. Ideally (as is the case in the annotated layouts), l is binary. We then define the Object Box b ∈ [0, 1] 4 as the minimal axis-aligned bounding box in relative coordinates of all active pixels in l. Finally, cropping l using b and projecting it into [0, 1] Wm×Wm where W m < min(H, W ), we get the Object Mask m ∈ [0, 1] Wm×Wm , which describes its shape, with higher-value pixels corresponding to the existence of the object in said pixel. We note that though b and m are derived uniquely from l, it is possible to approximate l by performing the inverse projection of m into [0, 1] H×W according to b. Scene Graph to Image We build on SG2Im [6] , which includes the following steps: (a) The SG is augmented with an additional dummy node which is connected to all other nodes through an outgoing dummy relation to ensure graph connectivity. (b) Every node and edge in the SG is replaced by a learned embedding v ∈ R d based on its class. (c) The graph is fed to a GCN, which produces a new embeddingṽ for each object (node) in the SG. (d) The embeddingsṽ 1 , . . . ,ṽ n are fed into the layout predictor consisting of two separate decoders, one predicts a bounding box locationb i , and another predicts a maskm i . Those are used to computel i as explained above. The embeddingṽ i is multiplied element-wise withl i producingˆ i ∈ [0, 1] H×W ×d . (e) The extended lay-outsˆ 1 , . . . ,ˆ n are summed element-wise to produce a coarse image layoutl ∈ [0, 1] H×W ×d . (f)l is fed along with z random noise channels into a Cascaded Refinement Network (CRN) [11] , predicting the final imageÎ.
In [6] , the model is trained with six loss functions: three adversarial loss functions that evaluate object realism, the ability to correctly classify objects, and image similarity to real images. The other three use strong supervision and force the model to predict boxes and masks which are similar to those in the ground truth image I. Those are:
L box = n i=1 b i −b i 2 , L pix = I −Î 1 L mask = n i=1 Wm,Wm h,w BCE(m i,h,w ,m i,h,w ) (1) whereb i ,m i ,Î
) = −p log(p) − (1 − p) log(1 −p).
3. Method
One major drawback of using the aforementioned supervised loss functions is the underlying assumption that for every SG there exists (in the dataset) at most one corresponding image layout. However, in COCO-STUFF [12] which is commonly used for this task, this is far from true, as 73% of the images contain a (multi) set of objects that is shared with many other images in the data and may result in identical SGs. Further, over 25% of the SGs match multiple different layouts, and almost 10% of the SGs describe over 10 different layouts. Thus, a model that maximizes the likelihood of a layout given an SG will be pushed towards predicting the mean of the bounding boxes in the layouts that occur in the training data, and similarly, the average mask. Because the location and shapes of layout substantially vary across images, the model eventually will ignore the context and predict a general location for each object with no distinct shape as can be seen in Figure. 2.
To overcome this difficulty, we remove most of the loss terms that are applied with respect to the exact ground truth layout used ( §2) and reduce the weight of the rest. Instead, we add adversarial loss functions that encourages the model to generate photo-realistic images that respect the original SG, without forcing it to learn a single SG2Im mapping. The proposed discriminator are applied on the predicted object layoutsl i . We find that some strong supervision is beneficial to cope with issues that are linked to cold-start. Pairwise Layout Discriminator The main source of training signal in our method is an adversarial network which teaches the generator to be spatially aware, creating objects without overlap that respect the relations set by the SGs. It follows the AC-GAN [13] adversarial loss pattern. Given a pair of neighboring objects in the SG, the discriminator accepts their object layouts l i , l j ∈ [0, 1] H×W and their class labels c i , c j ∈ C. It predicts whether this pair comes from a real or a generated layout and classify the relation between the two. Since low-quality layouts will be easily recognized as fake, it also improves the quality of all object layouts individually. In particular, The discriminator D l performs a mapping D l :
([0, 1] H×W , {0, 1} |C| ) 2 → [0, 1] × [0, 1] |R| . Let (ŷ,r) = D r ((l i , c i ), (l j , c j ))
be the prediction of the discriminator (real vs. fake and relation prediction) on its input.
Let r ∈ {0, 1} |R| be the true relation and y = 1 real .
EQUATION (2): Not extracted; please refer to original document.
where the discriminator trains on real and fake pairs, and the generator minimizes the loss over generated pairs only.
Losses To complement our discriminator and encourage the model to assign objects to every pixel of the image and refrain from overlap, we introduce the Layout Coverage Regularization
L reg = L coverage + λ • L overlap . Givenl 1 , . . . ,l n ∈ [0, 1]
H×W we define the summed image layoutL = n il i ∈ [0, n] H×W which gives the following definitions:
L coverage = H,W h,w 1[L h,w ≤ 1] • 1 −L h,n (3) L overlap = H,W h,w 1[L h,w > 1] • L h,n − 1 (4)
The loss reaches 0 if the layouts weights in every pixel sum to exactly 1, and grows as the coverage drops or overlap increases. Since L overlap is unbounded and often grows larger than L coverage , we suppress its contribution by setting λ = 0.4 which we found achieves the best tradeoff between the terms.
In addition, we found that when tasked to generate layouts with only the losses in equations (2) and (4), the generator fails to learn how to create coherent layouts, and the discrim-inator falls back to classify fake layouts based on spurious artifacts in the layouts. We attribute this issue to cold-start problem, and mitigate it by adding small weight to an Object Layout loss defined on each predicted object layoutl and the corresponding ground-truth layout l:
L layout = n i l i −l i 1 .
Mapping embeddings to layouts directly In prior work, boxes b i and masks m i were decoded from object embeddings v i in parallel. Hence, the embeddingsṽ i computed by the GCN must encode all the information about the location and shape of each object, including avoiding inter-object overlap and maintaining high-coverage of the layout. Furthermore, since the dimension of the mask is W m × W m , which is then warped into H×W , and W m min(H, W ), we can expect a drop in resolution (i.e. very coarse shapes). To mitigate these issues while remaining agnostic to the SG size and structure, we propose an adaptation of the GCN technique to improve the object layouts generation process by contextualizing object layouts on each other. We name this module Layout Refinement Network (LRN). Formally, given some intermediate object representationsv t 1 , . . . ,v t n ∈ R Ct×Ht×Wt , we describe a model that predicts the next representations down the linê
EQUATION (5): Not extracted; please refer to original document.
First, each representationv t n , n ∈ [1, n] is passed through a decoder U that applies a transposed convolution layer to upsample the representation by a factor of two, followed by a batch normalization layer and a ReLU activation; q t n = U (l t n ). Each pair of upsampled representations ({q t i , q t j }|i = j ∈ [1, n]) is then passed through a graph convolution layer. Due to the dimensionality of the representations (C t × H t × W t ), the traditional dense layers of the graph convolution are replaced with 2D-convolutional layers:
q t i,j , q t j,i = GCL t (q t i , q t j )
. Summing the initial and pairwise representation produce a new representation that is contextualized on all objects in the scene:
v t+1 n = q t n + 1 n−1 i =n q t n ,i + i =n q t i,n .
In intermediate stages, the residual sum is followed by a ReLU activation. In the final stage, a sigmoid is applied to create the object layout. Stacking T LRN blocks (where T = log 2 H − 1 due to the behavior of transposed convolutions in the first stage), we skip box and mask predictions Entirely. Instead, our model (depicted in Figure. 1 ), generates the layout directly from object embedding. Samples are shown in Figure. 3. Given embeddingsṽ 1 i of size 1 × 1 and depth K = 2 T , we stack T layers of upsampling which reduces the depth by a factor of two followed by an LRN. The output of the model is a set of object layoutsl 1 , . . . ,l n ∈ [0, 1] H×W . In each stage, the LRN is used to pass information between the layouts which results in a coherent layout exhibiting extremely high coverage and negligible overlaps. We name our model Contextualized Objects Layout Refiner (COLoR). To reduce computational constraints and allow for diversity in the gen- Table 1 . Layout quality evaluation with Coverage, Overlap, Decisiveness, and Geometric-Relation-Score. Image quality evaluation with FID and Perceptual Diversity.
eration, we sample a random subset of all possible layout pairs in each layer.
4. Experiments
We train our model to generate 128x128 layouts and use a pretrained SPADE [14] model to predict images from them. We show that it creates high-quality layouts, respects the SG's constraints and results in overall higher quality images compared to prior work. We follow the setup in [6, 7] using the COCO-STUFF dataset [12] . This dataset contains a subset of the images in COCO [15] with additional 91 stuff categories. We compare our model against a pretrained model of [7] , and a model of [6] trained to produce images of the same resolution. We evaluate layout quality, diversity and adherence to SGs and the predicted image's quality. Layout generation To evaluate the quality of the generated layouts, we measure the average coverage, overlap, and decisiveness of the layouts. Coverage ranges between 0 to 1 where higher values are better as it means the layout does not contain empty spots. Overlap measures if multiple objects occupy the same pixel which we wish to avoid. The decisiveness measure evaluates how decisive the generator is in deciding the pixels' pertinence. Formally, given predicted layoutsl 1 , . . . ,l n ∈ [0, 1] H×W we threshold the layoutsl t i,h,w = 1[l i,h,w ≥ t] setting t = 0.5 to getl t 1 , . . . ,l t n ∈ {0, 1} H×W . We then define the coverage as 1 Finally, to evaluate the compliance with relation con-straints, we define the Geometric-Relation-Score (GRS). Given a pair of predicted object layoutsl i ,l j , we compute the minimal axis-aligned bounding rectangle that contains all pixels with values above 0.5, and projecting it to W m × W m to get a mask. We then use the same heuristic that was used in the construction of the dataset to infer the relations between objects in the predicted layouts, and define the geometric relation score as the accuracy of these predictions. Image generation To evaluate the quality of the generated images, we use the common FID score. We augment this evaluation with the diversity measure suggested by [7] , which relies on the Perceptual Similarity measure [16] . There, multiple images are generated from the same SG, and we measures the average distance between every pair, where large distance is correlated with diversity. Results As depicted in Table 1 , our method outperforms on all layout generation benchmarks by large margins. It predicts layouts that have both high coverage and low overlap. In addition, the layouts are decisive and fulfill the geometric relations specified by the SG. The image generation quality of our model is preferable compared to the baselines according to both the FID measure and the diversity score. Our model achieves an FID and diversity score of 95.8 and 0.13 respectively. Our model also shows more diversity when generating multiple images from the same SG than the baselines. It should be noted that one of the ablations (-L layout ) scored very high on the diversity benchmark due to its failure to generate reasonable layouts consistently, which means that diversity on its own is not sufficient, and should always be evaluated in conjunction with an image quality benchmark. Ablations We study the contributions of L D l and L layout by training COLoR models without them. It can be seen in Table 1 that removing the pairwise layout discriminator D l impairs the GRS and removing L layout hurts the layout coverage. To show that improvements in the final image quality are due to improved layouts, and not due to SPADE's superiority over prior work's layout-to-image networks, we evaluate the image generation quality of SG2Im [6] and Grid2Im [7] layouts by replacing their layout-to-image modules (CRN [11] and Pix2Pix [17] respectively) with the SPADE [14] model we use. We find that the FID score is heavily dependent on the type of generator and not necessarily on the quality of the layout, which was the main focus of this work. Both SG2Im and Grid2Im scored differently using their own generators compared to using SPADE. However, SG2Im score improves while Grid2Im suffers.
5. Conclusions
We presented a new technique to train a model to directly predict object layouts from an abstract scene description while attending all objects simultaneously. Our method achieves a sizable improvements in the layouts' quality compared to prior works, resulting in accurate and photo-realistic images.