[1910.11124] Enforcing Reasoning in Visual Commonsense Reasoning
Although we are providing the predicted answer rather than the correct answer to the rationale prediction module as input (thus the performance is expected to decline), we show through experiments that our model is still able to perform competitively against the current state of the art
Abstract: The task of Visual Commonsense Reasoning is extremely challenging in the
sense that the model has to not only be able to answer a question given an
image, but also be able to learn to reason. The baselines introduced in this
task are quite limiting because two networks are trained for predicting answers
and rationales separately. Question and image is used as input to train answer
prediction network while question, image and correct answer are used as input
in the rationale prediction network. As rationale is conditioned on the correct
answer, it is based on the assumption that we can solve Visual Question
Answering task without any error - which is over ambitious. Moreover, such an
approach makes both answer and rationale prediction two completely independent
VQA tasks rendering cognition task meaningless. In this paper, we seek to
address these issues by proposing an end-to-end trainable model which considers
both answers and their reasons jointly. Specifically, we first predict the
answer for the question and then use the chosen answer to predict the
rationale. However, a trivial design of such a model becomes non-differentiable
which makes it difficult to train. We solve this issue by proposing four
approaches - softmax, gumbel-softmax, reinforcement learning based sampling and
direct cross entropy against all pairs of answers and rationales. We
demonstrate through experiments that our model performs competitively against
current state-of-the-art. We conclude with an analysis of presented approaches
and discuss avenues for further work.
‹Figure 1. Comparison of our approach against VCR baseline. Top row: baseline approach by Zellers et al. [18]. Bottom row: our approach. Q:Question, Ac:Correct Answer, Ap: Predicted Answer, R: Predicted Rationale, Q− > AR: Both answer and rationale prediction given question. (Introduction)Figure 2. The Visual Commonsense Reasoning task, Zellers et al. [18] (Introduction)Figure 3. Our approach. Q:Question, ai: it h answer Api: predicted probability for answer ai, AR: answer representation, R: predicted rationale, ARpi: predicted probability for answer-rationale combination (4 answer ∗ 4 rationale = 16 combinations), I: image, τ: temperature, g: sampled from gumbel distribution. (Related Work)Figure 4. The left column is softmax model and the right column is gumbel-softmax model. The top row is Q->A loss and the bottom row in QA->R loss. The blue line denotes model trained with Q->A loss : QA->R loss = 1:1, orange/red lines denote model trained with Q->A loss : QA->R loss = 1:4. As can be seen from the curves weighting the QA->R loss four times more results in slight improvement in the QA->R module while significant decline in Q->A module performance. (Direct Cross Entropy)Figure 5. Qualitative Results: Green denotes the correct answer. In these two examples the model (gumbel-softmax) was able to make the correct predictions. (Baselines)Figure 6. Qualitative Results: Green denotes the correct answer. Red denotes wrong predictions. In this example the model (gumbelsoftmax) was not able to make the correct predictions. (Baselines)›