Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention

By Jinkyu Kim and John Canny

Ambar Prajapati
3 min readNov 10, 2020


A Report by Ambar Prajapati

Original Research paper available here:

Deep neural perception and control networks have become a key component of self-driving vehicles. The models used in these networks need to be explainable — they should provide easy-to-interpret logic behind their behavior — so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular action in the autonomous vehicle.

This research paper explores the use of visual explanations. The explanations are in the form of real-time highlighted regions of an image that impact the network’s output (steering control).


There are two stages to achieving visual explanations.

In the first stage, a visual attention model trains a convolution network end-to-end from images to the steering angle. The attention model highlights image regions that potentially impact the network’s output. Some of these regions are important for producing output, whereas the others are not.

In the second stage, a causal filtering step is applied to determine which input region actually needs to impact the output.

The result after the two stages is — more crisp visual explanations and more accurate network behavior.


The paper demonstrates the effectiveness of the above approach on three datasets totaling 16 hours of driving. (more details on these datasets in later paragraphs)

It shows that

1. The training with attention does not degrade the performance of the end-to-end network.

2. The network takes inspiration or hints from a variety of features that are used by humans while driving.

To achieve this, the attention model must not hide the regions that may be important for driving control but must also look at the foliage or other triggers to determine that they are not street signs or other vehicles.

There are three steps to this process –

1. Encoder: convolutional feature extraction
2. Coarse-grained decoder by visual attention mechanism
3. Fine-grained decoder: causal visual saliency detection and refinement of attention map.

The below figure shows an overview of this approach.

The model predicts steering angle commands from an input raw image stream in an end-to-end manner. In addition, it generates a heat map of attention, which can visualize where and what the model sees. To achieve this, the images are first encoded with a CNN and then decoded into a heat map of attention to control the vehicle.

The post-processing upon the attention network’s output is performed as under –

  1. Cluster the attention network output into “blobs”
  2. Set the attention weights to zero for each blob to determine the effect on the end-to-end network output.
  3. Retain the blobs that have a causal impact on network output
  4. Remove the blobs that have no impact from the visual map presented to the user.


The convolutional feature cubes are obtained via training the 5-layer CNN with hidden variables and predicting the measured inverse turning radius. The cubes are then fed through the coarse-grained decoder and eventually through the fine-grained encoder.

Three different penalty coefficients λ ∈ {0, 10, 20}, are used to train the model to pay attention to more comprehensive parts of the image.

The figure below shows an input raw image, an attention network output with spurious attention sources, and a refined attention heat map.

Details of the Experimental Datasets

The three large-scale datasets mentioned earlier contain over 1,200,000 frames (≈16 hours) collected from the Hyundai Center of Excellence (HCE) in Integrated Vehicle Safety Systems and Control at Berkeley. These datasets contain video clips with sensor measurements for the vehicle’s velocity, acceleration, steering angle, GPS location, and gyroscope angles. Thus, these datasets are ideal for self-driving studies


The model provides a better way to understand its decision logic by visualizing where and what the model sees to control a vehicle.

The paper provides the observations that the model is indeed able to pay attention to road elements, such as lane markings, guardrails, and vehicles ahead, which are essential for humans to drive.