Red Means Stop: Classifying Traffic Lights for Driverless Vehicles

Swarit Dholakia
7 min readMay 2, 2019

--

Waymo’s driverless vehicle that knows that red means stop (which even some people can’t seem to understand 🤨)

Self-Driving Cars

Autonomous vehicles, as publicized as they are, still keep to private testing routes and limited deployment to make sure they are as safe as drivers (hopefully safer) like humans to go on our roads.

Across the numerous components apart of the process that makes a self-driving vehicle operate on roads (more on that here), a crucial aspect is the ability for the machine to be aware of road conditions and interpret meaning from what it sees.

Computer perception processes employ a series of sensors to ‘see’ the world (more on that here). Vision and depth-based sensors record specific details in the environment using machine learning and depth-recordings of objects in the surroundings through point-cloud maps.

A vision-camera, like that on your phone, is very versatile for self-driving car perception because it’s very good at seeing details that depth-sensors can’t: it’s the only way for a car to know what object on the road is a dog and what is a box on the ground (letting it decide if it’s safe to run through or not).

Image result for object detection self driving cars
a camera identifying unique details of the road scene — taken from Udacity’s Medium blog

From the various specific features, a camera extracts to know what’s going on on the road — anything from the rear brake light on succeeding vehicles to the posted speed on signs — one of the most important things to recognize for any driver, let alone a computer driver, is the colour of the traffic light of the intersection the vehicle would enter: you gotta’ know if it’s green or red for your intended route to prevent a horrific incident.

Traffic Light Detection and Classification

Cameras can record road scenes as it happens, and even convey that information to a computer to process in real-time, but the largest obstacle in interpreting the road conditions is selecting only the data that we’re looking for from a picture and knowing where to find it in a complex and dynamic environment: cars aren’t going to be useful driving in an environment where nothing’s happening.

Convolutional Neural Networks

To be able to identify, select and classify the objects in a scene we’re interested in, a convolutional neural network (CNN) is employed to analyze images and video feed in real-time.

A deep learning process for object detection and recognition, CNNs are extremely versatile, provide a real-time response for what it’s looking at and can intake higher-resolution data; a very important characteristic for autonomous vehicles.

CNNs use convolutional layers to filter inputs to extract useful information.

Convolution is a core part of a CNN. Convolution refers to the mathematical amalgamation of two functions to produce a third function. It merges two sets of information (which in our case, are two images).

In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (they’re the same thing) to make a feature map.

stolen from freeCodeCamp’s medium guide on CNNs

We perform a convolution by sliding the filter over the input. At every location, a matrix multiplication then sums the result onto the feature map of pixel values.

Convolutional networks are made up of an input and output layer, and hidden layers. A convolutional network is unique from a regular neural network in that the neurons in its layers are arranged in three dimensions (width, height, and length). The hidden layers are a combination of convolution layers, pooling layers, normalization layers, and fully connected layers. CNNs use multiple convolutions to filter input to greater levels of detail.

CNNs improve their capability to detect even unusually placed objects by using pooling layers that allow for the usage of more convolutional layers by reducing memory consumption.

regular neural networks on the left (notice, 2D); CNNs on the right (with a third dimension)

Region-CNN

Through the method we talked about above, there is a sliding window to search every position within any given image. However, various objects have different image aspect ratios based on the size and distance of that object from the camera. The recognition and classification process will be extremely slow if we use a conventional CNN at each location because constant-sized windows are more efficient in recognizing larger objects, relative to the viewing perspective (read more on this here).

In R-CNN the CNN is forced to focus on a single region at a time because that way interference is minimized because it is expected that only a single object of interest will dominate in a given region. The regions in the R-CNN are detected by selective search algorithm followed by resizing so that the regions are of equal size before they are fed to a CNN for classification and bounding box regression.

Implementation

Seemed like an interesting problem, so I built my own R-CNN traffic light classifier by replicating the Udacity Intro to Self-Driving Car Nanodegree project.

Dataset Prep

This traffic light dataset consists of 904 red traffic light images, 536 green traffic light images and 44 yellow traffic light images.

We now make a list with the order: [red value, yellow value, green value] to convert labels to a numeric format, called one-hot encoded labels.

A red light will have the label [1, 0, 0]. Yellow should be [0, 1, 0] and Green should be: [0, 0, 1].

We then make all our images the same size so that they can be sent through the same pipeline of classification steps. After which, we standardize the outputs by creating an array of zeros representing each class of traffic light (red, yellow, green), and set the index of the expected class number to 1.

Through the code below, we can identify the colour of the traffic light based on the returned one-hot encoder label. As we’ve standardized our inputs and outputs and cleaned our dataset, we now need to produce an array of two zeros and a 1 in the image class number. AKA, an image with a red light yields for index position 0 to be labelled as 1, resulting in a one-hot encoder [1, 0, 0].

def one_hot_encode(label):

## TODO: Create a one-hot encoded label that works for all classes of traffic lights
one_hot_encoded = [0, 0, 0]
# check whether color is red green of yellow
acceptable_colors = tuple(['red', 'yellow', 'green']
if label == 'red':
one_hot_encoded[0] = 1
if label == 'yellow':
one_hot_encoded[1] = 1
if label == 'green':
one_hot_encoded[2] = 1

return one_hot_encoded

We now will create features that help distinguish and classify the three types of traffic light images.

We’re going to be using HSV colour space, to help us identify the 3 different classes of a traffic light.

an end-to-end brightness feature construction pipeline for a sample traffic light image

In the following function, the code constructs our first feature that hot encodes the image based on where the mean of the feature is in relation to the height of the traffic light, given the standardized image. We’re looking for brightness per row, along with the height of the traffic light. The mean brightness can be found and characterized as red, yellow, or green based on where it is in the image (higher = red, lower = green).

def hot_encode_height(mean):
# bin ranges
# red = 0, 15
# yellow = 10, 23
# green = 20, 32
one_hot_encoded = [0, 0, 0]

if mean >= 0 and mean < 15:
one_hot_encoded[0] = 1
if mean >= 10 and mean < 23:
one_hot_encoded[1] = 1
if mean >= 20 and mean <= 32:
one_hot_encoded[2] = 1

return one_hot_encoded

Below, the feature multiples the hue and value histograms together to reinforce or question the value filter. Then we filter the image by hue colour. This can be used to see if any red, yellow, or green pixels have been found, and which is the most dominant in the image.

def estimate_value(rgb_image):

feature = feature_value(rgb_image)

# get mean and bimodal boolean
max_list2 = max_idx_rank(feature)
mean = max_list2[0]
bimodal = is_bimodal(max_list2, feature)

one_hot_encoded = hot_encode_height(mean)
return one_hot_encoded, bimodal


def estimate_hueXvalue(rgb_image):

feature = feature_valueXHue(rgb_image)

# get mean and bimodal boolean
max_list2 = max_idx_rank(feature)
mean = max_list2[0]
bimodal = is_bimodal(max_list2, feature)

one_hot_encoded = hot_encode_height(mean)
return one_hot_encoded, bimodal


def estimate_color(rgb_image):
feature = feature_rgb(rgb_image)

one_hot_encoded = [0, 0, 0]
# sum channels representing each light color
red_sum = np.sum(feature[:,:,0])
green_sum = np.sum(feature[:,:,1])
yellow_sum = np.sum(feature[:,:,2])

# one hot encode the color who has the greatest sum
if red_sum > (yellow_sum + green_sum):
one_hot_encoded[0] = 1
if yellow_sum > (green_sum + red_sum):
one_hot_encoded[1] = 1
if green_sum > (yellow_sum + red_sum):
one_hot_encoded[2] = 1

return one_hot_encoded

One of our core goals was to never classify a red light as a green light as this would create a major driving risk for an autonomous vehicle.

Check out my GitHub repo here.

Some awesome articles you should read if you’re super interested in autonomous vehicles:

  1. https://medium.com/@swaritd/designing-for-trust-in-self-driving-cars-4bef4187a545 (full disclosure: shameless plug)
  2. https://medium.com/@swaritd/how-autonomous-vehicles-work-1a54364463c6 (full disclosure: shameless plug)
  3. https://www.wired.com/story/the-know-it-alls-how-do-self-driving-cars-see/
  4. https://www.wired.com/story/guide-self-driving-cars/
  5. https://medium.com/@swaritd/reinventing-the-wheel-with-the-driverless-car-41b0ce2b1c29 (full disclosure: another shameless plug)

Liked this article? AWESOME! Show you’re appreciation down below 👏👏

  1. Follow me on Medium
  2. Connect with me on LinkedIn
  3. Reach out at dholakia.swarit@gmail.com to say hi!

I’d love to chat about autonomous vehicles or any cool exponential technology!

--

--

Swarit Dholakia

I write about tech ideas, startups, life, philosophies and mindsets.