April, 2012

Apr 12

The new Audi A3 webspecial with webcam gesture control

Audi A3 webspecial

In March 2012 Audi released a new version of its extremely successful Audi A3. The brief for Razorfish was to motivate potential clients to discover the A3s interiour. And to apply simplicity and intuitiveness of the A3s interiour design to the internet. So the concept team came up with the idea to  create a web special that could be fully experienced via gestures.

But how to realize that? Full-scale webcam-based controls for websites have not yet left the realm of experimental microsites. And in contrast to the Kinect platform by Microsoft, Flash does not provide a software framework to recognize gestures via the webcam. Kinect is also equipped with multiple cameras and a depth sensor that precisely captures a persons’ spatial movements.

The aim was to create an experience that is equally interesting when controlled by mouse as well as by the hands throught the webcam. So the gestures would be based on the movement of the mouse cursor. Consequently, we had to transform the hand movements captured by the camera into a hand cursor.

The hand becomes the cursor

Research on the experiments of creative developers on user interaction with a webcam in flash resulted in three approaches that could be suitable for our untertaking.

1st approach: Object Tracking

Object Tracking example The OpenCV (open computer vision) open source framework for C / C++ provides various algorithms for image processing. Some of them concerned with object detection within images have been partly ported to Flash by the Japanese Ohtsuka Masakazu in 2008. Though this technology is extreamly promising, we had soon to realize the code base available in Flash was too limited to support commercial development in time or budget.

2nd approach: Color Tracking

Colour Tracking exampleWith a colour filter it is possible to determine the location of a person’s hands within the image based on his skin colour. We had good results at first. Of course there was the need to filter out the persons head and also to dynamically handle different persons’ skin colours. That could be resolved by leveraging object tracking to determine the face and grab the person’s skin colour from it. But the approach was still instable with backgrounds that resembled the person’s skin colour or changing light conditions.

3rd approach: Motion Tracking

Motion Tracking example

This method compares successive camera images and determines the areas of alteration, which represent potential areas of movement. In the first prototype we implemented the rather simple swipe gesture. While the approach worked fine, was quite stable in different light as well as background conditions and performant, it was yet too imprecise and jumpy to support more complex gestures.

The solution

Advanced Motion TrackingBecause the first approach was not feasable and the second for its own insufficient, the way to go was to tweak the motion detection and maybe combine it with the others. Not an easy task. The enhancements which evolved over time by gaining a deeper understanding may be classified into detection, interpolation and performance optimization. Here are some examples:


  • Merging the results of serveral motion detections (3 were optimal) added a lot of stability and accuracy. This handles e.g. flicker of a light bulb (not visible to a person but to a camera!) or duplicate camera frames (framerate drops with lower light due to increased exposure time).
  • Two motion detection loops with different settings, the first optimized to handle faster movement and the second optimized to still capture very slow movement.


  • In what we call triangular interpolation the last three detected coordinates are taken to calulate the triangular balance point as the result point. This smoothes the detected movement a lot and eases undesired jiggle.
  • With a bezier curve interpolation the last 10 result points are taken as control points to calculate a bezier curve. The new result point is set at 70 percent on that curve. While this adds a little delay, it further smoothes the movement.


  • Because the detection loop runs on every frame, it is crucial that it uses minimal computational resources so it does not impede other parts of the application like playing smooth animations.
  • Furthermore, the emergence of garbage collection which is accompanied by frame drops and a collapse of detection quality should be mininized.
  • The best method to handle both is the consequent deployment of object pools. The reuse of a BitmapData object e.g. is 10x faster than creating a new one!

In the end, the system became so accurate that we renounced the need to combine it with the other approaches (and deal with their disadvantages). The job was done!

The cursor becomes a gesture

This was the comparatively easy part. Simple movement based gestures like swiping are recognized by moving the cursor in a specific direction or over an appointed shape. The hold gesture – the equivalent to the mouse click – is triggered by holding the cursor for a while over a determined shape.

Form based gestures are recognized by comparison with a previously captured form. In case of the rotation gesture this is a circle form. To reduce false detections, we also applied some noise filters. The comparison is based on the 1$ gesture recognizer algorithm, a good AS3 implementation can be found at http://www.betriebsraum.de/.