What will the most important sensors in a robocar be? The sensors drive what is known as the "perception" system, and that's central to driving the car. The perception system's job is to spot all important things on or near the road like other vehicles, pedestrians, debris and in some cases road features like signs and lane markings.
(Also driven by the sensors is the localization system, whose job it is to figure out very accurately where the car is on the road.)
The perception system has to detect all the obstacles, and attempt to identify them. It needs to measure their speed and direction and predict where they are going. It's a very challenging problem.
Two key ways the perception system can go wrong are called false negatives (blindness) and false positives (ghost objects.) A false negative is not detecting an obstacle. That can be catastrophic if it happens for long enough so that you might be unable to safely avoid hitting the obstacle. A good system will almost never get a false negative. It may occasionally take a little bit of extra time to fully understand an obstacle, and one may even blip out for brief flashes, but a persistent failure can mean a crash. I really mean almost never, like one in many millions.
Another error is a false positive. There the system sees an obstacle that isn't really there. This will cause the vehicle to jab on the brakes or swerve. This is annoying to the occupants, possibly even injurious if they don't have seat-belts. And it can also cause accidents if the vehicle is being followed too closely or swerves dangerously or brakes too hard. Usually these jabs end up safe, but if they are too frequent, users will give up on the system.
Related to the above is a misclassification. This can mean mistaking a cyclist for a pedestrian, or mistaking two motorcycles for a car. Even without identification, you know not to hit the obstacle, but you might incorrectly predict where it is going or how best to react to it.
A different class of error is complete failure. A sensor or its software components may shut down or malfunction in an obvious way. Surprisingly, this can be tolerated more frequently than blindness, because the system will know the sensor has failed, and will not accept its data. It will either rely on redundant sensors, or quickly move to pull off the road using other sensors if that's not enough. This can't be too frequent or people will stop trusting the system, though.
There are many important robocar sensors, but for primary perception, the two most researched and debated are LIDAR and cameras.
LIDAR is a light-based RADAR. The sensor sends out short pulses of invisible laser light, and times how long it takes to see the reflection. From this you learn both the brightness of the target, and how far away it is, with good accuracy.
However, there are disadvantages:
Camera systems follow the human model. One or more cameras view the scene, and software tries to do what humans do -- intuit a 3D world and understand it from the 2D image.
But cameras have a few downsides, the first being a deal-breaker:
Computer vision, however, keeps improving. Many hold hope it will overcome that big downside "some day soon."
Camera processing can be divided into two rough categories known as "machine vision" and "computer vision." Machine vision usually refers to simpler, localized analysis of digital images. This includes things like finding features and edges, detecting motion and motion parallax, and using parallax on stereo (binocular) images to estimate distance. These techniques are reasonably well established and many are well understood. Some machine vision problems are harder, but on track to solution, like detecting and reading signs.
Computer vision refers to a harder set of problems, more akin to the abilities of humans, which involve understanding an image. This means things like segmenting an image and recognizing objects. A human can be shown a picture of a human in almost any setting, any lighting, and quickly identify that it is a human, and even how far away they are. We can even discern their direction of attention and activity. Algorithms are getting better at this but are not yet at a sufficent level.
Some problems have reached the borderline zone. Machine vision tools can look for features, and do it independently of scale and rotation. This allows some detection of other cars, pedestrians, edges of the road and road markings.
The reliable general identification problem is one that many believe will be solved eventually, but it's much harder to predict when. Driving requires that the system "never miss" anything that is in a position to be a safety concern. Particularly difficult are stationary obstacles that are far enough away so that stereo does not work and motion parallax (the way things move against the background and in relation to other objects as you yourself move) is also limited. (An object you are heading straight for, like a pedestrian or stalled car on the road, will present very little motion parallax.)
The other problem for vision systems is the variability of lighting and shadow. Objects can be lit from any direction. They can also have the sun behind them. Often shadows will cross an object. In this case HDR techniques are needed to even see detail in both zones, and the shadow borders dwarf any actual features of the object when it comes to contrast.
There is a special type of camera, known as long-wave infrared or "thermal" which uses emitted light rather than reflected light. There are still "shadows" in that things in the sun are warmer than those in the shadow, but there are not moving shadows. Thermal images are monochrome but work equally well day or night -- in fact they are better at night. They can see better in fog and certain other weather conditions. They can be very good at spotting living creatures, though not when the ground is at human body temperature. Unfortunately thermal cameras are very expensive, and decent resolution is extremely expensive. They also must be mounted externally as LWIR does not go through glass. At present, nobody reports using these cameras, but some are investigating them.
There is some potential in "hyperspectral" imaging, where you have cameras working in many colour bands, including infrared and ultraviolet. With such images it can sometimes be much easier to identify certain types of objects.
Humans can usually convert the two-dimensional images our eyes see into a 3-D model of the world, though we do it much better when we examine a scene over time and watch the motion parallax. Computers are currently modest on static images and only sometimes make use of motion to help. Humans make use of stereo but can also drive just fine with one eye closed or missing.
In contrast, LIDAR is able to make a complete 3-D map of a scene from a single static sweep. Multiple sweeps can improve what it sees -- and help it judge velocities.
Much of the excitment today in computer vision is around convolutional neural networks, in particular those created with a tool named "Deep Learning" which seems to mimic many of the capabilities of biological brains. Many think this is where the breakthrough will come. Deep learning works with a large training set -- and to a limited extent even can do things even without special training -- to help it understand the world and even what to do. People have built robots that after being guided through terrain and training Deep Learning on the guided paths are able to learn how to move in similar environments.
This is exciting work, but the extreme accuracy needed for robocars is still some distance away. Also of concern is that when Deep Learning works, we don't strictly know why it works, just that it does. You can add training to it to fix its mistakes, but you can't even be sure why that fixed things. The same flaw can be attributed to human brains to a degree, but humans can tell you why they acted.
There are different ways to look at this from a legal standpoint. Machine learning in general may hurt you, because you can't understand how it works, or it may help you that you simply applied best practicies with a good safety record, and no specific mistake was made that can be ruled negligent.
Machine learning tends to get better the more training data you give it, and so big efforts are underway to generate huge amounts of such data. Even so, neural networks are not able to recognize things which they have never seen, or seen things similar to.
Stereo vision can provide distance data over short ranges, like 60m. New research in 3-camera systems is claiming to be able to extract distance for as much as a kilometer. While these systems are not yet in production, if they deliver on this promise they will be popular, though they will still face a few of visions problems.
The most important other sensor is radar. Radar has some fantastic advantages. First, it sees through fog, no matter how thick, just fine when all the optical sensors fail. Secondly, it sees other cars quite well, and each radar hit returns not just a distance, but how fast the obstacle is moving, thanks to doppler. That's even more than what LIDAR gives -- a single radar capture shows all the moving obstacles and their speeds.
Radar is able to do things like bounce off the road under a car or truck in front of you and tell you what the invisible vehicle in front of the truck is doing -- that's a neat trick.
Radar today offers much less resolution. There are experimental high-resolution radars but they need lots of radio spectrum (bandwidth) -- more than is allocated by the regulators for use. Radar has a hard time telling you if a target is in your lane or not, or if it's on an overpass or on the road before you.
Non-moving objects also return radar signals, but this is a problem. The ground, signs, fenceposts -- they are all returning radar signals saying they are fixed objects. So while a stalled car also gives a radar return, you can't reliably tell it apart from a sign on the side of the road or a stalled car in the shoulder. Most automotive radars just ignore returns from fixed objects, which is one reason that for a long time, automatic cruise controls did not work in stop-and-go traffic.
New research is generating higher resolution radar, and also learning how to identify classes of objects from patterns in the returns from them. Digital phased array radars can sweep a scene and bring resolution to around one degree. That's not quite enough but getting better.
This research is coming to fruition in the early 2020s and we may see radars that become central sensors.
When you have more than one sensor, you want to combine all the data so you can figure out that the car you see on the radar is the same as the one you're seeing in LIDAR or on camera. This improves the quality of your data, but can also hurt it. The fusion is not 100% reliable. What do you do if your radar suggests there is a car in front of you but the camera says not, or vice versa? You must decide which to believe. If you believe the wrong one, you might make an error. If you believe any obstacle report, you can reduce your blindness (which is very important) but you now add together the ghost obstacles of both sensors. Sometimes you get the best of both worlds and sometimes the worst.
In spite of this, because sensors all have different limitations, good sensor fusion remains a high goal of most robotics teams.
Sensor fusion can be done without complicating matters if each sensor is better at a particular problem or a particular region of examination. Then you trust that sensor in its region of best operation.
(It should be noted that good fusion is done considering the raw data from all sensors; you don't just decide that the radar data has an object and the vision data does not. Nonetheless many items will appear much more clearly in one set of sensor data than another.)
Both cameras and LIDAR can be used for localization (figuring out where you are on the map.) LIDAR again has the advantage of being independent of external lighting and making use of full 3-D, but the challenges of doing localization with a camera are less than doing full object perception with one.
The localization process starts by using tools like GPS, inertial motion detection and wheel encoders to figure out roughly where you are, and then looking at the scene and comparing it to known maps and images of the area to figure out exactly where you are. The GPS and other tools are not nearly good enough to drive (and GPS fails in many areas) but advanced localization is well up to the job.
Advanced radars can be potentially used for localization too.
Both technologies have advantages and disadvantages, so which is the right choice?
Robocars are barely commercial yet, and won't be in real production until 2020 or later, according to the optimistic projections. The right question is actually which technology you can reliably bet will be best in the future.
LIDAR's main problem was cost. Some think it's ridiculous that many car teams used Velodyne's $75,000 LIDAR or even their smaller 32 plane unit at half that cost. This is much more than the cost of a car. However, this is akin to somebody in 1982, noting that a 5 megabyte disk drive costs $3,000, predicting that disk drives are an untenably expensive technology for storing large amounts of data. The reality is that electronics technologies drop immensely in price as they get made in consumer volumes, and with time and Moore's law, they drop even further.
If a market of millions of high-resolution LIDARs is seen, the cost of these instruments will drop a great deal; not to be quite as cheap as cameras, but to a reasonable cost, less than a few thousand dollars, and eventually well under a thousand. Cameras will also drop, but only due to Moore's law, as they are already made in high volumes. There are known projects to develop cheaper LIDARS at both top tier automotive suppliers and small startups.
Of course, LIDAR systems will always supplement themselves with limited camera use, to see things like traffic lights which can't be seen by the LIDAR.
Computer vision will also get better. Machine vision's progress is more assured, because some of that will come from getting faster and more specialized machine vision processors, and general computation electronics will get cheaper. In particular, future chips will be able to handle higher resolution images for more accuracy, and search against more patterns and feature lists.
The Israeli company MobilEye, now part of Intel, which is perhaps the leader in machine vision systems for driving, makes their product by building a custom ASIC which is a processor optimized for doing machine vision primitives. They have gone through 4 generations of the chip to make it better and higher performance.
Computer vision improvement will come from this, but it also requires entirely new algorithms. It's not the case that if you give computer vision a supercomputer that it suddenly becomes reliable. If it were, then a Moore's law prediction could tell you when it will be commercially viable in cars and robots. Instead algorithmic breakthroughs are needed.
This is not to say that such breakthroughs won't come. There are large numbers of people working on them, and there's lots of money to be made from them. And there are computer vision experts on several notable robocar teams, including VisLab in Parma, Chinese teams building on Dickmanns' pioneering work, the Mercedes 6D project and the above noted MobilEye who hold much optimism.
For other teams, the fact that cheap LIDAR can be predicted as highly probable while the date of effective vision is uncertain demands one answer -- to base systems on LIDAR. If the cheap vision systems become available, it will be possible to re-purpose much of the work to interface with the camera systems.
The high LIDAR cost has caused a lot of people to make statements to the press that their vision system will solve the problem of robocars being far too expensive. The reality is that the cars will become affordable at the time of their release no matter which technology is used.
Machine vision systems make more sense in the immediate-term technology, namely the "super cruise" cars which can stay in their lane and keep pace with other cars, but require constant human supervision. The human supervision (which they enforce with various tools, like making the driver touch the wheel frequently) eliminates many of the problems with the too-frequent errors today's vision systems will make. If such systems can no longer find the lane maker, they keep driving straight, following other cars, and make a loud alarm to get the driver to take over.
These driver-supervised systems can't add a lot to the cost of a car, so cameras and radar are the only practical answer today. This should not lead one to the conclusion that the cameras are good enough for unsupervised operation. They may work 99% of the time, but the difference between 99% and 99.9999% is not just under 1%. It's really a factor of 10,000. In other words, getting to unsupervised driving requires a system that is 1,000,000% more accurate, not 1%.
The focus of almost all self-driving efforts today is on safety. Many teams have made cars that can navigate some decent subset of roads well, but the challenge is to improve the cars to a level of safety suitable for deployment to the public. As such, it is unlikely anybody is going to compromise safety to save only modest amounts of money. If we are confident LIDARs will drop to 10-20% of the cost of a car, this simply is not an amount of money that would drive people to make anything but the highest-safety choice, at least on the first cars to be released. Later, when the market is robust, there will be price competition, and the classic safety-vs-cost trade-offs will be discussed, and in some cases, on lower priced vehicles, systems that are slightly less safe but save lots of money may make sense.
This won't be because of regulations, but rather over liability. People might be willing to say in court that they chose a lesser system because it saved $100,000, but not because it saved $3,000 on the early vehicles.
Because using cameras requires algorithmic breakthroughs, it is very hard to predict just when they might be good enough for driving.
The human driving record is both horrible and fairly good. On average, there is an accident of any sort every 250,000 miles. Fatalities occur every 100 million miles on average, every 180 million miles on the highway. That puts the highway rate at around 3 million hours of driving between fatal accidents, though only about 6,000 hours between minor accidents. Many drivers never cause an accident in their lives, others cause several.
Not every perception blindness will cause an accident, of course. In fact, humans look away from the road very frequently for short stretches and only rarely crash from it. (Though 80% of the crashes are linked to not looking.) As such it's hard to establish truly good metrics for reliability here. Digital systems see the world "frame by frame" though they also analyze how things change over time. If the perception system fails to see something in one frame but sees it the next, it's almost never going to be a problem -- this is akin to a human "blinking." If a system is blind to something for an extended time, the risk of that causing a safety incident goes up.
How long you have to perceive things depends on speed. If an obstacle appears ahead of you on the road at the limits of your perception range, you must be able to stop. While swerving may be a fallback plan, you can not always swerve, so your system must be able to stop. This means reliable detection well before the stopping distance for your speed and road condition. If the roads are wet or icy that can be fairly far.
LIDAR's big attraction is that, at least for objects of decent size like pedestrians, cars, cyclists and large animals, it's always going to get laser returns saying something is there. The system may not be able to figure out what it is, but it will know it is there, and get more and more sure the closer you get to it. If something of size is blocking the road in front of you, you must stop, no matter what it is -- though there are some exceptions like birds and blowing debris. Within a certain range of distances and sizes, LIDAR is very close to 100%, and that's important.
Vision systems can be better at figuring out what something is. Even LIDAR based vehicles will use cameras for specialized identification of things like birds, traffic cones, road debris, traffic lights and more.
Vision systems also have an advantage when attempting to drive on unknown roads. Many of the systems being developed avoid driving on unknown roads. Having a detailed map of the road is hugely valuable, and the cost of making it scales well. Driving on unknown roads (or roads changed by construction) is still important
If we could make a working vision system but it required an hour of supercomputer time per frame, we could predict when those computing resources might get affordable. That's not enough information to predict, though, since cheap resources actually enable and drive research that doesn't get done when you need to buy supercomputer time. In many cases the breakthrough is waiting upon enough computing power and other tools being in so many labs that somebody comes up with the answer.
I predict that cameras will always be present, and that their role will increase over time, but the LIDARs will not go away for a long time. Today cameras are requird for seeing lights, like traffic signals and turn signals. In the future they will get used in areas of easy sensor fusion, such as looking at things beyond the range of other sensors (at least during the day,) reading signs, and possibly even doing face and gaze detection on other people on the road, to see where they are looking and measure their intent. Some of that will require either pan/tilt long-range cameras or arrays of cheap cameras, which is something Moore's law continues to bring us.
There is a different story if your robot is only going to go at very low speeds. For example, delivery robots or "NEV" golf-cart style vehicles that don't go more than 40 km/h. At low speeds, you don't need ot see as far, and so stereo vision can help you see the world in true 3D. It still has the problem of dealing with whatever the current natural illumination is.
At night at low speeds, or indoors, you can use "structured light" cameras (like the one found in the Microsoft Kinect) They can also see the world in 3-D and are not affected by variations in illumination -- because they avoid places that are illuminated by the sun.
While some teams still hope to do it all with cameras and radar, I believe that the use of both LIDAR and cameras will win the day. As such the question has become more about which one will dominate and provide the bulk of the value. This again depends on how quickly the two trends happen -- cheaper LIDAR, and better and cheaper vision.
Most teams, using higher end LIDARs, are relying on those and planning for them to get cheap. This will happen, but they will still be more expensive than cameras. The more resolution your LIDAR has the more reliably it can identify everything. At the same time, cameras will still be good at learning more about all the things in the environment, and necessary for things like traffic lights, turn signals and distant objects that LIDAR doesn't see. The LIDAR offers a path to the top level of safety. If there is an obstacle in front of you, you will always know with the LIDAR, even if you don't know what it is. If you don't know, you'll stop. The vision systems, if 99.9% accurate at identifying what the obstacle is, will help the vehicle perform very well because it will know reliably what things are there, where they are going and how far away they are. One time in 1,000 it will just get a bit more conservative than it has to.
A classic example would be things like birds or blowing trash which appear in your way but you should not slow down for. The LIDAR will see them, and the camera will almost always give additional information about what to do. Rarely, a car might brake for a bird -- but very rarely as time goes on. This work long before the vision system reaches the 99.99999% you want from it on its own.
Some will argue that a less expensive appraoch is to use a low resolution and low cost LIDAR, together with the bast computer vision. Here, the LIDAR will give you assurance you won't miss an obstacle, but you'll always need the vision to tell you want it is. Still, the 1 time in 10,000 or 100,000 that it doesn't properly identify the obstacle, the worst case is the car gets too conservative and brakes for something it need not brake for.
Quite simply, the first consumer robocars, which should arrive around 2017 to 2019, will be the most experimental and least refined. Safety is the overwhelming goal in all efforts towards these cars. Nobody wants to sacrifice safety in any way to save the few hundred dollars that LIDARs are forecast to cost at that time. Only when one sensor can completely supplant the other will there be a question of selecting just one to save money. Fusing them does present challenges, so if there is to be only one it is going to be the LIDAR for a few years.
Later, in the 2020s, if cameras become capable of the required percepton reliability on their own or combined with other sensors like radar, it would make sense to use only those systems to save money, as cameras will likely remain cheaper than LIDARs for the forseeable future.
In lower speed applications (under 25 km/h) it can be possible to use just cameras because long range is not needed.
By the time this question settles, the vision will be very good and the LIDARs very cheap. So who knows the final answer?
Tesla famously deprecates LIDAR, calling it a "crutch" and "lame." Their strategy is to do it all with 8 cameras and radar. Their thesis is as follows: Doing real self-driving requires superb computer vision. So superb, they argue, that once you have it, LIDAR tells you nothing extra. This high level computer vision they hope for is able to detect obstacles at very near 100% reliability and know their distance in all situations.
If you have such computer vision, Tesla believes, you wasted your time developing on LIDAR that no longer offers any advantage. Most other teams believe that LIDARs advantages are too strong, and the day when vision is this good is too far away and too hard to predict. They feel they can make a safe, working robocar sooner with a combination of LIDAR and "very good" computer vision instead of near-perfect computer vision.
Time will tell.
For those not familiar with LIDAR, you may wish to start with Wikipedia. The Velodyne popular with many robocar teams is a big heavy cylinder, with 64 lasers and sensors. The whole cylinder spins 10 times/second and scans the world around it. It shoots out pulses of 905nm light -- this is invisible near-infrared -- and times the return of those pulses. Velodyne also makes a smaller unit with 32 lasers, and a $10,000 unit with 16 lasers.
The LIDAR pulses are very bright, because we must see the reflection pulse even in the sunlight. They are very brief, however, and the unit spins so they are not shot in the same place, keeping it safe for people's eyes.
The cost of these units comes from several factors. The lasers are industrial units with high power and performance specifications, as are the receptors. These are made in moderate, industrial volumes, not the consumer volumes of things like the lasers in DVDs. The units are heavy and have moving parts, and are built in small volumes, not through bulk manufacturing.
LIDARs are already getting cheap. A new 16-laser Velodyne is under $10,000. Quanergy, to whom I am an advisor, is offering 8-laser units around $1,000 and promises a new type of LIDAR with aribtrary resolution by 2018 for considerably less. Valeo/IBEO promises a $250 LIDAR with 4 lasers but a limited field of view. Velodyne claims in very large quantities they could make an excellent unit for $300. The costs are going to continue to drop.
There are some other LIDAR designs of note:
Comments on this article can be left at the blog post.