3D Computer Vision: Past, Present, and Future


Uploaded by GoogleTechTalks on 23.02.2012

Transcript:
>> SEITZ: I somewhat foolishly agreed to give a talk on the history of 3D computer vision
for workshop in Italy a week ago not realizing how much time this would take. So, basically,
you know, I'm very familiar with a lot of computer vision, in 3D computer vision in
particular but really trying to give a history of this problem, you know, goes back decades
and so I had to consult a bunch of people to really get this right. So, I immediately
sent email to a number of the pioneers, you know, many of you recognize on this list like
Schreiner and Berthold Horn and Andrew Zisserman and so forth. Basically, you know, all the
people you think of about doing pioneering work in computer vision to try to get their
input on what were--are the breakthroughs in computer vision over the years in the 3D
domain. You want to close the door? >> Yeah.
>> SEITZ: And so, of course, everyone send you different lists and, you know, all of
the structure motion guys sent me structure motion references and all the, you know, for
the metric guys like Horn and Woodham sent me, you know, shading references so everyone
had a different interpretation of what is a landmark or a breakthrough so I had to kind
of call from this some subset of things. And I think everyone of these people will be horrified
by the talk I'm about to give because it only covers a small subset of the references that
they gave me for what they think are the pioneering references. So, if you want to cover all 3D
vision, it's actually very broad area on so I had to select only a small number of things
from each subfield. And so, because I'm not going to cover all of these stuff, you know,
many of these emails I got from these folks were just terrific based--you know, really
detailed histories over the last several decades so I'm also going to post online their responses.
So, what does Berthold Horn think were the chronicle references in 3D vision and so forth.
Okay. So, in addition to these individuals who gave me some information, I also-there's
lots of good information online in the papers and so forth and here's a--here are a few
links if you're interested to know more about the history of photogrammetry or bundle adjustment
and so forth to some good materials online. All right, so disclaimer, I've already said
just a little bit but this list is very incomplete for sure. It's also somewhat biased because
it's my own interpretation of the highlights in 3D computer vision over the years. Although,
it's informed by a lot--by, you know, experts by readings and just, you know, my experience
over the years. I tended to select for high impact result--results like the things which
I think had the biggest impact either in practice or on the research community and so there's
fewer kind of theoretical results although there are a few in this list. And of course,
I'm omitting a lot of really important results just for--to fit this in to an hour. Okay.
So, with that, let me jump into it. So, first of all, a bit of so-called pre-history. So,
here are a few really important people over the years, the centuries up to now. So, Leonardo
Da Vinci is not really the--he's by no means the first one to know about perspective but
he articulated it especially well so I love this quote, perspective is nothing else then
to seeing of an object behind a sheet of glass, smooth and quite transparent. On the surface
of which, all things may be marked or behind this glass. So, he had a concept of a flat
image plain. All things transmit their images to the eye by pyramidal lines--so rays and
these pyramidal lines are cut by said glass. The [INDISTINCT] of the eye, these are intersected,
the smaller the image of they possible appear. So, you know, really beautiful description
of perspective and had, you know, all the key ideas at that time. Okay? So, another
really important figure in 3D is Lambert and you probably know him best through Lambert's
Law which is--which is really model how light gets reflected off of it after hits the surface
and it gets scattered in all directions per map surfaces and really, you know, this has
been the vain of computer vision algorithms because almost everything assumes Lambertian
reflectance and therefore breaks down for anything that's shiny or specular or non-Lambertian.
So, we really have Lambert to blame for all these troubles in the field. But also, it
turns out he was the first person who proposed inverse projection. So basically, figuring
out where the camera was from one or more images and that's also known as camera resectioning
photogrammetry and literature. Next on this list is Galls and Galls is, you know, hugely
important for all sorts of things but one big element among those is Lee Squares and,
of course, Lee Squares is really fundamental to almost all of our algorithms these days
both many algorithms as well as intuitive nonlinear algorithms and so it's been central
to photogrammetry as well. Now, Wheatstone is maybe a bit more controversial to sort
of put in the same line as these other three but he was the first to really articulate
the principles of stereo and basically the idea that stereo consist of horizontal parallax
between two images--between retinal projections. And it's kind of surprising that this wasn't,
you know, known and so in the 1800s and it--you know, maybe it was. I mean, there's some evidence
that Leonardo knew about stereo although he didn't explain it as parallax. But what's
interesting about Wheatstone's work is that he was able to basically prove it by creating
images--synthetic images and putting them in a stereoscope. He invented the stereoscope
which then you could look at and see fuse of 3D image. So basically, that's a proof
of this parallax concept. And, of course, stereo has been central to a lot of major
advances in 3D vision. Okay. So, in parallel with computer vision and actually predating
a lot of computer vision is photogrammetry. So, photogrammetry is basically the science
of measurement from optics. From some optical observation. I'm sure [INDISTINCT] stepping
the word image because photogrammetry in some sense predated photography. So the mathematical
foundations go back to the early 1800s so [INDISTINCT] invention of projected geometry,
terms--you know, things like the horopter, mutli-view relations and so forth. And I'll
cover some of these layer in this talk in the context of other computer vision breakthroughs
but the mathematics goes back a long ways. Practical use, people are interested in photogrammetry
as part of map making and [INDISTINCT] and basically the idea is that if you're in an
imperialist country, you want to make sure that you can map the boundaries of nation
that you're conquering and so forth so people have long been interested in map making and
one of the--one of the great maybe the biggest breakthrough in photogrammetry is the--this
study called the Great Arch of India that the British--so when the British were serving
India, this turned out to be a huge undertaking. So, the idea was basically to create a 3D
map of India. And--so they took this huge theodolites that weight of time literally
at time. Three of the jungles is in [INDISTINCT] to try to take their measurements from different
places and they were able to show that the--that Mount Everest was actually the highest mountain
on earth so it was thought before then that the [INDISTINCT] was higher but--so this was
one of--one of the things that photogrammetry has to claim as a big advance. Anyway, so
the basic--the way this was one done was taking measurements from these different theodolites
strangulations and kind of putting all these strangulations into a big manual bundle adjustment
so they had all these measurements between pairs of, you know, sensors if you will and
they had to kind of bundle them all together by writing up the equations and solving for
the relative positions of objects on paper. So--but this was an example of bundle adjustment
in the 1800s. Of course, this would--this was advance with the invention of photography
and around the same time and they start putting these things together in the late 1800s. Okay.
So now, the next big event--you know, I'm skipping a lot of big events here actually
in photogrammetry but when the computers came about, computers that were big enough to do
largely squares computations, photogrammetrist immediately saw the potential to use this
to--for photogrammetry applications. So, in particular, someone named Dwane Brown pioneered
the use of computers for bundle adjustment and, you know, basically triangulation efforts
in the 1950s and so you can really date back computer based bundle adjustment in the 1950s
although at the time they weren't doing any image analysis, they were just using the computers
to solve the equations not to extract features or doing any measurements on the images themselves.
Okay? So, I called that all pre-history because there was no computer vision, there's no analysis
of the images using computers. So let me talk about what were the first uses of computer
vision for 3D measurement. And so, as far as I could tell, the first example of 3D computer
vision is stereo and this is work by this crazy inventor name Gilbert Hobrough and basically,
he was--at the time the way photogrammetry work, map making work is you had operators
who would look at a pair of images and find corresponding points by hand and then specify
the displacements and then feed this into the computer to solve for the positions of
these points. And so, Hobrough observed, well, wouldn't it great if you could automate this
manual correspondence task because that was really the bottle neck. And so he invented
this machine to do so and so this is really an analog stereo machine, analog hardware
and it's pretty--it's pretty incredible. So, I wasn't able to find original paper on this
but the patents are online so here's an image from the patent. So, what you're seeing here
is basically a table with two images slides on them, transparent images and there is a
CRT down below which is scanning basically a spot across both images in parallel. And
so it's illuminating a particular pixel so to speak in each image and the intensity of
that pixel is being picked up by a sensor which is measuring intensity in both images
of the spot and then there's a correlator which compares those intensities and decides
whether it's the same or not. If they're the same then it knows the disparity based on
the relative positions of the spot and say--and so there's a match that gives you the disparity
and the depth. If they're not, based on the difference in disparity, it will shift to
being in one of the images. Okay? So--and it will do this over and over again. So, analog
implementation of stereo, it's automated and he had a serious of innovations on top of
this and basically invented [INDISTINCT] stereo on a pyramid, you know, back in the '50s and
for those of you know a lot about computer vision, you'll know what I'm talking about
but--yeah, the--there's--this is rediscovered and computer vision basically, you know, three
decades later but pretty amazing stuff. All right. So, I'd say that was not very well
known [INDISTINCT] almost known I've talked to computer vision's ever heard of the previous
slide. This one is very well known. So, Larry Roberts, his PhD thesis in 1963 at MIT, he
invented what's known as Blocks World to this phase of computer where you basically take
an image of a couple of blocks and then based on extracting where the edges are which he
did using a grading operator on a computer which was the first implementations of doing
grading on computers, he was able to extract edges and lines and then based on the relative
orientations of these lines, he was able to figure out their configuration in 3D using
a line labeling technique and came up with a way to render the thing from a new viewpoint
as you're seeing in the far right using presumably some kind of vector graphics. Anyway, so this
is--this is actually an incredible achievement in 1963 to do all these stuff, to implement
the whole thing and have it work even if it only works for two blocks. But--so, Larry
Roberts is often considered the founder--the father of computer vision, you know, this
title [INDISTINCT] been throwing around. So I went to this Wikipedia page to sort of see
what is said about him and actually, his Wikipedia page doesn't even mention computer vision.
So it turns out after his [INDISTINCT] he stopped doing computer vision all together
or maybe he's disgusted by it and instead he started working on networks and he was
one of the, you know, one of the--he was the founder of the ARPANET and is considered one
of the forefathers of the Internet. So, that's what his Wikipedia page talked about to the
computer vision. But anyway, this was a pretty impressive breakthrough and, you know, one
of the first results in 3D computer vision for sure. So, by the way, if people have questions,
feel free to interrupt. You know, there may be some people in the audience who are aware
of things I'm missing and, you know, any feedback is great. Okay. So, this really fueled--Larry
Robert's work really fueled essentially a generation or at least a decade of work on
Blocks World Modeling. And the culmination of this or one of the best-known examples
was in the late '60s, early '70s there's this demo on the system at MIT where they tried
to produce a robot which would basically take pictures of--in real time of blocks, and then
based on that construct a plan for building a structure that it just took a picture of
from another set of blocks. So, basically, you have a set of blocks stacked in a particular
way, the robot tries to reproduce that exact stacking configuration from another set of
blocks by figuring out the structure of the blocks as well as planning it, making a plan
to navigate the robot. And this is incredibly ambitious because it basically involves solving
the computer vision problem, solving the planning problem, solving the AI problem in some sense,
at least the manipulation problem, and it turned out that most of this was doable. The
weakest link was actually low-level edge finding in images. So kind of a disappointing weak
link and this really inspired people to work on edge detection which they did for many
years after that. So this really inspired kind of a dive into low-level computer vision
and, you know, all the kind of canny and similar results came out of this. All right. So, fast
forward, actually this is around the same time actually. So, Berthold Horn, vision pioneer,
so he did his thesis at MIT and in 1970 he published a thesis and this was on Shape from
Shading, and this was one of the first methods that said--showed how from a single image
you can infer the shape of an object. And basically, the idea, as you can somewhat see
in this slide if you're depending on the projector which is not so good on orange, but there
are these curves superimposed on the image and these are so-called characteristic curves
or characteristic strips where if you know the depth of the scene along one point on
the strip, you can figure out the depth along any other point on the strip. So, you can't
figure out the absolute depth of this--of the strip. But if you know one point, then
you can integrate out the depth along every other point. And he also showed that if you
have a network of these strips and you have certain boundary conditions then you can actually
integrate out the surface of the scene. So you can really recover the shape, in other
words the surface orientation and the mesh of the face from a single image assuming--making
lots of assumptions about constant reflections and so forth. So this was--this was really
a key result. It turns out the astrophysicists were also doing similar things around the
same time around--slightly earlier, around mid 1960s, there's a field called--known as
photoclinometry. And they were--their interest was they wanted to determine the shape of
planets from an image of that planet. And so there's similar work. It doesn't--it's
typically for profile. So 1D instead of--for full images. And Horn is really credited with
one of the first to do kind of really 2D shape from shade. Now, for each of these landmarks
I'm talking about you can sort of trace some of the origins back many years before. So
Ernst Mach actually formulated some of the basic equations behind this image of radiance
equation almost a hundred years before. But he actually concluded that inverting this
equation, recovering the shape was impossible. So, it's kind of cool at least that Horn was
able to prove him wrong although I guess it took a hundred years to do it.
>> Well, but I mean, then wasn't he just saying that the problem with the L post and Horn
regularized on top of it? >> SEITZ: Well, so in some sense, yes. But
he also thought that it was too unstable to even--to actually have a look at the original
quotes of Mach, so, you know, I'm not sure. But the implication was that Mach didn't think
it was feasible because their--their--you wouldn't be able to get boundary conditions
or whatever. I don't know. >> Okay.
>> SEITZ: It's a good question. It--this may be too strong to include that one. But he
had a quote saying it's basically impossible. All right. So, another really amazing result
from the '70s. So this is Bruce Baumgart's PhD thesis in 1974. He was the first to do
shape in silhouettes. So the idea is that you have some scene and you have a set of
photos and you're somehow able to extract the silhouette of the object, the boundary
of the object, separate the foreground from the background. In this case, you see--this
is an image of a doll. They're pretty low quality. These are scanned from his thesis.
And so, from this doll, here's a silhouette of the doll. Here's a silhouette of the doll
from a different viewpoint. You can see it's actually missing the head and this points
to the difficulty of actually getting good silhouettes. And in this image it's also missing
the head. And so the basic idea is that if you know the camera positions for all these
different views, then you can basically back project the images. So, this is another great
slide from his thesis. And so the idea is that you know where the camera is, you know
the image, basically this image of play sitting in front of the camera. If you trace a ray
from the center of the camera through every point on the silhouette, that gives you a
conical volume which the scene must lie in 3D. And so from each view you get a different
volume which constraints the scene and the intersect of those volumes it gives you an
approximation of the scene. And so this is an idea. This is a figure which impressively
shows that kind of intersection where you see the silhouettes at the end in the back-projected
intersection. So, here's the results that he got. They look a little bit crude but they're
actually pretty nice. So here's some--here's the back side of the doll. You can see the
head is missing, again, because it was missing in the silhouette. And that--this shows the
limitations of shape from silhouette methods. If you--or if you miss a part of the object
in any view, it disappears in all the views. So this figure--so each row is a different
view, and the two columns correspond to a stereo pair. So if you're able to fuse these
images so that when people can do stereoscopic fusion, you might be able to better see the
3D structure from the image on the far right. Well, it's kind of cool actually and this
thesis that he came up with the idea of doing a stereo reproduction like this. So, in order
to do this, you just think of how much you would have had to do on a 1974 computer in
order to do this. He had to do the edge detection, he had to do the boundary following, he had
to do the back projection 3D, and we have the '50s polyhedral models to the back projection.
Just that last part could have been a PhD, I think, in the 1970s, and he had to invent
all these edge of data structures some of which are still in use today to do the geometrical
model. So, a very impressive piece of work. And actually, his thesis is online, so if
you go to this link at the bottom. I tried to include links as much as I could to information
online if you're interested in digging up more, these blue links on the slides. And
I'll post these slides as well. Okay. So here's another example, I think, which looks a little
bit better, images of a horse, and you can see the horse model which looks a bit more
like a horse. All right. So, I'm not going to go into details on these just because I'm
probably not going to have enough time. But there's a lot of follow-on work on shape from
silhouettes, and if you're interested, here's more where you can read what the vision community
has done, which builds on this basic stuff. All right. So, I talked about shape from shading,
Horn's work. But to this day actually, the--really the community has not been able to successfully
make--get good results out of shape from shading. Maybe as Mach predicted, it's very unstable
and it's hard to get accurate results. You have to make very strong assumptions to make
it work. And really, the only reason for including Horn's work as a landmark is--in my mind,
is that it led to photometric stereo, which is this slide. So the key idea is that if--instead
you just take one--instead of taking just one image, if you take say, three images of
the same scene from the same viewpoint under different illuminations, those three images
give you enough hues that you can reliably extract shape. And this works really well.
So this is Woodham's PhD thesis. Woodham was a--Bob Woodham was a student of Berthold Horn's,
and so--and this has an interesting history as well. So, the--so, Woodham described his
photometric stereo approach in his thesis, but actually he wasn't the one who implemented
it. The first implementation was by a master's student William Silver in 1980, also at MIT.
And this--you could show--see by the result that, you know--so this is a reconstruction
of like basically an egg. It's not actually a real egg, it's a wooden egg. But, if the--if
you compare this model to all the other models that I've shown up to this point, the details
are, you know, spectacular. And if--on the right is a profile of the egg in black, and
there's a dotted line which you probably can't see very well. But the dotted line is the
grand truth, the correct answer, and they're basically identical. So, incredibly accurate
models. So, one of the--people in computer vision--most people I think know about photometric
stereo and Bob Woodham's work in particular. But I think a misconception is that people
think it only works for Lambertian scenes. And in fact they showed--the very first work
in his thesis and in Silver's implementation showed that you can use this for any kind
of scene, any kind of BRDF as long as it's not translucent, has to be an opaque surface.
But it can be shiny or it can be made out of whatever you want, as long as you have--as
you know the form of that reflectance function. Okay. So--and in fact, in Silver's implementation,
what he did was he measured the reflectance function of wood basically by having a reference
object whose shape was known and taking measurements under different illuminations and use this
to reconstruct other wooden shapes. Okay. So as long as you know the reflectance function,
you can use this method. It doesn't require Lambertian. All right. So, photometric stereo,
I mean, works so well. It basically--it has inspired a lot of subsequent work in the community
and here are some of the highlights. I mean, basically each one of these landmarks has
like a series of arguably landmarks after that. You know so people who work on photometric
stereo would say, "Here are the landmarks of photometric stereo." So I tried to do that
for a couple of these subfields. But I'm not going to go into details just because I don't
have time. All right. So, okay. So, the next major landmark is something called the essential
matrix. And this was basically one of the first algorithms for recovering a scene from
two projections under perspective. And basically, the observation was this is worked by Longuet-Higgins
in nature. It's actually cool that there is a nature paper on--entitled A Computer Algorithm
For... So, basically, he came up with the observation that if you have points in corresponds
between two images, they're related by a 3x3 matrix. And in particular, there's a 3x3 matrix
where if you take up points represented homogeneous coordinates XYZ--sorry, XY1 from the first
image and you multiply it by this 3x3 matrix, you get the corresponding epipolar line in
the other image. In other words, a line of which its corresponds must occur in the other
image. And this mapping from points to lines can be represented by these rank 2 3x3 images.
And you can also compute camera matrices from this matrix and so forth. And so he came up
with a very--a linear algorithm for recovering this matrix from multiple projections of points--sorry,
two images, multiple points. And this very, very influential paper and inspired like a
whole area of computer vision now known as multi-view geometry with books by Hartley
and Zisserman and Paul Gerard and sort of this explosion of interest in this area in
the 1990s. Now, it turns out that the theory or the math here can be traced back to mathematicians
notably Chassell from the 1860s. So that back in the 1860s, they actually figured out that
there is a 3x3 rank 2 matrix which maps points in one image to lines in the other image.
Okay. So this was known. They didn't have an algorithm that, you know, recover these
points because, you know, the mathematicians don't look in algorithms. But it's interesting
to note that a lot of fundamentals were known a long time ago. All right. So--and then there's
been a lot of subsequent work since then, stuff like the fundamental matrix. The essential
matrix wasn't grand enough. You had to have another matrix which is even more grand. The
fundamental matrix is basically the uncalibrated version of this, the trifocal tensors, the
three view, and so forth. All right. So, the next major breakthrough, and this is an interesting
one, I think it's probably not too controversial for this audience but it might be controversial
for some of the others, you know, is this really--is this really computer vision? So,
this is algorithm called Marching Cubes by Lorensen and Cline published in SIGGRAPH in
1987. Basically, this was an algorithm to go from a volume to a surface. Very simple
algorithm on the surf--you know, at some level but actually involves affirmative specification
to get it right. So, the way it works is, you seem you have a volume which implicitly
represents the surface. So, typically, these are signed distance functions so, your positive
might mean outside negative inside and the zeros are the surface. So, if you're storing
at a voxel in this volume which contains the surface and they basically tells you how to
march in order to stay on the surface. So, you start at C point on the surface then you
start marching over the surface. And in every time you march to another voxel which is on
the surface, you create a polygon, okay. And you have to figure out which polygon to create
based on where the sub-voxel iso-function would intersect that voxel. Okay? So, you
do this at sub-voxel position, add to create, you know, different sorts of polygons. And
it's basically, a table which tells you for every possible configuration of a previous
voxel or next voxel the type of polygon to create intersecting which sides of the voxel.
And to get this table right--in fact, they didn't get the table quite right in the original
publication. So, if you implement this, make sure you look at the updates to this table
to get it right. But--so--and there's some earlier work. There's a, you know, 1970s paper
and so forth where they tried the similar things but I think this is a big landmark,
I mean, something which is still in widespread use today. So, it's pretty in common in our
field as you all know, to find an algorithm which is still used, you know, more than 20
year--more than 20 years on, 25 years later and this is an example so, an important landmark.
Okay. So, when I started sort of, actually catching up to work, I actually stared working
in computer vision in the 1990s and so really, one of the things which really inspired me
was this work by Tomasi and Kanade on structure-from-motion and the basic idea is you're getting a set
of points over a named sequence and if you know the correspondence of all these points,
you could just put them, the point coordinates in rows of a matrix. Let's call it W. So,
the first row of the matrix would be the point one position, the X coordinate of the point
and of the first point in all the images. The second row might be the X coordinate of
the second point and all the images and then, you have a series of Y. The bottom half matrix
will be the Y coordinates. So, it turns out that if you have this matrix W, you can express
that matrix as a product of the motion of the camera and the shape of the scene. And
so the entries of the motion matrix are literally the camera parameters and the orthographic
projection. And the entries of the shape matrix are the 3D positions of those points. Okay.
So this is basically a one line solution to this structure-from-motion problem. Being
able recover 3D from a sequence of images. So, extremely elegant, very, very cool highly
influential. It turns out to be optimal under the right assumptions and affine projection
model which is the version of orthographic and inspire all sorts of follow on work. These
are just a few extensions to, you know, multiple bodies, optical flow that knows about 3D structure,
non-rigid shapes and so forth. Now, it turns out that this hasn't really been picked up
in modern structure-from-motion bubble adjustment 3D reconstruction problems because it only
works for affine projection and in terms out not to work so well with outliers and there's
also to caveats but it's very cool that you can do this and I think you got all people
interested in computer vision. All right. So, next landmark on my list, Iterative Closest
Points is--this was an idea which was simultaneously invented by three different authors. So, I
think this is evidence that the idea was kind of in the air, that so many people came up
with it at the same time but the basic idea is that suppose you have overlapping range
scans of an object and you want to piece them together into a 3D model. You first have to
align the range scans and ICP is a way of doing that and the basic idea of how it works,
you have this green range scan and this red range scan over here, I guess you can't see
the green at all. It must be the green pixels are knocked out of the projector or something
like that. But, you have a correspondence between--so, the key thing is that if you
knew the correspondence of points on the green curve and the red curve, then, you could find
the alignment between them. There's a sort of linear way to recover the alignment of
the--of the two curves. But, you don't know the correspondence. So, the trick is for every
point on the green curve, you find it. And you have an estimate of its current rotation
relative and translation relative to the red curve. You find the closest point to that
on the red curve and then you assume that that's the right correspondence and you solve
for the alignments and then you get better aligned pair of curves. And then, it's not
perfect because the correspondence was wrong but now it's closer so you then re-solve for
the putative correspondences and then, iterates the procedure over and over again. And this
actually works quite well. And again, this is one of those examples that, you know, 20
years later it's in widespread use. All right. So, another really important result, 1995
is worked by Shree Nayar's group on Depth from Defocus. So, here you're seeing a real-time
range scan system. One of the first of its type--of its kind and what you're seeing is
on the left, there is a cup pouring milk into another cup. And on the right is a--is a depth
mount computed in real-time. And again, this is, you know, this is 15 years old or more
and basically, what it's doing is it's taking two images with different focus blur. So,
the two images are--have different focus. One is out of focus in a different way than
the other one because they're using different aperture sizes and there's a different--no,
I think it's actually focus settings. And based on the relative blur you can figure
out the depth of the--depth of the sea. Okay. So, this is--they weren't--so, Nayar's group
wasn't the very first to invent this idea of defocus but they're one--but this was the
first, I think, really compelling system that showed how it could be practical. And another
innovation here was--typically when you take images with different focus settings then
you could try it with your own camera, when you change the focus, the image actually--the
magnification of the image changes a little bit. And so, they came up with a way to make
this really work reliable--reliably, they had to come up with a way to avoid this so
that they used the notion called telecentric optics which allowed you have to two different
camera paths with two different focus settings which produced images with the exact same
magnification. And so, now, Nayar also contributed this idea to the computer vision community
and it made a big difference. So, this implementation, it ran at 30 hertz accurate to, you know,
.3% error, very impressive again from the 1990s. All right. So, 1996, this was work
by Paul Debevec and collaborators at Berkeley. This is a famous video that Paul made. There
he is running. And he's holding a model of the campanile there's actually a sound on
this of like bells ringing, campanile bell's ringing. So, this is the model and now you're
seeing the rendering of his reconstruction and the Berkeley campus modeled in 3D. You
can think of this as kind of a predecessor to Google Earth and similar products, you
know, that tried to do really high accuracy in 3D modeling. And I think it's inspired
a lot of--a lot of efforts in this area. So, going back and forth from the real footage
to the 3D model. Feel dizzy watching this. But it's so cool. So, I think this video really
showed, I think, people what was possible in terms of realistic, really highly realistic
3D modeling photos. So definitely a breakthrough landmark result and field. So, the next images
you'll show--yes, you can see the real and the synthetic one right next to one another,
especially in the model in that case. So, here the input photos of the tower and some
of the--and the photos of the environment at Berkeley campus. And here's the model that
they recovered. And you could see that the models actually are not very complicated.
For example, buildings on the bottom are just these boxes, but when you touch the map, it
looks really good. All right. So, this paper is important for a number of reasons. So,
it really claimed--well, what introduced--it didn't coin the term image-based modeling.
This came from Leonard McNally's work. But, it introduced the concept of image-based modeling
in the graphics community. And there is, you know, some people argue that this work didn't
actually have a lot of novelty. It was just an impressive system. But actually there were
some really key innovations. So, for example, view-dependent texture mapping, this is really
the first use of it, which is now used routinely in Street View, for example. And was really
the foundation for the Utopia work here. The model-based stereo, the idea of recovering
stereo--of recovering geometric--crude geometric models by recovering high fields on top of
them, these guys came up with that and published that for the first time. And, of course, it
inspired a lot of interesting products have come out of it. All right. So, same year,
range scan merging also known as D-rep, this is--this is like Tomasi-Kanade like elegant
results. Really cool method for--so basically the idea is that you somehow manage to align
your range scans that now you want to fuse into a 3D model. So, how do you create 3D
model from the aligned range scans? You want to create one consistent watertight mesh from
all these partial scans. And the key idea here of this work is that there's basically
a one mind solution to it again. So, if you represent not a surface but if you represent
the surface of the level side of a 3D volume or the surface is zero and the--and it--basically
every point of volume is a distance to that scan. And you represent every scan as this
signed distance function in the volume. That if you simply add up all the signed distance
functions, one from each scan and take a level set of that that gives you the optimal surface
in a least squared sense. Okay. So, a very elegant result and this basically was the
method of choice, still the method of choice for a lot of people. More recently, the competing
methods like plus on surface reconstruction have come out and there are people that are
trying to switch to those, but for more than 10 years this has been really the standard
of range scan merging method. Okay. So, of course, I have to include my own PhD work
here. That's why I have to give this talk as opposed to someone else. So, I can inject
my own piece of work. So, 1997--well, not just to be but Fougera as well came up with
a multi-view stereo. And these actually--arguably, multi-view stereo dates back further. Photogrammetrist
had some similar methods for doing multi-view stereo in a decade previously. But in computer
vision, these were kind of the main methods which came on the late '90s for reconstructing
high quality 3D models from a set of multiple photos, not just a pair of images. And some
of the key ideas in this work, both the space carving work that I participated in with [INDISTINCT]
could lock us on the level set work of [INDISTINCT] and Fougera is basically to reconstruct the
3D shape directly instead of matching images, doing it in 3D space, also a visibility model.
And it turns out that, you know, figuring out which point is visible and which images
is difficult. But if you do it in scene space right away, you can solve this problem. And
these were the methods of--basically the first who described how to solve this building problem
in a principled way with provable convergence properties and so forth. Okay. So these were
important results in 3D in computer vision. So, of course, you're all familiar with graph
cuts or at least vision folks and they're really--the first paper that introduced graph
cuts to computer vision stereo in particular was Roy & Cox paper--CD paper from 1998. And
this basically--this is a really cool paper. Not as--I mean, the algorithms are cool but
more importantly I think that some lessons learned on these paper is really cool. So,
on the left you see window-based matching. These are depth maps which correspond to depths
of points to the image. So basically, on the left is more or less the state-of-the-art
algorithm at the time from '98 although it's not quite, there are others, better algorithms
they could've used. But it's the sort of kind of the simple method from--and on the right
is the graph cut result which is, you know, much higher quality. And up until this point,
people had thought that stereo was a really hard problem to get, the right corresponds
between paired images. In order to really solve this, you had to get a better and better
way of describing--of comparing images, describing the region of pixels around the point of interest
that you're trying to recover a correspondence for. So by coming up with better ways of describing
pixels using edges and primitives and other things like that you can get better matches.
Now, graph cut work basically said forget about all of that. Take the simplest possible
way of comparing images basically taking the difference of pixel on the locked image to
a pixel of the right image. Brain dead but use a really good optimization algorithm.
You couldn't beat all this other stuff, just knock it out of the water. Okay. So this work
really showed the power of optimization and ushered in the so-called graph cuts era, the
computer vision applications all sorts of different problems but in particular, it has
been a really big development for stereo. Now, it also had its roots in the 1980s. This
work by Baker and Binford who--so I think people in my generation are older. We'll say
it was a really important paper in stereo but these days almost, you know, I think it's
kind of a footnote more than anything else. But they proposed--they were basically the
first to propose global optimization for stereo but they did align at a time using dynamic
programming. But for, you know, basically 25 years or whatever. Yeah, 15 years, no one
was able to figure out how to extend it--this 1D optimization to 2D because the dynamic
programming solution didn't extend, graph cuts was basically the extension to 2D. [INDISTINCT]
mentioned there is another work around the same time, notably, Roy & Cox who came up
with different formulations of graph cuts to do stereo problems. All right. So 1998,
Marc Pollefeys and his colleagues published a paper at ICCB on doing 3D reconstruction
from basically a video camera from uncalibrated 3D information. The idea is you kind of fly
around in this case or you walk around the scenes with video camera and you get out a
3D reconstruction. You don't have to know anything at all about the video camera and
none of the parameters, no calibration, nothing. So, this is very inspirational work and it
was really the, you know, the same way that Tomasi and Kanade work was influential. I
think this was very--also very influential. It's also a combination of many research advances.
The paper itself was on--actually a relatively narrow topic of self-calibration, finding
intrinsic camera parameters from a sequence of images. It didn't at all talk about the
stereo, texture mapping or anything. But this is one of the key pieces that enabled them
to put this whole system together and also the community--the vision community as a whole
had been working through the '90s on solving this self-calibration bundle adjustment problem
and this is really one of the culminations of that work. It showed you could put it all
together in a really compelling system. So it's a big milestone for the field. All right.
So, 1999, Blanz & Vetter. SIGGRAPH published this amazing paper where they showed from
a single photo you can create a recon--3D reconstruction that looks like this. So basically
a laser scan a quality 3D model of the person from a single photo. So harking back to the
shape of shading results, you know, there's no way that shape of shadings can be able
to do this. So, this is pretty incredible. Now, what it's doing is not using just image
information. It's also using a database of 3D model. So what they did was they scanned
200 faces, 200 people and they have this basically 200 scanned vector space then through linear
combinations of these scans they can produce any other person who looks like anything like
any of these 200 people. So that would be basically their model. So instead of having
to reconstruct the whole face, all they have to do is find 200 coefficients that are the
best combination of these input people to this output person. And in fact they're able
to compress it a little bit more using principal components and I think they only recovered
around 100 coefficients. Anyway, so there's similar work around the same time in Active
Appearance Models but I think these are--these results are really unrivaled even now. I don't
think there's better results in 3D reconstruction from faces. But if you look closely you'll
notice that, you know, the details may not exactly line up, maybe Tom Hanks's nose is
not quite as long and it sort of depends on your input space and how well the new photo
lies in the space of input photos. But it gives a very compelling argument for using
so-called model-based approaches for 3D reconstruction which used strong priors on the type of geometry
you're trying to reconstruct. So 1999 was also a big year in other ways. There's Digital
Michelangelo Project. So Marc Levoy's group at Stanford and many other collaborators.
And basically they set up to scan a bunch statues and other things, you know, in Italy
and then the centerpiece was Michelangelo's the David. And this was a huge scanning effort,
probably the biggest scanning ever undertaken in terms of amount of resources and time and
planning. It basically took a team of a dozen--I think there are 20 people involved total,
working, scanning all night long. So they--first of all they had--they had access to the statue
only at night because it's displayed in a museum. I think it's at [INDISTINCT] and the
museum is closed at night so they're allowed to scan all night long and had to stop in
the morning. And so that's basically what they did, they scanned all night long for
a month to get this model and it was, you know, roughly, you know, 20 people involved.
It's a huge, huge amount of work but the result is beautiful, they have--they captured the
sculpture basically almost every single nook and cranny at a quarter of a millimeter precision.
And so you can zoom in--on the right you're seeing--you're zooming into the eye of the
statue and you can zoom in further to this like ridge and see the details of the mesh.
And so, because it's also--it's also other scanning work like the PHR projects around
the same time. But this has really shows, you know, what you can do with 3D laser scanning
technology and also how much work it is to use it. Okay. Same year, so camera tracking--I
was looking for the canonical reference on bundle adjustment. It's actually really hard
to find a canonical reference on the computer based bundle adjustment and so I went with
this. This is--this is really the first time practical bundle adjustment systems where
you used in an application. So that--for all these--basically for almost every special
effects scene these days, there's a combination of real footage and computer graphics and
doing that right involves figuring out where the camera is precisely in the shots. You
can really--computer graphics in the same place. And so this process is called Match
Move and it requires bundle adjusting or reconstructing the positions in all the cameras in all the
images. And so the first truly animated methods were developed in the late '90s for doing
this for entry tracking the images and doing the 3D analysis. All right. And so there's
some pioneering commercial systems in particular MatchMover came out of INRIA [INDISTINCT]
led by [INDISTINCT] and Boujou grew out of Oxford Fitzgibbon [INDISTINCT] and so forth.
There's also a concurrent record in photogrammetry around the same time. Let's see, so I'm running
a little bit low. I'm just going skip over some of these. There's some important benchmarks
for stereo that came in about in 2001. Some really elegant work for doing stereo with
non number and surfaces in 2002. 2003 is the year I'd like to say stereo equals laser scan,
so this is the first stereo computer vision--passive computer vision method. No laser scan that
could produce 3D geometry that looked as good as a laser scan and arguably is almost as
good. And this was Carlos Hernandez's work. And Carlos as you know, he's at Google now.
So, a very cool work which, you know inspired a lot of subsequent work. We're trying to
just kind of skip over including Joshua's work who's--and others in the community. 2006,
first work on reconstructing 3D information from photos on the Internet. So these are
part of tourism work which led to Microsoft Photosynth. And really key ingredients to
make this work were SIFT, these reliable feature points and the photograph--the progress in
photogrammetry over the years kind of with similar work by other authors. 2006, this
import--the--there's a really cool effort of drawing, writing, creating 3D models of
a city by just driving a car around. And so it's kind of like Street View LiDAR but just
using raw computer vision measurements instead of LiDAR. This--in 2008, there's a great paper
by Zebedan etal, Bishop's group. And there they show a basically an automated version
of [INDISTINCT] fa├žade. So, from aerial imagery, it creates this really, really impressive
3D model of a city completely automatically. 2009, Zamira did his Rome-in-a-day work. And
this is sort of a city scale structure promotion from image collections. And sorry I don't
have more time to talk about these. And, of course, 2011 is Kinect. So body pose from
a single depth image, 200 frames per second, fastest selling electronics device in history.
So they had in the first three months, I think, they had doubled the sales of the iPad when
it came out. So not bad for computers and system. Of course, it has other things aside
computer vision but we'd like to take credit for it. All right. So in 2013, there is this
fantastic work on basically, Digital Michelangelo from a few photos. So instead of having to
scan all night long, this group was able to just take a few photos and create a model
of Digital Michelangelo that have the quality of this Lavoy group that took a month. On
2015, there's this wonderful project on creating a public repository of the world's geometry
where basically they scan everything, people, places and scenes, things and put it all up
in publicly hosted website. 2015, this work on, which I'm sure you all remember of how
do you reconstruct yourself from your photo collections. So, it took all of your photos
of your poses, expression, body shapes over time and create a 3D model, perfect 3D model
of you and all the ways that you move. 2020, it took a lot longer to solve this Inverse
CAD problem. So the idea is to create what an architect would call a 3D model from not
just a plain cloud but an accurate--or a 3D model which captures the salient features
of the scene, walls, floors, small number polygons from rough point clouds fills in
the gaps, easy to add and modify. 2020, we finally solve the Visual Turing test. So this
is creating a 3D computer vision system which creates models which are truly indistinguishable
from reality where you're able to move however you want through the scene and it still looks
perfect. So it looks just like being there. On finally, 2030, the computer being human.
That's it. Thanks you guys. >> Thank you very much for bringing us together.