Uploaded by GoogleTechTalks on 23.02.2012

Transcript:

>> SEITZ: I somewhat foolishly agreed to give a talk on the history of 3D computer vision

for workshop in Italy a week ago not realizing how much time this would take. So, basically,

you know, I'm very familiar with a lot of computer vision, in 3D computer vision in

particular but really trying to give a history of this problem, you know, goes back decades

and so I had to consult a bunch of people to really get this right. So, I immediately

sent email to a number of the pioneers, you know, many of you recognize on this list like

Schreiner and Berthold Horn and Andrew Zisserman and so forth. Basically, you know, all the

people you think of about doing pioneering work in computer vision to try to get their

input on what were--are the breakthroughs in computer vision over the years in the 3D

domain. You want to close the door? >> Yeah.

>> SEITZ: And so, of course, everyone send you different lists and, you know, all of

the structure motion guys sent me structure motion references and all the, you know, for

the metric guys like Horn and Woodham sent me, you know, shading references so everyone

had a different interpretation of what is a landmark or a breakthrough so I had to kind

of call from this some subset of things. And I think everyone of these people will be horrified

by the talk I'm about to give because it only covers a small subset of the references that

they gave me for what they think are the pioneering references. So, if you want to cover all 3D

vision, it's actually very broad area on so I had to select only a small number of things

from each subfield. And so, because I'm not going to cover all of these stuff, you know,

many of these emails I got from these folks were just terrific based--you know, really

detailed histories over the last several decades so I'm also going to post online their responses.

So, what does Berthold Horn think were the chronicle references in 3D vision and so forth.

Okay. So, in addition to these individuals who gave me some information, I also-there's

lots of good information online in the papers and so forth and here's a--here are a few

links if you're interested to know more about the history of photogrammetry or bundle adjustment

and so forth to some good materials online. All right, so disclaimer, I've already said

just a little bit but this list is very incomplete for sure. It's also somewhat biased because

it's my own interpretation of the highlights in 3D computer vision over the years. Although,

it's informed by a lot--by, you know, experts by readings and just, you know, my experience

over the years. I tended to select for high impact result--results like the things which

I think had the biggest impact either in practice or on the research community and so there's

fewer kind of theoretical results although there are a few in this list. And of course,

I'm omitting a lot of really important results just for--to fit this in to an hour. Okay.

So, with that, let me jump into it. So, first of all, a bit of so-called pre-history. So,

here are a few really important people over the years, the centuries up to now. So, Leonardo

Da Vinci is not really the--he's by no means the first one to know about perspective but

he articulated it especially well so I love this quote, perspective is nothing else then

to seeing of an object behind a sheet of glass, smooth and quite transparent. On the surface

of which, all things may be marked or behind this glass. So, he had a concept of a flat

image plain. All things transmit their images to the eye by pyramidal lines--so rays and

these pyramidal lines are cut by said glass. The [INDISTINCT] of the eye, these are intersected,

the smaller the image of they possible appear. So, you know, really beautiful description

of perspective and had, you know, all the key ideas at that time. Okay? So, another

really important figure in 3D is Lambert and you probably know him best through Lambert's

Law which is--which is really model how light gets reflected off of it after hits the surface

and it gets scattered in all directions per map surfaces and really, you know, this has

been the vain of computer vision algorithms because almost everything assumes Lambertian

reflectance and therefore breaks down for anything that's shiny or specular or non-Lambertian.

So, we really have Lambert to blame for all these troubles in the field. But also, it

turns out he was the first person who proposed inverse projection. So basically, figuring

out where the camera was from one or more images and that's also known as camera resectioning

photogrammetry and literature. Next on this list is Galls and Galls is, you know, hugely

important for all sorts of things but one big element among those is Lee Squares and,

of course, Lee Squares is really fundamental to almost all of our algorithms these days

both many algorithms as well as intuitive nonlinear algorithms and so it's been central

to photogrammetry as well. Now, Wheatstone is maybe a bit more controversial to sort

of put in the same line as these other three but he was the first to really articulate

the principles of stereo and basically the idea that stereo consist of horizontal parallax

between two images--between retinal projections. And it's kind of surprising that this wasn't,

you know, known and so in the 1800s and it--you know, maybe it was. I mean, there's some evidence

that Leonardo knew about stereo although he didn't explain it as parallax. But what's

interesting about Wheatstone's work is that he was able to basically prove it by creating

images--synthetic images and putting them in a stereoscope. He invented the stereoscope

which then you could look at and see fuse of 3D image. So basically, that's a proof

of this parallax concept. And, of course, stereo has been central to a lot of major

advances in 3D vision. Okay. So, in parallel with computer vision and actually predating

a lot of computer vision is photogrammetry. So, photogrammetry is basically the science

of measurement from optics. From some optical observation. I'm sure [INDISTINCT] stepping

the word image because photogrammetry in some sense predated photography. So the mathematical

foundations go back to the early 1800s so [INDISTINCT] invention of projected geometry,

terms--you know, things like the horopter, mutli-view relations and so forth. And I'll

cover some of these layer in this talk in the context of other computer vision breakthroughs

but the mathematics goes back a long ways. Practical use, people are interested in photogrammetry

as part of map making and [INDISTINCT] and basically the idea is that if you're in an

imperialist country, you want to make sure that you can map the boundaries of nation

that you're conquering and so forth so people have long been interested in map making and

one of the--one of the great maybe the biggest breakthrough in photogrammetry is the--this

study called the Great Arch of India that the British--so when the British were serving

India, this turned out to be a huge undertaking. So, the idea was basically to create a 3D

map of India. And--so they took this huge theodolites that weight of time literally

at time. Three of the jungles is in [INDISTINCT] to try to take their measurements from different

places and they were able to show that the--that Mount Everest was actually the highest mountain

on earth so it was thought before then that the [INDISTINCT] was higher but--so this was

one of--one of the things that photogrammetry has to claim as a big advance. Anyway, so

the basic--the way this was one done was taking measurements from these different theodolites

strangulations and kind of putting all these strangulations into a big manual bundle adjustment

so they had all these measurements between pairs of, you know, sensors if you will and

they had to kind of bundle them all together by writing up the equations and solving for

the relative positions of objects on paper. So--but this was an example of bundle adjustment

in the 1800s. Of course, this would--this was advance with the invention of photography

and around the same time and they start putting these things together in the late 1800s. Okay.

So now, the next big event--you know, I'm skipping a lot of big events here actually

in photogrammetry but when the computers came about, computers that were big enough to do

largely squares computations, photogrammetrist immediately saw the potential to use this

to--for photogrammetry applications. So, in particular, someone named Dwane Brown pioneered

the use of computers for bundle adjustment and, you know, basically triangulation efforts

in the 1950s and so you can really date back computer based bundle adjustment in the 1950s

although at the time they weren't doing any image analysis, they were just using the computers

to solve the equations not to extract features or doing any measurements on the images themselves.

Okay? So, I called that all pre-history because there was no computer vision, there's no analysis

of the images using computers. So let me talk about what were the first uses of computer

vision for 3D measurement. And so, as far as I could tell, the first example of 3D computer

vision is stereo and this is work by this crazy inventor name Gilbert Hobrough and basically,

he was--at the time the way photogrammetry work, map making work is you had operators

who would look at a pair of images and find corresponding points by hand and then specify

the displacements and then feed this into the computer to solve for the positions of

these points. And so, Hobrough observed, well, wouldn't it great if you could automate this

manual correspondence task because that was really the bottle neck. And so he invented

this machine to do so and so this is really an analog stereo machine, analog hardware

and it's pretty--it's pretty incredible. So, I wasn't able to find original paper on this

but the patents are online so here's an image from the patent. So, what you're seeing here

is basically a table with two images slides on them, transparent images and there is a

CRT down below which is scanning basically a spot across both images in parallel. And

so it's illuminating a particular pixel so to speak in each image and the intensity of

that pixel is being picked up by a sensor which is measuring intensity in both images

of the spot and then there's a correlator which compares those intensities and decides

whether it's the same or not. If they're the same then it knows the disparity based on

the relative positions of the spot and say--and so there's a match that gives you the disparity

and the depth. If they're not, based on the difference in disparity, it will shift to

being in one of the images. Okay? So--and it will do this over and over again. So, analog

implementation of stereo, it's automated and he had a serious of innovations on top of

this and basically invented [INDISTINCT] stereo on a pyramid, you know, back in the '50s and

for those of you know a lot about computer vision, you'll know what I'm talking about

but--yeah, the--there's--this is rediscovered and computer vision basically, you know, three

decades later but pretty amazing stuff. All right. So, I'd say that was not very well

known [INDISTINCT] almost known I've talked to computer vision's ever heard of the previous

slide. This one is very well known. So, Larry Roberts, his PhD thesis in 1963 at MIT, he

invented what's known as Blocks World to this phase of computer where you basically take

an image of a couple of blocks and then based on extracting where the edges are which he

did using a grading operator on a computer which was the first implementations of doing

grading on computers, he was able to extract edges and lines and then based on the relative

orientations of these lines, he was able to figure out their configuration in 3D using

a line labeling technique and came up with a way to render the thing from a new viewpoint

as you're seeing in the far right using presumably some kind of vector graphics. Anyway, so this

is--this is actually an incredible achievement in 1963 to do all these stuff, to implement

the whole thing and have it work even if it only works for two blocks. But--so, Larry

Roberts is often considered the founder--the father of computer vision, you know, this

title [INDISTINCT] been throwing around. So I went to this Wikipedia page to sort of see

what is said about him and actually, his Wikipedia page doesn't even mention computer vision.

So it turns out after his [INDISTINCT] he stopped doing computer vision all together

or maybe he's disgusted by it and instead he started working on networks and he was

one of the, you know, one of the--he was the founder of the ARPANET and is considered one

of the forefathers of the Internet. So, that's what his Wikipedia page talked about to the

computer vision. But anyway, this was a pretty impressive breakthrough and, you know, one

of the first results in 3D computer vision for sure. So, by the way, if people have questions,

feel free to interrupt. You know, there may be some people in the audience who are aware

of things I'm missing and, you know, any feedback is great. Okay. So, this really fueled--Larry

Robert's work really fueled essentially a generation or at least a decade of work on

Blocks World Modeling. And the culmination of this or one of the best-known examples

was in the late '60s, early '70s there's this demo on the system at MIT where they tried

to produce a robot which would basically take pictures of--in real time of blocks, and then

based on that construct a plan for building a structure that it just took a picture of

from another set of blocks. So, basically, you have a set of blocks stacked in a particular

way, the robot tries to reproduce that exact stacking configuration from another set of

blocks by figuring out the structure of the blocks as well as planning it, making a plan

to navigate the robot. And this is incredibly ambitious because it basically involves solving

the computer vision problem, solving the planning problem, solving the AI problem in some sense,

at least the manipulation problem, and it turned out that most of this was doable. The

weakest link was actually low-level edge finding in images. So kind of a disappointing weak

link and this really inspired people to work on edge detection which they did for many

years after that. So this really inspired kind of a dive into low-level computer vision

and, you know, all the kind of canny and similar results came out of this. All right. So, fast

forward, actually this is around the same time actually. So, Berthold Horn, vision pioneer,

so he did his thesis at MIT and in 1970 he published a thesis and this was on Shape from

Shading, and this was one of the first methods that said--showed how from a single image

you can infer the shape of an object. And basically, the idea, as you can somewhat see

in this slide if you're depending on the projector which is not so good on orange, but there

are these curves superimposed on the image and these are so-called characteristic curves

or characteristic strips where if you know the depth of the scene along one point on

the strip, you can figure out the depth along any other point on the strip. So, you can't

figure out the absolute depth of this--of the strip. But if you know one point, then

you can integrate out the depth along every other point. And he also showed that if you

have a network of these strips and you have certain boundary conditions then you can actually

integrate out the surface of the scene. So you can really recover the shape, in other

words the surface orientation and the mesh of the face from a single image assuming--making

lots of assumptions about constant reflections and so forth. So this was--this was really

a key result. It turns out the astrophysicists were also doing similar things around the

same time around--slightly earlier, around mid 1960s, there's a field called--known as

photoclinometry. And they were--their interest was they wanted to determine the shape of

planets from an image of that planet. And so there's similar work. It doesn't--it's

typically for profile. So 1D instead of--for full images. And Horn is really credited with

one of the first to do kind of really 2D shape from shade. Now, for each of these landmarks

I'm talking about you can sort of trace some of the origins back many years before. So

Ernst Mach actually formulated some of the basic equations behind this image of radiance

equation almost a hundred years before. But he actually concluded that inverting this

equation, recovering the shape was impossible. So, it's kind of cool at least that Horn was

able to prove him wrong although I guess it took a hundred years to do it.

>> Well, but I mean, then wasn't he just saying that the problem with the L post and Horn

regularized on top of it? >> SEITZ: Well, so in some sense, yes. But

he also thought that it was too unstable to even--to actually have a look at the original

quotes of Mach, so, you know, I'm not sure. But the implication was that Mach didn't think

it was feasible because their--their--you wouldn't be able to get boundary conditions

or whatever. I don't know. >> Okay.

>> SEITZ: It's a good question. It--this may be too strong to include that one. But he

had a quote saying it's basically impossible. All right. So, another really amazing result

from the '70s. So this is Bruce Baumgart's PhD thesis in 1974. He was the first to do

shape in silhouettes. So the idea is that you have some scene and you have a set of

photos and you're somehow able to extract the silhouette of the object, the boundary

of the object, separate the foreground from the background. In this case, you see--this

is an image of a doll. They're pretty low quality. These are scanned from his thesis.

And so, from this doll, here's a silhouette of the doll. Here's a silhouette of the doll

from a different viewpoint. You can see it's actually missing the head and this points

to the difficulty of actually getting good silhouettes. And in this image it's also missing

the head. And so the basic idea is that if you know the camera positions for all these

different views, then you can basically back project the images. So, this is another great

slide from his thesis. And so the idea is that you know where the camera is, you know

the image, basically this image of play sitting in front of the camera. If you trace a ray

from the center of the camera through every point on the silhouette, that gives you a

conical volume which the scene must lie in 3D. And so from each view you get a different

volume which constraints the scene and the intersect of those volumes it gives you an

approximation of the scene. And so this is an idea. This is a figure which impressively

shows that kind of intersection where you see the silhouettes at the end in the back-projected

intersection. So, here's the results that he got. They look a little bit crude but they're

actually pretty nice. So here's some--here's the back side of the doll. You can see the

head is missing, again, because it was missing in the silhouette. And that--this shows the

limitations of shape from silhouette methods. If you--or if you miss a part of the object

in any view, it disappears in all the views. So this figure--so each row is a different

view, and the two columns correspond to a stereo pair. So if you're able to fuse these

images so that when people can do stereoscopic fusion, you might be able to better see the

3D structure from the image on the far right. Well, it's kind of cool actually and this

thesis that he came up with the idea of doing a stereo reproduction like this. So, in order

to do this, you just think of how much you would have had to do on a 1974 computer in

order to do this. He had to do the edge detection, he had to do the boundary following, he had

to do the back projection 3D, and we have the '50s polyhedral models to the back projection.

Just that last part could have been a PhD, I think, in the 1970s, and he had to invent

all these edge of data structures some of which are still in use today to do the geometrical

model. So, a very impressive piece of work. And actually, his thesis is online, so if

you go to this link at the bottom. I tried to include links as much as I could to information

online if you're interested in digging up more, these blue links on the slides. And

I'll post these slides as well. Okay. So here's another example, I think, which looks a little

bit better, images of a horse, and you can see the horse model which looks a bit more

like a horse. All right. So, I'm not going to go into details on these just because I'm

probably not going to have enough time. But there's a lot of follow-on work on shape from

silhouettes, and if you're interested, here's more where you can read what the vision community

has done, which builds on this basic stuff. All right. So, I talked about shape from shading,

Horn's work. But to this day actually, the--really the community has not been able to successfully

make--get good results out of shape from shading. Maybe as Mach predicted, it's very unstable

and it's hard to get accurate results. You have to make very strong assumptions to make

it work. And really, the only reason for including Horn's work as a landmark is--in my mind,

is that it led to photometric stereo, which is this slide. So the key idea is that if--instead

you just take one--instead of taking just one image, if you take say, three images of

the same scene from the same viewpoint under different illuminations, those three images

give you enough hues that you can reliably extract shape. And this works really well.

So this is Woodham's PhD thesis. Woodham was a--Bob Woodham was a student of Berthold Horn's,

and so--and this has an interesting history as well. So, the--so, Woodham described his

photometric stereo approach in his thesis, but actually he wasn't the one who implemented

it. The first implementation was by a master's student William Silver in 1980, also at MIT.

And this--you could show--see by the result that, you know--so this is a reconstruction

of like basically an egg. It's not actually a real egg, it's a wooden egg. But, if the--if

you compare this model to all the other models that I've shown up to this point, the details

are, you know, spectacular. And if--on the right is a profile of the egg in black, and

there's a dotted line which you probably can't see very well. But the dotted line is the

grand truth, the correct answer, and they're basically identical. So, incredibly accurate

models. So, one of the--people in computer vision--most people I think know about photometric

stereo and Bob Woodham's work in particular. But I think a misconception is that people

think it only works for Lambertian scenes. And in fact they showed--the very first work

in his thesis and in Silver's implementation showed that you can use this for any kind

of scene, any kind of BRDF as long as it's not translucent, has to be an opaque surface.

But it can be shiny or it can be made out of whatever you want, as long as you have--as

you know the form of that reflectance function. Okay. So--and in fact, in Silver's implementation,

what he did was he measured the reflectance function of wood basically by having a reference

object whose shape was known and taking measurements under different illuminations and use this

to reconstruct other wooden shapes. Okay. So as long as you know the reflectance function,

you can use this method. It doesn't require Lambertian. All right. So, photometric stereo,

I mean, works so well. It basically--it has inspired a lot of subsequent work in the community

and here are some of the highlights. I mean, basically each one of these landmarks has

like a series of arguably landmarks after that. You know so people who work on photometric

stereo would say, "Here are the landmarks of photometric stereo." So I tried to do that

for a couple of these subfields. But I'm not going to go into details just because I don't

have time. All right. So, okay. So, the next major landmark is something called the essential

matrix. And this was basically one of the first algorithms for recovering a scene from

two projections under perspective. And basically, the observation was this is worked by Longuet-Higgins

in nature. It's actually cool that there is a nature paper on--entitled A Computer Algorithm

For... So, basically, he came up with the observation that if you have points in corresponds

between two images, they're related by a 3x3 matrix. And in particular, there's a 3x3 matrix

where if you take up points represented homogeneous coordinates XYZ--sorry, XY1 from the first

image and you multiply it by this 3x3 matrix, you get the corresponding epipolar line in

the other image. In other words, a line of which its corresponds must occur in the other

image. And this mapping from points to lines can be represented by these rank 2 3x3 images.

And you can also compute camera matrices from this matrix and so forth. And so he came up

with a very--a linear algorithm for recovering this matrix from multiple projections of points--sorry,

two images, multiple points. And this very, very influential paper and inspired like a

whole area of computer vision now known as multi-view geometry with books by Hartley

and Zisserman and Paul Gerard and sort of this explosion of interest in this area in

the 1990s. Now, it turns out that the theory or the math here can be traced back to mathematicians

notably Chassell from the 1860s. So that back in the 1860s, they actually figured out that

there is a 3x3 rank 2 matrix which maps points in one image to lines in the other image.

Okay. So this was known. They didn't have an algorithm that, you know, recover these

points because, you know, the mathematicians don't look in algorithms. But it's interesting

to note that a lot of fundamentals were known a long time ago. All right. So--and then there's

been a lot of subsequent work since then, stuff like the fundamental matrix. The essential

matrix wasn't grand enough. You had to have another matrix which is even more grand. The

fundamental matrix is basically the uncalibrated version of this, the trifocal tensors, the

three view, and so forth. All right. So, the next major breakthrough, and this is an interesting

one, I think it's probably not too controversial for this audience but it might be controversial

for some of the others, you know, is this really--is this really computer vision? So,

this is algorithm called Marching Cubes by Lorensen and Cline published in SIGGRAPH in

1987. Basically, this was an algorithm to go from a volume to a surface. Very simple

algorithm on the surf--you know, at some level but actually involves affirmative specification

to get it right. So, the way it works is, you seem you have a volume which implicitly

represents the surface. So, typically, these are signed distance functions so, your positive

might mean outside negative inside and the zeros are the surface. So, if you're storing

at a voxel in this volume which contains the surface and they basically tells you how to

march in order to stay on the surface. So, you start at C point on the surface then you

start marching over the surface. And in every time you march to another voxel which is on

the surface, you create a polygon, okay. And you have to figure out which polygon to create

based on where the sub-voxel iso-function would intersect that voxel. Okay? So, you

do this at sub-voxel position, add to create, you know, different sorts of polygons. And

it's basically, a table which tells you for every possible configuration of a previous

voxel or next voxel the type of polygon to create intersecting which sides of the voxel.

And to get this table right--in fact, they didn't get the table quite right in the original

publication. So, if you implement this, make sure you look at the updates to this table

to get it right. But--so--and there's some earlier work. There's a, you know, 1970s paper

and so forth where they tried the similar things but I think this is a big landmark,

I mean, something which is still in widespread use today. So, it's pretty in common in our

field as you all know, to find an algorithm which is still used, you know, more than 20

year--more than 20 years on, 25 years later and this is an example so, an important landmark.

Okay. So, when I started sort of, actually catching up to work, I actually stared working

in computer vision in the 1990s and so really, one of the things which really inspired me

was this work by Tomasi and Kanade on structure-from-motion and the basic idea is you're getting a set

of points over a named sequence and if you know the correspondence of all these points,

you could just put them, the point coordinates in rows of a matrix. Let's call it W. So,

the first row of the matrix would be the point one position, the X coordinate of the point

and of the first point in all the images. The second row might be the X coordinate of

the second point and all the images and then, you have a series of Y. The bottom half matrix

will be the Y coordinates. So, it turns out that if you have this matrix W, you can express

that matrix as a product of the motion of the camera and the shape of the scene. And

so the entries of the motion matrix are literally the camera parameters and the orthographic

projection. And the entries of the shape matrix are the 3D positions of those points. Okay.

So this is basically a one line solution to this structure-from-motion problem. Being

able recover 3D from a sequence of images. So, extremely elegant, very, very cool highly

influential. It turns out to be optimal under the right assumptions and affine projection

model which is the version of orthographic and inspire all sorts of follow on work. These

are just a few extensions to, you know, multiple bodies, optical flow that knows about 3D structure,

non-rigid shapes and so forth. Now, it turns out that this hasn't really been picked up

in modern structure-from-motion bubble adjustment 3D reconstruction problems because it only

works for affine projection and in terms out not to work so well with outliers and there's

also to caveats but it's very cool that you can do this and I think you got all people

interested in computer vision. All right. So, next landmark on my list, Iterative Closest

Points is--this was an idea which was simultaneously invented by three different authors. So, I

think this is evidence that the idea was kind of in the air, that so many people came up

with it at the same time but the basic idea is that suppose you have overlapping range

scans of an object and you want to piece them together into a 3D model. You first have to

align the range scans and ICP is a way of doing that and the basic idea of how it works,

you have this green range scan and this red range scan over here, I guess you can't see

the green at all. It must be the green pixels are knocked out of the projector or something

like that. But, you have a correspondence between--so, the key thing is that if you

knew the correspondence of points on the green curve and the red curve, then, you could find

the alignment between them. There's a sort of linear way to recover the alignment of

the--of the two curves. But, you don't know the correspondence. So, the trick is for every

point on the green curve, you find it. And you have an estimate of its current rotation

relative and translation relative to the red curve. You find the closest point to that

on the red curve and then you assume that that's the right correspondence and you solve

for the alignments and then you get better aligned pair of curves. And then, it's not

perfect because the correspondence was wrong but now it's closer so you then re-solve for

the putative correspondences and then, iterates the procedure over and over again. And this

actually works quite well. And again, this is one of those examples that, you know, 20

years later it's in widespread use. All right. So, another really important result, 1995

is worked by Shree Nayar's group on Depth from Defocus. So, here you're seeing a real-time

range scan system. One of the first of its type--of its kind and what you're seeing is

on the left, there is a cup pouring milk into another cup. And on the right is a--is a depth

mount computed in real-time. And again, this is, you know, this is 15 years old or more

and basically, what it's doing is it's taking two images with different focus blur. So,

the two images are--have different focus. One is out of focus in a different way than

the other one because they're using different aperture sizes and there's a different--no,

I think it's actually focus settings. And based on the relative blur you can figure

out the depth of the--depth of the sea. Okay. So, this is--they weren't--so, Nayar's group

wasn't the very first to invent this idea of defocus but they're one--but this was the

first, I think, really compelling system that showed how it could be practical. And another

innovation here was--typically when you take images with different focus settings then

you could try it with your own camera, when you change the focus, the image actually--the

magnification of the image changes a little bit. And so, they came up with a way to make

this really work reliable--reliably, they had to come up with a way to avoid this so

that they used the notion called telecentric optics which allowed you have to two different

camera paths with two different focus settings which produced images with the exact same

magnification. And so, now, Nayar also contributed this idea to the computer vision community

and it made a big difference. So, this implementation, it ran at 30 hertz accurate to, you know,

.3% error, very impressive again from the 1990s. All right. So, 1996, this was work

by Paul Debevec and collaborators at Berkeley. This is a famous video that Paul made. There

he is running. And he's holding a model of the campanile there's actually a sound on

this of like bells ringing, campanile bell's ringing. So, this is the model and now you're

seeing the rendering of his reconstruction and the Berkeley campus modeled in 3D. You

can think of this as kind of a predecessor to Google Earth and similar products, you

know, that tried to do really high accuracy in 3D modeling. And I think it's inspired

a lot of--a lot of efforts in this area. So, going back and forth from the real footage

to the 3D model. Feel dizzy watching this. But it's so cool. So, I think this video really

showed, I think, people what was possible in terms of realistic, really highly realistic

3D modeling photos. So definitely a breakthrough landmark result and field. So, the next images

you'll show--yes, you can see the real and the synthetic one right next to one another,

especially in the model in that case. So, here the input photos of the tower and some

of the--and the photos of the environment at Berkeley campus. And here's the model that

they recovered. And you could see that the models actually are not very complicated.

For example, buildings on the bottom are just these boxes, but when you touch the map, it

looks really good. All right. So, this paper is important for a number of reasons. So,

it really claimed--well, what introduced--it didn't coin the term image-based modeling.

This came from Leonard McNally's work. But, it introduced the concept of image-based modeling

in the graphics community. And there is, you know, some people argue that this work didn't

actually have a lot of novelty. It was just an impressive system. But actually there were

some really key innovations. So, for example, view-dependent texture mapping, this is really

the first use of it, which is now used routinely in Street View, for example. And was really

the foundation for the Utopia work here. The model-based stereo, the idea of recovering

stereo--of recovering geometric--crude geometric models by recovering high fields on top of

them, these guys came up with that and published that for the first time. And, of course, it

inspired a lot of interesting products have come out of it. All right. So, same year,

range scan merging also known as D-rep, this is--this is like Tomasi-Kanade like elegant

results. Really cool method for--so basically the idea is that you somehow manage to align

your range scans that now you want to fuse into a 3D model. So, how do you create 3D

model from the aligned range scans? You want to create one consistent watertight mesh from

all these partial scans. And the key idea here of this work is that there's basically

a one mind solution to it again. So, if you represent not a surface but if you represent

the surface of the level side of a 3D volume or the surface is zero and the--and it--basically

every point of volume is a distance to that scan. And you represent every scan as this

signed distance function in the volume. That if you simply add up all the signed distance

functions, one from each scan and take a level set of that that gives you the optimal surface

in a least squared sense. Okay. So, a very elegant result and this basically was the

method of choice, still the method of choice for a lot of people. More recently, the competing

methods like plus on surface reconstruction have come out and there are people that are

trying to switch to those, but for more than 10 years this has been really the standard

of range scan merging method. Okay. So, of course, I have to include my own PhD work

here. That's why I have to give this talk as opposed to someone else. So, I can inject

my own piece of work. So, 1997--well, not just to be but Fougera as well came up with

a multi-view stereo. And these actually--arguably, multi-view stereo dates back further. Photogrammetrist

had some similar methods for doing multi-view stereo in a decade previously. But in computer

vision, these were kind of the main methods which came on the late '90s for reconstructing

high quality 3D models from a set of multiple photos, not just a pair of images. And some

of the key ideas in this work, both the space carving work that I participated in with [INDISTINCT]

could lock us on the level set work of [INDISTINCT] and Fougera is basically to reconstruct the

3D shape directly instead of matching images, doing it in 3D space, also a visibility model.

And it turns out that, you know, figuring out which point is visible and which images

is difficult. But if you do it in scene space right away, you can solve this problem. And

these were the methods of--basically the first who described how to solve this building problem

in a principled way with provable convergence properties and so forth. Okay. So these were

important results in 3D in computer vision. So, of course, you're all familiar with graph

cuts or at least vision folks and they're really--the first paper that introduced graph

cuts to computer vision stereo in particular was Roy & Cox paper--CD paper from 1998. And

this basically--this is a really cool paper. Not as--I mean, the algorithms are cool but

more importantly I think that some lessons learned on these paper is really cool. So,

on the left you see window-based matching. These are depth maps which correspond to depths

of points to the image. So basically, on the left is more or less the state-of-the-art

algorithm at the time from '98 although it's not quite, there are others, better algorithms

they could've used. But it's the sort of kind of the simple method from--and on the right

is the graph cut result which is, you know, much higher quality. And up until this point,

people had thought that stereo was a really hard problem to get, the right corresponds

between paired images. In order to really solve this, you had to get a better and better

way of describing--of comparing images, describing the region of pixels around the point of interest

that you're trying to recover a correspondence for. So by coming up with better ways of describing

pixels using edges and primitives and other things like that you can get better matches.

Now, graph cut work basically said forget about all of that. Take the simplest possible

way of comparing images basically taking the difference of pixel on the locked image to

a pixel of the right image. Brain dead but use a really good optimization algorithm.

You couldn't beat all this other stuff, just knock it out of the water. Okay. So this work

really showed the power of optimization and ushered in the so-called graph cuts era, the

computer vision applications all sorts of different problems but in particular, it has

been a really big development for stereo. Now, it also had its roots in the 1980s. This

work by Baker and Binford who--so I think people in my generation are older. We'll say

it was a really important paper in stereo but these days almost, you know, I think it's

kind of a footnote more than anything else. But they proposed--they were basically the

first to propose global optimization for stereo but they did align at a time using dynamic

programming. But for, you know, basically 25 years or whatever. Yeah, 15 years, no one

was able to figure out how to extend it--this 1D optimization to 2D because the dynamic

programming solution didn't extend, graph cuts was basically the extension to 2D. [INDISTINCT]

mentioned there is another work around the same time, notably, Roy & Cox who came up

with different formulations of graph cuts to do stereo problems. All right. So 1998,

Marc Pollefeys and his colleagues published a paper at ICCB on doing 3D reconstruction

from basically a video camera from uncalibrated 3D information. The idea is you kind of fly

around in this case or you walk around the scenes with video camera and you get out a

3D reconstruction. You don't have to know anything at all about the video camera and

none of the parameters, no calibration, nothing. So, this is very inspirational work and it

was really the, you know, the same way that Tomasi and Kanade work was influential. I

think this was very--also very influential. It's also a combination of many research advances.

The paper itself was on--actually a relatively narrow topic of self-calibration, finding

intrinsic camera parameters from a sequence of images. It didn't at all talk about the

stereo, texture mapping or anything. But this is one of the key pieces that enabled them

to put this whole system together and also the community--the vision community as a whole

had been working through the '90s on solving this self-calibration bundle adjustment problem

and this is really one of the culminations of that work. It showed you could put it all

together in a really compelling system. So it's a big milestone for the field. All right.

So, 1999, Blanz & Vetter. SIGGRAPH published this amazing paper where they showed from

a single photo you can create a recon--3D reconstruction that looks like this. So basically

a laser scan a quality 3D model of the person from a single photo. So harking back to the

shape of shading results, you know, there's no way that shape of shadings can be able

to do this. So, this is pretty incredible. Now, what it's doing is not using just image

information. It's also using a database of 3D model. So what they did was they scanned

200 faces, 200 people and they have this basically 200 scanned vector space then through linear

combinations of these scans they can produce any other person who looks like anything like

any of these 200 people. So that would be basically their model. So instead of having

to reconstruct the whole face, all they have to do is find 200 coefficients that are the

best combination of these input people to this output person. And in fact they're able

to compress it a little bit more using principal components and I think they only recovered

around 100 coefficients. Anyway, so there's similar work around the same time in Active

Appearance Models but I think these are--these results are really unrivaled even now. I don't

think there's better results in 3D reconstruction from faces. But if you look closely you'll

notice that, you know, the details may not exactly line up, maybe Tom Hanks's nose is

not quite as long and it sort of depends on your input space and how well the new photo

lies in the space of input photos. But it gives a very compelling argument for using

so-called model-based approaches for 3D reconstruction which used strong priors on the type of geometry

you're trying to reconstruct. So 1999 was also a big year in other ways. There's Digital

Michelangelo Project. So Marc Levoy's group at Stanford and many other collaborators.

And basically they set up to scan a bunch statues and other things, you know, in Italy

and then the centerpiece was Michelangelo's the David. And this was a huge scanning effort,

probably the biggest scanning ever undertaken in terms of amount of resources and time and

planning. It basically took a team of a dozen--I think there are 20 people involved total,

working, scanning all night long. So they--first of all they had--they had access to the statue

only at night because it's displayed in a museum. I think it's at [INDISTINCT] and the

museum is closed at night so they're allowed to scan all night long and had to stop in

the morning. And so that's basically what they did, they scanned all night long for

a month to get this model and it was, you know, roughly, you know, 20 people involved.

It's a huge, huge amount of work but the result is beautiful, they have--they captured the

sculpture basically almost every single nook and cranny at a quarter of a millimeter precision.

And so you can zoom in--on the right you're seeing--you're zooming into the eye of the

statue and you can zoom in further to this like ridge and see the details of the mesh.

And so, because it's also--it's also other scanning work like the PHR projects around

the same time. But this has really shows, you know, what you can do with 3D laser scanning

technology and also how much work it is to use it. Okay. Same year, so camera tracking--I

was looking for the canonical reference on bundle adjustment. It's actually really hard

to find a canonical reference on the computer based bundle adjustment and so I went with

this. This is--this is really the first time practical bundle adjustment systems where

you used in an application. So that--for all these--basically for almost every special

effects scene these days, there's a combination of real footage and computer graphics and

doing that right involves figuring out where the camera is precisely in the shots. You

can really--computer graphics in the same place. And so this process is called Match

Move and it requires bundle adjusting or reconstructing the positions in all the cameras in all the

images. And so the first truly animated methods were developed in the late '90s for doing

this for entry tracking the images and doing the 3D analysis. All right. And so there's

some pioneering commercial systems in particular MatchMover came out of INRIA [INDISTINCT]

led by [INDISTINCT] and Boujou grew out of Oxford Fitzgibbon [INDISTINCT] and so forth.

There's also a concurrent record in photogrammetry around the same time. Let's see, so I'm running

a little bit low. I'm just going skip over some of these. There's some important benchmarks

for stereo that came in about in 2001. Some really elegant work for doing stereo with

non number and surfaces in 2002. 2003 is the year I'd like to say stereo equals laser scan,

so this is the first stereo computer vision--passive computer vision method. No laser scan that

could produce 3D geometry that looked as good as a laser scan and arguably is almost as

good. And this was Carlos Hernandez's work. And Carlos as you know, he's at Google now.

So, a very cool work which, you know inspired a lot of subsequent work. We're trying to

just kind of skip over including Joshua's work who's--and others in the community. 2006,

first work on reconstructing 3D information from photos on the Internet. So these are

part of tourism work which led to Microsoft Photosynth. And really key ingredients to

make this work were SIFT, these reliable feature points and the photograph--the progress in

photogrammetry over the years kind of with similar work by other authors. 2006, this

import--the--there's a really cool effort of drawing, writing, creating 3D models of

a city by just driving a car around. And so it's kind of like Street View LiDAR but just

using raw computer vision measurements instead of LiDAR. This--in 2008, there's a great paper

by Zebedan etal, Bishop's group. And there they show a basically an automated version

of [INDISTINCT] façade. So, from aerial imagery, it creates this really, really impressive

3D model of a city completely automatically. 2009, Zamira did his Rome-in-a-day work. And

this is sort of a city scale structure promotion from image collections. And sorry I don't

have more time to talk about these. And, of course, 2011 is Kinect. So body pose from

a single depth image, 200 frames per second, fastest selling electronics device in history.

So they had in the first three months, I think, they had doubled the sales of the iPad when

it came out. So not bad for computers and system. Of course, it has other things aside

computer vision but we'd like to take credit for it. All right. So in 2013, there is this

fantastic work on basically, Digital Michelangelo from a few photos. So instead of having to

scan all night long, this group was able to just take a few photos and create a model

of Digital Michelangelo that have the quality of this Lavoy group that took a month. On

2015, there's this wonderful project on creating a public repository of the world's geometry

where basically they scan everything, people, places and scenes, things and put it all up

in publicly hosted website. 2015, this work on, which I'm sure you all remember of how

do you reconstruct yourself from your photo collections. So, it took all of your photos

of your poses, expression, body shapes over time and create a 3D model, perfect 3D model

of you and all the ways that you move. 2020, it took a lot longer to solve this Inverse

CAD problem. So the idea is to create what an architect would call a 3D model from not

just a plain cloud but an accurate--or a 3D model which captures the salient features

of the scene, walls, floors, small number polygons from rough point clouds fills in

the gaps, easy to add and modify. 2020, we finally solve the Visual Turing test. So this

is creating a 3D computer vision system which creates models which are truly indistinguishable

from reality where you're able to move however you want through the scene and it still looks

perfect. So it looks just like being there. On finally, 2030, the computer being human.

That's it. Thanks you guys. >> Thank you very much for bringing us together.

for workshop in Italy a week ago not realizing how much time this would take. So, basically,

you know, I'm very familiar with a lot of computer vision, in 3D computer vision in

particular but really trying to give a history of this problem, you know, goes back decades

and so I had to consult a bunch of people to really get this right. So, I immediately

sent email to a number of the pioneers, you know, many of you recognize on this list like

Schreiner and Berthold Horn and Andrew Zisserman and so forth. Basically, you know, all the

people you think of about doing pioneering work in computer vision to try to get their

input on what were--are the breakthroughs in computer vision over the years in the 3D

domain. You want to close the door? >> Yeah.

>> SEITZ: And so, of course, everyone send you different lists and, you know, all of

the structure motion guys sent me structure motion references and all the, you know, for

the metric guys like Horn and Woodham sent me, you know, shading references so everyone

had a different interpretation of what is a landmark or a breakthrough so I had to kind

of call from this some subset of things. And I think everyone of these people will be horrified

by the talk I'm about to give because it only covers a small subset of the references that

they gave me for what they think are the pioneering references. So, if you want to cover all 3D

vision, it's actually very broad area on so I had to select only a small number of things

from each subfield. And so, because I'm not going to cover all of these stuff, you know,

many of these emails I got from these folks were just terrific based--you know, really

detailed histories over the last several decades so I'm also going to post online their responses.

So, what does Berthold Horn think were the chronicle references in 3D vision and so forth.

Okay. So, in addition to these individuals who gave me some information, I also-there's

lots of good information online in the papers and so forth and here's a--here are a few

links if you're interested to know more about the history of photogrammetry or bundle adjustment

and so forth to some good materials online. All right, so disclaimer, I've already said

just a little bit but this list is very incomplete for sure. It's also somewhat biased because

it's my own interpretation of the highlights in 3D computer vision over the years. Although,

it's informed by a lot--by, you know, experts by readings and just, you know, my experience

over the years. I tended to select for high impact result--results like the things which

I think had the biggest impact either in practice or on the research community and so there's

fewer kind of theoretical results although there are a few in this list. And of course,

I'm omitting a lot of really important results just for--to fit this in to an hour. Okay.

So, with that, let me jump into it. So, first of all, a bit of so-called pre-history. So,

here are a few really important people over the years, the centuries up to now. So, Leonardo

Da Vinci is not really the--he's by no means the first one to know about perspective but

he articulated it especially well so I love this quote, perspective is nothing else then

to seeing of an object behind a sheet of glass, smooth and quite transparent. On the surface

of which, all things may be marked or behind this glass. So, he had a concept of a flat

image plain. All things transmit their images to the eye by pyramidal lines--so rays and

these pyramidal lines are cut by said glass. The [INDISTINCT] of the eye, these are intersected,

the smaller the image of they possible appear. So, you know, really beautiful description

of perspective and had, you know, all the key ideas at that time. Okay? So, another

really important figure in 3D is Lambert and you probably know him best through Lambert's

Law which is--which is really model how light gets reflected off of it after hits the surface

and it gets scattered in all directions per map surfaces and really, you know, this has

been the vain of computer vision algorithms because almost everything assumes Lambertian

reflectance and therefore breaks down for anything that's shiny or specular or non-Lambertian.

So, we really have Lambert to blame for all these troubles in the field. But also, it

turns out he was the first person who proposed inverse projection. So basically, figuring

out where the camera was from one or more images and that's also known as camera resectioning

photogrammetry and literature. Next on this list is Galls and Galls is, you know, hugely

important for all sorts of things but one big element among those is Lee Squares and,

of course, Lee Squares is really fundamental to almost all of our algorithms these days

both many algorithms as well as intuitive nonlinear algorithms and so it's been central

to photogrammetry as well. Now, Wheatstone is maybe a bit more controversial to sort

of put in the same line as these other three but he was the first to really articulate

the principles of stereo and basically the idea that stereo consist of horizontal parallax

between two images--between retinal projections. And it's kind of surprising that this wasn't,

you know, known and so in the 1800s and it--you know, maybe it was. I mean, there's some evidence

that Leonardo knew about stereo although he didn't explain it as parallax. But what's

interesting about Wheatstone's work is that he was able to basically prove it by creating

images--synthetic images and putting them in a stereoscope. He invented the stereoscope

which then you could look at and see fuse of 3D image. So basically, that's a proof

of this parallax concept. And, of course, stereo has been central to a lot of major

advances in 3D vision. Okay. So, in parallel with computer vision and actually predating

a lot of computer vision is photogrammetry. So, photogrammetry is basically the science

of measurement from optics. From some optical observation. I'm sure [INDISTINCT] stepping

the word image because photogrammetry in some sense predated photography. So the mathematical

foundations go back to the early 1800s so [INDISTINCT] invention of projected geometry,

terms--you know, things like the horopter, mutli-view relations and so forth. And I'll

cover some of these layer in this talk in the context of other computer vision breakthroughs

but the mathematics goes back a long ways. Practical use, people are interested in photogrammetry

as part of map making and [INDISTINCT] and basically the idea is that if you're in an

imperialist country, you want to make sure that you can map the boundaries of nation

that you're conquering and so forth so people have long been interested in map making and

one of the--one of the great maybe the biggest breakthrough in photogrammetry is the--this

study called the Great Arch of India that the British--so when the British were serving

India, this turned out to be a huge undertaking. So, the idea was basically to create a 3D

map of India. And--so they took this huge theodolites that weight of time literally

at time. Three of the jungles is in [INDISTINCT] to try to take their measurements from different

places and they were able to show that the--that Mount Everest was actually the highest mountain

on earth so it was thought before then that the [INDISTINCT] was higher but--so this was

one of--one of the things that photogrammetry has to claim as a big advance. Anyway, so

the basic--the way this was one done was taking measurements from these different theodolites

strangulations and kind of putting all these strangulations into a big manual bundle adjustment

so they had all these measurements between pairs of, you know, sensors if you will and

they had to kind of bundle them all together by writing up the equations and solving for

the relative positions of objects on paper. So--but this was an example of bundle adjustment

in the 1800s. Of course, this would--this was advance with the invention of photography

and around the same time and they start putting these things together in the late 1800s. Okay.

So now, the next big event--you know, I'm skipping a lot of big events here actually

in photogrammetry but when the computers came about, computers that were big enough to do

largely squares computations, photogrammetrist immediately saw the potential to use this

to--for photogrammetry applications. So, in particular, someone named Dwane Brown pioneered

the use of computers for bundle adjustment and, you know, basically triangulation efforts

in the 1950s and so you can really date back computer based bundle adjustment in the 1950s

although at the time they weren't doing any image analysis, they were just using the computers

to solve the equations not to extract features or doing any measurements on the images themselves.

Okay? So, I called that all pre-history because there was no computer vision, there's no analysis

of the images using computers. So let me talk about what were the first uses of computer

vision for 3D measurement. And so, as far as I could tell, the first example of 3D computer

vision is stereo and this is work by this crazy inventor name Gilbert Hobrough and basically,

he was--at the time the way photogrammetry work, map making work is you had operators

who would look at a pair of images and find corresponding points by hand and then specify

the displacements and then feed this into the computer to solve for the positions of

these points. And so, Hobrough observed, well, wouldn't it great if you could automate this

manual correspondence task because that was really the bottle neck. And so he invented

this machine to do so and so this is really an analog stereo machine, analog hardware

and it's pretty--it's pretty incredible. So, I wasn't able to find original paper on this

but the patents are online so here's an image from the patent. So, what you're seeing here

is basically a table with two images slides on them, transparent images and there is a

CRT down below which is scanning basically a spot across both images in parallel. And

so it's illuminating a particular pixel so to speak in each image and the intensity of

that pixel is being picked up by a sensor which is measuring intensity in both images

of the spot and then there's a correlator which compares those intensities and decides

whether it's the same or not. If they're the same then it knows the disparity based on

the relative positions of the spot and say--and so there's a match that gives you the disparity

and the depth. If they're not, based on the difference in disparity, it will shift to

being in one of the images. Okay? So--and it will do this over and over again. So, analog

implementation of stereo, it's automated and he had a serious of innovations on top of

this and basically invented [INDISTINCT] stereo on a pyramid, you know, back in the '50s and

for those of you know a lot about computer vision, you'll know what I'm talking about

but--yeah, the--there's--this is rediscovered and computer vision basically, you know, three

decades later but pretty amazing stuff. All right. So, I'd say that was not very well

known [INDISTINCT] almost known I've talked to computer vision's ever heard of the previous

slide. This one is very well known. So, Larry Roberts, his PhD thesis in 1963 at MIT, he

invented what's known as Blocks World to this phase of computer where you basically take

an image of a couple of blocks and then based on extracting where the edges are which he

did using a grading operator on a computer which was the first implementations of doing

grading on computers, he was able to extract edges and lines and then based on the relative

orientations of these lines, he was able to figure out their configuration in 3D using

a line labeling technique and came up with a way to render the thing from a new viewpoint

as you're seeing in the far right using presumably some kind of vector graphics. Anyway, so this

is--this is actually an incredible achievement in 1963 to do all these stuff, to implement

the whole thing and have it work even if it only works for two blocks. But--so, Larry

Roberts is often considered the founder--the father of computer vision, you know, this

title [INDISTINCT] been throwing around. So I went to this Wikipedia page to sort of see

what is said about him and actually, his Wikipedia page doesn't even mention computer vision.

So it turns out after his [INDISTINCT] he stopped doing computer vision all together

or maybe he's disgusted by it and instead he started working on networks and he was

one of the, you know, one of the--he was the founder of the ARPANET and is considered one

of the forefathers of the Internet. So, that's what his Wikipedia page talked about to the

computer vision. But anyway, this was a pretty impressive breakthrough and, you know, one

of the first results in 3D computer vision for sure. So, by the way, if people have questions,

feel free to interrupt. You know, there may be some people in the audience who are aware

of things I'm missing and, you know, any feedback is great. Okay. So, this really fueled--Larry

Robert's work really fueled essentially a generation or at least a decade of work on

Blocks World Modeling. And the culmination of this or one of the best-known examples

was in the late '60s, early '70s there's this demo on the system at MIT where they tried

to produce a robot which would basically take pictures of--in real time of blocks, and then

based on that construct a plan for building a structure that it just took a picture of

from another set of blocks. So, basically, you have a set of blocks stacked in a particular

way, the robot tries to reproduce that exact stacking configuration from another set of

blocks by figuring out the structure of the blocks as well as planning it, making a plan

to navigate the robot. And this is incredibly ambitious because it basically involves solving

the computer vision problem, solving the planning problem, solving the AI problem in some sense,

at least the manipulation problem, and it turned out that most of this was doable. The

weakest link was actually low-level edge finding in images. So kind of a disappointing weak

link and this really inspired people to work on edge detection which they did for many

years after that. So this really inspired kind of a dive into low-level computer vision

and, you know, all the kind of canny and similar results came out of this. All right. So, fast

forward, actually this is around the same time actually. So, Berthold Horn, vision pioneer,

so he did his thesis at MIT and in 1970 he published a thesis and this was on Shape from

Shading, and this was one of the first methods that said--showed how from a single image

you can infer the shape of an object. And basically, the idea, as you can somewhat see

in this slide if you're depending on the projector which is not so good on orange, but there

are these curves superimposed on the image and these are so-called characteristic curves

or characteristic strips where if you know the depth of the scene along one point on

the strip, you can figure out the depth along any other point on the strip. So, you can't

figure out the absolute depth of this--of the strip. But if you know one point, then

you can integrate out the depth along every other point. And he also showed that if you

have a network of these strips and you have certain boundary conditions then you can actually

integrate out the surface of the scene. So you can really recover the shape, in other

words the surface orientation and the mesh of the face from a single image assuming--making

lots of assumptions about constant reflections and so forth. So this was--this was really

a key result. It turns out the astrophysicists were also doing similar things around the

same time around--slightly earlier, around mid 1960s, there's a field called--known as

photoclinometry. And they were--their interest was they wanted to determine the shape of

planets from an image of that planet. And so there's similar work. It doesn't--it's

typically for profile. So 1D instead of--for full images. And Horn is really credited with

one of the first to do kind of really 2D shape from shade. Now, for each of these landmarks

I'm talking about you can sort of trace some of the origins back many years before. So

Ernst Mach actually formulated some of the basic equations behind this image of radiance

equation almost a hundred years before. But he actually concluded that inverting this

equation, recovering the shape was impossible. So, it's kind of cool at least that Horn was

able to prove him wrong although I guess it took a hundred years to do it.

>> Well, but I mean, then wasn't he just saying that the problem with the L post and Horn

regularized on top of it? >> SEITZ: Well, so in some sense, yes. But

he also thought that it was too unstable to even--to actually have a look at the original

quotes of Mach, so, you know, I'm not sure. But the implication was that Mach didn't think

it was feasible because their--their--you wouldn't be able to get boundary conditions

or whatever. I don't know. >> Okay.

>> SEITZ: It's a good question. It--this may be too strong to include that one. But he

had a quote saying it's basically impossible. All right. So, another really amazing result

from the '70s. So this is Bruce Baumgart's PhD thesis in 1974. He was the first to do

shape in silhouettes. So the idea is that you have some scene and you have a set of

photos and you're somehow able to extract the silhouette of the object, the boundary

of the object, separate the foreground from the background. In this case, you see--this

is an image of a doll. They're pretty low quality. These are scanned from his thesis.

And so, from this doll, here's a silhouette of the doll. Here's a silhouette of the doll

from a different viewpoint. You can see it's actually missing the head and this points

to the difficulty of actually getting good silhouettes. And in this image it's also missing

the head. And so the basic idea is that if you know the camera positions for all these

different views, then you can basically back project the images. So, this is another great

slide from his thesis. And so the idea is that you know where the camera is, you know

the image, basically this image of play sitting in front of the camera. If you trace a ray

from the center of the camera through every point on the silhouette, that gives you a

conical volume which the scene must lie in 3D. And so from each view you get a different

volume which constraints the scene and the intersect of those volumes it gives you an

approximation of the scene. And so this is an idea. This is a figure which impressively

shows that kind of intersection where you see the silhouettes at the end in the back-projected

intersection. So, here's the results that he got. They look a little bit crude but they're

actually pretty nice. So here's some--here's the back side of the doll. You can see the

head is missing, again, because it was missing in the silhouette. And that--this shows the

limitations of shape from silhouette methods. If you--or if you miss a part of the object

in any view, it disappears in all the views. So this figure--so each row is a different

view, and the two columns correspond to a stereo pair. So if you're able to fuse these

images so that when people can do stereoscopic fusion, you might be able to better see the

3D structure from the image on the far right. Well, it's kind of cool actually and this

thesis that he came up with the idea of doing a stereo reproduction like this. So, in order

to do this, you just think of how much you would have had to do on a 1974 computer in

order to do this. He had to do the edge detection, he had to do the boundary following, he had

to do the back projection 3D, and we have the '50s polyhedral models to the back projection.

Just that last part could have been a PhD, I think, in the 1970s, and he had to invent

all these edge of data structures some of which are still in use today to do the geometrical

model. So, a very impressive piece of work. And actually, his thesis is online, so if

you go to this link at the bottom. I tried to include links as much as I could to information

online if you're interested in digging up more, these blue links on the slides. And

I'll post these slides as well. Okay. So here's another example, I think, which looks a little

bit better, images of a horse, and you can see the horse model which looks a bit more

like a horse. All right. So, I'm not going to go into details on these just because I'm

probably not going to have enough time. But there's a lot of follow-on work on shape from

silhouettes, and if you're interested, here's more where you can read what the vision community

has done, which builds on this basic stuff. All right. So, I talked about shape from shading,

Horn's work. But to this day actually, the--really the community has not been able to successfully

make--get good results out of shape from shading. Maybe as Mach predicted, it's very unstable

and it's hard to get accurate results. You have to make very strong assumptions to make

it work. And really, the only reason for including Horn's work as a landmark is--in my mind,

is that it led to photometric stereo, which is this slide. So the key idea is that if--instead

you just take one--instead of taking just one image, if you take say, three images of

the same scene from the same viewpoint under different illuminations, those three images

give you enough hues that you can reliably extract shape. And this works really well.

So this is Woodham's PhD thesis. Woodham was a--Bob Woodham was a student of Berthold Horn's,

and so--and this has an interesting history as well. So, the--so, Woodham described his

photometric stereo approach in his thesis, but actually he wasn't the one who implemented

it. The first implementation was by a master's student William Silver in 1980, also at MIT.

And this--you could show--see by the result that, you know--so this is a reconstruction

of like basically an egg. It's not actually a real egg, it's a wooden egg. But, if the--if

you compare this model to all the other models that I've shown up to this point, the details

are, you know, spectacular. And if--on the right is a profile of the egg in black, and

there's a dotted line which you probably can't see very well. But the dotted line is the

grand truth, the correct answer, and they're basically identical. So, incredibly accurate

models. So, one of the--people in computer vision--most people I think know about photometric

stereo and Bob Woodham's work in particular. But I think a misconception is that people

think it only works for Lambertian scenes. And in fact they showed--the very first work

in his thesis and in Silver's implementation showed that you can use this for any kind

of scene, any kind of BRDF as long as it's not translucent, has to be an opaque surface.

But it can be shiny or it can be made out of whatever you want, as long as you have--as

you know the form of that reflectance function. Okay. So--and in fact, in Silver's implementation,

what he did was he measured the reflectance function of wood basically by having a reference

object whose shape was known and taking measurements under different illuminations and use this

to reconstruct other wooden shapes. Okay. So as long as you know the reflectance function,

you can use this method. It doesn't require Lambertian. All right. So, photometric stereo,

I mean, works so well. It basically--it has inspired a lot of subsequent work in the community

and here are some of the highlights. I mean, basically each one of these landmarks has

like a series of arguably landmarks after that. You know so people who work on photometric

stereo would say, "Here are the landmarks of photometric stereo." So I tried to do that

for a couple of these subfields. But I'm not going to go into details just because I don't

have time. All right. So, okay. So, the next major landmark is something called the essential

matrix. And this was basically one of the first algorithms for recovering a scene from

two projections under perspective. And basically, the observation was this is worked by Longuet-Higgins

in nature. It's actually cool that there is a nature paper on--entitled A Computer Algorithm

For... So, basically, he came up with the observation that if you have points in corresponds

between two images, they're related by a 3x3 matrix. And in particular, there's a 3x3 matrix

where if you take up points represented homogeneous coordinates XYZ--sorry, XY1 from the first

image and you multiply it by this 3x3 matrix, you get the corresponding epipolar line in

the other image. In other words, a line of which its corresponds must occur in the other

image. And this mapping from points to lines can be represented by these rank 2 3x3 images.

And you can also compute camera matrices from this matrix and so forth. And so he came up

with a very--a linear algorithm for recovering this matrix from multiple projections of points--sorry,

two images, multiple points. And this very, very influential paper and inspired like a

whole area of computer vision now known as multi-view geometry with books by Hartley

and Zisserman and Paul Gerard and sort of this explosion of interest in this area in

the 1990s. Now, it turns out that the theory or the math here can be traced back to mathematicians

notably Chassell from the 1860s. So that back in the 1860s, they actually figured out that

there is a 3x3 rank 2 matrix which maps points in one image to lines in the other image.

Okay. So this was known. They didn't have an algorithm that, you know, recover these

points because, you know, the mathematicians don't look in algorithms. But it's interesting

to note that a lot of fundamentals were known a long time ago. All right. So--and then there's

been a lot of subsequent work since then, stuff like the fundamental matrix. The essential

matrix wasn't grand enough. You had to have another matrix which is even more grand. The

fundamental matrix is basically the uncalibrated version of this, the trifocal tensors, the

three view, and so forth. All right. So, the next major breakthrough, and this is an interesting

one, I think it's probably not too controversial for this audience but it might be controversial

for some of the others, you know, is this really--is this really computer vision? So,

this is algorithm called Marching Cubes by Lorensen and Cline published in SIGGRAPH in

1987. Basically, this was an algorithm to go from a volume to a surface. Very simple

algorithm on the surf--you know, at some level but actually involves affirmative specification

to get it right. So, the way it works is, you seem you have a volume which implicitly

represents the surface. So, typically, these are signed distance functions so, your positive

might mean outside negative inside and the zeros are the surface. So, if you're storing

at a voxel in this volume which contains the surface and they basically tells you how to

march in order to stay on the surface. So, you start at C point on the surface then you

start marching over the surface. And in every time you march to another voxel which is on

the surface, you create a polygon, okay. And you have to figure out which polygon to create

based on where the sub-voxel iso-function would intersect that voxel. Okay? So, you

do this at sub-voxel position, add to create, you know, different sorts of polygons. And

it's basically, a table which tells you for every possible configuration of a previous

voxel or next voxel the type of polygon to create intersecting which sides of the voxel.

And to get this table right--in fact, they didn't get the table quite right in the original

publication. So, if you implement this, make sure you look at the updates to this table

to get it right. But--so--and there's some earlier work. There's a, you know, 1970s paper

and so forth where they tried the similar things but I think this is a big landmark,

I mean, something which is still in widespread use today. So, it's pretty in common in our

field as you all know, to find an algorithm which is still used, you know, more than 20

year--more than 20 years on, 25 years later and this is an example so, an important landmark.

Okay. So, when I started sort of, actually catching up to work, I actually stared working

in computer vision in the 1990s and so really, one of the things which really inspired me

was this work by Tomasi and Kanade on structure-from-motion and the basic idea is you're getting a set

of points over a named sequence and if you know the correspondence of all these points,

you could just put them, the point coordinates in rows of a matrix. Let's call it W. So,

the first row of the matrix would be the point one position, the X coordinate of the point

and of the first point in all the images. The second row might be the X coordinate of

the second point and all the images and then, you have a series of Y. The bottom half matrix

will be the Y coordinates. So, it turns out that if you have this matrix W, you can express

that matrix as a product of the motion of the camera and the shape of the scene. And

so the entries of the motion matrix are literally the camera parameters and the orthographic

projection. And the entries of the shape matrix are the 3D positions of those points. Okay.

So this is basically a one line solution to this structure-from-motion problem. Being

able recover 3D from a sequence of images. So, extremely elegant, very, very cool highly

influential. It turns out to be optimal under the right assumptions and affine projection

model which is the version of orthographic and inspire all sorts of follow on work. These

are just a few extensions to, you know, multiple bodies, optical flow that knows about 3D structure,

non-rigid shapes and so forth. Now, it turns out that this hasn't really been picked up

in modern structure-from-motion bubble adjustment 3D reconstruction problems because it only

works for affine projection and in terms out not to work so well with outliers and there's

also to caveats but it's very cool that you can do this and I think you got all people

interested in computer vision. All right. So, next landmark on my list, Iterative Closest

Points is--this was an idea which was simultaneously invented by three different authors. So, I

think this is evidence that the idea was kind of in the air, that so many people came up

with it at the same time but the basic idea is that suppose you have overlapping range

scans of an object and you want to piece them together into a 3D model. You first have to

align the range scans and ICP is a way of doing that and the basic idea of how it works,

you have this green range scan and this red range scan over here, I guess you can't see

the green at all. It must be the green pixels are knocked out of the projector or something

like that. But, you have a correspondence between--so, the key thing is that if you

knew the correspondence of points on the green curve and the red curve, then, you could find

the alignment between them. There's a sort of linear way to recover the alignment of

the--of the two curves. But, you don't know the correspondence. So, the trick is for every

point on the green curve, you find it. And you have an estimate of its current rotation

relative and translation relative to the red curve. You find the closest point to that

on the red curve and then you assume that that's the right correspondence and you solve

for the alignments and then you get better aligned pair of curves. And then, it's not

perfect because the correspondence was wrong but now it's closer so you then re-solve for

the putative correspondences and then, iterates the procedure over and over again. And this

actually works quite well. And again, this is one of those examples that, you know, 20

years later it's in widespread use. All right. So, another really important result, 1995

is worked by Shree Nayar's group on Depth from Defocus. So, here you're seeing a real-time

range scan system. One of the first of its type--of its kind and what you're seeing is

on the left, there is a cup pouring milk into another cup. And on the right is a--is a depth

mount computed in real-time. And again, this is, you know, this is 15 years old or more

and basically, what it's doing is it's taking two images with different focus blur. So,

the two images are--have different focus. One is out of focus in a different way than

the other one because they're using different aperture sizes and there's a different--no,

I think it's actually focus settings. And based on the relative blur you can figure

out the depth of the--depth of the sea. Okay. So, this is--they weren't--so, Nayar's group

wasn't the very first to invent this idea of defocus but they're one--but this was the

first, I think, really compelling system that showed how it could be practical. And another

innovation here was--typically when you take images with different focus settings then

you could try it with your own camera, when you change the focus, the image actually--the

magnification of the image changes a little bit. And so, they came up with a way to make

this really work reliable--reliably, they had to come up with a way to avoid this so

that they used the notion called telecentric optics which allowed you have to two different

camera paths with two different focus settings which produced images with the exact same

magnification. And so, now, Nayar also contributed this idea to the computer vision community

and it made a big difference. So, this implementation, it ran at 30 hertz accurate to, you know,

.3% error, very impressive again from the 1990s. All right. So, 1996, this was work

by Paul Debevec and collaborators at Berkeley. This is a famous video that Paul made. There

he is running. And he's holding a model of the campanile there's actually a sound on

this of like bells ringing, campanile bell's ringing. So, this is the model and now you're

seeing the rendering of his reconstruction and the Berkeley campus modeled in 3D. You

can think of this as kind of a predecessor to Google Earth and similar products, you

know, that tried to do really high accuracy in 3D modeling. And I think it's inspired

a lot of--a lot of efforts in this area. So, going back and forth from the real footage

to the 3D model. Feel dizzy watching this. But it's so cool. So, I think this video really

showed, I think, people what was possible in terms of realistic, really highly realistic

3D modeling photos. So definitely a breakthrough landmark result and field. So, the next images

you'll show--yes, you can see the real and the synthetic one right next to one another,

especially in the model in that case. So, here the input photos of the tower and some

of the--and the photos of the environment at Berkeley campus. And here's the model that

they recovered. And you could see that the models actually are not very complicated.

For example, buildings on the bottom are just these boxes, but when you touch the map, it

looks really good. All right. So, this paper is important for a number of reasons. So,

it really claimed--well, what introduced--it didn't coin the term image-based modeling.

This came from Leonard McNally's work. But, it introduced the concept of image-based modeling

in the graphics community. And there is, you know, some people argue that this work didn't

actually have a lot of novelty. It was just an impressive system. But actually there were

some really key innovations. So, for example, view-dependent texture mapping, this is really

the first use of it, which is now used routinely in Street View, for example. And was really

the foundation for the Utopia work here. The model-based stereo, the idea of recovering

stereo--of recovering geometric--crude geometric models by recovering high fields on top of

them, these guys came up with that and published that for the first time. And, of course, it

inspired a lot of interesting products have come out of it. All right. So, same year,

range scan merging also known as D-rep, this is--this is like Tomasi-Kanade like elegant

results. Really cool method for--so basically the idea is that you somehow manage to align

your range scans that now you want to fuse into a 3D model. So, how do you create 3D

model from the aligned range scans? You want to create one consistent watertight mesh from

all these partial scans. And the key idea here of this work is that there's basically

a one mind solution to it again. So, if you represent not a surface but if you represent

the surface of the level side of a 3D volume or the surface is zero and the--and it--basically

every point of volume is a distance to that scan. And you represent every scan as this

signed distance function in the volume. That if you simply add up all the signed distance

functions, one from each scan and take a level set of that that gives you the optimal surface

in a least squared sense. Okay. So, a very elegant result and this basically was the

method of choice, still the method of choice for a lot of people. More recently, the competing

methods like plus on surface reconstruction have come out and there are people that are

trying to switch to those, but for more than 10 years this has been really the standard

of range scan merging method. Okay. So, of course, I have to include my own PhD work

here. That's why I have to give this talk as opposed to someone else. So, I can inject

my own piece of work. So, 1997--well, not just to be but Fougera as well came up with

a multi-view stereo. And these actually--arguably, multi-view stereo dates back further. Photogrammetrist

had some similar methods for doing multi-view stereo in a decade previously. But in computer

vision, these were kind of the main methods which came on the late '90s for reconstructing

high quality 3D models from a set of multiple photos, not just a pair of images. And some

of the key ideas in this work, both the space carving work that I participated in with [INDISTINCT]

could lock us on the level set work of [INDISTINCT] and Fougera is basically to reconstruct the

3D shape directly instead of matching images, doing it in 3D space, also a visibility model.

And it turns out that, you know, figuring out which point is visible and which images

is difficult. But if you do it in scene space right away, you can solve this problem. And

these were the methods of--basically the first who described how to solve this building problem

in a principled way with provable convergence properties and so forth. Okay. So these were

important results in 3D in computer vision. So, of course, you're all familiar with graph

cuts or at least vision folks and they're really--the first paper that introduced graph

cuts to computer vision stereo in particular was Roy & Cox paper--CD paper from 1998. And

this basically--this is a really cool paper. Not as--I mean, the algorithms are cool but

more importantly I think that some lessons learned on these paper is really cool. So,

on the left you see window-based matching. These are depth maps which correspond to depths

of points to the image. So basically, on the left is more or less the state-of-the-art

algorithm at the time from '98 although it's not quite, there are others, better algorithms

they could've used. But it's the sort of kind of the simple method from--and on the right

is the graph cut result which is, you know, much higher quality. And up until this point,

people had thought that stereo was a really hard problem to get, the right corresponds

between paired images. In order to really solve this, you had to get a better and better

way of describing--of comparing images, describing the region of pixels around the point of interest

that you're trying to recover a correspondence for. So by coming up with better ways of describing

pixels using edges and primitives and other things like that you can get better matches.

Now, graph cut work basically said forget about all of that. Take the simplest possible

way of comparing images basically taking the difference of pixel on the locked image to

a pixel of the right image. Brain dead but use a really good optimization algorithm.

You couldn't beat all this other stuff, just knock it out of the water. Okay. So this work

really showed the power of optimization and ushered in the so-called graph cuts era, the

computer vision applications all sorts of different problems but in particular, it has

been a really big development for stereo. Now, it also had its roots in the 1980s. This

work by Baker and Binford who--so I think people in my generation are older. We'll say

it was a really important paper in stereo but these days almost, you know, I think it's

kind of a footnote more than anything else. But they proposed--they were basically the

first to propose global optimization for stereo but they did align at a time using dynamic

programming. But for, you know, basically 25 years or whatever. Yeah, 15 years, no one

was able to figure out how to extend it--this 1D optimization to 2D because the dynamic

programming solution didn't extend, graph cuts was basically the extension to 2D. [INDISTINCT]

mentioned there is another work around the same time, notably, Roy & Cox who came up

with different formulations of graph cuts to do stereo problems. All right. So 1998,

Marc Pollefeys and his colleagues published a paper at ICCB on doing 3D reconstruction

from basically a video camera from uncalibrated 3D information. The idea is you kind of fly

around in this case or you walk around the scenes with video camera and you get out a

3D reconstruction. You don't have to know anything at all about the video camera and

none of the parameters, no calibration, nothing. So, this is very inspirational work and it

was really the, you know, the same way that Tomasi and Kanade work was influential. I

think this was very--also very influential. It's also a combination of many research advances.

The paper itself was on--actually a relatively narrow topic of self-calibration, finding

intrinsic camera parameters from a sequence of images. It didn't at all talk about the

stereo, texture mapping or anything. But this is one of the key pieces that enabled them

to put this whole system together and also the community--the vision community as a whole

had been working through the '90s on solving this self-calibration bundle adjustment problem

and this is really one of the culminations of that work. It showed you could put it all

together in a really compelling system. So it's a big milestone for the field. All right.

So, 1999, Blanz & Vetter. SIGGRAPH published this amazing paper where they showed from

a single photo you can create a recon--3D reconstruction that looks like this. So basically

a laser scan a quality 3D model of the person from a single photo. So harking back to the

shape of shading results, you know, there's no way that shape of shadings can be able

to do this. So, this is pretty incredible. Now, what it's doing is not using just image

information. It's also using a database of 3D model. So what they did was they scanned

200 faces, 200 people and they have this basically 200 scanned vector space then through linear

combinations of these scans they can produce any other person who looks like anything like

any of these 200 people. So that would be basically their model. So instead of having

to reconstruct the whole face, all they have to do is find 200 coefficients that are the

best combination of these input people to this output person. And in fact they're able

to compress it a little bit more using principal components and I think they only recovered

around 100 coefficients. Anyway, so there's similar work around the same time in Active

Appearance Models but I think these are--these results are really unrivaled even now. I don't

think there's better results in 3D reconstruction from faces. But if you look closely you'll

notice that, you know, the details may not exactly line up, maybe Tom Hanks's nose is

not quite as long and it sort of depends on your input space and how well the new photo

lies in the space of input photos. But it gives a very compelling argument for using

so-called model-based approaches for 3D reconstruction which used strong priors on the type of geometry

you're trying to reconstruct. So 1999 was also a big year in other ways. There's Digital

Michelangelo Project. So Marc Levoy's group at Stanford and many other collaborators.

And basically they set up to scan a bunch statues and other things, you know, in Italy

and then the centerpiece was Michelangelo's the David. And this was a huge scanning effort,

probably the biggest scanning ever undertaken in terms of amount of resources and time and

planning. It basically took a team of a dozen--I think there are 20 people involved total,

working, scanning all night long. So they--first of all they had--they had access to the statue

only at night because it's displayed in a museum. I think it's at [INDISTINCT] and the

museum is closed at night so they're allowed to scan all night long and had to stop in

the morning. And so that's basically what they did, they scanned all night long for

a month to get this model and it was, you know, roughly, you know, 20 people involved.

It's a huge, huge amount of work but the result is beautiful, they have--they captured the

sculpture basically almost every single nook and cranny at a quarter of a millimeter precision.

And so you can zoom in--on the right you're seeing--you're zooming into the eye of the

statue and you can zoom in further to this like ridge and see the details of the mesh.

And so, because it's also--it's also other scanning work like the PHR projects around

the same time. But this has really shows, you know, what you can do with 3D laser scanning

technology and also how much work it is to use it. Okay. Same year, so camera tracking--I

was looking for the canonical reference on bundle adjustment. It's actually really hard

to find a canonical reference on the computer based bundle adjustment and so I went with

this. This is--this is really the first time practical bundle adjustment systems where

you used in an application. So that--for all these--basically for almost every special

effects scene these days, there's a combination of real footage and computer graphics and

doing that right involves figuring out where the camera is precisely in the shots. You

can really--computer graphics in the same place. And so this process is called Match

Move and it requires bundle adjusting or reconstructing the positions in all the cameras in all the

images. And so the first truly animated methods were developed in the late '90s for doing

this for entry tracking the images and doing the 3D analysis. All right. And so there's

some pioneering commercial systems in particular MatchMover came out of INRIA [INDISTINCT]

led by [INDISTINCT] and Boujou grew out of Oxford Fitzgibbon [INDISTINCT] and so forth.

There's also a concurrent record in photogrammetry around the same time. Let's see, so I'm running

a little bit low. I'm just going skip over some of these. There's some important benchmarks

for stereo that came in about in 2001. Some really elegant work for doing stereo with

non number and surfaces in 2002. 2003 is the year I'd like to say stereo equals laser scan,

so this is the first stereo computer vision--passive computer vision method. No laser scan that

could produce 3D geometry that looked as good as a laser scan and arguably is almost as

good. And this was Carlos Hernandez's work. And Carlos as you know, he's at Google now.

So, a very cool work which, you know inspired a lot of subsequent work. We're trying to

just kind of skip over including Joshua's work who's--and others in the community. 2006,

first work on reconstructing 3D information from photos on the Internet. So these are

part of tourism work which led to Microsoft Photosynth. And really key ingredients to

make this work were SIFT, these reliable feature points and the photograph--the progress in

photogrammetry over the years kind of with similar work by other authors. 2006, this

import--the--there's a really cool effort of drawing, writing, creating 3D models of

a city by just driving a car around. And so it's kind of like Street View LiDAR but just

using raw computer vision measurements instead of LiDAR. This--in 2008, there's a great paper

by Zebedan etal, Bishop's group. And there they show a basically an automated version

of [INDISTINCT] façade. So, from aerial imagery, it creates this really, really impressive

3D model of a city completely automatically. 2009, Zamira did his Rome-in-a-day work. And

this is sort of a city scale structure promotion from image collections. And sorry I don't

have more time to talk about these. And, of course, 2011 is Kinect. So body pose from

a single depth image, 200 frames per second, fastest selling electronics device in history.

So they had in the first three months, I think, they had doubled the sales of the iPad when

it came out. So not bad for computers and system. Of course, it has other things aside

computer vision but we'd like to take credit for it. All right. So in 2013, there is this

fantastic work on basically, Digital Michelangelo from a few photos. So instead of having to

scan all night long, this group was able to just take a few photos and create a model

of Digital Michelangelo that have the quality of this Lavoy group that took a month. On

2015, there's this wonderful project on creating a public repository of the world's geometry

where basically they scan everything, people, places and scenes, things and put it all up

in publicly hosted website. 2015, this work on, which I'm sure you all remember of how

do you reconstruct yourself from your photo collections. So, it took all of your photos

of your poses, expression, body shapes over time and create a 3D model, perfect 3D model

of you and all the ways that you move. 2020, it took a lot longer to solve this Inverse

CAD problem. So the idea is to create what an architect would call a 3D model from not

just a plain cloud but an accurate--or a 3D model which captures the salient features

of the scene, walls, floors, small number polygons from rough point clouds fills in

the gaps, easy to add and modify. 2020, we finally solve the Visual Turing test. So this

is creating a 3D computer vision system which creates models which are truly indistinguishable

from reality where you're able to move however you want through the scene and it still looks

perfect. So it looks just like being there. On finally, 2030, the computer being human.

That's it. Thanks you guys. >> Thank you very much for bringing us together.