DMS: Software Tool Infrastructure


Uploaded by GoogleTechTalks on 28.07.2010

Transcript:
>>
Ira is the principal founder of company called Semantic Designs in Austin. And he's been
working for over a decade on tools to handle how to–-to analyze and maintain a lot of
software systems. So, I invited him over to give us a talk about the stuff that he's been
doing to see if any of his ideas or tools could be applicable to software systems at
Google. Sir, Ira. >> BAXTER: Hi. Good morning. I'm Ira Baxter,
CEO/CTO, chief follow washer of a small company called Semantic Designs in Austin, Texas.
We're very interested in dealing with large software systems, all right? We all see that
software systems exist. Okay, if you look at the ones that run large corporations you
discover they're actually very big. And they're not getting smaller, they're growing. And
so, the problem is we kind of see it, they're sort of like indicated by this galaxy. You
know, you're at some point here where you can see this very large piece of software
and it's enormous and it's getting bigger. And the question is, "How are we going to
deal with that over the long run?" So, I'm going to talk to you today about tools that
we have been designing and building, it's actually for about 15 years now. Okay, will
actually save people some 15 years into a 5 year technical plan. Okay, to give you some
idea of what we think these tools might look like and what kinds of things you can do with
them. So, Semantic Designs, we think our goal is to change the economics of how software
is built, at least for large-scale software. Okay, if you built things by hand the way
the Egyptians did, it's going to take you 30 years to build a pyramid with 30,000 slaves.
By anybody's standards, that's truly expensive. If you want to change the architecture of
the pyramids, that's just not a good answer. Okay, that's not going to work. You need to
do something different about it. The method is obvious. Bring machinery to the table.
Okay, you need automation to make this kind of thing work. So, we say, "All right, bring
automation and what--and what would it look like?" Okay, the company was founded in 1995,
okay, up off a core idea called DMS, which I'll talk about a lot during this presentation.
Okay, and since then, we've been trying to enhance this engine, put features in it, and
facilities to help us deal with these large systems. We're small about 10 people. Okay,
but half of the staff are PhD level computer scientists with various backgrounds in operating
systems, software engineering, AI, theorem proving, so on and so forth. All right. So,
it's an interesting crop, they're a good crowd to work with. Okay. Now, a lot of companies,
a lot of investors ask us, "What do you specialize as a company? You specialize in software,
what kind? They want to know if we specialize in banking software or embedded software or
medical software." No, no, we specialize in software, okay. In big pieces of stuff, because
there's a whole lot of machinery required to deal with just the software in terms of
its scale. So, we're not specialists to any problem area. We're specialists with software
systems in general. All right. Now, the tools that we apply or to--or activities to are
broad spectrum as a consequence. Okay. They go from embedded systems, to aviation, in
military software, to large-scale banking; it's all across the map. It's kind of surprisingly
broad and you'll see a whole bunch of examples I'll go through here today. And we use these
tools to carry off activities like architecture extraction. We are protecting a software that
is changing its shape, restructuring it, right? Transferring it from one link to another and
carrying out things like code generation. That's a very high-level view. So, we see
software everywhere. Interesting systems have oceans of code. People--it's easy to find
million line systems. All right, when you find them, what you discover is there are
million line systems but they're coded in five languages, they're all connected together
in complicated ways. And they're not only running on this machine but they're spread
across other machines and they communicate through distributed methods and mailing files
and doing FTP, everything you can imagine. So, the challenges when you face this kind
of systems are, how do you build this in a timely way? This is a problem software engineering
still doesn't know how do very well. All right. How do you understand the artifact that you
built, if you're going to make any kind of changes to it? Almost every organization that
has a big piece of software will tell you that nobody here knows what the software does
because all the people that built it left or are dead, right? So, how do--how do you
deal with a thing you don't understand? What do you do? All right. How do you deal with
the quality of this stuff? If this thing is running your corporation and it's broken,
your corporation's going to hurt, right? This is a pretty strong theme for upper management.
Software quality needs to be good with your large systems or it's not good for the company.
How do you make small changes, the kind of change that people do everyday? How do I hammer
on some little piece of it? Okay. And how do I make massive changes? How do I reshape
the system to meet the needs of the future? How do I switch from the mainframe-based system,
say, to an Internet-based system or go from one data model to another kind of data model.
And take two data models from two different kinds of system and merge them together so
I can get a system, which is smart across two kinds of data. All right. Do you want
to make all these changes of some kind? You need automated tools to help. So, how do we
do that? Well, we fundamentally offer two kinds of support activities and tools to help
people carry off these kinds of--deal with these kinds of issues. The first one is analysis,
okay? Help organizations with large amounts of software kind of get it under control.
Help them understand how the pieces are hooked together; one information is flowing from
one part to another, all right? And second thing is, help them understand what the quality
levels are. Help measure; is this code good for some definition of good? Is it good because
you've tested it well? Is it good because it's structured well? Is it good because it
has a good architecture? Is it good because it's layered well? So, analysis is pretty
big. Most of our customers when they first come talk to us want to know something about
what they have. Tell me about what I have. But the question's why, all right? Analysis
is an interesting topic but nobody does it unless they are afraid of the answer. When
you go to a doctor, okay, and you're feeling ill, you want an analysis. "Doctor, doctor,
tell me why I'm ill. Tell me what disease I have." But he can give you a very long complicated
definition of your disease, a very long complicated name and he can describe it to you, okay.
And if at that point, you leave the doctor's office and the visit's really unsatisfactory.
Nobody really wants to know what the name of the disease is, what they want put is a
label on the name so that you can do something about it. It's the cure that matters not the
analysis. So, analysis is done to support change. So, the last thing that we do and
I think the thing that makes us really, really unique, okay, is carrying out mass change
to large systems and use analysis to drive change. Okay. So, do code generation. Okay.
From some kind of a specification where the analysis might require you, it will say, domain's
specific analysis upfront to invent a DSL so you can do code generation from that. All
right. Restructuring, take your system and re-architect it, okay, to make it more useful.
Modernization and rip out some kind of underlying piece of technology which this piece of software
depends upon and replace it or remove green screens and replace them, okay, with--with
HTML, remove a hierarchal database, replace it with relational database. Switch from,
a single threaded system to a multi-threaded system. These are all massive changes and
they're not the kind of things that are easy to do by hand. Okay. The last one is, like,
migration, okay, converting from one system to another and change--changing platforms
while you're at it. So, I'm going to give you some examples of analysis and change that
we've done in various kinds of tests, some very quick ones. Okay. And let me talk about
the technology we use to do that. So, on the analysis side, okay, here's a customer that
came to us with an IBM mainframe system. Well, they kept the mainframe in Australia. And
they said, "We have 10 million lines of COBOL," okay, "3,000 JCL scripts," okay, "6,000 input-output
screens, 500 databases and we don't know how any of it's tied together." So, when they
want to do simple impact analysis, we want to change the way interest rates are paid
to our customer base. They--first of all, struggle to find a place to start. Well, we
think this database holds interest rates. Another question is what other parts of the
software talk to that database so they can think about modifying it. And that's they
can actually make the modification or acts the architecture, but what pieces touch it?
So they can actually start thinking about the impact analysis. If you cannot do this
on a very large scale, you can not schedule things well, you can not time it well, you
don't know what your different levels are, it's going to hurt you with quality, you're
going to get hurt when you install it because you didn't think it through, because you didn't
find everything. Just finding all the pieces and talking about how they're connected is
a really hard problem. So, one of the things we did for them was we built them a custom
analyzer that reads COBOL at the 10 million lines scale. That's 8,000 COBOL programs.
It reads all the JCL, reads a peculiar thing called a Hogan DB. This particular application
system is called Hogan. It's a banking system that you can buy commercially from CSC. And
the Hogan system happens to be an architecture that makes it really hard to see how anything
works. It contains a bunch of metadata that says, "All these programs talk to each other
through the way this metadata predicts." So, in order for them to see what's happening,
you have to understand how the COBOL programs talk to ledger, how they talk to the databases
and how they're directed by the metadata in the Hogan database. And so, we built them
a custom tool who will read all of that stuff and give them an answer. And I'll show what
that looks like in a little bit, okay, to give you a sense of it. The main point is
it's 10 million lines of code. It's bigger than you are--if you try and pick it up it'll
squash you, all right? So, you need machinery to do that. So, that's the analysis side,
okay? They are struggling to just understanding what they have and their next question's going
to be, how do we make this thing better? That's the next step. This is a more interesting
example. Okay. You're not supposed to be able to see that airplane. All right. That's the
B2 Stealth Bomber. Okay. B2s are made out of essentially stealth materials and software,
all right? So, half of the air--there's about 150 microprocessor in an airplane. Okay. It's
a 25-year-old--30-year-old--30 year old design at this point. So, which is using the best
microprocessor you could get in 1975, right? It has 16-bit CPUs that actually do floating
point in the hardware. I didn't know you could get CPUs like that in 1975. But if you're
in the Air Force and you're building this thing, you could get that kind of stuff. So,
people work software to run this airplane, all right? So, there's software to run the
ailerons, there's software to run the engines, there's software to manage the airplane, fly
to Baghdad, drop a bomb, come home. The pilot doesn't do anything except for say no, right?
If the plane goes down some path or somebody says, "Abort the mission," he says, "No, go
around." The mission software put--basically runs the airplane. The Air Force thinks this
is a really interesting device for, sort of, force projection. Okay. And they would like
to integrate this device into their "battle sphere." It's amazing what kind of terms you
come up with when you go in these different arenas. They want to integrate their battle
sphere and all they have to do to do that is to take the software, which is running
in an airplane, it was designed in 1975 and didn't know anything about the Internet, okay,
your radio links or communications and replace it by something they can enhance. So, this
software's written something called JOVIAL, Jules Own Version of the International Algor
Language. Designed by Jules Schwartz. It was the best in meta programming language you
could get in 1973. That was before anybody had ever heard of C, right? Let's see the
next slide. The good news was they had 1.2 million lines of code that ran the airplane,
did the mission software, all written in this thing, it all works, it runs the airplane,
everything's fine. The bad news, the people who wrote it are retired or dead. The development
tools on which it was dealt, one of the [INDISTINCT]. So, development tools are written in software
from a machine you can't buy. You can't even find them anymore. Okay. So, this thing was
a disaster of first magnitude in terms to dealing with. They had to get out of JOVIAL
and if you like, anything but JOVIAL would have been their goal but they were happy to
go to C. Here's a complication. This is Black Code. Top secret, we, SD, aren't allowed to
know what's in it or see it. "Can you please convert this to C without making any mistakes?"
Ugh. The answer is yes. Okay. All you need is enough automation and you can do that.
Okay. So, it's 100% automated conversion. It's now flying in the B2s that are being
used around. We're very happy with this particular thing. Now, I have to kill you since you've
seen the airplane. Okay. So, we've seen an analysis example and a transformation example,
okay, to give you an idea of the kind of things that we do. So, let's talk about how we do
that, okay? Now, if you had to build a tool to build each one of those things, each one
of those anal--the analysis engine with the B2 thing, and you had to build it from scratch,
you'd simply die, okay? There's just too much machinery involved in doing that. We have
a very simple insight and the very simple insight says that when you build large complicated
tool for dealing with software, that much of the infrastructure they have is the same
as every other tool you built. Okay. You need parsing, you need analysis, you need this
and that. You're going to hear that theme a lot in this talk. So, we did a simple thing.
We built an engine that has a lot of shared infrastructure. Think of this as an operating
system supporting application programs, which are software engineering tools. So, that's
what DMS is. It has this long complicated name, which is another hour of talk as to
how it got this long complicated name, okay? So, its official name is DMS Software Reengineering
Toolkit, we call it DMS for short. Okay. So, what is this? So, it's a device not for being
a tool that does something directly. It never does anything by itself. Okay. What it does
is it manufactures the tool but does what we need. So, you can think of this tool as
being something that takes in a bunch of description, a bunch of raw materials for describing what
has to be processed and spits out a device for carrying off a specific task at hand.
So, what we actually do is we take DMS, we manufacture tool with DMS and then we apply
the manufactured tool to the problem task. Okay. And the good thing is we can use raw
materials for DMS over and over again. Now, there's a list of applications over here on
the left. Okay. It talks about various kinds of things we've done. Formatters, migration
tools, test coverage tools, code generation from C++ and I'm not going to describe those
in any detail here. What I want you to do is see that that list is long and they work
really different. Okay. And you wouldn't have guessed if I hadn't probably preloaded the
question that those all might be highly related but they are, okay? You people are related
to fruit flies by your chromosomes, share about 85% of them. It takes a certain amount
of stuff to just be alive, okay? And that's the observation. Eighty-five percent of those
things are the same. In fact, they're experience is more like 95% and that's the win here.
So, we'll talk about what's in this engine to give people some idea how we can do this.
It's very simple idea. There's two things we put together. We put together compiler
technology that the compiler engineers have been building for the last 50 years, okay?
If you're going to deal with large pieces of software, your goal with the compiler guys
do at your peril. It's real simple. You need this machinery. So, we've taken a lot of this
machinery and integrated it in the DMS. Okay. Now, that machinery is mostly compiler stuff.
Compilers take in source code, they analyze it, they look for special cases, they do code
generation from it, they spit out something else. That's what they do. Okay. Now, if you
specialize it to a particular input language and a particular code generation style and
a particular kind of set of optimizations and a particular kind of binary output, you
get a particular compiler like GCC, okay, that's a nice thought. Okay, it's pretty useful.
There's lots of other compilers built around that. The difference between what we do and
those kinds of compilers is we've generalized the daylights out of the boxes, right? So,
you need to parse the input languages whatever language you need to read, you need to read--you
need to read them from their external textual form that people deal with into an internal
form with are essential compiler data structures, which I've symbolized by abstract syntax trees
up here, right? So, and you'd like to parameterize that by language definition so you could feed
it lots and lots of different languages and have it process them all. You need an analysis
engine because you need to ask questions about your software. If you're going to remediate
a problem, first of all, you need to know where. So you need analysis machinery, and
you can set up general kinds of analysis, okay, you configure them to ask the questions
that you want. All right. Now, most compilers will emit error messages. In this diagram,
you see them coming out over in the right called analysis results. So you might have
the tool just focus on analysis and print a report. And we'll see examples of that further
on. Well, but more interestingly, what you want in analysis tool is to drive the signal
from this analysis results, if I can find it. This arc here from--and now let's analyze
transform, this is the key interesting arc, okay, analyze to find my problem and use the
analyzer to drive a change to cause an impact, cause and effect on my system. And we feed
that to a transform engine, what the transform engine does; it maps compiler data structures
to compiler data structures. Okay. If you look at the way compilers work internally,
that's what they do. This is the middle five chapters in any compiler book you'll see,
how to map from--from a high level form to a low level form. That's what it is. But you
can generalize that, okay, with the notion called Program Transformations, which we do
and I'll talk about that in the next slide. Then there's the formatting, which in our
case, is mostly about taking the compiler data structures and converting it back to
source code. Because what programmers would want is control their system, put my code
in, have something have to do, have it come back up, have it mostly read my code with
improvements in it. So our formatters basically will generate source code. So, they're kind
of funny compared to some kinds of objects, the kinds of things you have with a regular
compiler. Now, the way we deal with this is we parameterize these boxes, okay, with the
various kinds of things that it takes to do that. And if you'll see down here--it looks
like we have a damaged arrow in this diagram. There's supposed to be an arrow from this
list of language definitions in this rule compiler and I don't know why it's damaged.
All right. In any case, the rule compiler accepts two things. It accepts language definitions
in the form of what you might think of as BNF augmented with a bunch of other things
about how to analyze that language particular properties asked. And it accepts a tool definition,
okay, which is what we want the tool to do. What kind of analysis we want it to perform,
how we want it to use that analysis to drive the transformational change. All right. So,
the stuff in the gray box there stays constant in the same way your operating system stays
constant. These pieces are just there. When you feed these two things and wiggle them
in order to get the effect that we're looking for. And that's the principal value we think
we bring to the table. Here's a piece of machinery that lots of organizations could use, okay,
by wiggling these inputs for their particular tasks. All right. So, what makes this interesting?
Well, there's the fact that the tool could provide some kind of understanding. Help me
understand the source code by--for low level parsers, the compiler data structures make
it possible to do algorithmic analysis over the code and get some kind of answers, okay,
and we actually had deep information flow analysis. If you really want to understand
what happens in the system, you have to know when information flows in and where does it
go and what happens to it? That's a flow analysis question. So, fundamentally, flow analysis
coordinates, right. All right. The core of this is the transformation engine; I'll talk
about that in more detail in just a moment. Okay. And the other issue that makes this
tool interesting is not just the sort of theory picture. Okay, it looks nice picture, that's
cool, you could find in an academic paper. Okay, I wrote one like that about 25 years
ago. Okay, but rather because it actually works on real systems, it'll works on scale.
We've actually dealt with systems who have millions of lines of code, multiple languages,
mixed languages, there's a computational level on them underneath, there's a parallel processing
engine to support the size of computations we're doing, because doing symbolic computation
like the million and 10 million line scales are expensive. There's a parallel process
system underneath DMS. That's all I'm going to say about today. But I'll have you talk
about it offline. Right. And it's been actively using our hands for over a decade, okay, by
a small but very clever team, I think. Okay. So, let's talk about program transformations.
This is not going to be my day. I need to advance this slide to the point we can see
it all. I don't know why. I'm not getting it projected the way I want. All right. Let's
start here. There was an idea back in 1969 that says, "It would be really cool if we
could synthesize programs from scratch." All right. And what this guy said was, "You imagine
something," that's the requirements bubble over there on the left, "And I'll write down
some formal specification for it in my formal language." Whatever that was, and then all
of those, there's dozens of foreign language had been proposed. And then we'll get some
magic engine that takes that formal language spec and generates code at the backside and
produces the program, okay. That idea has both been famously a failure and a success,
okay. The high-level version of it, really high-level requirements and specification
had been a complete disaster. I don't know if anybody can really do this very well. The
low-level version of it has been a gorgeous success to the point where everybody forgets
about it. They're called compilers, right? Everybody has one, everybody uses one, nobody
thinks about, it just works. But if you push on the high-level version of it, okay, what
you realize in going from a spec to that final program in a single step like, look, it takes
in two million lines of spec and it produces 10 million lines of program and it does that
in one step, seems pretty hard. Really hard to believe. The only hope you have in doing
something like this, let's say, why don't we do this in stages? Let's somehow incrementally
convert that specification into the final implementation by applying a whole bunch of
knowledge, which we'll call rules that map, that spec to the code. And then the lower
part of the slide, I'm going to show you a little simple version of that. As you can
imagine down here in the lower left, you can see we have a little specification. It's written
in Algebra because this way, I don't have to explain--explain the spec language to you.
I assume you all have had algebra, right? So, that's a specification and what we want
to do is compile it to a more efficient program. Okay, compilation means more efficient for
some definition of more efficient. Most people think more efficient means compile the machine
instruction, that's one definition, another one is it is just takes less time to run.
That's the version of compile, all right? Because that's what's everybody is really
looking for. So, on this slide, we're going to do the second one, it takes less time to
run. We're going to do that by basically simplifying this program, applying optimizations to it.
And with the optimizations we apply are all the ones you learned in 9th grade. So, this
first step goes from our simple spec here to the second spec and it applies the distributive
law as you learned in the 9th grade. And then, when you use a unity multiplier to get rid
of one times Y and convert it to Y. And then we apply a few more of these very simple 9th
grade transformations you know about to get to our final program. This is a compiled program
in the sense that it's better. It's smaller if you build a DUMB interpreter to interpret
the left one and the right one, the right one will run faster. So, it's compiled. So,
this achieves the idea. It doesn't achieve the sort of scale that you want, think about
20 million in transformational step, but I couldn't get those in a slide. So, I tried
to keep it simple. But this is the sort of idea. We want to take this analogue and apply
it to software where the thing over there on the left is a formal spec or a piece of
code, or piece of object code, any kind of a formal document and the rules will key our
pieces of knowledge which will legitimately tell you how you could manipulate this program
to make it change. So, the issue is, how can you give it different language definitions,
how can you feed it a lot of rules? All right, well, what does that really look like? This
is what transformations look like when you go from the algebra world to the programming
world. All right, so this is the transformation that maps from the source domain in COBOL.
Right, that's some kind of COBOL program, and what I'd like to do, is I'd like to convert
this COBOL program into CSharp. Okay, we can talk about whether this is a good idea or
not, but certainly a popular idea of lots of large mainframe guys right now, all right?
So, there's lots of interest in that and being, well, money-grabbing company, we help people
to do this kind of thing. So, how do you get from COBOL to CSharp? Well, the answer is
convert each of the constructs in COBOL into a corresponding construct in CSharp, which
that will give you equivalence. And so, here's a rule that does a piece of that. This rule
takes care of the special case in COBOL where you could say add some variable to some other
variable. That's a COBOL phrase, it's legal, for some source expression, some target expression,
and some target variable, okay. That add v1 to v2 in COBOL rewrites to the slot called
v2 in some object, okay? It's updated by the value that you computed from v1 from other
object it came from. So, hiding behind this thing is an analysis that "\object" thing
is a metacode says I need to go off and do an analysis to figure out where we decided
to put this object working in the target system. So, it's hiding an analysis there. There's
a second analysis there that says, "You can only do this if v2 is actually represented
in the target system as an integer". It doesn't work if it's a decimal number, because CSharp
doesn't do decimal add plus. It won't do that, okay? So, what we see on this particular transformation
is a syntax directed on the source. If you see this piece of syntax here, replace it
by that piece of syntax there. That's what--that's mathematics. It's called Equational Equivalence,
A=B. Replace this by that, okay, do some analysis to figure out what to replace it with and
do some analysis--add some conditions like if represented as integer. If you take that
transformation and you apply it to the "Before" piece of COBOL down there at the bottom, and
you run it with some other transforms that are allowed to actually do the object look-ups,
and things like that. You can get out something like Invoice.ShipTotal +=, that should say
+=, Order.WidgetCount. So, there's a transformation. All right. Now, we write those transformations
as source text because we're engineers and we don't know how to recompile our data structures
easily. As engineers, we want to work with that. But the tool can't work with text. It
has to work with complier data structures, and so the way it handles this kind of stuff,
is it takes those rewrite rules and it transforms with the same kind of compiler data structures
as the actual source codes so it can match them all up. And that's how it works internally.
So, it maps from that structure to that structure. Okay. Now, if I give you one rule, you're
not a mathematician, okay. X + 0 won't make you a--it won't get you job, okay, at John
Hopkins, right? So, if you want to be really good at mathematics, you have to take a lot
of courses. You have to learn the various kinds of mathematical systems there are, the
various ways to write down formulations, the various ways to manipulate those things, there's
a bunch of rules. And if you're going to be really good at, you're going to spend a long
time doing this. And if you do, you could do spectacular things with mathematics. DMS
is the same way. If you give it one rule, you get a cute example like the one I had
in our previous slide. But it's not really impressive by itself. You get three rules,
nah, you don't get much. You get 50 rules, you start to get to a place where you can
do things like building test coverage tools and we'll see some examples of that down the
road. And you give it several thousand rules and you could do spectacular things. You can
move the pyramids. Okay. In particular, you can move the v2, all right? It's just a matter
of scale. So, number of rules matters. Now, we talked about the transformational aspect
of the tool, okay. It's supported by a bunch of analysis engines. What I said before was
that flow analysis is important. How does information flow throughout your system? We
don't do any magic here. Okay, what we do is we implement the compiler technology you
can find in the Stanford Computer Science Bookstore. Go to the back wall, there's a
bunch of compiler books, pick up the first book on the left, implement that, put that
one away and pick up the second book, implement that, put that one away, repeat. We're about
halfway down the wall, right? Fundamentally, what they say is you need to figure out how
information flows throughout your system, okay? So, computing flow graphs, information
flows, used definition change, definition use change, points to analysis, all these
things are interesting. So, this is an example of a flow graph, okay, decorated with data
flows, okay. For that little tiny spot up there at the very top, which I'm sure you
can't read much if your eyesight's far better than mine, okay? That's a Fibonacci Program.
All right? This is all the stuff--oops. This graph that you get is all the information
flows that happen inside that Fibonacci program. My first comment is, "Man, no wonder programming
is hard." Okay, you have to understand all these kinds of relationships in some sense
in order to belive this program is right, all right? If you can collect these kind of
information flows, then you can discover that that some event here in your code, at this
point, can have an impact downstream on there. Whereas that event, doesn't have any downstream
impact on this part, and being able to simply separate the part of the program which is
important for the task you care about is fundamental, just to focusing your attention. Right. So,
you need the--you need the flow analysis just to help find out where things happen to help
separate things that are irrelevant to you, for things that are not irrelevant to you
and for understanding how information flows around the entire system. So, this machinery
is built in the DMS, okay, and it's been applied to a number of different languages. It's a
generic subsystem where we connected different language by providing some extra information
to the language definition. If you have this language definition, here's how you feed the
flow analysis machinery. Here's the other analyzer, this is the range analysis. How
big is this value? Okay, why would I want to know this? Well, I want to know if I'm
getting overflow, I want to know if I, you know, exceed my storage demand on some array,
I want to know how much storage to allocate for this particular variable. I'd like to
know all kind of things, right? So, this analysis is done using abstract interpretation over
symbolic range constraints. The symbolic range constraint is a very short formula, okay,
involving two constants--I'm sorry, three constants, A, B and C, and two program variables
X and Y, that says I can discover this particular property is true of this pair of variables
at some point in the code. And we have examples of that up here. Okay, there's a small program.
Okay, going in the program, we know and we know nothing at all, okay, coming out of this
first conditional what we know is minus I is less than minus four, that's a complicated
way. When you write this down, it's the complicated way to say, I is greater than or equal to
four. It shouldn't surprise you from the conditional, right? And likewise, I is less than or equal
to three if you come up this way on the branch. If you take these--nah. If you take these
range constraints and you propagate them through the various kinds of statements that occur
here, you end up collecting more information as you go through the program. Think of this
as a symbolic simulation on what's happening in the code. What do I know about the answers?
In this case, we got a fork here, coming off this way, it collects information, we have
a fork coming out this way. And down here we have a join where these two set of facts
come together. And so, all we know at this point is the intersection between those two
sets of facts. We take that intersection and we end up down here with this particular facts
as minus K is less than or equal to minus three, and that might not seem very exciting
until you think about--what would that really means is K must be greater than three, okay.
And there's no upper bound on K. What that means is that this array access down here
is going to fail because there has to be an upper bound on K. So in essence, this kind
of analysis helps you find things like subscript faults, all right? So it's just an example.
All right, so that's an analysis. So, what's in the engine? Well, it's a pile of gears.
It's a toolkit. You screw them to gather the way you want to get the output that you're
looking for. It contains parsers, it contains pretty printers, which are anti-parsers, that's
all they are, just inverse functions. Okay, it contains a surprising number of mature
front ends for tough languages, okay. There are about four or five C++ parsers on the
planet. We have one of them. All right, because the machinery we built underneath it. So,
C++, C, Python, Fortran, Ada, ABAP, okay? We have a large number of well tested front
ends that we've applied in serious commercial context, all right? And it takes a long time
to get the details on those things right, and we've spend about the last five years
busy tuning our C++ run in to try and deal with all the various word peccadilloes that
show up when you deal with the C++ and you deal with the dialects that come with C++.
Okay, Visual Studio's C++ is not the same. Okay, there's GCC-C++. There are different
sort of ways and your reasoning turns out to be different depending on which compiler
you're using, right? There's--there's a part down here called the Procedural API. This
is what you find in the compiler books. I have a tree, I have a symbol table, I have
a flow graph, and here's a bunch of APIs I can use to manipulate those things. That's
the very bottom of our system; it shouldn't surprise anybody, okay? It's what you do if
you need a compiler. We try not to use it, okay? What we're trying to use instead, so
we try and use these various kinds of analyzers, we've talked about them, control flow, data
flow, symbolic range analysis. And, we try and use the rewrite engine. Okay, you just
pattern matching language as you come--if you see this replace it by that. Okay, either
in different--from one language to another or from the same language to itself for optimization
purposes, right. And it's at the bottom of the stack, I lied. I said I'm going to say
once. I lied. There's a parallel programming language at the bottom in the discussion.
Main point is a big pile of gears; you screw them together to get the effect you're looking
for. So, this is kind of another view of it, okay? So, we got that compiler front end.
It actually reads these various definitions, dumps them into a set of internal databases
and they drive essentially the evaluators and the transformers. And then, we have a
set of separate subsystems down here that actually carry off these--oops. That carry
off this analysis and the abstract, and can be tied to a particular programming language
by the description that we gave it on that input over there. Okay. So, we're going to
take and look at some other applications of DMS, just to give us some sense of the kind
of things we've done with it. Okay. We found that--well, a lot of people want to have really
good information with their system and they want to have it now. But sometimes you can't
get really good information fastest. So, maybe it'd be okay to get pretty good information
really fast, right? So, we built a thing called the Search Engine. This is a tool for fishing
around in large bodies of text. Okay, it doesn't have any really deep knowledge of the text,
in fact, basically, what it does, it takes a source code and lexes it according to the
lexemes of the language. But when you got 15 different languages, you can lex them all
according to the specific language. Then you can build a query language over those lexemes.
And now, you can search for things, this is a search in a language called--I think this
is Natural? This is Natural. This is a 4GL. This is a programming language for business
stuff, okay, in the 1985s, 1970s and 80s. Okay. And what we wanted to find out was,
how many input screens do this thing have, okay, across this 1.3 million lines of code?
Because a customer who came to us and said, "We think we have...," I forgot what they
said. "We think we have something on your 1,600 input screens," right, "in our application.
And we'd like you to convert it. Now, please give us a fixed price bid for this based upon
1,600 screens." Would you trust them? Right? The answer's no. Go check it out for yourself,
okay? We found--and this is we found 1,112 of these things, okay? These are not the input
screens they were talking about, but they're input screens. There are a thousand screens
that they didn't tell us about. So, this is really good in for us. So, this is good information
really fast. So, this is a search across a system. You can kind of think of this as Google
for quote. Okay, it's only internal as opposed to external. I think you guys already have
a tool do something like this, this is used for fishing around internally. So, this is
okay information but really fast. We talked about the mainframe problem, the guy with
the 10 million lines of COBOL. Okay, what they wanted essentially was a picture like
this and this, in fact, is exactly the picture we gave them, okay? So, this is a COBOL program
here in the center. Okay. And it reads this database. And it reads this database, it writes
to this flat file, okay, and it's controlled by this piece of JCL. So basically it says,
you point me to a module and I'll tell you how it's connected to its neighbors, okay?
And they got a course they can, you know, search mechanism over here so they can figure
out which components they want to look at. This allows them to look at their 18,000 components,
one small subgraph at a time and get some sense. If I make a change to this program,
who do I have to worry about? So, it's a very simple answer both in concept in terms of
delivery, very hard answer to get because you have to read all these stuff and get all
the details right, do this flow analysis across these 10 million lines of code. All right.
Here's a second kind of analysis dome with DMS, okay. People have clones in their code.
Anybody here not clone any code? No hands. Okay. It's a surprise. Everybody knows they
have clones. They know how much. Okay, it's a barbital take with anybody. You got 10%
or more clones in your code. Anybody, I'll take the bet, right. All right. The reason
we know that is we have been running this tool for detecting clones for the last 10
years over everything you can imagine from Python to Fortran to Visual Basic 6, okay,
and it comes out 10% to 20% or worst on everything we see. Okay, the Sun guys, the only guys
that beat it; it was 9.87%, okay, with the JDK. Somebody that's actually working hard.
I'm impressed. So, here's a tool for locating duplicated code. Somebody's found--wrote something
that was useful and they cloned it someplace else. The good news about that is its software
reuse, it makes it more effective. The bad news about it is its [INDISTINCT], okay, whatever
you cloned has some kind of problem in it in the future. Maybe it's not actually wrong,
okay, but you can change your mind about the architecture and it contains an architecture
decision and you replicated it, you're in trouble. There's another way to think about
it. Imagine your source code base is 20% redundant. What that means is if you pick a random line
of code in it somewhere, there's a 20% chance that someplace else in that system that same
line of code exists. And you decide you have to change it random line of code and improve
it. That means there's 20% chance that someplace else in this system there's another line of
code that you should fix. How many of you know where that other line is? So, what you'd
like to have is a tool for locating where all these clones are, telling you where they
are and showing it to you. All right. So, here's an example from Python. Okay, this
is the--we usually--this is the Bio-Python test something like that. And people could
see down here if they look. They can see it says 11.1--11.7% redundant. Okay, across something
like 202,000 lines of code. Okay, this is a small system from our point of view, this
is--these sort of things get bigger, and this number goes up. So, this is just a summary.
The work summary report from the tool, okay. As a sample of a clone, I didn't take one
of the PH from Python; I took one from PHP just to show that we're different. We have
a PHP front end for DMS. So, this is a cloned piece of code. Okay, you can see it's occurred
in three places. Okay, it's in three separate files. Okay, these clones happen to be exactly
the same. One hundred percent items copy paste on it, okay. And most of them are not, okay,
and most of them were slightly different and [INDISTINCT] how to detect that. Okay. This
kind of tool works on any language that DMS can process. And if we run into a language
that DMS can't process for this, well we define the input to it and then this runs. We had
a customer in Austin that said, "I think I got clones in my code, and I'd like to manage
them." And they came to us and they worked with us for a while. And after they did this
exercise, they drew a size curve on their system, and there's a size curve, and it went
up and it went up and I'm not very surprised by this, the systems get bigger. And then
it did something that's shocking in the software world. It went down. Okay. And guess what?
That reflection point is. They started using the clone detector, all right? So, here's
a way to manage your system size. Here's a way to cut your engineering cost on large-scale
systems. Go find the clones, go manage the codes, control the clones. Make sure you're
processing clones, okay? He was very happy with this as a tool. All right. So, that's
kind of how things are similar in the big. All right, it's also true, if you'd like to
see how things are similar in the small. Everybody uses the diff tool, okay? Show me--to take
two files, one of which I've edited and show me how they're different. Now, what diff does
is that it shows you how they're different on a line-by-line basis. Nobody I know of
uses a line editor. So, it seems like really the wrong answer, okay? If you think about
the way people work with code, what they do is they say here--an expression or a statement
or a block, and I need to do something to this. I need to move it, I need to modify
it, I need to change it, I need to delete it, I need to insert, I need to copy it. They
think like that. At least I think like that, I don't know about the rest of the world.
And so a tool--the total differences between two pieces of code in that vocabulary seems
like it would be much more effective; you change this variable, you move this block
code not--this line's different, somehow. Okay, so an engine that we built called the
Smart Differencer uses the same machineries the clone detector does in the small; look
for two things that are the same. Okay. And instead of showing you what's the same, it
shows you what's not the same. It shows you the compliments, just the other answer. Right.
And that gives you essentially a Levenshtein difference between the Abstract Syntax Trees,
what's the smallest set of changes you might make to get from one tree to the other and
you can think about it as being a plausible editing path, how did he modify this program
to get to that? And it comes out and hand you things like this, a block of code got
moved? One of its more interesting aspects is you rename all the variables in the scope
this way, as supposed to I edited 47 places. Okay. It's completely insensitive to, you
know, the formatting and the comments, so if you change the formatting of the text,
it doesn't get fooled by any of that stuff. So this is a really useful tool in the small.
Okay. Here's kind of a quick example. Okay, this is done in COBOL. All right. So we look
over here on the left, we have an original piece of code. We look over in the right we
have a modified piece of code. And what you'll see is the term name, term name got rename
consistently term title and term title throughout this piece of code, so essentially to the
diff, painted up these two screens that you'll know the difference is so that you can see
it. If you're a code reviewer and you have to see how your code got changed, this is
a much, much easier way to see what actually happened. Okay. What we talked about migrations?
Okay, and so we used this sort of like nice scene from we've seen from nature shows about
this guy leading these birds on migration and then we have kind of the stealth bomber
up here in the corner--whoop. All right, so we've talked about migration, I think I've
made that point, right? So if you need to move from one system to another, change platforms,
change languages this does all that. All right. Now that was transformation as opposed to
analysis. You can use transformation to support analysis, okay? In particular you can use
transformation to instrument your code to collect data as it runs to help you decide
how good it is. So a fundamental question before I ship my code is did I test it? Everybody
says, "You should do unit testing," we all buy that, right? You all write unit test.
But do you know how much of your code you tested? Okay. If you tested five lines of
your code with your thousand unit test, I don't want to ship it, I should be afraid.
So you'd like to have a tool that essentially says, "Tell me what part of my code got executed
by my test?" The test all pass and most of my code got executed I might feel comfortable
about shipping it. The test all pass and hardly any of the code got executed I should be scared
to death. So I have no idea what that line of code that I never executed does, I just
don't know, I have no test. So what test coverage tool does, it instruments the code to collect
execution information, I got here. It does that massively over the entire system, you
run the code, it collects all the probes, it displays in a nice format, paints them
up on a nice picture like this. So the red part here is a piece of the C++ code that
did not get executed, meaning red for stop and green means the stuff got executed, green
for go. All right, you can use that kind of instrumentation to do not only test coverage
but profiling, both counting and timing profiling, okay, and the style's the same. Instrument--the
program language whatever, the kinds of probes you need for this particular language. So
the transformations vary a little bit but the style doesn't change at all. Okay. Some
of our customers ship intellectual property to each other. They want the receiver to use
it, but they don't want them to understand it and they want them to ship it again. So,
what they actually want to do is to take the code which is shipped and scramble it in a
way that makes it impossible for a human engineer to know what's going on. Oops. So here's a
piece of Verilog, it's a chip design. Some guys says, "I want them to design my chip,
and I will sell it to you, and I want you to integrate it in your chip. But I don't
want you to ship it to a third party." So what they'll do, instead of shipping that
thing, is they'll ship you something like this, this is code obfuscation, take the code
and scramble it, change all the names, remove the formatting, remove all stuff put your
queues for people. Okay. So that they can ship this and have some kind of technical
support for intellectual property safety. All right, there's a lot of other applications
of DMS that we've done, okay, I'm not going to go from the detail, I'll be leaving the
slides behind so people can look at it, they could take a look at the slide in general.
Okay. But they cover the spectrum from embedded systems, okay, to SOA, to other migrations,
okay, the generating vector machine code, okay, for parallel machines. These are all
done with DMS, basically by carrying out transformations on source code. We get continually asked,
"Can you do other new things?" People show up and ask us all kinds of questions, some
we can--some we say, "Yeah. We think so, some we can't yet." These are sort of things that
we're talking about with people in sort of serious ways at the moment. Okay. Migrating
applications for mainframes, we have a bid in to the University of California at Santa
Barbara, to migrate their student information system off of Natural in a mainframe into
CSharp, all right. Particular task that they want done, right? Control developments. Control
development cost by minimizing the software base. Okay. We have a customer, it's got 55
million lines of Java code, they're pretty sure half of it's dead, they don't know which
half. Help them find it; help them remove it, that's a big task to do by hand. Okay.
I got multi-core CPUs, we're talking to--able to say a big packet switching vendor. Okay.
They've built themselves a packet switching piece of software which they make a lot of
money off of. And it was assigned around a single threaded execution model. Suddenly
they have 32 CPUs and their code's not ready for it. What now brown cow? Okay. You need
to analyze--at scale to understand where you can break it apart and what you might to do
to paralyze the thing, right? Well U.S. Navy came to us awhile ago and said we're buying
software from Czechoslovakia because we hate being [INDISTINCT]. Okay. And the good news
it's cheap and the bad news is we don't trust it. So now the question is, can you find malicious
software? That's not an easy task to do but you need to build tools to analyze the stealth
and help them poke around, so on and so forth. We're currently talking to university who
wants to take C code, okay, for high speed image processing algorithms. Pick out the
inner loops and covert these inner loops in the FPGA so they can build basically a code
design system, okay. In which they can have low speed C code and high speed FPGA for the
core of the algorithms. All--that's all--that whole trick is just basically driven off the
flow analysis stuff, that graph you saw for the Fibonacci's perfect for what they want
to do. So I got invited out here after a conversation with some other folks at Google. Okay. And
so I thought I'd put one possible Google application to this. Okay. And as I understand it there's,
you know, a building process--a construction process that says I have a very large library
to support utilities. I have a large number of applications and there's a dependency network
between them. And we'd like to have is a build process that says, okay, when I touch one
of these things on which something else is dependent, okay, that all of the things downstream
get compiled, that's good. We all call it make or whatever it is you happen to do. Okay.
And so here we have a model in which of you change. The Z, we'd like to recompile Y and
Q and G and then A. Okay. That guarantees that when somebody modifies Z that A gets
updated. The problem with it is it's granular. Okay. It has its post on almost everybody's
world, it's--the dependency--dang--the dependency is based upon not the details of Z but rather
Z itself as a thing. Okay? So as its Y depends upon Z as a whole, so if anything in Z changes
you rebuild Y. Even if Y didn't use the thing that Z changed. So here we have an example.
We look inside, what you see is Z has some subcomponents; Z1 and Z2 and Y depends really
on Z1, okay? M depends upon Z2. Now, if I change Z2 the question is do I want to rebuild
A? The answer's no. I really don't want to do that. And the secret to doing that is essentially
something like the Smart Differencer. What you want to understand is; what's the difference
between the old Z and the new Z is you can say, "Hey, this is where changes got made,
here's Z2." And then you want to look up the dependence relationships between Z and Y and
Z and M and say, "Does Y actually use Z2? No. You don't have to rebuild this. So you
could build a smarter make by looking at more fine green, okay, in the code, okay, and looking
at dependencies. It doesn't change the fundamental model how you build things. But change the
details of how you do it and it cuts the cost radically, well we think anyway. Especially
if you do this on the scale I think you guys are doing. Okay. All right. So we've talked
about a piece of machinery, okay, we haven't talked about sort of the ideas behind and
the drivers behind it, okay? We think that not only is the machinery useful, but our
perspective as a company is useful. Okay, we've been doing this for 15 years. We think
about this differently than most people do at this stage of the game because we ask,
"What--what kind of hammer can I hit it with?" Not how can I do it with a team of people,
okay. And it takes awhile to get used to that idea because it's amazing how many different
ways you can hit with this kind of hammer if you think about it hard. Okay. The engine,
we bring the idea that you need semantic and precise analysis opposed to people precise
analysis, people are good, they're not good at scale. All right. That you want to deal
with very large artifacts with a lot of automated engines, okay? That you should build an engine,
okay, one uniform piece of machinery to carry off this kind of work and you could actually
use that engine, you can advertise its cost across lots and lots and lots of different
kinds of tools. Okay, it's surprising how different kinds of tools, right. And it's
also interesting that by using these tools we manage to carry off some tasks that most
people consider to be almost impossible. Most of our customers have come to us after they've
tried to do it themselves and failed. Okay, the B2 exercise they tried to do it internally
by hand twice before they came to us. And they said, basically they're going to give
us the business because we look weird, right? We don't look like we did it the way other
people did it. And the way other people did it didn't work. Right? It was really a strange
reason to get a piece of business but it worked out for everybody. Right. All right, my point
is we do hard with this kind of machine, just sort of very surprising things, things that
people would normally think are impossible. All right. In practice, we're a small company.
Okay. And this only works when we find a partner that has a specific task of his own, okay?
And he wants to bring some to the table. So we actually work with the integrators or the
other guy in the other side of the wall to carry off the task. So we usually write them
lots of advice, here's a piece of machinery, we might build some pieces of it, they might
built some pieces of it, we might go through lots of training. It's an interactive process,
okay, with our partners. All right. So, lessons. One, software is bigger than you are. Worst
to getting bigger, all right, it's expanding at the speed of light can you catch it, if
you just run a light way out to it, you're going to catch the edge of the galaxy. Okay.
The key technology for dealing is really simple, really simple. You got to be able to parse
stuff. Okay, you have to be able to analyze it. You have to worry about data flows at
scale, all kinds of data flows. Kind of fundamentally implement all that and make it available to
build tools on top of it. You need to be able to carry--transformation for change. You don't
want to go to the doctor's office for analysis; you want to go there for a cure. Tie analysis
to change. That I think of anything you walk away from this talk, it's tie analysis to
change. That's what tools are really good at, right? You're always going to need a custom
tool, because your situation is different than everybody else. You grew up the way you
did, whether your perspective's the way it is, you have a bunch of historical baggage.
You need a tool that deals with the world at your end. That means you need a custom
tool and you're never going to find it lying in the gutter, okay? You'll have to build
it. All right, so I think this is another insight. It's hard to build. You don't want
to build all those infrastructure by hand. It's way too expensive. I will scream if I
hear another guy at a working conference in reverse engineering saying, "I'd like through--I'd
like to redesign the system and if I just had a parser..." I'll scream if I hear that
again, right? They're always there. They don't seem to understand that even you have a parser,
okay, it's like climbing the Himalayas--it's like climbing Mt. Everest. The first 8,000
feet is easy, anybody can do it with a backpack. The last 19,000 feet requires a completely
different technology to get up to the peak, completely different game. Right? It looks
easy. Yeah, first 8,000 feet is easy. Any clod can do it but that doesn't matter. So
the infrastructure is expensive to build, it's hard, it has scale. You need generic
analysis; you need robust language front ends and real stuff to fit together. Okay. The
bottom line here is DMS is an engine for doing this. The reason I go on and proselytize is
because there are no other engines on the planet look anything like this at this kind
of scale that I know of. There are some research systems called Stratego and Tixel which you
can download for free and they say--they tell the same stories but they don't apply them
like this. Okay? They aren't going off and fighting the battles that people have, the
real battles, you know, can you really parse C++? Can you really do 10 million lines of
code? Can we do this flow analysis? Can we do call graph this entire system? They don't
do those kinds of things. It takes an investment and the machinery 15 years to get here. Okay.
I'm open for questions. I understand there's an audience here. Okay, with mutes so you
may have to turn your mic on if you want to ask your question. And we have a mic here
for people to step up and ask if they wish. >> I was curious when you were talking about
looking for duplicate sections in the code or, like, copy-pasted code. Are you able to
do that even if it's not identical code, if it's just similar in structure or maybe close
to the same code when you look at it? >> BAXTER: Yeah. We detect what we call "exact
clones" and what are called "near-miss clones." Now, what we don't detect are semantically
equivalent clones. It doesn't detect two pieces of code which compute the same thing but do
it in radically different ways, detects pieces of code that people have literally copied
and modified, okay? It's really weird perspective to realize what makes clones findable is the
fact that somebody stole them and the act of stealing them made them visible. Okay.
But so it's not finding semantically equivalent things, it's finding syntactically equivalent
things that somebody has stolen. So, the smart part of the clone detector really enough isn't
the clone detector, it's the guy that stole the code. He made it visible; he went and
identified this blob of code, this thing with its formless boundary. He said, "No, no. Here's
the line. Here's the part which is good. Pull it out. Set it over here." And the moment
he does that, now you can see the boundaries and they're drawn--they're drawn black, all
right? >> [INDISTINCT]
>> BAXTER: Microphone. >> Example of a smart differencing that you
showed. That's some--it didn't seem very different from a line differencer. So, I guess that's
probably because you chose to show a simple example. But in cases of semantic differences,
have you thought about how to show that visually? Because...?
>> BAXTER: Well--so let's talk about the example for a moment, okay? I mean, you're right.
If you were to do this line by line, a pure line detector might actually detect this.
Okay. The question is what did it say? Okay. A liner in detector wouldn't show--wouldn't
draw those green patches. They would simply say that line 10900 was different and line
11030 was different. What our tool says is, "The variable name--term name has been renamed
term title throughout this block of code," that's all it says, it says it once. So it's
a very--it really isn't a liner differencer. Okay, you can't see it in this particular
one. Okay, go back and ask the other part of the question again.
>> The question that I was getting to is if you're showing--if you're finding differencers
that are--show their difference in an abstract syntax tree implementation, how do you actually
surface that back so that it makes sense to people?
>> BAXTER: We don't--we don't show it. You never show an abstract syntax tree to a person.
That doesn't work. Okay. That's the reason for the pretty printers. What the pretty printers
do is allow us to take an arbitrary internal data structure and convert it back to source
code. And so the kind of thing that C coming out of the abstracts--out of the smart differencer
if you tell it to do is, "I move this block of code from here to there." And it'll actually
show the block of code. It's this block of code. And it might start in the middle of
the line and end in the middle of the line because it's pulling out the abstract syntax
tree, not the lines. All right. So it shows you the source text the way you would see
this as a programmer. I can only put up one slide here and have time, right? So...
>> Assuming what's the largest system you've ever worked with? You're talking about, you
know, apply this to Google. I don't know how many lines of code we have; maybe 50, maybe
100 meetings, things like that. Have you done anything like that?
>> BAXTER: Well I think the right way to ask the question, what's the largest system we
worked with? Okay. I mean, my suspicion is that you don't have a single monolithic 50
million line system. You probably have a lot of five to ten million lines systems. Okay,
you may prove me wrong, okay? We did have--well the packet switching customer had 35 million
lines of C code as a single system. Okay. And they stretched us to the hilt. And one
of the problems there is what to do, what's called the "points to analysis," where what
you want to know is for each pointer in the system, what could it point to? This is what's
called a "May analysis". This may point X, it may point to Y, okay? And we essentially
had to that analysis across all 35 million lines of code. Okay, the university papers
weren't good enough. Okay. We actually did something to do this, it ran for, I think
it was six and half days on an 8-core machine, okay? And used 90 gigabytes of ram to compute
this answer. Probably not big by your standards, it was big for ours--okay, to compute this
answer. That's the biggest thing we've dealt with as a single monolith. I had a conversation,
an eye-opening conversation with the CIO at Metropolitan Light back in 1995 when I first
started the business. Okay. At that time we had no technology at all and so we walked
around proselytizing, okay? So we went to visit these guys because well, they seemed
like interesting customers. And I asked this guy, "So how much code do you have?" And he
says, "I have a hundred million lines of Cobol," all right? I said, "Oh." He points out the
window--we're in New York--he points out the window at this building, three buildings away,
"See that building over there?" He says, "It's full of Cobol programmers. Okay. I need to
control that." I said, "What's the problem?" He says, "It isn't the hundred million lines
of code that makes me crazy. It's growing at five million lines a year." That was 1995
God knows where that man is now. He's probably retired. But his successors got 200 million
lines of COBOL. We went out to visit Social Security Administration about two months ago.
They do have two hundred million lines of COBOL. Okay. And that's the mainframe part,
then there's the external phasing part they've been doing all this distributed stuff in C#
and Java. Two hundred million lines of COBOL, it's breathtaking. I will say that we're probably
not very good at this. We're probably better than anybody else in the planet. Okay, it's
a tough choice and one of our struggles as a technology company is to try and make sure
that we stay on that curve that we follow people up as their systems get big, right?
Because otherwise we're all headed for collapse, okay? Black holes happen, okay? When the amount
of mass in there gets to be too big, pfft, then you're dead. You don't want to be there
when you get a Schwarzschild, a radius effect, right? Anybody else?