A JVM Does That?

Uploaded by GoogleTechTalks on 05.04.2011

Our speaker today is awesome, how awesome is he? He's so awesome that if you were a
car, he'd be a helicopter. He's so awesome that if you were a plane, he'd be a rocket
ship. He's so awesome that if you were a software engineer--if you were a software engineer,
he'd be the chief JVM architect at its dual systems. Ladies and gentlemen, Cliff Click.
>> CLICK: Thank you. Well, that was a little more of an intro than I usually get. At least,
I have somebody fooled. All right. So, this is me rambling about what goes on inside the
JVM, having worked on them for almost 15 years, they're little on time. And I'm still amazed
at what goes on inside the JVM. And there's a huge pile of services and people have been
adding to [INDISTINCT] to JVM that the last time we're sure everyone's right here. Many
of these services are sort of volunteered to the JVM engineering teams by naive changes
in specs which turn into like horrible wrong disasters of engineering efforts to make them
do something useful. So, some of the services involved; high quality GC and this is like
one that's in everyone's mind because there's so much engineering effort to put into try
and get GC pauses to be reasonable on these ever larger and larger heaps. But really,
the GC that's in there now is pretty amazing compared to what was available 10 and 20 years
ago. The actual total allocation costs are quite reasonable compared to the alternatives
and the GCs are parallel and concurrently incremental on yada, yada, yada and they have
all kinds of cool features about them. There's a high quality machine code generation going
on in there. Hot spot has two different JITs. Most of the VMs out there have several different
JITs in motion at once. There's management of the JIT'd code, there's profiling and the
combination of these things turns into--it gives you a bytecode cost model. And that
is sort of the key bit of transparency that's not one the people often think about. A bytecode
cost model tells you something like "a git field instruction is typically the cost on
an XCD 6 load," right? And that actually only works because the JIT puts its head around
what it means to do a git field, removes all the weird corner cases, optimizes the way
the Null checks and the range checks and everything else in the planet and in the end, the bytecodes
that you get actually have a cost model that's somewhat related to the underlying hardware
processor cost model. And there's Uniform Threading and Memory Model and this just basically
tells you the meaning of multithreading programs. What it means to have two threads, will they
read and write the same shared memory and whether that's correct or not, or what--how
the data flows and when--and how and why. It also gives you type safety and that's--you
know, crucial for large programs. Your C programs do not get that and suffer from weird bugs
the same, you know, that kind of a way. You get to do things like load code, and when
I say load code, I don't really mean just loading your code and running it but loading
it and running it using all the same services that are already there, meaning it has the
same bytecode cost model despite the fact the code was not available when the JVM started.
You have a long running server, you download them over the web, it's just as fast as if
you had it locally on the spot and that's because you optimized and re-JIT on the fly.
High quality time access, we'll talk about that because it led to one of the more horrible
things inside of the JVMs. There's a lot of introspection services, you know, JVMTI, DIPI
agents, reflections, yada, yada. Huge access to a huge library, O-Access to stuff in the
OS, threads, scheduling, priorities, things like that. Those are all brought through the
JVM and available to your Java program. So, where did this all come from? So, most of
these services were incrementally added over time. A hot spot, you know, one dot0 or Java
virtual machines one dot0 didn't have all this stuff. But the language, the JVM and
the underlying hardware have all co-evolved over the last decade, right? So the incremental
addition was like finalizes the thing that weren't there originally were added by people
who wanted it. Same with the Java memory model who we started seeing lots and lots of multi-course
system so we had to know what it meant to have two different threads on this--running
concurrently to communicate. Sixty four bits was added when it became clear that we were
going to have lots of machines with 64-bit capabilities. Support for high core count
and that means different underlying algorithms in use in the JVM itself for concurrency,
for scalable concurrency. If you want to run a garbage collection on 100 gigabyte heap,
you're not going to run them in a single thread, it will take too long. So now you need a garbage
collection that scales well to all the course you have available, right? And now the real
question is why the services came, but why did they come? They came because the services
provide an illusion. A powerful abstraction. It changes--it puts a dividing line between
where the JVM engineers innovate and where the writers of Java code innovate, right?
So people who are writing Java focus on adding value above the JVM layer, and people who
are implementing JVMs provide services below JVM layer, and that separation is similar
to the separation between X86 and the programmers who write software on X86. It's a great place
to draw the dividing line between who's doing which piece of work, right? Now if I look
at this set of services though, I see that they've been very ad hoc. They've been growing
over time as people requested or needed. Some of the services are unique to Java or unique
to the Java virtual machine. But a lot of the services sort of overlap with prior existing
services, but there were some stupid requirement that wasn't quite right and so the actual
service, you know, like priorities within Java and priorities as they exist in the OS
are not quite the same thing and they don't do the same thing, and they have slightly
different jobs and you can't end up using the same priority mechanisms. And I'll have
some more discussion on that. But there's a lot of these services that fall on there
that can't. They look like they're similar to what's been done before. There's some stupid
things that's different, and we had to, you know, replicate that service the hard way.
Okay. So that was my intro. Let's talk a bit about the illusions that we have, the abstractions
that we have. And the big one is obviously garbage collection. [INDISTINCT] It's the
illusion of an infinite heat, right? Just allocate memory being new and "poof," you
got some memory, right? Don't bother to figure out when you're done with it, track its life
and whatever. GC will figure out what's alive and what's dead and do the right thing, recycle
the dead memory. It's just so much easier program, the malloc and free, that it lets
you write code in an entirely a different way. You can remove a huge class of bugs from
your mental thinking process, and that means you can right code quicker to get the same
job done. Quicker time to market, usually. But, you know, on the other hand, you might
take quicker time to market to turn into more time to add more features or more time to
add more performance work or more time to do whatever. It was just a lot easier to use
GC than without it. And then it turns out that it enables certain kinds of concurrent
algorithms that you simply can't do without it. It's just too hard to track liveness on
a bunch of this interesting concurrent algorithms and in particular, the JDK5 locks the Java
[INDISTINCT] locking mechanisms. Do allocation in a way that requires a lifetime management
that's not really feasible without using GC. You know, fundamentally those locks require
GC to function. Okay, more on GC. There's this huge stride we made in the last decade.
We went from having a GC that was good enough to use in production in a lot of scenarios,
fairly reliable but didn't scale well--hugely well and still had all kind of weird corner
cases, and still had issues of very large heaps or with pause, [INDISTINCT] whatever.
And people have been adding more and more and more features to the GCs over time. So,
you know, it's robust, it's parallel concurrent. But there's still too many GC tuning flags,
at least I claim there are. This is a major point of differentiation among different people
who are working with different GC algorithms, it's a source of active development. Through
[INDISTINCT] the different GC algorithms might vary by as much as 30% where the very concurrent,
very low pause time GCs have higher overheads than the sort of, stop the world once. So,
here's six orders, their magnitude and pause times, Azul will sell you machine, they'll
take hundreds of gigabytes a heap with a ten mill second max pause. IBM will sell you machines
that has, you know, hundreds of megabytes not gigabytes of heap but with--in microsecond
pauses. This is good enough to do a hard real time stuff. And I, in fact, know that they
are doing hard real time stuff, like avionics control systems and the like. And the stock
GCs starts having serious pause issues when you head for the tens of gigs of GC of heap.
And then, you know, CMS and G1 fit in there and there's other GC algorithms floating around.
So there's stat illusions. Let's move on here. So, bytecodes are fast. Nice illusion. Bytecodes
in fact, suck as a way to describe programs. They're rather horrible but we're sort of
stuck with them now. The main wind of bytecodes though is it does hide the CPU details. It's
a way to describe a program without knowing what the actual annoying hardware is. Ten
years ago, this is interesting as a differential between X86 and Spark, or X86 and Power, or
X86 versus something else. These days, it's the low power side. It's X86 versus arm chips
or MIPS, or whatever goes in a iPad, iPhone, yada, yada thing, right? So, really what we're
doing with the bytecodes is that we're expecting them to be fast, but if I do sort of the naive
interpretation of full semantics of the git fill or git static, they're actually pretty
slow because there's a lot of work that goes on there and JIT'ing brings back the expected
cost model, lets me have bytecodes and kind of believe, "Oh that'll occur into something
that's reasonably efficient," and it actually only works because the JIT does all kinds
of magic tricks. So eventually, the bytecodes have to be JITed, right? And, you know, the
JITs themselves are--these days, are very high quality optimizing compilers in their
own right. They're giant complex things. Twenty years ago, an optimizing compiler that's now
inside of Hotspot would be considered as one of the most complicated programs on the planet.
These days, it's just pieces, part of a JVM, right? And the other kind of thing that's
interesting here from a historical perspective is it brings sort of, GCC-02, that level of
optimization amasses. Every time you guys run Hotspot-server, you're running an optimizing
compiler under the hood. So, it probably is the most executed optimizing compiler on the
planet now, or at least, this is going to be really close to, you know, GCC. But I can't
just pick up GCC directly, why don't I just, you know, write your own optimizing compiler?
Well, this is one of these services that didn't work to use the existing services that are
out there. There's lots of compilers 20 years ago, 10 years ago. Lots of optimizing compilers
out there, why wouldn't any of them work? Because they don't track pointers for garbage
collection already. They didn't do the right thing there; they didn't do the right thing
for the Java memory model which has strong requirements on how you do your ordering of
loads in stores and fences. I'll talk more about that in a minute. And then there's just
some new code patterns that needed optimizing. Things that weren't happening in C and portrait
programs were happening in Java programs, needed optimizations, things like class hierarchy
analysis or, you know, no pointer check eliminations. JIT'ing requires profiling, at least a minimal
amount of profiling because you don't want to JIT everything, right? When profiling allows
you to focus your code generations efforts on the small pieces of hot codes which happen
to be the case of all modern programs, right, but profiling allows you to do better code
generation than sort of static compilation because you know something about the code
you're looking at. You know what methods are hot so you know what to inline, where to do
your heavyweight optimizations, throw that--throw the work. The big effort from the compiler
gets put down in the right places. And then just, you know, general profiling turns out
to be--lets you do better things with code layout for branch prediction, or [INDISTINCT]
generation or instruction scheduling, things like that. There's a bunch of stuff here.
And basically, you know, everyone gets to use profiled card. If I look out, sort of
static optimizing compilers for the old spec-in spec off piece style stuff, profiling those
machines, I mean, those programs was worth 15% or more, sort of raw speed up over non-profile
but fully optimized code. So profiling has a really interesting, real benefit to the
performance of your code. Another illusion we have is that virtual calls are fast. C++
has virtual calls, kind of introduced the notion of it. But they avoided by default
because you have to ask for it and avoid it because it's slow, requires multiple levels
of indirection including in the end, a jump in directional [INDISTINCT] table. Java sort
of embraces them by default. Everything's virtual in Java unless you say otherwise since
we add the final keyword to sort of negate [INDISTINCT], it's not virtual, right? Well
that--that makes them common, and if you make them common, you have to make them fast or
they're unusable. So, mostly they do get fast. The big ticket item here is class hierarchy
analysis which tells you that, in fact, while multiple targets might be possible in practice,
right now there's only one. And if I know what the one is, I can it a make static call.
That must be a virtual call. And since it is a static call, I can inline to all the
oppositions I want to do. But that might change over time, so if new classes get loaded, I
might have to re-run my CHA and maybe re-JIT my code and, you know, and that all works
at all. It's in there and does the right thing. And then finally, if you can't use CHA, turns
out in practice, most call sites that are labeled as being virtual by the programmer
are in fact static in a dynamic sense. Only one class of object ever hits that call site
for this pointer, that's the same class every time, you use a simple inline cache. It's
a classic key value pair inline in the code, you test the key against the class. It's the
expected class you've already did in the lookup for the targeted method and you encoded as
a call instruction at like the hardware X86 machine code level. And that makes a virtual
call that hits an inline cache basically one clock cycle longer than a static call. It's
extremely cheap. But when that fails you go back to being slow. Turns out in practice,
this is almost never an issue. That the number of times you have to do the low load-low jumper
register to get the virtual call table lookup is so vanishingly small that it makes no difference
on any real programming. Partial programs are fast, right? This is the whole--I didn't
have the whole program at startup time, this is the open world models for the close road
model, right? You can take class rename and get new code loaded in and then that runs
as fast as the code that was there all along because you optimized it and you may have
to read, you know, de-optimize old code and re-profile and re-JIT to get there. But, that
all is in there, that all works. You can add code on the fly and it's just as fast as if
it had been there from the start. A consistent memory model. So this is a--this is an interesting
one of these days although 10 years ago was highly unknown to people or not thought about.
Wasn't important in people's minds. All machines out there have different memory models. And
the rules vary on visibility very widely from machine to machine, even with in same--within
generations of the same kind of machine. So you might think an X86 which has a very conservative
memory model has the same sort of memory model from X86 within--across generations. Turns
out within the same generation, there's some variations, turns out it varies across motherboards
as to the timing of wind loads in stores cross back and forth. So the rules here are, in
fact, quite variant. It's hard to say what they really are. Power MIPS have a more aggressive
memory model and Azul [INDISTINCT] are actually very aggressive, allowed wide out of orderings
in the hardware level. Now, we map this very out of order widely varying loads in store
timing issues, back to the Java memory model by doing something—-oh, okay, that's my
next line, okay. So, really what we have to, we have to match the Java memory model despite
the fact that we have all these very different memory models from underlying hardware. If
you just execute loads and stores without thinking about the order in which they happen,
you'll get programs whose very meaning depends on the hardware you run on. If I ran it on,
this X86 today and that MIPS chip, you know, tomorrow the meaning of the program will be
different because the underlying memory model is different. So, given different memory models--actually
none of them match the Java memory model. They all have to be, you know, the gap has
to be bridged by the JVM but has to keep that gap bridged while keeping the bytecode cost
model. Normal loads in stores have to remain fast and we do it by some combination of code
scheduling senses, very careful placements of your locks and your comparing swap operators,
requires close cooperation from the JIT, requires the JIT engineer to have detailed hardware
knowledge to do the right thing. To know when you can insert an X86 N fence or S fence or
a locked add to the top of the stack, wherever the hell it is, today or tomorrow or next
day on the right way to do is on X86 or Spark or, you know, arm or whatever. Okay, another
one. Consistent threading models. There's lots of very different OSs that hot spot or
JVMs in generals run on. You know, Linux, Solaris, and the AIDX, sort of fall in the
camp port like P threads works, you kind of port P threads and it's all good. But also
you run on other things, cellphones, iPads or maybe on some thousand CPU, you know, mega
server somewhere and these guys have very different OSs. But Java just makes, you know,
new thread, do the right thing and immediately synchronize wait, notify, join, all just work.
Your multi-threader program does the right thing across all these [INDISTINCT] and that
requires a lot of work under the hood. Here's an illusion, locks are fast. Okay. So, obviously
not all locks are fast. We have a contended lock, and obviously somebody has something
to block and go to sleep in the OS. So that's fine. But then we might expect fairness from
the OS. So if I have a thousand runnable threads all trying to take the same lock, I might
hope for, "Well, if I'm running on you know, [INDISTINCT] machine, most of my threads are
asleep, they're getting woken and taking turns." I'd like to expect fairness in the OS and
in fact that's not the case. Most OS is might--as I say, all the ones I know about do not hand
you fair locks. So, you get thread starvation. It's very easily testable on write a certain
Java program launch them to runnables and you can watch thread starvations. Especially
if you have one thread whose Java it is, is to shut all the other ones down. After he
goes to sleep, he'll never get another cycle and he won't, therefore, shut the program
down. So we fake fair locks and JVM itself, and that requires a huge amount of engineering
work to get the right kind of thing going on with fair locks. Now if you move away from
contented locks and then go to the un-contended locks, they're very cheap these days. And
in fact they're getting cheaper. Biased locking gets turned on. We can expect un-contended
locks to be in the handful of clock cycle range. So this is sort of amazingly fast compared
to, you know, Hotspot 1.0 10 years ago where locks should be avoided at all cost because
they're so expensive. These days, you know, if they're not contended, they're pretty damn
cheap. And why do they get optimized? You know, here's a different question instead
of they are optimized. Why? Because they're so common. People don't know how to write
programs or how to write concurrent programs. Just in general. So, the industry as a whole
settling in on this, just "add lock until it works" kind of mentality. The least common
denominator: I don't know why it's busted; I got some [INDISTINCT], I'm going to throw
another lock in. Throw another lock in, "Oh, the bugs went away. I'm done." Right? So locks
become common. Un-contented locks become common because they are defensively programmed in,
right? And so the JVMs optimize them. And we get this interesting concurrent programming
style of "add a gazillion locks until it works." And as an industry, I've learned a lot about
concurrent programming but they don't claim--what I claim we have learned is that, this sucks
as a programming style and we're looking for a better model and we don't know what it is
yet. But it's the model we have right now. Okay. Change topics. Quick time access. System.current
Time Millis. The reason it's in the slides is because it turned into one of those grossest
hacks floating around inside a Hotspot. And in general, I've come to discover today that
Google has replicated the hack over again. So, under some particular interesting bench
mark, not to mention JVV or anything, yes, a winning score will be called, you know,
it will be called billion of times a second. So it's not millions, it's billion of times
a second. But it is actually fairly common in all work start apps. People time stamp
a lot of things a lot of the time. They like to fill the time stamps out. They do it with
like calling card time millis. But, and here's this little eye chart here, what this little
piece of stupid mouth says is, if one thread thinks his current time millis is less than
another thread by even one millisecond which if they're just comparing millisecond counts,
they might actually be different by no more than a nanosecond time, but they just happen
to be at the edge case where the milliseconds flip, then this thread thinks all his reason
rights are completely available to that thread. Okay? So what that really means is, is if
one guy goes and populates some data structure and reads the current time millis and writes
it to memory, and a second guy reads it up and says, "Oh, it's off by one millisecond,"
even though there might be nanosecond in time difference, the millisecond count value is
one off, then he believes that all the rights that this guy did or--are visible in that
guy and he merely can rely on them being there when in fact they haven't had time to cross
the chip in the cashiers yet. You know, the [INDISTINCT] so fast. And so he pulls up on
in un-initialized data and real large, you know, three letter company acronym Java apps
service will crash if you don't provide this guarantee.
>> [INDISTINCT] >> CLICK: It's just something you have to
do. It is definitely not in the spec. Okay. So then, you know, this situation kind of
stinks and Intel says, "Oh, we can fix it. We'll add this TSC register. This nice thing
that will count ticks on a running core and will give you--you can convert it in the time."
And obviously when we convert to time, we just load the thing, add some base and scale
it and "poof," it's going to give you milliseconds, right? But it turns out the value is not coherent
across all CPUs. Now, new on Halem, it's coherent within a socket but a cross socket is still
not a coherent. So it varies, you know, it goes faster and slower and faster and slower
in some CPUs, so you get current CPUs that's doing this, right? It's all over the map.
And that means that if you just read the TSC shift scale ad, you get issues where this
doesn't work. So it's not sufficient for running large data programs that way. Right now a
Hotspot does current time millis by doing basically an inline version of the fastest
flavor of Linux at the time of [INDISTINCT] which is mostly some sort of user with atomic
structure. You read some big [INDISTINCT] that the kernels fixing behind your back in
a read-only page from user site, and he reads--the user reads it once and they read the second
time to confirm it hasn't change between reads and if they get an atomic readable field then
they use that to shift scale, have them come up with, you know, current time millis. It
gives you a nice quality time and it involves a number of memory fences and reads a memory
that are long enough that loads and stores from different CPUs saddle out, and so you
get the atomist guarantee that you need across machines. If I read current time millis in
one thread, and read it on another, they will agree if they, you know, they differ by a
millisecond, they will agree that all the loads in stores from one are visible on the
other. Okay. But a better way to this or a faster way anyhow is to do some sort of plain
load instruction or the kernel updates the current time millis in some user mode page.
That's just actually the millisecond, I just have to read it. And that means a thousand
times a second. I want to tick that one word up in the kernel. Well, that's pretty damn
cheap actually. That's worth 10% on JVV. And if you turn on -6xx home+ aggressive optimization
on Hotspot, it does this. But then there are crashes on all those apps servers. But it's
10% on JVV. So pretty-—I guess it's a pretty gross situation. And then hypervisors came
along and say, "Oh, we can fix TSC. We'll make it idealized. We'll make it monotonically
uniform taking across the CPUs by intercepting the reads of the TSC register and doing the
right thing under the hood and hypervisor and hand you back the right answer." But that
means access to TSC register is like a hundred explorer, it's just as fast as doing this.
So, you know, it's like, "No, they didn't actually fix the problem. It's not this fast,
it's that fast." But it secretly that fast because it looks like it's just read some
hardware register but it's not. Okay. Moving on. So, here's some illusions we think we
have but don't actually or would like to have. We will start with some "like-to-haves" here.
Infinite Stack. Everyone who likes these functional languages like, closure or whatever, would
love to have tail calls. Tail call co-optimizations. And that gives you the illusion of an Infinite
Stack. Running-code-is-data. That's closures. That's coming actually in project plan or
pointer, some combination of the two is going to hand you closures which is essentially
code-is-date. Capitalize integer is cheaper than lower case integer. And that's useful
for--again some these functional languages. Would you like to take primitive int so that
you throw virtual calls on them to do, I don't know what--I don't why. But that turns into
auto-boxing. Java C, if you use--if you use a caplet integer to get started somewhere
and then you do lower case int like things like, [INDISTINCT] equals zero, I listen in,
I++. Well, the I++ turns into an auto-boxing. That is, it becomes a capital integer and
then you have to allocate an object for the capital integer and you have to get the value
out of it, add one and make a new capital integer and yada, yada. It's like ten times
slower if you head down this path. And it's hidden from you, that it happens that way
because Java C does it--makes the cap liners automatic. And now people like the JIT to
undo that under the hood. So that's--some piece of work has been done by Sun before
Oracle took them over, I don't think it's turned on yet. And I know that it doesn't
work very well when it is turned on because I've looked at it. Big integers is cheap as
little int that's, you know, that's a silent overflow to infinite precision integers. That--to
do that efficiently, you want to do something called tag integer math which requires close
cooperation from the JITs and an online run time system. But I think the numbers of consumers
of this are very small, although some of the new languages want to go down that route,
they're going to make all the uses of that language pay some hefty penalty where I think
the number of people that need infinite precision math are very small, and I know who they are.
So they can just go ask for it. So I think this is barking up the wrong tree, but I'm
not a language designer so I'll quit now. >> [INDISTINCT]
>> CLICK: Yes, it does. Very interesting. Okay. Other illusions we'd like to have. Atomic
Multi-Address Update which is another funny name for Software Transactional Memory. A
bunch of people are playing games here. The JVM and the JIT could conspire to do interesting
things there. If this becomes the programming, the current programming style of choice, I
don't believe it will because it has some fundamental flaws, but I don't know. If it
turns out that people like Software Transactional Memory above all other options, then there's
things that JIT could do here, JVM can do. And then fast alternate call bindings. You
know this is in book dynamic. This is like is like a, you know, JRuby has different call
semantics than Java does. And invoke virtual doesn't do the right thing. So you want to
do some sort of different thing there but you still won't have to be as efficient as
an invoke virtual which in the end turns into like, an inline cache kind of thing. And that's
what invoked dynamics is about. So, it's not here yet but it's probably coming. So here's
the illusions we think we have, right? That's the big one. Mass of code is maintainable.
Hotspot's approaching 15 years old and large chunks of the code are very fragile, or honestly,
very fluffy per line of code. There's a lot of code per unit of work done, right? And
so you get a very slow new-feature rate-of-change. Things don't get added to Hotspot very often
like invoke dynamics, the biggest newest thing to get added in a long, long, long, long time.
We've been busy rewriting lots of it and that was the fun part of going away from a big
corporation to a start up and I--in fact we have rewritten lots of subsystems to become,
you know faster, simpler, lighter and it's just sort of a--you know, the takeaway from
this is, it is possible to fix Hotspots to give it new paint jobs and get rid of a lot
of the really crafty stuff. It just hasn't been happening at Sun or who knows what Oracle
will do now. Okay, more illusions I think we have. Thread priorities; people think we
have thread priorities because obviously the OSs have thread priorities and they work,
right, well, it turns out, they don't. Linux has, for instance, thread priorities but you
can't raise your priorities without it being root and everyone starts into the default
of max which means that you can't lift you priority without being root. Well, that means
if you want to have a concurrent GC which wants to provide sort of hard real time guarantees
on your GC performance that need cycles and if you have a lot of running mutator threads--they're
competing with those GC thread because everyone's looking at the same priority unless you lift
the priority of the GC threads above the mutator threads but that means you have to run as
root and that's an [INDISTINCT] nightmare--every JVM run its root in order to get priorities
for...so bad news. You get priority inversion, you get a bunch of issues, another one here
is--next slide probably--this one, you get priorities that are relative to the entire
machine but you need priorities that are relative within a JVM as well as a cross machine, it's
a different notion of priority. And then obviously, you know, we think we might have run--write-once-run
anywhere but scale matters. Stuff you do for small scale low-power, low-battery machines
is very different from what you might expect if you're running on a, you know, quad socket,
8 core and Halem thing. You need--you know, one matters for memory sized footprint and
power consumption and one says I got tons and tons of threads--got a lot of giant heap--I
got to have giant I/O thing, I have different programming styles. Okay, Finalizers, people
think Finalizers are useful—-somebody does because I see them all the time but they're
not but they think--but you're led to thinking that they are and you know what the issue
is, of course, is that, it's the way to reclaim OS resources but there's no timeliness guarantees
given. The code will eventually run but eventually could be, you know, death of the universe
kind of thing, right? And the obvious example here is something like Tomcat which turns
to file handles as part of accessing web request and you can trigger an out-of-file handle
situation which requires a full GC cycle. Or when the Finalizers reclaim the OSs, the
OS resources are the file handles. And in the olden days when heaps were small, you
ran full GC cycles fairly often and your file handles got returned in a timely fashion,
and the newer machines have more memory, you return to file handles faster then you run
GC cycles, and so you ran out of file handles. And so there is a backdoor hook in the out
of OS file handle resource thing, to tell the GC to go around an emergency GC cycle
reclaiming finalizers as best you can and hopes that this--then you can go re-ask the
OS, "Can you give me a file handle now?" And maybe you can. You know, right--but do we
really want to program our systems this way? Every time you add a new OS resource, you
have to add a back door to the GC and say go run a finalizer now, and just in case something
comes back out of it. It's sort of the wrong way to manage resources. Soft, Phantom Refs,
Weak Refs, and maybe Weak Ref is useful and this is basically another case of using GC
to manage a user level resource. In this case, usually user mode caches. And the idea, of
course, is you have a cache, it holds on a lot of stuff that you might need but sort
of optimistically in the cache, and then if you're low on memory, can you throw stuff
out of the cache? And then, you know, reclaim a bunch of memory, right? And so what happens
and I've seen this happen in a number of occasions, you get some sort of low memory situation
which is going to run a GC cycle. Well, the GC cycle flushes your Soft Refs, the GC has
no clue what a Soft Ref points to at the usual LAN, it's just an opaque software, a thing.
So he has to choose some. He picks some random set of Soft Refs to clear out and that causes
your cache to flush. Well, then you're not missing your cache because the user app was
using the stuff in the cache. It was doing what a cache does and saving you from work.
So you have to do all the work that you would've saved had you hit in the cache, and all that
work causes you to do some more allocation. And since you're low in memory, you do another
GC cycle which flushes your cache again. And so what happens is, you have some sort of
server which is behaving nicely, and as you add load, the load ramps up, the servers throughput
matches and RAM soft with it until you hit some threshold of out of memory, you're getting
close to memory edge. And then of course, the load never comes to you in some uniform,
easily scaling thing to load, bubbling up and down as it ramps up and you get a little
bubble at the wrong time or any the GC cycle and suddenly you have, you know, low memory
situation and you're trying to hit in your cache at the same time. But your cache flushes
and you hit the cycle, where your cache flushes, you don't want your work [INDISTINCT] GC again
and you get into this endless GC cycle where the server's doing GC constantly and every
time it does a GC, all the load has to do all the work the hard way because the cache
isn't functioning anymore, it's been stripped clean. And since it does all the work the
hard way, his throughput sucks and he fills his memory up with all the work to refill
the cache but then the cache gets wiped clean again. And the throughput on the server crashes
and remains low until you take the load off, long enough for GC to catch up, quick flush
in the cache and then you start applying load again gradually. And of course at this point,
people will just think the server has crashed and they re-boot it which is sort of another
way to get the same thing done. So I claim that the failure mode in this scenario is,
you're asking GC to make a decision about your application level performance--application
levels behavior without that GC having a slightest clue what the application level behavior is.
He doesn't understand things like, this is an LRU cache and I want to do some sort of
least recently used notion on the things—-the software is going to wipe out or not. He just
has a pile of software he has to wipe out, sort of blindly and hope. Okay. So let's go
sort through all the different things I've got here. Here's the summary of services,
GC, JIT'ing, Java Memory Model, Thread Management, and some of the fast time. JIT'ing includes,
you know, hiding CPU details and the hardware memory model from people. Services that are
provided below the JVM now, everything like Threads, and Context Switching from OS, Priorities,
I/O files systems, stuff like that. Then there's a set of services above the JVM application
level stuff, Threadpools and Worklists, Transactions or Crypto models or your user mode caches
in the JVV driver caching [INDISTINCT] what. Models of concurrent programming, right? And
then you know, some of these new languages, ultimate of languages, might have different
rules on dispatching or what it means to have an integer or concurrency. So, I claim that
I would like to see fast quality time moving to the OS because, you know, with the TSC
registrar gets some very fast, assuming the hypervisors start kicking on me, but it's
not quality enough. And I can use call get time of day but it turns out that that's not
as fast as I might like or could get if I move the thing into the OS. All you have to
do is had OS tick a memory word a thousand times a second on some user mode read-only
page. And this will give me something that's coherent on our clock-cycle—-clock cycle
basis, all the CPUs will cache it and they're all going to cache, it'd be cheap, it'd be
coherent, whatever, easy to do. Thread Priorities. The OS gives me thread priorities at the process
level but I want to have priorities where, if I have a high priority JVM that's doing
some sort of critical performance, latency performance thing like a web service, I want
that to be able to starve out some other JVMs I might have which are running some sort of
batch oriented background thing, you know. Somebody's doing statistics gathering across
a large DB--data mining, that guy needs to be lower priority than the guy that's trying
to make money by having you click on a webpage, right? But I can't specify that without doing
it sort of, at the entire process level, but then I get thread priorities within a process
and so I can't distinguish between GC threads which need to have cycles ahead of the mutators
or else the concurrent GC won't actually be concurrent. It will get starved out, right?
And the mutator ints are blocking for GC cycle. And same story goes with JIT'ing threads.
The JIT needs to get cycles eventually or you would always run interpreted and this
was actually a failure mode in the early bottom mark where a thousand runnable threads would
compete at identical priorities to the JIT thread, and the JIT thread would starve getting
1/1000 of a CPU and the entire venture could run interpreted unless the compiler of the
JIT thread is given a higher priority on the runnables to go ahead, you know, get his job
done. Right now, Azul systems is faking thread priorities with duty-cycle style locks and
blocks, we're getting the right set of thread priority games done but we're faking what
the OS should be doing for us, right? And we need it because we're trying to sell low-pause
concurrent collector. We need the right kind of thread priority games. But I claim this
should just belong in the OS. OS is already doing thread priority games. It already does
process priorities or just context switches so I can give you the thread priorities without
requiring everyone be "root," right? And then, you know, the other failure mode of everyone
being "root" is—-if I take all the mutators and voluntarily lower their priority, then
I don't compete well against, say, a my squeal. It's all still running on the same box. They
don't get cycles against the non-Java processes that aren't playing the same game that are
being polite. Okay, alternative concurrency models. These are the things that people are
looking for other ways to write large concurrent programs other than using Wax. The JVM provides
you thread management locks but some of these new concurrency ideas like Actors and Message-passings,
Scala and Erlang are software transaction memory out of closure or Fork/Join [INDISTINCT]
Fork/Join model. These are things for the JVM itself, just sort of too big and too slow
to move fast here. I think these experiments should be above the JVM and they aren't now
and they should stay there until we get some consensus on the right way to do concurrency,
and then maybe there's some interesting pieces that should go into the JVM at that point.
Like some sort of specific kind of STM behavior or some sort of magic, fast park and unpark,
or seek leak barriers, I don't know what it would be. But this is something where I want
to see people keep doing stuff above a JVM. So, Fixnums. Fixnums are one of these funny
things. I'll rail on it one more time and quit. They sort of belong in the JVM implementation,
not on the high level language implementation because the JVM can do good things there.
But the problem is that the obvious translation to infinite position math at the Java level
is fairly inefficient. We rather have some kind of tagged integer math and the JIT can
do magic things at tag integer math to make it really cheap. But then I'll throw my 2
bits out and you'll do a Bill Gates here and think 64 bit ought to be enough for anybody.
You do the Apple program, no, if you need more than 64 bits and don't make everyone
else pay for it. But I'm not a language designer so I'll quit railing on the language designers.
I give this talk first to the JAVA [INDISTINCT] summit people which are all Java language
engineers or JVM power users. New language people. Things to keep in the JVM. So GC,
JIT'ing and the Java Memory Model all sort of a co-intertwine in an intricate way but
JIT by itself really needs to be below the App unless you want to be in the, you know,
the business of doing XA6 code generation. But it definitely belongs above the OS and
has been above the OS for the past 50 years. So, you know, it works well there as a stand-along
server process like GCC. So keep it, you know, in between that. That puts it in the JVM.
GC also requires deep hooks into the JIT'ing process because the JIT just really has to
track pointers and know what they mean and where they go in order to do the right things
with the right GC variance so that makes sense to keep that below the application, as well.
The Java MemoryModel requires deep hooks in the JIT because it has to do all kinds of
games involving loads and stores and code section and code scheduling. So again, makes
sense below the app. These things kind of all are tied together in an intricate way
and that sort of defines what the JVM is. There's some of the alternative concurrency
models being exposed by the new programming models. People are working at, might enable
faster, cheaper, hardware with weaker memory models but in order to do that, you're still
going to need close JIT cooperation which makes it still in the, you know, within the
JVM. OS Resource Lifetime Management. I claim that Finalizers are not the right way to do
that, move this outside the JVM, make the application do some sort of lifetime management
with this reference counting or arenas or some sort of a lifetime management. There's
a lot of different techniques here. But don't burn the GC with the knowledge and more of
resource X can be had if only finalizers will run right now, right? And same thing for sort
of weak Soft Phantom Refs, there are some interesting, very rare use cases for some
of these but most of them are, you know, I don't think GCs should be changing application
level semantics. If I have a concurrent GC and it's running at a high cycle rate on you
and you make a Softer Phantom Ref, cycle by cycle it could go to zero. You could make
it one cycle then the next cycle could be mellowed and is that a usual programming model?
And I do have programs that will crash when that happens I can—-it happens fairly eagerly
actually with a good concurrent GC. So summary, move some of these things that are [INDISTINCT]
now into the OS, Thread Priorities, Fast Time access, Resource Management for, you know,
should get moved out—-Fixnums, Tail Calls, Closures, moved in. Azul system is doing some
different stuffs—-this is now going to be Azul system specifics, my one--two slides
of that. We do very aggressive virtual-memory to physical-memory remappings—-more or less,
we are in user mode controlling the virtual to physical mapping and that requires terabytes/second
remapping rates which we can't get out of [INDISTINCT] map and then [INDISTINCT] friends
so we need OS hacks to do that right. This is still going to be safe across processes—-you
don't break process safety but you know if you screwed up within a process, then the
JVM's got a bug and it crashes and the JVM's got a bug and it crashes--that happens--so
this is something that we have to fix. But if we do this, we get this really nice GC
which you know handles sort of memory of any size without the sort of load max cost. Hardware
Performance Counters—-Intel sort of screwed it up here and that they didn't make it cheap
to get up the Hardware Performance Counters--it have to go to the kernel which you have to
do kernel context switch in and out. But the JVM is a natural consumer of Hardware Performance
Counters—-I wish to hell, it was cheaper to get at them. You can take Perf Counter
data, re map it back to bytecodes and from there into like Java lines of code or whatever
and that's all up to the JVM to do that mapping if I can only get that damn code out of the
kernel in efficient way. So Azul is doing Thread Priorities by hand—-we like to see
that go into the OS—-Fast Time but we'd like to see how do Perf Counters come on the
user land because I want to screw with them and look at them and same with Virtual Physical
Memory Mappings. So, yes, summary—-there's work to do and that is it and I have a few
minutes to take questions. >> You mentioned a lot of [INDISTINCT] code
optimizations [INDISTINCT] >> CLICK: Right, right, okay—-so—-there's
a lot of code optimizations, there's a lot of data optimizations floating around. Hotspot
right now doesn't do a lot of data optimizations, they're sort of like value objects. There
has been a push for a long time to bring up the notion of value objects where for instance,
if you had an array of points, you didn't have to have a Java object header for every
point for the billion points you wanted to draw your 3D rendering scene, right? I don't
have a good answer for that—-I have my stock [INDISTINCT] answer which you know, turn in
an array of strucks into struck of arrays—-it kind of gets rid of the overhead that way
but it makes it obviously less convenient to manipulate and use those things. Azul systems
went to a one word object header a while ago—-that was a little bit of data layout change but
that's sort of been it. We're looking at changing the layout of streams—-well, we changed
it once already but we might change it again to be a more dense format--sort of a it file.
No whole scale interesting data layout optimizations going on.
>> [INDISTINCT] >> CLICK: An escape analysis is not enabled—-that's
a different—-that's not a data layout game. >> [INDISTINCT]
>> CLICK: They're removing the whole object itself—-they're trying to register in the
object and it's not a turn on because it has—-I assume it has bugs--bug issues. So let me
go to this guy then you, yeah, go ahead. >> [INDISTINCT]
>> CLICK: Oh God. Okay, can I talk more length about Software Transaction Memory, yes, I
can. Should I probably not in this context? I think it's the wrong way to do programming.
I have a large set of slide—-I have a full, you know, 45 minute talk on why it doesn't
work--hat I'm not going to do right now because I have a hard deadline and so let's just say
that I think it's a wrong way to do if you Google me in Software Transaction Memory or
look on my website, you eventually find you know, me blogging about the evils of STM.
Now I'm not Josh, the one behind. >> [INDISTINCT]
>> CLICK: Multi-cast is still useful for power users, I will agree to that.
>> [INDISTINCT] >> CLICK: Why am I comfortable enabling users?
>> [INDISTINCT] >> CLICK: Okay so that has two different answers,
one is I like enabling users because it's how I get, you know, my good feelings for
myself in life—-I'm helping other people with--fix what they claim is their problem.
And in the other one is—-and it also sums, we fell over everyone else's broken concurrent
programs because our memory model was much more out of order than an X86 and our programs
that would run on an X86 but has data raises that without appear on an X86 or it would
rarely on an X86. Those data raises would appear frequently and quickly on an Azul box
so something that would crash once a week in production in X86 would crash in 10 minutes
on an Azul box and at first, people claimed our hardware was broken and eventually, they
figured out that no--in fact, their programs were broken and they were happy after a while
to go fix their programs and happy that we discovered their bugs because they have been
eating their lunch in production for, you know, years on in but it was a very painful
path to go down to go tell people that, "Your program is busted and I'm going to make it
obviously by making my new product dysfunctional on your application." So I think that it's
not a good business model or maybe it would be wonderful you know theoretical, "I love
the world model." >> [INDISTINCT]
>> CLICK: At the time the Java bytecodes were out, there were other formats--lots of other
formats being explored how to pass around program semantics. In particular, right after
a Java came sort of semi-popular--Microsoft put out a bunch of research work on a different
alternative format that was clearly superior in a number of ways. Since that time, people
have done a lot of work on Java class files that sort of semi kind of backwards [INDISTINCT]
you know [INDISTINCT] big one miss--that makes format a lot denser for instance but still
the act of turning bytecodes into JIT'ed code has a lot of new chapter run through--that
was a little forethought in designing you know if somebody who knew how to write a compiler
had talked to the designer of bytecodes--he's not here today. Things would have been different
and a lot easier. I can name you a bunch of tricks. Here's one if you have a full registered
format into the stack format for your bytecodes, you can pre-reg or allocate it so that it
has a correct rich allocation for a 7CPU and then they doubled that, 16CPU, a fortune CPU
and a 30CPU--30 registered machine—-8, 16 and 32 registered machine--subtract one for
those and then you could--the JIT'ing process could've been as cheap as take the bytecodes,
unapack them, unfold the register allocation of flavor of choice and you wouldn't have
to do anything more than stupid code generation to get--it gets you--it gets you an uber fast
JIT for the first year JIT. As it stands now, you know, JIT'ing that first year of Java
bytecodes is substantially slower than some of the alternatives that are out there for
after, you know, no real good reason. >> [INDISTINCT]
>> CLICK: It sounds like an ad for Azul systems. >> [INDISTINCT]
>> CLICK: We change GC algorithms to one that is independent of the size of your heap. So
we have, you know, people in production with 300 gigabyte heaps with, you know, max plus
times an order of 10 millis and then we have our X86 stuff which we have seen people do.
I think 100 gigabytes with max plus as our, you know, a handful of noise. We change GC
algorithms. We have a better GC story and that sort of you know, half a business model.
So it is a solvable problem, it doesn't help unless you buy our gear but you know, you
can talk to SOP. >> [INDISTINCT]
>> CLICK: Because it's too big and complicated and it crashes and if a JVM crashes but leaves
your box, that's a more pleasant situation than your box hard core crashes and you have
to go power cycling. >> CLICK: It's true but how often...
>> [INDISTINCT] >> CLICK: How often does your operating system
crash here? Really, how often does your operating system go down versus a Java Virtual Machine
go down? >> [INDISTINCT]
>> CLICK: So, okay, if you're in the embedded handheld device--the OS and the JVM are sort
of, you know, might imagine them being basically one and the same and then yes, then whether
or not you're within the kernel or outside the kernel is sort of moot. At Azul systems,
we get our micro kernel OS, more or less, more or less micro kernel style but we kept
it as a kernel to get process protection across JVM, in case the JVM crashed and after a while,
the kernel became very stable and the machines essentially be up until you know, they got
power cycle but the JVMs were undergoing rapid, you know, evolvement and they had bugs and
they would crash and you would be nice to not bring the whole box down. You know, if
you got [INDISTINCT] box and it's got 20 JVMs running on it--if one of them's got a bug,
you don't want to bring everyone else down. >> [INDISTINCT]
>> CLICK: The optimization is the act of taking an optimize running a piece of Java code.
So Java codes [INDISTINCT] code instruction selection [INDISTINCT] everything on the planet.
And then discovering that deep in your inner loop, you had a virtual call that you may
stack and inline and unroll and replicate then yada, yada, yada and somebody loaded
a sub-class that now made that virtual call truly virtual and you have to JIT different
code. Furthermore, right after they made that, after they load to the class, they made a
new object and sort of in the global, published it which [INDISTINCT] merely picked up and
the next round through that entirely heavily optimized unrolled loop had better pick up
the new virtual call when it picks up the new object. And the more--make that all work
correctly is sort of, you know, it's semi-hat trick there. I can discuss in more length
but I will probably be out of time to do just this. Basically something I figured out while
I was a grad student at Rice, I put in a first version of the server compiler Hotspot [INDISTINCT]
many years ago and the idea is sort of spread around--it really you--you need to track the
Java Virtual Machine abstract state in the compiler and be able to rebuild it at some
point during some point in that loop, we're going to acknowledge and recognize the loading
of another class and get that all to work right and then once you acknowledge it, you
have to take your stack frame which is this heavily optimized thing and unpack it and
rebuild it as for instance, interpreter frames which [INDISTINCT] like it later we optimized--I
don't know but that whole job is very machine specific because you're rebuilding actual
frames on the stack and it's very JIT'ed code specific, it's very--it's bunch of things--a
lot of things are all tied up together in a tight little wad there. Well, I am going
to call that quits because I have a hard deadline and I want to be out the door shortly, so
thank you very much for your time and attention.