George Baklarz talks big data and best-in-class compression with DB2 10


Uploaded by developerworks on 21.09.2012

Transcript:
[ MUSIC ]
LANINGHAM: Welcome to the developerWorks podcast.
I'm Scott Laningham.
My guest this time is George Baklarz, who is program director at Worldwide DB2 Sales.
He's here to talk about DB2 10, IBM's flagship data management product,
and how it addresses the many issues that are spiraling around big data
which gets bigger and bigger every day.
George, nice to see you.
Thanks for joining.
BAKLARZ: Thanks a lot, Scott.
Glad to be here.
LANINGHAM: You know that big data story; the numbers around it are pretty astounding.
And I know, you know, people love to read that stuff on the news
and we'll hear it in different places.
But I thought I'd just mention a few little facts just because they're so interesting.
I Googled, you know, "how much data is created in a minute" on the web,
and this site Visual News had a posting from a little bit earlier in the year with some
of these stats: every minute YouTube users upload 48 hours of video, every minute.
Facebook users share 684,000 pieces of content.
Instagram, 3,600 new photos.
Tumblr, almost 28,000 new posts published.
571 new websites created every minute -- that one was amazing.
And, the mobile web grows by 217 approximately, that many new users every minute.
Clearly everybody needs help dealing with the data, right?
BAKLARZ: I think part of the problem is, is all of this stuff useful?
LANINGHAM: Right, well, that's true.
So part of the data story is sorting through all the stuff that isn't of any use, right?
BAKLARZ: Absolutely, yes.
You know, and that's where DB2 can help a lot.
I mean, you know, if you look in the past, DB2, you know, 10 years ago people thought of it
as an OLTP database, right, for transactions.
And things have changed dramatically since then, so we can now be used
for Big Data, for instance, right?
Things, storing data images, text, using XML, XQuery, using NoSQL, RDF,
all these new technologies now are integrated into DB2.
So we're no longer just an OLTP database.
LANINGHAM: Okay, now, let's talk about 10, DB2 10 for a minute.
How long has DB2 10 been out?
BAKLARZ: We announced and shipped in April of this year,
so it's been out, you know, about seven months.
LANINGHAM: Great response so far?
What's the response been like?
BAKLARZ: Oh, absolutely.
The thing is that depending who you are, you know, whether you're a developer, a DBA,
and end user, there are different features in the release that people love.
So the thing is, there's been a really broad spectrum in terms
of the people, what they like in it.
Right? So it's not one feature.
There always seems to be a number of features people really like in this release.
LANINGHAM: You mentioned developers, and this is developerWorks.
What's your favorite developer feature in DB2 10?
BAKLARZ: Probably the best one I've seen is something called Time Travel Query.
It's like going back in time.
So for instance, you can do queries that say, what did the data look like a year ago?
Right? So you could do things like going back in time, doing analysis based
on a period of time for sales or salaries.
For instance, a typical query could be, tell me what this individual's salary was last year
at this time, and I'm able to do queries like that.
So we can keep track of the data as it changes over time,
so lots of different applications that use this.
In fact, developers would have done this on their own before, their own programming.
A lot of coding.
So we eliminate a lot of that tedious work, and now it's built right into DB2.
It's a really neat feature.
LANINGHAM: To save snapshots, time-based snapshots like that,
what's the technology involved to do that?
I mean, to a layman like me it sounds like just mounds and mounds and mounds
of data to have all those pictures.
Or is it simplified in some way?
BAKLARZ: Yes, well, absolutely, we keep tons of data, right?
Which is good for a couple of things -- our disk drive division loves that.
[ LAUGHTER ]
What you're doing basically is think of it as a shadow table, right?
We have one table that contains your base data.
So today's salaries and employee information, for instance.
And when we make changes to those -- so we update somebody's record, we delete it,
for instance -- we'll keep a copy of that old version in another table, a shadow copy.
So, when someone says, show me the information from a year ago, we go to the shadow copy
or the current copy and figure out what's the most...or,
what's the version of the data at that time.
So, it's like I said, going back in time.
It's quite a neat technology, yes.
LANINGHAM: Very cool.
Now, does compression...or, where does compression come in to this whole picture?
And I know that DB2's compression, we've touted it as best in class for a long time.
And talk about compression in general and what makes DB2's approach so special.
BAKLARZ: Yes, well, compression is very important for many of our customers.
We actually introduced it about, oh, I'd say about eight years ago.
So, compression has gone through a number of iterations.
So, compression as we know it in DB2 today compresses the data that's in tables,
compresses indexes, compresses XML data, compresses log files.
So it's throughout the entire system.
So what we try to do is we look for patterns in the data, and we take those patterns out
and place them into a dictionary.
So basically it's a dictionary type compression approach.
When customers first started with compression in DB2,
they were seeing around 50 percent compression on their base data.
In Version 10, we introduced something called adaptive compression,
so it's kind of a new technology on top of our original table --
the dictionary based approach -- where we now look for patterns that are now very localized.
Things are individual pages in the database.
So not they're seeing 70 percent, 75 percent type compression on their data.
So the benefits are huge.
I mean, you've saved tons of space on disk, right?
But what people fail to forget...or, remember, is that it's not just what I save on disk;
it means that I get more data in memory
which means I can effectively use my memory more effectively.
I can get twice the amount of data in there.
My transactions go faster.
So customers actually have seen improved performance as well as reduction in disk space.
So that's why we call it leading edge.
It's not just what you get on disk space, but you also get all these benefits in terms
of memory usage, and logging is reduced, and all sorts of great benefits with compression.
LANINGHAM: Is Watson helping us with our compression R&D?
Just curious.
BAKLARZ: Well, absolutely.
When you take a look at the amount of data that we have to look at in Watson, right,
compression is obviously a big benefit here because we can reduce the number of,
let's call it "floors of disk space" that we need in order
to feed it all this information, right?
LANINGHAM: So Watson's a great trial zone to practice all this in and continue to tweak it
and continue to innovate around compression then.
BAKLARZ: Oh, absolutely.
I mean, there is still more compression innovation that we're looking at, so you know,
people have always looked at compression as being a type of compression for transactions,
because it has to be efficient, very fast, you don't want to get
in the way of your transaction speed...
Versus compression techniques for warehousing where we have millions and billions of rows
and perhaps we use what I call a static type compression which doesn't change
but it can be very effective, very efficient, but when I try to update it it's very expensive.
So trying to merge those two worlds, find better ways
of merging what I call OLTP type compression with business analytics compression,
make those two work together better.
So you'll see more stuff from us, but Version 10, pretty good for customers.
LANINGHAM: Where does hierarchical storage management come in to this picture,
and what's going on with DB2's approach around that?
BAKLARZ: Hierarchical storage, it's an interesting concept.
Today we see a lot of customers wanting to take advantage of things like SSD drives, right?
Very high performance, fast access.
But in order to get that to be used effectively in DB2
or any database is you want to place the hot data, right?
And this is where you get hierarchical storage from.
You want to place the hot data on the best performing for fastest preforming drives.
So what you want to do is say that, all right, anybody who's looking at data
that may be the month of October, or say, fourth quarter, that we're going to place that on
to an SSD drive so they get absolutely the best performance.
And older stuff -- third quarter, second quarter --
I'm going to put it on SAS drives or SATA drives, you know, move it into slower storage.
LANINGHAM: I see, yes.
BAKLARZ: Okay?
So what hierarchical storage is in DB2,
it allows the administrator, the DBA, to say, this hot data?
I'm going to directly place it on SSD drives.
The colder stuff I'm going to put on slower drives.
And I could put a policy in place to migrate that data seamlessly, automatically online.
So that means once my old data becomes colder, right?
So my new first quarter 2013 comes in, I can migrate that stuff
without anybody knowing on to slower drives.
Right? So I can effectively manage the heat of my data using this technique.
LANINGHAM: So the logic to do all that is built in to 10, it just does it for you.
BAKLARZ: Yes.
Sure. I mean, some DBAs don't trust automation in DB2, so they can do this manually.
But yes, you can automate all of this stuff.
LANINGHAM: As a last thought, I wanted to ask you about PureScale too, DB2 PureScale.
What is it and where does it fit in to this story?
BAKLARZ: Okay, yes, great question.
PureScale is a component of DB2.
It is now integrated actually into DB2 10.
PureScale is our continuous availability feature within the product.
By continuous availability, means that I can create what I call multiple copies
of DB2 running -- we call those "members" --
and have each one of those members act as one big database.
So in the event that one of these members happen to fail through either let's say a power failure
or some kind of human failure -- which, by the way, tends to happen more often
than physical failures -- the other members will continue to run
and the user will not see a failure.
They'll continue to run their transaction.
Perhaps the entire complex may run slightly slower,
but it doesn't matter, your system does not go down.
We call that continuous availability, and PureScale gives you that capability.
It is now a key part of Version 10 -- not a separate installation,
just a feature that a customer can install.
LANINGHAM: All right.
George, you'll be at Information on Demand 2012 the end of October in Las Vegas.
What have you got planned for the conference?
What are you going to be into there?
BAKLARZ: Well, I'm attending a lot of sessions.
I'll be on the showroom floor.
One thing I want people to know, if you want any more details on Version 10,
we've got a little Flash book on Version 10 features.
We're going to be at both of the IM keynote speeches both on Monday and Tuesday.
There will be books available for everybody, and we'll have signing sessions at the bookstore
as well both on the Monday and Tuesday at lunch time.
LANINGHAM: Great, and we have this chart up so people can look at the things
that are going on around this topic at IOD.
Again, George Baklarz has been talking with us.
George, again, program director, DB2 Worldwide Sales.
Thanks for taking time for this.
BAKLARZ: Thanks a lot, Scott.
LANINGHAM: Be sure and register to attend IOD 2012 in Las Vegas which will be October 21
through 25 at ibm.com/events/informationondemand.
Follow us on Facebook at facebook.com/informationondemand,
and on Twitter at twitter.com/ibm_iod or at the hashtag ibmiod.
This is the developerWorks podcast.
I'm Scott Laningham.
Talk to you next time.
[ MUSIC ]