Supersymmetry, Extra Dimensions and the Origin of Mass


Uploaded by Google on 25.07.2007

Transcript:

And it's my great privilege to introduce Professor Marjorie
Shapiro from Berkeley.
She is the chair of the physics department at
Berkeley, and she also does a research in high energy
physics at Lawrence Berkeley Laboratory
and Formula in CERN.
And today she's going to tell us more about the latest
trends in high energy physics and its unique requirement in
terms of computing.
So here's with Professor Marjorie Shapiro.
Thanks.

PROFESSOR MARJORIE SHAPIRO: Thanks very much.
I really am pleased to be here, and I'd like to thank
Jay and Matt for inviting me.
I feel a little presumptuous giving a talk that has
anything to do with computing here, and that about the only
thing I can say in my defense is that in the old days,
particle physics really was one of the drivers for high
performance computing.
I'm afraid that really isn't true anymore.
So I'm actually hoping that during the Q&A, I'll learn
some stuff from you where you can say, why the hell you
doing it this way?
There are much better ways that we can show you how to do
the computing that we want to do.
So the title of my talk is Supersymmetry, Extra Dimension
and the Origin of Mass: Exploring the Nature of the
Universe Using Pettis Scale Data Analysis.
And what I really want to do today is, I want to spend some
time explaining to you why those of us who are working at
the large Hadron Collider are so excited about the
possibilities for the next few years, what we might see, and
what the relevance is to problems of interest to
anybody who cares about science.
And then, since this is Google, I really have to say
something about computing.
So I'll try and emphasize some of the aspects of our
experiment that make it challenging from the computing
point of view.
In particular, the fact that we have very high data rates
in an online environment, and that we collect very large
quantities of data that's going to be looked at by
people all over the world.
So this is a small enough room, please, if you have
questions, just yell it out in the middle.
I'm much happier giving an interactive talk than having
you guys just sit there like sticks.
So don't feel embarrassed about
interrupting me in the middle.
So let me just start by talking about something that
really isn't Particle Physics, but is becoming more and more
in common with Particle Physics, and that's
Astrophysics.
When I was a graduate student, Particle Physics and
Astrophysics were totally separate fields.
They had almost nothing to do with each other.
And in fact, those of us who do Particle Physics used to
laugh about Astrophysicists because we always said that
the uncertainties there were in the exponent.
And in the old days, that's really the way it worked.
And one of the amazing things is that's changed so much in
the past 10 or 15 years.
At this point, you can really use the universe as a
laboratory, and there are lots of amazing experiments out
there that are trying to understand about the
fundamental forces and the nature of matter by doing
experiments where they use the universe as a laboratory.
It's an area of very high precision, astrophysics
observations, with very large amounts of data and very
precise measurements.
There's lots of discussion about what the systematic
uncertainties are, and you can see the quality of some of
this data by these images that really look back into early
time and tell you what the universe looked like.
So why is that relevant for this talk?
While I wanted to point out that things happen in the
other direction as well, we can use the universe is a
laboratory to tell us about interactions are
important on earth.
But we can also use the laboratory to tell us about
things that matter in terms of the universe.
And when I talk about laboratory, that really can
have a bunch of different meanings.
There a number of non-accelerator experiments.
For example, at the top here, I've shown an image from
SuperKamiokande which is an underground experiment that
uses water Cherenkov detectors to look at neutrinos.
And at the bottom, I've shown an experiment from just down
the road at Slack that looks at matter asymmetries is using
the BaBar Detector So both accelerator and
non-accelerator based experiments can really tell us
a lot of things that are important for the fundamental
interactions that matter for the universe.
Now in fact, if you want to describe the universe, you
need to describe it in terms of the fundamental forces and
interactions and the fundamental particles.
And it turns out, which of the relevant forces and
interactions are function of what time in the universe
you're talking about.
The reason for that is the universe is gone undergone a
number of phase transitions, so the particles that are
relevant depend on what the temperature of the universe
is, which depends on what the age of the universe is.
So if you're talking about the very early universe, you have
to talk about super strings.
Go a little bit later, and a little bit later means 10 to
the minus 35 seconds, so it's still pretty early.
That's called the Grand Unified Era.
Then there was a period of inflation followed by a period
which is where the particles that we know now begin to form
the Electro-weak Era.
It's in this Electro-weak Era where particles gain their
masses, and then, as you can see, you go on to longer
periods of time where galaxy formation begins and the
universe as we know it, is.
So if we want to have a theory of cosmology that goes as far
back to the beginning as possible, we need to
understand the particle physics that goes into it
because that's what's going to determine whether we can model
the evolution of the universe.
And in particular, this Electro-weak Era is one that's
extremely important for the experiments that I'm going to
talk about today.
And it's possible, by doing precise measurements, we can
also learn something about the Grand Unified era as well.
OK, so the next generation of experiments plays a very
special role in this.
And the reason is because of the energy of those
experiments.
I pointed out, on, the previous slide that we had a
ten to minus ten seconds was the Electro-weak period.
And in fact, the next generation of experiments has
enough energy that it can produce interactions that
occur comparable to what happened during that
Electro-weak Era.
So it's a chance for us really to look back in time and
understand what those interactions
look like, in detail.
So the higher energy reproduces the conditions of
the early universe.
And in particular, by looking at our theories, we know
something has to happen to these energies.
If you look at the models that we have now of how particle
physics works, if you assume that there's nothing except
the particles we've already seen, it turns out that the
equations that govern them, that give a very good
predictions for what's happening start to break down
once you get to high energy.
And in fact, they break that down so badly that they
produce infinities.
They produce cross sections or measurements predicting rates
that are unphysical.
So we all know something else has to come into the theory
that's going to cancel that breakdown.
In what's called the standard model, the description of
Particle Physics that we know now, there's a single extra
particle called the Higgs Boson that gets rid of those
infinities and makes the theory well behaved all the
way back to the beginning of the universe.
We don't know that that's the right theory, but independent
of whether that one is right or not, something has to
happen when we get to the TEV energy scale.
Because if nothing happened, then our
equations would break down.
And so, at the very least, we have to find that our
predictions aren't going to work at those energies.
So in fact, there are lots of theories that explain what
happens at the TEV energy scale.
But theories are great.
The real question is, can you confront them with data and
can you tell which ones are correct or not?
So the real goal of this next generation of accelerator
experiments is to try and distinguish between the many
possible models of high energy collisions that are relevant
for the [? LE ?]
universe and decide which one's are correct, and which
ones are beautiful, but really are only fantasies.
So what are the things we might find when we look at
these high energy collisions?
Well, within the standard model, the mass of all
particles is actually not just a God-given thing.
It's something that's generated dynamically, and
it's generated by the interaction of a specific
particle with all the other particles.
That specific particle is called the Higgs, and it's the
absorption and radiation of these Higgses that give all of
our other particles their mass.
So if that model is correct, we ought to be able to see
that Higgs Boson.
And in fact, it's predicted to have a mass in the range that
would be accessible by these new experiments.
The second thing we might find is a new symmetry of nature.
There's a model called Supersynergy, where every
particle we know about has a partner, and those partners we
haven't seen because they're too heavy.
One of the interesting things about this theory is, not only
would it give us employment for the next 30 years for
particle physicists, because we have a whole raft of new
particles to measure, it also would explain something that's
really important to the astrophysicists.
Asgtrophysicists always talk about dark matter.
We know, from gravitational measurements, that there's
something out in the universe that we can't see through
telescopes, but we don't know what that is.
In Supersymmetric models, the lighter supersymmetric
particle is stable and it interacts very weakly with
other things.
So it's a very good candidate for dark matter.
So if we saw this in an accelerator, the next question
people would ask is, do the properties of what we see in
the accelerator explain what we would see for the dark
matter in the universe?
And finally another wacky idea, but one that's really
being touted by a lot of famous theorists, is the idea
of extra space time dimensions.
The string theorists believe there are 10 or 11 dimensions.
We only have four, three space and one time dimensions in the
world that we know.
And what they argue is, the other dimensions are just
really tiny so we don't see them.
And there's a number of models that some of those space time
dimensions actually are big enough that when we get to
high energies, such as what you will have at the LHC, we
would actually see those extra dimensions.
that would be absolutely
revolutionary if that was found.
And in fact, one of the consequences it would make
gravity become strong on scale of the LCH.
And you'd have to talk about quantum gravity, which is
something that most of us have always thought that you
wouldn't be able to ever experimentally address for
probably hundreds of years.
And again, these are only some of the possibilities.
Whenever you reach a new energy regime, you don't know
what you're going to find.
And the important thing is to keep your mind open, to look
at as many models as possible, and to make sure you design
your experiments and your analysis of the data so that
no matter what's there, you have a chance of seeing it.
OK, so what's our next machine?
It's called the Large Hadron Colllider, LHC for short.
Its energy is 14 TEV. It's proton protons colliding, so
there's two rings.
They go around in opposite directions and they bash into
each other.
The energy of this machine is seven times the highest energy
currently available.
Right now, the highest energy is 2 TEV. That's at the Fermi
Lab Collider in Illinois.
The intensity is very high, a factor of five higher than the
current intensity at the Tevatron in the first year,
and going up in order of magnitude more than that after
three years.
So what I've shown on the right hand side here is a
picture of the tunnel that the magnets for the LHC go into.
So since we're accelerating protons, what we do is we
inject them.
We have big magnets that force them around in a circle, and
they collide where the experiments are.
Now what's the scale of this experiment is pretty amazing.
It's actually being built in Geneva, Switzerland, because
there's an existing tunnel from an earlier accelerator.
And the civil construction, it turns out, is one of the major
cost of building a machine like this, so using an
existing tunnel was a major plus.
The tunnel is right on the Swiss
French border near Geneva.
And in fact, half of the machine is in France, half the
machine is in Switzerland, and the circumference is 24
kilometers.
So you can see a picture drawn where the ring goes, with the
towns and the fields all around it.
The machine itself is underground because it uses
the earth to shield the people in the area from any radiation
that you would get from the accelerator otherwise.
OK, so what are the challenges of working at the LHC?
Well, first of all, high energy collisions require very
complex detectors.
Because the energy of the particles produce is very
high, you need a really big detector in order to capture
all that energy.
So that happens in two ways.
Some particles you cap.
You measure their energy in what's called the calorimeter.
You do that by stopping the particles and seeing how much
energy they deposit.
If something's got really high energy, you need a very deep
thing to capture it.
The other thing is, with charged particles, you measure
their momentum by having them curve in a magnetic field.
The higher their momentum is, the less they curve in the
magnetic field.
And so you need a big lever arm to get good enough
measurements to measure their momentum.
So both of those things need really big detectors.
The second thing is, in these high energy collisions lots of
particles come out, typically one or two hundred.
And if you want to distinguish one or two hundred particles,
you need very fine segmentation in order to be
able to do the measurements.
So these large detectors have very large numbers of
electronic channels, and so the data rates are quite high.
The second challenge is the processes we care about, in
particular, the Higgs particle I just told you
about are very rare.
So in order to see a very rare process, you have to produce
lots of collisions because only one in a very small
number of them produces the process you care about.
So we need very high intensity beams.
Now the problem is, something gets produced every time you
collide these guys.
So even though the stuff we care about is very rare,
there's lots of other stuff happening.
So we have to try and pick out the few events we care about
from a bunch of debris that a real interactions but they're
just sort of boring things that we're not
very interested in.
Now we always talk about that being the needle in a haystack
problem, but recently there was an article in the New
Yorker, and they said that they really thought that the
right way to describe it wasn't a needle in a haystack,
it was a needle in a needle factory.
And I think that's a really good comment because the
events we care about don't look all that different from
the ones we don't.
So one of the real challenging, from the analysis
point of view, is to be able to tell what are the ones we
care about?
Make sure we keep the ones we care about, while throwing out
the vast majority of the ones that we don't.
OK, so in terms of detectors for the LHC, in fact, there
are four detectors.
There are two large, general purpose detectors, that are
trying to do the kind of physics that I was talking
about before.
Their goals are similar.
The design trade-offs are, one chooses one thing,
one chooses the other.
I would say it's like Google and Yahoo.
And I'm on Atlas, so I think Atlas is Google, but you guys
can choose yourself.
There are two other detectives that have
very special purposes.
One is this is designed to study B-Decays, the same sort
of thing they do at Stanford.
The other is designed to look at heavy ion collisions.
There's going to be one month of heavy ion running per year
at the LHC.
I'm only going to concentrate on Atlas because that's my
experiment, and so it's the one I know the most about.
Most of what I'm saying would be pretty true for CMS.
Although the actual details may differ, the overview
really isn't that different.
So the first thing is these experiments are big.
So on the left hand side, what I've done is, I've taken a
drawing of the Atlas detector, which I'll tell you a little
bit more about later, and I've superimposed it on a drawing
of a five story office building where we have our
offices at CERN.
And so that sets the sense of scale.
It's a five stories tall detector, so it's pretty
phenomenally large compared to the detectors we've known in
the past.
It's also way underground, and so the pieces are
built on the surface.
They're brought down in a shaft and are assembled here.
And that assembly process is in the middle right now.
And in fact, if you go to the Atlas web page--
I don't have to give you the address, you
can just Google it.
You'll find we have a webcam in the collision hall and you
can watch, in real time, what's happening as the
experiment is being put together.
Second thing is Atlas is a complex detector.
It has lots of different people.
Again to set the scale, we have our
canonical person over here.
There's another one over there.
So you can see it's really quite a large detector.
The detector is built like an onion, with different pieces
of it that serve different functions.
If you look in the horizontal direction, this is
where the beam goes.
So there's a proton beam going this way, a proton beam going
that way, and they collide right in the
middle of the detector.
The things on the inside are to detect charged particles.
They're inside a magnetic field, and that magnetic field
is a solenoidal magnet that looks kind of like a Coke can.
And that's what those red lines are there.
Then outside the magnet, there's what's called
calorimeters.
They measure the energy of the particles by stopping them and
seeing how much is deposited.
And all the stuff on the outside are to detect one
species of particles called muons.
Muons are kind of like electrons, only heavier.
And one of their features is they go through a lot of
material without interacting.
So if you look at what comes at the end, then what you'll
get is a muon.
AUDIENCE: How do these other detectors that you were
talking about share the space?
PROFESSOR MARJORIE SHAPIRO: There's four collision points.
Each one has its own hall, and the beam is forced to collide
in four different places.
In fact, one of them, at least, runs at a different
time because it's only heavy ions.
But these are just too massive to move in and out, so each
one is separately built on the beam line.
And in fact we have a problem when we're running, we can't
service the detector because when the beams on, it's a high
radiation area.
So while it's not as bad as being in space, it has some of
the similar problems that the access is quite
limited during running.

OK, so an experiment of this size needs a big team.
And what's shown on the right is a picture that was taken at
one of our collaboration meetings of people in our
office building who are members of the experiment.
And it's hard to see all their faces, but typically we have a
group of about 2,000 physicists and engineers.
And because of the complexity of these detectors, the
engineers really are full collaborators in terms of
deciding how the detector is going to operate.
You can see that we have worldwide collaboration.
The areas where we have collaborators are in yellow.
About the only continent that's really
not covered is Africa.
And we don't do too well in south Asia because India is on
the CMS experiment, not on Atlas.
But it's a real challenge to find a meeting time that works
for everyone.
It's pretty much impossible and unfortunately the poor
Japanese tend to get screwed.
Because you can find a time that works for the U.S. and
for Europe, but it doesn't work for
Asia at the same time.
OK so how does this experiment work?
As I said, we have this detector and what we do is
record the particles that are produced in the collision.
We record the time, the location, the momentum, the
energy, and the charge of the particles.
And this onion of detectors allows us to tell the
properties of those particles in terms of those things.
And we infer the characteristics of the
interaction from those properties.
We use highly specialized custom electronics and data
acquisition systems. That's a sort of messy picture on the
left of what the data acquisition system looks like
in one of a test beams. And on the right are pictures of some
of the custom electronics we use.
And almost all of our electronics is custom A6 that
are designed by engineers and are produced solely for us
just because they need to be very fast. They need to be
very radiation hard, and they don't really have too many
industry applications.
We do use the semiconductor industry for doing our
production runs of the electronics and for some of
our detector elements.
So we work with the same companies that produce chips
for computers, although we send them our own masks and
designs for how to produce electronics.
So here's a schematic view of how the detection works.
Right at the middle, again, is where the collision occurs.
After that we have a bunch of what are
called tracking detectors.
Energy is deposited through ionization, and that's
detected on individual elements in that detector.
The charge particles curve because this thing here is a
magnetic field, and so they curve in that magnetic field.
After they exit the tracking detectors, they enter what's
called the calorimeter.
And that calorie is divided into a front section that's
optimized for finding photons and electrons, a back segment
that that's optimized for finding charge particles that
aren't photons and electrons.
Those are called hadrons, and then this big section on the
back to looking for neutrinos.
There are some particles that don't interact at all.
Neutrinos have such a small interaction rate that they
escape without us seeing them.
We can infer their existence because, when we look at the
energy in the event, we see an imbalance.
So if we see a big bunch of energy on one side with
nothing recoiling against it, that tells us something must
have escaped without being measured.
And so, in fact, we also have sensitivity to the neutrinos
by saying it's minus whatever else we didn't find.
That, of, course makes an assumption.
You can't tell a neutrino from another
non-interacting particle.
And that's an issue for models like supersymmetry, which also
have heavier non-interacting particles.
The signature for us would look the same.

So one of the issues that we have that's different from
what you guys have, is that we have to work in real time.
The collisions of LHC see happen every 25 nanoseconds,
and there's no way that you can record every 25
nanoseconds.
And even if you could, you wouldn't be able to handle the
amount of data that's produced.
Something happens every crossing, so you can't say,
well, I'm only going to record when something happens.
What happens may not be very interesting, but something
always happens.
So what we need to do is, we need to find some way of
deciding what the interesting events are in real time, and
selecting those events and throwing out all the others.
So that's what our trigger system does.
Now because it has to work at high rate, and because the
decisions are difficult, we divide the trigger system into
three levels.
The reason is that the first level has to take the full
rate of the LHC every 25 nanoseconds.
So it has to accept every event, look at it very, very
quickly and say yes or no.
I think this might be interesting.
That's done with special purpose hardware.
Now you might say, how can you make a decision in 25
nanoseconds?
We don't.
We have a very long pipeline, so we keep all of the event in
memory while we're deciding.
And the length of the pipeline tells us how many events we
can pass through the system while
we're making the decision.
But in the end--
yes.
AUDIENCE: How about how much data are we talking
about for one event?
PROFESSOR MARJORIE SHAPIRO: I'll show you
that in a few minutes.
Its big, because there's lots of channels.
But I should comment that even here, as part of the readout,
we do do zero suppression.
Because there's so many channels in the detector, that
if we read every one out we'd have a real problem even after
the triggers.
So I'll show the rates in a couple minutes.
OK, then the second level trigger, unlike the
specialized electronics of the first level, is just PC-based.
It's running fast algorithms, but it's standard C++ code.
The thing that makes it special is the fact that it
still doesn't see the whole event.
What it does is, it looks at everything in a road, sort of
a wedge around each of the trigger objects from the first
level trigger, and it tries to make a fast decision based on
the other information in that road.
And finally in the third level, which is also PC-based,
you build the whole event, you see the whole event, and you
can make more complicated decisions where you basically
have everything about the event in order
[UNINTELLIGIBLE]
hide.
And you can correlate information.
I see this on this side of the detector, that on that side of
the detector.
So the rates are, the Level 1 trigger takes the 40 megahertz
from the machine, accepts 100 kilohertz.
Level 2 takes 100 kilohertz, accepts three kilohertz.
Level 3 has an accept rate of about 200 Hertz, and that 200
Hertz is actually cached on disk, then written to magnetic
tape for archiving, and sent offline for further
processing.
OK, so just to show you this in more detail, schematically,
in the Level 1 trigger, the kind of information you can
use is you can say, do I have a muon?
If I have a bunch of hits in the muon system that line up
to form a stub or track, that would be called a trigger.
There's a magnetic field here, so you can make a momentum cut
by saying it has to bend less than a certain amount.
Because the more it bends, the lower the momentum.
And we have a calorimeter trigger that looks for clumps
of energy in contiguous cells and gives you a
trigger based on that.
There are additional triggers that do things like, say take
one every 10,000 event.
That's called a pre-scale trigger, is
the one out of 10,000.
If you just say, take whatever 10,000 events with no other
requirement, we call that a minimum bias trigger.
So by taking these loose additional triggers without a
lot of requirements, we can try and understand what kind
of bias we're taking in the data that we collect by
comparing it to an unbiased sample.
Now of course, if something's very rare, you'll never see it
in the unbiased sample.
So in fact, what we do is, with all of these triggers we
build them up by having pre-scaled one every N events
at low momentum.
So you take one every 100 events at 6 [? jav ?], one of
every 10 events at 10 [? jav ?], and every event
above 20 [? jav, ?]
that way you can use to lower energy pre-scaled events to
understand the efficiency for the higher energy.
And so then you take these triggers from Level 1.
We typically run with a menu of, on order, 50 or 60
different triggers in parallel.
Because unlike an astrophysics experiment, where you point
the telescope somewhere else, and that's that guy's data, we
all take our data together.
And we divide up the bandwidth in the online trigger, so that
everybody's triggers are mixed together while
they're being taken.
That makes you less sensitive to systematics, and also means
if the machine breaks, you don't have somebody who has no
data at all.
Everyone suffers the pain at the same time.
And for a rare processes, it means that you get them for
the whole time.
So we take the data here.
We have data that's added to the DAQ system that says what
triggered, and what were the
characteristics of those triggers.
And you pass it on to the Level 2
region of interest trigger.
And again, that looks at slices around each of those
trigger objects.
And it can ask more complicated questions.
So for example, for this muon, it can look in the tracking
detector and say, did I find a track?
We don't have a Level 1 track trigger, because, at the
moment, we don't have a hardware processor that can
quickly enough to look for patterns of
hits to find tracks.
A lot of people are talking about the possibility of
adding a track trigger an upgrade to the experiment.
It would really help our capabilities, but it's hard to
have one that works with a high enough bandwidth for what
we need at the LHC.
OK, then the Level 3 trigger puts together the whole event,
instead of just these regions of interest. It has
everything.
And the code rerun on Level 3 looks pretty much like the
code rerun offline, but with looser selections, selections
that are less sensitive to whether we have the final
alignment, whether we've calibrated our detector
purpose perfectly.
But there's a lot of movement of code that gets developed in
the offline and gets moved into a Level 3 trigger, once
it's proved to be robust. Needless to say, this has to
be very robust code since it's running in real time.
It's run on a Linux farm, and it scales pretty well.
Because, basically, you just keep adding boxes as long as
you have enough network bandwidth.
OK, so how about the offline reconstruction?
Well, once we've passed through Level 3,
we put it on tape.
That's just for safety's purpose.
We try and avoid actually having to go to tape as much
as possible because the tape is a nightmare.
It's not a technology that's ever been made to scale very
well, but unfortunately it's the only really cheap and safe
way to take lots of amounts of data.
And then we perform a common processing for the whole
collaboration.
So I would view this as similar to what you guys do
when you do your pre-caching of important
information here at Google.
You know people can make a lot of queries, and you don't know
what queries they're going to make.
But you know that if you have to answer all those queries in
real time, you're never going to make it.
So what happens is, we do a general set of feature
extraction, calibration, pattern recognition.
And we write those results out, and those become the
starting point for the queries that all of the scientists are
going to do.
Now because the data volumes are so large, we have a
hierarchy of data storage.
And this is one of the areas that I would say it's not
clear how well it's going to work.
This hierarchy is based on the assumption that most people
will be able to work with the higher
levels of the hierarchy.
Very few people are going to have to go back
to the earlier stages.
Because we just can't afford to handle every collaborator
going back and redoing all the raw data.
We don't have the CPU.
We don't have the bandwidth.
So the raw data, which we call byte stream, because it's just
a bunch of ones and zeroes are archived raw data.
ESD stands for Event Summary Data.
This is the result of the reconstruction along with
calibrated hits.
So if you have the ESDs, you can pretty much do most of the
reconstruction over again.
It doesn't have individual cells in the detector.
It's clumped together neighboring cells to try and
save some space and do the calibrations.
But it's pretty close to raw data, plus all the results of
the reconstruction.
Then Analysis Object Data is a summary of those things that
are in the ESD that you would actually do
to talk about physics.
So it has tracks, electrons, muons, missing energy, the
basic concepts that physicists use.
And then there's a very tiny thing called a tag, which is
just a summary of high level objects.
It's really meant to just do fast queries, so you can
navigate through the other data quickly.
The tag is something that I would say the
jury's still out.
We don't understand how useful it's going to be.
The idea behind a tag is you can say, give me all the
events of three electrons and missing 18 more than 20.
And then when you navigate through the data, you don't
have to look at the events that don't meet the query.
Unfortunately the main issue for us is getting the files
that hold those data delivered to the user.
And so these kind of queries are very effective if the
fraction of events you're keeping is small compared to
one event per file.
But how useful it actually is when you have, on average,
more than one event per file isn't so clear.
So I would say it's an open question how useful the tags
are going to be.
OK, so here's the answer to Matt's question about sizes.
Our raw data is 1.6 megabytes per event.
So these are these are big events, lots of channels.
And that's just stored as packed byte streams with 32
bits per channel, and an identifier
that's usually 32 bits.
It gives you the channel numbers
since we suppress zeroes.
The ESDs are about half a megabyte.
And these are target sizes.
I should say, right now, we're close to a factor of two off
on the ESD size.
And it's a big issue because we don't have enough disk
space if we don't get that factor of two down.
And it's going to cause real problems if we have to go back
to tape to read them.
AOD is 100 kilobytes and the tags are about one kilobyte.
We have lots of simulated data, because the only way you
can understand the complicated detector like this is by
simulating it.
And the simulated data is a bit bigger because we keep the
truth information as well as the reconstructed information.
The time for reconstruction in some units, that at least to
me, are random are 15K, 51 12K seconds per event, and
significantly larger for the simulation.
And again, we operate about 200 days a year, and the event
rate after the trigger is 200 Hertz.
So we typically take 2 times 10 to the 9th events per year.
So we are talking about petascale data samples and
quite large CPU usage as well.
OK, so how do physicists work with this data?
Well, I've already told you the bulk reconstruction is
done once for everybody.
We probably won't get it right the first time, so our
computing model says that we reconstruct all of the data
once every year.
That means you need more CPU as you've collected more data,
since you go back many years worth.
But Moore's law pretty much helps you with that since,
typically, you only keep your machines about three years.
And the process data is the starting
point for the analysis.
In order to try and make data access more efficient, we
stream the data according to the
trigger and physics channel.
So we put all the electrons in one set of files, all the muon
triggers in another set of files, all the missing ET
triggers in another set of files.
And right now our default is an inclusive streaming model,
which means the same event appears on more than one file
if it satisfies more than one trigger.
And this is, basically, just physical placement of data to
make it more efficient to access it.
We distribute the data to multiple sites.
And then we have infrastructure which is quite
complicated.
And it's probably the most fragile part of the system to
allow distributed analysis and data mining.
OK, so here's a picture that says what I've just said.
The data itself, from the detector, would be a petabyte
per second if you read out every event, which of course
you can't do.
We read out about 100 megabytes per second through
our data acquisition system, send them to a
Linux farm at CERN.
Then the results of that get sent out to various what are
called Tier 1 centers.
They're large computing centers that everyone on the
collaboration can use.
They're distributed internationally for a number
of reasons, one of which is funding agencies like to have
boxes in their own country.
So it's easy to get money if you have a funding agency in
your country.
And it also makes you less sensitive to any kind of
infrastructure problems. Yes?
AUDIENCE: Do you use compression
techniques for the data?
MARJORIE SHAPIRO: Gzip type compression.
So we do, but nothing much fancier than Gzip.

If we could find an algorithm that worked significantly
better, it would save us a lot.
Because at the moment, data volume is our
biggest cost driver.
The reason is we want to keep all of the analysis data on
disk, because the latency for tape access is a disaster.
And the rate of failure on tapes is too high.
But if you look at the costs of disks, they are so high
that a doubling of our data size, we could basically get
rid of all our CPU, and we still
couldn't pay for the disk.
So fancy compression, if someone could tell us
something that was better than Gzip would definitely,
definitely help.
One thing we do to try and make that more efficient is we
stream each object with a separate I/O streamer, and so
Gzip separately, the objects.
And that makes Gzip do a little bit better on it.
OK, so once we have these Tier 1 centers that analyze the
data, then it gets passed to what are called Tier 2's.
These are mainly universities and they are the place where
the users mainly do their own analysis.
You can also do your own analysis on your laptop or
your home machine.
We call that a Tier 3, but I always joke that, because we
have two laptops and a desktop at home, I'd probably be
classified as a Tier 3 by the Atlas computing model.
So I guess the only other comment on the slide that I
would say is important is, even though we're a 2,000
person collaboration, it's a very broad physics program.
So, typically, about 10 people work together on an analysis.
For high profile analyses, you'll have parallel efforts
looking at the same data.
But it's very hard to work with more than 10 people in a
tight collaboration.
And so we still have a model where it's a few students,
post-docs, and a couple of faculty members working
closely together, and then comparing the
results of the people.
OK, so how do physicists analyze this data?
Well, in spite of the fact that I've shown you a lot of
event pictures, event visualization really isn't a
primary way of doing analysis.
The main thing it's used for is debugging.
If you find that your pattern recognition isn't working,
looking at an event picture and being able to
interactively run the pattern recognition might help you
understand why it isn't working well.
But almost no analysis is done by viewing single events.
Typically what we do is, we do statistical analyses on
ensembles of events.
And we compare the observed rate for a given process to
what the combination of theory and detector simulation
predicts that rate should be.
We search for deviations from the prediction, and then the
characteristics of those deviations tell us if there
are hints of new physics.
So for example, this happens to be a simulation of a model
called Technicolor that has particles with well defined
masses in it, and so what shown here, in yellow, is what
you would get in a model that didn't have Technicolor.
And these various white peaks show, for different masses of
what the technoparticle looks like, what you might see in
your detector.
So what you're looking for is this excess over the standard
model signal.
And so it points out the fact, first of all, that simulation
is very important because that defines a baseline of what you
expect to see.
And it also says that we're going to have to do
statistical analyses and we're going to have to do fits.
I didn't show a fit on this, but you're going to have to
ask how many events do I have above that peak?
And that's going to involve techniques where we try and
fit for shape for the background and
subtract the signal.
OK, so some comments on the software.
The lifetime of the experiment is 10 or 20 years.
That's longer than most OS's last, so we have to ask
ourselves how d o we keep our software robust with such a
long lifetime?
And the answer is, we just have to plan for change.
It's not going to be possible to keep a given version.
It's also longer than the term of many developers, so there's
a premium for having the code maintainable, documented, and
accessible to others.
So, in that sense, we're closer to industry than most
academics are.
Because we can't deal with one graduate student typing in a
corner and no one else knows how to run the code.
It just isn't a model that scales to an
experiment this side.
The code is shared by several thousand people, so robustness
and documentation is key.
We use CVS for code management.
We have a homegrown system for configuration management on
top of that, which I would say is marginally satisfactory.
But I don't know anyone who has a good system for
configuration management.
I know there are some commercial products, but they
don't necessarily scale and they
tend to be very expensive.
So we basically have something that's built on make files and
scripts that runs in conjunction with CVS. We do
releases of major and minor release versions and bug fixes
to the releases, so all of that's pretty
familiar to all of you.
OK, the use patterns are likely to change with time, so
we need flexible code.
And the input parameters for the reconstruction and
analysis improve as we learn more, so we try and separate
our data and our metadata.
So alignment constants, calibration constants,
anything that we think we can make better by looking at the
data, we keep in a relational database.
At the moment, we have a system where we use Oracle at
CERN, and MySQL at the other institutions.
In retrospect, we might have been better off with MySQL
everywhere.
But there's a tendency for people who aren't experts to
think, if you pay money, it must be a better product.

One of the big issues for us is needing to know where the
data is and how it was processed.
So having access to metadata about the data processing is
important, and about the location of that data, because
that's the starting point for all of the users.
OK, so what's our chosen software architecture?
We have a multipurpose C++ framework.
It's got well defined abstract interfaces, plug in
components, that satisfy those interfaces, so we have
services, algorithms, tools.
And each of them have a defined algorithm and the code
that the users write can inherit from one or more of
those interfaces.
The whole thing is done with dynamic loading, so the
framework is brought up, you specify what libraries you
want to load at runtime.
And that allows you to add your own code, trivially, to
the existing system.
And we use Python bindings for run time configuration, so
much of the code that's written has
parameters of the algorithm.
So, for example, if you're looking at a clump of energy,
you say, I only want to look at a clump within a cone of a
given radius.
The size of that radius is a parameter of the algorithm,
and can be set at run time via Python script.
The data objects have a persistent transient
separation.
Originally, we tried to have our data persistency automated
where we had a parser that read the header files and
wrote the classes to write the data out.
We found the performance on that was truly awful.
In fact, brute force, having somebody who knows what's in
the data actually write a streamer that defines the data
was a major plus.
We've gained factors of three or four in I/O performance by
customizing the streamers.
And one of the advantages of the persistent transient
separation is it makes schema evolution much easier.
So you can change your transient representation and
deal with the fact that you have old data just by putting
an if/then statement in your streamer code.
It's low tech but it works.
And anyone who's ever tried to do schema evolution in a
relational database knows that it's not an easy problem
without some kind of separation.
We have lots of calibration metadata.
We handle that with an interval of validity service.
C++ handles with call backs.
We encapsulated the interface to the database so you can
have multiple implementations of a database, Oracle MySQL,
text, files or even data stuck on the front of a file that
you have in your directory.
And the code doesn't have to choose the implementation
until run time.
And then for the actual statistics, making of plots,
we use a CERN developed package called Root.
It's a C++ based package and it has a PyROOT, Python to
Root interface, that allows you again do bindings to the
same data objects that we use within the framework.
So this is something that for us is new.
In the past, we always had summary data that had very
different data objects, and we've changed our mind
completely on that.
And that's partially because the power of being able to use
C++ for things like collections, and using STL to
do all the good things STL used.
People just didn't want to go back to having flat lists of
numbers with no tools.
And so we've really found were much better off sticking to a
very object oriented C++ type of analysis, even when you're
doing interactive analysis from the prompt.
And certainly when you're doing scripts that are going
to run over lots of events.
OK, so I'm just going to finish with some examples of
what some analyses would look like with this data.
And one of the things that I really want to emphasize is we
don't know which are going to be the most exciting thing.
So the flexibility and being able to change is a very
important part of what we do.
So one example.
I've already told you about the standard model Higgs.
It's the thing that gives mass to all the other particles and
it turns out, in this theory, the theory does not tell you
the mass of the Higgs, but it tells you exactly how it
decays if you know its mass.
And so it's plotted here as a function of the mass of the
Higgs, what its probability of decaying in
different channels are.
So what that means is that you can develop a strategy for
looking for it as a function of mass.
And there's been a lot of work on both Atlas and CMS in
devising those strategies because it
is such a rare process.
So here's some examples from our technical design report of
how one searches for the Higgs, and some of the
difficulties one should highlight.
So if the Higgs is light, it turns out it likes to decay
[UNINTELLIGIBLE] photons, not all that often, it only does
it one out of 1,000 times.
But photons, it turns out, you can identify cleanly and
measure their energy very well.
So even though it's a low rate, it's easier to see in
this channel than a lot of other channels.
Now easy is a relative word.
If you look at this histogram here, that
little blip is the Higgs.
That straight line is the background from real photons
in other events.
So again, this isn't the case where you can look at an event
picture and say, there's my Higgs.
You have to do a very careful statistical analysis, and it's
a case where you really care about your calibration.
Suppose, as a function of time, your energy calibration
changes, then this peak would get wider.
Given how small this peak is compared to the background, if
it got wider, it would disappear pretty fast.
So this is an example of why we really have to be able to
remake the metadata, make sure that as we improve our
calibrations, we can reprocess the data multiple times.
It's also an example of why we need good
robust fitting tools.
This shows the excess, once we remove the smooth background
with that dotted line.
This is a dicey measurement.
So this is one where we want to have multiple channels
because you can't do it in only one
channel and believe it.
Here's a second example where Higgs goes to a particle
called a B Cork and a B Anti-cork.
And here, there are two issues.
One, there aren't a lot of events, but also, again, the
shape of this background is something we're going to have
to measure for from data.
So all the work in this measurement isn't going to be
looking for that peak itself, it's going to be all of the
things you have to do to convince yourself that you
know that that dotted line has the right shape.
So there's going to be lots of going back to simulation,
looking at control channels that also indicate a BB Bar
and say, can I predict what the shape are,
an interactive process.
This is a reasonably easy one.
Higgs goes to ZZ.
There's not a lot of events here, but the peak is big
compared to the background.
So that's a nice one.
Yes?
AUDIENCE: What do you actually see in the ZZ
[UNINTELLIGIBLE]?
MARJORIE SHAPIRO: Only the ZZ goes to leptons, so either
electrons or muons.
So we see both the electron and the muon channels.

So this one is called a gold plated.
It's easy to see, and this, in fact, is one event from that
peak, just showing what it looks like on a detector
simulation.
So I think one of the things I want to emphasize here is that
if you ask how do you do this analysis?
It ain't SQL queries.
The way physicists do analysis isn't sitting there and
looking in a relational database.
It really is coating mathematical formulas with
complicated objects with the relationships between them.
Is this a photon or an electron?
Where is it in the detector and I need to do some kind of
shimming of the calibration?
And so our whole model really involves people writing code
to access C++ objects.
At an earlier stage, people proposed putting all of our
data into a database, as opposed to just
keeping it in files.
And that was a point where objectivity was viewed as a
potential solution because it was an
object oriented database.
What we found was, first of all, there were real scaling
problems, that these object oriented databases had a hard
time handling the size data sets that we had.
Schema evolution was a nightmare, as it
always is in databases.
And once you were using C++, the advantages of having a
database over having a bunch of files where you said, these
files have all the electrons, these files have all the
muons, really was not apparent.
So we've pretty much gotten to the mode where we say metadata
goes into a relational database.
Data goes into C++ objects that we try and read as fast
as we possibly can, and try to navigate to as fast as we can.
OK, here's the second example.
Supersymmetry.
In Supersymmetry, every particle has a known partner,
so fermions, which are spin one half particles, have spin
zero partners.
Bosons, which is spin one half, while particles have
spin one particles have spin one half partners.
It's a theoretically favored extension to the standard
model, both because string theory requires it, and also
because it would provide a dark matter candidate.
And it gives you a large number of
particles to look for.
Now fortunately, there are a lot of variants in this model.
So it's not as if you can say there's one
model and that's it.
So you really need to have a very broad
based search for it.
Many of the models involve missing energy.
Things like neutrinos don't interact in the detector, so
there's lots of emphasis on missing
energy in these searches.
And an example of how one might first find
supersymmetry, here's a plot where the green shows the
prediction if there's no supersymmetry.
And this is a log scale.
The red shows, or purple, shows what the supersymmetry
signal would look like in a variable of just the sum of
the momentum of the four highest momentum
objects in the event.
So it's sort of a simple how much energy do I have coming
around variable?
So the good news about this is you have orders of magnitude,
more events than you would predict from theory.
The bad news is you've got a lump of events and no peak.
So if you saw this, you'd know pretty fast that something new
was happening.
But it's not enough just by saying this to be able to say,
I know this is definitely supersymmetry.
So this would be the first stage in what's then going to
turn out to be a long detective story to try and
characterize what the new physics is.
You say, gee, we're getting all these events where we
didn't expect any.
What are they?
And so this plot is that we're getting
all these extra events.
The what are they is very model dependent because the
supersymmetric particles decay into each other, and depending
on the exact model, there's different decay chains.
So once you see an excess, what you need to do is start
making plots of various quantities.
So here's a model where the supersymmetric particle decays
into an electron and a positron and missing energy,
so you can select events that have an electron, a positron,
and missing energy, and some [UNINTELLIGIBLE]
of hadronic energy.
And then look at the invariant mass of the two electrons and
you see a sharp peak.
That sharp peak tells you something about the mass of
this guy that you missed, the difference actually between
these two masses.
So depending on the model, you'll get different signals.
But you basically, at that point, are going to have to be
a detective.
So again, you require a rather flexible system because you
don't know what you're going to see.
Then the third example is extra dimensions.
And here I'm just showing an event display.
At first blush, a lot of the signals look similar to
supersymmetry.
And so one of the exercises going on now is, if you saw a
signal, how would you be able to tell it was extra
dimensions, not supersymmetry?
And that's a long discussion that's going on in both the
experimental and the theoretical community.
We're just hoping that we have that as a problem, because it
would be great to be arguing about what the source of the
new physics is, rather than asking why didn't we see any?
OK, so this is just a picture of a simulated mini black hole
produced at the LHC.
OK, so finally, let me just conclude.
The LHC will provide access to conditions not seen since the
early universe.
Analysis of LHC data has the potential to change how we
view the world.
But the LHC analyses require finesse and care, substantial
computing resources.
And I haven't talked much about it here, but there's
also of course, sociological challenges when you have 2,000
collaborators, all of whom feel they have
ownership of the data.
And you have to understand how to decide the collaboration as
a whole says, yes, we've measured something in that
kind of an environment.
So turn on next summer, and I hope you guys will be reading
in the Chronicle about exciting things happening once
we turn on.
That's it.

Yes?
AUDIENCE: The whole end trigger, if I understand
correctly, throughout everything except, like, one
in 40 million events?
MARJORIE SHAPIRO: It's a one in 40 million divided by 200.
It's 40 megahertz in and 200 Hertz out, so yes, most of the
data it throws out.
AUDIENCE: How can you be sure that you're not
throwing out something?
MARJORIE SHAPIRO: It's a big question.
So fortunately the machine will
turn on at lower intensity.
So at the very beginning, when the machine turns on with
lower intensity, we can write 200 Hertz still, so we can
keep more events.
But even there, we're going to have to do a lot of
triggering.
So one of the ideas with these pre-scale triggers, as you
say, let's take every 10,000th event, even if we don't know
anything, just randomly.
You can look at those random events and try and say, is
there anything we don't understand there?
But it's always a problem, and it's particularly bad at the
LHC because there's such a big gap in energy.
With the Tevatron turned on, there was only about a factor
of two and a half difference in energy between the Tevatron
and the machine below.
And so it wasn't so hard to extrapolate, at least the
non-science.
The temperature on the LHC is a factor of seven difference,
so the question of how well you can extrapolate the
uninteresting physics is a big one.
But the one thing that saves us a little bit is, almost all
the exciting physics involves very high mass scales.
And when things that are very massive decay, they tend to
produce things with lots of energy.
So most of the triggers say, keep things that
have lots of energy.
Throw out things--
if the two protons just go past each other and tickle
each other, most of the energy just goes down the beam pipe.
If you're looking at here, you see a few tweaks.
You don't see very much.
So that's the basic idea of the triggers, but
it's always a problem.
Yes?
AUDIENCE: So if the [UNINTELLIGIBLE] exist, why
haven't they seen [UNINTELLIGIBLE]?
MARJORIE SHAPIRO: Because protons
are composite particles.
They are made of quarks.
If you ask how likely are you to get a collision of a quark
and a proton, and the one going the other way with
enough energy to produce a Higgs that's small?
And also the Higgs production rate itself is small.
In principle, with enough time, the Tevatron
could see the Higgs.
And in fact, people who aren't on LHC experiments are really
busting their tails to try and do as much analysis as
possible of the data.
Because there is a chance they might be able to
see it at the Tevatron.
It depends critically on what the mass is.
If it's of a lower range of what's possible,
they might see it.
Yes?
AUDIENCE: Is the data that we have similar for some sort of
frequency extracting [UNINTELLIGIBLE]?
Like a [UNINTELLIGIBLE]?
MARJORIE SHAPIRO: So some analyses, yes.
It's very dependent on the kind of
analysis that you're doing.
So there's some analyses where you might do a frequency
measurement.
The data, in terms of a general technique, no.
Because what you want to do is you want to take all of these
data and you want to extract features that correspond to
known particles.
So you want to look for trajectories.
You want to look for clumps of energy.
So it tends to be a first stage pattern recognition
that's feature extraction.
You either look for a bunch of hits that are consistent with
coming from a curved line, or a bunch of hits that are
coming from a clump of energy.
And then refining comes in, doing some kind of a fit to
those hits to get the best measurement of the curvature.
Or calibrating the clump of energy to get the best
estimate of what the deposited energy was.
So some of the higher level analyses, you can use a foray
analysis as part of it.
Yes?
AUDIENCE: So you said-- now talk about zero suppression.
I assume that there's always some background
noise on these detectors.
You have to have a squelch threshold?
Is that how it works?
MARJORIE SHAPIRO: That's right.
So you have some threshold that's usually set at some
number of sigma above the noise level.
And it's very detector dependent on what you do.
So for example, our pixel detector, the pixels are
basically-- it's a piece of silicon that's etched into
little rectangles.
And the rectangles are 50 microns by 500 microns.
There's a zillion of them.
Those you have to have very low noise for channel, and set
the threshold high enough that you only read out 10 to the
minus fifth channels, if it's only noise.
Because otherwise you're swamped by data.
On the other hand, it turns out the calorimeter integrates
for a long time, and so it sees the previous crossing.
And there they keep threshold low enough that they do, in
fact, read out all the channels because they use it
to correct for any underflow or overflow from
the previous crossing.
Yes.
AUDIENCE: [UNINTELLIGIBLE]
I'm curious.
What is the policy regarding sharing the
data with the public?
Is the code available?
MARJORIE SHAPIRO: The code is available, but the data isn't.
So all of the code is available from LXR, but the
data itself is not available unless you are a member of the
collaboration.
AUDIENCE: Second thing is, as you mentioned you use
[UNINTELLIGIBLE]
kind of like a generic compressor.
If you have the data obstructor public
[UNINTELLIGIBLE]?

MARJORIE SHAPIRO: Sure.
Send me an email.
mdshapiro@ldl.gov and I'll point you at our data.
If you can come up with a better compression algorithm,
we would be thrilled.

AUDIENCE: So is the detector cryogenically cooled?
MARJORIE SHAPIRO: Parts of it are.
The calorimeter is liquid argon energy, so that's
cryogenically cooled.
The tracking detector is only kept at moderate temperatures,
so like minus 10 degrees, just to keep the electronics cool.
The accelerator itself is at liquid helium temperatures.
That's super conducting.
AUDIENCE: Is it liquid helium?
MARJORIE SHAPIRO: Yeah, the accelerator's liquid helium,
because it's a super conducting magnets.
AUDIENCE: And one further question, wouldn't they get a
natural Higgs generation from a cosmic ray interaction in
the upper atmosphere?
MARJORIE SHAPIRO: The rate's very, very low.
There's not enough.
You just wouldn't get enough events to be able to see it.
Yes?
AUDIENCE: With regard to data compression, it would seem
that for any particular experiment, most of the data
would be noise.
And so, from his point of view, he would benefit from a
lossy compression.
MARJORIE SHAPIRO: That's right.
And in fact, there is lossy compression in some objects.
One of the advantages, even though it's a pain to have
handwritten converters for each object, is that you can
do lossy compression.

You're not limited to 32 bits per floating point.
So there are parts of the detector where we are doing
lossy compression and non-linear lossi compression
to take into account the dynamic range.
Because some of the electronics itself is
non-linear.
And the other thing, in terms of most of the information
being noise, different analyses care about different
parts of the detector.
So by having separate I/O streams for each type of
object and just access on demand, even if you open a
file, you only read objects of the type that you care about.
At the moment, we don't have good ways of selecting events,
but I think it's clear that the next step is also going to
be to use some kind of a fast query to say, I want events
137, 53, and 48 on this file, and then only read those
events is a starting point.
Because the I/O performance is just not going to be
acceptable if you only do things generically.

Yes?
AUDIENCE: Can you say something about how you handle
failure and data corruption?
MARJORIE SHAPIRO: Yeah, failure and data corruption is
a big problem.
Handling it during the reconstruction is not so bad,
because the reconstruction is happening on one site with
professionals monitoring the system.

The log files are parsed, and not only for core dumps, but
also for signaling of severe errors, assuming that the code
developers signal severe errors.
And so that, one can handle pretty well.
And you can just say this file is not going to be included
because it hasn't been processed properly and
reprocess it.
Just rerun it, if possible, or declare that part of the data
not usable.
It's much harder when you have people looking at the end data
because these files are delivered from a data delivery
system, and making sure that Joe Blow user, if 10% of the
files don't make it to his analysis, realizes the 10%
don't make it is much harder.
Because Joe Blow user will get it wrong unless you force him
to get it right.
So what we're talking about now is, when he creates his
high level summary data, having it automatically put
metadata in the file that says which data it read.
And that, I think, handles everything except core dumps
where he gets a file that only had some fraction of
the input file read.
So you still depend on the user to deal with code that
just drops out and core dumps its missing files that are the
real issue.
Yes?
AUDIENCE: So is the reason that you don't make the data
available to the public because of proprietary
interest in the data?
Or is it just because it's too expensive to do that data?
And could it be possible to make a small amount of the
data available?
MARJORIE SHAPIRO: So there's a little bit of both.
So I think making all of the data accessible would be a
huge problem in terms of data volume and the cost of the
service would be big.
Making a small amount available is something that's
been talked about, and in fact, I think the
collaboration would be quite interested in.
Because there's been lots of discussion about using it as a
tool for teaching in colleges and high schools.
Part of the problem is that, although our system is fairly
well documented for someone who is on the experiment, it's
not really adequate for a random user.
But I think it would be an interesting project if
somebody wanted to spend, say six months, in figuring out
how to make 1% percent of the data, or a selected set of
data available to the public to use.
AUDIENCE: I think if somebody wanted to develop that
compression algorithm that you want, it would be helpful to
have the data.
MARJORIE SHAPIRO: I think there would be no problem
giving some amount of data for those kind of purposes.
It would have to be approved by the collaboration because
it is proprietary.
But I think that the main thing they want to prevent is
people writing science papers before they do.
They don't want to see a paper on finding the Higgs before
the collaboration's decided it's found it.
Now this is a sociological issue.
If you look in astrophysics collaborations, there, all the
data is published.
And in fact, NASA requires the data to be
made available publicly.
And it's quite normal for non-collaborators to write
papers analyzing data that they've glommed off the web.
So it's a culture thing, and it may change with time,
depending on what the funding agencies require.
Yes?
AUDIENCE: The solution to that is to corrupt the data.
Seriously.
You add noise, or particularly, bad
noise to the data.
And then you can hand it to somebody's who's just playing
with compression algorithms. He's not
going be able to publish.
MARJORIE SHAPIRO: Someone playing with compression
algorithms, you can give data that's noncontroversial, also.
But this idea of corrupting the data, actually it's
something that we talk about for another reason.
One of the problems when you do these analyses is there's
this tendency to see what you want in the data.
And so we're having lots of discussions about, how do you
do blinded analysis, where the people doing the analysis
don't really get to know what the answer is
until they do it?
It turns out it's pretty easy to do blinded analysis if you
know what it is that you're measuring
right from the beginning.
It's very hard to do blind analysis when you have no idea
what you're looking for.
So it's a continuing discussion but we haven't
found a good way to do overall blinding.
Although I think some analyses, in particular
analyses where you're doing precision measurements of a
quantity, there will be some attempt to blind the data by
putting a random offset that depends on a random number
that only one person in an experiment knows.

AUDIENCE: So Higgs [UNINTELLIGIBLE]
so many of them out there.
And do you vary your trigger online?
MARJORIE SHAPIRO: We do all the triggers in parallel.
So what we do is we divide up the bandwidth.
And it's one of the biggest issues in the experiment is
making sure you're in on the discussions for how the band
width gets divided, because that's what
determines the physics.
You have to be there fighting for your
physics in the bandwidth.
Yes?
AUDIENCE: So you had engineers building your machines.
Do you have software engineers building your code?
MARJORIE SHAPIRO: We do have software engineers but not
nearly enough.
And this is partially a problem with funding agencies.
It's much more difficult to get funding agencies to pay
for software engineers than to pay for hardware engineers.
We probably, on Atlas, have on order 25 or 30 software
engineers, and we ought to have more like a 150.
The exception to that--
I only counted the offline.
It's fairly accepted that data acquisition systems need
software professionals.
Because everyone understands that real
time systems are hard.
There's a tendency for funding agencies still to think
anybody can write software.
And it's only been in the last decade that they even asked
software projects to present.
We have meetings every year of the Department of
Energy who funds us.
We are supposed to present your progress,
cost, budget, schedule.
It's only been in the last 10 years that software has been
included in those meetings now.
As painful as those meetings are, it's
a major step forward.
Because it's their first recognition that the software
is as in important part of the engineering as the hardware.

Yes?
AUDIENCE: When the system is running, how much power really
is consumed?
MARJORIE SHAPIRO: I don't know the answer to that, but it's
quite large.
And in fact, one of the reasons we only run 200 days a
year is that CERN has an arrangement with the French
power companies that they turn off during the winter.
Because, unlike the US, where air conditioning is a major
thing, so slack turns off in the summer, in France, people
have electric heat.
So winter is when power rates go up, and so CERN has an
arrangement.
When the power grid gets to a certain utilization, they have
to shut off.
And in exchange for that they get better power rates.
AUDIENCE: Are you able to harness any of the energy you
create by--?
MARJORIE SHAPIRO: No, we use a lot more than we give out,
unfortunately.

Mm-hmm?
AUDIENCE: How difficult is your simulation problem?
[UNINTELLIGIBLE]?
MARJORIE SHAPIRO: The simulation problem is very
difficult, and there are two parts to that.
There's this one which is called the event generation,
and so that's when you try to model the underlying physics.
And there the issue is, we don't really know the
underlying physics very well.
So typically there are four or five packages that are all
supposed to be giving us the same physics answer.
When we run all of them, and look at differences between
them to get some sense of how well is the calculation under
control, and then they have parameters that you can vary
to try and see how much it changes.
The second thing which is the detector simulation.
That depends on how well you can model the underlying
physics of particles going through matter.
There's a program called [UNINTELLIGIBLE]
that was developed at CERN, which is what we use as the
underlying engine for doing the simulation.
It's been well tested in a lot of regions.
It doesn't do so well at very, very low energies or
at very, very high.
While very high energies, it's hard to test because you don't
have anything to compare it to.
And very low energies, it turns out it's just a very
difficult problem.
But the same code is being used by some radiation
physicists designing treatment plans for patients.
And I'd worry a lot more for them than I would for us.
But it's a hard problem and the [UNINTELLIGIBLE]
simulation has gone through many revisions, and there are
discrepancies with the data, in some cases.
Yes?
AUDIENCE: In terms of radiation, is there some kind
of activation going on?
MARJORIE SHAPIRO: Yes, the
equipment does become activated.
Yes, and I don't know what the rules will be at CERN, but at
Fermi Lab, after they turn off the machine, you have to wait
15 minutes before you go in because mostly short
activation stuff.
And then they have to do a survey with a Geiger counter
before anyone's allowed to work.
My guess is, at CERN, it will be longer.
The magnets themselves, on the machine,
will be highly activated.
And so if the machines are ever decommissioned, they will
have to be treated as radioactive waste and disposed
of like any other radioactive waste.

OK, it looks like we're done.
AUDIENCE: Thanks.
MARJORIE SHAPIRO: Thanks a lot.