Day 6
I. Last Time:
Hw #2 Posted
A. Analytical Performance and the Clock
II. New Stuff:
A. Analytical Performance Continued:
1. ISA vs. Implementation
a.ISA: Instruction Set Architecture
This refers to the ability to read a specific format of 1s and 0s
I.e. an Intel and an AMD processor may have the same ISA.
They both can read and run the same program
b. Implementation: The internal details of how the processor
accomplishes reading/running a program.
The previous example changed just the implementation.
(In different isa's a number means differens instructions:
0x77 may mean ADD to one CPU and MULT to another)
Ex: Intel x86 (may) take 4 clocks for an ADD
AMD x86 may only take 3.
If they are both run at the same clock speed,
the AMD will be faster. But if the Intel can
be run 33% faster than the AMD, they'll be the same...
The Intel and AMD processors have vastly different implementations
The ONLY time we can even begin to use clock speed as a measure of
performance is when comparing two machines with the same ISA AND the
same Implementation AND all other components in the systems are
identical. (This last part is due to Amdahl's Law...)
B. Typical Analytical Calculations:
1. Typically we'll perform calculations to answer one of a few questions:
a. How long will the program/set of instructions take to execute?
b. If we improve/change X, how long will this program take?
(Ex: if Add takes 4 cycles instead of 7)
c. How can we make machine A behave as Machine B?
(Ex: What clock speed would we need to run B at to perform as A?)
2. We're often given some of the following:
a. A table of instructions and the number of clock cycles
required by each
b. A clock speed or period
c. An Average CPI (Clocks/Instruction) for a specific program
d. A piece of code
e. A number of instructions in a piece of code and a table of the "code mix"
3. Find a way to express how long the execution will take:
Either in clock cycles or in actual time
a. Find how long each instruction takes and add them up
b. Find the "instruction mix" percentages of a program and add them up
c. Use a known "Average CPI" and the number of instructions
Examples:
A. A program requires 1,250,000 cycles on a 50MHz machine.
How long does it take?
1 cycle = 1/50,000,000 = 20ns
20ns*1,250,000 =0.025s
0.025s
B. A program has an average CPI of 2.4 for it's 20,000,000 instructions.
It runs on a processor with a 10ns clock.
What's the Frequency of the clock?
100MHz
How long will the program take?
2.4*20,000,000 = 48,000,000 cycles
48...*10ns = 0.48s
How long will the average instruction take?
2.4*10 or .48s/20,000,000 = 24ns
C. A program consists of 1,000,000 Memory Insts @ 6 clocks each and
2,000,000 math insts @ 3 clocks each
What does the clock frequency have to be to complete in 1s?
"" 0.7s
6*1,000,000+3*2,000,000 = 12,000,000 cycles
12 MHz
12,000,000/.7 = ~17.15 MHz
What will the clock speeds need to be if we improve the Mem Insts
to only require 5 clocks?
1,000,000*5+2,000,000*3 = 11,000,000
11MHz
11,000,000/.7= 15.72 MHz
C. Amdahl's Law: The Bottle Neck Problem / Law of diminishing returns
The basic Idea: You can't go any faster than your slowest part
This often happens on the interstate...
Cruising @ 80 and hit a construction zone.
The slowest part/component will dictate the highest possible speed.
This governs both Hardware and Software
1. Amdahl: IBM engineer who went on to form Amdahl Corp.
2. VERY important for people doing parallel programming.
Working on non-parallel (but speed critical) progs.
Trying to improve hardware design.
3. Related to the concepts mentioned when discussing profiling
4. Ex: Pg 75
Improvement of Exe time =
Exe Time of Affected/Amount of Improve + Exe Time of unchanged
I.e. Program takes 100s total. 80s is spent in a multiply.
If only the multiply is improved and it's speed is doubled,
how long will it take:
80/2+20=60s
What if we need the program to complete in 40s,
how much faster would mult need to be?
4x faster than orig.
What if we need results in 20s? How much faster will mult need to be?
If the multiply was instant (0 time) the fastest the prog. could
exe would be 20s.
5. Corollary to Amdahl's Law: Make the common case fast!
I.e. design for typical/frequent usage.
Ex: A supercomputer should be good at floating point ops
D. Experimental Techniques: Benchmarking
Q. What is benchmarking? Who's run a benchmark of some sort?
A. A benchmark is a program (or suite of programs) used
specifically to measure performance.
Catagories of Benchmarks: Synthetic and Non-Synthetic
Synthetic - "Artificial" benchmark. I.e. a program written solely
to measure performance.
Usually a small ASM program. Or at least a small program
Non-Synthetic - An actual application that is really used
When/Why Synthetic:
Can test very specific/hardware oriented feature
Can be used early in Hardware/Processor Design (When no actual hw even)
Why NOT Synthetic:
Easily fooled/spoofed
(A simple change to processor/compiler may drastically change
perf. of benchmark without any advantage to normal progs.)
Common Examples of Synthetic: Norton (old)
How do synthetic benchmarks report their results?
Often MIPS and MFLOPS are used as performance measures.
MIPS - Count of the millions of instructions executed per second
MFLOPS - Count of the millions of Floating Point Operations per second
These are only useful when comparing identical ISAs. Why?
Because different machines may do different quantities of "work" for
different instructions.
Ex: A RISC Machine (like MIPS) may require 200-300 instructions to
perform a memory copy.
A CISC Machine (like Intel) may require only a single instruction
to do the same work.
Just comparing the number of instructions done in a given time doesn't
really measure the WORK done in that time.
If both machines complete the copy in the same time, we'd say they have
the same performance for the copy...
But the RISC machine would have a much higher MIPS rating.
Really, EXE time is probably the "best" measure.
Why Non-Synthetic:
Can measure the actual performance that will be seen by the end user
Problems with Non-Synthetic:
1. Application Mix/Usage - Needs to be same as in benchmark
We want to test apps that are as close to what the machine
will be used for as possible.
I.e. 3D performance shouldn't be measusred if we're only
going to use the machine for spreadsheets!
2. "Ranking" in suites - How are different results combined?
Look at page 32 of the SPEC handout
How do we really combine these results to determine
what is best?
Common Examples of Non-Synthetic:
Winstone, Quake, WinBench, SPEC and SPEC 2000, etc.
E. Other Poor Metrics (Besudes MIPS and MFLOPS):
Peak MIPS - WORSE than MIPS. Just a measure of the number of times the
fastest instruction can execute per second.
Relative MIPS - Based on a workload and a common machine, so a little better.
F. SPEC - System Performance and Evaluation Cooperative
1. What's Measured: Int and Float Performance
Int: "Normal" Tasks
Compression: gzip
Compilation: gcc
chess, perl, combinational optimization, databases,
logic simulation, etc...
Floating Point: Number crunching
Image Processing/Neural Networks (Matrix Manip), Fluid Dynamics,
Primality Test, 3D Graphics, Finite Elements/Simultaion,
Nuclear Physics, etc...
2. What's used to measure these?
REAL Apps (Non-Synthetic Benchmark) for "realistic" performance
Must be compute (rather tan IO) bound.
Is "unique" compared to other benchmark members
Must do meaningful/usefull work.
3. Who's behind SPEC?
For the most part: Industry (HP, IBM, SGI, Compaq, Sun, Intel, etc.)
4. What are the problems in developing SPEC
a. Selecting a program. MUST be REALLY portable
18 Platforms, some 32-bit, some 64-bit
11 types of unix, and 2 windows NTs
multiple compilers
b. Programming Languages & Compilers
C:11 in int, 4 in FP
C++: 1 in int.
F77: 6 in fp
F90: 4 in FP
Why Fortran in FP?
Why so little C++?
c. Problems with FP
Numerical variableability - subtle implmentation differences
will actually give different results.
d. Vendor self interest:
results are usually confidential
voted for by diverse sub-committee
e. Misc:
Compilers: Optimazation/differences = Big Impact
181.mfc, 282.eon (pg32) 500 MHz beats 533 3 times.
III. Next Time:
1. Continue Experimental Performancs
2. SPEC Handout
3. Hw#2 Assigned