Day 6
I. Last Time:
Hw #2 Posted
A. Analytical Performance and the Clock

II. New Stuff:   
A. Analytical Performance Continued:
       1. ISA vs. Implementation
          a.ISA: Instruction Set Architecture
            This refers to the ability to read a specific format of 1s and 0s
          I.e. an Intel and an AMD processor may have the same ISA. 
               They both can read and run the same program
          b. Implementation: The internal details of how the processor 
             accomplishes reading/running a program. 

The previous example changed just the implementation.
(In different isa's a number means differens instructions:
0x77 may mean ADD to one CPU and MULT to another)

             Ex: Intel x86 (may) take 4 clocks for an ADD
                 AMD x86 may only take 3.  
                 If they are both run at the same clock speed, 
                 the AMD will be faster. But if the Intel can 
                 be run 33% faster than the AMD, they'll be the same...

          The Intel and AMD processors have vastly different implementations
          The ONLY time we can even begin to use clock speed as a measure of 
          performance is when comparing two machines with the same ISA AND the 
          same Implementation AND all other components in the systems are 
          identical. (This last part is due to Amdahl's Law...)

    B. Typical Analytical Calculations:
       1. Typically we'll perform calculations to answer one of a few questions:
          a. How long will the program/set of instructions take to execute?
          b. If we improve/change X, how long will this program take? 
             (Ex: if Add takes 4 cycles instead of 7)
          c. How can we make machine A behave as Machine B?
             (Ex: What clock speed would we need to run B at to perform as A?)
       2. We're often given some of the following:
          a. A table of instructions and the number of clock cycles 
             required by each
          b. A clock speed or period
          c. An Average CPI (Clocks/Instruction) for a specific program
          d. A piece of code
          e. A number of instructions in a piece of code and a table of the "code mix"
       3. Find a way to express how long the execution will take:
              Either in clock cycles or in actual time
          a. Find how long each instruction takes and add them up
          b. Find the "instruction mix" percentages of a program and add them up
          c. Use a known "Average CPI" and the number of instructions
       Examples:
          A. A program requires 1,250,000 cycles on a 50MHz machine. 
             How long does it take? 
1 cycle = 1/50,000,000 = 20ns
20ns*1,250,000 =0.025s
             0.025s

          B. A program has an average CPI of 2.4 for it's 20,000,000 instructions. 
             It runs on a processor with a 10ns clock. 
             What's the Frequency of the clock?
100MHz
             How long will the program take?
2.4*20,000,000 = 48,000,000 cycles
48...*10ns = 0.48s
             How long will the average instruction take?
             2.4*10 or .48s/20,000,000 = 24ns

          C. A program consists of 1,000,000 Memory Insts @ 6 clocks each and 
                                   2,000,000 math insts @ 3 clocks each 
             What does the clock frequency have to be to complete in 1s?
             "" 0.7s
             6*1,000,000+3*2,000,000 = 12,000,000 cycles
12 MHz
12,000,000/.7 =  ~17.15 MHz
             What will the clock speeds need to be if we improve the Mem Insts 
             to only require 5 clocks?
1,000,000*5+2,000,000*3 = 11,000,000
11MHz
             11,000,000/.7= 15.72 MHz

    C. Amdahl's Law: The Bottle Neck Problem / Law of diminishing returns
       The basic Idea: You can't go any faster than your slowest part
           This often happens on the interstate...
           Cruising @ 80 and hit a construction zone.
       The slowest part/component will dictate the highest possible speed. 
       This governs both Hardware and Software
       1. Amdahl: IBM engineer who went on to form Amdahl Corp.
       2. VERY important for people doing parallel programming.
Working on non-parallel (but speed critical) progs.
Trying to improve hardware design. 
       3. Related to the concepts mentioned when discussing profiling
       4. Ex: Pg 75
          Improvement of Exe time = 
              Exe Time of Affected/Amount of Improve + Exe Time of unchanged
          I.e. Program takes 100s total. 80s is spent in a multiply.
               If only the multiply is improved and it's speed is doubled, 
               how long will it take:
                   80/2+20=60s
          What if we need the program to complete in 40s, 
          how much faster would mult need to be?
             4x faster than orig.
          What if we need results in 20s? How much faster will mult need to be? 
          If the multiply was instant (0 time) the fastest the prog. could 
          exe would be 20s.
       5. Corollary to Amdahl's Law: Make the common case fast!
          I.e. design for typical/frequent usage. 
          Ex: A supercomputer should be good at floating point ops

    D. Experimental Techniques: Benchmarking
       Q. What is benchmarking? Who's run a benchmark of some sort?
       A. A benchmark is a program (or suite of programs) used 
          specifically to measure performance.

       Catagories of Benchmarks: Synthetic and Non-Synthetic
          Synthetic - "Artificial" benchmark. I.e. a program written solely 
                      to measure performance.
                      Usually a small ASM program. Or at least a small program
          Non-Synthetic - An actual application that is really used

        When/Why Synthetic:
           Can test very specific/hardware oriented feature
           Can be used early in Hardware/Processor Design (When no actual hw even)
        Why NOT Synthetic:
           Easily fooled/spoofed
           (A simple change to processor/compiler may drastically change 
            perf. of benchmark without any advantage to normal progs.)

        Common Examples of Synthetic: Norton (old)

      How do synthetic benchmarks report their results?    
        Often MIPS and MFLOPS are used as performance measures. 
        MIPS - Count of the millions of instructions executed per second
        MFLOPS - Count of the millions of Floating Point Operations per second

        These are only useful when comparing identical ISAs. Why?
        Because different machines may do different quantities of "work" for 
        different instructions.
        Ex: A RISC Machine (like MIPS) may require 200-300 instructions to 
            perform a memory copy.
            A CISC Machine (like Intel) may require only a single instruction 
            to do the same work.
        Just comparing the number of instructions done in a given time doesn't
        really measure the WORK done in that time.

        If both machines complete the copy in the same time, we'd say they have
        the same performance for the copy...

        But the RISC machine would have a much higher MIPS rating.
      
        Really, EXE time is probably the "best" measure.

        Why Non-Synthetic:
           Can measure the actual performance that will be seen by the end user
        Problems with Non-Synthetic:
            1. Application Mix/Usage - Needs to be same as in benchmark
               We want to test apps that are as close to what the machine
               will be used for as possible. 
               I.e. 3D performance shouldn't be measusred if we're only
                    going to use the machine for spreadsheets!
            2. "Ranking" in suites - How are different results combined?
                Look at page 32 of the SPEC handout
                How do we really combine these results to determine 
                what is best?
        Common Examples of Non-Synthetic:
            Winstone, Quake, WinBench, SPEC and SPEC 2000, etc.

    E. Other Poor Metrics (Besudes MIPS and MFLOPS):
       Peak MIPS - WORSE than MIPS. Just a measure of the number of times the 
                   fastest instruction can execute per second. 
       Relative MIPS - Based on a workload and a common machine, so a little better.

    F. SPEC  - System Performance and Evaluation Cooperative
       1. What's Measured: Int and Float Performance
          Int: "Normal" Tasks
               Compression: gzip
               Compilation: gcc
               chess, perl, combinational optimization, databases, 
               logic simulation, etc...
          Floating Point: Number crunching
               Image Processing/Neural Networks (Matrix Manip), Fluid Dynamics, 
               Primality Test, 3D Graphics, Finite Elements/Simultaion, 
               Nuclear Physics, etc... 
      2. What's used to measure these?
         REAL Apps (Non-Synthetic Benchmark) for "realistic" performance
         Must be compute (rather tan IO) bound.
         Is "unique" compared to other benchmark members
         Must do meaningful/usefull work.
      3. Who's behind SPEC?
         For the most part: Industry (HP, IBM, SGI, Compaq, Sun, Intel, etc.)
      4. What are the problems in developing SPEC
         a. Selecting a program. MUST be REALLY portable
            18 Platforms, some 32-bit, some 64-bit
            11 types of unix, and 2 windows NTs
            multiple compilers
         b. Programming Languages & Compilers
            C:11 in int, 4 in FP
            C++: 1 in int.
            F77: 6 in fp
            F90: 4 in FP 
            Why Fortran in FP?
            Why so little C++?
         c. Problems with FP
            Numerical variableability - subtle implmentation differences 
                                        will actually give different results.
         d. Vendor self interest:
            results are usually confidential
            voted for by diverse sub-committee

         e. Misc:
               Compilers: Optimazation/differences = Big Impact
              181.mfc, 282.eon (pg32) 500 MHz beats 533 3 times.

 III. Next Time:
1. Continue Experimental Performancs
2. SPEC Handout
      3. Hw#2 Assigned