Show Sidebar Log in

Performance Analysis on Intel i7-920

This is the output of the interaction of my performance measuring script when executed on a i7-920 together with maxent. I used my C wrapper and linked it together with the Fortran MaxEnt. the used compiler was gcc/gfortran -4.6.3 with compiler-switches -O3 -march=corei7. The test involved a single Green’s function using an unsymmetric Kernel.

The output of the script contains first some raw numbers and towards the end some interpreted text and numbers. The script collects the numbers over mutiple runs of the program that’s one reason, why the numbers might not fully match up.

So If you want to see the bottle-necks, there you go :-)

 

Let’s begin with the raw data
ARITH_CYCLES_DIV_BUSY = 57896700000.000000
BR_INST_EXEC_ANY = 403132000000.000000
BR_INST_EXEC_DIRECT_NEAR_CALL = 5092530000.000000
BR_INST_EXEC_INDIRECT_NEAR_CALL = 102616000.000000
BR_INST_EXEC_INDIRECT_NON_CALL = 3847590000.000000
BR_INST_EXEC_NEAR_CALLS = 5202310000.000000
BR_INST_EXEC_NON_CALLS = 397338000000.000000
BR_INST_EXEC_RETURN_NEAR = 5116580000.000000
UNC_QMC_WRITES_FULL_ANY = 9698830000.000000
400114000000.000000 386115000000.000000 4689920000.000000 1889600000.000000
UNC_L3_LINES_IN_ANY 48120600000.000000
1935040000000.000000 2203850000.000000 64256300.000000 1857490.000000
UNC_L3_LINES_OUT_ANY 49994900000.000000
43727900.000000 770720000000.000000 275510000000.000000 3633900000.000000
MEM_LOAD_RETIRED_L3_MISS 775005000.000000
MEM_LOAD_RETIRED_L3_UNSHARED_HIT 1387540000.000000
MEM_LOAD_RETIRED_OTHER_CORE_L2_HIT_HITM 254421.000000
RESOURCE_STALLS_ANY 569792000000.000000
MEM_UNCORE_RETIRED_LOCAL_DRAM 773893000.000000
MEM_UNCORE_RETIRED_REMOTE_DRAM 0.000000
SSEX_UOPS_RETIRED_PACKED_SINGLE = 39815100000.000000
ITLB_MISS_RETIRED = 2996070.000000
UOPS_EXECUTED_CORE_ACTIVE_CYCLES = 1541580000000.000000 (this is per core, not per thread)
UOPS_ISSUED_STALLED_CYCLES 618274000000.000000
UOPS_ISSUED_ANY 3407280000000.000000
====== Summary================================================================
Runtime : 734.122000 seconds
average frequency 2.635 GHz
==============================================================================
====== Instruction length decoder(yes that thing is really just there for determining the length of an instruction…)===
Analysis not available on bloomfield CPU…
==============================================================================
====== Situation at the instruction decoder ==================================
uops delivered by the microcode sequencer: 7785830000.000000, which is .228 % of all issued uops
===============================================================================
====== Situation at the instruction queue(IQ) (I assume it’s that decoded instruction queue)
INST_QUEUE_WRITES: 292370000000.000000
cycles during which instructions are written to the IQ: 96099300000.000000
average number of instructions decoded each cycle (should be close to 4): .328
===============================================================================
====== Situation at the Register allocation table(RAT) ========================
uops issued from the RAT to the RS 3407280000000.000000 terminology: from the frontend to the backend
RAT stall cycles : 200009000000.000000 , which is 10.336 %
RAT register read stall cycles 8720780.000000 , which is .004 % of rat stalls which is 0 % of total
RAT stall cycles due to serialization 2229780000.000000 , which is 1.114 % of rat stalls which is .115 % of total
RAT stall cycles due to ROB read port stall 197772000000.000000 , which is 98.881 % of rat stalls which is 10.220 % of total
RAT stall cycles due to partial flag register stall 95746.000000 , which is 0 % of rat stalls which is 0 % of total
===============================================================================
====== UOPS decomposition at issue port ======================================
TotalCycles = 1935040000000.000000 cycles
percentage of cycles where no execution port is doing ANYTHING : 31.951 %
precisely speaking in these cycles, no uops is delivered to the RS
percentage of cycles where some execution port is doing SOMETHING : 68.048 %
precisely speaking in these cycles, uops are delivered to the RS
===============================================================================
====== Situation at the Reservation station(RS) ===============================
Any Resource stalls 569792000000.000000
cycles during which occured the lack of a load buffer 463961000.000000 , hence .023 %
cycles during which occured the lack of a store buffer 194276000.000000 , hence .010 %
cycles during which the RS was full 159433000000.000000 , hence 8.239 %
cycles during which the ROB was full 410497000000.000000 , hence 21.213 %
cycles spent writing to the FPCW 0.000000 , hence 0 %
cycles spent writing to the MXCSR 0.000000 , hence 0 %
cycles spent for other reasons 0.000000 , hence 0 %
number of loads dispatched from the RS: 950111000000.000000
number of loads dispatched from the RS the MOB : 255008000000.000000 , which is 26.839 %
number of loads dispatched that bypass the MOB : 652119000000.000000 , which is 68.636 %
===============================================================================
====== Situation at the execution ports =======================================
Total number of executed uops 3372137000000.000000
uops executed on port 0 (SSE + FP Add) 549976000000.000000 , hence 16.30 % of uops
uops executed on port 1 (multiply divide) 677262000000.000000 , hence 20.08 % of uops
uops executed on port 2 (load)(per core) 788602000000.000000 , hence 23.38 % of uops
uops executed on port 3 (store)(per core) 282601000000.000000 , hence 8.38 % of uops
uops executed on port 4 (?)(per core) 318733000000.000000 , hence 9.45 % of uops
uops executed on port 5 (branch, SSE add) 754963000000.000000 , hence 22.38 % of uops
===============================================================================
====== UOP decomposition at retirement ========================================
retired instructions 3102338500000.000000
issued uops: 3407280000000.000000
wasted uops: 304941500000.000000 , hence 9.829 %
===============================================================================
===== branching metrics =======================================================
Total Number of branch instructions executed: 403132000000.000000
contribution of direct near calls: 1.263 %
contribution of indirect near calls: .025 %
contribution of indirect non calls: .954 %
contribution of near calls: 1.290 %
contribution of non calls: 98.562 %
contribution of near returns: 1.269 %
Total number of retired branch instructions: 400114000000.000000
contribution of near calls: 1.172 %
contribution of conditionals: 96.501 %
percentage of mispredicted branches: .468 %
wasted work due to instruction starvation: 2.505 % of total cycles
Number of Instructions per near call: 596.338
===============================================================================
===== Memory Subsystem ========================================================
retired Instructions that hit the DRAM : 773893000.000000 , Hence .100 % of all loads
retired Instructions that hit the L3 : 1387540000.000000 , Hence .180 % of all loads
retired Instructions that hit the L2 : 3633900000.000000 , Hence .471 % of all loads
retired Instructions that hit the L1 : 726595000000.000000 , Hence 94.274 % of all loads
First some information on the interaction with the DRAM. Note that these quantities are per Socket!
full cachelines written: 9698830000.000000 -> 591969.60 MB
partial cachelines written: 25670000.000000 -> 1566.77 MB
read events: 44791900000.000000 -> 591969.60 MB(assuming every event corresponds to one cacheline)
L1D hardware prefetch requests: 114428000000.000000
L1D hardware prefetch misses: 45385900000.000000 , which is 39.663 % of all hardware requests
L1D prefetch triggers: 114694000000.000000
Loads that hit a SSE prefetched line in-flight: 0.000000
writebacks from L1 to L2 : 11298800000.000000 -> 689624.02 MB
Store Buffer stall cycles: 93191300000.000000 , which is .048 % of all cycles
estimated impact of Hits to the L2 21803400000.000000 cycles, which equals 1.126 % of cycles
estimated impact of Hits to the L3 48563900000.000000 cycles, which equals 2.509 % of cycles
number of retired loads that miss the L3: 775005000.000000
estimated impact of an L3 miss local dram: 174125925000.000000 cycles, which equals 8.998 % of cycles
estimated impact of an L3 miss remote DRAM Hit: 0 cycles, which equals 0 % of cycles
estimated impact of an L3 miss remote Cache Hit : 0 cycles, which equals 0 % of cycles
estimated impact of an L3 miss, but data from somewhere(?) Hit : 222400000.000000 cycles, which equals .011 % of cycles
estimated total impact of L3 misses : 174348325000.000000 cycles, which equals 9.000 % of cycles
now we estimate the bandwidth to the L3
L2_LINES_IN_ANY: 40931800000.000000 -> 2498278.80 MB -> 3403.08 MB/s
L2_LINES_OUT_DEMAND_DIRTY 2029880000.000000 -> 123894.04 MB -> 168.76 MB/s
Hence total L2 <-> L3 bandwith:
2622172.85 MB -> 3571.84 MB/s
replaced L1D cachelines, hence reads: 47011800000.000000 -> 2869372.558 MB -> 3908.577 MB/s
evicted L1D cachelines : 11505000000.000000 -> 702209.472 MB -> 956.529 MB/s, compare to L2 writebacks
===============================================================================
estimated impact of Hits to the other core’s L2 (blindly modified for bloomfield) 15265260.000000 cycles, which equals 0 % of cycles
estimated impact of Hits to the other core’s L2(modified data) 19081575.000000 cycles, which equals 0 % of cycles
L1 DTLB Miss impact: 77134750000.000000 cycles, which equals 3.986 % of cycles
counted stalled cycles due to load ops: 321884721835.000000 cycles, which equals 16.634 % of cycles
Cycles spent for div and sqrt: 57896700000.000000 cycles, which equals 2.992 % of cycles
TotalCounted stalled cycles: 379781421835.000000 cycles, which equals 19.626 % of cycles
Contribution of L3 misses to the total counted stalls : 45.907 %
Contribution of L2 Hits to the total counted stalls : 5.741 %
Contribution of L1 DTLB misses to the total counted stalls : 20.310 %
Contribution of L3 unshared hits to the total counted stalls : 12.787 %
Contribution of L2 other core Hits to the total counted stalls : .004 %
Contribution of L2 other core Hits and found modified data to the total counted stalls : .005 %
Contribution of divs and sqrts to the total counted stalls: 15.244 %
L2 Instruction fetch misses: 43727900.000000 , which is 40.49 of the total Ifetches
L1 ITLB miss impact 65012150.000000 cycles, which equals .003 % of cycles
ITLB Miss rate: 0 %
branch instructins: 400114000000.000000 which is 12.897 %
load instructions: 770720000000.000000 which is 24.843 %
store instructions: 275510000000.000000 which is 8.880 %
other instructions: 1655994500000.000000 which is 53.378 %
Packed(SSE) instructions: 420036100000.000000 which is 13.539 %
===== FLOP counting =====================================================
Retired Double precision packed operations 380221000000.000000
Retired Double precision scalar operations 1419910000000.000000
Retired Single precision packed operations 39815100000.000000
Now follows an estimation of the flop rate as done in some Intel manual…
Executed X87 uops: 2547610000.000000
Executed packed FP uops : 124254000000.000000
Executed scalar FP uops : 700024000000.000000
Hence 826825.6 MFLOPS
Hence .4272 flops per cycle.
Therefore 1126.278 MFlops/s
===== Further Motivation ================================================
CPI: .6237 hence IPC: 1.6032
Load to store ratio: 2.797
Note: all processors from Intel since at least the Core2 reach CPI=0.25….
Possible improvement: 2.494
cycles after improvement: 775878107457.89

Comment display has been disabled on this doc.

Comment posting has been disabled on this doc.