This is the output of the interaction of my performance measuring script when executed on a i7-920 together with maxent. I used my C wrapper and linked it together with the Fortran MaxEnt. the used compiler was gcc/gfortran -4.6.3 with compiler-switches -O3 -march=corei7. The test involved a single Green’s function using an unsymmetric Kernel.

The output of the script contains first some raw numbers and towards the end some interpreted text and numbers. The script collects the numbers over mutiple runs of the program that’s one reason, why the numbers might not fully match up.

So If you want to see the bottle-necks, there you go ðŸ™‚

Let’s begin with the raw data

ARITH_CYCLES_DIV_BUSY = 57896700000.000000

BR_INST_EXEC_ANY = 403132000000.000000

BR_INST_EXEC_DIRECT_NEAR_CALL = 5092530000.000000

BR_INST_EXEC_INDIRECT_NEAR_CALL = 102616000.000000

BR_INST_EXEC_INDIRECT_NON_CALL = 3847590000.000000

BR_INST_EXEC_NEAR_CALLS = 5202310000.000000

BR_INST_EXEC_NON_CALLS = 397338000000.000000

BR_INST_EXEC_RETURN_NEAR = 5116580000.000000

UNC_QMC_WRITES_FULL_ANY = 9698830000.000000

400114000000.000000 386115000000.000000 4689920000.000000 1889600000.000000

UNC_L3_LINES_IN_ANY 48120600000.000000

1935040000000.000000 2203850000.000000 64256300.000000 1857490.000000

UNC_L3_LINES_OUT_ANY 49994900000.000000

43727900.000000 770720000000.000000 275510000000.000000 3633900000.000000

MEM_LOAD_RETIRED_L3_MISS 775005000.000000

MEM_LOAD_RETIRED_L3_UNSHARED_HIT 1387540000.000000

MEM_LOAD_RETIRED_OTHER_CORE_L2_HIT_HITM 254421.000000

RESOURCE_STALLS_ANY 569792000000.000000

MEM_UNCORE_RETIRED_LOCAL_DRAM 773893000.000000

MEM_UNCORE_RETIRED_REMOTE_DRAM 0.000000

SSEX_UOPS_RETIRED_PACKED_SINGLE = 39815100000.000000

ITLB_MISS_RETIRED = 2996070.000000

UOPS_EXECUTED_CORE_ACTIVE_CYCLES = 1541580000000.000000 (this is per core, not per thread)

UOPS_ISSUED_STALLED_CYCLES 618274000000.000000

UOPS_ISSUED_ANY 3407280000000.000000

====== Summary================================================================

Runtime : 734.122000 seconds

average frequency 2.635 GHz

==============================================================================

====== Instruction length decoder(yes that thing is really just there for determining the length of an instruction…)===

Analysis not available on bloomfield CPU…

==============================================================================

====== Situation at the instruction decoder ==================================

uops delivered by the microcode sequencer: 7785830000.000000, which is .228 % of all issued uops

===============================================================================

====== Situation at the instruction queue(IQ) (I assume it’s that decoded instruction queue)

INST_QUEUE_WRITES: 292370000000.000000

cycles during which instructions are written to the IQ: 96099300000.000000

average number of instructions decoded each cycle (should be close to 4): .328

===============================================================================

====== Situation at the Register allocation table(RAT) ========================

uops issued from the RAT to the RS 3407280000000.000000 terminology: from the frontend to the backend

RAT stall cycles : 200009000000.000000 , which is 10.336 %

RAT register read stall cycles 8720780.000000 , which is .004 % of rat stalls which is 0 % of total

RAT stall cycles due to serialization 2229780000.000000 , which is 1.114 % of rat stalls which is .115 % of total

RAT stall cycles due to ROB read port stall 197772000000.000000 , which is 98.881 % of rat stalls which is 10.220 % of total

RAT stall cycles due to partial flag register stall 95746.000000 , which is 0 % of rat stalls which is 0 % of total

===============================================================================

====== UOPS decomposition at issue port ======================================

TotalCycles = 1935040000000.000000 cycles

percentage of cycles where no execution port is doing ANYTHING : 31.951 %

precisely speaking in these cycles, no uops is delivered to the RS

percentage of cycles where some execution port is doing SOMETHING : 68.048 %

precisely speaking in these cycles, uops are delivered to the RS

===============================================================================

====== Situation at the Reservation station(RS) ===============================

Any Resource stalls 569792000000.000000

cycles during which occured the lack of a load buffer 463961000.000000 , hence .023 %

cycles during which occured the lack of a store buffer 194276000.000000 , hence .010 %

cycles during which the RS was full 159433000000.000000 , hence 8.239 %

cycles during which the ROB was full 410497000000.000000 , hence 21.213 %

cycles spent writing to the FPCW 0.000000 , hence 0 %

cycles spent writing to the MXCSR 0.000000 , hence 0 %

cycles spent for other reasons 0.000000 , hence 0 %

number of loads dispatched from the RS: 950111000000.000000

number of loads dispatched from the RS the MOB : 255008000000.000000 , which is 26.839 %

number of loads dispatched that bypass the MOB : 652119000000.000000 , which is 68.636 %

===============================================================================

====== Situation at the execution ports =======================================

Total number of executed uops 3372137000000.000000

uops executed on port 0 (SSE + FP Add) 549976000000.000000 , hence 16.30 % of uops

uops executed on port 1 (multiply divide) 677262000000.000000 , hence 20.08 % of uops

uops executed on port 2 (load)(per core) 788602000000.000000 , hence 23.38 % of uops

uops executed on port 3 (store)(per core) 282601000000.000000 , hence 8.38 % of uops

uops executed on port 4 (?)(per core) 318733000000.000000 , hence 9.45 % of uops

uops executed on port 5 (branch, SSE add) 754963000000.000000 , hence 22.38 % of uops

===============================================================================

====== UOP decomposition at retirement ========================================

retired instructions 3102338500000.000000

issued uops: 3407280000000.000000

wasted uops: 304941500000.000000 , hence 9.829 %

===============================================================================

===== branching metrics =======================================================

Total Number of branch instructions executed: 403132000000.000000

contribution of direct near calls: 1.263 %

contribution of indirect near calls: .025 %

contribution of indirect non calls: .954 %

contribution of near calls: 1.290 %

contribution of non calls: 98.562 %

contribution of near returns: 1.269 %

Total number of retired branch instructions: 400114000000.000000

contribution of near calls: 1.172 %

contribution of conditionals: 96.501 %

percentage of mispredicted branches: .468 %

wasted work due to instruction starvation: 2.505 % of total cycles

Number of Instructions per near call: 596.338

===============================================================================

===== Memory Subsystem ========================================================

retired Instructions that hit the DRAM : 773893000.000000 , Hence .100 % of all loads

retired Instructions that hit the L3 : 1387540000.000000 , Hence .180 % of all loads

retired Instructions that hit the L2 : 3633900000.000000 , Hence .471 % of all loads

retired Instructions that hit the L1 : 726595000000.000000 , Hence 94.274 % of all loads

First some information on the interaction with the DRAM. Note that these quantities are per Socket!

full cachelines written: 9698830000.000000 -> 591969.60 MB

partial cachelines written: 25670000.000000 -> 1566.77 MB

read events: 44791900000.000000 -> 591969.60 MB(assuming every event corresponds to one cacheline)

L1D hardware prefetch requests: 114428000000.000000

L1D hardware prefetch misses: 45385900000.000000 , which is 39.663 % of all hardware requests

L1D prefetch triggers: 114694000000.000000

Loads that hit a SSE prefetched line in-flight: 0.000000

writebacks from L1 to L2 : 11298800000.000000 -> 689624.02 MB

Store Buffer stall cycles: 93191300000.000000 , which is .048 % of all cycles

estimated impact of Hits to the L2 21803400000.000000 cycles, which equals 1.126 % of cycles

estimated impact of Hits to the L3 48563900000.000000 cycles, which equals 2.509 % of cycles

number of retired loads that miss the L3: 775005000.000000

estimated impact of an L3 miss local dram: 174125925000.000000 cycles, which equals 8.998 % of cycles

estimated impact of an L3 miss remote DRAM Hit: 0 cycles, which equals 0 % of cycles

estimated impact of an L3 miss remote Cache Hit : 0 cycles, which equals 0 % of cycles

estimated impact of an L3 miss, but data from somewhere(?) Hit : 222400000.000000 cycles, which equals .011 % of cycles

estimated total impact of L3 misses : 174348325000.000000 cycles, which equals 9.000 % of cycles

now we estimate the bandwidth to the L3

L2_LINES_IN_ANY: 40931800000.000000 -> 2498278.80 MB -> 3403.08 MB/s

L2_LINES_OUT_DEMAND_DIRTY 2029880000.000000 -> 123894.04 MB -> 168.76 MB/s

Hence total L2 <-> L3 bandwith:

2622172.85 MB -> 3571.84 MB/s

replaced L1D cachelines, hence reads: 47011800000.000000 -> 2869372.558 MB -> 3908.577 MB/s

evicted L1D cachelines : 11505000000.000000 -> 702209.472 MB -> 956.529 MB/s, compare to L2 writebacks

===============================================================================

estimated impact of Hits to the other core’s L2 (blindly modified for bloomfield) 15265260.000000 cycles, which equals 0 % of cycles

estimated impact of Hits to the other core’s L2(modified data) 19081575.000000 cycles, which equals 0 % of cycles

L1 DTLB Miss impact: 77134750000.000000 cycles, which equals 3.986 % of cycles

counted stalled cycles due to load ops: 321884721835.000000 cycles, which equals 16.634 % of cycles

Cycles spent for div and sqrt: 57896700000.000000 cycles, which equals 2.992 % of cycles

TotalCounted stalled cycles: 379781421835.000000 cycles, which equals 19.626 % of cycles

Contribution of L3 misses to the total counted stalls : 45.907 %

Contribution of L2 Hits to the total counted stalls : 5.741 %

Contribution of L1 DTLB misses to the total counted stalls : 20.310 %

Contribution of L3 unshared hits to the total counted stalls : 12.787 %

Contribution of L2 other core Hits to the total counted stalls : .004 %

Contribution of L2 other core Hits and found modified data to the total counted stalls : .005 %

Contribution of divs and sqrts to the total counted stalls: 15.244 %

L2 Instruction fetch misses: 43727900.000000 , which is 40.49 of the total Ifetches

L1 ITLB miss impact 65012150.000000 cycles, which equals .003 % of cycles

ITLB Miss rate: 0 %

branch instructins: 400114000000.000000 which is 12.897 %

load instructions: 770720000000.000000 which is 24.843 %

store instructions: 275510000000.000000 which is 8.880 %

other instructions: 1655994500000.000000 which is 53.378 %

Packed(SSE) instructions: 420036100000.000000 which is 13.539 %

===== FLOP counting =====================================================

Retired Double precision packed operations 380221000000.000000

Retired Double precision scalar operations 1419910000000.000000

Retired Single precision packed operations 39815100000.000000

Now follows an estimation of the flop rate as done in some Intel manual…

Executed X87 uops: 2547610000.000000

Executed packed FP uops : 124254000000.000000

Executed scalar FP uops : 700024000000.000000

Hence 826825.6 MFLOPS

Hence .4272 flops per cycle.

Therefore 1126.278 MFlops/s

===== Further Motivation ================================================

CPI: .6237 hence IPC: 1.6032

Load to store ratio: 2.797

Note: all processors from Intel since at least the Core2 reach CPI=0.25….

Possible improvement: 2.494

cycles after improvement: 775878107457.89