Saturday, April 25, 2009

Solaris Kernel Analysis with MDB

Solaris Kernel Analysis with MDB dee-commands (dcmds). I have captured the output
for various dcmds, the resulting txt is big file.

Thursday, April 16, 2009

Extended Dispatch Tables - Performance Improvement

Time Sharing Scheduling Class (TS) - ts_dptbl (dispatcher parameter table)

Intro:

The dispatcher controls allocation of the CPU resource to a processes and the scheduler supports multiple scheduling classes. Each class defines its scheduling policies and priority queues on which ready to run processes are linked. Scheduling classes supported by Solaris.

dispadmin -l

CONFIGURED CLASSES
==================

SYS (System Class)
TS (Time Sharing)
FX (Fixed Priority)
IA (Interactive)

Processes in the time-sharing class which are running in user mode or in kernel mode before going to sleep are scheduled according to the parameters in a time-sharing dispatcher parameter table (ts_dptbl).

Except for Starcat systems (E25K and E15K) 'standard disptach table is used'.

dispadmin -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000
# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
200 0 50 0 50 # 0
200 0 50 0 50 # 1
200 0 50 0 50 # 2
200 0 50 0 50 # 3
200 0 50 0 50 # 4
200 0 50 0 50 # 5
200 0 50 0 50 # 6
200 0 50 0 50 # 7
200 0 50 0 50 # 8
200 0 50 0 50 # 9
160 0 51 0 51 # 10
160 1 51 0 51 # 11
160 2 51 0 51 # 12
160 3 51 0 51 # 13
160 4 51 0 51 # 14
160 5 51 0 51 # 15
160 6 51 0 51 # 16
160 7 51 0 51 # 17
160 8 51 0 51 # 18
160 9 51 0 51 # 19
120 10 52 0 52 # 20

[... Only a portion of table shown ...]

Starcat and systems (DC2 & DC3) use the extended dispatch table, i.e. you will find
ts_dispatch_extended = 1 being set.

#dispadm -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000

# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
400 0 1 2 40 # 0
380 0 2 2 40 # 1
380 1 3 2 40 # 2
380 1 4 2 40 # 3
380 2 5 2 40 # 4
380 2 6 2 40 # 5
380 3 7 2 40 # 6
380 3 8 2 40 # 7
380 4 9 2 40 # 8
380 4 10 2 40 # 9
380 5 11 2 40 # 10
380 5 12 2 40 # 11
380 6 13 2 40 # 12
380 6 14 2 40 # 13
380 7 15 2 40 # 14
380 7 16 2 40 # 15
380 8 17 2 40 # 16
380 8 18 2 40 # 17
380 9 19 2 40 # 18
380 9 20 2 40 # 19
360 10 21 2 40 # 20
360 11 22 2 40 # 21
360 12 23 2 40 # 22



45 * time-sharing dispatcher parameter table entry
46 */

47 typedef struct tsdpent {
48 pri_t ts_globpri; /* global (class independent) priority */
49 int ts_quantum; /* time quantum given to procs at this level */
50 pri_t ts_tqexp; /* ts_umdpri assigned when proc at this level */
51 /* exceeds its time quantum */
52 pri_t ts_slpret; /* ts_umdpri assigned when proc at this level */
53 /* returns to user mode after sleeping */
54 short ts_maxwait; /* bumped to ts_lwait if more than ts_maxwait */
55 /* secs elapse before receiving full quantum */
56 short ts_lwait; /* ts_umdpri assigned if ts_dispwait exceeds */
57 /* ts_maxwait */
58 } tsdpent_t;


As it can be seen from the two dispatch tables above the main differences are:
- ts_quantum that is the time quantum allocated to a process,
- ts_slpret priority at which the process would be placed when it returns to user mode after sleeping.

ts_slpret is there to help boost IO bound and interactive threads.

For example lets say your thread is at priority 13 and you've burned through your time slice. Your next priority is 6 with ts_dispatch extended enabled and 3 otherwise. At priority 6 your next quantum will still be 380 msecs. The ts_maxwait column is an anti-starvation, if you've languished on the ready queue for ts_maxwait time you get a boost to ts_lwait.

Experiments:

Following benchmarks were run on OPL-FF2 to gather performance data needed to help decide optimal setting for default Time Sharing dispatch table:

- SPECjbb2005 benchmark a SPEC.org JAVA workload.
-TPC-E benchmark. Latest of the TPC.org OLTP workloads.

SPECjbb2005 experiment

OPL-FF2 System Configuration

1) Clock rate at 2.15 Ghz,
2) L2 cache latency of 30 Pclocks
3) Memory access latency during SPECjbb2005 runs is 327.4ns

OPL-DC1 System Configuration

1) Clock rate at 2.28 Ghz,
2) L2 cache latency of 24 Pclocks
3) Memory access latency during SPECjbb2005 runs is 375ns

Benchmark and JVM version

  • Benchmark SPECjbb2005 1.07
  • Java HotSpot(TM) Server VM (build 1.6.0-rc-b99, mixed mode)

Performance data



Platform
ts_dispatch_extended = 0 (standard or Serengeti like)
ts_dispatch_extended = 1 (current)
OPL-FF2
bops = 171250 bops/JVM = 21406
bops = 170490 bops/JVM = 21311
OPL-DC1
bops = 308676 bops/JVM = 19292
bops = 308046 bops/JVM = 19253


OPL-DC3 experiment

OPL-DC3 System Configuration
1) 32 CPU Chip, 64 Cores.
2) Clock rate 2.27 Ghz.
3) 512 GB Memory.
4) System clock frequency 792 MHz.
5) Memory latency measured during SPECjbb2005 runs at 553 ns.

Benchmark and JVM version
  • Benchmark SPECjbb2005 1.07
  • Java HotSpot(TM) Server VM (build 1.6.0-rc-b87, mixed mode)

Performance Results

      
CPU
Chip
# Cores/Strands Throughput
jdk1.6.0
Throughput
jdk1.6.0
processor-set and lgrp_mem_pset_aware=1
32
64C/128S

bops=535630, bops/JVM=16738 (32)
32
64C/128S bops=283317, bops/JVM=35415 (8)
bops=510639, bops/JVM=63830 (8)
28
56C/112S
bops=266043, bops/JVM=38006 (7)
bops=447450, bops/JVM=63921 (7)
24
48C/96S bops=244480, bops/JVM=40747 (6)
bops=383632, bops/JVM=63939 (6)


SPECjbb2005 experiment results:

The OPL-FF2 and OPL-DC1 systems looks quiet immune to dispatch table setting. Memory latency at 327 and 375 nsecs does not cause any perceivable contention.

SPECjbb2005 tuned run is about 1.12 times the un-tuned run for OPL-FF2 and 1.23 for OPL-DC1.

The OPL-DC2/DC3 systems with an average memory latency of 550 nsecs during SPECjbb2005 runs, even with the lattest JVM performance improvements and tuning only hit 604511 bops

It is recommended to used extended table for DC2 and DC3. Higher ts_quantum would compensate for high memory latency and wasted cycles due to cache misses for high CPI workloads.

TPC-E experiment


OPL-FF2 System Configuration

  • FF2 system with 2 System Boards and 128 GB available memory
  • 4 x 2 GHz SPARC64-VI sockets used for expt; psradm -f other cpus
  • 32 GB RAM used
  • Vxvm used for DB datafiles
  • Oracle version
    • 10gR2 FCS ( 10.2.0.1)
    • 26 GB SGA , 4K block size
  • TPC-E scale and versions
    • 78000 customer database
    • V3.13 EGenLoader and kit
    • V3.14 plsql and tpce-3.14-20060508 driver
    • 0.7 08_02 Faban driver harness

Performance results

tpce.3g TS (serengeti dptbl ) run2 , 80K DB 156.12
tpce.3e TS (FF2 dptbl ) run2 , 80K DB 155.64


What we know this far is to use ts_dispatch_extended for OPl-DC2/DC3.

Tuesday, February 3, 2009

CPU Performance Counters: - Front & Back Ends - Part 1

CPU Performance Counters: - Front & Back ends - Part 1

You must be already aware performance numbers are critical both SW and HW.
SW in the sense you have added/modified/deleted code, so how is the performance after
change.

HW in the sense, a New Processor is being introduced or a new DIMM upgrade is planned,
or on the IO side a new PLX switch is being introduced or that matter its firmware is being
updated.

Oh, add a another variable, virtualization!! You have diced and sliced the existing HW to
run 'x' number of Logical Domains (LDOMS - Sun's terminology).

Performance numbers are critical to every involved in the Product Development/Support.
(Exec's, Team Members, Marketing, Support etc...)

So Crunching the numbers in the best optimized form is critical. From a SW point
there are many applications / Specs that could be run. It depends on what you
plan to use the server for (Webserver, Database, HPC applications).

I'll cover in different posting, things available to run, but for now assume that
we are running something and the logs will provide you a number or a average value.
(Eg: In case of Webserver, no. of active connections that requested some content - file get).
Repeat you experiments for 'n' times, then you have a average number.

This is not it! I want to open the hood and monitor things (counters) while the application
is running. So here comes CPU performance counters - front end (commands) and back end (module).

Frond End:

OpenSolaris provides front end CLI's that gets the data cpustat.c

cpustat monitor system behavior using CPU performance counters,
/usr/sbin/cpustat -h --> will provide you the counters exported by systems CPU.

Back End:

CLI calls's into back end module called pcbe (Performance Counter Back End)
CPU specific back end module is written to interfaces described in cpc_pcbe.h

In the onnv (Nevada) source base following are the files:
  • uts/common/sys/cpc_pcbe.h ---> Interfaces
  • uts/intel/pcbe/core_pcbe.c ---> Intel Family 6 Models 15, 23, 26 & 28
  • uts/intel/pcbe/opteron_pcbe.c ---> AMD Opteron and AMD Athlon 64 processors
  • uts/intel/pcbe/p123_pcbe.c ---> Pentiums I, II, and III
  • uts/intel/pcbe/p4_pcbe.c ---> Pentium 4
  • uts/sun4u/pcbe/opl_pcbe.c ---> Fuijtsu's SPARC64 VI & VII
  • uts/sun4u/pcbe/us234_pcbe.c ---> UltraSPARC-II, III, IV
  • uts/sun4v/pcbe/niagara2_pcbe.c --> UltraSPARC T2 & T2+ Processors
  • uts/sun4v/pcbe/niagara_pcbe.c ---> Niagara
  • uts/sun4v/pcbe/rock_pcbe.c ---> Rock CPU
I worked on opl_pcbe.c, events to be monitored are described in PCR register
and frequency of events is described in PIC register, both the registers
exists per thread. Some of the events are per chip based,
events increments PIC's of all threads. (cycle_count).

    137  * Performance Control Register (PCR)
138 *
139 * +----------+-----+-----+------+----+
140 * | 0 | OVF | 0 | OVR0 | 0 |
141 * +----------+-----+-----+------+----+
142 * 63 48 47:32 31:27 26 25
143 *
144 * +----+----+--- -+----+-----+---+-----+-----+----+----+----+
145 * | NC | 0 | SC | 0 | SU | 0 | SL |ULRO | UT | ST |PRIV|
146 * +----+----+-----+----+-----+---+-----+-----+----+----+----+
147 * 24:22 21 20:18 17 16:11 10 9:4 3 2 1 0
148 *
149 * ULRO and OVRO bits should be on upon accessing pcr unless
150 * those fields need to be updated.
151 * Turn off these bits when updating SU/SL or OVF field
152 * (during initialization, etc.).
153 *
154 *
155 * Performance Instrumentation Counter (PIC)
156 * Four PICs are implemented in SPARC64 VI and VII,
157 * each PIC is accessed using PCR.SC as a select field.
158 *
159 * +------------------------+--------------------------+
160 * | PICU | PICL |
161 * +------------------------+--------------------------+
162 * 63 32 31 0

Sample script to monitor the events.

#!/bin/ksh

#cpustat -c pic0=cycle_counts,cycle_counts,cycle_counts,cycle_counts,cycle_counts,cycle_counts,cycle_counts,cycle_counts 5 5


# event specification syntax:
# [picn=][,attr[n][=]][,[picn=][,attr[n][=]],...]

while :
do
cpustat \
-c pic0=cycle_counts,pic1=cycle_counts,pic7=cycle_counts \
-c pic0=instruction_counts,pic1=instruction_counts,pic7=op_stv_wait \
-c pic0=op_stv_wait,pic1=instruction_flow_counts,pic7=load_store_instructions \
-c pic0=load_store_instructions,pic1=iwr_empty,pic7=branch_instructions \
-c pic0=branch_instructions,pic1=op_stv_wait,pic7=floating_instructions \
-c pic0=floating_instructions,pic1=load_store_instructions,pic7=impdep2_instructions \
-c pic0=impdep2_instructions,pic1=branch_instructions,pic7=prefetch_instructions \
-c pic0=prefetch_instructions,pic1=floating_instructions,pic7=regwin_intlk \
-c pic0=flush_rs,pic1=impdep2_instructions,pic7=rs1 \
-c pic0=2iid_use,pic1=prefetch_instructions,pic7=trap_IMMU_miss \
-c pic0=trap_int_vector,pic1=rs1,pic7=jbus_odrbus2_busy \
-c pic0=ts_by_sxmiss,pic1=1iid_use \
-c pic0=active_cycle_count,pic1=trap_all,pic7=1endop \
-c pic0=op_stv_wait_sxmiss,pic1=thread_switch_all,pic7=op_stv_wait_sxmiss_ex \
-c pic1=active_cycle_count,pic7=if_wait_all \
-c pic0=swpf_fail_all,pic1=act_thread_suspend,pic7=dvp_count_dm \
-c pic0=sx_miss_wait_pf,pic1=cse_window_empty,pic7=sx_miss_count_dm_opsh \
-c pic0=jbus_cpi_count,pic1=inh_cmit_gpr_2write,pic7=jbus_odrbus2_busy \
-c pic0=jbus_reqbus1_busy,pic1=swpf_success_all,pic7=instruction_counts 1 10

sleep 5
done

exit

event1: cycle_counts instruction_counts instruction_flow_counts
iwr_empty op_stv_wait load_store_instructions
branch_instructions floating_instructions
impdep2_instructions prefetch_instructions rs1 1iid_use
trap_all thread_switch_all active_cycle_count
act_thread_suspend cse_window_empty inh_cmit_gpr_2write
swpf_success_all sx_miss_wait_dm jbus_bi_count
lost_softpf_pfp_full jbus_reqbus0_busy

event2: cycle_counts instruction_counts op_stv_wait
load_store_instructions branch_instructions
floating_instructions impdep2_instructions
prefetch_instructions 4iid_use flush_rs trap_spill
ts_by_timer active_cycle_count 0iid_use
op_stv_wait_nc_pend 0endop write_op_uTLB sx_miss_count_pf
jbus_cpd_count snres_64 jbus_reqbus3_busy

event3: cycle_counts instruction_counts op_stv_wait
load_store_instructions branch_instructions
floating_instructions impdep2_instructions
prefetch_instructions 3iid_use trap_int_level
ts_by_data_arrive active_cycle_count op_stv_wait_nc_pend
op_stv_wait_sxmiss_ex eu_comp_wait write_if_uTLB
sx_miss_count_dm jbus_cpb_count snres_256
lost_softpf_by_abort jbus_reqbus2_busy

event4: cycle_counts instruction_counts op_stv_wait
load_store_instructions branch_instructions
floating_instructions impdep2_instructions
prefetch_instructions sync_intlk trap_trap_inst ts_by_if
active_cycle_count cse_window_empty_sp_full fl_comp_wait
op_r_iu_req_mi_go sx_read_count_pf jbus_orderbus_busy
sx_miss_count_dm_if jbus_odrbus1_busy

event5: cycle_counts instruction_counts instruction_flow_counts
iwr_empty op_stv_wait load_store_instructions
branch_instructions floating_instructions
impdep2_instructions prefetch_instructions trap_fill
ts_by_intr active_cycle_count flush_rs
cse_window_empty_sp_full op_stv_wait_ex 3endop
if_r_iu_req_mi_go swpf_lbs_hit sx_read_count_dm
jbus_reqbus_busy sx_btc_count jbus_odrbus0_busy

event6: cycle_counts instruction_counts op_stv_wait
load_store_instructions branch_instructions
floating_instructions impdep2_instructions
prefetch_instructions trap_DMMU_miss ts_by_suspend
ts_by_other active_cycle_count decall_intlk
cse_window_empty_sp_full 2endop op_stv_wait_sxmiss
op_wait_all dvp_count_pf sx_miss_count_dm_opex
jbus_odrbus3_busy

event7: cycle_counts instruction_counts op_stv_wait
load_store_instructions branch_instructions
floating_instructions impdep2_instructions
prefetch_instructions regwin_intlk rs1 trap_IMMU_miss
ts_by_spinloop active_cycle_count cse_window_empty_sp_full
1endop op_stv_wait_sxmiss_ex if_wait_all dvp_count_dm
sx_miss_count_dm_opsh jbus_odrbus2_busy

attributes: nouser sys

See the "SPARC64 VI extensions" for descriptions of these events.

Tuesday, January 27, 2009

Hardware Descriptor - M4000/M5000 (OPL)

The gory Hardware details about OPL systems M4000 (FF1) or M5000 (FF2) or
other DC1-3 (Data Center Machines) is described in Hardware descriptor.

     39 #define HWD_SBS_PER_DOMAIN  32  /* System boards per domain */
40 #define HWD_CPUS_PER_CORE
4 /* Strands per physical core */
41 #define HWD_CORES_PER_CPU_CHIP
4 /* Cores per processor chip */
42 #define HWD_CPU_CHIPS_PER_CMU
4 /* Processor chips per CMU */
43 #define HWD_SCS_PER_CMU
4 /* System controllers per CMU */
44 #define HWD_DIMMS_PER_CMU
32 /* Memory DIMMs per CMU */
45 #define HWD_IOCS_PER_IOU
2 /* Oberon chips per I/O unit */
46 #define HWD_PCI_CHANNELS_PER_IOC
2 /* PCI channels per Oberon chip */
47 #define HWD_LEAVES_PER_PCI_CHANNEL
2 /* Leaves per PCI channel */
48 #define HWD_PCI_CHANNELS_PER_SB
4 /* PCI channels per system board */
49 #define HWD_CMU_CHANNEL
4 /* CMU channel number */
50 #define HWD_IO_BOATS_PER_IOU
6 /* I/O boats per I/O unit */
51 #define HWD_BANKS_PER_CMU
8 /* Memory banks per CMU */
52 #define HWD_MAX_MEM_CHUNKS
8 /* Chunks per board */

View the opl_hwdesc.h

Is there is a way to this information, yes of course via the mdb.
Dump the structures (mdb print command)

hwd_header_t
hwd_domain_info_t;
hwd_cpu_t;
hwd_core_t;
hwd_cpu_chip_t;
hwd_sc_t;
hwd_bank_t;
hwd_dimm_t;
hwd_memory_t;

A whole lot of other structures and dump the memory contents and
join togther entire Hardware descriptor.

Wouldn't be easy if there to be command that does the job for you.
Here you go: mdb's dcmd (to read as dee-command) -- oplhwd


Dcmd that I wrote is described here: opl_hwd.c

> ::help oplhwd

NAME
oplhwd - print hardware descriptor information on OPL

SYNOPSIS
[ addr ] ::oplhwd [ -b NUM ] [ -sdihomkrcp ] [ -a ] [ -v NUM ]

DESCRIPTION

-b NUM list oplhwd entry for a board
-s list oplhwd entry with SB Status
-d list oplhwd entry with Domain Info
-i list oplhwd entry with SB Info
-h list oplhwd entry with Chips details
-o list oplhwd entry with Core details
-m list oplhwd entry with Memory Information
-k list oplhwd entry with Memory Bank Information
-r list oplhwd entry with SC Information
-c list oplhwd entry with CMU channels
-p list oplhwd entry with PCI channels
-a list oplhwd entry with all the above information
-v NUM list oplhwd entry in verbose mode

Check out the entire output at oplhwd-log.


Welcome

Thanks for landing here. Welcome!
I work at Sun Microsystems/Systems group for
the NPE (New Product Engineering).

I'm a development engineer/Technical lead on
many Enterprise Servers, some of which include:

Mid-range: SunFire E4900, E6800,
Sun SPARC Enterprise M4000/M5000 (OPL),
High-end: SunFire E20K, E25K
and SunFire Classic's E4500.

Here in my blog you will find interesting information.
So are you ready ?