Saturday, April 25, 2009

Solaris Kernel Analysis with MDB

Solaris Kernel Analysis with MDB dee-commands (dcmds). I have captured the output
for various dcmds, the resulting txt is big file.

Thursday, April 16, 2009

Extended Dispatch Tables - Performance Improvement

Time Sharing Scheduling Class (TS) - ts_dptbl (dispatcher parameter table)

Intro:

The dispatcher controls allocation of the CPU resource to a processes and the scheduler supports multiple scheduling classes. Each class defines its scheduling policies and priority queues on which ready to run processes are linked. Scheduling classes supported by Solaris.

dispadmin -l

CONFIGURED CLASSES
==================

SYS (System Class)
TS (Time Sharing)
FX (Fixed Priority)
IA (Interactive)

Processes in the time-sharing class which are running in user mode or in kernel mode before going to sleep are scheduled according to the parameters in a time-sharing dispatcher parameter table (ts_dptbl).

Except for Starcat systems (E25K and E15K) 'standard disptach table is used'.

dispadmin -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000
# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
200 0 50 0 50 # 0
200 0 50 0 50 # 1
200 0 50 0 50 # 2
200 0 50 0 50 # 3
200 0 50 0 50 # 4
200 0 50 0 50 # 5
200 0 50 0 50 # 6
200 0 50 0 50 # 7
200 0 50 0 50 # 8
200 0 50 0 50 # 9
160 0 51 0 51 # 10
160 1 51 0 51 # 11
160 2 51 0 51 # 12
160 3 51 0 51 # 13
160 4 51 0 51 # 14
160 5 51 0 51 # 15
160 6 51 0 51 # 16
160 7 51 0 51 # 17
160 8 51 0 51 # 18
160 9 51 0 51 # 19
120 10 52 0 52 # 20

[... Only a portion of table shown ...]

Starcat and systems (DC2 & DC3) use the extended dispatch table, i.e. you will find
ts_dispatch_extended = 1 being set.

#dispadm -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000

# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
400 0 1 2 40 # 0
380 0 2 2 40 # 1
380 1 3 2 40 # 2
380 1 4 2 40 # 3
380 2 5 2 40 # 4
380 2 6 2 40 # 5
380 3 7 2 40 # 6
380 3 8 2 40 # 7
380 4 9 2 40 # 8
380 4 10 2 40 # 9
380 5 11 2 40 # 10
380 5 12 2 40 # 11
380 6 13 2 40 # 12
380 6 14 2 40 # 13
380 7 15 2 40 # 14
380 7 16 2 40 # 15
380 8 17 2 40 # 16
380 8 18 2 40 # 17
380 9 19 2 40 # 18
380 9 20 2 40 # 19
360 10 21 2 40 # 20
360 11 22 2 40 # 21
360 12 23 2 40 # 22



45 * time-sharing dispatcher parameter table entry
46 */

47 typedef struct tsdpent {
48 pri_t ts_globpri; /* global (class independent) priority */
49 int ts_quantum; /* time quantum given to procs at this level */
50 pri_t ts_tqexp; /* ts_umdpri assigned when proc at this level */
51 /* exceeds its time quantum */
52 pri_t ts_slpret; /* ts_umdpri assigned when proc at this level */
53 /* returns to user mode after sleeping */
54 short ts_maxwait; /* bumped to ts_lwait if more than ts_maxwait */
55 /* secs elapse before receiving full quantum */
56 short ts_lwait; /* ts_umdpri assigned if ts_dispwait exceeds */
57 /* ts_maxwait */
58 } tsdpent_t;


As it can be seen from the two dispatch tables above the main differences are:
- ts_quantum that is the time quantum allocated to a process,
- ts_slpret priority at which the process would be placed when it returns to user mode after sleeping.

ts_slpret is there to help boost IO bound and interactive threads.

For example lets say your thread is at priority 13 and you've burned through your time slice. Your next priority is 6 with ts_dispatch extended enabled and 3 otherwise. At priority 6 your next quantum will still be 380 msecs. The ts_maxwait column is an anti-starvation, if you've languished on the ready queue for ts_maxwait time you get a boost to ts_lwait.

Experiments:

Following benchmarks were run on OPL-FF2 to gather performance data needed to help decide optimal setting for default Time Sharing dispatch table:

- SPECjbb2005 benchmark a SPEC.org JAVA workload.
-TPC-E benchmark. Latest of the TPC.org OLTP workloads.

SPECjbb2005 experiment

OPL-FF2 System Configuration

1) Clock rate at 2.15 Ghz,
2) L2 cache latency of 30 Pclocks
3) Memory access latency during SPECjbb2005 runs is 327.4ns

OPL-DC1 System Configuration

1) Clock rate at 2.28 Ghz,
2) L2 cache latency of 24 Pclocks
3) Memory access latency during SPECjbb2005 runs is 375ns

Benchmark and JVM version

  • Benchmark SPECjbb2005 1.07
  • Java HotSpot(TM) Server VM (build 1.6.0-rc-b99, mixed mode)

Performance data



Platform
ts_dispatch_extended = 0 (standard or Serengeti like)
ts_dispatch_extended = 1 (current)
OPL-FF2
bops = 171250 bops/JVM = 21406
bops = 170490 bops/JVM = 21311
OPL-DC1
bops = 308676 bops/JVM = 19292
bops = 308046 bops/JVM = 19253


OPL-DC3 experiment

OPL-DC3 System Configuration
1) 32 CPU Chip, 64 Cores.
2) Clock rate 2.27 Ghz.
3) 512 GB Memory.
4) System clock frequency 792 MHz.
5) Memory latency measured during SPECjbb2005 runs at 553 ns.

Benchmark and JVM version
  • Benchmark SPECjbb2005 1.07
  • Java HotSpot(TM) Server VM (build 1.6.0-rc-b87, mixed mode)

Performance Results

      
CPU
Chip
# Cores/Strands Throughput
jdk1.6.0
Throughput
jdk1.6.0
processor-set and lgrp_mem_pset_aware=1
32
64C/128S

bops=535630, bops/JVM=16738 (32)
32
64C/128S bops=283317, bops/JVM=35415 (8)
bops=510639, bops/JVM=63830 (8)
28
56C/112S
bops=266043, bops/JVM=38006 (7)
bops=447450, bops/JVM=63921 (7)
24
48C/96S bops=244480, bops/JVM=40747 (6)
bops=383632, bops/JVM=63939 (6)


SPECjbb2005 experiment results:

The OPL-FF2 and OPL-DC1 systems looks quiet immune to dispatch table setting. Memory latency at 327 and 375 nsecs does not cause any perceivable contention.

SPECjbb2005 tuned run is about 1.12 times the un-tuned run for OPL-FF2 and 1.23 for OPL-DC1.

The OPL-DC2/DC3 systems with an average memory latency of 550 nsecs during SPECjbb2005 runs, even with the lattest JVM performance improvements and tuning only hit 604511 bops

It is recommended to used extended table for DC2 and DC3. Higher ts_quantum would compensate for high memory latency and wasted cycles due to cache misses for high CPI workloads.

TPC-E experiment


OPL-FF2 System Configuration

  • FF2 system with 2 System Boards and 128 GB available memory
  • 4 x 2 GHz SPARC64-VI sockets used for expt; psradm -f other cpus
  • 32 GB RAM used
  • Vxvm used for DB datafiles
  • Oracle version
    • 10gR2 FCS ( 10.2.0.1)
    • 26 GB SGA , 4K block size
  • TPC-E scale and versions
    • 78000 customer database
    • V3.13 EGenLoader and kit
    • V3.14 plsql and tpce-3.14-20060508 driver
    • 0.7 08_02 Faban driver harness

Performance results

tpce.3g TS (serengeti dptbl ) run2 , 80K DB 156.12
tpce.3e TS (FF2 dptbl ) run2 , 80K DB 155.64


What we know this far is to use ts_dispatch_extended for OPl-DC2/DC3.