Solaris Kernel Analysis with MDB dee-commands (dcmds). I have captured the output
for various dcmds, the resulting txt is big file.
Saturday, April 25, 2009
Thursday, April 16, 2009
Extended Dispatch Tables - Performance Improvement
Time Sharing Scheduling Class (TS) - ts_dptbl (dispatcher parameter table)
Intro:The dispatcher controls allocation of the CPU resource to a processes and the scheduler supports multiple scheduling classes. Each class defines its scheduling policies and priority queues on which ready to run processes are linked. Scheduling classes supported by Solaris.
dispadmin -l
CONFIGURED CLASSES==================
SYS (System Class)
TS (Time Sharing)
FX (Fixed Priority)
IA (Interactive)
Processes in the time-sharing class which are running in user mode or in kernel mode before going to sleep are scheduled according to the parameters in a time-sharing dispatcher parameter table (ts_dptbl).
Except for Starcat systems (E25K and E15K) 'standard disptach table is used'.
dispadmin -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000
# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
200 0 50 0 50 # 0
200 0 50 0 50 # 1
200 0 50 0 50 # 2
200 0 50 0 50 # 3
200 0 50 0 50 # 4
200 0 50 0 50 # 5
200 0 50 0 50 # 6
200 0 50 0 50 # 7
200 0 50 0 50 # 8
200 0 50 0 50 # 9
160 0 51 0 51 # 10
160 1 51 0 51 # 11
160 2 51 0 51 # 12
160 3 51 0 51 # 13
160 4 51 0 51 # 14
160 5 51 0 51 # 15
160 6 51 0 51 # 16
160 7 51 0 51 # 17
160 8 51 0 51 # 18
160 9 51 0 51 # 19
120 10 52 0 52 # 20
[... Only a portion of table shown ...]
Starcat and systems (DC2 & DC3) use the extended dispatch table, i.e. you will find
ts_dispatch_extended = 1 being set.
#dispadm -c TS -g
# Time Sharing Dispatcher Configuration
RES=1000
# ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL
400 0 1 2 40 # 0
380 0 2 2 40 # 1
380 1 3 2 40 # 2
380 1 4 2 40 # 3
380 2 5 2 40 # 4
380 2 6 2 40 # 5
380 3 7 2 40 # 6
380 3 8 2 40 # 7
380 4 9 2 40 # 8
380 4 10 2 40 # 9
380 5 11 2 40 # 10
380 5 12 2 40 # 11
380 6 13 2 40 # 12
380 6 14 2 40 # 13
380 7 15 2 40 # 14
380 7 16 2 40 # 15
380 8 17 2 40 # 16
380 8 18 2 40 # 17
380 9 19 2 40 # 18
380 9 20 2 40 # 19
360 10 21 2 40 # 20
360 11 22 2 40 # 21
360 12 23 2 40 # 22
45 * time-sharing dispatcher parameter table entry
46 */
47 typedef struct tsdpent {
48 pri_t ts_globpri; /* global (class independent) priority */
49 int ts_quantum; /* time quantum given to procs at this level */
50 pri_t ts_tqexp; /* ts_umdpri assigned when proc at this level */
51 /* exceeds its time quantum */
52 pri_t ts_slpret; /* ts_umdpri assigned when proc at this level */
53 /* returns to user mode after sleeping */
54 short ts_maxwait; /* bumped to ts_lwait if more than ts_maxwait */
55 /* secs elapse before receiving full quantum */
56 short ts_lwait; /* ts_umdpri assigned if ts_dispwait exceeds */
57 /* ts_maxwait */
58 } tsdpent_t;
As it can be seen from the two dispatch tables above the main differences are:
- ts_quantum that is the time quantum allocated to a process,
- ts_slpret priority at which the process would be placed when it returns to user mode after sleeping.
For example lets say your thread is at priority 13 and you've burned through your time slice. Your next priority is 6 with ts_dispatch extended enabled and 3 otherwise. At priority 6 your next quantum will still be 380 msecs. The ts_maxwait column is an anti-starvation, if you've languished on the ready queue for ts_maxwait time you get a boost to ts_lwait.
Experiments:
Following benchmarks were run on OPL-FF2 to gather performance data needed to help decide optimal setting for default Time Sharing dispatch table:
- SPECjbb2005 benchmark a SPEC.org JAVA workload.
-TPC-E benchmark. Latest of the TPC.org OLTP workloads.
SPECjbb2005 experiment
OPL-FF2 System Configuration
1) Clock rate at 2.15 Ghz,2) L2 cache latency of 30 Pclocks
3) Memory access latency during SPECjbb2005 runs is 327.4ns
OPL-DC1 System Configuration
1) Clock rate at 2.28 Ghz,2) L2 cache latency of 24 Pclocks
3) Memory access latency during SPECjbb2005 runs is 375ns
Benchmark and JVM version
- Benchmark SPECjbb2005 1.07
- Java HotSpot(TM) Server VM (build 1.6.0-rc-b99, mixed mode)
Performance data
| Platform | ts_dispatch_extended = 0 (standard or Serengeti like) | ts_dispatch_extended = 1 (current) |
| OPL-FF2 | bops = 171250 bops/JVM = 21406 | bops = 170490 bops/JVM = 21311 |
| OPL-DC1 | bops = 308676 bops/JVM = 19292 | bops = 308046 bops/JVM = 19253 |
OPL-DC3 experiment
OPL-DC3 System Configuration
1) 32 CPU Chip, 64 Cores.2) Clock rate 2.27 Ghz.
3) 512 GB Memory.
4) System clock frequency 792 MHz.
5) Memory latency measured during SPECjbb2005 runs at 553 ns.
Benchmark and JVM version
- Benchmark SPECjbb2005 1.07
- Java HotSpot(TM) Server VM (build 1.6.0-rc-b87, mixed mode)
Performance Results
| CPU Chip | # Cores/Strands | Throughput jdk1.6.0 | Throughput jdk1.6.0 processor-set and lgrp_mem_pset_aware=1 |
| 32 | 64C/128S | bops=535630, bops/JVM=16738 (32) | |
| 32 | 64C/128S | bops=283317, bops/JVM=35415 (8) | bops=510639, bops/JVM=63830 (8) |
| 28 | 56C/112S | bops=266043, bops/JVM=38006 (7) | bops=447450, bops/JVM=63921 (7) |
| 24 | 48C/96S | bops=244480, bops/JVM=40747 (6) | bops=383632, bops/JVM=63939 (6) |
SPECjbb2005 experiment results:
The OPL-FF2 and OPL-DC1 systems looks quiet immune to dispatch table setting. Memory latency at 327 and 375 nsecs does not cause any perceivable contention.SPECjbb2005 tuned run is about 1.12 times the un-tuned run for OPL-FF2 and 1.23 for OPL-DC1.
The OPL-DC2/DC3 systems with an average memory latency of 550 nsecs during SPECjbb2005 runs, even with the lattest JVM performance improvements and tuning only hit 604511 bops
It is recommended to used extended table for DC2 and DC3. Higher ts_quantum would compensate for high memory latency and wasted cycles due to cache misses for high CPI workloads.
TPC-E experiment
OPL-FF2 System Configuration
- FF2 system with 2 System Boards and 128 GB available memory
- 4 x 2 GHz SPARC64-VI sockets used for expt; psradm -f other cpus
- 32 GB RAM used
- Vxvm used for DB datafiles
- Oracle version
- 10gR2 FCS ( 10.2.0.1)
- 26 GB SGA , 4K block size
- TPC-E scale and versions
- 78000 customer database
- V3.13 EGenLoader and kit
- V3.14 plsql and tpce-3.14-20060508 driver
- 0.7 08_02 Faban driver harness
Performance results
| tpce.3g | TS (serengeti dptbl ) run2 , 80K DB | 156.12 |
| tpce.3e | TS (FF2 dptbl ) run2 , 80K DB | 155.64 |
What we know this far is to use ts_dispatch_extended for OPl-DC2/DC3.
Subscribe to:
Posts (Atom)