Short stacks are a rich source of information which can help you to do analyzing and deconstructing a process which is part of a problem in Oracle databases, and the good part is you don’t have to rely on any OS utilities like gdb, pstack etc. to extract stacks or traces. Its pretty simple and quick to generate the short stacks, you have to pass on the OSPID of the process, and issue SHORT_STACK option for ORADEBUG, and will generate a sequence of all functions involved and their current position in a calls stack. You can also use errorstack for the same, but short_stacks are fast and point-to-point!
Now one can ask – What is a ‘call stack’ ? A call stack is a list of names of methods called at run time from the beginning of a program until the execution of the current statement. A call stack is mainly intended to keep track of the point to which each active subroutine should return control when it finishes executing. Call stack acts as a tool to debug an application when the method to be traced can be called in more than one context. This forms a better alternative than adding tracing code to all methods that call the given method.
Let me run it on a real time system where we are facing issues where due to adaptive switching between Log write methods LGWR was causing excessive ‘log file sync‘ Waits
[oracle@oracleontario ~]$ ps -ef|grep lg
oracle 4402 1 0 00:25 ? 00:00:00 ora_lgwr_dixitdb
oracle 4719 3201 0 00:25 pts/1 00:00:00 grep --color=auto lg
-- Here in the example I am generating short stacks for LGWR process
SQL>
SQL> oradebug setospid 4402
Oracle pid: 19, Unix process pid: 4402, image: oracle@oracleontario.ontadomain (LGWR)
SQL> oradebug short_stack
ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-semtimedop()+10<-skgpwwait()+187
<-ksliwat()+2233<-kslwaitctx()+200<-ksarcv()+320<-ksbabs()+670<-ksbrdp()+1167<-opirip()+541<-opidrv()+581<-sou2o()+165
<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
SQL>
SQL>
-- After a gap of ~ 10-15 Mins
-- Let me generate the short stack once again just to see if there is any difference.
SQL> oradebug short_stack
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+992 <-sigacthandler()+104
<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212 <-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680
<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380
So, as you’ve seen in above code, the clear change is there in the LGWR short stacks, we did few log switches, modified a big table in the database followed by multiple commits. If you carefully look at the stack, it was function ksedsts() where it was at the time we generated the stack, and function ksdxfstk () called it, followed by ksdxcb(), sspuser() and immediate change after functions called by LGWR internal call stacks. They matched one of the known BUG which gave us a hint that we resolved after modified a undocumented parameter.
So, it can help us locating the buggy functions called by any user session or background processes and is very useful in case when you are battling with a strange looking problem in your Oracle database.
Few of the functions that I am able to recall are given below. semtimedop() is one of the function for semaphore operations also known as semop and is used to perform atomically an array of semaphore operations on the set of semaphores associated with the semaphore identifier specified by semid. ‘ksdxcb’ is a function that’s called usually when the command oradebug is being used. sighandler is the programming signal handler and is used to locate an exception. When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system function, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another. The sigaltstack()function allows a thread to define and examine the state of an alternate stack area on which signals are processed.
opidrv() is ORACLE Program Interface DRiVer (IGNORE) opiodr is ORACLE code request driver – route the current request main() is the standard executable entry point ksedst()is the KSE dump the call stack skgmstack() is the call specified function with extra STACK space rpidrv()or the RPI is theRecursive Program Interface DRiVer pfrrun() or PSDEVN is the PL/SQL Interpreter Main Instruction Loop kkxexe()or KKX is to execute plsql
Recently we were facing one performance issue with one of the SQL which references a highly volatile table in the entire database, and one of the largest too. The tables goes through massive DMLs each minute and that has lead to huge randomness in the data distribution, and expectedly few of its critical indexes has a very high (bad) clustering factor (CF).
A quick explanation on the clustering factor, it is a measure of the ordered-ness of an index in comparison to the table that it is based on. That would be a simple explanation on Oracle’s clustering factor (CF), but if you want to read more about it I would recommend to read blog posts by Richard Foote, Jonathan Lewis and few other great articles by Randolf Geist.
Okay, coming back to the case – We tried few methods (fixed BLevel, Leaf_Blocks, degree, stats recollection etc.), but none of them helped much to stabilize performance of the query, and finally we tried something that was introduced in Oracle 12.1 the ‘Attribute Clustering‘ that helped us to resolve the problem. The attribute clustering improves physical IOs for tables and its partitions. An attribute-clustered table stores data in close proximity on disk in an ordered way based on the values of a certain set of columns in the table or a set of columns in the other tables.
Attribute clustering is a user-defined table directive that provides data clustering on one or more columns in a table. The directives can be specified when the table is created or modified. There are two types of attribute clustering:
With Linear Ordering : Linear ordering stores the data according to the order of specified columns. This is the default type of clustering.
With Interleaved Ordering : It accurately determines exactly where data is located on the disk. This enabled I/O Pruning. This uses a special multidimensional clustering technique based on Z-order curve fitting.
Note: Zone mapping is a separately licensed feature.
Let’s do some test to understand the feature better. Here in the demo I would be using linear order clustering without any zonemaps. You can try try pairing the LO clustering with zone maps too. All tests I am doing it on a 19.15.0.0.0 sandbox environment.
For testing purpose, I have created a new tables with some 2000000 random values/rows inserted into it, this is to mimic a bad clustering factor (CF) scenario. I have also created an Index on the top of a column (date) and have collected the statistics.
[oracle@oracleontario ~]$ !sql
sqlplus / as sysdba
SQL*Plus: Release 19.0.0.0.0 - Production on Wed Apr 27 00:13:08 2022
Version 19.15.0.0.0
Copyright (c) 1982, 2022, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.15.0.0.0
SQL>
SQL> create table new_test (id number, DOB date, text varchar2(40));
Table created.
SQL> insert into new_test select rownum, sysdate-trunc(dbms_random.value(0, 20000)), 'PRASHANT DIXIT'
from dual connect by level <= 2000000; 2
2000000 rows created.
SQL>
SQL> commit;
Commit complete.
SQL> create index idx_newtest on new_test(dob);
Index created.
SQL> EXEC DBMS_STATS.gather_table_stats('DIXDROID', 'NEW_TEST', estimate_percent => null, method_opt=>'FOR ALL COLUMNS SIZE 1');
PL/SQL procedure successfully completed.
-- Here you can see the extremely bad CLUSTERING FACTOR of the Index.
SQL> SELECT t.table_name, i.index_name, t.blocks, t.num_rows, i.clustering_factor FROM user_tables t, user_indexes i
WHERE t.table_name = i.table_name AND i.index_name='IDX_NEWTEST'; 2
TABLE_NAME INDEX_NAME BLOCKS NUM_ROWS CLUSTERING_FACTOR
---------- -------------------- ---------- ---------- -----------------
NEW_TEST IDX_NEWTEST 10097 2000000 1989246
SQL>
With such a bad clustering factor of the Index, let me try to run a query on the same column and pass a date range and see the execution plan.
SQL> explain plan for select * from new_test where dob between '01-JUN-2017' and '30-JUN-2017';
Explained.
SQL>
SQL> SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------
Plan hash value: 1847055510
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3100 | 86800 | 2750 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| NEW_TEST | 3100 | 86800 | 2750 (1)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("DOB">=TO_DATE(' 2017-06-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "DOB"<=TO_DATE(' 2017-06-30 00:00:00', 'syyyy-mm-dd
hh24:mi:ss'))
15 rows selected.
With a bad clustering factor, the query is going for a full table scan on NEW_TEST table, and the overall cost of the access path is very high too. Let’s alter the table using Linear attribute clustering – I won’t be using materialized zonemaps here
SQL> alter table NEW_TEST add clustering by linear order(DOB) without materialized zonemap;
Table altered.
SQL> alter table NEW_TEST move online;
Table altered.
-- Now if you check you will the improved CF of the Index.
SQL> SELECT t.table_name, i.index_name, t.blocks, t.num_rows, i.clustering_factor FROM user_tables t, user_indexes i
WHERE t.table_name = i.table_name AND i.index_name='IDX_NEWTEST';
TABLE_NAME INDEX_NAME BLOCKS NUM_ROWS CLUSTERING_FACTOR
---------- -------------------- ---------- ---------- -----------------
NEW_TEST IDX_NEWTEST 10097 2000000 9277
The clustering factor of the Index was greatly improved after altering the table column (DOB) with linear order clustering enabled. Let’s try to run the same SQL and see if there is any improvements or not.
SQL> explain plan for select * from new_test where dob between '01-JUN-2017' and '30-JUN-2017';
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------
Plan hash value: 1446839462
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3100 | 86800 | 26 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| NEW_TEST | 3100 | 86800 | 26 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | IDX_NEWTEST | 3100 | | 11 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("DOB">=TO_DATE(' 2017-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"DOB"<=TO_DATE(' 2017-06-30 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
15 rows selected.
SQL>
SQL>
And we can see the FTS was replaced with a quick INDEX RANGE SCAN on IDX_NEWTEST index (created to cover DOB column). Look at the great reduction in the cost per step and final cost of the access path and plan tree.
Recently during one of the performance taskforce on a newly migrated system, customer DBA asked me to use one of their legacy tool to get more idea about database’s performance, that one of their expert DBA written to collect performance metrics. I’d seen their previous reports collected through the same tool for other systems, and it was good. But, got a runtime exception with an error while calling the script/tool which says ‘PLAN_TABLE physical table present in user schema SYS‘. The error means the user executing it (SYS) owns the table PLAN_TABLE that is the not the Oracle seeded GTT (Global Temporary Table) plan table owned by SYS (PLAN_TABLE$ table with a PUBLIC synonym PLAN_TABLE).
This was little odd to the customer DBAs as they had never experienced this error with the tool, and now when its there, question was Shall we drop the PLAN_TABLE ? Is it risky to do that ? If we drop it, will it impact the execution plan generation or not ? Any other associated risk with drop of plan_table ?
Next when I’d queried DBA_OBJECTS, I saw the table is there in SYS schema, though this system was migrated from 12.2 to 19c, but the table should not be there as the table only by default existed in older versions of Oracle. The object creation date was coming for the time when database was upgraded. It had appeared that someone after upgrade/migration, called the utlrp.sql explicitly (maybe any old 8i/9i DBA) and that’d created the table. Now the question is – It’s safe to dropthis table ?
SQL> select owner, object_name, object_type, created from dba_objects where object_name like '%PLAN_TABLE%'
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;
OWNER OBJECT_NAME OBJECT_TYPE CREATED
---------- -------------------- ----------------------- ---------
PUBLIC PLAN_TABLE SYNONYM 17-APR-19
PUBLIC SQL_PLAN_TABLE_TYPE SYNONYM 17-APR-19
PUBLIC PLAN_TABLE_OBJECT SYNONYM 17-APR-19
SYS SQL_PLAN_TABLE_TYPE TYPE 17-APR-19
SYS PLAN_TABLE TABLE 13-MAR-22 ----->>>>> OLD PLAN_TABLE created during the UPGRADE
SYS SQL_PLAN_TABLE_TYPE TYPE 17-APR-19
SYS PLAN_TABLE_OBJECT TYPE 17-APR-19
SYS PLAN_TABLE$ TABLE 17-APR-19
SYS PLAN_TABLE_OBJECT TYPE BODY 17-APR-19
9 rows selected.
-- Look at the difference between the two, PLAN_TABLE$ is a GLOBAL TEMP TABLE and old PLAN_TABLE is not.
SQL> SELECT TABLE_NAME, owner, temporary from dba_tables where table_name like '%PLAN_TABLE%'
AND owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;
TABLE_NAME OWNER T
------------------------------ -------------------- -
PLAN_TABLE SYS N
PLAN_TABLE$ SYS Y ---> Y represents GTT
Let’s first see what’s there inside the PLAN_TABLE and what’s its purpose. Will generate few SQL execution plans will observe changes that happens in PLAN_TABLE.
-- Table columns and details
SQL> desc plan_table
Name Null? Type
----------------------------------------- -------- ----------------------------
STATEMENT_ID VARCHAR2(30)
PLAN_ID NUMBER
TIMESTAMP DATE
REMARKS VARCHAR2(4000)
OPERATION VARCHAR2(30)
OPTIONS VARCHAR2(255)
OBJECT_NODE VARCHAR2(128)
OBJECT_OWNER VARCHAR2(128)
OBJECT_NAME VARCHAR2(128)
OBJECT_ALIAS VARCHAR2(261)
OBJECT_INSTANCE NUMBER(38)
OBJECT_TYPE VARCHAR2(30)
OPTIMIZER VARCHAR2(255)
SEARCH_COLUMNS NUMBER
ID NUMBER(38)
PARENT_ID NUMBER(38)
DEPTH NUMBER(38)
POSITION NUMBER(38)
COST NUMBER(38)
CARDINALITY NUMBER(38)
BYTES NUMBER(38)
OTHER_TAG VARCHAR2(255)
PARTITION_START VARCHAR2(255)
PARTITION_STOP VARCHAR2(255)
PARTITION_ID NUMBER(38)
OTHER LONG
DISTRIBUTION VARCHAR2(30)
CPU_COST NUMBER(38)
IO_COST NUMBER(38)
TEMP_SPACE NUMBER(38)
ACCESS_PREDICATES VARCHAR2(4000)
FILTER_PREDICATES VARCHAR2(4000)
PROJECTION VARCHAR2(4000)
TIME NUMBER(38)
QBLOCK_NAME VARCHAR2(128)
OTHER_XML CLOB
-- Let me check other stats or details about the PLAN_TABLE
SQL> select index_name, table_name from dba_indexes where table_name='PLAN_TABLE'
And owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;
INDEX_NAME TABLE_NAME
-------------------------------------------------- ------------------------------
SYS_IL0000078251C00036$$ PLAN_TABLE
SQL> select table_name, owner, TABLESPACE_NAME from dba_tables where table_name like '%PLAN_TABLE%'
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;
TABLE_NAME OWNER TABLESPACE_NAME
------------------------------ ------------------------------ ------------------------------
PLAN_TABLE SYS SYSTEM
PLAN_TABLE$ SYS
SQL>
-- The OLD PLAN_TABLE is empty at the moment
SQL> select count(*) from plan_table;
COUNT(*)
----------
0
-- Lets explain a test SQL to see what happens to the OLD PLAN_TABLE
SQL> explain plan for select count(*) from bigtab;
Explained.
-- And immediately 3 rows related to the plan line ids added to it
SQL> select count(*) from plan_table;
COUNT(*)
----------
3
-- Three entries for below 3 IDs.
SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2140185107
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 69 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| BIGTAB | 72358 | 69 (0)| 00:00:01 |
---------------------------------------------------------------------
9 rows selected.
-- But the new PLAN_TABLE$ is still empty
SQL> select count(*) from PLAN_TABLE$ ;
COUNT(*)
----------
0
So, the question is – Is it safe to drop this table PLAN_TABLE ?
SQL> drop table PLAN_TABLE;
Table dropped.
SQL>
-- And the table is gone
SQL> select owner, object_name, object_type, created from dba_objects where object_name like '%PLAN_TABLE%'
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;
OWNER OBJECT_NAME OBJECT_TYPE CREATED
------------------------------ -------------------- ----------------------- ---------
PUBLIC PLAN_TABLE SYNONYM 17-APR-19
PUBLIC SQL_PLAN_TABLE_TYPE SYNONYM 17-APR-19
PUBLIC PLAN_TABLE_OBJECT SYNONYM 17-APR-19
SYS PLAN_TABLE_OBJECT TYPE BODY 17-APR-19
SYS SQL_PLAN_TABLE_TYPE TYPE 17-APR-19
SYS PLAN_TABLE_OBJECT TYPE 17-APR-19
SYS PLAN_TABLE$ TABLE 17-APR-19
SYS SQL_PLAN_TABLE_TYPE TYPE 17-APR-19
8 rows selected.
Now when the table is gone, lets check if we are still able to generate the execution plan.
SQL>
SQL> explain plan for select count(*) from bigtab;
Explained.
SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2140185107
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 69 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| BIGTAB | 72358 | 69 (0)| 00:00:01 |
---------------------------------------------------------------------
9 rows selected.
SQL> select count(*) from plan_table$;
COUNT(*)
----------
3
And yes, no issues at all. The plan now started to sit inside PLAN_TABLE$ that has a PUBLIC SYNONYM called PLAN_TABLE. So, it’s totally safe to drop the PLAN_TABLE from your schema if it still exists and Oracle has now a public synonym for the same purpose. WARNING: Don’t drop the PLAN_TABLE$ nor the PLAN_TABLE public synonym, these need to exist for the new PLAN_TABLE to work properly.
Last week we faced an interesting issue with one of the production system which was recently migrated from Oracle 12.2. to 19.15. The setup was running on a VMWare machine with limited resources. It all started when the application team started reporting slowness in their daily scheduled jobs and other ad-hoc operations, when checked at the database layer, it was all ‘enq: CR – block range reuse ckpt‘ wait event. Same can be seen in the below ORATOP output, where the BLOCKER ID 3817 is the CKPT or the checkpoint process.
The strange part was, the blocker was CKPT process and it was all for a particular SQL ID (an INSERT operation), see below in next oratop screen fix.
As far as this wait event (other classed), This comes just before you delete or truncate a table, where we need a level segment checkpoint. This is because it must maintain the consistency of the blocks there may be in the buffer memory and what’s on the disc. As per the definition, this wait event happens due to contention on blocks caused by multiple processes trying to update same blocks. This may seem issues from the application logic resulting into this concurrency bottleneck, but interestingly this was happening on a simple INSERT operation, not a DELETE or TRUNCATE.
Same can be seen in the AWR and ASH reports too! There are CBC (Cache Buffer Chains) latching and latch free events too along with ‘enq: CR – block range reuse ckpt‘, but the initial focus was to understand the event and its reasons. As far as ‘direct path read temp‘ it was happening due to couple of SELECT statement which we resolved after attaching a better plan with the SQLs.
Wait event source (from ASH)
SQL Text was quite simple, an INSERT INTO statement.
I’ve tried first with the CKPT process traces, just to see if there is anything interesting or useful caught within the process logs or traces. The trace was short and has only got some strange and obscure looking content, but which at least gave us an idea that the reuse block checkpoint of the RBR was failed due to enqueue, and its entry was failed due to abandoned parent. Still, that doesn’t helped us anything, we were unsure about the reason.
--> Info in CKPT trace file ---> XXXXX_ckpt_110528.trc.
RBR_CKPT: adding rbr ckpt failed for 65601 due to enqueue
RBR_CKPT: rbr mark the enqueue for 65601 as failed
RBR_CKPT: adding ckpt entry failed due to abandoned parent 0x1b57b4a88
RBR_CKPT: rbr mark the enqueue for 65601 as failed
There were few things logged in the alert.log, multiple deadlocks (ORA 0060), too many parse errors for one SELECT statement and some checkpoint incomplete errors (log switching was high >35)
-- deadlocks in alert log.
Errors in file /opt/u01/app/oracle/diag/rdbms/xxxxxx/xxxx/trace/xxxx_ora_73010.trc:
2022-04-19T13:41:22.489551+05:30 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
ORA-00060: Deadlock detected. See Note 60.1 at My Oracle Support for Troubleshooting ORA-60 Errors. More info in file /opt/u01/app/oracle/diag/rdbms/pwfmfsm/PWFMFSM/trace/PWFMFSM_ora_73010.trc.
-- From systemstatedumps
[Transaction Deadlock]
The following deadlock is not an ORACLE error. It is a ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
------------Blocker(s)----------- ------------Waiter(s)------------
Resource Name process session holds waits serial process session holds waits serial
TX-03AC0014-0000B33A-00000000-00000000 1562 691 X 1632 946 5441 X 56143
TX-01460020-0001A5C2-00000000-00000000 946 5441 X 56143 1562 691 X 1632
-- too many parse errors
2022-04-19T13:57:26.261176+05:30
WARNING: too many parse errors, count=2965 SQL hash=0x8ce1e2ff ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
PARSE ERROR: ospid=68036, error=942 for statement:
2022-04-19T13:57:26.272805+05:30
SELECT * FROM ( SELECT xxxxxxxxxxxxxxxxxxxxxx FROM xxxxxxxxxxxxxxxxxxxxxx AND xxxxxxxxxxxxxxxxxxxxxx ORDER BY xxxxxxxxxxxxxxxxxxxxxx ASC ) WHERE ROWNUM <= 750
Additional information: hd=0x546ba1be8 phd=0x61edab798 flg=0x20 cisid=113 sid=113 ciuid=113 uid=113 sqlid=ccmkzhy6f3srz
...Current username=xxxxxxxxxxxxxxxxxxxxxx
...Application: xxxxxxxxxxxxxxxxxxxxxx.exe Action:
-- Checkpoint incomplete
2022-04-19T15:03:16.964470+05:30 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
Thread 1 cannot allocate new log, sequence 186456
Checkpoint not complete
Current log# 11 seq# 186455 mem# 0: +ONLINE_REDO/xxxxxxxxxxxxxxxxxxxxxx/ONLINELOG/group_11.256.1087565999
2022-04-19T15:03:17.785113+05:30
But the alert log was not sufficient to give us any concrete evidences or reasons for CKPT bloking user sessions. So, next we decided to generate the HANGANALYZE and SYSTEMSTATEDUMPs to understand what’s all happening under the hood, through its wait chains. We noticed few interesting things there
Wait chain 1 where a session was waiting on ‘buffer busy waits‘ while doing the “bitmapped file space header” which talks about the space management blocks (nothing to with bitmap indexes) and was related with one SELECT statement.
Wait chain 2 where a session was found waiting on ‘enq: CR – block range reuse ckpt‘ event and was blocked by CKPT process (3817) which was further waiting on ‘control file sequential read‘
Wait chain 4 where SID 1670, was found waiting on ‘buffer busy waits‘ while doing ‘bitmapped file space header’ operations.
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 46794
process id: 2014, oracle@monkeymachine.prod.fdt.swedish.se
session id: 15
session serial #: 19322
module name: 0 (xxxx.exe)
}
is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 2 min 20 sec
timeout after: never
wait id: 10224
blocking: 0 sessions
current sql_id: 1835193535
current sql: SELECT * FROM ( SELECT xxxxx FROM task JOIN request ON xxxxx = xxxxx JOIN xxxxx ON xxxxx = xxxxx JOIN c_task_assignment_view ON xxxxx = xxxxx
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 23090
process id: 3231, oracle@monkeymachine.prod.fdt.swedish.se
session id: 261
session serial #: 39729
module name: 0 (xxxx.exe)
}
which is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 18.999227 sec ◄◄◄
timeout after: never
wait id: 47356
blocking: 25 sessions ◄◄◄
current sql_id: 0
current sql: <none>
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-read()+14<-snttread()+16<-nttfprd()+354<-nsbasic_brc()+399<-nioqrc()+438<-opikndf2()+999<-opitsk()+910<-opiino()+936<-opiodr()+1202<-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
wait history:
* time between current wait and wait #1: 0.000084 sec
1. event: 'SQL*Net message from client'
time waited: 0.008136 sec
wait id: 47355 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #1 and #2: 0.000043 sec
2. event: 'SQL*Net message to client'
time waited: 0.000002 sec
wait id: 47354 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #2 and #3: 0.000093 sec
3. event: 'SQL*Net message from client'
time waited: 2.281674 sec
wait id: 47353 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
}
and may or may not be blocked by another session.
.
.
.
Chain 2:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 10122
process id: 2850, oracle@monkeymachine.prod.fdt.swedish.se
session id: 97
session serial #: 47697
module name: 0 (xxxx.exe)
}
is waiting for 'enq: CR - block range reuse ckpt' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'name|mode'=0x43520006
p2: '2'=0x10b22
p3: '0'=0x1
time in wait: 2.018004 sec
timeout after: never
wait id: 81594
blocking: 0 sessions
current sql_id: 1335044282
current sql: <none>
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 110528
process id: 24, oracle@monkeymachine.prod.fdt.swedish.se
session id: 3817 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ CKPT background process
session serial #: 46215
}
which is waiting for 'control file sequential read' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'file#'=0x0
p2: 'block#'=0x11e
p3: 'blocks'=0x1
px1: 'disk number'=0x4
px2: 'au'=0x34
px3: 'offset'=0x98000
time in wait: 0.273981 sec
timeout after: never
wait id: 17482450
blocking: 45 sessions ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
current sql_id: 0
current sql: <none>
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2199<-kslwaitctx()+205<-ksarcv()+320<-ksbabs()+602<-ksbrdp()+1167<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
wait history:
* time between current wait and wait #1: 0.000012 sec
1. event: 'control file sequential read'
time waited: 0.005831 sec
wait id: 17482449 p1: 'file#'=0x0
p2: 'block#'=0x11
p3: 'blocks'=0x1
* time between wait #1 and #2: 0.000012 sec
2. event: 'control file sequential read'
time waited: 0.011667 sec
wait id: 17482448 p1: 'file#'=0x0
p2: 'block#'=0xf
p3: 'blocks'=0x1
* time between wait #2 and #3: 0.000017 sec
3. event: 'control file sequential read'
time waited: 0.009160 sec
wait id: 17482447 p1: 'file#'=0x0
p2: 'block#'=0x1
p3: 'blocks'=0x1
}
.
.
.
Chain 4:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 46479
process id: 1036, oracle@monkeymachine.prod.fdt.swedish.se
session id: 1670
session serial #: 6238
module name: 0 (xxxx.exe)
}
is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 18.954206 sec ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
timeout after: never
wait id: 20681
blocking: 0 sessions
current sql_id: 343919375
current sql: SELECT * FROM ( SELECT xxxxx FROM task JOIN request ON xxxxx = xxxxx JOIN xxxxx ON xxxxx = xxxxx JOIN c_task_assignment_view ON xxxxx = xxxxx
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 44958
process id: 523, oracle@monkeymachine.prod.fdt.swedish.se
session id: 4681
session serial #: 41996
module name: 0 (xxxx.exemonkeymachine.prod.fdt.swedish.se (TNS)
}
which is waiting for 'buffer busy waits' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 18.959429 sec ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
timeout after: never
wait id: 153995
blocking: 101 sessions ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
current sql_id: 343919375
current sql: INSERT INTO xx_xxx_xx(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) VALUES (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2199<-kslwaitctx()+205<-ksqcmi()+21656<-ksqcnv()+809<-ksqcov()+95<-kcbrbr_int()+2476<-kcbrbr()+47<-ktslagsp()+2672<-ktslagsp_main()+945<-kdliAllocCache()+37452<-kdliAllocBlocks()+1342<-kdliAllocChunks()+471<-kdliWriteInit()+1249<-kdliWriteV()+967<-kdli_fwritev()+904<-kdlxNXWrite()+577<-kdlx_write()+754<-kdlxdup_write1()+726<-kdlwWriteCallbackOld_pga()+1982<-kdlw_write()+1321<-kdld_write()+410<-kdl
wait history:
* time between current wait and wait #1: 0.756195 sec
1. event: 'direct path write temp'
time waited: 0.406543 sec
wait id: 153994 p1: 'file number'=0xc9
p2: 'first dba'=0xb28fc
p3: 'block cnt'=0x4
* time between wait #1 and #2: 0.000001 sec
2. event: 'ASM IO for non-blocking poll'
time waited: 0.000000 sec
wait id: 153993 p1: 'count'=0x4
p2: 'where'=0x2
p3: 'timeout'=0x0
* time between wait #2 and #3: 0.000002 sec
3. event: 'ASM IO for non-blocking poll'
time waited: 0.000001 sec
wait id: 153992 p1: 'count'=0x4
p2: 'where'=0x2
p3: 'timeout'=0x0
}
and may or may not be blocked by another session.
.
.
Though, we wanted to try couple of hidden parameters to enable fast object level truncation and checkpointing, as they had helped us a lot in the past in similar scenarios, but had to involve Oracle support and after carefully analyzing the situation, they too agreed and want us to try them as they started suspecting it as an aftermath of a known bug of 19c.
[oracle@oracleontario ~]$ !sql
sqlplus / as sysdba
SQL*Plus: Release 19.0.0.0.0 - Production on Sun Apr 24 00:24:47 2022
Version 19.15.0.0.0
Copyright (c) 1982, 2022, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.15.0.0.0
SQL>
SQL> @hidden
Enter value for param: db_fast_obj_truncate
old 5: and a.ksppinm like '%¶m%'
new 5: and a.ksppinm like '%db_fast_obj_truncate%'
Parameter Session Value Instance Value descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_db_fast_obj_truncate TRUE TRUE enable fast object truncate
_db_fast_obj_ckpt TRUE TRUE enable fast object checkpoint
SQL> ALTER SYSTEM SET "_db_fast_obj_truncate" = false sid = '*';
System altered.
SQL>
SQL> ALTER SYSTEM SET "_db_fast_obj_ckpt" = false sid = '*';
System altered.
SQL>
And soon after setting them, we saw a drastic drop in the waits and system seemed better, much better. But it all happened during an off-peak hour, so there wasn’t much of a workload to see anything odd.
And as we suspected, the issue repeated itself, and next day during peak business hours we started seeing the same issue, same set of events back into existence. This time the ‘latch: cache buffers chains‘ was quite high and prominent which was not that much earlier.
Initially we tried to fix some of the expensive statements on ‘logical IOs’ or memory reads, but that hardly helped. The issue persisted even after setting a higher value for db_block_hash_latches and decreasing cursor_db_buffers_pinned. AWR continues to show ‘latch: cache buffers chains’ in the top ten, foreground timed events, and ‘latch free‘ in first place in the background timed events. Oracle support confirmed the behavior was due to published bug 33025005 where excessive Latch CBC waits were seen after upgrading from 12c to 19c, and suggested to apply patch 33025005 and then to set hidden parameter “_post_multiple_shared_waiters” to value FALSE (in MEMORY only to test) which disables multiple shared waiters in the database.
-- After applying patch 33025005
SQL> ALTER SYSTEM SET "_post_multiple_shared_waiters" = FALSE SCOPE = MEMORY;
System altered.
And even after applying the patch and setting the recommended undocumented parameter, the issue persisted and we were totally clueless.
And as a last resort, we tried to flush the buffer cache of the database, and bingo that crude method of purging the cache helped to drastically to reduce the load on ‘CBC Latching‘ and for ‘enq: cr block range reuse ckpt‘, and the system ran fine soon after the flush of the DB Buffer cache.
So, nothing worked for us, we changed multiple checkpointing and shared writers related parameters in the database, applied a bug fix patch (33025005), but of no use. Finally, the flush of buffer cache worked for us! Oracle support agreed that this was happening due to a new/unpublished bug (33125873 or 31844316) which is not yet fixed in 19.15 and will be included in 23.1, and they are in status 11 that means still being worked by Development so there is no fix for it yet.