Tales From A Lazy Fat DBA

Its all about Databases & their performance, troubleshooting & much more …. ¯\_(ツ)_/¯

Posts Tagged ‘Database’

Are you looking for a method to stop automatic SQL quarantine without disabling the entire SQL Quarantine feature ? Welcome to Oracle 21c …

Posted by FatDBA on September 23, 2022

Hi All,

Recently I was working on a 21c database for a POC where at one point I want to disable automatic creation of SQL Quarantine, but without disabling the entire statement Quarantine feature. This 21.3.0.0.0 Database had got the resource manager enabled, and one of the SQL was taking long time to complete, longer than the allowed directive limits on IO & CPU TIME, and as expected the SQL was killed with message “ORA-00040: active time limit exceeded – call aborted” and the SQL plan was quarantined. I wanted to stop or disable the auto creation of SQL Quarantines for the SQL in question, after RM terminates the SQL.

I remember in 19c there wasn’t any way to achieve that and can only regulate behavior using two of the underscore parameters _quarantine_enabled or _optimizer_quarantine_sql. Oracle 21c has introduced two two new parameters to control the behavior of SQL Quarantine, and that specially solves this issue.

First one is optimizer_capture_sql_quarantine, if set to FALSE, would disable the automatic creation of SQL Quarantine configurations after RM termination of a SQL query execution. This is FALSE by default.

The second parameter is optimizer_use_sql_quarantine, if set to FALSE would disable the use of existing SQL Quarantine configurations in a database. This parameter determines whether the optimizer considers SQL Quarantine configurations when choosing an execution plan for a SQL statement. This is TRUE by default, thereby allowing users to manually create and use SQL Quarantine configurations.

Oracle Database 21c Enterprise Edition Release 21.0.0.0.0 - Production
Version 21.3.0.0.0

-- Default Setting
SQL> show parameter OPTIMIZER_CAPTURE_SQL_QUARANTINE

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
optimizer_capture_sql_quarantine     boolean     FALSE
SQL>
SQL>

-- Default Setting
SQL> sho parameter OPTIMIZER_USE_SQL_QUARANTINE

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
optimizer_use_sql_quarantine         boolean     TRUE
SQL>
SQL>

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

FRA full with archives ORA-38706 ORA-38708 ? Not always think of a BUG … And when that proud little DBA in me missed common sense

Posted by FatDBA on September 11, 2022

Hi All,
Some time back I was working on an 19.16 database where I’d to enable FLASHBACK on a database, but immediately kicked out with an error “ORA-38706: Cannot turn on FLASHBACK DATABASE logging. ORA-38708: not enough space for first flashback database log file”. So, I’d tried to expand the FRA size, but thought to check what there inside the FRA, it was 99.9% full with 270 archive log files occupying 99.39% of the total allocated space. So, everything was good till that point.

[oracle@fatdba ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Mon Sep 5 19:07:28 2022
Version 19.16.0.0.0

Copyright (c) 1982, 2022, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.16.0.0.0

SQL>
SQL> show parameter db_recovery_file_dest_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_recovery_file_dest_size           big integer 251G


NAME         SPACE_LIMIT_GB SPACE_AVAILABLE_GB PERCENT_FULL
------------ -------------- ------------------ ------------
+RECO                   251         .240234375         99.9


FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID
----------------------- ------------------ ------------------------- --------------- ----------
CONTROL FILE                             0                         0               1          0
REDO LOG                                 0                         0               0          0
ARCHIVED LOG                         99.39                         0             270          0
BACKUP PIECE                           .01                         0               2          0
IMAGE COPY                               0                         0               0          0
FLASHBACK LOG                            0                         0               0          0
FOREIGN ARCHIVED LOG                     0                         0               0          0
AUXILIARY DATAFILE COPY                  0                         0               0          0

But I thought to check space at the ASM Level. I queried v$asm_diskgroup and results were totally opposite with what I saw with V$RECOVERY_FILE_DEST and v$flash_recovery_area_usage. RECO Disk Group (FRA location) was almost 100% free and only 0.54% was consumed. Same results were there when I’d queried RECO DG via asmcmd.

-- results from v$asm_diskgroup view
Disk Group            Sector   Block   Allocation
Name                    Size    Size    Unit Size State       Type   Total Size (MB) Used Size (MB) Pct. Used
-------------------- ------- ------- ------------ ----------- ------ --------------- -------------- ---------
DATA                     512   4,096    1,048,576 CONNECTED   EXTERN         691,197        356,322     51.55
OCRVFDG                  512   4,096    4,194,304 MOUNTED     EXTERN          25,596            100       .39
RECO                     512   4,096    1,048,576 CONNECTED   EXTERN       1,048,575          5,645       .54
                                                                     --------------- --------------
Grand Total:                                                               1,765,368        362,067



ASMCMD> lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512             512   4096  1048576    691197   334875                0          334875              0             N  DATA/
MOUNTED  EXTERN  N         512             512   4096  4194304     25596    25496                0           25496              0             N  OCRVFDG/
MOUNTED  EXTERN  N         512             512   4096  1048576   1048575  1043233                0         1043233              0             N  RECO/
ASMCMD>

At this point I’d started thinking about some kind of a BUG in the database, I know about few FRA related bugs in earlier Oracle versions. Just before I’d tried anything else, I thought to execute dbms_backup_restore.refreshagedfiles which refreshes the view. At the same time tried kra_options event which resets v$recovery_file_dest, but no luck 😦

SQL>
SQL> alter session set events 'immediate trace name kra_options level 1';

Session altered.

SQL>  execute dbms_backup_restore.refreshagedfiles;

PL/SQL procedure successfully completed.

SQL>

And right at that point I was very sure it was happening all due to a bug and was thinking to contact Oracle support. But just before that, that proud little DBA in me thought, Am I missing something ? Any other possible areas I should have explored before connecting with OCS ?

I’d checked RMAN to see if it still has any expired archivelog details and ran ‘crosscheck archivelog all’, and it identified exactly 270 older archivelogs which were non existent. So, that proud little DBA was wrong 🙂 .. When the count was matched exactly with the v$flash_recovery_area_usage, I’d deleted all of those expired archivelogs from catalog.

-- To Crosscheck all archivelog files present in the RMAN catalog.
RMAN> crosscheck archivelog all;

using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=258 device type=DISK
validation failed for archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_1.261.1109109619 RECID=1 STAMP=1109109620
validation failed for archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_2.264.1109401205 RECID=2 STAMP=1109401206
validation failed for archived log
.....
........
..........
..............
validation failed for archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_268.463.1114203831 RECID=268 STAMP=1114203829
validation failed for archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_269.469.1114207455 RECID=269 STAMP=1114207454
validation succeeded for archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_321.531.1114624193 RECID=270 STAMP=1114624193
Crosschecked 270 objects


-- Delete expired archivelog files
RMAN>
RMAN>
RMAN> delete expired archivelog all;
...
.......
270     1    270     X 31-AUG-22
        Name: +RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_268.463.1114203831 RECID=268 STAMP=1114203829
..
.......
Do you really want to delete the above objects (enter YES or NO)? YES
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_1.261.1109109619 RECID=1 STAMP=1109109620
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_2.264.1109401205 RECID=2 STAMP=1109401206
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_3.265.1109671797 RECID=3 STAMP=1109671799
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_4.266.1109955617 RECID=4 STAMP=1109955618
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_5.267.1110233333 RECID=5 STAMP=1110233334
deleted archived log
archived log file name=+RECO/FATDBA/ARCHIVELOG/2022_07_02/thread_1_seq_6.268.1110485231 RECID=6 STAMP=1110485232
deleted archived log
Deleted 270 EXPIRED objects

RMAN>
RMAN>

And immediately I saw the change in FRA related dynamic views (V$RECOVERY_FILE_DEST and v$flash_recovery_area_usage) and matched the genuine utilization at the ASM level.

[oracle@fatdba ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Mon Sep 5 19:07:28 2022
Version 19.16.0.0.0

Copyright (c) 1982, 2022, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.16.0.0.0

SQL>
SQL>
SQL> SELECT NAME,
       (SPACE_LIMIT / 1024 / 1024 / 1024) SPACE_LIMIT_GB,
         ((SPACE_LIMIT - SPACE_USED + SPACE_RECLAIMABLE) / 1024 / 1024 / 1024) AS SPACE_AVAILABLE_GB,
       ROUND((SPACE_USED - SPACE_RECLAIMABLE) / SPACE_LIMIT * 100, 1) AS PERCENT_FULL
  FROM V$RECOVERY_FILE_DEST;  

NAME         SPACE_LIMIT_GB SPACE_AVAILABLE_GB PERCENT_FULL
------------ -------------- ------------------ ------------
+RECO                   200          197.59082          1.2

SQL> select * from v$flash_recovery_area_usage;

FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID
----------------------- ------------------ ------------------------- --------------- ----------
CONTROL FILE                           .01                         0               1          0
REDO LOG                                 0                         0               0          0
ARCHIVED LOG                           .57                         0               2          0
BACKUP PIECE                           .01                       .01               2          0
IMAGE COPY                               0                         0               0          0
FLASHBACK LOG                            0                         0               0          0
FOREIGN ARCHIVED LOG                     0                         0               0          0
AUXILIARY DATAFILE COPY                  0                         0               0          0

8 rows selected.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , | Leave a Comment »

A new 21c dynamic view securefile_shrink, and possibly a new BUG in 21.3. I guess I have identified a BUG

Posted by FatDBA on September 1, 2022

Recently I was doing a POC on Oracle 21c, and I had to shrink an LOB secure file column, and I was aware about a new dynamic view added to Oracle 21c called V$SECUREFILE_SHRINK, which is very useful to monitor securefiles shrink operations. This gives you some great details like start time, end time, status of the operation etc. I shrank one of the required securefile LOB column of a table and I was interested to see what all was captured by V$SECUREFILE_SHRINK … But It was empty!! got ‘no rows selected’, Why is that ??

Let me try to replicate the scenario and explain what I was doing that day.

-- Let me create a test scenario
-- Will create a brand new playground to test all things
[oracle@witnessalberta ~]$ !sq
sqlplus / as sysdba

SQL*Plus: Release 21.0.0.0.0 - Production on Sun Aug 28 10:50:26 2022
Version 21.3.0.0.0

Copyright (c) 1982, 2021, Oracle.  All rights reserved.


Connected to:
Oracle Database 21c Enterprise Edition Release 21.0.0.0.0 - Production
Version 21.3.0.0.0

SQL>
SQL>
SQL>
SQL> CREATE PLUGGABLE DATABASE pdb2 ADMIN USER pdb_adm IDENTIFIED BY oracle90 CREATE_FILE_DEST='/opt/oracle/oradata';

Pluggable database created.

SQL> show pdbs;

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 ORCLPDB1                       MOUNTED
         4 PDB2                           MOUNTED
SQL> ALTER PLUGGABLE DATABASE pdb2 OPEN READ WRITE;

Pluggable database altered.

SQL>  ALTER SESSION SET CONTAINER = PDB2;

Session altered.

SQL> SHOW CON_NAME;

CON_NAME
------------------------------
PDB2

SQL>
SQL> conn fatdba/oracle90@PDB2
Connected.
SQL>
SQL>
SQL>
SQL> show user
USER is "FATDBA"
SQL>
SQL>
SQL>
SQL>

-- Created a Table with a SECUREFILE LOB 
SQL> CREATE TABLE secure_file_tab (rid  NUMBER(5), bcol BLOB) LOB (bcol) STORE AS SECUREFILE bcol_lob 
(TABLESPACE users DISABLE  STORAGE IN ROW CHUNK 8192 RETENTION MIN 3600 KEEP_DUPLICATES NOCOMPRESS DECRYPT CACHE READS); 

Table created.

SQL>

SQL>
-- Inserted some 100000 random rows into the table
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
insert into secure_file_tab VALUES(101, utl_raw.cast_to_raw('hello, this is the first review'));
...
......
........

SQL> select count(*) from secure_file_tab;

  COUNT(*)
----------
    100001

SQL>
SQL>
-- Lets check the size of the table.
SQL> select segment_name, bytes/1024 from dba_segments where segment_name='SECURE_FILE_TAB';

SEGMENT_NAME         BYTES/1024
-------------------- ----------
SECURE_FILE_TAB      4096

-- Have deleted all rows from the table
SQL> delete from secure_file_tab where RID=101;

100000 rows deleted.

SQL> COMMIT;

SQL> select count(*) from secure_file_tab;

COUNT(*)
----------
1

-- Gathered table stats
SQL> exec dbms_stats.gather_table_stats(null, 'secure_file_tab');

PL/SQL procedure successfully completed.


SQL> select segment_name, bytes/1024 from dba_segments where segment_name='SECURE_FILE_TAB';

SEGMENT_NAME         BYTES/1024
-------------------- ----------
SECURE_FILE_TAB      113



SQL> alter table secure_file_tab modify lob(bcol) (shrink space);

Table altered.

Now time to check if anything captured in the v$securefile_shrink or gv$securefile_shrink dynamic view about the securefile shrink operation.

-- Lets check if anything captured by the view
SQL> desc v$securefile_shrink
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 LOB_OBJD                                           NUMBER
 SHRINK_STATUS                                      VARCHAR2(40)
 START_TIME                                         TIMESTAMP(3) WITH TIME ZONE
 END_TIME                                           TIMESTAMP(3) WITH TIME ZONE
 BLOCKS_MOVED                                       NUMBER
 BLOCKS_FREED                                       NUMBER
 BLOCKS_ALLOCATED                                   NUMBER
 EXTENTS_ALLOCATED                                  NUMBER
 EXTENTS_FREED                                      NUMBER
 EXTENTS_SEALED                                     NUMBER
 CON_ID                                             NUMBER


SQL> select * from v$securefile_shrink;

no rows selected

SQL> select * from gv$securefile_shrink;

no rows selected

SQL>

Nothing!!! Why ? Why ? Let me try the other way (MOVE LOB) and see if that populates anything in v$securefile_shrink or gv$securefile_shrink. Repopulated the same table again with 100000 rows, deleted all rows from the table, committed changes, regathered stats and verified table size, same what we did in Test 1.

-- Changed the LOB SHRINK technique this time.
SQL> ALTER TABLE secure_file_tab MOVE LOB(BCOL) STORE AS (TABLESPACE test);

Table altered.


-- Lets check if anything captured by the view
SQL> select * from v$securefile_shrink;

no rows selected

SQL> select * from gv$securefile_shrink;

no rows selected

SQL>

Nothing!!! Why ? Why ? Let me try the with one more way (SHRINK SPACE CASCADE) again and see if that populates anything in v$securefile_shrink or gv$securefile_shrink. Repopulated the same table again with 100000 rows, deleted all rows from the table, committed changes, regathered stats and verified table size, same what we did in Test 1.

--Let me try the other way
SQL> alter table secure_file_tab MODIFY LOB(BCOL) (SHRINK SPACE CASCADE);

Table altered.

-- Lets check if anything captured by the view
SQL> select * from v$securefile_shrink;

no rows selected

SQL> select * from gv$securefile_shrink;

no rows selected

SQL>

Nothing!!! Why ? Why ? This is super strange now. Then finally, I decided to check with Oracle support and asked them to try at their end in 21c and from their end also it was getting populated. I guess I have identified a BUG in 21.3.0.0.0!! 🙂 🙂 Currently the case is with development team and they are taking a look on this buggy behavior on Oracle 21.5 on Red Hat Linux 7 x86_64 … Let’s see how it goes from here, will post the solution or the cause behind this ill behavior soon.

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , , | Leave a Comment »

Prerequisite check CheckActiveFilesAndExecutables failed while applying July 2022 GI Release Update 19.16.0.0.220719, and the magic of opatchauto resume

Posted by FatDBA on August 8, 2022

Hi All,

Recently while applying the latest (July 2022) GI Release Update 19.16.0.0.220719 on GI+DB homes I’ve encountered an issue where the GI patching failed with an (expected) error ‘oracle.opatch.opatchsdk.OPatchException: Prerequisite check “CheckActiveFilesAndExecutables” failed’ and ended with ‘OPATCHAUTO-68061: The orchestration engine failed‘. Below pasted is what exactly happened …

[root@monkeybox patches]# /test/patch/dir/grid/OPatch/opatchauto apply /patchdir/july2022/34130714

OPatchauto session is initiated  
.....
..........

Performing prepatch operations on CRS - bringing down CRS service on home /test/patch/dir/grid
Prepatch operation log file location: /testdir/app/grid/crsdata/monkeyboxcrsconfig/hapatch_xxxxxx.log
CRS service brought down successfully on home /test/dir/grid

Start applying binary patch on home /oracledir/app/oracle/product/19c/dbhome
Binary patch applied successfully on home /oracledir/app/oracle/product/19c/dbhome

Start applying binary patch on home /test/patch/dir/grid
Failed while applying binary patches on home /test/patch/dir/grid   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Execution of [OPatchAutoBinaryAction] patch action failed, check log for more details. Failures:
Patch Target : monkeybox->/test/patch/dir/grid Type[siha]
Details: [
---------------------------Patching Failed---------------------------------
Command execution failed during patching in home: /test/patch/dir/grid, host: monkeybox.
Command failed:  /test/patch/dir/grid/OPatch/opatchauto  apply /patchdir/july2022/34130714 -oh /test/patch/dir/grid -target_type has -binary -invPtrLoc /test/patch/dir/grid/oraInst.loc -jre /test/patch/dir/grid/OPatch/jre -persistresult /test/patch/dir/grid/opatchautocfg/db/sessioninfo/sessionresult_monkeybox_siha_1.ser -analyzedresult /test/patch/dir/grid/opatchautocfg/db/sessioninfo/sessionresult_analyze_monkeybox_siha_1.ser
Command failure output:
==Following patches FAILED in apply:

Patch: /patchdir/july2022/34130714/33575402
Log: /test/patch/dir/grid/cfgtoollogs/opatchauto/core/opatch/xxxx.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Prerequisite check "CheckActiveFilesAndExecutables" failed.

Patch: /patchdir/july2022/34130714/34133642
Log: /test/patch/dir/grid/cfgtoollogs/opatchauto/core/opatch/xxxxxxxxx.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Prerequisite check "CheckActiveFilesAndExecutables" failed. >>>>>>>>>>>>>

Patch: /patchdir/july2022/34130714/34139601
Log: /test/patch/dir/grid/cfgtoollogs/opatchauto/core/opatch/xxxxxxxxx.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Prerequisite check "CheckActiveFilesAndExecutables" failed. >>>>>>>>>>>>>

Patch: /patchdir/july2022/34130714/34160635
Log: /test/patch/dir/grid/cfgtoollogs/opatchauto/core/opatch/xxxxxxxxx.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Prerequisite check "CheckActiveFilesAndExecutables" failed. >>>>>>>>>>>>>

Patch: /patchdir/july2022/34130714/34318175
Log: /test/patch/dir/grid/cfgtoollogs/opatchauto/core/opatch/xxxxxxxxx.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Prerequisite check "CheckActiveFilesAndExecutables" failed. >>>>>>>>>>>>>

After fixing the cause of failure Run opatchauto resume  >>>>>>>>>>>>>

]
OPATCHAUTO-68061: The orchestration engine failed. >>>>>>>>>>>>>
OPATCHAUTO-68061: The orchestration engine failed with return code 1
OPATCHAUTO-68061: Check the log for more details.
OPatchAuto failed.

OPatchauto session completed at xxxxxxxxx
Time taken to complete the session 8 minutes, 50 seconds

This is a classic case where the patching failed as there were few executables/files from the HOME still active. Same you can verify in the standard logging directory cfgtoollogs for opatchauto for the patch failed.

[INFO]    Prerequisite check "CheckActiveFilesAndExecutables" failed.
The details are:

Following active files/executables/libs are used by ORACLE_HOME :/test/dir/grid

/test/dir/grid/lib/libclntsh.so.19.1
/test/dir/grid/lib/libasmclntsh19.so

The easiest way to fix this issue is to find which opened process is using a file, a directory or a socket, and that you can do it using fuser command. The fuser command lists the process numbers of local processes that use the local or remote files specified by the File parameter. Let’s do it!

[grid@monkeybox ~]$
[grid@monkeybox ~]$ /sbin/fuser /test/dir/grid/lib/libclntsh.so.19.1
/test/dir/grid/lib/libclntsh.so.19.1: 18199m
[grid@monkeybox ~]$
[grid@monkeybox ~]$ /sbin/fuser /test/dir/grid/lib/libasmclntsh19.so
/test/dir/grid/lib/libasmclntsh19.so: 18199m
[grid@monkeybox ~]$
[grid@monkeybox ~]$
[grid@monkeybox ~]$
[grid@monkeybox ~]$ ps -ef|grep 18199
grid     18199 13587  0 09:34 pts/2    00:00:00 /test/dir/grid/perl/bin/perl -w -I /test/dir/grid/perl/lib/5.32.0 -I /test/dir/grid/perl/lib/site_perl/5.32.0 -I /test/dir/grid/lib -I /test/dir/grid/lib/asmcmd -I /test/dir/grid/rdbms/lib/asmcmd /test/dir/grid/bin/asmcmdcore
grid     29647 16974  0 10:11 pts/3    00:00:00 grep --color=auto 13610
[grid@monkeybox ~]$
[grid@monkeybox ~]$
[grid@monkeybox ~]$ kill -9 18199
[grid@monkeybox ~]$

Now when we have killed those two opened files (libclntsh.so.19.1 and libasmclntsh19.so), lets resume the patch from the same spot where it has left last time before crashing. I mean opatchauto was able to patch DB HOME before it failed while applying it on GI HOME. So, this will resume from the same spot and will igore previous applied patches. So, will use ‘opatchauto resume’ instruction/command as this operation resumes a previous patching session.

opatchauto is a really powerful tool which even let you resume your patch even when the patching crashed in between by any reasons like server crash, reboot cases or even manual CTRL+C etc. The other two regular options are rollback and version.

[root@monkeybox patches]# /test/dir/grid/OPatch/opatchauto resume

OPatchauto session is initiated at xxxxxxxxx
Session log file is .....
Resuming existing session with id xxxxxx
....
.......
...............
OPatchAuto successful.

Patching is completed successfully. Please find the summary as follows:

OPatchauto session completed at xxxxxx
Time taken to complete the session 9 minutes, 12 seconds

[root@monkeybox patches]#

Hope It Helped!
Prashant Dixit

Posted in troubleshooting | Tagged: , , , , , | Leave a Comment »

Writing custom messages to Alert Log and remembering DBMS_SYSTEM and that sweet little ksdwrt

Posted by FatDBA on August 1, 2022

Hi Guys,

Today’s one is a quick one and is about an old package (I guess its there since 8i days), but lesser known and underutilized feature at the same time.
Recently I was working on an OEM task where I have to edit the metric collection regular expressions to make it as per requirements. The next question was asked – ‘Lets wait till the time that specific incident happens in the database to see if it triggers the alert or not ….‘ Really ? Do we have to wait that long ?

No, in order to test that change, you don’t even have to adopt any crude method of editing the alert log file manually and write the error message over there or use UTL_FILE. There is an inbuilt package called dbms_system that you can use to handle such cases and write your custom messages to the alert log file. There are few other options/routines available that you can do with the package but this one is about a special subprogram called ‘ksdwrt‘.

dbms_system.ksdwrt(dest IN BINARY_INTEGER,tst IN VARCHAR2);

Here is the syntax:
execute sys.dbms_system.ksdwrt(,to_char(sysdate)|| ‘ — ‘);
where the argument values can be
1 to write to trace file
2 to write to alert log file
3 to write both trace and alert logfile.

Example:
exec dbms_system.ksdwrt(3,'ORA-04031: This is a test error message, please ignore');

Hope It Helped!
Prashant Dixit

Posted in Basics | Tagged: , | Leave a Comment »

High stats collection time for partitioned tables after upgrade to 19c

Posted by FatDBA on July 2, 2022

Recently, while working on a database upgrade from 12c to 19c (19.15) one of my friend encountered a strange issue on the newly upgraded 19c database where the stats gathering on the full database started taking huge time. It used to take ~ 3 hours to complete the full database statistics, but the same stats collection job after the upgrade to 19c started taking close to 10 hours. The stats collection script they were using was quite simple and with minimal parameters used.

EXEC DBMS_STATS.GATHER_DATABASE_STATS(ESTIMATE_PERCENT=>DBMS_STATS.AUTO_SAMPLE_SIZE,degree => 8);

During the analysis he observed that the top 3-4 in-flight transactions during stats collection were related to the Index Statistics and were found doing ‘Index Fast Full Scan’, and all of them are on few of the large partitioned tables in the database. He discussed the case with me and together tried few thing i.e. recollected dictionary and fixed object statistics, did some comparative study of parameters between 12c and 19c but none of them worked. At last we tried to set debugging levels on DBMS_STATS to see what’s happening under the hood, and that gave us some hint when set it with level/flag 8 (trace index stats) and with level 32768 to trace approximate NDV (number distinct values) gatherings. Traces gave us some idea that its surely with the index stats and NDV or number of distinct keys and is taking time.

But even after that we both were totally clueless as these Tables and its dependent objects are there in the system for a very long time. So. the big question was – What’s new in 19c that has slowed down stats collection ?

Finally we decided to contact OCS! And they quickly responded to the problem as its a known problem with the 19c. As per them, there was an enhancement in 19c that is related to Index stats gathering, and that had lead to the longer stats times. It was all due to an unpublished Bug 33427856 which is an enhancement to improve the calculation of index NDK (Number of Distinct Keys). This new feature with the approx_count_distinct function and fully scans indexes to calculate NDK. This has a significant benefit because NDK is now accurate. It also means that gathering statistics can take longer (for example, updating global index statistics if incremental stats is used). So, In general, this is expected behavior, since DBMS_STATS is doing more work in 19c than it did in previously unenhanced versions.

And the solution to this new 19c index-stats feature (a problem) off by setting fix control to disable ‘Enhance Index NDK Statistics’ – 27268249

alter system set "_fix_control"='27268249:0';

And as soon as we deleted existing statistics and regather them, the time dropped drastically and got completed under 3 hours.

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , , | Leave a Comment »

Exceptionally high stats collection time on FIXED OBJECTS during an upgrade …

Posted by FatDBA on June 26, 2022

Someone recently asked about a situation where they were trying to upgrade their database to 19c and as a part of their upgrade plan, they were trying to run fixed object statistics but it was going on forever, and they were totally clueless why and where its taking time. This being a mandatory step, they tried several times, but same result.

About fixed object stats, It is recommended that you re-gather them if you do a major database or application upgrade, implement a new module, or make changes to the database configuration. For example if you increase the SGA size then all of the X$ tables that contain information about the buffer cache and shared pool may change significantly, such as X$ tables used in v$buffer_pool or v$shared_pool_advice.

About fixed objects stats collection idle time, I mean anything between 1-10 minutes is I will say normal and average, but anything that goes beyond 20 minutes or even more or even in hours is abnormally high and points to a situation.

So, I was asked to take a look on ad-hoc basis and during the analysis I found a SQL trying to do a count all on unified_audit_trail, and was running from the same time since they called the DBMS_STATS for FIXED OBJECTS on the database. When asked, they told me that they’d enabled auditing on the database some 6 months back and haven’t purged anything since then, the audit trail had grown behemoth and has ~ 880 Million records. I immediately offered them two approaches to handle the situation – Either lock your unified table statistics (using dbms_stats.lock_table_stats) or else take backup of the table and purge audit records before calling the stats gathering job again. They agreed with the second approach, they took backup of audit table and purged audit trail. As soon as they purged audit table, the stats collection on fixed objects got finished in ~ 3 minutes.

This was the situation and what we did …

SQL> select * from dba_audit_mgmt_last_arch_ts;

AUDIT_TRAIL RAC_INSTANCE LAST_ARCHIVE_TS
-------------------- ------------ ------------------------------
STANDARD AUDIT TRAIL 0 22-MAY-22 06.00.00.000000 AM +00:00


SQL> select count(*) from aud$;

COUNT(*)
----------
885632817

BEGIN
DBMS_AUDIT_MGMT.clean_audit_trail(
audit_trail_type => DBMS_AUDIT_MGMT.AUDIT_TRAIL_AUD_STD,
use_last_arch_timestamp => FALSE);
END;
/

SQL> select count(*) from aud$;

COUNT(*)
----------
0


SQL> SET TIMING ON
SQL> BEGIN
DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;
END;
/

Elapsed: 00:03:10.81

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , , | 2 Comments »

What is that strange looking wait event ‘TCP Socket (KGAS)’ in AWR report ?

Posted by FatDBA on June 13, 2022

Hi Guys,

Recently someone shared me an AWR report from a production 19c system, and he was very tensed about one of the strange looking wait event called ‘TCP Socket (KGAS)’. He was strained because the event was coming with a very high average wait time of 7863.68ms (7.86 seconds), and was consuming around 98.0% of the total DB Time.

Luckily I’d encountered something similar in the past for one of the customer, where the application team was unable to send the mail as DBMS scheduler, and it was stuck for a long time with wait event “TCP Socket(KGAS)” where problem was not with the scheduler, but was an underlying network or third-party application problem.

So, today’s post is all about the wait event, what it is, how to resolve it etc.

KGAS is a element in the server which handles TCP/IP sockets which is typically used in dedicated connections i.e. by some PLSQL built in packages such as UTL_HTTP and UTL_TCP.
A session is waiting for an external host to provide requested data over a network socket. The time that this wait event tracks does not indicate a problem, and even a long wait time is not a reason to contact Oracle Support. It naturally takes time for data to flow between hosts over a network, and for the remote aspect of an application to process any request made to it. An application that communicates with a remote host must wait until the data it will read has arrived.

From an application/network point of view, delays in establishing a network connection may produce unwanted delays for users. We should make sure that the application makes network calls efficiently and that the network is working well such that these delays are minimized.

From the database point of view, these waits can safely be ignored; the wait event does not represent a database issue. It merely reports the total elapsed time for a network connection to be established or for data to arrive from over the network. The database waits for the connection to be established and reports the time taken. Its always good to check with the network or the third-party application vendors to investigate the underlying socket.

But in case of systemwide waits – If the TIME spent waiting for this event is significant then it is best to determine which sessions are showing the wait and drill into what those sessions are doing as the wait is usually related to whatever application code is executing eg: What part of the application may be using UTL_HTTP or similar and is experiencing waits. This statement can be used to see which sessions may be worth tracing

SELECT sid, total_waits, time_waited
FROM v$session_event WHERE event='TCP Socket (KGAS)' and total_waits>0 ORDER BY 3,2;

In order to reduce these waits or to help find the origin of the socket operations try:

  • Check the current SQL/module/action of V$SESSION for sessions that are waiting on the event at the time that they are waiting to try and identify any common area of application code waiting on the event.
  • Get an ERRORSTACK level 3 dump of some sessions waiting on the event. This should help show the exact PLSQL and C call stacks invoking the socket operation if the dump is taken when the session is waiting. Customers may need assistance from Oracle Support in order to get and interpret such a dump but it can help pinpoint the relevant application code.
  • Trace sessions incurring the waits including wait tracing to try and place the waits in the context of the code executing around the waits. eg: Use event 10046 level 8 or DBMS_MONITOR.SESSION_TRACE_ENABLE.
  • Use DBA_DEPENDENCIES to find any application packages which may ultimately be using UTL_HTTP or UTL_TCP underneath for some operation.

Example:
Execute the following SQL from a session on a dedicated connection and then check the resulting trace file to see “TCP Socket (KGAS)” waits:

alter session set events '10046 trace name context forever, level 8';

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | Leave a Comment »

Do you really require physical PLAN_TABLE in your database ?

Posted by FatDBA on May 9, 2022

Recently during one of the performance taskforce on a newly migrated system, customer DBA asked me to use one of their legacy tool to get more idea about database’s performance, that one of their expert DBA written to collect performance metrics. I’d seen their previous reports collected through the same tool for other systems, and it was good. But, got a runtime exception with an error while calling the script/tool which says ‘PLAN_TABLE physical table present in user schema SYS‘. The error means the user executing it (SYS) owns the table PLAN_TABLE that is the not the Oracle seeded GTT (Global Temporary Table) plan table owned by SYS (PLAN_TABLE$ table with a PUBLIC synonym PLAN_TABLE).

This was little odd to the customer DBAs as they had never experienced this error with the tool, and now when its there, question was Shall we drop the PLAN_TABLE ? Is it risky to do that ? If we drop it, will it impact the execution plan generation or not ? Any other associated risk with drop of plan_table ?

Next when I’d queried DBA_OBJECTS, I saw the table is there in SYS schema, though this system was migrated from 12.2 to 19c, but the table should not be there as the table only by default existed in older versions of Oracle. The object creation date was coming for the time when database was upgraded. It had appeared that someone after upgrade/migration, called the utlrp.sql explicitly (maybe any old 8i/9i DBA) and that’d created the table. Now the question is – It’s safe to drop this table ?

SQL> select owner, object_name, object_type, created from dba_objects where object_name like '%PLAN_TABLE%' 
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;

OWNER      OBJECT_NAME          OBJECT_TYPE             CREATED
---------- -------------------- ----------------------- ---------
PUBLIC     PLAN_TABLE           SYNONYM                 17-APR-19
PUBLIC     SQL_PLAN_TABLE_TYPE  SYNONYM                 17-APR-19
PUBLIC     PLAN_TABLE_OBJECT    SYNONYM                 17-APR-19
SYS        SQL_PLAN_TABLE_TYPE  TYPE                    17-APR-19
SYS        PLAN_TABLE           TABLE                   13-MAR-22     ----->>>>> OLD PLAN_TABLE created during the UPGRADE 
SYS        SQL_PLAN_TABLE_TYPE  TYPE                    17-APR-19
SYS        PLAN_TABLE_OBJECT    TYPE                    17-APR-19
SYS        PLAN_TABLE$          TABLE                   17-APR-19
SYS        PLAN_TABLE_OBJECT    TYPE BODY               17-APR-19

9 rows selected.





-- Look at the difference between the two, PLAN_TABLE$ is a GLOBAL TEMP TABLE and old PLAN_TABLE is not.
SQL> SELECT TABLE_NAME, owner, temporary from dba_tables where table_name like '%PLAN_TABLE%' 
AND owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;

TABLE_NAME                     OWNER                T
------------------------------ -------------------- -
PLAN_TABLE                     SYS                  N
PLAN_TABLE$                    SYS                  Y     ---> Y represents GTT


Let’s first see what’s there inside the PLAN_TABLE and what’s its purpose. Will generate few SQL execution plans will observe changes that happens in PLAN_TABLE.

-- Table columns and details
SQL> desc plan_table
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 STATEMENT_ID                                       VARCHAR2(30)
 PLAN_ID                                            NUMBER
 TIMESTAMP                                          DATE
 REMARKS                                            VARCHAR2(4000)
 OPERATION                                          VARCHAR2(30)
 OPTIONS                                            VARCHAR2(255)
 OBJECT_NODE                                        VARCHAR2(128)
 OBJECT_OWNER                                       VARCHAR2(128)
 OBJECT_NAME                                        VARCHAR2(128)
 OBJECT_ALIAS                                       VARCHAR2(261)
 OBJECT_INSTANCE                                    NUMBER(38)
 OBJECT_TYPE                                        VARCHAR2(30)
 OPTIMIZER                                          VARCHAR2(255)
 SEARCH_COLUMNS                                     NUMBER
 ID                                                 NUMBER(38)
 PARENT_ID                                          NUMBER(38)
 DEPTH                                              NUMBER(38)
 POSITION                                           NUMBER(38)
 COST                                               NUMBER(38)
 CARDINALITY                                        NUMBER(38)
 BYTES                                              NUMBER(38)
 OTHER_TAG                                          VARCHAR2(255)
 PARTITION_START                                    VARCHAR2(255)
 PARTITION_STOP                                     VARCHAR2(255)
 PARTITION_ID                                       NUMBER(38)
 OTHER                                              LONG
 DISTRIBUTION                                       VARCHAR2(30)
 CPU_COST                                           NUMBER(38)
 IO_COST                                            NUMBER(38)
 TEMP_SPACE                                         NUMBER(38)
 ACCESS_PREDICATES                                  VARCHAR2(4000)
 FILTER_PREDICATES                                  VARCHAR2(4000)
 PROJECTION                                         VARCHAR2(4000)
 TIME                                               NUMBER(38)
 QBLOCK_NAME                                        VARCHAR2(128)
 OTHER_XML                                          CLOB



-- Let me check other stats or details about the PLAN_TABLE
SQL> select index_name, table_name from dba_indexes where table_name='PLAN_TABLE' 
 And owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;

INDEX_NAME                                         TABLE_NAME
-------------------------------------------------- ------------------------------
SYS_IL0000078251C00036$$                           PLAN_TABLE


SQL> select table_name, owner, TABLESPACE_NAME from dba_tables where table_name like '%PLAN_TABLE%' 
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;

TABLE_NAME                     OWNER                          TABLESPACE_NAME
------------------------------ ------------------------------ ------------------------------
PLAN_TABLE                     SYS                            SYSTEM
PLAN_TABLE$                    SYS






SQL>

-- The OLD PLAN_TABLE is empty at the moment
SQL> select count(*) from plan_table;

  COUNT(*)
----------
         0


-- Lets explain a test SQL to see what happens to the OLD PLAN_TABLE
SQL> explain plan for select count(*) from bigtab;

Explained.

-- And immediately 3 rows related to the plan line ids added to it
SQL> select count(*) from plan_table;

  COUNT(*)
----------
         3



-- Three entries for below 3 IDs.
SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2140185107

---------------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |     1 |    69   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |        |     1 |            |          |
|   2 |   TABLE ACCESS FULL| BIGTAB | 72358 |    69   (0)| 00:00:01 |
---------------------------------------------------------------------

9 rows selected.


-- But the new PLAN_TABLE$ is still empty

SQL> select count(*) from PLAN_TABLE$ ;

  COUNT(*)
----------
         0

So, the question is – Is it safe to drop this table PLAN_TABLE ?

SQL> drop table PLAN_TABLE;

Table dropped.

SQL>

-- And the table is gone
SQL> select owner, object_name, object_type, created from dba_objects where object_name like '%PLAN_TABLE%' 
and owner not in ('SQLTXPLAIN','SQLTXADMIN') ORDER BY 1;

OWNER                          OBJECT_NAME          OBJECT_TYPE             CREATED
------------------------------ -------------------- ----------------------- ---------
PUBLIC                         PLAN_TABLE           SYNONYM                 17-APR-19
PUBLIC                         SQL_PLAN_TABLE_TYPE  SYNONYM                 17-APR-19
PUBLIC                         PLAN_TABLE_OBJECT    SYNONYM                 17-APR-19
SYS                            PLAN_TABLE_OBJECT    TYPE BODY               17-APR-19
SYS                            SQL_PLAN_TABLE_TYPE  TYPE                    17-APR-19
SYS                            PLAN_TABLE_OBJECT    TYPE                    17-APR-19
SYS                            PLAN_TABLE$          TABLE                   17-APR-19
SYS                            SQL_PLAN_TABLE_TYPE  TYPE                    17-APR-19

8 rows selected.

Now when the table is gone, lets check if we are still able to generate the execution plan.

SQL>
SQL> explain plan for select count(*) from bigtab;

Explained.

SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2140185107

---------------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Cost (%CPU)| Time     |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |     1 |    69   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |        |     1 |            |          |
|   2 |   TABLE ACCESS FULL| BIGTAB | 72358 |    69   (0)| 00:00:01 |
---------------------------------------------------------------------

9 rows selected.


SQL> select count(*) from plan_table$;

  COUNT(*)
----------
         3

And yes, no issues at all. The plan now started to sit inside PLAN_TABLE$ that has a PUBLIC SYNONYM called PLAN_TABLE. So, it’s totally safe to drop the PLAN_TABLE from your schema if it still exists and Oracle has now a public synonym for the same purpose.
WARNING: Don’t drop the PLAN_TABLE$ nor the PLAN_TABLE public synonym, these need to exist for the new PLAN_TABLE to work properly.

Hope It Helped!
Prashant Dixit

Posted in Basics, troubleshooting | Tagged: , , , | Leave a Comment »

An interesting case of ‘enq: CR – block range reuse ckpt’, CKPT blocking user sessions …

Posted by FatDBA on May 2, 2022

Hi All,

Last week we faced an interesting issue with one of the production system which was recently migrated from Oracle 12.2. to 19.15. The setup was running on a VMWare machine with limited resources. It all started when the application team started reporting slowness in their daily scheduled jobs and other ad-hoc operations, when checked at the database layer, it was all ‘enq: CR – block range reuse ckpt‘ wait event. Same can be seen in the below ORATOP output, where the BLOCKER ID 3817 is the CKPT or the checkpoint process.

The strange part was, the blocker was CKPT process and it was all for a particular SQL ID (an INSERT operation), see below in next oratop screen fix.

As far as this wait event (other classed), This comes just before you delete or truncate a table, where we need a level segment checkpoint. This is because it must maintain the consistency of the blocks there may be in the buffer memory and what’s on the disc. As per the definition, this wait event happens due to contention on blocks caused by multiple processes trying to update same blocks. This may seem issues from the application logic resulting into this concurrency bottleneck, but interestingly this was happening on a simple INSERT operation, not a DELETE or TRUNCATE.

Same can be seen in the AWR and ASH reports too! There are CBC (Cache Buffer Chains) latching and latch free events too along with ‘enq: CR – block range reuse ckpt‘, but the initial focus was to understand the event and its reasons. As far as ‘direct path read temp‘ it was happening due to couple of SELECT statement which we resolved after attaching a better plan with the SQLs.

Wait event source (from ASH)

SQL Text was quite simple, an INSERT INTO statement.

 INSERT INTO xx_xxx_xx(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) VALUES (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)

I’ve tried first with the CKPT process traces, just to see if there is anything interesting or useful caught within the process logs or traces. The trace was short and has only got some strange and obscure looking content, but which at least gave us an idea that the reuse block checkpoint of the RBR was failed due to enqueue, and its entry was failed due to abandoned parent. Still, that doesn’t helped us anything, we were unsure about the reason.

--> Info in CKPT trace file ---> XXXXX_ckpt_110528.trc.
RBR_CKPT: adding rbr ckpt failed for 65601 due to enqueue
RBR_CKPT: rbr mark the enqueue for 65601 as failed
RBR_CKPT: adding ckpt entry failed due to abandoned parent 0x1b57b4a88
RBR_CKPT: rbr mark the enqueue for 65601 as failed

There were few things logged in the alert.log, multiple deadlocks (ORA 0060), too many parse errors for one SELECT statement and some checkpoint incomplete errors (log switching was high >35)

-- deadlocks in alert log.
Errors in file /opt/u01/app/oracle/diag/rdbms/xxxxxx/xxxx/trace/xxxx_ora_73010.trc:
2022-04-19T13:41:22.489551+05:30 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ 
ORA-00060: Deadlock detected. See Note 60.1 at My Oracle Support for Troubleshooting ORA-60 Errors. More info in file /opt/u01/app/oracle/diag/rdbms/pwfmfsm/PWFMFSM/trace/PWFMFSM_ora_73010.trc.

-- From systemstatedumps
[Transaction Deadlock]
The following deadlock is not an ORACLE error. It is a ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph: 
------------Blocker(s)----------- ------------Waiter(s)------------
Resource Name process session holds waits serial process session holds waits serial
TX-03AC0014-0000B33A-00000000-00000000 1562 691 X 1632 946 5441 X 56143
TX-01460020-0001A5C2-00000000-00000000 946 5441 X 56143 1562 691 X 1632



-- too many parse errors
2022-04-19T13:57:26.261176+05:30
WARNING: too many parse errors, count=2965 SQL hash=0x8ce1e2ff ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
PARSE ERROR: ospid=68036, error=942 for statement:
2022-04-19T13:57:26.272805+05:30
SELECT * FROM ( SELECT xxxxxxxxxxxxxxxxxxxxxx FROM xxxxxxxxxxxxxxxxxxxxxx AND xxxxxxxxxxxxxxxxxxxxxx ORDER BY xxxxxxxxxxxxxxxxxxxxxx ASC ) WHERE ROWNUM <= 750
Additional information: hd=0x546ba1be8 phd=0x61edab798 flg=0x20 cisid=113 sid=113 ciuid=113 uid=113 sqlid=ccmkzhy6f3srz
...Current username=xxxxxxxxxxxxxxxxxxxxxx
...Application: xxxxxxxxxxxxxxxxxxxxxx.exe Action:


-- Checkpoint incomplete
2022-04-19T15:03:16.964470+05:30 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
Thread 1 cannot allocate new log, sequence 186456
Checkpoint not complete
Current log# 11 seq# 186455 mem# 0: +ONLINE_REDO/xxxxxxxxxxxxxxxxxxxxxx/ONLINELOG/group_11.256.1087565999
2022-04-19T15:03:17.785113+05:30

But the alert log was not sufficient to give us any concrete evidences or reasons for CKPT bloking user sessions. So, next we decided to generate the HANGANALYZE and SYSTEMSTATEDUMPs to understand what’s all happening under the hood, through its wait chains. We noticed few interesting things there

  • Wait chain 1 where a session was waiting on ‘buffer busy waits‘ while doing the “bitmapped file space header” which talks about the space management blocks (nothing to with bitmap indexes) and was related with one SELECT statement.
  • Wait chain 2 where a session was found waiting on ‘enq: CR – block range reuse ckpt‘ event and was blocked by CKPT process (3817) which was further waiting on ‘control file sequential read
  • Wait chain 4 where SID 1670, was found waiting on ‘buffer busy waits‘ while doing ‘bitmapped file space header’ operations.
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 46794
process id: 2014, oracle@monkeymachine.prod.fdt.swedish.se
session id: 15
session serial #: 19322
module name: 0 (xxxx.exe)
}
is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 2 min 20 sec
timeout after: never
wait id: 10224
blocking: 0 sessions
current sql_id: 1835193535
current sql: SELECT * FROM ( SELECT xxxxx FROM task JOIN request ON xxxxx = xxxxx JOIN xxxxx ON xxxxx = xxxxx JOIN c_task_assignment_view ON xxxxx = xxxxx
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 23090
process id: 3231, oracle@monkeymachine.prod.fdt.swedish.se
session id: 261
session serial #: 39729
module name: 0 (xxxx.exe)
}
which is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄  bitmapped file space header
time in wait: 18.999227 sec ◄◄◄
timeout after: never
wait id: 47356
blocking: 25 sessions ◄◄◄
current sql_id: 0
current sql: <none>
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-read()+14<-snttread()+16<-nttfprd()+354<-nsbasic_brc()+399<-nioqrc()+438<-opikndf2()+999<-opitsk()+910<-opiino()+936<-opiodr()+1202<-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
wait history:
* time between current wait and wait #1: 0.000084 sec
1. event: 'SQL*Net message from client'
time waited: 0.008136 sec
wait id: 47355 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #1 and #2: 0.000043 sec
2. event: 'SQL*Net message to client'
time waited: 0.000002 sec
wait id: 47354 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #2 and #3: 0.000093 sec
3. event: 'SQL*Net message from client'
time waited: 2.281674 sec
wait id: 47353 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
}
and may or may not be blocked by another session.
.
.
.
Chain 2:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 10122
process id: 2850, oracle@monkeymachine.prod.fdt.swedish.se
session id: 97
session serial #: 47697
module name: 0 (xxxx.exe)
}
is waiting for 'enq: CR - block range reuse ckpt' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'name|mode'=0x43520006
p2: '2'=0x10b22
p3: '0'=0x1
time in wait: 2.018004 sec
timeout after: never
wait id: 81594
blocking: 0 sessions
current sql_id: 1335044282
current sql: <none>
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 110528
process id: 24, oracle@monkeymachine.prod.fdt.swedish.se
session id: 3817 ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄  CKPT background process
session serial #: 46215
}
which is waiting for 'control file sequential read' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'file#'=0x0
p2: 'block#'=0x11e
p3: 'blocks'=0x1
px1: 'disk number'=0x4
px2: 'au'=0x34
px3: 'offset'=0x98000
time in wait: 0.273981 sec
timeout after: never
wait id: 17482450
blocking: 45 sessions ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
current sql_id: 0
current sql: <none>
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2199<-kslwaitctx()+205<-ksarcv()+320<-ksbabs()+602<-ksbrdp()+1167<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
wait history:
* time between current wait and wait #1: 0.000012 sec
1. event: 'control file sequential read'
time waited: 0.005831 sec
wait id: 17482449 p1: 'file#'=0x0
p2: 'block#'=0x11
p3: 'blocks'=0x1
* time between wait #1 and #2: 0.000012 sec
2. event: 'control file sequential read'
time waited: 0.011667 sec
wait id: 17482448 p1: 'file#'=0x0
p2: 'block#'=0xf
p3: 'blocks'=0x1
* time between wait #2 and #3: 0.000017 sec
3. event: 'control file sequential read'
time waited: 0.009160 sec
wait id: 17482447 p1: 'file#'=0x0
p2: 'block#'=0x1
p3: 'blocks'=0x1
}
.
.
.
Chain 4:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 46479
process id: 1036, oracle@monkeymachine.prod.fdt.swedish.se
session id: 1670
session serial #: 6238
module name: 0 (xxxx.exe)
}
is waiting for 'buffer busy waits' with wait info:
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄  bitmapped file space header
time in wait: 18.954206 sec ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
timeout after: never
wait id: 20681
blocking: 0 sessions
current sql_id: 343919375
current sql: SELECT * FROM ( SELECT xxxxx FROM task JOIN request ON xxxxx = xxxxx JOIN xxxxx ON xxxxx = xxxxx JOIN c_task_assignment_view ON xxxxx = xxxxx
.
.
.
and is blocked by
=> Oracle session identified by:
{
instance: 1 (kpkpkpkpk.kpkpkpkpk)
os id: 44958
process id: 523, oracle@monkeymachine.prod.fdt.swedish.se
session id: 4681
session serial #: 41996
module name: 0 (xxxx.exemonkeymachine.prod.fdt.swedish.se (TNS)
}
which is waiting for 'buffer busy waits' with wait info: ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
{
p1: 'file#'=0xca
p2: 'block#'=0x2
p3: 'class#'=0xd ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄ bitmapped file space header
time in wait: 18.959429 sec ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
timeout after: never
wait id: 153995
blocking: 101 sessions ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
current sql_id: 343919375
current sql:  INSERT INTO xx_xxx_xx(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) VALUES (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+223<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2199<-kslwaitctx()+205<-ksqcmi()+21656<-ksqcnv()+809<-ksqcov()+95<-kcbrbr_int()+2476<-kcbrbr()+47<-ktslagsp()+2672<-ktslagsp_main()+945<-kdliAllocCache()+37452<-kdliAllocBlocks()+1342<-kdliAllocChunks()+471<-kdliWriteInit()+1249<-kdliWriteV()+967<-kdli_fwritev()+904<-kdlxNXWrite()+577<-kdlx_write()+754<-kdlxdup_write1()+726<-kdlwWriteCallbackOld_pga()+1982<-kdlw_write()+1321<-kdld_write()+410<-kdl
wait history:
* time between current wait and wait #1: 0.756195 sec
1. event: 'direct path write temp'
time waited: 0.406543 sec
wait id: 153994 p1: 'file number'=0xc9
p2: 'first dba'=0xb28fc
p3: 'block cnt'=0x4
* time between wait #1 and #2: 0.000001 sec
2. event: 'ASM IO for non-blocking poll'
time waited: 0.000000 sec
wait id: 153993 p1: 'count'=0x4
p2: 'where'=0x2
p3: 'timeout'=0x0
* time between wait #2 and #3: 0.000002 sec
3. event: 'ASM IO for non-blocking poll'
time waited: 0.000001 sec
wait id: 153992 p1: 'count'=0x4
p2: 'where'=0x2
p3: 'timeout'=0x0
}
and may or may not be blocked by another session.
.
.

Though, we wanted to try couple of hidden parameters to enable fast object level truncation and checkpointing, as they had helped us a lot in the past in similar scenarios, but had to involve Oracle support and after carefully analyzing the situation, they too agreed and want us to try them as they started suspecting it as an aftermath of a known bug of 19c.

[oracle@oracleontario ~]$ !sql
sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sun Apr 24 00:24:47 2022
Version 19.15.0.0.0
Copyright (c) 1982, 2022, Oracle.  All rights reserved.

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.15.0.0.0

SQL>
SQL> @hidden
Enter value for param: db_fast_obj_truncate
old   5:    and a.ksppinm like '%&param%'
new   5:    and a.ksppinm like '%db_fast_obj_truncate%'

Parameter                                     Session Value             Instance Value            descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_db_fast_obj_truncate                         TRUE                      TRUE                      enable fast object truncate
_db_fast_obj_ckpt                             TRUE                      TRUE                      enable fast object checkpoint


SQL> ALTER SYSTEM SET "_db_fast_obj_truncate" = false sid = '*';

System altered.

SQL>
SQL> ALTER SYSTEM SET "_db_fast_obj_ckpt" = false sid = '*';

System altered.

SQL>

And soon after setting them, we saw a drastic drop in the waits and system seemed better, much better. But it all happened during an off-peak hour, so there wasn’t much of a workload to see anything odd.

And as we suspected, the issue repeated itself, and next day during peak business hours we started seeing the same issue, same set of events back into existence. This time the ‘latch: cache buffers chains‘ was quite high and prominent which was not that much earlier.

Initially we tried to fix some of the expensive statements on ‘logical IOs’ or memory reads, but that hardly helped. The issue persisted even after setting a higher value for db_block_hash_latches and decreasing cursor_db_buffers_pinned. AWR continues to show ‘latch: cache buffers chains’ in the top ten, foreground timed events, and ‘latch free‘ in first place in the background timed events.
Oracle support confirmed the behavior was due to published bug 33025005 where excessive Latch CBC waits were seen after upgrading from 12c to 19c, and suggested to apply patch 33025005 and then to set hidden parameter “_post_multiple_shared_waiters” to value FALSE (in MEMORY only to test) which disables multiple shared waiters in the database.

-- After applying patch 33025005

SQL> ALTER SYSTEM SET "_post_multiple_shared_waiters" = FALSE SCOPE = MEMORY;

System altered.

And even after applying the patch and setting the recommended undocumented parameter, the issue persisted and we were totally clueless.

And as a last resort, we tried to flush the buffer cache of the database, and bingo that crude method of purging the cache helped to drastically to reduce the load on ‘CBC Latching‘ and for ‘enq: cr block range reuse ckpt‘, and the system ran fine soon after the flush of the DB Buffer cache.

So, nothing worked for us, we changed multiple checkpointing and shared writers related parameters in the database, applied a bug fix patch (33025005), but of no use. Finally, the flush of buffer cache worked for us! Oracle support agreed that this was happening due to a new/unpublished bug (33125873 or 31844316) which is not yet fixed in 19.15 and will be included in 23.1, and they are in status 11 that means still being worked by Development so there is no fix for it yet.

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , | 5 Comments »

 
%d bloggers like this: