Addressing Stuck Undo Segments : How to Safely Drop Problematic Undo Segments

Posted by FatDBA on October 14, 2024

Hi All,

This post discusses an intriguing issue we encountered recently on a 19.22 Oracle database following a CDB restart. After the restart, we observed a peculiar problem where all sessions performing DDL commands were getting locked and hung at the PDB level. This behavior was affecting the entire database, essentially halting all DDL operations.

During our analysis, we discovered that the SMON process was waiting on a latch, leading to high CPU resource consumption. Furthermore, we noticed that the MMON process was blocking SMON, causing additional delays. The alert log revealed multiple error messages, which further complicated the diagnosis.

This issue required a deep dive into Oracle’s background processes and system-level contention to resolve, as it was causing a significant disruption to database operations.

-- Fragments from alert log, smon/mmon process logs and standard diag traces.
kcbzib: encounter logical error ORA-1410, try re-reading from other mirror..
TRCMIR:kcf_reread     :start:3722012:0:+DATA/CDBMONKEY/AA82C21DD440449FE053B4146E0AA55B/DATAFILE/tablespace_test_dataaa.xxx.xxxx
TRCMIR:kcf_reread     :done :3722012:0:+DATA/CDBMONKEY/AA82C21DD440449FE053B4146E0AA55B/DATAFILE/tablespace_test_dataaa.xxx.xxxxx
kcbzibmlt: encounter logical error ORA-1410, try re-reading from other mirror..

---> SMON: Parallel transaction recovery tried
30317 error message received from server=1.70(P01Y) qref:0x8de103cf0 qrser:5121 qrseq:3 mh:0x97fdf9460
Parallel Transaction recovery caught exception 12801
Parallel Transaction recovery caught error 30317

*** 2024-08-19T20:38:23.297997-04:00 (PWS1E(3))
Parallel Transaction recovery caught exception 30319
Parallel Transaction recovery caught error 30319

*** 2024-08-19T20:38:50.613855-04:00 (PWS1E(3))
30317 error message received from server=1.57(P01L) qref:0x8de109fe8 qrser:11265 qrseq:3 mh:0x95fccd3c8
Parallel Transaction recovery caught exception 12801
Parallel Transaction recovery caught error 30317
Parallel Transaction recovery caught exception 30319

TEST1E(3):about to recover undo segment 98 status:6 inst:0
TEST1E(3):mark undo segment 98 as available status:6 ret:1

TEST1E(3):about to recover undo segment 46 status:6 inst:0
TEST1E(3):mark undo segment 46 as available status:6 ret:1

The logs and trace files also highlighted an issue with two specific undo segments, identified by segment numbers 98 and 46, from the UNDO tablespace. Upon further investigation, we found that both segments were in a ‘RECOVERING’ state. What was particularly concerning was that the recovery process for these segments was progressing extremely slowly, with the v$fast_start_transactions view showing an unusually high estimated recovery time.

In fact, based on the progress we monitored, it seemed like the recovery process wasn’t moving forward at all and appeared to be stuck in some kind of loop. This stagnation in recovery added to the overall system’s delay, compounding the performance issues we were already facing. It became clear that this problem was a significant bottleneck in restoring the database to normal operation.

SQL> select * from V$FAST_START_TRANSACTIONS;

USN SLT SEQ STATE UNDOBLOCKSDONE UNDOBLOCKSTOTAL PID CPUTIME PARENTUSN PARENTSLT PARENTSEQ XID PXID RCVSERVERS CON_ID
---------- ---------- ---------- ---------------- -------------- --------------- ---------- ---------- ---------- ---------- 
46 46 2313064 RECOVERING 505 24992423 77 5586 0 0 0 10001000684B2300 0000000000000000 1 0
98 25 1352150 RECOVERING 0 226231 78 5586 0 0 0 30001900D6A11400 0000000000000000 1 0
	

SQL> SELECT segment_name, tablespace_name  FROM dba_rollback_segs  WHERE segment_id IN (98, 46);

SEGMENT_NAME		       TABLESPACE_NAME
------------------------------ ------------------------------
_SYSSMU46_5249279471$	       UNDOTEST1
_SYSSMU98_5249279471$	       UNDOTEST1

We attempted to take the segments offline and ultimately drop them, as they were associated with a materialized view (MV) refresh and a bulk insert statement. These operations were part of an ad-hoc activity, so it was acceptable for them to be missed. However, despite our efforts, the segments remained in a ‘PARTLY AVAILABLE’ state, leaving us with no option to drop or take them offline. This left us in a situation where we were essentially stuck, unable to proceed with dropping the segments or the associated tablespace. The inability to release these segments further complicated our recovery efforts.

We’d even checked the status of the those two undo segments using base table x$ktuxe and the KTUXESTA (Status) was coming as ‘DEAD’, means the transaction has failed but is still holding resources and that gave ius more confidence about what happened under the hood.

SQL> select min(sample_time), max(sample_time), sql_id, xid, count(1) from dba_hist_active_sess_history 
where xid in ('10001000684B2300','30001900D6A11400') group by sql_id, xid;

MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) SQL_ID XID COUNT(1)
--------------------------------------------------------------------------- -------------------
15-SEP-24 01.22.25.446 PM 15-SEP-24 05.51.22.340 PM 30001900D6A11400 3213
15-SEP-24 10.22.46.218 AM 15-SEP-24 01.22.15.440 PM ac5hhandj9fh1 30001980D6A11400 2158 --------------> 
13-SEP-24 08.31.54.374 PM 14-SEP-24 02.53.45.723 AM annqr822no0a1 10001090684B2300 4578 -------------->
14-SEP-24 02.53.55.731 AM 15-SEP-24 05.51.22.340 PM 10001000684B2300 27781

SQL> select sql_id, sql_text from dba_hist_sqltext where sql_id in ('annqr822no0a1','ac5hhandj9fh1o');

SQL_ID SQL_TEXT
------------- --------------------------------------------------------------------------------
annqr822no0a1 INSERT INTO monkey.ah_ah3_xaa_131C (
ac5hhandj9fh1o /* MV_REFRESH (INS) */INSERT /*+ BYPASS_RECURSIVE_CHECK */ INTO "monkey"."test_


SQL> ALTER ROLLBACK SEGMENT "_SYSSMU46_5249279471$" offline;

Rollback segment altered.

SQL> ALTER ROLLBACK SEGMENT "_SYSSMU98_5249279471$" offline;

Rollback segment altered.


SQL> SELECT segment_name, status, tablespace_name
FROM dba_rollback_segs
WHERE segment_name IN ('_SYSSMU98_5249279471$', '_SYSSMU46_5249279471$');
  2    3
SEGMENT_NAME		       STATUS		TABLESPACE_NAME
------------------------------ ---------------- ------------------------------
_SYSSMU46_5249279471$	       PARTLY AVAILABLE UNDOTEST1
_SYSSMU98_5249279471$	       PARTLY AVAILABLE UNDOTEST1


SQL> SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe 
WHERE ktuxesta!='INACTIVE' AND ktuxeusn=98;

KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS
---------- ---------- ---------- ---------------- ------------------------
98 25 1352150 ACTIVE DEAD

SQL> SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe 
WHERE ktuxesta!='INACTIVE' AND ktuxeusn=46;

KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS
---------- ---------- ---------- ---------------- ------------------------
46 46 2313064 ACTIVE DEAD

Given that this is a critical production system, we couldn’t afford to wait for a complete recovery of the affected undo segments. To mitigate the issue, we created a new undo tablespace and designated it as the default for the database. This action enabled us to resume normal operations while the recovery of the problematic segments continued in the background.

However, the underlying mystery remains: why are we unable to drop these segments in the production environment? To investigate further, we cloned the production database and set up a test instance. To our surprise, we replicated the same situation, where both segments 46 and 98 appeared again in a ‘PARTLY AVAILABLE’ state, providing no options for us to drop them.

In our exploration, we first enabled the FAST_START_PARALLEL_ROLLBACK parameter, which determines the number of processes that participate in parallel rollback during transaction rollbacks, typically following an instance failure or a large manual rollback. We set this parameter to HIGH, as it significantly accelerates the rollback process for large transactions, particularly in scenarios involving instance failures or extensive operations requiring manual rollback.

Additionally, we experimented with the undocumented parameter _OFFLINE_ROLLBACK_SEGMENTS, which is intended to control the state of rollback segments.
Note: When dealing with hidden or undocumented parameters, it’s crucial to consult with Oracle support or rely on prior experience, as these settings can lead to unforeseen consequences in production environments.

Ran below query to dynamically get alter statements for segments which we need to set offline.

SQL>  select 'ALTER SYSTEM SET "_OFFLINE_ROLLBACK_SEGMENTS"='||listagg(''''||segment_name||'''',',') WITHIN GROUP (ORDER BY segment_name)||' scope=spfile;' from dba_rollback_segs 
where tablespace_name='UNDOTBS1' and status ='NEEDS RECOVERY'; 

Alter System set "_OFFLINE_ROLLBACK_SEGMENTS"="_SYSSMU98_5249279471$" scope=spfile;
Alter System set "_OFFLINE_ROLLBACK_SEGMENTS"="_SYSSMU46_5249279471$" scope=spfile;

Shutdown the database and startup as normal after setting the above parameter. 

shutdown immediate;
startup;

and finally the drop statements. 
SQL> select 'drop rollback segment "'||segment_name||'";' from dba_rollback_segs 
where tablespace_name='UNDOTBS1' and status ='NEEDS RECOVERY';

drop rollback segment "_SYSSMU98_5249279471$";
drop rollback segment "_SYSSMU46_5249279471$";

Issue above two drop rollback segemnts from the dfatabase and bounce the database again anf finally drop the problematic undo tablespace. Do not forget to reset the ‘_OFFLINE_ROLLBACK_SEGMENTS’ parameter and a one more bounce again.

SQL>  shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>  startup;


SQL>  drop tablespace UNDOTEST1;
Tablespace dropped.


SQL>  Alter System reset "_OFFLINE_ROLLBACK_SEGMENTS";
System altered.

SQL>  shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>  startup;

Although it was a lengthy and demanding process involving numerous experiments, the results were ultimately positive. We encountered no errors and successfully dropped the problematic segments, freeing the database from the issues that had plagued it. This experience not only resolved our immediate concerns but also provided valuable insights into managing similar challenges in the future.

Hope It Helped!
Prashant Dixit

This entry was posted on October 14, 2024 at 9:35 AM and is filed under Uncategorized. Tagged: Database, oracle, oracle-database, performance, SQL, sql-server, troubleshooting. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Tales From A Lazy Fat DBA

Its all about Databases, their performance, troubleshooting & much more …. ¯\_(ツ)_/¯

Likes