Tales From A Lazy Fat DBA

Its all about Databases & their performance, troubleshooting & much more …. ¯\_(ツ)_/¯

  • Prashant Dixit is the 'FatDBA' ...
  • Follow me on Twitter

Why the optimizer not picking the correct Index ? Bringing chaos to order ..

Posted by FatDBA on December 13, 2021

Hi Guys,

Would like to discuss one interesting problem that I’ve recently faced while supporting one of the customer migration, where one critical SQL statement turned out to be very slow on this new infrastructure. They moved from 12.1 standalone to 12.2 2-Node RAC cluster and moved to a brand new hardware, but with similar HW and resource configurations as earlier.

As per the team, one of the core application API was behaving very slow while processing requests, and also failing due to time limits applied within the application. Hence the entire system becoming slow after they moved the workload to this new system.

Query was using multiple subqueries or nesting results that it got from them using a UNION ALL operator which returned all rows as it does not eliminate duplicate selected rows. The query was frequently waiting on User IO wait class, specially on ‘direct path read‘ and ‘db file sequential read‘ events, and took ~ 2.7 minutes to complete, but was expected to complete in less than a second. The query text was something like below ..

(select * from (select * from dix_table_A where perfor_column_12=:1 and testcas_idnu=:2 ) where rownum <= :"SYS_B_0") 
union all (select * from (select * from dix_table_A where perfor_column_12=:3 and testcas_idnu=:4 ) where rownum <= :"SYS_B_1") 
union all (select * from (select * from dix_table_A where perfor_column_12=:5 and testcas_idnu=:6 ) where rownum <= :"SYS_B_2") 
union all (select * from (select * from dix_table_A where perfor_column_12=:7 and testcas_idnu=:8 ) where rownum <= :"SYS_B_3");


SQL_ID        PLAN_HASH_VALUE        EXECS    AVG_ETIME AVG_CPU_TIME        AVG_LIO      AVG_PIO
------------- --------------- ------------ ------------ ------------ -------------- ------------
8sn7dhnash901      891893112           45      162.276      101.390   19,618,917.8  11,911,560.6


--- Execution Plan of the ill/slow SQL
-------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |               |       |       |    24 (100)|          |
|   1 |  UNION-ALL                            |               |       |       |            |          |
|   2 |   COUNT STOPKEY                       |               |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |   187 | 17952 |     6   (0)| 00:00:03 |
|   4 |     INDEX SKIP SCAN                   | I2DIX_TABLE_A |   168K|       |     2   (0)| 00:00:03 |
|   5 |   COUNT STOPKEY                       |               |       |       |            |          |
|   6 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |   187 | 17952 |     6   (0)| 00:00:03 |
|   7 |     INDEX SKIP SCAN                   | I2DIX_TABLE_A |   168K|       |     2   (0)| 00:00:03 |
|   8 |   COUNT STOPKEY                       |               |       |       |            |          |
|   9 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |   187 | 17952 |     6   (0)| 00:00:03 |
|  10 |     INDEX SKIP SCAN                   | I2DIX_TABLE_A |   168K|       |     2   (0)| 00:00:03 |
|  11 |   COUNT STOPKEY                       |               |       |       |            |          |
|  12 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |   187 | 17952 |     6   (0)| 00:00:03 |
|  13 |     INDEX SKIP SCAN                   | I2DIX_TABLE_A |   168K|       |     2   (0)| 00:00:03 |
-------------------------------------------------------------------------------------------------------

As per the SQL text, its querying PERFOR_COLUMN_12 and TESTCASE_IDNU columns in the WHERE clause, and per the execution plan (above), I2DIX_TABLE_A index is picked to prepare the plan with INDEX SKIP SCAN as the access method/path for said Index. Index skip scan means, that the leading or other columns of the index is ignored. This costs performance since Oracle has read every item of the first column, second or subsequent columns and check if the second (or third etc.) column is what you searched for. This usually is faster than a full-table scan (depends on your query), but slower than a index range scan.

I2DIX_TABLE_A index is a multi column index which has both of the referenced columns PERFOR_COLUMN_12 and TESTCASE_IDNU, but in the reverse direction, not as the leading columns.

-- I2DIX_TABLE_A Index DDL (BAD Index)
CREATE UNIQUE INDEX "DIXIT"."I2DIX_TABLE_A" ON "DIXIT"."DIX_TABLE_A" ("PRODUCT_CLASSIC_CAT", "BUILD_CLASSIC_ID", "TESTCASE_IDNU", "PERFOR_COLUMN_12");

While doing further analysis, I saw there is already one dedicated index I1DIX_TABLE_A there to cover this combination of columns PERFOR_COLUMN_12 and TESTCASE_IDNU, and in the right order too, but was ignored by the optimizer while preparing the estimations or costing.

-- I1DIX_TABLE_A Index DDL (GOOD Index)
CREATE INDEX "DIXIT"."I1DIX_TABLE_A" ON "DIXIT"."DIX_TABLE_A" ("PERFOR_COLUMN_12", "TESTCASE_IDNU");

Now, this is strange, why the CBO didn’t considered that index even when the required column set is present in the other index and in the right order, and instead it opted to go for an expensive index ? These are the other stats about both the two indexes.

INDEX_NAME           UNIQUENES     BLEVEL DEGREE   TABLE_NAME      NUM_ROWS    LEAF_BLOCKS  INI_TRANS  MAX_TRANS STATUS  LAST_ANAL
-------------------- --------- ---------- -------- --------------- ----------- ------------ ---------- --------- ------  ------------
I1DIX_TABLE_A        NONUNIQUE          4 1        DIX_TABLE_A     33457571    942391        2         255       VALID    01-DEC-21
I2DIX_TABLE_A        UNIQUE             3 1        DIX_TABLE_A     35084294    494579        2         255       VALID    01-DEC-21

All statistics looks good, I mean the row count is almost same, ITL initial and max values, status and stats collection date, but the difference is there for the branch level (BLevel) and Leaf_Blocks as they are different. Taking about the Blevel, it’s the part of the B-tree index that relates to the number of times Oracle has to narrow its search on the index while searching for a particular record, or the blevel is the number of branch levels. On the other hand, leaf_blocks represents number of leaf blocks in an index.

As we know the Blevel and Clustering Factor are important elements of an index scan cost. The index that has lower values for these is likely to be chosen. I2DIX_TABLE_A has lower values than I1DIX_TABLE_A. About the CF, It is hard to decrease clustering factor because it requires rebuilding the table.

So, lets focus to decrease the BLEVEL, will try rebuilding I1DIX_TABLE_A as that might help to decrease the Blevel.

SQL> alter index dixit.I1DIX_TABLE_A rebuild;

Index altered.


-- Index stats after rebuilding 
INDEX_NAME           UNIQUENES     BLEVEL DEGREE   TABLE_NAME      NUM_ROWS    LEAF_BLOCKS  INI_TRANS  MAX_TRANS STATUS  LAST_ANAL
-------------------- --------- ---------- -------- --------------- ----------- ------------ ---------- --------- ------  ------------
I1DIX_TABLE_A        NONUNIQUE          3 1        DIX_TABLE_A     33457571    446703        2         255       VALID    01-DEC-21
I2DIX_TABLE_A        UNIQUE             3 1        DIX_TABLE_A     35084294    494579        2         255       VALID    01-DEC-21

And Yes, both the BLEVEL and the LEAF_BLOCKS of I1DIX_TABLE_A got reset, and I’ve immediately started hearing some good news from the application team about query’s runtime improvements.
The expensive index I2DIX_TABLE_A was finally replaced by the good I1DIX_TABLE_A and that’s how the INDEX SKIP SCAN was replaced with the fast INDEX RANGE SCAN.

-------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |               |     8 |   10M |     4   (0)| 00:00:01 |
|   1 |  UNION-ALL                            |               |       |       |            |          |
|*  2 |   COUNT STOPKEY                       |               |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |     2 |   192 |     1   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | I1DIX_TABLE_A | 28707 |       |     1   (0)| 00:00:01 |
|*  5 |   COUNT STOPKEY                       |               |       |       |            |          |
|   6 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |     2 |   192 |     1   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN                  | I1DIX_TABLE_A | 28707 |       |     1   (0)| 00:00:01 |
|*  8 |   COUNT STOPKEY                       |               |       |       |            |          |
|   9 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |     2 |   192 |     1   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN                  | I1DIX_TABLE_A | 28707 |       |     1   (0)| 00:00:01 |
|* 11 |   COUNT STOPKEY                       |               |       |       |            |          |
|  12 |    TABLE ACCESS BY INDEX ROWID BATCHED| DIX_TABLE_A   |     2 |   192 |     1   (0)| 00:00:01 |
|* 13 |     INDEX RANGE SCAN                  | I1DIX_TABLE_A | 28707 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

You might be thinking that the other difference is the UNIQUENESS, though the Unique scan cost is lower than Range scan cost. But, in this case, unique scan cannot be used for any index. Unique index is NOT always used for “unique scan.” It depends on predicates. The bad index is chosen not because it is a unique index but its scan cost is lower.

Hope It Helped
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , | 2 Comments »

And my simple & quick Oracle database monitoring script …

Posted by FatDBA on December 9, 2021

Hi All,

Sometimes when you have to support a performance problem and have to do the firefight, every second counts and you don’t want to use any of those fancy tools to see what’s going on in the database at the moment, and want something handy and quick, nothing is as friendly as querying dynamic views. What about using a small quick script to provide you what. Specially the darling views V$SQLAREA and V$SESSION where the first gives you all sorts of SQL level information and later provides me session information.

Recently while working on a shot assignment where I have to settle the performance and stablize system’s performance for a network inventory application, I quickly penned one script, which can be called as ‘Poor man’s database monitoring script‘ 🙂 to provide me a quick understanding of the database system on what all comes and stays and waits at any given point in time.

Github link : https://github.com/fatdba/scripts.git

The script is pretty basic but very handy, quick and gives you all sort of details, and I’ve played with the LAST_CALL_ET from V$SESSION to get the overall time (RUNNING_SINCE) the SQL is there in the database, rest all you can add, alter as per your need. So this is what it is …

--------------------------------------------------------------------------------------------------------------------
-- File name:   prashantpoormanscript.sql
-- Version:     V1.1 (12-08-2021) Simple View
-- Purpose:     This script can be used on any Oracle DB to know what all running and for how long and waiting
--              Also provides details on SQL and SESSION level. 
-- Author:      Prashant Dixit The Fatdba www.fatdba.com
--------------------------------------------------------------------------------------------------------------------
set linesize 400 pagesize 400
select 
x.inst_id
,x.sid
,x.serial#
,x.username
,x.sql_id
,plan_hash_value
,sqlarea.DISK_READS
,sqlarea.BUFFER_GETS
,sqlarea.ROWS_PROCESSED
,x.event
,x.osuser
,x.status
,x.BLOCKING_SESSION_STATUS
,x.BLOCKING_INSTANCE
,x.BLOCKING_SESSION
,x.process
,x.machine
,x.program
,x.module
,x.action
,TO_CHAR(x.LOGON_TIME, 'MM-DD-YYYY HH24:MI:SS') logontime
,x.LAST_CALL_ET
,x.SECONDS_IN_WAIT
,x.state
,sql_text,
ltrim(to_char(floor(x.LAST_CALL_ET/3600), '09')) || ':'
 || ltrim(to_char(floor(mod(x.LAST_CALL_ET, 3600)/60), '09')) || ':'
 || ltrim(to_char(mod(x.LAST_CALL_ET, 60), '09'))    RUNNING_SINCE
from   gv$sqlarea sqlarea
,gv$session x
where  x.sql_hash_value = sqlarea.hash_value
and    x.sql_address    = sqlarea.address
and    sql_text not like '%select x.inst_id,x.sid ,x.serial# ,x.username ,x.sql_id ,plan_hash_value ,sqlarea.DISK_READS%'
and    x.status='ACTIVE'
and x.USERNAME is not null
and x.SQL_ADDRESS    = sqlarea.ADDRESS
and x.SQL_HASH_VALUE = sqlarea.HASH_VALUE
order by RUNNING_SINCE desc;

This is how the output looks like. I have highlighted the RUNNING_SINCE column in red, just to show that the SQL was there in the database running for 3 minutes and 8 seconds and still active waiting on db file sequential reads and was called through SQL Developer, along it gets you the SQLID and PHV for all active SQLs along with other session and SQL level details that will help you to form the right approach for any of those slow sluggish SQLs.

There are hundreds of use cases for this one, specially when the application or product team doesn’t know what all gets triggered in the database in the form of SQLs whenver any of the APIs touched. That’s when you can take help of such handy SQL scripts to check what all runs on the database. You can also write a simple shell script and call it via OS utilities like watch to see it live showing you database workload .. You can write a simple shell something like this …

--------------------------------------------------------------------------------------------------------------------
-- File name:   prashantpoormanscript.sh
-- Version:     V1.1 (12-08-2021) Shell script View
-- Purpose:     This script can be used on any Oracle DB to know what all running and for how long and waiting
--              Also provides details on SQL and SESSION level. 
-- Author:      Prashant Dixit The Fatdba www.fatdba.com
--------------------------------------------------------------------------------------------------------------------
#!/bin/ksh
sqlplus /nolog << EOF
CONNECT username/Password@connection_string
set linesize 400
set pagesize 400
col ACTION for a22
col USERNAME for a9
col SQL_ID for a16
col EVENT for a20
col OSUSER for a10
col PROCESS for a8
col MACHINE for a15
col OSUSER for a8
col PROGRAM for a15
col module for a20
select x.inst_id,x.sid
,x.serial#
,x.username
,x.sql_id
,plan_hash_value as PHV
,sqlarea.DISK_READS
,sqlarea.BUFFER_GETS
,sqlarea.ROWS_PROCESSED
,x.event
,x.osuser
,x.status
,x.BLOCKING_SESSION_STATUS
,x.BLOCKING_INSTANCE
,x.BLOCKING_SESSION
,x.process
,x.machine
,x.program
,x.module
,x.action
,TO_CHAR(x.LOGON_TIME, 'MM-DD-YYYY HH24:MI:SS') logontime
,x.LAST_CALL_ET
,x.SECONDS_IN_WAIT
,x.state
,sql_text,
ltrim(to_char(floor(x.LAST_CALL_ET/3600), '09')) || ':'
 || ltrim(to_char(floor(mod(x.LAST_CALL_ET, 3600)/60), '09')) || ':'
 || ltrim(to_char(mod(x.LAST_CALL_ET, 60), '09'))    RUNNING_SINCE
from   gv\$sqlarea sqlarea
,gv\$session x
where  x.sql_hash_value = sqlarea.hash_value
and    x.sql_address    = sqlarea.address
and    sql_text not like '%select x.inst_id,x.sid ,x.serial# ,x.username ,x.sql_id ,plan_hash_value%'
and    x.status='ACTIVE'
and x.USERNAME is not null
and x.SQL_ADDRESS    = sqlarea.ADDRESS
and x.SQL_HASH_VALUE = sqlarea.HASH_VALUE
order by RUNNING_SINCE desc;

SPOOL OFF
EXIT;
EOF

Once you put all that in shell script, call that via watch utility and you’re set.
Example : call the shell every 2 seconds to refresh the output

[Fatdba@Ontacan-test7-dbmonkey mytools]$ watch -n 2 sh prashantpoormanscript.sh

But if you want a fancier representation of the code, here you go … I did some formatting and make it more easy to read but I still love the core one, unformatted and simple.

--------------------------------------------------------------------------------------------------------------------
-- File name:   prashantpoormanscript1.sql
-- Version:     V1.1 (12-08-2021) Fancy Version
-- Purpose:     This script can be used on any Oracle DB to know what all running and for how long and waiting
--              Also provides details on SQL and SESSION level. 
-- Author:      Prashant Dixit The Fatdba www.fatdba.com
--------------------------------------------------------------------------------------------------------------------
set linesize 400
set pagesize 400
col ACTION for a22
col USERNAME for a9
col SQL_ID for a16
col EVENT for a20
col OSUSER for a10
col PROCESS for a8
col MACHINE for a15
col OSUSER for a8
col PROGRAM for a15
col module for a20
col BLOCKING_INSTANCE for a20
select 
'InstID .............................................: '||x.inst_id,
'SID ................................................: '||x.sid,
'Serial .............................................: '||x.serial#,
'Username ...........................................: '||x.username,
'SQLID ..............................................: '||x.sql_id,
'PHV ................................................: '||plan_hash_value,
'DISK_READS .........................................: '||sqlarea.DISK_READS,
'BUFFER_GETS ........................................: '||sqlarea.BUFFER_GETS,
'ROWS_PROCESSED ..... ...............................: '||sqlarea.ROWS_PROCESSED,
'Event  .............................................: '||x.event,
'OSUser .............................................: '||x.osuser,
'Status .............................................: '||x.status,
'BLOCKING_SESSION_STATUS ............................: '||x.BLOCKING_SESSION_STATUS,
'BLOCKING_INSTANCE ..................................: '||x.BLOCKING_INSTANCE,
'BLOCKING_SESSION ...................................: '||x.BLOCKING_SESSION,
'PROCESS ............................................: '||x.process,
'MACHINE ............................................: '||x.machine,
'PROGRAM ............................................: '||x.program,
'MODULE .............................................: '||x.module,
'ACTION .............................................: '||x.action,
'LOGONTIME ..........................................: '||TO_CHAR(x.LOGON_TIME, 'MM-DD-YYYY HH24:MI:SS') logontime,
'LAST_CALL_ET .......................................: '||x.LAST_CALL_ET,
'SECONDS_IN_WAIT ....................................: '||x.SECONDS_IN_WAIT,
'STATE ..............................................: '||x.state,
'RUNNING_SINCE ......................................: '||ltrim(to_char(floor(x.LAST_CALL_ET/3600), '09')) || ':' || ltrim(to_char(floor(mod(x.LAST_CALL_ET, 3600)/60), '09')) || ':' || ltrim(to_char(mod(x.LAST_CALL_ET, 60), '09')) RUNNING_SINCE,
'SQLTEXT ............................................: '||sql_text
from   gv$sqlarea sqlarea
,gv$session x
where  x.sql_hash_value = sqlarea.hash_value
and    x.sql_address    = sqlarea.address
and    sql_text not like '%select x.inst_id,x.sid ,x.serial# ,x.username ,x.sql_id ,plan_hash_value%'
and    x.status='ACTIVE'
and    sql_text not like '%select :"SYS_B_00"||x.inst_id, :"SYS_B_01"||x.sid, :"SYS_B_02"||x.serial#,%'
and x.USERNAME is not null
and x.SQL_ADDRESS    = sqlarea.ADDRESS
and x.SQL_HASH_VALUE = sqlarea.HASH_VALUE
order by RUNNING_SINCE desc;

And the output will look something like this, neat and clean ..

Hope It Helped!
Prashant Dixit


Posted in Advanced, troubleshooting | Tagged: , , | 6 Comments »

SQL Tuning Advisor is now available to use with Oracle Cloud Infrastructure

Posted by FatDBA on December 5, 2021

And finally the much loved ‘SQL Tuning Advisor‘ is now available to use in Oracle cloud DB Service. The useful utility is a built-in tool to provide suggestions or recommendations about certain SQL statements and now available to use with OCI.

Read more about it!

https://blogs.oracle.com/observability/post/available-now-sql-tuning-advisor-for-oracle-cloud-databases

Hope It Helped!
Prashant Dixit

Posted in Basics | Tagged: , , | Leave a Comment »

ORA-24777: use of non-migratable database link not allowed

Posted by FatDBA on November 27, 2021

Hi gUYS,

Recently I was contacted by someone to take a look at an application problem where they caught few ORA errors in their logs, received “ORA-24777: use of non-migratable database link not allowed” which was for one of their important DBLink. They were using WebLogic as an application server and 19c as Oracle DB Version!
Okay, talking about the issue, this usually happens due to a session using shared connection (XA) using DBLink with dedicated connection. About fixing this problem, there are few ways how you can fix this issue i.e. setting dispatchers, Change or server on tnsnames.ora to SHARED etc., but would like to discuss two of the other methods that I tried to fix the problem.

First Solution:

  1. Open your Weblogic Server Admin Console and click on Services -> Data Sources -> {Choose the datasource you are using in your DB adapter} -> Connection Pool
  2. Change Driver ClassName from oracle.jdbc.xa.client.OracleXADataSource to oracle.jdbc.OracleDriver

Second Solution:
Sometimes, you don’t want to set the system wide change in your WebLogic by changing the driver class name, you can alter your DBLink and make it a SHARED DBLink. Shared database links purpose is to curb down the number of connections to the remote database by sharing same connection link between the source and the remote by multiple users sessions.

-- This was the previous DDL of the DBLink
CREATE DATABASE LINK test_dblink 
   CONNECT TO testuser IDENTIFIED BY mypassword
   USING '(DESCRIPTION=
            (ADDRESS=(PROTOCOL=TCP)(HOST=monkeyserver.ontadomain.com)(PORT=1524))
            (CONNECT_DATA=(SERVICE_NAME=TESTDB1))
          )';


-- This is what I have altered it to to make it SHARED DBLink and worked for me.
CREATE SHARED DATABASE LINK test_dblink 
   CONNECT TO testuser IDENTIFIED BY mypassword
   USING '(DESCRIPTION=
            (ADDRESS=(PROTOCOL=TCP)(HOST=monkeyserver.ontadomain.com)(PORT=1524))
            (CONNECT_DATA=(SERVICE_NAME=TESTDB1))
          )';

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

Materialized Views Refresh slow ? This is what you can try …

Posted by FatDBA on November 22, 2021

Hi All,

Entire last week I was looking in to one system wide performance issue where customer reported slowness in their database since they’d moved into a new infrastructure, after data migration. Though this wasn’t the production system, but kind of equally important, and the issue was impacting their acceptance tests. I was updated that few of the Materialized Views are there and their refresh times are now extended by a great margin, since they moved to this new system.

As soon I got the access on their system, I did a quick check on the database and I found a huge list of MV refresh jobs are scheduled and almost all of them are running with a huge delay, the performance was not at all consistent, sometimes they gets refreshed within few minutes and another run goes for few hours. This initially sounded little bizarre to me, until I did a comparive stuy between the previous (fast) and the new (slow) databases. What I observed is that there is more than 500% of extra data there in the business tables in this new (slow) database and with such a huge data, all those complex MV refresh queries ultimately got slowed down here in the new system.

For an example, one of the critical MV taking ~ 200 mins to refresh an is created with COMPLETE REFRESH mode.

create MATERIALIZED VIEW MV_MYTEST_BIGREFRESH
BUILD DEFERRED
REFRESH ON DEMAND
COMPLETE
AS
select abc,xyz,cast(value as ajaaj) as test ....
............
.................



SELECT * 
FROM ( SELECT OWNER, 
              MVIEW_NAME, 
              CONTAINER_NAME, 
              REFRESH_MODE, 
              LAST_REFRESH_TYPE, 
              STALENESS, 
              round((LAST_REFRESH_END_TIME-LAST_REFRESH_DATE)*24*60,2) as REFRESH_TIME_MINS 
       FROM ALL_MVIEWS 
       WHERE LAST_REFRESH_TYPE IN ('FAST','COMPLETE')
     ) 
ORDER BY REFRESH_TIME_MINS DESC;
 
OWNER   MVIEW_NAME                       CONTAINER_NAME                   REFRESH_MODE LAST_REFRESH_TYPE STALENESS           REFRESH_TIME_MINS
------- -------------------------------- -------------------------------- ------------ ----------------- -------------------------------------
PRASH   MV_MYTEST_BIGREFRESH             MV_MYTEST_BIGREFRESH             DEMAND        COMPLETE      COMPLETE                196.75

Though I did multiple optimizations in order to stabilize performance, but would like to talk about few of the refresh tuning techniques that I have applied and finally worked for me.

Atomic_Refresh with value FALSE.
What caught my eye are all those long running DELETE statements running underneath those costly MV refresh jobs. This was the very first thing I tried with those complex and long running COMPLETE MV refresh, this is the optional parameter atomic_refresh of the procedure dbms_mview.refresh and is very useful. If the parameter is set to FALSE, the materialized view is deleted with a much faster TRUNCATE command. I immediately got the gain, as the runtime was dropped ~ 60% .. At the same time I have observed a great relax in redo generation as DELETE is replaced with TRUNCATE, and hence a great drop in ‘checkpoint incomplete‘ and ‘Log File Sync‘ waits.

The parameter atomic_refresh=FALSE works only with “complete” refresh, so “truncate” is only possible with “complete“. But there is one demerit too for this method, and is that no data is visible to the users during the refresh, so, choose wisely. If this is feasible in your environment, you can use the following command for a Complete Refresh:

BEGIN
dbms_mview.refresh(‘MV_MYTEST_BIGREFRESH_JOB’, method => ‘C’, atomic_refresh => FALSE);
END;


-- See below the REFRESH_TIME_MINS got dropped by ~ 60% after using setting atomic_refres = false
SELECT * 
FROM ( SELECT OWNER, 
              MVIEW_NAME, 
              CONTAINER_NAME, 
              REFRESH_MODE, 
              LAST_REFRESH_TYPE, 
              STALENESS, 
              round((LAST_REFRESH_END_TIME-LAST_REFRESH_DATE)*24*60,2) as REFRESH_TIME_MINS 
       FROM ALL_MVIEWS 
       WHERE LAST_REFRESH_TYPE IN ('FAST','COMPLETE')
     ) 
ORDER BY REFRESH_TIME_MINS DESC;
 
OWNER   MVIEW_NAME                       CONTAINER_NAME                   REFRESH_MODE LAST_REFRESH_TYPE STALENESS           REFRESH_TIME_MINS
------- -------------------------------- -------------------------------- ------------ ----------------- -------------------------------------
PRASH   MV_MYTEST_BIGREFRESH             MV_MYTEST_BIGREFRESH             DEMAND        COMPLETE      COMPLETE                79.12

Adding PX (Parallelism) to the base tables or to the refresh command itself.
Though with atomic_refrsh FALSE, I was able to cut down a great percentage of the overall elapsed time, but still customer asked for me, maybe some 30% more …
I decided to try PX to the base tables or to the refresh statement, among other parameters, this procedure takes one parameter PARALLELISM. At the outset, it appears that the PARALLELISM parameter should invoke a parallel refresh of the materialized view. However, it does not; it simply changes the requested degree of parallelism if it already executes in parallel. You can invoke this procedure as:

BEGIN
dbms_mview.refresh(‘MV_MYTEST_BIGREFRESH_JOB’, method => ‘C’, atomic_refresh => FALSE, PARALLELISM=>4); 
END;



-- See below the REFRESH_TIME_MINS got dropped by further ~30% after using PARALLELISM to the master/base MV tables.
SELECT * 
FROM ( SELECT OWNER, 
              MVIEW_NAME, 
              CONTAINER_NAME, 
              REFRESH_MODE, 
              LAST_REFRESH_TYPE, 
              STALENESS, 
              round((LAST_REFRESH_END_TIME-LAST_REFRESH_DATE)*24*60,2) as REFRESH_TIME_MINS 
       FROM ALL_MVIEWS 
       WHERE LAST_REFRESH_TYPE IN ('FAST','COMPLETE')
     ) 
ORDER BY REFRESH_TIME_MINS DESC;
 
OWNER   MVIEW_NAME                       CONTAINER_NAME                   REFRESH_MODE LAST_REFRESH_TYPE STALENESS           REFRESH_TIME_MINS
------- -------------------------------- -------------------------------- ------------ ----------------- -------------------------------------
PRASH   MV_MYTEST_BIGREFRESH             MV_MYTEST_BIGREFRESH             DEMAND        COMPLETE      COMPLETE                 55.08
  

You can also add PX to the base tables. When the master table of the materialized view has the PARALLEL attribute set to > 1, then the creation as well as the refresh processes will be parallelized. Whether or not you specify the PARALLELISM parameter in the REFRESH clause, doesn’t matter.

ALTER TABLE DIXIT_TEST_TABLE PARALLEL (DEGREE 4);

You can also add PX hint to the MV definition too! When the materialized view is created with a PARALLEL hint, then the creation as well as the refresh processes, will be parallelized. Whether or not you specify the PARALLELISM parameter in the REFRESH clause, doesn’t matter.
This is how you can ad a parallel hint in the materialized view definition!

CREATE MATERIALIZED VIEW TEST_MV
AS
SELECT /*+ PARALLEL(DIXIT_TEST_TABLE, 4) */ MONTHLY_ID, SALARY_DATE, SUM(ANNUAL_COUNTS)
FROM DIXIT_TEST_TABLE GROUP BY MONTHLY_ID, SALARY_DATE;

Stats Collection:
This is very vital that all of the tables, Indexes which your MV refresh references via the code, should be fresh and well collected. Look of for possibilities to add column level stats i.e Histograms wherever there is a scope, specially for low cardinality columns. You can also leave this to the optimizer to decide, by using METHOD_OPT=>’FOR ALL COLUMNS SIZE AUTO’

Other techniques:

  • Try for FAST/Incremental Refresh mode. With FAST refresh, only the changes since the last refresh are applied to the materialized view. Personally I have found the FAST refresh slower than the COMPLETE refresh mode, maybe because I used atomic_refresh parameter that replaced those costly DELETEs with fast TRUNCATE … Not sure why they called it FAST 🙂 To be sure that my materialized view can be fast refresh, we can also use explain_mview procedure and check the capability_name called “REFRESH_FAST”. The MV_CAPABILITIES TABLE is created in the owner’s schema by running the following script ..
SQL> @$ORACLE_HOME/rdbms/admin/utlxmv.sql

SQL> exec dbms_mview.explain_mview('MV_MYTEST_BIGREFRESH_JOB');
PL/SQL procedure successfully completed.
 
SQL> select capability_name,possible,related_text,msgtxt from mv_capabilities_table;

CAPABILITY_NAME                POSSIBLE             RELATED_TEXT            MSGTXT
------------------------------ -------------------- ----------------------- ----------------------------------------------------------------------------
PCT                            N
REFRESH_COMPLETE               Y
REFRESH_FAST                   Y <<<<<<<<---------------
REWRITE                        Y
PCT_TABLE                      N                    ID_ID_CANADA_HISTORY_T1 relation is not a partitioned table
PCT_TABLE                      N                    ID_ID_CANADA_HISTORY_TV relation is not a partitioned table
PCT_TABLE                      N                    DWH_SITE                relation is not a partitioned table
REFRESH_FAST_AFTER_INSERT      Y
REFRESH_FAST_AFTER_ONETAB_DML  Y
REFRESH_FAST_AFTER_ANY_DML     Y
REFRESH_FAST_PCT               N                                            PCT is not possible on any of the detail tables in the materialized view
REWRITE_FULL_TEXT_MATCH        Y
REWRITE_PARTIAL_TEXT_MATCH     Y
REWRITE_GENERAL                Y
REWRITE_PCT                    N                                            general rewrite is not possible or PCT is not possible on any of the detail tables
PCT_TABLE_REWRITE              N                    ID_ID_CANADA_HISTORY_T1 relation is not a partitioned table
PCT_TABLE_REWRITE              N                    ID_ID_CANADA_HISTORY_TV relation is not a partitioned table
PCT_TABLE_REWRITE              N                    DWH_SITE                relation is not a partitioned table
 
18 rows selected.
  • Out-of-place refresh is a new option on DBMS_MVIEW. REFRESH in Oracle 12c. The complete refresh process builds a new table which seamlessly becomes the materialized view, the old table is dropped. It only works in conjunction with non-atomic refresh.
  • DBMS_ADVISOR contains a procedure called TUNE_MVIEW that is used when working with the query rewrite mechanism. DBMS_MVIEW.EXPLAIN_MVIEW and DBMS_MVIEW.EXPLAIN_REWRITE that could be used to diagnose why a materialized view wasn’t being used for query rewrite. The DBMS_ADVISOR.TUNE_MVIEW procedure has the capability to add additional aggregate columns and materialized view logs to the view definition so that it becomes fast refreshable.

Well, all after I’ve performed above three methods, I was able to reduce the MV refresh downtimes by more than 92% … 🙂

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , | Leave a Comment »

Redis cluster node with status DISCONNECTED – [ERR] Not all 16384 slots are covered by nodes, and a crude method to deal that …

Posted by FatDBA on November 8, 2021

Hi All,

Today’s post doesn’t follow the usual problem-solution approach, but about a hit and trial that fixed my problem on Redis. This is about a strange problem which I’d encountered while doing a POC, when one of the cluster node, after reboot stopped responding and coming with status as ‘disconnected‘. This was a multi master configuration (4 master nodes) with no slaves, Redis version was 6.2.6 on RHEL7. For simulation purposes, I have invoked four different Redis server instances on the same node/machine using separate ports. One running on port 6391 is not responding.

Note: This is just a test I did on one sandbox environment and it’s might not match your problem, I recommend not to try these steps in production as they are just part of hit-and-trial that I did.

-- Four different Redis Server Instances running on same host.
[root@fatdba redis-cluster]# ps -ef|grep redis
root       3337   3177  0 23:14 pts/1    00:00:03 redis-server 127.0.0.1:6391 [cluster]
root       3763   3177  0 23:20 pts/1    00:00:01 redis-server 127.0.0.1:6392 [cluster]
root       3814   3177  0 23:21 pts/1    00:00:01 redis-server 127.0.0.1:6394 [cluster]
root       4001   3177  0 23:24 pts/1    00:00:01 redis-server 127.0.0.1:6390 [cluster]



-- one with port 6391 is with status 'disconnected'.
[root@fatdba node0]# redis-cli -a secret -p 6390 cluster nodes
bdcc3381941e141c89110bbcffe66c73bc927054 :0@0 master,noaddr - 1635737041203 1635737041203 2 disconnected 4096-8191
8751bde72e08fd1b08cc34c82cd5868475320a6a 127.0.0.1:6392@16392 master - 0 1635737634979 3 connected 8192-12287
a2b89ff5be2eafeba1ae3aea78f871d0ccc16283 127.0.0.1:6390@16390 myself,master - 0 1635737633000 1 connected 0-4095
c548fe11041b636d627cafffd23f8ad642ac4085 127.0.0.1:6394@16394 master - 0 1635737634000 4 connected 12288-16383
[root@fatdba node0]#

Cluster check command is showing details only for three healthy masters and throwing error at the end of the check '[ERR] Not all 16384 slots are covered by nodes'. I tried to remove the problematic node using CLUSTER FORGET and re-add it to the cluster using addnode but even that didn’t worked, ideally it should, but it didn’t, and it was still coming with status ‘disconnected‘.

Meanwhile the 'cluster info' command showing known nodes as 4 with cluster state as OK.

[root@fatdba redis-cluster]# redis-cli -a secret --cluster check 127.0.0.1:6394
127.0.0.1:6394 (c548fe11...) -> 0 keys | 4096 slots | 0 slaves.
127.0.0.1:6390 (a2b89ff5...) -> 0 keys | 4096 slots | 0 slaves.
127.0.0.1:6392 (8751bde7...) -> 0 keys | 4096 slots | 0 slaves.
[OK] 0 keys in 3 masters.
0.00 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:6394)
M: c548fe11041b636d627cafffd23f8ad642ac4085 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
M: a2b89ff5be2eafeba1ae3aea78f871d0ccc16283 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: 8751bde72e08fd1b08cc34c82cd5868475320a6a 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.



127.0.0.1:6391> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:4
cluster_size:4
cluster_current_epoch:4
cluster_my_epoch:1
cluster_stats_messages_ping_sent:3342
cluster_stats_messages_pong_sent:3732
cluster_stats_messages_sent:7074
cluster_stats_messages_ping_received:3732
cluster_stats_messages_pong_received:3339
cluster_stats_messages_received:7071

So, we tried all possibilities but still our problem endured …

Now, if you have carefully checked the syntax that I’ve used to connect with any of the nodes is via -a option, as there was the password set, so every time it asks me to input the passcode. Just for the trial purpose I removed the password (PassThru parameter in Redis config file) and recreated the cluster, and oddly that trick worked. Of course that is something you will never do in production setups, but can try when learning the database or if not running on production and you’ve the luxury to play around with the database – Just a crude method to deal the issue 🙂

Still it’s incomprehensible for me about how this password removal thing fixed the issue 😦 or maybe I was doing some mistake, I am perplexed and will be troubleshooting it further …

[root@fatdba node0]# redis-cli --cluster create 127.0.0.1:6390 127.0.0.1:6391 127.0.0.1:6392 127.0.0.1:6394 --cluster-replicas 0
>>> Performing hash slots allocation on 4 nodes...
Master[0] -> Slots 0 - 4095
Master[1] -> Slots 4096 - 8191
Master[2] -> Slots 8192 - 12287
Master[3] -> Slots 12288 - 16383
M: 4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391
   slots:[4096-8191] (4096 slots) master
M: 08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
M: 923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join
...
>>> Performing Cluster Check (using node 127.0.0.1:6390)
M: 4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: 923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
M: ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391
   slots:[4096-8191] (4096 slots) master
M: 08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

-- Let's check once again the cluster status
127.0.0.1:6391> cluster nodes
4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390@16390 master - 0 1635742000349 1 connected 0-4095
923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394@16394 master - 0 1635741998338 4 connected 12288-16383
ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391@16391 myself,master - 0 1635741998000 2 connected 4096-8191
08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392@16392 master - 0 1635741999344 3 connected 8192-12287
127.0.0.1:6391>


127.0.0.1:6391> cluster keyslot prashant
(integer) 6942

127.0.0.1:6391> keys *
(empty array)

-- No keys, let me try to create one string type key.
127.0.0.1:6391> set prashant dixit
OK

127.0.0.1:6391> get prashant
"dixit"

Hope It Helped
Prashant Dixit

Posted in troubleshooting | Tagged: , , , | 2 Comments »

Confused about using PURGE_ALL to remove optimizer statistics from the SYSAUX ?

Posted by FatDBA on November 7, 2021

Hi All,

Recently while working in one for the project where SYSAUX was growing at an alarming rate, customer operations team decided to purge old stats from the SYSAUX as that was occupying around 50% of the total occupancies. Though MMON purges these stats regularly from the OPTSTAT tables, but due to a known bug (14373728) it failed to that. I mean ideally it should purge these statistics according to the default retention, which is 30 days in this case, but it seem it can not due to that bug. MMON performs these purge activities automatically, but it has a limit of 5 minutes to perform these activities. So if the purging takes more than 5 mins, then the activities are aborted and as a result the stats are not purged.

Due to some reasons they don’t want to apply the patch, but agreed to purge all stats from the OPTSTAT related tables within SYSAUX. But, the team was very perplexed to perform the PURGE_ALL as a big-bang (not using sysdate-x) approach, as they weren’t sure if PURGE_ALL will remove all the stats including current statistics on objects (Tables, Indexes etc.) or it only touches OPTSTAT tables and truncates them.

About optimizer stats tables (OPTSTAT in short), they contain backup information about stats collection and involves tables like WRI$_OPTSTAT_OPR, WRI$_OPTSTAT_AUX_HISTORY, WRI$_OPTSTAT_TAB_HISTORY, WRI$_OPTSTAT_IND_HISTORY, WRI$_OPTSTAT_HISTGRM_HISTORY, WRI$_OPTSTAT_HISTHEAD_HISTORY.

Even the comments from Oracle customer support has raised the confusion, as their explanation was too hazy and with lot of muddiness about using EXEC DBMS_STATS.PURGE_STATS(DBMS_STATS.PURGE_ALL). So, I decided to do a quick demo to clear all the doubts around this subject.

All below mentioned tests were performed on Oracle 21c database.

-- Here I first created a sample table and generated 10000 random rows. 
[oracle@fatdba ~]$ sqlplus / as sysdba

SQL*Plus: Release 21.0.0.0.0 - Production on Wed Nov 3 23:53:14 2021
Version 21.3.0.0.0


SQL> CREATE TABLE tab1 AS
SELECT level AS id,
       CASE
         WHEN MOD(level, 2) = 0 THEN 0
         ELSE TRUNC(DBMS_RANDOM.value(1,10))
       END AS record_type,
       'Description for ' || level AS description
FROM   dual
CONNECT BY level <= 10000;

Table created.



SQL> select count(*) from tab1;

  COUNT(*)
----------
     10000


SQL> ALTER TABLE tab1 ADD CONSTRAINT tab1_pk PRIMARY KEY (id);

Table altered.


SQL> CREATE INDEX tab1_record_type_idx ON tab1(record_type);

Index created.

SQL> EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1');

PL/SQL procedure successfully completed.


SQL> EXEC DBMS_STATS.gather_index_stats(USER, 'TAB1_RECORD_TYPE_IDX');

PL/SQL procedure successfully completed.

Okay, so we have a table ready with 10000 random rows with a primary key constraint and have created one index on record_type column and have gathered stats on the table and the index. Next, I will purge all stats from OPTSTAT tables using PURGE_ALL function

23:13:05 SQL> EXEC DBMS_STATS.PURGE_STATS(DBMS_STATS.PURGE_ALL);

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.13
23:13:09 SQL>


-- Now the same can be confirmed by querying OPTSTAT tables WRI$_OPTSTAT_IND_HISTORY and  WRI$_OPTSTAT_TAB_HISTORY for Tables and Indexes, 
-- And they all are gone, no more data is there in these tables after flush.

23:15:55 SQL> select OBJ#, flags, rowcnt, ANALYZETIME from WRI$_OPTSTAT_IND_HISTORY where obj#=77267;

no rows selected

23:16:01 SQL> select FLAGS,ROWCNT, OBJ#,ANALYZETIME from WRI$_OPTSTAT_TAB_HISTORY where obj#=77256;

no rows selected

Now, lets check if tables and index has the stats collection date coming or not, and yes they are there! means still the table has their current stats and didn’t got purged after we executed PURGE_ALL on OPTSTAT tables.

23:13:37 SQL> select index_name, index_type, last_analyzed from dba_indexes where index_name='TAB1_RECORD_TYPE_IDX';

INDEX_NAME                     INDEX_TYPE                  LAST_ANALYZED
------------------------------ --------------------------- --------------------
TAB1_RECORD_TYPE_IDX           NORMAL                      03-NOV-2021 23:12:07

Elapsed: 00:00:00.00

23:13:50 SQL> select table_name, last_analyzed from dba_tables where table_name='TAB1';

TABLE_NAME           LAST_ANALYZED
-------------------- --------------------
TAB1                 03-NOV-2021 23:08:32

Elapsed: 00:00:00.00

I am now going to recollect stats on both of them, Tables and its Index, and with that, the last_analyzed date/time is refreshed with the latest timestamp.

repeating same steps for TABLE. And one more time, I see that the TABLE HISTORY related OPTSTAT table has only previous/historical values not the latest timestamp from stats collection.
I mean it should should have ’03-NOV-2021 23:26:07′ as a latest entry in the table, but has only got ’03-NOV-2021 23:17:28′ when last stats were collected.

23:25:54 SQL> EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1');

PL/SQL procedure successfully completed.


23:26:15 SQL> select table_name, last_analyzed from dba_tables where table_name='TAB1';

TABLE_NAME           LAST_ANALYZED
-------------------- --------------------
TAB1                 03-NOV-2021 23:26:07


23:26:08 SQL> select FLAGS,ROWCNT, OBJ#,ANALYZETIME from WRI$_OPTSTAT_TAB_HISTORY where obj#=77256;

     FLAGS     ROWCNT       OBJ# ANALYZETIME
---------- ---------- ---------- --------------------
        10      10000      77256 03-NOV-2021 23:08:32
        10      10000      77256 03-NOV-2021 23:17:28 ----> previous value, not the latest timestamp of 03-NOV-2021 23:26:07









-- One more time, for Table ....

23:29:05 SQL> EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1');

PL/SQL procedure successfully completed.


23:29:11 SQL> 23:29:11 SQL>  select table_name, last_analyzed from dba_tables where table_name='TAB1';


TABLE_NAME           LAST_ANALYZED
-------------------- --------------------
TAB1                 03-NOV-2021 23:29:07


23:29:07 SQL> select FLAGS,ROWCNT, OBJ#,ANALYZETIME from WRI$_OPTSTAT_TAB_HISTORY where obj#=77256;


     FLAGS     ROWCNT       OBJ# ANALYZETIME
---------- ---------- ---------- --------------------
        10      10000      77256 03-NOV-2021 23:08:32
        10      10000      77256 03-NOV-2021 23:17:28
        10      10000      77256 03-NOV-2021 23:26:07 ----> previous value, not the latest timestamp of 03-NOV-2021 23:29:0

Let’s do this on Indexes!

23:32:26 SQL> EXEC DBMS_STATS.gather_index_stats(USER, 'TAB1_RECORD_TYPE_IDX');

PL/SQL procedure successfully completed.

23:32:34 SQL> select index_name, index_type, last_analyzed from dba_indexes where index_name='TAB1_RECORD_TYPE_IDX';

INDEX_NAME                     INDEX_TYPE                  LAST_ANALYZED
------------------------------ --------------------------- --------------------
TAB1_RECORD_TYPE_IDX           NORMAL                      03-NOV-2021 23:32:27


23:32:33 SQL> select OBJ#, flags, rowcnt, ANALYZETIME from WRI$_OPTSTAT_IND_HISTORY where obj#=77267;

      OBJ#      FLAGS     ROWCNT ANALYZETIME
---------- ---------- ---------- --------------------
     77267         10      10000 03-NOV-2021 23:12:07
     77267         10      10000 03-NOV-2021 23:31:37 ---> Has only got '03-NOV-2021 23:31:37' which is a previous timestamp/history, not the latest of 03-NOV-2021 23:32:27

So, from above tests it’s clear that the PURGE_ALL only touches and truncates WRI$_OPTSTAT_XXXX tables and as they always have the previous stats collection details, your present stats on objects will always be there, untouched and safe. So, this command is very safe to run and is best when you are thinking of doing an iterative purge using SYSDATE-X as it’s always very slow as compared to PURGE_ALL which truncates the table and doesn’t delete records from it, and the reason of its agility.

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | Leave a Comment »

DDL stalled Golden Gate replicat : A quick damage control !

Posted by FatDBA on October 31, 2021

Hi Everyone,

Recently I was asked to check a mono-directional Integrated Golden Gate (Version 12.3.0.1.2) replication problem where the replicat was not processing the data and got struck with a huge lag of ~ 3.5 hours. Customer reported that the GG stuck due to CREATE INDEX statements running from last 3 hours and want to skip those transactions.

This being an ad-hoc request, I didn’t had any direct access to their systems and can only ask for files or outputs from their operations team. And as per reports shared, yes, there was a delay of around 3.5 hours for the replicat and was showing in RUNNING state.

-- Process stats on the target
GGSCI (cana01db66664b.prod.sdt.monkey12.se) 1> info all
Program     Status      Group       Lag at Chkpt  Time Since Chkpt
MANAGER     RUNNING
REPLICAT    RUNNING     TORONTO       03:29:11      00:00:03


-- From replicat parameter file
--DDL Options Used --
DDL INCLUDE MAPPED OBJNAME PDIXIT.*
DDLERROR DEFAULT IGNORE
DDLERROR 942 IGNORE
DDLERROR 955 IGNORE
DDLERROR 1435 IGNORE INCLUDE OPTYPE ALTER OBJTYPE SESSION
DDLERROR 904 IGNORE
DDLERROR 1430 IGNORE

While I was examining their parameter file saw they’re using BATCHSQL and HANDLECOLLISIONS to improve performance. Updated them that the use of the HANDLECOLLISIONS parameter can also cause performance issues if there are collisions since there needs to be additional processing done with those records. Hence, It is recommended to remove this parameter, if you are “not doing your initial load” for your tables. Most of transactions are happening in normal mode, hence, asked them to remove BATCHSQL parameter too.

Next, I asked them to try with DDLOPTIONS REPORT to be set in the parameter file, as this settings causes Integrated Extract to write a step-by-step history of all DDL operations captured to it’s report file; which is very useful when troubleshooting DDL replication issues. After they made a change to the parameter file and a bounce of replicat, I checked the logs, and per logs, below DDL was already executed

--> 2021-10-20 08:19:30 INFO OGG-00489 DDL is of mapped scope, after mapping new operation [create index "PDIXIT"."IDX_TEST121" on "PDIXIT"."TAB12"(C1666600563,C737666607) (size 82)]. 

However , DDL error ignored due to ORA-04021 and as they have DDLERROR ignore parameters in parameter file.

--> 2021-10-20 08:34:32 INFO OGG-00492 DDL error ignored: error code [DEFAULT], filter [include all (default)], error text [Error code [4021], ORA-04021: timeout occurred while waiting to lock object PDIXIT.IDX_TEST121 SQL create index "PDIXIT"."IDX_TEST121" on "PDIXIT"."TAB12"(C1666600563,C737666607) /* GOLDENGATE_DDL_REPLICATION */].

Asked them to check whether the index “PDIXIT”.”IDX_TEST121″ has been created on the target. If not then they may manually need to create this index on the target, but as per them, the index statement can be ignored and they only want to move the GG further, as their entire processing system was stopped and caused a massive delay.

Sid,Ser#      USERNAME      STATUS          SQL_ID          SQL_CHILD_NUMBER SQL_HASH_VALUE Logon Time    OSUser@Machine         pid   
TERMINAL PROGRAM                            MODULE
------------- ------------- --------------- --------------- ---------------- -------------- ------------- -------------------- ------- 
-------- ----------------------------------- -----------------------------------
6781,54621    GGGTEST       ACTIVE          u52csv8kcnb34                  0      617229321 23-AUG 06:55  oracle@cana01db66664 33329  r
eplicat@cana01db66664b.prod.sdt.er OGG-RCITSA00-CANA_DAIS_GREATC
                                                                                                          b.prod.sdt.monkey12.
                                                                                                          se

SQL> select sql_fulltext from gv$sql where sql_id='u52csv8kcnb34';


SQL_FULLTEXT
--------------------------------------------------------------------------------
create index "PDIXIT"."IDX_TEST121" on "PDIXIT"."TAB12"(C1666600563,C737666607
)  /* GOLDENGATE_DDL_REPLICATION */

create index "PDIXIT"."IDX_TEST121" on "PDIXIT"."TAB12"(C1666600563,C737666607
)  /* GOLDENGATE_DDL_REPLICATION */


                                    Locking                                                         Object                 Object            Lock
 SID     SER# STATUS                  User               Os user              Os Proc               Owner                   Name             Mode
------ ------ --------------- -------------------- -------------------- -------------------- -------------------- ------------------------- ----------
  6781  54621 ACTIVE          GGGTEST              oracle               33329                SYS                  OBJ$                    Row-X (SX)
  6781  54621 ACTIVE          GGGTEST              oracle               33329                PDIXIT               TAB12                   Share
 

I immediately asked them to kill those database session which was running above DDLs, and the lock issue got resolved the moment they killed them the lag was drained in few seconds. Next we added below parameter in to the replicat to exclude CREATE INDEX statements.

DDL INCLUDE MAPPED OBJNAME PDIXIT.* exclude INSTR 'CREATE INDEX' 

Customer also wanted to remove few of the rebuild online index DDLs, since those are taking too much time and causing lag in replicat, so we tried one more parameter to exclude INDEX REBUILD ONLINE operations from the specific user to happen on the target

DDL INCLUDE MAPPED OBJNAME PDIXIT.* exclude optype alter objtype 'INDEX' instr 'rebuild online'

And told them to remove added parameters to IGNORE DDLs and revert back to original once lag is zero and go back to original parameter of

DDL INCLUDE MAPPED OBJNAME PDIXIT.*

So, that’s how we fixed the issue and the flow we adopted to handle the situation.

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | 2 Comments »

Another 10053 trace viewer : Best of the Best …

Posted by FatDBA on October 21, 2021

Finally, I’ve got a working copy of my favorite 10053 optimizer trace viewer from one of my connection. This one was written by Sergei Romanenko. I love it because it allows direct jumps to the most important parts of the trace and also uses highlighter to improve the readability of the trace and optionally formats the final query after transformations. It’s pretty easy to use, and you can directly search the keywords within these big thumping traces. And you can also wrap your text and can also format the final query.

Click here to download!

This is how the interface looks like.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | 5 Comments »

10053 Trace Viewer : A life savior when handling colossal optimizer traces

Posted by FatDBA on October 18, 2021

Hi Everyone,

I am sure that my last post about 10053 debug traces has sparked some interest in optimizer cost calculations and estimations 🙂 As you guys are familiar that these traces aren’t that easy to digest and interpret, as they are pretty complicated, a humongous pile of internal cryptic information’s. One of the reader asked – If there are any tool that can help to at least format the trace and its sections ? Yes, there are few and one of my favorite is the 10053 viewer, and is what I am using from last few years now (lucky that I found that great blog post by Jonathan Lewis).

Click here if want to download it!

The tool is pretty easy to use! You have to click on ‘open trace file’ button and browse the 10053 trace from the system and click on ‘show trace file’ (next button).

Now load the trace file.

Now you’ll have a drop down view to select from. Once the trace is loaded, you can access sections by using ‘+’ to expand and ‘‘-‘ to minimize the section.

Expand to get more details about any particular section.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

 
%d bloggers like this: