Tales From A Lazy Fat DBA

Love all databases! – Its all about performance, troubleshooting & much more …. ¯\_(ツ)_/¯

  • Prashant Dixit is the 'FatDBA' ...
  • Follow me on Twitter

Posts Tagged ‘troubleshooting’

Another 10053 trace viewer : Best of the Best …

Posted by FatDBA on October 21, 2021

Finally, I’ve got a working copy of my favorite 10053 Oracle optimizer trace viewer from one of my connection. This one was written by Sergei Romanenko. I love it because it allows direct jumps to the most important parts of the trace and also uses highlighter to improve the readability of the trace and optionally formats the final query after transformations. It’s pretty easy to use, and you can directly search the keywords within these big thumping traces. And you can also wrap your text and can also format the final query.

Click here to download!

This is how the interface looks like.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

10053 Trace Viewer : A life savior when handling colossal optimizer traces

Posted by FatDBA on October 18, 2021

Hi Everyone,

I am sure that my last post about 10053 debug traces has sparked some interest in optimizer cost calculations and estimations 🙂 As you guys are familiar that these traces aren’t that easy to digest and interpret, as they are pretty complicated, a humongous pile of internal cryptic information’s. One of the reader asked – If there are any tool that can help to at least format the trace and its sections ? Yes, there are few and one of my favorite is the 10053 viewer, and is what I am using from last few years now (lucky that I found that great blog post by Jonathan Lewis).

Click here if want to download it!

The tool is pretty easy to use! You have to click on ‘open trace file’ button and browse the 10053 trace from the system and click on ‘show trace file’ (next button).

Now load the trace file.

Now you’ll have a drop down view to select from. Once the trace is loaded, you can access sections by using ‘+’ to expand and ‘‘-‘ to minimize the section.

Expand to get more details about any particular section.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

Oracle event 10046 debug traces, they really aren’t that ‘complicated’ as we think – A 10046 trace break apart!

Posted by FatDBA on October 16, 2021

Hi Everyone,

Oracle has a long list of internal debug codes and this tracing is an art and a real craft. 10046 debug traces is one of the popular method for collecting extended SQL trace (like SQL_TRACE=TRUE) information for Oracle sessions. This we specially use to determine or distinguish the nature of a SQL Tuning problem. By setting this event, you can get detailed trace information of Oracle’s internal execution system analysis, call, wait, and bind variables, which plays a very important role in analyzing the performance of the system. This provides a great source and different levels of details about SQLs.

This post is all about breaking the parts of the trace and understand some of the critical sections to help understanding about SQL stats that it captures. I am not going to show how to generate them as the steps are pretty straight forward and are available on metalink.

Though there are multiple use cases of 10046, but I recently used them to understand a complicated and costly PL/SQL program which has got more than 1000 different SQL’s inside that it calls and I was interested to check on the costliest among them and why is that …. Though there are surely other ways to get the details of same level like using SQL Profiler, SQL Traces etc. but none of them provides the level of details what 10046 gives) and elapsed time parsing (prsela).

Above used snippet is from live 10046 sorted traces from a production system running on 10.2.0.5.0 (Yes, an old application). There isn’t much different that you will notice if you execute it on any new version of database too, there are very few changes that you will notice in case running on a latest oracle DB version. Okay let’s first understand few of the keywords or column names used in the result.

TKPROF: Release 10.2.0.5.0 - Production on Fri Sep 27 03:31:42 2021

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Trace file: dixitdb_ora_28282_10046_for_spdixitM.trc
Sort options: exeela  fchela  prsela  
********************************************************************************

SELECT COUNT(*) 
FROM
 CANONTALAB.DIXIT1_SAMPLE WHERE DIXIT1_SAMPLE_NO=:B1 


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse      134      0.00       0.00          0          0          0           0
Execute    862      0.03       0.03          0          0          0           0
Fetch      862     46.59      45.52          0    1235246          0         862
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1858     46.63      45.55          0    1235246          0         862

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 296  (CANONTALAB)   (recursive depth: 3)

Rows     Row Source Operation
-------  ---------------------------------------------------
      2  SORT AGGREGATE (cr=2866 pr=0 pw=0 time=114898 us)
      2   INDEX FAST FULL SCAN PK_DIXIT1_SAMPLE (cr=2866 pr=0 pw=0 time=114888 us)(object id 125001)


Rows     Execution Plan
-------  ---------------------------------------------------
      0  SELECT STATEMENT   MODE: ALL_ROWS
      2   SORT (AGGREGATE)
      2    INDEX   MODE: ANALYZED (UNIQUE SCAN) OF 'PK_DIXIT1_SAMPLE' 
               (INDEX (UNIQUE))

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                        60       12.17         43.54
  
********************************************************************************

COUNT – Represents the umber of times a SQL statement was parsed, executed, or was fetched.
CPU – Total CPU time in seconds for all parse, execute, or fetch calls for the SQL.
ELAPSED – Total elapsed time in seconds for all parse, execute, or fetch calls for the SQL statement.
DISK – Total number of data blocks physically read from the datafiles on disk for all parse, execute, or fetch calls.
QUERY – Total number of buffers retrieved in consistent mode for all parse, execute, or fetch calls. Usually, buffers are retrieved in consistent mode for queries. A Consistent Get is where oracle returns a block from the block buffer cache but has to take into account checking to make sure it is the block current at the time the query started. Or is a normal reading of a block from the buffer cache. A check will be made if the data needs reconstructing from rollback info to give you a consistent view
CURRENT – Total number of buffers retrieved in current mode. Buffers are retrieved in current mode for statements such as INSERT, UPDATE, and DELETE. A DB block get (or current get in TKPROF) not only gets the block as it is right now, but it stops anyone else from getting that block (in current mode!) until we change it and release it. If someone else got there first, we wait.
ROWS – Total number of rows processed by the SQL statement. This total does not include rows processed by subqueries of the SQL statement. Also the rows gives statistics about 3 calls: Parse, Execute & Fetch
PARSE – Translates the SQL statement into an execution plan, including checks for proper security authorization and checks for the existence of tables, columns, and other referenced objects. This is where the physical and logical transformations and optimizations happens.
EXECUTE – Actual execution of the statement by Oracle. For INSERT, UPDATE, and DELETE statements, this modifies the data. For SELECT statements, this identifies the selected rows.
FETCH – Retrieves rows returned by a query. Fetches are only performed for SELECT statements.

Okay, now when all the column names and table entries are explained, let me try to explain what it represents in the form of those numbers.

It says 1235246 blocks were received in consistent mode during fetch operation. Since this is a SELECT statement, the blocks are shown during Fetch operation. If its a DML statement then blocks will be shown during Execute operation. Misses in library cache during each call. If there is no miss then it wont be mentioned. The 1 miss for the SQL is very much acceptable since when a SQL runs for the first time it need to be parsed and executed and execution plan will be stored. So parse and execute call will have 1 misses. If you see the statement parse call happened 134 times but the miss count is only 1, meaning that the statement was parsed only once and was stored in library cache. For next 133 parses the same parsed statement from library cache was used. So we have miss 1 and hit 133. Similarly execution plan was made only once and 861 times Oracle used same execution plan from library cache.

Now next jump to the row source operations and codes that it uses. cr = Consistent Reads, pr = Physical Reads, pw = Physical Writes, time = time taken by this step in microseconds. You might see some other codes used i.e. cost = cost incured by the step, size = represent the size of data in that step and card = cardinality.

So, the query was found doing an UNIQUE SCAN on it’s primary key index with name PK_DIXIT1_SAMPLE in ALL_ROWS mode and which is pretty understood as an equality predicate was used and the unique or primary key constraint was sufficient by itself to produce an index unique scan. And finally it shows wait event details and is pretty straight forward and says that it waits on ‘db file sequential read‘ with max wait time of 12.17

With the values above we need make a decision to whether to tune the SQL or not. Unless we have a locking problem or bad performing SQLs we shouldn’t worry about CPU time or the elapsed time. because timings come into consideration only when we have bad performing SQLs. The important factor is the number of block visits, both query (that is, subject to read consistency) and current (that is, not subject to read consistency). Segment headers and blocks that are going to be updated are acquired in current mode, but all query and subquery processing requests the data in query mode.

Hope It Helped
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | 4 Comments »

Migrated to RAC and getting ‘row cache locks’ or ‘enq: SQ – contention’ ?

Posted by FatDBA on September 24, 2021

Hi Everyone,

Recently I was working on a performance issue where customer reported frequent slowness and hang issues with their newly migrated 12.2 2-Node RAC cluster. I was involved at the time when issue was already gone and now I had to dig out the data the history either from AWR or via DBA_HIST_XX views. I started glancing over AWR reports for the probe period (~ 2 hours). I saw especially node 1 was swamped with excessive ‘row cache lock’ wait events, and that’s too with very high average wait time of 7973.47 ms (~ 8 seconds per wait). Though similar waits were found happening on instance 2, but quite less as compared to node1 (take a look at the AWR snip below)

You can also see ‘enq: SQ – contention’ in place of ‘row cache locks’ as this got renamed.

Below is the snippet from AWR that states it spend ~ 99% of DB Time% on sequence loading.

While checking ‘enqueue stats’ I saw ‘SQ-Sequence Cache’ type enqueues with very high overall wait period of 545 seconds (~ 9 minutes).

Next target was to find out the source SQL or the statements waiting on these row cache lock waits. And as expected, it was a SQL that interacts with the sequence to generate the NEXTVAL and feed that information to another statements that inserts records to a frequently accessed application log table. You can think of that statement something like below …

-- Generating next available value from the sequence
SELECT TEST_SEQ.NEXTVAL FROM DUAL; 

And source being a sequence, I thought of generating the DDL to see all its options or properties. And as expected, this sequence has NOCACHE option because this was recently upgraded from a standalone 12.1 database to a new 12.2 2-Node RAC cluster. The main reason for specifying NOCACHE earlier was to avoid gaps in sequence number as the value is not lost when the instance abnormally terminates.

CREATE SEQUENCE  "DIXIT"."TEST_SEQ"  MINVALUE 1 MAXVALUE 9999999999999999999999999999 INCREMENT BY 1 START WITH 1673163 NOCACHE  NOORDER  NOCYCLE  NOKEEP  NOSCALE  GLOBAL ;

And we complety missed to modify sequences as per the best practice of having CACHE + NOORDER combination in case of RAC. With this combo, each instances caches a distinct set of numbers in the shared pool & Sequences will not be globally ordered.

When caching is used, then the dictionary cache (the row cache) is updated only once with the new high watermark, e.g. when a caching of 20 is used and a nextval is requested the first time, then the row cache value of the sequence is changed in the dictionary cache and increased by 20. The LAST_NUMBER of the DBA_SEQUENCES get increased with the cache value or 20. The extracted 20 values, stored in the shared pool, will be distributed to the sessions requesting the nextval of it.

When no caching is used, then the dictionary cache has to be updated for any nextval request. It means the row cache has to be locked and updated with a nextval request. Multiple sessions requesting a nextval will hence be blocked on a ‘row cache lock’ wait. Each instance allocates numbers by access to the database but cache fusion may delay sending current seq$ block to a requesting instance if the block is busy owing to many sequence number allocations from the instance owning the current block image.

But there is a caveat when you use CACHE option and that is that the gaps in the sequence numbering occur when the sequence cache is lost e.g. any shared pool flush or instance shutdown like an single instance databases. When the sequence caching is used and the cached values are flushed from the shared pool. The same happen in RAC as in single instance databases. Any flush on any shared pool is enough to invalidate the cache value on RAC systems. And I don’t see any problem having a gap in the sequence, if not using a banking application.

Let me explain it through an example ..

-- Will create a sequence, default is to cache 20 sequence values in memory.

SQL> create sequence mytest_seq start with 1 increment by 1;

Sequence created.

SQL> select mytest_seq.nextval from dual;

  NEXTVAL
----------

         1

SQL> select mytest_seq.nextval from dual;

  NEXTVAL
----------

         2


-- The database is terminated and after startup, the next value of the sequence is selected.


SQL> select mytest_seq.nextval from dual;

  NEXTVAL
----------

        21

-- The first 20 values were in the cache, but only the first two were actually used. 
-- When the instance got terminated, sequence values 3 through 20 were lost as they were in cache. 

So, we decided to use caching, considering the average modifications and sequence generation requests to the main table, we planned to go with 500 sequence to be cached that Oracle will pre-allocate and keep in the memory for faster access.

ALTER SEQUENCE TEST_SEQ cache 500; 

And yup, the issue got fixed as soon we made sufficient sequences numbers available in the cache and no more ‘row cache lock’ waits afterwards.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , | 2 Comments »

Are you suffering from excessive ‘cursor: mutex X’ & ‘cursor: mutex S’ waits after the upgrade ?

Posted by FatDBA on September 15, 2021

Hi Everyone,

Recently, I was contacted by one of my friend who was battling with some performance issues, since they moved from 12c to 19c. He was mostly strained about a particular problem with the new 19c database where he was getting excessive concurrency classed waits of “cursor: mutex X” (> 92% of the DB Time) and some “cursor: mutex S” events in the database. This was leading to frequent database hang instances.

As per the above snippet from AWR report for the period, ‘cursor: mutex X’ was consuming more than 170 ms per wait or an average wait time and was responsible for more than 91% of the total DB Time consumption.

Initially I though it was a case of classic hard parsing issue, as Cursor: mutex S wait usually occurs when Oracle is serializing parsing of multiple SQL statements. I mean there must be SQLs which are going through excessive hard parsing and has too many child cursors in the library cache. So, I immediately checked section ‘SQL Ordered by Version Count’ and saw one individual statement was there with 7,201 versions or Childs within a period of 2 hours.

Same was confirmed through ASH report too (see below pasted snippet). This particular SELECT statement was waiting on both on these two concurrency classed events specific to library cache.

I further drilled down on this issue to identify the cause behind this problem by querying V$SQL_SHARED_CURSOR (for reasons) to know why a particular child cursor was not shared with existing child cursors, and I was getting BIND_EQUIV_FAILURE as a reason. The database has the ACS(Adaptive Cursor Sharing) and CFB(Cardinality Feedback) enabled and seemed a ‘cursor leak’ issue.

I also noted huge sleeps for CP type mutexes on functions kkscsAddChildNode & kkscsPruneChild, below is the snippet from AWR, take a look at the first two in red.

And when I was about to prepare the strategy (i.e. specific plan purges etc.) to handle the situation, I thought to generate the hang analyze to identify if there are any known/familiar hang chains within stack traces. And I saw most of the chains running the same cursor from different processes waiting on ‘cursor: mutex X’ with below error stack … I mean there were multiple unique sessions waiting for a parent cursor mutex in exclusive mode on the same cursor under the following stack.

<-kgxExclusive<-kkscsAddChildNode<-kxscod<-kkscsCompareBinds<-kkscscid_bnd_eval<-kkscsCheckCriteria<-kkscsCheckCursor<-kkscsSearchChildList<-kksfbc<-

So, we had an error stack showing wait chains running the same cursor from different processes waiting on ‘cursor: mutex X’ and with BIND_EQUIV_FAILURE=Y in V$SQL_SHARED_CURSOR and CFB & ACS enabled, it was appearing that this was happening due to some bug.

Oracle support confirmed my doubt. They affirmed that this was happening all due to two unpublished bugs 28889389 and 28794230. For first one we need to apply patch 28889389, which has the optimized code for cursor mutex while searching the parent cursor for the match, for second one 28794230, they recommended few alternatives .given below …

_optimizer_use_feedback=false
_optimizer_adaptive_cursor_sharing=false
_optimizer_extended_cursor_sharing_rel=none

But even after setting above three undocumented parameters, which is to disable cardinality feedback and adaptive & extended cursor sharing, we only saw ~30% reduction in total waits. Later on they recommended us to apply the optimizer related bug fix control

_fix_control='23596611:OFF

and that completely resolved the issue.

Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

My favorite 5 linux TOP features/flags ….

Posted by FatDBA on July 23, 2021

Hi Guys,

Recently I started the ‘Top 5’ series where I share my top 5 features in any particular tool or product. Last time I did for SQL Developer command line (SQLcl), this time it will be about one of Linux’s favorite tool ‘TOP‘.

So, without any particular order, below are my top 5 picks of options/flags available with the tool.

c : To get absolute path of the commands currently executing. This will be quite helpful if you want to know from where the command/executable is running. I have used this option lot of the times in multiple performance battles that I have won 🙂 Look at the screen below and in to column COMMAND.

z: will display running process in RED color which may help you to identified running process easily.

1: To get usage details per CPU core.

SHIFT+i : To disable the IRIX (default mode is ON) mode, help you to understand – Why Process CPU % Usage larger than Total CPU Time ?

E: To get memory details in KB, MB & GB, press ‘E’ each time. So, no need to separately call ‘free’ command.

I know this is all about top 5 picks, but here goes my 6th top (kinda reserve option) 🙂

V: Tree Mode, to display the processes in a parent child hierarchy.

Hope It Helped
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

How to fix/handle Wraparound problem in PostgreSQL ?

Posted by FatDBA on July 21, 2021

Hi Guys,

In continuation from where I left in my last post on simulating wraparounds in PostgreSQL, this post is all about managing the mess that we spilled 🙂 Though this was a synthetic/artificial test and as soon you close or exit SESSION 1 where you’ve opened a BEGIN block/transaction and ran select txid_current() to block autovacuum, it (autovacuum) will automatically kick-in and perform a quick full vacuum on all of the databases in the cluster. You have to remember that even if you have autovacuum turned off, it will still start in emergency mode.

So, this is what we got earlier at the time we hit by the wraparound problem when we tried to execute a DML and it says something about stopping postmaster and vacuum databases in single-user mode; What is that ? – We will see that next ….

enterprisedb=#
enterprisedb=# SELECT datname
, age(datfrozenxid)
, current_setting('autovacuum_freeze_max_age')
FROM pg_database
 ORDER BY 2 DESC;
 
   datname    |    age     | current_setting
--------------+------------+-----------------
 postgres     | 2146483647 | 200000000
 edb          | 2146483647 | 200000000
 template1    | 2146483647 | 200000000
 template0    | 2146483647 | 200000000
 immortal     | 2146483647 | 200000000
 darkthrone   | 2146483647 | 200000000
 immortal1    | 2146483647 | 200000000
 mayhem       | 2146483647 | 200000000
 murduk       | 2146483647 | 200000000
 burzum       | 2146483647 | 200000000
 gorgoroth    | 2146483647 | 200000000
 repmgr       | 2146483647 | 200000000
 enterprisedb | 2146483647 | 200000000
(13 rows)



enterprisedb=# update pgbench_history set aid=65555 where aid=64444;
ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
enterprisedb=#
enterprisedb=#

Okay, coming back to the point, how to handle the situation now when you’ve already hit the wraparound. So, first thing you have to manually shut down the database cluster, if it’s still up.

-bash-4.2$ pwd
/usr/edb/as11/bin

-bash-4.2$ /usr/edb/as11/bin/pg_ctl stop -D /var/lib/edb/as11/data
waiting for server to shut down.... done
server stopped
-bash-4.2$

Next, you need to run VACUUM FULL for each of the database (one-by-one using single mode) through stand-alone backend mode. For that you have to call postgres (if you’re on open source) or edb-postgres utility. I am on EDB PostgreSQL 11 so will be calling edb-postgres and execute VACUUM FULL.

-bash-4.2$ pwd
/usr/edb/as11/bin

-bash-4.2$
-bash-4.2$ ls *postgre*
edb-postgres
-bash-4.2$

-bash-4.2$ ./edb-postgres --single -D /var/lib/edb/as11/data enterprisedb

PostgreSQL stand-alone backend 11.11.20
backend> vacuum full;
2021-07-20 11:07:49 IST LOG:  statement: vacuum full;

backend>

It’s done for ‘enterprisedb’ database. Next will do it for ‘postgresql’ database, but this time I will use some stand-alone backend related command to assure we are connected with the right database and to enable runtime stats or verbose.

-bash-4.2$ ./edb-postgres --single -D /var/lib/edb/as11/data postgres

PostgreSQL stand-alone backend 11.11.20
backend>
backend> select current_database();
2021-07-20 13:45:25 IST LOG:  statement: select current_database();

         1: current_database    (typeid = 19, len = 64, typmod = -1, byval = f)
        ----
         1: current_database = "postgres"       (typeid = 19, len = 64, typmod = -1, byval = f)
        ----
backend> 


backend> SELECT set_config('log_statement_stats', 'on', false);
2021-07-20 13:48:55 IST LOG:  statement: SELECT set_config('log_statement_stats', 'on', false);

         1: set_config  (typeid = 25, len = -1, typmod = -1, byval = f)
        ----
         1: set_config = "on"   (typeid = 25, len = -1, typmod = -1, byval = f)
        ----
backend>

-- Now when we have enabled logging of stats, let's execute the VACUUM to see runtime stats
backend> vacuum full;
2021-07-20 13:49:29 IST LOG:  statement: vacuum full;

2021-07-20 13:49:30 IST LOG:  QUERY STATISTICS
2021-07-20 13:49:30 IST DETAIL:  ! system usage stats:
        !       0.557302 s user, 0.426659 s system, 120.357278 s elapsed
        !       [0.583998 s user, 0.455579 s system total]
        !       21756 kB max resident size
        !       224/402352 [240/402824] filesystem blocks in/out
        !       0/2558 [0/6201151] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       982/2 [1005/3] voluntary/involuntary context switches
2021-07-20 13:58:39 IST STATEMENT:  vacuum full;

backend>

Okay, same way we have to run VACUUM FULL on all of the databases and restart PostgreSQL cluster.
Once all of the databases are done and after restart, connect with the database and you will see the TXID or datfrozenid has been reset and you are all good.

enterprisedb=# SELECT datname
, age(datfrozenxid)
, current_setting('autovacuum_freeze_max_age')
FROM pg_database
 ORDER BY 2 DESC;

    datname    | age | current_setting
--------------+-----+-----------------
postgres     | 1369 | 200000000
edb          | 1272 | 200000000
template1    | 1170 | 200000000
template0    | 1071 | 200000000
immortal     |  974 | 200000000
darkthrone   |  877 | 200000000
immortal1    |  780 | 200000000
mayhem       |  585 | 200000000
murduk       |  487 | 200000000
burzum       |  388 | 200000000
gorgoroth    |  291 | 200000000
repmgr       |  194 | 200000000
enterprisedb |   97 | 200000000

You’re all set!

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , | Leave a Comment »

Quarantine a SQL ? This can save from a system performance pandemic …

Posted by FatDBA on July 6, 2021

Hi Guys,

Last weekend I was asked to examine a performance problem where customer was observing sporadic resource spikes on CPU & PIOs (Physical IO scans), and had no idea how to handle the situation. They already have identified the problematic SQL and asked me to take a look, while doing initial investigation on the system, I found the resource manager was enabled and was killing the ill SQL every time when it tries to breach the limits set on CPU & other resources.

I spent some time understanding query behavior and performance, and I found this a scenario of SQL plan (PHV) flip where the query optimizer toggles between a good and a worst plan. This being a time sensitive issue and customer needs a quick fix before I identify the reason behind the PHV flip, we have to think of any technique on how we can stop the SQL execution with bad PHVs. This was an Oracle EE 19c system running on Exadata, I immediately proposed for a solution using new feature of ‘SQL Quarantine‘, which helps to prevent reuse of same execution plan which was terminated by resource managers due to resource limits.

Let me show you how to do that, here I will demonstrate how to quarantine a SQL & all its execution plans.

Well this is just a use case, you can also quarantine the SQL for all its PHVs or execution plans or even through the SQL Text. Here I will show how you can quarantine a SQL for all its execution plans.

Note: This feature is currently available only on Enterprise Edition on Engineered Systems, like Exadata and Exadata Cloud Service, but for this demo, I will be enabling the ‘_exadata_feature_on‘ parameter for demo purposes on my sandbox setup.

** Please don’t touch this parameter in your production environments if you’re not running on Oracle EE on the top of Exadata or Exadata Cloud services.

[oracle@canttowinsec ~]$ !sql
sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Jul 3 23:33:41 2021
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.


SQL>
SQL> @hidden
Enter value for param: exadata_feature_on
old   5:    and a.ksppinm like '%&param%'
new   5:    and a.ksppinm like '%exadata_feature_on%'

Parameter                                     Session Value             Instance Value            descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_exadata_feature_on                           FALSE                     FALSE                     Exadata Feature On

SQL>
SQL>


SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>
SQL>


SQL> startup
ORACLE instance started.

Total System Global Area  922745216 bytes
Fixed Size                  8903040 bytes
Variable Size             843055104 bytes
Database Buffers           67108864 bytes
Redo Buffers                3678208 bytes
Database mounted.
Database opened.
SQL>
SQL>
SQL>



SQL>
SQL> @hidden
Enter value for param: exadata_feature_on
old   5:    and a.ksppinm like '%&param%'
new   5:    and a.ksppinm like '%exadata_feature_on%'

Parameter                                     Session Value             Instance Value            descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_exadata_feature_on                           TRUE                      TRUE                      Exadata Feature On

SQL>




-- For this demo I have created one table with 1000000 rows. 

SQL> select count(*) from bigtab;

  COUNT(*)
----------
   1000000

Next check the SQLID, PHV and TEXT of the SQL for which you want to setup this QUARANTINE feature and will create the quarantine configuration for it. You can set it for SQL ID, SQL ID + PHV or for SQL TEXT.

SQL> select sql_id, plan_hash_value, sql_text from v$sqlarea where sql_text like '%select * from bigtab%';

SQL_ID        PLAN_HASH_VALUE
------------- ---------------
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6fwqzurbc8y7k       441133017
select * from bigtab

SQL>


-- You can create CREATE_QUARANTINE BY SQL_ID 
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '&sql_id')
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '&sql_id', PLAN_HASH_VALUE => '&PLAN_HASH_VALUE');

-- You can create CREATE_QUARANTINE_BY_SQL_TEXT.
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_TEXt(SQL_TEXT => to_clob('select count(*) from emp'));



-- This command creates a quarantine configuration for SQL ID '6fwqzurbc8y7k' and all its execution plans.
SQL>
SQL> DECLARE
quarantine_sql VARCHAR2(30);
BEGIN
quarantine_sql :=   DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '6fwqzurbc8y7k');
END;
/  

PL/SQL procedure successfully completed.

SQL>



-- Check if quarantine config/profile is created.

SQL> select NAME, SQL_TEXT, ELAPSED_TIME, cpu_time,CREATED, ENABLED from dba_sql_quarantine;


NAME                           SQL_TEXT                       ELAPSED_TIME         CPU_TIME                       CREATED                        ENA
------------------------------ ------------------------------ -------------------- ------------------------------ ------------------------------ ---
SQL_QUARANTINE_2akx3bq44wvc5   select * from bigtab           ALWAYS               ALWAYS                         04-JUL-21 12.09.25.567422 AM   YES


Now next we will use the DBMS_SQLQ.ALTER_QUARANTINE procedure to put limits on resource usage i.e. Elapsed time, CPU, IO (MBs), Number of physical (PIOs) requests, Number of logical (LIOs) requests.

SQL>
SQL> BEGIN
    DBMS_SQLQ.ALTER_QUARANTINE(
       QUARANTINE_NAME => 'SQL_QUARANTINE_2akx3bq44wvc5',
       PARAMETER_NAME  => 'CPU_TIME',
       PARAMETER_VALUE => '5');

    DBMS_SQLQ.ALTER_QUARANTINE(
       QUARANTINE_NAME => 'SQL_QUARANTINE_2akx3bq44wvc5',
       PARAMETER_NAME  => 'ELAPSED_TIME',
       PARAMETER_VALUE => '10');
END;
/  2    3    4    5    6    7    8    9   10   11   12

PL/SQL procedure successfully completed.

SQL>


-- let's verify limits that we have set against this SQL Quarantine profile.

NAME                           SQL_TEXT                       ELAPSED_TIME         CPU_TIME                       CREATED                        ENA
------------------------------ ------------------------------ -------------------- ------------------------------ ------------------------------ ---
SQL_QUARANTINE_2akx3bq44wvc5   select * from bigtab           10                   5                              04-JUL-21 12.12.41.918609 AM   YES

Next we will setup the RESOURCE MANAGER.

-- Create a pending area.
SQL> begin
 dbms_resource_manager.create_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

-- Create a consumer group with name 'DIXITPOC_GROUP'.
SQL> begin
 dbms_resource_manager.create_consumer_group(CONSUMER_GROUP=>'DIXITPOC_GROUP',COMMENT=>'To test SQL quarantine feature for one SQL');
 end;
 /  2    3    4

PL/SQL procedure successfully completed.


-- Now bind the consumer group with the user 'DIXIT', I mean this could be any group through which you will execute the query.
SQL> begin
 dbms_resource_manager.set_consumer_group_mapping(attribute => 'ORACLE_USER',value => 'DIXIT',consumer_group =>'DIXITPOC_GROUP' );
 end;
  /  2    3    4

PL/SQL procedure successfully completed.


-- Create a resource plan with name 'POC_FOR_QUARANTINE'.
SQL> begin
dbms_resource_manager.create_plan(plan => 'POC_FOR_QUARANTINE',comment => 'Kill SQL and its other execution plans when exceeding CPU and Elapsed seconds');
end;
/  2    3    4

PL/SQL procedure successfully completed.


-- Create a plan directive by allocating resource plan 'POC_FOR_QUARANTINE' to consumer group 'DIXITPOC_GROUP'. I am setting execution limit of 5 seconds for the SQL.
SQL> begin
 dbms_resource_manager.create_plan_directive(
 plan => 'POC_FOR_QUARANTINE',
 group_or_subplan => 'DIXITPOC_GROUP',
 comment => 'Kill SQL and its other execution plans when exceeding CPU and Elapsed seconds',
 switch_group=>'CANCEL_SQL',
 switch_time => 5,
 switch_estimate=>false);
 end;
 /  2    3    4    5    6    7    8    9   10

PL/SQL procedure successfully completed.




-- Allocate full resources to rest of the sessions which are not part of this plan.
SQL> begin
 dbms_resource_manager.create_plan_directive(PLAN=> 'POC_FOR_QUARANTINE', GROUP_OR_SUBPLAN=>'OTHER_GROUPS',COMMENT=>'leave rest as is');
 end;
 /  2    3    4

PL/SQL procedure successfully completed.



-- Validate and submit pending area.
SQL> begin
 dbms_resource_manager.validate_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

SQL> begin
 dbms_resource_manager.submit_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

SQL>


-- Grant switch privilege to the DIXIT user which will switch it to 'DIXITPOC_GROUP' group.
SQL> begin
dbms_resource_manager_privs.grant_switch_consumer_group('DIXIT','DIXITPOC_GROUP',false);
end;
/  2    3    4

PL/SQL procedure successfully completed.

-- Create initial consumer group for DIXIT user for 'DIXITPOC_GROUP' group.
SQL> begin
dbms_resource_manager.set_initial_consumer_group('DIXIT','DIXITPOC_GROUP');
end;
/  2    3    4

PL/SQL procedure successfully completed.

SQL>

Next set the resource manager at the database level, we will have to set the RESOURCE_MANAGER_PLAN to value or name of the plan ‘POC_FOR_QUARANTINE’.

SQL> show parameter resource

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
resource_limit                       boolean     TRUE
resource_manage_goldengate           boolean     FALSE
resource_manager_cpu_allocation      integer     5
resource_manager_plan                string
SQL> 


SQL> alter DIXITtem set RESOURCE_MANAGER_PLAN = 'POC_FOR_QUARANTINE';
DIXITtem altered.

SQL> show parameter resource

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
resource_limit                       boolean     TRUE
resource_manage_goldengate           boolean     FALSE
resource_manager_cpu_allocation      integer     5
resource_manager_plan                string      POC_FOR_QUARANTINE
SQL>

Execute the SQL for the first time, you will get ‘active time limit exceeded’ due to RM.

SQL>
SQL> select * from emp;

ERROR at line 2:
ORA-00040: active time limit exceeded - call aborted

Now, if you run the SQL for the second time, you will see your quarantine plan/profile will be used and will abort the execution of the SQL.

SQL>
SQL> select * from emp;

ERROR at line 2:
ORA-56955: quarantined plan used

Hope It Helped
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , , | Leave a Comment »

Something cool was introduced in Oracle Golden Gate 12.2 …

Posted by FatDBA on June 16, 2021

Hi Guys,

Last week someone from my team contacted me for a problem that he was facing with the Golden Gate, he was getting an error message with replicat process saying ‘Error mapping TEST.TABLE1 to TEST.TABLE1’ and the replicat process is abending at the target end. He was using Golden Gate version 12.1.2.0. Now during the initial investigation we found the table structure both at the source and target is same except column order/positions, just like below.

Source
-----------------------
    TEST.TABLE1
ID Varchar2(10)
Name Varchar2(10)
Codes Numeric(10)

Target:
------------------------
    TEST.TABLE1
ID Varchar2(10)
Codes Numeric(10)
Name Varchar2(10)

As GG works with the column order and not explicitly with the column names, it lead to that error at the target while doing the INSERTS in to the table. We knew that this is now auto handled by the GG, but not sure on the version when it got fixed, so immediately, we started looking for newer releases and their documentations, and luckily we found release notes for Golden Gate 12.2 which clearly explained that now there isn’t any need to generate the DEFGEN files, or to use ASSUMETARGETDEFS or SOURCEDEFS parameter (in some cases).

Starting from Golden Gate version 12.2 (released in year 2016), the trail files now contains the metadata (TDR or table definition records), so GG knows table details like column order/positions, data type, column length etc. before applying the changes to the table. So, now even if you add a new table, drop a column, add a new column, rename an existing column, change column datatype length, you don’t need to worry as Golden Gate will take care by its own.

So, about the solution, we had two options to choose from – Upgrade GG from 12.1 to 12.2 or else use FORMAT RELEASE 12.2 parameter, with that, there is no need to use any of the above mentioned parameters. We choose to upgraded GG version to 12.2 and that fixed the problem!

Hope It Helped
Prashant Dixit

Posted in Advanced | Tagged: , | Leave a Comment »

repmgr failover done, but where is my Standby ?

Posted by FatDBA on December 15, 2020

Hey Guys,

Last weekend I’ve got a call from one of my friend asking about a weird issue in his master-slave (primary-standby) setup where he’s using repmgr for automatic failover and switchover operations. He said after the failover (server crashed) with role reversal or promotion the old standby becomes new primary (expected, all good till here), but as soon as the old primary comes back online and rejoins the configuration, it started as a primary too. I mean you can imagine something like below, both of the two hosts becomes master and there is no sign of a new standby.


[postgresdb@fatdba ~]$ repmgr -f /etc/repmgr.conf cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Connection string
----+-------+---------+-----------+----------+----------+-----------------------------------------
 1  | node1 | primary | * running |          | default  | host=db_node1 dbname=repmgr user=repmgr
 2  | node2 | primary |   running |          | default  | host=db_node2 dbname=repmgr user=repmgr
 

He checked repmgr logs on the standby node and everything looks okay, standby promotion was sucessfull. It seems something was stopping the repmgr to update the information and bring that crashed node (previous primary) back as a standby. So the final situation is there are two primary nodes at this point.

So, how to and where to start from ?
A quick solution to this problem is to create an UPDATE TRIGGER on one of the internal table EVENTS of repmgr database (it creates its own database with default name of repmgr). The logic of creating a trigger is to handle events “standby_promote”, I mean when it happens a script is triggered that will bring back the old primary as new standby.

Next we would be needing a script that will help to automate the standby promotion. For that purpose you can use this script (link) or you can write something similar. And now when we have to call a bash script or to write stored procedures in a shell, we need the ‘.plsh’ extension in the database. Let’s install one of the most popular extension to do that, petere/plsh (link)


[postgresdb@fatdba ~]$ cd /usr/share/postgresql/12/extension/
[postgresdb@fatdba ~]$ sudo git clone https://github.com/petere/plsh.git
[postgresdb@fatdba ~]$ cd plsh
[postgresdb@fatdba ~]$ sudo make PG_CONFIG=/usr/bin/pg_config
[postgresdb@fatdba ~]$ sudo apt-get install make
[postgresdb@fatdba ~]$ sudo apt-get install gcc
[postgresdb@fatdba ~]$ sudo apt-get install postgresql-server-dev-12
[postgresdb@fatdba ~]$ sudo make install PG_CONFIG=/usr/bin/pg_config
 

Now, next create the extension in the database.


[postgresdb@fatdba ~]$ psql -U repmgr
 

This one needs to be executed only n primary side.


repmgr=# create extension plsh;
 

Now, let’s create the function and a trigger to invoke the script failover_promote.sh


[postgresdb@fatdba ~]$ psql
postgres=# CREATE FUNCTION failover_promote() RETURNS trigger AS $$
#!/bin/sh
/bin/bash /tmp/psqlscripts/failover_promote.sh $1 $2
$$
LANGUAGE plsh;
 

And the UPDATE TRIGGER on EVENTS table to call function created above.


postgres=#  create trigger failover
after insert
on events
for each row
execute procedure failover_promote();
 

Yes, that’s it. All done!
Now you can mimic the failover scenario i.e by stopping the primary database and check repmgr logs on the standby server.
You will see the standby is promoted and becomes new primary and as soon it’s finished, it will start pinging the crashed node and the moment it is back it will be promoted as a new standby.

Hope It helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , | 1 Comment »

 
%d bloggers like this: