Tales From A Lazy Fat DBA

Den of an Oracle DB Performance & Troubleshooting freak … \,,/

  • Likes

    • 199,506
  • Archives

  • Categories

  • Subscribe

Archive for the ‘troubleshooting’ Category

Oracle Real Application Testing (RAT) – Part 1: What it is ?

Posted by FatDBA on January 31, 2020

Hi Guys,

As committed I am back with the first edition or the post on Oracle RAT (Real Application Testing) and there be couple more follow up chapters on the same in next few days or weeks.

Alright, recently during one of our mission-critical production database migration we reached a point where we had to perform the Load Test before pushing the real-time workload on to this new system. I was asked to prepare the strategy and to pick the best possible tool to access the performance of the performance of this new system and how it will respond to the current traffic.

Received lot’s of suggestions from rest of the team, i.e. Swingbench, Loadrunner, Orion etc. but most of them are with a predefined set of Supplied Benchmarks though few are customized but are more related to the server performance and bechmarking but not at the Database or SQL level. And considering the notorious behavior of many of the custom code and legacy application modules I was more leaned towards picking a tool which covers both Database and SQL, and we finally agreed on Oracle RAT.

Oracle Real Application Testing, an option that comes with Oracle Enterprise Edition. Oracle Real Application Testing helps you to test the real-life workload after changes on the database such as database upgrades, OS upgrades, parameter changes, hardware replacement, etc. So, in short the Oracle RAT will be system stress test tool to simulate production load. Introduced in Oracle 11g Release 1. But yes, it’s not free and comes with additional cost and licenses.

There are two features “Database Replay” and “SQL Performance Analyzer” will help fine-tuning on the database before passing production.
I will cover more about the ‘Database Replay’ feature here and might cover the ‘SQL Performance Analyzer’ feature later.

When can you use RAT – “Database Replay” feature?
System Changes
– Hardware replacement such as CPU, RAM, etc.
– Database and OS upgrades
– Storage changes (OCFS2 – ASM)
– OS changes (Windows – Linux)
Configuration Changes
– Single Instance – RAC– Patch installation– Database parameter change

Which database versions are supported?
The workload capture process is supported on the Oracle Database 10g R2 (10.2.0.4) and above versions. The worload replay process is supported on the Oracle Database 11g R1 and above versions.

How to do it, where to start and all ?
Well there are two different ways you can perform the RAT (DB Replay) testing
– Using Oracle Enterprise Manager (OEM) : This option is entirely GUI based where you select your source and target systems and by doing all those clicks performs this stress/load testing on the system.
– Using command line way (My preferred way of doing this, yes I am ‘old school’) using DBMS_WORKLOAD_CAPTURE & DBMS_WORKLOAD_REPLAY procedures.

Some High Level Steps:
– Capture workload into capture files (In the form of .rec files, are flat files)
– Copy files to test system and preprocess them (to make them machine understandable)
– Replay files on test system (play the recorded files)
– Perform detailed analysis of workload capture and replay using reports generated by Database Replay. (Reporting for bench markings)

ON SOURCE System:
dbms_workload_capture.start_capture 
dbms_workload_capture.finish_capture; 

Copy the workload files to the client system. For example: – /home/oracle/rat/test1

On TARGET System:
1. dbms_workload_replay.process_capture 
2. dbms_workload_replay.initialize_replay 
3. dbms_workload_replay.prepare_replay 
4. Run the workload client to calibrate the replay. The calibration process (mode=CALIBRATE) recommends the number of client processes required to perform the replay
5. Replay the workload using below command.
dbms_workload_replay.start_replay; 

Components: The ARCHITECTURE (Simplified)

DB REPLAY “The Big Picture”

What is a Workload Client ?
The REPLAY uses wrc clients – Which are multi-threaded JAVA clients and can be started on the same machine/host or on separate hosts.
Will cover about them more in depth in future posts.


[oracle@PDIXIT:RAT]$ wrc system/XXXX mode=calibrate replaydir=/DBCapture/RAT/RAT_13DEC15_19_17
 Workload Replay Client: Release 11.2.0.4.0 - Production on Sat Dec 16 05:50:39 2015
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
 
 
Report for Workload in: /DBCapture/RAT/RAT_13DEC16_19_17
-----------------------
Recommendation:
Consider using at least 13 clients divided among 4 CPU(s)
You will need at least 168 MB of memory per client process.
If your machine(s) cannot match that number, consider using more clients.
 
Workload Characteristics:
- max concurrency: 575 sessions
- total number of sessions: 1729
 
Assumptions:
- 1 client process per 50 concurrent sessions
- 4 client process per CPU
- 256 KB of memory cache per concurrent session
- think time scale = 100
- connect time scale = 100
- synchronization = TRUE
 

Now how to compare/benchmark ?
At the end of both CAPTURE & REPLAY methods you need to generate few process specific report.
Few of the important files that help in benchmarking are:
AWR Reports: Generate the AWR reports for the same time interval when we have any of the two process were in progress. The BEGIN AND END Snaps can be collected from DBA_WORKLOAD_CAPTURES & DBA_WORKLOAD_REPLAYS
CAPTURE/REPLAY Reports: These reports are specific to workload capture and playing on target.
Capture Vs Replay reports.

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

Disk Goes Offline after rebalance! – Is this due to a BUG ?

Posted by FatDBA on January 30, 2020

HI Everyone,

Today during one of the activity where we migrated the ASM Storage for one of our 2 Node RAC cluster (running on 11gR2), where we had to perform the disk rebalancing to copy/mirror the contents from older/existing storage to the new storage before we go and drop the older storage partitions, we faced some some weirdness. The disks goes offline in this multi-node ASM and we left stranded with initially no idea behind this behavior, but finally we were able to locate a metalink page for the same issue.

Yes, this was due to a known Bug with number 13476583
Oracle Server (Rdbms) Version
This problem is introduced in the
11.2.0.2.3 Patch Set Update
11.2.0.2.5 Patch Set Update
11.2.0.2.4 Patch Set Update
11.2.0.2.3 Patch Set Update
and in 11.2.0.3, by the fix for bug 10040921.

Problem:
When disks are dropped, a forcible diskgroup dismount is performed on other ASM instance/s.

Workaround or Fix:
1. The problem does not cause diskgroup corruption. So mostly diskgroup can be mounted again.
2. Apply fix
Interim patches here: Patch:13476583
11.2.0.2.6 Patch Set Update
11.2.0.2 Patch 17 on Windows Platforms

Oracle Notes: 245840.1

Hope That Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

RAT Reporting Error: ORA-06502: numeric or value error: character string buffer too small

Posted by FatDBA on December 16, 2019

Hi All,

Today’s topic of discussion is to handle/fix one of the issues that I’d faced while generating RAT (real application testing) reports on 10gR2 database. I know many of us are not yet aware about the tool, it’s purpose and functionality. Very soon I will be writing about this great product from Oracle for database load testing using real/genuine workload and is quite helpful to forecast your DB performance before you migrate.

Alright, coming back to the point – I was trying to generate the RAT Capture report (on target of course) to see what all was there in the capture, its observations, highlights and rest and that’s when we’ve encountered an error (pasted below)



DECLARE
l_report CLOB;
BEGIN
l_report := DBMS_WORKLOAD_CAPTURE.report(capture_id => 81,
format => DBMS_WORKLOAD_CAPTURE.TYPE_HTML);
END;
/ 2 3 4 5 6 7
DECLARE
*
ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error: character string buffer too small
ORA-06512: at "SYS.DBMS_SWRF_REPORT_INTERNAL", line 7446
ORA-06512: at "SYS.DBMS_SWRF_REPORT_INTERNAL", line 8591
ORA-06512: at "SYS.DBMS_SWRF_REPORT_INTERNAL", line 8521
ORA-06512: at "SYS.DBMS_WORKLOAD_CAPTURE", line 486
ORA-06512: at "SYS.DBMS_WORKLOAD_CAPTURE", line 1214
ORA-06512: at line 4


There are two solutions to this problem:

1. First to drop the common (shared by capture and replay) schemas and their infrastructure tables using below two scripts.
That firstscript below drop schema tables shared by capture and replay and second drops the Capture infrastructure tables.
catwrr.sql – Catalog script for Workload Capture and Replay — this script then rebuilds all the capture and replay related tables.


@@?/rdbms/admin/catnowrr.sql
@@?/rdbms/admin/catwrr.sql
exec prvt_report_registry.register_clients(TRUE); --- This one registers clients 

Note: In that case you might loss all of your previous capture ID details from the system as it simply washes or wipes everything there related with RAT tables. Hence this is kind of a crude and a raw method to fix this issue. And I recommend to always connect with Oracle Support before going to run these scripts on your database!

2. I tried of another approach to avoid this error and generate the RAT capture report from the target instead of Source where we were getting the error.
Is that possible ?? — Yes, you can. After further analysis I found the issue is with the 10gR2 capture reporting code which sometimes throws this error.

So, the second way turned ut to be a better approach here as we have all of our previous stats and data untouched and nothing has been wiped out in this case, as we simply ran the reporting procedure from the target (12c R2 in our case) and that’s how avoided the issue.


Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: | Leave a Comment »

root.sh failing while installing 12cR2 on RHEL7 “Failed to create keys in the OLR” – Did your hostname starts with a number ?

Posted by FatDBA on July 29, 2019

Hi Guys,

I know its been too long since i last posted and it all happened due to some site authentication issues and some personal priorities. Here I am back with new issues, all related with performance, administration, troubleshooting, optimization and other subjects.

This time would like to share one of the issue that i have faced while installing Oracle 12c Release 2 (Yes, I still do installations, sometimes 🙂 ) on a brand new RHEL7 box where everything was good till I ran root.sh which got failed due to a weird error which initially got no hint behind the problem.
Initially i though if this qualifies to be a post and deserves a place here but actually I have spend few days identifying the cause and hours that I have spend with support, so just want to save all that time for you all who might facing the same issue and looking something on Google 🙂

So lets get started!
This is what exactly I got when ran the root.sh script



[root@8811913-monkey-db1:/u011/app1/12.2.0.1/grid]# ./root.sh
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /u011/app1/12.2.0.1/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u011/app1/12.2.0.1/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /u011/app1/12.2.0.1/crsdata/8811913-monkey-db1/crsconfig/roothas_2019-02-18_00-59-22AM.log
Site name (8811913-monkey-db1) is invalid.clscfg -localadd -z  [-avlookup]
                 -p property1:value1,property2:value2...

  -avlookup       - Specify if the operation is during clusterware upgrade
  -z   - Specify the site GUID for this node
  -p propertylist - list of cluster properties and its value pairs

 Adds keys in OLR for the HASD.
WARNING: Using this tool may corrupt your cluster configuration. Do not
         use unless you positively know what you are doing.

 Failed to create keys in the OLR, rc = 100, Message:


2019/02/18 00:59:28 CLSRSC-188: Failed to create keys in Oracle Local Registry
Died at /u011/app1/12.2.0.1/grid/crs/install/oraolr.pm line 552.
The command '/u011/app1/12.2.0.1/grid/perl/bin/perl -I/u011/app1/12.2.0.1/grid/perl/lib -I/u011/app1/12.2.0.1/grid/crs/install /u011/app1/12.2.0.1/grid/crs/install/roothas.pl ' execution failed


The error simply said that the script failed to ‘create the keys in OLR’. These keys were for HASD that it was attempting to add. I verified all run time logs that got created the time but they too gave no idea about this problem. That is when I had to engage the Oracle customer support and came to know that this all happened due to a new BUG (BUG 26581118 – ALLOW HOSTNAME WITH NUMERIC VALUE) that comes in to picture when you have the hostname starts with a numeral or number and is an RHEL7 and is specific to Oracle 12c Release 2.

Oracle suggested a bug fix (Patch Number: 26751067) for this issue. This is a MERGE patch and fixes both Bug 25499276 & 26581118. One more thing, you have to apply this patch before the root.sh script.
So let me quickly show how to do that (removing all redundant and other sections).



[oracle@8811913-monkey-db1:/u011/app1/12.2.0.1/grid/OPatch]$ ./opatch napply -oh /u011/app1/12.2.0.1/grid -local 26751067/26751067/
Oracle Interim Patch Installer version 12.2.0.1.6
Copyright (c) 2019, Oracle Corporation.  All rights reserved.

...
......

Patch 26751067 successfully applied.
Log file location: /u011/app1/12.2.0.1/grid/cfgtoollogs/opatch/opatch2019-02-18_01-05-41AM_1.log

OPatch succeeded.
[oracle@8811913-monkey-db1:/u011/app1/12.2.0.1/grid/OPatch]$
[oracle@8811913-monkey-db1:/u011/app1/12.2.0.1/grid/OPatch]$


Ran the root.sh after patching and it went smooth.
BTW, in case you don’t want to do all this, simply change the hostname and put any alphabet in front of your hostname i.e. 8811913 –> A8811913 — That’s It!

Hope It Helps!

Thanks
Prashant Dixit

Posted in troubleshooting, Uncategorized | Tagged: | 1 Comment »

OPatch – Error occurred during initialization of VM, Could not reserve enough space for XXXXXXKB object heap

Posted by FatDBA on February 19, 2019

Hi Guys,

Disucssing a random issue what i’ve encountered few hours back, is a problem related with the new version of the OPatch which when unzipped generating a weird error and is discussed below.



[oracle@gunna:~/app/oracle/product/12.2.0/dbhome_1/OPatch/28822515]$ opatch prereq CheckConflictAgainstOHWithDetail -ph ./

Error occurred during initialization of VM
Could not reserve enough space for 39957221KB object heap


On OCS download page for OPatch, the auto version is set to 32-bit (Linux X86).
Check if the name of the downloaded file is something similar ‘p6880880_122010_LINUX.zip’. If yes, then you have downloaded the 32 bit version. Choose ‘Linux x86-64’ as the right vrsion and try again

Let’s try again.



[oracle@gunna:~/app/oracle/product/12.2.0/dbhome_1/OPatch/28822515]$ opatch prereq CheckConflictAgainstOHWithDetail -ph ./
Oracle Interim Patch Installer version 12.2.0.1.16
Copyright (c) 2018, Oracle Corporation.  All rights reserved.

PREREQ session

Oracle Home       : /home/oracle/app/oracle/product/12.2.0/dbhome_1
Central Inventory : /home/oracle/app/oraInventory
   from           : /home/oracle/app/oracle/product/12.2.0/dbhome_1/oraInst.loc
OPatch version    : 12.2.0.1.16
OUI version       : 12.2.0.1.4
Log file location : /home/oracle/app/oracle/product/12.2.0/dbhome_1/cfgtoollogs/opatch/opatch2018-12-24_00-46-02AM_1.log

Invoking prereq "checkconflictagainstohwithdetail"

Prereq "checkConflictAgainstOHWithDetail" passed.

OPatch succeeded.


All good now!

Hope It Helps
Prashant ‘Fatdba’ Dixit

Posted in troubleshooting | Tagged: | Leave a Comment »

12c OEM Error: LongOpManager$ZombieDetection:1017

Posted by FatDBA on August 31, 2018

Hey Pals,

I am sure, few of the us who have the 12c EM configured on their systems must have received one of the annoying alert/incident where the EM Agent on the server has reported
something about the ‘Zombie’ processes repeatedly.

I recently encountered one such issue with this Production system where the 12c agent frequently sending zombie related alerts, like one below. So, this post of all about handling such issues
and how to fix them or to avoid them.

Host=dixitlab.asi.dixson.corp 
Target type=Agent 
Target name=dixitlab.asi.dixson.corp:3873 
Message=Problem: java.lang.Throwable:oracle.sysman.gcagent.tmmain.execution.LongOpManager$ZombieDetection:1017 
Severity=Critical 
Problem creation time=May 5, 2018 10:22:48 AM ADST 
Last updated time=Aug 29, 2018 1:00:47 AM ADST 
Problem ID=113 
Problem key=java.lang.Throwable:oracle.sysman.gcagent.tmmain.execution.LongOpManager$ZombieDetection:1017 
Incident count=5 
Problem Status=New 
Priority=None 
Escalated=No 
Problem acknowledged By Owner=No 
Rule Name=OEM12c-Problems,Enterprise Problems 
Rule Owner=SYSMAN 

Next quick thing in such cases is always checking the agent (gcagent) logs to understand the exact error or issues with EM or Agent.
And as expected the logs has something interesting in them, i saw a DEBUG message captured where Zombie Detection was initiated by the agent after a delay of lot of attempts.
This happens when an EM Agent task such as collecting metrics is running more than the expected time, the process is marked as a zombie and is one of the leading causes of agent crash or halt.

X-AGENT_PERSISTENCE_WAIT_TIME: 60
X-AGENT_PERSISTENCE_ID: https://dixitlab1.asi.dixson.corp:1830/emd/main/
2018-08-29 17:10:26,050 [31:858161EB] DEBUG - Submitting task ZombieDetector for execution
2018-08-29 17:10:26,050 [216:1AE716D8] DEBUG - Begin task ZombieDetector on Thread: GC.SysExecutor.8
2018-08-29 17:10:26,050 [216:69BEAC9D:GC.SysExecutor.8 (ZombieDetector)] DEBUG - Scheduling next ZombieDetector.Task after delay 60000 including periodShift of 0 milliseconds

There are few of the ways to avoid such issues.
1. Set _zombieSuspensions=TRUE
2. Set _canceledThreadWait=900
3. set _zombieThreadPercentThreshold=0

Other way, that is to avoid the alerts/incidents you can set one of the ‘Hidden’ parameter “_zombieCreateIncident=” to FALSE and set it in the agent configuration file followed by agent restart.
This

[oracle@dixitlab config]$
[oracle@dixitlab config]$ pwd
/u01/app/oracle/new_agent12c/agent_inst/sysman/config


[oracle@dixitlab config]$ more emd.properties|grep _zombieCreateIncident
_zombieCreateIncident=false
[oracle@dixitlab config]$




[oracle@dixitlab bin]$
[oracle@dixitlab bin]$ ./emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version          : 12.1.0.5.0
OMS Version            : 12.1.0.5.0
Protocol Version       : 12.1.0.1.0
Agent Home             : /u01/app/oracle/new_agent12c/agent_inst
Agent Log Directory    : /u01/app/oracle/new_agent12c/agent_inst/sysman/log
Agent Binaries         : /u01/app/oracle/new_agent12c/core/12.1.0.5.0
Agent Process ID       : 2978
Parent Process ID      : 12860
Agent URL              : https://dixitlab.asi.dixson.corp:3873/emd/main/
Local Agent URL in NAT : https://dixitlab.asi.dixson.corp:3873/emd/main/
Repository URL         : https://dixitlab1.asi.dixson.corp:4900/empbs/upload
Started at             : 2018-08-29 01:01:08
Started by user        : oracle
Operating System       : Linux version 2.6.32-696.20.1.el6.x86_64 (amd64)
Last Reload            : (none)
Last successful upload                       : 2018-08-29 17:28:20
Last attempted upload                        : 2018-08-29 17:28:20
Total Megabytes of XML files uploaded so far : 1.03
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 52.74%
Collection Status                            : Collections enabled
Heartbeat Status                             : Ok
Last attempted heartbeat to OMS              : 2018-08-29 17:29:16
Last successful heartbeat to OMS             : 2018-08-29 17:29:16
Next scheduled heartbeat to OMS              : 2018-08-29 17:30:16

---------------------------------------------------------------



[oracle@dixitlab bin]$ ./emctl stop agent
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
Stopping agent ...
 stopped.
[oracle@dixitlab bin]$
[oracle@dixitlab bin]$ ./emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running
[oracle@dixitlab bin]$
[oracle@dixitlab bin]$
[oracle@dixitlab bin]$ ./emctl start agent
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
Starting agent .............................. started.
[oracle@dixitlab bin]$
[oracle@dixitlab bin]$
[oracle@dixitlab bin]$ ./emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version          : 12.1.0.5.0
OMS Version            : 12.1.0.5.0
Protocol Version       : 12.1.0.1.0
Agent Home             : /u01/app/oracle/new_agent12c/agent_inst
Agent Log Directory    : /u01/app/oracle/new_agent12c/agent_inst/sysman/log
Agent Binaries         : /u01/app/oracle/new_agent12c/core/12.1.0.5.0
Agent Process ID       : 2664
Parent Process ID      : 2455
Agent URL              : https://dixitlab.asi.dixson.corp:3873/emd/main/
Local Agent URL in NAT : https://dixitlab.asi.dixson.corp:3873/emd/main/
Repository URL         : https://dixitlab1.asi.dixson.corp:4900/empbs/upload
Started at             : 2018-08-29 17:33:05
Started by user        : oracle
Operating System       : Linux version 2.6.32-696.20.1.el6.x86_64 (amd64)
Last Reload            : (none)
Last successful upload                       : 2018-08-29 17:34:34
Last attempted upload                        : 2018-08-29 17:34:34
Total Megabytes of XML files uploaded so far : 0.03
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 52.71%
Collection Status                            : Collections enabled
Heartbeat Status                             : Ok
Last attempted heartbeat to OMS              : 2018-08-29 17:34:24
Last successful heartbeat to OMS             : 2018-08-29 17:34:24
Next scheduled heartbeat to OMS              : 2018-08-29 17:35:24

---------------------------------------------------------------
Agent is Running and Ready
[oracle@dixitlab bin]$


Hope It Helps
Prashant Dixit

Posted in troubleshooting | Tagged: | Leave a Comment »

Visualize your database performance statistics using Tableau.

Posted by FatDBA on January 28, 2018

Hi Everyone,

Today’s post is all about producing some interactive data visualizations your system/database statistics using one of the most popular BI tool Tableau.
There are lot’s of other tools which can be used for your data analysis i.e. Excel, Microstrategy Analytics, Domo, QlikView but i always find Tableau easiest way to do such things, specially during all my database audit, 360 health reviews and troubleshooting task-forces.

Tableau offers a suite of tools that include an online, desktop and server version. All of these versions provide a easy-to-use drag and drop interface that can help you quickly turn your data into business insights. Like many other data analytics and visualization tools, Tableau can connect to local or remote data of many different formats.

Okay now after that short introduction of the tool, time to do some tests using the tool.
I have divided process in to three step activity and are discussed below.

First: Data Collection
You can collect your AWR reports in TEXT format and which will be later on parsed to create a CSV file.
There are many tools/scripts available online to generate multiple AWR reports of your database. I see an awesome work was already done by FlashDBA (Download). You can use his script to generate batch AWR reports and yes in TEXT format ONLY!

Example:

[oracle@dixitlab AWR]$ ls -ltr
total 12852
-rw-r--r--. 1 oracle oinstall 225031 Jan 27 21:25 awrrpt_1_445_446.txt
-rw-r--r--. 1 oracle oinstall 255010 Jan 27 21:26 awrrpt_1_446_447.txt
-rw-r--r--. 1 oracle oinstall 253217 Jan 27 21:26 awrrpt_1_447_448.txt
-rw-r--r--. 1 oracle oinstall 253217 Jan 27 21:26 awrrpt_1_448_449.txt
-rw-r--r--. 1 oracle oinstall 244229 Jan 27 21:26 awrrpt_1_449_450.txt
........
.................

Second: Data Fold or Compression.
Now next you need a tool to fold your AWR reports in to a CSV. Here once again FlashDBA did a marvelous job, he wrote one fabulous script to parse your AWR text files and generate a final AWR report.
You can download the script from his Github (Download Link)

As far as the script, you need to pass the format of your files and direct output to a CSV as shown below.

[oracle@dixitlab AWR]$ ./awr-parser.sh awr*.txt > tunedbperftests.csv

Info : Parsing file awrrpt_1_445_446.txt at 2018-01-27 21:32:49
Info : Parsing file awrrpt_1_446_447.txt at 2018-01-27 21:32:53
Info : Parsing file awrrpt_1_447_448.txt at 2018-01-27 21:32:56
Info : Parsing file awrrpt_1_448_449.txt at 2018-01-27 21:33:01
Info : Parsing file awrrpt_1_449_450.txt at 2018-01-27 21:33:07
Info : Parsing file awrrpt_1_450_451.txt at 2018-01-27 21:33:15
Info : Parsing file awrrpt_1_451_452.txt at 2018-01-27 21:33:21
....
........
Info : Parsing file awrrpt_1_499_500.txt at 2018-01-27 21:36:56
Info : No more files found
Info :
Info : ______SUMMARY______
Info : Files found : 55
Info : Files processed : 55
Info : Processing errors : 0
Info :
Info : Completed with 0 errors
[oracle@dixitlab AWR]$

With that you are done with the parsing of reports and have got the final CSV which we will be using to play around within Tableau.
Contents inside the parsed file.

Filename	Database Name	Instance Number	Instance Name	Database Version	Cluster	Hostname	Host OS	Num CPUs	Server Memory (GB)	DB Block Size	Begin Snap	Begin Time	End Snap	End Time	Elapsed Time (mins)	DB Time (mins)	Average Active Sessions	Busy Flag	Logical Reads/sec
awrrpt_1_445_446.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	445	1/26/2018 21:57	446	1/26/2018 23:00	62.24	6.33	0.1	N	2629.5
awrrpt_1_446_447.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	446	1/26/2018 23:00	447	1/27/2018 0:00	60.19	12.18	0.2	N	13973.4
awrrpt_1_447_448.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	447	1/27/2018 0:00	448	1/27/2018 1:00	60.15	13.52	0.2	N	14055.8
awrrpt_1_448_449.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	448	1/27/2018 1:00	449	1/27/2018 2:00	60.15	10.13	0.1	N	11597.4
awrrpt_1_449_450.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	449	1/27/2018 2:00	450	1/27/2018 3:00	60.16	0.03	0	N	65.4
awrrpt_1_450_451.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	450	1/27/2018 3:00	451	1/27/2018 4:00	60.12	0.02	0	N	70.3
awrrpt_1_452_453.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	452	1/27/2018 5:00	453	1/27/2018 6:00	60.13	0.69	0	N	189.8
awrrpt_1_453_454.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	453	1/27/2018 6:00	454	1/27/2018 7:00	60.13	2.88	0	N	2439.1
awrrpt_1_454_455.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	454	1/27/2018 7:00	455	1/27/2018 8:00	60.14	12.57	0.2	N	14027.3
awrrpt_1_455_456.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	455	1/27/2018 8:00	456	1/27/2018 9:00	60.14	10.11	0.1	N	13916.6
awrrpt_1_456_457.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	456	1/27/2018 9:00	457	1/27/2018 10:00	60.14	10.26	0.1	N	13941.5
awrrpt_1_457_458.txt	TUNEDB	PRIMARY	tunedb	EE	1	dixitlab.localdo	Linux x86 64-bit	1	2.77	8K	457	1/27/2018 10:00	458	1/27/2018 10:50	50.25	18.67	0.3	N	14118.9

Third: Data Representation using Tableau.
Okay so here we have the final parsed CSV of all those TEXT AWR reports named ‘tunedbperftests.csv’ and we are ready to play around and learn.

Immediately after launching you will see couple of options available for Data Sources on the left. Choose TEXT as the source and browse the CSV to load.

Next you will see all rows of your data source (tunedbperftests.csv in our case) file.

Next click on Worksheet, your personal area to play.

Tableau then divides the data in two main types: dimensions and measures. Dimensions are usually those fields that cannot be aggregated; measures, as its name suggests, are those fields that can be measured, aggregated, or used for mathematical operations. Dimension fields are usually used for row or column headings; measures are usually used for plotting or giving values to the sizes of markers.

             

Next tab is for Analytical functions, i.e. If you want to add a constant, average, mean, median averages or any reference lines to your graph/chart.

All good now, so we all all set yto plot out performance charts for that DB historical information that we have collected in the form of CSV and loaded to Tableau. Lets, plot for average hard parse per/second, Average DB Time, Average Pareses, Average Transactions happened against Time (Hourly rate of BEGIN TIME as a measure).

So, using above Area Graph you have plotted the average metric usages on the database during a time period.

Next, i will visualize one of the most prominent db wait event observed in the database during the probe (data collection) period ‘DPR’ or ‘Direct Path Reads’ and will plot the Bar graph against the TIME (Hourly BEGIN TIME).

Some more stats visualisations, this time ‘top 5 waits‘ and their hourly frequency.

There are lot’s of other things that you can do with your statistics, i mean you can plot your data in the form of Square, side by side circle, polygon, pie char, polygons, gantt bar, line graph, area graphs, box-and-whisker plots, highlight tables and many more.

So, imagine and you can visualize your database statistics using Tableau!

Questions are welcome. Happy reading! 🙂 🙂

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | Leave a Comment »

AAS or Average Number Of Active Sessions – The first thing to look in an AWR & its Uses.

Posted by FatDBA on January 27, 2018

Hi All,

Today’s post is all about answering the question ‘What is the very first thing that one should look out for in an AWR report ?‘. I have got this question so many times in the past about the first statistic i prefer to look at when troubleshooting a performance problem so though of answering this by writing this post with some real-time examples.

And the answer is ‘AAS‘ or ‘Average Number of Active Sessions’ is the first thing that i always look out for while reading AWR reports because It gives you a quick idea about how busy the system is and about the workload happening.

Okay so first lets understand what is an ‘Active Session’ : In simple words It is a session currently spending time in the database (i.e. from v$session where status=’ACTIVE’).
Now, what exactly is AAS – It’s the ratio or rate of change of DB time over clock time. The value of this metric is calculated by using a standard formula of (DB Time/Elapsed Time).

Lets calculate the value for one of the system.

Host Name	        Platform	       CPUs   
dixitLab1.fatdba.com	Linux x86 64-bit	16

AAS In this case : 1024.72/60.04 => 17.067 of average active sessions during the snap interval of an hour.

Let’s further decode the magical Figures of AAS.
We always use CPU Count as a standard for comparing the AAS. Few rule of thumbs while doing this comparison are give below.
– If the AAS is higher than the number of CPU you have then there is a problem. i.e In above example we have an AAS value of 17 and CPU
count 16, hence we could have performance problems and needs investigation.
– If the value is very high than the number of CPUs then there is a choke-point in the database.

You could also use the AAS to plot your graphs, lines and Manhattan’s as one of the axis to compare it with CPU consumption and quickly pin point the pain areas and time slots. Let understand and use it through a scenario.

Assume one fine day you got a call from monitoring team that they have observed huge spikes in system resource usage and many of the other metrics set on the dashboard are in red. And as usual lot’s of fingers and eyes started pointing towards you and the DBA team.

Now you as a DBA quickly generated the AWR for that specific time frame to understand the system behavior and performance and observed a huge workload is happening on the database with AAS of 305 (For a 2 Node RAC database with 128 CPUs collectively) and some huge peaks for Application class (i.e. row lock contentions etc.), User IO classified waits (i.e. DBF Sequential Reads, read by other session etc.) and some Network class waits (i.e SQL*Net message from dblink waits) in your database.

Now you want to understand the trend for wait classes for the database during last few days. Here you can use DBA_HIST_ACTIVE_SESSION_HISTORY view to collect historical statistics for the database which you will use to plot charts using excel, tableau etc.

I have collected similar stats using ASH view and have plotted a graph using few of my data representation tools to understand this transient variation in system performance.

Here you see a sudden spike in DB wait Classes (Specially User IO, Cluster, Application and Network) on March 5th with average number of active sessions (AAS) stacked for both of RAC nodes was close around 305. Which if compare it with number of total CPUs (64+64=128) is extremely high.

After further investigation you understand that it’s application class wait ‘enq: TX – row lock contention‘ which is the primary cause of this high system resources utilization.

Below graph is a representation of AAS Waiting on Application class event ‘enq: TX – row lock contention’ per Instance on the database where we can the same happening. A constant then a sudden raise in row locking contentions.

And you have identified the major sources contributing towards this row locking during the probe period of last 7 days till now. You can do a join on dba_hist_active_sess_history and dba_hist_snapshot to get this historical information — Read my previous article on how to get this past information from AWR repository.

Now when you have narrowed down the problem and have identified the problematic SQLs with their total contribution, you can now start the query optimization/tuning to fix the issue.
There are lot of other data representations you can do by using AAS as one of the graph axis i.e. AAS on CPU and Top Wait Events and will discuss in my further posts.

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | 3 Comments »

Parse CPU to Parse Elapsd % – Lets clear the Perplex!

Posted by FatDBA on January 26, 2018

Happy Republic Day!

 

Hi Mates,

I see a lot of confusion, mix-ups and perplexity in between the DBAs on few of the metrics under ‘Instance Efficiency % section’ in AWR reports. Specially for one of the metric “Parse CPU to Parse Elapsd %”. In today’s post i will try to explain the metrics in detail which will help you to understand it in depth and clear the muddiness.

Few of the lines you might have read about this metric i.e.
“we should always look for as low as possible numbers for this metric …”
“Try to achieve the impossible value of zero for this one “

– Even some of the Metalink notes are misleading too.

Scenario:

Below is the snippet from one of my test box.

Note: Just taking a look at the instance efficiency ratios can be very dangerous and i advise to first start with Load Profile, top 5 waits and there on …

Okay, so my definition of this statistic differ from what you judge after reading the name of the metric – This datum signals the delay/wait in parsing of SQL queries during the snap interval.
In our example the value is 1.37% this means that for every CPU second spend parsing we spent about 72.99 (100/1.37) Seconds of clock time. It can happen due to various reasons i.e Latch or any contention between the sessions etc.

The ideal value for this stat should also be 100% like rest of the ratios (Yes, that’s correct!!).

Let’s see how this value was calculated, what all it considers while deducing that final figure of 1.37%.
It takes the “parse time cpu/parse time elapsed * 100” to get the ‘Parse CPU to Parse Elapsed’ figure.

 

Statistic	                              Total	per Second     per Trans
------------------------------------------------------------------------------------------------
parse time cpu                                  398	      0.11	0.01
parse time elapsed                           29,055	      8.07	0.81

 

So, in short each time there is a drop in this metric from 100%, means the database was waiting for something which slowed down the parse times.
If you want to dig in deep then you should trace the session using 10046 tracing with level 8 to see where the other % of parse time is being spent and leaving this for readers to test.

 

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , | Leave a Comment »

How to troubleshoot, understand HW events & measure performance using PERF (Linux Profiler)! – Part 2

Posted by FatDBA on January 13, 2018

Repeat: I think I’ve mistakenly deleted the post, so re-posting the same.

Hey Folks,

Back with second edition of my previous post on troubleshooting performance issues using Linux in-built profiler named ‘perf’. Many of the users requested to write about few use cases situations on when and how to use this tool.

So, this post is all about discussing some cases that i have faced while working on few performance tuning projects and few of them are from the tests that I’ve performed on my Lab systems.

Okay coming back to the reason on why we need this, perf i mean ?
The answer is sometimes Oracle wait interface is not enough and you need to dig deeper inside the system to understand the problem. That point you have to use some third party, in-built dedicated tools for performance investigation. There perf might help you to understand what your resource intensive query is doing on OS layers.
So, using the tool you can monitor your process on what’s its doing!

And in case if you are using perf on Virtualised system, you might get error “perf.data file has no samples“.
In order to fix it try with “-e cpu-clock” arguments to collect the sample data and then interpret the file.

Okay so one fine day, you saw a spike in server’s CPU consumption using TOP, Oratop, OEM, Scheduled scripts or by any possible monitoring techniques and you have identified the process and its other attributes.
Using the PID you have reached the SID, SQL_ID, SQL_TEXT and other statistics. So now you might want to analyze and understand the oracle’s execution.

 
Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  DIXIT (1:53089)
 SQL ID              :  71aa5ju8pwtf2
 SQL Execution ID    :  16777216
 Execution Started   :  01/09/2018 06:10:32
 First Refresh Time  :  01/09/2018 06:10:32
 Last Refresh Time   :  01/09/2018 06:13:03
 Duration            :  152s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus@dixitlab.localdomain (TNS V1-V3)

Global Stats
========================================================
| Elapsed |   Cpu   |    IO    | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Gets  | Reqs | Bytes |
========================================================
|     163 |     150 |      115 |     5M | 648K |   5GB |
========================================================

SQL Plan Monitoring Details (Plan Hash Value=12102956)
===================================================================================================================================================================
| Id   |      Operation       | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |       Activity Detail       | Progress |
|      |                      |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |         (# samples)         |          |
===================================================================================================================================================================
| -> 0 | SELECT STATEMENT     |      |         |       |       150 |     +2 |     1 |        0 |      |       |          |                             |          |
| -> 1 |   SORT AGGREGATE     |      |       1 |       |       150 |     +2 |     1 |        0 |      |       |          |                             |          |
| -> 2 |    TABLE ACCESS FULL | T294 |   96523 | 56397 |       151 |     +1 |     1 |      306 | 648K |   5GB |    99.34 | Cpu (69)                    |      31% |
|      |                      |      |         |       |           |        |       |          |      |       |          | db file sequential read (6) |          |
|      |                      |      |         |       |           |        |       |          |      |       |          | direct path read (126)      |          |
===================================================================================================================================================================

Above results points to high CPU Time (150 Seconds). Now to look more deeper in to things and to understand what exactly the session is doing.

[root@dixitlab ~]#  perf top -e cpu-clock -p 3505 

   PerfTop:     349 irqs/sec  kernel:42.7%  exact:  0.0% [1000Hz cpu-clock],  (target_pid: 3505)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ __________________________________________________

             1981.00 46.2% _raw_spin_unlock_irqrestore [kernel.kallsyms]
              906.00 21.1% _intel_fast_memcmp          /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
              270.00  6.3% kole_simple_string_match    /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle 
               96.00  2.2% copy_user_generic_unrolled  [kernel.kallsyms]
               96.00  2.2% kcbgtcr                     /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               54.00  1.3% __intel_new_memset          /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle 
               37.00  0.9% __intel_ssse3_rep_memcpy    /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               30.00  0.7% kghfrf                      /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               28.00  0.7% kghalf                      /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               27.00  0.6% kcbldio                     /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               24.00  0.6% kdxbrs1                     /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               24.00  0.6% kspgvc                      /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
               21.00  0.5% kksMapCursor                /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle



Results shows system was mostly busy processing kernel calls (kernel.kallsyms) and its function ‘_raw_spin_unlock_irqrestore‘ with 46% of its time spend using CPU cycles. It’s coming with huge overhead and with large samples and in general irq_restore shows up because re-enabling interrupts is costly, but this is not the real CPU consumption but is how the tool, Interrupt and system works in few of the Virtual systems – I was testing it on my Lab VM.

Here i would like to thank Tanel Poder for reviewing the document and highlighting the issue with the perf top command when executed inside a VM environment.

…… perf top that concluded _raw_spin_unlock_irqrestore taking lots of CPU time. I’ve been through this myself in past – and this is likely not actual CPU usage but rather how perf, OS and interrupts work in some VMs (basically measurement bias/errors by these tools). I assume that you tested in a VM? ……..

Next in the list is function/object ‘_intel_fast_memcmp‘ called by oracle with 21.1% which i believe is for fast memory compilations.

So the conclusion of the analysis:
Most of the CPU by process was spend processing kernel calls and for for fast memory compilations.

Purpose Revisit: This gives you a glimpse of what happens with the process calls and monitors a cpu-bound, database process is pass its time.

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

 
%d bloggers like this: