Tales From A Lazy Fat DBA

Fan of Oracle DB & Performance, PostgreSQL, Cassandra & much more … \,,/

  • Likes

    • 278,767
  • Archives

  • ΰ₯

  • Categories

  • Subscribe

  • Prashant Dixit is the FatDBA

  • Follow Tales From A Lazy Fat DBA on WordPress.com
  • My Twitter Feeds

  • Disclaimer!

    FatDBA or Oracle ‘Ant’ is an independent web-blog/site.The experiences, Test cases, views, and opinions expressed in this website are my own and does not reflect the views or opinions of my employer.

    This site is independent of and does not represent Oracle Corporation in any way. Oracle does not officially sponsor, approve, or endorse this site or its content.
    Product and company names mentioned in this website may be the trademarks of their respective owners.

Posts Tagged ‘troubleshooting’

My favorite 5 linux TOP features/flags β€¦.

Posted by FatDBA on July 23, 2021

Hi Guys,

Recently I started the ‘Top 5’ series where I share my top 5 features in any particular tool or product. Last time I did for SQL Developer command line (SQLcl), this time it will be about one of Linux’s favorite tool ‘TOP‘.

So, without any particular order, below are my top 5 picks of options/flags available with the tool.

c : To get absolute path of the commands currently executing. This will be quite helpful if you want to know from where the command/executable is running. I have used this option lot of the times in multiple performance battles that I have won πŸ™‚ Look at the screen below and in to column COMMAND.

z: will display running process in RED color which may help you to identified running process easily.

1: To get usage details per CPU core.

SHIFT+i : To disable the IRIX (default mode is ON) mode, help you to understand – Why Process CPU % Usage larger than Total CPU Time ?

E: To get memory details in KB, MB & GB, press ‘E’ each time. So, no need to separately call ‘free’ command.

I know this is all about top 5 picks, but here goes my 6th top (kinda reserve option) πŸ™‚

V: Tree Mode, to display the processes in a parent child hierarchy.

Hope It Helped
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

How to fix/handle Wraparound problem in PostgreSQL ?

Posted by FatDBA on July 21, 2021

Hi Guys,

In continuation from where I left in my last post on simulating wraparounds in PostgreSQL, this post is all about managing the mess that we spilled πŸ™‚ Though this was a synthetic/artificial test and as soon you close or exit SESSION 1 where you’ve opened a BEGIN block/transaction and ran select txid_current() to block autovacuum, it (autovacuum) will automatically kick-in and perform a quick full vacuum on all of the databases in the cluster. You have to remember that even if you have autovacuum turned off, it will still start in emergency mode.

So, this is what we got earlier at the time we hit by the wraparound problem when we tried to execute a DML and it says something about stopping postmaster and vacuum databases in single-user mode; What is that ? – We will see that next ….

enterprisedb=#
enterprisedb=# SELECT datname
, age(datfrozenxid)
, current_setting('autovacuum_freeze_max_age')
FROM pg_database
 ORDER BY 2 DESC;
 
   datname    |    age     | current_setting
--------------+------------+-----------------
 postgres     | 2146483647 | 200000000
 edb          | 2146483647 | 200000000
 template1    | 2146483647 | 200000000
 template0    | 2146483647 | 200000000
 immortal     | 2146483647 | 200000000
 darkthrone   | 2146483647 | 200000000
 immortal1    | 2146483647 | 200000000
 mayhem       | 2146483647 | 200000000
 murduk       | 2146483647 | 200000000
 burzum       | 2146483647 | 200000000
 gorgoroth    | 2146483647 | 200000000
 repmgr       | 2146483647 | 200000000
 enterprisedb | 2146483647 | 200000000
(13 rows)



enterprisedb=# update pgbench_history set aid=65555 where aid=64444;
ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
enterprisedb=#
enterprisedb=#

Okay, coming back to the point, how to handle the situation now when you’ve already hit the wraparound. So, first thing you have to manually shut down the database cluster, if it’s still up.

-bash-4.2$ pwd
/usr/edb/as11/bin

-bash-4.2$ /usr/edb/as11/bin/pg_ctl stop -D /var/lib/edb/as11/data
waiting for server to shut down.... done
server stopped
-bash-4.2$

Next, you need to run VACUUM FULL for each of the database (one-by-one using single mode) through stand-alone backend mode. For that you have to call postgres (if you’re on open source) or edb-postgres utility. I am on EDB PostgreSQL 11 so will be calling edb-postgres and execute VACUUM FULL.

-bash-4.2$ pwd
/usr/edb/as11/bin

-bash-4.2$
-bash-4.2$ ls *postgre*
edb-postgres
-bash-4.2$

-bash-4.2$ ./edb-postgres --single -D /var/lib/edb/as11/data enterprisedb

PostgreSQL stand-alone backend 11.11.20
backend> vacuum full;
2021-07-20 11:07:49 IST LOG:  statement: vacuum full;

backend>

It’s done for ‘enterprisedb’ database. Next will do it for ‘postgresql’ database, but this time I will use some stand-alone backend related command to assure we are connected with the right database and to enable runtime stats or verbose.

-bash-4.2$ ./edb-postgres --single -D /var/lib/edb/as11/data postgres

PostgreSQL stand-alone backend 11.11.20
backend>
backend> select current_database();
2021-07-20 13:45:25 IST LOG:  statement: select current_database();

         1: current_database    (typeid = 19, len = 64, typmod = -1, byval = f)
        ----
         1: current_database = "postgres"       (typeid = 19, len = 64, typmod = -1, byval = f)
        ----
backend> 


backend> SELECT set_config('log_statement_stats', 'on', false);
2021-07-20 13:48:55 IST LOG:  statement: SELECT set_config('log_statement_stats', 'on', false);

         1: set_config  (typeid = 25, len = -1, typmod = -1, byval = f)
        ----
         1: set_config = "on"   (typeid = 25, len = -1, typmod = -1, byval = f)
        ----
backend>

-- Now when we have enabled logging of stats, let's execute the VACUUM to see runtime stats
backend> vacuum full;
2021-07-20 13:49:29 IST LOG:  statement: vacuum full;

2021-07-20 13:49:30 IST LOG:  QUERY STATISTICS
2021-07-20 13:49:30 IST DETAIL:  ! system usage stats:
        !       0.557302 s user, 0.426659 s system, 120.357278 s elapsed
        !       [0.583998 s user, 0.455579 s system total]
        !       21756 kB max resident size
        !       224/402352 [240/402824] filesystem blocks in/out
        !       0/2558 [0/6201151] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       982/2 [1005/3] voluntary/involuntary context switches
2021-07-20 13:58:39 IST STATEMENT:  vacuum full;

backend>

Okay, same way we have to run VACUUM FULL on all of the databases and restart PostgreSQL cluster.
Once all of the databases are done and after restart, connect with the database and you will see the TXID or datfrozenid has been reset and you are all good.

enterprisedb=# SELECT datname
, age(datfrozenxid)
, current_setting('autovacuum_freeze_max_age')
FROM pg_database
 ORDER BY 2 DESC;

    datname    | age | current_setting
--------------+-----+-----------------
postgres     | 1369 | 200000000
edb          | 1272 | 200000000
template1    | 1170 | 200000000
template0    | 1071 | 200000000
immortal     |  974 | 200000000
darkthrone   |  877 | 200000000
immortal1    |  780 | 200000000
mayhem       |  585 | 200000000
murduk       |  487 | 200000000
burzum       |  388 | 200000000
gorgoroth    |  291 | 200000000
repmgr       |  194 | 200000000
enterprisedb |   97 | 200000000

You’re all set!

Hope It Helped!
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , | Leave a Comment »

Quarantine a SQL ? This can save from a system performance pandemic …

Posted by FatDBA on July 6, 2021

Hi Guys,

Last weekend I was asked to examine a performance problem where customer was observing sporadic resource spikes on CPU & PIOs (Physical IO scans), and had no idea how to handle the situation. They already have identified the problematic SQL and asked me to take a look, while doing initial investigation on the system, I found the resource manager was enabled and was killing the ill SQL every time when it tries to breach the limits set on CPU & other resources.

I spent some time understanding query behavior and performance, and I found this a scenario of SQL plan (PHV) flip where the query optimizer toggles between a good and a worst plan. This being a time sensitive issue and customer needs a quick fix before I identify the reason behind the PHV flip, we have to think of any technique on how we can stop the SQL execution with bad PHVs. This was an Oracle EE 19c system running on Exadata, I immediately proposed for a solution using new feature of ‘SQL Quarantine‘, which helps to prevent reuse of same execution plan which was terminated by resource managers due to resource limits.

Let me show you how to do that, here I will demonstrate how to quarantine a SQL & all its execution plans.

Well this is just a use case, you can also quarantine the SQL for all its PHVs or execution plans or even through the SQL Text. Here I will show how you can quarantine a SQL for all its execution plans.

Note: This feature is currently available only on Enterprise Edition on Engineered Systems, like Exadata and Exadata Cloud Service, but for this demo, I will be enabling the ‘_exadata_feature_on‘ parameter for demo purposes on my sandbox setup.

** Please don’t touch this parameter in your production environments if you’re not running on Oracle EE on the top of Exadata or Exadata Cloud services.

[oracle@canttowinsec ~]$ !sql
sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Jul 3 23:33:41 2021
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.


SQL>
SQL> @hidden
Enter value for param: exadata_feature_on
old   5:    and a.ksppinm like '%&param%'
new   5:    and a.ksppinm like '%exadata_feature_on%'

Parameter                                     Session Value             Instance Value            descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_exadata_feature_on                           FALSE                     FALSE                     Exadata Feature On

SQL>
SQL>


SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>
SQL>


SQL> startup
ORACLE instance started.

Total System Global Area  922745216 bytes
Fixed Size                  8903040 bytes
Variable Size             843055104 bytes
Database Buffers           67108864 bytes
Redo Buffers                3678208 bytes
Database mounted.
Database opened.
SQL>
SQL>
SQL>



SQL>
SQL> @hidden
Enter value for param: exadata_feature_on
old   5:    and a.ksppinm like '%&param%'
new   5:    and a.ksppinm like '%exadata_feature_on%'

Parameter                                     Session Value             Instance Value            descr
--------------------------------------------- ------------------------- ------------------------- ------------------------------------------------------------
_exadata_feature_on                           TRUE                      TRUE                      Exadata Feature On

SQL>




-- For this demo I have created one table with 1000000 rows. 

SQL> select count(*) from bigtab;

  COUNT(*)
----------
   1000000

Next check the SQLID, PHV and TEXT of the SQL for which you want to setup this QUARANTINE feature and will create the quarantine configuration for it. You can set it for SQL ID, SQL ID + PHV or for SQL TEXT.

SQL> select sql_id, plan_hash_value, sql_text from v$sqlarea where sql_text like '%select * from bigtab%';

SQL_ID        PLAN_HASH_VALUE
------------- ---------------
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6fwqzurbc8y7k       441133017
select * from bigtab

SQL>


-- You can create CREATE_QUARANTINE BY SQL_ID 
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '&sql_id')
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '&sql_id', PLAN_HASH_VALUE => '&PLAN_HASH_VALUE');

-- You can create CREATE_QUARANTINE_BY_SQL_TEXT.
quarantine_config := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_TEXt(SQL_TEXT => to_clob('select count(*) from emp'));



-- This command creates a quarantine configuration for SQL ID '6fwqzurbc8y7k' and all its execution plans.
SQL>
SQL> DECLARE
quarantine_sql VARCHAR2(30);
BEGIN
quarantine_sql :=   DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(SQL_ID => '6fwqzurbc8y7k');
END;
/  

PL/SQL procedure successfully completed.

SQL>



-- Check if quarantine config/profile is created.

SQL> select NAME, SQL_TEXT, ELAPSED_TIME, cpu_time,CREATED, ENABLED from dba_sql_quarantine;


NAME                           SQL_TEXT                       ELAPSED_TIME         CPU_TIME                       CREATED                        ENA
------------------------------ ------------------------------ -------------------- ------------------------------ ------------------------------ ---
SQL_QUARANTINE_2akx3bq44wvc5   select * from bigtab           ALWAYS               ALWAYS                         04-JUL-21 12.09.25.567422 AM   YES


Now next we will use the DBMS_SQLQ.ALTER_QUARANTINE procedure to put limits on resource usage i.e. Elapsed time, CPU, IO (MBs), Number of physical (PIOs) requests, Number of logical (LIOs) requests.

SQL>
SQL> BEGIN
    DBMS_SQLQ.ALTER_QUARANTINE(
       QUARANTINE_NAME => 'SQL_QUARANTINE_2akx3bq44wvc5',
       PARAMETER_NAME  => 'CPU_TIME',
       PARAMETER_VALUE => '5');

    DBMS_SQLQ.ALTER_QUARANTINE(
       QUARANTINE_NAME => 'SQL_QUARANTINE_2akx3bq44wvc5',
       PARAMETER_NAME  => 'ELAPSED_TIME',
       PARAMETER_VALUE => '10');
END;
/  2    3    4    5    6    7    8    9   10   11   12

PL/SQL procedure successfully completed.

SQL>


-- let's verify limits that we have set against this SQL Quarantine profile.

NAME                           SQL_TEXT                       ELAPSED_TIME         CPU_TIME                       CREATED                        ENA
------------------------------ ------------------------------ -------------------- ------------------------------ ------------------------------ ---
SQL_QUARANTINE_2akx3bq44wvc5   select * from bigtab           10                   5                              04-JUL-21 12.12.41.918609 AM   YES

Next we will setup the RESOURCE MANAGER.

-- Create a pending area.
SQL> begin
 dbms_resource_manager.create_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

-- Create a consumer group with name 'DIXITPOC_GROUP'.
SQL> begin
 dbms_resource_manager.create_consumer_group(CONSUMER_GROUP=>'DIXITPOC_GROUP',COMMENT=>'To test SQL quarantine feature for one SQL');
 end;
 /  2    3    4

PL/SQL procedure successfully completed.


-- Now bind the consumer group with the user 'DIXIT', I mean this could be any group through which you will execute the query.
SQL> begin
 dbms_resource_manager.set_consumer_group_mapping(attribute => 'ORACLE_USER',value => 'DIXIT',consumer_group =>'DIXITPOC_GROUP' );
 end;
  /  2    3    4

PL/SQL procedure successfully completed.


-- Create a resource plan with name 'POC_FOR_QUARANTINE'.
SQL> begin
dbms_resource_manager.create_plan(plan => 'POC_FOR_QUARANTINE',comment => 'Kill SQL and its other execution plans when exceeding CPU and Elapsed seconds');
end;
/  2    3    4

PL/SQL procedure successfully completed.


-- Create a plan directive by allocating resource plan 'POC_FOR_QUARANTINE' to consumer group 'DIXITPOC_GROUP'. I am setting execution limit of 5 seconds for the SQL.
SQL> begin
 dbms_resource_manager.create_plan_directive(
 plan => 'POC_FOR_QUARANTINE',
 group_or_subplan => 'DIXITPOC_GROUP',
 comment => 'Kill SQL and its other execution plans when exceeding CPU and Elapsed seconds',
 switch_group=>'CANCEL_SQL',
 switch_time => 5,
 switch_estimate=>false);
 end;
 /  2    3    4    5    6    7    8    9   10

PL/SQL procedure successfully completed.




-- Allocate full resources to rest of the sessions which are not part of this plan.
SQL> begin
 dbms_resource_manager.create_plan_directive(PLAN=> 'POC_FOR_QUARANTINE', GROUP_OR_SUBPLAN=>'OTHER_GROUPS',COMMENT=>'leave rest as is');
 end;
 /  2    3    4

PL/SQL procedure successfully completed.



-- Validate and submit pending area.
SQL> begin
 dbms_resource_manager.validate_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

SQL> begin
 dbms_resource_manager.submit_pending_area();
 end;
 /  2    3    4

PL/SQL procedure successfully completed.

SQL>


-- Grant switch privilege to the DIXIT user which will switch it to 'DIXITPOC_GROUP' group.
SQL> begin
dbms_resource_manager_privs.grant_switch_consumer_group('DIXIT','DIXITPOC_GROUP',false);
end;
/  2    3    4

PL/SQL procedure successfully completed.

-- Create initial consumer group for DIXIT user for 'DIXITPOC_GROUP' group.
SQL> begin
dbms_resource_manager.set_initial_consumer_group('DIXIT','DIXITPOC_GROUP');
end;
/  2    3    4

PL/SQL procedure successfully completed.

SQL>

Next set the resource manager at the database level, we will have to set the RESOURCE_MANAGER_PLAN to value or name of the plan ‘POC_FOR_QUARANTINE’.

SQL> show parameter resource

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
resource_limit                       boolean     TRUE
resource_manage_goldengate           boolean     FALSE
resource_manager_cpu_allocation      integer     5
resource_manager_plan                string
SQL> 


SQL> alter DIXITtem set RESOURCE_MANAGER_PLAN = 'POC_FOR_QUARANTINE';
DIXITtem altered.

SQL> show parameter resource

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
resource_limit                       boolean     TRUE
resource_manage_goldengate           boolean     FALSE
resource_manager_cpu_allocation      integer     5
resource_manager_plan                string      POC_FOR_QUARANTINE
SQL>

Execute the SQL for the first time, you will get ‘active time limit exceeded’ due to RM.

SQL>
SQL> select * from emp;

ERROR at line 2:
ORA-00040: active time limit exceeded - call aborted

Now, if you run the SQL for the second time, you will see your quarantine plan/profile will be used and will abort the execution of the SQL.

SQL>
SQL> select * from emp;

ERROR at line 2:
ORA-56955: quarantined plan used

Hope It Helped
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , , , , | Leave a Comment »

Something cool was introduced in Oracle Golden Gate 12.2 …

Posted by FatDBA on June 16, 2021

Hi Guys,

Last week someone from my team contacted me for a problem that he was facing with the Golden Gate, he was getting an error message with replicat process saying ‘Error mapping TEST.TABLE1 to TEST.TABLE1’ and the replicat process is abending at the target end. He was using Golden Gate version 12.1.2.0. Now during the initial investigation we found the table structure both at the source and target is same except column order/positions, just like below.

Source
-----------------------
    TEST.TABLE1
ID Varchar2(10)
Name Varchar2(10)
Codes Numeric(10)

Target:
------------------------
    TEST.TABLE1
ID Varchar2(10)
Codes Numeric(10)
Name Varchar2(10)

As GG works with the column order and not explicitly with the column names, it lead to that error at the target while doing the INSERTS in to the table. We knew that this is now auto handled by the GG, but not sure on the version when it got fixed, so immediately, we started looking for newer releases and their documentations, and luckily we found release notes for Golden Gate 12.2 which clearly explained that now there isn’t any need to generate the DEFGEN files, or to use ASSUMETARGETDEFS or SOURCEDEFS parameter (in some cases).

Starting from Golden Gate version 12.2 (released in year 2016), the trail files now contains the metadata (TDR or table definition records), so GG knows table details like column order/positions, data type, column length etc. before applying the changes to the table. So, now even if you add a new table, drop a column, add a new column, rename an existing column, change column datatype length, you don’t need to worry as Golden Gate will take care by its own.

So, about the solution, we had two options to choose from – Upgrade GG from 12.1 to 12.2 or else use FORMAT RELEASE 12.2 parameter, with that, there is no need to use any of the above mentioned parameters. We choose to upgraded GG version to 12.2 and that fixed the problem!

Hope It Helped
Prashant Dixit

Posted in Advanced | Tagged: , | Leave a Comment »

repmgr failover done, but where is my Standby ?

Posted by FatDBA on December 15, 2020

Hey Guys,

Last weekend I’ve got a call from one of my friend asking about a weird issue in his master-slave (primary-standby) setup where he’s using repmgr for automatic failover and switchover operations. He said after the failover (server crashed) with role reversal or promotion the old standby becomes new primary (expected, all good till here), but as soon as the old primary comes back online and rejoins the configuration, it started as a primary too. I mean you can imagine something like below, both of the two hosts becomes master and there is no sign of a new standby.


[postgresdb@fatdba ~]$ repmgr -f /etc/repmgr.conf cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Connection string
----+-------+---------+-----------+----------+----------+-----------------------------------------
 1  | node1 | primary | * running |          | default  | host=db_node1 dbname=repmgr user=repmgr
 2  | node2 | primary |   running |          | default  | host=db_node2 dbname=repmgr user=repmgr
 

He checked repmgr logs on the standby node and everything looks okay, standby promotion was sucessfull. It seems something was stopping the repmgr to update the information and bring that crashed node (previous primary) back as a standby. So the final situation is there are two primary nodes at this point.

So, how to and where to start from ?
A quick solution to this problem is to create an UPDATE TRIGGER on one of the internal table EVENTS of repmgr database (it creates its own database with default name of repmgr). The logic of creating a trigger is to handle events “standby_promote”, I mean when it happens a script is triggered that will bring back the old primary as new standby.

Next we would be needing a script that will help to automate the standby promotion. For that purpose you can use this script (link) or you can write something similar. And now when we have to call a bash script or to write stored procedures in a shell, we need the ‘.plsh’ extension in the database. Let’s install one of the most popular extension to do that, petere/plsh (link)


[postgresdb@fatdba ~]$ cd /usr/share/postgresql/12/extension/
[postgresdb@fatdba ~]$ sudo git clone https://github.com/petere/plsh.git
[postgresdb@fatdba ~]$ cd plsh
[postgresdb@fatdba ~]$ sudo make PG_CONFIG=/usr/bin/pg_config
[postgresdb@fatdba ~]$ sudo apt-get install make
[postgresdb@fatdba ~]$ sudo apt-get install gcc
[postgresdb@fatdba ~]$ sudo apt-get install postgresql-server-dev-12
[postgresdb@fatdba ~]$ sudo make install PG_CONFIG=/usr/bin/pg_config
 

Now, next create the extension in the database.


[postgresdb@fatdba ~]$ psql -U repmgr
 

This one needs to be executed only n primary side.


repmgr=# create extension plsh;
 

Now, let’s create the function and a trigger to invoke the script failover_promote.sh


[postgresdb@fatdba ~]$ psql
postgres=# CREATE FUNCTION failover_promote() RETURNS trigger AS $$
#!/bin/sh
/bin/bash /tmp/psqlscripts/failover_promote.sh $1 $2
$$
LANGUAGE plsh;
 

And the UPDATE TRIGGER on EVENTS table to call function created above.


postgres=#  create trigger failover
after insert
on events
for each row
execute procedure failover_promote();
 

Yes, that’s it. All done!
Now you can mimic the failover scenario i.e by stopping the primary database and check repmgr logs on the standby server.
You will see the standby is promoted and becomes new primary and as soon it’s finished, it will start pinging the crashed node and the moment it is back it will be promoted as a new standby.

Hope It helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , | Leave a Comment »

Could not send replication command “TIMELINE_HISTORY”: ERROR: could not open file pg_wal/00xxxx.history

Posted by FatDBA on October 20, 2020

Hi All,

Ever encountered a situation where the backup history (TIMELINE_HISTORY) file was deleted by mistake or maybe someone removed it purposely, it was quite old and you try to restore a new backup. I remember many issues related with replication, backup tools (BARMAN & BART) that you might face if that file is removed from PG_WAL directory. Would like to discuss a problem that we encountered while taking BART backup on EDB version 10.

These ‘timeline files’ are quite important, as using the timeline history files, the pg_basebackup can follow the latest timeline present in the primary, just as it can follow new timelines appearing in an archive WAL directory. So, in short, it shows which timeline it branched off from and when. These history files are necessary to allow the system to pick the right WAL segment files when recovering from an archive that contains multiple timelines. So, its important to have this file there in WAL directory.


[enterprisedb@fatdba ~]$ bart -c /usr/edb-bart-1.1/etc/bart.cfg BACKUP -s edbserver --backup-name MAINFULLBKP_10-13-20
INFO:  DebugTarget - getVar(checkDiskSpace.bytesAvailable)
INFO:  creating full backup using pg_basebackup for server 'edbserver'
INFO:  creating backup for server 'edbserver'
INFO:  backup identifier: '1602788909136'
ERROR: backup failed for server 'edbserver' 

pg_basebackup: could not send replication command "TIMELINE_HISTORY": ERROR:  could not open file "pg_wal/00000002.history": No such file or directory 

1633701/1633701 kB (100%), 2/2 tablespaces
pg_basebackup: child process exited with error 1
pg_basebackup: removing data directory "/edbbackup/edbserver/1602788909136"
 

The file is not there under said directory.


[enterprisedb@fatdba ~]$ cd /edb/as10/as10/data/pg_wal/
[enterprisedb@fatdba pg_wal]$ ls
0000000200000005000000EA  0000000200000005000000EB.00000060.backup  0000000200000005000000ED  archive_status
0000000200000005000000EB  0000000200000005000000EC                  0000000200000005000000EE
 

In case of file missing/moved, you can always create a brand new empty file and that will be used by the respective utility and will be populated with metadata soon after. So, in order to quickly restore this issue, let’s create one.

 [enterprisedb@fatdba pg_wal]$ touch 00000002.history
[enterprisedb@fatdba pg_wal]$
[enterprisedb@fatdba pg_wal]$ ls *hist*
00000002.history 

Let’s try to take the backup once again.


[enterprisedb@fatdba pg_wal]$ bart -c /usr/edb-bart-1.1/etc/bart.cfg BACKUP -s edbserver --backup-name MAINFULLBKP_10-13-20
INFO:  DebugTarget - getVar(checkDiskSpace.bytesAvailable)
INFO:  creating full backup using pg_basebackup for server 'edbserver'
INFO:  creating backup for server 'edbserver'
INFO:  backup identifier: '1602789425665'
INFO:  backup completed successfully
INFO:
BART VERSION: 2.5.5
BACKUP DETAILS:
BACKUP STATUS: active
BACKUP IDENTIFIER: 1602789425665
BACKUP NAME: MAINFULLBKP_10-13-20
BACKUP PARENT: none
BACKUP LOCATION: /edbbackup/edbserver/1602789425665
BACKUP SIZE: 1.57 GB
BACKUP FORMAT: tar
BACKUP TIMEZONE: Europe/Berlin
XLOG METHOD: stream
BACKUP CHECKSUM(s): 0
TABLESPACE(s): 1
 Oid     Name      Location
 42250   UNKNOWN   /edb/as10/as10/data_test/pg_tblspc

START WAL LOCATION: 0000000200000005000000ED
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2020-10-15 21:17:05 CEST
STOP TIME: 2020-10-15 21:17:38 CEST
TOTAL DURATION: 33 sec(s)


[enterprisedb@fatdba pg_wal]$  bart -c /usr/edb-bart-1.1/etc/bart.cfg SHOW-BACKUPS
 SERVER NAME   BACKUP ID       BACKUP NAME            BACKUP PARENT   BACKUP TIME                BACKUP SIZE   WAL(s) SIZE   WAL FILES   STATUS

 edbserver     1602789425665   MAINFULLBKP_10-13-20   none            2020-10-15 21:17:38 CEST   1.57 GB       16.00 MB      1           active
 

And it worked.


Hope It Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

EDB PostgreSQL BART Error: tablespace_path is not set

Posted by FatDBA on October 16, 2020

Today would like to discuss about the issue that we faced while doing a BART restore operation of one of the EDB 11 PostgreSQL instance. This was a new system under realization phase (before delivery to customer). So, during one of the test we saw the restore got failed with a message which says something about the value ‘tablespace_path’. I know I have a tablespace in this system, but I initially though that BART will take care of it by its own, but its was not the case.

Below was the error what I have encountered during the test.


[enterprisedb@fatdba archived_wals]$ bart -c /usr/edb-bart-1.1/etc/bart.cfg RESTORE -s edbserver -i 1602187005158 -p /edb/as10/as10/data/
INFO:  restoring backup '1602187005158' of server 'edbserver'
ERROR: "tablespace_path" is not set
[enterprisedb@fatdba archived_wals]$
 

Okay, let’s first check tablespace details, we use the metacommand of ‘db’ to get that info about tablespaces. Okay, so we have it’s location, size details.
Let’s go inside the said directoy and see what all is there.
Note: Last two are the default tablespaces so need to worry about them.


enterprisedb=# \db+
                                              List of tablespaces
    Name    |    Owner     |           Location           | Access privileges | Options |  Size   | Description
------------+--------------+------------------------------+-------------------+---------+---------+-------------
 newtblspc  | dixit        | /home/enterprisedb/newtblspc |                   |         | 52 kB   |
 pg_default | enterprisedb |                              |                   |         | 1362 MB |
 pg_global  | enterprisedb |                              |                   |         | 774 kB  |
(3 rows)

[enterprisedb@fatdba pg_tblspc]$ pwd
/edb/as10/as10/data_test/pg_tblspc
[enterprisedb@fatdba pg_tblspc]$ ls -ltrh
total 4.0K
lrwxrwxrwx. 1 enterprisedb enterprisedb   28 May  5 17:58 42250 -> /home/enterprisedb/newtblspc
drwx------. 3 enterprisedb enterprisedb 4.0K Oct  8 21:56 PG_10_201707211
 

Okay, so we have a soft-link created for the tablespace under PG_TBLSPC directory under DATA dir with OID 42250.
Now when we have all the information, time to add requisite parameter in bart.cfg file to consider tablespaces, just like below.
Format: OID_1=tablespace_path_1;OID_2=tablespace_path_2 …
example: tablespace_path = 42250=/edb/as10/as10/data_test/pg_tblspc
Note: tablespace_path parameter must exist or to be empty at the time you perform the BART RESTORE operation.

Now let’s modify our bart confguration file, will look something like below with the ‘tablespace_path’ option set.


[EDBSERVER]
host = 10.0.0.144
port = 5444
user = enterprisedb
backup_name = mktg_%year-%month-%dayT%hour:%minute
cluster_owner = enterprisedb
description = "EDB PROD server"
archive_command='scp %p enterprisedb@10.0.0.144:/edbbackup/edbserver/archived_wals/%f'
tablespace_path = 42250=/edb/as10/as10/data_test/pg_tblspc
allow_incremental_backups=enabled
 

All set for the restore now, let’s try that.


[enterprisedb@fatdba pg_tblspc]$ bart -c /usr/edb-bart-1.1/etc/bart.cfg RESTORE -s edbserver -i 1602187005158 -p /edb/as10/as10/data/
INFO:  restoring backup '1602187005158' of server 'edbserver'
WARNING: tablespace restore path is not empty (/edb/as10/as10/data_test/pg_tblspc), restoring anyway
INFO:  base backup restored
INFO:  writing recovery.conf file
INFO:  WAL file(s) will be streamed from the BART host
INFO:  archiving is disabled
INFO:  permissions set on $PGDATA
INFO:  restore completed successfully
[enterprisedb@fatdba pg_tblspc]$
[enterprisedb@fatdba pg_tblspc]$
 

This is fixed.

Hope That Helped!
Prashant Dixit

Posted in Uncategorized | Tagged: , , | Leave a Comment »

oracle.net.ns.NetException: Got minus one from a read call

Posted by FatDBA on September 3, 2020

Hi Guys,

Would like to discuss one problem that I was facing today in on one of the Oracle 12c Release 1 standalone database where application team started explaining the problem that they are getting when doing application restart, specially oracle NET exception of ‘Got minus one from a read call


Caused by: java.lang.RuntimeException: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/jdbc/ProdMonkeyPD
    Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/jdbc/ProdMonkeyPD
    Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:/jdbc/ProdMonkeyPD
    Caused by: javax.resource.ResourceException: IJ031084: Unable to create connection
    Caused by: java.sql.SQLRecoverableException: IO Error: Got minus one from a read call
    Caused by: oracle.net.ns.NetException: Got minus one from a read call"}}
 

They were suspecting the issue with the high number of opened files on OS. The files count was too high when listing using lsof command on this RHEL7 system and the count goes down once the services are stopped. well I was able to explain the difference that exists between RHEL6 & RHEL7 when doing count using lsof. It was in RHEL7/EL7 that shows output including TID as default compared to RHEL6/OL6. Hence the number of open files count increases in RHEL7 as compared to RHEL6. So, it has nothing to do with the error that they have reported.

So, next we have checked database alert log and it was all good, all clean, no errors nothing. I immediately checked the value of “OS_AUTHENT_PREFIX” parameter as it specifies a prefix that Oracle uses to authenticate users attempting to connect to the system. Oracle simply appends this value to the beginning of user’s operating system account name and password and which it later on compares. So, it was set to its default value that is OPS$ and was set for the backward compatibility with previous versions.

So, I have two solutions for the problem

– Set “OS_AUTHENT_PREFIX” to “” (a null string), thereby eliminating the addition of any prefix to operating system account names.
– Set “tcp.validnode_checking = no” in SQLNET.ora file
This is to enable and disable valid node checking for incoming connections. If this parameter is set to yes, then incoming connections are allowed only if they originate from a node that conforms to list specified by TCP.INVITED_NODES or TCP.EXCLUDED_NODES parameters.

So, I tried with the first option and rebooted the database to make changes persistent (this parameter is static) and asked application team to give it a try again, and as expected it worked. The error or the ORACLE NET exception ‘Got minus one from a read call‘ was resolved after applying the first fix itself.

Here the second option is valid too as that also does the same thing, but one fix at a time.

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

Cassandra node refused to start – MismatchedInputException: No content to map due to end-of-input

Posted by FatDBA on August 5, 2020

Hi All,

This one will be a quick one, kind of error and solution approach.
This time it’s about Cassandra statup error which I have faced while doing some testing on a POC box and how I resolved it.

In this case I was using Datastax Cassandra 6.8.1 on RHEL6, I tried starting the instance but failed with error which says “MismatchedInputException: No content to map due to end-of-input”.
I’ve attempted to start it few more times but ended up with the same error message. Below is the complete/full error details.


WARN  [main] 2020-08-05 13:43:46,277  DatabaseDescriptor.java:1517 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
INFO  [main] 2020-08-05 13:43:46,281  DseDelegateSnitch.java:39 - Setting my workloads to [Cassandra]
INFO  [main] 2020-08-05 13:43:46,358  YamlConfigurationLoader.java:77 - Configuration location: file:/home/cassandra/dse-6.8.1_node1/resources/cassandra/conf/cassandra.yaml
INFO  [main] 2020-08-05 13:43:46,360  DseDelegateSnitch.java:41 - Initialized DseDelegateSnitch with workloads [Cassandra], delegating to com.datastax.bdp.snitch.DseSimpleSnitch
INFO  [main] 2020-08-05 13:43:46,424  YamlConfigurationLoader.java:77 - Configuration location: file:/home/cassandra/dse-6.8.1_node1/resources/cassandra/conf/cassandra.yaml
ERROR [main] 2020-08-05 13:43:47,904  DseModule.java:126 - Unable to start server. Exiting...
org.apache.cassandra.io.FSReadError: com.fasterxml.jackson.databind.exc.MismatchedInputException: No content to map due to end-of-input
 at [Source: (BufferedInputStream); line: -1, column: 0]
        at com.datastax.bdp.db.nodes.Nodes.transactionalRead(Nodes.java:195)
        at com.datastax.bdp.db.nodes.Nodes.access$100(Nodes.java:76)
        at com.datastax.bdp.db.nodes.Nodes$Local.(Nodes.java:477)
        at com.datastax.bdp.db.nodes.Nodes.(Nodes.java:97)
        at com.datastax.bdp.db.nodes.Nodes$Instance.setup(Nodes.java:911)
        at org.apache.cassandra.config.DatabaseDescriptor.applyMetadata(DatabaseDescriptor.java:587)
        at org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:234)
        at org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:249)
        at com.datastax.bdp.DseCoreModule.(DseCoreModule.java:84)
        at com.datastax.bdp.DseModule.getRequiredModules(DseModule.java:163)
        at com.datastax.bdp.server.AbstractDseModule.configure(AbstractDseModule.java:27)
        at com.datastax.bdp.DseModule.configure(DseModule.java:89)
        at com.google.inject.AbstractModule.configure(AbstractModule.java:62)
        at com.google.inject.spi.Elements$RecordingBinder.install(Elements.java:340)
        at com.google.inject.spi.Elements.getElements(Elements.java:110)
        at com.google.inject.internal.InjectorShell$Builder.build(InjectorShell.java:138)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:104)
        at com.google.inject.Guice.createInjector(Guice.java:96)
        at com.google.inject.Guice.createInjector(Guice.java:73)
        at com.google.inject.Guice.createInjector(Guice.java:62)
        at com.datastax.bdp.ioc.DseInjector.get(DseInjector.java:36)
        at com.datastax.bdp.DseModule.main(DseModule.java:103)
Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: No content to map due to end-of-input
 at [Source: (BufferedInputStream); line: -1, column: 0]
 


Now, let’s talk about the fix that I have applied to get rid of the problem, but before that I recalled that the database was abruptly shutdown last time due to server/hsot crashed. This file contains local startup information like data center, release info, scheam version, native transport address, port info (JMX, storage etc.), broadcast address etc.


[cassandra@fatdba-doccass bin]$ cd /var/lib/cassandra/metadata/nodes/
[cassandra@fatdba-doccass nodes]$ ls
local  peers  snapshots
[cassandra@fatdba-doccass nodes]$ rm -rf local
[cassandra@fatdba-doccass nodes]$
[cassandra@fatdba-doccass nodes]$
[cassandra@fatdba-doccass nodes]$ ls -ltrh
total 4.0K
drwxrwxr-x. 2 cassandra cassandra 6 Jul 27 16:07 snapshots
-rw-rw-r--. 1 cassandra cassandra 1 Aug  4 18:29 peers
 

Now let’s try to start the node once again.


[cassandra@fatdba-doccass bin]$
[cassandra@fatdba-doccass bin]$ ./dse cassandra
 


And it got started after clearning the local details/information which was there when node was crashed.

Hope It Helps
Prashant Dixit

Posted in Advanced, troubleshooting | Tagged: , | Leave a Comment »

High Latch Free waits due to Result Cache: RC Latch contention

Posted by FatDBA on June 16, 2020

Hi Everyone,

This is one of the latest issue that I will be discussing next where in one of the 12c (12.1.0.2.0) Non-RAC production box we suddenly started seeing a huge spike in database workload (or AAS). As usual I started with some live monitoring using scripts, OEM Performance page and tools and saw huge numbers of sessions waiting on ‘Latch Free’ (Others classed event) waits. Next I saw the event wait were contributing more than 87% of the total DBTime % with exceptionally high latch wait times of 54 ms (Yes, that’s too much for latch gets) and what else caught my eye was ‘enq: RC – Result Cache: Contention’ with average wait times of 19.31 ms.

But lets not jump directly to the conclusion, next I checked SQLs with excessive buffer scans (SQL Ordered by Gets) and saw the source statement that caused the mess with exceptionally high number of buffer scan requests and as you must be aware that each scan that you do in memory buffers (db buffer cache) you will have to acquire a latch for your scan. This one SQL was responsible for around 85% of total buffer reads or gets and interestingly within three hours of AWR report not even a single execution was completed.

Alright now next task was to identify the type of latch which was causing ‘latch free’ or latch shortage in the database. And for that I checked ‘Latch Statistics’ section and ‘Latch Sleep Breakdown’ subsection to see what latch was frequently missed to get and with more sleep requests, and I have got the culprit, ‘Result Cache: RC Latch’ was coming with huge Miss and sleep ratio. The latch was sleeping mainly for ‘Result Cache: Serialization12’ and little bit on ‘Result Cache: Serialization01’.

Next just to be double sure, I checked for P2 value for ‘Latch Free’ event which was coming as 559. So I queried V$Latchname to see if it’s matching the same what we identified so far. And yes, it was same RC Latch!


> SELECT latch#,name FROM gv$latchname WHERE latch#=559;

    LATCH# NAME
---------- ----------------------------------------------------------------
       559 Result Cache: RC Latch
 

You can do more deep down analysis and troubleshooting on such issues, Tanel Poder has written a great presentation for all who want to dig deep and want to pin point the addresses, code etc.
Download the copy of the document, click next –> Oracle_Latch_And_Mutex_Contention_Troubleshooting

Alright, now when the latch name, SQL is identified lets check more about the RESULT CACHE, and it’s set to MANUAL (parameter result_cache_mode) and this is strange. Next we saw that the SQL using the RC contains DS_SVC hints and are part of dynamic statistics activity.
Then I thought to look out for some official documentation on the same, and luckily found one official note from Oracle Doc ID 2002089.1 which is matching almost same with my problem.

And as per the document the workload we can apply is to disable the Adaptive Dynamic Statistics mechanism.


alter system set "_optimizer_ads_use_result_cache" = FALSE;
 

or else One-off Patch 26436717 can be requested and applied to fix the issue for a permanent fix.

Hope It Helps
Prashant Dixit

Posted in Advanced | Tagged: , | Leave a Comment »

 
%d bloggers like this: