Tales From A Lazy Fat DBA

Its all about Databases & their performance, troubleshooting & much more …. ¯\_(ツ)_/¯

Adaptive Log File Sync is not always good for you …

Posted by FatDBA on April 11, 2022

Last week I was part of one system/database stability short term assignment where customer running a critical Telco application on 2-Node RAC Cluster (RHEL) on reported slowness in few of the critical data processing modules, which in turn had slowed down their entire system. Todays post is about buggy behavior of an adaptive feature which had caused a huge mess in the system in the form of ‘Log File Sync‘ waits.

After few of the initial calls with the customer, we come to know that the database has a history of high ‘log file sync‘ waits, but they simply come and go, and it seemed that the events were never handled correctly or being analyzed! And interestingly, restarting the cluster resolves the issue for few hours, sometimes for few days. LFS event was quite prominently seen in this database and on an average found consuming > 38% of the total DB Time% available with very high average wait times (248 ms).

Below are few of the database statistics captured from the problem time.

About the event, When a user session commits, all redo records generated by that session’s transaction need to be flushed from memory to the redo logfile to insure changes to the database made by that transaction become permanent. The time between the user session posting the LGWR and the LGWR posting the user after the write has completed is the wait time for ‘log file sync’ that the user session will show. This event is known as the time lost as a result of the LGWR process waiting while users initiate a Transaction Commit or Rollback.

Next in order to get more idea about the event, I ran the lfsdiag.sql (Oracle provided script to catch diag info on it – Doc ID 1064487.1)). The script will look at the important parameters involved in log file sync waits, wait histogram data, and at the worst average LFS times in the active session history data and AWR data and dump information to help determine why those times were the highest.

Below are the ASH LFS background process waits caught during its worst minute and it had some really bad stats captured for the worst minute for the same time when customer had worst application performance.

MINUTE          INST_ID EVENT                            TOTAL_WAIT_TIME      WAITS   AVG_TIME_WAITED
------------ ---------- ------------------------------ ----------------- ---------- -----------------
Apr06_1742            2 log file sync                        2819744.300       1973          1429.166
Apr06_1745            2 log file sync                        1219765.027       1200          1016.471
Apr06_1747            2 log file sync                        2045077.213       1745          1171.964
Apr06_1748            2 log file sync                        1417639.236       1170          1211.657
Apr06_1749            2 log file sync                        2202804.958       2190          1005.847
Apr06_1753            2 log file sync                        1967863.159       1871          1051.771
Apr06_1756            2 log file sync                        1096747.638        336          3264.130
Apr06_1843            2 log file sync                        1710602.016        531          3221.473
Apr06_1846            2 log file sync                         814607.205        247          3298.005

Another intersting section was the Histogram data for LFS and other related events. Here was can see the LFS waits at “wait_time_milli” and specially the high wait times to correlate them with other wait events. From below stats its evident that Node 2 of this RAC cluster was severely impacted with the LFS waits., with very high wait counts and wait times (ms), and lot of ‘gcs log flush sync’ along with LFS events that pushes LGWR process to write data to the disk.

The stats are very bad, with highest wait time of 1048576 ms on Node 2 and average of 99864 ms (1.6 mins), both ‘log file parallel write’ and ‘gcs log flush sync’ were quite high too.

   INST_ID EVENT                                    WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
         1 log file sync                                          1   76126281
         1 log file sync                                          2   31805429
         1 log file sync                                          4   18893320
         1 log file sync                                          8   29604071
         1 log file sync                                         16   21903517
         1 log file sync                                         32    7252292
         1 log file sync                                         64    3692615
         1 log file sync                                        128    1615552
         1 log file sync                                        256     262632
         1 log file sync                                        512      25341
         1 log file sync                                       1024       5431
         1 log file sync                                       2048        901
         1 log file sync                                       4096        188
         1 log file sync                                       8192        102
         1 log file sync                                      16384         33
         2 log file sync                                          1   56003954
         2 log file sync                                          2   19903760
         2 log file sync                                          4   12749437
         2 log file sync                                          8   17572031
         2 log file sync                                         16   12266342
         2 log file sync                                         32    3209853
         2 log file sync                                         64    1124852
         2 log file sync                                        128    2912977
         2 log file sync                                        256    5516739
         2 log file sync                                        512    3226632
         2 log file sync                                       1024     783055
         2 log file sync                                       2048     119096
         2 log file sync                                       4096      24974
         2 log file sync                                       8192       8841
         2 log file sync                                      16384       3782
         2 log file sync                                      32768       1640
         2 log file sync                                      65536        578
         2 log file sync                                     131072        259
         2 log file sync                                     262144         78
         2 log file sync                                     524288         22
         2 log file sync                                    1048576          4

         1 log file parallel write                                1  198650048
         1 log file parallel write                                2   16777612
         1 log file parallel write                                4    9934905
         1 log file parallel write                                8    4511957
         1 log file parallel write                               16     808734
         1 log file parallel write                               32     107699
         1 log file parallel write                               64      17255
         1 log file parallel write                              128       1837
         1 log file parallel write                              256        139
         1 log file parallel write                              512         24
         1 log file parallel write                             1024          2
         2 log file parallel write                                1  150805280
         2 log file parallel write                                2    9299589
         2 log file parallel write                                4    4816280
         2 log file parallel write                                8    3145522
         2 log file parallel write                               16     793765
         2 log file parallel write                               32     116312
         2 log file parallel write                               64      12469
         2 log file parallel write                              128       1298
         2 log file parallel write                              256        118
         2 log file parallel write                              512         14
         2 log file parallel write                             1024          1

         1 gcs log flush sync                                     1   85952301
         1 gcs log flush sync                                     2    8052174
         1 gcs log flush sync                                     4    4224838
         1 gcs log flush sync                                     8    4017059
         1 gcs log flush sync                                    16    1226469
         1 gcs log flush sync                                    32      88359
         1 gcs log flush sync                                    64       4263
         1 gcs log flush sync                                   128         90
         1 gcs log flush sync                                   256          1
         2 gcs log flush sync                                     1   60282021
         2 gcs log flush sync                                     2    5102517
         2 gcs log flush sync                                     4    1951187
         2 gcs log flush sync                                     8    2470197
         2 gcs log flush sync                                    16    1088285
         2 gcs log flush sync                                    32     214273
         2 gcs log flush sync                                    64      26585
         2 gcs log flush sync                                   128        147

As a possible solution, we tried few of the quick ones i.e. batching redo (commit_logging = batch) for the log writer (I know it has its own risks) to reduce LFS, but that didn’t worked either.

Next thing I’d generated the system state dump to understand the system and situation bettern, and the LFS events were caught in the system state dump as well with wait chains pointing to ‘rdbms ipc message'<=’log file sync’.

Process traces are always considered a wealth of diagnostic information, So I’d checked the LGWR process traces and thats where I saw some strangeness with frequent entries related with switching between post/wait and polling method which is an adaptive way to control switching between post/wait (older way) and polling (new method) for log file syncs. This gave me little hint about the possible reason on why so many LFS waits.

Talking about the adaptive log file sync, there are 2 methods by which LGWR and foreground processes can communicate in order to acknowledge that a commit has completed:
Post/wait: traditional method available in previous Oracle releases LGWR explicitly posts all processes waiting for the commit to complete. The advantage of the post/wait method is that sessions should find out almost immediately when the redo has been flushed to disk.
Polling: Foreground processes sleep and poll to see if the commit is complete, this was introduced to free high CPU usage by the LGWR.

This behavior is controlled by the parameter “_use_adaptive_log_file_sync” and was introduced in 11gR2 and controls whether adaptive switching between post/wait and polling is enabled. In and the default value for the parameter is false. From, the default value has been changed to true.

-- LGWR traces 
WARNING:io_submit failed due to kernel limitations MAXAIO for process=128 pending aio=128
WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=1048576 AIO-NR=169402

*** 2022-04-07 06:03:31.916
Warning: log write broadcast wait time 2612477ms (SCN 0xad1.f8c170fe)

*** 2022-04-07 06:03:31.916
Warning: log write broadcast wait time 2598008ms (SCN 0xad1.f8c21251)
kcrfw_update_adaptive_sync_mode: post->poll long#=33 sync#=202 sync=5963 poll=8730 rw=383 rw+=383 ack=3982 min_sleep=1135

*** 2022-04-07 07:46:20.018
Log file sync switching to polling --------------------->>>>>>> It shows current method is polling 
Current scheduling delay is 1 usec 
Current approximate redo synch write rate is 67 per sec

*** 2022-04-07 07:47:13.877
kcrfw_update_adaptive_sync_mode: poll->post current_sched_delay=0 switch_sched_delay=1 current_sync_count_delta=63 switch_sync_count_delta=202

*** 2022-04-07 07:47:13.877
Log file sync switching to post/wait ------------------>>>>>>>> It shows current method is post/wait
Current approximate redo synch write rate is 21 per sec

-- Below stats shows that the POLLING is happening on the database for LGWR wrtes 
SQL> select name,value from v$sysstat where name in ('redo synch poll writes','redo synch polls');

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
redo synch poll writes                                             10500129
redo synch polls                                                   10773618

As there were no other symptoms of issues with I/O or in other areas, the problem could be with excessive switching between post/wait and polling wait methods.

After consulting with Oracle support, we found the issue was happening due to BUG (25178179) and the issue gets severe when only while using log file sync “polling mode“. To prevent the problem from happening, they suggested us to turn off the _use_adaptive_log_file_sync by setting it to FALSE in either in the spfile and restarting the database or dynamically in memory.
This will force the log file sync waits to use (the traditional) “post/wait mode” rather than the automatically switching between “post/wait mode” and “polling mode” based on performance statistics. Changing the said parameter to FALSE disabled adaptive LFS in the database and that resolved the issue and system performance was restored.

-- Set the parameter _use_adaptive_log_file_sync = false and restart the database:
SQL> ALTER SYSTEM SET "_use_adaptive_log_file_sync" = FALSE;

-- In cases where a restart is not feasible, then you can set in memory and also in the SP file for when a restart does occur:
SQL> ALTER SYSTEM SET "_use_adaptive_log_file_sync"=FALSE; -- to set in memory and spfile as by default scope=both
SQL> ALTER SYSTEM SET "_use_adaptive_log_file_sync"=FALSE scope=sfile sid='*'; -- to set on spfile, so parameter is used at next restart.

Though In the vast majority of cases adaptive log file sync improves the overall performance of log file synchronization, but there are few bugs associated with this feature i.e. 13707904, 13074706 and 25178179.

Hope It Helped!
Prashant Dixit


One Response to “Adaptive Log File Sync is not always good for you …”

  1. shadab said

    even i had tweak this parameter to false on and reaped the benefits. After setting the parameter to false performance improved many fold.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: