Hi Guys,
Would like to discuss one strange situation what I’ve faced while doing a general SELECT operation on one of the table which is around 1 GB in size and contains 20000000 rows of test data. After spending around less than a minute the session gets killed and kicks me out back to the Linux terminal.
And yes, the OS is RHEL 7 (64 Bit) running on VMWare.
So, this what happened …
postgres=# select * from test;
Killed
-bash-4.2$
As always to troubleshoot any issue i started with the instance/db logs to see what’s going on, but it speaks not much and left me confused with no reason of this session kill everytime during this SELECT operation.
LOG: could not send data to client: Broken pipe
STATEMENT: SELECT * from test;
FATAL: connection to client lost
Table details are here below.
postgres=#
postgres=# \dt+ test
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | test | table | postgres | 996 MB |
(1 row)
postgres=# select count(*) from test;
count
----------
20000000
(1 row)
Time: 2045.816 ms
I fired the same query once again to see what’s happening there on the OS, pasted below are top results for the same time when this SELECT was in run. You can use pg_top too, that’s more PostgreSQL specific.
last pid: 15382; load avg: 4.40, 2.70, 1.65; up 0+10:27:52
0 processes:
CPU states: 94.0% user, 0.0% nice, 6.0% system, 0.0% idle, 0.0% iowait
Memory: 1836M used, 15M free, 139M cached
DB activity: 0 tps, 0 rollbs/s, 0 buffer r/s, 0 hit%, 0 row r/s, 0 row w/s
DB I/O: 51 reads/s, 25940 KB/s, 0 writes/s, 0 KB/s
DB disk: 0.0 GB total, 0.0 GB free (100% used)
Swap: 1925M used, 123M free, 8916K cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15367 postgres 20 0 358572 3660 3376 D 29.8 0.2 0:33.31 postgres: postgres postgres [local] SELECT
1721 mongod 20 0 976044 3192 0 S 19.0 0.2 4:13.26 /usr/bin/mongod -f /etc/mongod.conf
15382 postgres 20 0 2012488 1.584g 228 S 2.4 87.6 0:02.62 psql >>>>>>>>>>>>>>>>>>>>>>>>> Culprit
If we take a look at top results its evident that one of the postgreSQL session with PID 15382 is consuming a lot what is there under ‘RES’ column and that’s the non-swapped physical memory a task has used. It’s using around 1.5 GB of physical RAM and the CPU usage is little high as well.
This makes sense as we only have around 1.5 GB of RAM allocated to this VM machine, so every time i fires this SELECT it maxes out on memory usage, but still i would like to dig-in deep with all my WHY, WHERE, WHAT questions. And the best place to go next is system messages or logs.
Below are the system logs for the same time.
Sep 26 11:33:04 fatdba kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
Sep 26 11:33:04 fatdba kernel: ffff880027f41f60 00000000a0ebe647 ffff88006c9c7ad0 ffffffff81686e13
Sep 26 11:33:07 fatdba kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Sep 26 11:33:08 fatdba kernel: [14918] 26 14918 29067 1 13 312 0 bash
Sep 26 11:33:08 fatdba kernel: [15382] 26 15382 521305 419602 920 27070 0 psql >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> CULPRIT PROCESS
Sep 26 11:33:08 fatdba kernel: [15384] 26 15384 89643 440 64 405 0 postmaster
Sep 26 11:33:08 fatdba kernel: [15385] 26 15385 19311 259 43 0 0 pg_top
Sep 26 11:33:08 fatdba kernel: [15411] 0 15411 28811 46 11 23 0 ksmtuned
Sep 26 11:33:08 fatdba kernel: [15412] 0 15412 28811 61 11 23 0 ksmtuned
Sep 26 11:33:08 fatdba kernel: [15413] 0 15413 37148 172 29 0 0 pgrep
Sep 26 11:33:08 fatdba kernel: [15416] 26 15416 89489 303 59 215 0 postmaster
Sep 26 11:33:08 fatdba kernel: [15417] 26 15417 89511 239 57 224 0 postmaster
Sep 26 11:33:08 fatdba kernel: Out of memory: Kill process 15382 (psql) score 448 or sacrifice child
Sep 26 11:33:08 fatdba kernel: Killed process 15382 (psql) total-vm:2085220kB, anon-rss:1678260kB, file-rss:148kB, shmem-rss:0kB
Sometimes i gets shocked to see the terms OS/Softwares/Apps used while logging internal activities — “Sacrifise Child”, “Node Amnesia”, “Shoot The Other Node in Head”, “Node Suicides” … 🙂
Okay so coming back to the point, so its clear now that the VM/OS has killed this process with ID 15382 (psql) as its is consuming almost all of the system memory resources. Take a look at few of the columns total_vm, rss, swapents .. All too high and pointing towards the real reason of killing this individual session.
We increased the physical memory on this box and ran the same query again with success!
Hope It Helps
Prashant Dixit