28 Jun 2024 Analysing Slow Query Performance on Aurora PostgreSQL
In my previous post, I wrote about how I used AWS DMS to create an Oracle database replica in Aurora PostgreSQL. I touched on how my tests uncovered slow query performance and how using the query plan management extension helped me to achieve comparable execution times.
In this blog post I will take you through the journey of identifying the root cause of the problem and how I solved it.
Problem description
I updated the application configuration to connect to the new Aurora PostgreSQL database and began my tests. The first success: the application connected to the database and was successfully executing queries. Throughout my tests, I experienced database connection timeouts. Using AWS RDS Performance Insights, I was able to get more understanding of what was going on. While the application timed out, the query is still executed on the database. After monitoring performance insights for several minutes the query execution completed. Performance Insights stores a copy of the executed SQL query, which allowed me to execute the query manually using pgAdmin. After waiting for ~7 minutes I finally received the result. I reran the query on the Oracle database using SQL Developer. It took Oracle less than a second to execute the query. I compared the results and confirmed that they match.
The application we are discussing is an ASP.NET Core Web application, and short response times are essential. Queries are programmatically generated using NHibernate, which often results in very large queries. The query in question has a size of ~25kb and includes 23 joins.
Increasing Memory
Performance Insights also highlighted an excessive usage of buffer file read & write activities. This can be an indicator that the database does not have enough memory available to cache data. Based on this initial finding, I increased work_mem and reran the query. Although the query performance did not improve significantly, I managed to save a few seconds, and Performance Insights is no longer showing buffer file activity.
To get more insights about the query, I ran the query again using EXPLAIN ANALYSE this time.
EXPLAIN ANALYSE tells us about the generated query plan. Reviewing the output, we can identify a heavy usage of sequential scans and hash joins. Further information available in the output indicates that the sequential scan is looping through 900k rows a total of 69 times.
Even with my little database knowledge, I know that this is suboptimal; it may hint at missing indexes. So, I reviewed the indexes in the converted schema from AWS SCT and verified they existed on the database, which they did, except for one. I created the missing index and reran the query, with no success. The execution time is still as high as before, even though EXPLAIN ANALYSE showed the use of one additional index.
Increase Instance Size
Performance Insights also revealed heavy CPU activity, I tried moving to a larger database instance size with little success. After testing various instance sizes and types, the shortest execution time I achieved was about 3 minutes, too long for our web application.
Discouraging PostgreSQL
I kept asking myself the question, why is PostgreSQL choosing sequential scans over indexes?
We need to dig deeper. While doing my research, I came across a talk James Finnerty gave in 2019 at at AWS Migration Day – PostgresConf “Deep Dive on Amazon Aurora PostgreSQL Performance Tuning“. His talk is a really great source about Aurora PostgreSQL performance tuning. I have probably watched it a hundred times by now.
About 13 minutes in he is talking about Optimizer Parameters.
This is really useful. It contains a summary of the available configuration parameters that I can use to force the creation of new query plans. Before using any of the enable configuration parameters you must know that you can not force the use of particular plan types. You are asking postgreSQL politely to favour certain plan types over others, when generating query plans.
EXPLAIN ANALYSE uncovered an excessive usage of Hash Joins, so I tried to encourage PostgreSQL not to use Hash Joins for my query by setting enable_hashjoin to false.
No improvement. Looking into the EXPLAIN ANALYSE output, we can see that PostgreSQL is now using Merge Left Join instead of Hash Join. The question remains: why is PostgreSQL not using the indexes?
Restructuring the query
I started looking at the query itself, which NHibernate hadn’t setup very efficiently. Since the query is programmatically created, I don’t have direct control over it. Still, I wanted to see what execution time I could actually expect from the database. After spending some time dealing with the query and getting help, I finally had a restructured version of the query. I ran it, and finally I had an acceptable execution time of less than a second.
Now I knew that is is possible to get an execution time comparable to the Oracle database. The only issue is how to get there. The queries in the application can not be altered directly, so I had to come up with another solution.
Configuration Tweaking
After all those failed attempts, I watched James’ talk again, and this time I stopped at minute ~14 where James is presenting a slide with recommended settings for Aurora PostgreSQL.
Following his recommendations, I updated the parameter group. I reran my query and to my surprise the query just took half the time to execute. Instead of ~7 minutes it dropped to around ~3 minutes to execute. I must be on the right track? At minute ~18 James is talking about ensuring there’s sufficient RAM available and is recommending to review the cache hit radio using the following query
select
sum(heap_blks_read) as heap_read,
sum (heap_blks_hit) as heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as cache_hit_ratio
from pg_statio_user_tables;
PostgreSQL returned a cache_hit_ratio of 70%. Time to play around with the instance sizes again. Yet again, I switched between various instance types & sizes and was able to reduce the execution time to 1 minute. Pretty impressive, considering the initial execution time of ~7 minutes, but still too long for our web application.
At minute ~55 James presents a summary of what you can do, to make a slow statement faster. Changing the instance size is one of the last things to consider in order to improve the execution time. I get a feeling, that playing around with the instance sizes won’t provide me with any more improvement.
Switching the Database type
The existing Oracle Database is running on RDS, Could that be the difference here? So, I recreated the database using RDS PostgreSQL. Unfortunately, I can not use a snapshot of an Aurora PostgreSQL Database to create an RDS PostgreSQL Database. I had to sync the tables again. I repeated all the steps I did above, with no improvement.
Back to school
I clicked through the video a few more times until I found the patience to focus on the part at minute ~25, where he is talking about the Query Plan Management extension. This sounds like a very useful feature. An extension keeping track of all query plans for each query executed more than twice. It allows you to review all generated plans, and provides you with the ability to disable inefficient plans, and enable more efficient plans. Sounds reasonable, I enabled the extension and discovered it’s not available for RDS PostgreSQL. I enabled the Query Plan Management extension on the Aurora PostgreSQL Database and ran the query multiple times throughout the day. By the end of the day, I had run the query at least 10 times. Enough to review all generated plans. The result was sobering, because PostgreSQL did not come up with any alternative plans.
I want to believe
More frustration, but I am still believing that there is something out there to improve the query execution time. What kept me believing was James’ slide from minute ~55,
James is talking about various things to make slow queries faster. Throwing more resources at it is the last thing to consider. I now have control over the query plans, so maybe discouraging the use of plan types might work. But there are many configuration parameters: which one to configure?
PostgreSQL Join Limits
Back to good old google: I tried to lookup similar user experiences. While scrolling through various blogs & forums, I came across an article from pganalze.com. This article is talking about how PostgreSQL uses the genetic query optimizer to generate query plans for queries containing multiple joins. It describes the configuration parameter join_collapse_limit, which is also one of the configuration parameters James is talking about in his presentation. What this configuration parameter does is to configure PostgreSQL not to attempt to calculate an optimised plan for queries with more joins than specified in the limit parameter. If exceeded, PostgreSQL will follow the join order of the query. The reasoning behind this is that it is more costly to calculate an optimised plan than to follow the join order of the query.
I already proved that the generated query structure is inefficient. Could tweaking this configuration parameter have an effect on my execution time? I updated the configuration parameter with the sums of joins in my query + 1. This should at least force PostgreSQL to come up with a new plan.
To my surprise, I reached an execution time below one second. To prove that this wasn’t just a coincidence, I updated the configuration parameter to -1 and waited minutes again. I switched the parameter back and forth a couple of times because I couldn’t believe it. After weeks of analysis I found a solution for my problem. As further validation, I updated the parameter group and switched to an instance type similar to the RDS Oracle instance, and achieved a very similar response time.
Winner Winner Chicken Dinner
While executing further tests, I discovered that PostgreSQL still chooses the old “bad” plan, leading to an execution time of several minutes. Since I already knew that I have control over the query execution plans, I used the Query Plan Management to disable the “bad” plans and enable the good ones. Luckily the extension provides a build-in function to do this without major effort. I finally achieved a repeatably low execution time.
Conclusion
This exercise was a roller coaster ride for me, with many highs, lows, and loops in between. James’ performance tuning talk provided me with many new insights into how to approach my problem. What kept me going was his slide from minute ~55: if scaling up to a larger instance class & size doesn’t solve my performance problem, then there must be something else. I am glad I kept on pursuing the problem to gain a deeper understanding and find a solution. We are finally able to create a read replica of our Oracle database using Aurora PostgreSQL.
The Query Optimisation Management extension is a must-have when you are running an Aurora PostgreSQL database. It helps you identify bottlenecks and improve the query performance. Make sure you have regular maintenance intervals to maintain the query plans. We will use an RDS proxy to monitor the query response latency and identify slow queries early on.







No Comments