Analysing Slow Query Performance on Aurora PostgreSQL

slow signage

Analysing Slow Query Performance on Aurora PostgreSQL

In my previous post, I wrote about how I used AWS DMS to create an Oracle database replica in Aurora PostgreSQL. I touched on how my tests uncovered slow query performance and how using the query plan management extension helped me to achieve comparable execution times.

In this blog post I will take you through the journey of identifying the root cause of the problem and how I solved it.

Problem description

I updated the application configuration to connect to the new Aurora PostgreSQL database and began my tests. The first success: the application connected to the database and was successfully executing queries. Throughout my tests, I experienced database connection timeouts. Using AWS RDS Performance Insights, I was able to get more understanding of what was going on. While the application timed out, the query is still executed on the database. After monitoring performance insights for several minutes the query execution completed. Performance Insights stores a copy of the executed SQL query, which allowed me to execute the query manually using pgAdmin. After waiting for ~7 minutes I finally received the result. I reran the query on the Oracle database using SQL Developer. It took Oracle less than a second to execute the query. I compared the results and confirmed that they match.

The application we are discussing is an ASP.NET Core Web application, and short response times are essential. Queries are programmatically generated using NHibernate, which often results in very large queries. The query in question has a size of ~25kb and includes 23 joins.

Increasing Memory

Performance Insights also highlighted an excessive usage of buffer file read & write activities. This can be an indicator that the database does not have enough memory available to cache data. Based on this initial finding, I increased work_mem and reran the query. Although the query performance did not improve significantly, I managed to save a few seconds, and Performance Insights is no longer showing buffer file activity.

To get more insights about the query, I ran the query again using EXPLAIN ANALYSE this time.

Query result after increasing work_mem

EXPLAIN ANALYSE tells us about the generated query plan. Reviewing the output, we can identify a heavy usage of sequential scans and hash joins. Further information available in the output indicates that the sequential scan is looping through 900k rows a total of 69 times.

Even with my little database knowledge, I know that this is suboptimal; it may hint at missing indexes. So, I reviewed the indexes in the converted schema from AWS SCT and verified they existed on the database, which they did, except for one. I created the missing index and reran the query, with no success. The execution time is still as high as before, even though EXPLAIN ANALYSE showed the use of one additional index.

Increase Instance Size

Performance Insights also revealed heavy CPU activity, I tried moving to a larger database instance size with little success. After testing various instance sizes and types, the shortest execution time I achieved was about 3 minutes, too long for our web application.

Discouraging PostgreSQL

I kept asking myself the question, why is PostgreSQL choosing sequential scans over indexes?

We need to dig deeper. While doing my research, I came across a talk James Finnerty gave in 2019 at at AWS Migration Day – PostgresConf “Deep Dive on Amazon Aurora PostgreSQL Performance Tuning“. His talk is a really great source about Aurora PostgreSQL performance tuning. I have probably watched it a hundred times by now.

About 13 minutes in he is talking about Optimizer Parameters.

Deep Dive on Amazon Aurora PostgreSQL Performance Tuning – Optimizer Parameters

This is really useful. It contains a summary of the available configuration parameters that I can use to force the creation of new query plans. Before using any of the enable configuration parameters you must know that you can not force the use of particular plan types. You are asking postgreSQL politely to favour certain plan types over others, when generating query plans.

EXPLAIN ANALYSE uncovered an excessive usage of Hash Joins, so I tried to encourage PostgreSQL not to use Hash Joins for my query by setting enable_hashjoin to false.

Query result after disabling hash joins

No improvement. Looking into the EXPLAIN ANALYSE output, we can see that PostgreSQL is now using Merge Left Join instead of Hash Join. The question remains: why is PostgreSQL not using the indexes?

Restructuring the query

I started looking at the query itself, which NHibernate hadn’t setup very efficiently. Since the query is programmatically created, I don’t have direct control over it. Still, I wanted to see what execution time I could actually expect from the database. After spending some time dealing with the query and getting help, I finally had a restructured version of the query. I ran it, and finally I had an acceptable execution time of less than a second.

Query result after restructuring the query

Now I knew that is is possible to get an execution time comparable to the Oracle database. The only issue is how to get there. The queries in the application can not be altered directly, so I had to come up with another solution.

Configuration Tweaking

After all those failed attempts, I watched James’ talk again, and this time I stopped at minute ~14 where James is presenting a slide with recommended settings for Aurora PostgreSQL.

Deep Dive on Amazon Aurora PostgreSQL – Recommended settings

Following his recommendations, I updated the parameter group. I reran my query and to my surprise the query just took half the time to execute. Instead of ~7 minutes it dropped to around ~3 minutes to execute. I must be on the right track? At minute ~18 James is talking about ensuring there’s sufficient RAM available and is recommending to review the cache hit radio using the following query

select
    sum(heap_blks_read) as heap_read,
    sum (heap_blks_hit) as heap_hit,
    sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as cache_hit_ratio
from pg_statio_user_tables;

PostgreSQL returned a cache_hit_ratio of 70%. Time to play around with the instance sizes again. Yet again, I switched between various instance types & sizes and was able to reduce the execution time to 1 minute. Pretty impressive, considering the initial execution time of ~7 minutes, but still too long for our web application.

At minute ~55 James presents a summary of what you can do, to make a slow statement faster. Changing the instance size is one of the last things to consider in order to improve the execution time. I get a feeling, that playing around with the instance sizes won’t provide me with any more improvement.

Switching the Database type

The existing Oracle Database is running on RDS, Could that be the difference here? So, I recreated the database using RDS PostgreSQL. Unfortunately, I can not use a snapshot of an Aurora PostgreSQL Database to create an RDS PostgreSQL Database. I had to sync the tables again. I repeated all the steps I did above, with no improvement.

Back to school

I clicked through the video a few more times until I found the patience to focus on the part at minute ~25, where he is talking about the Query Plan Management extension. This sounds like a very useful feature. An extension keeping track of all query plans for each query executed more than twice. It allows you to review all generated plans, and provides you with the ability to disable inefficient plans, and enable more efficient plans. Sounds reasonable, I enabled the extension and discovered it’s not available for RDS PostgreSQL. I enabled the Query Plan Management extension on the Aurora PostgreSQL Database and ran the query multiple times throughout the day. By the end of the day, I had run the query at least 10 times. Enough to review all generated plans. The result was sobering, because PostgreSQL did not come up with any alternative plans.

I want to believe

More frustration, but I am still believing that there is something out there to improve the query execution time. What kept me believing was James’ slide from minute ~55,

Deep Dive on Amazon Aurora PostgreSQL Performance Tuning – make slow statement faster

James is talking about various things to make slow queries faster. Throwing more resources at it is the last thing to consider. I now have control over the query plans, so maybe discouraging the use of plan types might work. But there are many configuration parameters: which one to configure?

PostgreSQL Join Limits

Back to good old google: I tried to lookup similar user experiences. While scrolling through various blogs & forums, I came across an article from pganalze.com. This article is talking about how PostgreSQL uses the genetic query optimizer to generate query plans for queries containing multiple joins. It describes the configuration parameter join_collapse_limit, which is also one of the configuration parameters James is talking about in his presentation. What this configuration parameter does is to configure PostgreSQL not to attempt to calculate an optimised plan for queries with more joins than specified in the limit parameter. If exceeded, PostgreSQL will follow the join order of the query. The reasoning behind this is that it is more costly to calculate an optimised plan than to follow the join order of the query.

I already proved that the generated query structure is inefficient. Could tweaking this configuration parameter have an effect on my execution time? I updated the configuration parameter with the sums of joins in my query + 1. This should at least force PostgreSQL to come up with a new plan.

Query result after increasing join_collapse_limit

To my surprise, I reached an execution time below one second. To prove that this wasn’t just a coincidence, I updated the configuration parameter to -1 and waited minutes again. I switched the parameter back and forth a couple of times because I couldn’t believe it. After weeks of analysis I found a solution for my problem. As further validation, I updated the parameter group and switched to an instance type similar to the RDS Oracle instance, and achieved a very similar response time.

Winner Winner Chicken Dinner

While executing further tests, I discovered that PostgreSQL still chooses the old “bad” plan, leading to an execution time of several minutes. Since I already knew that I have control over the query execution plans, I used the Query Plan Management to disable the “bad” plans and enable the good ones. Luckily the extension provides a build-in function to do this without major effort. I finally achieved a repeatably low execution time.

Conclusion

This exercise was a roller coaster ride for me, with many highs, lows, and loops in between. James’ performance tuning talk provided me with many new insights into how to approach my problem. What kept me going was his slide from minute ~55: if scaling up to a larger instance class & size doesn’t solve my performance problem, then there must be something else. I am glad I kept on pursuing the problem to gain a deeper understanding and find a solution. We are finally able to create a read replica of our Oracle database using Aurora PostgreSQL.

The Query Optimisation Management extension is a must-have when you are running an Aurora PostgreSQL database. It helps you identify bottlenecks and improve the query performance. Make sure you have regular maintenance intervals to maintain the query plans. We will use an RDS proxy to monitor the query response latency and identify slow queries early on.

No Comments

Leave a Reply

Discover more from Shine Solutions Group

Subscribe now to keep reading and get access to the full archive.

Continue reading