Recently, I was tasked with writing a query to identify and remove duplicate entities while preserving those with specific criteria. In the past, I tackled similar problems with monstrous MySQL queries involving multiple levels of subqueries. Despite writing them myself, I often found them confusing and difficult to maintain. This time, I decided to take a different approach and asked ChatGPT for help. It introduced me to using Common Table Expressions (CTEs) and partitions, features I wasn’t very familiar with.
I had encountered CTEs when querying data warehouses but never fully understood their potential. It was time to dig deeper.
What are Common Table Expressions (CTEs)?
CTEs have been available since MySQL 8.0, and they are described as follows:
A common table expression (CTE) is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times.
While it’s true that subqueries, temporary tables, or views can achieve similar results, CTEs offer unique advantages.
Differences
So, what are the differences between CTEs, and some of the other options which may be used to achieve a similar result?
Feature/Aspect | CTEs | Subqueries | Temporary Tables | Views |
---|---|---|---|---|
Scope | Single query | Single query | Session or until explicitly dropped | Permanent until explicitly dropped |
Syntax | WITH clause |
Nested SELECT |
CREATE TEMPORARY TABLE |
CREATE VIEW AS SELECT |
Readability | High (Named result sets) | Can become complex | Moderate | High |
Maintainability | High (Reusable within query) | Can become difficult to maintain | Moderate | High (Reusable across queries) |
Reusability | Yes, within the same query | Limited (Single use within query) | Yes, within the session | Yes, across multiple queries |
Performance | Optimized by query engine | Optimized by query engine | May use disk storage | Depends on query complexity |
Supports Recursion | Yes | No | No | No |
Use Cases | Simplify complex queries, recursive queries | Intermediate calculations, filtering | Store intermediate results, complex transactions | Encapsulate complex queries |
Temporary Storage | In-memory, query scope | In-memory, query scope | Disk or memory | N/A (Virtual table) |
Ease of Use | Easy to define and use | Easy for simple cases, complex for nested | Moderate (Requires management) | Easy to define and use |
As we can see, each option has its use cases, but for many ad-hoc queries and complex transactional queries, CTEs offer significant advantages, especially where subqueries were used before.
Let’s play!
Run MySQL 8 locally
Enough chatting, let’s play! The only thing you need to follow along is Docker, so assuming you have Docker installed, let’s spin up a docker container with MySQL 8:
We can now connect to the container using a GUI, or the command line with the following details:
- Host: localhost
- Port: 3306
- Username: root
- Password: password
- Database: cte_test
Or, if you prefer the command line, you can execute the following which will open an interactive terminal:
Bootstrapping our database
Now that we have our MySQL 8 instance running, let’s create a few tables and fill them with some dummy data.
Alright, now we have some totally legitimate data in our database.
An example scenario
You’ve had your first coffee of the day, you’ve done your morning stand-up, and you’re ready to continue with an interesting task you were working on, BUT WAIT…
Out of nowhere, someone in your company (most likely a PM) urgently needs some data. A loophole in the checkout process allowed orders of 500 euros or more to be placed without successful payment. They need a list of all buyers who placed such orders, so they can be contacted. Additionally, they want to know the total value of each order above 500 euros to understand the extent of the issue.
Subqueries version
Context switching time! You close your favourite IDE/text editor, open up your favorite GUI for executing SQL queries, and bash out a query like this:
Great! You deliver the results, close your GUI, and open your IDE again, BUT WAIT…
The colleague reappears. Based on some analysis, they conclude that buyers using the example.com
domain were the main perpetrators. They swiftly ask if we can filter for only buyers with example.com
emails, and also include how many line items in total they had - and for good measure, add the average cost of each buyer’s line item.
You smile through the pain, and modify your existing query:
It works, but it’s already starting get a bit hairy, and before you even deliver the results, they are back again. Now, they would also like to know the maximum, single item price each buyer has ordered.
Phew - another nested subquery, another layer of complexity. Through this contrived example, you can see how quickly these queries grow. People love data. You decide this query might be useful in the future, something worth adding to your query favorites, but it does look a little scary. Opening this in 3 months, it might take you a while to decipher what’s going on.
Let’s put our noise cancelling headphones on to deter any more interactions from our colleague, and see if we can simplify things using CTEs.
CTE version
Let’s start from our base query, which simply lists buyers with orders over 500 euros, along with the total value of their orders:
The base query
Contrasting this with the subquery version, it’s similar, but I would already argue it’s more readable. The first thing we do is define our CTE, which will produce a temporary (lifetime of the query) result set, named OrderValues
. This temporary result set contains a list of orders, with the respective buyers, that are over 500 euros in value.
From that point on, we simply query our CTE, join on the buyers
table to fetch any additional info we need, and sort it by order value.
Adding complexity
And in this version, we added a filter based on email address, a SUM, and an AVG:
Another level of complexity
And on the boss level, we add in a second CTE, which picks the highest value line item per order, which we later query in the main query body.
What else?
As the table above indicates, CTEs can also support recursion, but it starts to get a little more complicated, so that’s a topic for another post.
Conclusion
In this post, we didn’t cover anything groundbreaking, but CTEs are a great tool that can transform previously complex and hard-to-read SQL queries into much more manageable and understandable ones. Combined with other features such as partitions, you can build some really powerful queries.
Stay tuned for a future post, where we dive into recursive CTEs and explore their full potential!