Common Table Expressions (CTEs), in MySQL 8

Recently, I was tasked with writing a query to identify and remove duplicate entities while preserving those with specific criteria. In the past, I tackled similar problems with monstrous MySQL queries involving multiple levels of subqueries. Despite writing them myself, I often found them confusing and difficult to maintain. This time, I decided to take a different approach and asked ChatGPT for help. It introduced me to using Common Table Expressions (CTEs) and partitions, features I wasn’t very familiar with.

I had encountered CTEs when querying data warehouses but never fully understood their potential. It was time to dig deeper.

What are Common Table Expressions (CTEs)?

CTEs have been available since MySQL 8.0, and they are described as follows:

A common table expression (CTE) is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times.

While it’s true that subqueries, temporary tables, or views can achieve similar results, CTEs offer unique advantages.

Differences

So, what are the differences between CTEs, and some of the other options which may be used to achieve a similar result?

Feature/Aspect	CTEs	Subqueries	Temporary Tables	Views
Scope	Single query	Single query	Session or until explicitly dropped	Permanent until explicitly dropped
Syntax	`WITH` clause	Nested `SELECT`	`CREATE TEMPORARY TABLE`	`CREATE VIEW AS SELECT`
Readability	High (Named result sets)	Can become complex	Moderate	High
Maintainability	High (Reusable within query)	Can become difficult to maintain	Moderate	High (Reusable across queries)
Reusability	Yes, within the same query	Limited (Single use within query)	Yes, within the session	Yes, across multiple queries
Performance	Optimized by query engine	Optimized by query engine	May use disk storage	Depends on query complexity
Supports Recursion	Yes	No	No	No
Use Cases	Simplify complex queries, recursive queries	Intermediate calculations, filtering	Store intermediate results, complex transactions	Encapsulate complex queries
Temporary Storage	In-memory, query scope	In-memory, query scope	Disk or memory	N/A (Virtual table)
Ease of Use	Easy to define and use	Easy for simple cases, complex for nested	Moderate (Requires management)	Easy to define and use

As we can see, each option has its use cases, but for many ad-hoc queries and complex transactional queries, CTEs offer significant advantages, especially where subqueries were used before.

Let’s play!

Run MySQL 8 locally

Enough chatting, let’s play! The only thing you need to follow along is Docker, so assuming you have Docker installed, let’s spin up a docker container with MySQL 8:

docker run --name mysql-8-container -e MYSQL_ROOT_PASSWORD=password -e MYSQL_DATABASE=cte_test -p 3306:3306 -d mysql:8

We can now connect to the container using a GUI, or the command line with the following details:

Host: localhost
Port: 3306
Username: root
Password: password
Database: cte_test

Or, if you prefer the command line, you can execute the following which will open an interactive terminal:

docker exec -it mysql-8-container mysql -uroot -ppassword

Bootstrapping our database

Now that we have our MySQL 8 instance running, let’s create a few tables and fill them with some dummy data.

CREATE TABLE buyers (
    buyer_id INT PRIMARY KEY,
    buyer_name VARCHAR(100),
    email VARCHAR(100)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    buyer_id INT,
    order_date DATE,
    FOREIGN KEY (buyer_id) REFERENCES buyers(buyer_id)
);

CREATE TABLE line_items (
    line_item_id INT PRIMARY KEY,
    order_id INT,
    product_name VARCHAR(100),
    quantity INT,
    price DECIMAL(10, 2),
    FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

INSERT INTO buyers (buyer_id, buyer_name, email) VALUES
(1, 'John Doe', 'john.doe@example.com'),
(2, 'Jane Smith', 'jane.smith@example.com'),
(3, 'Alice Johnson', 'alice.johnson@example.com');

INSERT INTO orders (order_id, buyer_id, order_date) VALUES
(101, 1, '2023-07-01'),
(102, 1, '2023-07-15'),
(103, 2, '2023-07-20'),
(104, 3, '2023-07-25');

INSERT INTO line_items (line_item_id, order_id, product_name, quantity, price) VALUES
(1, 101, 'Laptop', 1, 1200.00),
(2, 101, 'Mouse', 2, 25.00),
(3, 102, 'Keyboard', 1, 75.00),
(4, 103, 'Monitor', 2, 300.00),
(5, 103, 'Headphones', 1, 150.00),
(6, 104, 'Desk', 1, 200.00);

Alright, now we have some totally legitimate data in our database.

An example scenario

You’ve had your first coffee of the day, you’ve done your morning stand-up, and you’re ready to continue with an interesting task you were working on, BUT WAIT…

Out of nowhere, someone in your company (most likely a PM) urgently needs some data. A loophole in the checkout process allowed orders of 500 euros or more to be placed without successful payment. They need a list of all buyers who placed such orders, so they can be contacted. Additionally, they want to know the total value of each order above 500 euros to understand the extent of the issue.

Subqueries version

Context switching time! You close your favourite IDE/text editor, open up your favorite GUI for executing SQL queries, and bash out a query like this:

SELECT
    b.buyer_name,
    ov.total_order_value
FROM
    buyers b
JOIN (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    GROUP BY
        o.buyer_id,
        o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
) ov ON b.buyer_id = ov.buyer_id
ORDER BY
    ov.total_order_value DESC;

Great! You deliver the results, close your GUI, and open your IDE again, BUT WAIT…

The colleague reappears. Based on some analysis, they conclude that buyers using the example.com domain were the main perpetrators. They swiftly ask if we can filter for only buyers with example.com emails, and also include how many line items in total they had - and for good measure, add the average cost of each buyer’s line item.

You smile through the pain, and modify your existing query:

SELECT
    b.buyer_name,
    b.email,
    ov.total_order_value,
    ov.total_items,
    ov.avg_price_per_item
FROM
    buyers b
JOIN (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value,
        SUM(li.quantity) AS total_items,
        AVG(li.price) AS avg_price_per_item
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    GROUP BY
        o.buyer_id,
        o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
) ov ON b.buyer_id = ov.buyer_id
WHERE
    b.email LIKE '%@example.com'
ORDER BY
    ov.total_order_value DESC;

It works, but it’s already starting get a bit hairy, and before you even deliver the results, they are back again. Now, they would also like to know the maximum, single item price each buyer has ordered.

SELECT
    b.buyer_name,
    b.email,
    ov.total_order_value,
    ov.total_items,
    ov.avg_price_per_item,
    ov.max_single_item_price
FROM
    buyers b
JOIN (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value,
        SUM(li.quantity) AS total_items,
        AVG(li.price) AS avg_price_per_item,
        (SELECT MAX(li_inner.price)
         FROM line_items li_inner
         WHERE li_inner.order_id = o.order_id) AS max_single_item_price
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    GROUP BY
        o.buyer_id,
        o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
) ov ON b.buyer_id = ov.buyer_id
WHERE
    b.email LIKE '%@example.com'
ORDER BY
    ov.total_order_value DESC;

Phew - another nested subquery, another layer of complexity. Through this contrived example, you can see how quickly these queries grow. People love data. You decide this query might be useful in the future, something worth adding to your query favorites, but it does look a little scary. Opening this in 3 months, it might take you a while to decipher what’s going on.

Let’s put our noise cancelling headphones on to deter any more interactions from our colleague, and see if we can simplify things using CTEs.

CTE version

Let’s start from our base query, which simply lists buyers with orders over 500 euros, along with the total value of their orders:

The base query

WITH OrderValues AS (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    GROUP BY
        o.buyer_id, o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
)
SELECT
    b.buyer_name,
    ov.total_order_value
FROM
    OrderValues ov
JOIN
    buyers b ON ov.buyer_id = b.buyer_id
ORDER BY
    ov.total_order_value DESC;

Contrasting this with the subquery version, it’s similar, but I would already argue it’s more readable. The first thing we do is define our CTE, which will produce a temporary (lifetime of the query) result set, named OrderValues. This temporary result set contains a list of orders, with the respective buyers, that are over 500 euros in value.

From that point on, we simply query our CTE, join on the buyers table to fetch any additional info we need, and sort it by order value.

Adding complexity

And in this version, we added a filter based on email address, a SUM, and an AVG:

WITH OrderValues AS (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value,
        SUM(li.quantity) AS total_items,
        AVG(li.price) AS avg_price_per_item
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    GROUP BY
        o.buyer_id, o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
)
SELECT
    b.buyer_name,
    b.email,
    ov.total_order_value,
    ov.total_items,
    ov.avg_price_per_item
FROM
    OrderValues ov
JOIN
    buyers b ON ov.buyer_id = b.buyer_id
WHERE
    b.email LIKE '%@example.com'
ORDER BY
    ov.total_order_value DESC;

Another level of complexity

And on the boss level, we add in a second CTE, which picks the highest value line item per order, which we later query in the main query body.

WITH OrderValues AS (
    SELECT
        o.buyer_id,
        o.order_id,
        SUM(li.quantity * li.price) AS total_order_value,
        SUM(li.quantity) AS total_items,
        AVG(li.price) AS avg_price_per_item
    FROM
        orders o
    JOIN
        line_items li ON o.order_id = li.order_id
    WHERE
        o.order_date >= CURDATE() - INTERVAL 4 DAY
    GROUP BY
        o.buyer_id, o.order_id
    HAVING
        SUM(li.quantity * li.price) >= 500
),
MaxItemPrices AS (
    SELECT
        li.order_id,
        MAX(li.price) AS max_single_item_price
    FROM
        line_items li
    GROUP BY
        li.order_id
)
SELECT
    b.buyer_name,
    b.email,
    ov.total_order_value,
    ov.total_items,
    ov.avg_price_per_item,
    mip.max_single_item_price
FROM
    OrderValues ov
JOIN
    buyers b ON ov.buyer_id = b.buyer_id
JOIN
    MaxItemPrices mip ON ov.order_id = mip.order_id
WHERE
    b.email LIKE '%@example.com'
ORDER BY
    ov.total_order_value DESC;

What else?

As the table above indicates, CTEs can also support recursion, but it starts to get a little more complicated, so that’s a topic for another post.

Conclusion

In this post, we didn’t cover anything groundbreaking, but CTEs are a great tool that can transform previously complex and hard-to-read SQL queries into much more manageable and understandable ones. Combined with other features such as partitions, you can build some really powerful queries.

Stay tuned for a future post, where we dive into recursive CTEs and explore their full potential!