Debug Diary: Slow Article Migration

August 27, 2021

This is a (somewhat unusual?) summary of how I debugged and resolved a bottleneck in a migration. Enjoy! 🙂

Background

Some client project required migrating existing content from various internal sources into WordPress. This included both posts and terms, as well as a bit of metadata.

Most of the existing data was provided as a MySQL dump. This file contained one table for article data, as well as a number of cross-reference tables that represent article-to-term relationships—and another table for article-to-article relationships.

The overall migration is a three-step process

For each tag that is referenced by one or more articles, query its data from a remote endpoint, and create a term in the according custom taxonomy in WordPress. Of course, this is not done on a per-tag basis, but in batches of up to 100 items.
For each row in the articles table, create a new post in a custom post type in WordPress, including all post data and metadata, and assign all relevant terms created in step 1.
Once all articles have been migrated, import the article-to-article relationships.

Problem

Now, when performing the articles migration, which is step 2 in the above list, a colleague and I realized that this was going slow. And, in fact, it was getting slower and slower the longer the migration ran.

Since we were looking at importing more than 72,000 articles, we paid attention to writing efficient and well-performing queries, as well as memory management. But still, something wasn’t working quite as we wanted, and I had to find out what and why…

Debugging

The first thing I did was logging the time it takes to process a single article. I did that by using microtime( true ), and simply added the time difference to the existing terminal output. This looked like so:

 ·
 ·
 ·
0.0083923339874 Imported article with ID 102633: Former Jefferies Executive Gets Wells Notice
0.0133037567133 Imported article with ID 102634: Some Relief Seen From Issuance Drought
1.3080917999761 Imported article with ID 102635: Calendar
0.0141859054649 Imported article with ID 102636: Oil Plunge Dampens CLO Outlook
0.0132083892835 Imported article with ID 102637: Values Soar for Rental-Home Bonds
0.0128984451363 Imported article with ID 102638: UK Mulls More Bad-Bank Sales
0.0080108642657 Imported article with ID 102639: US Bank Surges as Deutsche Lags
0.0131845474235 Imported article with ID 102640: Auto-Loan Signals Diverge
1.2493103354183 Imported article with ID 102641: THE GRAPEVINE
0.0084877014248 Imported article with ID 102642: Industry Pro Builds Search Engine
 ·
 ·
 ·

On average, processing a given article took about 10 milliseconds. But for some articles, the time needed was more than 1 second, which is about 100 times as long.

I let this run for a while, stopped at random, and compared the times. The average time stayed at around 10 milliseconds, while the outliers needed longer and longer.

So I started it all from scratch again and again, and compared the times measured. They, of course, varied a bit, but the exact same articles that took a long time before, also took a long time then. So it was something with their data…

Since my colleague implemented some offsetting logic into the articles import, I was able to start at random IDs, so I did just that and only imported one article that took a long time, or from about 50 before until 50 after. And guess what: these same, previously slow articles were suddenly imported rather quickly.

I then thought about doing some actual profiling, and, to be fair, I had thought about that before. But I decided that this wouldn’t be useful, or usable, given that we are talking about a lot of articles, each of which requires a lot of data processing and database interaction, and also given that the time difference wasn’t large at first.

Anyway, maybe I was wrong, and maybe there was something to be seen, so I was about to stop the migration process to set up profiling—when I realized something in the terminal output.

THE GRAPEVINE

Having stared at several thousand lines of terminal output in the last 30 or so minutes, I saw that all the articles that took a long time to process had one of a handful of different post titles. Sometimes, this was more obvious in the output, sometimes not. One of those was “THE GRAPEVINE”.

Then I remembered a discussion between our team and the product owner. A colleague had asked what was up with the grapevine, by which he was referring to several articles in the database dump that all had that same title. The PO had to clarify with the client team and then came back to us with the explanation that this was some kind of a catchy title they used to write about rumours or gossip and that kind of thing. You know, stuff they “heard through the grapevine”…

When I ran a MySQL query to check how often that title was included in the articles table, I almost fell off my chair. Out of the overall 72,000 articles, almost a whopping 25,500 (!) had that headline.

At this point, I’ve got a question for you:

Do you know how WordPress ensures unique post slugs?

Naive Implementation

For each custom post type, no two readable posts have the same slug, unless they are in different levels in a hierarchical custom post type. The algorithm to ensure a unique post slug is rather naive, and it works like this:

Check if there already is a post with the given slug.
If no, all good.
If yes, append -2, and check again.
If still yes, append -3, and check again.
Repeat with increasing suffix…

Given that, usually, you don’t have a ton of posts that all have the same post title, that might work just fine. But not in this case.

For every single one of these posts, WordPress starts with the suffix being 2, and increments until it finds a post slug that hasn’t been taken yet.

Just in case you were wondering, the triangular number for 25,500 is 325,137,750. That is the number of database queries just to find unique post slugs for these articles.

Addressing the Issue

Once I figured that out, the first thing I did was to check for frequent headlines. I did that by means of a MySQL query similar to the following one:

SELECT headline
FROM news_alerts_articles
GROUP BY headline
HAVING COUNT(*) > 50

This surfaced a total of 11 headlines that were used for more than 50 times.

When I also included the individual count, I was presented with this breakdown:

headline	COUNT(*)
Calendar	4,553
Correction	543
Corrections	385
FUND-RAISING ACTION	139
GRAPEVINE	424
INVESTMENT VEHICLES	791
NEW DEALS	1,560
ON THE MARKET	201
RECENT SALES	417
REGULATORY ROUNDUP	546
THE GRAPEVINE	25,449

Frequent headlines with their individual count.

That means quite a few unnecessary database calls!

To stop this from happening, I used these headlines in the WP-CLI command PHP as keys in a map or rather counter matrix. When processing an article, I checked if its headline was one of the frequent ones, and supplied the post slug myself, ensuring to update the according counter value. This looks like so:

if ( isset( $this->frequent_titles[ $title ] ) ) {
    $post_name = sanitize_title(
        $title . '-' . $this->frequent_titles[ $title ]++
    );
}

Now we’re talking! These articles are now being processed just as quickly as all the other ones.

…well, almost.

There were still a few outliers, so I changed the threshold from 50 to 30, which surfaced two additional titles. But still, there were posts that took a longer time than all the others, and they all had the title “CALENDAR”.

So I checked in my list of frequent titles, and while I could see “Calendar”, the all-uppercase variant wasn’t there, and yet it seemed to be used a lot.

After a bit of ~~hard thinking~~ Googling, I realized the issue was that MySQL is case-insensitive, or, to be more precise, the used collation was. There are several ways of changing this, but the easiest I found was to cast the headline value—which, as per field definition, was a VARCHAR—to its binary representation.

This is the final MySQL query:

SELECT BINARY(headline) AS h
FROM news_alerts_articles
GROUP BY h
HAVING COUNT(*) > 30

If my colleague hadn’t decided to log the post title to the terminal when processing an article, I’m sure I wouldn’t have been able to find the root cause as quickly as I did!

Tags: Debugging, Migration, Performance, WordPress