By Seth Melnick and Somar Musa
Two weeks ago, our systems encountered an issue that caused significant delays in data consumption. While this delay was ongoing, a user who was considering adding a hashtag to a post they were composing would be shown a view count result of 0 views, even if there were billions of views. This display issue - which impacted a total of 225,611 unique hashtags, but could have impacted any searched hashtag - was due to inefficiencies in our data stream flushing service design.
Today, we want to clarify exactly what happened, how it happened, and why it happened. We'll go into details on how our system returns information on hashtags to users, what we did to fix the issue in the short term, and how we're going to ensure that this problem doesn't happen again.
How did we learn about the issue?
On May 28th, 2020, we learned through user reports that view counts on a large number of hashtags – including #georgefloyd and #blacklivesmatter – were showing 0 views in the Compose screen (where users prepare their post before uploading).
What did we find out?
Upon hearing about the issue, we began exploring the two paths that would be most likely to cause this error: (1) an issue in the code or (2) a data loading or lagging issue. We found that:
(1) The code was bug free and provided eventual consistency.
(2) There was a lag in the data-stream that counts the number of times a video is viewed.
What did the data lag look like?
When we say we had a lag in a data-stream, we mean some data was so backed up that it needed to be queued to be consumed called. Normally, we calculate the computing power needed so as to consume the queued data with little to no delay or lag time.
In this case, as shown in the above graph, the queued unprocessed hashtag data piled to some millions in scale, which still needed to be processed one-by-one, causing a lot of data to not be able to be pushed online/back to the user. Eventually the server might have returned a value to the user, but it would have taken several days, at which point the view count it displayed would also have been out of date.
In the above graph you can also see several peaks and valleys to the data consumption slopes. While this might appear as though the issue was resolving itself every few days, this is actually a self-protection mechanism of the data-stream infrastructure: When the lagging is too devastating to the data stream, it will decide to abandon all the lagged data and start all over again, which is why we see a regular peak and then suddenly the lag would fall down to 0. This caused the lag to be longer.
Why did the hashtag count view data pile up?
To understand how the hashtag count view data piled up, we need to look at how hashtag count view searches work under normal circumstances and how the data is consumed:
If the system is working correctly, the request for the hashtag works through a server proxy and a data stream flushing service - the process of the data stream system consuming the queued data - that keeps the entire system moving quickly. Ideally, this will return a value immediately.
Generally speaking, a one-to-one mapping check that includes a data flushing service is always much faster than a thorough search (including indexing, matching, ranking etc.). In this case, and for a period of time, the data flushing failed, which caused the data lag, and the system to assume there was no value to return (or 0 views).
What happened that caused the data flushing system to fail?
The system is supposed to support global hashtag flushing, and it flushes every attribute of a hashtag nearly 150 times (one country a time). This is unnecessary because most of the attributes stay the same among all countries(eg: plaintext, hashtag-ids), and can cause lags, like the one that caused this issue.
So how did we fix it?
Once our engineering team ascertained that the issue was caused by the data stream flushing not working, we fell back to a thorough search procedure to normalize the query and back search the hashtag library. This meant essentially cutting out the mapping data and the data stream flushing, and focusing on pulling the data in plaintext and then communicating between the server proxy and the hashtag library to get the proper hashtag ID and therefore view count. It looks something like this:
As mentioned above, this method is slower, but in this case it was effective in more quickly, accurately pulling the view counts for hashtags in the Compose screen.
Unfortunately, switching to plaintext didn't solve the problem for all other case changed version of a hashtag. This meant that if you attempted to put in #GeorgeFloyd or #BlacklivesMatter you would see a 0 view value.
This happens because we store all of our plaintext hashtags in lowercase.
How did we solve the uppercase issue?
We solved the upper and multi-case issue by normalizing all plaintext, case-sensitive permutations to lowercase before our system counted them in the hashtag library, and then doing a final check and accounting before returning to the user. It looks something like this:
So, now the server has multiple checkpoints where it normalizes the plaintext query entered by the user to lowercase before sending it to be counted from the hashtag library, then checks if the result returned by the recommendation side has a corresponding lowercase version. This helps ensure that every time a user enters a hashtag in the Compose screen they are seeing every view of that hashtag.
What's next?
We've explored the design of our data stream flushing and found inefficiencies. We're upgrading the design to eradicate this issue. Specifically, the new system will recognize identical attributes and not flush redundantly. So, if an attribute is the same among all countries, we will flush it only once instead of 150 times (per country). This will allow us to process faster than before while using far less computing resources, meaning we shouldn't see a lag issue like this again.
Moreover, as an extra layer of precaution, we will be increasing our monitoring, which will notify us of delays in data stream processing early on and help us correct these issues as they arise.