Serious forum bug, and downtime that isn't my fault

posted by Jeff | Friday, September 9, 2022, 12:01 AM | comments: 0

Last night I finally finished integrating the latest forum code to the forum product, and at first, it seemed like it was all good. It had been running fine on CoasterBuzz, which is different because that's a single tenant. The hosted product is intended to run a bunch of forums at the same time, so the try-out forum and the PointBuzz Forums are literally the same site, just skinned differently with different data because of the domain name. I worked very hard a few years ago to light up this scenario, where I could mostly just drop in the existing forums to a project with extra plumbing to facilitate the multitenancy, and that's been working ever since. Until it didn't.

One of the biggest changes in the new version is notifications, and using them in place of email subscriptions. It used to be that you could subscribe to a topic, and when it was updated, you would get an email saying as much. Of course, no one has used that in years, because email. I replaced that with in-app notifications, and they were working pretty well. But in the old days, I wasn't going to make the user wait until everyone who subscribed was emailed, so I spawned a new thread to do that. I knew this was a terrible idea years ago when I wrote it, because I couldn't unit test it right, and I even had the compiler stop bothering me about it.

#pragma warning disable 1998
	public async Task NotifySubscribers(Topic topic, User postingUser, string topicLink, Func<User, Topic, string> unsubscribeLinkGenerator)
	{
		new Thread(async () => {
			var users = await _subscribedTopicsRepository.GetSubscribedUsersThatHaveViewed(topic.TopicID);
			foreach (var user in users)
			{
				if (user.UserID != postingUser.UserID)
				{
					var unsubScribeLink = unsubscribeLinkGenerator(user, topic);
					await _subscribedTopicEmailComposer.ComposeAndQueue(topic, user, topicLink, unsubScribeLink);
				}
			}
			await _subscribedTopicsRepository.MarkSubscribedTopicUnviewed(topic.TopicID);
		}).Start();
	}
#pragma warning restore 1998

I mean, there were all of the signs this was a bad idea. I had to add the warning overrides when I made all of the other stuff inside run asynchronously, and even before that, I couldn't run unit tests. This mostly worked fine, because the thread it spawned seemed to always do its job and not die or get garbage collected before it was done. But after I changed that code so it did notifications instead of email, and then put it in a multitenant app, that spawned thread had no idea what tenant it was working with, so it got crushed hard. The worst thing is that I spent a lot of time building a test environment, and it's all automated and deploys every time I change code. But I didn't do some basic testing around the new features in that environment.

As you might suspect, I spent some time this evening writing code to just queue the topic's ID number and the tenant, and let a processor read off the queue and do the notifications. It should have always been that, but I was lazy.

So that was the first problem to solve, because the mass of exceptions every time a new post was made and all those people had to be notified, it brought the app now. It recovered pretty quick, but not ideal. At the same time, by sheer coincidence, most of the audience started getting gateway errors. I thought maybe I had not fixed the problem, and for three hours I tried to diagnose what was going on. When I started looking at the log streams, I noticed that someone could still reach it, because there was traffic. Then I tried other tenants, like the try-out site, and it was working fine. So why the heck wasn't the gateway taking in users for the PointBuzz forum?

On a hunch, I removed the domain name from the app, then added it back in. Within a minute or two, people could reach the site again. There aren't any obvious reasons for this, and I'm pretty sure it's not my fault. I opened a support ticket for it, but we'll see if I get any traction or explanation.

I knew things were going too well, but the problem I made wasn't that hard to find once I tested it locally.


Comments

No comments yet.


Post your comment: