I've been letting the new forum version simmer on CoasterBuzz for a bit now prior to releasing it to the hosted version (which powers PointBuzz). Now that it's in both places, I did have to do a big fix on how notifications worked, specifically to queue the "new reply" notifications since there can be many for any given topic. I got that fix in, and now I'm watching it. There are some interesting things that I'm seeing.
Before that, and update on the outage that I had. I was able to open up a support ticket, and the guy handling the case was able to reproduce the problem that I had, where one of the domain names becomes unbound to the SSL certificate, so the gateway throws 502's. That's reassuring, because again it means that the outage wasn't self-inflicted. He's had to get the product team involved, because it sounds like there may be a bug in the infrastructure. Neat!
The first thing that I noticed was that I very suddenly started seeing waves of errors, maybe a hundred at a time, of SQL timeouts caused by running out of connections from the connection pool. That seemed weird until I realized that the errors were in fact being logged in the database, but also that these waves came inside the scope of about two seconds. I did have a logging bug that would cause the whole app to crash if it got into a loop trying to record the database failure in the failing database (duh), but after I fixed that, I was still seeing the waves. I quickly realized that most of the failures were from some kind of search bot from China, hitting a hundred URL's all inside a half-second. That's definitely naughty behavior, but it should be able to handle it. So the first thing I did was block the entire subnet from China, and I could see hundreds of requests being blocked.
But still, the connection pools should be big enough to handle all of that. What I learned was that the database did not even break a sweat, so the problem wasn't there. Instead, it looks like the SQL client library manages 100 connections in the pool by default. It stands to reason that if you hit it with a hundred requests at a time, at least some of them will fail. I changed the pool to 200 maximum connections, and so far so good. It's hard to spot in the wild when there's a problem, because I can usually see there are only 5 to 10 connections at a time. It would help if I had it running across two ore more nodes, but the rest of my apps can't run that way because of mostly local caching. I have to revisit those.
The hosted thing does run across two nodes, and it's not recording any errors at all. I'm really surprised at how well it's getting on. It has been for about two years, but with the new version doing all of this real-time websockets stuff, I wasn't sure what to expect. I still get periodic Redis cache failures, but I did refactor the code a little to give up sooner if it can't reach the cache, and that's working so well for what amounts to transient failures.
Meanwhile, I updated the documentation quite a bit, and the sample project is also using the newer bits on one branch. My biggest task now is to monitor and test some from-scratch installations to see if everything is generally working.
I'd really like to have a few paying customers for this thing, and I have some shower ideas that I need to try out.
No comments yet.