I checked to see how CoasterBuzz's search index was going today, and was pretty shocked to see that it now as about 15 million rows. Can't say that I've personally written anything that generated that much data on its own before.
Once upon a time, POP Forums used SQL's FullText service to index posts. The frustration that came out of that black box was two-fold: It was grinding the disk and CPU, and frankly the results sucked. Long before I ever got v8 into a usable state, I prototyped the current search engine against data from the old version.
Building something like that is one of those exercises where you worry about all kinds of scalability issues instead of just trying to build something and refactor when you fail the first time. That's why it took me so long to just try something. Eventually I passed that brain block and wrote something, which predictably sucked. I kept refactoring until I had a workable solution.
The solution went like this. Find all of the words in a topic, toss out junk words like "the" and other things people won't search for, and score them based on frequency. Bonus points if it lived in the topic's title. Then save the words, along with their topic ID and score.
To search, simply find the topics that rank highest by averaging the rank for each word. There is an absolutely horrible query built ad-hoc in the data query that does it. It's partly ugly because it has to page the data, so there are some weird common table expressions being formed. It's so hard to read that I'm not even sure where I'd start to refactor it! But despite this, it works surprisingly well.
I think the one thing I'd tweak is the scoring, but aside from that the searching part works pretty well. I'm sure that I'm not the first to think of it. The joy comes from the fact that SQL Server is fast enough to get the work done. One of these days I'll see if I can get a guru to look harder at it and see how it can be made even faster.
No comments yet.