If you know SQL, you can skip this paragraph. If you're not familiar with SQL databases, this is all you need to know to understand: SQL databases are comprised of many tables. I (the programmer) define what columns are in each of the tables. However, the actions of the user will create, update, or delete rows in this table. The set of tables and their columns are called "the schema". The rows of the table are called "entries". For example, on this website there is a table called "users" and each row represents one registered user. When you registered, a row was added. There are columns for your username, email address, password, etc. Each time you log in, there's a column that indicates the last time you used the website which gets updated. This is how dynamic websites generally work.
How the Q&A used to be
For the Q&A on TwoCans, initially there were only two tables. "questions" and "answers". Each question has a unique ID#. The questions table would of course have the question ID#, the user that asked it, and the text of the question itself. The answers table would have the question ID# it was an answer to, who answered it, a 1 or 0 depending on if the user had read the answer, and of course the text of the answer.
This was all fine and dandy at first when there were only a few thousand questions and answers. However a few problems arose once this number got closer to a million.
When you answer questions, you don't want to see questions you've already answered. This required me to check not only the questions table, but also the answers table to see if you have any answers for a particular question. When there are a million answers, this is a very slow operation.
When you pick a random item in a table, SQL does this by selecting the contents of the WHOLE table, ordering it randomly, and then taking the first one. Shuffling a whole table to get 1 entry is a very slow operation.
Some people's ancient questions were still getting answers while others were not getting answers to their new questions.
At this point the page load time had slowed to about 2 seconds. Additionally, the database would crash and need to be rebooted every few minutes. I had to take the site down and rewrite it in a panic.
The New (Current) System
Working off a giant data set is something that should be avoided whenever possible. Therefore, there are now several tables that allow me to creatively work around doing things directly to the questions/answers table.
The Active Pile
There is a table called "active_questions". This table contains all the questions that have been asked or re-activated in the last 24 hours. This table usually contains around 600 entries at any given moment, last I checked. This table contains just the question ID#'s of active questions and a few other bits of interesting information such as the number of answers the question has, the user who asked it, the language it was asked in, and the time it was last activated or asked. This entry is deleted once it is more than 24 hours old. However the question itself is not deleted.
When you load the Answer page, the question you receive is coming from this table. The text of the question still comes from the questions table, however the full questions table is indexed against the question ID# so looking that up is not a big deal in terms of page load time. If you understand SQL, you know indexes are magical. If you do not understand SQL, then understand this: indexes are magical.
However, selecting a question randomly and checking to see if you already answered it is still a bit slow. So there are a few more optimizations...
The Question Plate
As I said earlier, the way SQL picks a random row out of a table is to shuffle the order of a whole table and then pick the first row. This is stupid slow for 1 row. It also seems silly if I need to do this many times one after another. So there is a table called "question_plate". What I do instead is just get the last 100 questions in the active_questions table and return them. No random ordering. Once the rows are received, they are randomly ordered by PHP and are inserted into the question_plate table along with your user ID# attached to them. This is your personal queue of randomly ordered questions. You may notice the page load is a bit slow once in a while, but just slightly.
After this point, each time you load the page, the next question in your queue is pulled out and fetched from the questions table. This is a nearly instantaneous operation (because of the index on the question ID#). Your queue is cleared and restocked after every few minutes or when you run out of questions in your queue. Those page loads are slightly slower than the others, but not really noticeable.
The Question Acknowledgement
The last bit of information is in a table called question_ack. This table contains a list of question ID#'s paired with user ID#'s. This list indicates the questions that you have "acknowledged". This means you have either answered the question or pressed the "Skip Forever" button.
When a question is pulled from the active_questions table to put in your queue, it is checked against the question_ack table. If an entry exists that corresponds to you and that question, it is ignored. Also, once an entry is pulled out of the queue, it is checked again. This prevents bugs such as if you have to browsers open and are answering questions in both. Imagine if you left your browser open on question #123456 and you opened a new browser and after a while your queue was restocked. Then when you got to #123456 in your second browser and answered it, then go back and answer it in the old browser, that would be a problem because you've answered the same question twice. This prevents that.
That's pretty much it. Now you have instant page loads instead of 5-second-long page load times. I hope I didn't put you to sleep too much.
User Comments: 27
Huh. Cool, I always wondered how it worked.
Must reread this when it's not midnight.
got to the third paragraph and my brain exploded
Cool stuff. I'm learning SQL soon in my DBMS class.
Cool stuff. I'm learning SQL soon in my DBMS class.
I accidentally misread that as "BDSM class". Now I feel foolish.
Edit: It's worth noting that I have dyslexia.
Best class ever.
I was wondering.
And thanks, Blake. Pretty nifty.
That's pretty cool. Thanks.
I still don't get this whole programming thing. My dad is a programmer. I think I'll ask him.
^So is mine. Maybe he'd explain this for me.
So, is there a way to get to the older questions that remain active? I find that once you burn through the pool of fresh questions, they start to get very repetitive very quickly. Soon, you end up with about 20-30 questions that seem to just churn over. I'd love to see a link to a questions_old table. I would be understanding that this pool would take significantly longer to load based on the pool size.
I like what kind of people post in the comments on this post, awesome people.
This is absolutely brilliant!
I was thinking about this while in school, this website is taking over my life. @_@
I was like that last year, but j visited here and IRC so often that they blocked both.
I only check my messages in school during lunch.
Great Site. I hope I could create like this site too, but I am still thinking for a different useful purpose. I am an IT student 3rd yr. college and studied web programming. All I want is a free web hosting as .com on the internet, Internet offers a free but they have 2 dots in an address, one dot you should pay.
You're gonna have to pay for a .com address, at least without a subdomain. I know that .tk TLDs are free though.
I still don't get it... But I guess that doesn't matter.
Huh, that's surprisingly simple. Very cool and effective though. I wondered why you needed to shuffle the entries anyway; particularly if you're just taking the same initial 100.
26 users online right now 0 are asking questions, 2 are answering questions, 9 are checking their messages, 2 are on the forum, 13 are doing something else 51833 users registered, 11526452 messages sent The current time is 13 Vigeo 13:4 - 0.65.23what is this?