Day 12 – TweetGrid Fights For Its Life20:30:29

14
Jun
5

Editor’s Note: Day 12 was Friday, June 12, 2009.

After a very productive previous day, I decided to sleep in just a little bit and then keep the momentum of my progress going. Unfortunately that plan went out the window the second I woke up. I awoke to messages in my inbox and on Twitter asking me “Why is TweetGrid down?” and “When do you plan to bring it back?”

TweetGrid has had some hosting issues before; it would be down for a few minutes or so and then things would go back to normal. About six weeks ago I got a phone call from GoDaddy saying that TweetGrid was using too many resources on its Shared Hosting server and it would be moved to its own server for one month while I was supposed to figure out how to fix it.  If the problem was not solved by then, my account would be suspended. Well, that’s great, but they didn’t tell me what the actual problem was, much less how to fix it. This left me utterly confused.

Some stats:

  • TweetGrid has about a total of 24 MB of source code and images that comprise the entire site. I am allotted something like 150 GB of space on a Shared Hosting account, so I am using 0.016% of my quota.  So, that isn’t the problem.
  • My Shared Hosting account is allotted 1.5 TB (1,500 GB) of bandwidth usage each month. I have never even reached 50% bandwidth usage in any given month.  So, that isn’t the problem.

What’s going on then? After some prodding I was able to learn that for some reason TweetGrid was using 3 GB of RAM on the server and using a “disproportionately high” percentage of the server CPU cycles.

There really was nothing I could do to change anything about the situation (except for moving the site to another host; which can be an expensive and time intensive process).  The site had been running fine for over 6 months without incident. Growth was steady, but it wasn’t exponential. What was going on?

A month passed, and the site continued to run normally. I thought things had sorted themselves out and I had averted a crisis. Then about two weeks ago I get another phone call. The problem had not gone away, and it was time for me to either upgrade my account to a dedicated server (very very expensive) or remove TweetGrid from my account. After reminding them of my disk usage and bandwidth stats, I argued that getting a dedicated server would be overkill, and that I couldn’t afford it. They got back to me and asked if I would mind trying their new Grid Hosting offering. The normal price is something like $19.99 per month (still about 4x more than I was paying for my Shared Hosting plan, but much less than a dedicated server), but while Grid Hosting is in Beta it would only be $4.99 per month. That sounded like a good deal to me, so I said sure let’s try that.

That night they migrated the site over to their Grid Hosting servers. I really have no idea how their Grid Server implementation works, but I was told that I would have 100% of the grid CPU cycles (which seemed like a non-sequitur to me), and I wouldn’t have to worry about excessive CPU cycle usage anymore.  TweetGrid + GoDaddy Grid. Sounded like a perfect match. I logged into the administrative panel and saw that there were, in fact, several nodes running the site. This still didn’t exactly explain how it all works, but at least there was an actual grid I could look at.

Flash forward to Friday. TweetGrid was down. And it stayed down. I logged into the administrative panel at GoDaddy, and all of the nodes I had seen before had vanished. Not good. I called support.

“Hello, I’d like to know if you have any information about the state of my Grid Hosting account?”

“One moment please, let me call the Grid department and see what I can find…” Insert hold music for about 5 minutes. “Hello, sir. I was told that there is a known issue with the Grid and that they are working on it.”

“Any other details? ETA on a fix?”

“No, sir. It will be back as soon as possible.”

“Ok, thank you.”

At this point I was basically at the end of my rope with GoDaddy’s hosting of TweetGrid. Having played around with several server instances on the Rackspace Cloud, I decided to just host TweetGrid there and maintain the server administration myself. I spun up a server and quickly installed a LAMP stack on it. I transferred all of TweetGrid’s files to the new server from my computer (I have a complete mirror backup of the site at all times) and started testing the new site to make sure everything worked.

When I was satisfied with the new server’s functionality I was prepared to update the DNS entry to point all of the traffic there. As soon as I was about to hit the button I started seeing messages that “TweetGrid is back!” Sure enough, the site was alive again on the GoDaddy servers. I held off updating the DNS entry to see if things would actually settle down and become stable again. Hosting TweetGrid in the Rackspace Cloud would cost about $22.00 per month, but at least I would have total control over everything. I was happy to see the site come back to life, but I left the new server running as a hot stand-by just in case. That was a smart decision.

Twenty minutes later GoDaddy’s servers were down again. My head exploded. I clicked the “Update DNS” button so hard I think I broke my mouse. Seconds later I started seeing traffic pour into the new server and the site was live once again.

That was all well and good, but the most frustrating part came several minutes later when I had the ultimate epiphany.

The TweetGrid Widget gets several million pageviews per month. Yes, millions. I still cannot believe this number, but I have to admit it’s pretty cool to see it pop up around the internet. The widget is just one 20KB javascript file, but it gets served over, and over, and over, and over… Since I host the file on TweetGrid’s server itself, I get the traffic and bandwidth hit from all of the other sites using it on their pages. This has pros and cons, but the biggest advantage of the other sites letting me host it is that they get upgrades to the widget automatically. I don’t have to tell anybody how to upgrade their widget or tell them to install the newest version. I just update the code on my site, and they get the newest version without having to think about it.

When it first launched, I hosted TweetGrid on my home computer. I had been hosting websites long enough to know that if you are going to have a high traffic website that only serves a few files per page load, you are better off disabling HTTP KeepAlives and only responding to requests with the HTTP/1.0 protocol instead of HTTP/1.1.  This means that when somebody loads a file (a page, an image, or the widget), the server closes the connection immediately (HTTP/1.0) instead of leaving the connection open to listen for more requests (HTTP/1.1).  This can have a HUGE impact on web server and site performance. If you have a file that gets served over and over and over (the widget in my case), it’s better to turn KeepAlives off to avoid tying up precious server resources that could be better spent serving other requests.

When I setup my new cloud server to host TweetGrid, I disabled KeepAlives instinctively before flipping the switch. When traffic started to come in, I was very curious to see how much RAM would actually be eaten up by the amount of traffic as well as how much CPU was being used by the server, especially to compare with the GoDaddy stats.

The numbers were alarming:

  • Total amount of RAM used by the webserver processes: 77 MB
  • Total CPU utilization: < 1% (meaning, the CPU is over 99% idle).

I could actually host the site on a 256 MB computer with a 500MHz processor with no problem if I wanted.

That’s when it dawned on me. GoDaddy must be using KeepAlives. I never had a reason to know or care before now, so I never really checked. Everything had worked and it didn’t matter. Now I had to find out.

Sure enough, I did a raw socket connection to another one of my GoDaddy hosted sites. The response header came back with “HTTP/1.1″ in it. Then I tried to force a HTTP/1.0 connection by specifying it explicitly in the GET request. The response still came back “HTTP/1.1″. That must have been the problem. With the frequency of connection requests for the widget, it must have been chewing up connection resources on their server and thus creating a memory and CPU hog on the shared server. I’m still not sure what the issue with their Grid was, but since it is in Beta I’ll chalk it up to early stage technical issues.

I am going to see if there is a way to ask them to put me back on a shared server with KeepAlives disabled (since that is not an option anywhere on their hosting administration panel), but I am not very optimistic about their response.

Long story short, I basically burned all of Friday battling with this. However, this story has a silver lining. I was encouraged by the fact that it is pretty darn easy to get a cloud server up and running and that my Linux admin chops are good enough to get it done. This will most certainly come in handy when I am getting ready to launch other sites with Rackspace that will most definitely need more than 77 MB of RAM and 1% CPU to run the backend of the sites.

TweetGrid has been running in the RackSpace Cloud since Friday with no issues. I am not sure if it will live there permanently as that is a business cost/revenue decision I will have to make. For now, I am just happy to have the stability and happy users once again.

Popularity: 94% [?]

Filed under: Summary
Comments Policy: I highly encourage and appreciate comments to the posts. I promise to respond to every comment containing a question (within reason). This is supposed to be interactive, after all.
Comments (5) Trackbacks (0)
  1. Robyn
    8:40 PM on June 14th, 2009

    I’ve had similar experience with Godaddy hosting. I and others I know never found them to be good as a host unless it’s a site no one goes to or uses. Good with domain names, though.

  2. Kristen, your sis
    12:33 AM on June 15th, 2009

    Nicely written! I understood about 58.2% of all that- but your explainations helped a lot. I figured you were freaking on Friday- but never doubted you had a backup plan. Where are some of your widgets located that I could see?

  3. Debbie Yost
    12:17 PM on June 15th, 2009

    Our company is recently dealing with some of these issues so more of this made sense than it might have a few weeks ago. We are currently sending our stuff to a local hosting company and getting away from GoDaddy. I’ve only been looking at hosting companies for a short time for my personal blog and the discussions we’ve had with work, but that’s the problem I’ve heard with GoDaddy, too much traffic. Glad you have a solution, if only for a short time. Good luck.

  4. James Hartig
    1:43 PM on June 15th, 2009

    Awesome report! Sorry about the problems. I have looked into the KeepAlives for my own server for iSociale but with the constant reconnections, it was not a good choice. If you have access to htaccess or anything similar you could disable it for a certain directory (the widget one). http://httpd.apache.org/docs/2.0/mod/core.html#keepalive

    Also, take a look at: http://httpd.apache.org/docs/2.0/env.html#special

  5. Lisa Spear
    11:32 AM on June 16th, 2009

    Could the switch over to the RackSpace Cloud be related to the Search API problem with Twitter?

    No, the Search API has problems all on its own independent of where TweetGrid is hosted. The API is usually pretty robust, but lately it has been bogged down with lots of traffic (an issue with their servers, not mine).
    -Chad

Leave a comment

No trackbacks.