Social Media

How Efficient is Digg?

You would imagine that given the limited venture capital budgets of social media sites like Digg.com, they would be scrutinizing every cost-saving measure to ensure that they are operating at maximum
efficiency. A few recent post on Digg show that this may not be true.
According to a post submitted by vicnick,

Digg.com is not gzipped. Page Size: 41 K, Size if Gzipped: 9 K, Potential Savings: 78.05% !!! That’s a potential bandwidth saving for Digg as well as the end-user.

Looking at the statistics from WhatsMyIP.com, I noted the following:
digggzipped How Efficient is Digg?
As the utility points out, Digg can save substantially (bandwidth-wise) by gzipping the content on the site, and even improve the site’s performance for people using dial-up modems.
That, however, is not the complete story. While you may save on bandwidth, two problems may arise.
1. gzipping can be very CPU intensive.
2. AJAX doesn’t play well with mod_gzip.
The first problem means that you may ultimately end up not saving much money from gzipping. Before you can determine the economically optimal decision, you have to determine how much you will save from bandwidth-related costs and add to that how much additional investment you will have to make in hardware to get the additional processing power. The second problem means that you wouldn’t be able to use any AJAX related enhancements on Digg (i.e. digg/spy).
Furthermore, fkr points out what may be potentially inefficient use of Digg’s 75 servers. According to Markus Frind’s calculations, Digg serves about 7 million page views a day, which comes to an average of 81 pageviews every second. Dividing those by the number of servers Digg has, he calculates that Digg is displaying 1 pageview per server per second. He ultimately concludes that,

I think digg.com wins the worst infrustructure/setup award of any major site hands down. If their [myspace] infrastructure was as bad as digg.com’s they would need 18,750 servers!!!

I would love to hear some more technical input on this, preferably from the Digg team.
The last story that addresses potential inefficiencies on Digg comes from Oatmeal, and his offer to save Digg a million dollars just by using 3 lines of code. Of course the title of the post is an exaggeration, but that is not to discount the point of the linked article. As pointed out by Matthew Inman, Digg is not using 301 redirects to make preferred canonical urls, and consequently their search engine rankings and the resulting traffic to their site is suffering.

The difference in click-through rates for the top three [ranked search results in Google] versus 4-10 are incredibly substantial. Click-throughs from Google mean more visitors to Digg from a broader audience. This audience might be inclined to click on some of your ads, meaning more money in your pocket.

The three lines of code that would fix this are,

RewriteEngine On
RewriteCond %{HTTP_HOST} !^digg.com
RewriteRule ^/(.*) http://digg.com/$1 [R=301,L]

Comments are closed.

14 thoughts on “How Efficient is Digg?

  1. Or even just using Google Webmasters Tools – you can set how the search bot should ‘search’ your pages w/ the www or w/o it.

  2. hey… Big problem with your numbers though…
    “According to Markus Frind’s calculations, Digg serves about 7 million page views a day, which comes to an average of 81 pageviews every second.”
    Well, so you are assuming that the digg traffic is constant 24 hours a day?
    And not… heavily biased towards early morning, or at least work hours?
    Sites like myspace would probably be a little more even 24 hours per day.

  3. Oh, and even if it were relatively constant throughout the day… thats an average of 81 pageviews every second. I would assume there were peaks of at least 2-3 times that. However, I strongly believe that most of diggs traffic occur at certain times of the day, probably the first 3-4 hours of the work day, depending on timezone probably contains much of that day’s traffic. I don’t know many people that read digg when not at work/on lunch…

  4. speedy…that means that during non-peak times, Digg has several servers doing LESS than 1 pageview a second.
    Even if the peak times give 5 time the average traffic, which isn’t likely at least not for any sustained period of time, it is still VERY inefficient use of 75 servers. And I am not even factoring in that those 75 servers still makes it so that it takes SEVERAL seconds for any given Digg page to load and it often gets overloaded to the point of having to display an error message.

  5. The CPU load of gzip is inconsequential compared to the bandwidth savings. If each server is handling one request per second, the few milliseconds of CPU time needed for the compression would just be a modest increase in CPU usage.
    Keep in mind, of course, that every page load might entail many other associated files. Javascript, stylesheets, XML, etc.
    Of course, the supposed problems with AJAX might make that a moot point. This is the first I’ve heard about the problem, though, and there is no inherent reason why AJAX and gzip wouldn’t play nice together. gzip compression can be done either by the server (via something like mod_gzip), or by the script itself (in PHP, a simple output filter can be added, or you can just gzip the output yourself).

  6. I’ve been saying this for months. As soon as I read an article that Digg runs on 75 servers I was like WTF. I could cut that figure in half and save them… let’s say (30*$250/mo) $7,500 a month.. crap, it wouldn’t be worth the effort/worry about babysitting the servers. Go Digg!

  7. I’m not sure wether the inefficiency accusation is supposed to be a criticism of developers supposedly creating a system so horrible it requires 1 second to handle a simple page request, or if not then of the engineers implementing it being idiots for having so many servers going to waste, or possibly both. However, all three variations are equally assinine:
    Let’s pretend that digg is a mission-critical application, I imagine it is to the people who run it.
    Chances are that ‘average’ digg traffic can be supported by 1 or 2 servers — on a news site wouldn’t you consider it prudent to provision peak capacity of at least 100x mean traffic? From the above numbers it’s not hard to imagine digg getting 20mm views/hr, even if only for a few minutes a year, or ~6k msgs/s — less than 80x avg load.
    Presuming 50 out of 75 servers are dedicated web nodes, that works out to 120 msgs / server / second, or >10ms per page request — almost exactly what I would’ve guessed.
    Aside from amazon S3 there’s no such thing as on-demand cpu cycles that I’m aware of, and while I think S3 is cool I wouldn’t necessarily build my business around it. Thus, like digg, I sit around with loads of spare processing capacity going to ‘waste’, except for the fact that I need them at peak load, which perhaps suprisingly happens to coincide with peak revenue. — Expense Justified — QED.
    the gzipping issue is an accountants call of bandwidth cost vs. cost of more servers to handle peak load. Given that bandwidth at a tier-1 hosting or colo facility IS effectively on-demand, and servers and rackspace are not, it’s likely that gzipping is not such a hot idea.

  8. There is also the fact that digg.com is serving up excess amounts of javascript (i.e. the entire script.aculo.us library when it only uses parts of it) and it doesn’t compress it in any way (eg. remove comments / white space / new lines etc).

  9. I don’t think bandwidth is really that much of a consideration, especially for a text-based site. It’s not like they’re trying to keep under a certain limit per month having some cheap web hosting plan. It’s certainly unlimited and measured in Mbps. It’s unlikely that Gzipping it would really make a difference here.

  10. When asked if Digg traffic might be “heavily biased towards early morning” – I wonder what “early morning” is?
    A presentation by a Yahoo engineer some years ago pointed out that part of their traffic load ballancing came from the simple fact that not the entire world wakes up at the same time.
    An early morning in New York, is my mid afternoon.

  11. All the traffic charts for major sites look similar because of timezones and 50% of use occurs at work and another 50% after work. Average pageviews are 1/2 of peak.

    Really? 1/2 of peak?
    Digg traffic never more than doubles when there is a particularly interesting story which is linked to by other sites? Or is it just that digg traffic volume is so huge that even the additional traffic from a slashdot reference is so miniscule as to be statistically unnoticeable?
    Isn’t there a difference between more global sites like google/yahoo/amazon and, AFAIK, english only news sites that serve an audience of which perhaps 50% resides in one of two US coastal timezones, with maybe 45% of the rest scattered in the UK, 1 or 2 Australian/Pacific timezones and the central US?
    Even if 50% of use occurs during work and 50% occurs after, aren’t ‘work’ and ‘after-work’ 2/3 of a whole including sleep?
    I would be surprised if diggs traffic was the same at 4am/1am EST/PST as it during work, or at least waking, hours on the US coasts.
    Tell me the same thing about an international search and/or shopping giant like google or amazon — not so surprising.

  12. ” When asked if Digg traffic might be “heavily biased towards early morning” – I wonder what “early morning” is?
    A presentation by a Yahoo engineer some years ago pointed out that part of their traffic load ballancing came from the simple fact that not the entire world wakes up at the same time.
    An early morning in New York, is my mid afternoon.”
    Yes, but I would hazard to guess that a fair majority of Digg users are in the states,m meaning that the period of ‘morning’ spans like 9am EST to 11AM PST, which is still just a small fraction of the day….