Fun with Words. WordCount and QueryCount

SMS Text

I came across a very interesting online tool while I did some research for the post about proper title capitalization. The tool is about the use of the English language in real life and called WordCount.

WordCount™ is an artistic experiment in the way we use language. It presents the 86,800 most frequently used English words, ranked in order of commonness. Each word is scaled to reflect its frequency relative to the words that precede and follow it, giving a visual barometer of relevance. The larger the word, the more we use it. The smaller the word, the more uncommon it is.

WordCount™ data currently comes from the British National Corpus®, a 100 million words collection of samples of written and spoken language from a wide range of sources, designed to represent an accurate cross-section of current English usage. WordCount includes all words that occur at least twice in the BNC®. In the future, WordCount™ will be modified to track word usage within any desired text, website, and eventually the entire Internet.

Here is a nice shot of how the results look like.

WordCount Small

To the look and purpose of the tool did the create state the following.

WordCount™ was designed with a minimalist aesthetic, to let the information speak for itself. The interface is clean, basic and intuitive. The goal is for the user to feel embedded in the language, sifting through words like an archaeologist through sand, awaiting the unexpected find. Observing closely ranked words tells us a great deal about our culture. For instance, “God” is one word from “began”, two words from “start”, and six words from “war”. Another sequence is “america ensure oil opportunity”. Conspiracists unite! As ever, the more one explores, the more is revealed.

WordCount Conspiracy.
The author published a number of WordCount sequences that were discovered by people in the fast amount of data rankings and emailed to him. The results are quite funny and show that some people have certainly too much free time on their hand. Look for yourself. Conspiracists unite!

I like even more the spin off tool of WordCount titled QueryCount.

QueryCount shows the top words queried for at WordCount by users of the site. It is something like a mini-Google Zeitgeist, only a lot smaller, but certainly not censored like its big brother from Google πŸ™‚

The results shown by QueryCount seem to be about right and more or less represent what I would expect as the top words used in queries on the major search engines as well.

QueryCount Small

I was checking out the results for 2006 at Google Zeitgeist and my little modified picture below expresses what my thoughts were when I looked at them.

Google Zeitgeist

The number one term “bebo” did really have many searches. Bebo is like another MySpace, Xanga or Yahoo! 360. Michael Birch and his wife founded in 2005. It saw a huge increase in popularity and number of memberships last year. The site did break into the Alexa Top 100 already.

I have not checked it out myself yet, but I am intrigued.

The Soccer World Cup in Germany was big of course. is another social network for video sharing. Radioblog is the music search engine “” at is the web music player powered by Adobe Flash and PHP used by the site.

I took the number two word “myspace” and the number six word “Wikipedia” of the top 10 list and hopped over to Google Trends to compare the search trends of the two words with the trends of the top two words of QueryCount.

Google Trends

The number one QueryCount word clearly beats the number two word of Google Zeitgeist.

It should have been the number Two in Google Zeitgeist instead. The number One actually, because “bebo” did also not have as much searches to beat the real #1.

I can to some degree understand why Google censors words like p**n in something like Google Zeitgeist, but the word Sex? C’mon. Show the world that Americans are not as prude as people say about them. There is nothing wrong with sex. Nobody would be around to talk about it without it.

It is interesting to see how uncensored tools like QueryCount show the real human US. We are what we are and there is no reason to be ashamed to be human. Make love and peace…


Carsten Cumbrowski
Owner of the uncensored Internet Marketing Resources Portal at

Carsten Cumbrowski
Carsten Cumbrowski has years of experience in Affiliate Marketing and knows both sides of the business as the Affiliate and Affiliate Manager. Carsten has over... Read Full Bio
Subscribe to SEJ!
Get our weekly newsletter from SEJ's Founder Loren Baker about the latest news in the industry!
  • Marjory Meechan

    That was fun!

  • Raj Dash

    These guys have a couple of other cool site experiments, particularly visualizations. Though I don’t think they were corpus-based.

    I originally had been collecting and analyzing my incoming spam email using a simple Perl script, with the intent of building a Spam Corpus and comparing it to the Brown’s (English) Corpus (which if I’m not mistaken is not the same as the British National Corpus). But it’s been lying dormant as I got sick of processing spam every few days πŸ˜›

  • CarstenCumbrowski

    Hey Raj,

    Sounds like a project like this one, which sounds like a pretty effective approach against email spam to me.

  • Raj Dash

    @Carsten: thanks for the link. I had actually thought about applying my “web corpus” (or spam corpus, actually) against spam, but it’s a great deal of work. I haven’t studied language parsing a great deal, but my aborted Master’s in Comp Sci was for NQLs (Natural Query Languages). My feeling has always been that nothing short of progressive Neural Networks are going to be able to keep up with spam.

    That’s epecially more true now that spam content is being generated using advanced mathematical techniques including, I think Bayesian techniques. So Bayesian filtering, as mentioned in that article, may not be as effective today as it was 5 years ago.

    Ahem. But I’m getting off topic.

  • Raj Dash

    Let me clarify: Neural Networks that are dedicated to learning language parsing. Then there’s probably also non-English spam as well.

  • CarstenCumbrowski


    I don’t know if you read the full article yet and also the one it refers to regarding the Bayesian methods used for the filtering, but the crucial thing is that the corpus is created and refined for each mailbox individualy. What is spam for one person is not spam for another. Take for for example.

    I have one mailbox that is used for general affiliate marketing communication that gets newsletters and offers from hundreds of different merchants.

    If a corpus of a normal person would be used for that inbox, 90% false positives would be the result πŸ™‚

    I did not study mathematics and try to get my head around the Bayesian stuff.

    I have no clue about neural networks. You might want to write about it. Like this guy at ReveNews did for bayesian interference in search marketing. That would be interesting and it is obviously a subject that you enjoy.

  • Raj Dash

    Carsten: Ah, that’s true. I’ve heard people refer to anything they didn’t ask for as spam, including something from a friend that’s “garbage”. My old method would have been too brute-force. However, any spam filtering method that is truly adaptive (to both user needs and to spam becoming more sophisticated) would theoretically work.

    Now, to tie my comments all back to your original post, it’d be interesting to see some visualization experiment for spam word frequency. If I can find my raw spam data and come up with something, I’ll post it. I wouldn’t be surprised, though, if there’s a long-tail of word frequency count even for spam.

  • CarstenCumbrowski

    Ask Paul Graham. The guy I referred to two of my comments earlier. He must have some good data and I am sure that he would give you some of it, if you tell him what you want to do with it and proofed to him that you are really into the stuff for the right reasons and not a spammer yourself hehe.

  • Raj Dash

    Good idea. I’ll have to fit it amongst my million other unfinished web projects πŸ™‚