How Google May (Theoretically) Discover Web Pages

Very often you find Google knows much more about your site than it is supposed to: you find it crawl pages with 0 backlinks (you are aware of) or you can find pages that have never really existed in its index.

This brings us to a plenty of speculations, theories and observations as to what can be used for a web page discovery; here are 15 of them (article inspired by WebmasterWorld thread):

  1. “Dofollow” “direct links (either external or internal) links pointing to a page;
  2. URL manipulation‘ – i.e. if site.com/?one-word exists, then perhaps so does site.com/?two-words.
  3. Link inside the forms:
  4. Matt [Cutts] confirmed that such “links” send PageRank. Google establishes a virtual link on their back end when they find something through form navigation and they add that virtual link to the webgraph.

  5. Clicking a link using a browser with Google toolbar installed (or a pagerank indicator of some sort that sends every page you visit to Google);
  6. Putting the link in Google searchbar and performing the search for it (you may be surprised to know how many people use Google to navigate the web instead of regular browser address bar);
  7. Other sites hotlinking to your images;
  8. Other sites linking to your javascript or CSS files;
  9. Links in email a search engine has access to (link in Gmail);
  10. URLs within meta data of graphics and video files;
  11. URLs within HTML comments; URLs within the head section, meta data of an HTML page, or alternate html entities (alt, name, id, etc) or any other HTML attributes;
  12. Links in Flash movies (games, quizzes, etc);
  13. Non-linked URLs (http://www.domain.com);
  14. Links in any documents other than web pages e.g. .doc, .pdf, .txt, etc – see detailed experiment on Search engines and pdf ;
  15. Links in other Google produced software (gadgets, widgets)
  16. Advertising links (AdWords/Yahoo), and other services like Maps.

Let’s watch this list grow with your ones!

Written By:
PG

| My Blog Guest | @seosmarty

Ann Smarty is the blogger and marketer specializing in SEO consulting and guest blogging. Ann's expertise in blogging and tools serve as a base for her writing, tutorials and her guest blogging project, MyBlogGuest.com.

More Posts By

Comments

  1. Roland says:

    Good list. But I think you forgot an important one: the browser address bar of Google Chrome. The suggest feature collects URLs typed in.

  2. Good list. I was recently doing some CI work to find that an unlinked URL actually counted to Google as a back link. Stunned.

  3. I think the links in Gmail having any value is the scariest…*puts on tinfoil hat*

  4. Add my voice to the roster of “good listers”. Wish I could think of something more profound to say.

  5. Franz says:

    Good list, will link to your articles in my growing collection of Internet marketing newbie tips. What I found when regularly checking backlinks via quirks search status is that no matter how I try and get the pure backlinks of a certain client’s page determined, even if I state the complete URL in quotes, Google says there are 25,000+ backlinks which on inspection is about a 150 fold overblown. Yahoo on the other hand states too few and MSN LiveSearch pretended not to have heard of the page although it is PR 3 and exists since 2005. Highly unreliable altogether if you think you could just do a quick check on something!

  6. Roland says:

    I have another interesting one: the internal search engine of a website.

    I saw some evidence of that on our own site. If you search for (using google.nl) [ allinurl:search site:cdc.informaat.nl ] (without brackets), you see lots of indexed links that where found using the internal site search engine. Examining the words used to search they are taken from the titles and first paragraph of pages.

    Of course this could also just be an example of the deep web search of Google.

    Also interesting is that some admin links are in the index. No idea how Google found these. Maybe they use knowledge of the CMS used to guess these links.

  7. andreas.wpv says:

    Awesome. I love no.2. Guess you take a thesaurus and run it on this, with different file endings. They might use a Thesaurus build with similar web pages (related pages or google wheel).

    Most tricky way to add pages, perhaps: you are using gmail, use the bar and the web history tracks everything. Now you as editor / author / webmaster access a page which is hidden or no inbound link, but you know it, for you are working on it. Ping, Google has the link.

  8. I had an interested experiment last year when I bought a new domain and deliberatly didn’t do anything to promote the domain … Google found and trawled it incredibly quickly – http://tr.im/lnLz

    it got me thinking along these lines as to how Google found it so quickly.

    I registered the domain and got confirmation via my gmail address (and also the hosting account) so they may have trawled it there

    the other thoughts I had at the time was that they were mining the domain registry database when new entries were posted or (slightly more worrying) tapping the root name servers (there is one at NASA Ames in Mountain View that’s right on Googles campus)

    I’ve also seen Googlebot hitting links that I tweet within seconds of the post going up.

    thy are certainly doing their darndest to make sure they find everything even if you might not be ready to reveal it to the world

  9. Good list Ann

    Not sure if #15 ( links in advertising) includes it, but an analysis of a new client’s site last week showed me that Google has indexed a slew of affiliate landing page links – that can only come from the affiliate banners they have out there all over the web.

    The client was quite surprised to find that such pages were indexed, something they had not wanted to occur.

  10. david coxon says:

    I’m not certain but i would assume if you have anything that auto generals site maps from your site that that would pick up unlinked pages as well and i would also assume that google mines data accross all of their products so if you email someone with a gmail account a test link it will find it and take a look.

    I would also assume that people that don’t want pages found know to add a script to tell search engines to ignore the page and only remove it when you want the page to be public.

    This wouldn’t stop the search engine taking a look at an unpublished link, but would stop it returning that page as a result.