Very often you find Google knows much more about your site than it is supposed to: you find it crawl pages with 0 backlinks (you are aware of) or you can find pages that have never really existed in its index.
This brings us to a plenty of speculations, theories and observations as to what can be used for a web page discovery; here are 15 of them (article inspired by WebmasterWorld thread):
- “Dofollow” “direct links (either external or internal) links pointing to a page;
- ‘URL manipulation‘ – i.e. if site.com/?one-word exists, then perhaps so does site.com/?two-words.
- Link inside the forms:
- Clicking a link using a browser with Google toolbar installed (or a pagerank indicator of some sort that sends every page you visit to Google);
- Putting the link in Google searchbar and performing the search for it (you may be surprised to know how many people use Google to navigate the web instead of regular browser address bar);
- Other sites hotlinking to your images;
- Links in email a search engine has access to (link in Gmail);
- URLs within meta data of graphics and video files;
- URLs within HTML comments; URLs within the head section, meta data of an HTML page, or alternate html entities (alt, name, id, etc) or any other HTML attributes;
- Links in Flash movies (games, quizzes, etc);
- Non-linked URLs (http://www.domain.com);
- Links in any documents other than web pages e.g. .doc, .pdf, .txt, etc – see detailed experiment on Search engines and pdf ;
- Links in other Google produced software (gadgets, widgets)
- Advertising links (AdWords/Yahoo), and other services like Maps.
Matt [Cutts] confirmed that such “links” send PageRank. Google establishes a virtual link on their back end when they find something through form navigation and they add that virtual link to the webgraph.
Let’s watch this list grow with your ones!