SEOs have been discussing how Google may be giving different treatment to different page elements based on their location on the page. Identifying so-called “boilerplate” is part of this process dealing with finding and analyzing “repeated non-content across [web] pages“.
Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.
Boilerplate is classified as follows:
- (Sitewide) global navigation (home, about us, etc)
- Certain spacial areas, especially if including links, (blogroll, navbar)
How Google may be treating it boilerplate:
- Ignore it completely (e.g. never waste time following each link on each page, never store pages like “Contact us” in their indices unless the page contains some valuable data like the business physical address);
- Index the links within the boilerplate understanding they are repeated links and thus using them for good or for worse (e.g. combined with other factors those links may be identified as paid)
- The boilerplate may be identified and used for understanding the overall structure of the site (e.g. identifying duplicate content issues, adjusting PR flow, etc)
Related discussions on boilerplate and how it can impact the algorithm:
I think it goes back to what M*tt C*tts said a long time ago. If we penalize all the websites out there that don’t have W3C compliant code, we’d lose 40% of the Internet. Translation: Google knows about crappy coding, duplicate content, and boilerplate phrases and they “TRY” to adjust to it, but don’t let it affect their results. They may in fact put out what they prefer to see on websites in the form of doctrine, but if they act on that doctrine they lose valuable results searchers need.
And soon to come HTML5 with new named markup – article, section, header, footer, nav… html markup that specifies content might just be what the SEs ordered
The first time I noticed “boilerplate” issues, though I didn’t know what to call them then, was during the horrendous Florida update of November 2003. At that time I suspected it with on-page text, and it was the first time I suspected that excessive use of keywords in internal anchor text could be a problem. After all the years since then, and now that others have been seeing the effect of navigation repetition issues, it kind of figures, since this patent was applied for soon after that.
…boilerplate text is part of the public record. Excluding those embedded terms and conditions that are placed in footer text misrepresents the content of the page. If a search engine is claiming to make the Web searchable, then morally it MUST provide some means for searching boilerplate content even if that requires the user to stipulate bypassing a filter.
Boilerplate text is not any less relevant or important to a user’s query simply because people are tired of seeing it. Boilerplate text helps you identify which site you’re actually looking at, and that is always very important to know.
- Yahoo: Page-level Template Detection via Isotonic Smoothing (pdf)
- Google: Methods and apparatus for estimating similarity