Solving the Content Uniqueness and Integrity Problem in Social Media

SMS Text

There are two problems that can compromise content on socially driven media sites. The first and more prevalent one is the problem of duplicate content. The second and less prevalent but more worrisome is the problem of content integrity and spam.
Whenever breaking news is released, all the major news networks cover the same story, but they post it on their own unique site with their own uniquely sensationalized headline. Because of the way the mechanisms work at socially driven sites such as Digg.com and Netscape.com (by checking the url and story title to determine if a submission is unique or a duplicate) the same story from different sites can possibly pass through the system undetected.
Additionally, because of URL structures that are unfriendly to these duplicate submission prevention mechanisms, even the same article from the same site can be posted to a community several times over. For example,

http://www.technologyreview.com/read_article.aspx?id=18097&ch=infotech
http://www.technologyreview.com/Infotech/18097/
http://www.technologyreview.com/Infotech/18097/page1/

All three of the urls are unique, but resolve to the same article.
Moreover, there are people who will intentionally alter URLs to re-submit content that was already submitted. Although the mechanisms used by socially driven media sites are continuously improving and evolving, they are currently in a relatively primitive state. For example,

You can also add additional / ‘s to the URL in any correct spot and the site host you’re linking to handles it the same on the request, but the digg dupechecker accepts it as a non-dupe.

Unfortunately this ? at the end of a URL will always be a way to get around a dupe because the ? allows GET variables to be passed to the browser. Some sites pass an ID to tell the server which page to bring up, thus ?var=14 and ?var=15 will be treated as different pages. The downside for digg dupes is you can add any variable name/value to a URL and it will not affect the site’s display because the browser just simply disregards any variable it does not bring into the page.

Before I talk about the solution, let me briefly touch upon the other problem.
Imagine the following scenario. A person writes the perfect Digg-bait, the story gets promoted to the home page of Digg, and subsequently gets sent out on Digg’s home page RSS feed. Once it is out on the RSS feed (and of course can’t be pulled back) the site-owner removes the content and replaces it with an ad-farm. This is a type of bait and switch tactics that spammers can easily use to abuse large-scale social systems that have millions of RSS readers. While community moderation ensures that such content will be noticed by the community and reported as spam, and subsequently removed from the site’s home page, there is no such mechanism to prevent uninformed RSS subscribers from being baited by the spammer.
So to solve the two problems mentioned above at the same time, socially driven news and content sites could take advantage of content-scraping mechanisms. With respect to duplicate submission prevention, they could look at the story URL, title, and summary first, if there is even the slightest bit of resemblance between the two then scan the body content from the two sites and see if they match. If the content is even more than 25% similar, prevent the second user from making the submission. Furthermore, once a story is promoted to the home page the linked site should be checked ever 15 minutes or so to make sure that the content on the other end has not been significantly altered.
These two steps (scraping and comparing) will make sure that the community is not plagued by duplicate submissions (and so the original submission is most efficiently and timely promoted), and furthermore, that the promoted story is not sufficiently altered for spamming purposes. As far as implementing this method is concerned, given the tools widely available, it should be incredibly easy.

Subscribe to SEJ!
Get our weekly newsletter from SEJ's Founder Loren Baker about the latest news in the industry!