Social Media

Solving the Content Uniqueness and Integrity Problem in Social Media

There are two problems that can compromise content on socially driven media sites. The first and more prevalent one is the problem of duplicate content. The second and less prevalent but more worrisome is the problem of content integrity and spam.
Whenever breaking news is released, all the major news networks cover the same story, but they post it on their own unique site with their own uniquely sensationalized headline. Because of the way the mechanisms work at socially driven sites such as Digg.com and Netscape.com (by checking the url and story title to determine if a submission is unique or a duplicate) the same story from different sites can possibly pass through the system undetected.
Additionally, because of URL structures that are unfriendly to these duplicate submission prevention mechanisms, even the same article from the same site can be posted to a community several times over. For example,

http://www.technologyreview.com/read_article.aspx?id=18097&ch=infotech

http://www.technologyreview.com/Infotech/18097/

http://www.technologyreview.com/Infotech/18097/page1/

All three of the urls are unique, but resolve to the same article.
Moreover, there are people who will intentionally alter URLs to re-submit content that was already submitted. Although the mechanisms used by socially driven media sites are continuously improving and evolving, they are currently in a relatively primitive state. For example,

You can also add additional / ‘s to the URL in any correct spot and the site host you’re linking to handles it the same on the request, but the digg dupechecker accepts it as a non-dupe.

Unfortunately this ? at the end of a URL will always be a way to get around a dupe because the ? allows GET variables to be passed to the browser. Some sites pass an ID to tell the server which page to bring up, thus ?var=14 and ?var=15 will be treated as different pages. The downside for digg dupes is you can add any variable name/value to a URL and it will not affect the site’s display because the browser just simply disregards any variable it does not bring into the page.

Before I talk about the solution, let me briefly touch upon the other problem.
Imagine the following scenario. A person writes the perfect Digg-bait, the story gets promoted to the home page of Digg, and subsequently gets sent out on Digg’s home page RSS feed. Once it is out on the RSS feed (and of course can’t be pulled back) the site-owner removes the content and replaces it with an ad-farm. This is a type of bait and switch tactics that spammers can easily use to abuse large-scale social systems that have millions of RSS readers. While community moderation ensures that such content will be noticed by the community and reported as spam, and subsequently removed from the site’s home page, there is no such mechanism to prevent uninformed RSS subscribers from being baited by the spammer.
So to solve the two problems mentioned above at the same time, socially driven news and content sites could take advantage of content-scraping mechanisms. With respect to duplicate submission prevention, they could look at the story URL, title, and summary first, if there is even the slightest bit of resemblance between the two then scan the body content from the two sites and see if they match. If the content is even more than 25% similar, prevent the second user from making the submission. Furthermore, once a story is promoted to the home page the linked site should be checked ever 15 minutes or so to make sure that the content on the other end has not been significantly altered.
These two steps (scraping and comparing) will make sure that the community is not plagued by duplicate submissions (and so the original submission is most efficiently and timely promoted), and furthermore, that the promoted story is not sufficiently altered for spamming purposes. As far as implementing this method is concerned, given the tools widely available, it should be incredibly easy.

Comments are closed.

4 thoughts on “Solving the Content Uniqueness and Integrity Problem in Social Media

  1. I am more concerned with good news that never ranks in search engines simply because SEO fellah hit Digg first with his super titled linkbait that pulled in all the links.
    There is a lot of extremely good content and smart people out here that do not have their friends (or fake accounts) to submit news to the big self promoting trash heap known as Digg.
    I do agree that social media sites require more self policing and algorithm development. Take Wiki for instance, because their system it primative they were forced to initiate the lame nofollow tag, what they should be doing is just as you say Muhammad, develop their own system.
    Good post man.

  2. I’m not sure how your ‘scrape and compare’ system actually solves the bait-and-switch problem. As you wrote, “community moderation ensures that such content will be noticed by the community and reported as spam, and subsequently removed from the site’s home page, there is no such mechanism to prevent uninformed RSS subscribers from being baited by the spammer.”
    So ‘scrape and compare’ serves the same function as community moderation, though I fail to see how it would do anything more to help RSS subscribers. It might be useful on a site with a less active community, but of course there’s also less incentive for spammers to target those less visisble sites.
    It does seem like a viable solution to the problem of duplicate submissions, but I’d consider 25% similarity to be a poor test for duplicates. Two different short pages from the same site would probably have more than 25% of their text in common, when you consider navigation, footers, and sidebars that may be common across all pages on a site. You’d also exclude rebuttal posts that quote heavily from the post they’re responding to. You could set that “sameness threshold” at 90% and still catch most of the dupes.

  3. Kevin I understand what you are saying. What they should do, in case a website is marked as spam, or fails the ‘scrape and compare’ test, is to keep the Digg page for that story but disable the outbound link from that submission (possibly with a comment indicating why this happened). This way, even though the feed subscribers will land on the Digg page, they will not follow on the spammer’s page.
    Furthermore, as far as the 25% threshold is concerned, I chose it to be that low because when CNN, Fox, and ABC cover a story, while the items are largely the same, the language may not be 90% identical (but may pass a much lower similarity test). Of course this test can be limited to the body text and therefore prevent it from taking navigation, footers, and sidebars into account. I don’t know enough to comment on whether or not it is possible to exclude quoted material (exclusion by block-quoting syntax perhaps?).

  4. A few months ago one guy generated around 27 million pages with all kinds of “high ranked” content (medical terms, tech terms, etc), primarily from news sites. Google indexed these pages and directed considerable amount of traffic to this site.
    I don’t know how much this guy made on advertisement, but, clearly, he exploited one of the weak points of current search / ad system.
    I think, the only way to identify duplications is calculation of some kind of multi-dimensional signature of content. Duplicates will be very close to each other in this system, which could allow to catch them.