Defining (Blog) Content Theft



My posts today on SecurityRatty inspired a bit more debate than I expected. A number of commenters asked if someone still links back to my site, how can I consider it theft? What makes it different than other content aggregators?

This is actually a big problem on many of the sites where I contribute content. From TidBITS to industry news sites, skimmers scrape the content, and often present it as their own. Some, like Ratty, aren’t as bad since they still link back. Others I never even see since they skip the linking process. I’ve been in discussions with other bloggers, analysts, and journalists where we all struggle with this issue. The good news is most of it is little more than an annoyance; my popularity is high enough now that people who search for my content will hit me on Google long before any of these other sites. But it’s still annoying.

Here’s my take on theft vs. legal use:

  1. Per my Creative Commons license, I allow non-commercial use of my content if it’s attributed back to me. By “non-commercial” I mean you don’t directly profit from the content. A security vendor linking into my posts and commenting on it is totally fine, since they aren’t using the content directly to profit. Reposting every single post I put up, with full content (as Ratty does), and placing advertising around it, is a violation. I purposely don’t sell advertising on this site- the closest I come is something like the SANS affiliate program which is a partner organization that I think offers value to my readers.
  2. Thieves take entire posts (attributed or not) and do not contribute their own content. They leech off others. Even if someone produces a feed with my headlines, and maybe a couple line summary, and then links into the original posts I consider that legitimate.
  3. Related to (2), search engines and feed aggregators are fine since they don’t repurpose the entire content. Technorati, Google, and others help people find my content, but they don’t host it. To get the full content people need to visit my site, or subscribe to my feed. Yes, they sell advertising, but not on my full content, for which readers need to visit my site.
  4. In some cases I may authorize a full representation of my content/feed, but it’s *my* decision. I do this with the Security Bloggers Network since it expands my reach, I have full access to readership statistics, and it’s content I like to be associated with.
  5. Many people use large chunks of my content on their sites, but they attribute back and use my content as something to blog about, thus contributing to the collective dialog. Thieves just scrape, and don’t contribute.
  6. Thieves steal content even when asked to cease and desist. I know 2 other bloggers that asked Ratty to drop them and he didn’t. I know one that did get dropped on request, but I only found that out after I put up my post (and knew the other requests were ignored). I didn’t ask myself, based on reports from others that were ignored.

Thus thieves violate content licenses, take full content and not just snippets, ignore requests to stop, and don’t contribute to the community dialog/discussion. Attributed or not, it’s still theft (albeit slightly less evil than unattributed theft).

I’m not naive; I don’t expect the problem to ever go away. To be honest, if it does it means my content is no longer of value. But that doesn’t mean I don’t reserve the right to protect my content when I can. I’ve been posting nearly daily for 2 years, and trying to put up a large volume of valuable content that helps people in their day to day jobs, not just comments on news stories. It’s one of the most difficult undertakings of my life, and even though I don’t directly generate revenue from advertising I get both personal satisfaction and other business benefits from having readers on my site, or reading my feed. To be blunt, my words feed my family.

The content is free, but I own my words — they are not in the public domain.

Posted on

13 comments

  1. Daniel Philpott Jul 2

    This in response to comments under the original post but seem more appropriate here.

    You are absolutely correct about your content. You own your words and have the right of ownership to distribute them under the license and terms of your choice. That you want your words distributed for others to read in no way negates your copyright of those words. That is the basis of the ‘open source’ content model. If another entity reuses it without permission and in violation of your license terms then you could call that theft.

    But is this a bad thing? Is it a problem to be overcome? Or is it an example of the network effect and a force to be harnessed to achieve your goals? I just love questions like that.

    Before I put forward my argument I want to be clear this is not meant to be antagonistic. My intention is to illuminate what I see as a logical inconsistency in opposing aggregators. I hadn’t heard of Securosis or SecurityRatty before today so I have no preset opinions on the value of either.

    You seem to have an interest in distributing your copyrighted content to public audience. To this end the Securosis blog is public facing, the sitemap is registered with Google, an RSS feed is available, reciprocal links (blogroll) and other customary web link mechanisms are in place. The prima facie evidence points to one of the blog’s goals being to communicate the blog’s content to a wide audience. Another goal, which was previously stated, is to retain copyright to that content and maintain distribution of the content under the terms of the license. Now let’s apply a test as to which of these goals is predominant. If the primary goal was to maintain control of the content what would be the logical action? To restrict distribution to a limited audience with commensurate security measures to prevent redistribution. As this action is not in evidence my conclusion is the communications goal is primary and the copyright goal is secondary.

    How is the primary goal of communicating to a wide audience achieved? There are a number of mechanisms and methods used here to increase the audience communicated to (these were mentioned previously). Reciprocal link relationships with similar content providers are in place. The blog is registered with search sites, both blog specific and general. An RSS feed is offered to ease the effort required to keep current with the blog. It is likely that word of mouth as a discovery method is also a factor. By using this mixture of methods and mechanisms to grow the blog’s audience and communicate to them, the achievement of the primary goal is pursued. Let’s focus on what is arguably the most important mechanism in this mix, the search engine.

    With modern search engines rank is achieved through an obscure calculation of link quantity and link quality. By targeting an aggregator and preventing reuse of the content (which links back to the original content) the connection to the aggregator’s nexus of links is removed. In a single instance this may or may not negatively affect the blog’s search engine rankings, but if this is repeated with other similar nexuses of links it will eventually lower the quantity of links that the aggregator and those who connect through the aggregator provide. This prevention of aggregator reuse supports the secondary goal.

    Likewise the search engines themselves take the blog’s content and use it to drive advertising revenue. An examination of the Google cache shows the whole of the blog’s content is retained and Google Reader likewise displays the whole of the posts. In order to prevent this unauthorized reuse means both legal and technical can be pursued (C&D orders, removal of the RSS feed, placement of robots.txt, IP blocks, etc.). Again, preventing search engine reuse of content supports the secondary goal.

    But how does this pursuit of the secondary goal affect the primary goal, communication? The ability to communicate is still present. The posts still are published and are as accessible as ever for reading. What happens is the audience growth is restricted and audience size will be limited. This is irrespective of the quality of the content. The end result is that the likelihood of a person discovering the content through search engines decreases. Eventually through the magic of a lack of compound growth and attrition the audience size diminishes. This negatively impacts the primary goal.

    While the reuse of your content by either aggregator or search engine is a clear license violation (which you are free to call theft) it is also a service transaction. This transaction provides value to each of the three parties involved. The blog gains from increased visibility. The reader gains from being able to discover you. The aggregator gains from advertising revenue. And you still maintain ownership and copyright despite the license violation. So where is the downside?

    To be clear, this is not an abstract argument. I only know about Securosis because of a ts/sci repost of the whole of your content which I read in Google Reader, leading me to do a Google search and discover you. In this sample of 1 the conclusion is supported by 100% of the anecdotal evidence.

    Before you set it in your mind that I am a puppet of the capitalist class take a look at my FISMApedia and FISMA Arts projects. The licenses are Creative Commons 3.0 USA Attribution Share-Alike. You will note that ‘Non-Commercial’ is not listed. I do not restrict reuse on the basis of commercial purpose. Let’s call me a refugee from the era of the BSD license. If someone wants to repackage my content in a commercial product I’ll be happy to see it used by more people. In fact I gleefully and frequently suggest people do exactly that even though the projects aren’t feature complete or publicly announced yet.

    And for the record I would not call this theft. Misuse, yes. License violation, yes. The MPAA and RIAA would call this theft. But this is not the criminal act of theft. Wiki’s take on theft is “The actus reus of theft is usually defined as an unauthorized taking, keeping or using of another’s property which must be accompanied by a mens rea of dishonesty and/or the intent to permanently deprive the owner or the person with rightful possession of that property or its use.” But that’s an argument for another day.

    See that? I stole from Wiki. Somehow I don’t think I’ll lose sleep over it. Speaking of sleep …

  2. Eponymous Jul 3

    Non-attributed use of someone else’s writing is plagiarism, plain and simple. Doesn’t matter whether it’s a blog, a freshman research paper, or a widely distributed book. Of course the world needs choads to balance out the creative, so don’t look for it to cease any time soon.

  3. Marcin Jul 3

    @ Daniel, just to be clear… My repost of Rich’s content was intended as a joke. I also cleared it with Rich beforehand ;)

  4. Daniel Philpott Jul 3

    @Marcin: I’d inferred your post was a show of unity deal from your comment about having forgotten to post. I only mentioned it because when I clicked through the RSS to the ts/sci post it was gone, which lead me to Google for the original material, which lead me to Securosis, which lead me to open my big mouth.

    @Eponymous: I can’t help but see some irony in a plagiarism and attribution comment being signed Eponymous. I mean, unless your name really is Securosis, in which case it goes from ironic to apropos. ;-)

  5. Jonathan Bailey Jul 3

    I’m actually going to skirt the debate here. My attitude on the topic is that every author has to choose his or her own comfort level with how others use their writing and it is up to those that wish to use the content to respect those wishes.

    So, I’ll offer no opinion on where hte lines are drawn.

    However, I did want to say that I am sorry about what has happened to you and that, if there is anything I can do to help, please let me know. I’ve shut down over 600 plagiarists of my own content so there’s a better than average chance I might be able to help.

    Just let me know if there is anything that I can do!

  6. alan shimel Jul 3

    Rich - not sure how to trackback to you anymore. so am leaving comment. I took another view than yours (surprise). Have posted here: http://www.stillsecureafteralltheseyears.com/ashimmy/2008/07/a-thin-line-bet.html

  7. Pepper Jul 4

    Daniel Philpott
    Um, no. Rich doesn’t have to choose only one goal (communications) and forgo all others, as you suggest. It’s also absurd to say that he retains copyright as a good, when discussing the lack of enforcement.

    Rich is entirely entitled to say he wants to make no technical restrictions on the content, and expects people to restrict themselves to snippets and commentary (per fair use) instead of republishing. Declining to use copy protection techniques, which would impair human (RSS) readers does not remove Rich’s right to point out this misbehavior, or to jab SR with humor if he feels like it.

    Rich,
    James Duncan Davidson has a long and rambling post where he ruminates on republishing. The comments there are interesting too.

  8. Jazz Jul 5

    The problem with content showing up on other web sites can solely be blamed on the authors of the content and their misuse of what the RSS, Atom and XML feeds were intended.

    Unknowing authors put all their content in their feeds. They treat the feeds like it is their personal newsletter to the world. And so they load their feed with every bit of writing in their article. As such, the recipients of those feeds have a full copy of the syndicated article.

    Instead, the author should be only putting a synopsis or the first 500 characters of their content in to their feed with a link back to their site or original article for an intrigued reader to read the entire article.

    Misuse of the RSS feeds by the authors have lead to scum bag sites appearing that happily take all the content and place it available on their web site. They do this not to benefit the author - but instead to boost their own rankings in the search engines. They call it SEO - search engine optimization.

    But in most cases it is outright theft of content with the full intention of placing Google Adsense ads in proximity of the hijacked content.

    In fact, many of these SEO optimized sites laugh at how easy it is for any brainless noob to become a kingpin in a specific topic area through the use of other peoples content.

    So let me just say that any author who cries about all his content being found on another persons web site is actually contributing to the theft and capitizing on their content by others. Don’t give all your content away in your RSS feeds. Use it as an enticer to catch the interest of readers and force them to click to the originating source to get the full story.

    I came to read this article from the SecurityNewsPortal.com web site. On the SNPX.com they only allow the first 255 characters of an RSS gathered feed to be shown. But they also provide the hyperlink that got me to here.

    In most cases a clever and appropriate title description is enough to catch my eyes when I am scanning for interesting or intriguing articles to read. Sometimes the first 255 characters is enough to intrigued me to follow the link back to the originating author’s article.

    I won’t even discuss the ultimate scumbags who run scrapers against original content web sites and steal all their content. They you can deal harshly with for their wanton disregard of copyright or honor.

    But I have seen other less reputable sites that are indeed benefitting from authors giving away the house and loading all their content in to their RSS feed. Why or why would anyone click back to the original authors article when they have so foolishly placed that content available to the world in their feed. The answer is simple : stop packing all of your content into your RSS feed. Give them enough to catch their interest and want to click back to your site.

    Now… as for Google and other search engines not keeping copies of your content - in its entirety - on their servers, I guess you have never checked out their cache they maintain of all your content. Yup they have a complete copy of all your works - in their full entirety in their stored cache copies, which they make readily available to anyone who clicks the respective link on the search engine.

    And what about other questionable players like Archive.org who keep entire copies of web sites stored on their site. They claim it is to give a historical record of the evolution of a web site. I say it serves as a central point of content theft being done under the guise of some worthy or legit purpose. Frankly the way they make money is to sell copies of their archive to lawyers who are involved in lawsuits involving content.

    I could go on but I think the whole complaint about republishers of RSS feeds ultimately falls flat on the shoulders of the original authors.

    And when you think about it… most RSS feeds are usually less then 10K in size. But then you have bloated ego or unknowing authors who make available everything in one RSS feed, that weigh in at 50k, 100k, etc. What a waste of bandwidth for all parties concerned.

    So put your RSS feed on a diet and you won’t have to gripe about content theft by SEO weasles like Ratty.

    .

  9. Dan Philpott Jul 6

    Why put all the content in a feed? I read about 90% of most blogs through my RSS reader, clicking through to about the last 10%. I habitually unsubscribe from blogs that don’t put full content in feeds as the cost in time to read a given blog through a web interface is greater than the benefit received for doing so in most cases. So the calculation here is, which is of greater benefit to the content creator: 10% click through or 0% click through?

    Let’s assume I am atypical user, representing 10% of the average blog’s readership and that readership follows the continuum from 100% RSS, 0% click through (e.g., using RSS only and not going to the web or being unsubscribed) to 0% RSS, 100% click through (e.g. reading only on the web interface and not using RSS). What are the benefits for the content creator for each of these classes of users? How does the limiting of RSS feed size affect the population? At what point do users switch to an alternate provider of similar content.

    I’m asking the questions because I don’t have the answers though I suspect someone has done this market analysis. The irony would be if someone did know and blogged or commented about the analysis results, wouldn’t they be stealing the original content?

  10. Dan Philpott Jul 6

    @Pepper:

    First off, good link. Well worth the read.

    Second, He retains copyright. Copyright is simply (very, very simply) the right to enforce your ownership of a particular type of intellectual property through the law. He can show he originated the content and that another party has used it. So he is within his rights to send a C&D or pursue a lawsuit to enforce his rights. A certain amount of that is necessary depending on context in order to keep his rights enforceable.

    The question is at what point enforcing these rights is of maximal benefit based on the content providers purpose for publication. The differentiation in my logical construct was never binary, it was an examination of the spectrum between the primary and secondary goals. As an object example, look at my use of Wiki for the definition of theft. Despite the negative connotations of depending on Wikipedia as a source I used it instead of the West’s Business Law sitting on the shelf next to my desk. Why? Because of my perception of the relative risk in using content from the given provider. Sometimes holding content too tightly makes it valueless in the Internet context.

    And Rich can and should do whatever he thinks best in support of his interests. All I’m doing is commenting on them.

  11. rmogull Jul 7

    First, the good news. That site seems to have converted to snippets with links back to the original site. Hard to complain about that.

    Now to some of the comments here. First, I put all my content into the RSS feed because that’s how I, as a consumer of information, like to get it. I rarely click through to full articles via my RSS reader; at the volume I consumer, it just takes too much time.

    I also fully recognize that once you put content out there, there is absolutely no way to protect it. It will be used and abused in every way imagineable. For the most part this works in my favor.

    But it doesn’t mean I don’t have the right to get a little nasty myself towards the occasional abuser that ends up on my radar. In this case, I not only improved the situation for myself but for every other feed that was being pulled.

    While I accept my content will be misused, I get really irked at all the SEO crap and blog spam taking advantage of average Internet users (who are more the victims than I am). Every now and then I have the urge to lash out, and in this case it made the world just a smidge better.

  12. Marcin Jul 7

    Like Rich, I also publish full content because that’s how I like to get it. Very few blogs are worth subscribing too that only post excerpts. I only click through to a blog if I’m interested in reading any comments or posting one anyways.

  1. all different search engines

Leave a reply

Related Posts

Debix Contest Ending This Week
Yes, I’m Giving A DLP Webcast. No, I Won’t Post The Picture
Announcing Winners of Debix Contest