🔥SEJ Live is Back! The AI Search Playbook.

  1. SEJ
  2.  ⋅ 
  3. Reddit

Reddit CEO: LLMs ‘Would Not Exist’ Without Reddit Data

  • Huffman said LLMs "would not exist as we know them" without Reddit and called the platform's content "modern oil" for the modern internet.
  • On new data licensing deals, he said the company is "open for business."
  • Huffman says Reddit's community is "starting to reject" AI-written content through downvotes.

Reddit CEO Steve Huffman calls user content "modern oil" for AI, discusses deals with Google and OpenAI, and outlines why some companies face lawsuits.

Reddit CEO: LLMs ‘Would Not Exist’ Without Reddit Data

Reddit CEO Steve Huffman said large language models “would not exist as we know them” without Reddit’s content. He called the platform’s user-generated data “modern oil” for AI.

Huffman made the comments during an interview at Fast Company’s Most Innovative Companies Summit.

What Huffman Said About Reddit’s Value To AI

Huffman described the position Reddit’s data holds in the AI ecosystem.

Huffman said:

“LLMs would not exist as we know them without Reddit. Reddit is one of the single largest sources of training data for the LLMs and Reddit continues to be one of the primary sources of both training data and we’re also the most cited, the most cited platform across all models.”

He attributed the citation claim to Profound, a firm that tracks AI citation data.

Huffman explained why AI companies depend on the content.

“There’s no artificial intelligence without actual intelligence. At the end of the day, these models are quite simple. They’re regurgitating on an absolutely massive scale what they’ve consumed elsewhere and a large portion of that consumption is actually just the human conversation on Reddit because it’s natural and it covers basically every topic imaginable.”

Deals For Some, Lawsuits For Others

Reddit announced data licensing agreements with Google and OpenAI in 2024. Huffman referenced those as Reddit’s original two AI data deals and didn’t announce any additional agreements.

“Since we did the original two deals with Google and OpenAI, that was over two years ago, so we’ve learned a lot. They’ve learned a lot. The whole world’s learned a lot. Specifically how valuable Reddit’s data is and how useful it is. And so we’re being I think very deliberate and selective there. But yeah, we’re open and open for business.”

For companies that haven’t agreed to licensing terms, Reddit has taken legal action. The company sued Anthropic in California Superior Court, alleging unauthorized use of Reddit content and violations of Reddit’s terms. Reddit filed a federal lawsuit against Perplexity in the Southern District of New York, along with three data-scraping firms, alleging DMCA anti-circumvention violations and related claims.

Huffman drew a line between the two groups.

“Companies like Google and OpenAI where we had good relationships, we can actually do a deal and put some guard rails on use and access to our data on behalf of our users but then collaborate on making products for the next generation of the internet.”

He added that “not every company is willing to be a collaborative partner and so unfortunately we have to go the other way which is lawsuits.”

Huffman told the audience Reddit’s position on commercial use is simple. “Commercial use of our data requires commercial terms,” he said. Reddit began charging for commercial API access in 2023, a move that preceded the current licensing deals.

Huffman said Reddit still provides free data access to researchers and universities and tries to remain flexible for non-commercial use.

What Changed Reddit’s Openness

According to Huffman, Reddit’s willingness to share data freely changed when the AI industry moved away from open research. As SEJ previously reported, Reddit limited access for many search engine crawlers while Google remained an exception.

“Historically, Reddit has been like we’re born of the open internet and Reddit has been open and very permissive for access to its data. And honestly, I think we would be in a different position today if the AI companies were still basically open and open source and doing open research.”

Huffman said the issue was that Reddit couldn’t longer track how its data was being used. “People are using our data and we don’t know what it was being used for,” he told the audience.

Beyond commercial terms, Huffman said Reddit wants to prevent its data from being used to identify users, target them with ads, or to replace or disintermediate the platform.

Reddit’s Own AI Efforts

Huffman acknowledged what he called a “paradox.” Reddit’s content powers external AI systems, but the company also uses AI across its platform.

The most visible product is Reddit Answers, an LLM-powered search feature. It reads posts and comments, then organizes them into responses built from verbatim user quotes. Huffman noted it’s designed for questions without definitive answers.

“What Reddit Answers does is a couple of things that are unique to Reddit. One, it basically only answers in verbatim quotes from actual people. And then the second thing it does is it tries to present multiple perspectives because the whole point if you’re on Reddit, you want the human perspective.”

Behind the scenes, Reddit uses AI for content moderation and classification. LLMs can evaluate whether a comment crosses into bullying, something Huffman described as previously difficult because of the subjectivity involved.

Huffman presented AI moderation as a way to reduce exposure to the worst content, not as a replacement for Reddit’s community moderation model.

“The worst job on the internet used to be looking at the worst content on the internet and deciding whether it could be online or not,” Huffman said. “That job just goes away.”

The Gray Area Of AI-Written Posts

Huffman also addressed the challenge of users writing content with AI tools and pasting it into Reddit. That’s different from automated bot activity, he stressed.

“The most annoying thing that I see not just on Reddit, but all over the internet is somebody who wrote their post or comment with ChatGPT and then pasted it into Reddit. Like, is that a bot? Certainly feels like a bot, but there’s a human behind the idea.”

Huffman cast the issue as one of intent. “It’s very important to us that there’s a human behind the idea, behind the content, behind the prompt,” Huffman said. But he also noted that “the writing sucks” when users rely on AI to compose their posts.

Rather than creating a policy to address it, Huffman indicated Reddit will let its community handle the issue. Users are already downvoting AI-written content and calling it out in comments. Huffman said Reddit will “empower the users more and the subreddits more to just reject that sort of content altogether.”

He compared the broader question to calculators in math class. “Kids these days are just learning how to write with AI. What are we going to do about it?” he said. “We kind of have to learn, I think, along with everybody else.”

Why This Matters

Huffman’s comments reinforce Reddit’s pitch that its user discussions are a core input for AI systems.

The AI-written content problem Huffman described is one SEJ covered as part of a broader YouTube AI slop investigation. Reddit’s decision to let community voting handle AI-generated posts, rather than building detection tools, is a different path than platforms that have deployed automated labeling.

Looking Ahead

Huffman told Fast Company that Reddit is “in the market talking to folks all the time” about new data deals, though he didn’t hint at a third agreement.

Reddit’s lawsuits against Anthropic and Perplexity are both ongoing. The Anthropic case was the subject of a federal court remand hearing in March.

Category News Reddit
SEJ STAFF Matt G. Southern Senior News Writer at Search Engine Journal

See short video versions of news stories on YouTube and TikTok. Matt G. Southern is the Senior News Writer at ...