Google admits to using content from publishers who opt out to train its Search AI
By willowt // 2025-05-07
 
  • Google confirmed it uses web content to train AI-powered search features (e.g., Gemini) even when publishers opt out, as its search division operates under different rules than general AI training policies.
  • To fully block AI training, publishers must opt out of Google Search indexing via robots.txt. But this renders their content invisible in search results, harming traffic and ad revenue.
  • The Justice Department proposes drastic measures, including forcing Google to divest Chrome, end default-search payments, and share search/AI data with competitors to curb dominance.
  • Publishers and authors accuse Google and OpenAI of exploiting copyrighted content for AI training without fair compensation, raising unresolved questions about fair use and consent.
  • The case underscores tensions between tech innovation and publisher rights, with Judge Amit Mehta's ruling potentially setting precedents for data ownership and AI’s use of online content.
In a federal antitrust trial that could reshape the future of online search, Google admitted it continues to use web content to train its AI-powered search features – even when publishers explicitly opt out. The revelation came during testimony from Eli Collins, a vice president at Google DeepMind. He confirmed that while the company respects opt-outs for general AI training, its search division operates under different rules. The case, unfolding in Washington, D.C., underscores growing tensions between tech giants and publishers over who controls – and profits from – online content.

Google's AI loophole: Opt-outs don't apply to search

During cross-examination by Department of Justice (DOJ) attorney Diana Aguilar, Collins acknowledged that once Google's Gemini AI model is integrated into its search engine, the company can train it on data publishers sought to block. "Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Aguilar asked. Collins replied: "Correct – for use in search." This distinction has alarmed publishers, who argue that Google's AI-generated summaries – displayed above traditional search results – divert traffic from their sites, eroding ad revenue. To fully block AI training, publishers must opt out of Google Search indexing entirely via the robots.txt protocol, a move that would effectively render their content invisible in search results.

The antitrust case: Breaking Google's stranglehold on search

The testimony emerged during a pivotal antitrust trial before Judge Amit Mehta, who ruled in 2023 that Google unlawfully monopolized the search market. The DOJ is now pushing for drastic remedies, including forcing Google to divest its Chrome browser and prohibiting payments to secure default search status on devices. Internal documents revealed that Google filtered out 80 billion of 160 billion content "tokens" due to publisher opt-outs. Despite this, the search engine giant retained vast datasets from search sessions and YouTube videos. The DOJ's broader efforts also include proposals to prevent Google from dominating future AI developments. Regulators may require Google to open its search indexes, data and AI models to competitors and restrict agreements that limit rivals' access to web content. Additionally, the DOJ suggests allowing websites to opt out of AI training without sacrificing search visibility – a policy that could redefine digital consent. When pressed about whether Google's search dominance unfairly advantaged its AI, Collins conceded that DeepMind CEO Demis Hassabis had explored using search rankings to enhance AI performance, though no such model was confirmed to exist.

Publishers' dilemma: Surrender data or disappear from search

The case highlights a Catch-22 for publishers: Allow Google to scrape content for AI training or vanish from search rankings altogether. As AI Overviews increasingly replace click-throughs, smaller outlets face existential threats. Earlier this year, education platform Chegg sued Google, alleging AI summaries decimated its revenue. A Google spokesperson defended the policy, stating publishers could use robots.txt to block indexing – ignoring the collateral damage to their visibility. Meanwhile, OpenAI faces similar legal challenges, with lawsuits accusing it of training AI models on "stolen private information" without consent. Authors and publishers argue that AI firms exploit copyrighted material, raising questions about fair use in the age of machine learning. Google's recent privacy policy updates may be an attempt to preempt legal action, but experts warn that courts have yet to establish clear boundaries on AI data usage. The trial underscores a broader debate over data ownership in the AI era. While Google frames its practices as innovation, critics see a pattern of leveraging monopoly power to override consent. As Judge Mehta weighs remedies, the outcome could set a precedent for how tech giants balance progress with publisher rights. For now, the message to content creators is clear: In Google's ecosystem, opting out of AI training may mean opting out of the internet itself. Sources for this article include: ReclaimTheNet.org Bloomberg.com NeimanLab.org