Recent lawsuits by Dow Jones, the New York Post, the New York Times and Amazon against AI search engine Perplexity highlight how automated extraction has become a boardroom crisis affecting fair competition and fiduciary duty. AI policy researcher and data protection manager Areejit Banerjee explores how OWASP is redefining scraping risk from “server load” to “value extraction” that erodes ROI on data assets, why technical defenses operate without clear legal backstop and how boards should deploy layered countermeasures including limiting exposed value, making automated use harder and instrumenting abnormal access patterns while waiting for federal reform.
Web scraping began as a tool for search indexing, but it has now mutated to a global extraction industry. Research from estimates the web-scraping market currently sits at $1.03 billion and is projected to nearly double to $2 billion by 2030. For boards, compliance officers and chief information security officers (CISOs), this is no longer a purely technical problem; it is a governance issue that affects fair competition, fiduciary duty and the credibility of the organization’s data-protection commitments.
Technological defenses have resulted in an arms race and we now face a strategic crisis. As automation scales, we are witnessing the rise of a “free-rider” dynamic: One side invests capital to build, curate and verify high-quality data infrastructure, while automated actors appropriate that value at zero cost. In effect, if you are building data products today, you are subsidizing your competitor’s product.
This imbalance destabilizes competition and discourages innovation. Recent federal policy discussions have highlighted, US law has not kept pace with automated harvesting techniques, leaving high value data assets exposed to industrial-scale extraction.
From nuisance to litigation
This “free-rider” problem is now flooding the US court system. Dow Jones, the New York Post and the New York Times have all filed major lawsuits against AI search engine Perplexity, alleging copyright infringement and data theft. Simultaneously, Amazon has also taken legal action against Perplexity. The core issue in these cases is the use of “agentic” browsers. Unlike traditional bots, agents simulate human user behavior and bypass terms of service and technical protection against automated scraping. This makes traditional perimeter defenses, such as CAPTCHA and basic rate limiting, much less effective on their own.
LinkedIn v. hiQ narrowed what counted as “unauthorized access” under the Computer Fraud and Abuse Act (CFAA) for public data, which weakened the legal backstop for bot blocking long before Perplexity. That gap is why these Perplexity lawsuits feel like a last resort: When your technical filters fail, the law doesn’t give you a clean way to argue “this is infrastructure theft.”
The result is a regulatory gray zone. While platforms can still attempt to block bots technically, the legal deterrent is gone. Companies are left managing relentless exploitation with no clear recourse when technical filters fail.
What Does Effective AI Governance Look Like in Uncertain Times?
Existing data governance programs can often provide solid foundation
Read moreDetailsIt’s about ROI, not just bandwidth
The industry’s understanding of the threat is finally shifting from “server load” to “value extraction.”
OWASP’s Automated Threat project is updating its definition of scraping to reflect this reality, recognizing that the primary symptom is not just network lag, but the erosion of return on investment (ROI) for high-quality data infrastructure.
This distinction is critical. When a competitor scrapes your pricing, inventory or proprietary content, they aren’t just using your bandwidth; they are eroding the ROI of your data assets. This dynamic means the original platform can no longer recover the substantial investments made to assemble and sustain its dataset.
A federal framework
Technical defenses can slow attackers, but as long as federal law treats industrial-scale harvesting as a gray area, the free-rider problem persists. For boards and compliance leaders, this means today’s controls are operating without a clear legal backstop. A modernized federal framework could close that gap by:
- Redefining “unauthorized access”: Treats automated access as “unauthorized” whenever it ignores published access rules (such as robots.txt or terms of service).
- Establishing “data misappropriation”: Recognizes large-scale stripping of investment-heavy datasets as asset misappropriation rather than a contractual dispute.
- Creating a unified standard: Replaces today’s patchwork of state rules with a single federal standard aligned to emerging international views on scraping and intellectual property.
- Preserving research exceptions: Maintains narrow, documented carve-outs for bona fide research and interoperability.
A layered approach
While that kind of reform works its way through Washington (if it ever does), boards and CISOs still have to keep their data products defendable today. OWASP’s handbook confirms that scraping is not solved by a single control. Instead, application owners are advised to deploy a coordinated set of countermeasures:
- Limit exposed value: Expose only the data fields needed for legitimate use and rely on aggregation, truncation, masking, anonymization or encryption wherever possible.
- Make automated use harder: Vary how content and URLs are delivered, set explicit scraping requirements and build test cases that simulate abusive collection patterns.
- Identify and slow automation: Use fingerprinting, reputation and behavioral signals to spot non-human usage, then apply rate limits, delays or stronger authentication to high-risk access.
- Instrument and formalize the response: Log and monitor abnormal access patterns and back technical measures with contracts, playbooks and information-sharing with peers and emergency response teams.
For boards and compliance leaders, the key is not to manage each control directly but to ensure that scraping risk is explicitly in scope for data-protection governance, that these kinds of layered measures are being implemented and that the organization can explain to regulators, customers and investors, how it is protecting its data infrastructure against free-rider abuse.
Earlier in 2025, I described a layered-defense approach that treats scraping mitigation as a stacked system: make it harder for automated actors to enter, harder for them to operate at scale and harder for them to convert stolen output into competitive value. That philosophy aligns closely with the OWASP guidance: multiple, coordinated controls that raise the cost of extraction, while we wait for a federal “data misappropriation” standard to give defenders a legal backstop that matches the technical reality.
Innovation requires boundaries
We cannot build a robust AI economy on a foundation of infrastructure theft. If the free-rider problem remains unchecked, we risk a market where no one invests in data quality because no one can protect it.
The solution is not to ban automation but to govern it. As AI reshapes the nature of work, we must protect the data infrastructure that makes these models effective. Preserving the value of high-quality data is essential for the sustained advancement of the industry. By defining “data misappropriation” at the federal level, we can safeguard legitimate research and interoperability while ensuring that the companies building the digital future can sustain the infrastructure that supports it.


Areejit Banerjee is a senior data protection and product security leader focused on reducing automated data harvesting and misuse risk across digital products. He is a graduate researcher at Purdue University studying the compliance and accountability implications of AI-enabled data extraction. He contributes to OWASP Foundation community standards efforts on protection against automated threats, helping modernize guidance for today’s attacker capabilities. 







