Why General-Purpose AI Fails at Regulatory Research
Copilot cited the wrong state, fabricated rule language, and missed data that existed in the filing. Here's why general-purpose AI can't do regulatory research.

We tested Microsoft Copilot against a source-grounded research tool on real regulatory questions. The results weren't a matter of preference. Copilot got basic facts wrong in ways that would damage an analyst's credibility if submitted to a commission.
This isn't a criticism of Copilot specifically. It's a structural problem with how general-purpose AI tools work, and why they can't be relied on for regulatory research.
Summary: General-purpose AI tools like Copilot fail at regulatory research because they search the internet, not primary source filings. In testing, Copilot cited the wrong jurisdiction, fabricated rule language, and missed data that existed in actual case documents.
The core problem
General-purpose AI tools search the public internet. They find what Google or Bing has indexed: law firm blogs, news articles, public-facing utility websites, and whatever fragments of government data have made it onto the open web.
Most regulatory filings never make it onto the open web. Public utility commission filings, testimony, orders, compliance documents, and staff reports live in state document management systems that search engines don't fully crawl. An AI that searches the internet is working from an incomplete dataset. It doesn't know what it's missing.
Source-grounded research works differently. The system crawls official PUC websites, pulls every docket and document within each case, and stores them in its own database. When an analyst asks a question, it searches that database, the actual filings, as the only source of truth.
What happened when we tested both
We ran identical regulatory questions through Copilot and through a source-grounded research tool built for utility commission filings. Four categories of failure emerged.
Wrong jurisdiction
We asked about a specific Idaho PUC case. Copilot identified it as a Washington UTC case, cited Washington statutes, and described a completely different type of proceeding. An analyst submitting this would cite the wrong state's laws to the wrong commission.
The source-grounded tool correctly identified the case as an Idaho depreciation proceeding and returned the actual stipulation terms, effective dates, and Commission findings, with citations to the order.
Wrong case number
When asked for a company's most recent rate case, Copilot returned a docket number that was months out of date. The correct filing, which requested a significant rate increase, had already been submitted. Copilot's web results simply hadn't caught up.
Data That Existed But Wasn't Found
On a security issuance case, Copilot said cost estimates were "not disclosed in the public order." The source-grounded tool found the exact figures: $284,031.70 in total transaction costs, broken down by agent fees and legal fees, pulled directly from the compliance filing with page numbers.
The information existed. It just wasn't in the order. It was in the compliance filing within the same docket. Copilot could only find what was on the web. The compliance filing wasn't indexed by any search engine. The source-grounded tool searched the actual case docket and found it immediately.
Fabricated rule language
This was the most dangerous result. Copilot quoted a state administrative rule verbatim, complete with quotation marks and a rule citation. The quoted language does not exist. It was fabricated.
When pressed, the source-grounded tool confirmed that no document in its database contained that text and identified the actual regulatory precedent: staff testimony from a prior rate case citing specific grounds for disallowance.
If an analyst had included Copilot's fabricated quote in testimony or a commission filing, it would have been immediately identifiable as false. In a regulated environment, that's not just embarrassing. It's a credibility-ending mistake.
The citation problem
Beyond factual errors, there's a structural difference in how results are sourced.
Copilot cites website domains ("[utc.wa.gov]" or "[puc.idaho.gov]") without specifying which document, which docket, or which page. Verifying a Copilot citation means manually searching an entire government website to find what the AI might have been referencing. In practice, that makes the citation unverifiable.
Source-grounded research cites the document name, docket number, filing date, company name, and specific page numbers. Verification is one click away. For work submitted to a commission, rate case testimony, staff responses, regulatory filings, unverifiable citations are not usable.
The speed tradeoff
Copilot returns answers in 15-30 seconds. Source-grounded research takes 3+ minutes for a comprehensive answer.
The reason is straightforward: Copilot skims the internet and summarizes. Source-grounded research reads through case documents, cross-references filings across multiple dockets, and builds citations with page numbers. That takes longer, but it's also why the output is citable and accurate.
For perspective: the research completed in 3 minutes would take an analyst days or weeks to do manually. Copilot is faster because it's doing less.
What this means for regulated organizations
The Stanford study on legal AI tools found that even premium RAG-based research tools from LexisNexis and Thomson Reuters hallucinate 17-33% of the time. General-purpose tools like Copilot, which aren't built for any specific domain, perform worse.
For organizations in regulated industries (utilities, legal, financial services), AI accuracy isn't a nice-to-have. A wrong case number in a rate filing, a fabricated rule citation in testimony, or a missed compliance document can cost credibility, money, or both.
The solution isn't to avoid AI. It's to use AI that's grounded in the actual source material (the filings, the orders, the testimony) rather than AI that searches the internet and hopes for the best.
We built a tool that does exactly that. You can see how it works in our case study, including a full side-by-side comparison of every question we tested.