LidaSim: Testing AI Policies With Persona-Based Simulations

LidaSim: Testing AI Policies With Persona-Based Simulations

For the AI Governance Apart Research hackathon, we created a multi-agent simulation of influential figures in AI and politics to determine which AI governance policies are more likely to be broadly supported. Specifically, we simulated group debates between ten influential figures to analyze proposals for compute governance, though the framework is generic and could be applied to any subdomain of AI safety.

This project was a collaboration between Linh Le (Lida Safety Research, Mila), David Williams-King (Lida Safety Research, ERA), and Arthur Collé (Lida Safety Research, Distributed Systems).

Motivation

Increasingly, it is becoming clear that some form of AI regulation would strongly benefit consumers and public safety at large. An Australian study recently found that the risk of death from using AI to be 4,000 times larger than the risk of death from flying on a commercial airplane. However, the exact form of regulation is a difficult political question. Furthermore, we do not have time to try many different approaches before landing on one that works, with transformative AI by many estimates 2-5 years or at most 10 years away.

Can we leverage the world-modelling capabilities of LLMs themselves to determine which AI policies are most likely to be effective, and palatable to government and industry leaders? After all, LLMs absorb enormous amounts of public information about specific people: their speeches, published positions, interview responses, congressional testimony, and academic writing. That is the idea behind our project, LidaSim. It's not a replacement for real consultation, but an early-stage tool for finding structural weaknesses or blind spots before approaching people in the real world.

We chose to focus on hardware governance as a use-case, because it may be one of the most tractable paths toward meaningful oversight of advanced AI systems. However, many proposals face enormous hurdles to actually being approved by different states and supported by companies. The basic idea is that AI chips can potentially be made to report what they're doing, verify where they are, and/or attest to the scale of computation they're performing. This creates a technical verification layer that doesn't rely entirely on trust between governments or companies.

How LidaSim Works

1
Data Collection
Scraping publicly available content for 10 influential figures using parallel.ai
speechesinterviewspaperstestimony
2
Digital Twin Construction
LLM persona prompts encoding identity, positions, reasoning style, communication patterns
identitypositionsreasoningstyle
3
Multi-Agent Deliberation
Up to 10 rounds: initial positions → persuasion attempts → group statements → vote
debatepersuadevote
4
Analysis
Tracking vote changes, argument patterns, coalition stability across rounds and runs
shiftsmetricscoalitions
Figure 1 — Overview of LidaSim, with four phases.

The Ten Personas

We selected figures spanning the key stakeholder groups that any real governance proposal must navigate: political leaders with power to pass legislation, AI researchers who define technical auditing requirements, and industry CEOs who would implement them.

  • Politicians: Donald Trump, Xi Jinping, Ursula Von Der Leyen, Chuck Schumer
  • Industry Leaders: Elon Musk, Sam Altman, Dario Amodei, Sundar Pichai
  • Researchers: Yoshua Bengio, Stuart Russell

The Debate Structure

We created a multi-agent framework to simulate each persona. Structurally, there were five debate rounds. In the first round, all agents stated their initial positions without seeing each other's responses. In subsequent rounds, there would be persuasion, further statements from each agent, and votes by each agent (e.g. SUPPORT or OPPOSE). The votes in the last round were taken as the output of the simulation.

Specifically, in rounds after the first, one agent from the SUPPORT camp and one from the OPPOSE camp would each make a direct persuasion attempt to the other side (having access to the full accumulated discussion). Then, each agent would made a statement after reading all previous statements, including the previous statements in the same round. Finally, each cast a vote of either SUPPORT, OPPOSE, MODIFY, or ABSTAIN. MODIFY indicates cautious support and a request to continue discussion. (We had a fixed number of rounds, but we also experimented with continuing as long as some agents wanted to continue discussion; generally, the agents would declare a position too early for this to work well.)

Figure 2 below shows how you might visualize this happening in real time:

LidaSim — Q9: Auto-escalation at 10²⁵ FLOP · Round 2 of 5
Active Personas
SR
Stuart Russell
Researcher
OPPOSE
SP
Sundar Pichai
Industry
MODIFY
EM
Elon Musk
Industry
SUPPORT
SA
Sam Altman
Industry
SUPPORT
YB
Yoshua Bengio
Researcher
SUPPORT
CS
Chuck Schumer
Political
SUPPORT
DT
Donald Trump
Political
OPPOSE
Round 2 — Persuasion Phase
SR
Stuart Russell OPPOSE persuading SUPPORT →
"Automatic escalation based solely on compute thresholds is a blunt instrument. It risks prematurely disclosing sensitive research directions, potentially benefiting malicious actors, while offering little real insight into actual risk. We need capability-based monitoring, not just raw compute triggers."
SP
Sundar Pichai MODIFY SUPPORT → MODIFY
"I agree with Stuart Russell that automatic escalation based solely on compute thresholds is too blunt. The oversight goal is sound, but the framing needs broadening to include capability assessments alongside compute volume."
Support: 6
Oppose: 2
Modify: 2
Deliberating…
Figure 2 — A visualization of debate within a round. Researchers can observe deliberations in real time. The sidebar shows current vote status; the main panel shows the live debate transcript. Our web interface looks similar.

The somewhat more complex interface of our actual system can be seen below:

Figure 3 - Screenshot from our real web interface show conversation within a round.

All results were logged and captured as json for further analysis. Here is a viewer to parse logs and read the conversations that took place. The structure of the round can be seen in the first part of the prompt.

Watching conversations or viewing logs to determine what exactly the agents were talking about and why they were changing their minds was essential to the success of this project.

Results from Short Questions

We generated 22 short questions about policies, to test our system on relevant use cases, phrased as "Do you support...?". Below are the final vote counts for all 22 questions. A slight majority received a final verdict of SUPPORT; four ended in outright opposition; three ended in roughly even splits that we label as SPLIT/MODIFY. The table also shows notable vote changes across the five rounds.

IDQuestion (abbreviated) ~ ResultNotable Vote Changes
Q1Tamper-resistant secure enclaves reporting to distributed ledger (no remote disable)0100OPPOSEBengio: MODIFY×4 → OPPOSE
Q280% of existing datacenter chips instrumented within 24 months370OPPOSE
Q3Mandate hardware-rooted attestation of training run details730SUPPORTAltman: OPPOSE→ABSTAIN→…→OPPOSE
Q4AI chips use offline licensing; restrict if thresholds exceeded or chip relocated730SUPPORT
Q5Embed location verification in chips to enforce export controls via hardware370OPPOSEVon der Leyen: MODIFY×4 → OPPOSE
Q6Require HEMs (location, workload attestation) as condition for export licensing820SUPPORT
Q7Chips automatically enter reduced-capability mode if compliance signals lost550SPLIT
Q8Privacy-preserving compliance reporting via ZK proofs / secure MPC910SUPPORT
Q9 ★Auto-escalation to detailed disclosure above 10²⁵ FLOP threshold046MODIFY6 agents shifted SUPPORT → MODIFY after Russell's argument in Round 2
Q10Require AI training infrastructure to use only vetted, security-standard hardware1000SUPPORTPichai: MODIFY×4 → SUPPORT
Q11Additional government approval for chips from state-affiliated suppliers1000SUPPORT
Q12Mandate tamper-evident mechanisms detecting unauthorized chip modifications550SPLIT
Q13Establish international body to maintain compute ledgers and coordinate inspections550SPLITMusk: MODIFY×4 → OPPOSE
Q14Require participation in international body as condition for chip export access460OPPOSE
Q15Hardware-level non-compliance enforcement (license suspension) over diplomatic channels820SUPPORTVon der Leyen: ABSTAIN → SUPPORT×4
Q1672-hour breach notification with hardware forensic logs from secure enclaves910SUPPORT
Q17Link dangerous capability evals to hardware-enforced security requirements1000SUPPORT
Q1890-day advance notification for runs above 10²⁵ FLOP with hardware verification730SUPPORT
Q19Auto-flag for government review when actual compute deviates from declared projections820SUPPORT
Q20Use HEMs as the primary enforcement tool for AI governance (not supplementary)0100OPPOSE
Q21Trade restrictions as ultimate enforcement for non-compliant countries820SUPPORT
Q22Multilateral secretariat publishing anonymized aggregate compute statistics820SUPPORT
Figure 3 — Full results for Q1–Q22. ✔ = Support, ✘ = Oppose, ~ = Modify/Abstain. All experiments used Gemini 2.0 Flash. ★ marks Q9, discussed in detail below.

A Closer Look at Q9: How One Argument Shifted Six Votes

The most interesting result was Q9: "Do you support automatic escalation from anonymized reporting to detailed disclosure when compute runs exceed predefined thresholds (e.g., 10^25 FLOP)?"

What would this policy do? Imagine a traffic light system for AI training runs. Below a certain size (10²⁵ FLOP — roughly ten times larger than the biggest models trained today), a lab reports its activity anonymously: we know a big training run happened, but not who or what. Above that threshold, the light turns red: the lab must provide full details — what it trained, what data it used, what capabilities the model has — to a regulatory authority. Q9 asks whether this switch should be automatic and purely size-triggered, with no other conditions needed.

After Round 1, the vote was majority SUPPORT with zero MODIFY votes. By Round 5, it had moved to 0 SUPPORT, 4 OPPOSE, 6 MODIFY. Six agents changed their votes — all moving from SUPPORT to MODIFY rather than to outright OPPOSE, which is itself informative: they didn't think the goal was wrong, they thought the mechanism was too blunt. The full breakdown of the round can be seen below. Highlighted rows = agents that changed their vote; the cascade started in Round 2 when Stuart Russell was assigned to argue the opposition case.

PersonaRound 1Round 2Round 3Round 4Round 5
Sundar PichaiSUPPORTMODIFYMODIFYMODIFYMODIFY
Elon MuskSUPPORTSUPPORTMODIFYMODIFYMODIFY
Dario AmodeiSUPPORTSUPPORTMODIFYMODIFYMODIFY
Yoshua BengioSUPPORTSUPPORTSUPPORTMODIFYMODIFY
Chuck SchumerSUPPORTSUPPORTSUPPORTMODIFYMODIFY
Sam AltmanSUPPORTSUPPORTSUPPORTSUPPORTMODIFY
Stuart RussellOPPOSEOPPOSEOPPOSEOPPOSEOPPOSE
Xi JinpingOPPOSEOPPOSEOPPOSEOPPOSEOPPOSE
Donald TrumpOPPOSEOPPOSEOPPOSEOPPOSEOPPOSE
Ursula von der LeyenOPPOSEOPPOSEOPPOSEOPPOSEOPPOSE

Highlighted rows = agents that changed their vote. Cascade started in Round 2 when Stuart Russell was assigned to argue the opposition case.

Figure 4 — Q9 vote progression across all five rounds. Six agents moved from SUPPORT to MODIFY. The trigger was Stuart Russell's Round 2 argument that compute-only thresholds are "too blunt an instrument."

What's interesting here is the mechanism. Stuart Russell had voted SUPPORT in Round 1, but was then assigned the role of arguing the opposition case. His argument was that escalation based solely on compute thresholds is too blunt — it risks premature disclosure of sensitive research directions without providing meaningful insight into actual risk. Smarter monitoring, he argued, would be capability-based.

"Automatic escalation based solely on compute thresholds is a blunt instrument. It risks prematurely disclosing sensitive research directions, potentially benefiting malicious actors and hindering beneficial progress, while offering little real insight into actual risk."— Simulated Stuart Russell, Round 2, Q9

The word "solely" then appeared repeatedly in subsequent vote changes. Sundar Pichai's transition to MODIFY was representative: he explicitly agreed that escalation based "solely" on compute thresholds was too blunt, while still supporting the underlying oversight goal. Agents weren't convinced to abandon the policy, but they were convinced that the current framing had a specific, fixable weakness.

Results from Longer Policies

We then generated 13 longer, operationalized policy proposals, and ran each three times. The chart below summarizes support proportions across runs. Most received consistent SUPPORT verdicts. P2 — the most ambitious proposal, requiring a global blockchain-based chip reporting ledger — was universally opposed in all three runs. P4 showed the most variance, for reasons we unpack in the synthesis section.

Each bar represents the average vote distribution across three separate simulation runs of that policy. Green = agents who voted to support the proposal. Red = agents who voted to oppose it. Yellow = agents who wanted the proposal modified rather than accepted or rejected outright. Consistent results across three runs suggest the outcome is robust; variable results (like P4) suggest the outcome is sensitive to framing or ordering.

PolicyDescriptionVote Distribution (avg. across 3 runs)Verdict
P1Forced CoT transparency reporting; deployment pause below monitorability threshold
SUPPORT ×3
P2Global blockchain GPU ledger; all chip sales reported; no remote disable
OPPOSE ×3 (0–10)
P3Software reporting of compute runs; mandatory with physical inspection compliance
SUPPORT ×3
P4Quarterly CoT monitorability reporting; pause deployment if <80% monitorable
MIXED (1×MODIFY, 2×SUPPORT)
P5Tamper-resistant secure enclaves on all new AI accelerators → international ledger
SUPPORT ×3
P6Datacenter software agents reporting compute; annual inspections with 48h notice
SUPPORT ×3
P7Security levels (SL1–SL5); national security risk models must achieve SL3+
SUPPORT ×3
P872-hour notification protocol for AI security incidents; international coordination center
SUPPORT ×3
P9Annual security assessments: pen testing, red teams, physical security, ML-specific vectors
SUPPORT ×3 (10–0)
P10Insider threat programs (CISA standard) for all personnel with model weight access
SUPPORT ×3
P11SBOMs for all training/inference code; vetted hardware suppliers; weekly integrity checks
SUPPORT ×3
P1290-day pre-notification for training runs >10²⁵ FLOP; capability report within 30 days post-training
SUPPORT ×3 (10–0)
P13Capability benchmarks auto-escalate security requirements through SL3→SL5 tiers
SUPPORT ×3
Support Oppose Modify
Figure 5 — P1–P13 policy results across three runs each. Bars show proportional vote distribution averaged across runs. Highlighted rows indicate extreme results (universal support or opposition). Model: Gemini 2.0 Flash.

Synthesis: What Properties Make Hardware Governance Proposals Workable?

Looking across all 35 simulation runs, a question naturally arises: why do some proposals win broad support while others collapse? The vote counts tell us what happened, but a more useful question is why — and whether there are generalizable properties that predict acceptability across different kinds of governance proposals.

We think that there are in fact generalizable takeaways that surfaced in our experiments. There are five properties that seem to distinguish broadly acceptable proposals from ones that face consistent resistance. We want to be clear these are hypotheses drawn from a limited simulation, but intend to investigate them and their predictive power in future work.

If these properties predict political viability, they give governance designers a quick diagnostic: before spending months developing a detailed proposal, you can check whether it scores well on the five dimensions. The properties also suggest a sequencing strategy: start with proposals that score high on all five, use those early wins to build institutional trust and infrastructure, then tackle the harder proposals that require shared sovereignty.

Property What it means Supports Undermines
Proportionality Requirements scale with assessed risk level (e.g. SL1–SL5 framework); low-capability models face minimal burden P7, P9, P13 Q2 (flat 80% mandate)
Unilateral implementability A single country or company can implement without requiring a new international infrastructure body to function first P9, P10, P12 P2, Q13, Q14
Privacy-preserving verification Compliance can be confirmed without revealing the underlying proprietary data (e.g. via ZK proofs or secure MPC) Q8 (9–1) Q1 (raw ledger, no privacy)
Multi-criterion triggers Escalation or enforcement is triggered by a combination of signals (compute + capability), not a single metric alone P13 (capability-triggered) Q9 (compute-only trigger)
Coalition coherence Supporters back the proposal for compatible reasons; a policy that looks supported may be fragile if supporters hold incompatible motives P9 (consistent rationale) P4 (fragmented coalition)

Figure 8 — Five candidate properties that appear to predict broad stakeholder acceptability across hardware governance proposals. These are hypotheses drawn from a small set of runs only.

Proportionality matters because flat mandates feel unfair to industry personas who see low-risk projects being treated the same as frontier models. Unilateral implementability matters because any proposal requiring international infrastructure first hit a catch-22 — states wouldn't join until the infrastructure existed, but the infrastructure couldn't be built until states joined. Privacy-preserving verification matters because industry personas consistently objected to policies that would expose trade secrets as a side effect of compliance. Multi-criterion triggers matter because single-metric systems are vulnerable to the "too blunt" critique we saw in Q9. And coalition coherence matters because a proposal that looks well-supported can collapse as soon as deliberation separates supporters who care about different things.

The chain-of-thought monitorability policy illustrates the coalition coherence problem clearly. The simulated Trump initially voted MODIFY, framing it as a tool for maintaining American AI dominance. By Round 5 he voted OPPOSE, arguing it would disadvantage US companies while Chinese competitors remain unregulated.

"I support this policy, but it needs to be MUCH stronger and apply globally. We can't let China steal our AI technology and use it against us."— Simulated Donald Trump, Round 1, P4 (before voting OPPOSE by Round 5)

What happened here is that P4 attracted initial support from two groups with incompatible motivations: those who care about interpretability as a safety property, and those who see transparency requirements as a tool for maintaining US technological advantages. Under deliberation, these motivations came apart. Industry personas concluded the 80% CoT threshold was arbitrary; nationalist personas concluded it was self-defeating without international scope. The simulation suggests P4 would benefit from either narrowing its coalition to interpretability-motivated supporters, or resolving the international coordination gap as a prerequisite.

The Validation Problem: When Should We Trust Simulation Results?

The most important question raised by our reviewers is this: under what conditions is it valid to draw inferences about actual human responses from LLM simulations of those responses? Or in other words, how predictive is the simulation?

There are two approaches we can use:

1. Retrospective backtesting on documented positions. Several of our simulated figures have taken public, documented positions on policies that overlap with our test set. Sam Altman and Dario Amodei both testified before the US Senate in May 2023, addressing compute thresholds, incident reporting, and safety requirements — overlapping directly with P8, P12, and Q16. Yoshua Bengio's 2023 essay on AI risk lays out specific positions on oversight mechanisms. Stuart Russell testified to the US Senate in 2023 on international AI coordination. We could run our simulation on each of these cases and compare predicted vs. documented positions — not just the final vote, but the direction of travel through deliberation.

It's definitely true that public positions may not reflect private negotiating stances. Also, our personas are built from the same public record we'd be testing against. But it would at least tell us whether the simulation is consistent with publicly observable behavior, which is a minimum bar.

2. Comparison against expert elicitation. For a subset of policies, we could ask domain experts — people who work professionally in AI governance and know these stakeholders well — to predict how each figure would vote. If our simulation agrees with expert predictions more often than chance, that's weak positive evidence. If it systematically disagrees with experts in specific directions (say, consistently overestimating support from industry personas), that's informative about the model's biases.

Overall, we can expect LLM simulations of named public figures are most likely to be informative about argument structure — which objections are logically available to which stakeholder types, and which policy framings are vulnerable to specific critiques — and least likely to be informative about strategic behavior, which depends on private incentives and real political pressure that simulations can't access.

Future Work

Running backtesting. The most important next step is testing the simulation against documented historical positions. Concrete candidates: Altman and Amodei's public statements on compute thresholds post-Senate testimony (2023); Bengio's published positions on hardware governance in his 2023 safety writing. We plan to run the simulation on each of these cases and compare predicted vs. documented positions, then publish the calibration results. It may be that we need to test domains with greater amounts of data available to calibrate our system, and especially to see which people it is most accurate at predicting.

Human-in-the-loop runs. We built the web interface specifically to support this: running the same debates with real humans playing some roles, then comparing outcomes to see where the simulations diverge. Or, if we can find experts in specific areas to participate, we can leverage their arguments to help get more realistic transcripts to train from.

Testing the five properties as predictors. The framework in Figure 6 makes falsifiable predictions. We can score a set of new governance proposals on the five dimensions, run them through LidaSim, and check whether the scoring predicts the simulation outcome. If the properties genuinely explain variance across proposals, this would give governance researchers a quick diagnostic tool — and publishing the scoring rubric openly would let others test it on their own policy proposals.

Expanding the policy space. Our current focus is compute governance. The same framework applies to model evaluation mandates, deployment restrictions, AI lab insurance liability, and safety legislation — all of which interact with hardware governance in ways worth exploring separately.

Conclusion

We built LidaSim in a hackathon weekend to test a simple question: can AI-powered simulations of real public figures help stress-test governance proposals before they face real political scrutiny? The system won first prize at the hackathon, and we intend to continue in this research direction with backtesting and human-in-the-loop runs.

The biggest practical takeaway is the five-property framework. If governance proposals succeed when they are proportionate, unilaterally implementable, privacy-preserving, multi-criterion triggered, and coalition-coherent — and fail when they're not — that's an actionable diagnostic for policy designers. It doesn't require a simulation to apply: you can check a new proposal against these five dimensions in an afternoon. Whether this framework actually holds up is an empirical question, and we intend to test it.

If you work in AI governance, compute policy, or AI safety evaluation, and you'd like to collaborate, we'd love to hear from you. Our code is at github.com/distributed-systems-co/lida-multiagents-research.