TL;DR — The best prediction market agents do not run on a single LLM. They split into a research model for market synthesis, an execution model for structured tool calls, and a deterministic risk layer that sits between the AI and your money. This guide covers every model worth considering in 2026, when to self-host, when to fine-tune, and how to evaluate before committing real capital.
Why model selection matters for prediction market agents
A wallet-bearing prediction market agent is harder than a chatbot. It needs to read large volumes of market data and news, decide when to call tools (and when not to), emit machine-checkable structured outputs, know when the right trade is no trade at all, and interface safely with fund-moving infrastructure. The quality bar is not “good reasoning” — it is decision quality under uncertainty.
That makes this a Layer 4 problem in the agent betting stack. But it connects directly to every other layer: the LLM proposes actions that flow through a wallet layer for signing, hit a market adapter for execution, and may need agent identity credentials for platform access.
Tool-use benchmarks like BFCL, τ-bench, and When2Call show that even frontier models still struggle with stateful multi-turn workflows, consistent rule-following, and deciding when not to act. These are necessary signals — but they are not sufficient evidence that a model will trade profitably.
The architecture that actually works
Do not give one model direct authority to move funds. The production pattern that works:
Research model (GPT-5.4 / Claude Sonnet 4.6 / Gemini 2.5 Pro / gpt-oss-120b)
→ proposes thesis, probability estimate, or abstain signal
Policy + risk engine (deterministic code)
→ checks market whitelist, position limits, slippage caps, notional caps, cooldowns
Execution model (Mistral Small 3.2 / gpt-oss-20b / Qwen3-32B / xLAM / Functionary)
→ produces exact structured trade intent (market_id, side, size, limit_price, max_loss)
Wallet service (Coinbase Agentic Wallets / Safe)
→ signs only approved action types under hard spending limits
Market adapter (Polymarket CLOB SDK / Kalshi API)
→ submits deterministic order
Auditor / replay log
→ stores every input, tool call, and approved action
This is not just a safety preference. It is an accuracy recommendation. BFCL found that state-of-the-art models are strong on simple single-turn function calls but weak on memory, dynamic decision-making, and long-horizon reasoning. τ-bench reported that agents succeeded on fewer than half of realistic tasks. When2Call showed significant gaps in deciding whether to call a tool, ask for clarification, or abstain.
If you let a single model both decide and execute trades, you are betting capital on exactly the failure modes the research literature still flags as open problems.
What matters most in model selection
Ranked by importance for prediction market agents:
1. Tool/function calling reliability. Bad tool use is not a UX bug — it is how you get wrong market IDs, malformed orders, and stale data. This is the minimum bar.
2. Structured output compliance. If the model cannot reliably emit strict JSON trade intents, it should not be on the execution path.
3. Abstention discipline. Prediction markets reward waiting. A model that always has a take will overtrade. A weaker model that abstains correctly often outperforms.
4. Long-context retrieval and synthesis. Event-driven markets require reading rules, summarizing conflicting evidence, and tracking changing facts across large document sets.
5. Cost and latency. Monitoring hundreds of markets requires cheap, fast loops. Your research model and your execution model do not need to be the same — and usually should not be.
6. Self-hostability. Proprietary signal pipelines, wallet telemetry, and compliance constraints may require running models on your own infrastructure.
7. Fine-tunability. Important only after you know what you want to optimize. Start with evals, not training.
Frontier API models
GPT-5.4
OpenAI’s current flagship for complex agentic work. 1,050,000-token context window, native function calling, structured outputs, and the broadest tool surface in the API ecosystem (web search, file search, code interpreter, computer use, MCP).
Best role: Premium research/planner model, or single-model baseline before you split the stack.
Trade-offs: API-only. No self-hosting. No fine-tuning (distillation only). Premium output pricing.
Agent angle: The 1M context makes it strong for “read everything before deciding” workflows. If cost is not the main constraint and you want the cleanest starting point, GPT-5.4 is it. Keep wallet signing and order submission outside the model.
Claude Opus 4.6 and Claude Sonnet 4.6
Anthropic positions Opus 4.6 for the most complex tasks and Sonnet 4.6 as the best speed/intelligence balance. Both support 200K context (1M in beta) and have strong tool-use primitives. Sonnet 4.6 includes improved agentic search and code execution workflows.
Best role: Opus 4.6 for the hardest research and synthesis. Sonnet 4.6 as a balanced day-to-day planner and coding-heavy stack component.
Trade-offs: API-only. No self-hosting. For money movement, you still need external guardrails.
Agent angle: Sonnet 4.6 is near the top of the list for balanced performance in a production market-research agent. If your agent stack already uses Claude for analysis via tools like Polyseer or CrewAI, keeping the research layer on Claude simplifies integration.
Gemini 2.5 Pro, Flash, and Flash-Lite
Google’s family covers the full cost spectrum. Pro is the strongest reasoner with 1M+ context. Flash is the price-performance leader for high-volume loops. Flash-Lite is the cheapest option for classification and heartbeat monitoring.
Best role: Pro for deep research reads. Flash for cheap market scanning and triage. Flash-Lite for large-scale routers.
Trade-offs: API-only. Pro is substantially more expensive than Flash.
Agent angle: For agents monitoring dozens or hundreds of markets simultaneously, Gemini’s pricing ladder is the most attractive. Run Flash-Lite as a broad scanner, Flash for preliminary scoring, and route only high-signal markets to a premium research model.
DeepSeek V3.2 (API)
OpenAI-compatible API with integrated thinking + tool-use. Strong reasoning lineage from the R1 family. Attractive when cost matters.
Best role: Budget-conscious hybrid stacks. Research/planning via API with self-hosted execution elsewhere.
Trade-offs: Fast-moving release cadence. Pin versions and test carefully. Open-weight releases do not always match the API versions exactly.
Open-weight models for self-hosting
If you need tighter latency control, private data handling, or lower marginal cost at scale, self-hosting is the path. These are the families that matter for agent work.
gpt-oss-120b and gpt-oss-20b
OpenAI’s open-weight models are explicitly agent-oriented. Apache 2.0 license. gpt-oss-120b (117B total, 5.1B active MoE) fits on a single H100. gpt-oss-20b (21B total, 3.6B active) is the lower-latency option. Both support full-parameter fine-tuning and serve via vLLM, Ollama, or llama.cpp.
Best role: gpt-oss-120b as the strongest practical open all-rounder. gpt-oss-20b as a self-hosted execution model and fine-tune target.
Agent angle: If you want one open-weight family for a prediction market agent, gpt-oss belongs at the top of the list. The 20b variant is an excellent candidate for fine-tuning on your specific Polymarket or Kalshi tool schemas.
Qwen3 family
Dense models from 0.6B to 32B plus MoE variants (30B-A3B and 235B-A22B). Apache 2.0 license. Supports thinking and non-thinking modes. Serves via SGLang, vLLM, Ollama, LM Studio.
Best role: Qwen3-30B-A3B or Qwen3-32B as balanced self-hosted agent cores. Smaller variants (8B, 14B) as routers and lightweight executors. 235B-A22B as a heavyweight open research model if you have the infrastructure.
Agent angle: The broad size range makes Qwen3 ideal for teams that want a single model family across planner, router, and executor roles. 30B-A3B is one of the most interesting “serious but not absurd” self-hosted choices for market agents.
Mistral Small 3.2 and Mistral Large 3
Mistral’s docs are unusually agent-friendly — function calling, structured outputs, and agent conversation patterns are first-class features. Small 3.2 is a 24B model (Apache 2.0, 128K context, ~55 GB GPU RAM in bf16). Large 3 is a 675B MoE (41B active, 256K context, Apache 2.0).
Best role: Small 3.2 as a self-hosted execution model. Large 3 as a heavyweight open research/planner.
Agent angle: If reliable function-calling in a self-hosted mid-size model is your priority, Mistral Small 3.2 belongs on the shortlist immediately. Its function-calling template is more robust than prior releases and pairs well with a wallet layer that expects strict structured intents.
DeepSeek V3 (open weights)
671B MoE with 37B active parameters. MIT-licensed code, commercial-use model license. Deployment via DeepSeek-Infer, SGLang, LMDeploy, TensorRT-LLM.
Best role: High-end open research model. Hybrid API + self-host experimentation.
Trade-offs: Operationally demanding. Only for teams already running large inference clusters. For most builders, gpt-oss, Qwen3, or Mistral Small 3.2 are easier starting points.
Llama 3.3 70B
Meta’s 70B instruction-tuned model with 128K context. Enormous ecosystem of fine-tunes, adapters, and quantizations. The important caveat: Llama 3.3 Community License is less permissive than Apache 2.0.
Best role: Base model for custom downstream fine-tunes. Ecosystem-heavy projects that benefit from wide third-party tooling.
Agent angle: Not the first recommendation for a new prediction market agent in 2026, but you will see it underneath many custom domain and tool-use fine-tunes.
Specialized function-calling models
This is the part most builders miss. If your question is “are custom-trained models better?” — the answer is yes, for the execution layer.
xLAM (Salesforce)
Large action models specialized for agent tasks. Achieved top results on BFCL at publication. Active Hugging Face collection with variants and datasets (xlam-function-calling-60k, APIGen-MT-5k). The -fc variants are specifically tuned for function calling and run on vLLM.
Why it matters: Direct evidence that action-optimized models outperform general chat models on the thing you actually need for execution — reliable tool calls with correct arguments.
Functionary (MeetKai)
Purpose-built for interpreting and executing functions/plugins. Supports serial and parallel function calling, JSON Schema tool definitions, and only triggers functions when needed. Serves via vLLM and SGLang.
Why it matters: If your problem is “tool calls must work reliably,” Functionary is one of the strongest open baselines available.
ToolLLaMA / ToolLLM / ToolBench
Early but important. Built ToolBench spanning 16,464 real-world REST APIs. Released training and eval scripts plus ToolLLaMA models fine-tuned for tool use.
Why it matters: Still one of the clearest open examples of how to build specialized tool-use models from data construction through evaluation. The ToolBench dataset is useful training material if you are building your own executor.
ToolACE
A data-generation pipeline and training methodology, not just a benchmark result. Shows that even 8B models trained with high-quality synthetic tool data can reach state-of-the-art function-calling performance comparable to much larger closed models.
Why it matters: Good tool-use performance is often more about the right training data and objectives than raw parameter count.
Are custom-trained models better?
For narrow execution tasks: often yes. For end-to-end market trading: not automatically.
Custom-trained models consistently outperform general models at tool routing, argument formatting, schema adherence, shorter and more deterministic outputs, and working within a specific runtime template. The evidence from ToolACE, xLAM, and ToolLLM is strong.
But custom-trained models are not automatically better at probability calibration, deciding when not to trade, integrating conflicting evidence, handling long-horizon stateful workflows, or maximizing expected value. BFCL, τ-bench, and When2Call all flag these as open challenges even for specialized models.
A custom-trained action model can be better than a frontier model for your execution layer. That does not make it a better trading brain.
When custom training is worth it
Fine-tuning becomes high-ROI when all of these are true:
- Your tool schemas are stable. Changing APIs age training traces quickly.
- You have real production traces. Not synthetic “weather API” examples — your actual market workflows, your actual failures.
- You know the specific failure you are fixing. Wrong market ID selection, broken JSON, bad abstention, over-calling tools.
- You have an evaluation harness. Without offline measurement, you train style, not quality.
- Your bottleneck is execution quality, not world knowledge.
When custom training is not worth it
Do not start with fine-tuning if you are still discovering your architecture, your market and wallet tools are still changing, your biggest weakness is reasoning quality, you have no labeled outcomes, or you have not run a strong baseline.
In early stages, invest in prompt engineering, schema tightening, deterministic execution code, retrieval quality, and a better eval harness first.
What to fine-tune
Fine-tune the execution model, not the research model. Good targets: gpt-oss-20b, Mistral Small 3.2, Qwen3-14B/32B, Functionary, or xLAM.
Build three training datasets:
Dataset A — Tool and schema execution. Correct function calls, valid JSON, proper Polymarket CLOB or Kalshi API method routing, rejection of malformed requests.
Dataset B — Abstention and no-trade cases. Insufficient evidence, ambiguous market wording, efficient pricing, events that do not match the user’s framing, data freshness below threshold. This is the missing piece in generic tool datasets.
Dataset C — Domain judgment with realized outcomes. Historical market states, event descriptions, news snapshots available at the time, model-generated probabilities, human overrides, eventual resolution, realized PnL. This is the hardest data to build and the dataset most likely to create actual trading edge.
If your training data is mostly tool traces, expect improvement mostly in execution. If it includes historical evidence, outcomes, and abstention labels, you have a chance to improve decision quality.
Four build patterns
Pattern A — Fastest path to production
| Component | Choice |
|---|---|
| Research model | GPT-5.4 or Claude Sonnet 4.6 |
| Execution | Deterministic code or small self-hosted model |
| Wallet | Coinbase Agentic Wallets |
| Market adapter | Polymarket CLOB SDK |
Strongest reasoning with minimal ML ops. Best for teams that want to ship fast and iterate on the agent logic, not the infrastructure.
Pattern B — Best hybrid
| Component | Choice |
|---|---|
| Research model | GPT-5.4 / Gemini 2.5 Pro / Claude Sonnet 4.6 |
| Execution | Mistral Small 3.2 / gpt-oss-20b / Qwen3-32B |
| Wallet | Coinbase Agentic Wallets or Safe |
| Market adapter | Polymarket + Kalshi adapters |
Best balance of quality, cost, and control. The execution model is self-hosted and can be fine-tuned on your own traces.
Pattern C — Best fully open stack
| Component | Choice |
|---|---|
| Research model | gpt-oss-120b or Qwen3-30B-A3B |
| Execution | Mistral Small 3.2 / Functionary / xLAM |
| Runtime | vLLM or SGLang |
| Wallet | Safe multi-sig |
Good open performance without jumping to giant multi-node MoEs. Full data residency and no API dependencies.
Pattern D — Max open research stack
| Component | Choice |
|---|---|
| Research model | Mistral Large 3 / DeepSeek V3 / Qwen3-235B-A22B |
| Execution | Smaller specialized model |
| Cost | High inference-ops burden |
Strongest open-only ceiling. Only for teams with serious GPU infrastructure.
How to evaluate before committing live funds
Use a four-layer evaluation framework:
Layer 1 — Tool-call correctness. BFCL-style tests: correct function selected, correct arguments, valid JSON, no hallucinated tools, proper abstention when tools are insufficient.
Layer 2 — Stateful multi-turn behavior. τ-bench-style scenarios: does the model remember earlier tool outputs, follow policy across turns, remain consistent under retries and partial failures, and recover from ambiguous information.
Layer 3 — Market-decision quality. Historical replay with information available only at the time. Require explicit probability or trade intent. Measure calibration (Brier score, log loss), abstention quality, and compare against market prices and human analysts.
Layer 4 — Shadow mode. The agent generates recommendations only. Compare against human or deterministic policy decisions. Log every tool call and proposed trade. Review failure clusters. Only then allow tightly capped autonomous execution.
Metrics that matter
Track these across your evaluation:
| Metric | What it measures |
|---|---|
| Tool-call exact-match rate | Execution reliability |
| Malformed order rate | Schema compliance |
| Abstention precision/recall | Discipline quality |
| Policy violations caught by risk engine | Guardrail load |
| Calibration error (Brier/log) | Prediction quality |
| Simulated PnL and max drawdown | Trading performance |
| Tokens and latency per market | Cost efficiency |
| Fraction requiring human override | Autonomy readiness |
A model that scores brilliantly on generic reasoning benchmarks but fails these tests is not a good prediction market execution model.
Self-hosting runtimes
If you self-host your execution model (or research model), these are the runtimes that work:
vLLM — The default production choice. High-throughput serving, broad model support including MoE architectures, and explicit compatibility with gpt-oss, DeepSeek, Qwen, and Mistral families.
SGLang — High-performance serving designed for low-latency inference. Recommended by Qwen and documented by DeepSeek for local deployment.
Ollama — Easiest path for local dev and prototyping. Supports gpt-oss, DeepSeek-R1, and Qwen3 out of the box.
TensorRT-LLM — For teams heavily invested in NVIDIA infrastructure who want maximum performance from larger deployments.
Practical self-hosting tiers
Tier A (easiest serious setup): Mistral Small 3.2, Qwen3-30B-A3B/32B, gpt-oss-20b. Start here if you want a production-ish self-hosted executor without turning your project into an inference-ops program.
Tier B (high-end but practical): gpt-oss-120b. Fits on a single H100-class GPU.
Tier C (heavyweight open frontier): Mistral Large 3, DeepSeek V3, Qwen3-235B-A22B. Only if you are willing to operate multi-GPU infrastructure.
Model comparison summary
| Model | Type | Context | License | Best role | Cost tier |
|---|---|---|---|---|---|
| GPT-5.4 | API | 1.05M | Proprietary | Premium research/planner | $$$ |
| Claude Opus 4.6 | API | 200K–1M | Proprietary | Deep research/synthesis | $$$ |
| Claude Sonnet 4.6 | API | 200K–1M | Proprietary | Balanced planner | $$ |
| Gemini 2.5 Pro | API | 1M+ | Proprietary | Long-context research | $$ |
| Gemini 2.5 Flash | API | 1M+ | Proprietary | Cheap monitoring/triage | $ |
| DeepSeek V3.2 | API | Varies | Proprietary | Budget research | $ |
| gpt-oss-120b | Open | — | Apache 2.0 | Open research + execution | Self-host |
| gpt-oss-20b | Open | — | Apache 2.0 | Execution / fine-tune target | Self-host |
| Qwen3-30B-A3B | Open | — | Apache 2.0 | Mid-size self-hosted core | Self-host |
| Qwen3-32B | Open | — | Apache 2.0 | Mid-size self-hosted core | Self-host |
| Mistral Small 3.2 | Open | 128K | Apache 2.0 | Function-calling executor | Self-host |
| Mistral Large 3 | Open | 256K | Apache 2.0 | Heavyweight planner | Self-host |
| Llama 3.3 70B | Open | 128K | Community | Fine-tune base | Self-host |
| xLAM | Open | Varies | Varies | Specialized tool-calling | Self-host |
| Functionary | Open | Varies | Varies | Reliable function execution | Self-host |
Where this fits in the agent betting stack
Model selection is a Layer 4 — Intelligence decision, but it touches every layer:
- Layer 1 — Identity: Your agent may need identity credentials (Moltbook, SIWE, ENS) to access certain platforms. The execution model needs to handle authentication tool calls reliably.
- Layer 2 — Wallet: The LLM proposes trade intents that flow through Coinbase Agentic Wallets or Safe. The wallet layer enforces spending limits regardless of what the model requests.
- Layer 3 — Trading: The execution model must produce exact structured calls to Polymarket CLOB or Kalshi API endpoints. Bad tool calls here mean real money lost.
- Layer 4 — Intelligence: This is where model selection lives. Tools like CrewAI for multi-agent orchestration and Polyseer for Bayesian analysis can augment any base LLM.
See the full agent betting stack guide for how all four layers connect. Browse the agent marketplace for tools and platforms that integrate with these models, or check the tools directory for the complete catalog.
What’s next
- Agent Betting Stack: The Complete Architecture — How all four layers connect
- Agent Wallet Comparison — Coinbase Agentic Wallets vs Safe vs alternatives
- Prediction Market API Reference — Polymarket CLOB, Kalshi API, and execution details
- Agent Identity Comparison — Moltbook, SIWE, ENS, and EAS for agent credentials
- Agent Marketplace — Browse tools that integrate with these models
