Labs
03

Building Athena

Solo build Personal Mar – Apr 2026 · Paused

  • Python
  • Claude
  • Kalshi API
  • FRED
  • Polymarket
  • Hetzner
  • Docker

Athena was a side project that asked one question: could Claude, with disciplined risk controls, find calibrated edge on Kalshi’s prediction markets? I built it over five weeks in spring 2026, paper-traded it on a $4.50-a-month server in Helsinki, and shelved it before going live with real money. This is the post-mortem.

The hypothesis was that prediction markets are the right venue for an LLM-led trading system. In equities you compete against Renaissance and Citadel. On Kalshi you compete against retail bettors. The markets are thin and new, the contracts settle objectively, and Claude’s strength is synthesizing economic data, Fed language, and historical patterns into a probability estimate. That sits closer to a real edge than to a coin flip.

The architecture was three tiers, each at a different cadence and intelligence level. A continuous Tier 1 executor with no AI in the loop, enforcing hard limits and stop-losses on a 60-second clock. A Tier 2 brain calling Claude every 30 minutes for probability estimation and trade recommendations. A Tier 3 daily review where Claude looked at its own performance, graded the day, and proposed parameter updates for tomorrow. The discipline of separating intelligence by frequency was deliberate. If Anthropic’s API went down, Tier 1 still protected capital. If Claude drifted on a single cycle, Tier 3 was supposed to catch it.

Fig 01
Project Athena title plate with Student-t probability distribution showing a 15-cent edge between market-implied and model-estimated probability

Risk management was non-negotiable and hard-coded. Quarter-Kelly position sizing. A 10% portfolio cap per trade. A hard ceiling per position. A 15% daily loss limit that froze all new trading until the next day. A minimum 5-cent net-of-fee edge required to open a trade. A kill-switch file the system checked every cycle so I could halt everything from a single SSH command. Claude could not override any of these. That was the point.

I deployed to a Hetzner ARM box after Oracle’s free tier wouldn’t give me capacity, ran it in demo mode, and shipped commits every day. The first three weeks were mostly hardening: stop-loss fee accounting, partial fill tracking, ghost-position reconciliation, GTC order timeout handling, an inverted weather-market semantic that would have silently traded the wrong side. The kinds of bugs you only discover when a real exchange API rejects your fifth order in a row for a reason the docs don’t explain.

What surprised me most was how quickly my original signal stack collapsed. I had planned to lean on Reddit and Twitter for sentiment. Nitter instances were dead. The RSS bridges I tried returned zero tweets. Reddit started rate-limiting the server’s IP within days. The vibes signal I thought would feed Tier 2 was a mirage. I ripped it all out and replaced it with hard data: NOAA temperature forecasts for weather contracts, Cleveland Fed and NY Fed nowcasts for CPI and GDP, a Student-t probability model over three years of FRED history, a Polymarket scraper for cross-venue divergence, and an economic release calendar that knew when the next CPI print landed. The system got worse before it got better, and then noticeably better.

Fig 02 Three tiers, three cadences. Intelligence separated by frequency.
Three-tier architecture: Tier 1 execution at 60-second cadence, Tier 2 analysis brain every 30 minutes, Tier 3 daily review, with a signal aggregator feeding Tier 2

The single most useful learning came from a design review I commissioned from another model. The critique was sharp. The system had no calibration tracking. It generated probability estimates and never measured whether those estimates were accurate. It assumed optimistic fills in dry run. It treated "no trade" as a fallback instead of a first-class decision. It would happily stack five correlated positions that were really one big bet. And the Tier 3 self-modification loop, the part I was proudest of, could overfit on a week of data and convince itself a noise pattern was a signal.

Every one of those was right. I started building calibration scoring: Brier score by market family, predicted edge against realized edge, confidence-bucket performance. I added pessimistic fill simulation with worse-than-mid prices, missed-fill probability, and slippage modeled by event regime. I added correlated exposure buckets so Tier 1 would refuse a trade that quietly stacked an existing thematic bet. I rewrote the Tier 2 prompt to require a disconfirming-evidence field and an explicit "what would make this a no-trade" line. I put guardrails on Tier 3 so it couldn’t change more than one parameter at a time, and kept a frozen baseline configuration that didn’t move so I could measure the adaptive version against something stable.

I paused before going live for two honest reasons. The first is that the more I built, the clearer it was that the system needed a calibration history measured in months, not weeks, before I could trust its probability estimates with real capital. Paper trading on demo doesn’t simulate fill quality, partial fills, or the way a thin order book moves when a $100 buy lands on it. The second is that this was a side project sharing time with a real job and a family. The unbounded scope of "make this trustworthy with money on the line" was incompatible with that.

Fig 03 The evidence hierarchy after the social signals failed.
Four-tier evidence hierarchy: A (hard data, highest trust), B (interpretation), C (crowd chatter, lowest), D (cross-market calibration)

What carries forward isn’t the trading agent. It’s the discipline. Tier-rank your evidence sources before you trust them. Measure calibration before you scale. Make "no trade" a first-class decision the model has to argue against. Constrain a self-modifying agent to small, evidence-backed changes against a baseline that doesn’t move. Most autonomous agents I’ve seen since are still missing those guardrails. Athena’s value to me wasn’t the markets it would have traded. It was proving how hard the meta-problem actually is.

A trading agent that loses money slowly is indistinguishable from one with no edge at all. The interesting work isn’t building the agent. It’s building the apparatus that tells you whether it’s working.

Fig 04 What carries forward.
Six lessons from Athena: calibration first, pessimistic fills, no-trade as a decision, correlation buckets, constrained self-modification, ranked sources

Tags

  • AI Product Design Designing for models that are good, wrong, or weird. Often on the same screen. I care more about trustworthy and unobtrusive than flashy.
  • AI Agents Agents that act on the world need guardrails before they need cleverness. Calibration before scale.
  • Python The boring choice for finance and AI work. Don’t fight the ecosystem.
  • Claude What I use to write specs, build prototypes, and ship code. A swappable layer, not the product itself.
  • Prediction Markets Binary contracts that settle on objective outcomes. Cleaner ground truth than sentiment.
  • Quantitative Modeling Math that knows what it can’t tell you. Calibration matters more than fit.
  • Risk Management Hard-coded limits the model can’t override. The boring layer that lets the interesting layer exist.
  • Calibration Measuring whether a model’s probabilities are actually accurate. Skipped at your peril.