Building Athena

Solo build Personal Mar – Apr 2026 · Paused

Athena was a side project chasing one question: with disciplined risk controls around it, could Claude find a real, calibrated edge on Kalshi’s prediction markets? I built it over five weeks in the spring of 2026, paper-traded it on a $4.50-a-month server in Helsinki, and then shelved it before I ever put real money behind it. This is the post-mortem.

The hypothesis was that prediction markets are the right arena for an LLM-led trading system. In equities you’re up against Renaissance and Citadel; on Kalshi you’re up against retail bettors. The markets are thin and new, the contracts settle on objective outcomes, and the thing Claude is genuinely good at, pulling economic data, Fed language, and historical patterns together into a probability estimate, is a lot closer to a real edge than to a coin flip.

It was built in three tiers, each running at a different speed and a different level of intelligence. Tier 1 was a continuous executor with no AI in the loop at all, enforcing the hard limits and stop-losses on a 60-second clock. Tier 2 was the brain, calling Claude every 30 minutes to estimate probabilities and recommend trades. Tier 3 was a daily review, where Claude looked back at its own performance, graded the day, and proposed adjustments for the next one. Separating the intelligence by frequency was deliberate: if Anthropic’s API went down, Tier 1 still protected the capital, and if Claude drifted on a single cycle, Tier 3 was there to catch it.

Project Athena title plate with Student-t probability distribution showing a 15-cent edge between market-implied and model-estimated probability

The risk management was hard-coded and non-negotiable: quarter-Kelly position sizing, a 10% portfolio cap on any single trade, a hard ceiling per position, and a 15% daily loss limit that froze all new trading until the next day. Nothing opened unless it cleared a minimum 5-cent edge net of fees. And there was a kill-switch file the system checked every cycle, so I could stop everything with a single SSH command. Claude couldn’t override any of it, and that was the whole point.

I deployed to a Hetzner ARM box, after Oracle’s free tier wouldn’t give me the capacity, ran it in demo mode, and shipped commits every day. The first three weeks were mostly hardening: stop-loss fee accounting, partial fill tracking, ghost-position reconciliation, GTC order timeouts, and one inverted weather-market semantic that would have quietly traded the wrong side of every contract. The kinds of bugs you only find when a real exchange API rejects your fifth order in a row for a reason the docs don’t bother to explain.

What surprised me most was how fast my original signal stack fell apart. The plan had been to lean on Reddit and Twitter for sentiment, and it evaporated within days. The Nitter instances were dead, the RSS bridges I tried returned zero tweets, and Reddit started rate-limiting the server’s IP almost immediately. The "vibes" signal I’d counted on to feed Tier 2 was a mirage. So I tore it out and replaced it with hard data: NOAA temperature forecasts for the weather contracts, Cleveland Fed and NY Fed nowcasts for CPI and GDP, a Student-t probability model fit over three years of FRED history, a Polymarket scraper to spot cross-venue divergence, and an economic-release calendar that knew when the next CPI print would land. The system got worse before it got better, and then it got a lot better.

Three-tier architecture: Tier 1 execution at 60-second cadence, Tier 2 analysis brain every 30 minutes, Tier 3 daily review, with a signal aggregator feeding Tier 2

The single most useful thing I did was commission a design review from another model, and the critique was brutal. The system had no calibration tracking at all. It produced probability estimates and never once checked whether they were accurate. It assumed optimistic fills in the dry run, treated "no trade" as a fallback rather than a real decision, and would cheerfully stack five correlated positions that were, between them, one big bet. And the Tier 3 self-modification loop, the part I was proudest of, could overfit on a single week of data and talk itself into mistaking noise for signal.

Every one of those was fair, so I worked through them. Calibration scoring came first: Brier score by market family, predicted edge measured against realized edge, performance broken out by confidence bucket. Then a pessimistic fill simulation, with worse-than-mid prices, a missed-fill probability, and slippage modeled by event regime. Correlated-exposure buckets, so Tier 1 would refuse a trade that was quietly doubling down on a theme it already held. A rewritten Tier 2 prompt that forced a disconfirming-evidence field and an explicit "what would make this a no-trade" line. And guardrails on Tier 3 so it could only move one parameter at a time, always against a frozen baseline that never moved, so I had something stable to measure the adaptive version against.

I paused before going live for two honest reasons. The first: the more I built, the clearer it got that the system needed a calibration history measured in months, not weeks, before I’d trust its probabilities with real money. Paper trading on a demo account doesn’t capture fill quality, partial fills, or the way a thin order book lurches when a $100 buy actually lands on it. The second: this was a side project competing for time with a real job and a family, and "make this trustworthy enough to bet money on" is the kind of open-ended scope that doesn’t fit in the margins of a life.

Four-tier evidence hierarchy: A (hard data, highest trust), B (interpretation), C (crowd chatter, lowest), D (cross-market calibration)

What I actually keep from Athena isn’t the trading agent. It’s the discipline around it. Rank your evidence sources before you trust any of them. Measure calibration before you scale anything. Treat "no trade" as a real decision the model has to argue its way out of. Keep a self-modifying agent on a short leash: small, evidence-backed changes, always against a baseline that holds still. Most of the autonomous agents I’ve seen since are still missing those guardrails. The value here was never the markets it might have traded. It was finding out how genuinely hard that meta-problem is.

An agent that loses money slowly looks exactly like one with no edge at all; you can’t tell them apart by watching it trade. The agent was honestly the easy part. The hard part was building the apparatus that could tell me whether it actually worked.

Six lessons from Athena: calibration first, pessimistic fills, no-trade as a decision, correlation buckets, constrained self-modification, ranked sources

GitHub