Now

What I'm Up To

A snapshot of my current focus, inspired by Derek Sivers' /now page movement.

Last updated: December 2025

Current Focus

Training models to misbehave on purpose, then seeing if our safety techniques catch them.

Running multi-agent RL experiments to test things like AI Debate. How do you maintain oversight when the model is smarter than you?

Building pipelines that automatically generate and test adversarial inputs. Evals are everything.

RLHFIn Progress

Trying to really understand the dynamics and where they break down.

Multi-agent SystemsIn Progress

Game theory, debate dynamics, how adversarial setups might help safety.

Mechanistic InterpretabilityOngoing

Keeping up with the interpretability research. Helps to know what's actually happening inside.

Scaling Monosemanticity— AnthropicPaper

Re-reading for the third time. Still finding new insights.

A Philosophy of Software Design— John OusterhoutBook

Fundamentals of managing complexity. Applies to ML systems too.

The Alignment Problem— Brian ChristianBook

Great for context on how we got here.

See my full reading list.

Honestly? These keep me up at night.

Want to chat about any of this?