Skip to main content

Now

What I'm Up To

A snapshot of my current focus, inspired by Derek Sivers' /now page movement.

Last updated: December 2025

Current Focus

Alignment Stress-testing

Training models to misbehave on purpose, then seeing if our safety techniques catch them.

Scalable Oversight Experiments

Running multi-agent RL experiments to test things like AI Debate. How do you maintain oversight when the model is smarter than you?

Safety Evaluation Tooling

Building pipelines that automatically generate and test adversarial inputs. Evals are everything.

Learning

RLHFIn Progress

Trying to really understand the dynamics and where they break down.

Multi-agent SystemsIn Progress

Game theory, debate dynamics, how adversarial setups might help safety.

Mechanistic InterpretabilityOngoing

Keeping up with the interpretability research. Helps to know what's actually happening inside.

Reading

Scaling Monosemanticity— AnthropicPaper

Re-reading for the third time. Still finding new insights.

A Philosophy of Software Design— John OusterhoutBook

Fundamentals of managing complexity. Applies to ML systems too.

The Alignment Problem— Brian ChristianBook

Great for context on how we got here.

See my full reading list.

Thinking About

  • •How would we know if a model was deceiving us?
  • •What oversight techniques actually scale to superhuman systems?
  • •The gap between behavioral safety and robust alignment
  • •How to build a sustainable research career without burning out
  • •Whether I'll ever stop missing Sydney beaches (spoiler: no)

What I'm Currently Confused About

Honestly? These keep me up at night.

  • ?Why debate sometimes works better with weaker judges
  • ?The right way to measure deceptive alignment
  • ?Whether RLHF fundamentally limits what we can teach models
  • ?How to build safety evals that don't get gamed

Want to chat about any of this?

Get in touch