Now
What I'm Up To
A snapshot of my current focus, inspired by Derek Sivers' /now page movement.
Last updated: December 2025
Current Focus
Alignment Stress-testing
Training models to misbehave on purpose, then seeing if our safety techniques catch them.
Scalable Oversight Experiments
Running multi-agent RL experiments to test things like AI Debate. How do you maintain oversight when the model is smarter than you?
Safety Evaluation Tooling
Building pipelines that automatically generate and test adversarial inputs. Evals are everything.
Learning
Trying to really understand the dynamics and where they break down.
Game theory, debate dynamics, how adversarial setups might help safety.
Keeping up with the interpretability research. Helps to know what's actually happening inside.
Reading
Re-reading for the third time. Still finding new insights.
Fundamentals of managing complexity. Applies to ML systems too.
Great for context on how we got here.
See my full reading list.
Thinking About
- ā¢How would we know if a model was deceiving us?
- ā¢What oversight techniques actually scale to superhuman systems?
- ā¢The gap between behavioral safety and robust alignment
- ā¢How to build a sustainable research career without burning out
- ā¢Whether I'll ever stop missing Sydney beaches (spoiler: no)
What I'm Currently Confused About
Honestly? These keep me up at night.
- ?Why debate sometimes works better with weaker judges
- ?The right way to measure deceptive alignment
- ?Whether RLHF fundamentally limits what we can teach models
- ?How to build safety evals that don't get gamed
Want to chat about any of this?
Get in touch