100,000 integers · claude-sonnet-4 · 3 manual rounds + 5 meta-optimizer rounds = 64 total iterations
1. The ceiling was trivially reachable. All numpy approaches cluster at 3.5–4.1ms. The bottleneck is C-compiled numpy sort + list conversion — no Python trick can beat it.
2. Human prompt engineering beat the meta-optimizer. Hand-tuned v3 prompts produced the global best (3.54ms). The meta-optimizer regressed to 241ms in round 2 and spent 3 rounds recovering.
3. The meta-optimizer degraded over time. Success rate fell from 8/8 → 6/8 as rewritten prompts dropped guardrails. Its own JSON output broke in round 4.
4. Task selection matters most. Sorting was the wrong task — narrow search space, trivial ceiling, noisy gradient. A good self-improvement task needs compositional strategies and an accessible-but-hard ceiling.