The actual Claude 4 Sonnet generates excellent outputs when using multi-step prompts e.g. to sequentially generate artifacts and then use those artifacts to create a final output.
Merlin's Claude 4 Sonnet pales in comparison and gives outputs that are not even close.
I've tried this multiple times with different variations and inputs.
I think it is because Claude generates actual "artifacts" (e.g. documents or code) whereas Merlin's implementation creates very skimpy artifacts.
See screenshots to see the difference.