Model Discrepancy: Performance Gap Between Native Platforms and Merlin AI
under review
Samuel Jackson
Hi! I'm noticing a recurring issue: models on Merlin — for some reason, maybe due to custom instructions or internal tuning — seem to perform noticeably worse than their native versions. They're getting dumber and dumber, especially on reasoning tasks.
For example, Gemini 2.5 Pro managed to solve a complex task instantly on Google's own platform, but on Merlin, it failed completely — even when spoon-fed the correct path. The same behavior applies to other models like Claude or DeepSeek when it comes to reasoning.
Would love to understand what’s going on and whether this can be improved, because right now it feels like we’re not getting the true power of these models.
Vijay Bharadwaj
under review
Vijay Bharadwaj
Hi, could you attach your chats (if devoid of sensitive information) here or mail them to me at vj@foyer.work, so that we can see what's wrong?
We actually don't restrict context windows for Pro users upto 100K for models that support it. Claude 4 Opus is the only exception, because if we don't do it, people will exhaust their Fair Use limit in literally very few requests. It is an extremely compute-heavy model. cc: Gabriele Monni
If this is a recent occurence, it may be the prompt or the agent misbehaving, since it is in beta. We'll try to fix this ASAP, once I have context on your exact misfirings.
Thanks for bringing this up!
Tags: Samuel Jackson Танджиро Фан Mohit
Танджиро Фан
Vijay Bharadwaj
Okay, here's how everything happened in chronological order.
Previously (about a month ago), in order to get a satisfactory response, I simply wrote my idea for the scene and asked them to carefully read the files in the project (because if you don't ask them to read it, they don't see it and start writing nonsense).
The first two images: An example of a message and number of characters in the scene(circled in red).
Recently (a couple of weeks ago), in order to get the previous message volume, I had to ask it to read the files carefully and to write a response of at least 10,000 characters. It didn't always work, but at least it was close to what I asked for (oh yes, and by that time it had stopped remembering anything. There was a case when I asked it to summarise the information I wrote in the character questionnaire, then asked it to write a scene, and when I asked in the next message, ‘Was there a questionnaire?’, it replied, ‘No, there wasn't.’)
The following two images (3 and 4): An example of my request and the number of characters in the response.
Now, even when I ask it to write a scene that is 10,000 characters long, it is unable to do so. When I asked why it couldn't do this, it replied, "The character limit is related to technical aspects of text processing in the system. Although I can create long and detailed texts, too much data may exceed the limits set for the safe and effective functioning of the model. If longer texts are needed, they can be divided into separate parts."
Photos 5, 6, 7: An example of my request, the number of characters in the response, and the answer to the question of why it cannot write as much as I ask it to.
Update 1: I just had an addition. I tried to generate the scene I wanted again, and in the end it gave me this. 4623 characters.
It seems that everything went wrong right after the Agentic Chat update was released.(Image 8)
Update 2: I have a hunch about what the problem might be. I often provide information about my characters in text files, but this time I decided to experiment and not give any information about the characters, but simply ask it to come up with a scene of 10,000 characters. The result was surprising. The scene was interrupted at 17,945 characters (it didn't finish the scene properly, i.e. it simply cut off the answer in the middle, but that's something). So the problem is that it gets very confused when given files.(Image 9)
Vijay Bharadwaj
Танджиро Фан: Thanks for the detailed feedback -- really appreciate it. Will get back to you, after analysing the situation.
Vijay Bharadwaj
Танджиро Фан: I inspected what's going on, here are my initial thoughts:
- Claude 4 Sonnet (Thinking) output context limits have not been changed since release, and has been the same as for Claude 3.7 Sonnet (Thinking) for a long while, on our end. The model is clearly capable of generating a long response (as you mentioned), it's just that the model is hallucinating, because when you attach documents, there is a lot of input context Claude has to remember (the current request + other requests in the same chat). The current context limits should allow for what you're asking for -- we have to analyse further what exactly is going wrong.
- We'll check our RAG pipeline (document processing) and the way chat memory is working to see if there are any problems there.
- Claude 4 has poorer output context window compared to some of our other models like Gemini 2.5 Pro. Since you're attaching a lot of input context, you seem to be doing long chats, I suggest using a different model to get better memory of previous messages (and you can do more chats, since the model is cheaper)
We'll also try the above on our end, and let you know in more detail.
Танджиро Фан
Vijay Bharadwaj
Thank you for trying to solve the problem. It's just that, as I mentioned before, I didn't have this problem until recently. Everything worked fine with 1-2 text files in the project (and I don't need more than that), and it was able to write long texts without even asking for it. I use exactly Claude models, because they are best able to make human text, where everything will be logical, and if the character is a 100 year old man, he will behave like a 100 year old man.
k nabu
Vijay Bharadwaj tththis is super interesting. I’m curious to how you define the term misbehaving in this context just out of curiosity I find it in an interesting word choice but either way I love Merlin and even though you guys are still developing, I’ve been here since the start and I love the community
Vijay Bharadwaj
Танджиро Фан: I understand. Looking at what we can do!
Gabriele Monni
Same problem, it needs to be fixed, it is not at all acceptable that in order to reduce costs we end up limiting the performance of the models, otherwise it is misleading marketing
Танджиро Фан
Oh my God, the same thing is happening to me. I use Claude models to generate scenarios with my characters, and while Claude 3.7 Sonnet (Thinking) used to be able to easily write scenes of 12,000 characters and more, over time I had to write in the request itself that it should write scenes of at least 10,000 characters. Its memory also deteriorated, as it could no longer remember the previous message. It seems that a new model, Claude 4 (Thinking), has been released, but the problem has not been solved, and it has only gotten worse. Today I noticed a problem: it started saying something like, ‘I'll write the scene you asked for now’ before it started writing. And now, even when I ask it to write a scene with a minimum of 10,000 characters, the maximum I can get is 5,000 characters. Bots are getting dumber and dumber.
M
Mohit
yeah i'm finding it the same way with claude 4