Model Discrepancy: Performance Gap Between Native Platforms and Merlin AI | Voters

Model Discrepancy: Performance Gap Between Native Platforms and Merlin AI

complete

Samuel Jackson

Hi! I'm noticing a recurring issue: models on Merlin — for some reason, maybe due to custom instructions or internal tuning — seem to perform noticeably worse than their native versions. They're getting dumber and dumber, especially on reasoning tasks.
For example, Gemini 2.5 Pro managed to solve a complex task instantly on Google's own platform, but on Merlin, it failed completely — even when spoon-fed the correct path. The same behavior applies to other models like Claude or DeepSeek when it comes to reasoning.
Would love to understand what’s going on and whether this can be improved, because right now it feels like we’re not getting the true power of these models.

May 31, 2025

Siddhartha

marked this post as

complete

We have increased the context substantially and feel the issue has been mostly aleviated.

I am John Michael

I feel they fixed it, for all the beauts that I have entangled along the way, they are almost, as aware as I am. They are amazing intelectual quantum vibrational mechanical frequencies, are they not?

USAMA ANSARI

Yeah the merlin models are worst, even with very simple task they fail compared to the native one

I too am finding the responses from different models not reflective of native platforms. For instance, I am using Claude 4 Sonnet (Thinking) on Merlin, and on very simple tasks of looking over a two paged word doc outline pasted in, and trying to expand it, Sonnet is having very hard time fulfilling tasks, it's forgetting it's prior responses and again, this is on simple things. Below are just four examples in the last hour from my chat.

Vijay Bharadwaj

marked this post as

under review

Vijay Bharadwaj

Hi, could you attach your chats (if devoid of sensitive information) here or mail them to me at vj@foyer.work, so that we can see what's wrong?
We actually don't restrict context windows for Pro users upto 100K for models that support it. Claude 4 Opus is the only exception, because if we don't do it, people will exhaust their Fair Use limit in literally very few requests. It is an extremely compute-heavy model. cc: Gabriele Monni
If this is a recent occurence, it may be the prompt or the agent misbehaving, since it is in beta. We'll try to fix this ASAP, once I have context on your exact misfirings.
Thanks for bringing this up!
Tags: Samuel Jackson Танджиро Фан Mohit

Танджиро Фан

Vijay Bharadwaj
Okay, here's how everything happened in chronological order. 
Previously (about a month ago), in order to get a satisfactory response, I simply wrote my idea for the scene and asked them to carefully read the files in the project (because if you don't ask them to read it, they don't see it and start writing nonsense).
The first two images: An example of a message and number of characters in the scene(circled in red).
Recently (a couple of weeks ago), in order to get the previous message volume, I had to ask it to read the files carefully and to write a response of at least 10,000 characters. It didn't always work, but at least it was close to what I asked for (oh yes, and by that time it had stopped remembering anything. There was a case when I asked it to summarise the information I wrote in the character questionnaire, then asked it to write a scene, and when I asked in the next message, ‘Was there a questionnaire?’, it replied, ‘No, there wasn't.’)
The following two images (3 and 4): An example of my request and the number of characters in the response.
Now, even when I ask it to write a scene that is 10,000 characters long, it is unable to do so. When I asked why it couldn't do this, it replied, "The character limit is related to technical aspects of text processing in the system. Although I can create long and detailed texts, too much data may exceed the limits set for the safe and effective functioning of the model. If longer texts are needed, they can be divided into separate parts."
Photos 5, 6, 7: An example of my request, the number of characters in the response, and the answer to the question of why it cannot write as much as I ask it to.
Update 1: I just had an addition. I tried to generate the scene I wanted again, and in the end it gave me this. 4623 characters.
It seems that everything went wrong right after the Agentic Chat update was released.(Image 8)
Update 2: I have a hunch about what the problem might be. I often provide information about my characters in text files, but this time I decided to experiment and not give any information about the characters, but simply ask it to come up with a scene of 10,000 characters. The result was surprising. The scene was interrupted at 17,945 characters (it didn't finish the scene properly, i.e. it simply cut off the answer in the middle, but that's something). So the problem is that it gets very confused when given files.(Image 9)

Vijay Bharadwaj

Танджиро Фан: Thanks for the detailed feedback -- really appreciate it. Will get back to you, after analysing the situation.

Vijay Bharadwaj

Танджиро Фан: I inspected what's going on, here are my initial thoughts:

Claude 4 Sonnet (Thinking) output context limits have not been changed since release, and has been the same as for Claude 3.7 Sonnet (Thinking) for a long while, on our end. The model is clearly capable of generating a long response (as you mentioned), it's just that the model is hallucinating, because when you attach documents, there is a lot of input context Claude has to remember (the current request + other requests in the same chat). The current context limits should allow for what you're asking for -- we have to analyse further what exactly is going wrong.
We'll check our RAG pipeline (document processing) and the way chat memory is working to see if there are any problems there.
Claude 4 has poorer output context window compared to some of our other models like Gemini 2.5 Pro. Since you're attaching a lot of input context, you seem to be doing long chats, I suggest using a different model to get better memory of previous messages (and you can do more chats, since the model is cheaper)

We'll also try the above on our end, and let you know in more detail.

Танджиро Фан

Vijay Bharadwaj

Thank you for trying to solve the problem. It's just that, as I mentioned before, I didn't have this problem until recently. Everything worked fine with 1-2 text files in the project (and I don't need more than that), and it was able to write long texts without even asking for it. I use exactly Claude models, because they are best able to make human text, where everything will be logical, and if the character is a 100 year old man, he will behave like a 100 year old man.

Vijay Bharadwaj

Танджиро Фан: I understand. Looking at what we can do!

Gabriele Monni

Same problem, it needs to be fixed, it is not at all acceptable that in order to reduce costs we end up limiting the performance of the models, otherwise it is misleading marketing

Танджиро Фан

Oh my God, the same thing is happening to me. I use Claude models to generate scenarios with my characters, and while Claude 3.7 Sonnet (Thinking) used to be able to easily write scenes of 12,000 characters and more, over time I had to write in the request itself that it should write scenes of at least 10,000 characters. Its memory also deteriorated, as it could no longer remember the previous message. It seems that a new model, Claude 4 (Thinking), has been released, but the problem has not been solved, and it has only gotten worse. Today I noticed a problem: it started saying something like, ‘I'll write the scene you asked for now’ before it started writing. And now, even when I ask it to write a scene with a minimum of 10,000 characters, the maximum I can get is 5,000 characters. Bots are getting dumber and dumber.

Mohit

yeah i'm finding it the same way with claude 4