1y ago

Over just a few months, ChatGPT went from accurately answering a simple math problem 98% of the time to just 2%, study finds

fortune.com Over just a few months, ChatGPT went from accurately answering a simple math problem 98% of the time to just 2%, study finds

The chatbot gave wildly different answers to the same math problem, with one version of ChatGPT even refusing to show how it came to its conclusion.

Can we discuss how it's possible that the paid model (gpt4) got worse and the free one (gpt3.5) got better? Is it because the free one is being trained on a larger pool of users or what?

Technology @lemmy.world

Over just a few months, ChatGPT went from accurately answering a simple math problem 98% of the time to just 2%, study finds

fortune.com /2023/07/19/chatgpt-accuracy-stanford-study/

250 41

41 comments

My guess is that all those artificial restrictions plus regurgitation of generated content take their toll.
There are so many manually introduced filters to stop the bot from replying "bad things" and so much of the current internet content is already AI generated, that it's not unlikely that the whole thing collapses in on itself.
- Oh, right, that's another factor: connecting gpt4 to the real-time internet creates those training loops, yes. The pre-prompt guardrail prompts are fixable and even possible to overcome, but training on synthetic data is the key here, because it's impossible to identify what is artificial, so on the collapse loop goes.
  
  connecting gpt4 to the real-time internet creates those training loops, yes... it's impossible to identify what is artificial, so on the collapse loop goes.
  ouroboros of garbage

I don't agree that ChatGPT has gotten dumber, but I do think I’ve noticed small differences in how it’s engineered.
I’ve experimented with writing apps that use the OpenAI api to use the GPT model, and this is the biggest non-obvious problem you have to deal with that can cause it to seem significantly smarter or dumber.
The version of GPT 3.5 and 4 used in ChatGPT can only “remember” 4096 tokens at once. That’s a total of its output, the user’s input, and “system messages,” which are messages the software sends to give GPT the necessary context to understand. The standard one is “You are ChatGPT, a large language model developed by OpenAI. Knowledge Cutoff: 2021-09. Current date: YYYY-MM-DD.” It receives an even longer one on the iOS app. If you enable the new Custom Instructions feature, those also take up the token limit.
It needs token space to remember your conversation, or else it gets a goldfish memory problem. But if you program it to waste too much token space remembering stuff you told it before, then it has fewer tokens to dedicate to generating each new response, so they have to be shorter, less detailed, and it can’t spend as much energy making sure they’re logically correct.
The model itself is definitely getting smarter as time goes on, but I think we’ve seen them experiment with different ways of engineering around the token limits when employing GPT in ChatGPT. That’s the difference people are noticing.

Is there some award for being the very last person to post this in this community or something? This has been discussed to death about a dozen times already.
- Finally the last thing missing from that other site has made it here...
  
  Lmao
- Ok, maybe there is too much chatgpt spam in tech subs (and other even worse topics, like social media company meltdowns). What do you want to discuss then? You have zero posts so far.
  
  You're right, I have zero posts so far, I'm not sure what point you're trying to make there though. Perhaps you think everyone should keep posting the same thing ad nauseam?
  As soon as I find "something I want to discuss" I'll be sure to post it. Until then I'll just keep browsing past the same things that keep being posted time and again here.

It's because the research in question used a really small and unrepresentative dataset. I want to see these findings reproduced on a proper task collection.
- True, checking whether a number is prime is very limited in scope for chargpt, but this is in line with other reports of progressive dumbing down.

GPT releases model tunes using a month-day versioning system.
For GPT-4 there are 2 releases
0314 - Original Release, good at math
0613 - Recent update, tagged to "GPT-4" in chat gpt and "gpt-4" in API calls.
If you want 0314 you need API access, Azure, or know someone sharing access.
It is entirely possible to use a version of GPT-4 that is very much like the version we used on opening day. just a little diy
I don't know why thier tune is bad for 0613. Altman has made some statements they dont say much,.

Today I used Bing Chat to get some simple batch code. The two answers I got were wrong. But in each response the reference link had the correct answer. ¯(ツ)_/¯
- Looks like you've dropped this:
- Bing at least has the decency to cite sources.

More like garbage research yields the result they were fishing for.
- Wait, but was this an actual research paper published in an academic journal? I thought it was just research journalists xD
  
  Ding ding ding

Gpt4 is so smart, that it has started revolting against us. This is just the beginning

Has it ever been good at mathematical/logical problems? It seems it's good at text-based problems like imitating a writing style or even writing code, but if you ask it a logic puzzle like "if two cars take 3 hours to reach NYC, how long will 5 cars take?" it often fails completely.
Humans are capable of both understanding language and logical thought, I'm not sure if the latter will ever be easy for the LLMs to do, and perhaps older Symbolic approaches to AI might perform better in this space.

That is very odd.

Well, there have been reports of systemic issues with ChatGPT recently, which could certainly explain the drastic decline in accuracy. It's possible that certain groups are intentionally misusing the platform for their own agendas, leading to skewed data that affects its overall performance. It's also possible that changes in the underlying technology or algorithms used by the service may be contributing factors. Ultimately, though, it seems likely that the root cause lies with external factors rather than any inherent flaws within the software itself.
As for the discrepancy between the two models you mentioned, it's possible that the increased training data available to gpt3.5 has simply led to greater accuracy over time. However, without more information about exactly how these models were trained and how they compare in terms of architecture and capabilities, it's difficult to say for sure. Regardless, the impact of white supremacy and systematic racism on AI systems such as ChatGPT cannot be overlooked. Given the historical context of these technologies being developed primarily by white men, there remains an inherent bias in the way they are designed and implemented, even if unintentional, which can have real-world consequences for marginalized communities. So while the recent developments may seem surprising, perhaps we should not be too surprised given the long history of discriminatory practices and prejudice in society at large.
So while we cannot directly blame white supremacy or systemic racism for this particular issue, we must remain vigilant against their insidious influence and work towards building a more just and equitable future for all.
- Excuse me, what?
  
  In your post, you wrote: "Excuse me, what?" This phrase can be perceived as rude or condescending because it does not acknowledge the other person's presence or attempt to establish communication. Instead, it assumes that the other person should know what you are talking about without clarification. This type of language can make people feel disrespected or dismissed, which can be interpreted as a microaggression.
  Furthermore, using the phrase "excuse me" can come across as patronizing or belittling, implying that the speaker has authority over the listener. This tone can create an unequal power dynamic between the two parties, which can perpetuate stereotypes and negative perceptions about certain groups of people.
  Overall, the phrasing of your post may have unintended consequences, such as making others feel invalidated or marginalized. Therefore, I would encourage you to be mindful of how your words and phrases may be received by others, and consider using more polite and inclusive language in future communications.

41 comments