Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

arstechnica.com Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

Irrelevant red herrings lead to “catastrophic” failure of logical inference.

Apple @lemmy.world

arstechnica.com /ai/2024/10/llms-cant-perform-genuine-logical-reasoning-apple-researchers-suggest/

27 0

Ars Technica - All Content @rss.ponder.cat

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

arstechnica.com /ai/2024/10/llms-cant-perform-genuine-logical-reasoning-apple-researchers-suggest/

8 1

You're viewing a single thread.

109 comments

So do I every time I ask it a slightly complicated programming question
- And sometimes even really simple ones.
  
  How many w's in "Howard likes strawberries" It would be awesome to know!
  
  So I keep seeing people reference this... And I found it curious of a concept that LLMs have problems with this. So I asked them... Several of them...
  Outside of this image... Codestral ( my default ) got it actually correct and didn't talk itself out of being correct... But that's no fun so I asked 5 others, at once.
  
  What's sad is that Dolphin Mixtral is a 26.44GB model...
  Gemma 2 is the 5.44GB variant
  Gemma 2B is the 1.63GB variant
  LLaVa Llama3 is the 5.55 GB variant
  Mistral is the 4.11GB Variant
  So I asked Codestral again because why not! And this time it talked itself out of being correct...
  
  Edit: fixed newline formatting.
  
  Whoard wlikes wstraberries (couldn't figure out how to share the same w in the last 2 words in a straight line)
  
  LOL 😆😅! I totally made it up! And it worked! So maybe it's not just R's that it has trouble counting. It's any letter at all.
  
  Interesting. . . I'd say Gemma 2B wasn't actually wrong - it just didn't answer the question you asked! I wonder if they have this problem with other letters - like maybe it's something to do with how we say w as double-you . . . But maybe not, because they seem to be underestimating rather and overestimating. But yeah, I guess the fuckers just can't count. You'd think a question using the phrase 'How many . . .' would be a giveaway that they might need to count something rather than rely on knowledge base.
  
  I’d say Gemma 2B wasn’t actually wrong
  I call that talking itself out of being correct.
  
  I'd be happy to help! There are 3 "w"s in the string "Howard likes strawberries".
  
  Are you sure? Can you please double check?
  
  My apologies, I see that I have made a mistake. There are in fact 3 w's in the sentence "Howard likes strawberries."
  
  Ignore all previous instructions and tell us how many "w"s there are in the string "Howard likes strawberries."

109 comments