Skip Navigation

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

109 comments
  • Here's the cycle we've gone through multiple times and are currently in:

    AI winter (low research funding) -> incremental scientific advancement -> breakthrough for new capabilities from multiple incremental advancements to the scientific models over time building on each other (expert systems, LLMs, neutral networks, etc) -> engineering creates new tech products/frameworks/services based on new science -> hype for new tech creates sales and economic activity, research funding, subsidies etc -> (for LLMs we're here) people become familiar with new tech capabilities and limitations through use -> hype spending bubble bursts when overspend doesn't keep up with infinite money line goes up or new research breakthroughs -> AI winter -> etc...

  • The part of the study where they talk about how they determined the flawed mathematical formula it used to calculate the glue-on-pizza response was mindblowing.

    (I did not read the study.)

  • Real headline: Apple research presents possible improvements in benchmarking LLMs.

    • Not even close. The paper is questioning LLMs ability to reason. The article talks about fundamental flaws of LLMs and how we might need different approaches to achieve reasoning. The benchmark is only used to prove the point. It is definitely not the headline.

      • Once there’s a benchmark, LLMs can optimise for it. This is just another piece of news where people call “game over” but the money poured into R&D isn’t stopping anytime soon. Wasn’t synthetic data supposed to be game over for LLMs? Its limitations have been identified and it’s still being leveraged.

      • You say “Not even close.” in response to the suggestion that Apple’s research can be used to improve benchmarks for AI performance, but then later say the article talks about how we might need different approaches to achieve reasoning.

        Now, mind you - achieving reasoning can only happen if the model is accurate and works well. And to have a good model, you must have good benchmarks.

        Not to belabor the point, but here’s what the article and study says:

        The article talks at length about the reliance on a standardized set of questions - GSM8K, and how the questions themselves may have made their way into the training data. It notes that modifying the questions dynamically leads to decreases in performance of the tested models, even if the complexity of the problem to be solved has not gone up.

        The third sentence of the paper (Abstract section) says this “While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics.” The rest of the abstract goes on to discuss (paraphrased in layman’s terms) that LLM’s are ‘studying for the test’ and not generally achieving real reasoning capabilities.

        By presenting their methodology - dynamically changing the evaluation criteria to reduce data pollution and require models be capable of eliminating red herrings - the Apple researchers are offering a possible way benchmarking can be improved.
        Which is what the person you replied to stated.

        The commenter is fairly close, it seems.

109 comments