Skip Navigation

User banner
InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)D
Posts
8
Comments
99
Joined
2 yr. ago

  • Hmm, fair point, it could be training data contamination / model collapse.

    It's curious that it is a lot better at converting free form requests for accuracy, into assurances that it used a tool, than into actually using a tool.

    And when it uses a tool, it has a bunch of fixed form tokens in the log. It's a much more difficult language processing task to assure me that it used a tool conditionally on my free form, indirect implication that the result needs to be accurate, than to assure me it used a tool conditionally on actual tool use.

    The human equivalent to this is "pathological lying", not "bullshitting". I think a good term for this is "lying sack of shit", with the "sack of shit" specifying that "lying" makes no claim of any internal motivations or the like.

    edit: also, testing it on 2.5 flash, it is quite curious: https://g.co/gemini/share/ea3f8b67370d . I did that sort of query several times and it follows the same pattern: it doesn't use a calculator, it assures me the result is accurate, if asked again it uses a calculator, if asked if the numbers are equal it says they are not, if asked which one is correct it picks the last one and argues that the last one actually used a calculator. I hadn't ever managed to get it to output a correct result and then follow up with an incorrect result.

    edit: If i use the wording of "use an external calculator", it gives a correct result, and then I can't get it to produce an incorrect result to see if it just picks the last result as correct, or not.

    I think this is lying without scare quotes, because it is a product of Google putting a lot more effort into trying to exploit Eliza effect to convince you that it is intelligent, than into actually making an useful tool. It, of course, doesn't have any intent, but Google and its employees do.

  • Pretentious is a fine description of the writing style. Which actual humans fine tune.

  • The other interesting thing is that if you try it a bunch of times, sometimes it uses the calculator and sometimes it does not. It, however, always claims that it used the calculator, unless it didn't and you tell it that the answer is wrong.

    I think something very fishy is going on, along the lines of them having done empirical research and found that fucking up the numbers and lying about it makes people more likely to believe that gemini is sentient. It is a lot weirder (and a lot more dangerous, if someone used it to calculate things) than "it doesn't have a calculator" or "poor LLMs cant do math". It gets a lot of digits correct somehow.

    Frankly this is ridiculous. They have a calculator integrated in the google search. That they don't have one in their AIs feels deliberate, particularly given that there's a plenty of LLMs that actually run calculator almost all of the time.

    edit: lying that it used a calculator is rather strange, too. Humans don't say "code interpreter" or "direct calculator" when asked to multiply two numbers. What the fuck is a "direct calculator"? Why is it talking about "code interpreter" and "direct calculator" conditionally on there being digits (I never saw it say that it used a "code interpreter" when the problem wasn't mathematical), rather than conditional on there being a [run tool] token outputted earlier?

    The whole thing is utterly ridiculous. Clearly for it to say that it used a "code interpreter" and a "direct calculator" (what ever that is), it had to be fine tuned to say that. Consequently to a bunch of numbers, rather than consequently to a [run tool] thing it uses to run a tool.

    edit: basically, congratulations Google, you have halfway convinced me that an "artificial lying sack of shit" is possible after all. I don't believe that tortured phrases like "code interpreter" and a "direct calculator" actually came from the internet.

    These assurances - coming from an "AI" - seem like they would make the person asking the question be less likely to double check the answer (and perhaps less likely to click the downvote button), In my book this would qualify them as a lie, even if I consider LLM to not be any more sentient than a sack of shit.

  • Try asking my question to Google gemini a bunch of times, sometimes it gets it right, sometimes it doesn't. Seems to be about 50/50 but I quickly ran out of free access.

    And google is planning to replace their search (which includes a working calculator) with this stuff. So it is absolutely the case that there's a plan to replace one of the world's most popular calculators, if not the most popular, with it.

  • That's why I say "sack of shit" and not say "bastard".

  • The funny thing is, even though I wouldn't expect it to be, it is still a lot more arithmetically sound than what ever is it that is going on with it claiming to use a code interpreter and a calculator to double check the result.

    It is OK (7 out of 12 correct digits) at being a calculator and it is awesome at being a lying sack of shit.

  • Incels then: Zuckerberg creates a hot-or-not clone with stolen student data, gets away with it, becomes a billionaire.

    Incels now: chatgpt, what's her BMI.

  • Yolo charging mode on a phone, disable the battery overheating sensor and the current limiter.

    I suspect that they added yolo mode because without it this thing is too useless.

  • There is an implicit claim in the red button that it was worth including.

    It is like Google’s AI overviews. There can not be a sufficient disclaimer because the overview being on the top of Google search implies a level of usefulness which it does not meet, not even in the “evil plan to make more money briefly” way.

    Edit: my analogy to AI disclaimers is using “this device uses nuclei known to the state of California to…” in place of “drop and run”.

  • It would have to be more than just river crossings, yeah.

    Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.

  • I’d just write the list then assign randomly. Or perhaps pseudorandomly like sort by hash and then split in two.

    One problem is that it is hard to come up with 20 or more completely unrelated puzzles.

    Although I don’t think we need a large number for statistical significance here, if it’s like 8/10 solved in the cheating set and 2/10 in the hold back set.

  • Yeah any time its regurgitating an IMO problem it’s a proof it’salmost superhuman, but any time it actually faces a puzzle with unknown answer, this is not what it is for.

  • Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

    Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

    edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

    I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that's public improves, vs the one that's held back.

  • making LLMs not say racist shit

    That is so 2024. The new big thing is making LLMs say racist shit.

  • Can’t be assed to read the bs but sometimes the use after free only happens in some rarely executed code path, or only when one branch is executed then later another branch. So you still may need fuzzing to trigger use after free for Valgrind to detect.

  • I swear I’m gonna plug an LLM into a rather traditional solver I’m writing. I may tuck deep into the paper a point how it’s quite slow to use an LLM to mutate solutions in a genetic algorithm or a swarm solver. And in any case non LLM would be default.

    Normally I wouldn’t sink that low but I got mouths to feed, and frankly, fuck it, they can persist in this madness for much longer than I can stay solvent.

    This is as if there was a mass delusion that a pseudorandom number generator can serve as an oracle, predicting the future. Doing any kind of Monte Carlo simulation of something like weather in that world would of course confirm all the dumb shit.

  • Yeah plenty of opportunities to just work it into the story.

    I dunno what kind of local models you can use, though. If it is a 3D game then its fine to require a GPU, but you wouldn't want to raise minimum requirements too high. And you wouldn't want to use 12 gigs of vram for a gimmick, either.

  • I think it could work as a minor gimmick, like terminal hacking minigame in fallout. You have to convince the LLM to tell you the password, or you get to talk to a demented robot whose brain was fried by radiation exposure, or the like. Relatively inconsequential stuff like being able to talk your way through or just shoot your way through.

    Unfortunately this shit is too slow and too huge to embed a local copy of, into a game. You need a lot of hardware compatibility. And running it in the cloud would cost too much.

  • I was trying out free github copilot to see what the buzz is all about:

    It doesn't even know its own settings. This one little useful thing that isn't plagiarism, providing natural language interface to its own bloody settings, it couldn't do.