Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)N
Posts
1
Comments
10
Joined
2 wk. ago

  • Chat is not very interesting for me :) and i use ik_llama.cpp fork that says that sliding window is not supported for this model :/ I want a full development on it, so waiting for some technologies to arrive so it would be possible. (Or ill start to make my own fork if my patience won't be able to last :D)

  • Yeah, I tried usual llama.cpp and got 12 t/s. Try ik_llama.cpp

  • Absolutely 💯

  • Maybe i will make some of these later. Killed lot of time trying to make this to work, but my family and main job are still calling :)

  • Thats actually awesome. Judging on some reviews its very strong model. Sadly, i dont have money for 3090 right now. So will max out my 3050m

  • UPD 2: With a bit of tweaking here and there i balanced memory consumption on VRAM and RAM and APEX-COMPAT version of Qwen3.6 35B... attention... BLASTED with 30 tokens per second! That's just wow. Now problem is that there is only 100mb left on RAM and i can't even open the browser...

    So for now, i connected to local server from my phone. And yeah - 30t/s. That's crazy. But no room for context really... Need to figure something out...

  • the problem with coding agents is simple - THERE A LOT of System promts. Promts that correct the behavior of the model in process of creating project. That is needed becase even largest models are a dumb to some degree. They forget what tools they need to use and how to use them properly. So there hidden from you system promt (i tried Cline for example - it is 11k tokens only on system prompt!) that eats context like crazy. I tried to create similar agent with tools and system promts, that save on context (my custom tool "get_overview", instead of read_file; in mix with "search_content" tool that returns lines on search query, it can save a lot - model don't need to read full file) and mix just a tiny beat cheetsheet to every user msg, so model don't forget. Results were very good. Don't know why they need spam sysprmt like that.

    So i think this problem is kinda solvable on local machine

  • yeah, 6-7 is slow (for me personally even for chat), but 15 feels great. Strange, but It can run even faster in generating progress. KV cache is hittin i guess.I tried to create my own optimised version of coding agent and it even performes relatively good, but for programming it is surely slow. It would be ok, if it done all the code right from the first try, but it's not. It is not the model problem - even cloud agents do mistakes, but due to high speed they can fix it fast.

    but for chat its great

  • If u have any advice to run it better, i'll appritiate that!

  • LocalLLaMA @sh.itjust.works

    I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage