A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.

  • KrasnaiaZvezda@lemmygrad.ml
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    26 days ago

    It’s not gonna be the full model like in this video but it’s still advanced enough for some tasks apparently.

    Those models are the Qwen models finetuned by DeepSeek so no comparisson to the original DeepSeek V3 and R1 really. And considering how much Qwen has been releasing latelly I’d say anyone thinking about running the distilled versions you talked about might as well try the default Qwen ones as well, with Qwen 30B-A3B being very decent for older machines as it is a MoE with only 3B active parameters which can be quite fast and can probably fit Q4_k_m in some 20GBs RAM/VRAM (I can run Qwen3 30B-A3B-Instruct-UD-Q3_K_XL with 16GBs RAM and some offloaded to the SSD with SWAP at 8+ tokens/second).

    • CriticalResist8@lemmygrad.ml
      link
      fedilink
      arrow-up
      4
      arrow-down
      1
      ·
      25 days ago

      ooh now I’m stressed I’m gonna have to download and try 20 different models to find out the one I like best haha. Do you know some that are good for coding tasks? I also do design stuff (the AI helps walk through the design thinking process with me), it’s kind of a reasoning/logical task but it’s also highly specialized.

      • KrasnaiaZvezda@lemmygrad.ml
        link
        fedilink
        arrow-up
        3
        ·
        24 days ago

        There is a new Qwen3 coder 30B-A3B that looks to be good, and people were talking about GLM4.5 32B, but I haven’t used local models for code much so I can’t give a good answer.