A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.

  • CriticalResist8@lemmygrad.ml
    link
    fedilink
    arrow-up
    8
    arrow-down
    1
    ·
    12 days ago

    just reasoning alone almost destroyed the western AI industry overnight. They scrambled to make their own reasoning models in less than a week but it changed everything