• loathsome dongeater@lemmygrad.ml
      link
      fedilink
      English
      arrow-up
      6
      ·
      18 days ago

      These agentic coding tools end up consuming a lot of tokens. I can’t give you a number since I don’t use them but Anthropic pricing out Cursor recently was related to that.

  • KrasnaiaZvezda@lemmygrad.ml
    link
    fedilink
    arrow-up
    10
    ·
    18 days ago

    Error rates compound exponentially in multi-step workflows. 95% reliability per step = 36% success over 20 steps. Production needs 99.9%+.

    My DevOps agent works precisely because it’s not actually a 20-step autonomous workflow. It’s 3-5 discrete, independently verifiable operations with explicit rollback points and human confirmation gates.

    That’s nice to know. I was just thinking of how to remove errors from an “agenda maintaining agent” using local LLMs and I was thinking of just passing everything done by me and seeing some numbers about things in use with better AIs than I have access to shows that it will indeed be important to avoid mistakes on my use case.

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      arrow-up
      8
      ·
      18 days ago

      It’s a really good illustration of where the sweet spot for these tools is. Trying to have LLMs solve problems end to end is simply not practical barring some new revolutionary innovation down the road. However, these tools are getting very good at solving small and focused tasks. It’s becoming increasingly evident that these tools aren’t replacing humans, but automate a lot of tedious labour for the developer allowing you to focus mostly on high level functionality of the feature you’re working on.

      What the author describes matches my experience as well. I quickly learned that the more focused you make the tasks the better the results. I also quickly learned that if the model doesn’t come up with a good solution on the first shot, it becomes increasingly unlikely that it will improve as it iterates on it. What typically happens is that it just keeps adding kludges on top of kludges instead of addressing underlying mistakes in the solution. What I’ve started doing is sketching out the scaffolding for the code where I effectively create a template with the function signatures and overall structure I want, and then let the agent fill in the blanks. My experience is that this tends to work pretty well majority of the time.

  • loathsome dongeater@lemmygrad.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    18 days ago

    The real challenge isn’t AI capabilities, it’s designing tools and feedback systems that agents can actually use effectively.

    I don’t understand how and to what extent these agents include the code repository in their context. Like I am assuming that including the whole repository should be a prerequisite to allow the agent to create a proper solution to a non-trivial problem. But what about the dependencies? Do they just take a chance that the LLM is able to guess how to use them properly? Plus so many big companies have a culture of massive monorepos. I suppose there are islands of isolated code within them but still that’s another barrier.

    Besides this I think “AI capabilities” are just big a challenge if not bigger than the tooling. AI capabilities are surprisingly good but they still habitually fail in critical scenarios. And it’s not just about them not being intelligent enough. Every improvement in model quality is accompanied by increase in compute costs. Infinite VC funding is softening this blow but this cannot probably go on forever.

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      arrow-up
      4
      ·
      18 days ago

      Generally, the agent will use the whole repo. The workflow is that you get them to make the changes to satisfy the query, then run tests, and use the feedback from the tests to iterate. They’re getting surprisingly good at this nowadays for common tasks.