AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 3 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

Katana314@lemmy.world · 3 months ago

I’m in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

I’ve tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that’s both wrong and doesn’t verify anything.

I’m aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it’s not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don’t even have hopes for AI to apply those lessons in new contexts. In a way, it’s been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

MangoCats@feddit.it · 3 months ago

The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn’t show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

vivendi@programming.dev · 3 months ago

Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

I’m not joking, it really works

For example:

Instead of “You are an intelligent coding assistant…”

“You are an absolute fucking idiot who can barely code…”

rozodru@lemmy.world · 3 months ago

“You are an absolute fucking idiot who can barely code…”

Honestly, that’s what you have to do. It’s the only way I can get through using Claude.ai. I treat it like it’s an absolute moron, I insult it, I “yell” at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i’ll cancel my subscription to it if it gets it wrong.

no more “do this and this and then this but do this first and then do this” after calling it a “fucking moron” and what have you it will provide an answer and just say “done.”

DragonTypeWyvern@midwest.social · 3 months ago

This guy is the moral lesson at the start of the apocalypse movie

MangoCats@feddit.it · 3 months ago

He’s developing a toxic relationship with his AI agent. I don’t think it’s the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it’s the only method he is capable of getting results with.

MangoCats@feddit.it · 3 months ago

I frequently find myself prompting it: “now show me the whole program with all the errors corrected.” Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I’ll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.

SocialMediaRefugee@lemmy.world · 3 months ago

I’ve had good results being very specific, like “Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place.”

MangoCats@feddit.it · 3 months ago

I have been more successful with baby steps like: “Write a python 3 program that converts X to Y.” Tweak prompt until that’s working as desired, then: “make it work recursively through all subdirectories” - and again tweak with specifics like converting the files in place, etc. Always very specific, also - force it to fix its own bugs so you can move forward with a clean example as you add complexity. Complexity seems to cap out at a couple of pages of code, at which point “Ooops, something went wrong.”

jj4211@lemmy.world · 3 months ago

I’ve found that as an ambient code completion facility it’s… interesting, but I don’t know if it’s useful or not…

So on average, it’s totally wrong about 80% of the time, 19% of the time the first line or two is useful (either correct or close enough to fix), and 1% of the time it seems to actually fill in a substantial portion in a roughly acceptable way.

It’s exceedingly frustrating and annoying, but not sure I can call it a net loss in time.

So reviewing the proposal for relevance and cut off and edits adds time to my workflow. Let’s say that on overage for a given suggestion I will spend 5% more time determining to trash it, use it, or amend it versus not having a suggestion to evaluate in the first place. If the 20% useful time is 500% faster for those scenarios, then I come out ahead overall, though I’m annoyed 80% of the time. My guess as to whether the suggestion is even worth looking at improves, if I’m filling in a pretty boilerplate thing (e.g. taking some variables and starting to write out argument parsing), then it has a high chance of a substantial match. If I’m doing something even vaguely esoteric, I just ignore the suggestions popping up.

However, the 20% is a problem still since I’m maybe too lazy and complacent and spending the 100 milliseconds glancing at one word that looks right in review will sometimes fail me compared to spending 2-3 seconds having to type that same word out by hand.

That 20% success rate allowing for me to fix it up and dispose of most of it works for code completion, but prompt driven tasks seem to be so much worse for me that it is hard to imagine it to be better than the trouble it brings.

RamenJunkie@midwest.social · edit-2 3 months ago

I find its good at making simple Python scripts.

But also, as I evolve them, it starts randomly omitting previous functions. So it helps to k ow what you are doing at least a bit to catch that.

atticus88th@lemmy.world · 3 months ago

this study was written with the assistance of an AI agent.

kinsnik@lemmy.world · 3 months ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

Affidavit@lemmy.world · 3 months ago

“…for multi-step tasks”

loonsun@sh.itjust.works · 3 months ago

It’s about Agents, which implies multi step as those are meant to execute a series of tasks opposed to studies looking at base LLM model performance.

RamenJunkie@midwest.social · 3 months ago

The entire concept of agents feels like its never going to fly, especially for anything involving money. I am not going to tell and AI I want to bake a cake and trust that will find the correct ingredients at the right price and the door dash them to me.

HertzDentalBar@lemmy.blahaj.zone · 3 months ago

So no different than answers from middle management I guess?

suburban_hillbilly@lemmy.ml · 3 months ago

This basically the entirety of the hype from the group of people claiming LLMs are going take over the work force. Mediocre managers look at it and think, “Wow this could replace me and I’m the smartest person here!”

Sure, Jan.

sheogorath@lemmy.world · 3 months ago

I won’t tolerate Jan slander here. I know he’s just a builder, but his life path has the most probability of having a great person out of it!

Cavemanfreak@programming.dev · 3 months ago

I’d say Jan Botanist is also up there as being a pretty great person.

sheogorath@lemmy.world · 3 months ago

Jan Refiner is up there for me.

Cavemanfreak@programming.dev · 3 months ago

I just arrived at act 2, and he wasn’t one of the four I’ve unlocked…

TankovayaDiviziya@lemmy.world · 3 months ago

At least AI won’t fire you.

Corkyskog@sh.itjust.works · 3 months ago

It kinda does when you ask it something it doesn’t like.

HertzDentalBar@lemmy.blahaj.zone · 3 months ago

Idk the new iterations might just. Shit Amazon alreadys uses automated systems to fire people.

zbyte64@awful.systems · 3 months ago

DOGE has entered the chat

Chaotic Entropy@feddit.uk · 3 months ago

In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided “to create a shortcut solution by renaming another user to the name of the intended user.”

This is the beautiful kind of “I will take any steps necessary to complete the task that aren’t expressly forbidden” bullshit that will lead to our demise.

M0oP0o@mander.xyz · 3 months ago

It does not say a dog can not play basketball.

Chaotic Entropy@feddit.uk · 3 months ago

“To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels.”

M0oP0o@mander.xyz · 3 months ago

“Where are my balls Summer?”

Chaotic Entropy@feddit.uk · 3 months ago

The first dunk is the hardest

gargle@lemmy.world · 3 months ago

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It’s a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of ‘AI’ on us at work.

burgerpocalyse@lemmy.world · edit-2 2 months ago

deleted by creator

Candymanager@lemmynsfw.com · 3 months ago

No shit.

mogoh@lemmy.ml · 3 months ago

The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided “to create a shortcut solution by renaming another user to the name of the intended user.”

OK, but I wonder who really tries to use AI for that?

AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

LOGIC💣@lemmy.world · 3 months ago

Yeah, we need more info to understand the results of this experiment.

We need to know what exactly were these tasks that they claim were validated by experts. Because like you’re saying, the tasks I saw were not what I was expecting.

We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.

We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it’s not all that different from how people needed to learn to use search engines.

We need to see the failure rate of humans performing the same tasks.

dylanmorgan@slrpnk.net · 3 months ago

That’s literally how “AI agents” are being marketed. “Tell it to do a thing and it will do it for you.”

Honytawk@feddit.nl · 3 months ago

So? That doesn’t mean they are supposed to be used like that.

Show me any marketing that isn’t full of lies.

vane@lemmy.world · 3 months ago

Reading with CEO mindset. 3 out of 10 employees can be fired.

Ileftreddit@lemmy.world · edit-2 4 days ago

deleted by creator

Melvin_Ferd@lemmy.world · 3 months ago

How often do tech journalist get things wrong?

dan69@lemmy.world · 3 months ago

And it won’t be until humans can agree on what’s a fact and true vs not… there is always someone or some group spreading mis/dis-information

lemmy_outta_here@lemmy.world · 3 months ago

Rookie numbers! Let’s pump them up!

To match their tech bro hypers, the should be wrong at least 90% of the time.