Ton of work already being done on this. I am a Vulnerability Researcher @ MIT and I know of a few efforts, just at my lab alone, being worked on. So far nearly everything I have seen seems to do nothing but report false positives. They are missing bugs a fuzzer could have found in minutes. I will be impressed when it finds high severity/exploitable bugs. I think we are a bit too far from that if its achievable though. On the flip side LLMs have been very useful reverse engineering binaries. Binary Ninja w/ Sidekick (their LLM plugin) can recover and name data structures quite well. It saves a ton of time. Also does a decent job providing high level overviews of code...
Solving the false positive problem is like solving the halting problem. I don’t think we get to a world where static analysis tools don’t have them, AI or otherwise.
That said, I have found LLMs can find bugs in binaries. It’s not all false positives, as far as I can tell. I have a side project I’ve been working on that does just this (shameless plug): PwnScan.com. It’s currently free and focused on binaries.
The bad news is that you quickly get into a situation where you have too many false positives where it’s sometimes not feasible to sort through them all.
I definitely agree that there's a lot of research happening in this space, and the false positive issue is a significant hurdle. From my own research and experimentation, I have also seen how challenging it is to get LLM-powered tools to consistently find real.
Our approach with Jazzberry is specifically focused on the dynamic execution aspect within the PR context. I am seeing that by actually running the code with the specific changes, we can get a clearer signal about functional errors. We're very aware of the need to demonstrate our ability to find those high-severity/exploitable bugs you mentioned, and that's a key metric for us as we continue to develop it.
Given your background, I'd be really interested to hear if you have any thoughts on what approaches you think might be most promising for moving beyond the false positive problem in AI-driven bug finding. Any insights from your work at MIT would be incredibly valuable.
Agree with you on that. There is nothing about LLMs that makes them uniquely suited for bug finding. However, they could excel re:bugs by recovering traces as you say, and taking it one step further, even recommending fixes.
One possibility is crafting (somewhat-)minimal reproductions. There's some work in the FP community to do this via traditional techniques, but they seem quite limited.
We largely agree, we don't think pure LLM-based approaches are sufficient. Having an LLM automatically orchestrate tools, like a software fuzzer, is something we've been thinking about for a while and we view incorporating code execution as the first step.
We think that LLMs are able to capture semantic bugs that traditional software testing cannot find in a hands-off way, and ideas from both worlds will be needed for a truly holistic bug finder.
I've used your product and particularly like that you show bugs in a table instead of littering my entire PR.
Does Jazzberry run on the entire codebase, or does it look at the specific PR? Would also like some more details about the tool - it seems much faster than others I've tried - are you using some kind of smaller fine-tuned model?
Thank you! This is definitely a design choice we believe strongly in.
Jazzberry starts by viewing the patch associated with the PR, but can choose to view any file in the repo as it searches for bugs. All the models in use are off-the-shelf for now, but we have been thinking about training a model from the beginning, stay tuned!
We are laser focused on bug finding, and aren't targeting general code review, like comments on code style and variable names. We also run your code as part of bug finding instead of only having an LLM inspect it
> Jazzberry is focused on dynamically testing your code in a sandbox to confirm the presence of real bugs.
That seems like a waste of resources to perform a job that a static linter could do in nanoseconds. Paying to spin up a new VM for every test is going to incur a cost penalty that other competitors can skip entirely.
You are right that static linters are incredibly fast and efficient for catching certain classes of issues.
Our focus with the dynamic sandbox execution is aimed at finding bugs that are much harder for static analysis to detect. These are bugs like logical flaws in specific execution paths and unexpected interactions between code changes.
Do you guide the LLM to do this specifically? So it doesn't "waste" time on what can be taken care of by static analysis? Would be interesting if you could also integrate traditional analysis tools pre-LLM.
how did and do you validate that this is of any value at all?
how many test cases do you have? how do you score the responses? how do you ensure random changes by the people who did almost all of the work (training models) doesn't wreck your product?
Not the OP but -- I would immediately believe that finding bugs would be a valuable problem to solve. Your questions should probably be answered on an FAQ though, since they are pretty good.
Ton of work already being done on this. I am a Vulnerability Researcher @ MIT and I know of a few efforts, just at my lab alone, being worked on. So far nearly everything I have seen seems to do nothing but report false positives. They are missing bugs a fuzzer could have found in minutes. I will be impressed when it finds high severity/exploitable bugs. I think we are a bit too far from that if its achievable though. On the flip side LLMs have been very useful reverse engineering binaries. Binary Ninja w/ Sidekick (their LLM plugin) can recover and name data structures quite well. It saves a ton of time. Also does a decent job providing high level overviews of code...
Solving the false positive problem is like solving the halting problem. I don’t think we get to a world where static analysis tools don’t have them, AI or otherwise.
That said, I have found LLMs can find bugs in binaries. It’s not all false positives, as far as I can tell. I have a side project I’ve been working on that does just this (shameless plug): PwnScan.com. It’s currently free and focused on binaries.
The bad news is that you quickly get into a situation where you have too many false positives where it’s sometimes not feasible to sort through them all.
I definitely agree that there's a lot of research happening in this space, and the false positive issue is a significant hurdle. From my own research and experimentation, I have also seen how challenging it is to get LLM-powered tools to consistently find real.
Our approach with Jazzberry is specifically focused on the dynamic execution aspect within the PR context. I am seeing that by actually running the code with the specific changes, we can get a clearer signal about functional errors. We're very aware of the need to demonstrate our ability to find those high-severity/exploitable bugs you mentioned, and that's a key metric for us as we continue to develop it.
Given your background, I'd be really interested to hear if you have any thoughts on what approaches you think might be most promising for moving beyond the false positive problem in AI-driven bug finding. Any insights from your work at MIT would be incredibly valuable.
Agree with you on that. There is nothing about LLMs that makes them uniquely suited for bug finding. However, they could excel re:bugs by recovering traces as you say, and taking it one step further, even recommending fixes.
One possibility is crafting (somewhat-)minimal reproductions. There's some work in the FP community to do this via traditional techniques, but they seem quite limited.
Correct, what is unique about LLMs is their ability to match an existing tool or practice to a unique problem.
We largely agree, we don't think pure LLM-based approaches are sufficient. Having an LLM automatically orchestrate tools, like a software fuzzer, is something we've been thinking about for a while and we view incorporating code execution as the first step.
We think that LLMs are able to capture semantic bugs that traditional software testing cannot find in a hands-off way, and ideas from both worlds will be needed for a truly holistic bug finder.
I've used your product and particularly like that you show bugs in a table instead of littering my entire PR.
Does Jazzberry run on the entire codebase, or does it look at the specific PR? Would also like some more details about the tool - it seems much faster than others I've tried - are you using some kind of smaller fine-tuned model?
Thank you! This is definitely a design choice we believe strongly in.
Jazzberry starts by viewing the patch associated with the PR, but can choose to view any file in the repo as it searches for bugs. All the models in use are off-the-shelf for now, but we have been thinking about training a model from the beginning, stay tuned!
I'm kinda curious how this compares to GitLab's similar offering: https://docs.gitlab.com/user/project/merge_requests/duo_in_m...
We are laser focused on bug finding, and aren't targeting general code review, like comments on code style and variable names. We also run your code as part of bug finding instead of only having an LLM inspect it
Cool demo! You mentioned using a microVM, which I think is Firecracker? And if it is, any issues with it?
Thanks! We are indeed using Firecracker. No issues so far
> Jazzberry is focused on dynamically testing your code in a sandbox to confirm the presence of real bugs.
That seems like a waste of resources to perform a job that a static linter could do in nanoseconds. Paying to spin up a new VM for every test is going to incur a cost penalty that other competitors can skip entirely.
You are right that static linters are incredibly fast and efficient for catching certain classes of issues.
Our focus with the dynamic sandbox execution is aimed at finding bugs that are much harder for static analysis to detect. These are bugs like logical flaws in specific execution paths and unexpected interactions between code changes.
Do you guide the LLM to do this specifically? So it doesn't "waste" time on what can be taken care of by static analysis? Would be interesting if you could also integrate traditional analysis tools pre-LLM.
how did and do you validate that this is of any value at all?
how many test cases do you have? how do you score the responses? how do you ensure random changes by the people who did almost all of the work (training models) doesn't wreck your product?
Not the OP but -- I would immediately believe that finding bugs would be a valuable problem to solve. Your questions should probably be answered on an FAQ though, since they are pretty good.