Behind the scenes at socket, we have been quietly using GPT as an internal production tool for over 4 months now. More recently with the launch of ChatGPT, the enthusiasm for large language models has taken the internet by storm. In this post, we want to share some of our learnings about the incredible strengths of ChatGPT and how we've worked around some of the limits.
Socket's internal tooling has a variety of automated analysis that run in real time on the npm change feed. For those unfamiliar, every time a package is published to the registry it is propagated out for replication purposes. We can actually listen to this and do automation on it. Whenever a suspicious new package is published we want to check it out as soon as possible. We do use static analysis extensively but even the best signals still require a human to filter through the noise and check if a package is truly dangerous malware.
This is where we get leads on other blog posts like These Chinese devs are storing 1000s of eBooks on GitHub and npm - Socket ; a human goes to understand what is going on when some sort of signal flags the package published to the registry. However, given the enormous volume of the npm publish feed, doing this manually can quickly turn into a full time job. Our slack bot logs hundreds of these every day, and since last October we've had GPT helping us to get quick at a glance summaries to prioritize auditing.
This might seem like GPT is about to completely replace our jobs given other blog posts we have seen while we wait for our understanding at Socket, but you would be wrong. GPT3 is helpful, though still quite easy to trick into both false positives and negatives. One that constantly gets a chuckle is a test fixtures like @lavamoat/allow-scripts - NPM Package File explorer - Socket have obvious mentions of things like "evil" but are not actually doing anything. Social engineering sandbox escapes also work on these types of LLMs so don't be worried, your job is safe for now. Other fun evasions can occur by saying things aren't evil but actually are safe with things like inline comments.
However, we can actually use the same idea behind these kinds of social engineering attacks to get around limitations in the AI themselves. With Chat GPT some limitations are avoided by giving more context to previous conversation points to the AI as you keep feeding it more data, but in general having context is the exact problem for trying to analyze large files. GPT's Davinci model is limited to 4000 tokens of content to analyze and most JS files are larger than that. Removing whitespace and minifying are classic approaches to make JS code smaller, but these start to impact the AI's ability to explain things, just like a normal programmer.
Instead, we segment up the code of a file and analyze subsets recursively. We split up the file small enough to fit in our model's limits and use comments to social engineer the AI into thinking something is going on. You can imagine it as if we had an imaginary limited AI that can only analyze 3 lines at a time analyzing the following and feeding it back as a comment (example shown using Chat GPT so you can recreate it yourself easily, but we are using GPT3's Davinci model internally):
After splitting it on the fetch call, having GPT analyze it, and feeding that analysis back as it we might get a new code block that is smaller that our AI could analyze:
Running this analysis on the npm firehose has been an interesting exercise in working around the limits of OpenAI's services. One of the biggest challenges has not only been the size of files, but the massive number of API calls to do this recursive analysis. The API tokens we are using have had to have their limits increased a few times and we've done quite a bit of tweaking to get the prompts and drivers running smoothly. As of a few months ago, increasing API limits required some email conversations with an OpenAI sales rep, but hopefully this process will get better in the future.
Getting the analysis to start to resist becoming paranoid or being too lenient on code it sees is another story.
We have had to not just tweak prompts but also the temperature of the AI. The temperature is a configuration option for how confident the AI must be when making a prediction effectively. When given a low temperature with our prompts the AI tends to say that nothing is malicious. However, when you put the temperature too high, the AI seems to lose all ability to coherently analyze code. At those high temperatures paranoia and hallucination best describe the results. In the end our prompts seem to work best at a temperature of 0.6 which is a little lower than the average usage for creative purposes.
Are these new types of large language model AIs going to replace tooling? Probably not due to the variety of ways to evade its analytical tooling with things like simple comments. I do think more malware will start to pepper counter intuitive comments in to avoid AI detection but it will be an arms race. Eventually with things like coloring prompts to help avoid social engineering the AI it will be mitigable but there is a long way to go.
That said, we are using it as a helpful form of summarization and how to triage existing work we have to do, even if we have to manually go and check it ourselves.