One year of LLM usage with Clojure

Introduction

This essay is a reflection on using LLM agents with the Clojure programming language. At Shortcut, we have spent the last year building Korey, an LLM agent focused on product management. During that time, we have attempted to use different LLM agents to work with our rather large Clojure code base, ~250,000–300,000 lines of Clojure code. In doing so, we discovered a lot. I hope this essay can help people in the Clojure community and in other unorthodox languages more generally take more advantage of LLM tools for their work. I would break our approach down into eras:

Early adoption and struggles
The doldrums of Claude Code
The clojure-mcp revolution
Skills, Prompts, and OpenCode
System prompt refinements
PI Agent and liberation

Early Adoption and Struggles

It is commonly understood that LLM models are more effective with languages that dominate the training dataset. Research has shown that model performance varies significantly based on the volume of training data for each programming language, with Python, JavaScript, and TypeScript being heavily represented in public code repositories. This fact is and was concerning to me. For one, at Shortcut we have a large Clojure code base that we have grown, nurtured, and maintained over the last 11 years. We don't have the time, money, or interest in rewriting this to Python or TypeScript simply because state-of-the-art models prefer these languages. Additionally, to me, it does not seem like a good business move to throw out working code.

With that in mind, we decided to try to teach Claude Code how to write Clojure code well. Quickly, we realized that certain aspects of the way Claude Code is structured by default make it very difficult to work with a large code base like ours. An example of this is that Claude will run the entire test suite after each thing it implements, and running the entire test suite locally on our machine takes several minutes at this point, unfortunately. As Clojure engineers, we prefer tight feedback loops on the REPL. This several-minute-long test suite run was unacceptable to us.

We began to tweak our AGENT.md file, teaching Claude Code about how Clojure and Clojure's data structures work. With that, we were able to narrow the static verification steps that Claude Code made after each point in its implementation process. We noticed large improvements in our performance and our iterative process at this point.

The doldrums of Claude Code

At this point we felt using Claude Code was mostly functional, and we were capable of achieving a certain level of development flow with Claude Code. Claude used our large code base itself as a model of how it should write Clojure code, although we often had to prompt it to do so. I began experimenting with the best ways of prompting the LLM agent. One of the things I discovered is that specifying in detail what the LLM agent should do is critical; you can't leave any ambiguity.

However, we still struggled with certain aspects of how Claude approached software engineering. For example, we noticed that Claude often leapt ahead before looking. Claude would write a bunch of code — potentially several thousand lines over a few minutes — and then attempt to verify it. The problem is that because of hallucinations or misunderstandings, the new code would often contain errors. At that time, roughly six months ago, Claude was poor at debugging these errors.

Additionally, we observed a pattern: when Claude ran into an error, it often solved the problem by adding more complexity, which, as software engineers, we know is rarely the right approach. Consequently, we would see Claude spend a lot of time and tokens debugging a problem it created, and it couldn't resolve it because the code didn't actually belong in our code base. This leap-before-looking severely limited Claude's effectiveness in our code base.

We also noticed some more fundamental flaws with the way Claude—this was Sonnet 3.5 at the time—was approaching Clojure code. One thing we commonly notice is that Sonnet defaults to an imperative programming approach, which we know does not work well in Clojure. For example, we discovered, embedded in a large code change, a doseq with an atom where a map, or a reduce would be preferable. These issues were problematic for us because Claude can generate a lot of code, which is very difficult to undo. Ultimately, we want the LLM to generate the correct code in the first place. So we faced the dilemma: how do we achieve better functional Clojure code from the LLM? The next step was to explore certain avenues to achieve that.

The clojure-mcp revolution

The first step I took was to explore the Clojure MCP tool. This tool exposed a set of editing and evaluation features that greatly reduce AI hallucinations, and with that we were able to ground the LLM better in our code base. The Clojure MCP's edit functionality was essential for preventing invalid parentheses, syntax errors, and other issues from entering our code base.

Clojure MCP fundamentally altered my belief in the ability of LLMs to write effective Clojure code. Previously, I was struggling and frustrated; working with Clojure and LLMs was a constant source of hallucinated functions, invalid syntax, and a generally unpleasant experience, with a lot of rework and misdirection. Clojure MCP really changed that. Thanks a ton to Bruce Hauman for building Clojure MCP and Clojure MCP Light and releasing them as open-source tools.

Skills, Prompts and OpenCode

The next step I took toward achieving better LLM output was to evaluate how Anthropic's skill system works. I also developed a tool I ended up calling Clojure Skills. Clojure Skills was envisioned as a SQLite database with a command-line interface that would allow the LLM agent and the human to search through a set of anthropic-style skills. These skills would inform the LLM agent about patterns, idioms, specific libraries, and whatnot, hoping that it would produce much better output and that I would spend a lot less time debugging. This was also the point when I started experimenting a little with tools like OpenCode.

What's interesting about OpenCode is that it lets you easily define your own system prompt. So at this point I defined a system prompt for building a skill that would dynamically load the library onto the Clojure REPL and build a skill for that library.

Here is an example of using OpenCode with a custom system prompt:

{
  "$schema": "https://opencode.ai/config.json",
  "agent": {
    "clojure": {
      "description": "Expert Clojure developer with REPL-driven workflow",
      "model": "anthropic/claude-sonnet-4",
      "prompt": "{file:./SYSTEM.md}"
    }
  },
  "default_agent": "clojure"
}

It was during the process of building my own skill system and using those skills day-to-day that I noticed a fundamental issue: over long LLM sessions with large context windows, the knowledge and the skill didn't stay very sticky. The LLM initially uses the correct skills and patterns, but eventually it starts to forget, ignore, and just do its own thing.

After some research I discovered that this problem is documented in the literature. In their 2023 paper "Lost in the Middle: How Language Models Use Long Contexts", Liu et al. demonstrated that LLMs effectively have a U-shaped memory curve: the most recent conversation turns and the system prompt are weighted more heavily than the middle of the conversation. Consequently, when skills are loaded mid-conversation, the LLM often fails to follow them.

I began experimenting within the clojure-skills system with a concept I call "prompt packing." I saw on Reddit that other people were putting all the knowledge of their code base directly into the system prompt. That made immediate sense given what we know about context windows. So I tried both inlining and referencing skills in the system prompt. OpenCode let me carefully curate the skills in my system prompt.

With this approach I achieved a new level of effectiveness: I was able to one-shot more and more tasks with the LLM than I ever could with Claude Code alone.

At this point in our development of Korey, it was time to tweak the system prompt for Korey itself. I was lucky enough to be assigned this task at work, and I spent a couple of weeks understanding how people evaluate system prompts and how they make them more effective for their task. I then drew the connection between my engineering system prompt and a prompt-evaluation system. This was key to our development for iterating and evaluating how we use LLMs and how our system prompt can become more effective. A helpful resource for this type of research is hamel.dev. Hamel Husain is a great communicator and writer. He was critical in my understanding of how to think about refining and defining system prompts, especially for working with code. Thanks, Hamel.

At this point, I started working on what eventually became my Clojure system prompt project. This was a prompt that we iterated a lot internally at Shortcut, and I have released it as an open-source project. I hope other Clojure developers will find this a helpful starting point to devise their own system prompts. One thing you can notice is I use the REPL extensively to ground truth, aka prevent hallucinations, before the LLM agent writes any code. This was key for transforming what was an incredibly frustrating experience into a much smoother and more graceful LLM interaction.

Another step I took was to deepen the ability of the LLM to use the Clojure platform itself. Clojure is designed to be interacted with from the REPL. This turns out to be a huge advantage when working with an LLM. For example, instead of defining a new skill for each library, I taught the LLMs about clojure.repl. This allows the LLM to dynamically explore any Clojure library on the REPL. Individual skills or lessons became less important, and the platform itself served as a dynamic feedback loop. Here is an example of that

<discovering-functions>
The REPL is your documentation browser. Explore before coding:

List all public functions in a namespace:
```shell
clj-nrepl-eval -p 7889 "(clojure.repl/dir clojure.string)"
# Output: blank? capitalize ends-with? escape includes? index-of join
#         lower-case replace reverse split split-lines starts-with? trim ...
```

Get detailed documentation for any function:
```shell
clj-nrepl-eval -p 7889 "(clojure.repl/doc map)"
# Output:
# -------------------------
# clojure.core/map
# ([f] [f coll] [f c1 c2] [f c1 c2 c3] [f c1 c2 c3 & colls])
#   Returns a lazy sequence consisting of the result of applying f to
#   the set of first items of each coll, followed by applying f to the
#   set of second items in each coll, until any one of the colls is
#   exhausted. Any remaining items in other colls are ignored...
```

Search for functions by name pattern:
```shell
clj-nrepl-eval -p 7889 "(clojure.repl/apropos \"split\")"
# Output: (clojure.string/split clojure.string/split-lines split-at split-with)
```

Search documentation text:
```shell
clj-nrepl-eval -p 7889 "(clojure.repl/find-doc \"regular expression\")"
# Shows all functions whose documentation mentions "regular expression"
```

Read function source code:
```shell
clj-nrepl-eval -p 7889 "(clojure.repl/source filter)"
# Shows the actual implementation - great for understanding edge cases
```

Despite these breakthroughs, the development flow was still not what I wanted it to be. First, OpenCode seems to consume a large amount of memory. It is a real bummer when you're working with a context window that you have crafted over thirty or forty minutes, and the LLM agent crashes because of the harness. It's a very frustrating experience.

Additionally, I've noticed that OpenCode tries to minimize the output of tool calls. This follows the general industry pattern we see in Claude Code, where certain information is hidden from the engineer. It's unclear to me what the exact design goals are, but I believe they're trying to minimize the information the developer sees so as not to overwhelm them.

However, when you're refining your system prompts and thinking carefully about your LLM interactions, this hiding of information becomes a real hindrance. I want to know that my tool calls are correct, that my edits are clear, and that my system functions as effectively as possible.

At this point I read a blog post about a new agent, pi-agent, and I was hooked.

PI Agent and liberation

Coming from an LLM agent harness like Claude Code that attempts to hide what the LLM is doing to a simple LLM harness that shows you everything felt like a liberating experience to me. Not only that, but pi-agent encourages you to solve your own problems, just like Emacs does. If there's a tool or functionality that you need that pi-agent doesn't provide, you simply build your own plugin. I have written a couple of plugins and used several more from the community.

One plugin I wrote is something to track my energy usage while working with LLMs. Like many, I am deeply concerned about LLMs' effects on our planet. My plugin tracks how much carbon my interactions are generating and compares that to standard car use.

What I value most about pi-agent is its reliability over long contexts. It also clearly indicates whether a tool call succeeded or failed with a green or red output. It's very minimal, does not use sub-agents, and allows me to focus on what my LLM is doing. As I follow along, I can tighten my iteration loop even more.

Current stack and future directions

My current stack is Clojure MCP Light, pi-agent, and my own system prompt. I plan to continue iterating on the Clojure system prompt and develop tools that make working with Clojure from LLMs more effective. I've also begun to experiment more and more with open-weight models, including the excellent Kimi 2.5 model from Moonshot AI, which is effectively my day-to-day driver now along with Sonnet 4.5.

There are a bunch of future directions that I would love to explore if given the time and opportunity. The rise of very effective open-weight models like Kimi 2.5 and MinMax 2.5 opens the possibility to post-train these models on Clojure and on our code base. Theoretically, this could allow us to not even have to use custom system prompts, skills, or specialized tools. We could effectively train our own Kimi Clojure or MinMax Clojure model that would have all of our best practices baked in. Of course, this is theoretical; I've done a little research, and there seems to be strong evidence that this would work. However, we won't really know until we try it out.

I also think it would be worthwhile to begin to curate all the tools that individual Clojure developers are building to work with Clojure and building an ecosystem and documentation around this.

Conclusion

I think there is a tendency among certain users of LLMs to limit the platforms and languages that we use with the LLMs. Out of the box, there are good reasons to consider this. If I were working on a greenfield project, I might consider using something like Python. However, I believe software decisions should serve the needs of the people who work on it and with it, not the needs of the LLM that might play a role creating it.

Ultimately, human beings are responsible for managing these software systems, and human knowledge still, in my view, trumps LLM statistical output. Taking that into consideration, the artifact produced by the LLM process is still a critical part of software development. I prefer to debug and deploy Clojure JVM artifacts, and I know many other organizations do as well.

Also, I think there are important reasons not to abandon all the lessons we've learned over the last twenty years of using Clojure. Clojure itself eliminates a whole set of common engineering problems like statefulness. I don't see the advantage of going back to a language where we have to use mutable state for iteration, for example, or a language where functional idioms are not the default.

My experiences with using an LLM for software engineering and developing an LLM have taught me that, rather than constraining the platforms we use and adopt, the LLM era could actually free us. The idea that LLMs are only good at something because it's in the training set ignores the tools, skills, and system prompts that we can apply post-training.

We can craft and refine LLMs for our specific platform. Not only that, there is something beautiful about LLMs: instead of narrowing how we work, they can allow people to work in very different ways.

I hope this blog post was insightful and informative, and I hope that my experience and our experience at Shortcut can help you craft and shape an LLM experience that solves your particular engineering problems. Thank you for reading.

References

Katzy, J., & Izadi, M. (2023). On the Impact of Language Selection for Training and Evaluating Programming Language Models. arXiv preprint arXiv:2308.13354. https://arxiv.org/abs/2308.13354

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172. https://arxiv.org/abs/2307.03172

Relections on Software Engineering

Thoughts on code, craft, and curiosity