AI Engineer's World Fair Recap

Agents don't work yet, but let's pretend they do...

July 2024

With a title as grand as "World Fair", I should've known better than to hope for an intimate gathering for sharing technical epiphanies. Even if that's not the real point of conferences… a nerd can dream.

It was, however, an excellent self-selected gathering of people from in the arena. Had many interesting conversations. Grounded my intuitions on multiple fronts. Scanned way too many LinkedIn QR codes.

Overview

In one word, I'd say the eigentopic of the conference was “control."

In a sentence: How do we constrain the output of non-deterministic LLMs to the parts of latent space that we define as "good"?

We have a growing bag of tricks (RAG, knowledge graphs, structured output, control vectors, fine-tuning). But added together, it still doesn't get us to the 99.999999% that both users and developers crave.

Without reliable control, the holy grail of real agents (capable of high-level, multi-step delegation) remains seductively out of reach. In the interim, the term "agent" tacitly refers to any AI system that demos well as a flowchart, even if it's just a set of nested conditionals… but I digress.

Still, the prevailing sense is that the problems are tractable, and collectively, we have a lot of lead bullets left. A breakthrough would accelerate things, but 6-12 months of old-fashioned engineering schlep can ~~paper over~~ fix a lot of problems.

This banner subtly plants the idea of subscribing, so that when you get to the bottom, you're 14.83% more likely to do so.

“Goldilocks"

At one end of the spectrum, prompt engineering is fast, cheap, and unreliable. On the other end, fine-tuning is slow, expensive, and still unreliable 😅.

My hunch is that a lot alpha lives in the messy middle ground between these two extremes:

Structured Generation is powerful. When you know what kind of output you want, there's really no reason not to use a library like Outlines to improve accuracy, speed, and cost. Conceptually, I think of it like creating a high-level type system for the LLM to adhere to.

Custom YAML DSLs. My favorite workshop was Manuel Odendahl's 100x programmer session, in which he shared a grab bag of prompting techniques and mindsets.

A common frustration in code generation is that the lazy “implement XYZ" prompt doesn't give the AI enough detail to succeed, yet it's tedious to actually type out every relevant detail. Instead, ask the LLM generate a YAML DSL of the thing you're asking for first. This is deceptively simple, but YAML is the perfect middle ground — easily skimmable, yet precise and structured enough to describe non-trivial complexity. Much easier to make your corrections in the DSL, before proceeding to actual generation. This is useful outside of coding, too.

Control vectors and GPU-poor feature engineering. My favorite hallway conversation was with one of the folks producing “abliterated" models — an elegantly simple technique for performing crude surgery on model weights. In this case, removing the model's ability to refuse requests! Crucially, this does not require a rack of training GPUs… just enough for inference.

Beyond semantic similarity for RAG. Vector search isn't useful for many queries. A simple enhancement is to ask an LLM to generate relevant keywords from the prompt, then supplement with an old-fashioned keyword search. Fancier would be to use a LLM to extract a knowledge graph, then use this too.

Other projects that I'm watching more closely, thanks to the conference:

ARC Prize is an elegant canary for AGI. Solving ARC doesn't necessarily imply that AGI is here, but intuitively seems like a necessary step. Also, great for nerdsniping.

Mozilla is offering grants for open source local AI projects. Many AI use cases, IMO, feel like they necessarily want to flow to the OS layer. Having open source win at that level seems important, if we're to retain any level of user control over our devices.

Open Questions

Some questions that are still tugging at my consciousness:

What are enterprise customers actually paying for? There sure were a lot of infrastructure vendors, each offering convenient, end-to-end training, evals, deployment, observability, etc. But who is actually paying for these… and what are they actually running in production?

What are use cases that Big Tech will not do, that customers can not do? The AI infra and application layers are fast becoming blood-red oceans, and I sense that startups are being squeezed from multiple directions. Generalist use cases feel fated to eventually be absorbed into the OS / foundation models. Meanwhile, end users armed with ever more capable tools will be able to generate bespoke solutions for specific, niche problems. Neither of these will happen overnight, so it there's still fortunes to be made in racing for distribution.

Why do people still use LangChain? I don't get it. I really don't.

Overview

“Goldilocks"

“Like and Subscribe"

Open Questions