If you work in tech I shouldn’t need to convince you that large language models are a big deal. However, LLMs are still poorly understood by most product builders and investors. There are outstanding technical resources for available for free online, but these assume a significant background in math and programming. Good non-technical resources are limited and tend to focus on narrow topics, making it difficult to build a functional understanding.
This post grew out of my notes for an internal session for my colleagues to bring together the basic building blocks necessary to reason about new products and features enabled by LLMs. I hope this helps builders exploring this new design space ramp up more quickly.
Although I do not introduce any equations, this series will be dense because there is a lot of ground to cover. I also simplify some technical details, say some things which are not strictly correct and omit details; trying to strike the right balance is tough. Please feel free to reach out if anything here is unclear - would love your feedback!
Note: I highlight key terms and jargon in bold in this series. While I try minimize jargon, knowing these terms will help you follow technical discussions you may see elsewhere.
P.S. This is a long post and I’ve gotten feedback that I should break it up but if I do I’m going to end up fiddling with each piece again and procrastinate so you’ll just have to deal with it. This is also going to be my excuse if you find typos.
P.P.S. I promise future posts won’t be this long. I’m not happy about the length either.
Table of Contents
Part 1: What are large language models actually doing and why?
Part 2: What are LLMs bad at?
Part 3: Building applications and agents with LLMs
Part 1: What are large language models actually doing and why?
Note: You may want to skip to the final section of Part 1 and then move on to Part 2 if you are familiar with the basics. I wrote this for completeness but this has been covered elsewhere for non-technical readers.
LLMs are statistical next word predictors
Software has been eating the world for decades now but it has historically struggled to address problems and workflows involving natural language understanding. LLMs make this class of problems look considerably more edible. Unlike traditional programs however, LLMs behave non-deterministically and exhibit a number of quirks. This presents a number of design challenges. One powerful mental model to help builders working with LLMs is thinking of LLMs as next word predictors.
Given a string of text, an LLM predicts what word (or to be precise, token) is most likely to come next based on the statistical relationships in dataset that it was trained on. Each predicted word is recursively fed back into the model together with the previous input to predict the following word (i.e. LLMs are autoregressive). This behavior is a result of the way models are trained. Before we go further, let me introduce a few foundational concepts and some historical context.
Neural Network Basics
Neural networks are a type of machine learning model loosely inspired by the human brain. A neural network is made up of one or more layers of neurons. The diagram above shows a simple neural network with an input layer, a single hidden layer, and output layer. The input to a neural network is a vector, an array of numerical values (sometimes also referred to as a tensor although this usage is not strictly correct). Each individual element of the input vector - a single number - is the input for a single neuron, to which it applies its activation function to produce a single numerical output. The output of neurons in the same layer are then combined into a new vector that forms the input for neurons in the subsequent layer. Excluding the input layer, each neuron multiplies each value in its input vector by the weight assigned to the neuron in the previous layer that produced that value. Neurons may also add or subtract a constant value called a bias from their output. Together, weights and biases are referred to as parameters. The size of a model is usually described by its total parameter count.
Since NNs only operate over vectors, inputs to a model are translated into a numerical representation through tokenization and embedding. Tokenization is the process of converting input into atomic units called tokens that a model operates over. For LLMs with text input, tokenization can take place the word-level (e.g. “seemingly”), subword-level (“seem”, “ing”, “ly”), character-level (“s”, “e”… etc.) or byte-level. Each token is represented by a single embedding vector in the model’s vocabulary. Token embedding vectors are passed into an LLM and for every token in the model vocabulary, the the output layer produces a prediction of the probability that that token will come next.
Moving up one level of abstraction, it might be helpful to think of a NN as an input layer that takes some input and converts it into a vector, some hidden layers which apply linear and nonlinear transformations to that vector, and an output layer which converts the transformed vector back into the target output format.
Historical context
The first neural network was introduced in 1958 but it was only in 2012 that interest in NNs started taking off following the publication of the AlexNet image classification model, which outperformed competing models by a staggering margin. Most of the key elements in AlexNet were already present in prior research dating back to the 1980’s, but it was only in the 2010’s that computing power had become sufficiently cheap to train powerful NNs. This breakthrough led to a flurry of activity in computer vision and self-driving.
The success of NNs in computer vision also revitalized research in applying them to language modeling. The key breakthrough in language models was the introduction of the transformer architecture, first described in the seminal 2017 paper “Attention Is All You Need”. Almost all modern LLMs are variations of the original transformer.
How LLMs are trained
The process of training an LLM begins with pretraining and is typically followed by one or more phases of fine-tuning. The graphic above from Meta’s paper on training their Llama-2 model is representative of current techniques.
In the pretraining phase, training data is fed into a model token by token and the model attempts to predict each subsequent token. The accuracy of predictions is measured by a loss function (also referred to as perplexity). A model’s parameters are updated iteratively during training after each batch of training data to reduce perplexity (i.e. improve prediction quality) using a learning algorithm called backpropagation. The pretraining dataset forms the vast majority of training data for an LLM. This dataset is diverse and typically includes data scraped from the internet (e.g. CommonCrawl) as well as a mix of other sources like textbooks and other authoritative works.
A pretrained LLM is a naive next token predictor. If given an input “Barack Obama is”, one likely output is something like “ a former President of the United States”. While this is useful in some contexts, pretrained models do not follow instructions or generate conversational output. An input like “What is the capital of France?” is more likely to generate an output like “What is the capital of Germany?” because the bulk of the pretraining data isn’t structured as Q&A. Coaxing more sophisticated outputs and behaviors from a model at this stage is possible but requires prompt engineering - specific prompting techniques - which I won’t cover in detail since non-technical readers are unlikely to be interacting with a base model.
Next, a pretrained model is fine-tuned in one or more phases. Fine-tuning differs from pre-training in a few ways. Fine-tuning datasets are much smaller and highly curated to adjust the behavior of a pretrained model in specific ways. There are a number of fine-tuning methods ranging from adjusting all the parameters in a model (full fine-tuning), a subset of parameters in a model (partial fine-tuning), or leaving the base model parameters untouched and bolting on additional layers to the model (LoRA and related methods).
Models used in most production settings will have undergone instruction tuning, which makes a model generate conversational outputs and follow instructions as opposed to performing pure next token prediction. This allows users to enter a question as a prompt and receive an answer, or to instruct the model to perform a certain task and have the model behave accordingly. In the diagram above this corresponds to the supervised fine-tuning or SFT step.
Alignment is the final common step in fine-tuning, where a model’s behavior is further aligned to human preferences. This reduces the likelihood that a model produces outputs that are deemed to be undesirable e.g. racist/hateful output, instructions on how to build a bomb, etc. The most commonly used technique for alignment is reinforcement learning from human feedback or RLHF. Leading chatbots like ChatGPT, Claude and Bard will all have undergone RLHF to reduce the probability of undesirable outputs. The first step in RLHF is building a dataset of prompts that map to 2 or more outputs per prompt. These outputs are manually annotated to indicate which outputs are preferable. This dataset is then used to train a reward model to predict how well any arbitrary output maps onto human preferences. The LLM is then prompted to generate outputs to be scored by the reward model, which forms the dataset for updating model weights. Alignment research is of particular interest to those interested in AI safety and is one of the large areas of research at major labs. Other approaches to alignment include reinforcement learning from AI feedback (RLAIF) and Anthropic’s Constitutional AI.
Models can be further fine-tuned to bias outputs towards a desired style or behavior for a specific use case on proprietary data. Right now this is costly and requires technical expertise but OpenAI already offers fine-tuning through an API and other service providers are doing the same for open source models.
Takeaways for builders
Modern LLMs exhibit a dizzying range of abilities but harnessing their power isn’t straightforward. They can and will fail in all sorts of ways. Fortunately, many of these failure modes are systematic and at least somewhat predictable. I started this essay with the claim that the LLM-as-next-word-predictor mental model was important but I haven’t actually substantiated that claim. In Part 2, I cover the major shortcomings of LLMs and it should become clearer why I think this mental model is so important.
Optional further reading:
An even gentler introduction to LLMs:
Another mental model for LLMs (you may get more out of this after finishing this post):
The most approachable technical introduction to transformers. Fantastic visualizations:
Part 2: What are LLMs bad at?
The range of outputs an LLM can produce might make it seem like they can be used in any task or context but current models still have many limitations. Being clear about their shortcomings is critical to understanding where an LLM can (and should!) be used. I think having a negative characterization of LLM capabilities is particularly important precisely because they can be used in so many ways. I also discuss some ways that builders are addressing these shortcomings.
Note: I underline key points to make it easier to skim/come and I try to link these to relevant academic literature - surveys or overviews where available in case you want to go down the rabbit hole.
Shortcomings due to model architecture
All major LLMs in deployment at the moment are variations on the transformer architecture from the 2017 paper “Attention is All You Need”. The primary innovation of the transformer architecture was avoiding recurrence in its architecture - which was present in prior state-of-the-art (SOTA) models - and relying just on the attention mechanism. This enabled massive parallelization of model training and inference which wasn’t possible with past recurrent neural network (RNN) architectures. So while RNNs were competitive with transformers in predictive accuracy when trained on the same amount of compute (as measured by floating point operations or FLOPs), transformers can be much be trained on orders of magnitude more FLOPs/second (sometimes confusingly referred to as FLOPS) without sacrificing predictive power on a compute-adjusted basis.
The transformer architecture comes with some tradeoffs however. Compute and memory required by the attention mechanism scales quadratically with the number of tokens in a model’s context window - the number of tokens in a model’s combined input and output. This puts a practical limit on the length of inputs and outputs in a single generation. Limited context window length is less of an issue for SOTA models that have relatively large context windows (32k tokens for GPT-4, 100k tokens for Claude 2 at ~1.5 tokens/word in English). There are also workarounds for effectively extending the context window through e.g. making the model forget earlier tokens, breaking down the task into smaller subtasks or summarizing portions of the context to reduce token usage while mostly preserving accuracy.
Another shortcoming commonly highlighted in public discourse is the tendency for transformer models to “hallucinate”, or produce factually incorrect outputs. I dislike this term because there are a number of different things that can cause a model to produce false outputs but they get lumped under a single umbrella which often causes confusion. Hallucinations are typically a result of one (or both) of the next two shortcomings.
Transformers perform poorly on memory tasks like storing and retrieving declarative statements from their training data, and as a result they are prone to recall statements incorrectly. This is a result of models being optimized for probabilistic next word prediction. For example, even if the sentence “Bob Dylan is the name of John’s cat” is present in the training data, the model is still unlikely to respond with this sentence when asked “Who is Bob Dylan?” since there are far more sentences in the training data referring to the singer Bob Dylan, none of which contain a mention of “John’s cat”.
A related problem is that even if a model can reliably answer the question “Who is Barack Obama’s mom?” with “Ann Dunham”, it will have a tendency to fail to answer the question in reverse order (“Who is Ann Dunham’s son?”) because there will be very few sentences in the training data that include a reference to Ann Dunham.
Currently, it is unknown if declarative memory can be built directly into a transformer model reliably. However, there is a lot of research and engineering work is being done to implement external memory for LLMs. This is typically done by running a database (sometimes referred to as a knowledge graph) of factual/declarative statements outside the LLM, which provides relevant facts to an LLM at inference time (referred to as retrieval-augmented generation or RAG). I’ll go into more depth on RAG in a later piece because it is one of the most common design elements in products incorporating LLMs.
SOTA transformers cannot reliably perform symbolic/deductive reasoning tasks. I’m not sure how to define this precisely but I think of it as a broad umbrella covering math, logical inference, and a grab bag of other computational and deterministic tasks. No model achieves high accuracy even on basic arithmetic if the numbers go past a few digits. Models also routinely fail to reason correctly even when provided all relevant information.
There are good workarounds for some of these tasks, particularly those which traditional computer programs are good at. Function calling is a capability that has been fine-tuned into GPT-4 and Anthropic that enables their models to reliably generate structured outputs to be passed to external programs (sometimes referred to as tool use). Models can even create their own tools through code generation coupled with the ability to validate, store and run that code. Although function calling fails occasionally, guardrails can be implemented to ensure reliability. Code generation is less reliable currently but models continue to improve at a rapid pace.
The more challenging issue to address is poor reasoning ability. There are some prompt engineering techniques that improve output quality but this remains an unsolved problem. Some prominent researchers are skeptical that this problem is solvable for transformer models.
Transformer outputs are sometimes sensitive to (even very small) changes in their input that may not have been relevant to humans. Prompt engineering seems weird and hacky because many best practices are found through brute force. Still, it is possible to build intuition around optimal prompting with experience. For example, adding the line “let’s think step-by-step” to the end of a task given as input has been shown to dramatically increase output quality on multi-step tasks. My understanding is that this produces a very slight bias towards the correct first token in the output because inputs like “think step-by-step” are associated with better outputs in the training data. This slight bias can have a compounding impact for the following reason.
Transformers cannot correct themselves so a single incorrect word/token generation can derail the quality of every subsequent output token. A consequence of this is that as output length increases, the likelihood of at least one output token straying off the goldilocks path and beginning a cascade of compounding errors increases as well, leading to degraded performance as context length increases.
The primary method to address these two shortcomings is prompt engineering for each specific task, and being thoughtful around the length of desired output. One emerging technique to reduce the difficult of prompt engineering is prompt enrichment; currently, the one public implementation of this is DALL-E 3 prompting through ChatGPT/Bing. User prompts are run through an intermediate LLM call that enriches the image generation prompt before it is passed to DALL-E 3.
Incidentally, this is also the reason why many jailbreak techniques focus on getting the LLM output to begin with something like “Sure”, because those tokens immediately bias the LLM towards generating a helpful response to a question even for topics that a model was tuned to avoid in alignment.
Transformer outputs are non-deterministic; reliability and consistency are a huge problem that needs to be solved or built around in a production setting. When running inference on a transformer, one of the settings (or hyperparameters) which model output is most sensitive to is temperature, which controls how much randomness and diversity is injected into next-word predictions. Reducing temperature results in more deterministic output, but may also cause the model to become less useful broadly. Anecdotally, many developers have struggled to coax the desired output behavior with extremely low temperature settings.
There are a number of approaches to addressing consistency and reliability but this remains one of the most challenging issues. Many approaches overlap with addressing hallucination, like checking model output and reducing variability through running a separate model (or multiple models in an ensemble) or doing a RAG at the output stage and checking for consistency.
Transformer (and other NN models) behavior and outputs are challenging to interpret. This is the root cause for many of the safety concerns around AI because without being able to interpret and understand model behavior at a granular level, we cannot be sure that they will behave in an aligned manner. The lack of interpretability also has implications for builders because it debugging becomes much more challenging.
Work on interpretability is still relatively nascent although there have been some promising recent breakthroughs. Currently, builders can improve interpretability by decomposing longer prompts and tasks into chained single-task generations but there is still much work to be done in this area.
Shortcomings due to insufficient scale or training data
The big caveat to everything in the bucket above is that while all of these shortcomings are present in current SOTA models, greater scale or more/better training data may be able to address these. This is because of the phenomenon of emergence, where models only begin to exhibit certain capabilities after passing a certain scale threshold - empirically, model output quality has not improved linearly with scale for many tasks and instead shows sharp jumps at certain thresholds. It is plausible that there may be further emergent capabilities that current SOTA models are too small to exhibit.
Emergent behavior thresholds are also a big factor in performance deterioration on non-English prompts and outputs. While there is evidence for some degree of transfer learning from English language ability to general language-agnostic ability, it is very clear that current models perform better on the languages that constitute larger portions of their training data.
While current SOTA models are already trained on massive datasets, there is still room to increase the size of these datasets by 1-2 orders of magnitude before the entirety of online data is exhausted. While there is a finite amount of training data available in the wild, running out of training data is not an immediate barrier to progress. Researchers are also exploring synthetic data generation for training, which has been shown to be effective in certain situations.
Emergence and data-intensity has potentially severe implications for non-English languages, particularly for low-resource languages. The entirety of the online text corpus for Burmese, for example, is likely insufficient to reach certain emergence thresholds that are attainable in English. Low-resource languages, particularly those that do not use the latin alphabet, also tend to suffer from suboptimal tokenization (e.g. Korean uses ~5-7 tokens/word) and thus have far shorter effective context windows.
Shortcomings due to fine-tuning and guardrails
The final bucket of shortcomings builders may run into have to do with the various safety measures that model builders have implemented. I will do a dedicated piece on safety that goes into more depth, but the core problem this presents for model capabilities is that in preventing models from generating certain unwanted or undesirable outputs, the model’s ability to generate helpful and non-harmful output deteriorates to some extent (or at least changes). These issues are difficult to systematically identify and frontier models served through API are often fine-tuned for safety without advanced notice. This is especially problematic for developers incorporating external LLMs through API calls because output behavior may change on short notice as a result of the model parameters being updated. Ensuring consistency is perhaps the biggest reason for builders to run their own models despite the tradeoffs in doing so.
Building around these shortcomings is critical for any product incorporating LLMs. LLMs are just a tool, and being aware of their shortcomings is a starting point to figure out where they can actually be useful as part of a larger product. As I alluded to in Part 1, there is a common thread in many of these shortcomings and I hope that this discussion has been helpful in solidifying a richer mental model of LLMs for you. This section isn’t exhaustive by any means but many of the failures you might encounter are likely to be another facet of the shortcomings above.
Part 3: Building products and agents with LLMs
Now that we know what LLMs are bad at, let’s get to building. The design space for LLM products is too broad to cover exhaustively so instead, I try to provide a rough map of the design space, cover some common components and patterns in early products, and explore a few types of opportunities. Let me first start by introducing the concept of an agent.
Note: This section is a synthesis of my current understanding but is largely subjective unlike Parts 1 and 2. YMMV.
What is an agent?
Agents are a class of programs designed to plan and execute a complex set of actions. Given the shortcomings we covered in Part 2, agents require external modules to implement capabilities that LLMs are not currently well-suited for. Section A in the diagram above comes from a paper that proposes a cognitive architecture framework sketching out the necessary functional architecture for an agent. In the architecture above, everything other than reasoning and working memory is implemented externally.
There isn’t a hard boundary between applications and agents, but the general way I think of it is that there is a spectrum between using an LLM output in a very small and restricted use case like inline auto-complete as compared to a complex instruction to a capable agent like “develop a marketing strategy for company XYZ, generate all necessary collateral and execute the proposed actions on social platforms”. In this section I will use these terms interchangeably at times.
General observations on product and UX design
If there is a single takeaway from this section, it should be that applications should incorporate LLMs in highly restricted use cases for which they have been optimized. It should be clear by now that LLMs suck at many tasks, and even with tasks for which they are useful, a significant amount of scaffolding needs to be built to achieve consistency and reliability in output.
Interacting with ChatGPT was how most people had their first interaction with an LLM. While ChatGPT and other chatbots are a great showcase for the range of capabilities an LLM has, they have also anchored users on a text input field as the way a user should interact with LLMs. I think this is mistaken - allowing users to enter any arbitrary instruction or task requires an application or agent to be good at every arbitrary task to reliably fulfill a user’s request. UX for an application incorporating LLMs begins with user expectations that are elicited by the UI. In the vast majority of workflows, a text input field without any restrictions is the wrong UI. An additional reason for this is that a blank text input box requires context switching that may impact user productivity.
Consistency and accuracy are another big challenge for building production-grade LLM applications. Builders should be thinking about how their applications can fail gracefully and in predictable ways without causing harm to the user. Workflows with no human in the loop require extremely high accuracy and consistency - LLM applications may not be the right solution for addressing these. If there is a human in the loop, applications can fail with little consequence as long as they still produce a net productivity gain after factoring in the cost of rectifying the failure.
Retrieval-augmented generation
One of the biggest challenges with building on top of LLMs is that they perform very poorly at storing and retrieving memory. Most products being built with LLMs need to perform these tasks regularly. Retrieval-augmented generation (RAG), which I previously mentioned is a broad umbrella of techniques for implementing memory in current-generation LLM applications. RAG systems provide additional information and context to an application that is added to a model input to improve output quality.
The way RAG works is that any data (documents, images or other files) that an application or agent needs access to is first ingested into a database. This data is then indexed using one or more methods so that it can later be retrieved to augment a prompt. Typically, a prompt is fed into a RAG system, augmented with relevant data, and then the augmented prompt is fed into the LLM. In addition to traditional indexing and information retrieval methods, LLMs introduce indexing through vector embeddings. A vector embedding is a vector representation of the meaning or semantic value of some chunk of data. This enables semantic search (also called vector search) i.e. searching by meaning, which current search techniques perform poorly on. Vector databases like Pinecone, Weaviate and others are infrastructure for indexing and retrieval in semantic search and are optimized for vector queries.
The key metric that RAG systems optimize for is relevance of the retrieved information as measured by how useful the retrieved information is in helping an application or agent generate the desired output for a given task. RAG is still relatively new, having been popularized by Meta in late 2020. There is little consensus on the optimal implementation RAG, but a number of startups have emerged specifically around helping developers implement RAG systems like LlamaIndex and Unstructured. At this time, RAG systems remain challenging to implement effectively, particularly if an application or agent is intended to ingest varied types of data and information. The emerging consensus is that while vector search is very powerful, a hybrid approach of using traditional information retrieval methods combined with vector search tends to yield even better results.
Chaining and abstraction
Decomposing workflows into multiple smaller steps mitigates some of the challenges around interpretability and degraded performance over longer inputs. Prompts are significantly easier to optimize for small tasks and intermediate feedback from external programs can also be used to iteratively improve output quality. This approach is analogous to writing reusable programs and functions and validating them enough that you can abstract away their details and use them in many places. The wrinkle with LLMs is their non-deterministic behavior, but that can also be addressed with guardrails and defensive design (discussed below).
Tool use
Many current-gen models are fine tuned for function calling and tool use. This is a combination of (1) determining whether a task should be performed externally, (2) identifying which external program the task should be routed to, and (3) generating a structured output to be passed to the external program. Some of this functionality can also be implemented outside the model for greater consistency.
Some examples of tool use would be using an external calculator for math problems, or searching the internet for reliable information. Making calls to a RAG system is another form of tool use, but I covered it separately in more depth because of how ubiquitous it is.
Function calling becomes more useful as the complexity of a workflow increases. Applications and agents intended to achieve a complex objective plan and perform multiple related tasks. One prompt for such an agent may make multiple LLM calls and perform intermediate steps, often involving tool use. There are a number of open-source frameworks to orchestrate these workflows like Langchain and AutoGPT.
One particularly interesting class of tool use is running arbitrary code within an agent or application. This code can even be generated by the LLM itself (tool generation), validated, and stored in memory for future retrieval. ChatGPT’s Advanced Data Analysis mode allows execution of arbitrary code in a sandbox which vastly expands its capabilities over a base model. This inspired open source project called Open Interpreter enables developers to easily run code generated from an LLM on-device. One great piece of research I’d like to highlight to give you a sense of what is possible is Voyager, which combines all of these design patterns to build an incredibly powerful Minecraft-playing agent.
Guardrails and defensive design
Despite thoughtful design and engineering to improve reliability, LLM applications can and will still occasionally fail. Some failure modes are systematic and can be identified automatically, and developers may choose incorporate guardrails to address these. Zooming out, there are a few options to design LLM applications defensively, although this comes with some tradeoffs.
One category of systematic failures is including offensive language in outputs. Many LLMs have built-in guardrails to prevent this, but an additional check can be run over outputs to ensure that an output is not offensive before it is shown to a user. OpenAI has a standalone moderation API with this functionality and a number of startups are building products to help developers implement guardrails more easily. Checking for less obvious failures like factual inaccuracies is also possible but requires more engineering - one interesting implementation of this is to require a citation for every part of an output. Guardrails will generally will increase the cost of serving a user and add latency to an application. They may also cause output quality to deteriorate.
Another category of issues arises from the non-deterministic nature of LLM outputs. In contexts where the required output accuracy is extremely high, developers may choose to run multiple instances of a model in parallel or run different models in parallel to reduce variance in the output despite the increased cost and latency.
Other considerations in production systems
Something I’ve touched on a few times before is the operating cost of LLM applications. Unlike traditional web apps, the marginal cost of serving a user is meaningful because LLMs can only be run on powerful hardware. Even if a developer does not incorporate defensive design elements, regular usage may still generate substantial compute costs. The primary implication of this is that freemium pricing models may not be feasible for many developers.
Another significant challenge for developers is that SOTA models are only accessible by API. These models may be updated without any notice for safety and alignment. This can change model behavior even in tasks that do not raise safety issues, causing functionality to break unexpectedly. Other issues like rate-limits, lack of priority access and changes in the policies of model providers make it harder for developers to offer a consistent product.
One option for developers is to use open source models, which continue to improve and are becoming viable for more and more tasks. The open source community also has a greater focus on compute efficiency than leading research labs given its more limited access to hardware. This has led to a lot of research and engineering in making powerful small models as well as optimizing existing models.
What not to do
A common criticism of applications built on top of LLMs is that they are just thin LLM wrappers with no defensibility or differentiation. There is a kernel of truth in this, but by the same token most SaaS applications are just thin database wrappers. Technical moats in software are extremely rare. Users don’t care if software has a technical moat, only that it solves a problem for a reasonable price. LLMs enable software to solve a whole new set of problems.
That said, builders and investors should be particularly attentive to competitive dynamics. It’s obvious that LLMs and subsequent developments in AI will generate an enormous amount of value. What is not so clear, however, is how much of that value will be created and captured startups as opposed to incumbents. The internet, cloud and mobile platform shifts all gave rise to new giants. Incumbents struggled to capitalize on these shifts because innovation required difficult change, giving startups the time to get to scale. Incorporating LLMs to improve an existing SaaS product is far simpler than moving a complex on-prem product to the cloud or making radical changes to a cash-cow product in order to be mobile-first. The Photoshop of AI is going to be Photoshop. The office suites of AI are going to be Microsoft 365 and Google Workspace. If there is some LLM-powered capability or functionality that would be a great feature for an incumbent product, that feature will ship before a competitive startup gets to scale.
On the model-building front, Google (Deepmind), Microsoft (OpenAI), Amazon (Anthropic) and Meta have a stranglehold on AI research talent. Meta is the only research powerhouse that still regularly publishes research and open sources its models (to a certain extent). This means that advances will diffuse much more slowly than before, putting open source models at a disadvantage. A handful of small teams of researchers have splintered off from these organizations to build foundation models independently like Mistral and Reka, but it remains to be seen whether they can keep pace over the medium-term. The upshot of these dynamics is that builders should avoid spaces with or adjacent to large incumbents.
Pockets of opportunity
Note: don’t take any of this as gospel - the space is changing rapidly and so are my views. I’ll be writing more about particular opportunities in later posts.
One way to do this is to focus on new product categories. Agents designed to automate complex workflows are one such bucket of products. While RPA businesses like UiPath have a head start to some extent, the range of workflows that can now be automated is far greater now. Entertainment and companionship are other categories where LLMs open up a whole new set of possibilities.
Alternatively, builders can use the tried and true playbook of building software for narrow verticals and problems, particularly where LLMs dramatically increase the value that software can deliver. Jasper, Copy.ai and Writer are building tools to generate on-brand writing for companies. This was a no-brainer given how well suited LLMs are for this problem. I am certain there are other LLM capabilities that can be productized.
The area which has seen the most activity so far is developer tooling but if you are building in that space you probably don’t need to be reading this. That being said, the space is still nascent. Most companies building dev tools have gone the open-source route to leverage the community of builders and hackers experimenting with LLMs - this isn’t a business model I am that familiar with so I don’t have a strong view on how this space develops.
A final category which I am thinking a lot about is services. I’m certain that a new crop of dev shops and systems integrators will crop up around implementing custom LLM products. On top of that, however, I think that there are many services businesses which can be built from the ground up with increasing levels of automation that will have a structural cost and speed advantage over traditional players.
Despite the stranglehold the major AI labs have on building SOTA foundation models, there are plenty of opportunities for builders - these four buckets are far from exhaustive. To get your juices flowing, let me talk about a few LLM products that I think are well executed and dive into some clever implementation details.
Thoughtful LLM implementations
ChatGPT
Most people think of ChatGPT as very thin, simple LLM wrapper. This was largely true when it was first released but OpenAI has built an impressive amount of agentic functionality in newer features. The first of these features was augmenting inputs with information from web search, but the complex architecture and scaffolding in ChatGPT becomes very apparent when you use Advanced Data Analysis (formerly named Code Interpreter) and DALL-E 3 modes.
For any prompt in the DALL-E 3 mode, ChatGPT first determines whether a request is being made to generate an image. If the prompt does include a request to generate an image, this kicks off a multi-step process.
The relevant details for the desired image are extracted.
These details are used as the input to a separate LLM inference call to generate an enriched image generation prompt.
The enriched prompt is used as input to DALL-E 3, which is a separate model.
Images generated by DALL-E 3 are passed back to ChatGPT’s working memory and text is generated to go alongside them in the chat response by GPT-4.
Advanced Data Analysis mode implements an even more complex set of capabilities and workflows. ADA allows users to upload their own files and data and prompt ChatGPT to perform a wide range of tasks involving generating and running custom code. A single code generation step looks something like the following:
If ChatGPT has determined that this step of the problem is best solved by generating code, generate code based on the desired functionality.
Make a separate inference call to generate test cases for generated code.
Run generated code in sandbox to validate it against generated test cases.
Repeat process if code fails on test cases; else proceed with next step.
A single ADA prompt may involve multiple such code generation steps, depending on how the agent has broken down an action plan. For tasks like data analysis, ChatGPT is already quite capable in decomposing a vague high-level prompt into actionable steps - see this link for an in-depth example. Reliability and consistency is an issue here but the beauty of ChatGPT is that it functions as more of a showcase and users don’t expect extremely accurate responses.
Sudowrite
Sudowrite is an AI copilot for fiction writers. Its two founders have built and sold startups before and bonded over a shared passion for writing fiction. You can take Sudowrite for a spin for free but I’ll just highlight a couple clever feature implementations that stood out for me.
Context window length and the tendency for LLMs to ignore important details in the middle of context windows becomes a challenge when using them as a writing assistant for a long novel. Sudowrite addresses this by building scaffolding at multiple levels of abstraction, from concept to synopsis to outline to individual story beats in a chapter. Supplementing text generation with these scaffolding details is one way I think Sudowrite is able to generate better outputs. More broadly, I think of this design pattern as semantic zoom where information is presented at differing levels of detail and abstraction appropriate to a particular use case. You are probably most familiar with this in Google Maps, which shows an extremely detailed map with many individual pins at maximum zoom but slowly starts to hide more and more detail as you zoom out.
Another really neat feature is prebuilt prompting for certain workflows that a writer encounters often like e.g. making a passage more descriptive, or changing pacing. This requires no prompting from the user, just selection of the passage to be modified. It is clear that Sudowrite has been built with a lot of attention to the little individual challenges writers face and it is head and shoulders above anything else in the market in terms of thoughtfulness.
Github Copilot
As someone who barely ever touches code I can’t speak about the experience of using Copilot firsthand. Still, I take a lot of inspiration from its thoughtful RAG implementation. Unlike most other RAG implementations, information retrieval is done primarily through static analysis (automated analysis and debugging of code). When using Copilot’s completion feature, the LLM input consists of the code before and after the cursor location and is augmented with identifying the right programming language as well as code and comments from imported functions.
Tying it all together
Phew that was a doozy. Thanks for reading this far - you’re probably glad I didn’t stick to my original plan of making this post 8 parts long. There are a bunch of topics I haven’t covered or touched on only briefly but I think the material covered here should at least provide you with enough context to avoid groping around in the dark.
I’m planning to continue writing mostly shorter pieces as I learn more myself and I hope you join me on my journey of learning in public. While writing this post, I found feedback to be incredibly helpful not just to make my writing better but also to highlight things that I didn’t fully understand. Two requests:
Please do comment, ask questions and give me feedback!
Share this with other people who will do the same.
I don’t have anything concrete topics planned but here are some other areas that I plan to continue learning and writing about:
MOAR mental models
Great products and features built with FMs
New design patterns in engineering and UX for the above
Open-source models
Multimodal models
LLMs for other languages
Alternative model architectures
Hardware
Regulatory, legal and safety issues
Optional further reading:
Great posts exploring UX for LLM applications:
Some surveys of agent implementations:
Yann Lecun on architecture for autonomous agents:
More interesting app implementation details:
Playing with ChatGPT Advanced Data Analysis: