Generative AI#

A conversation with an Artifical Intelligence?

Fig. 14 A conversation with an Artifical Intelligence?#

Note

This guide is a work in progress. For corrections and suggested changes, please raise an issue or contact the Analysis Standards and Pipelines hub.

Generative AI (Artificial Intelligence) has become the most talked-about technological development in recent years. Chatbots and image generators are now widespread in workplaces. Most of us have heard the confident assertion that the use of generative AI as a coding assistant will revolutionise how coders work. Whether you are curious about using AI tools at work, or you have been encouraged to do so, this guide will explain some of the fundamental technology behind them so you can understand how they work, and offer suggestions for how to get the most out of them. This guide is intended to supplement the existing Government Playbook on AI - all the principles laid out in the playbook still apply here.

Introduction - Machine learning and Large Language Models#

Generative AI (Artificial Intelligence) is a term that refers to a class of machine learning models. A machine learning model is a piece of software that is trained on a set of data to make predictions when given new data. Machine learning has been around for a long time and it is used in almost every industry to solve a wide variety of problems. As a government analyst, you may be quite familiar with machine learning already.

In a generative AI model, these predictions are novel instances of text, images, video or audio (or any combination of these) that are produced when given some input data. This guide will focus on a particular subcategory of generative AI models called Large Language Models, or LLMs. LLMs typically take text as input and generate text as output (although most major LLMs today can take inputs and generate outputs in multiple formats, the typical use case is still text-to-text). The most well known LLMs include OpenAI’s GPT series, Google’s Gemini, Anthropic’s Claude, and DeepSeek’s DeepSeek.

How do LLMs generate novel instances of text that so closely resemble something that was written by a human? The simplest way of putting it is that LLMs are next-word-predictor machines. They continuously predict the next word in a sequence over and over again, until they have a ‘finished’ output, whether that is a business email, an academic essay, or an entire Python package. But how does a model know what word to add to the end of a sequence of words so that it ‘looks right’? This is only possible because the model has learned a mathematical pattern for human text. The technology behind this is a neural network, a form of machine learning model architecture used to find non-linear patterns in data.

Neural networks get their name because they consist of a network of model ‘neurons’ - computational units that take many inputs to deliver one output. Neural networks are layered constructions. They first consist of an input layer, which takes the inputs to the model. The next set of layers in a neural network are the hidden layers, of which there can be one or more.

Each hidden layer consists of one or more neurons. Each neuron of a hidden layer takes every input of the layer before it and computes a singular output. This is done by computing the linear product of the previous layer’s outputs with a unique set of weights, plus a bias, and then feeding this into an activation function that transforms the linear output into a non-linear output.

An example of a commonly used activation function used is a rectified linear unit, which is defined as \(f(x) = max(0, x)\). The activation function is a crucial component in neural networks, as without them they cannot learn to predict non-linearity. The activation function also models how real neurons function - they only fire after the electrical stimulus from their inputs reaches a certain threshold.

A diagram of neural network, showing two hidden layers consisting of four neurons each. The input layer takes two inputs, and the output layer produces four outputs. The outputs for the first hidden layer are calculated using the outputs of the input layer, weight vector $W_1$, bias vector $b_1$ and the activation function, the outputs of the second hidden layer are calculated using the outputs of the first hidden layer, weight vector $W_2$, bias vector $b_2$ and the activation function, and so on. Credit: Wolfram Research

Fig. 16 A diagram of neural network, showing two hidden layers consisting of four neurons each. The input layer takes two inputs, and the output layer produces four outputs. The outputs for the first hidden layer are calculated using the outputs of the input layer, weight vector \(W_1\), bias vector \(b_1\) and the activation function, the outputs of the second hidden layer are calculated using the outputs of the first hidden layer, weight vector \(W_2\), bias vector \(b_2\) and the activation function, and so on. Credit: Wolfram Research#

While the individual components of a neural network are performing essentially quite trivial mathematical operations, when networks are scaled up to include many hidden layers of many neurons (this is where the term deep learning comes from - deep neural networks, meaning neural networks of many layers), they can predict exceptionally complex non-linear patterns. An even more surprising feature is they can decide by themselves what these patterns are, they do not need instructions. For example, a neural network trained as an image classifier is able to learn on its own what constitutes an image of a car and what constitutes an image of a bicycle, just by being given examples of each - it does not need a definition of what is ‘car-like’ or ‘bicycle-like’ in advance.

Training a neural network to recognise a pattern or reproduce a non-linear function requires supplying it with enough examples that consist of an input and the desired output. From this point onwards, the neural network is trained to find the weights and biases that are needed to reproduce the desired outputs. This is done via gradient descent. Loss is calculated, the weights and biases are modified in the direction that reduces loss, and the process repeats until the model converges.

The drop down section above contains a primer on loss and gradient descent. In neural networks, gradient descent is implemented using an algorithm called backpropagation.

A neural network learning to predict a non-linear function. Credit: Stephen Wolfram

Fig. 17 A neural network learning to predict a non-linear function. Credit: Stephen Wolfram#

LLMs are essentially just very large neural networks. The underlying neural network of an LLM is trained on a vast set of text, and the neural network is able to learn patterns of words and then assign probabilities to individual words occurring within sequences of other words, allowing the model to produce long sequences of text that quite accurately mimic human language. However, it’s important to remember that a trained LLM is a black box. The trained model contains billions if not trillions of terms, so nobody can say for sure why it is capable of such accurate mimicry, we can only assess that it is.

The explanation for why LLMs are so ubiquitous now (given previous language models got far less attention) is a specific type of neural network architecture called a ‘transformer’, that was invented in 2017 at Google for the purpose of machine translation. This architecture represented a breakthrough in language models, as it opened up the models to ‘hyperscaling’ - increasing the size of the neural network and increasing the size of the training data led to predictable increases in performance, so both have increased exponentially ever since. Further discussion of the transformer architecture is beyond the scope of this guide, but the Financial Times’ Visual Storytelling Team have written an excellent introduction.

An important point of note about LLMs is that from a user perspective the neural network training is only half the story. The neural network training produces the ‘foundation’ model (e.g., a GPT-5 series model), but a finished chatbot product (such as ChatGPT) has also gone through an enormous post-training (also known as ‘fine-tuning’) phase. It can be summarised as a kind of reinforcement learning that uses human feedback, but it is much harder to discuss, as different companies have developed different methods for what seems to work, and in many cases it is less about technology and more about substantial human labour. Even though we cannot offer a discussion about it in this guide, post-training is increasingly important in the field of LLMs.

Now that we have a rough handle on what LLMs are and how they work, we can discuss their use in the workplace.

Vibe coding: Using LLMs to write code in the workplace#

LLM-powered chatbots have become particularly popular as automated coding assistants. Many programmers around the world have opted (or been instructed) to use LLMs to generate code as a way to increase productivity, the logic behind this being that it is faster to write prompts to generate code than it is to hand write code. The slang term for this working practice is ‘vibe coding’.

Before we go any further we need to cover hallucinations. LLMs can and regularly do generate plausible but inaccurate statements. The production of hallucinations is mathematically inevitable. This is not an example of the models ‘failing’, it is a feature - as we discussed earlier, all LLMs do is model patterns in text and replicate them. Outputs that ‘look right’ are all that the models are trained to produce. You must always keep in mind that LLMs do not have an internal world-model of what is true and false, only a very long list of parameters that are used to output next-word probabilities. It is from this that we get the first golden rule of vibe coding:

The first golden rule of vibe coding

You must never use or submit to peer review any code generated by an LLM without first reviewing it yourself.

We can then derive the second golden rule of vibe coding directly from the first rule:

The second golden rule of vibe coding

The only person responsible for the code you generated with an LLM is you.

We need one more golden rule to round everything out. Our first two golden rules concern outputs, but we must also set a rule for inputs. An LLM-powered chatbot is also a website, so you should be taking the same general precautions you would with using any other website on the internet.

The third golden rule of vibe coding

Treat the input of an LLM chatbot as you would any other public online domain. Under no circumstances should you enter sensitive information or data in the input of an LLM chatbot.

We can now discuss the benefits and risks of vibe coding, while bearing in mind that we must refer to our golden rules at all times.

Don’t kill my vibe - The benefits of vibe coding#

The most touted advantage of vibe coding is a significant increase in productivity. From prompts of only a few words, an LLM can generate expressions, functions, classes, even entire modules for you to add to your project. Integrating an LLM-powered chatbot into an IDE (such as pairing VS Code with GitHub CoPilot) allows the LLM to directly modify code files, removing the need to even copy and paste outputs. Vibe coding can free you and your team from time consuming menial tasks to focus on delivering functionality, and it can empower individuals to tackle more ambitious tasks than they would otherwise be comfortable with.

One study across two companies (Microsoft and Accenture) found experimental evidence supporting the assertion that AI-assisted programmers write code faster than non-AI-assisted programmers. In this study there was a statistically significant increase in the number of pull requests and builds submitted by AI-assisted programmers when compared to non-AI-assisted programmers. However, there was no overall statistically significant difference in the number of successful builds between the two groups.

Vibe check failed - The risks of vibe coding#

Although vibe coding lets you write code faster, in practice you will inevitably be forced to spend a substantial amount of re-prompting, reviewing, and editing the outputs of LLMs to achieve the desired final result. An AI-powered programmer may find themselves quickly turning into a full time code reviewer for their LLM assistant, with the total time spent on a task staying the same or even increasing, despite code being written faster.

Researchers at METR tested the productivity of experienced programmers and found the surprising result that AI-assisted programmers actually took longer to complete tasks than non-AI-assisted programmers, but the AI-assisted programmers believed that they had been more efficient! The authors of the study proposed 5 potential explanations for the result:

  • There is a degree of complacency caused by overoptimism about the usefulness of AI tools

  • AI tools lack domain-specific knowledge.

  • AI tools perform worse in large and complex codebases.

  • There is a high level of rejection of AI-generated outputs, and a significant amount of time is spent reviewing and editing AI-generated outputs.

  • AI tools lack a lot of the implicit context that exists in projects.

There is also the pertinent question of how individuals and organisations build and retain skill in workplaces inundated with vibe coding. A study by Anthropic suggests that LLM adoption, especially if it is done quickly and aggressively, can have a negative impact on individuals’ ability to learn on the job. This has potentially quite serious implications at the organisational level regarding internal career progression and technical debt within teams. Negative impacts on psychological wellbeing have also been measured after the adoption of LLMs in the workplace, with increased boredom and decreased motivation associated with using LLMs.

There is also a risk that LLM-generated code can contain critical security vulnerabilities. This is of particular importance to anybody working on web-based platforms and applications.

Finally, we should also quickly mention the environmental and ethical concerns related to LLM-powered chatbots. These chatbots are not efficient machines. They consume vast quantities of electricity and fresh water (especially in the model training process), and the demand for both is only increasing. There are also ethical concerns around using LLMs, such as how LLMs use copyrighted works in their training data, the bias exhibited by LLMs as a result of them reproducing biases present in their training data, and the use of LLMs to produce offensive and harmful content.

Conclusion - Centaurs and reverse-centaurs#

Author and journalist Cory Doctorow uses the idea of centaurs and reverse-centaurs when discussing AI-assisted work. A centaur is a human that is assisted in their work by a machine. It’s nice to be a centaur, as the machine takes some of the drudgery out of work by doing the menial time-consuming tasks at a fraction of the effort, allowing you to be more productive and creative.

A reverse-centaur is a human that assists a machine in their work. It’s much less nice to be a reverse-centaur, as you are stripped of agency in your work - the machine dictates what you do and when you do it. When we use AI tools at work, we want to be centaurs and we don’t want to be reverse-centaurs.

Let’s look at the following example. Say you’re working on a data pipeline, and you need a function that flags outliers in your data. You could use CoPilot as an LLM-powered chatbot assistant, and give it the following prompt:

A prompt offloading most of the thinking to CoPilot.

Fig. 18 A prompt offloading most of the thinking to CoPilot.#

I then get the following Python code from CoPilot:

You asked for one function, but because the prompt was vague, and also asked for ‘quality standards’ to be adhered to (rather than verifying that by yourself!), CoPilot has given a verbose, defensively-coded response. The output has in total:

  • 1 class (that contains only attributes)

  • 5 functions

  • 225 new lines

  • An import of __future__ that suggests there could be compatibility issues in the code

You have to review all of this before you can add some simple outlier flagging into the data pipeline - practically the size of an entire pull request. By being too vague in prompting and offloading too much of the cognitive work to CoPilot, you may find yourself turning into a reverse-centaur - CoPilot gets to design and write all the code, while you must test and review it. Can we change this prompt so that we are being centaurs, not reverse-centaurs?

Try the following updated prompt:

An updated prompt where I am more precise with my requirements.

Fig. 19 An updated prompt where I am more precise with my requirements.#

And CoPilot produces the following output:

You now have only 1 function and 46 new lines to review (we are also only importing numpy and pandas). By thinking about what you actually needed in the data pipeline (i.e., specifying that you’re working with numerical data and pandas DataFrames, and deciding on the thresholding) rather than offloading the cognitive work to CoPilot, CoPilot has output a much more compact piece of code that will be easier to review and add into the data pipeline. In this example, you’re much more like a centaur - You get to design the code, while CoPilot is doing the menial work of typing it up.

This outlines a general principle - LLM-powered chatbots are most useful when you direct them to work for you, rather than when you are picking up after them and trying to shape their outputs into something functional. Remember that they have no world-model of what is right and wrong, they don’t have any innate understanding of the requirements for your work, and they can’t be held responsible for shoddy work.

Acknowledgements#

Stephen Wolfram’s article What Is ChatGPT Doing … and Why Does It Work is an excellent primer on the underlying mechanics of LLMs and informed much of the first section in this guide. In addition, the Google Machine Learning Crash Course was enormously helpful and is another starting point for anyone interested in machine learning and AI. The sections on neural networks, embeddings, and LLMs are particularly relevant.