The AI Revolution
Artificial intelligence (AI) has been one of the most rapidly evolving fields in computer science over the past few decades. In a very real sense, in the last few years it has suddenly "come of age" with increasing numbers of real world applications. AI has been used to solve a wide range of problems, from image recognition to natural language processing. In the last few years with the public exposure of LLM systems in search engines and image generators, the public interest and range of solutions has literally exploded. In this article, we discuss some of the recent advances in AI, including deep learning, reinforcement learning, and computer vision.
Deep learning is a subfield of machine learning that involves training artificial neural networks which are neural networks with multiple layers "virtual neurons". Deep learning has been used to solve many complex problems such as image recognition, speech recognition, and natural language processing. One of the most significant advances in deep learning has been the development of large language models (LLMs) such as GPT-4 by OpenAI, BERT, TS & Gemini by Google, XLNet by Google/CMU, Turing-NLG by Microsoft, Megatron by NVIDIA, Grok by xAI, Claude (Sonnet) by Anthropic, RoBERTa & LLaMA by Facebook, DeepSeek by DeepSeek, Nova by Amazon and Mistral & Mixtral by Mistral AI, plus a host of other smaller and emerging systems. LLMs have been trained on massive datasets and can generate human-like responses by processing natural-language inputs.
Reinforcement learning is another subfield of machine learning that involves training agents to make decisions based on rewards and/or punishments. Some of the earlier public AI successes came from the reinforcement learning space. Reinforcement learning is used to solve complex problems in game playing, robotics, industrial control and autonomous vehicles. One of the better known advances in reinforcement learning was the development of AlphaGo by DeepMind some years ago. AlphaGo was able to defeat the world champion at the game of Go using a combination of deep neural networks and reinforcement learning.
Reinforcement learning works particularly well when an independent observer or interpreter knows the path that should be walked, or is able to algorithmically determine whether the last step taken has changed the state of the environment to which the action applies so that the current state is closer to a known end goal than the state before the last action was taken. It is essentially an implementation of the trial and error learning approach humans employ when developing new skills with little or no pre-training, or where a skill to be learned requires practice with gradual improvement (like piano playing). The method is very dependent on the suitability of the reward function and suits a complex problem where trial and error learning is possible. It tends to be costly in terms of computational effort and may be defeated in situations where local minima occur on the path. Local minima occurs where the current state is the closest available state relative to the end goal and all next steps result in a state that is further away from the end goal. Thus going "backwards" (in terms of reward) for a while is may be necessary so that one can go forward and find an even closer state. In these scenarios the cleverness of the reward function becomes particularly important or the agent may fail t6o be able to progress. There are broadly two forms of reinforcement learning: Positive (defined as strengthening the behaviour when a positive action occurs so as to increase the frequency or strength of an action) and Negative (defined as strengthening a behaviour because a negative action is stopped or avoided). Reinforcement learning is important in AI generally and in LLM prompt Engineering because it is one of the methods that can be employed to resolve complex reasoning and logic problems, and is broadly the method used to fine tune a range of neural networks.
Computer vision is a subfield of AI that involves training machines to interpret and understand visual data from the world around them. Computer vision has been used to solve problems such as object recognition, facial recognition, and vision interpretation for self-driving cars. One of the most significant advances in computer vision has been the development of convolutional neural networks (CNNs). CNNs have been used to achieve state-of-the-art performance on many computer vision tasks.
AI has made significant advances in natural language processing (NLP). NLP involves training machines to understand and generate human language. NLP has been used to solve many complex problems such as machine translation, sentiment analysis, and chatbots. One of the most significant advances in NLP has been the development of transformer neural networks such as LLM's listed above. Transformer neural networks have been used to achieve state-of-the-art performance on many NLP tasks.
Generative Pretrained Transformer networks (GPT) such as LLMs derive meanings from long sequences of text to understand how semantic components relate to one another in a probability driven framework and then determine how likely one semantic component is to appear in proximity to one another. The transformer networks are trained unsupervised on a vast body of textual (and other) data to create a pretrained data set, then fine-tuned by humans interacting with the pretrained model to create pattern of response and train reaponse classification. For example a raw LLM trained on extensive bodies of narrative text would not understand the question answer pattern and is likely to simply attempt to continue the question with sentences comprising tokens that are likely to follow the tokens in the question. In fine tuning the LLM is taught the pattern of question and answer. Modern hyper-scale LLM's like those listed have been trained on much of the world's currently available knowledge in digital form and thus (at least in theory) have everything we know as a species (at least to the extent that it has been transcribed to digital form) embedded in their "brains" in multi-dimensional probability networks of how tokens relate to one another. For simplicity you can think of tokens as "words". (That is not quite correct all the time but it is close enough most of the time to work)
Generative adversarial networks (GANs) are another area where AI has made significant advances. GANs involve training two neural networks against each other: a generator network that generates new data samples and a discriminator network that tries to distinguish between real and fake data samples. GANs have been used to generate realistic images, videos, and audio samples, however after some initial successes they have fallen out of favour as generated images tended to lack variety, required excessive training time, are subject to mode collapse (where the generator falls into a pattern of generating a limited set of outputs which results in repeatedly similar or identical samples because the discriminator becomes to strong causing a local minima similar to the Reinforcement Learning problem described earlier) and inherent difficulty in training due to the adversarial nature of the training strategy.
In image, audio and video data sets an extensive data lake of labelled images is used to train the network and a process of Diffusion is applied in which noise is added to create randomness in an image which is then removed progressively while the model attempts to match semantically similar images. This is the neural version of Markov random fields used for modeling the physical process of gas diffusion. The diffusion model has two main actions forward and backward diffusion. During the training phase forward diffusion operates to merge random noise with the images captured, while during the inference phase backward diffusion applies to gradually de-noise the network and reveal the image. In a sense the noisy pattern stored by the model after forward diffusion contains every possible image that could be produced by the network (if only you knew how to extract them!). the Models that perform text to image or sound generation use this process. Diffusion models are immune to the mode collapse problem of GANs, but on their own are computationally expensive to operate because in the basic form they work at the pixel level all the Markov states have to be predicted and held in memory at all times (think: imagining every possible way to draw a cat at once while you work through settling on one of them).
NLP (transformer networks), GANs and Diffusion all have their own advantages and disadvantages in image processing. In response to this fact the concept of the Latent Diffusion Model (LDM) has emerged. The LDM combines the goal seeking power of GANs, the detail preservation of diffusion and the semantic power of transformers. It works in the latent space rather than the pixel space and is thus more efficient and versatile than any of the previous solutions on their own. The pixel space is the pixel representation of the image being learned and contains all of the original pixel data - including redundant and duplicated information, while latent space is a compressed representation of the original pixel space that preserve the main features of the underlying pixel structure. The transformation from pixel space to latent space typically involves reducing the dimensions of the original high-resolution image but could take other forms. In a sense feature extraction is a form of latent space transformation, in that each feature represents a data point summarizing a collection of pixel data points; similarly, edge finding algorithms typically reduce a complex image down to a collection of lines (edges) and thus also reduce the image detail stored while also extracting a set of features. So, the LDM gains its first advantage over standard diffusion by reducing the amount of data to which the noisification process must be applied while extracting the key elements from the noise. This in turn equips cross-attention algorithms with features to which to apply attention weights and facilitates a semantic understanding of the image elements. When generating images from text prompts, a diffusion network uses an LLM to interpret the text and provide a semantic understanding of what is desired that the diffusion model uses via its labelled features to direct the generation of the corresponding image. The LDM is initially trained with text labelled images. Obviously, while the image labels are turned into vectors at the start of the process and concatenated with the image data, they are not themselves "noisified" and travel through the process unpolluted so they can be used to guide image selection and assembly during the generation (de-noisification) process.
Broadly we can classify current AI systems as goal seeking, discriminative or generative:
GPT/LLMs can be trained into any of these three forms of AI - Goal seeking, Discriminative/Classifier or Generative networks, during any of the training, fine turning or even the prompt engineering stages. The earlier the classifier or goal seeking training is injected, the more strongly the network will behave in that fashion, and the less it will perform well as a general generative network, so it is probably preferred that classifier and goal seeking behavior modification is reserved for the later stages (fine tuning or prompt engineering)(at least in the general case), or perhaps applied to an ancillary network with agents used to select how and when it should be applied. If the AI is intended for edge IoT devices, their limited power, memory and computational resources would argue for a dedicated classifier or goal seeking network, rather than a more general purpose GPT solution.
As one might guess the deep learning approach of generative AI is computationally intensive and makes use of the parallel and concurrent mathematical processing capabilities of GPU's to drive an iterative unsupervised learning model, utilising much larger quantities of CPU cycles and memory than earlier generations of AI such as Expert Systems which were essentially human coded fixed and inflexible problem solving approaches to AI.
Early neural networks were essentially permanently in learning mode which could result in the knowledge base being corrupted over time as the network learned by right and wrong responses as if they were all "right" responses. The modern LLM has a distinct training and inference phase. The computational demands during training are extensive while, comparatively, the computational demands during the inference phase are insignificant.
AI has made possible significant advances in robotics and autonomous systems. Robotics involves training machines to perform physical tasks such as grasping objects or navigating through an environment. Autonomous systems involve training machines to make decisions without human intervention. These technologies are being used in many applications such as self-driving cars, drones, and industrial automation.
Some of the most impressive advances in AI over recent years have come from advances in areas such as deep learning, reinforcement learning, computer vision, natural language processing, generative adversarial networks, robotics, and autonomous systems. These technologies have the potential to revolutionize many fields such as healthcare and education by generating human-like responses by processing natural-language inputs.
Increasingly current AI direction hinges around the power of LLM's which are driving both text only networks and other specialised AI systems such as diffusion networks (for image, audio, etc.). During their inferencing phase (most widely recognised as the "chat-bot" pattern that most users associate with LLM AI's and, indeed, how most people experience an AI) the LLM receives a prompt from a user (or agent, or similar) and phrases some form of response to that prompt. The skill of forming effective prompts to elucidate a relevant and valid response from the LLM is called "Prompt Engineering". Prompt engineering is essentially programming the LLM on a specific problem using English (or another human language) and logic to help it extract from its knowledge base the answer desired. It is a semi-complex skill that is rapidly becoming essential in the modern age.
While that might sound simple: after all, we can all speak and write - so anybody should be able to prompt an LLM and get as good an answer as anybody else, right? I urge a sense of caution. In theory anybody that knows a topic should be able to teach it, but we have all experienced good and bad teachers, and the most knowledgeable person on a topic was not necessarily the best teacher of that topic. So clearly there is a skill in teaching beyond merely knowing the topic at hand. Likewise, with LLM prompting while anyone can ask an LLM a question and get an answer, some will ask that question in a way that gets the right answer while others will ask it in a way that the LLM either misunderstands or does understand but doesn't know how to answer, and thus get the wrong answer. In these cases the skill of prompt engineering - correctly prompting the LLM - becomes critical.
In the collection of articles on this site we will explore AI and Generative systems such as Diffusion networks and Large Language Models in particular while providing extensive references to other sites that drill into the individual topics covered here in more detail. These articles are intended as a survey of the current state of the art, a course in the use of some of the AI technologies (in particular LLMs) and a repository of resources available to, and accessible by, the public, and lastly provide a thorough course in the IT discipline of prompt engineering. The right hand column of each page has a list of relevant references for further reading on the topics discussed on the associated page.