Exploring Large Language Models Under the Hood: Creativity, Reliability, and Challenges
The discipline of machine learning is not a foreign concept to most people and it has already encompassed our society for more than 60 years. If you are not too familiar with machine learning overall, Andrew Ng’s famous quote encapsulates the fundamental principle and goal of machine learning below really well.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”
Essentially, machine learning is a discipline that integrates statistics and computational science to build models to replicate or automate some tasks. No matter how smart your algorithm gets, there will always be an outlier. You can track your model performance with some matrix on false positives, false negatives, and such.
How does the Large Language Model work?
First, you need to understand ChatGPT is a derivative of GPT models (base model). GPT models, in general, are doing some sort of sentence completion, ie continuing on what your prompt is based on its training data.
From many perspectives, ChatGPT shares a lot of technical architecture similarities with a search engine. First, they are transformer-based models, where you create dense embedding for user query to create the context to the query so it is understandable to the computer. Second, the goal is that its output should have semantic similarity with the input.
Unlike a search engine that returns a verifiable response stored somewhere in the database, the GPT model has the ability to construct its own sentences that is difficult to verify. To put it in the simplest manner, the GPT models will predict what is the most likely word after your sentence and iterate continuously. For example, a sentence such as “AI has the ability to” can be completed by “learn”, “destroy” and “create” based on the dense embedding of the model created conditioned on your prompt context and its training data. This process of sentence construction is achieved by the neural network, another important component, which I will explain in my blog in the near future.
Creativity and Hallucination: Two Sides of the Same Coin.
Unlike a sentiment classifier targeting to optimize accuracy. ChatGPT does not necessarily just want to optimize for the highest probability. Therefore, it does random sample one of the top probabilities of all possible candidates controlled by “temperature”. If you always seek the highest ranked word or sentence, you will get a flat essay. Whereas, the ChatGPT-like model does pick lower-ranked words, so it gets creative, interesting, and sometimes even unexpected given the goal was to recreate a human-like response. Like us, sometimes we all say and write things that do not make sense.
This challenge on creativity and reliability echoes the fact model intrinsically makes mistakes.
Problems and Remediation
If you only sample the highest probability, would this approach resolve the hallucination? It will reduce rather than remove. Therefore, like any other ML use case, it is incredibly important to keep humans in the loop.
To this date, it is still an open problem to evaluate the truthfulness of ChatGPT’s output. It is difficult to establish a matrix to determine if the model is making things up to evaluate, not to mention the large language model is working on the context of all facets of life and society. Something that is not true can be true and possibly is true in some other context.
Some strategies can be used such as caching and running inferences overnight before serving these to the customers and evaluate by human quality assurance beforehand.
Other, relatively, novel ideas have come to investigate if we can remove the layer of human quality assurance, such as creating several ChatGPT instances acting as moderators and tree of thoughts that provide augmentation on logical reasoning.