
The Problem of Faithfulness in LLM Explanations
Evaluating the faithfulness of model-generated explanations in machine learning is crucial to ensuring that explanations reflect the true decision-making processes of large language models (LLMs). Agents such as CoPilot, ChatGPT, Claude and Gemini are only as good as their faithfulness.
Why is this important ?
LLMs can generate plausible responses however they may not show why the model made a certain choice. This is fine for creative work such as Midjourney.com or Runway.ai – infact the more hallucinating the better. However, for precise LLMs such as law, medicine and finance – we need to get underneath the bonnet to understand exactly how the model made it’s decisions and how we can improve these.
So What ?
The more we understand methods, systems and processes to better understand AI decisions, especially in “mission-critical” fields where the wrong answer is unacceptable – the more we understand the "why and how” rather than “what” LLMs are generating. It could be argued that Mistral.ai encourages us do this better than most of the other current LLMs with their native ‘open’ source projects.
We need to understand the “why” and the “how” LLMs are operating in order to achieve our precise “what”.
Deeper Dive into some Great Research
While current methods, such as the Counterfactual Test (CT), measure faithfulness by intervening in input texts and observing changes in predictions, they fail to consider the overall shift in the predicted label distribution.
There is excellent research into the Correlational Explanatory Faithfulness (CEF) metric (see below), which captures a more detailed shift in model output. Using this approach, these brilliant researchers - Siegel, Camburu, Hees and Perez-Ortiz, conduct experiments on several LLMs and datasets, highlighting new insights into explanation reliability and faithfulness.
Instant.Lawyer can see how important this research becomes when reliance on LLMs is increased and the faithfulness becomes even more critical to real-world solutions. Hallucinating is just fine for Midjourney but is catastrophic for other fields such as legal tech, tax law and medicine.
The importance of faithfulness in the contexts of law, finance and medicine is not just handy - it is critical.
A human-in-the-loop architectural philosophy ensures that this faithfulness augmented with leading LLM structures is the way forward. By adding a plurality of layers of
Problem Definition and Key Contributions
In machine learning, understanding why a system generates a particular prediction is critical, especially in high-stakes domains such as medicine and criminal justice (Rudin, 2018). The field of Explainable AI (XAI) attempts to provide interpretable rationales, but evaluating their faithfulness remains problematic.
Faithfulness refers to the degree to which the generated explanation captures the actual reasoning process used by the model, not just a plausible justification.
Why is this research so interesting ?
Introducing Correlational Explanatory Faithfulness (CEF): The paper argues that existing metrics are insufficient, as they rely on binary measures of faithfulness. CEF improves on this by assessing how much explanations reflect significant factors that influence the model’s predictions. It also tracks the difference in frequency between impactful and non-impactful factors in the explanations.
Correlational Counterfactual Test (CCT): By implementing CEF in a counterfactual framework, the paper provides an alternative to the Counterfactual Test (CT) of Atanasova et al. (2023). Unlike CT, CCT takes into account the total shift in label distribution probabilities, rather than only binary changes in the top-predicted class.
Experimental Validation: Experiments are conducted using the Llama2 family of LLMs on three datasets—e-SNLI, ComVE, and ECQA—to demonstrate the metric’s effectiveness. The authors find that CCT captures trends in faithfulness that CT misses, especially in larger models like Llama2-70B.
The Correlational Explanatory Faithfulness (CEF) Metric
Faithfulness metrics should evaluate whether explanations include the significant factors that influence a model’s prediction. However, the existing Counterfactual Test (CT) only verifies whether certain input interventions (such as inserting a word) cause a change in predictions, without measuring the scale of this shift in model predictions. CEF resolves this by incorporating the total variation in model predictions.
In particular, CEF measures the following:
The degree to which the insertion of specific terms in the input (interventional additions, or IAs) causes shifts in the model’s predictions.
Whether the explanation mentions impactful terms (IAs) more often than non-impactful ones.
This approach leads to more reliable measures of faithfulness, as it accounts for both the magnitude of changes in the predicted label distribution and whether the explanations align with the important features of the data.
Correlational Counterfactual Test (CCT)
The Counterfactual Test (CT) previously used in explainability research inserts random terms into inputs and checks whether these terms cause changes in the model’s predictions. However, this test suffers from certain drawbacks:
Over-reliance on Binary Changes: CT only considers whether an intervention results in a binary change (e.g., whether a class probability crosses a threshold). However, it fails to account for cases where interventions cause significant probability shifts that do not affect the top predicted label.
Verbatim Explanations: CT can be trivially gamed by repeating input text verbatim in the explanation. This would lead to 0% unfaithfulness, as the explanations would always mention the IA, regardless of its significance.
CCT addresses these problems by quantifying the degree of change in the prediction using Total Variation Distance (TVD), which measures the difference between the model’s output distributions before and after the intervention. This allows for more nuanced evaluation, as the metric correlates the magnitude of the model's shift with the relevance of the terms mentioned in the explanation.
Experiments and Results
The authors conducted experiments using LLMs from the Llama2 family across three datasets: e-SNLI (natural language inference), ComVE (common-sense violation), and ECQA (commonsense question answering). The results are summarized as follows:
Faithfulness Trends: The CCT revealed that explanations for e-SNLI were more faithful than those for ECQA and ComVE. For e-SNLI, impactful IAs were more likely to be mentioned in explanations, while ECQA showed relatively flat trends. This indicates that ECQA explanations were less sensitive to the importance of the terms involved in the predictions.
Larger Models Perform Better: The largest LLMs, such as Llama2-70B, produced the most faithful explanations, as they were more likely to mention impactful IAs and leave out irrelevant ones.
Dataset Variability: There was significant variation between datasets in terms of faithfulness, possibly due to the nature of the tasks and differences in human-annotated explanations. For instance, ECQA explanations were often verbose and included less relevant terms, reducing the overall faithfulness.
Discussion and Outlook
The introduction of CEF and CCT marks a significant improvement over previous metrics by providing a more accurate and reliable measure of faithfulness in model-generated explanations. Their research demonstrates that larger models produce more faithful explanations, although this varies by dataset. The experiments also suggest that the faithfulness of explanations improves when explanations are conditioned on reasoning processes generated before the prediction (explain-then-predict).
Limitations and Future Work
Despite the improvements over existing methods, the authors acknowledge several limitations:
Limited Interventions: The counterfactual interventions used in the experiments were restricted to inserting single adjectives or adverbs. Future research could explore the impact of more complex interventions, such as replacing larger text segments or considering different parts of speech.
Generalizability: The CCT's effectiveness depends on the nature of the dataset and task. Additional testing across more varied tasks and instruction-tuned models could further validate the approach.
Semantic Coherence: While random interventions were filtered to ensure that they made sense, some generated text may still lack semantic coherence. More advanced filtering techniques using LLMs could mitigate this issue.
This clever research introduces a novel metric, Correlational Explanatory Faithfulness (CEF), and its implementation through the Correlational Counterfactual Test (CCT) to improve the evaluation of model-generated explanations.
By considering the total variation in prediction distributions, these methods provide a more faithful measure of how well explanations align with the true decision-making process of LLMs.
Peter Toumbourou
Stay tuned for more insights and updates on how Instant.Lawyer is empowering individuals, businesses and legal professionals globally. For more information or to request a demo, contact us.
Recommended Further Reading:
Jacovi and Goldberg (2020) – "Towards Faithfully Interpretable NLP Systems: HowShould We Define and Evaluate Faithfulness?"; for an in-depth survey on faithfulness in natural language explanations.
Rudin (2018) – "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead"; on the importance of interpretable models in high-stakes AI systems.
Atanasova et al. (2023) - "Are self-explanations from Large Language Models faithful?" ; for the original Counterfactual Test approach.
Camburu et al. (2018) – "e-SNLI: Natural Language Inference with Natural Language Explanations"; on the e-SNLI dataset and its use in generating natural language explanations.