Evaluating Log-Likelihood for Confidence Estimation in LLM-Based Multiple-Choice Question Answering
DOI:
https://doi.org/10.70844/ijas.2025.2.29Keywords:
Artificial intelligence, Machine learning, Large Language Models (LLMs), Confidence estimation, Log-likelihood, Calibration, Multiple-Choice Question Answering (MCQA), Softmax, Uncertainty quantification, Model reliability, Answer scoring methods, NLP evaluationAbstract
Reliable deployment of Large Language Models (LLMs) in question-answering tasks requires well-calibrated confidence estimates. This work investigates whether token-level log-likelihoods—sums of log-probabilities over answer tokens—can serve as effective confidence signals in Multiple-Choice Question Answering (MCQA). We compare three methods: (1) Raw log-likelihood, (2) length-normalized log- likelihood and (3) conventional softmax-based choice probability. Across four diverse MCQA benchmarks, we find that no single scoring method is universally best. Length normalization can significantly improve calibration but may reduce accuracy, while softmax and raw log-likelihood yield identical predictions. These results highlight important trade-offs between calibration and accuracy and offer insights into selecting or adapting confidence measures for different tasks. Our findings inform the design of more trustworthy LLM-based QA systems and lay groundwork for broader uncertainty quantification efforts.
