[2001.03985v1] Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial Sampling
We have seen empirically that the number of samples required for reliable estimation varies between tasks, models and parameters of interests

Figure 1: A. Number of samples used by fixed (red curves) or inverse binomial sampling (blue; expected value) to estimate the log-likelihood log p on a single trial with probability p. IBS uses on average 1 p trials. B. Bias of the log-likelihood estimate. The bias of IBS is identically zero. C. Standard deviation of the log-likelihood estimate. (Why fixed sampling fails)Figure 2: A. Bias of fixed sampling estimators of the log-likelihood, plotted as a function of pM, where p is the likelihood on a given trial, and M the number of samples. As M → ∞, the bias converges to a master curve (Equation ??). B. Same, but for standard deviation of the estimate. (Why fixed sampling fails)Figure 3: A. z-score plot for the total number of samples used by IBS. B. z-score plot for the estimates returned by IBS, using the exact variance formula for known probability. C. Calibration plot for the estimates returned by IBS, using the variance estimate from Equation ??. These figures show that the number of samples taken by IBS and the estimated log-likelihood are Gaussian, and that the variance estimate from Equation ?? is calibrated. (Higher-order moments)Figure 4: A. Trial structure of the simulated orientation discrimination task. A oriented patch appears on a screen for 250 ms, after which participants decide whether it is rotated rightwards or leftwards with respect to a vertical reference. B. Graphical illustration of the behavioral model, which specifies the probability of choosing rightwards as a function of the true stimulus orientation. The three model parameters σ, µ, and γ correspond to the (inverse) slope, horizontal offset and (double) asymptote of the psychometric curve, as per Equation ??. Note that we parametrize the model with η ≡ log σ. (Orientation discrimination)

Figure 5: A. Estimated values of η ≡ log σ as a function of the true η in simulated data using IBS with R = 1 repeat (blue), fixed sampling with M = 10 (red) or the exact likelihood function (green). The black line denotes equality. Error bars indicate standard deviation across 100 simulated data sets. IBS uses on average 2.22 samples per trial. B. Mean and standard error (shaded regions) of estimates of η for 100 simulated data sets with ηtrue = log 2◦ , using fixed sampling, IBS or the exact likelihood function. For fixed sampling and IBS, we plot mean and standard error as a function of the (average) number of samples used. C. Root mean squared error (RMSE) of estimates of η, averaged across the range of ηtrue in A, as a function of the number of samples used by IBS or fixed sampling. Shaded regions denote ±1 standard error across the 100 simulated data sets. We also plot the RMSE of exact maximum-likelihood estimation, which is nonzero since we simulated data sets with only 600 trials. D-F Same, for γ. These results demonstrate that IBS estimates parameters of the model for orientation discrimination more accurately than fixed sampling using equally many or even fewer samples. (Orientation discrimination)Figure 6: A. Trial structure of the simulated change localization task. While the participant fixates on a cross, 6 oriented patches appear for 250 ms, disappear and the re-appear after a delay. In the second display, one patch will have changed orientation, in this example the top left. The participant indicates with a mouse click which patch they believe changed. B. The generative model is fully characterized by the proportion correct as function of model parameters and circular distance between the orientations of the changed patch in its first and second presentation (see text). Here we plot this curve for two values of η ≡ log σ. In both curves, γ = 0.2. We can read off η from the slope and γ from the asymptote. (Change localization)Figure 7: Same as Figure ??, for the change localization experiment and estimates of η ≡ log σ and γ. (Change localization)Figure 8: A. Example board configuration in the 4-in-a-row task, in which two players alternate placing pieces (white or black circles) on a 4-by-9 board (gray grid), and the first player to get 4 pieces in a row wins. In this example, the black player can win by placing a piece on the square on the bottom row, third column. B. Illustration of features used in the value function of the heuristic search model (Equation ??). For details on the model, see Appendix ?? and van Opheusden et al. (2016). (Four-in-a-row game)Figure 9: Same as Figures ?? and ??, for the 4-in-a-row experiment and estimates of the value noise η ≡ log σ, pruning threshold ξ and feature drop rate δ. (Four-in-a-row game)Figure 15: Full parameter recovery results for the orientation discrimination model. A. Mean estimates recovered by fixed sampling with different number of samples. Error bars are omitted to avoid visual clutter. B. Mean estimates recovered by IBS with different numbers of repeats. The legend reports the average number of samples per trial that IBS uses to obtain these estimates. C. Mean estimate recovered using the ‘exact’ log-likelihood function (Equation ??). D-F Same, for the bias parameter µ. G-I Same, for the lapse rate γ. Overall, fixed sampling produces severely biased estimates of η and γ, while IBS is much more accurate. The bias parameter µ can be accurately estimated by either method regardless of the number of samples or repeats. (Complete parameter recovery results)

Figure 10: A. Log-likelihood loss with respect to ground truth, as a function of number of samples, for the orientation discrimination task. Lines are mean and standard error across 120 generating parameter values, with 100 simulated datasets each (error bars are smaller than the line thickness). B. Log-likelihood loss for the change localization task. Lines are mean and standard error across 80 generating parameter values, with 100 simulated datasets each. (Log-likelihood loss)Figure 11: Likelihood function of log λ given that fixed sampling returns m = 0 (none of the samples from the model match the participant’s response). The likelihood is approximately flat for all log λ ≤ −2. Since λ is defined as p M , this implies that the posterior distribution over p will be dominated by a prior rather than evidence, as quantified by the likelihood. (Analysis of bias of fixed sampling)Figure 12: Standard deviation of IBS (Blue curve) and the lower bound given by the information inequality (black, see Equation ??). The standard deviation of IBS is within 30% of the lower bound across the entire range of p. (Estimator variance and information inequality)Figure 13: Standard deviation times square root of the expected number of samples drawn by IBS (blue) and fixed sampling (red), and the master curve (black) that fixed sampling converges to when M → ∞. (Estimator variance and information inequality)Figure 14: Same as Figure ?? in the main text, but for the alternative fixed-sampling estimator defined by Equation ??. The results are qualitatively identical. (Alternative fixed sampling estimator)Figure 16: Same as Figure ??, for the change localization model. Fixed sampling is severely biased for both the measurement noise η and the lapse rate γ, whereas IBS is accurate for η and biased for γ, but still much less biased than fixed sampling. (Complete parameter recovery results)Figure 17: Same as Figure ??, for the four-in-a-row task. For this model, we do not have an exact log-likelihood formula or numerical approximation, so we only show fixed sampling and IBS. Overall, fixed sampling has severe biases in its estimation of η and δ and a smaller bias in estimating ξ. IBS has almost no bias for η and only a small bias for ξ and δ. (Complete parameter recovery results)

Figure 18: Graphical illustration of the two methods to implement IBS with multiple trials, in this case N = 6. In this figure, each column represents a trial, each box above the trial number represents a successive sample from the model from that trial, with red crosses for samples that do not match the participant’s response (‘misses’) and green checkmarks for ones that do (‘hits’). Above each column, we indicate K, the number of samples until a hit. For trials 2 and 4, K = 1 so L̂IBS = 0. The most obvious implementation of multi-trial IBS is ‘columns-first’, to sample model responses for each trial until a hit, and only then move to the next trial. However, a more convenient sampling method is ‘rows-first’, and sample one response for each trial with k = 1, then one response for each trial with k = 2, excluding trials 2 and 4 since the first sample was a hit, and continue increasing k until all trials reach a hit. This method allows for early stopping and a parallel processing. (Early stopping threshold)Figure 19: Graph of p pLi2(1 − p), which is proportional to the optimal number of repeats for a trial with likelihood p (see equation ??). We observe that the optimal allocation of computational resources entails repeated sampling for trials with p ≈ 1 2 and to avoid p ≈ 0 or p ≈ 1. (Reducing variance by trial-dependent repeated sampling)