A Functional Usability Analysis of Appearance-Based Gaze Tracking for Accessibility

Appearance-based gaze tracking algorithms, which compute gaze direction from user face images, are an attractive alternative to infrared-based external devices. Their accuracy has greatly benefited by using powerful machine-learning techniques. The performance of appearance-based algorithms is normally evaluated on standard benchmarks typically involving users fixating at points on the screen. However, these metrics do not easily translate into functional usability characteristics. In this work, we evaluate a state-of-the-art algorithm, FAZE, in a number of tasks of interest to the human-computer interaction community. Specifically, we study how gaze measured by FAZE could be used for dwell-based selection and reading progression (line identification and progression along a line) — key functionalities for users facing motor and visual impairments. We compared the gaze data quality from 7 participants using FAZE against that from an infrared tracker (Tobii Pro Spark). Our analysis highlights the usability of appearance-based gaze tracking for such applications.


INTRODUCTION
Eye gaze tracking has been used extensively as a human-computer interface modality (e.g.pointer control [Drewes et al. 2007;Sibert and Jacob 2000], magnification control [Ashmore et al. 2005; Manduchi and Chung 2022]), to measure the user's attention (e.g. when driving a vehicle [Vicente et al. 2015] or visiting a web site [Pan et al. 2004]), to study reading behaviors [Rajendran et al. 2018;Vo et al. 2010], and to identify specific conditions such as autism spectrum disorders [Murias et al. 2018], ADHD [De Silva et al. 2019], or dyslexia [Raatikainen et al. 2021;Rayner 1998;Wang et al. 2024].Measurements of the user's gaze point (point of regard on the screen) are usually obtained through an external device (a gaze tracker) that uses an infrared illuminator and one or more cameras to compute the visual axis [Guestrin and Eizenman 2006] (the line from the point of regard to the center of the fovea through the pupil).Modern commercial gaze trackers can be rather accurate (with errors of a fraction of a degree) while allowing users to move their heads within a certain volume of space [Tobii [n. d.]].
In recent years, there has been increasing interest in software systems that leverage modern machine learning to estimate a person's gaze direction from an image of their face, taken e.g. from a screen camera.The practical advantages of "appearance-based" tracking software are apparent, both in terms of convenience (no need for an external device to connect) and cost (infrared-based trackers are still quite expensive).However, the accuracy of appearance-based trackers still lags behind that of infrared trackers [Zhang et al. 2019].This article presents a functional usability analysis of a state-ofthe-art appearance-based tracker.While the performance of gaze trackers is normally expressed in quantities such as angular errors, typically computed in specific settings (e.g. with users looking at a target on the screen), these quantities do not easily translate into desired usability parameters.Therefore, we investigate whether state-of-the-art appearance-based trackers can serve as potential substitutes for infrared-based systems, especially in the following applications tailored for users with disabilities: Dwell-Based Selection.This is a standard technique for users who are unable to trigger a click event using a mouse or a switch [Jacob 1991;Müller-Tomfelde 2007;Paulus and Remijn 2021;Sibert and Jacob 2000;Zhang and MacKenzie 2007].While other approaches have been considered (e.g., blink-based [Huckauf and Urbina 2008;Lu et al. 2020]), dwell-based selection remains a popular choice and is implemented in commercial devices such as the Tobii Dynavox communication system [Menges et al. 2019] enhancing accessibility for those with physical limitations.
Reading Progression Tracking.Measuring progression when reading a document can be useful to assess one's cognitive skills of reading [Huck 2016;Patterson and Ralph 1999] or to provide gazecontingent reading support (e.g., highlighting the line currently being read [Rosenberg 2008], controlling the speed of auto-scrolling [Kumar et al. 2007;Sharmin et al. 2013] or of text-to-speech [Schiavo et al. 2015], magnifying the text being gazed at [Ashmore et al. 2005;Manduchi and Chung 2022;Maus et al. 2020], or detecting reading difficulties and augmenting text [Biedert et al. 2009;Bottos and Balasingam 2020;Lunte and Boll 2020]), aiding those with dyslexia or low vision [Wang et al. 2024].We are interested in reading line identification (detecting which text line in the document is currently being read [Bottos and Balasingam 2020;Sun and Balasingam 2021;Wang et al. 2024]) as well as in tracking progression along a line by measuring fixation scanpaths [Deng et al. 2023;Reichle et al. 2003].
We selected FAZE [Park et al. 2019] as our reference appearancebased gaze tracking algorithm (described in Sec. 3.1.3),showing to achieve accuracy of about 3 • on multiple standard benchmark data sets.One important feature of FAZE is that it adapts to the appearance characteristics of a new user from just a few calibration images.An open-source implementation of FAZE was made available by the authors1 .On a Lambda Tensorbook, FAZE produces gaze data at a rate of 6 fps.
In order to evaluate the feasibility of FAZE for the considered applications, we conducted a small study with 7 participants, who underwent two tasks: a fixation task (representative of dwell-based selection), and a reading task.Images of the participants during these tasks were taken by a computer camera.In addition, we used an infrared-based gaze tracker (Tobii Pro Spark) to capture their gaze direction.The Tobii tracker is used as a reference against which to compare FAZE data.We define specific metrics for each task, and evaluate FAZE and Tobii data comparatively against these metrics.Our results give a detailed picture of the type of errors that can be expected when using FAZE,providing insights for designers integrating appearance-based gaze tracking in applications for individuals with disabilities.

RELATED WORK
Hohlfeld et al. [Hohlfeld et al. 2015] presented an analysis of the applicability of computer vision-based gaze tracking for mobile scenarios that is germane to our work.Here are the main differences between this contribution and [Hohlfeld et al. 2015].1. Appearancebased algorithm: Hohlfeld et al. used EyeTab [Wood and Bulling 2014], a model-based tracker whose accuracy (errors of 7 • in ideal conditions) is substantially inferior to learning-based algorithms such as FAZE; 2. Tasks set: the following tasks were considered in [Hohlfeld et al. 2015]: Focus on Device (determining whether the user was looking at a tablet computer or behind it); Line Progression: Line Test (finding regressions when following a moving dot); Word Fixation: Point Test (finding fixation times).Our tasks (dwell-based selection, reading line identification, progression along a line) are substantially different than those in [Hohlfeld et al. 2015].3. Infrared gaze tracker as reference.We use a commercial-grade infrared gaze tracker to produce a reliable baseline against which to compare the data from appearance-based tracking.Comparison between the two trackers is important to establish whether an appearance-based tracker can substituted for an infrared-based tracker, which is the main research question motivating our work.
Zhang et al. [Zhang et al. 2019] presented a comparative evaluation of two appearance-based gaze tracking algorithms (MPI-IFaceGaze [Zhang et al. 2017] and GazeML [Park et al. 2018]) against a consumer-grade infrared-based device (Tobii EyeX).This work was concerned with the range of viewing distances for which gaze could be reliably computed, the required number of calibration samples, the systems' robustness to varying illumination (indoor vs. outdoor), and their ability to measure gaze for users wearing glasses.While very valuable, these tasks are very different from the tasks considered in our contribution.Wang et al. [Wang et al. 2024] developed GazePrompt to improve digital reading for low-vision users by providing line-switching and difficult-word recognition features, utilizing an infrared-based tracker.This innovation highlights the necessity of investigating appearance-based gaze tracking as a means to enhance usability and accessibility.Such exploration could lead to significant advancements in assistive reading technologies.

METHOD 3.1 Apparatus
3.1.1Computer.We used a Lambda Tensorbook (equipped with an NVIDIA RTX 2080 GPU and 8-core Intel i7-10875H at 2.30 GHz, running Ubuntu 20.04.6) for our tests.The screen size (active pixel area) was 349 mm by 195 mm, for a resolution of 1920 by 1080 pixels.A 1080p webcam was located on the top edge of the screen.

Infrared Gaze
Tracker.We used a Tobii Pro Spark gaze tracker for baseline measurements.This is a moderately priced model that produces binocular measurements at 60 Hz.In ideal conditions, its nominal accuracy (mean angular error) is of 0.45 • , while its precision (standard deviation of the error) is of 0.26 • [Tobii [n.d.]].For a person looking at the TensorBook's screen from a distance of 500 mm, these values translate to 20 and 11 pixels, respectively.The tracker can measure gaze from a user located between 450 mm and 950 mm from the screen, with a nominal freedom of head movement of 350 × 350 mm.The tracker was placed at the bottom of the TensorBook's screen and was calibrated for each participant using the Tobii Pro Eye Tracker Manager utility (9 targets).

Appearance-Based Gaze Detection. FAZE (Few-shot Adaptive GaZE Estimation
) is a state-of-the-art appearance-based gaze tracking algorithm.It incorporates several few-shot learning paradigms, most notably Model-Agnostic Meta-Learning (MAML).At the core of FAZE is an encoder-decoder architecture that captures latent representations related to appearance, gaze direction, and head pose from eye region imagery.After the initial learning phase of these latent features, FAZE undergoes fine-tuning with a minimal set of calibration samples from individual users.The use of MAML significantly reduces over-fitting, thereby facilitating rapid and person-specific model fine-tuning.The average angular error of FAZE is of 3.14 • [Park et al. 2019].
In our tests, we noted that data from FAZE sometimes exhibits a consistent location bias, even after calibration.To remedy this, we considered an additional geometric calibration.Specifically, for each participant, we recorded the barycenter of the gaze points produced by FAZE while the participant fixated each of the 9 points in a pattern (Sec.3.3), then regressed the parameters of an affine transform minimizing the squared norm of the location error.This affine transform was then applied on the gaze points returned by FAZE for that participant.

Population
We recruited 7 participants (3 female, 4 male; age min: 22; max: 58; mean: 33.7) for this test.Three participants(P5, P6, P7) wore glasses during the test.The study was conducted following a Human Subject protocol approved by the Institutional Review Board at our school.

Procedure
Participants were asked to sit in front of the computer, which was placed on a tabletop.The experimenter ensured that they sat at a distance from the screen that was within the admissible range for the Tobii tracker.The average distance of each participant to the screen was recorded by the tracker (min: 533 mm; max: 717 mm; mean: 632 mm).They first completed the procedure for calibration of the Tobii gaze tracker.Then, they completed the calibration procedure for the FAZE algorithm.At this point, the data acquisition part started.This comprised two tasks.
Task 1: Participants were asked to stare at a target (a small blue disk of 16 pixels in diameter) appearing in a sequence of 9 locations on the screen (see Fig. 1), and remaining in the same locations for 6 seconds before moving to the next one.(This amount of time is consistent with other experiments on fixation stability [Fragiotta et al. 2018].) Task 2: Participants were presented with a text document (extracted from Carroll's Alice in Wonderland).The text document was formatted using Times New Roman font at 11pt, consisting of 15 lines with an interline distance of 18pt (24 pixels), and they were asked to read it in its entirety.In addition, participants were asked to press a button on the keyboard when they started a new line and to press another button when they ended that line.In this way, we were able to record the in-line time intervals.Participants were at liberty to read the text aloud or silently (only P1 read it aloud).Timestamped images of the participants were recorded from the computer camera at a rate of 10 fps for offline processing.Timestamped gaze points from the Tobii tracker were recorded by a Python application built on the Tobii Pro SDK.2007] typically defines an area (e.g., a circle with diameter ) around a certain target (e.g., a button to be clicked).When the gaze point is located within this area continuously for a period of time  , the selection is triggered.We are interested in evaluating how errors in gaze measurements affect selection by dwelling, and how to properly design a system that accounts for these errors.We are not considering here the dynamic aspects of this task, which can be described using variants of Fitt's law [Zhang et al. 2010[Zhang et al. , 2011]].Rather, we look for the minimum diameter  min of a circle around the target that ensures, with a certain probability , that selection is triggered when the user is fixating the target for a period of time  .Intuitively,  will need to be larger for noisy measurements, as noise may push measurements away from the point of fixation.In our experiments, we set  , the dwelling time, to 700 ms, as this was found to be appropriate for simple tasks in prior research [Stampe and Reingold 1995;Zhang et al. 2011].We set  to 0.9.To find  min , we first considered the interval of time [  (),   ()] (approximately 6 seconds long) during which participants fixated the -th target in Task 1.We defined a sequence of finely spaced values for , and for each such value, we slid a time window of duration  through [  (),   ()].For each window location, we checked whether or not all gaze points measured in that time window were within a distance of /2 from the center of the target.The proportion of window locations for which this was the case represents the probability   () that, when staring at the -th target, selection would be triggered for a dwelling circle of diameter .Finally, we defined  min as the smallest value of  for which   () ≥ 0.9.

Measurements
In addition, we provide measures of bias and dispersion.Bias is defined as the distance, for each target, between the barycenters of the gaze points measured from Tobii or FAZE and the actual target location.Dispersion is measured as the square root of BCEA (bivariate contour ellipse area).BCEA, a metric commonly used for fixation studies (e.g.[Blignaut and Beelders 2012;Niehorster et al. 2020]), represents the area of the ellipse containing 63% of the gaze values, which are modeled as normally distributed.Noisy measurements are typically characterized by large BCEA values.We used all the data within each period [  (),   ()] to measure bias and BCEA at each target.
It is important to note that both  min and dispersion are affected by measurement noise as well as by any fixation instability of the viewer.BCEA is unaffected by bias (constant error terms).
To determine the intervals [  (),   ()], we define a circle of radius 4 pixels around each marker.  () and   () are the times at which gaze as measured by the reference Tobii tracker enters and exits the circle defined at the -th marker.

Text
Reading -Line Identification.The ability to identify which text line in an onscreen document one is currently reading hinges on the measured gaze being located within a narrow area containing the line.We are only concerned with in-line reading here, and neglect retracing time (return sweeps [Rayner and Pollatsek 2006]).We do not consider a specific vertical coordinate as a reference (e.g., the midline of the text) since the user's gaze is not constrained to such a line while reading.Instead, we take the Tobii data as a reference, against which to compare FAZE data.
For the -th text line, we measure, for both Tobii and FAZE data, the mean   () and standard deviation   () of the Y coordinate of gaze points.  () measures the vertical dispersion; it provides an indication of the minimum interline distance for reliable line identification.The differences of the means   () between FAZE data and the reference Tobii data represent the residual vertical bias.

Text
Reading -Progression Along a Line.During reading, one's eyes are not gliding smoothly along a text line; rather, gaze proceeds as a sequence of fixations (during which gaze is relatively static) and saccades, which are rapid movements forward in the line, or, occasionally, backward (regressions) [Rayner and Pollatsek 2006].For our measure of progression along a line, we consider all fixations detected from Tobii data during line reading.Fixation detection is a relatively straightforward operation, and the accuracy of infrared trackers such as Tobii Spark is adequate for this purpose [Olsen 2012].For this purpose, we use a simple velocity-based algorithm inspired by the Tobii I-VT fixation filter [Olsen 2012].For the th fixation period, we compute the average value  , () and the standard deviation  , () of the X coordinate for both Tobii and of FAZE data.The difference between  , () values in the two cases is an indication of how accurately the reading location along a line can be tracked using an algorithm like FAZE.
In addition, we computed the standard deviation  , () of the X coordinate of gaze point for both Tobii and FAZE data in the periods outside fixations (saccades [Rayner and Pollatsek 2006]).Comparison of  , () against  , () provides an indication of the relative dispersion during fixations (periods with low gaze point variance) and during saccades (when variance is large due to fast motion).

RESULTS
We present the results of our experiments in the following.All statistical tests were conducted at 5% significance level.In order to visually highlight any dependencies of the recorded values on the participants' distance to the screen, participant indices were sorted according to increasing distance to the screen.

Fixation
Recorded values of bias, dispersion, and  min are shown in Fig. 1.Specifically, we report, for both Tobii and FAZE, the values averaged across participants for each target, as well as the values averaged across targets for each participant.As expected, FAZE data have significantly larger bias and dispersion than Tobii data (as revealed by paired t-test).
Total means for Tobii data were: bias: 54.0 pixels; the square root of BCEA: 1.13 • ;  min : 125.3 pixels.For FAZE data, the total  ) Note from Fig. 1 that the BCEA value for P5 was substantially higher than for other participants, though this did not translate into a larger  min value.Two-way analysis of variance revealed a significant effect of participants on both  min and square root of BCEA, for both Tobii and FAZE data.A significant effect of target was found for Tobii data only, on both  min and square root of BCEA.A significant correlation between distance and both bias and  min was found for FAZE data only ( = 0.85 in both cases).A graphical representation of  min for each target (averaged over all participants) for both Tobii and FAZE is shown in Fig. 2, left.An example of data collected with the two modalities for a single participant (P7) is presented in Fig. 2, right, which shows contours at the same percentile levels of the probability density functions fitted to the recorded samples.

Text Reading -Line Identification
Relevant data from the experiment is shown in Fig. 3.The text line index was not shown to have a significant effect on either bias (RMSE of the difference of the means   measured for each line for FAZE or Tobii), nor on   for either FAZE or Tobii data.Participant index had a significant effect on both bias and   .  was found to be significantly larger for FAZE than for Tobii.For Tobii data only,   was found to be correlated with distance to the screen ( = 0.78).The total mean of the bias was 91.7 pixels, while the total mean of   was 15.9 pixels for Tobii data and 51.5 for FAZE data.From Fig. 3, it is seen that P5 had a much larger value of   (averaged across lines) than the others.An example of strips containing gaze data at   () ±   () is shown in Fig. 4, left.

Text Reading -Progression Along a Line
We computed all fixation times (during in-line reading intervals) on the Tobii data, then, as explained in Sec.3.4.3,we computed the RMSE of the difference of the mean values  , () of the X coordinate of measurements from Tobii and FAZE.The resulting bias value is shown in Fig. 5, left.The mean of RMSE across participants was 89.2 pixels.For both Tobii and FAZE data, we also computed the standard deviation  , () and  , () of the X coordinate of gaze for all periods identified as fixations and saccades, respectively, based on the Tobii data.The mean values are shown in Fig. 5.For both Tobii and FAZE data, paired t-test rejected the null hypothesis of equal mean of  , and of  , .An example of gaze data on a text line is shown in Fig. 4, right.

DISCUSSION AND CONCLUSIONS
Appearance-based gaze tracking algorithms hold the promise to "democratize" gaze-based interactions and analysis by removing the need to purchase dedicated devices.However, it is critical that these systems be tested in realistic applications, in order to assess their practical usability [Hohlfeld et al. 2015;Zhang et al. 2019].In this paper, we proposed a number of metrics associated with specific applications of interest, and compared measurements taken with a state-of-the-art appearance-based tracker against those taken with an infrared gaze tracker.
Our first experiment showed that dwelling-based selection is possible with FAZE, but the dwelling areas must be substantially larger than those afforded by an infrared tracker for equal effectiveness (Fig. 2, left).In our measurements, the ratio of the diameters  min found for FAZE to those found with Tobii (averaged over all participants) varied from 2.3 to 6.3.Our text reading -line identification experiment showed the dispersion across the Y coordinate of FAZE data to be more than 3 times larger than that of Tobii data.This suggests that the minimum interline distance needs to be larger by at least that same amount, in order to ensure reliable text line identification.This is compounded by the effect of bias, which measures the difference between the Y coordinate of the values measured by Tobii and FAZE in the same line, and that was found to be 92 pixels on average in our experiment.This is almost 4 times the interline distance used in the text document considered for our experiment (see Fig. 4,left).Our text reading -progression along a line experiment showed an RMSE value of the difference of X coordinates during fixations of almost 90 pixels.Considering that in our document the width of a character was about 13.5 pixels on average, this bias translates to an expected error of about 7 characters.Interestingly, we found a significant difference in the mean of the standard deviation of FAZE measured during fixation and saccade intervals (where these intervals were computed based on our reference Tobii data).This suggests that it may be possible to identify fixations on FAZE data using appropriate local analysis.
In most of the cases, measurements on the FAZE data were found to correlate positively with the distance to the screen.This should not be surprising, considering that gaze tracking algorithms measure the direction of the visual axis, and the effect of an angular error on the location of the gaze point increases linearly with the distance.
Our study considered a relatively small population sample (7 participants), and we are planning for a larger study in the near future, which will include different illumination types (which can affect the quality of FAZE data) and a larger range of viewing distances.Another limitation of this work is that the image data was processed offline.In future experiments, we will run FAZE online.Besides a reduced frame rate (6 frames/second on our TensorBook), latency (delay) should be expected, and its effect on specific tasks (e.g., dwelling) will be analyzed.

Figure 2 :
Figure 2: Dwelling experiment.Left: A visualization of the dwelling circles with a diameter equal to the mean of  min across all participants.Right: Contour plots of probability density functions fitted to the recorded data for P7.Contour levels were set at 20,40,60, and 80 percentile.Solid blue line: Tobii data.Yellow dashed line: FAZE data.

Figure 3 :
Figure 3: Text Reading -Line Identification experiment.(a) Bias (RMSE of the difference of   () between Tobii and FAZE data).(b)   averaged over all text lines.

Figure 4 :
Figure 4: Left: Strips showing for   () ±   () for data recorded for participant P4, shown for three text lines.Right: An example of gaze points recorded for a text line (P6).Data from Tobii was subsampled to 10 Hz for comparison with FAZE data.Blue forward slash or circular symbol: Tobii data.Yellow backward slash or cross symbol: FAZE data.

Figure 5 :
Figure 5: Text Reading -Progression Along a Line experiment.(a) Bias (RMSE of the difference of  , () for Tobii and FAZE data).(b)   averaged over all fixations (left bar in bar groups) and saccades (right bar in bar groups).