Towards human-compatible autonomous car: A study of non-verbal Turing test in automated driving with affective transition modelling

Autonomous cars are indispensable when humans go further down the hands-free route. Although existing literature highlights that the acceptance of the autonomous car will increase if it drives in a human-like manner, sparse research offers the naturalistic experience from a passenger's seat perspective to examine the humanness of current autonomous cars. The present study tested whether the AI driver could create a human-like ride experience for passengers based on 69 participants' feedback in a real-road scenario. We designed a ride experience-based version of the non-verbal Turing test for automated driving. Participants rode in autonomous cars (driven by either human or AI drivers) as a passenger and judged whether the driver was human or AI. The AI driver failed to pass our test because passengers detected the AI driver above chance. In contrast, when the human driver drove the car, the passengers' judgement was around chance. We further investigated how human passengers ascribe humanness in our test. Based on Lewin's field theory, we advanced a computational model combining signal detection theory with pre-trained language models to predict passengers' humanness rating behaviour. We employed affective transition between pre-study baseline emotions and corresponding post-stage emotions as the signal strength of our model. Results showed that the passengers' ascription of humanness would increase with the greater affective transition. Our study suggested an important role of affective transition in passengers' ascription of humanness, which might become a future direction for autonomous driving.

the number of deaths from road traffic crashes worldwide. In other words, every 23 seconds, there will be someone in this world to be told that a loved one has died in a road crash. It is worth noting that 94% of road accidents are due to human error [3]. Based on these critical and tragic facts, it is promising that autonomous cars 1 (ACs) have the potential to reduce human error substantially [5], [6]. Notably, artificial intelligence (AI) algorithms in ACs can make faster driving decisions than human drivers to prevent crashes [7], [8]. Globally, ACs are poised to save 10 million lives simply by removing human error elements per decade [9].
Despite the lifesaving benefits of ACs being paramount, and researchers from academia and industry have made significant progress [10], [11], [12] since the first landmark AC appeared nearly 40 years ago [13], [14], [15], there has yet to be a large-scale deployment of ACs [16], [17]. In other words, ACs still face enormous challenges in replacing humans (e.g., Tesla Autopilot deaths [18]). In addition to safety and trust issues [19], [20], [21], another main obstacle is that these cars are not humanoid, which means they are not driving in a human-like manner. More importantly, existing literature highlights that the acceptance of the AC will increase if the AC drives in a human-like manner [22], 1. According to the literature [4], which promotes using the term 'autonomous cars' to facilitate public acceptance of automated driving from a range of terms, we adopted the term 'autonomous cars' in this paper for this endeavour. [23], [24], [25] (the rationale is that 'humans will find it easier to interact and feel at ease with ACs in such cases [26]').
On the other hand, many researchers in the area of human factors mainly use simulators in the laboratory arena or online surveys to examine how drivers [25], [41], [42], [43], pedestrians [44], [45] or passengers [37], [46], [47] respond to ACs, which have been programmed and designed deliberately to perform in a human-like manner, i.e., validating algorithms of ACs from the perspective of human factors studies. However, sparse research offers a true-to-life ride experience for passengers to examine the human likeness of the AC. Given that it is the key to improving the acceptance of the AC, we presented the following research question: How to offer the naturalistic experience from a passenger's seat perspective to measure the human likeness of current ACs?
To tackle the research question and overcome the limitations of driving simulators and laboratory settings, we developed a ride experience-based version of the 'Turing test [48]' on SAE Level 4 [49] ACs (in which human intervention is unnecessary in limited spatial areas or under special circumstances) in the real world (Section 2.1). In 1950, Alan Turing proposed the Turing test [48] to evaluate the ascription of intelligence, i.e., whether humans would ascribe human-like intelligent behaviour to machines. In the Turing test, a human interrogator (C) engages in verbal interaction with a computer program (A) and another human (B) (i.e., asks questions to A and B through written notes) and tries to determine with whom C is interacting. If the human interrogator cannot determine which answers are given by a human and which by a computer program, the latter is said to pass the test. The rationale of the Turing test, i.e., 'human judges impartially compare and evaluate outputs from different systems while ignoring the source of the outputs [50]', has been used in the literature [51], [52] in different ways to investigate the ascription of humanness.
With the same motivation, we conducted a ride experience-based version of the Turing test (Section 2.1.2), which in the following will be referred to as 'a non-verbal variation of the Turing test'. In particular, 69 participants (Section 2.1.1) assumed the role of a passenger in the rear seat of SAE Level 4 cars, unable to see the driver cabin, as depicted in Fig. 1. They had to judge whether a real human or an AI algorithm was behind the wheel based on their ride experience on the road stage just passed. Specifically, after each road stage, passengers rated a variation of a Turing test question, 'Do you think the driver was a real human or an AI algorithm?', with 1-3, 1 for 'AI driver'; 2 for 'Not sure'; 3 for 'Human driver'. Our main goal was to test whether the AI driver (i.e., WeRide ONE, a universal selfdriving algorithm for comprehensive open urban roads [53]) could create a human-like ride experience for passengers, such that passengers would have either chance-level or even higher humanness ratings under the AI driver condition. The results showed that when the AI driver controlled the AC, passengers' humanness ratings were significantly below the chance level, indicating that passengers could detect and discriminate between the human and AI drivers. Thus, the AI driver did not pass our non-verbal variation of the Turing test (Section 3.1).
The AI driver's failure inspired us to explore further why the AI algorithm could trick human passengers in some trials and not in most others. Digging into this rabbit hole may bring us some informative information for future ACs. Accordingly, we presented the following thought-provoking research question: How do human passengers ascribe humanness in the non-verbal variation of the Turing test?
In our non-verbal variation of the Turing test, passengers' ascription of humanness would likely be influenced by their affective states, cognitive inference and external stimuli (i.e., human and AI drivers). In particular, we could further formulate the problem mentioned above in the language of field theory [54], the most central and influential work of Kurt Lewin [55] (who is undoubtedly the father of modern social psychology). Field theory states that a person's psychological field (i.e., the total psychological environment that the person experiences subjectively) determines their behaviour (B) [56], which can be expressed by Lewin's equation [57]: where P and E represent the person and their environment, respectively. In line with the idea of Gestalt psychology that 'the whole is more than the sum of its parts [58]', the parts, i.e., P and E, together combine to form something larger, the psychological field [59]. In our case, B is humanness rating behaviour, and P and E denote passenger and the driving environment, respectively. More importantly, P and E together form the passenger's psychological field, i.e., their subjective ride experience. Given B, we aimed to figure out the computation of the right-hand side of Lewin's equation. To phrase the matter another way, we intended to investigate why a given trial (i.e., a particular passenger P in a particular driving environment E) has the event B (e.g., high humanness rating) and no other as a result of the non-verbal variation of the Turing test. To that end, we proposed a computational model which combines signal detection theory (SDT) [60], [61], [62] with pre-trained language models (PLMs) [63], [64], [65] (Section 2.2), as depicted in Fig. 2. In this SDT-based model (Section 2.2.2), we used affective transition (AT, Section 2.2.3) between prestudy baseline emotions and corresponding post-stage emotions (collected using the modified Differential Emotions Scale and written description, Section 2.2.1), transformed by PLM (Section 2.2.4), as the signal strength.
The results showed that our proposed computational model could adequately predict passengers' humanness rating behaviour in the non-verbal variation of the Turing B. From top to bottom: manual driving mode (the human driver was actively steering); autonomous mode, in which the human driver would be free to release the steering wheel, meaning that the AI driver (i.e., WeRide ONE algorithm [53]) would take control of the car; the participants would ride in the rear seat taking the role of a passenger, and a thick black drape hid the driver cabin from the passenger's viewpoint. C. Sub-figure (1) is the satellite image of the test stages (yellow colour). The dark blue of the sub-figure (2)(3)(4) represents the first, second and third stages, respectively. Each participant would experience three stages in turn (randomly assigned to the manual driving mode or autonomous mode). The red arrow indicates the direction of travel. The 4-point star and 5-point star represent the start and end location of the AC, respectively. Furthermore, the traffic lights in all three stages have been marked. test (Section 3.2). Further analysis (Section 4) suggested that affective transition, serving as a hypothetical essential part (i.e., P ) of passengers' subjective ride experience in our model, may play a crucial role in their ascription of humanness. Specifically, we found that the passengers' ascription of humanness would increase with the greater AT (Section 4.1). Moreover, based on the analysis of AT, we also gave concrete suggestions for the AI driver to offer a human-like ride experience for the passenger (Section 4.2-3). Taking the results of behavioural experiments and computational modelling together, we conjecture that the lack of a certain level of mentalising ability in the current selfdriving algorithm may underlie its failure to pass our nonverbal variation of the Turing test. In this regard, our study calls for a spotlight on the importance of ensuring ACs (or artificial social intelligence, more broadly speaking) have at least some mentalising ability (Section 5).

The non-verbal variation of the Turing test
In this subsection, we will first brief on the information of participants and how we recruit them. Next, we give the details of the conducted non-verbal variation of the Turing test, as illustrated in Fig. 1.

Participants
We recruited 23 employees of WeRide (a Chinese hightech company aiming to develop the most advanced au-tonomous driving technology) and 46 tourists and passersby via on-site registration in the Guangzhou International Biological Island. The entire sample included 45 males and 24 females, aged 34.48 (SD = 10.44, range = [21,60]) years on average. After welcoming, all participants received information about the aims of the experiment and provided informed consent. Participants each received a plush toy for participating in our study. The local ethics committee approved our research protocol (2020-0515-0140).

Procedure
In the double-blind non-verbal variation of the Turing test, participants went by the SAE Level 4 AC (see Fig. 1A), driven by either the human driver (manual driving mode) or the AI driver (autonomous mode). Due to legal restrictions, one engineer would place in the passenger's seat (the front one) to monitor the AC. Therefore, the participants would ride in the rear seat, taking the role of a passenger, with a thick black drape hiding the driver cabin from the passenger's viewpoint (see Fig. 1B). If the passenger cannot distinguish between the manual driving mode and autonomous mode, the non-verbal variation of the Turing test is passed. In turn, each participant would experience three stages assigned randomly to the manual driving mode or autonomous mode.
There were three stages in a 3.

Pre-trained language models
Whitening and dimensionality reduction  [54], which is expressed by a cartoon version of the formula: B = f (P, E) at the centre of the figure. A. From left to right: A participant was filling out pre-study self-reported scores of the modified DES-IV on his smartphone; the stage began; after the stage, the participant was completing the online questionnaire, including his humanness rating (i.e., the answer to a variation of a Turing test question, 'Do you think the driver was a real human or an AI algorithm?', 1 for 'AI driver'; 2 for 'Not sure'; 3 for 'Human driver'), post-stage modified DES-IV scores, the scores of safety and comfort, and written mixed feelings (optional). B. The high-level illustration of our model, with the framework of SDT as the backbone.

Pre-study baseline vector
The signal strength (computed as the affective transition, AT) and stimulus (human driver or AI driver) are the model's input, while the output is the participant's humanness rating behaviour. Notice that the two competing hypotheses (H 1 and H 2 ) about the possible relatedness between the participant's humanness rating and the magnitude of signal strength are all depicted. C. The further computation of AT, i.e., the computation of the distance between the pre-study baseline and post-stage vectors, in which vectors would be transformed by Optimus Prime (a fictional character created by the Transformers franchise), i.e., the transformation module. D. The internal transformation procedure when giving Optimus Prime a participant's post-stage rating scores and mixed feelings.
from the Xingdao Ring Road North, the first stage (around 1.6 km) included six traffic lights and a left-hand turn towards the second stage on Luoxuan Avenue. After a straight ride of around 1.2 km with two traffic lights, a left-hand turn to the third stage on the Xingdao Ring Road South was performed. Finally, the third stage is around 0.6 km, including a big left-hand curve and two traffic lights. The predetermined course ended at the beginning of the first stage (see Fig. 1C). Self-reported emotions were assessed before the whole study as pre-study baseline emotions. In contrast, humanness ratings, safety, comfort, post-stage emotions and mixed feelings were measured during the lag time (participants have 1-2 minutes to rest before the next stage) after each stage, and we will further introduce these data in Section 2.2.1.

How do human passengers ascribe humanness?
To understand passengers' ascription of humanness in the non-verbal variation of the Turing test, we advanced a computational model which specifies the detailed steps for generating passengers' humanness rating behaviour, as shown in Fig. 2. At the centre of Fig. 2, we portrayed Lewin's equation [54], B = f (P, E), for the highest-level illustration of our computational modelling method. In the following four parts, we will first introduce the details of the participant data collected in the non-verbal variation of the Turing test (see Fig. 2A) and subsequently describe our model in detail from a top-down perspective (see Fig. 2B-D).

2.2.1
Participant data: Self-reported scores, humanness ratings, and mixed feelings We collected self-reported scores (including pre-study baseline emotions, post-stage emotions, safety and comfort), humanness ratings and mixed feelings from participants in the non-verbal variation of the Turing test ( Fig. 2A). Specifically, pre-study baseline emotions and post-stage emotions were collected using the modified DES-IV [66], [67] (on Likert scales from 1-4) since it has been suggested that passengers' emotion plays a fundamental role in the social acceptance of ACs [16], [68], [69]. The left side of Fig. 2D shows an example in which a participant rated six emotions as follows: '较 强烈快乐' (Enjoyment 3/4), '较强烈兴趣' (Interest 3/4), '较 轻微惊奇' (Surprise 2/4), '一点也没有恐惧' (Fear 1/4), '一点 也没有紧张' (Tension 1/4), '较强烈满意' (Satisfaction 3/4). Besides, user acceptance also resides in the increase of their trust towards the AC [70]. Therefore, in the light of passengers' safety and comfort could establish trust towards the AC [71], [72], [73], self-reported scores of safety and comfort were rated on an integer scale from 1 to 4, 1 meaning 'Not safe (comfortable) at all' and 4 meaning 'Very safe (comfortable)'. Besides, the humanness (B in Lewin's equation), i.e., the answer to a variation of a Turing test question, 'Do you think the driver was a real human or an AI algorithm?', was rated from 1-3, 1 for 'AI driver'; 2 for 'Not sure'; 3 for 'Human driver'. Notice that a three-option scale rather than a forced choice scale (with no middle option 'Not sure') was used because 1) humanness is more like a continuous rather than simple dichotomous variable; 2) using a three-option rating scale is a decent trade-off between an attempt to create an approximately continuous variable for the humanness (i.e., a rating scale with more options is better) and the convenience for passengers to ascribe humanness (i.e., a rating scale with fewer options is better).
Excepting all the quantitative ratings, the qualitative assessments, i.e., participants' mixed feelings, were also collected, given that the information contained in natural language texts may be able to predict human behaviour [74]. The lower-left corner of Fig. 2D shows an example of one participant's mixed feelings about the past stage: '过红绿灯 时停车较急促。' (The car stopped more quickly at traffic lights). In total, we got 68, 68, and 65 participants' data for the first, second, and third stages, respectively.

Backbone: Signal detection theory
Signal detection theory (SDT) [60], [61], [62] is a general framework widely used by psychologists to describe decisions made under conditions of uncertainty. Here, we adopted the most common SDT framework, the equal variance SDT (EVSDT) model (which assumes that signal strength distributions are two Gaussian distributions with equal variances), as the backbone of our computational model, with the motivation to regard the perception system of the passenger as an information processing [75], [76] system (Fig. 2B). Thus, we could formulate passengers' ascription of humanness into detecting the signal from the noise, in which the stimulus (E in Lewin's equation) from the human driver represents the signal, and that from the AI driver represents the noise.
To better introduce this information processing process, we take the input signal strength SS k , stimulus E k and output humanness rating behaviour B k of the observation k as an example. We begin by calculating the point estimates of the EVSDT parameters. More specifically, using observations (excluding k) from passengers, we can compute hit rates H 1/2 , H 2/3 and false alarm rates F 1/2 and F 2/3 under two criteria by (here, we hypothesised that the signal strength from the human driver was greater than that from the AI driver, i.e., hypothesis 1 (H 1 ), as depicted on the left of Fig. 2B): #Observ. in which B = 2 or 3, E = Human driver #Human driver #Observ. in which B = 3, E = Human driver #Human driver Then response criteria c 1 and c 2 can be given by: where Φ −1 is the inverse cumulative normal distribution function, which converts hit rate or false alarm rate into a z score. Therefore, B k given SS k and E k is: Notice that in this example, B k equals the magnitude of SS k , i.e., M k , which means B and M are positively correlated. An alternative hypothesis, hypothesis 2 (H 2 ), is that the signal strength from the AI driver was greater than that from the human driver, i.e., B and M are negatively correlated, as shown on the right of Fig. 2B, and the above calculations can easily be adapted to H 2 .

Signal strength: Affective transition
Further, to figure out how to represent the signal strength in SDT, we examined whether pre-study baseline emotions (including enjoyment, interest, surprise, fear, tension and satisfaction), post-stage emotions (same as baseline emotions) and safety and comfort scores are associated with passengers' humanness rating behaviour (i.e., B). None of these measures was consistently correlated with B across three road stages (Table 1). Moreover, we also found that the raw scores of the measures mentioned above were not significantly different between the human and AI driver condition (Fig. 3), which ineluctably can not hold the role of signal strength to detect humanness across three stages.
If not the above pre-study nor post-stage measures affect the passenger's humanness rating behaviour, then perhaps a dynamic change in emotions, i.e., affective transition (AT) between pre-study baseline emotions and corresponding post-stage emotions, holds the key. We tested this possibility by using representational similarity analysis (RSA) [77].   Fig. 4: Intertrial variability in affective transition (AT) was significantly and consistently correlated with intertrial variability in humanness rating behaviour (B) across three road stages and two conditions. Each triangular matrix of dissimilarity reflects intertrial variability in AT (derived from the distance between multidimensional scores of pre-study and post-stage emotions without transformation procedure) and B, respectively. All correlation scores are in Spearman rho rank-order units (** p < .01, **** p < .0001), and related p-values were derived from one-tailed permutation tests (10,000 iterations).
RSA is a widely used framework for analysing common representational mapping between computational models, brain activity and behavioural data [77], [78], [79], in which second-order isomorphism [80] (i.e., the match of dissim-ilarity matrices) is of the essence. By relating the representational geometry of affective transition to humanness rating behaviour, we found that intertrial variability in AT was significantly and consistently correlated with that in  'B' for passengers' humanness ratings. All correlation coefficients in Table 1a represent Spearman's rank correlation score, and the regression coefficients in Table 1b were derived from ordinal logistic regressions. The significant effects (* p < .05, ** p < .01, *** p < .001) are bold. All the p-values (uncorrected) in parentheses were derived from two-tailed tests.
B across three road stages and two conditions (Fig. 4), indicating the potential of AT to play a role in passengers' ascription of humanness. Ergo, we employed AT, computed as the proximity between self-reported scores of pre-study and post-stage emotions, as the signal strength in SDT. That is to say, we leveraged passengers' AT to represent variable P for investigating the specific and concrete form of Lewin's equation [54] B = f (P, E) in our case. Continuing with the above example, i.e., observation k, we can compute AT k as the distance between the prestudy baseline vector (v where z denotes z-score normalisation, and dist represents the distance measure, which could be absolute distance, one of the Anna Karenina distances [81] (including mean distance, minimum distance and the product of the absolute and minimum distance), reversed Anna Karenina distance [81] (i.e., maximum distance), Pearson distance, Euclidean distance, Mahalanobis distance, cosine distance, Manhattan distance, word mover's distance [82] or word rotator's distance [83]. Notice that the selection of specific distance measures was performed under the crossvalidation procedure (Section 3.2).

Transformation: Leveraging pre-trained language models
Another key point to remember is that we transformed passengers' rating scores of emotions into corresponding language descriptions (together with their written mixed feelings) and leveraged PLMs to obtain the high-dimensional text representation to compute affective transition (Fig. 2D).
The intuition of transformation is two-fold: Firstly, recent evidence from cognitive neuroscience has shown that, in addition to sensory-derived, embodied knowledge representation, there is another language-derived, non-sensory knowledge representation for concepts with sensory referents in the human brain [84], [85], [86], and PLMs hold great promise for simulating this type of knowledge coding system [87], [88], [89], [90]. Therefore, we may better represent passengers' emotional experiences by utilising PLMs to simulate the language-derived coding system in their brains. Secondly, PLMs have achieved unprecedented success in many natural language processing (NLP) tasks [91], [92], [93], [94]. Incorporating the prior semantic knowledge in PLMs into our computational model might further boost the model performance. Thus, we could gain a better understanding of passengers' humanness rating behaviour from a data-driven perspective.
In this study, we tested 282 different PLMs, including 120 pre-trained word embeddings [63], [95] and 162 transformer-based [96] PLMs (such as ELECTRA [64] and T5 [65]), for encoding passenger's emotion scores and their written mixed feelings. Specifically, given the corresponding language descriptions L k of the observation k, we can use the following equations to describe the general feature extraction process of a multi-layer transformer-based PLM: where H 0 k is the input representations constructed by summing the corresponding token embeddings (E token k ), segment embeddings (E seg k ), and position embeddings (E pos k ). Then, the hidden representations of L k at the α-th layer of the N -layer PLM can be calculated as: Empirically, we compute the average of hidden representations H avg k ∈ R n×d from the first layer and last layer as the final extracted feature of L k [97], [98], where n is the length of the L k and d is the size of the transformer layer 2 . Notice that we get the sentence level representations via the above procedure. To get the document level representations, we first need to get the sentence level representations for six emotions and mixed feelings separately, and then conduct global average pooling over each matrix and stack these vectors vertically.
Next, we conduct global pooling [99] over H avg k to get the vector representation v k ∈ R d : where pooling operations could be max-, mean-, minover-time operations or a combination of two or three of these operations.
Finally, we further conduct whitening transformation and dimensionality reduction [98] to improve the representations obtained via the above procedure. Given a set of vector representations of N observations {v i } N i=1 , we can compute its mean vector µ and covariance matrix Σ as follows: Then we conduct SVD decomposition [100] over Σ to get the related orthogonal matrix U and diagonal matrix Λ. Let W = √ Λ −1 [:, : κ] (κ ∈ [1, d/2] and κ denotes the number of columns that need to be kept in W ), the transformed vector v k can be given as: The selection of different level representations, the specific pooling operations and the κ value were performed under the cross-validation procedure (Section 3.2).

Results of the non-verbal variation of the Turing test
To examine whether the AI driver passed our non-verbal variation of the Turing test, we ran one-sample Wilcoxon tests on the average humanness rating scores (normalised to the range [0, 1] for better illustration) across trials for each condition against the chance level of 0.5 (i.e., the expected value of random rating). As shown in Fig. 5, when the AI driver controlled the AC, passengers' humanness ratings were significantly below the chance level across three separate road stages and all stages (first stage: CI =

Results of the computational models
We trained and evaluated the computational models under the nested leave-one-out cross-validation (nested-LOOCV) procedure [101]. We compared our models with machine learning baselines 3 : MLR, the multi-class logistic regression classifier; KNN, the nearest neighbour classifier; SVC, the support vector machine; RF, the random forest classifier; XGBoost, the decision tree-based ensemble classifier that uses a gradient boosting framework; MLP, the multilayer perceptron classifier; and naive baselines: Random, which posits that the passenger's humanness rating behaviour is generated at random with equal probability; Probability, which posits that the passenger's humanness rating behaviour is drawn at random from the population of history ratings; Detective, which posits that the passenger could discern the difference between the human and AI drivers and thus make the correct guess all the time. Within our proposed SDT-AT framework, we tested the following models: Original, in which AT was derived directly from a distance between multidimensional scores of pre-study and post-stage emotions without transformation with PLM; PLM-wv, in which pre-trained word embeddings would transform passenger's emotion scores or mixed feelings; PLM-tf, in which transformer-based PLM would trans-form passenger's emotion scores or mixed feelings. For the above SDT-AT models, we computed AT based on different data components: positive affect (PA, including enjoyment, interest, surprise and satisfaction), negative affect (NA, including fear and tension), all affect (AA, including PA and NA), mixed feelings (MF) 4 or a combination between MF and other data components. For machine learning baselines, we tested different model inputs: AA, PA or NA of prestudy baselines, post-stage or a combination of the above two. We used Spearman's rank correlation score (rho) as the evaluation metric and selected hyperparameters with the highest rho score in the inner loop cross-validation of the nested-LOOCV.
The performance of different computational models is shown in Table 2 5 Based on Lewin's equation, our proposed SDT-AT models provided superior within- (Table 2a-c) and cross-stage performance (Table 2d)  We also conducted the model simulations to verify whether our proposed winning computational models could replicate the passenger's humanness rating behaviour. Fig. 6 shows that our computational model accurately captured the passenger's humanness rating behaviour patterns in the non-verbal variation of the Turing test. Further, by using representational similarity analysis (RSA) [77], we directly compared the representational geometry of empirically observed humanness rating behaviour with those of model simulations. As shown in Fig. 7, we found that representational dissimilarity matrices (RDMs) of model simulations were highly correlated with the RDM of empirically observed humanness rating behaviour (withinstage: rho = 0.6607, p = 0.0039; cross-stage: rho = 0.6577, p = 0.0049), suggesting that our computational model exhibited the same humanness rating behaviour pattern as passengers did. Altogether, these results permit us to use our computational models in further elucidating the implications that radiate from passengers' ascription of humanness in the non-verbal variation of the Turing test (see Section 4).

Analysis of relatedness between the humanness rating and the magnitude of affective transition
In computational modelling, we incorporated two competing hypotheses (H 1 and H 2 , see Section 2.2.2 and Fig. 2B  about the relatedness between humanness rating behaviour and the magnitude of affective transition into our proposed SDT-AT models, respectively. Then, we selected the winning model with the highest rho score in the outer loop crossvalidation of the nested-LOOCV, as reported in Table 2. To reveal which hypothesis holds true, we compared the passenger's humanness rating to the magnitude of AT derived from our winning models.   In favour of H 1 , we observed strong positive withinand cross-stage associations between the humanness rating and the magnitude of AT (first stage: rho = 0.4768, p = 3.94×10 −5 ; second stage: rho = 0.4739, p = 4.46×10 −5 ; third stage: rho = 0.5615, p = 1.14 × 10 −6 ; all stages: rho = 0.5093, p < 1.0 × 10 −13 , see Fig. 8), such that the ascription of humanness would increase with the greater affective transition. The above analysis suggested that AT, serving as a hypothetical crucial part of passengers' ride experience in our model, may indeed affect their ascription of humanness.

Analysis of the direction for AT on the starting two stages
Our proposed SDT-AT models in which AT was derived (or partly derived) from the positive affect (PA) dominated comparisons on the first and second stages (see Table 2ab). However, we did not know how the passenger's PA changed under two conditions during the starting two stages since AT is just a scalar quantity with no direction. It might be the case that the passenger's PA would greatly or moderately increase or decrease under the human driver condition while moderately or slightly increase or decrease under the AI driver condition, given that the previous analysis (Section 4.1) showed that the signal strength (i.e., AT) from the human driver was greater than that from the AI driver. To investigate this further, we examined mean changes in PA (calculated as post-stage minus pre-study PA summary scores of enjoyment, interest, surprise and satisfaction) during the first and second stages, respectively. As presented in Table 3  the greater the affective transition along with this enhancement, the higher the passenger's humanness rating will be.

Word cloud analysis of mixed feelings
Given our proposed SDT-AT models in which AT was obtained from the mixed feelings (MF) yielded the best performance on the third and all stages (see Table 2c-d), we further conducted word cloud analysis to compare the difference of MF induced by the human and AI driver. As shown in Fig. 9, the word cloud highlights the salient MF items (i.e., those with larger sizes) under each condition, with the size of each MF item proportioning (positively or negatively for the human or AI driver condition) its zscored AT from cross-stage model simulations. Specifically, under the human driver condition, the defining MF items in predicting passengers' highest humanness rating, 'Human driver' (3), were clustered (based on their semantics) as follows: 'Kerb distance was relatively constant.'; 'The driving mode was standard.'; 'The car ran (or started, or braked or stopped) smoothly.' While under the AI driver condition, the defining MF items in predicting passengers' lowest humanness rating, 'AI driver' (1), were clustered as follows: 'The car braked sharply or non-linearly.'; 'The car had a rough or bumpy start.' The above comparison vividly showed the difference in the passenger's subjective ride experience between the two conditions and illustrates details of what needs to be improved for current automated driving to offer a human-like ride experience for the passenger and thus increase the social acceptance of ACs.

Contributions and implications
As autonomous cars are increasing on our roads, the human role gradually shifts from active drivers to passive passengers. Meanwhile, a growing body of literature [22], [23], [24], [25], [26] highlights that the acceptance of the AC will increase if it drives in a stereotypical human manner. Nevertheless, very little research has been devoted to investigating the human likeness of the AC from the perspective of passive passengers. Herein, in the present study, for the first time, we examined whether the current SAE Level 4 AC, i.e., AC with the WeRide ONE [53] as its self-driving algorithm, could create a human-like ride experience for passengers in a real-road scenario and hence pass the nonverbal variation of the Turing test from the perspective of passive passengers. Our results showed that human passengers might be sensitive to the human-like ride experience, as indicated by the higher humanness rating in our non-verbal variation of the Turing test for the human driver condition relative to the AI driver condition. When the AI driver controlled the AC, results showed that passengers' humanness ratings were below the chance level, indicating that the WeRide ONE did not pass our variation of a non-verbal Turing test because human passengers could successfully detect the AI driver based on their subjective ride experience (Fig. 5). Nonetheless, we also noticed that the WeRide ONE could successfully trick human passengers in some trials, revealing the promising fact that some self-driving algorithms, like the WeRide ONE, are beginning to learn and imitate human behaviour in a convincing manner.
As the literature suggests [102], even the best technology, such as a vehicle that drives itself, is of little use if the user does not accept it. Consequently, given the key role that human likeness played in improving the passengers' acceptance towards ACs, we investigated further why passengers could discern the AI driver in most trials and not in others in our non-verbal variation of the Turing test. Specifically, on the basis of Lewin's field theory [54], we advanced a computational model combining signal detection theory (SDT) with pre-trained language models (PLMs) to predict passengers' humanness rating behaviour. We employed affective transition (AT), computed as the proximity between rating vectors of pre-study and post-stage emotions transformed by PLM, as the signal strength in our SDT models. The results showed that our SDT-AT models could adequately predict passengers' humanness rating behaviour in the nonverbal variation of the Turing test (Table 2, Fig. 6 and Fig. 7), the implications of which are as follows.
First, our proposed computational model is a concrete application of Lewin's field theory, in which we replaced the variables in Lewin's equation with the specific situational and personal characteristics of the passenger (e.g., B with the humanness rating, P with AT and E with the stimulus). The practical success of basing the computational modelling on Lewin's seemingly abstract and theoretical field theory speaks directly to his famous maxim that 'there is nothing as practical as a good theory' [103]. Second, our proposed models not only achieved superior within-stage performance than all other baselines (Table 2a-c) but also showed superiority in cross-stage performance (Table 2d). Together with the agreement between model simulations and empirical observations ( Fig. 6 and Fig. 7), our results indicate that we may succeed in discovering the general law B = f (P, E) which is valid for the dynamic structure of the passenger's psychological field (i.e., (P, E)). Finally, these results also demonstrate the possibility and feasibility of using NLP techniques, such as PLMs, as adjuncts to the interaction between social cognition and artificial intelligence to guide theorising and the generation of conceptual insights [104], [105].
Overall, conducting affective computing in this novel way enable us to discover the latent relatedness between AT and the passenger's humanness rating behaviour. Importantly, we offer the first insights into what renders passengers' subjective ride experience truly human-like for future automated driving: the passengers' ascription of humanness would increase with the greater affective transition (Fig. 8). Our further analysis of AT provided more concrete suggestions for the self-driving algorithm to offer a human-like ride experience for the passenger, e.g., improving passengers' positive affect during the starting stage (Table 3) and ensuring smoother starting and braking (Fig. 9).
Mentalising is a holistic process of inferring about a target agent's beliefs, motivations (i.e., cognitive mentalising), emotions and feelings (i.e., affective mentalising [106]), which not only plays a pivotal role in human social interaction [107], [108] but also is central to human-machine communication [109], [110]. We conjecture that the reason behind the phenomena we just described above (e.g., the relatively lower humanness rating and AT in the AI driver condition) is that the current self-driving algorithm may lack a certain level of mentalising ability (especially affective mentalising ability). For instance, without understanding the emotions and feelings of the passenger and how specific driving behaviour affect the passenger's emotions and feelings particularly (for a similar example of pedestrian-AC interaction, see [111]), the self-driving algorithm may not be able to provide passengers with a comfortable and pleasant ride experience as the human driver. More generally, as suggested by the literature [112], current AI is yet to fully embrace 'hot' cognition (refers to emotional and social cognition; in contrast to 'cold' cognition processes, i.e., non-emotional information processing [113]), and it is crucial that AI applications should include a mentalising system to help improve human-machine interaction. Ergo, we think it is very likely that imbuing future ACs with artificial mentalising ability will increase their human likeness and thus encourage automated driving to be integrated into human society. And the interdisciplinary collaboration incorporating psychology, neuroscience and computer science is the path we must take to develop such kind of artificial social intelligence with mentalising ability [114], [115], [116].

Limitations and future work
One could argue that passengers' humanness rating behaviour might not emerge completely after but during the stage. In other words, passengers might make the humanness rating first (which later results in their affective transition) during the road stage before they report post-stage emotions. In response to this questioning of logical rationality, let us go back to the buttress of our computational modelling, i.e., Lewin's field theory. One principle of Lewin's field theory is contemporaneity, which means that the behaviour in a psychological field depends only upon the psychological field 'at that time' [117], i.e., B t = f (P t , E t ) 6 . Empirically, a 'field at a given time' does not refer to a moment without time extension, but to a certain time period [117] (quite similar to describing the velocity of a point with treating a moment as a certain time period in physics [118]). In our case, it is worth noting that the psychological past and psychological future within a road stage are simultaneous parts of the passenger's psychological field existing at a given time t. That is to say, to the size of the passenger's humanness rating behaviour, the whole road stage the passenger rode would be considered as the size of the passenger's psychological field or subjective ride experience (cf. [117]). It cannot be excluded the possibility that passengers' humanness rating behaviour emerged during the road stage before passengers reported post-stage emotions if one is outside of Lewin's field theory, though the contemporaneity principle of field theory has already sufficed to address this concern. Future work should try to investigate this possibility with a more rigorous and sophisticated experimental design.
There are also several limitations that we should address in the present study. First, we conducted the non-verbal variation of the Turing test in a non-social context where no pedestrians were in the test stages. Thus, neither the AI nor the human driver in our experiment will face the so-called social or moral dilemma (e.g. trolley problem) [119], [120], [121], [122]. Given the far-reaching importance of AI ethical decision making to its social acceptance [123], further research on this topic is necessary. Second, due to the capacity of people the event could hold, the number of participants in the current study was limited (68, 68 and 65 effective observations in the first stage, second stage and third stage, respectively). Third, we ignored the inherent differences of passengers, e.g., individual differences in their driving experiences and social cognition (large individual differences have been found in human mentalising ability 6. For the sake of brevity, we ignored the time subscript t in the previous description of Lewin's equation. and social behaviour [124], [125], [126], [127]), all of which might affect the generalisation of our results. Hence, a validation test would be crucial in future work to test whether our findings will remain. Finally, we only used self-reported scores to measure the emotional experiences of passengers, which limits our adventure towards the neural underpinnings supporting passengers' ascription of humanness in our non-verbal variation of the Turing test. Future studies might uncover this by using physiological measurement (e.g., heart rate, eye-movement entropy, galvanic skin response [128]), mobile electroencephalography (EEG) [129] (or even combined with mouse-tracking [130]) or portable functional near-infrared spectroscopy [131], [132].

DATA AND CODE AVAILABILITY STATEMENT
The data and code used in this paper are available at http: //github.com/Das-Boot/bot or not.