Abstract
Social interaction has long been a subject of theoretical inquiry in both Computer-Mediated Communication (CMC) and Human-Computer Interaction (HCI), but seldom has it been examined through the lens of digital embodiment. As the metaverse gains traction as a platform for learning and collaboration, understanding how its affordances construct behavioral engagement demands empirical scrutiny. Thus, this study examines the effects of avatar customization and communication modality on behavioral engagement within a metaverse-based simulation. Using a 2×2 factorial design, participants were randomly assigned to avatar (customized vs. generic) and modality (voice vs. text) conditions, with engagement tracked via a stealth assessment approach across multiple sessions. Findings indicate that avatar customization facilitated broader spatial exploration, while voice-based communication elicited higher interpersonal interaction. Critically, the convergence of both factors produced a compounded effect that yielded selective interaction effects on temporal and social dimensions of engagement. This study contributes a framework of affordance convergence that informs both the theoretical modeling of digital embodiment and the practical design of immersive learning platforms. As educational experiences increasingly unfold within socio-technical systems, the challenge for both HCI and CMC is to design environments where social interaction is both mediated and dynamically co-constructed through the alignment of interactional affordances.
Keywords: Metaverse, Virtual Worlds, Social Interaction, Stealth Assessment, Computer-Mediated Communication, Human-Computer Interaction
Call for Research Collaboration
I am looking to collaborate with others engaged in educational research across a range of topics, with a particular interest in educational technology. If you are interested in improving teaching and learning through thoughtful research and innovative ideas, I would be happy to connect.
Introduction
The social affordances of metaverse platforms represent an emergent and increasingly theorized dimension of computer-mediated communication (CMC; Oh et al., 2023). Unlike traditional CMC modalities that primarily rely on symbolic interactionism via text-based interfaces (e.g., asynchronous email exchanges, threaded discussions) or constrained synchronicity in video-mediated telepresence (e.g., Zoom, Microsoft Teams), metaverse ecosystems instantiate persistent digital environments that facilitate embodied copresence. This sociotechnical evolution extends conventional CMC frameworks by integrating synchronous and asynchronous communicative modalities, avatar-driven virtual embodiments, and spatially anchored, topologically persistent digital loci (Voinea et al., 2022). Thus, metaverse platforms introduce novel phenomenological salience to human-computer interaction (HCI), wherein users experience an intensified perception of copresence and immersion (Garcia et al., 2024; Jo & Lee, 2024). In educational contexts, the spatial configuration of metaverse environments becomes integral to shaping learner interactions, access to resources, and the construction of shared meaning.
To explain this heightened mediated immediacy, recent studies increasingly apply Embodied Social Presence Theory (ESPT; e.g., Garcia et al., 2023; Zhang et al., 2022) as a theoretical framework for understanding digital copresence in metaverse environments. According to Mennecke et al. (2010), ESPT postulates that the salience of social presence is a function of corporeal digital representation and the system's capacity to simulate kinaesthetic affordances akin to real-world social embodiment. Metaverse environments operationalize ESPT through a convergence of multimodal affordances that approximate physical-world interaction with increasing fidelity. The integration of kinematic avatar tracking, gesture-driven expressivity, and dynamic environmental reactivity enables users to experience not just visual but proprioceptive and affective dimensions of virtual presence. Advanced metaverse infrastructures further amplify these visuospatial social heuristics through haptic fidelity, spatialized acoustics, and algorithmically modulated avatar gesturality (Kim et al., 2023; Tang et al., 2024). Given these collective technologies, immersive virtual spaces can foster ontologically rich para-reality engagements that transcend the semiotic limitations of legacy CMC paradigms.
In addition to its transformative implications for CMC, the metaverse is increasingly recognized as a pedagogically disruptive paradigm that reconfigures the spatiotemporal and interactional architectures of conventional learning environments (Chang & Hsiao, 2025; Qian et al., 2023; Zheng et al., 2025). Recent acceptance-oriented research further indicates that sustained participation in metaverse environments depends not only on technological usability but also on embodied, social, and immersive experience factors that shape how users inhabit and interpret virtual worlds (Garcia, 2025). Departing from the constraints of geographical fixity and synchronous participation, metaverse-based platforms cultivate participatory learning ecologies that dissolve institutional, temporal, and physical barriers (Sá & Serpa, 2023). This transition aligns with broader educational shifts toward hybridized and digitally mediated instructional models, wherein immersive presence and embodied interaction function as cognitive and affective amplifiers of knowledge construction and collaborative meaning-making. At the core of this educational evolution lies the social affordances of metaverse environments, which reconceptualize the school as a digitally co-habitable and co-constructed space rather than a static physical locality. Instead of confining interaction to text-based discourse and asynchronous engagement that primarily facilitate disembodied interaction and symbolic exchange, as seen in virtual learning environments, the metaverse engenders a more ontologically situated learning experience through spatialized sociality, immersive embodiment, and persistent communal copresence (Büyüközkan & Mukul, 2024; İbili et al., 2024). This restructuring of educational topology transforms the metaverse from merely a tool for digital pedagogy to an alternative venue for academic socialization, institutional cohesion, and collaborative inquiry.
Literature Review
Avatar Customization
The capacity for avatar customization represents a foundational affordance in virtual environments (Wu et al., 2023). This feature shapes not only user identity expression (Vasalou & Joinson, 2009) but also cognitive and affective engagement in digital interactions (Lin & Wang, 2014). Customizable avatars function as mediated extensions of self-representation by allowing users to exercise agency over their visual, morphological, and expressive attributes (Messinger et al., 2019; Zimmermann et al., 2023). This flexibility in digital embodiment has been extensively examined in the context of presence theory (Garcia et al., 2023), self-discrepancy theory (Devlin et al., 2024), and proteus effect studies (Ratan et al., 2020), with each framework offering unique insights into the psychosocial and behavioral ramifications of avatar personalization. A central theoretical construct underpinning avatar customization is appearance-based anthropomorphism, wherein humanlike visual and behavioral characteristics facilitate greater social identification and relational engagement (Garcia, 2025). Research on anthropomorphic design suggests that higher avatar realism enhances social presence, interpersonal trust, and affective resonance within virtual environments (Kim et al., 2023; Sinatra et al., 2021). However, the uncanny valley hypothesis posits that excessively hyperrealistic digital embodiments may induce perceptual dissonance (Bae et al., 2024) and undermine social-affective congruence. Thus, the degree of realism vs. abstraction in avatar customization presents a nuanced variable that influences not only social interaction patterns but also self-perception and behavioral modulation in digital spaces.
In addition to anthropomorphic fidelity, avatar customization is also deeply intertwined with identity construction (Vasalou & Joinson, 2009) and social signaling (Lee et al., 2023) in the metaverse and other virtual environments. Studies in digital self-presentation reveal that the ability to modify one's avatar fosters greater psychological ownership (Chung et al., 2024), reinforcing self-congruence (Huang et al., 2024) and social confidence (Kang & Rhee, 2025) in virtual interactions. This phenomenon aligns with the Proteus Effect (Ratan et al., 2020), wherein users subconsciously internalize and enact behaviors associated with their avatar's visual and symbolic attributes. Empirical evidence suggests that customizable avatars can modulate user behavior and influence levels of assertiveness, prosociality, and engagement within collaborative and competitive digital contexts (Jo & Lee, 2024; Lehdonvirta et al., 2012). In metaverse-based social and educational settings, avatar customization holds substantive implications for user engagement, group dynamics, and participatory involvement (Barta et al., 2024; Garcia, 2025). The capacity to personalize avatars may attenuate the psychological distance between users as well as enhance interpersonal relatability and communicative fluency. Avatar customization may also serve as a pedagogical scaffold. This aspect reinforces student agency, intrinsic motivation, and social cohesion in virtual learning spaces. Given these potential cognitive, social, and affective affordances, avatar customization emerges as a critical construct warranting further empirical investigation within the context of metaverse-based institutional engagement.
Communication Modality
The modalities of communication embedded within digital environments shape social presence, cognitive engagement, and interactional dynamics in mediated spaces (Nguyen et al., 2021). Synchronous multimodal communication in virtual environments introduces layered affordances that influence expressivity, immediacy, and relational fluidity (Matusitz & Dacas, 2024). The degree to which a communication mode aligns with the user's cognitive and affective expectations can dictate whether interactions feel immersive and natural or detached and mechanized. Within the metaverse, the choice of communication modality operates as a structural determinant of digital user engagement that mediates the perceived authenticity, reciprocity, and affective salience of interactions (Dong & Lee, 2022; Matusitz & Dacas, 2024; Sediyaningsih et al., 2023). A critical framework for understanding communication mode affordances in virtual environments is media richness theory (MRT), which posits that communicative modalities vary in their capacity to convey immediate feedback, emotional nuance, and nonverbal cues (e.g., Zhang et al., 2024). Text-based communication, while facilitating asynchronous discourse and reflective composition, is inherently low in immediacy and socio-emotional signaling. This modality often resulted in attenuated affective engagement. Conversely, voice-based communication enhances prosodic modulation, turn-taking synchronization, and expressive dynamism, fostering heightened social presence and relational cohesion. Empirical studies on media naturalness theory (MNT) further substantiate this distinction (e.g., Kahai, 2025), arguing that human cognitive systems are evolutionarily predisposed to process vocal and gestural cues more efficiently than text-based inputs. Given these findings, voice-enabled communication predictably emerges as a more immersive and socially potent mode of interaction.
In addition to affective and cognitive processing, communication modality plays a pivotal role in group dynamics, collaborative efficacy, and participatory agency within metaverse-based interactions (Wang et al., 2024). A similar dichotomy between voice and text communication emerges in this context. Speech-driven environments facilitate higher conversational spontaneity, which enables fluid discourse, immediate clarification, and co-regulation of meaning-making processes. In contrast, text-based interactions offer persistence, deliberative processing, and asynchronous accessibility. This modality is consequently instrumental for complex problem-solving and archival knowledge retrieval. This functional contrast underscores the need for a contextually adaptive approach to communication mode selection (Matusitz & Dacas, 2024; Salinäs, 2002), particularly one that strategically balances expressive immediacy with cognitive load management to optimize engagement in virtual immersive environments. Within metaverse-based educational and social infrastructures, communication modality likewise constitutes a critical determinant of user experience, interactional depth, and engagement patterns. In pedagogical applications, voice communication fosters interactive discourse, rapid ideation, and real-time collaborative scaffolding, while textual engagement enables structured reflection, precision in articulation, and cognitive offloading (Garcia et al., 2022). The integration of multimodal communication tools offers a hybridized communicative ecology that maximizes cognitive bandwidth and social presence simultaneity (Ghamandi et al., 2024). Given its fundamental role in shaping user behavior, participatory agency, and social connectedness, communication modality remains a theoretically and empirically salient construct that necessitates further examination within metaverse-based institutional frameworks.
Interaction Between Avatar Customization and Communication Modality
The synergistic interplay between avatar customization and communication modality constitutes a critical yet underexplored dimension of embodied computer-mediated interaction. While avatar customization influences self-representation, identity projection, and social affordances (Kang & Rhee, 2025; Messinger et al., 2019; Zimmermann et al., 2023), communication modality governs the immediacy, expressiveness, and relational depth of interactions (Dong & Lee, 2022; Matusitz & Dacas, 2024; Wang et al., 2024). These two variables co-construct the communicative landscape that determines the extent to which users experience a sense of social immersion, interpersonal connectedness, and participatory agency within virtual spaces. Their alignment, in addition to their independent effects, may dictate the salience of digital presence and interactional coherence in metaverse environments. When digital self-representation is reinforced by multimodal communication affordances, users may experience heightened copresence and interactional fluidity. In this context, the degree of avatar customization may amplify or constrain the effectiveness of communication modality, as users tend to exhibit greater expressive alignment when their digital representation is congruent with their communicative intent (Messinger et al., 2019; Zimmermann et al., 2023).
The alignment between avatar customization and communication modality may also influence discourse cohesion, participatory engagement, and cognitive-affective investment. A highly personalized avatar in a voice-based learning environment may serve as a behavioral amplifier that reinforces student agency, discourse fluidity, and interactive spontaneity (Huang et al., 2024; Kim et al., 2023). The sense of embodied presence created by a customized avatar, when combined with real-time spoken communication, may facilitate deeper relational continuity and collaborative engagement. However, in asynchronous or text-heavy virtual discussions, the role of avatar customization may function more as a symbolic identity marker than an active communicative amplifier. The absence of immediate verbal reciprocity in text-based conditions may limit the activation of avatar-induced behavioral modulation, leading to a reduced impact of customization on interaction depth. These variations underscore the necessity of examining how the alignment of avatar customization and communication modality mediates digital socialization, interactional fluidity, and knowledge co-construction in the metaverse environment.
Gap and Research Questions
Despite the expanding discourse on metaverse-based education, existing scholarship positions the metaverse as a pedagogical instrument rather than a fully realized social and institutional space. Prior research primarily centers on its utility for immersive content delivery, interactive simulations, and virtualized skill acquisition, often overlooking the mechanisms through which presence and engagement manifest in metaverse-based interactions. This pedagogical framing has led to a narrow conceptualization of the metaverse that prioritizes instructional affordances while neglecting its capacity to function as a socially immersive environment. Therefore, it is unsurprising that the relationship between digital embodiment and communicative affordances remains underexamined in the field. While presence in virtual environments is often assumed to translate into meaningful engagement, the extent to which avatar customization and communication modality shape social dynamics and participatory behaviors remains insufficiently theorized. Given that schools serve as both cognitive and social ecosystems, it is imperative to explore how the metaverse not only replicates but actively reconfigures these communal functions through digital self-representation and interactional modality. To address this gap, the present study investigates how avatar customization and communication modality influence user engagement within a metaverse-based school event. Specifically, this study seeks to answer the following research questions (RQs):
- To what extent do avatar customization and communication modality independently shape user engagement in a metaverse-based school event?
- How does the interplay between avatar customization and communication modality modulate engagement dynamics within a metaverse environment?
Methods
Research Design
This study employed a 2×2 factorial between-subjects experimental design to examine the independent and interactive effects of avatar customization (personalized vs. generic) and communication modality (voice vs. text) on user engagement in a metaverse-based school event. An experimental approach was chosen for its ability to establish causal relationships by systematically manipulating independent variables while controlling potential confounds (Cobb et al., 2003). This methodology ensures that observed differences in engagement can be attributed to experimental conditions rather than extraneous factors. On the other hand, a factorial design was selected because it allows for the simultaneous examination of multiple independent variables and their interaction effects (Parke, 2010). The factorial design is particularly useful in CMC and HCI research, where engagement is often influenced by the interplay of multiple factors rather than isolated variables (e.g., Garcia, 2025). This experimental setup reflects the interactive nature of learning environments in the metaverse, where user engagement is shaped by both social presence and interface design (Lee et al., 2023). In this study, participants (n = 120) were randomly assigned to one of four experimental conditions: Personalized Avatar + Voice Modality (Group 1), Personalized Avatar + Text Modality (Group 2), Generic Avatar + Voice Modality (Group 3), and Generic Avatar + Text Modality (Group 4). This study adhered to the established ethical guidelines of the institution and the broader principles of research integrity.
Setting and Metaverse Application
The study was conducted within an institution recognized for its pioneering use of metaverse platforms in education (Garcia et al., 2023). This university was selected due to its established commitment to immersive virtual learning and digital interaction research. Their metaverse application has been the subject of various studies on immersive education (e.g., Garcia, 2025), demonstrating its potential for both pedagogical and social engagement. Last June 2024, the institution unveiled the Summer Edition of its metaverse ecosystem (Figure 1). This version was designed specifically to improve social interaction in virtual spaces (Garcia et al., 2024). This edition marked a deliberate departure from traditional academic spaces by introducing a beach-themed environment optimized for social engagement and participatory experiences. To enhance digital presence and interaction, the Summer Edition incorporated a diverse range of immersive social affordances, including thematic activities, interactive entertainment spaces, and simulated recreational experiences. These features not only provided a dynamic setting for informal peer interactions but also positioned the metaverse as an alternative institutional space for social relationship-building beyond classroom-based learning.
To support broader student engagement, the platform was also designed with accessibility and flexibility in mind. Importantly, it does not require the use of virtual reality (VR) headsets, as it operates across standard devices such as desktop computers, tablets, and mobile phones. This cross-device accessibility helped ensure comparable interaction experiences across conditions, particularly by mitigating potential limitations for participants in the text-based communication modality. While the term metaverse is often associated with fully immersive ecosystems, the present study adopts an inclusive conceptualization consistent with emerging scholarly frameworks. In this view, the metaverse encompasses persistent, multi-user, three-dimensional virtual environments that afford embodied copresence and synchronous interaction. This perspective aligns with Kye et al. (2021) and the systematic review by Tlili et al. (2022), who drew on the Metaverse Roadmap proposed by the Acceleration Studies Foundation in 2006. The roadmap conceptualized the metaverse as consisting of four categories, namely Augmented Reality, Lifelogging, Mirror Worlds, and Virtual Worlds. These categories demonstrate that the metaverse is not confined to fully immersive VR experiences but rather encompasses a spectrum of digital environments that facilitate presence, interaction, and continuity.
Procedures and Instrumentation
The study commenced with virtual orientation sessions, during which participants were briefed on their assigned conditions. To maintain ecological validity while preserving experimental control, participants were given a pre-configured metaverse application based on their assigned condition. All participants interacted using full-body avatars equipped with emote features (Figure 2), which allowed them to perform a range of expressive actions (e.g., waving, sitting, clapping) to support nonverbal interaction. Because the institutional metaverse platform had been widely used in prior courses and co-curricular events, participants were already familiar with avatar customization and navigation. This familiarity reduces the risk of variability in user competence influencing engagement outcomes. Over the course of two weeks, they engaged in four structured sessions designed for exploratory engagement, social interaction, and participation in pre-existing activities within the metaverse environment. Throughout these sessions, participants were free to navigate the digital space, interact with peers, and engage in various virtual experiences. To assess user engagement objectively, the study employed stealth assessment. This performance-based assessment is an unobtrusive data collection technique that tracks user behavior in real time without disrupting the natural flow of interaction (Rahimi & Shute, 2024). In the present study, engagement is operationalized as behavioral engagement, reflected through observable patterns of spatial movement, activity participation, and social interaction within the metaverse environment. While these indicators do not capture subjective or affective engagement directly, they provide a validated and ecologically grounded approximation of users' enacted involvement and participatory investment during the event.
Data Collection and Analysis
This study employed a multi-phase data collection approach, wherein engagement metrics were systematically recorded over a predefined interaction window within the metaverse event. Consistent with prior work in technology-rich environments, engagement was treated as a multidimensional behavioral construct rather than a latent psychological state and was inferred from log-based indicators of enacted participation rather than self-reported experience. System logs were extracted, anonymized, and structured into quantitative datasets for statistical analysis. Descriptive statistics (mean and standard deviation) were computed to summarize performance across conditions, while assumption testing was conducted to ensure the validity of inferential analyses. The Shapiro-Wilk test, along with visual inspections of histograms and Q-Q plots, was used to assess normality, and Levene's test for homogeneity of variances confirmed variance equality across groups. A two-way analysis of variance (ANOVA) was performed to examine the main effects of avatar customization and communication modality, as well as their interaction effect on engagement metrics. To assess the practical significance of findings, effect sizes (η²) were computed, with partial eta squared (η²ₚ) interpreted using conventional benchmarks. For significant interaction effects, simple effects analyses with Bonferroni-adjusted post hoc comparisons were conducted to determine pairwise differences across conditions. Findings were interpreted through the lens of CMC frameworks, HCI principles, affective computing, social presence frameworks, and digital embodiment research.
Results
All recruited participants (N = 120) successfully completed the study. Their mean age was 21.4 years (SD = 2.9), with 57% (n = 68) identifying as male, 41% (n = 49) as female, and 2% (n = 3) preferring not to disclose their gender. Given that most participants had prior experience with the institution's educational metaverse, it is unsurprising that there were no significant differences in baseline familiarity across experimental conditions (p = .315). Descriptive statistics (Table 1) revealed notable variations in engagement metrics across the four experimental conditions. Group 1 (Personalized Avatar + Voice Modality) exhibited the highest levels of engagement across all dependent variables, with the greatest exploration distance (M = 3.52 km, SD = 0.45), the longest time spent in activities (M = 47.1 min, SD = 5.2), highest unique interaction count (M = 21.2, SD = 3.5), and highest total interaction count (M = 94.6, SD = 9.2). Conversely, Group 4 (Generic Avatar + Text Modality) consistently exhibited the lowest engagement levels, particularly in exploration distance (M = 2.89 km, SD = 0.39), time spent in activities (M = 36.4 min, SD = 4.5), and total interaction count (M = 68.5, SD = 7.8).
| Condition | Exploration Distance | Time Spent on Activities | Activity Participation | Unique Interaction | Interaction Count |
|---|---|---|---|---|---|
| Personalized Avatar + Voice Modality | 3.52 ± 0.45 | 47.1 ± 5.2 | 5.4 ± 0.9 | 21.2 ± 3.5 | 94.6 ± 9.2 |
| Personalized Avatar + Text Modality | 3.14 ± 0.38 | 38.7 ± 4.8 | 5.2 ± 0.8 | 16.3 ± 3.1 | 73.2 ± 8.5 |
| Generic Avatar + Voice Modality | 2.97 ± 0.42 | 42.6 ± 5.0 | 4.9 ± 0.7 | 19.4 ± 3.3 | 87.3 ± 8.9 |
| Generic Avatar + Text Modality | 2.89 ± 0.39 | 36.4 ± 4.5 | 4.6 ± 0.6 | 15.9 ± 3.0 | 68.5 ± 7.8 |
A similar pattern was observed in the unique interaction count, where participants using personalized avatars demonstrated a broader social reach (M = 21.2, SD = 3.5 in Group 1; M = 16.3, SD = 3.1 in Group 2) compared to those using generic avatars (M = 19.4, SD = 3.3 in Group 3; M = 15.9, SD = 3.0 in Group 4), irrespective of communication modality. This finding suggests that avatar personalization may enhance social openness or increase perceived approachability within virtual environments. The activity participation count remained relatively stable across conditions, though it is marginally higher in voice-based conditions (M = 5.4, SD = 0.9 in Group 1; M = 4.9, SD = 0.7 in Group 3) than in text-based conditions (M = 5.2, SD = 0.8 in Group 2; M = 4.6, SD = 0.6 in Group 4). These initial engagement trends suggest that both avatar customization and communication modality exert distinct influences on metaverse-based engagement.
RQ1: To what extent do avatar customization and communication modality independently shape user engagement in a metaverse-based school event?
The main effects analysis revealed a complex interplay between avatar customization and communication modality, with some engagement metrics demonstrating significant differences based on avatar customization while others were more strongly influenced by communication modality (Table 2). For avatar customization, a significant effect was observed for exploration distance (F = 5.89, p = .017, η² = .048), indicating that participants using personalized avatars explored the virtual environment more extensively than those with generic avatars. However, it did not significantly affect any other engagement metrics, including time spent in activities (F = 2.14, p = .146, η² = .019), activity participation count (F = 1.87, p = .174, η² = .016), unique interaction count (F = 3.02, p = .085, η² = .026), and total interaction count (F = 2.95, p = .089, η² = .025). These findings suggest that while customized avatars may encourage greater spatial exploration, their influence on social interaction and activity engagement is less pronounced.
| Performance | Avatar Customization | Communication Modality | ||||
|---|---|---|---|---|---|---|
| F | p | η² | F | p | η² | |
| Exploration Distance | 5.89 | .017 | .048 | 1.76 | .187 | .015 |
| Time Spent on Activities | 2.14 | .146 | .019 | 2.81 | .096 | .024 |
| Activity Participation | 1.87 | .174 | .016 | 0.42 | .517 | .004 |
| Unique Interaction | 3.02 | .085 | .026 | 4.78 | .032 | .040 |
| Total Interaction | 2.95 | .089 | .025 | 12.41 | <.001 | .097 |
Conversely, communication modality exerted a stronger influence on social engagement metrics. While it did not significantly impact exploration distance (F = 1.76, p = .187, η² = .015), time spent in activities (F = 2.81, p = .096, η² = .024), or activity participation count (F = 0.42, p = .517, η² = .004), it demonstrated a significant effect on unique interaction count (F = 4.78, p = .032, η² = .040) and total interaction count (F = 12.41, p < .001, η² = .097). These results indicate that participants in voice-based communication conditions engaged with a greater number of unique users and had significantly higher overall interaction volumes compared to those in text-based conditions. Collectively, these findings highlight a contrasting pattern in how avatar customization and communication modality influence metaverse engagement. This pattern suggests that the factors driving spatial and social engagement in the metaverse operate through distinct mechanisms, warranting further exploration into their potential interactive effects.
RQ2: How does the interplay between avatar customization and communication modality modulate engagement dynamics within a metaverse environment?
The interaction analysis revealed that avatar customization and communication modality jointly influenced certain aspects of engagement, though their combined effects were more pronounced in some metrics than others (Table 3). A significant interaction effect was observed for time spent in activities (F = 5.03, p = .027, η² = .041), indicating that the alignment between self-representation and communicative affordances shaped the extent to which participants remained engaged in structured interactions within the metaverse environment. Similarly, the interaction effect for unique interaction was statistically significant (F = 4.62, p = .034, η² = .038), suggesting that the interplay between avatar customization and communication modality influenced the breadth of participants' social networks, affecting the number of distinct individuals they interacted with during the session. In contrast, no significant interaction effects were detected for exploration distance (F = 1.34, p = .251, η² = .011), activity participation (F = 2.22, p = .139, η² = .019), or total interaction (F = 2.19, p = .142, η² = .019). These findings indicate that while the factors independently shaped engagement, their combined influence did not meaningfully alter spatial navigation patterns, overall participation rates, or total communicative output beyond their respective main effects. The presence of interaction effects in specific engagement dimensions suggests that the alignment (or misalignment) between digital embodiment and communicative affordances is particularly consequential for sustained activity engagement and the diversification of social interactions within metaverse environments.
| Performance | F | p | η² | Interpretation |
|---|---|---|---|---|
| Exploration Distance | 1.34 | .251 | .011 | Not significant; small effect |
| Time Spent on Activities | 5.03 | .027 | .041 | Significant; moderate effect |
| Activity Participation | 2.22 | .139 | .019 | Not significant; small effect |
| Unique Interaction | 4.62 | .034 | .038 | Significant; moderate effect |
| Total Interaction | 2.19 | .142 | .019 | Not significant; small effect |
Discussion
This study sought to interrogate the often-assumed equivalence between the illusion of presence and the actuality of engagement within a metaverse environment by examining how avatar customization and communication modality shape user interaction. Interrogating how the confluence of digital self-representation and communicative modality modulates user behavior contributes to the broader discourse on social presence, digital identity construction, and multimodal engagement in virtual ecosystems. Framed within broader traditions of CMC and HCI, the results offer a decisive empirical response to the central research question concerning the extent to which the metaverse can replicate, augment, or reconfigure socio-spatial affordances in sociotechnical systems designed for immersive interaction. At the heart of this inquiry is the proposition that behavioral engagement in virtual environments is not merely a function of spatial copresence, but an emergent pattern of enacted participation shaped by the alignment between digital self-representation and communicative affordances. What emerges is a portrait of engagement as an emergent phenomenon that is dependent not on isolated design features but on the alignment between visual identity, communicative immediacy, and contextual affordances. Whereas prior work has modeled metaverse participation primarily through acceptance and experiential appraisal frameworks (e.g., the META model; Garcia, 2025), the present findings extend this line of inquiry by demonstrating how representational and communicative affordances condition engagement as a set of observable enactments within immersive environments.
The Behavioral Consequences of Avatar Customization
While avatar customization is typically viewed as a feature of user preference, emerging evidence suggests it plays a far more consequential behavioral role. The findings reinforce the proposition that avatar customization extends beyond superficial aesthetic modifications to function as a behavioral catalyst that modulates user agency and engagement dynamics (Garcia, 2025; Wu et al., 2023). Aligned with the Proteus Effect, the ability to curate one's digital embodiment appears to foster heightened psychological ownership over the virtual self (Ratan et al., 2020). The significant increase in exploration distance among users with customized avatars substantiates the assertion that stronger avatar self-representation enhances perceptual and cognitive mapping of the virtual environment. These findings are consistent with prior work by Messinger et al. (2019) and Zimmermann et al. (2023), both of which suggest that avatar personalization enhances spatial immersion and self-extension in digital environments. This spatial behavior may be partially attributed to social signaling motives, as users with personalized avatars could be more inclined to traverse high-visibility areas within the metaverse to display their digital identity to others. Drawing from the Self-Presentation Theory (Goffman, 1959), which conceptualizes social interaction as a form of dramaturgical performance, users actively engage in impression management by shaping how they appear to others. In virtual environments, avatars function as front-stage personas, and customization becomes a form of impression management through which individuals communicate identity, status, or affiliation. Within this performative framework, spatial movement assumes symbolic weight that positions users as actors navigating visibility, recognition, and social legibility. Thus, avatar customization not only informs how users appear, but also where they choose to go, how they negotiate presence, and how they enact digital selfhood within immersive sociotechnical systems (Barta et al., 2024; Hube et al., 2024).
Yet, while avatar customization clearly shaped how users moved through virtual space, it did not meaningfully alter how they moved toward one another. The non-significant effects of avatar customization on activity participation and interaction frequency suggest that digital embodiment alone does not intrinsically foster social engagement. The observed dissociation between spatial immersion and social participation underscores the need to conceptualize engagement in virtual environments as multidimensional rather than monolithic. This finding contrasts conceptually with Lee et al. (2023), who found that avatar customization amplified the relationship between avatar identification and enjoyment, partially through increased social engagement. Although their study did not isolate customization as a direct predictor of interaction, it highlights the conditional role customization plays in shaping socially engaging experiences when identification is already high. A plausible explanation for this divergence lies in contextual differences. Whereas prior studies examined socially driven or entertainment-oriented platforms with clear interactional incentives, the present study was situated in an educational metaverse with more diffuse and pedagogically mediated engagement structures. In such environments, the presence of a personalized avatar may enhance spatial presence without necessarily translating into deeper interpersonal interaction. This interpretation is also consistent with Garcia (2025), who noted that avatar realism exerts limited influence on communicative outcomes in the absence of behavioral cues and synchronous modalities. These results collectively suggest that while avatar customization may facilitate individual-level immersion and navigational agency, its capacity to elicit reciprocal engagement remains contingent on the communicative scaffolds and social architecture of the virtual environment. Consequently, the ability of avatar customization to foster peer-to-peer engagement may depend less on the feature itself and more on how virtual spaces are intentionally designed to activate social intent.
Communication Modality as a Driver of Interactional Depth
Just as avatar customization shaped the spatial logic of user behavior, communication modality emerged as a central determinant of interactional structure. Rather than functioning as a neutral conduit for message delivery, modality operated as an embedded affordance that modulated the rhythm, reach, and reciprocity of social engagement. The observed increase in both unique and total interaction aligns with previous scholarship highlighting the role of modality in scaffolding social presence and interactional immediacy within virtual spaces (Matusitz & Dacas, 2024; Nguyen et al., 2021). This pattern is consistent with the principles of MRT, which posit that more expressive modalities foster richer, more fluid exchanges through immediate feedback and emotional nuance (Zhang et al., 2024). Similarly, the heightened interactional fluency observed in more synchronous communicative conditions resonates with findings from MNT, which suggests that vocal cues are more intuitively processed than textual inputs due to cognitive evolutionary predispositions (Kahai, 2025). This interactional advantage may be explained by the reduced cognitive effort and increased temporal fluidity afforded by synchronous voice communication. Unlike text, which requires message construction, editing, and interpretation of tone in the absence of prosody, voice allows for immediate articulation and spontaneous expression. This reduction in linguistic overhead lowers the threshold for initiating and sustaining conversations, particularly in dynamic social environments. Moreover, voice communication activates higher levels of perceived copresence, prompting users to treat virtual interaction as more socially obligatory, immediate, and interpersonally salient. These results also extend earlier work by Dong and Lee (2022), who identified modality as a mediator of reciprocity and relational salience in immersive environments. Taken together, these findings reinforce the proposition that communication modality constitutes a fundamental design variable that directly governs not only how users express themselves, but also how relationships are enacted, maintained, and expanded within immersive sociotechnical systems.
Although communication modality clearly governed the shape of social interaction, it exerted little influence on how users navigated or engaged with the environment itself. The absence of significant effects on exploration distance, time spent on activities, and activity participation suggests that the behavioral reach of modality may be bounded by its social function. This notion is consistent with Sediyaningsih et al. (2023), who emphasized that the communicative mode primarily shapes perceived authenticity and affective salience in interpersonal exchanges rather than task performance. Whereas interaction with others is acutely sensitive to the immediacy and expressivity of a communicative channel, spatial exploration, and activity engagement are often driven by environmental design, individual motivation, and task relevance. It is also likely that users implicitly associate modality with interpersonal affordances rather than system-oriented actions (Ghamandi et al., 2024; Matusitz & Dacas, 2024). In educational contexts, this distinction echoes findings by Garcia et al. (2022), who noted that while voice enhances collaborative scaffolding, textual modes often support solitary reflection and content-focused engagement. From a behavioral perspective, this may reflect an implicit mapping of modality to perceived utility where users calibrate their use of voice or text according to whether a task feels socially or cognitively mediated. In this sense, modality facilitates the quality of human connection rather than the quantity of movement or task participation. These findings reinforce the distinction between the relational scaffolding provided by communication modality and the autonomous or goal-directed behaviors shaped by user intent and platform architecture.
The Synergistic Influence of Avatar Customization and Communication Mode
Visual identity and communicative immediacy functioned as orthogonal but complementary affordances, each operating along distinct perceptual axes yet converging at the level of user engagement. When users' digital embodiment was supported by an expressive channel for social exchange, the result was a qualitatively distinct engagement profile marked by increased temporal investment and expanded social reach. This pattern underscores the proposition that immersive participation in metaverse environments is not reducible to additive affordances but emerges from the dynamic interplay of representational and relational architectures (Kang & Rhee, 2025; Wang et al., 2024). Prior research has emphasized the discrete impacts of avatar customization and modality on identity projection and interpersonal immediacy, respectively (Dong & Lee, 2022; Zimmermann et al., 2023), yet the present findings suggest that these effects converge when communicative intent is matched by a congruent digital persona. A plausible mechanism for this convergence is perceptual-conceptual congruence, or the alignment of how users appear and how they sound, which creates a unified self-presentation that minimizes dissonance and enhances social intelligibility. As Messinger et al. (2019) observed, expressive alignment fosters more naturalistic engagement, and in the context of synchronous communication, this alignment amplifies behavioral continuity and perceived copresence. In conditions where synchronous voice communication reinforces avatar-based identity cues, users are more likely to experience the interaction as authentic, reciprocal, and socially saturated. Conversely, when such alignment is absent, such as in asynchronous or text-heavy contexts, customization may function primarily as a symbolic artifact rather than a behavioral amplifier. These findings advance the notion that engagement in the metaverse is an emergent property of affordance integration, wherein representational coherence across sensory modalities scaffolds more immersive, affectively resonant, and socially durable forms of participation.
Implications, Limitations, and Future Directions
This study offers a theoretically novel and empirically grounded account of how the alignment between avatar customization and communication modality modulates user engagement in immersive virtual environments. Departing from prior research that treated these variables as functionally independent, the findings substantiate a model of affordance convergence in which representational and relational systems interact to produce emergent patterns of behavioral coherence. The findings of this study constitute a significant contribution to the discourse on embodied interaction and sociotechnical design within CMC and HCI. Moreover, it advances an integrated framework wherein avatar embodiment and communicative expressivity are co-constitutive of participatory depth in metaverse-based ecologies.
The implications for educational metaverse infrastructures are both conceptual and operational. The demonstrated behavioral impact of user-driven digital embodiment indicates that avatar customization should be repositioned from a peripheral aesthetic affordance to a central pedagogical vector (Garcia, 2025). Platforms designed for learning and collaboration would benefit from enabling granular and semantically resonant customization. As observed in this study, such features are beneficial for enhancing self-referential anchoring, spatial agency, and exploratory autonomy within immersive contexts. In parallel, the observed interaction between avatar embodiment and communication modality foregrounds the critical need for multimodal communicative architectures. Rather than privileging a singular channel of expression, platform designers must develop adaptive communicative ecologies that enable users to transition fluidly between high-immediacy (e.g., voice) and high-precision (e.g., text) modalities.
Accordingly, the study articulates the following design heuristics and guidelines for metaverse developers, educators, and institutional technologists as part of a broader framework for enhancing engagement and learning within immersive environments:
- Cross-device functionality (desktop, tablet, mobile) should remain a design priority. This study's non-VR implementation achieved high engagement consistency across modalities, suggesting that immersive learning can be realized without hardware exclusivity.
- Personalized avatars significantly increased spatial exploration and trended toward richer social engagement. Platforms should therefore offer semantically resonant, identity-relevant customization that supports self-referential anchoring, agency, and motivation.
- Communication modality had a strong effect on social interaction volume and diversity, with voice-based engagement outperforming text-based interaction. Designers should implement adaptive architectures that enable seamless transitions between high-immediacy and high-precision channels to accommodate communicative preferences.
- The significant interaction effects for time-on-task and social reach indicate that engagement deepens when avatar embodiment and communication modality are congruent. Platforms should thus promote coherence between users' expressive representations and the modalities through which they interact.
- The positive associations between embodiment and engagement suggest that users respond dynamically to how their avatars perform in context. Incorporating real-time behavioral or affective feedback could strengthen sustained participation.
- Rather than treating visual, spatial, and communicative features as discrete elements, developers should design for their synchrony. Engagement emerges from the alignment of perceptual and relational cues that together scaffold a sense of presence and participation.
- Engagement in the metaverse is an emergent property of cognitive, emotional, and social congruence within the environment. Rather than feature novelty, system coherence should guide design decisions to ensure platforms feel socially inhabitable.
Collectively, these recommendations advance the view that metaverse participation is the product of affordance orchestration, or the perceptual and behavioral synchrony between how users see themselves, how they are seen by others, and how they communicate across modalities. Thus, interaction design becomes an exercise in constructing interactional harmony, wherein avatar embodiment and communicative immediacy act as amplifying vectors of digital sociality.
Notwithstanding these contributions, the study is bounded by several methodological and conceptual constraints. The experimental design was constrained to immediate behavioral responses that preclude claims about the longitudinal persistence of the observed effects. Future research should adopt extended temporal frames to examine how users internalize, habituate to, or recalibrate avatar–modality alignment over time, particularly as environmental familiarity and group dynamics evolve. Second, while the customization protocol permitted expressive variation, it did not manipulate specific dimensions of avatar design, such as anthropomorphism, realism, or dynamic facial expressivity. Given prior evidence linking avatar realism to the user–avatar relationship (Kim et al., 2023), subsequent inquiries should investigate how these design attributes interact with communication modality to influence relational depth and epistemic trust. Third, the study did not incorporate environmental variables such as spatial configuration, object interactivity, or ambient social cues. These contextual factors likely serve as mediators or moderators of the behavioral manifestations of embodiment–modality alignment. Fourth, the study did not utilize VR headsets or immersive hardware, as the platform was accessed via standard devices (e.g., desktop, tablet, or mobile). This choice was intentional to ensure methodological consistency across conditions, particularly for participants in the text-based modality. However, this also limits the study's applicability to fully immersive VR contexts, where presence, embodiment, and sensory cues may operate differently. Future research should compare cross-platform engagement (screen-based vs. VR-based environments) to better understand how device affordances shape the dynamics of avatar–modality alignment and user experience. Finally, while the use of stealth assessment added ecological validity by unobtrusively capturing behavioral data, the study did not incorporate complementary qualitative methods (e.g., interviews or open-ended feedback) that could shed light on participants' subjective experiences. Although this decision was consistent with the study's quantitative focus on engagement behaviors, future research could benefit from integrating mixed method approaches to provide a richer, multi-dimensional understanding of user engagement in metaverse environments.
Toward this end, the study advocates for a paradigm shift away from discrete-feature evaluation toward an ecological interactionism model. This approach situates digital engagement as a function of perceptual and social congruence within complex sociotechnical systems. It recognizes that meaningful participation emerges not from isolated design elements, but from the alignment of users' cognitive, emotional, and social orientations with the immersive environment. Such a perspective invites a more holistic approach to platform development, particularly one that privileges coherence over novelty and integration over fragmentation. In addition to advancing theoretical understanding of metaverse participation, it also provides a conceptual roadmap for designing immersive environments that feel not merely inhabited but inhabitable in a manner that is socially coherent, experientially resonant, and psychologically meaningful.
Conclusion
As the metaverse continues to evolve from a speculative frontier into a functional paradigm for communication, collaboration, and learning, understanding how its core affordances shape user engagement has become an urgent imperative. In educational contexts, especially where presence, interaction, and agency are foundational to pedagogical efficacy, the metaverse offers both unprecedented opportunities and novel design challenges. This study contributes to that growing body of knowledge by illuminating how both the individual and interactive effects of avatar customization and communication modality configure the behavioral contours of engagement in immersive environments. Examining these foundational constructs as isolated variables and mutually conditioning systems advances a more integrated framework for understanding how digital selfhood and communicative immediacy converge to scaffold meaningful participation. The implications are far-reaching, as they extend to educators seeking to design engaging virtual learning environments, to developers building the next generation of immersive platforms, and to researchers developing theories of embodied digital interaction. In a world where identity is increasingly mediated through screens and simulations, the findings underscore a critical insight. How we appear and how we speak are not separate acts but interwoven performances of self in digitally constructed spaces. As digital spaces increasingly mediate human presence, the imperative is no longer to replicate reality but to design environments that are experientially legible, socially intelligible, and psychologically sustainable. What lies ahead is a redefinition of what it means to be present, to be seen, and to be heard.