When learning to bind visual symbols to sounds, to what extent do beginning readers track seemingly irrelevant information such as a symbol’s position within a visual display? In this study, we used adult typical readers’ own webcams to track their eye movements during a paired associate learning task that arbitrarily bound unfamiliar characters with monosyllabic pseudowords. Overall, participants’ error rate in recognition (Phase 1) decreased as a function of exposure, but was not modulated by the episodic memory-based effect of ‘looking-at-nothing’. Moreover, participants’ lowest error rate in both recognition and recall (Phases 1 and 2) was associated with item consistency across multiple exposures, in terms of spatial and contextual properties (i.e., stimulus’ screen location and co-occurrences with specific distractor items during encoding). Taken together, our findings suggest that normally developing readers extract statistical regularities in the input during visual-phonological associative learning, leading to rapid acquisition of these pre-orthographic representations.