Recent models of voice perception propose a hierarchy of steps leading from a more general, “low-level” acoustic analysis of the voice signal to a voice-specific, “higher-level” analysis. We aimed to engage two of these stages: first, a more general detection task in which voices had to be identified amid environmental sounds, and, second, a more voice-specific task requiring a same/different decision about unfamiliar speaker pairs (Bangor Voice Matching Test [BVMT]). We explored how vulnerable voice recognition is to interfering distractor voices, and whether performance on the aforementioned tasks could predict resistance against such interference. In addition, we manipulated the similarity of distractor voices to explore the impact of distractor similarity on recognition accuracy. We found moderate correlations between voice detection ability and resistance to distraction (r = .44), and BVMT and resistance to distraction (r = .57). A hierarchical regression revealed both tasks as significant predictors of the ability to tolerate distractors (R2 = .36). The first stage of the regression (BVMT as sole predictor) already explained 32% of the variance. Descriptively, the “higher-level” BVMT was a better predictor (β = .47) than the more general detection task (β = .25), although further analysis revealed no significant difference between both beta weights. Furthermore, distractor similarity did not affect performance on the distractor task. Overall, our findings suggest the possibility to target specific stages of the voice perception process. This could help explore different stages of voice perception and their contributions to specific auditory abilities, possibly also in forensic and clinical settings.