Audio is captured, an FFT is computed, then the log spectrum is cepstrally liftered: the high-quefrency part of the cepstrum (which carries pitch and harmonic structure) is zeroed out, leaving only the spectral envelope. Formants are picked as the two strongest prominent peaks of that envelope above F0.
The voice scaling toggle multiplies reference F1/F2 by 1.17 for adult female voices. Detection itself doesn't depend on language or sex — those only move the reference circles. Your dot is your real F1/F2.