Abstract:Talking face generation requires precise joint modeling of facial texture and driven audio; to achieve this goal, research on semantic-guided texture feature deformation has been conducted, proposing a sketch-guided few-shot speaker video generation framework, employing dual-stage generation techniques for modality alignment. In the first stage, real prior facial landmarks information is used to generate the target facial landmarks from audio, and in the second stage, facial landmarks are transformed into sketches as intermediate representations for semantic alignment with reference images. Introduction of sketches effectively addresses the modality mismatch between audio and images; through experimental testing, the algorithm achieves FID scores of 15.676 and 8.618 on the public datasets HDTF and MEAD respectively. The proposed algorithm effectively models facial texture under the drive of target audio through intermediate representations, achieving comparable results to state-of-the-art algorithms as validated by the aforementioned results.