We introduce a novel approach to modeling the dynamics of human facial motion induced by the action of speech for the purpose of synthesis. We represent the trajectories of a number of salient features on the human face as the output of a dynamical system made up of two subsystems, one driven by the deterministic speech input, and a second driven by an unknown stochastic input. Inference of the model (learning) is performed automatically and involves an extension of independent component analysis to time-depentend data. Using a shape-texture decompositional representation for the face, we generate facial image sequences reconstructed from synthesized feature point positions.