Markerless human motion capture has a wide variety of applications, but recovering a pose from multiple-view calibrated images remains challenging. In general, the problem of pose estimation is an optimization problem, but cost functions easily trapped in a local minimum based on extrinsic similarities (silhouette, edge, etc.) and pose estimation fail if the body is largely self-occluded. We define an effective cost function that combines extrinsic and intrinsic similarities, that can be translated by projecting a mesh model onto camera views to best match the silhouettes from an extrinsic point of view, and that attempts to preserve the intrinsic geometry of the shape using predefined mesh models based on heat diffusion. The results of experiment show that our proposed method is more robust for sequence motion capture than pure extrinsic similarity or pure intrinsic similarity.