In this paper, a semiautomated system for modeling 3D objects, especially buildings from aerial video, over a semi-urban scene is presented. First, the video frames are preprocessed to minimize the rotational effects of camera motion. The 3D translational coordinates of the sensor are used to stitch the video frames into nadir and stereo mosaics. The features extracted from the stereo mosaics, like elevation, edges and corners, visual entropy, and color information, are employed in a Bayesian framework to identify the 3D objects in the scene, such as buildings and trees. The initial 3D building models are further optimized by projecting them onto individual video frames. A novel method for setting the input parameters of vision algorithms required for feature extraction, using the data-driven probabilistic inference in Bayesian Networks, has been designed. This method automates the 3D object identification process and precludes the need for manual intervention. Improvements that can be used to increase the accuracy of 3D models when Lidar data is fused with aerial video during the object identification process are also discussed.