We propose to fuse an image's local and global information for scene classification. First, the image's local information is represented by context information exploited using spatial pyramid matching. Images are segmented to patches by a regular grid, and scale invariant feature transform (SIFT) features are extracted. All the patch features are clustered and quantified to get visual words. The visual word pair and visual word triplet are neighboring and different visual words. By an analogy between image pixel space and patch space, we also get visual word groups, which are the continuous occurrence of the same visual words. The spatial envelope is employed for extracting an image's global information. The spatial envelope is a holistic description of the scene, where local information is not taken into account. Finally, a stacked-support vector machine (SVM) fusion method is used to get the scene classification results. Experimented with three benchmark data sets, the results demonstrated that our methods could get better results than most popular scene classification methods presented in recent years.