Fine-grained visual classification (FGVC) is difficult due to the under-utilization of low-level features. This paper proposes a real-time method MBNet based on multi-stream multi-scale cross bilinear CNN that contributes to solving the problem. First, each layer of the multi-stream CNN is extracted by basic network such as VGGNet and others, followed by calculating multi-stream cross bilinear vector and bottom bilinear vector of low and high level features respectively. The FGVC results are predicted after feature fusion, which solves the problem that small and low-level details in the original image are easily overlooked. In the widely used datasets Caltech-UCSD Birds, Stanford Cars and Aircraft, the proposed method shows that the accuracy is significantly improved compared to the existing methods, reaching to state of the art level of 88.51%, 94.73% and 92.41%. It also meets the requirements of real-time tasks.