Nowadays, retail products recognition technologies are mostly based on traditional two-stages computer vision methods. Those methods first create features manually, followed by a classification algorithm to distinguish all products. Since deep learning methods have achieved state-of-the-art results on many tasks and have unified pipelines, it would be promising to apply deep models into products recognition. In this paper, we have built up a new light CNN architecture named ProNet for this task. The 27-layers ProNet combines the advantages of ResNet and Mobilenet. Depth-wise separable convolution and residual connection are two main operations in the architecture design. Depth-wise separable convolution is used to cut down the computation cost. Residual connection is used to help network learn better feature representations and converge to a better point during training. Compared with other commonly used CNN architectures, our ProNet is relatively computational efficient, but it can still get good performances on several public datasets. We first test ProNet architecture on ImageNet dataset. Top 1 average accuracy of 70.8% is got. After that, we test ProNet on another public dataset ALOI and our own task-specific retail products dataset GroOpt using transfer learning. Using this base model, we get an average accuracy of 98% on ALOI and 96% on GroOpt, which are both much higher than traditional SIFT based methods. Results show that ProNet is an accurate model. To make ProNet transferable in other environments, we apply the following two strategies: (1) a white balance augmentation algorithm to randomly change the RGB ratio of every image. (2) add another linear classifier on top feature maps to help distinguish very similar samples. Using augmented training set and modified model, we have trained ProNetV2. This improved version gets an accuracy of 99% on both ALOI and GroOpt. We have also embedded ProNetV2 model into a smart phone with 2GB RAM and test it under different situations, including different light illuminations, backgrounds, etc. An average accuracy of 96% and processing time of 0.1s per image are reached. Those results prove the effectiveness and usefulness of our proposed networks.