This paper investigates RF-based system for automatic American Sign Language (ASL) recognition. We consider radar for ASL by joint spatio-temporal preprocessing of radar returns using time frequency (TF) analysis and high-resolution receive beamforming. The additional degrees of freedom offered by joint temporal and spatial processing using a multiple antenna sensor can help to recognize ASL conversation between two or more individuals. This is performed by applying beamforming to collect spatial images in an attempt to resolve individuals communicating at the same time through hand and arm movements. The spatio-temporal images are fused and classified by a convolutional neural network (CNN) which is capable of discerning signs performed by different individuals even when the beamformer is unable to separate the respective signs completely. The focus group comprises individuals with varying expertise with sign language, and real time measurements at 77 GHz frequency are performed using Texas Instruments (TI) cascade radar.