Typical surveillance systems employ decision- or feature-level fusion approaches to integrate heterogeneous sensor data, which are sub-optimal and incur information loss. In this paper, we investigate data-level heterogeneous sensor fusion. Since the sensors monitor the common targets of interest, whose states can be determined by only a few parameters, it is reasonable to assume that the measurement domain has a low intrinsic dimensionality. For heterogeneous sensor data, we develop a joint-sparse data-level fusion (JSDLF) approach based on the emerging joint sparse signal recovery techniques by discretizing the target state space. This approach is applied to fuse signals from multiple distributed radio frequency (RF) signal sensors and a video camera for joint target detection and state estimation. The JSDLF approach is data-driven and requires minimum prior information, since there is no need to know the time-varying RF signal amplitudes, or the image intensity of the targets. It can handle non-linearity in the sensor data due to state space discretization and the use of frequency/pixel selection matrices. Furthermore, for a multi-target case with J targets, the JSDLF approach only requires discretization in a single-target state space, instead of discretization in a J-target state space, as in the case of the generalized likelihood ratio test (GLRT) or the maximum likelihood estimator (MLE). Numerical examples are provided to demonstrate that the proposed JSDLF approach achieves excellent performance with near real-time accurate target position and velocity estimates.