Automatic detection and tracking of maritime targets in imagery can greatly increase situation awareness on naval vessels. Various methods for detection and tracking have been proposed so far, both for reasoning as well as for learning approaches. Learning approaches have the promise to outperform reasoning approaches. They typically detect targets in a single frame, followed by a tracking step in order to follow targets over time. However, such approaches are sub-optimal for detection of small or distant objects, because these are hard to distinguish in single frames. We propose a new spatiotemporal learning approach that detects targets directly from a series of frames. This new method is based on a deep learning segmentation model and is now applied to temporal input data. This way, targets are detected based not only on appearance in a single frame, but also on their movement over time. Detection hereby becomes more similar to how it is performed by the human eye: by focusing on structures that move differently compared to their surroundings. The performance of the proposed method is compared to both ground-truth detections and detections of a contrast-based detector that detects targets per frame. We investigate the performance on a variety of infrared video datasets, recorded with static and moving cameras, different types of targets, and different scenes. We show that spatiotemporal detection overall obtains similar to slightly better performance on detection of small objects compared to the state-of-the-art frame-wise detection method, while generalizing better with fewer adjustable parameters, and better clutter reduction.