From its inception, the iWarp microprocessor was designed to be used for parallel computing. The iWarp processor (cell) is comprised of a computation agent and a communication agent which operate independently. Both synchronous and asynchronous communication are supported. A processor may have up to 8 asynchronous data movement operations proceeding simultaneously through the spooling (DMA) mechanism. Independently, the computation agent may synchronously move up to 4 words (2 reads, 2 writes) between the cell and the interconnecting pathways in each instruction. The pathways are a set of 4 physical interconnects providing 40 MBytes/sec bidirectional communication each (320 MBytes/sec/cell) to adjacent cells. Local memory bandwidth for the computation agent is 160 MBytes/sec. Communication overhead costs are critical for efficient parallel computing. iWarp's communication is based on connections which establish the desired topology once--the connections remain in place until removed by the program. Multiple (logical) connections may share a single pathway. Communication over established connections incur little or no overhead. We discuss the software tools to help build efficient programs both at the single cell level (C, F77 compilers) and at the array level (Apply/Adapt, Assign, C*, etc.). Using the communication features of iWarp, we present measured performance on some frequently used data movement operations (scatter, gather, broadcast, transpose).