Li and Allebach recently proposed parameter-trainable tone dependent error diffusion (TDED) which yields outstanding halftone quality among error diffusion based algorithms. In TDED, the tone dependent weights and thresholds as well as a halftone bitmap for threshold modulation are implemented as look-up tables (LUTs) which consume on-chip memory. In addition, the diffused errors must be buffered in on-chip memory and in most cases, transferred to off-chip memory. However, off-chip memory access considerably deteriorates system performance. In this paper, we propose two approaches to improve memory efficiency. First, we use deterministic bit flipping to replace threshold modulation, and linearize the weights and thresholds of TDED. This reduces the memory requirement by using only a few constants, rather than full LUTs, and generates halftones whose quality is nearly indistinguishable from that of standard TDED. Secondly, we propose a block-based processing strategy which significantly reduces off-chip memory access. We devise a novel scan-path which enables our algorithm to process any input image block-by-block without yielding block-boundary artifacts. Special filters are designed and optimized for the block diagonals so that the resulting halftone quality is comparable to that of standard TDED.