In recent years, the Time-Correlated Single Photon Counting (TCSPC) technique has gained a prominent role in many fields, where the analysis of extremely fast and faint luminous signals is required. In the life science, for instance, the estimation of fluorescence time-constants with picosecond accuracy has been leading to a deeper insight into many biological processes. Although the many advantages provided by TCSPC-based techniques, their intrinsically repetitive nature leads to a relatively long acquisition time, especially when time-resolved images are obtained by means of a single detector, along with a scanning point system. In the last decade, TCSPC acquisition systems have been subjected to a fast trend towards the parallelization of many independent channels, in order to speed up the measure. On one hand, some high-performance multi-module systems have been already made commercially available, but high area and power consumption of each module have limited the number of channels to only some units. On the other hand, many compact systems based on Single Photon Avalanche Diodes (SPAD) have been proposed in literature, featuring thousands of independent acquisition chains on a single chip. The integration of both detectors and conversion electronic in the same pixel area, though, has imposed tight constraints on power dissipation and area occupation of the electronics, resulting in a tradeoff with performance, both in terms of differential nonlinearity and timing jitter. Furthermore, in the ideal case of simultaneous readout of a huge number of channels, the overall data rate can be as high as 100 Gbit/s, which is nowadays too high to be easily processed in real time by a PC. Typical adopted solutions involve an arbitrary dwell time, followed by a sequential readout of the converters, thus limiting the maximum operating frequency of each channel and impairing the measurement speed, which still lies well below the limit imposed by the saturation of the transfer rate towards the elaboration unit. We developed a novel readout architecture, starting from a completely different perspective: considering the maximum data rate we can manage with a PC, a limited set of conversion data is selected and transferred to the elaboration unit during each excitation period, in order to take full advantage of the bus bandwidth toward the PC. In particular, we introduce a smart routing logic, able to dynamically connect a large number of SPAD detectors to a limited set of high-performance external acquisition chains, paving the way for a more efficient use of resources and allowing us to effectively break the tradeoff between integration and performance, which affects the solutions proposed so far. The routing electronic features a pixelated architecture, while 3D-stacking techniques are exploited to connect each SPAD to its dedicated electronic, leading to a minimization of the overall number of interconnections crossing the integrated system, which is one of the main issues in high-density arrays.