The early vision principle of redundancy reduction of 108 sensor excitations is understandable from computer vision viewpoint toward sparse edge maps. It is only recently derived using a truly unsupervised learning paradigm of artificial neural networks (ANN). In fact, the biological vision, Hubel- Wiesel edge maps, is reproduced seeking the underlying independent components analyses (ICA) among 102 image samples by maximizing the ANN output entropy (partial)H(V)/(partial)[W] equals (partial)[W]/(partial)t. When a pair of newborn eyes or ears meet the bustling and hustling world without supervision, they seek ICA by comparing 2 sensory measurements (x1(t), x2(t))T equalsV X(t). Assuming a linear and instantaneous mixture model of the external world X(t) equals [A] S(t), where both the mixing matrix ([A] equalsV [a1, a2] of ICA vectors and the source percentages (s1(t), s2(t))T equalsV S(t) are unknown, we seek the independent sources <S(t) ST(t)> approximately equals [I] where the approximated sign indicates that higher order statistics (HOS) may not be trivial. Without a teacher, the ANN weight matrix [W] equalsV [w1, w2] adjusts the outputs V(t) equals tanh([W]X(t)) approximately equals [W]X(t) until no desired outputs except the (Gaussian) 'garbage' (neither YES '1' nor NO '-1' but at linear may-be range 'origin 0') defined by Gaussian covariance <V(t) V(t)T>G equals [I] equals [W][A] <S(t) ST(t)greater than [A]T[W]T. Thus, ANN obtains [W][A] approximately equals [I] without an explicit teacher, and discovers the internal knowledge representation [W], as the inverse of the external world matrix [A]-1. To unify IC, PCA, ANN & HOS theories since 1991 (advanced by Jutten & Herault, Comon, Oja, Bell-Sejnowski, Amari-Cichocki, Cardoso), the LYAPONOV function L(v1,...,vn, w1,...wn,) equals E(v1,...,vn) - H(w1,...wn) is constructed as the HELMHOTZ free energy to prove both convergences of supervised energy E and unsupervised entropy H learning. Consequently, rather using the faithful but dumb computer: 'GARBAGE-IN, GARBAGE-OUT,' the smarter neurocomputer will be equipped with an unsupervised learning that extracts 'RAW INFO-IN, (until) GARBAGE-OUT' for sensory knowledge acquisition in enhancing Machine IQ. We must go beyond the LMS error energy, and apply HOS To ANN. We begin with the Auto- Regression (AR) which extrapolates from the past X(t) to the future ui(t+1) equals wiTX(t) by varying the weight vector in minimizing LMS error energy E equals <[x(t+1) - ui(t+1)]2> at the fixed point (partial)E/(partial)wi equals 0 resulted in an exact Toplitz matrix inversion for a stationary covariance assumption. We generalize AR by a nonlinear output vi(t+1) equals tanh(wiTX(t)) within E equals <[x(t+1) - vi(t+1)]2>, and the gradient descent (partial)E/(partial)wi equals - (partial)wi/(partial)t. Further generalization is possible because of specific image/speech having a specific histogram whose gray scale statistics departs from that of Gaussian random variable and can be measured by the fourth order cumulant, Kurtosis K(vi) equals <vi4> - 3 <vi2>2 (K greater than or equal to 0 super-G for speeches, K less than or equal to 0 sub-G for images). Thus, the stationary value at (partial)K/(partial)wi equals plus or minus 4 PTLwi/(partial)t can de-mix unknown mixtures of noisy images/speeches without a teacher. This stationary statistics may be parallel implemented using the 'factorized pdf code: (rho) (v1, v2) equals (rho) (v1) (rho) (v2)' occurred at a maximal entropy algorithm improved by the natural gradient of Amari. Real world applications are given in Part II, (Wavelet Appl-VI, SPIE Proc. Vol. 3723) such as remote sensing subpixel composition, speech segmentation by means of ICA de-hyphenation, and cable TV bandwidth enhancement by simultaneously mixing sport and movie entertainment events.