Matrix Manual: Stochastic Proofs

Matrix Reference Manual
Proofs Section 5: Stochastic Matrices

5.1	N(x ; a, A) N(x ; b, B) = N(a ; b , A+B) × N(x ; c, C) where C = (A^-1+B^-1)^-1 = A(A+B)^-1B = B(A+B)^-1A and c = C(A^-1a+B^-1b) = A(A+B)^-1b + B(A+B)^-1a We assume without comment below that A, B, C and their sums and inverses are all symmetric matrices. First we define C = (A^-1+B^-1)^-1 = A(BA^-1A+BB^-1A)^-1B = A(B+A)^-1B = A(A+B)^-1B = (similarly) B(A+B)^-1A. We also define c = C(A^-1a+B^-1b) = B(A+B)^-1A A^-1a+A(A+B)^-1B B^-1b= A(A+B)^-1b + B(A+B)^-1a We see that c^TC^-1c = (a^TA^-1+b^TB^-1)CC^-1C(A^-1a+B^-1b) = (a^TA^-1+b^TB^-1)C(A^-1a+B^-1b) = a^TA^-1CA^-1a + 2a^TA^-1CB^-1b +b^TB^-1CB^-1b = a^T(A+B)^-1BA^-1a + 2a^TA^-1A(A+B)^-1BB^-1b +b^T(A+B)^-1AB^-1b substituting for C = a^T(A+B)^-1((A+B)A^-1-I) a + 2a^T(A+B)^-1b +b^T(A+B)^-1((A+B)B^-1-I) b = a^T(A^-1- (A+B)^-1) a + 2a^T(A+B)^-1b +b^T(B^-1-(A+B)^-1) b = a^TA^-1a +b^TB^-1b -(a^T(A+B)^-1a - 2a^T(A+B)^-1b +b^T(A+B)^-1b) = a^TA^-1a +b^TB^-1b -(a-b)^T(A+B)^-1(a-b) Hence (x-c)^TC^-1(x-c) = x^TC^-1x -2x^TC^-1c + c^TC^-1c = x^T(A^-1+B^-1)x -2x^TC^-1C(A^-1a+B^-1b) + a^TA^-1a +b^TB^-1b -(a-b)^T(A+B)^-1(a-b) = x^TA^-1x + x^TB^-1x - 2x^TA^-1a - 2x^TB^-1b + a^TA^-1a + b^TB^-1b - (a-b)^T(A+B)^-1(a-b) = (x-a)^TA^-1(x-a) + (x-b)^TB^-1(x-b) - (a-b)^T(A+B)^-1(a-b) It follows that N(a ; b , A+B) × N(x ; c, C) = det(2 pi (A+B))^-½ exp((a-b)^T(A+B)^-1(a-b)) × det(2 pi C)^-½ exp((x-c)^TC^-1(x-c)) = det(4 pi² (A+B)C)^-½ exp((x-c)^TC^-1(x-c) + (a-b)^T(A+B)^-1(a-b)) multiplying determinants and adding exponents = det(4 pi² (A+B)A(A+B)^-1B)^-½ exp((x-a)^TA^-1(x-a) + (x-b)^TB^-1(x-b)) = det(2 pi A)^-½ exp((x-a)^TA^-1(x-a)) × det(2 pi B)^-½ exp((x-b)^TB^-1(x-b)) = N(x ; a, A) N(x ; b, B)
5.2	N(x ; a, A)^m = N(0 ; 0, m(2×pi)^m-2A^m-1) × N(x ; a, m^-1A) We prove this by induction. For m=1, we have N(x ; a, A)¹ = N(0 ; 0, (2×pi)^-1I) × N(x ; a, A) which is true. N(x ; a, A)^m = N(x ; a, A) × N(x ; a, A)^m-1 = N(x ; a, A) × N(0 ; 0, (m-1)(2×pi)^m-3A^m-2) × N(x ; a, (m-1)^-1A) = N(0 ; 0, (m-1)(2×pi)^m-3A^m-2) × N(x ; a, A) × N(x ; a, (m-1)^-1A) = N(0 ; 0, (m-1)(2×pi)^m-3A^m-2) × N(a ; a, (1+(m-1)^-1)A) × N(x ; c, C) where from [5.1] C = (A^-1+(m-1)A^-1)^-1 = (mA^-1)^-1 = m^-1A and c = C(A^-1a+(m-1)A^-1a) = mCA^-1a = a So N(x ; a, A)^m = N(0 ; 0, (m-1)(2×pi)^m-3A^m-2) × N(a ; a, (1+(m-1)^-1)A) × N(x ; a, m^-1A) = N(0 ; 0, 2 × pi × (m-1)(2×pi)^m-3A^m-2× (1+(m-1)^-1)A) × N(x ; a, m^-1A) = N(0 ; 0, (m-1)(1+(m-1)^-1)(2×pi)^m-2A^m-1) × N(x ; a, m^-1A) = N(0 ; 0, m(2×pi)^m-2A^m-1) × N(x ; a, m^-1A)
5.3	If x_[n] has a complex Gaussian distribution then Cov(x) = E(xx^H) - mm^H = S_[n#n] ≤> K_[2n#2n] and E(xx^T) = mm^T From the definition of a complex gaussian we have E(x_i^Rx_k^R) = ½s_i,k^R + m_i^Rm_k^R E(x_i^Ix_k^I) = ½s_i,k^R + m_i^Im_k^I E(x_i^Ix_k^R) = ½s_i,k^I + m_i^Im_k^R E(x_i^Rx_k^I) = -½s_i,k^I + m_i^Rm_k^I E(xx^H)_i,k = E(x_ix_k^C) = E((x_i^Rx_k^R + x_i^Ix_k^I) + j(x_i^Ix_k^R - x_i^Rx_k^I)) = s_i,k^R + js_i,k^I + m_im_k^C = (S + mm^H)_i,k E(xx^T)_i,k = E(x_ix_k) = E((x_i^Rx_k^R - x_i^Ix_k^I) + j(x_i^Ix_k^R + x_i^Rx_k^I)) = m_im_k = (mm^T)_i,k
5.4	If x_[n] has a complex Gaussian distribution then E(x • x^C) = diag(S) + m • m^C E(x • x^C) = E(diag(xx^H)) = diag(E(xx^H)) = diag(S + mm^H) = diag(S) + m • m^C [5.3]
5.5	If x_[n] has a complex Gaussian distribution and p = x • x^C then Cov(p) = E(pp^T) - E(p)E(p)^T = S • S^C + 2(mm^H • S^T)^R If we define v = diag(S) and w = m • m^C , then E(p)E(p)^T = (v+w)(v+w)^T Assume that y = x - m has zero mean and that E(y • y^C) = diag(S) = v E((y • y^C)(y^T • y^H))_i,k = E(((x_i^R)² + (x_i^I)²)((x_k^R)² + (x_k^I)²)) = E((x_i^Rx_k^R)² + (x_i^Rx_k^I)² + (x_i^Ix_k^R)² + (x_i^Ix_k^I)²) Now from Wick's theorem and [5.3], E((x_i^Rx_k^R)²) = E(x_i^Rx_i^R)E(x_k^Rx_k^R) + 2(E(x_i^Rx_k^R))² = ¼v_iv_k + ½(s_i,k^R)² = (¼vv^T + ½S^R • S^R)_i,k Similarly E((x_i^Ix_k^I)²) = (¼vv^T + ½S^R • S^R)_i,k and E((x_i^Ix_k^R)²) = E((x_i^Ix_k^I)²) = (¼vv^T + ½S^I • S^I)_i,k Hence E((y • y^C)(y^T • y^H))_i,k = 2(¼vv^T + ½S^R • S^R)_i,k + 2(¼vv^T + ½S^I • S^I)_i,k = (vv^T + S • S^C)_i,k Since the expected value of any odd power of y_i is zero and, from [5.3], E(yy^T) = 0, we have E((x • x^C)(x^T • x^H)) = E(((y+m) • (y+m)^C)((y+m)^T • (y+m)^H)) = E((y • y^C)(y^T • y^H) + (y • y^C)(m^T • m^H) + (y • m^C)(y^T • m^H) + (y • m^C)(m^T • y^H) + (m • y^C)(y^T • m^H) + (m • y^C)(m^T • y^H) + (m • m^C)(y^T • y^H) + (m • m^C)(m^T • m^H)) = vv^T + S • S^C + vw^T + 0 + S • (mm^H)^C + S^C • (mm^H) + 0 + wv^T + ww^T = S • S^C + 2(mm^H • S^T)^R + (v+w)(v+w)^T
5.6	If x_[n] has a real Gaussian distribution then E(x • x) = diag(S) + m • m = diag(S + m • m^T) E(x • x)_k = E(x_k • x_k) = s_k,k + m_k² = (diag(S) + m • m)_k
5.7	If x_[n] has a real Gaussian distribution then Cov(x • x) = 2 S • (S + 2mm^T) Assume that x = y + m with E(y) = 0 and and Cov(y) = S Cov(x • x)_i,k = E(x_i²x_k²) - (s_i,_i + m_i²)(s_k,k + m_k²) = E((y_i²+2y_im_i+m_i²)(y_k²+2y_km_k+m_k²)) - (s_i,_i + m_i²)(s_k,k + m_k²) = s_i,is_k,k + 2s_i,_k² + s_i,im_k² + s_k,km_i² + 4s_i,_km_im_k + m_i²m_k² - (s_i,_i + m_i²)(s_k,k + m_k²) = 2s_i,_k (s_i,_k+2m_im_k) = (2 S • (S + 2mm^T))_i,k
5.8	If the joint distribution [x; y] ~ N([x; y]; [p; q], [P R; R^T Q]) then x \| y ~ N(x, p+RQ^-1(y-q), P - RQ^-1R^T). The functions f_i(y) below denote functions that are independent of x. p(x \| y) = p(x, y) / p(y) = f₁(y) exp( -½[x-p; y-q]^T S^-1 [x-p; y-q] ) where S = [P R; R^T Q] From [3.5] S^-1 = [C^-1, -C^-1RQ^-1; -Q^-1R^TC^-1, Q^-1(I+R^TC^-1RQ^-1)] where C =(P-RQ^-1R^T) Hence [x-p; y-q]^T S^-1 [x-p; y-q] = (x-p)^TC^-1(x-p) - 2(x-p)^TC^-1RQ^-1(y-p) + f₂(y) = (x-p-RQ^-1(y-q))^TC^-1(x-p-RQ^-1(y-q)) + f₃(y) Hence p(x \| y) = f₄(y) exp( -½(x-p-RQ^-1(y-q))^TC^-1(x-p-RQ^-1(y-q)) ) = f₅(y) N(x, p+RQ^-1(y-q), C) = f₅(y) N(x, p+RQ^-1(y-q), P - RQ^-1R^T) where f₅(y) must equal 1 to give the correct integral.
5.9	If x ~ N(x; m, S) then x \| A^Tx=b ~ N(x, (I-HA^T)m+Hb, (I-HA^T)S) where H=SA(A^TSA)^-1 Define z = [x; A^Tx] = [I; A^T](x-m). Then z has mean [m; A^Tm] and Cov(z) = E([I; A^T](x-m)(x-m)^T[I, A]) = [I; A^T]S[I, A] = [S SA; A^TS A^TSA] From [5.8] x \| A^Tx=b ~ N(x, m+SA(A^TSA)^-1(b-A^Tm), S - SA(A^TSA)^-1A^TS) = N(x, (I-HA^T)m+Hb, (I-HA^T)S)
5.10	If x ~ N(x; m, S) then y = Ax+b ~ N(y; Am+b, ASA^T) E(y) = E(Ax+b) = A E(x) + b = Am+b Var(y) = E(yyT) - E(y)E(y)T = E(Axx^TA^T + Axb^T + bx^TA^T + bb^T) - (Amm^TA^T + Amb^T + bm^TA^T + bb^T) = (A(S+mm^T)A^T + Amb^T + bm^TA^T + bb^T) - (Amm^TA^T + Amb^T + bm^TA^T + bb^T) = ASA^T
5.11	If S is +ve semidefinite Hermitian, then the elements of R = DIAG(S)^-½ S DIAG(S)^-½ have magnitude ≤ 1. \|r_i,j\|² = s_i,j² (s_i,is_j,j)^-1 ≤ 1 from [3.6]
5.12	If the joint distribution [x; y] ~ N([x; y]; [p; q], [P R^T; R Q]) then Var(x_i - a^Ty) is minimum when a = Q^-1R_i when its value is diag(P)_i - R_i^TQ^-1R_i. This proof is adapted from [R.1]. Since variances are unaffected by a shift of mean we can assume wlog that p=0 and q=0. Now set b = a - Q^-1R_i, then for any a we have Now note that for a zero-mean scalar z, Var(z) = E(zz^T). Hence Var(x_i - a^Ty) = Var(x_i - R_i^TQ^-1y - b^Ty) = E((x_i - R_i^TQ^-1y - b^Ty)(x_i - R_i^TQ^-1y - b^Ty)^T) = E(x_ix_i^T + R_i^TQ^-1yy^TQ^-1R_i + b^Tyy^Tb - 2x_iy^TQ^-1R_i - 2x_iy^Tb + 2R_i^TQ^-1yy^Tb) = diag(P)_i + R_i^TQ^-1QQ^-1R_i + b^TQb - 2R_i^TQ^-1R_i - 2R_i^Tb + 2R_i^TQ^-1Qb = diag(P)_i + R_i^TQ^-1R_i + b^TQb - 2R_i^TQ^-1R_i - 2R_i^Tb + 2R_i^Tb = diag(P)_i - R_i^TQ^-1R_i + b^TQb Only the last term depends on b and, since Q is positive semidefinite, this is minimum when b = 0.
5.13	If the joint distribution [x; y] ~ N([x; y]; [p; q], [P R^T; R Q]) then the maximum (over a) of the correlation between x_i and a^Ty is obtained when a = Q^-1R_i This proof is adapted from [R.1]. Since correlations are unaffected by a shift of mean we can assume wlog that p=0 and q=0. For any a we can define c=√(R_i^TQ^-1R_i / a^TQa). From [5.12] we know that Var(x_i - ca^Ty) ≥ Var(x_i - R_i^TQ^-1y) E(x_ix_i^T + c²a^Tyy^Ta - 2cx_iy^Ta) ≥ E(x_ix_i^T + R_i^TQ^-1yy^TQ^-1R_i - 2x_iy^TQ^-1R_i) diag(P)_i + c²a^TQa - 2cE(x_iy^Ta) ≥ diag(P)_i + R_i^TQ^-1QQ^-1R_i - 2E(x_iy^TQ^-1R_i) c²a^TQa - 2cE(x_iy^Ta) ≥ R_i^TQ^-1R_i - 2E(x_iy^TQ^-1R_i) R_i^TQ^-1R_i - 2cE(x_iy^Ta) ≥ R_i^TQ^-1R_i - 2E(x_iy^TQ^-1R_i) cE(x_iy^Ta) ≤ E(x_iy^TQ^-1R_i) E(x_iy^Ta) / √(a^TQa) ≤ E(x_iy^TQ^-1R_i) / √(R_i^TQ^-1R_i ) E(x_iy^Ta) / √( Var(x_i) × a^TQa) ≤ E(x_iy^TQ^-1R_i) / √( Var(x_i) × R_i^TQ^-1R_i ) E(x_iy^Ta) / √( Var(x_i) × Var(a^Ty)) ≤ E(x_iy^TQ^-1R_i) / √( Var(x_i) × Var(R_i^TQ^-1y)) Thus the correlation between x_i and a^Ty is bounded above by the right hand side and attains this bound when a = Q^-1R_i. The value of the correlation is given by E(x_iy^TQ^-1R_i) / √( Var(x_i) × Var(R_i^TQ^-1y)) = R_i^TQ^-1R_i / √( diag(P)_i × R_i^TQ^-1R_i) = √(R_i^TQ^-1R_i / diag(P)_i)
5.14	(g_x\|y)_i² = 1 - p_ii^-1 det([p_ii , r_i^T; r_i , Q]) det(Q)^-1 1 - p_ii^-1 det([p_ii , r_i^T; r_i , Q]) det(Q)^-1 = 1 - p_ii^-1 ((p_ii - r_i^TQ^-1r_i) det(Q)) det(Q)^-1 [3.1] = 1 - (1 - p_ii^-1r_i^TQ^-1r_i) = p_ii^-1r_i^TQ^-1r_i = (FR)_ii ÷ p_ii = (g_x\|y)_i²
5.15	d/dq (ln(p(X)) = 1_[k#1]^T(X-M)^TS^-1 dm/dq - ½(k S^-1 - S^-1(X-M)(X-M)^TS^-1):^T dS/dq where M_[n#k] = m×1_[k#1]^T. d/dq (ln(p(X)) = d/dm (ln(p(X)) dm/dq + d/dP (ln(p(X)) dP/dS dS/dq where P = S^-1 d/dm (ln(p(X)) = d/dM (ln(p(X)) dM/dm = - ½ d/dM ( tr((X-M)^T S^-1 (X-M))) dM/dm = (S^-1(X-M)):^T dM/dm [Derivative of Trace] = (S^-1(X-M)):^T (1_[k#1] ⊗ I) [Derivative of Linear Function] = 1_[k#1]^T(X-M)^TS^-1 [Kroneker Product Identity] d/dP (ln(p(X)) = ½k d/dP (ln(det(2pi×P))) - ½ d/dS (tr((X-M)^T P (X-M))) = ½kP^-1:^T - ½ d/dP (tr((X-M)^T P (X-M))) [Derivative of Determinant] = ½kP^-1:^T - ½ ((X-M)(X-M)^T):^T [Derivative of Quadratic] dP/dS = -(P ⊗ P) = -(S^-1 ⊗ S^-1) [Derivative of Inverse] d/dP (ln(p(X)) dP/dS = -½( kS:^T - ½ ((X-M)(X-M)^T):^T)(S^-1 ⊗ S^-1) = -½(S^-1( kS - ½ (X-M)(X-M)^T)S^-1):^T = -½(( kS^-1 - ½ S^-1(X-M)(X-M)^T)S^-1):^T
5.16	E(vv^T) = k ((dm/dq)^T S^-1 dm/dq + ½(dS/dq)^T (S ⊗ S)^-1 dS/dq ) where v^T = 1_[k#1]^T(X-M)^TS^-1 dm/dq - ½(k S^-1:^T - ((X-M)(X-M)^T):^T (S^-1 ⊗ S^-1)) dS/dq Since v is the sum of k independent identically distributed zero-mean vectors, we can calculate E(vv^T) for k=1 and then multiply the result by k. For k=1, v simplifies to v^T = (x-m)^TS^-1 dm/dq + ½(S - (x-m)(x-m)^T ):^T dP/dq where P = S^-1 (see [5.15]) When we form the product vv^T, the cross terms are cubic in (x-m) and therefore have zero mean. We deal with the two squared terms in vv^T separately. For the first term k E (((x-m)^TS^-1 dm/dq)^T ((x-m)^TS^-1 dm/dq)) = k E ((dm/dq)^T S^-1(x-m)(x-m)^TS^-1 dm/dq) = k E ((dm/dq)^T S^-1 dm/dq) Since the second term has zero mean, we can write ¼k E {(dP/dq)^T (S - (x-m)(x-m)^T ): (S - (x-m)(x-m)^T ):^T dP/dq} = ¼k (dP/dq)^TE { ((x-m)(x-m)^T ): ((x-m)(x-m)^T):^T }dP/dq - ¼k(dP/dq)^T S: S:^T dP/dq We define T = TVEC(n,n) so that T S: = T^T S: = S^T: =S: since S and T are symmetrical. [see vectorized transpose] To calculate the quartic expectation, E {((x-m)(x-m)^T ): ((x-m)(x-m)^T):^T }, we use Wick's rule take the three possible pairings of the (x-m) terms and, treating x and y as distinct random vectors, we get: E {((x-m)(x-m)^T ): ((x-m)(x-m)^T):^T } = E {((x-m)(x-m)^T ): ((y-m)(y-m)^T):^T } + E {((x-m)(y-m)^T ): ((x-m)(y-m)^T):^T } + E {((x-m)(y-m)^T ): ((y-m)(x-m)^T):^T } = S:S:^T + E {((y-m) ⊗ (x-m) )((y-m) ⊗ (x-m))^T } + E {((y-m) ⊗ (x-m) )((y-m) ⊗ (x-m))^TT^T } [see kroneker product and vectorized transpose] = S:S:^T + E {((y-m)(y-m) ) ⊗ ((x-m)(x-m))^T } (I + T^T) = S:S:^T + (S ⊗ S) (I + T^T) Substituting this into the second term expression gives ¼k (dP/dq)^T(S:S:^T + (S ⊗ S) (I + T^T))dP/dq - ¼k(dP/dq)^T S: S:^T dP/dq = ¼k (dP/dq)^T (S ⊗ S) (I + T^T)dP/dq [Since the first and last terms cancel] = ½k (dP/dq)^T(S ⊗ S) dP/dq [Since the symmetry of dP means that T^T dP/dq = dP/dq] Putting this all together and using dP/dq = dP/dS dS/dq= -(P ⊗ P) dS/dq = -(S^-1 ⊗ S^-1) dS/dq [Derivative of Inverse] we obtain J = E {vv^T} = k E ((dm/dq)^T S^-1 dm/dq) + ½k (dP/dq)^T(S ⊗ S)dP/dq = k E ((dm/dq)^T S^-1 dm/dq) + ½k (dS/dq)^T(S^-1 ⊗ S^-1)(S ⊗ S)(S^-1 ⊗ S^-1)dS/dq = k E ((dm/dq)^T S^-1 dm/dq) + ½k (dS/dq)^T(S^-1 ⊗ S^-1)dS/dq
5.17	If f_[r#1](X) is a function of X with mean value g(q), then Cov(f) ≥ dg/dq J^-1 (dg/dq)^T g = E (f) means that dg/dq = E(df/dq + f v^T) = E(f v^T) where v^T = d/dq (ln(p(X)). For any arbitrary vector a, we have the scalar equation a^T dg/dq J^-1 (dg/dq)^T a = E(a^T f v^TJ^-1 (dg/dq)^T a) = Cov(a^T f, a^T dg/dq J^-1v) Hence from the Cauchy-Schwarz inequality (a^T dg/dq J^-1 (dg/dq)^T a)² ≤ Var(a^T f)Var(a^T dg/dq J^-1v) = (a^TCov(f)a) (a^T dg/dq J^-1Cov(v) J^-1(dg/dq)^T a) = (a^TCov(f)a) (a^T dg/dq J^-1(dg/dq)^T a) [Since Cov(v) = J ] Hence (a^T dg/dq J^-1 (dg/dq)^T a) ≤ a^TCov(f)a for any a So Cov(f) ≥ dg/dq J^-1 (dg/dq)^T where ≥ represents the Loewner partial order.
5.18	For even n, E(prod(x_[n]-m)) = (½n)!^-12^-½n sum_v(s_v(1),v(2)s_v(3),v(4)...s_v(n-1),v(n)) where the sum is over all n! permutations v of the numbers 1:n. An equivalent formula is to omit the normalizing factor, (½n)!^-12^-½n, and to restrict the summation to all distinct pairings of the numbers 1:n. Without loss of generality, we assume that m = 0. The characteristic function is f(t) = E(exp(jt^Tx)) = exp(-½t^TSt) where j=√(-1). Differentiating E(exp(jt^Tx)) shows that E(prod(x)) = j^-n dⁿf/dt₁dt₂...dt_n evaluated at t=0. Expanding the power series f(t) = exp(-½t^TSt) = sum((-½t^TSt)^r/r!; r=0..inf). Term r in this contains only terms in t_x^r where x is some combination of the subscripts 1:n. When differentiated n times, terms with r < n/2 will become zero, whereas terms with r > n/2 will still contain powers of t_x and will therefore be zero when t is set to 0. It follows that when n is odd, all terms vanish and E(prod(x))=0. When n is even, E(prod(x)) = dⁿg/dt₁dt₂...dt_n evaluated at t=0 where g(t) = j^-n (-½t^TSt)^½n/(½n)! = ½^½n(t^TSt)^½n/(½n)! We can expand (t^TSt)^½n=t^TSt × t^TSt × ...× t^TSt and we now differentiate in turn with respect to t_n, t_n_-1, ... , t₁. We note that t occurs n times in this expansion and that dt/dt_i = e_i a column of the identity matrix. When we differentiate with respect to t_n, we obtain a sum of n terms, in each of which one of the occurences of t has been replaced by e_n and the remaining n-1 occurrences remain as t . If we now differentiate with respect to t_n_-1, each of the n terms in the previous step changes into a sum of n-1 terms in which one ot the occurences of t is now replaced by e_n-1 and the remaining n-2 occurrences remain as t . We thus have a total of n(n-1) terms. We repeat this process for t_n_-2, ... , t₁and we end up with n! terms of the form (e_v(1))^TSe_v(2)(e_v(3))^TSe_v(4)...(e_v(n-1))^TSe_v(n) = s_v(1),v(2)s_v(3),v(4)...s_v(n-1),v(n) where v runs through all n! permutations of 1:n. Thus E(prod(x)) = (½n)!^-12^-½n sum_v(s_v(1),v(2)s_v(3),v(4)...s_v(n-1),v(n)) Note that each term in the summation arises (½n)!2^½n times since the ½n factors s_ij can be rearranged in (½n)! orders and for each factor s_ij = s_ji since S is symmetric. Thus an equivalent formula is to omit the normalizing factor, (½n)!^-12^-½n, and restrict the summation to all distinct pairings of the numbers 1:n. This is known as Isserlis' theorem or Wick's theorem.
5.19	[x Real Gaussian, m=0] Y_n=<(xx^T)ⁿ>, then Y₁=S and Y_n+1= tr(Y_n)S+2nSY_n By defintion, Y₁=S. So we assume n>0. Y_n+1 consists of 2n+2 x terms multiplied together. By Wick's theorem [5.18] we can obtain Y_n+1 by considering all possible pairings of the x's and treating each pair as an independent Gaussian vector. We write Y_n+1= <x(x^Tx)ⁿx^T> and note that the central term, (x^Tx)ⁿ, is a product of scalars which may therefore be re-ordered arbitrarily. If we dentote the frst x by x₁, its pair can be in any of 2n+1 positions; we consider three cases: The pair to x₁ can be the first x in one of the scalars that form (x^Tx)ⁿ; this gives n terms of the form <x₁(x₁^Tx)(x^Tx)^n-1x^T> = <x₁x₁^T(xx^T)ⁿ> = SY_n The pair to x₁ can be the second x in one of the scalars that form (x^Tx)ⁿ; this gives n terms of the form <x₁(x^Tx₁)(x^Tx)^n-1x^T> = <x₁(x₁^Tx)(x^Tx)^n-1x^T> = <x₁x₁^T(xx^T)ⁿ> = SY_n The pair to to x₁ can be the final x; this gives a single term of the form <x₁(x^Tx)ⁿx₁^T> = <(x^Tx)ⁿx₁x₁^T> = tr(Y_n)S Adding these 2n+1 terms gives Y_n+1= nSY_n + nSY_n+ tr(Y_n)S = tr(Y_n)S+2nSY_n.
5.20	[x Complex Gaussian, m=0] Y_n=<(xx^H)ⁿ>, then Y₁=S and Y_n+1= tr(Y_n)S+nSY_n By defintion, Y₁=S. So we assume n>0. Y_n+1 consists of 2n+2 x terms multiplied together. By Wick's theorem [5.18] we can obtain Y_n+1 by considering all possible pairings of the x's and treating each pair as an independent complex Gaussian vector. We write Y_n+1= <x(x^Hx)ⁿx^H> and note that the central term, (x^Hx)ⁿ, is a product of scalars which may therefore be re-ordered arbitrarily. If we dentote the frst x by x₁, its pair can be in any of 2n+1 positions; we consider three cases: The pair to x₁ can be the first x in one of the scalars that form (x^Hx)ⁿ; this gives n terms of the form <x₁(x₁^Hx)(x^Hx)^n-1x^H> = <x₁x₁^H(xx^H)ⁿ> = SY_n The pair to x₁ can be the second x in one of the scalars that form (x^Hx)ⁿ; this gives n terms of the form <x₁(x^Hx₁)(x^Tx)^n-1x^T> = <x₁(x₁^Tx^)(x^Hx)^n-1x^H> = <x₁x₁^Tx^x^H(xx^H)^n-1> = 0 [5.3] The pair to to x₁ can be the final x; this gives a single term of the form <x₁(x^Hx)ⁿx₁^H> = <(x^Hx)ⁿx₁x₁^H> = tr(Y_n)S Adding these 2n+1 terms gives Y_n+1= nSY_n + 0+ tr(Y_n)S = tr(Y_n)S+nSY_n.
5.21	If y ~ kN(y; 0, 1) is restricted to the domain p<y<q, then, E(y)=r= (f(p)-f(q))/(F(q)-F(p)) and Var(y)=v=1-(q f(q) - p f(p))/(F(q)-F(p)) - r² where k=1/(F(q)-F(p)), with f(x) and F(x) being the pdf and cdf of the standard Gaussian. We first note that f(-x)=f(x), F(-x)=1-F(x), dF/dx=f(x) and df/dx = -x f(x). So with all integrals going from p to q to infinity we can write Ã¢Ë†Â« f(x)dx = F(q)-F(p), Ã¢Ë†Â« xf(x)dx = f(p)-f(q), Ã¢Ë†Â« x²f(x)dx =F(q) - F(p) + p f(p) - q f(q). From this, E(y) = (f(p)-f(q))/(F(q)-F(p))=r and Var(y) = 1-(q f(q) - p f(p))/(F(q)-F(p)) - r². For the special case p=-Ã¢Ë†Å¾, f(p)=F(p)=0 and E(y) = r = -f(q)/F(q)=r and Var(y) = v = 1+r (q-r). For the special case q=+Ã¢Ë†Å¾, f(q)=0, F(q)=1 and E(y) = r = f(p)/(1-F(p))=f(p)/F(-p) and Var(y) = v = 1+ p f(p)/F(-p) - r².
5.22	Suppose y ~ kN(y; m, S) is restricted to the domain satisfying b<a^Ty<c with k a normalizing constant. Then E(y) = m+grSa and Cov(y) = S - g²vSa(Sa)^T where g=1/√(a^TSa), p=g(b-a^Tm), q=g(c-a^Tm), r= (f(p)-f(q))/(F(q)-F(p)), v=(q f(q) - p f(p))/(F(q) - F(p)) + r² and f(q) and F(q) are the pdf and cdf respectively of a standard 1-dimensional Gaussian with f(q)=dF/dq. We use Cholesky decomposition to find T such that S=TT^T and define w=T^-1(y-m) which implies y = Tw+m. So w ~ kN(w; 0, T^-1ST^-T) = kN(w; 0, I) within the domain satisfying b-a^Tm < a^TTw < c-a^Tm. Now find an orthogonal Q such that gQT^Ta = e where e is the first column of the identity matrix and g = 1/√(a^TTT^Ta)=1/√(a^TSa)> 0. Thus Q is any orthogonal matrix whose first row is ga^TT. If follows that gT^Ta = Q^Te which implies ga^TTQ^T= e. Now we define z=Qw which implies w=Q^Tz. So z ~ kN(z; 0, I) within the domain defined by b-a^Tm < a^TTQ^Tz < c-a^Tm which is equivalent to g(b-a^Tm) < e^Tz < g(c-a^Tm). Since only the first element of z is constrained and all elements are independent, we can use [5.21] to write E(z) = re and Cov(z) = I - vee^T where p=g(b-a^Tm), q=g(c-a^Tm), r=(f(p)-f(q))/(F(q)-F(p)) and v=(q f(q) - p f(p))/(F(q) - F(p)) + r². We have y = TQ^Tz+m and so E(y) = rTQ^Te+m and Cov(y)=TQ^T(I - vee^T)QT^T = TQ^TQT^T-vTQ^Tee^TQT^T . From above we have gT^Ta = Q^Te which implies TQ^Te = gTT^Ta = gSa. Using this substitution (along with Q^TQ = I) gives E(y) = m+grSa and Cov(y) = S - g²vSa(Sa)^T as required.
5.23	<(Ax + a)(Bx + b)^H> = ASB^H + (Am+a)(Bm+b)^H We define the zero-mean vector u = x-m so that <u> = 0 and <uu^H> = S. <(Ax + a)(Bx + b)^H> = <(Au + Am + a)(Bu + Bm + b)^H> = <Auu^HB^H + Au(Bm+b)^H + (Am+a)u^HB^H + (Am + a)(Bm + b)^H> = ASB^H + 0 + 0 + (Am + a)(Bm + b)^H = ASB^H + (Am + a)(Bm + b)^H
5.24	<(Ax+a)^H (Bx+b)> = <tr((Bx+b)(Ax+a)^H )> = tr(BSA^H) + (Am+a)^H (Bm+b) <(Ax + a)^H(Bx + b)> = tr(<(Bx + b)(Ax + a)^H>) = tr(BSA^H) + tr((Bm + b)(Am + a)^H) [5.23] = tr(BSA^H) + (Am + a)^H(Bm + b)
5.25	argmin_K<\|\|(AKB+C)x\|\|²> = -(A^HA)^-1A^HC(S+mm^H)B^H(B(S+mm^H)B^H)^-1 argmin_K<\|\|(AKB+C)x\|\|²> = argmin_K<(AKB+C)x((AKB+C)x)^H> = argmin_K<tr{(AKB+C)xx^H(AKB+C)^H}> = argmin_K{tr((AKB+C)(S+mm^H)(AKB+C)^H)} = -(A^HA)^-1A^HC(S+mm^H)B^H(B(S+mm^H)B^H)^-1 [2.7]
5.26	argmin_K<\|\|(AKB+C)x + (AKE+F)y\|\|²> = -(A^HA)^-1A^H(CS_xB^H+FS_yE^H)(BS_xB^H+ES_yE^H)^-1 We can define S = <[x; y][x; y]^H> = [Sx 0; 0 Sy] since x and y are independent and zero mean argmin_K<\|\|(AKB+C)x + (AKE+F)y\|\|²> = argmin_K<\|\|(AK[B E]+[C F])[x; y]\|\|²> = -(A^HA)^-1A^H[C+F]S[B+E]^H([B+E]S[B+E]^H)^-1 [5.25] = -(A^HA)^-1A^H(CS_xB^H+FS_yE^H)(BS_xB^H+ES_yE^H)^-1
5.27	<(Ax + a)(Bx + b)^T(Cx + c) (Dx + d)^T> = (ASB^T+(Am+a)(Bm+b)^T)(CSD^T+(Cm+c) (Dm+d)^T) + (ASC^T+(Am+a)(Cm+c)^T)(BSD^T+(Bm+b) (Dm+d)^T) + (Bm+b)^T(Cm+c)×(ASD^T - (Am+a)(Dm+d)^T) + tr(BSC^T)×(ASD^T + (Am+a)(Dm+d)^T) We define the zero-mean vector u = x-m so that <u> = 0 and <uu^T> = S. We now multiply out the quartic expression, omitting terms invovling odd powers of u since their mean is zero [5.18] <(Au + a)(Bu + b)^T(Cu + c) (Du + d)^T> = <Auu^TB^TCuu^TD^T + Auu^TB^Tcd^T + Aub^TCud^T + Aub^Tcu^TD^T + au^TB^TCud^T + au^TB^Tcu^TD^T + ab^T Cuu^TD^T + ab^Tcd^T> = <Auu^TB^TCuu^TD^T + Auu^TB^Tcd^T + Auu^TC^Tbd^T + b^Tc×Auu^TD^T + au^TB^TCud^T + ac^TBuu^TD^T + ab^T Cuu^TD^T + ab^Tcd^T> where we have moved or transposed some scalar factors of the form p^Tq. Defining v to have the same statistics as u but independent of it, we use Isserlis' theorem [5.18] to decompose the first term as <Auu^TB^TCuu^TD^T> = <Auu^TB^TCvv^TD^T> + <Auv^TB^TCuv^TD^T> + <Auv^TB^TCvu^TD^T> By again moving or transposing scalar factors of the form p^Tq we can rearrange to get = <Auu^TB^TCvv^TD^T> + <Auu^TC^TBvv^TD^T> + <Auu^TD^T×v^TB^TCv> We now use the quadratic expectation theorems to give = ASB^TCSD^T + ASC^TBSD^T + ASD^T×tr(B^TCS) Therefore we can write <(Au + a)(Bu + b)^T(Cu + c) (Du + d)^T> = <Auu^TB^TCuu^TD^T + Auu^TB^Tcd^T + Auu^TC^Tbd^T + b^Tc×Auu^TD^T + au^TB^TCud^T + ac^TBuu^TD^T + ab^T Cuu^TD^T + ab^Tcd^T> = ASB^TCSD^T + ASC^TBSD^T + ASD^T×tr(B^TCS) + ASB^Tcd^T + ASC^Tbd^T + b^Tc×ASD^T + ad^T×tr(B^TCS) + ac^TBSD^T + ab^TCSD^T + ab^Tcd^T> = (ASB^T+ab^T)(CSD^T+cd^T) + (ASC^T+ac^T)(BSD^T+bd^T) + b^Tc×(ASD^T - ad^T) + tr(BSC^T)×(ASD^T + ad^T) where we use tr(B^TCS) = tr(BSC^T) and also add and subtract a copy of ab^Tcd^T = ac^Tbd^T = b^Tc×ad^T Now we note that Ax + a = Au + (Am + a) so if we replace a by Am+a, b by Bm+b etc in the above expression, we obtain <(Ax + a)(Bx + b)^T(Cx + c) (Dx + d)^T> = (ASB^T+(Am+a)(Bm+b)^T)(CSD^T+(Cm+c) (Dm+d)^T) + (ASC^T+(Am+a)(Cm+c)^T)(BSD^T+(Bm+b) (Dm+d)^T) + (Bm+b)^T(Cm+c)×(ASD^T - (Am+a)(Dm+d)^T) + tr(BSC^T)×(ASD^T + (Am+a)(Dm+d)^T)
5.28	<(Ax + a)^T(Bx + b) (Cx + c)^T(Dx + d)> = tr(AS(C^TD+D^TC)SB^T) + ((Am+a)^TB + (Bm+b)^TA)S(C^T(Dm+d) + D^T(Cm+c)) + (tr(ASB^T)+(Am+a)^T(Bm+b))(tr(CSD^T)+(Cm+c)^T(Dm+d)) <(Ax + a)^T(Bx + b) (Cx + c)^T(Dx + d)> = tr(<(Bx + b) (Cx + c)^T(Dx + d)(Ax + a)^T>) so, from [5.27] we can write = tr((BSC^T+(Bm+b)(Cm+c)^T)(DSA^T+(Dm+d) (Am+a)^T) + (BSD^T+(Bm+b)(Dm+d)^T)(CSA^T+(Cm+c) (Am+a)^T) + (Cm+c)^T(Dm+d)×(BSA^T - (Bm+b)(Am+a)^T) + tr(CSD^T)×(BSA^T + (Bm+b)(Am+a)^T)) = tr(BSC^TDSA^T) + tr(BSC^T(Dm+d)(Am+a)^T) + tr((Bm+b)(Cm+c)^T(DSA^T+(Dm+d)(Am+a)^T)) + tr(BSD^TCSA^T) + tr(BSD^T(Cm+c)(Am+a)^T) + tr((Bm+b)(Dm+d)^T(CSA^T+(Cm+c)(Am+a)^T)) + (Cm+c)^T(Dm+d)×tr(BSA^T - (Bm+b)(Am+a)^T) + tr(CSD^T)×tr(BSA^T + (Bm+b)(Am+a)^T) Now, noting that tr(pq^T) = q^Tp and also collecting together all terms that equal (Am+a)^T(Bm+b)(Cm+c)^T(Dm+d), we can write = tr(BSC^TDSA^T)) + (Am+a)^TBSC^T(Dm+d) + (Cm+c)^TDSA^T(Bm+b) + tr(BSD^TCSA^T) + (Am+a)^TBSD^T(Cm+c) + (Dm+d)^TCSA^T(Bm+b) + (Cm+c)^T(Dm+d)×tr(BSA^T) + tr(CSD^T)×(tr(BSA^T) + (Am+a)^T(Bm+b)) + (2-1)×(Am+a)^T(Bm+b)(Cm+c)^T(Dm+d) = tr(BSC^TDSA^T + BSD^TCSA^T) + (Am+a)^TBS(C^T(Dm+d) + D^T(Cm+c)) + (Bm+b)^TASD^T(Cm+c) + (Bm+b)^TASC^T(Dm+d) + (Cm+c)^T(Dm+d)×tr(ASB^T) + tr(CSD^T)×(tr(ASB^T ) + (Am+a)^T(Bm+b)) + (Am+a)^T(Bm+b)(Cm+c)^T(Dm+d) = tr(ASD^TCSB^T + ASC^TDSB^T) + (Am+a)^TBS(C^T(Dm+d) + D^T(Cm+c)) + (Bm+b)^TAS(C^T(Dm+d) + D^T(Cm+c)) + (tr(CSD^T) + (Cm+c)^T(Dm+d)) × (tr(ASB^T ) + (Am+a)^T(Bm+b)) = tr(AS(D^TC+C^TD)SB^T) + ((Am+a)^T B+ (Bm+b)^TA)S(C^T(Dm+d) + D^T(Cm+c)) + (tr(CSD^T) + (Cm+c)^T(Dm+d)) (tr(ASB^T) + (Am+a)^T(Bm+b))

This page is part of The Matrix Reference Manual. Copyright © 1998-2022 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: $Id: proof005.html 11291 2021-01-05 18:26:10Z dmb $