Matrix Reference Manual
Proofs Section 5: Stochastic Matrices


Go to: Introduction, Notation, Index



5.1 N(x ; a, A) N(x ; b, B) = N(a ; b , A+B) × N(x ; c, C) where C = (A-1+B-1)-1 = A(A+B)-1B = B(A+B)-1A and c = C(A-1a+B-1b) = A(A+B)-1b + B(A+B)-1a
  • We assume without comment below that A, B, C and their sums and inverses are all symmetric matrices.
  • First we define C = (A-1+B-1)-1 = A(BA-1A+BB-1A)-1B = A(B+A)-1B = A(A+B)-1B = (similarly) B(A+B)-1A.
  • We also define c = C(A-1a+B-1b) = B(A+B)-1A A-1a+A(A+B)-1B B-1b= A(A+B)-1b + B(A+B)-1a
  • We see that cTC-1c = (aTA-1+bTB-1)CC-1C(A-1a+B-1b)
    = (aTA-1+bTB-1)C(A-1a+B-1b)
    = aTA-1CA-1a + 2aTA-1CB-1b +bTB-1CB-1b
    = aT(A+B)-1BA-1a + 2aTA-1A(A+B)-1BB-1b +bT(A+B)-1AB-1b substituting for C
    = aT(A+B)-1((A+B)A-1-I) a + 2aT(A+B)-1b +bT(A+B)-1((A+B)B-1-I) b
    = a
    T(A-1- (A+B)-1) a + 2aT(A+B)-1b +bT(B-1-(A+B)-1) b
    = aTA-1a +bTB-1b -(aT(A+B)-1a - 2aT(A+B)-1b +bT(A+B)-1b)
    = aTA-1a +bTB-1b -(a-b)T(A+B)-1(a-b)
  • Hence (x-c)TC-1(x-c) = xTC-1x -2xTC-1c + cTC-1c
    =
    xT(A-1+B-1)x -2xTC-1C(A-1a+B-1b) + aTA-1a +bTB-1b -(a-b)T(A+B)-1(a-b)
    = xTA-1x + xTB-1x - 2xTA-1a - 2xTB-1b + aTA-1a + bTB-1b - (a-b)T(A+B)-1(a-b)
    = (x-a)TA-1(x-a) + (x-b)TB-1(x-b) - (a-b)T(A+B)-1(a-b)
  • It follows that N(a ; b , A+B) × N(x ; c, C) = det(2 pi (A+B))
  • exp((a-b)T(A+B)-1(a-b)) × det(2 pi C) exp((x-c)TC-1(x-c))
    = det(4 pi2 (A+B)C) exp((x-c)TC-1(x-c) + (a-b)T(A+B)-1(a-b)) multiplying determinants and adding exponents
    = det(4 pi2 (A+B)A(A+B)-1B) exp((x-a)TA-1(x-a) + (x-b)TB-1(x-b))
    = det(2 pi A) exp((x-a)TA-1(x-a)) × det(2 pi B) exp((x-b)TB-1(x-b))
    = N(x ; a, A) N(x ; b, B)
5.2

N(x ; a, A)m = N(0 ; 0, m(2×pi)m-2Am-1) × N(x ; a, m-1A)

We prove this by induction. For m=1, we have N(x ; a, A)1 = N(0 ; 0, (2×pi)-1I) × N(x ; a, A) which is true.

  • N(x ; a, A)m = N(x ; a, A) × N(x ; a, A)m-1
    = N(x ; a, A) × N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(x ; a, (m-1)-1A)
    = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(x ; a, A) × N(x ; a, (m-1)-1A)
    = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(a ; a, (1+(m-1)-1)A) × N(x ; c, C) where from [5.1] C = (A-1+(m-1)A-1)-1 = (mA-1)-1 = m-1A and c = C(A-1a+(m-1)A-1a) = mCA-1a = a
  • So N(x ; a, A)m = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(a ; a, (1+(m-1)-1)A) × N(x ; a, m-1A)
    = N(0 ; 0, 2 × pi × (m-1)(2×pi)m-3Am-2× (1+(m-1)-1)A) × N(x ; a, m-1A)
    = N(0 ; 0, (m-1)(1+(m-1)-1)(2×pi)m-2Am-1) × N(x ; a, m-1A)
    = N(0 ; 0, m(2×pi)m-2Am-1) × N(x ; a, m-1A)
5.3 If x[n] has a complex Gaussian distribution then Cov(x) = E(xxH) - mmH = S[n#n] <=> K[2n#2n] and E(xxT) = mmT
  • From the definition of a complex gaussian we have
    • E(xiRxkR) = ½si,kR + miRmkR
    • E(xiIxkI) = ½si,kR + miImkI
    • E(xiIxkR) = ½si,kI + miImkR
    • E(xiRxkI) = -½si,kI + miRmkI
  • E(xxH)i,k = E(xixkC) = E((xiRxkR + xiIxkI) + j(xiIxkR - xiRxkI)) = si,kR + jsi,kI + mimkC = (S + mmH)i,k
  • E(xxT)i,k = E(xixk) = E((xiRxkR - xiIxkI) + j(xiIxkR + xiRxkI)) = mimk = (mmT)i,k
5.4 If x[n] has a complex Gaussian distribution then E(xxC) = diag(S) + mmC
  • E(xxC) = E(diag(xxH)) = diag(E(xxH)) = diag(S + mmH) = diag(S) + mm [5.3]
5.5 If x[n] has a complex Gaussian distribution and p = xxC then Cov(p) = E(ppT) - E(p)E(p)T = SSC + 2(mmHST)R
  • If we define v = diag(S) and w = mmC , then E(p)E(p)T = (v+w)(v+w)T
  • Assume that y = x - m has zero mean and that E(yyC) = diag(S) = v
    • E((yyC)(yTyH))i,k = E(((xiR)2 + (xiI)2)((xkR)2 + (xkI)2)) =  E((xiRxkR)2 + (xiRxkI)2 + (xiIxkR)2 + (xiIxkI)2)
    • Now from Wick's theorem and [5.3], E((xiRxkR)2) = E(xiRxiR)E(xkRxkR) + 2(E(xiRxkR))2 = ¼vivk + ½(si,kR)2 = (¼vvT + ½SRSR)i,k
    • Similarly E((xiIxkI)2) = (¼vvT + ½SRSR)i,k and E((xiIxkR)2) = E((xiIxkI)2) = (¼vvT + ½SISI)i,k
    • Hence E((yyC)(yTyH))i,k = 2(¼vvT + ½SRSR)i,k + 2(¼vvT + ½SISI)i,k = (vvT + SSC)i,k
  • Since the expected value of any odd power of yi is zero and, from [5.3], E(yyT) = 0, we have
    E((xxC)(xTxH)) = E(((y+m) • (y+m)C)((y+m)T • (y+m)H))
    = E((yyC)(yTyH) + (yyC)(mTmH) + (ymC)(yTmH) + (ymC)(mTyH) + (myC)(yTmH) + (myC)(mTyH) + (mmC)(yTyH) + (mmC)(mTmH))
    = vvT + SSC + vwT + 0 + S • (mmH)C + SC • (mmH) + 0 + wvT + wwT
    = SSC + 2(mmHST)R + (v+w)(v+w)T
5.6 If x[n] has a real Gaussian distribution then E(xx) = diag(S) + mm  = diag(S + mmT)
  • E(xx)k = E(xkxk) = sk,k + mk2 = (diag(S) + mm)k
5.7 If x[n] has a real Gaussian distribution then Cov(xx) = 2 S • (S + 2mmT)
  • Assume that x = y + m with E(y) = 0 and and Cov(y) = S
  • Cov(xx)i,k = E(xi2xk2) - (si,i + mi2)(sk,k + mk2) = E((yi2+2yimi+mi2)(yk2+2ykmk+mk2)) - (si,i + mi2)(sk,k + mk2) = si,isk,k + 2si,k2 + si,imk2 + sk,kmi2 + 4si,kmimk + mi2mk2 - (si,i + mi2)(sk,k + mk2) = 2si,k (si,k+2mimk) = (2 S • (S + 2mmT))i,k
5.8 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P R; RT Q]) then x | y ~ N(x, p+RQ-1(y-q), P - RQ-1RT). The functions fi(y) below denote functions that are independent of x.
  • p(x | y) = p(x, y) / p(y) = f1(y) exp( -½[x-p; y-q]T S-1 [x-p; y-q] ) where S = [P R; RT Q]
  • From  [3.5] S-1 = [C-1, -C-1RQ-1; -Q-1RTC-1, Q-1(I+RTC-1RQ-1)] where C =(P-RQ-1RT)
  • Hence [x-p; y-q]T S-1 [x-p; y-q] = (x-p)TC-1(x-p) - 2(x-p)TC-1RQ-1(y-p) + f2(y) = (x-p-RQ-1(y-q))TC-1(x-p-RQ-1(y-q)) + f3(y)
  • Hence p(x | y) =  f4(y) exp( -½(x-p-RQ-1(y-q))TC-1(x-p-RQ-1(y-q)) ) = f5(y) N(x, p+RQ-1(y-q), C) = f5(y) N(x, p+RQ-1(y-q), P - RQ-1RT) where f5(y) must equal 1 to give the correct integral.
5.9 If x ~ N(x; m, S) then x | ATx=b ~ N(x, (I-HAT)m+Hb, (I-HAT)S) where H=SA(ATSA)-1
  • Define z = [x; ATx] = [I; AT](x-m). Then z has mean [m; ATm] and Cov(z) = E([I; AT](x-m)(x-m)T[I, A]) = [I; AT]S[I, A] = [S SA; ATS ATSA]
  • From  [5.8] x | ATx=b ~ N(x, m+SA(ATSA)-1(b-ATm), S - SA(ATSA)-1ATS) = N(x, (I-HAT)m+Hb, (I-HAT)S)
5.10 If x ~ N(x; m, S) then y = Ax+b  ~  N(y; Am+b, ASAT)
  • E(y) = E(Ax+b) = A E(x) + b = Am+b
  • Var(y) = E(yyT) - E(y)E(y)T
    = E(AxxTAT + AxbT + bxTAT + bbT) - (AmmTAT + AmbT + bmTAT + bbT)
    = (A(S+mmT)AT + AmbT + bmTAT + bbT) - (AmmTAT + AmbT + bmTAT + bbT)
    = ASAT
5.11 If S is +ve semidefinite Hermitian, then the elements of R = DIAG(S) S DIAG(S) have magnitude <= 1.
  • |ri,j|2 = si,j2 (si,isj,j)-1 <= 1 from  [3.6]
5.12 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P RT; R Q]) then Var(xi - aTy) is minimum when a = Q-1Ri when its value is diag(P)i - RiTQ-1Ri.

This proof is adapted from [R.1]. Since variances are unaffected by a shift of mean we can assume wlog that p=0 and q=0.  Now set b = a - Q-1Ri, then for any a we have

  • Now note that for a zero-mean scalar z, Var(z) = E(zzT). Hence
    Var(xi - aTy) = Var(xi - RiTQ-1y - bTy)
    = E((xi - RiTQ-1y - bTy)(xi - RiTQ-1y - bTy)T)
    = E(xixiT + RiTQ-1yyTQ-1Ri + bTyyTb - 2xiyTQ-1Ri - 2xiyTb + 2RiTQ-1yyTb)
    = diag(P)i + RiTQ-1QQ-1Ri + bTQb - 2RiTQ-1Ri - 2RiTb + 2RiTQ-1Qb
    = diag(P)i + RiTQ-1Ri + bTQb - 2RiTQ-1Ri - 2RiTb + 2RiTb
    = diag
    (P)i - RiTQ-1Ri + bTQb

Only the last term depends on b and, since Q is positive semidefinite, this is minimum when b = 0.

5.13 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P RT; R Q]) then the maximum (over a) of the correlation between xi and aTy is obtained when a = Q-1Ri

This proof is adapted from [R.1]. Since correlations are unaffected by a shift of mean we can assume wlog that p=0 and q=0. For any a we can define c=sqrt(RiTQ-1Ri / aTQa). From  [5.12]  we know that

  • Var(xi - caTy) >= Var(xi - RiTQ-1y)
  • E(xixiT + c2aTyyTa - 2cxiyTa) >= E(xixiTRiTQ-1yyTQ-1Ri - 2xiyTQ-1Ri)
  • diag(P)i + c2aTQa - 2cE(xiyTa) >= diag(P)iRiTQ-1QQ-1Ri - 2E(xiyTQ-1Ri)
  • c2aTQa - 2cE(xiyTa) >= RiTQ-1Ri - 2E(xiyTQ-1Ri)
  • RiTQ-1Ri  - 2cE(xiyTa) >= RiTQ-1Ri - 2E(xiyTQ-1Ri)
  • cE(xiyTa) <= E(xiyTQ-1Ri)
  • E(xiyTa) / sqrt(aTQa) <= E(xiyTQ-1Ri) / sqrt(RiTQ-1Ri )
  • E(xiyTa) / sqrt( Var(xi) × aTQa) <= E(xiyTQ-1Ri) / sqrt( Var(xi) × RiTQ-1Ri )
  • E(xiyTa) / sqrt( Var(xi) × Var(aTy)) <= E(xiyTQ-1Ri) / sqrt( Var(xi) × Var(RiTQ-1y))

Thus the correlation between xi and aTy is bounded above by the right hand side and attains this bound when a = Q-1Ri.  The value of the correlation is given by

  • E(xiyTQ-1Ri) / sqrt( Var(xi) × Var(RiTQ-1y))
    = RiTQ-1Ri / sqrt( diag(P)i × RiTQ-1Ri)
    = sqrt(RiTQ-1Ri /  diag(P)i)
5.14 (gx|y)i2 = 1 - pii-1 det([pii , riT; ri , Q]) det(Q)-1
  • 1 - pii-1 det([pii , riT; ri , Q]) det(Q)-1
  • = 1 - pii-1 ((pii - riTQ-1ri) det(Q)) det(Q)-1    [3.1]
  • = 1 -  (1 - pii-1riTQ-1ri) = pii-1riTQ-1ri = (FR)ii  ÷ pii = (gx|y)i2
5.15 d/dq (ln(p(X)) = 1[k#1]T(X-M)TS-1 dm/dq - ½(k S-1 - S-1(X-M)(X-M)TS-1):T dS/dq   where  M[n#k] = m×1[k#1]T.
  • d/dq (ln(p(X)) = d/dm (ln(p(X)) dm/dq + d/dP (ln(p(X)) dP/dS dS/dq where P = S-1
  • d/dm (ln(p(X)) = d/dM (ln(p(X)) dM/dm
    =
    - ½ d/dM ( tr((X-M)T S-1 (X-M))) dM/dm
    = (S-1(X-M)):T dM/dm   [Derivative of Trace]
    = (S-1(X-M)):T (1[k#1] ¤ I)    [Derivative of Linear Function]
    = 1[k#1]T(X-M)TS-1     [Kroneker Product Identity]
  • d/dP (ln(p(X)) = ½k  d/dP (ln(det(2pi×P))) - ½  d/dS (tr((X-M)T P (X-M)))
    = ½kP-1:T  - ½  d/dP (tr((X-M)T P (X-M)))    [Derivative of Determinant]
    = ½kP-1:T  - ½  ((X-M)(X-M)T):T   [Derivative of Quadratic]
  • dP/dS = -(P ¤ P) = -(S-1 ¤ S-1)   [Derivative of Inverse]
  • d/dP (ln(p(X)) dP/dS = -½( kS:T  - ½  ((X-M)(X-M)T):T)(S-1 ¤ S-1)
    = -½(S-1( kS  - ½  (X-M)(X-M)T)S-1):T
    = -½(( kS-1  - ½  S-1(X-M)(X-M)T)S-1):T
5.16 E(vvT) = k ((dm/dq)T S-1 dm/dq + ½(dS/dq)T (S ¤ S)-1 dS/dq )
where vT = 1[k#1]T(X-M)TS-1 dm/dq - ½(k S-1:T - ((X-M)(X-M)T):T (S-1 ¤ S-1)) dS/dq
  • Since v is the sum of k independent identically distributed zero-mean vectors, we can calculate E(vvT) for k=1 and then multiply the result by k. For k=1, v simplifies to
     vT = (x-m)TS-1 dm/dq + ½(S - (x-m)(x-m)T ):T dP/dq where P = S-1  (see  [5.15])
  • When we form the product vvT, the cross terms are cubic in (x-m) and therefore have zero mean. We deal with the two squared terms in vvT separately. For the first term
    • k E (((x-m)TS-1 dm/dq)T ((x-m)TS-1 dm/dq))
      = k E ((dm/dq)T S-1(x-m)(x-m)TS-1 dm/dq)
      = k E ((dm/dq)T S-1 dm/dq)
  • Since the second term has zero mean, we can write
    • ¼k E {(dP/dq)T (S - (x-m)(x-m)T ):  (S - (x-m)(x-m)T ):T dP/dq}
      = ¼k (dP/dq)TE { ((x-m)(x-m)T ):  ((x-m)(x-m)T):T }dP/dq - ¼k(dP/dq)T  SS:T dP/dq
  • We define T = TVEC(n,n) so that T S:  = TT S:  = ST: =S: since S and T are symmetrical. [see vectorized transpose]
  • To calculate the quartic expectation, E {((x-m)(x-m)T ):  ((x-m)(x-m)T):T }, we use Wick's rule take the three possible pairings of the (x-m) terms and, treating x and y as distinct random vectors, we get:
    • E {((x-m)(x-m)T ):  ((x-m)(x-m)T):T }
      = E {((x-m)(x-m)T ):  ((y-m)(y-m)T):T } + E {((x-m)(y-m)T ):  ((x-m)(y-m)T):T } + E {((x-m)(y-m)T ):  ((y-m)(x-m)T):T }
      = S:S:T + E {((y-m) ¤ (x-m) )((y-m) ¤ (x-m))T } + E {((y-m) ¤ (x-m) )((y-m) ¤ (x-m))TTT } [see kroneker product and vectorized transpose]
      = S:S:T + E {((y-m)(y-m) ) ¤ ((x-m)(x-m))T } (I + TT)
      = S:S:T + (S ¤ S) (I + TT)
    • Substituting this into the second term expression gives
      ¼k (dP/dq)T(S:S:T + (S ¤ S) (I + TT))dP/dq - ¼k(dP/dq)T  SS:T dP/dq
      =
      ¼k (dP/dq)T (S ¤ S) (I + TT)dP/d  [Since the first and last terms cancel]
      =
      ½k (dP/dq)T(S ¤ S) dP/dq    [Since the symmetry of dP means that TT dP/dq = dP/dq]
  • Putting this all together and using dP/dq = dP/dS dS/dq= -(P ¤ P) dS/dq = -(S-1 ¤ S-1) dS/dq   [Derivative of Inverse] we obtain
    J = E {vvT} = k E ((dm/dq)T S-1 dm/dq) + ½k (dP/dq)T(S ¤ S)dP/dq
    =
    k E ((dm/dq)T S-1 dm/dq) + ½k (dS/dq)T(S-1 ¤ S-1)(S ¤ S)(S-1 ¤ S-1)dS/dq
    =
    k E ((dm/dq)T S-1 dm/dq) + ½k (dS/dq)T(S-1 ¤ S-1)dS/dq
5.17 If f[r#1](X) is a function of X with mean value g(q), then Cov(f) >= dg/dq J-1 (dg/dq)T
  • g = E (f) means that dg/dq = E(df/dq + f vT) = E(f vT) where vT = d/dq (ln(p(X)).
  • For any arbitrary vector a, we have the scalar equation
    aT dg/dq J-1 (dg/dq)T a = E(aT f vTJ-1 (dg/dq)T a) = Cov(aT f, aT dg/dq  J-1v)
  • Hence from the Cauchy-Schwarz inequality
    (aT dg/dq J-1 (dg/dq)T a)2 <= Var(aT f)Var(aT dg/dq  J-1v)
    = (aTCov(f)a) (aT dg/dq  J-1Cov(v) J-1(dg/dq)T a)
    = (aTCov(f)a) (aT dg/dq  J-1(dg/dq)T a)   [Since Cov(v) = J ]
  • Hence (aT dg/dq J-1 (dg/dq)T a) <= aTCov(f)a for any a
  • So Cov(f) >= dg/dq J-1 (dg/dq)T  where >= represents the Loewner partial order.
5.18 For even n, E(prod(x[n]-m)) = (½n)!-12n sumv(sv(1),v(2)sv(3),v(4)...sv(n-1),v(n)) where the sum is over all n! permutations v of the numbers 1:n.
  • Without loss of generality, we assume that m = 0. The characteristic function is f(t) = E(exp(jtTx)) = exp(-½tTSt) where j=sqrt(-1).
  • Differentiating E(exp(jtTx)) shows that E(prod(x)) = j-n dnf/dt1dt2...dtn evaluated at t=0.
  • Expanding the power series f(t) = exp(-½tTSt) = sum((-½tTSt)r/r!; r=0..inf). Term r in this contains only terms in txr where x is some combination of the subscripts 1:n. When differentiated n times, terms with r < n/2 will become zero, whereas terms with r > n/2 will still contain powers of tx and will therefore be zero when t is set to 0.
  • It follows that when n is odd,  all terms vanish and E(prod(x))=0.
  • When n is even, E(prod(x)) =  dng/dt1dt2...dtn evaluated at t=0 where g(t) = j-n (-½tTSt)½n/(½n)! = ½½n(tTSt)½n/(½n)!
  • We can expand (tTSt)½n=tTSt × tTSt × ...× tTSt and we now differentiate in turn with respect to tn, tn-1, ... , t1. We note that t occurs n times in this expansion and that  dt/dti = ei a column of the identity matrix.
  • When we differentiate with respect to tn, we obtain a sum of n terms, in each of which one of the occurences of t has been replaced by en and the remaining n-1 occurrences remain as t .
  • If we now differentiate with respect to tn-1, each of the n terms in the previous step changes into a sum of n-1 terms in which one ot the occurences of t is now replaced by en-1 and the remaining n-2 occurrences remain as t . We thus have a total of n(n-1) terms.
  • We repeat this process for tn-2, ... , t1and we end up with n! terms of the form (ev(1))TSev(2)(ev(3))TSev(4)...(ev(n-1))TSev(n) = sv(1),v(2)sv(3),v(4)...sv(n-1),v(n) where v runs through all n! permutations of 1:n.
  • Thus E(prod(x)) = (½n)!-12n sumv(sv(1),v(2)sv(3),v(4)...sv(n-1),v(n))

This page is part of The Matrix Reference Manual. Copyright © 1998-2005 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: $Id: proof005.html 1621 2012-03-15 09:45:07Z dmb $