Matrix Reference Manual
Proofs Section 5: Stochastic Matrices


Go to: Introduction, Notation, Index



5.1 N(x ; a, A) N(x ; b, B) = N(a ; b , A+B) × N(x ; c, C) where C = (A-1+B-1)-1 = A(A+B)-1B = B(A+B)-1A and c = C(A-1a+B-1b) = A(A+B)-1b + B(A+B)-1a
  • We assume without comment below that A, B, C and their sums and inverses are all symmetric matrices.
  • First we define C = (A-1+B-1)-1 = A(BA-1A+BB-1A)-1B = A(B+A)-1B = A(A+B)-1B = (similarly) B(A+B)-1A.
  • We also define c = C(A-1a+B-1b) = B(A+B)-1A A-1a+A(A+B)-1B B-1b= A(A+B)-1b + B(A+B)-1a
  • We see that cTC-1c = (aTA-1+bTB-1)CC-1C(A-1a+B-1b)
    = (aTA-1+bTB-1)C(A-1a+B-1b)
    = aTA-1CA-1a + 2aTA-1CB-1b +bTB-1CB-1b
    = aT(A+B)-1BA-1a + 2aTA-1A(A+B)-1BB-1b +bT(A+B)-1AB-1b substituting for C
    = aT(A+B)-1((A+B)A-1-I) a + 2aT(A+B)-1b +bT(A+B)-1((A+B)B-1-I) b
    = a
    T(A-1- (A+B)-1) a + 2aT(A+B)-1b +bT(B-1-(A+B)-1) b
    = aTA-1a +bTB-1b -(aT(A+B)-1a - 2aT(A+B)-1b +bT(A+B)-1b)
    = aTA-1a +bTB-1b -(a-b)T(A+B)-1(a-b)
  • Hence (x-c)TC-1(x-c) = xTC-1x -2xTC-1c + cTC-1c
    =
    xT(A-1+B-1)x -2xTC-1C(A-1a+B-1b) + aTA-1a +bTB-1b -(a-b)T(A+B)-1(a-b)
    = xTA-1x + xTB-1x - 2xTA-1a - 2xTB-1b + aTA-1a + bTB-1b - (a-b)T(A+B)-1(a-b)
    = (x-a)TA-1(x-a) + (x-b)TB-1(x-b) - (a-b)T(A+B)-1(a-b)
  • It follows that N(a ; b , A+B) × N(x ; c, C) = det(2 pi (A+B))
  • exp((a-b)T(A+B)-1(a-b)) × det(2 pi C) exp((x-c)TC-1(x-c))
    = det(4 pi2 (A+B)C) exp((x-c)TC-1(x-c) + (a-b)T(A+B)-1(a-b)) multiplying determinants and adding exponents
    = det(4 pi2 (A+B)A(A+B)-1B) exp((x-a)TA-1(x-a) + (x-b)TB-1(x-b))
    = det(2 pi A) exp((x-a)TA-1(x-a)) × det(2 pi B) exp((x-b)TB-1(x-b))
    = N(x ; a, A) N(x ; b, B)
5.2

N(x ; a, A)m = N(0 ; 0, m(2×pi)m-2Am-1) × N(x ; a, m-1A)

We prove this by induction. For m=1, we have N(x ; a, A)1 = N(0 ; 0, (2×pi)-1I) × N(x ; a, A) which is true.

  • N(x ; a, A)m = N(x ; a, A) × N(x ; a, A)m-1
    = N(x ; a, A) × N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(x ; a, (m-1)-1A)
    = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(x ; a, A) × N(x ; a, (m-1)-1A)
    = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(a ; a, (1+(m-1)-1)A) × N(x ; c, C) where from [5.1] C = (A-1+(m-1)A-1)-1 = (mA-1)-1 = m-1A and c = C(A-1a+(m-1)A-1a) = mCA-1a = a
  • So N(x ; a, A)m = N(0 ; 0, (m-1)(2×pi)m-3Am-2) × N(a ; a, (1+(m-1)-1)A) × N(x ; a, m-1A)
    = N(0 ; 0, 2 × pi × (m-1)(2×pi)m-3Am-2× (1+(m-1)-1)A) × N(x ; a, m-1A)
    = N(0 ; 0, (m-1)(1+(m-1)-1)(2×pi)m-2Am-1) × N(x ; a, m-1A)
    = N(0 ; 0, m(2×pi)m-2Am-1) × N(x ; a, m-1A)
5.3 If x[n] has a complex Gaussian distribution then Cov(x) = E(xxH) - mmH = S[n#n] <=> K[2n#2n] and E(xxT) = mmT
  • From the definition of a complex gaussian we have
    • E(xiRxkR) = ½si,kR + miRmkR
    • E(xiIxkI) = ½si,kR + miImkI
    • E(xiIxkR) = ½si,kI + miImkR
    • E(xiRxkI) = -½si,kI + miRmkI
  • E(xxH)i,k = E(xixkC) = E((xiRxkR + xiIxkI) + j(xiIxkR - xiRxkI)) = si,kR + jsi,kI + mimkC = (S + mmH)i,k
  • E(xxT)i,k = E(xixk) = E((xiRxkR - xiIxkI) + j(xiIxkR + xiRxkI)) = mimk = (mmT)i,k
5.4 If x[n] has a complex Gaussian distribution then E(xxC) = diag(S) + mmC
  • E(xxC) = E(diag(xxH)) = diag(E(xxH)) = diag(S + mmH) = diag(S) + mm [5.3]
5.5 If x[n] has a complex Gaussian distribution and p = xxC then Cov(p) = E(ppT) - E(p)E(p)T = SSC + 2(mmHST)R
  • If we define v = diag(S) and w = mmC , then E(p)E(p)T = (v+w)(v+w)T
  • Assume that y = x - m has zero mean and that E(yyC) = diag(S) = v
    • E((yyC)(yTyH))i,k = E(((xiR)2 + (xiI)2)((xkR)2 + (xkI)2)) =  E((xiRxkR)2 + (xiRxkI)2 + (xiIxkR)2 + (xiIxkI)2)
    • Now from Wick's theorem and [5.3], E((xiRxkR)2) = E(xiRxiR)E(xkRxkR) + 2(E(xiRxkR))2 = ¼vivk + ½(si,kR)2 = (¼vvT + ½SRSR)i,k
    • Similarly E((xiIxkI)2) = (¼vvT + ½SRSR)i,k and E((xiIxkR)2) = E((xiIxkI)2) = (¼vvT + ½SISI)i,k
    • Hence E((yyC)(yTyH))i,k = 2(¼vvT + ½SRSR)i,k + 2(¼vvT + ½SISI)i,k = (vvT + SSC)i,k
  • Since the expected value of any odd power of yi is zero and, from [5.3], E(yyT) = 0, we have
    E((xxC)(xTxH)) = E(((y+m) • (y+m)C)((y+m)T • (y+m)H))
    = E((yyC)(yTyH) + (yyC)(mTmH) + (ymC)(yTmH) + (ymC)(mTyH) + (myC)(yTmH) + (myC)(mTyH) + (mmC)(yTyH) + (mmC)(mTmH))
    = vvT + SSC + vwT + 0 + S • (mmH)C + SC • (mmH) + 0 + wvT + wwT
    = SSC + 2(mmHST)R + (v+w)(v+w)T
5.6 If x[n] has a real Gaussian distribution then E(xx) = diag(S) + mm  = diag(S + mmT)
  • E(xx)k = E(xkxk) = sk,k + mk2 = (diag(S) + mm)k
5.7 If x[n] has a real Gaussian distribution then Cov(xx) = 2 S • (S + 2mmT)
  • Assume that x = y + m with E(y) = 0 and and Cov(y) = S
  • Cov(xx)i,k = E(xi2xk2) - (si,i + mi2)(sk,k + mk2) = E((yi2+2yimi+mi2)(yk2+2ykmk+mk2)) - (si,i + mi2)(sk,k + mk2) = si,isk,k + 2si,k2 + si,imk2 + sk,kmi2 + 4si,kmimk + mi2mk2 - (si,i + mi2)(sk,k + mk2) = 2si,k (si,k+2mimk) = (2 S • (S + 2mmT))i,k
5.8 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P R; RT Q]) then x | y ~ N(x, p+RQ-1(y-q), P - RQ-1RT). The functions fi(y) below denote functions that are independent of x.
  • p(x | y) = p(x, y) / p(y) = f1(y) exp( -½[x-p; y-q]T S-1 [x-p; y-q] ) where S = [P R; RT Q]
  • From  [3.5] S-1 = [C-1, -C-1RQ-1; -Q-1RTC-1, Q-1(I+RTC-1RQ-1)] where C =(P-RQ-1RT)
  • Hence [x-p; y-q]T S-1 [x-p; y-q] = (x-p)TC-1(x-p) - 2(x-p)TC-1RQ-1(y-p) + f2(y) = (x-p-RQ-1(y-q))TC-1(x-p-RQ-1(y-q)) + f3(y)
  • Hence p(x | y) =  f4(y) exp( -½(x-p-RQ-1(y-q))TC-1(x-p-RQ-1(y-q)) ) = f5(y) N(x, p+RQ-1(y-q), C) = f5(y) N(x, p+RQ-1(y-q), P - RQ-1RT) where f5(y) must equal 1 to give the correct integral.
5.9 If x ~ N(x; m, S) then x | ATx=b ~ N(x, (I-HAT)m+Hb, (I-HAT)S) where H=SA(ATSA)-1
  • Define z = [x; ATx] = [I; AT](x-m). Then z has mean [m; ATm] and Cov(z) = E([I; AT](x-m)(x-m)T[I, A]) = [I; AT]S[I, A] = [S SA; ATS ATSA]
  • From  [5.8] x | ATx=b ~ N(x, m+SA(ATSA)-1(b-ATm), S - SA(ATSA)-1ATS) = N(x, (I-HAT)m+Hb, (I-HAT)S)
5.10 If x ~ N(x; m, S) then y = Ax+b  ~  N(y; Am+b, ASAT)
  • E(y) = E(Ax+b) = A E(x) + b = Am+b
  • Var(y) = E(yyT) - E(y)E(y)T
    = E(AxxTAT + AxbT + bxTAT + bbT) - (AmmTAT + AmbT + bmTAT + bbT)
    = (A(S+mmT)AT + AmbT + bmTAT + bbT) - (AmmTAT + AmbT + bmTAT + bbT)
    = ASAT
5.11 If S is +ve semidefinite Hermitian, then the elements of R = DIAG(S) S DIAG(S) have magnitude <= 1.
  • |ri,j|2 = si,j2 (si,isj,j)-1 <= 1 from  [3.6]
5.12 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P RT; R Q]) then Var(xi - aTy) is minimum when a = Q-1Ri when its value is diag(P)i - RiTQ-1Ri.

This proof is adapted from [R.1]. Since variances are unaffected by a shift of mean we can assume wlog that p=0 and q=0.  Now set b = a - Q-1Ri, then for any a we have

  • Now note that for a zero-mean scalar z, Var(z) = E(zzT). Hence
    Var(xi - aTy) = Var(xi - RiTQ-1y - bTy)
    = E((xi - RiTQ-1y - bTy)(xi - RiTQ-1y - bTy)T)
    = E(xixiT + RiTQ-1yyTQ-1Ri + bTyyTb - 2xiyTQ-1Ri - 2xiyTb + 2RiTQ-1yyTb)
    = diag(P)i + RiTQ-1QQ-1Ri + bTQb - 2RiTQ-1Ri - 2RiTb + 2RiTQ-1Qb
    = diag(P)i + RiTQ-1Ri + bTQb - 2RiTQ-1Ri - 2RiTb + 2RiTb
    = diag
    (P)i - RiTQ-1Ri + bTQb

Only the last term depends on b and, since Q is positive semidefinite, this is minimum when b = 0.

5.13 If the joint distribution [x; y] ~ N([x; y]; [p; q], [P RT; R Q]) then the maximum (over a) of the correlation between xi and aTy is obtained when a = Q-1Ri

This proof is adapted from [R.1]. Since correlations are unaffected by a shift of mean we can assume wlog that p=0 and q=0. For any a we can define c=sqrt(RiTQ-1Ri / aTQa). From  [5.12]  we know that

  • Var(xi - caTy) >= Var(xi - RiTQ-1y)
  • E(xixiT + c2aTyyTa - 2cxiyTa) >= E(xixiTRiTQ-1yyTQ-1Ri - 2xiyTQ-1Ri)
  • diag(P)i + c2aTQa - 2cE(xiyTa) >= diag(P)iRiTQ-1QQ-1Ri - 2E(xiyTQ-1Ri)
  • c2aTQa - 2cE(xiyTa) >= RiTQ-1Ri - 2E(xiyTQ-1Ri)
  • RiTQ-1Ri  - 2cE(xiyTa) >= RiTQ-1Ri - 2E(xiyTQ-1Ri)
  • cE(xiyTa) <= E(xiyTQ-1Ri)
  • E(xiyTa) / sqrt(aTQa) <= E(xiyTQ-1Ri) / sqrt(RiTQ-1Ri )
  • E(xiyTa) / sqrt( Var(xi) × aTQa) <= E(xiyTQ-1Ri) / sqrt( Var(xi) × RiTQ-1Ri )
  • E(xiyTa) / sqrt( Var(xi) × Var(aTy)) <= E(xiyTQ-1Ri) / sqrt( Var(xi) × Var(RiTQ-1y))

Thus the correlation between xi and aTy is bounded above by the right hand side and attains this bound when a = Q-1Ri.  The value of the correlation is given by

  • E(xiyTQ-1Ri) / sqrt( Var(xi) × Var(RiTQ-1y))
    = RiTQ-1Ri / sqrt( diag(P)i × RiTQ-1Ri)
    = sqrt(RiTQ-1Ri /  diag(P)i)
5.14 (gx|y)i2 = 1 - pii-1 det([pii , riT; ri , Q]) det(Q)-1
  • 1 - pii-1 det([pii , riT; ri , Q]) det(Q)-1
  • = 1 - pii-1 ((pii - riTQ-1ri) det(Q)) det(Q)-1    [3.1]
  • = 1 -  (1 - pii-1riTQ-1ri) = pii-1riTQ-1ri = (FR)ii  ÷ pii = (gx|y)i2
5.15 d/dq (ln(p(X)) = 1[k#1]T(X-M)TS-1 dm/dq - ½(k S-1 - S-1(X-M)(X-M)TS-1):T dS/dq   where  M[n#k] = m×1[k#1]T.
  • d/dq (ln(p(X)) = d/dm (ln(p(X)) dm/dq + d/dP (ln(p(X)) dP/dS dS/dq where P = S-1
  • d/dm (ln(p(X)) = d/dM (ln(p(X)) dM/dm
    =
    - ½ d/dM ( tr((X-M)T S-1 (X-M))) dM/dm
    = (S-1(X-M)):T dM/dm   [Derivative of Trace]
    = (S-1(X-M)):T (1[k#1]I)    [Derivative of Linear Function]
    = 1[k#1]T(X-M)TS-1     [Kroneker Product Identity]
  • d/dP (ln(p(X)) = ½k  d/dP (ln(det(2pi×P))) - ½  d/dS (tr((X-M)T P (X-M)))
    = ½kP-1:T  - ½  d/dP (tr((X-M)T P (X-M)))    [Derivative of Determinant]
    = ½kP-1:T  - ½  ((X-M)(X-M)T):T   [Derivative of Quadratic]
  • dP/dS = -(PP) = -(S-1S-1)   [Derivative of Inverse]
  • d/dP (ln(p(X)) dP/dS = -½( kS:T  - ½  ((X-M)(X-M)T):T)(S-1S-1)
    = -½(S-1( kS  - ½  (X-M)(X-M)T)S-1):T
    = -½(( kS-1  - ½  S-1(X-M)(X-M)T)S-1):T
5.16 E(vvT) = k ((dm/dq)T S-1 dm/dq + ½(dS/dq)T (SS)-1 dS/dq )
where vT = 1[k#1]T(X-M)TS-1 dm/dq - ½(k S-1:T - ((X-M)(X-M)T):T (S-1S-1)) dS/dq
  • Since v is the sum of k independent identically distributed zero-mean vectors, we can calculate E(vvT) for k=1 and then multiply the result by k. For k=1, v simplifies to
     vT = (x-m)TS-1 dm/dq + ½(S - (x-m)(x-m)T ):T dP/dq where P = S-1  (see  [5.15])
  • When we form the product vvT, the cross terms are cubic in (x-m) and therefore have zero mean. We deal with the two squared terms in vvT separately. For the first term
    • k E (((x-m)TS-1 dm/dq)T ((x-m)TS-1 dm/dq))
      = k E ((dm/dq)T S-1(x-m)(x-m)TS-1 dm/dq)
      = k E ((dm/dq)T S-1 dm/dq)
  • Since the second term has zero mean, we can write
    • ¼k E {(dP/dq)T (S - (x-m)(x-m)T ):  (S - (x-m)(x-m)T ):T dP/dq}
      = ¼k (dP/dq)TE { ((x-m)(x-m)T ):  ((x-m)(x-m)T):T }dP/dq - ¼k(dP/dq)T  SS:T dP/dq
  • We define T = TVEC(n,n) so that T S:  = TT S:  = ST: =S: since S and T are symmetrical. [see vectorized transpose]
  • To calculate the quartic expectation, E {((x-m)(x-m)T ):  ((x-m)(x-m)T):T }, we use Wick's rule take the three possible pairings of the (x-m) terms and, treating x and y as distinct random vectors, we get:
    • E {((x-m)(x-m)T ):  ((x-m)(x-m)T):T }
      = E {((x-m)(x-m)T ):  ((y-m)(y-m)T):T } + E {((x-m)(y-m)T ):  ((x-m)(y-m)T):T } + E {((x-m)(y-m)T ):  ((y-m)(x-m)T):T }
      = S:S:T + E {((y-m) ⊗ (x-m) )((y-m) ⊗ (x-m))T } + E {((y-m) ⊗ (x-m) )((y-m) ⊗ (x-m))TTT } [see kroneker product and vectorized transpose]
      = S:S:T + E {((y-m)(y-m) ) ⊗ ((x-m)(x-m))T } (I + TT)
      = S:S:T + (SS) (I + TT)
    • Substituting this into the second term expression gives
      ¼k (dP/dq)T(S:S:T + (SS) (I + TT))dP/dq - ¼k(dP/dq)T  SS:T dP/dq
      =
      ¼k (dP/dq)T (SS) (I + TT)dP/d  [Since the first and last terms cancel]
      =
      ½k (dP/dq)T(SS) dP/dq    [Since the symmetry of dP means that TT dP/dq = dP/dq]
  • Putting this all together and using dP/dq = dP/dS dS/dq= -(PP) dS/dq = -(S-1S-1) dS/dq   [Derivative of Inverse] we obtain
    J = E {vvT} = k E ((dm/dq)T S-1 dm/dq) + ½k (dP/dq)T(SS)dP/dq
    =
    k E ((dm/dq)T S-1 dm/dq) + ½k (dS/dq)T(S-1S-1)(SS)(S-1S-1)dS/dq
    =
    k E ((dm/dq)T S-1 dm/dq) + ½k (dS/dq)T(S-1S-1)dS/dq
5.17 If f[r#1](X) is a function of X with mean value g(q), then Cov(f) >= dg/dq J-1 (dg/dq)T
  • g = E (f) means that dg/dq = E(df/dq + f vT) = E(f vT) where vT = d/dq (ln(p(X)).
  • For any arbitrary vector a, we have the scalar equation
    aT dg/dq J-1 (dg/dq)T a = E(aT f vTJ-1 (dg/dq)T a) = Cov(aT f, aT dg/dq  J-1v)
  • Hence from the Cauchy-Schwarz inequality
    (aT dg/dq J-1 (dg/dq)T a)2 <= Var(aT f)Var(aT dg/dq  J-1v)
    = (aTCov(f)a) (aT dg/dq  J-1Cov(v) J-1(dg/dq)T a)
    = (aTCov(f)a) (aT dg/dq  J-1(dg/dq)T a)   [Since Cov(v) = J ]
  • Hence (aT dg/dq J-1 (dg/dq)T a) <= aTCov(f)a for any a
  • So Cov(f) >= dg/dq J-1 (dg/dq)T  where >= represents the Loewner partial order.
5.18 For even n, E(prod(x[n]-m)) = (½n)!-12n sumv(sv(1),v(2)sv(3),v(4)...sv(n-1),v(n)) where the sum is over all n! permutations v of the numbers 1:n. An equivalent formula is to omit the normalizing factor,  (½n)!-12n, and to restrict the summation to all distinct pairings of the numbers 1:n.
  • Without loss of generality, we assume that m = 0. The characteristic function is f(t) = E(exp(jtTx)) = exp(-½tTSt) where j=sqrt(-1).
  • Differentiating E(exp(jtTx)) shows that E(prod(x)) = j-n dnf/dt1dt2...dtn evaluated at t=0.
  • Expanding the power series f(t) = exp(-½tTSt) = sum((-½tTSt)r/r!; r=0..inf). Term r in this contains only terms in txr where x is some combination of the subscripts 1:n. When differentiated n times, terms with r < n/2 will become zero, whereas terms with r > n/2 will still contain powers of tx and will therefore be zero when t is set to 0.
  • It follows that when n is odd,  all terms vanish and E(prod(x))=0.
  • When n is even, E(prod(x)) =  dng/dt1dt2...dtn evaluated at t=0 where g(t) = j-n (-½tTSt)½n/(½n)! = ½½n(tTSt)½n/(½n)!
  • We can expand (tTSt)½n=tTSt × tTSt × ...× tTSt and we now differentiate in turn with respect to tn, tn-1, ... , t1. We note that t occurs n times in this expansion and that  dt/dti = ei a column of the identity matrix.
  • When we differentiate with respect to tn, we obtain a sum of n terms, in each of which one of the occurences of t has been replaced by en and the remaining n-1 occurrences remain as t .
  • If we now differentiate with respect to tn-1, each of the n terms in the previous step changes into a sum of n-1 terms in which one ot the occurences of t is now replaced by en-1 and the remaining n-2 occurrences remain as t . We thus have a total of n(n-1) terms.
  • We repeat this process for tn-2, ... , t1and we end up with n! terms of the form (ev(1))TSev(2)(ev(3))TSev(4)...(ev(n-1))TSev(n) = sv(1),v(2)sv(3),v(4)...sv(n-1),v(n) where v runs through all n! permutations of 1:n.
  • Thus E(prod(x)) = (½n)!-12n sumv(sv(1),v(2)sv(3),v(4)...sv(n-1),v(n))
  • Note that each term in the summation arises (½n)!2½n times since the ½n factors sij can be rearranged in (½n)! orders and for each factor sij = sji since S is symmetric. Thus an equivalent formula is to omit the normalizing factor,  (½n)!-12n, and restrict the summation to all distinct pairings of the numbers 1:n. This is known as Isserlis' theorem or Wick's theorem.
5.19 [x Real Gaussian, m=0] Yn=<(xxT)n>, then Y1=S and Yn+1= tr(Yn)S+2nSYn
  • By defintion, Y1=S. So we assume n>0.
  • Yn+1 consists of 2n+2 x terms multiplied together. By Wick's theorem [5.18] we can obtain Yn+1 by considering all possible pairings of the x's and treating each pair as an independent Gaussian vector.
  • We write Yn+1= <x(xTx)nxT> and note that the central term, (xTx)n, is a product of scalars which may therefore be re-ordered arbitrarily. If we dentote the frst x by x1, its pair can be in any of 2n+1 positions; we consider three cases:
    1. The pair to x1 can be the first x in one of the scalars that form (xTx)n; this gives n terms of the form <x1(x1Tx)(xTx)n-1xT> = <x1x1T(xxT)n> = SYn
    2. The pair to x1 can be the second x in one of the scalars that form (xTx)n; this gives n terms of the form <x1(xTx1)(xTx)n-1xT> = <x1(x1Tx)(xTx)n-1xT> = <x1x1T(xxT)n> = SYn
    3. The pair to  to x1 can be the final x; this gives a single term of the form  <x1(xTx)nx1T> =  <(xTx)nx1x1T> = tr(Yn)S
  • Adding these 2n+1 terms gives  Yn+1= nSYn + nSYn+ tr(Yn)S = tr(Yn)S+2nSYn.
5.20 [x Complex Gaussian, m=0] Yn=<(xxH)n>, then Y1=S and Yn+1= tr(Yn)S+nSYn
  • By defintion, Y1=S. So we assume n>0.
  • Yn+1 consists of 2n+2 x terms multiplied together. By Wick's theorem [5.18] we can obtain Yn+1 by considering all possible pairings of the x's and treating each pair as an independent complex Gaussian vector.
  • We write Yn+1= <x(xHx)nxH> and note that the central term, (xHx)n, is a product of scalars which may therefore be re-ordered arbitrarily. If we dentote the frst x by x1, its pair can be in any of 2n+1 positions; we consider three cases:
    1. The pair to x1 can be the first x in one of the scalars that form (xHx)n; this gives n terms of the form <x1(x1Hx)(xHx)n-1xH> = <x1x1H(xxH)n> = SYn
    2. The pair to x1 can be the second x in one of the scalars that form (xHx)n; this gives n terms of the form <x1(xHx1)(xTx)n-1xT> = <x1(x1Tx*)(xHx)n-1xH> = <x1x1Tx*xH(xxH)n-1> = 0  [5.3]
    3. The pair to  to x1 can be the final x; this gives a single term of the form  <x1(xHx)nx1H> =  <(xHx)nx1x1H> = tr(Yn)S
  • Adding these 2n+1 terms gives  Yn+1= nSYn + 0+ tr(Yn)S = tr(Yn)S+nSYn.
5.21 If y ~ kN(y; 0, 1) is restricted to the domain p<y<q, then, E(y)=r= (f(p)-f(q))/(F(q)-F(p)) and Var(y)=v=1-(q f(q) - p f(p))/(F(q)-F(p)) - r2 where k=1/(F(q)-F(p)), with f(x) and F(x) being the pdf and cdf of the standard Gaussian.

We first note that f(-x)=f(x), F(-x)=1-F(x), dF/dx=f(x) and df/dx = -x f(x).
So with all integrals going from p to q to infinity we can write ∫ f(x)dx = F(q)-F(p),  ∫ xf(x)dx = f(p)-f(q),  ∫ x2f(x)dx =F(q) - F(p) + p f(p) - q f(q).
From this, E(y) = (f(p)-f(q))/(F(q)-F(p))=r  and Var(y) = 1-(q f(q) - p f(p))/(F(q)-F(p)) - r2.
For the special case p=-∞, f(p)=F(p)=0 and E(y) = r = -f(q)/F(q)=r  and Var(y) = v = 1+r (q-r).
For the special case q=+∞, f(q)=0, F(q)=1 and E(y) = r = f(p)/(1-F(p))=f(p)/F(-p)  and Var(y) = v = 1+ p f(p)/F(-p) - r2.
5.22 Suppose y ~ kN(y; m, S) is restricted to the domain satisfying b<aTy<c with k a normalizing constant. Then E(y) = m+grSa and Cov(y) = S - g2vSa(Sa)T where g=1/sqrt(aTSa), p=g(b-aTm), q=g(c-aTm), r= (f(p)-f(q))/(F(q)-F(p)), v=(q f(q) - p f(p))/(F(q) - F(p)) + r2 and f(q) and F(q) are the pdf and cdf respectively of a standard 1-dimensional Gaussian with f(q)=dF/dq.

We use Cholesky decomposition to find T such that S=TTT and define w=T-1(y-m) which implies y = Tw+m.
So w ~ kN(w; 0, T-1ST-T) = kN(w; 0, I) within the domain satisfying b-aTm < aTTw < c-aTm.
Now find an orthogonal Q such that gQTTa = e where e is the first column of the identity matrix and g = 1/sqrt(aTTTTa)=1/sqrt(aTSa)> 0. Thus Q is any orthogonal matrix whose first row is gaTT. If follows that gTTa = QTe which implies gaTTQT= e.
Now we define z=Qw which implies w=QTz.
So z ~ kN(z; 0, I) within the domain defined by b-aTm < aTTQTz < c-aTm which is equivalent to g(b-aTm) < eTz  < g(c-aTm).
Since only the first element of z is constrained and all elements are independent, we can use  [5.21] to write E(z) = re and Cov(z) = I - veeT where p=g(b-aTm), q=g(c-aTm), r=(f(p)-f(q))/(F(q)-F(p)) and v=(q f(q) - p f(p))/(F(q) - F(p)) + r2.
We have  y = TQTz+m and so E(y) = rTQTe+m and Cov(y)=TQT(I - veeT)QTT = TQTQTT-vTQTeeTQTT .
From above we have gTTa = QTe which implies TQTe = gTTTa =  gSa.
Using this substitution (along with QTQ = I) gives E(y) = m+grSa and  Cov(y) = S - g2vSa(Sa)T as required.
5.23 <(Ax + a)(Bx + b)H> = ASBH + (Am+a)(Bm+b)H
  • We define the zero-mean vector u = x-m so that <u> = 0 and <uuH> = S.
  • <(Ax + a)(Bx + b)H> = <(Au + Am + a)(Bu + Bm + b)H>
  • = <AuuHBH + Au(Bm+b)H + (Am+a)uHBH + (Am + a)(Bm + b)H>
  • = ASBH + 0 + 0 + (Am + a)(Bm + b)H = ASBH + (Am + a)(Bm + b)H
5.24 <(Ax+a)H (Bx+b)> = <tr((Bx+b)(Ax+a)H )> = tr(BSAH) + (Am+a)H (Bm+b)
  • <(Ax + a)H(Bx + b)> = tr(<(Bx + b)(Ax + a)H>)
  • =  tr(BSAH) + tr((Bm + b)(Am + a)H)    [5.23]
  • =  tr(BSAH) + (Am + a)H(Bm + b)
5.25 argminK<||(AKB+C)x||2> = -(AHA)-1AHC(S+mmH)BH(B(S+mmH)BH)-1
  • argminK<||(AKB+C)x||2> = argminK<(AKB+C)x((AKB+C)x)H>
  • =  argminK<tr{(AKB+C)xxH(AKB+C)H}>
  • = argminK{tr((AKB+C)(S+mmH)(AKB+C)H)}
  • = -(AHA)-1AHC(S+mmH)BH(B(S+mmH)BH)-1  [2.7]
5.26 argminK<||(AKB+C)x + (AKE+F)y||2> =  -(AHA)-1AH(CSxBH+FSyEH)(BSxBH+ESyEH)-1
  • We can define S = <[x; y][x; y]H> =  [Sx 0; 0 Sy] since x and y are independent and zero mean
  • argminK<||(AKB+C)x + (AKE+F)y||2>
  • = argminK<||(AK[B E]+[C F])[x; y]||2>
  • = -(AHA)-1AH[C+F]S[B+E]H([B+E]S[B+E]H)-1    [5.25]
  • = -(AHA)-1AH(CSxBH+FSyEH)(BSxBH+ESyEH)-1
5.27 <(Ax + a)(Bx + b)T(Cx + c) (Dx + d)T> = (ASBT+(Am+a)(Bm+b)T)(CSDT+(Cm+c) (Dm+d)T) + (ASCT+(Am+a)(Cm+c)T)(BSDT+(Bm+b) (Dm+d)T) + (Bm+b)T(Cm+c)×(ASDT - (Am+a)(Dm+d)T) + tr(BSCT)×(ASDT + (Am+a)(Dm+d)T)
  • We define the zero-mean vector u = x-m so that <u> = 0 and <uuT> = S.
  • We now multiply out the quartic expression, omitting terms invovling odd powers of u since their mean is zero [5.18]
  • <(Au + a)(Bu + b)T(Cu + c) (Du + d)T>
  • =  <AuuTBTCuuTDT + AuuTBTcdT + AubTCudT + AubTcuTDT + auTBTCudT + auTBTcuTDT + abT CuuTDT + abTcdT>
  • =  <AuuTBTCuuTDT + AuuTBTcdT + AuuTCTbdT + bTc×AuuTDT + auTBTCudT + acTBuuTDT + abT CuuTDT + abTcdT>
  • where we have moved or transposed some scalar factors of the form pTq. Defining v to have the same statistics as u but independent of it, we use Isserlis' theorem  [5.18] to decompose the first term as
  • <AuuTBTCuuTDT> = <AuuTBTCvvTDT> + <AuvTBTCuvTDT> + <AuvTBTCvuTDT>
  • By again moving or transposing scalar factors of the form pTq we can rearrange to get
  • = <AuuTBTCvvTDT> + <AuuTCTBvvTDT> + <AuuTDT×vTBTCv>
  • We now use the quadratic expectation theorems to give
  • = ASBTCSDT + ASCTBSDT + ASDT×tr(BTCS)
  • Therefore we can write
  • <(Au + a)(Bu + b)T(Cu + c) (Du + d)T>
  • =  <AuuTBTCuuTDT + AuuTBTcdT + AuuTCTbdT + bTc×AuuTDT + auTBTCudT + acTBuuTDT + abT CuuTDT + abTcdT>
  • = ASBTCSDT + ASCTBSDT + ASDT×tr(BTCS) + ASBTcdT + ASCTbdT + bTc×ASDT + adT×tr(BTCS)  + acTBSDT + abTCSDT + abTcdT>
  • = (ASBT+abT)(CSDT+cdT) + (ASCT+acT)(BSDT+bdT) + bTc×(ASDT - adT) + tr(BSCT)×(ASDT + adT)
  • where we use tr(BTCS) = tr(BSCT) and also add and subtract a copy of abTcdT = acTbdT = bTc×adT
  • Now we note that Ax + a = Au + (Am + a) so if we replace a by Am+a, b by Bm+b etc in the above expression, we obtain
  • <(Ax + a)(Bx + b)T(Cx + c) (Dx + d)T> = (ASBT+(Am+a)(Bm+b)T)(CSDT+(Cm+c) (Dm+d)T) + (ASCT+(Am+a)(Cm+c)T)(BSDT+(Bm+b) (Dm+d)T) + (Bm+b)T(Cm+c)×(ASDT - (Am+a)(Dm+d)T) + tr(BSCT)×(ASDT + (Am+a)(Dm+d)T)
5.28 <(Ax + a)T(Bx + b) (Cx + c)T(Dx + d)> = tr(AS(CTD+DTC)SBT) + ((Am+a)TB + (Bm+b)TA)S(CT(Dm+d) + DT(Cm+c)) + (tr(ASBT)+(Am+a)T(Bm+b))(tr(CSDT)+(Cm+c)T(Dm+d))
  • <(Ax + a)T(Bx + b) (Cx + c)T(Dx + d)> = tr(<(Bx + b) (Cx + c)T(Dx + d)(Ax + a)T>)
  • so, from  [5.27] we can write
  • = tr((BSCT+(Bm+b)(Cm+c)T)(DSAT+(Dm+d) (Am+a)T) + (BSDT+(Bm+b)(Dm+d)T)(CSAT+(Cm+c) (Am+a)T) + (Cm+c)T(Dm+d)×(BSAT - (Bm+b)(Am+a)T) + tr(CSDT)×(BSAT + (Bm+b)(Am+a)T))
  • = tr(BSCTDSAT) + tr(BSCT(Dm+d)(Am+a)T) + tr((Bm+b)(Cm+c)T(DSAT+(Dm+d)(Am+a)T)) + tr(BSDTCSAT) + tr(BSDT(Cm+c)(Am+a)T) + tr((Bm+b)(Dm+d)T(CSAT+(Cm+c)(Am+a)T)) +  (Cm+c)T(Dm+d)×tr(BSAT - (Bm+b)(Am+a)T) + tr(CSDT)×tr(BSAT + (Bm+b)(Am+a)T)
  • Now, noting that tr(pqT) = qTp and also collecting together all terms that equal (Am+a)T(Bm+b)(Cm+c)T(Dm+d), we can write
  •  = tr(BSCTDSAT)) + (Am+a)TBSCT(Dm+d) + (Cm+c)TDSAT(Bm+b) + tr(BSDTCSAT) + (Am+a)TBSDT(Cm+c) + (Dm+d)TCSAT(Bm+b) + (Cm+c)T(Dm+d)×tr(BSAT) + tr(CSDT)×(tr(BSAT) + (Am+a)T(Bm+b)) + (2-1)×(Am+a)T(Bm+b)(Cm+c)T(Dm+d)
  • = tr(BSCTDSAT + BSDTCSAT) + (Am+a)TBS(CT(Dm+d) + DT(Cm+c)) + (Bm+b)TASDT(Cm+c) + (Bm+b)TASCT(Dm+d) + (Cm+c)T(Dm+d)×tr(ASBT) + tr(CSDT)×(tr(ASBT ) + (Am+a)T(Bm+b)) + (Am+a)T(Bm+b)(Cm+c)T(Dm+d)
  • = tr(ASDTCSBT + ASCTDSBT) + (Am+a)TBS(CT(Dm+d) + DT(Cm+c)) + (Bm+b)TAS(CT(Dm+d) + DT(Cm+c))  + (tr(CSDT) (Cm+c)T(Dm+d)) × (tr(ASBT ) + (Am+a)T(Bm+b))
  • = tr(AS(DTC+CTD)SBT) + ((Am+a)T B+ (Bm+b)TA)S(CT(Dm+d) + DT(Cm+c)) + (tr(CSDT) + (Cm+c)T(Dm+d)) (tr(ASBT) + (Am+a)T(Bm+b))

This page is part of The Matrix Reference Manual. Copyright © 1998-2022 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: $Id: proof005.html 11291 2021-01-05 18:26:10Z dmb $