# Matrix Calculus

Go to: Introduction, Notation, Index

## Notation

• j is the square root of -1
• XR and XI are the real and imaginary parts of X = XR + jXI
• (XY)R = XRYR - XIYI
• (XY)I = XRYI + XIYR
• XC = XR - jXI is the complex conjugate of X
• XH=(XR)T=(XT)C is the Hermitian transpose of X
• X: denotes the long column vector formed by concatenating the columns of X (see vectorization).
• AB = KRON(A,B), the kroneker product
• AB the Hadamard or elementwise product
• matrices and vectors A, B, C do not depend on X
• In  = I[n#n] the n#n identity matrix
• Tm,n = TVEC(m,n) is the vectorized transpose matrix, i.e. XT:=Tm,nX: for X[m,n]
• Y/∂X and ∂Y/∂XC are partial derivatives with XC and X respectively held constant (note that XH=(XC)T)
• Y/∂XR and ∂Y/∂XI are partial derivatives with XI and XR respectively held constant

### Derivatives

In the main part of this page we express results in terms of differentials rather than derivatives for two reasons: they avoid notational disagreements and they cope easily with the complex case. In most cases however, the differentials have been written in the form dY: = dY/dX dX: so that the corresponding derivative may be easily extracted.

#### Derivatives with respect to a real matrix

If X is p#q and Y is m#n, then dY: = dY/dX dX: where the derivative dY/dX is a large mn#pq matrix. If X and/or Y are column vectors or scalars, then the vectorization operator : has no effect and may be omitted. dY/dX is also called the Jacobian Matrix of Y: with respect to X: and det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

Although they do not generalise so well, other authors use alternative notations for the cases when X and Y are both vectors or when one is a scalar. In particular:

• dy/dx is sometimes written as a column vector rather than a row vector
• dy/dx is sometimes transposed from the above definition or else is sometimes written dy/dxT to emphasise the correspondence between the columns of the derivative and those of xT.
• dY/dx and dy/dX are often written as matrices rather than, as here, a column vector and row vector respectively. The matrix form may be converted to the form used here by appending : or :T respectively.

#### Derivatives with respect to a complex matrix

If X is complex then dY: = dY/dX dX: can only be generally true iff Y(X) is an analytic function. This normally implies that Y(X) does not depend explicitly on XC or XH.

Even for non-analytic functions we can treat X and XC (with XH=(XC)T) as distinct variables and write uniquely dY: = ∂Y/∂X dX: + ∂Y/∂XC dXC: provided that Y is analytic with respect to X and XC individually (or equivalently with respect to XR and XI individually).  ∂Y/∂X is the Generalized Complex Derivative and ∂Y/∂XC is the Complex Conjugate Derivative [R.4, R.9]; their properties are studied in Wirtinger Calculus.

We define the generalized derivatives in terms of partial derivatives with respect to XR and XI:

• Y/∂X = ½ (∂Y/∂XR - j ∂Y/∂XI)
• Y/∂XC = (∂YC/∂X)C =  ½ (∂Y/∂XR + j ∂Y/∂XI)

We have the following relationships for both analytic and non-analytic functions Y(X):

• The following are equivalent ways of saying that Y(X) is analytic:
• Y(X) is an analytic function of X
• dY: = ∂Y/∂X dX:
• Y/∂XC = 0 for all X
• Y/∂XR + j ∂Y/∂XI = 0 for all X (these are the Cauchy Riemann equations)
• dY: = ∂Y/∂X dX: + ∂Y/∂XC dXC:
• Y/∂XR = ∂Y/∂X + ∂Y/∂XC
• Y/∂XI = j (∂Y/∂X - ∂Y/∂XC)
• Y/∂XC = (∂YC/∂X)C
• Chain rule: If Z is a function of Y which is itself a function of X, then ∂Z/∂X = ∂Z/∂YY/∂X. This is the same as for real derivatives.
• Real-valued: If Y(X) is real for all complex X, then
• Y/∂XC= (∂Y/∂X)C
• dY: = 2(∂Y/∂X dX:)R
• If  Y(X) is real for all complex X and W(X) is analytic and if W(X)=Y(X) for all real-valued X, then ∂W/∂X = 2 (∂Y/∂X)R for all real X
• Example: If C=CH, y(x)=xHCx and w(x)=xTCx, then ∂y/∂x = xHC and ∂w/∂x = 2xTCR

#### Complex Constrained Minimization

Suppose f(X) is a scalar real function of a complex matrix (or vector), X, and G(X) is a complex-valued matrix (or vector or scalar) function of X. To minimize f(X) subject to G(X)=0, we use complex Lagrange multipliers and minimize f(X)+tr(KHG(X))+tr(KTG(X)C) subject to G(X)=0. Hence we solve ∂f/∂X+∂tr(KHG)/X+tr(KTGC)/X = 0T subject to G(X)=0. If g(X) is a vector, this becomes  ∂f/∂X+kHg/∂X+kTgC/∂X = 0T . If g(X) is a scalar, this becomes  ∂f/∂X+kCg/∂x+kgC/∂x = 0T .

• Example: If f(x)=xHSx where S=SH and g(x)=aHx-1, then ∂f/∂x+kHg/∂x+kT∂gC/∂x=xHS+kaH+0T=0T which implies Sx+kCa=0 from which x=-kCS-1a. Substituting this into the constraint, g(x)=aHx-1=0, gives -kCaHS-1a = 1 from which k=-(aHS-1a)-1. Substituting this back into the expression for x gives x = S-1a(aHS-1a)-1.

If f(X) is a real function of a complex matrix (or vector), X, then ∂f/∂XC= (∂f/∂X)C and we can define the complex-valued column vector grad(f(X)) = 2 (∂f/∂X)H = (∂f/XR+j ∂f/XI)T as the Complex Gradient Vector [R.9] with the properties listed below. If we use <-> to represent the vector mapping associated with the   Complex-to-Real isomporphism, and  X[m#n]: <-> y[2mn] where y is real, then grad(f(X)) <->  grad(f(y)) where the latter is the conventional grad function from vector calculus.
• grad(f(X)) is zero at an extreme value of f .
• grad(f(X)) points in the direction of steepest slope of f(x)
• The magnitude of the steepest slope is equal to |grad(f(X))|. Specifically, if g(X) = grad(f(X)), then lima->0 a-1( f(X+ag(X)) - f(X) ) = | g(X) |2
• grad(f(X)) is normal to the surface f(X) = constant which means that it can be used for gradient ascent/descent algorithms.

## Basic Properties

• We may write the following differentials unambiguously without parentheses:
• Transpose: dYT=d(YT)=(dY)T
• Hermitian Transpose: dYH=d(YH)=(dY)H
• Conjugate: dYC=d(YC)=(dY)C
• Linearity: d(Y+Z)=dY+dZ
• Chain Rule: If Z is a function of Y which is itself a function of X, then for both the normal and the generalized complex derivative: dZ: = dZ/dY dY: = dZ/dY dY/dX dX:
• Product Rule: d(YZ) =Y dZ + dY  Z
• d(YZ): = (IY) dZ: + (ZTI) dY:  = ((IY) dZ/dX + (ZTI) dY/dX ) dX:
• Hadamard Product: d(YZ) =YdZ + dYZ
• Kroneker Product: d(YZ) =YdZ + dYZ

## Differentials of Linear Functions

• d(Ax) = d(xTAT): =A dx
• d(xTa) = d(aTx) = aT dx
• d(bxTa) = baT dx
• d(AXB): = (A dX B): =  (BTA) dX:
• d(aTXb) = (ba)T dX:  = (abT):T dX:
• d(aTXa) = d(aTXTa) = (aa)T dX:  = (aaT):T dX:
• [X[m#n]] d(AX): =  (InA) dX:
• [X[m#n]] d(XB): = (dX B): = (BTIm) dX:
• [x[n]] d(xbT): = (dx bT): = (bIn) dx
• d(AXTB):  = (BTA) dXT:
• d(aTXTb) = (ab)T dX:  = (abT):T dXT:= (baT):T dX:
• d(|x|) = |x|-1xT dx
• [x: Complex] d (xHA): = AT dxC
• d(X[m#n]A[p#q]): = (InTq,mIp)(ImnA:) dX: = (InqTm,p )(InA:Im) dX:
• d(A[p#q]X[m#n]): = (IqTn,pIm)(A:Imn) dX: = (Tm,nIpq )(InA:Im) dX:

• d(Ax+b)TC(Dx+e) = ((Ax+b)TCD + (Dx+e)TCTA) dx
• d(xTCx) = xT(C+CT)dx  = [C=CT] 2xTCdx
• d(xTx) = 2xTdx
• d(Ax+b)T (Dx+e) =  ( (Ax+b)TD + (Dx+e)TA)dx
• d(Ax+b)TC(Ax+b) = [C=CT] 2(Ax+b)TCA dx
• d(Ax+b)HC(Dx+e) = (Ax+b)HCD dx + (Dx+e)TCTAC dxC
• d(Ax+b)HC(Ax+b) = (Ax+b)HCA dx + (Ax+b)TCTAC dxC = [C=CH] 2((Ax+b)HCA dx)R
• d(Ax+b)H(Ax+b) = 2((Ax+b)HA dx)R
• d (xHCx) = xHC dx +xTCT dxC = [C=CH] 2(xHC dx)R
• d (xHx) = 2(xH dx)R
• d(aTXTXb) = X(abT + baT):T dX:
• d(aTXTXa) = 2(XaaT ):T dX:
• d(aTXTCXb) = (CTXabT + CXbaT):T dX:
• d(aTXTCXa) = ((C + CT)XaaT ):T dX: = [C=CT] 2(CXaaT):T dX:
• d((Xa+b)TC(Xa+b)) = ((C+CT)(Xa+b)aT ):T dX:
• [X[n#n]] d(X2): = (XdX + dX X): = (InX + XTIn) dX:
• [X[m#n]] d(XTCX): = (InXTC) dX: + (XTCTIn) dXT: = (InXTC+Tn,n(In⊗ XTCT)) dX:
• [X[m#n], C[m#m]=CT] d(XTCX): =  (In×n+Tn,n)(In⊗ XTC) dX:
• [X[m#n]] d(XTX): = (InXT) dX: + (XTIn) dXT: = (In×n + Tn,n)(InXT) dX:
• [X[m#n]] d(XHCX): = (XHCdX): + (d(XH) CX): =  (InXHC) dX: + (XTCTIn) dXH:

## Differentials of Cubic Products

• d(xxTAx) = (xxT(A+AT)+xTAx×I )dx
• d(xxTx) = (2xxT+xTx×I )dx
• [X[m#n]] d(XAXTBX): = (XTBTXATIm + InXAXTB) dX: + (XTBXA) dXT: = (XTBTXATIm + Tn,m(XA ⊗ XTB) + InXAXTB) dX:
• [X[m#n]] d(XXTX): = (XTXIm + InXXT) dX: + (XTX) dXT: = (XTX ⊗ Im + Tn,m(X ⊗ XT) + In⊗ XXT) dX:
• [X[m#n]] d(XAXBX): = (XTBTXTATIm + XTBTXA + InXAXB) dX:
• [X[n#n]] d(X3): = ((XT)2In + XTX + InX2) dX:

## Differentials of Inverses

• d(X-1) = -X-1dX X-1  [2.1]
• d(X-1): = -(X-TX-1) dX:
• d(aTX-1b) = - (X-TabTX-T ):T dX: =  - (abT):T (X-TX-1) dX: [2.9]
• d(tr(ATX-1B)) = d(tr(BTXTA)) =  -(X-TABTX-T):T dX: = -(ABT):T (X-TX-1) dX:

## Differentials of Trace

Note: matrix dimensions must result in an n*n argument for tr().

• d(tr(Y))=tr(dY)
• d(tr(X)) = d(tr(XT)) = I:T dX:   [2.4]
• d(tr(Xk)) =k(Xk-1)T:T dX:
• d(tr(AXk)) = (SUMr=0:k-1(XrAXk-r-1)T  ):T dX:
• d(tr(AX-1B)) = -(X-1BAX-1)T:T dX:= -(X-TATBTX-T):T dX:   [2.5]
• d(tr(AX-1)) =d(tr(X-1A)) = -(X-TATX-T ):T dX:
• d(tr(ATXBT)) = d(tr(BXTA)) = (AB):T dX:    [2.4]
• d(tr(XAT)) = d(tr(ATX)) =d(tr(XTA)) = d(tr(AXT)) = A:T dX:
• d(tr(ATX-1BT)) = d(tr(BXTA)) =  -(X-TABX-T):T dX: = -(AB):T (X-TX-1) dX:
• d(tr(AXBXTC)) = (ATCTXBT + CAXB):T dX:
• d(tr(XAXT)) = d(tr(AXTX)) = d(tr(XTXA)) =( X(A+AT)):T dX:
• (tr(XTAX)) = d(tr(AXXT)) = d(tr(XXTA)) = ((A+AT)X):T dX:
• d(tr(XXT))  = d(tr(XTX)) = 2X:T dX:
• d(tr(AXBX)) = (ATXTBT + BTXTAT ):T dX:
• d(tr((AXb+c)(AXb+c)T) = 2(AT(AXb+c)bT):T dX:
•  [C=CT] d(tr((XTCX)-1A) = d(tr(A (XTCX)-1) = -((CX(XTCX)-1)(A+AT)(XTCX)-1):T dX:
• [B=BT, C=CT] d(tr((XTCX)-1(XTBX)) = d(tr( (XTBX)(XTCX)-1) = 2(BX(XTCX)-1-(CX(XTCX)-1)XTBX(XTCX)-1 ):T dX:
• [D=DH] d(tr((AXB+C)D(AXB+C)H)) = ((2AH(AXB+C)DBH):H dX:)R    [2.6]
• d(tr((AXB+C)(AXB+C)H)) = ((2AH(AXB+C)BH):H dX:)R
• [D=DH] d(tr(XDXH)) = ((2XD):H dX:)R
• d(tr(XXH)) = (2X:H dX:)R

## Trace Minimization

In the following expressions M# denotes the inverse of M or, if M is singular, any generalized inverse (including the pseudoinverse).

• [D=DH] argminX{tr((AXB+C)D(AXB+C)H)} = -(AHA)#AHCDBH(BDBH)#  [2.7]
• [D=DH] argminX{tr((AX+C)D(AX+C)H)} = -(AHA)#AHC
• [D=DH] argminX{tr((AXB+C)HD(AXB+C))} = -(AHDA)#AHDCBH(BBH)#
• [D=DH] argminX{tr((AX+C)HD(AX+C))} = -(AHDA)#AHDC
• [D=DH] argminx{(Ax+c)HD(Ax+c)} = -(AHDA)#AHDc
• [D=DH, R=RH] argminX{tr((AXB+C)D(AXB+C)H+(AXP+Q)R(AXP+Q)H)} = -(AHA)#AH(CDBH+QRPH)(BDBH+PRPH)#
• [D=DH, R=RH] argminX{tr((AX+C)D(AX+C)H+(AX+Q)R(AX+Q)H)} = -(AHA)#AH(CD+QR)(D+R)#
• [D=DH, R=RH] argminX{tr((AXB+C)D(AXB+C)H+(AX)R(AX)H)} = -(AHA)#AH(CDBH)(BDBH+R)#
• [D=DH, R=RH] argminX{tr((XB+C)D(XB+C)H+XRXH)} = -(CDBH)(BDBH+R)#
• [D=DH] argminX{tr((AXB+C)D(AXB+C)H) | EXF=G} = (AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHCDBH(BDBH)#F+G}{FH(BDBH)#F}#FH - AHCDBH)(BDBH)#  [2.8]
• [D=DH] argminX{tr((AX+C)D(AX+C)H) | EX=G} = (AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHC+G} - AHC)
• [D=DH] argminX{tr((Ax+c)D(Ax+c)H) | Ex=g} = (AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHc+g} - AHc)
• [D=DH] argminX{tr((AXB+C)HD(AXB+C)) | EXF=G} = (AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDCBH(BBH)#F+G}{FH(BBH)#F}#FH - AHDCBH)(BBH)#
• [D=DH] argminX{tr((AX+C)HD(AX+C)) | EX=G} = (AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDC+G}- AHDC)
• [D=DH] argminx{(Ax+c)HD(Ax+c) | Ex=g} = (AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDc+g}- AHDc)

## Differentials of Determinant

Note: matrix dimensions must result in an n#n argument for det(). Some of the expressions below involve inverses: these forms apply only if the quantity being inverted is square and non-singular; alternative forms involving the adjoint, ADJ(), do not have the non-singular requirement.

• d(det(X)) = d(det(XT)) = ADJ(XT):T dX: = det(X) (X-T):T dX:  [2.10]
• d(det(ATXB)) = d(det(BTXTA)) = (A ADJ(ATXB)TBT):T  dX:[A,B: nonsingular] det(ATXB) × (X-T):T dX: [2.11]
• d(ln(det(ATXB))) = [A,B: nonsingular] (X-T):T dX: [2.12]
• d(ln(det(X))) = (X-T):T dX:
• d(det(Xk)) = d(det(X)k) = k × det(Xk) × (X-T):T dX: [2.13]
• d(ln(det(Xk))) = k × (X-T):T dX:
• d(det(XTCX)) = [C=CT] 2(CX ADJ(XTCX)):T dX: = 2det(XTCX)×(CX(XTCX)-1):T dX: [2.14]
• = [C=CT, CX: nonsingular] 2det(XTCX)×(X-T):T dX:
• d(ln(det(XTCX))) = [C=CT] 2(CX(XTCX)-1):T dX:
•  = [C=CT, CX: nonsingular] 2(X-T):T dX:
• d(det(XHCX))  =  det(XHCX) × (CTXC (XTCTXC)-1):TdX:   + (CX(XHCX)-1):T dXC:) [2.15]
• d(ln(det(XHCX))) = (CTXC (XTCTXC)-1):TdX:   + (CX(XHCX)-1):T dXC: [2.16]

## Jacobian

dY/dX is called the Jacobian Matrix of Y: with respect to X: and JX(Y)=det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

• JX(X[n#n]-1)= (-1)ndet(X)-2n

## Hessian matrix

If f is a real function of x then the Hermitian matrix Hx  f = (d/dx (df/dx)H)T  is the Hessian matrix of f(x). A value of x for which grad f(x) = 0 corresponds to a minimum, maximum or saddle point according to whether Hx f is positive definite, negative definite or indefinite.

• [Real] Hx  f = d/dx (df/dx)T
• Hx  f is symmetric
• Hx (aTx) = 0
• Hx (Ax+b)TC(Dx+e) = ATCD + DTCTA
• Hx (Ax+b)T (Dx+e) = ATD + DTA
• Hx (Ax+b)TC(Ax+b)  = AT(C + CT)A = [C=CT] 2ATCA
• Hx (Ax+b)T (Ax+b) = 2ATA
• Hx (xTCx) = C+CT  = [C=CT] 2C
• Hx (xTx) = 2I
• [x: Complex] Hx  f  = (d/dx (df/dx)H)T = d/dxC (df/dx)T
• Hx  f is hermitian
• Hx (Ax+b)HC(Ax+b) = [C=CH] (AHCA)T  [2.17]
• Hx (xHCx) = [C=CH] CT

This page is part of The Matrix Reference Manual. Copyright © 1998-2021 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: \$Id: calculus.html 11291 2021-01-05 18:26:10Z dmb \$