Matrix Calculus

Contents of Calculus Section

Notation
Differentials of Linear, Quadratic and Cubic Products
Differentials of Inverses, Trace and Determinant
Hessian matrices

Notation

j is the square root of -1
X^R and X^I are the real and imaginary parts of X = X^R + jX^I
- (XY)^R = X^RY^R- X^IY^I
- (XY)^I = X^RY^I + X^IY^R
X^C = X^R - jX^I is the complex conjugate of X
X^H=(X^R)^T=(X^T)^C is the Hermitian transpose of X
X: denotes the long column vector formed by concatenating the columns of X (see vectorization).
A ⊗ B = KRON(A,B), the kroneker product
A • B the Hadamard or elementwise product
A ÷ B the elementwise quotient
matrices and vectors A, B, C do not depend on X
I_n = I_[n#n] the n#n identity matrix
T_m,n = TVEC(m,n) is the vectorized transpose matrix, i.e. X^T:=T_m,nX: for X_[m,n]
∂Y/∂X and ∂Y/∂X^C are partial derivatives with X^C and X respectively held constant (note that X^H=(X^C)^T)
∂Y/∂X^R and ∂Y/∂X^I are partial derivatives with X^I and X^R respectively held constant

Derivatives

In the main part of this page we express results in terms of differentials rather than derivatives for two reasons: they avoid notational disagreements and they cope easily with the complex case. In most cases however, the differentials have been written in the form dY: = dY/dX dX: so that the corresponding derivative may be easily extracted.

Derivatives with respect to a real matrix

If X is p#q and Y is m#n, then dY: = dY/dX dX: where the derivative dY/dX is a large mn#pq matrix. If X and/or Y are column vectors or scalars, then the vectorization operator : has no effect and may be omitted. dY/dX is also called the Jacobian Matrix of Y: with respect to X: and det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

Although they do not generalise so well, other authors use alternative notations for the cases when X and Y are both vectors or when one is a scalar. In particular:

dy/dx is sometimes written as a column vector rather than a row vector
dy/dx is sometimes transposed from the above definition or else is sometimes written dy/dx^T to emphasise the correspondence between the columns of the derivative and those of x^T.
dY/dx and dy/dX are often written as matrices rather than, as here, a column vector and row vector respectively. The matrix form may be converted to the form used here by appending : or :^T respectively.

Derivatives with respect to a complex matrix

If X is complex then dY: = dY/dX dX: can only be generally true iff Y(X) is an analytic function. This normally implies that Y(X) does not depend explicitly on X^C or X^H.

Even for non-analytic functions we can treat X and X^C (with X^H=(X^C)^T) as distinct variables and write uniquely dY: = ∂Y/∂X dX: + ∂Y/∂X^C dX^C: provided that Y is analytic with respect to X and X^C individually (or equivalently with respect to X^R and X^I individually). ∂Y/∂X is the Generalized Complex Derivative and ∂Y/∂X^C is the Complex Conjugate Derivative [R.4, R.9]; their properties are studied in Wirtinger Calculus.

We define the generalized derivatives in terms of partial derivatives with respect to X^R and X^I:

∂Y/∂X = ½ (∂Y/∂X^R - j ∂Y/∂X^I)
∂Y/∂X^C = (∂Y^C/∂X)^C = ½ (∂Y/∂X^R + j ∂Y/∂X^I)

We have the following relationships for both analytic and non-analytic functions Y(X):

The following are equivalent ways of saying that Y(X) is analytic:
- Y(X) is an analytic function of X
- dY: = ∂Y/∂X dX:
- ∂Y/∂X^C = 0 for all X
- ∂Y/∂X^R + j ∂Y/∂X^I = 0 for all X (these are the Cauchy Riemann equations)
dY: = ∂Y/∂X dX: + ∂Y/∂X^C dX^C:
∂Y/∂X^R = ∂Y/∂X + ∂Y/∂X^C
∂Y/∂X^I = j (∂Y/∂X - ∂Y/∂X^C)
∂Y/∂X^C = (∂Y^C/∂X)^C
Chain rule: If Z is a function of Y which is itself a function of X, then ∂Z/∂X = ∂Z/∂Y ∂Y/∂X. This is the same as for real derivatives.
Real-valued: If Y(X) is real for all complex X, then
- ∂Y/∂X^C= (∂Y/∂X)^C
- dY: = 2(∂Y/∂X dX:)^R
- If Y(X) is real for all complex X and W(X) is analytic and if W(X)=Y(X) for all real-valued X, then ∂W/∂X = 2 (∂Y/∂X)^R for all real X
  - Example: If C=C^H, y(x)=x^HCx and w(x)=x^TCx, then ∂y/∂x = x^HC and ∂w/∂x = 2x^TC^R

Complex Constrained Minimization

Suppose f(X) is a scalar real function of a complex matrix (or vector), X, and G(X) is a complex-valued matrix (or vector or scalar) function of X. To minimize f(X) subject to G(X)=0, we use complex Lagrange multipliers and minimize f(X)+tr(K^HG(X))+tr(K^TG(X)^C) subject to G(X)=0. Hence we solve ∂f/∂X+∂tr(K^HG)/∂X+∂tr(K^TG^C)/∂X = 0^T subject to G(X)=0. If g(X) is a vector, this becomes ∂f/∂X+k^H∂g/∂X+k^T∂g^C/∂X = 0^T . If g(X) is a scalar, this becomes ∂f/∂X+k^C∂g/∂x+k∂g^C/∂x = 0^T .

Example: If f(x)=x^HSx where S=S^H and g(x)=a^Hx-1, then ∂^f/∂x+k^H∂g/∂x+k^T∂g^C/∂x=x^HS+ka^H+0^T=0^T which implies Sx+k^Ca=0 from which x=-k^CS^-1a. Substituting this into the constraint, g(x)=a^Hx-1=0, gives -k^Ca^HS^-1a = 1 from which k=-(a^HS^-1a)^-1. Substituting this back into the expression for x gives x = S^-1a(a^HS^-1a)^-1.

Complex Gradient Vector

If f(X) is a real function of a complex matrix (or vector), X, then ∂f/∂X^C= (∂f/∂X)^C and we can define the complex-valued column vector grad(f(X)) = 2 (∂f/∂X)^H = (∂f/∂X^R+j ∂f/∂X^I)^T as the Complex Gradient Vector [R.9] with the properties listed below. If we use <-> to represent the vector mapping associated with the Complex-to-Real isomporphism, and X_[m#n]: <-> y_[2mn] where y is real, then grad(f(X)) <-> grad(f(y)) where the latter is the conventional grad function from vector calculus.

grad(f(X)) is zero at an extreme value of f .
grad(f(X)) points in the direction of steepest slope of f(x)
The magnitude of the steepest slope is equal to |grad(f(X))|. Specifically, if g(X) = grad(f(X)), then lim_a->0 a^-1( f(X+ag(X)) - f(X) ) = | g(X) |²
grad(f(X)) is normal to the surface f(X) = constant which means that it can be used for gradient ascent/descent algorithms.
If f(X)=y^Hy, then grad(f(X))=2(∂y/∂X)^Hy+2(∂y/∂X^C)^Ty^C

Basic Properties

We may write the following differentials unambiguously without parentheses:
- Transpose: dY^T=d(Y^T)=(dY)^T
- Hermitian Transpose: dY^H=d(Y^H)=(dY)^H
- Conjugate: dY^C=d(Y^C)=(dY)^C
Linearity: d(Y+Z)=dY+dZ
Chain Rule: If Z is a function of Y which is itself a function of X, then for both the normal and the generalized complex derivative: dZ: = dZ/dY dY: = dZ/dY dY/dX dX:
Product Rule: d(YZ) =Y dZ + dY Z
- d(YZ): = (I ⊗ Y) dZ: + (Z^T ⊗ I) dY: = ((I ⊗ Y) dZ/dX + (Z^T ⊗ I) dY/dX ) dX:
Hadamard Product: d(Y • Z) =Y • dZ + dY • Z
Kroneker Product: d(Y ⊗ Z) =Y ⊗ dZ + dY ⊗ Z

Differentials of Linear Functions

d(Ax) = d(x^TA^T): =A dx
- d(x^Ta) = d(a^Tx) = a^T dx
- d(bx^Ta) = ba^T dx
d(AXB): = (A dX B): = (B^T ⊗ A) dX:
- d(a^TXb) = (b ⊗ a)^T dX: = (ab^T):^T dX:
  - d(a^TXa) = d(a^TX^Ta) = (a ⊗ a)^T dX: = (aa^T):^T dX:
- [X_[m#n]] d(AX): = (I_n ⊗ A) dX:
- [X_[m#n]] d(XB): = (dX B): = (B^T ⊗ I_m) dX:
  - [x_[n]] d(xb^T): = (dx b^T): = (b ⊗ I_n) dx
d(AX^TB): = (B^T ⊗ A) dX^T:
- d(a^TX^Tb) = (a ⊗ b)^T dX: = (ab^T):^T dX^T:= (ba^T):^T dX:
d(|x|) = |x|^-1x^T dx
[x: Complex] d (x^HA): = A^T dx^C
d(X_[m#n] ⊗ A_[p#q]): = (I_n ⊗ T_q,m ⊗ I_p)(I_mn ⊗ A:) dX: = (I_nq ⊗ T_m,p )(I_n ⊗ A: ⊗ I_m) dX:
d(A_[p#q] ⊗ X_[m#n]): = (I_q ⊗ T_n,p ⊗ I_m)(A: ⊗ I_mn) dX: = (T_m,n ⊗ I_pq )(I_n ⊗ A: ⊗ I_m) dX:

Differentials of Quadratic Products

d(Ax+b)^TC(Dx+e) = ((Ax+b)^TCD + (Dx+e)^TC^TA) dx
- d(x^TCx) = x^T(C+C^T)dx = [C=C^T] 2x^TCdx
  - d(x^Tx) = 2x^Tdx
- d(Ax+b)^T (Dx+e) = ( (Ax+b)^TD + (Dx+e)^TA)dx
  - d(Ax+b)^T (Ax+b) = 2(Ax+b)^TAdx
- d(Ax+b)^TC(Ax+b) = [C=C^T] 2(Ax+b)^TCA dx
d(Ax+b)^HC(Dx+e) = (Ax+b)^HCD dx + (Dx+e)^TC^TA^C dx^C
- d(Ax+b)^HC(Ax+b) = (Ax+b)^HCA dx + (Ax+b)^TC^TA^C dx^C = [C=C^H] 2((Ax+b)^HCA dx)^R
- d(Ax+b)^H(Ax+b) = 2((Ax+b)^HA dx)^R
- d (x^HCx) = x^HC dx +x^TC^T dx^C = [C=C^H] 2(x^HC dx)^R
- d (x^Hx) = 2(x^H dx)^R
d(a^TX^TXb) = X(ab^T + ba^T):^T dX:
- d(a^TX^TXa) = 2(Xaa^T ):^T dX:
d(a^TX^TCXb) = (C^TXab^T + CXba^T):^T dX:
- d(a^TX^TCXa) = ((C + C^T)Xaa^T ):^T dX: = [C=C^T] 2(CXaa^T):^T dX:
d((Xa+b)^TC(Xa+b)) = ((C+C^T)(Xa+b)a^T ):^T dX:
[X_[n#n]] d(X²): = (XdX + dX X): = (I_n ⊗ X + X^T ⊗ I_n) dX:
[X_[m#n]] d(X^TCX): = (I_n ⊗ X^TC) dX: + (X^TC^T ⊗ I_n) dX^T: = (I_n ⊗ X^TC+T_n,n(I_n⊗ X^TC^T)) dX:
- [X_[m#n], C_[m#m]=C^T] d(X^TCX): = (I_n×n+T_n,n)(I_n⊗ X^TC) dX:
- [X_[m#n], C_[m#m]=C^T] d(diag(X^TCX)) = 2diag(X^TC dX)
- [X_[m#n]] d(X^TX): = (I_n ⊗ X^T) dX: + (X^T ⊗ I_n) dX^T: = (I_n×n + T_n,n)(I_n ⊗ X^T) dX:
[X_[m#n]] d(X^HCX): = (X^HCdX): + (d(X^H) CX): = (I_n ⊗ X^HC) dX: + (X^TC^T ⊗ I_n) dX^H:
grad((Ax+b)^H(Ax+b)) = 2A^H(Ax+b)
- grad(x^Hx) = 2x

Differentials of Cubic Products

d(xx^TAx) = (xx^T(A+A^T)+x^TAx×I )dx
- d(xx^Tx) = (2xx^T+x^Tx×I )dx
[X_[m#n]] d(XAX^TBX): = (X^TB^TXA^T ⊗ I_m + I_n ⊗ XAX^TB) dX: + (X^TB^T ⊗ XA) dX^T: = (X^TB^TXA^T ⊗ I_m + T_n,m(XA ⊗ X^TB^T) + I_n ⊗ XAX^TB) dX:
- [X_[m#n]] d(XX^TX): = (X^TX ⊗ I_m + I_n ⊗ XX^T) dX: + (X^T ⊗ X) dX^T: = (X^TX ⊗ I_m + T_n,m(X ⊗ X^T) + I_n⊗ XX^T) dX:
[X_[m#n]] d(XAXBX): = (X^TB^TX^TA^T ⊗ I_m + X^TB^T ⊗ XA + I_n ⊗ XAXB) dX:
- [X_[n#n]] d(X³): = ((X^T)² ⊗ I_n + X^T ⊗ X + I_n ⊗ X²) dX:

Differentials of Inverses

d(X^-1) = -X^-1dX X^-1 [2.1]
- d(X^-1): = -(X^-T ⊗ X^-1) dX:
d(a^TX^-1b) = - (X^-Tab^TX^-T ):^T dX: = - (ab^T):^T (X^-T ⊗ X^-¹) dX: [2.9]
d(tr(A^TX^-1B)) = d(tr(B^TX^TA)) = -(X^-TAB^TX^-T):^T dX: = -(AB^T):^T (X^-T ⊗ X^-1) dX:

Differentials of Trace

Note: matrix dimensions must result in an n*n argument for tr().

d(tr(Y))=tr(dY)
d(tr(X)) = d(tr(X^T)) = I:^T dX: [2.4]
d(tr(X^k)) =k(X^k^-1)^T:^T dX:
d(tr(AX^k)) = (SUM_r=0:k-1(X^rAX^k-r^-1)^T ):^T dX:
d(tr(AX^-1B)) = -(X^-1BAX^-1)^T:^T dX:= -(X^-TA^TB^TX^-T):^T dX: [2.5]
- d(tr(AX^-1)) =d(tr(X^-1A)) = -(X^-TA^TX^-T ):^T dX:
d(tr(A^TXB^T)) = d(tr(BX^TA)) = (AB):^T dX: [2.4]
- d(tr(XA^T)) = d(tr(A^TX)) =d(tr(X^TA)) = d(tr(AX^T)) = A:^T dX:
- d(tr(A^TX^-1B^T)) = d(tr(BX^TA)) = -(X^-TABX^-T):^T dX: = -(AB):^T (X^-T ⊗ X^-1) dX:
d(tr(AXBX^TC)) = (A^TC^TXB^T + CAXB):^T dX:
- d(tr(XAX^T)) = d(tr(AX^TX)) = d(tr(X^TXA)) =( X(A+A^T)):^T dX:
- (tr(X^TAX)) = d(tr(AXX^T)) = d(tr(XX^TA)) = ((A+A^T)X):^T dX:
- d(tr(XX^T)) = d(tr(X^TX)) = 2X:^T dX:
d(tr(AXBX)) = (A^TX^TB^T + B^TX^TA^T ):^T dX:
d(tr((AXb+c)(AXb+c)^T) = 2(A^T(AXb+c)b^T):^T dX:
[C=C^T] d(tr((X^TCX)^-1A) = d(tr(A (X^TCX)^-1) = -((CX(X^TCX)^-1)(A+A^T)(X^TCX)^-1):^T dX:
[B=B^T, C=C^T] d(tr((X^TCX)^-1(X^TBX)) = d(tr( (X^TBX)(X^TCX)^-1) = 2(BX(X^TCX)^-1-(CX(X^TCX)^-1)X^TBX(X^TCX)^-1 ):^T dX:
[D=D^H] d(tr((AXB+C)D(AXB+C)^H)) = ((2A^H(AXB+C)DB^H):^H dX:)^R [2.6]
- d(tr((AXB+C)(AXB+C)^H)) = ((2A^H(AXB+C)B^H):^H dX:)^R
- [D=D^H] d(tr(XDX^H)) = ((2XD):^H dX:)^R
- d(tr(XX^H)) = (2X:^H dX:)^R

Trace Minimization

In the following expressions M^# denotes the inverse of M or, if M is singular, any generalized inverse (including the pseudoinverse).

[D=D^H] argmin_X{tr((AXB+C)D(AXB+C)^H)} = -(A^HA)^#A^HCDB^H(BDB^H)^# [2.7]
- [D=D^H] argmin_X{tr((AX+C)D(AX+C)^H)} = -(A^HA)^#A^HC
[D=D^H] argmin_X{tr((AXB+C)^HD(AXB+C))} = -(A^HDA)^#A^HDCB^H(BB^H)^#
- [D=D^H] argmin_X{tr((AX+C)^HD(AX+C))} = -(A^HDA)^#A^HDC
- [D=D^H] argmin_x{(Ax+c)^HD(Ax+c)} = -(A^HDA)^#A^HDc
[D=D^H, R=R^H] argmin_X{tr((AXB+C)D(AXB+C)^H+(AXP+Q)R(AXP+Q)^H)} = -(A^HA)^#A^H(CDB^H+QRP^H)(BDB^H+PRP^H)^#
- [D=D^H, R=R^H] argmin_X{tr((AX+C)D(AX+C)^H+(AX+Q)R(AX+Q)^H)} = -(A^HA)^#A^H(CD+QR)(D+R)^#
- [D=D^H, R=R^H] argmin_X{tr((AXB+C)D(AXB+C)^H+(AX)R(AX)^H)} = -(A^HA)^#A^H(CDB^H)(BDB^H+R)^#
- [D=D^H, R=R^H] argmin_X{tr((XB+C)D(XB+C)^H+XRX^H)} = -(CDB^H)(BDB^H+R)^#
[D=D^H] argmin_X{tr((AXB+C)D(AXB+C)^H) | EXF=G} = (A^HA)^#(E^H{E(A^HA)^#E^H}^#{E(A^HA)^#A^HCDB^H(BDB^H)^#F+G}{F^H(BDB^H)^#F}^#F^H - A^HCDB^H)(BDB^H)^# [2.8]
- [D=D^H] argmin_X{tr((AX+C)D(AX+C)^H) | EX=G} = (A^HA)^#(E^H{E(A^HA)^#E^H}^#{E(A^HA)^#A^HC+G} - A^HC)
- [D=D^H] argmin_X{tr((Ax+c)D(Ax+c)^H) | Ex=g} = (A^HA)^#(E^H{E(A^HA)^#E^H}^#{E(A^HA)^#A^Hc+g} - A^Hc)
[D=D^H] argmin_X{tr((AXB+C)^HD(AXB+C)) | EXF=G} = (A^HDA)^#(E^H{E(A^HDA)^#E^H}^#{E(A^HDA)^#A^HDCB^H(BB^H)^#F+G}{F^H(BB^H)^#F}^#F^H - A^HDCB^H)(BB^H)^#
- [D=D^H] argmin_X{tr((AX+C)^HD(AX+C)) | EX=G} = (A^HDA)^#(E^H{E(A^HDA)^#E^H}^#{E(A^HDA)^#A^HDC+G}- A^HDC)
- [D=D^H] argmin_x{(Ax+c)^HD(Ax+c) | Ex=g} = (A^HDA)^#(E^H{E(A^HDA)^#E^H}^#{E(A^HDA)^#A^HDc+g}- A^HDc)

Differentials of Determinant

Note: matrix dimensions must result in an n#n argument for det(). Some of the expressions below involve inverses: these forms apply only if the quantity being inverted is square and non-singular; alternative forms involving the adjoint, ADJ(), do not have the non-singular requirement.

d(det(X)) = d(det(X^T)) = ADJ(X^T):^T dX: = det(X) (X^-T):^T dX: [2.10]
d(det(A^TXB)) = d(det(B^TX^TA)) = (A ADJ(A^TXB)^TB^T):^T dX: = [A,B: nonsingular] det(A^TXB) × (X^-T):^T dX: [2.11]
d(ln(det(A^TXB))) = [A,B: nonsingular] (X^-T):^T dX: [2.12]
- d(ln(det(X))) = (X^-T):^T dX:
d(det(X^k)) = d(det(X)^k) = k × det(X^k) × (X^-T):^T dX: [2.13]
d(ln(det(X^k))) = k × (X^-T):^T dX:
d(det(X^TCX)) = [C=C^T] 2(CX ADJ(X^TCX)):^T dX: = 2det(X^TCX)×(CX(X^TCX)^-1):^T dX: [2.14]
- = [C=C^T, CX: nonsingular] 2det(X^TCX)×(X^-T):^T dX:
d(ln(det(X^TCX))) = [C=C^T] 2(CX(X^TCX)^-1):^T dX:
- = [C=C^T, CX: nonsingular] 2(X^-T):^T dX:
d(ln(det(DIAG(diag(X^TCX))))) = [C=C^T,X_m#n] 2(CX ÷ (1_m diag(X^TCX)^T)):^T dX:
d(det(X^HCX)) = det(X^HCX) × (C^TX^C (X^TC^TX^C)^-1):^TdX: + (CX(X^HCX)^-1):^T dX^C:) [2.15]
d(ln(det(X^HCX))) = (C^TX^C (X^TC^TX^C)^-1):^TdX: + (CX(X^HCX)^-1):^T dX^C: [2.16]

Jacobian

dY/dX is called the Jacobian Matrix of Y: with respect to X: and J_X(Y)=det(dY/dX) is the corresponding Jacobian. The Jacobian occurs when changing variables in an integration: Integral(f(Y)dY:)=Integral(f(Y(X)) det(dY/dX) dX:).

J_X(X_[n#n]^-1)= (-1)ⁿdet(X)^-2n

Hessian matrix

If f is a real function of x then the Hermitian matrix H_x f = (d/dx (df/dx)^H)^T is the Hessian matrix of f(x). A value of x for which grad f(x) = 0 corresponds to a minimum, maximum or saddle point according to whether H_x f is positive definite, negative definite or indefinite.

[Real] H_x f = d/dx (df/dx)^T
- H_x f is symmetric
- H_x (a^Tx) = 0
- H_x (Ax+b)^TC(Dx+e) = A^TCD + D^TC^TA
  - H_x (Ax+b)^T (Dx+e) = A^TD + D^TA
  - H_x (Ax+b)^TC(Ax+b) = A^T(C + C^T)A = [C=C^T] 2A^TCA
    - H_x (Ax+b)^T (Ax+b) = 2A^TA
    - H_x (x^TCx) = C+C^T = [C=C^T] 2C
    - H_x (x^Tx) = 2I
[x: Complex] H_x f = (d/dx (df/dx)^H)^T = d/dx^C (df/dx)^T
- H_x f is hermitian
- H_x (Ax+b)^HC(Ax+b) = [C=C^H] (A^HCA)^T [2.17]
  - H_x (x^HCx) = [C=C^H] C^T

This page is part of The Matrix Reference Manual. Copyright © 1998-2022 Mike Brookes, Imperial College, London, UK. See the file gfl.html for copying instructions. Please send any comments or suggestions to "mike.brookes" at "imperial.ac.uk".
Updated: $Id: calculus.html 11291 2021-01-05 18:26:10Z dmb $