import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Arrow
def make_plot(xmin, ymin, xmax, ymax):
= plt.subplots()
fig, ax
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)1))
ax.set_xticks(np.arange(xmin, xmax, 1))
ax.set_yticks(np.arange(ymin, ymax, "equal")
ax.set_aspect(=True)
ax.grid(visiblereturn fig, ax
def draw_arrow(x0, y0, x1, y1, axes, color="blue", alpha=1):
- x0, y1 - y0, width=0.3, color=color, alpha=alpha))
axes.add_patch(Arrow(x0, y0, x1 return axes
Essential Linear Algebra
Fundamentals of Data Science
R and Python
For a look at linear algebra basics in R and Python, see:
Vectors and Scalars
\(\mathbf{R}^{n}\) is the set of vectors (ordered tuples) of real numbers of length \(n\). A scalar is a real number.
- Vectors are added componentwise.
- A vector can be multiplied by a scalar.
Addition
= make_plot(0, 0, 12, 12)
fig, ax "Vector Addition")
ax.set_title(= draw_arrow(0, 0, 3, 5, ax)
ax = draw_arrow(0, 0, 1, 4, ax)
ax = draw_arrow(1, 4, 4, 9, ax)
ax = draw_arrow(3, 5, 4, 9, ax)
ax plt.show()
Scalar Multiplication
= make_plot(0, 0, 12, 12)
fig, ax "Scalar Multiplication")
ax.set_title(= draw_arrow(0, 0, 3, 5, ax)
ax = draw_arrow(0, 0, 6, 10, ax, color="red", alpha=0.5) ax
Geometric Interpretation in 2 and 3 dimensions
Feature Space
Each ‘dimension’ in feature space corresponds to a ‘feature’ or measurement of the data. Here we are assuming for now that the features are continuous and measured by real numbers.
Let’s choose some numerical features of the penguins dataset.
= pd.read_csv("data/penguins-raw.csv")
data = (
data
data[
["Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
"Body Mass (g)",
]
]
.dropna()
.values )
Each penguin is represented by a vector in \(\mathbf{R}^{4}\). So for example penguin number 34 is represented as follows.
34, :] data[
array([ 39.2, 21.1, 196. , 4150. ])
This abstraction of penguins into vectors is sometimes called “an embedding”.
Features
We can also look at a single feature for all of the penguins. For example, ‘Culmen Length (mm)’ is a feature and there is a vector in \(\mathbf{R}^{342}\) consisting of all of the Culmen Lengths for all of the penguins.
In the tidy convention, we summarize our data in an array or matrix where each row corresponds to a sample and each column to a feature. So our penguin data has \(342\) rows (corresponding to the 342 penguins with no missing data) and 4 columns corresponding to four features.
Image Embeddings
Each sample in the MNIST database is a \(28x28\) gray scale image, represented by a \(28\times 28\) array of integers between 0 and 255.
with open("data/train-images.idx3-ubyte", "rb") as f:
16)
f.read(= f.read(28 * 28)
buf = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
data = data.reshape(1, 28, 28)
data for x in range(28):
for y in range(28):
print("{:>4}".format(int(data[0, x, y])), end="")
print("\n")
= np.asarray(data[0].squeeze())
image plt.imshow(image)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 3 18 18 18 126 136 175 26 166 255 247 127 0 0 0 0
0 0 0 0 0 0 0 0 30 36 94 154 170 253 253 253 253 253 225 172 253 242 195 64 0 0 0 0
0 0 0 0 0 0 0 49 238 253 253 253 253 253 253 253 253 251 93 82 82 56 39 0 0 0 0 0
0 0 0 0 0 0 0 18 219 253 253 253 253 253 198 182 247 241 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 80 156 107 253 253 205 11 0 43 154 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 14 1 154 253 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 139 253 190 2 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 11 190 253 70 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 35 241 225 160 108 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 81 240 253 253 119 25 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 45 186 253 253 150 27 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 93 252 253 187 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 249 253 249 64 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 130 183 253 253 207 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 39 148 229 253 253 253 250 182 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 24 114 221 253 253 253 253 201 78 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 23 66 213 253 253 253 253 198 81 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 18 171 219 253 253 253 253 195 80 9 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 55 172 226 253 253 253 253 244 133 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 136 253 253 253 212 135 132 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here we can view each image as a vector in a 784 dimensional (=28*28) space.
A collection of 100 images would be represented by an array with 100 rows and 784 columns
A 28x28 image in color has 28283 numbers to account for the RGB channels.
One Hot Embedding
Normally categorical variables don’t embed direcly into \(\mathbf{R}^{n}\) but one can use “one-hot” embedding.
Suppose our categorical vector has 4 levels: red, green, blue, orange.
The “one-hot” embedding uses four features, and each color corresponds to a vector with a one in the column corresponding to the color and zeros elsewhere.
red | green | blue | orange |
---|---|---|---|
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
So to use one-hot encoding of our feature data, we’d add four columns to our data matrix.
Word embeddings
Word2vec is a technique developed by scientists at google that embeds a vocabulary into \(\mathbf{R}^{n}\). Each of 3 million words has a 300 dimensional vector representing it.
The Curse of Dimensionality
Our intuition misleads us when we think about high dimensional space. There is just vastly more “space” in high dimensional space then we expect.
One way to see this is to compare the unit sphere in \(\mathbf{R}^{n}\) and the unit cube in \(\mathbf{R}^{n}\).
In the plane, the unit cube has volume 4 and the unit circle has volume 3.14. So the unit circle fills up most of the cube. Given vectors with coordinates \((x,y)\) between \(-1\) and \(1\), about 75 percent are within distance one of the origin.
= np.random.default_rng()
rng def hypercube(n,d):
= rng.uniform(-1,1,size=(n,d))
data = np.linalg.norm(data,axis=1)
r = np.histogram(r,bins=20,density=True)
hist, edges = plt.figure()
fig = fig.subplots()
axes =edges[:-1], height=hist,align='edge',width=edges[1]-edges[0],edgecolor='white')
axes.bar(x"Distance to zero for {} points\n in the hypercube in {} dimensions".format(n,d))
axes.set_title(10000,2) hypercube(
In 20 dimensions, the situation is quite different. The hypercube has volume 2^{20} (this volume grows exponentially with the dimension) while the sphere has volume \[ \frac{\pi^{10}}{10!}=.025 \]
If you compare the distribution you see that most of the points are very far from the origin; basically none are within the unit sphere.
10000,20) hypercube(
This means that even huge numbers of points in relatively high dimensions are very sparsely distributed.
Linear combinations
If \(v_1,\ldots, v_k\) are vectors in \(\mathbf{R}^{n}\) then a weighted sum of the \(v_{i}\) is called a linear combination.
\[ w = \sum b_{i}v_{i} \]
Suppose our data is the performance of students on 2 homeworks, 1 midterm, and one final, all scored on a 100 points scale, with each homework worth 10% of the total, the midterm worth 25% and the final worth 55%. If there are 20 students our data is a \(20 x 4\) array with each row having the grades of a single student and each column having all the scores for a particular assignment.
Let \(v_1, v_2, v_3, v_4\) be the four columns. Then the final score is the linear combination \[ s=.1v_1+.1v_2+.25v_3+.55v_4 \]
The vector \(s\), the score, is a linear combination of the features.
Linear Dependence and Linear Independence
In the example above, the final score is a linear combination of the features. We say that the final score is dependent on the features. More generally, a collection of vectors is linearly dependent if there are constants, not all zero, so that
\[ b_1v_1+...+b_kv_k=0. \]
If they’re dependent, it means one of them can be written in terms of the other.
If they aren’t dependent, they are independent. This means that the only way you can get \[ b_1v_1+...+b_kv_k=0 \] is if all the constants are zero.
None of the vectors can be written in terms of the others.
Mathematically, dependence is an exact relationship.
Linear relations in python
= [np.random.uniform(size=8) for i in range(5)] # 10 random vectors
vectors = [np.random.normal() for i in range(5)] # 10 random scalars
scalars = [scalars[i] * vectors[i] for i in range(5)] # products
prods = sum(prods) # sum of products
result print(result)
[ 1.35814571 0.53992968 0.94057224 1.03528351 -0.29581081 1.68075209
0.81547344 0.28800681]
Approximate linear dependence
Sometimes two features are “almost” linearly related.
In an old dataset about car models, with 398 types of cars and 9 features, two features are “miles per gallon” and “engine displacement”. If we look at mpg and displacement relative to their means by subtracting their averages values, we see that
Here, miles per gallon (relative to its mean value) is roughly linearly dependent on displacement (relative to its mean). \[ \mathrm{mpg}-\overline{\mathrm{mpg}} = -.0603(\mathrm{disp}-\overline{\mathrm{disp}}) \]
So we don’t learn much new from ‘mpg’ that isn’t already in ‘displacement’.
import statsmodels.api as sm
= pd.read_csv("data/auto-mpg.csv")
mpg = mpg["displacement"].values
x = x - np.mean(x)
x = mpg["mpg"].values
y = y - np.mean(y)
y = sm.add_constant(x)
x1 = sm.OLS(y, x1).fit()
model = np.linspace(-100, 300, 10)
x0 = sm.add_constant(x0)
x1 = model.predict(x1)
predictions
plt.scatter(x, y)="red")
plt.plot(x0, predictions, color"Miles per Gallon vs Engine Displacement")
plt.title("Displacement") plt.xlabel(
Text(0.5, 0, 'Displacement')
Linear Regression (studied later) tries to capture this.
Span
The span of a collection of vectors is the set of all linear combinations of those vectors.
Basis and dimension
A basis is a linearly independent, spanning set. The number of elements in a basis is always the same; it is called the dimension of the vector space.
The dimension of \(\mathbf{R}^{n}\) is \(n\) (the standard vectors are independent and span).
Distances and the euclidean norm
The norm of a vector \(v=(a_1,\ldots, a_n)\) is \[ \|v\| = (\sum a_{i}^2)^{1/2} \]
It is the “length” of the vector.
The Euclidean distance between two points \(v\) and \(w\) is \[ \|v-w\| \]
Mean Squared Error in vector form
Remember our example the “almost” dependence of mpg and displacement.
= mpg["mpg"] - np.mean(mpg["mpg"])
mpg0 = mpg["displacement"] - np.mean(mpg["displacement"])
disp0 = -0.0603 * disp0
predicted
plt.scatter(disp0, mpg0)
plt.scatter(disp0, predicted)"MSE={:.2f}".format(np.linalg.norm(mpg0 - predicted) / mpg0.shape[0])) plt.title(
Text(0.5, 1.0, 'MSE=0.23')
The mean squared error is the squared distance between a true and predicted value, divided by the number of values.
The dot product
Suppose we have two vectors: \[\begin{aligned} v_1 & =[a_1,a_2,\dots, a_n] \\ v_2 &= [b_1,b_2,\ldots, b_n] \end{aligned} \]
The “dot product” or “inner product” of these two vectors is: \[ v_1\cdot v_2 = \sum_{i=1}^{n} a_i b_i. \]
Important: The dot product *converts two vectors into a scalar!
Properties of the dot product
- \(v_1\cdot v_1 = \|v_1\|^2\)
- \((av_1+bv_2)\cdot v_3 = a(v_1\cdot v_2) + b(v_2\cdot v_2)\)
- \(v_1\cdot v_2 = v_2 \cdot v_1\).
Angles and Cauchy-Schwartz
The law of cosines: \[\|v_1\|^2+\|v_2\|^2 - 2\|v_1\|\|v_2\|\cos(\theta) =\|v_1-v_2\|^2\]
means that
\[ v_1\cdot v_2 = \|v_1\|\|v_2\|\cos(\theta) \tag{1}\]
In particular:
\[|v_1\cdot v_2|\le \|v_1\|\|v_2\| \tag{2}\]
This says \[ |\sum_{i=1}^{n} a_{i}b_{i}|^2\le (\sum_{i=1}^{n} a_{i}^2)(\sum_{i=1}^{n} b_{i}^2) \]
Equation 2 is called the “Cauchy-Schwartz inequality.”
Python
# python
= np.array([1,2,3,4,5])
v = np.array([2,4,6,8,10])
w print('Entry by entry product = {}'.format(v*w))
print('Dot product = {}'.format(np.dot(v,w)))
Note: In R, the symbol for dot product is %*%
.
In Listing 1 we show how to compute the dot product in python.
Orthogonality
If \(v_1\cdot v_2=0\) then either one of \(v_1\) or \(v_2\) is zero, or the angle between then is 90 degrees.
In this case we saw the vectors are orthogonal.
Unit vectors and projection
A vector \(u\) is a unit vector if \(u\cdot u=1\).
The quantity \(v\cdot u\) measures the projection of \(v\) into the direction given by \(u\).
Variance, Correlation and cosine similarity
If \(v\) is a feature vector, let \(\overline{v}=\frac{1}{n}\sum_{i=1}^{n} v_{i}\).
Notice that \(\overline{v}=\frac{v\cdot E}{n}\) where \(E\) is the vector with \(1\) in each entry.
\[ \|(v-\overline{v}E)\|=\sigma^{2} \]
Covariance
If \(v\) and \(w\) are two vectors, their covariance is \[ \sigma_{vw} = \frac{(v-\overline{v}E)\cdot (w-\overline{w}E)}{n} \]
Correlation
The correlation coefficient of \(v\) and \(w\) is \[ r_{vw} = \frac{\sigma_{vw}}{\sigma_{v}\sigma_{w}}=\frac{|(v-\overline{v}E)\cdot(w-\overline{w}E)|}{\|v-\overline{v}E)\|\|w-\overline{w}E\|}=\cos\theta \]
It measures the cosine of the angle between \(v-\overline{v}E\) and \(w-\overline{w}E\).
Cosine similarity
In general,
\[ \cos(\theta) = \frac{v\cdot w}{\|v\|\|w\|} \] measures the angle between two feature vectors and is a measure of “similarity” between \(0\) and \(1\).
Hyperplanes
A (linear) hyperplane is a subspace of dimension \(n-1\) in a vector space of dimension \(n\). It is given by an equation of the form
\[ \sum a_{i}x_{i}=0. \]
Geometrically this can be written \(v\cdot x=0\) where \(v=[a_1,\ldots, a_n]\) and \(x=[x_1,\ldots, x_n]\). The vector \(v\) is called the normal vector to the hyperplane.
An (affine) hyperplane is given by an equation of the form \[ v\cdot x = b \] for some constant \(b\)
= np.linspace(-5, 5, 10)
x = 2 / 3 * x
y
plt.plot(x, y)True)
plt.grid("equal")
plt.gca().set_aspect(
0, -2], [0, 3])
plt.plot([-2, -2], [3, 2])
plt.plot([-2, -1], [3, 2.6])
plt.plot(["Hyperplane 2x-3y=0 with normal vector [-2,3]") plt.title(
Text(0.5, 1.0, 'Hyperplane 2x-3y=0 with normal vector [-2,3]')
Affine hyperplanes
For fixed \(v\) and varying \(b\), the hyperplanes \(v\cdot x=b\) form a parallel family.
Distance from a point to a hyperplane
The distance from a point \(w\) to the hyperplane \(v\cdot x = b\) is \[ D = \frac{w\cdot v-b}{\|v\|}. \]
This is the projection of the line from \(w\) to a point \(x\) on the hyperplane against the unit normal \(\frac{v}{\|v\|}\).
\[ (w-x)\cdot\frac{v}{\|v\|} =\frac{(w\cdot v -x\cdot v)}{\|v\|} = \frac{(w\cdot v - b)}{\|v\|} \]
= np.linspace(-4, 4, 10)
x = 2 / 3 * x
y
plt.plot(x, y)True)
plt.grid("equal")
plt.gca().set_aspect("Dist. from point to hyperplane")
plt.title(0, -2 / np.sqrt(13)], [0, 3 / np.sqrt(13)], color="red", linewidth=3)
plt.plot([0, -2], [0, 3])
plt.plot([-3, -2], [-2, 3])
plt.plot([-2.4, 3, r"w")
plt.text(-3, -2.3, r"x")
plt.text(-3.6, 0, r"$\|w-x\|$")
plt.text(-1.9, 2.2, r"$\theta$")
plt.text(-0.5, 1.4, r"$\|w-x\|\cos(\theta)$") plt.text(
Text(-0.5, 1.4, '$\\|w-x\\|\\cos(\\theta)$')
Matrices
An \(n\times m\) matrix is an array of real numbers with \(n\) rows, \(m\) columns, and a total of \(nm\) entries.
Matrix times Vector
If \(M\) is an \(n\times m\) matrix and \(v\) is an \(m\times 1\) vector (a column vector) then \(Mv\) is the \(n\times 1\) column vector whose entries are \[ M[i,:]\cdot v \] where \(i\) runs from 1 to \(n\). Here \(M[i,:]\) is the \(i^{th}\) row of \(M\).
If \(v\) is a \(1\times n\) row vector then \(vM\) is the \(1\times m\) row vector whose entries are \(v\cdot M[:,i]\) as \(i\) runs from 1 to \(m\).
Matrix times vector is linear
\[M(v+w) = Mv+Mw\] \[M(aw) = aMw\]
Matrix times standard vector gives row/column
If \(v\) is a column vector with a a \(1\) in position \(i\) and zeros elsewhere, then \(Mv\) is the \(i^{th}\) column of \(M\).
If \(v\) is a row vector with a \(1\) in position \(i\) and zeros elsewhere, then \(vM\) is the \(i^{th}\) row of \(M\).
\(Mv\)/\(vM\) gives linear combination of columns/rows
\(Mv\) is a linear combination of the columns of \(M\) weighted by the entries of \(v\).
\(vM\) is a linear combination of the rows of \(M\) weighted by the entries of \(v\).
Matrix times matrix
An \(n\times m\) matrix times an \(m\times k\) matrix yields an \(n\times k\) matrix.
You can view \(MN\) as \(Mv\) where \(v\) runs through the columns of \(N\). Each column has \(n\) rows.
\[ MN=\left[ \begin{matrix} M[0,:]N & M[1,:]N &\cdots&M[m,:]N\\ \end{matrix} \right] \]
OR you can view \(MN\) as \(wN\) where \(w\) runs through the rows of \(M\).
\[ MN = \left[\begin{matrix} MN[:,0] \\ MN[:,1] \\\vdots\\MN[:,k]\end{matrix}\right] \]
OR
you can view \(MN\) where the \(i,j\) entry of \(MN\) is the dot product of the \(i^{th}\) row of \(M\) with the \(j^{th}\) column of \(N\) (each of which has \(m\) entries).
\[(MN)_{ij} = M[i,:]\cdot N[;,j]\]
Python
The ‘@’ sign gives matrix multiplication in python. In R, it’s %*%.
= np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
M = np.array([[-1], [0], [1]])
v print(M)
print(v)
print(M @ v)
print("Compare -M[:,0]+M[:,2] with M@v\n both are {}".format(-M[:, 0] + M[:, 2]))
[[1 2 3]
[2 3 4]
[3 4 5]]
[[-1]
[ 0]
[ 1]]
[[2]
[2]
[2]]
Compare -M[:,0]+M[:,2] with M@v
both are [2 2 2]
Where does matrix multiplication come from?
If \(v\) is in \(\mathbf{R}^{m}\) as a column vector, and \(M\) is an \(n\times m\) matrix, then \(Mv\in\mathbf{R}^{n}\). So the function \(v\to Mv\) is a function from \(\mathbf{R}^{m}\to \mathbf{R}^{m}\).
Now if \(N\) is a \(k\times n\) matrix, then \(NMv\) is in \(\mathbf{R}^{k}\).
If we want \(N(Mv)=(NM)v\) to be true then this forces the definition of matrix multiplication.
The Matrix Product gives composition of functions
Transpose of a matrix
The transpose \(M^{T}\) of a matrix is the matrix obtained from \(M\) by switching rows and columns.
The transpose switches the order of a product.
\[(MN)^{T}=N^{T}M^{T}\]
Covariance Matrix
Remember that if \(v\) and \(w\) are feature vectors, then \((v-\overline{v})\cdot (w-\overline{w})\) is the covariance of \(v\) and \(w\).
Suppose \(v_1,\ldots, v_n\) are features forming a data matrix \(X\).
Let \(X_{0}\) be the matrix whose columns are \(v_{i}-\overline{v_{i}}\).
Then \(\frac{1}{N}X_{0}^{T}X_{0}\) is \(n\times n\) and called the covariance matrix.
If \(Y_{0}\) is obtained from \(X_{0}\) by dividing each column by its norm, then \(Y_{0}^{T}Y_{0}\) is the correlation matrix – ones on the diagonal, correlation coefficients off diagonal.
= pd.read_csv("data/penguins-raw.csv")
data = (
data
data[
["Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
"Body Mass (g)",
]
]
.dropna()
.values
)# axis=0 means take the average of the columns (summarize over rows)
=0)
data.mean(axis
# "center" each column; scale column 3
= data - data.mean(axis=0)
data0 print(np.linalg.norm(data0, axis=0))
= data0 / np.linalg.norm(data0, axis=0)
data0 = data0.transpose() @ data0
D print(D)
="hot", interpolation="nearest")
plt.imshow(D, cmap"covariance matrix heatmap") plt.title(
[ 100.81768459 36.46689639 259.66621062 14809.0410685 ]
[[ 1. -0.23505287 0.65618134 0.59510982]
[-0.23505287 1. -0.58385122 -0.47191562]
[ 0.65618134 -0.58385122 1. 0.87120177]
[ 0.59510982 -0.47191562 0.87120177 1. ]]
Text(0.5, 1.0, 'covariance matrix heatmap')
=data0[:, 2], y=data0[:, 3]) plt.scatter(x
Rank and invertibility
The column rank of a matrix is the dimension of the space spanned by its columns; this is the number of linearly independent columns.
The row rank is the number of linearly independent rows.
Theorem: These two numbers are equal.