- Quick Tips
- You always want to process training sets without explicit for loops because they are too expensive.
- In logistic regression, there is a forward propagation step followed by a backward propagation step.
- Let's say that you have an image that is 64px x 64px. To represent that image in a computer, it will be represented as 3 separate matrices (red matrix, blue matrix, green matrix) and each one will be 64 x 64.
- This picture represents y. You will either get a 0 or a 1 as a result that will tell you whether or not the image is what you expect.
- To find x, we need to use those values to create a feature vector.
- (x,y) = xnx, y0 or 1
- mtraining examples = {(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5)}
- mtraining examples - the number of training examples
- mtest - the number of test sets X = [x1, x2, x3…xm]
- This matrix will have m columns and nx rows
- X = nx x m matrix
X.shape= (nx, m). This will give you the shape of a matrix in python.- Y = [y1, y2, y3…ym]
- Y = 1x x m matrix
Y.shape= (1, m). This is a 1 x m matrix.
Given x, we want ŷ to equl the probability that y = 1, given x.
- Xnx with parameters wnx, b (which is a real number)
- ŷ = w transpose x + b (linear function of the input x)
- This is good for linear regression.
- The problem with this is that it is hard to enforce that 0 <= y <= 1. It can be negative, or much bigger.
- So, the solution is ŷ = sigmoid(w transpose x + b)

- We use z to replace the w transpose x + b
- So, sigmoid(z) = 1/1+e-z
- If z is large then it will equal something very close to 1.
- If z is very small (large negative number) then sigmoid(z) will be close to 0.
- Squared error = 1/2(ŷ - y)2 is one way to find the loss, but it doesn’t work well with gradient descent so we don’t want to do that.
- Loss(ŷ, y) = -(y log ŷ + (1-y)log(1-ŷ))
- We want the loss function to be as small as possible
- If y = 1, want ŷ to be large
- If y = 0, want ŷ to be small
- The loss function works to see how a single training example is doing.
- The cost function (how you are doing on the entire training set)
- We can use gradient descent to learn the parameters of the training set. We want to find w, b that minimize J(w, b)
- Convex functions look like big bowls (in contrast to functions that have lines that go up and down)
- You can initialize it anywhere, but you will always end up at the same point.
- You always take a step in the steepest downhill direction.
- Global optimum - The lowest point in the bowl where it is most optimized.
- The derivative term dJ(w)/dw is usually represented in code as “dw” and the equation representing the b is “db”


- The actual equations to update each of the parameters (w and b)
Partial derivative symbol - ∂ (lowercase d in fancy font that is used to describe derivative). This symbol will be shown in place of a lowercase d if there is more than one parameter. This is a rule of calculus.
- Derivative - A fancy term for the slope of a line (height/width or rise/run). Slope and derivative can be used interchangeably.
- Df(a)/da or d/da(f(a))
- As you move up a line with this formula, the slope will remain the same.
- If you have f(a) = a2 then the slope will change as you move on the line.
D/da(a^2) = 2a- If you nudge up fa at some point, then you can expect the derivate to move up 2a. This will tell you exactly how much you can expect fa (f of a) to go up. 4 times as much.
- D/da(a3) = 3a2
- If A = 2, derivative = 8. If you check this 3a2 = 12, meaning that it will be 12 times as much.
- Logea = d/da(f(a)) = 1/a
- If a = 2, derivative = 0.69315. so we’d expect it to go up by 1/2
- Derivative of functions can usually be found in textbooks—so you can always look them up.
- The computation graph helps to explain why a backward and forward propagation steps are necessary.


- This shows how you can take a left-to-right pass to figure out J, and we are going to learn how you can take a right-to-left pass to figure out some of the other variables as well.
- One step of a backward propagation on a computation graph yields derivative of final output variable.
- dJ/dv = What is the derivative of J according to v?
Chain rule - The product in the change of
dJ/dv * dv/da = dJ/da(if a => v => j) - The picture below shows what we would name a variable that is looking for the derivative based on a certain variable. The derivative of the final output variable that you care about (such as J)

- We want to modify the parameters to create a lower loss function.


- Remember: to get the derivatives of a training example, you always need to go backwards.
- The first step is you need to find the derivative of the loss function.
- The above code is what gradient descent looks like with two variables. You’ll notice the two for loops—we will get rid of those for loops using vectorization.
- Vectorization techniques allow us to get rid of explicit for loops in our code.
- Vectorization is so important because it helps code run faster and since we are working with such big datasets, that can make massive differences.
Z = np.dot(w,x) + bw transpose x = np.dot(w,x)
import numpy as np
a = np.array([1, 2, 3, 4])
print(a)
Import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print(c)
print(“Vectorized version: “ + str(1000 + (toc - tic)) + “ms”)
c = 0
tic = time.time()
for i in range(1000000):
c += a[I] * b[i]
toc = time.time()
print(c)
print(“For loop: “ + str(1000 + (toc - tic)) + “ms”)
- Deep learning algorithms run better on a GPU but can also be run on a CPU.
- Both GPU’s and CPU’s have SIMD. GPU’s are better at SIMD calculations.
- SIMD- Single instantiation multiple data - takes advantage of parallelism to run your computations much faster if you use methods like np.dot().
np.zeros(nx, 1)= this is the function you can use to create a vector
- This is the code required for a forward propagation step.
X = [x1 x2 x3 xm]. Any capital letter that is used denotes this same format.- (nx by m matrix = (nx,m))
Z = np.dot(w.T, x) + b(this is a 1xm matrix that calculates all of the z variables)- Activation logic
A = [a1 a2 a3 am] = sigmoid(Z)
- Broadcasting - python will take b and “broadcast” it out to be [b1 b2 bm].


- You can see how
dZ = A - Ythrough the logic included in the picture above.

- The perfect example on the right compared to the first bad example on the left:
- Using the code on the right, we have just implemented a single iteration of gradient descent for logistic regression.
- If you want to implement multiple iterations of gradient descent then you still need to use a for loop. There isn’t a known better way of doing this.
import numpy as np
A = np.array([[56.0, 0.0, 4.0, 68.0],
[1.2, 104.0, 52.0, 8.0],
[1.8, 135.0, 99.0, 0.9]])
print(A)
cal = A.sum(axis=0) # this will sum them up vertically
print(cal)
percentage = 100 * A/cal.reshape(1,4) # Calling reshape on cal is a little redundant because it’s already in that shape, but sometimes it’s good to do that if you aren’t sure what your matrix looks like.
print(percentage)
- The reshape command is very cheap to call so you don’t have to worry about it.
- If you have a vector i.e. [1, 2, 3, 4] * 100 and multiply it by a one dimensional vector then it will expand that 100 and multiple it across the board.


- Broadcasting is a pro and con in python.
- Pro: It can allow for great expressivity.
- Con: It can bring about strange bugs that are difficult to understand.
- Rank 1 Array - Data structure in python that doesn’t look act like a column or row vector. It looks like (5,). Avoid these with neural networks.
- Instead, make a column vector
(np.random.randn(5,1)or a row vector(np.random.randn(1,5)
- Instead, make a column vector
- If you are unsure about what type of vector you are working with, you can throw an assertion
assert(a.shape == (5,1))- Assertions are very inexpensive to execute and they help to serve as code documentation.
- If you do end up with a rank 1 array, you can reshape it:
a.reshape((5,1))
import numpy as np
a = np.random.rand(5)
print(a)
print(a.shape) # This will produce a “rank 1 array” i.e. (5,)
print(a.T) # This will look the same as a
print(np.dot(a, a.T)) # This produces a single number
- When coding neural networks, don’t use structures like this: (5,). Don’t use rank 1 arrays.
a = np.random.randn(5,1)
print(a)
print(a.T) # This produces a row vector.
print(np.dot(a, a.T)) # This will give you the product of a vector
- Jupyter notebooks are open source notebooks that you can use on any site. They are great for python problems.
- https://jupyter.org/install.html








