%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section {Linear regression}
Lets consider a supervised learning example where data points are houses with 2 features ($X^1$= living area; $X^2$ = number of bedrooms) and labels are the prices.\\
%\begin{table}[h!]
\begin{center}
\begin{tabular}{ccc}
Living area (ft$^2$) & \#bedrooms & price (1000\$s)\\
\hline
2104 & 3 & 400\\
1600 & 3 & 330\\
2400 & 3 & 369\\
1416 & 2 & 232\\
3000 & 4 & 540\\
... & ... & ...\\
\hline
\end{tabular}
\end{center}
%\end{table}
Each datapoint is thus $(\tb{x}_t,y_t)$, where $\tb{x}_t = (x_t^1,x_t^2)$ are the living area and the number of bedrooms respectively, and $y_t$ is the price. The main problem addressed by \emph{regression} is: Can we predict the price (or label) $y_t$ from input features $\tb{x}_t$ ? Today we study \emph{linear regression}, that is if we can predict the price from the features using a linear regressor
\[
h_{w}(\tb{x}) = w^0 + w^1 x^1 + w^2 x^2
\]
where $w = (w^0, w^1,w^2)$ are the parameters of the regression function. Within the class of linear functions (regressors) our task shall be to find the best parameters $w$. When there is no risk of confusion, we will drop $w$ from the $h$ notation, and we will assume a dummy feature $x^0=1$ for all datapoints such that we can write
\[
h(\tb{x}) = \sum_{d=0}^D w^d x^d
\]
where $d$ iterates through input features 1,2,... ,$D$ (in our example $D=2$).
What do we mean by the best regression fit? For today, we will measure the error (or cost) of the regression function by the \emph{mean square error}
\[
J(w) = \sum_t (h_{w}(\tb{x}_t)  y_t) ^2
\]
and we will naturally look for the $w$ that minimizes the error function. The regresion obtained using the square error function $J$ is called \emph{least square regression}, or \emph{least square fit}. We present two methods for minimizing $J$: a direct linear algebra solution next, and gradient descent optimization later.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Least mean squares via Gradient descent}
% \subsubsection{Gradient descent}
% The gradient descent algorithm finds a local minima of the objective function ($J$) by guessing an initial set of parameters $w$ and then "walking" episodically in the direction of the gradient $\partial J / \partial w$. Since $w$ is vector valued (3 components for our example) we need to perform the update for each component separately
% \[
% w^j = w^j  \lambda \frac{\partial J(w)}{\partial w^j}
% \]
% where $\lambda$ is the \emph{learning rate} parameter or the step of the update.
% To see this in practice, lets consider an example (other than the mean square error): say $J(x) = (x2)^2+1$ and the initial guess for a minimum is $x_0$ = 3. The differential of $J$ is
% \[
% \frac{\partial J(x)}{\partial x} = 2x4
% \]
% Assuming a fixed $\lambda=0.25$, the first several episodes of gradient descent are:
% \begin{eqnarray*}
% x_1 = x_0  \lambda \frac{\partial J(x_0)}{\partial x} = 3  .25(2*34) = 2.5\\
% x_2 = x_1  \lambda \frac{\partial J(x_1)}{\partial x} = 2.5  .25(2*2.54) = 2.25\\
% x_3 = x_2  \lambda \frac{\partial J(x_2)}{\partial x} = 2.25  .25(2*2.254) = 2.125
% \end{eqnarray*}
% Figure\ref{gradient_descent} illustrates the episodic process.
% \begin{figure}[h!]
% \begin{center}
% %\begin{tabular}{cc}
% \includegraphics[width=0.55\columnwidth]{gradient_descent}
% %\end{tabular}
% \vspace{3ex}
% \end{center}
% \caption{Gradient descent iterations} \label{gradient_descent}
% \end{figure}
%
%
% \subsection {Least mean squares}
% In order to apply gradient descent to our problem, we need first to differentiate our error (objective) function $J$.
% \[
% J(w) = \frac{1}{2}\sum_t (h_{w}(\tb{x}_t)  y_t) ^2
% \]
% Lets lok at the differential for the case when there is only one datapoint (so there are no sums)
% \begin{eqnarray*}
% \frac{\partial J(w)}{\partial w^j } &=& \frac{\partial \frac{1}{2} (h_{w}(\tb{x}) y)^2 } {\partial w^j} \\
% &=& (h_{w}(\tb{x}) y) \frac{\partial (h_{w}(\tb{x}) y) } {\partial w^j} \\
% &=& (h_{w}(\tb{x}) y) \frac{\partial (\sum_j{w^j x^j} y) } {\partial w^j} \\
% &=& (h_{w}(\tb{x}) y) x^j
% \end{eqnarray*}
% Thus the gradient descent update rule (for one data point) is
% \[
% w^j = w^j  \lambda (h_{w}(\tb{x}) y) x^j
% \]
% The update has to be performed for all components j = 1,2,..., D simultaneously (in parallel).
%
% There are two ways to adapt the rule derived above for the case where many datapoints. The first one, \emph{batch gradient descent}, is to use the error function $J$ for all datapoints when differentiate (essentially make the update for all points at once). The sum inside $J$ expression propagates to the differential expression and the update becomes
% \[
% w^j = w^j  \lambda \sum_t (h_{w}(\tb{x}_t) y_t) x_t^j
% \]
% for all j=1,2,.., D. Note that in batch gradient descent, a single update require analysis of all datapoints; this can be very tedious if the data set is large.
%
% The second way to involve all datapoints is to make the update separately for each datapoint (\emph{stochastic gradient decent}). Each update step becomes then a loop including all datapoints:\\
% \\
% LOOP for t=1 to m
% \[
% \hspace{30ex} w^j = w^j  \lambda (h_{w}(\tb{x}_t) y_t) x_t^j \text{ for all } j
% \]
% END LOOP
%
% Stochastic (or online) gradient descent advances by updating based on one datapoint at a time. if the regression problem is not too complicated, often few iterations are enough to converge and so in practice this version of gradient decent is often faster. it is possible (although very unlikely) to have a problem where stochastic updates "dance" around the best $w$ (that realizes minimum error) without actually converging to it.
%
% To conclude regression via gradient descent, we make one final observation. The objective function J is convex because its second differential is positive
% \[
% \frac{\partial^2 J(w)}{\partial (w^j)^2} = (x^j)^2 \geq 0 \text{ for all } j
% \]
% which means any local minima is in fact a global minima, thus the gradient descent (or any method that finds local minima) finds a global minima.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Least mean square via normal equations}
\subsection {Matrix derivatives}
Let $f:\mathbb{R}^{m\times n} \mapsto \mathbb{R}$ a function that takes a matrix as input and outputs a real number. We define the derivative of $f$ with respect to the matrix $A$ as
\[
\nabla_A f(A) = \left [
\begin{array}{ccc}
\frac{\partial f}{\partial a_{11}} & ... & \frac{\partial f}{\partial a_{1n}} \\
... & ... & ... \\
\frac{\partial f}{\partial a_{m1}} & ... & \frac{\partial f}{\partial a_{mn}} \\
\end{array}
\right ]
\]
where $a_{ij}$ is the element of A on row $i$ and column $j$. For example, consider
$A = \left[ \begin{array}{cc}
a_{11} & a_{12}\\
a_{21} & a_{22}
\end{array} \right]$
and the function $f(A) = \frac{3}{2}a_{11} + 5a_{12}^2 + a_{21}a_{22}$. Then
\[
\nabla_A f(A) = \left [
\begin{array}{cc}
\frac{3}{2} & 10 a_{12}\\
a_{22} & a_{21}
\end{array}
\right ]
\]
Today we will look at the \emph{trace} function as $f$. The trace of a matrix A is the sum of the elements of the main diagonal
\[
\text{tr}(A) = \sum_i a_{ii}
\]
The following are simple and well known properties of the trace:
\begin{eqnarray*}
\text{tr}(AB) = \text{tr}(BA) \\
\text{tr}(A) = \text{tr}(A^T)\\
\text{tr}(A+B) = \text{tr}(A) + \text{tr}(B)\\
\text{tr}(xA) = x\text{tr}(A)
\end{eqnarray*}
The following properties of the trace matrix derivative are going to be useful for finding an exact regression solution:
\begin{eqnarray}
\nabla_A \text{tr}(AB) = B^T\\
\nabla_{A^T} f(A) = (\nabla_A f(A))^T\\
\nabla_A \text{tr}(ABA^TC) = CAB + C^TAB^T
\end{eqnarray}
Combining the second and third statements above we get
\begin{eqnarray}
\nabla_{A^T} \text{tr}(ABA^TC) = B^TA^TC^T + BA^TC
\label{eqn_5_linear_alg}
\end{eqnarray}
\subsection{An exact solution for regression using linear algebra}
Given the datapoints $(\tb{x}_t, y_t)$ for $t=1,2,..., m$, with D input dimensions (features), we shall look at them in a matrix form
\[
X = \left[\begin{array}{ccc}
x_1^1 & ... & x_1^D \\
... & ... & ... \\
x_m^1 & ... & x_m^D \\
\end{array}\right] \hspace {3ex}
Y= \left[\begin{array}{c}
y_1\\
...\\
y_m\\
\end{array}\right]
\]
Then the error array asociated with our regressor $h_{w}(\tb{x}) = \sum_d w^d x^d$ is
\[
E = \left[\begin{array}{c}
h_{w}(\tb{x}_1)  y_1\\
...\\
h_{w}(\tb{x}_m)  y_m\\
\end{array}\right]
=
\left[\begin{array}{c}
\tb{x}_1 w \\
...\\
\tb{x}_m w \\
\end{array}\right]

\left[\begin{array}{c}
y_1\\
...\\
y_m\\
\end{array}\right]
= Xw  Y
\]
(we used $w= (w^0,w^1,... w^d)^T$ as a vector column). We can now write the mean square error as
\[
J(w) = \frac{1}{2}\sum_t (h_{w}(\tb{x}_t)  y_t) ^2 = \frac{1}{2}E^T E = \frac{1}{2} (Xw  Y)^T(Xw  Y)
\]
Then
\begin{eqnarray*}
\nabla_{w}J(w) &=& \nabla_{w}\frac{1}{2}E^T E = \nabla_{w} \frac{1}{2} (Xw  Y)^T(Xw  Y)\\
&=& \frac{1}{2} \nabla_{w} (w^TX^TXw  w^TX^TY  Y^TXw + Y^TY)\\
&=& \frac{1}{2} \nabla_{w} \text{tr}(w^TX^TXw  w^TX^TY  Y^TXw + Y^TY)\\
&=& \frac{1}{2} \nabla_{w} (\text{tr}(w^TX^TXw)  2\text{tr}(Y^TXw))\\
&=& \frac{1}{2} (X^TXw + X^TXw  2X^TY) \\
&=& X^TXw  X^TY
\end{eqnarray*}
because: in the third step we have $\text{tr}(x) = x$, in the four step we have $\text{tr}(A) = \text{tr}(A^T)$, and in the fifth step we are using equation \ref{eqn_5_linear_alg} with $A^T = w, B=B^T = X^TX , C= I$.
Since we are trying to minimize $J$, a convex function, a sure way to find $w$ that minimizes $J$ is to set its derivative to zero. In doing so we obtain
\[
X^TXw = X^TY \text{ or } w = (X^TX)^{1}X^TY
\]
This is the exact $w$ that minimizes the mean square error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section {Least mean square probabilistic interpretation}
Why mean square error? We show now that the objective $J$ used is a direct consequence of a very common assumption over the data. Lets look at the errors
\[
\epsilon_t = h(\tb{x}_t)  y_t
\]
and lets make the assumption that they are IID according to a gaussian (normal) distribution of mean $\mu=0$ and variance $\sigma^2$. That we write $\epsilon~ \mathcal{N}(0,\sigma^2)$ or
\[
p(\epsilon) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(\frac{\epsilon^2}{2\sigma^2}\right)
\]
which implies
\[
p(yx;w) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(\frac{(w\tb{x}y)^2}{2\sigma^2}\right)
\]
Note that above $w$ is a parameter (array) and not a random variable. Given the input $X$, what is the probability of $Y$ given the parameters $w$ ? Equivalently, this is the \emph{likelihood} that $w$ is the correct parameter for the model
\[
L(w) = L(w;X,Y) = p(YX;w)
\]
Using the IID assumption the likelihood becomes
\begin{eqnarray*}
L(w) &=& \prod_t p (y_t\tb{x}_t);w\\
&=& \prod_t \frac{1}{\sqrt{2\pi}\sigma}\exp\left(\frac{(w\tb{x}_ty_t)^2}{2\sigma^2}\right)
\end{eqnarray*}
since we have now a probabilistic model over the data, a common way to determine the best parameters is to use \emph{maximum likelihood}; in other words find $w$ that realizes the maximum $L$. Instead of maximizing $L$ we shall maximize the \emph{log likelihood} $\log L(w)$ because it simplifies the math (and produces the same "best" $w$)
\begin{eqnarray*}
l(w) &=& \log L(w)\\
&=& \log \prod_t \frac{1}{\sqrt{2\pi}\sigma}\exp\left(\frac{(w\tb{x}_ty_t)^2}{2\sigma^2}\right) \\
&=& \sum_t \log \frac{1}{\sqrt{2\pi}\sigma}\exp\left(\frac{(w\tb{x}_ty_t)^2}{2\sigma^2}\right) \\
&=& m \log \frac{1}{\sqrt{2\pi}\sigma}  \frac{1}{2\sigma^2} \sum_t (h_{w}(\tb{x}_t)y_t)^2 \\
\end{eqnarray*}
Hence, the maximizing the likelihood $L$ produces the same $w$ as minimizing the mean square error (since the front term does not depend on $w$). That is to say that, if we believe the errors to be IID normally, then the maximum likelihood is obtained for the parameters $w$ that minimizes the mean square error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section {Polynomial regression}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section {Classification and logistic regression}
In classification, the labels $y$ are not numeric values (like prices), but instead \emph{class labels}. For today, lets assume that we have two classes denoted by 0 and 1; we call this \emph{binary classification} and we write $y\in\{0,1\}$.
\subsection{Logistic transformation}
We could, in principle try to run the linear regression we just studied, without making use of the fact that $y\in\{0,1\}$. (Essentially assume $y$ are simply real numbers). There are several problem with this approach: first, the regression assumes the data supports a linear fit, which might not be true anymore for classification problems; second, our regressor $h(\tb{x})$ will take lots of undesirable values (like the ones far outside the interval [0,1]).
To make an explicit mapping between the real valued regressor $h$ and the set \{0,1\}, we would like a function that preserves differentiability and has a easy interpretable meaning. We choose the \emph{logistic function}, also called \emph{sigmoid}
\[
g(z) = \frac{1}{1+e^{z}}
\]
\begin{figure}[h!]
\begin{center}
%\begin{tabular}{cc}
\includegraphics[width=0.41\columnwidth]{logistic_function}
%\end{tabular}
\vspace{3ex}
\end{center}
\caption{Logistic function} \label{logistic_function}
\end{figure}
Note that $g$ acts like a indicator for \{0,1\}, but it is much more sensitive than a linear function. We can make it even more sensitive by, for example, doubling the input z before applying the function. Lets state the derivative of $g$, since we are going to use it later on
\begin{eqnarray*}
g'(z) &=& \frac{\partial g(z)}{\partial z} \\
&=& \frac{1}{(1+e^{z})^2} e^{z}\\
&=&\frac{1}{1+e^{z}} \left(1\frac{1}{1+e^{z}} \right)\\
&=& g(z) (1g(z))
\end{eqnarray*}
\subsection{Logistic regression}
We apply $g$ to the linear regression function to obtain a \emph{logistic regression}. Our new hyphothesis (or predictor, or regressor) becomes
\[
h_{w}(\tb{x}) = g(w\tb{x}) = \frac{1}{1+e^{w\tb{x}}} = \frac{1}{1+e^{\sum_d w^d x^d}}
\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%