Linear Regression

: The aim of the research is to discover educational patterns in electronic courses through the learning management system and to predict academic achievement among university students through big data mining techniques. To achieve this goal, a group of electronic courses were selected that are offered to King Khalid University students via the learning management system (Blackboard). To identify the prevailing patterns in electronic courses, the “K-Means” clustering algorithm was used to discover patterns, and the “Linear Regression, Random Forest, KNN, Tree, SVM” algorithms were used to predict students’ academic achievement. The results indicated that the “K-Means” clustering algorithm grouped students’ grades into three main classes: the highest grades were the first and second classes and the lowest was the third class. As for predicting academic achievement, the results indicated that it is possible to predict students’ academic achievement through activities and tests. For all courses except the course (111 Ladder), the results indicated that the academic achievement of students in this course can be predicted through semester tests only, and that semester activities do not contribute to predicting students’ academic performance. The Linear Regression algorithm is the most contributing algorithm in predicting students’ academic achievement, while the SVM algorithm is the least, and that there is a strong correlation between educational patterns in electronic courses designed on e-learning management systems (Blackboard) and the level of student achievement.

• As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs w j x j (Note: x 0 = 1 is a constant input, so that w 0 is the bias) • The activation is the output (no thresholding) • Regression: the target function can take on real values Effect of a Single Binary Input • Consider only binary inputs with x i,j ∈ {0, 1} • When x i,j switches from 0 to 1 and the other inputs remain fixed (intervention), the j-th input adds to the output the quantity w j , independent of context, i.e., independent of the other inputs!(the average causal effect is identical to the individual causal effect) • Recall that for the perceptron, the effect of an input on the output does critically depend on context: When x i,j switches from 0 to 1 and the other inputs remain fixed (intervention), dependent on the other inputs, the output might stay as is, or changes from 1 to −1 or from −1 to 1

Method of Least Squares
• Squared-loss cost function: • The parameters that minimize the cost function are called least squares (LS) estimators w ls = arg min w cost(w) • For visualization, we take M = 1 (although linear regression is often applied to high-dimensional inputs)

Least-squares Estimator for Regression
One-dimensional regression: Goal:

Least-squares Estimator in Several Dimensions
General Model:

Linear Regression with Several Inputs
Predictions as Matrix-vector product The vector of all predictions at the training data is Gradient Descent Learning • Initialize parameters (typically using small random numbers) • Adapt the parameters in the direction of the negative gradient • The parameter gradient is (Example: w j ) • A sensible learning rule is

Analytic Solution
• The ADALINE is optimized by SGD • Online Adaptation: a physical system constantly produces new data: the ADALINE (SGD in general) can even track changes in the system • With a fixed training data set the least-squares solution can be calculated analytically in one step (least-squares regression) Cost Function in Matrix Form

Necessary Condition for an Optimum
• A necessary condition for an optimum is that ∂cost(w) ∂w w=w opt = 0 One Parameter: Explicit • (chain rule: inner derivative times outer derivative) • Thus • (chain rule: inner derivative times outer derivative) Setting First Derivative to Zero Complexity (linear in N ): Derivatives of Vector Products • Comment: one also finds the conventions, Machine Learning as an Inverse Problem • The map w → y = Xw is the forward model • Machine learning is an "inverse" problem

Stability of the Solution
• When N >> M p , the LS solution is stable (small changes in the data lead to small changes in the parameter estimates) • When N < M p then there are many solutions which all produce zero training error • Of all these solutions, one selects the one that minimizes M j=0 w 2 j = w T w (regularised solution) • Even with N > M p it is advantageous to regularize the solution, in particular with noise on the target

Linear Regression and Regularisation
• Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight Decay ): the influence of a single data point should be small • Let x t and y t be the training pattern in iteration t.Then we adapt, t = 1, 2, . . .
Toy Example: One-dimensional Model (cont'd) • A deeper analysis reveals that the estimate (E: expected value; stdev : standard deviation) w 1 reflects the causal dependency of y on x 1 • The second coefficient, r2 = 0.96, does not reflect a causal effect, but reflects the fact that x 1 and x 2 are highly correlated, and thus also y and x 2 (correlation does not imply causality).
• The third value w 3 is correctly closer to 0, but not really small in magnitude.

Toy Example: Least Squares Regression
• We get: Here it is important to see that the standard deviation of the spurious input is largely reduced!
• Overall, in regression, the causal influence of x 1 stands out much more clearly!
• Both the influence of the correlated (non-causal) input x 2 and the noise input x 3 are largely reduced The Power of Supervised Learning • In regression: an input only has to model, what the other inputs could not model • In a one-dimensional analysis: each input on its own tries to model the dependency to y as well as possible!
Application to Healthcare (cont'd) • Example: Only patients without any other disease (x 1 = 1) get the treatment, but they get healthy (y = 1), independent of treatment (x 2 ) (since their bodies can focus on the disease of interest); still, this results in a correlation between treatment (x 2 ) and outcome (y) • Fisher's hypothesis (epidemiology): A certain gene variant (x 1 = 1) causes lung cancer, but also makes you want to smoke (x 2 = 1), but smoking itself has no effect on lung cancer y; still, this results in a correlation between smoking (x 2 ) and outcome (y) • Now that we can measure genetic variance: Fisher's hypothesis is (mostly) wrong

Remarks: Correlation versus Regression
• The Pearson correlation coefficients is independent of context, objective.Karl Pearson: "I interpreted that sentence of Francis Galton (1822Galton ( -1911) ) [his advisor] to mean that there was a category broader than causation, namely correlation, of which causation was only the limit, and that this new conception of correlation brought psychology, anthropology, medicine, and sociology in large parts into the field of mathematical treatment." • But the Pearson correlation coefficient does not reflect causality (dependencies) • The regression coefficients display causal behavior, much more closely: causality analysis based on observed data requires complete models • "Gold standard": prospective randomized controlled trial (RCT): assign patients randomly to the treatment group called the Moore-Penrose pseudo inverse (generalized inverse); (Roger Penrose won the 2020 Nobel Prize in Physics)

(X
(x); we see the strong correlation between x 1 and x 2 T y = (99, 97, −20) T This is N × (r 1 , r2 , r3 ) T !We see the strong correlation between both x 1 and x 2 with y • ŵ3 = −0.018 is much closer to 0, compared to the sample Pearson correlation coefficient r3 = −0.21 If parameter interpretation is essential or if, for computational reasons, one wants to keep the number of inputs small: -Forward selection; start with the empty model; at each step add the input that reduces the error most -Backward selection (pruning); start with the full model; at each step remove the input that increases the error the least • But no guarantee, that one finds the best subset of inputs or that one finds the true inputs Experiments with Real World Data: Data from Prostate Cancer Patients With Var( ŵz ls,j ) = σ 2 /(N d 2 j ), we get more shrinkage for weights which are uncertain • In Hastie et al.,"The Elements of Statistical Learning", it is shown that this shrinkage also occurs in the original coordinate system, without a diagonalization as preprocessing • Regularization is especially important when N ≈ M p , or when N < M p •