| { | |
| "title": "Linear Regression Mastery: 100 MCQs", | |
| "description": "A comprehensive set of 100 multiple-choice questions designed to test and deepen your understanding of Linear Regression, from fundamental concepts to advanced topics like model evaluation, assumptions, and regularization.", | |
| "questions": [ | |
| { | |
| "id": 1, | |
| "questionText": "What is the primary goal of Simple Linear Regression?", | |
| "options": [ | |
| "To classify data into two distinct categories.", | |
| "To model the linear relationship between two continuous variables.", | |
| "To find the clusters within a dataset.", | |
| "To reduce the dimensionality of the data." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Simple Linear Regression aims to establish a linear relationship between a single independent variable (predictor) and a single dependent variable (outcome). Its goal is to find the best-fitting straight line, represented by the equation y = beta_0 + beta_1x, that describes how the dependent variable changes as the independent variable changes." | |
| }, | |
| { | |
| "id": 2, | |
| "questionText": "In the equation y = beta_0 + beta_1x, what does beta_1 represent?", | |
| "options": [ | |
| "The y-intercept", | |
| "The slope of the line", | |
| "The predicted value of y", | |
| "The error term" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "beta_1 represents the slope of the regression line. It indicates the change in the dependent variable (y) for a one-unit change in the independent variable (x). A positive slope means y increases as x increases, while a negative slope means y decreases as x increases." | |
| }, | |
| { | |
| "id": 3, | |
| "questionText": "The value that a linear regression model predicts is called the:", | |
| "options": [ | |
| "Independent variable", | |
| "Residual", | |
| "Dependent variable", | |
| "Coefficient" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The linear regression model is built to predict the value of the dependent variable (also known as the response, target, or outcome variable). The independent variables are used as inputs to make this prediction." | |
| }, | |
| { | |
| "id": 4, | |
| "questionText": "What is a 'residual' in the context of linear regression?", | |
| "options": [ | |
| "The difference between the actual value and the predicted value.", | |
| "The slope of the regression line.", | |
| "The intercept of the regression line.", | |
| "The correlation between the two variables." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "A residual is the vertical distance between an actual data point and the regression line. It is calculated as e = y - y_hat, where y is the actual value and y_hat is the predicted value. The goal of regression is to minimize these residuals." | |
| }, | |
| { | |
| "id": 5, | |
| "questionText": "The method of Ordinary Least Squares (OLS) aims to minimize which of the following?", | |
| "options": [ | |
| "The sum of the absolute values of residuals", | |
| "The sum of the squared residuals", | |
| "The number of outliers", | |
| "The correlation coefficient" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Ordinary Least Squares (OLS) is the most common method for fitting a linear regression model. It works by finding the parameter values (beta_0 and beta_1) that minimize the sum of the squared differences between the observed dependent variable and the values predicted by the model. This is also known as minimizing the Sum of Squared Errors (SSE)." | |
| }, | |
| { | |
| "id": 6, | |
| "questionText": "Which of the following is a key assumption of linear regression regarding the residuals?", | |
| "options": [ | |
| "Residuals must be positively correlated.", | |
| "Residuals should have a mean of 1.", | |
| "Residuals should be normally distributed.", | |
| "Residuals must be greater than the predicted values." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "One of the key assumptions of linear regression is that the error terms (residuals) are normally distributed with a mean of zero. This assumption is important for conducting hypothesis tests and constructing confidence intervals for the model parameters." | |
| }, | |
| { | |
| "id": 7, | |
| "questionText": "What does an R-squared (R^2) value of 0.85 signify?", | |
| "options": [ | |
| "85% of the predictions are correct.", | |
| "The correlation between the variables is 0.85.", | |
| "85% of the variance in the dependent variable is explained by the independent variable(s).", | |
| "The model is incorrect 15% of the time." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "R-squared, also known as the coefficient of determination, measures the proportion of the total variance in the dependent variable that can be explained by the independent variable(s) in the model. An R^2 of 0.85 means that 85% of the variability in the outcome can be accounted for by the predictors." | |
| }, | |
| { | |
| "id": 8, | |
| "questionText": "If you add more independent variables to a multiple linear regression model, what will happen to the R-squared (R^2) value?", | |
| "options": [ | |
| "It will always decrease.", | |
| "It will remain the same.", | |
| "It will always increase or stay the same.", | |
| "It will become negative." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The standard R-squared value will never decrease when you add more predictors to the model, even if the new predictors are not useful. It will either increase or, in a very rare case, stay the same. This is why Adjusted R-squared is often preferred, as it penalizes the addition of non-significant variables." | |
| }, | |
| { | |
| "id": 9, | |
| "questionText": "What is the primary purpose of Adjusted R-squared?", | |
| "options": [ | |
| "To measure the absolute error of the model.", | |
| "To account for the number of predictors in the model.", | |
| "To ensure the residuals are normally distributed.", | |
| "To calculate the y-intercept." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Adjusted R-squared modifies the R-squared value to account for the number of independent variables in the model. Unlike R^2, it can decrease if a newly added variable does not improve the model more than would be expected by chance. It is a more reliable measure when comparing models with different numbers of predictors." | |
| }, | |
| { | |
| "id": 10, | |
| "questionText": "What does the term 'multicollinearity' refer to in multiple regression?", | |
| "options": [ | |
| "A high correlation between the independent variables and the dependent variable.", | |
| "A lack of correlation between the independent variables.", | |
| "A high correlation between two or more independent variables.", | |
| "A non-linear relationship between variables." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This can make it difficult to determine the individual effect of each predictor on the dependent variable and can lead to unstable and unreliable coefficient estimates." | |
| }, | |
| { | |
| "id": 11, | |
| "questionText": "How can you detect multicollinearity?", | |
| "options": [ | |
| "By checking the R-squared value.", | |
| "By calculating the Variance Inflation Factor (VIF).", | |
| "By plotting the residuals.", | |
| "By examining the p-value of the model." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The Variance Inflation Factor (VIF) is a common metric used to detect multicollinearity. VIF measures how much the variance of an estimated regression coefficient is increased because of collinearity. A general rule of thumb is that a VIF value greater than 5 or 10 indicates a problematic level of multicollinearity." | |
| }, | |
| { | |
| "id": 12, | |
| "questionText": "Which assumption states that the variance of the residuals should be constant for all levels of the independent variables?", | |
| "options": [ | |
| "Normality", | |
| "Linearity", | |
| "Independence", | |
| "Homoscedasticity" | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "Homoscedasticity means 'same variance'. This assumption implies that the variance of the error terms (residuals) is constant across all values of the independent variables. If the variance changes, the condition is called heteroscedasticity." | |
| }, | |
| { | |
| "id": 13, | |
| "questionText": "A scatter plot of residuals versus predicted values is useful for checking which assumption?", | |
| "options": [ | |
| "Normality of residuals", | |
| "Homoscedasticity", | |
| "Multicollinearity", | |
| "Autocorrelation" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Plotting residuals against predicted values (y_hat) helps to check for homoscedasticity. If the points are randomly scattered around the horizontal line at zero without any discernible pattern, the assumption of homoscedasticity is likely met. A cone shape or other systematic pattern suggests heteroscedasticity." | |
| }, | |
| { | |
| "id": 14, | |
| "questionText": "What does a p-value for an independent variable's coefficient represent?", | |
| "options": [ | |
| "The probability that the coefficient is correct.", | |
| "The strength of the relationship.", | |
| "The probability of observing the data if the null hypothesis is true.", | |
| "The variance of the coefficient." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The p-value tests the null hypothesis that the coefficient is equal to zero (i.e., the variable has no effect). A small p-value (typically < 0.05) indicates that you can reject the null hypothesis, suggesting that the independent variable is a statistically significant predictor of the dependent variable." | |
| }, | |
| { | |
| "id": 15, | |
| "questionText": "What is overfitting in the context of linear regression?", | |
| "options": [ | |
| "The model is too simple to capture the underlying trend.", | |
| "The model performs very well on training data but poorly on unseen data.", | |
| "The model has a high bias and low variance.", | |
| "The model fails to converge during training." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations. This results in a model that performs exceptionally well on the data it was trained on but fails to generalize to new, unseen data. It is characterized by low bias and high variance." | |
| }, | |
| { | |
| "id": 16, | |
| "questionText": "Which of the following techniques is used to address overfitting in linear regression?", | |
| "options": [ | |
| "Increasing the number of features.", | |
| "Using a simpler model.", | |
| "Regularization (e.g., Ridge or Lasso).", | |
| "Both B and C." | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "Both using a simpler model (with fewer features) and applying regularization techniques can combat overfitting. Regularization methods like Ridge (L2) and Lasso (L1) add a penalty term to the cost function to shrink the magnitude of the coefficients, preventing them from becoming too large and complex." | |
| }, | |
| { | |
| "id": 17, | |
| "questionText": "What is the primary difference between Lasso (L1) and Ridge (L2) regularization?", | |
| "options": [ | |
| "Ridge can shrink coefficients to exactly zero, while Lasso cannot.", | |
| "Lasso can shrink coefficients to exactly zero, while Ridge cannot.", | |
| "Lasso uses the squared magnitude of coefficients as a penalty.", | |
| "Ridge is better for feature selection." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The key difference is in the penalty term. Lasso (L1) uses the absolute value of the coefficients (|beta_j|), which can force some coefficients to become exactly zero. This makes Lasso useful for automatic feature selection. Ridge (L2) uses the squared magnitude of coefficients (beta_j^2), which shrinks them close to zero but never exactly to zero." | |
| }, | |
| { | |
| "id": 18, | |
| "questionText": "What does the term 'y-intercept' (beta_0) represent in a practical sense?", | |
| "options": [ | |
| "The value of the independent variable when the dependent variable is zero.", | |
| "The predicted value of the dependent variable when all independent variables are zero.", | |
| "The minimum possible value of the dependent variable.", | |
| "The average value of the independent variable." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The y-intercept (beta_0) is the predicted value of the dependent variable (y) when all independent variables in the model are equal to zero. In some contexts, this value might not have a practical or meaningful interpretation, especially if x=0 is outside the range of observed data." | |
| }, | |
| { | |
| "id": 19, | |
| "questionText": "If the correlation coefficient between two variables is close to -1, what does this indicate for linear regression?", | |
| "options": [ | |
| "A strong positive linear relationship.", | |
| "A strong negative linear relationship.", | |
| "No linear relationship.", | |
| "A perfect non-linear relationship." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A correlation coefficient close to -1 signifies a strong, negative linear relationship. This means that as one variable increases, the other variable tends to decrease in a highly predictable, linear fashion. This is an ideal scenario for applying simple linear regression." | |
| }, | |
| { | |
| "id": 20, | |
| "questionText": "Which of the following metrics is most sensitive to outliers?", | |
| "options": [ | |
| "Mean Absolute Error (MAE)", | |
| "Mean Squared Error (MSE)", | |
| "R-squared", | |
| "Adjusted R-squared" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Mean Squared Error (MSE) calculates the average of the squared residuals. Because the errors are squared, large errors (outliers) are penalized much more heavily than smaller errors. This makes MSE more sensitive to outliers compared to MAE, which uses the absolute value of the errors." | |
| }, | |
| { | |
| "id": 21, | |
| "questionText": "What is the purpose of Gradient Descent in linear regression?", | |
| "options": [ | |
| "To calculate the R-squared value.", | |
| "To find the optimal coefficients that minimize the cost function.", | |
| "To check for multicollinearity.", | |
| "To normalize the input features." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Gradient Descent is an iterative optimization algorithm used to find the values of the model parameters (coefficients) that minimize the cost function (like MSE). It works by taking steps in the direction of the steepest descent of the cost function's slope." | |
| }, | |
| { | |
| "id": 22, | |
| "questionText": "What does the 'learning rate' hyperparameter in Gradient Descent control?", | |
| "options": [ | |
| "The number of iterations.", | |
| "The size of the steps taken to minimize the cost function.", | |
| "The number of features to use.", | |
| "The penalty term in regularization." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The learning rate (often denoted by alpha) determines the size of the steps the algorithm takes during each iteration. A learning rate that is too small can lead to slow convergence, while one that is too large can cause the algorithm to overshoot the minimum and fail to converge." | |
| }, | |
| { | |
| "id": 23, | |
| "questionText": "Which type of regression would be most appropriate for modeling a relationship that looks like a curve?", | |
| "options": [ | |
| "Simple Linear Regression", | |
| "Multiple Linear Regression", | |
| "Polynomial Regression", | |
| "Ridge Regression" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Polynomial Regression is used to model non-linear relationships. It does this by adding polynomial terms (e.g., x^2, x^3) of the independent variable as new features in the model. Even though the relationship is curved, it is still considered a type of linear regression because the model is linear in its coefficients." | |
| }, | |
| { | |
| "id": 24, | |
| "questionText": "An outlier is a data point that:", | |
| "options": [ | |
| "Always improves model accuracy.", | |
| "Has a value of zero.", | |
| "Is significantly different from other observations.", | |
| "Is the average of all other points." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "An outlier is an observation that lies an abnormal distance from other values in a dataset. In regression, outliers can have a significant impact on the estimated regression line, potentially pulling it towards them and skewing the results." | |
| }, | |
| { | |
| "id": 25, | |
| "questionText": "How do you handle categorical independent variables in a linear regression model?", | |
| "options": [ | |
| "Remove them from the model.", | |
| "Convert them into a continuous variable.", | |
| "Use one-hot encoding or create dummy variables.", | |
| "Assign a unique integer to each category." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Categorical variables must be converted into a numerical format. The best practice is to use one-hot encoding, which creates new binary (0 or 1) columns for each category. This prevents the model from assuming an ordinal relationship between categories, which would happen if you simply assigned integers (e.g., 1, 2, 3)." | |
| }, | |
| { | |
| "id": 26, | |
| "questionText": "The F-statistic in a regression output tests the:", | |
| "options": [ | |
| "Significance of a single coefficient.", | |
| "Overall significance of the entire model.", | |
| "Presence of heteroscedasticity.", | |
| "Normality of the residuals." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The F-statistic is used to test the null hypothesis that all of the model's coefficients are equal to zero. A statistically significant F-test (indicated by a small p-value) suggests that at least one of the independent variables is related to the dependent variable, meaning the model as a whole is useful." | |
| }, | |
| { | |
| "id": 27, | |
| "questionText": "What does it mean if the confidence interval for a coefficient contains zero?", | |
| "options": [ | |
| "The coefficient is highly significant.", | |
| "The variable has no relationship with the outcome.", | |
| "The coefficient is not statistically significant at the chosen confidence level.", | |
| "The model is overfitted." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A confidence interval provides a range of plausible values for a coefficient. If this range includes zero, it means we cannot be confident that the true value of the coefficient is different from zero. Therefore, the variable is not considered statistically significant." | |
| }, | |
| { | |
| "id": 28, | |
| "questionText": "Which plot is used to check the normality of residuals assumption?", | |
| "options": [ | |
| "Scatter plot of y vs. x", | |
| "Residuals vs. Fitted plot", | |
| "Q-Q (Quantile-Quantile) plot", | |
| "Correlation matrix heatmap" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A Q-Q plot is the standard method for visually checking if a set of data (in this case, the residuals) follows a specific distribution (in this case, the normal distribution). If the residuals are normally distributed, the points on the Q-Q plot will lie closely along the diagonal reference line." | |
| }, | |
| { | |
| "id": 29, | |
| "questionText": "In the context of bias-variance tradeoff, a simple linear regression model with few predictors typically has:", | |
| "options": [ | |
| "High bias, low variance", | |
| "Low bias, high variance", | |
| "High bias, high variance", | |
| "Low bias, low variance" | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "A simple model (like one that underfits the data) has high bias because its assumptions are too strong and it cannot capture the complexity of the data. However, it has low variance because it will produce similar results across different training datasets. It is stable but systematically wrong." | |
| }, | |
| { | |
| "id": 30, | |
| "questionText": "What is the effect of standardizing (or scaling) the independent variables before fitting a linear regression model?", | |
| "options": [ | |
| "It changes the R-squared value of the model.", | |
| "It makes the interpretation of coefficients easier, especially in regularized models.", | |
| "It violates the linearity assumption.", | |
| "It always increases the model's accuracy." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Standardizing features (e.g., using StandardScaler to give them a mean of 0 and a standard deviation of 1) does not change the fundamental performance of a standard OLS model. However, it is crucial for algorithms that are sensitive to the scale of input features, like Gradient Descent and regularized models (Ridge, Lasso), as it ensures that the penalty term treats all coefficients equally." | |
| }, | |
| { | |
| "id": 31, | |
| "questionText": "If the dependent variable is binary (e.g., yes/no), which model is generally more appropriate than linear regression?", | |
| "options": [ | |
| "Polynomial Regression", | |
| "Logistic Regression", | |
| "Exponential Regression", | |
| "Time Series Analysis" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "When the dependent variable is categorical (especially binary), Logistic Regression is the appropriate choice. Linear regression is designed for continuous outcomes and can produce predicted probabilities outside the 0-1 range, which is nonsensical for a binary outcome." | |
| }, | |
| { | |
| "id": 32, | |
| "questionText": "What does the term 'linearity' in linear regression assumptions mean?", | |
| "options": [ | |
| "The relationship between the independent and dependent variables is linear.", | |
| "The residuals are linear.", | |
| "The data points must form a perfect straight line.", | |
| "The variables must be from a linear dataset." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "The linearity assumption states that there should be a linear relationship between the predictor variables (x) and the outcome variable (y). If this relationship is non-linear, the model will be a poor fit, and its predictions will be inaccurate." | |
| }, | |
| { | |
| "id": 33, | |
| "questionText": "Which statement about Simple and Multiple Linear Regression is TRUE?", | |
| "options": [ | |
| "Simple Linear Regression can have multiple dependent variables.", | |
| "Multiple Linear Regression has only one independent variable.", | |
| "Multiple Linear Regression can model the effect of several independent variables on one dependent variable.", | |
| "Simple Linear Regression is always more accurate." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Simple Linear Regression involves one independent and one dependent variable. Multiple Linear Regression extends this by allowing for two or more independent variables to predict a single dependent variable, which can often create a more accurate and comprehensive model." | |
| }, | |
| { | |
| "id": 34, | |
| "questionText": "Root Mean Squared Error (RMSE) is calculated as:", | |
| "options": [ | |
| "The sum of squared errors.", | |
| "The square root of the Mean Absolute Error.", | |
| "The square root of the Mean Squared Error.", | |
| "The absolute value of the R-squared." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "RMSE is the square root of the Mean Squared Error (MSE). Taking the square root puts the error metric back into the original units of the dependent variable, making it easier to interpret than MSE." | |
| }, | |
| { | |
| "id": 35, | |
| "questionText": "A model that is too complex and captures the noise in the data is said to have:", | |
| "options": [ | |
| "Low bias, high variance", | |
| "High bias, low variance", | |
| "Low bias, low variance", | |
| "High bias, high variance" | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "This describes an overfitted model. It has low bias because it fits the training data very closely. It has high variance because it is highly sensitive to the specific training data; a different training set would result in a very different model." | |
| }, | |
| { | |
| "id": 36, | |
| "questionText": "In a regression model, if an independent variable's coefficient is 5.0, what is the correct interpretation?", | |
| "options": [ | |
| "For every 5 units increase in y, x increases by 1 unit.", | |
| "For every 1 unit increase in x, the predicted value of y increases by 5.0 units, holding other variables constant.", | |
| "The correlation between x and y is 5.0.", | |
| "5% of the variance in y is explained by x." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The coefficient of an independent variable represents the average change in the dependent variable for a one-unit increase in that independent variable, assuming all other variables in the model are held constant." | |
| }, | |
| { | |
| "id": 37, | |
| "questionText": "What problem does heteroscedasticity cause in linear regression?", | |
| "options": [ | |
| "It makes the coefficient estimates biased.", | |
| "It makes the coefficient estimates inconsistent.", | |
| "It makes the standard errors of the coefficients unreliable.", | |
| "It always reduces the R-squared value." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "While heteroscedasticity does not cause bias in the coefficient estimates themselves, it leads to biased and unreliable standard errors. This, in turn, makes the t-tests and F-tests for coefficient significance invalid, potentially leading to incorrect conclusions about the importance of predictors." | |
| }, | |
| { | |
| "id": 38, | |
| "questionText": "A Breusch-Pagan test is used to check for:", | |
| "options": [ | |
| "Normality", | |
| "Linearity", | |
| "Multicollinearity", | |
| "Heteroscedasticity" | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "The Breusch-Pagan test is a formal statistical test used to determine if heteroscedasticity is present in a regression model. The null hypothesis is that homoscedasticity is present." | |
| }, | |
| { | |
| "id": 39, | |
| "questionText": "What does 'extrapolation' mean in the context of regression?", | |
| "options": [ | |
| "Using the model to predict values within the range of the training data.", | |
| "Using the model to predict values outside the range of the training data.", | |
| "Removing outliers from the data.", | |
| "Adding new variables to the model." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Extrapolation is the process of making predictions for values of the independent variable(s) that are outside the range of the data used to build the model. Such predictions are often unreliable because the linear relationship observed within the data range may not hold true outside of it." | |
| }, | |
| { | |
| "id": 40, | |
| "questionText": "Which of these is NOT an assumption of classical linear regression?", | |
| "options": [ | |
| "The independent variables must be normally distributed.", | |
| "The error terms are normally distributed.", | |
| "The error terms have constant variance.", | |
| "The error terms are independent of each other." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "Classical linear regression does not assume that the independent variables themselves are normally distributed. The key normality assumption applies to the error terms (residuals), not the predictors." | |
| }, | |
| { | |
| "id": 41, | |
| "questionText": "If two independent variables are perfectly collinear (correlation = 1 or -1), what happens when you try to fit an OLS model?", | |
| "options": [ | |
| "The model will have a perfect R-squared of 1.", | |
| "The model fitting will fail because the coefficient matrix is not invertible.", | |
| "The coefficients will be exactly zero.", | |
| "The model will be more accurate." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Perfect multicollinearity makes it impossible for the OLS algorithm to find a unique solution for the coefficients. Mathematically, it means the design matrix (X^T X) is singular and cannot be inverted, so the calculation for the coefficients fails." | |
| }, | |
| { | |
| "id": 42, | |
| "questionText": "The Durbin-Watson statistic is used to detect:", | |
| "options": [ | |
| "Outliers", | |
| "Autocorrelation in the residuals", | |
| "Heteroscedasticity", | |
| "Multicollinearity" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The Durbin-Watson statistic is a test for autocorrelation (also known as serial correlation) in the residuals from a regression analysis. Autocorrelation is common in time-series data, where the error in one time period is correlated with the error in the subsequent period." | |
| }, | |
| { | |
| "id": 43, | |
| "questionText": "Which cost function is minimized in Lasso Regression?", | |
| "options": [ | |
| "MSE", | |
| "MSE + lambda * sum(beta_j^2)", | |
| "MSE + lambda * sum(|beta_j|)", | |
| "MAE" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Lasso (L1) regression minimizes the standard Mean Squared Error (MSE) plus a penalty term. This penalty is the product of a regularization parameter (lambda) and the sum of the absolute values of the coefficients (|beta_j|)." | |
| }, | |
| { | |
| "id": 44, | |
| "questionText": "An R-squared value can range from:", | |
| "options": [ | |
| "-1 to 1", | |
| "0 to 1", | |
| "0 to infinity", | |
| "-infinity to +infinity" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The standard R-squared value ranges from 0 to 1 (or 0% to 100%). A value of 0 indicates that the model explains none of the variability in the dependent variable, while a value of 1 indicates that the model explains all the variability." | |
| }, | |
| { | |
| "id": 45, | |
| "questionText": "What is 'underfitting'?", | |
| "options": [ | |
| "The model is too complex and fits the noise.", | |
| "The model is too simple and fails to capture the underlying pattern in the data.", | |
| "The model has high variance and low bias.", | |
| "The model performs well on test data but poorly on training data." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Underfitting occurs when a model is too simple to capture the underlying structure of the data. It performs poorly on both the training data and new, unseen data. It is characterized by high bias and low variance." | |
| }, | |
| { | |
| "id": 46, | |
| "questionText": "An interaction term in a multiple regression model allows for:", | |
| "options": [ | |
| "The effect of one independent variable on the dependent variable to depend on the value of another independent variable.", | |
| "The removal of multicollinearity.", | |
| "A non-linear relationship with a single predictor.", | |
| "The standardization of coefficients." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "An interaction term (e.g., x1 * x2) is included to model the situation where the relationship between one predictor (x1) and the outcome (y) changes depending on the level of another predictor (x2)." | |
| }, | |
| { | |
| "id": 47, | |
| "questionText": "If the residuals of a model show a 'cone' shape when plotted against the fitted values, this suggests:", | |
| "options": [ | |
| "Multicollinearity", | |
| "The model is a good fit", | |
| "Homoscedasticity", | |
| "Heteroscedasticity" | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "A cone-shaped pattern (fanning out or in) in the residual plot is a classic sign of heteroscedasticity. It indicates that the variance of the errors is not constant across all levels of the predicted values." | |
| }, | |
| { | |
| "id": 48, | |
| "questionText": "The 'best-fit line' in simple linear regression is the line that:", | |
| "options": [ | |
| "Passes through the most data points.", | |
| "Has the smallest sum of squared vertical distances from the points to the line.", | |
| "Connects the first and last data points.", | |
| "Has a slope of 1 and an intercept of 0." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "This is the definition of the Ordinary Least Squares (OLS) criterion. The best-fit line is the one that minimizes the sum of the squared residuals, which are the vertical distances between the observed data points and the line." | |
| }, | |
| { | |
| "id": 49, | |
| "questionText": "A small p-value (< 0.05) for the F-statistic of a regression model indicates that:", | |
| "options": [ | |
| "The model is not useful.", | |
| "All coefficients are statistically significant.", | |
| "At least one predictor variable is significantly related to the outcome variable.", | |
| "The residuals are normally distributed." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A significant F-statistic means we can reject the null hypothesis that all model coefficients are zero. It provides evidence that the model, as a whole, is statistically significant and has some predictive capability." | |
| }, | |
| { | |
| "id": 50, | |
| "questionText": "In the formula for Ridge Regression, what does the hyperparameter lambda control?", | |
| "options": [ | |
| "The number of features to select.", | |
| "The learning rate of gradient descent.", | |
| "The strength of the penalty on the coefficient size.", | |
| "The y-intercept." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Lambda is the regularization parameter. It controls the trade-off between fitting the data well (minimizing MSE) and keeping the model simple (keeping coefficients small). A larger lambda imposes a stronger penalty, leading to smaller coefficients and more bias." | |
| }, | |
| { | |
| "id": 51, | |
| "questionText": "Which metric is in the same units as the dependent variable?", | |
| "options": [ | |
| "R-squared", | |
| "Mean Squared Error (MSE)", | |
| "Root Mean Squared Error (RMSE)", | |
| "F-statistic" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Both MAE and RMSE are in the original units of the dependent variable. MSE is in squared units. Because RMSE is in the original units, it is often more interpretable than MSE for describing the model's typical prediction error." | |
| }, | |
| { | |
| "id": 52, | |
| "questionText": "To use a categorical variable with 4 categories in a regression model, how many dummy variables should you create?", | |
| "options": [ | |
| "1", | |
| "2", | |
| "3", | |
| "4" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "You should always use k-1 dummy variables for a categorical variable with k categories. One category is chosen as the 'reference' or 'baseline' category and is represented by having all dummy variables for that feature set to 0. This avoids the 'dummy variable trap', a form of perfect multicollinearity." | |
| }, | |
| { | |
| "id": 53, | |
| "questionText": "A confidence interval for a regression slope is [2.5, 4.5]. What is the correct interpretation?", | |
| "options": [ | |
| "There is a 95% probability that the true slope is between 2.5 and 4.5.", | |
| "The slope of the line is guaranteed to be between 2.5 and 4.5.", | |
| "We are 95% confident that for every one-unit increase in x, y will increase by an amount between 2.5 and 4.5.", | |
| "95% of the data points fall within this range." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A 95% confidence interval means that if we were to repeat the sampling process many times, 95% of the calculated intervals would contain the true population slope. Therefore, we are 95% confident that the true effect of x on y lies within this range." | |
| }, | |
| { | |
| "id": 54, | |
| "questionText": "What is a major limitation of linear regression?", | |
| "options": [ | |
| "It cannot be used for prediction.", | |
| "It can only model linear relationships.", | |
| "It requires the data to be perfectly clean with no noise.", | |
| "It cannot handle more than two independent variables." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The primary limitation is its core assumption: the relationship between the predictors and the outcome must be linear. If the true relationship is non-linear, the model will produce biased and inaccurate predictions. While techniques like polynomial regression can help, they are extensions built upon this linear framework." | |
| }, | |
| { | |
| "id": 55, | |
| "questionText": "If you plot the independent variable on the x-axis and the dependent variable on the y-axis, what are you creating?", | |
| "options": [ | |
| "A histogram", | |
| "A Q-Q plot", | |
| "A scatter plot", | |
| "A box plot" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A scatter plot is the standard way to visualize the relationship between two continuous variables. It is the first step you should take before building a simple linear regression model to visually assess if a linear relationship exists." | |
| }, | |
| { | |
| "id": 56, | |
| "questionText": "Which of these regression models can perform feature selection automatically?", | |
| "options": [ | |
| "Simple Linear Regression", | |
| "Ridge Regression", | |
| "Lasso Regression", | |
| "Polynomial Regression" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Lasso (L1) regression is known for its ability to perform automatic feature selection. By applying a penalty based on the absolute value of the coefficients, it can shrink the coefficients of less important features to exactly zero, effectively removing them from the model." | |
| }, | |
| { | |
| "id": 57, | |
| "questionText": "What does a VIF (Variance Inflation Factor) value of 1 indicate?", | |
| "options": [ | |
| "Perfect multicollinearity", | |
| "High multicollinearity", | |
| "No multicollinearity", | |
| "The variable is not significant." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A VIF of 1 indicates that there is no correlation between a given independent variable and any of the other independent variables in the model. This is the ideal scenario, meaning no inflation of the coefficient's variance." | |
| }, | |
| { | |
| "id": 58, | |
| "questionText": "What is the relationship between the correlation coefficient (r) and the coefficient of determination (R^2) in simple linear regression?", | |
| "options": [ | |
| "R^2 = r", | |
| "R^2 = 2r", | |
| "R^2 = r^2", | |
| "R^2 = sqrt(r)" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "In simple linear regression (with only one predictor), the coefficient of determination (R^2) is equal to the square of the Pearson correlation coefficient (r) between the independent and dependent variables. This relationship does not hold for multiple linear regression." | |
| }, | |
| { | |
| "id": 59, | |
| "questionText": "If your model's residuals show a clear pattern (e.g., a curve), what does this suggest?", | |
| "options": [ | |
| "The linearity assumption is violated.", | |
| "The model is a perfect fit.", | |
| "The data contains no outliers.", | |
| "The residuals are normally distributed." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "A non-random pattern in the residuals (like a U-shape or a systematic trend) plotted against the fitted values is a strong indication that the linearity assumption is not met. It suggests that the model is missing a non-linear component." | |
| }, | |
| { | |
| "id": 60, | |
| "questionText": "In statistics, what is the 'null hypothesis' for a regression coefficient beta_1?", | |
| "options": [ | |
| "beta_1 = 1", | |
| "beta_1 != 0", | |
| "beta_1 = 0", | |
| "beta_1 > 0" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The null hypothesis (H_0) for the t-test of a regression coefficient is that the coefficient is equal to zero. This implies that the corresponding independent variable has no linear relationship with the dependent variable." | |
| }, | |
| { | |
| "id": 61, | |
| "questionText": "Why is it important to split data into training and testing sets?", | |
| "options": [ | |
| "To make the model run faster.", | |
| "To evaluate the model's performance on unseen data.", | |
| "To increase the R-squared value.", | |
| "To satisfy the linearity assumption." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Splitting the data allows you to train the model on one subset (the training set) and then test its ability to generalize on a separate, unseen subset (the testing set). This is crucial for assessing whether the model is overfitted and how it will likely perform in the real world." | |
| }, | |
| { | |
| "id": 62, | |
| "questionText": "If the p-value of a coefficient is 0.03, what would you conclude at a 5% significance level?", | |
| "options": [ | |
| "Fail to reject the null hypothesis; the variable is not significant.", | |
| "Reject the null hypothesis; the variable is significant.", | |
| "Accept the alternative hypothesis; the variable is not significant.", | |
| "The result is inconclusive." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A significance level of 5% corresponds to a threshold of 0.05. Since the p-value (0.03) is less than the significance level (0.05), we reject the null hypothesis. This means there is statistically significant evidence that the variable has an effect on the outcome." | |
| }, | |
| { | |
| "id": 63, | |
| "questionText": "Which technique combines L1 and L2 penalties?", | |
| "options": [ | |
| "Ridge Regression", | |
| "Lasso Regression", | |
| "Polynomial Regression", | |
| "Elastic Net Regression" | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "Elastic Net is a regularized regression method that linearly combines the L1 and L2 penalties of the Lasso and Ridge methods. It is useful when there are multiple correlated features, as it tends to group and shrink their coefficients together." | |
| }, | |
| { | |
| "id": 64, | |
| "questionText": "A model's performance on the training data is excellent, but on the test data, it is poor. This is a classic sign of:", | |
| "options": [ | |
| "Underfitting", | |
| "Overfitting", | |
| "Multicollinearity", | |
| "Heteroscedasticity" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "This performance gap between training and testing data is the hallmark of overfitting. The model has learned the specifics of the training data, including its noise, and cannot generalize its predictive power to new data." | |
| }, | |
| { | |
| "id": 65, | |
| "questionText": "What does a residual of 0 for a data point mean?", | |
| "options": [ | |
| "The data point is an outlier.", | |
| "The model made a mistake.", | |
| "The predicted value is exactly equal to the actual value.", | |
| "The independent variable for that point is 0." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A residual is calculated as (actual value - predicted value). If the residual is 0, it means the predicted value was perfect for that specific observation, and the data point lies exactly on the regression line." | |
| }, | |
| { | |
| "id": 66, | |
| "questionText": "What is the main advantage of MAE over MSE?", | |
| "options": [ | |
| "It is easier to calculate.", | |
| "It is more robust to outliers.", | |
| "It is always a smaller number.", | |
| "It is differentiable everywhere." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Mean Absolute Error (MAE) is less sensitive to outliers than Mean Squared Error (MSE). Because MSE squares the errors, a few large errors from outliers can dominate the metric. MAE, which takes the absolute value, is not as heavily influenced by these extreme values." | |
| }, | |
| { | |
| "id": 67, | |
| "questionText": "The assumption of independence of errors is particularly important for which type of data?", | |
| "options": [ | |
| "Cross-sectional data", | |
| "Time-series data", | |
| "Categorical data", | |
| "Standardized data" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The independence of errors assumption means that the residual for one observation is not correlated with the residual for another. This is often violated in time-series data, where an observation at one point in time is likely related to the observation at the previous point in time (autocorrelation)." | |
| }, | |
| { | |
| "id": 68, | |
| "questionText": "If the slope of a regression line is zero, what does it imply?", | |
| "options": [ | |
| "A perfect positive linear relationship.", | |
| "A perfect negative linear relationship.", | |
| "No linear relationship between the independent and dependent variables.", | |
| "The intercept must also be zero." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "A slope of zero means the regression line is horizontal. This indicates that as the independent variable changes, the predicted value of the dependent variable does not change at all. Therefore, there is no linear relationship between the two variables." | |
| }, | |
| { | |
| "id": 69, | |
| "questionText": "Which of the following is an example of a linear model?", | |
| "options": [ | |
| "y = beta_0 + beta_1x + beta_2x^2", | |
| "y = beta_0 + beta_1log(x)", | |
| "y = beta_0e^(beta_1x)", | |
| "Both A and B" | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "A model is considered 'linear' in linear regression if it is linear in its parameters (the beta coefficients). Both polynomial regression (y = beta_0 + beta_1x + beta_2x^2) and models with transformed variables (y = beta_0 + beta_1log(x)) are linear in their parameters. The model y = beta_0e^(beta_1x) is non-linear in its parameters." | |
| }, | |
| { | |
| "id": 70, | |
| "questionText": "What is the effect of removing a significant predictor variable from a multiple regression model?", | |
| "options": [ | |
| "The Adjusted R-squared will likely increase.", | |
| "The model's predictive power will likely decrease.", | |
| "The issue of multicollinearity will be solved.", | |
| "The intercept will become zero." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Removing a variable that has a statistically significant relationship with the outcome will typically reduce the model's ability to explain the variance in the dependent variable. This will likely lead to a decrease in both R-squared and Adjusted R-squared, and poorer overall performance." | |
| }, | |
| { | |
| "id": 71, | |
| "questionText": "What is the primary motivation for using multiple linear regression over simple linear regression?", | |
| "options": [ | |
| "It is computationally less expensive.", | |
| "It is easier to interpret.", | |
| "It can account for the effects of multiple factors simultaneously.", | |
| "It does not require assumptions about the data." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The real world is complex, and outcomes are rarely influenced by just one factor. Multiple linear regression allows us to build more realistic models that can incorporate several predictors, potentially leading to better predictions and a more nuanced understanding of the relationships." | |
| }, | |
| { | |
| "id": 72, | |
| "questionText": "In a model predicting salary based on years of experience, what are the dependent and independent variables?", | |
| "options": [ | |
| "Dependent: Years of experience, Independent: Salary", | |
| "Dependent: Salary, Independent: Years of experience", | |
| "Both are dependent variables.", | |
| "Both are independent variables." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "We are trying to predict salary, so 'Salary' is the dependent (outcome) variable. We are using 'Years of experience' to make that prediction, so it is the independent (predictor) variable." | |
| }, | |
| { | |
| "id": 73, | |
| "questionText": "What does a high leverage point in regression refer to?", | |
| "options": [ | |
| "A point with a large residual.", | |
| "A point with an extreme value for the independent variable.", | |
| "A point with an extreme value for the dependent variable.", | |
| "A point that strengthens the model's R-squared." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A high leverage point is an observation with an extreme value for one or more of its predictor (independent) variables. These points have the potential to exert a strong influence on the slope of the regression line." | |
| }, | |
| { | |
| "id": 74, | |
| "questionText": "An influential point is an observation that:", | |
| "options": [ | |
| "Is always an outlier.", | |
| "Has high leverage.", | |
| "Significantly alters the regression line if removed from the dataset.", | |
| "Has a residual of zero." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "An influential point is one whose removal causes a substantial change in the model's coefficients, predictions, or overall fit. A point can be influential if it is an outlier, has high leverage, or both. Cook's distance is a common metric to measure influence." | |
| }, | |
| { | |
| "id": 75, | |
| "questionText": "If the goal is purely prediction and not interpretation, which of the following issues might be less of a concern?", | |
| "options": [ | |
| "Overfitting", | |
| "Heteroscedasticity", | |
| "Multicollinearity", | |
| "Outliers" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "Multicollinearity primarily affects the reliability and interpretation of individual coefficient estimates. However, as long as the same collinear relationships exist in the new data you are predicting on, it may not significantly harm the model's overall predictive accuracy." | |
| }, | |
| { | |
| "id": 76, | |
| "questionText": "A student builds a model to predict exam scores from hours studied. The intercept is 35. How is this best interpreted?", | |
| "options": [ | |
| "The minimum possible score is 35.", | |
| "The model predicts a score of 35 for a student who studied for 0 hours.", | |
| "For every hour studied, the score increases by 35.", | |
| "The interpretation is not meaningful." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The intercept (beta_0) is the predicted value of the dependent variable (exam score) when the independent variable (hours studied) is zero. So, the model predicts a baseline score of 35 with zero study time." | |
| }, | |
| { | |
| "id": 77, | |
| "questionText": "What does the 'least squares' in Ordinary Least Squares refer to?", | |
| "options": [ | |
| "Using the smallest number of data points possible.", | |
| "Minimizing the sum of squared errors.", | |
| "Using the smallest number of features.", | |
| "Finding the line with the least steep slope." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The term 'least squares' refers directly to the optimization criterion of the method: finding the line that results in the minimum (least) possible value for the sum of the squares of the residuals." | |
| }, | |
| { | |
| "id": 78, | |
| "questionText": "Why can't R-squared be used to compare models with different dependent variables?", | |
| "options": [ | |
| "R-squared is always the same for different dependent variables.", | |
| "It's computationally impossible.", | |
| "R-squared is a measure of variance explained *in a specific dependent variable*.", | |
| "Adjusted R-squared should be used instead." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "R-squared is calculated based on the total sum of squares (TSS) of the dependent variable. If you change the dependent variable (e.g., from predicting `price` to `log(price)`), the TSS changes, making the R-squared values incomparable. They are on different scales." | |
| }, | |
| { | |
| "id": 79, | |
| "questionText": "A model has a very low R-squared but a highly significant p-value for its coefficient. What does this mean?", | |
| "options": [ | |
| "The model is useless.", | |
| "The independent variable has a real, but very small, effect on the dependent variable.", | |
| "There is a calculation error.", | |
| "The model is overfitted." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A significant p-value indicates that the relationship is likely not due to random chance. A low R-squared means that this relationship, while real, does not explain much of the variance in the outcome. This often happens in noisy datasets or when many other factors influence the outcome." | |
| }, | |
| { | |
| "id": 80, | |
| "questionText": "Which statement is true about the residuals of a well-fitted OLS regression model?", | |
| "options": [ | |
| "Their sum is always greater than zero.", | |
| "Their sum is always less than zero.", | |
| "Their sum is exactly zero.", | |
| "Their sum is equal to the number of data points." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "One of the mathematical properties of an OLS regression that includes an intercept term is that the sum (and therefore the mean) of the residuals will be exactly zero." | |
| }, | |
| { | |
| "id": 81, | |
| "questionText": "Which type of variable transformation might help to correct for heteroscedasticity and non-normality of residuals?", | |
| "options": [ | |
| "Standardizing the independent variables.", | |
| "Applying a logarithmic transformation to the dependent variable.", | |
| "Creating interaction terms.", | |
| "Removing outliers." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "When the dependent variable or residuals have a skewed distribution, or when the variance increases with the mean, a logarithmic transformation (e.g., predicting `log(y)` instead of `y`) can often stabilize the variance and make the distribution more normal." | |
| }, | |
| { | |
| "id": 82, | |
| "questionText": "If the true relationship between X and Y is Y=X^2, what would a simple linear regression model likely show?", | |
| "options": [ | |
| "A perfect fit with R-squared = 1.", | |
| "A poor fit and a pattern in the residuals.", | |
| "A negative slope.", | |
| "A slope of exactly 2." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A simple linear regression model can only fit a straight line. It would fail to capture the U-shape of the quadratic relationship, resulting in a poor fit (low R-squared) and a distinct parabolic pattern in the residual plot, indicating a violation of the linearity assumption." | |
| }, | |
| { | |
| "id": 83, | |
| "questionText": "The assumption that the observations are independent of one another is referred to as:", | |
| "options": [ | |
| "No multicollinearity", | |
| "No autocorrelation", | |
| "Homoscedasticity", | |
| "Normality" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "This is the assumption of no autocorrelation or no serial correlation. It means that the value of the error term for one observation is not related to the value of the error term for another observation." | |
| }, | |
| { | |
| "id": 84, | |
| "questionText": "What does a negative coefficient for a variable 'age' in a model predicting 'running_speed' imply?", | |
| "options": [ | |
| "Older individuals are predicted to run faster.", | |
| "Older individuals are predicted to run slower.", | |
| "Age has no effect on running speed.", | |
| "The model is incorrect." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "A negative coefficient indicates an inverse relationship. Holding all other factors constant, as the value of 'age' increases, the predicted value of 'running_speed' decreases." | |
| }, | |
| { | |
| "id": 85, | |
| "questionText": "Which of these is a valid reason to prefer a simpler model over a more complex one?", | |
| "options": [ | |
| "The simple model has a slightly lower R-squared.", | |
| "The simple model is easier to interpret and less likely to overfit.", | |
| "The complex model has more features.", | |
| "The complex model was harder to build." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "This is the principle of parsimony (or Occam's Razor). A simpler model is generally preferred because it is more interpretable, more robust, and less prone to overfitting, even if a more complex model achieves a slightly better fit on the training data." | |
| }, | |
| { | |
| "id": 86, | |
| "questionText": "The coefficients in a linear regression model are determined by:", | |
| "options": [ | |
| "Random chance.", | |
| "The algorithm that minimizes a chosen cost function.", | |
| "The number of data points.", | |
| "The p-values of the variables." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "The coefficients are not random; they are the specific values that the fitting algorithm (like OLS or Gradient Descent) calculates to minimize the difference between the model's predictions and the actual data, as defined by the cost function (e.g., SSE)." | |
| }, | |
| { | |
| "id": 87, | |
| "questionText": "In the context of linear regression, 'bias' refers to:", | |
| "options": [ | |
| "The error introduced by approximating a real-world problem with a simpler model.", | |
| "The personal prejudice of the data scientist.", | |
| "The variance of the model's predictions.", | |
| "The y-intercept of the regression line." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "In the bias-variance tradeoff, bias is the simplifying assumption made by a model to make the target function easier to learn. A high-bias model, like a simple linear model applied to complex data, makes strong assumptions and is likely to underfit." | |
| }, | |
| { | |
| "id": 88, | |
| "questionText": "Which of the following would NOT be a reasonable next step after finding evidence of heteroscedasticity?", | |
| "options": [ | |
| "Transforming the dependent variable (e.g., log or square root).", | |
| "Using weighted least squares regression.", | |
| "Using heteroscedasticity-consistent standard errors (robust standard errors).", | |
| "Adding more independent variables to the model." | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "Simply adding more variables does not address the underlying issue of non-constant variance of the errors. The other three options are all standard and appropriate methods for dealing with heteroscedasticity." | |
| }, | |
| { | |
| "id": 89, | |
| "questionText": "What happens to the confidence intervals for coefficients when multicollinearity is high?", | |
| "options": [ | |
| "They become narrower.", | |
| "They become wider.", | |
| "They do not change.", | |
| "They become negative." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "High multicollinearity increases the standard errors of the coefficients. Since the width of a confidence interval is a multiple of the standard error, larger standard errors lead to wider confidence intervals, reflecting our increased uncertainty about the true value of the coefficients." | |
| }, | |
| { | |
| "id": 90, | |
| "questionText": "A modeler tries to predict a house's price using its size in square feet and its size in square meters. This is a classic example of:", | |
| "options": [ | |
| "Heteroscedasticity", | |
| "Perfect multicollinearity", | |
| "An interaction effect", | |
| "A well-specified model" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Size in square feet and size in square meters are perfectly linearly related (one is just a constant multiple of the other). Including both as predictors creates perfect multicollinearity, and the OLS algorithm will fail to find a unique solution." | |
| }, | |
| { | |
| "id": 91, | |
| "questionText": "The total sum of squares (TSS) represents the total variance in the:", | |
| "options": [ | |
| "Independent variable", | |
| "Dependent variable", | |
| "Residuals", | |
| "Predicted values" | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "TSS measures the total variation of the observed dependent variable values (y_i) around their mean (mean of y). It is calculated as sum((y_i - mean of y)^2). R-squared is the proportion of this total variance that is explained by the model." | |
| }, | |
| { | |
| "id": 92, | |
| "questionText": "If a regression model is linear in its parameters, it means:", | |
| "options": [ | |
| "The plot of Y vs X must be a straight line.", | |
| "The parameters (coefficients) appear as simple linear terms in the model equation.", | |
| "The variables themselves cannot be transformed.", | |
| "The model can only contain one independent variable." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Linearity in parameters means that the regression equation is a linear combination of its coefficients. For example, y = beta_0 + beta_1x^2 is linear in its parameters (beta_0, beta_1) even though the relationship between y and x is non-linear." | |
| }, | |
| { | |
| "id": 93, | |
| "questionText": "What is the main drawback of using only R-squared to evaluate a model?", | |
| "options": [ | |
| "It is difficult to calculate.", | |
| "It can never be 1.", | |
| "It always increases as you add more variables, which can be misleading.", | |
| "It is not sensitive to outliers." | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "R-squared is guaranteed to increase (or stay the same) whenever a new predictor is added, regardless of whether that predictor is actually useful. This can tempt modelers to add irrelevant variables just to inflate the R-squared value. Adjusted R-squared is designed to correct for this." | |
| }, | |
| { | |
| "id": 94, | |
| "questionText": "Stepwise regression is a method for:", | |
| "options": [ | |
| "Checking model assumptions.", | |
| "Automatically selecting variables for a model.", | |
| "Calculating robust standard errors.", | |
| "Dealing with outliers." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Stepwise regression is an automated, iterative procedure for building a model by adding or removing predictor variables one at a time, based on criteria like p-values or AIC/BIC. It comes in forward, backward, and bidirectional variants." | |
| }, | |
| { | |
| "id": 95, | |
| "questionText": "The assumption that the error term has a mean of zero (E(epsilon) = 0) implies that:", | |
| "options": [ | |
| "The model is, on average, correct in its predictions.", | |
| "The model has no errors.", | |
| "The R-squared must be 1.", | |
| "The dependent variable must have a mean of zero." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "This assumption means that while the model will make errors for individual predictions (some positive, some negative), these errors are not systematic. On average, across all observations, the model's predictions are centered around the true values." | |
| }, | |
| { | |
| "id": 96, | |
| "questionText": "Which of the following is TRUE about linear regression?", | |
| "options": [ | |
| "It can prove causation.", | |
| "It can establish correlation and association.", | |
| "It is a type of unsupervised learning.", | |
| "It requires no assumptions to be valid." | |
| ], | |
| "correctAnswerIndex": 1, | |
| "explanation": "Regression can demonstrate the strength and direction of an association between variables (correlation). However, correlation does not imply causation. Proving causation requires a carefully designed experiment, not just observational data analysis." | |
| }, | |
| { | |
| "id": 97, | |
| "questionText": "You are building a model where the impact of each additional year of education on income is greater for people with more work experience. How would you model this?", | |
| "options": [ | |
| "With an interaction term between 'education' and 'experience'.", | |
| "By using two separate simple linear regression models.", | |
| "By removing the 'experience' variable.", | |
| "By standardizing both variables." | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "This scenario describes an interaction effect. The effect of education on income depends on the level of experience. This is modeled by including a new term in the equation that is the product of the two variables: `income = beta_0 + beta_1(education) + beta_2(experience) + beta_3(education * experience)`." | |
| }, | |
| { | |
| "id": 98, | |
| "questionText": "Which value is NOT typically found in a standard multiple regression output summary?", | |
| "options": [ | |
| "Coefficients for each predictor", | |
| "P-values for each coefficient", | |
| "The learning rate of the optimization algorithm", | |
| "The R-squared and Adjusted R-squared values" | |
| ], | |
| "correctAnswerIndex": 2, | |
| "explanation": "The learning rate is a hyperparameter used in iterative optimization algorithms like Gradient Descent. Standard regression output from statistical packages that use Ordinary Least Squares (an analytical solution) will not include a learning rate. It would only be relevant if you were implementing the regression using such an algorithm yourself." | |
| }, | |
| { | |
| "id": 99, | |
| "questionText": "A predictor variable with a very small coefficient value (e.g., 0.0001) is:", | |
| "options": [ | |
| "Definitely not statistically significant.", | |
| "Definitely not practically significant.", | |
| "Definitely the most important variable.", | |
| "Not necessarily insignificant; its importance depends on its scale and p-value." | |
| ], | |
| "correctAnswerIndex": 3, | |
| "explanation": "The magnitude of a coefficient is dependent on the scale of its corresponding variable. A variable measured in millions (e.g., company revenue) will naturally have a very small coefficient compared to a variable measured in single units (e.g., years of experience). You must consider the variable's scale, the p-value, and the practical context to judge its importance." | |
| }, | |
| { | |
| "id": 100, | |
| "questionText": "The line y = beta_0 + beta_1x is a model for the:", | |
| "options": [ | |
| "Conditional mean of y given x, E(y|x)", | |
| "Conditional variance of y given x, Var(y|x)", | |
| "Conditional distribution of x given y", | |
| "Correlation between y and x" | |
| ], | |
| "correctAnswerIndex": 0, | |
| "explanation": "The regression line does not predict the exact value of y for a given x, because there is inherent variability. Instead, it models the conditional mean of the dependent variable. It predicts the average value of y for all individuals that have a specific value of x." | |
| } | |
| ] | |
| } |