1. What is the Central Limit Theorem and why is it important?
Suppose that we are interested in estimating the average height among all people. Collecting data for
every person in the world is impossible. While we can’t obtain a height measurement from everyone in the
population, we can still sample some people. The question now becomes, what can we say about the
average height of the entire population given a single sample.
2. What is sampling?
How many sampling methods do you know?
“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative
subset of data points to identify patterns and trends in the larger data set being examined.
3. What is the difference between type I vs type II error?
- Type I error occurs when the null hypothesis is true, but is rejected.
- Type II error occurs when the null hypothesis is false.
4. What is linear regression?
What do the terms p-value, coefficient, and r-squared
value mean? What is the significance of each of these components?
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends
on a myriad of factors, such as its size or its location. In order to see the relationship between these
variables, we need to build a linear regression, which predicts the line of best fit between them and can
help conclude whether or not these two factors have a positive or negative relationship.
5. What are the assumptions required for linear regression?
There are four major assumptions:
- There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data,
- The errors or residuals of the data are normally distributed and independent from each other,
- There is minimal multicollinearity between explanatory variables.
- Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.