Introduction To Statistic Interview Question
Statistic is most important part of Data Science and here is guide to crack your next Statistic Interview. In this series we are going to discuss about most important Statistic Interview Question and their answers.
Q1. What is Selection Bias?
This is most important question in the Statistic Interview Question series. A sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen). When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.
Q2. You are given a data set consisting of variables having more than 30% missing values? How will you deal with them?
First of all, we check all of the null values in a data frame and identify the column which contains more than 30% null value. and remove that particular column from dataset otherwise if the column is highly related to the output variable the check their data type if the data type is categorical the fill the null values with mode nad fit the data is numerical then check their distribution nad fill values using mean or median.
Q3. What is an outlier? How to handle if it found in data?
Outliers are those which is lies outside in data Ex:- Salary range 12K,14K, 15K, 13K, 90K, 100K here in the range maximum values are lies in between 12K-15K but in the data present higher value this value is outliers
Handle this Outlies using Z Score
Z-score is the number of standard deviations from the mean a data point is.
Z Score = (x – μ) / σ
- x: Value of the element
- μ: Population mean
- σ: Standard Deviation
A z-score of zero tells you the values are exactly average while a score of +3 tells you that the value is much higher than average.
Interqurtile Range( IQR ) :-
The main advantage of the IQR is that it is not affected by outliers because it doesn’t take into account observations below Q1 or above Q3.
It might still be useful to look for possible outliers in your study.
As a rule of thumb, observations can be qualified as outliers when they lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile. Outliers are values that “lie outside” the other values. Outliers = Q1 – 1.5 * IQR OR Outliers = Q3 + 1.5 * IQR
Q4. How to Handle Missing values in data?
To handle missing value is data there are two types of process
Dropping:- in suppose the your data contain more than 30% null values then dropping that column is a suggestion for to handle null value
Imputation:- If your data contain less then 30%- 40% null values value the fill that null values are the best method so to fill null values if the data is categorical then fill null values using the mode of the data if the data is numerical and the data is not skewed the used mean if the data is skewed then use the median.
Q5. How to identify data is Skewed or not skewed? and what are different types of skewed and how to identify them?
If in the numerical data is their mean and median are different then this is the indication of data is Skewed. If in the data contain Mean = Median = Mode data is normally distributed.
There are two different types of skewness
Positive Skewed:- Most of the data is present in the left side of distribution and tail is downward towards the right side then data is positive skewed or Mean > Median > Mode this is also known as positive skewed
Negative Skewed:- Most of the data is present in the right side of distribution and tail is downward towards the left side then data is negatively skewed or Mean< Median < Mode this is also known as Negative Skewed
Q6. What is Normal Distribution?
The data distribution is equal across the right and left side of distribution and the curve is formed bell shape curve is known as Normal Distibutate data.
Properties of Normal Distribution
Unimodel:- One Pick
Symmetrical:- Both sizes contain mirror image in distribution
Mean, Median, Mode are equal.
Q7. What is correlation?
Statistic Interview Question Correlation is important term which is used To check the relation between two Quntaitative data we find the Correlation between them.
Pearson’s r or Pearson Correlation: When two sets of data are strongly linked together, they have a High Correlation.
The word Correlation is made of Co- (meaning “together”), and Relation
Correlation is Positive when the values increase together, and
Correlation is Negative when one value decreases as the other increases
Correlation Coefficient can have a value:
- 1 is a perfect positive correlation
- 0 is no correlation
- -1 is a perfect negative correlation
- The value shows how good the correlation is even if it is positive or negative. Note: Correlation is not Causation
Q8. Formula to calculate Standard Deviation and Variance
Standard Deviation: The Standard Deviation is a measure of how spread out numbers are.
Variance: The average of the squared differences from the Mean. i.e the Square of the Standard Deviation.
variance = σ2 = Σ(x – μ)2 / n
standard deviation = √ σ2
variance = s2 = Σ(x – μ)2 / n – 1
standard deviation s = √s2
- x is individual one value
- n is the size of population
- μ is the mean of population or sample
9. Difference between Point Estimates and Confidence Interval?
Point Estimation:- For population parameter particular value as an estimate gives us point estimation. Methods like moments and maximum likelihood estimator methods are the point estimation population parameter.
Confidence Interval:- The probability that the interval contains the parameter is the confidence interval. It quantifies the level of confidence that the parameter lies in the interval.
Confidence Interval = 1 alpha Where alpha is the level of significance.
Q10. What is P-Value?
In Statistic Interview Question what is P Value is an important question. A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage.
- A p-value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance).
- A large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment.
The higher the p-value (≥ 0.05), the higher is the probability of failing to reject a null hypothesis. And
The lower the p-value (≤ 0.05), The higher is the probability of the null hypothesis is rejected.
Q11. What is the goal of A/B Testing?
A/B is for a randomized experiment with two variables A and B in a hypothesis testing.
A/B testing is a direct industry application of the two-sample proportion test sample you have just studied.
While developing an e-commerce website, there could be different opinions about the choices of various elements, such as the shape of buttons, the text on the call-to-action buttons, the color of various UI elements, the copy on the website, or numerous other such things.
Often, the choice of these elements is very subjective, and is difficult to predict which option would perform better. To resolve such conflicts, you can use A/B testing. A/B testing provides a way for you to test two different versions of the same element and see which one performs better.
Q12. A Siva has 1000 coins, of which 999 are fair and 1 is double-headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also ahead?
There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads.
Probability of selecting fair coin = 999/1000 = 0.999
Probability of selecting unfair coin = 1/1000 = 0.001
Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coin
P (A) = 0.999 * (1/2)^5 = 0.999 * (1/1024) = 0.000976
P (B) = 0.001 * 1 = 0.001
P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939
P( B / A + B ) = 0.001 / 0.001976 = 0.5061
Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531
13. Define quality assurance, six sigma.
Quality assurance:- To maintaining a desired level of quality by minimizing mistakes and defects in an activity and set of activity.
Six Sigma: six sigma is set of management tools and techniques used to improve the process by reducing the likelihood ratio.
Q14. Why Scaling is Required?
Most machine learning algorithms take into account only the magnitude of the measurements, not the units of those measurements.
So that is expressed in a very high magnitude (number), which may affect the prediction a lot more than an equally important feature.
Q15. Explain Standard Scalar
The Standard Scaler is one of the most widely used scaling algorithms out there. It assumes that your data follows a Gaussian Distribution (Gaussian distribution is the same thing as Normal distribution)
The Mean and the Standard Deviation are calculated for the feature and then the feature is scaled based on:
SC= (xi–mean(x)) / stdev(x)
The idea behind Standard Scaler is that it will transform your data, such that the distribution will have a mean value of 0 and a standard deviation of 1.
If the data is not normally distributed, it’s not recommended to use the Standard Scaler.
Q16. Explain Univariate and Bivariate Graph analysis.
Univariate Graph analysis used only one variable to get some analysis. Plots use in univariate analysis Countplot, Distribution plot, Histogram, etc.
Bivariate Graph analysis used two variables to get analysis their relation in between data uses both Qualitative and Quantitative Data Graph use for Bivariate data is scatter plot, bar graph, etc.
Q17. What is kurtosis? What are the different types of Kurtosis?
In statistics, kurtosis is defined as the parameter of relative sharpness of the peak of the probability distribution curve. It ascertains the way observations are clustered around the center of the distribution.
It is used to indicate the Flatness or Peakedness of the frequency distribution curve and measures the tails or outliers of the distribution.
Mesokurtic:- Mesokurtic is the distribution which has similar kurtosis as normal distribution kurtosis, which is zero.
Leptokurtic:- The distribution which has kurtosis greater than a Mesokurtic distribution. Tails of such distributions are thick and heavy.
Platykurtic:- The distribution which has kurtosis lesser than a Mesokurtic distribution. Tails of such distributions thinner.
Q18. What are the types of modalities?
Unimodal:- It has only one peak
Bimodal:- It has two peak
Multimodal:- It has many peak
Uniform:- All are distributed uniformly
Q19. When to use which measure of central tendency?
Mean – When your data is not skewed i.e Symmetric/Normally Distributed. In other words, there are no extreme values present in the data set (Outliers).
Median – When your data is skewed or you are dealing with ordinal (ordered categories) data.
Mode – When dealing with nominal (unordered categories) data.
If your data is quantitative then go for mean or median. Basically, if your data is having some influential outliers or data is highly skewed the median is the best measurement for finding central tendency. Otherwise go for Mean.
If data is Categorical (Nominal or Ordinal) it is impossible to calculate mean or median. So, go for mode.
Q20. Explain the Process of data analysis.
In this series we discussed about most important Statistic Interview Question and their answers.