Info Edge Data Science and Analytics Interview Questions and Answers

March 14, 2024

Data analytics has become the backbone of modern businesses, providing invaluable insights that drive strategic decisions. If you’re eyeing a role at Info Edge or simply want to brush up on your data analytics interview skills, you’ve come to the right place. Here, we delve into some of the most pertinent questions and polished answers to help you ace your interview at Info Edge.

Table of Contents

Python skill test

Question: Ways to find Outliers?

Answer: Some common ways to find outliers include:

Statistical Methods: Such as z-score, modified z-score, or IQR (Interquartile Range) method.
Visualization Techniques: Like box plots, scatter plots, or histograms.
Machine Learning Algorithms: Certain algorithms like Isolation Forest or One-Class SVM can detect outliers.
Domain Knowledge: Understanding the context of the data and identifying anomalies based on subject-matter expertise.
Clustering Methods: Outliers may stand out as individual clusters or points distant from the main cluster centroids.

Question: Define entropy and cross entropy.

Answer:

Entropy is a measure of uncertainty or disorder in a set of data. In information theory, it quantifies the average amount of information produced by a stochastic process. Mathematically, it’s calculated using probabilities and logarithms.

Cross-entropy, on the other hand, is a measure of the difference between two probability distributions. It’s commonly used in machine learning as a loss function to measure the dissimilarity between the predicted probability distribution and the actual distribution of the data. In classification tasks, minimizing cross-entropy helps improve the accuracy of the model’s predictions.

Question: How to draw a boxplot?

Answer: To draw a boxplot, follow these steps:

Arrange Data: Arrange your data in ascending order.
Calculate Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3).
Find Interquartile Range (IQR): Calculate the IQR, which is the difference between Q3 and Q1.
Identify Outliers: Determine if there are any outliers in the data using the IQR method.
Draw Box: Draw a box from Q1 to Q3, with a line at the median (Q2).
Draw Whiskers: Extend lines (whiskers) from the box to the minimum and maximum values within 1.5 times the IQR.
Plot Outliers: Plot any outliers beyond the whiskers.
Label Axes: Label the axes appropriately.

Question: What is the IQR?

Answer: The Interquartile Range (IQR) is a measure of statistical dispersion, specifically a measure of variability based on dividing a data set into quartiles. It is the difference between the third quartile (Q3) and the first quartile (Q1). Mathematically, IQR = Q3 – Q1.

Question: What is central limit theorem?

Answer: The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that the sampling distribution of the sample mean (or sum) of a sufficiently large sample size drawn from any population will approximate a normal distribution, regardless of the shape of the original population distribution.

Question: Different optimizers in deep learning.

Answer: There are various optimizers in deep learning: Gradient Descent, Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent, Adam, RMSprop, Adagrad, Adadelta, and AdamW. These optimizers differ in how they update the parameters during training. Adam and RMSprop are adaptive learning rate algorithms, while Adagrad and Adadelta adjust learning rates based on historical gradients. The choice of optimizer depends on factors like the problem at hand and the architecture of the neural network.

Statistics

Question: Definition of a random variable.

Answer: A random variable is like a number that we don’t know for sure—it could be any value, but we know the chances of it being certain values. It’s used to describe uncertain outcomes, like rolling a dice or flipping a coin. Random variables help us understand probabilities and make predictions about uncertain events.

Question: What is Bayes Theorem?

Answer: Bayes’ Theorem is a fundamental concept in probability theory that describes the probability of an event based on prior knowledge of conditions that might be related to the event. Mathematically, it expresses the probability of A given B in terms of the probability of B given A and the probabilities of A and B separately. In essence, it allows us to update our beliefs about the likelihood of an event occurring based on new evidence. Bayes’ Theorem is widely used in various fields such as statistics, machine learning, and Bayesian inference.

Question: Explain p-test, t-test, z-test.

Answer:

Z-test: A statistical test used when the sample size is large (typically n > 30) or when the population standard deviation is known. It compares a sample mean to a known population mean and determines whether the difference between them is statistically significant.
T-test: A statistical test used when the sample size is small (typically n < 30) and the population standard deviation is unknown. There are different types of t-tests for different scenarios, such as one-sample t-test, independent samples t-test, and paired samples t-test.
P-test: There is no widely recognized statistical test called a “p-test.” It’s possible that you may be referring to a significance test related to a proportion (like a z-test for proportions), but it’s not a standard term in statistics. If you provide more context or clarification, I can offer further assistance.

Question: What is Random variable?

Answer: A random variable is a variable whose possible values are outcomes of a random phenomenon. Essentially, it represents a quantity whose value is uncertain and determined by chance. Random variables can be either discrete, taking on a countable set of values (like the outcome of rolling a die), or continuous, taking on any value within a specified range (like the height of a person). They are fundamental to probability theory and statistics, enabling the modeling and analysis of uncertain events and processes.

Question: What is Rank of a matrix.

Answer: The rank of a matrix is the maximum number of linearly independent rows or columns in the matrix. In simpler terms, it represents the dimension of the vector space spanned by the rows or columns of the matrix. The rank of a matrix provides information about its properties, such as its invertibility and the number of solutions to a system of linear equations it represents. It is an important concept in linear algebra and has various applications in fields such as engineering, physics, and computer science.

Machine Learning

Question: Linear Regression and its assumption.

Answer: Linear regression is a statistical method used to model the relationship between one or more independent variables (predictors) and a dependent variable (response). It assumes a linear relationship between the predictors and the response variable, meaning that changes in the predictors result in proportional changes in the response.

Assumptions of linear regression include:

Linearity: The relationship between the independent and dependent variables is linear.
Independence: The residuals (the differences between observed and predicted values) are independent of each other.
Homoscedasticity: The variance of the residuals is constant across all levels of the independent variables.
Normality: The residuals are normally distributed.
No perfect multicollinearity: The independent variables are not perfectly correlated with each other.

Question: What is the linear and non-linear model?

Answer: In a linear model, the relationship between variables is assumed to be straight and proportional, whereas in a non-linear model, the relationship can take on various shapes, such as curves or bends. Linear models involve straight lines or planes, while non-linear models encompass more complex patterns like curves, exponentials, or logarithms. The choice between linear and non-linear models depends on the underlying data and the nature of the relationship being studied.

Question: What are Pros and Cons of KNN?

Answer:

Pros:

Simple to understand and implement.
No training period required.
Versatile for classification and regression tasks.
Robust to noisy data.
Handles non-linear data relationships effectively.

Cons:

Computationally expensive for large datasets.
Memory intensive due to storing all training data.
Sensitivity to irrelevant features may affect performance.
Requires feature scaling for optimal results.
Performance degradation in high-dimensional feature spaces due to the curse of dimensionality.

Question: What is tree Pruning?

Answer: Tree pruning in machine learning involves reducing the size of decision trees to prevent overfitting and improve generalization. Pre-pruning sets stopping criteria during tree construction, like limiting maximum depth or requiring a minimum number of samples in leaf nodes. Post-pruning, or cost-complexity pruning, removes less important sections of the tree after it’s fully grown, based on their impact on validation data. Pruning helps create simpler, more interpretable models while maintaining predictive accuracy.

Question: How to overcome the issue of overfitting in the decision trees?

Answer: To mitigate overfitting in decision trees, strategies like pruning, limiting tree depth, and increasing minimum samples per leaf can be employed to simplify the model and prevent it from capturing noise in the training data. Additionally, feature selection helps focus on relevant features, while ensemble methods like Random Forest or Gradient Boosting combine multiple trees to improve generalization. Cross-validation techniques further validate the model’s performance on unseen data, ensuring it doesn’t overfit to the training set.

Deep Learning

Question: What is the role of the activation function?

Answer: The activation function introduces nonlinearity into neural networks, enabling them to learn complex patterns and relationships in data. It determines whether a neuron should be activated based on its input, allowing the network to model and approximate nonlinear functions. Without activation functions, neural networks would only perform linear transformations, limiting their ability to learn from data. Activation functions are crucial for enabling neural networks to solve a wide range of tasks, including classification, regression, and pattern recognition.

Question: What is the vanishing gradient issue, and how to overcome it?

Answer: The vanishing gradient issue occurs during the training of deep neural networks when the gradients become extremely small as they are propagated backward through the network layers during backpropagation. This phenomenon is particularly problematic in deep architectures with many layers, as it hinders the ability of earlier layers to learn meaningful representations, resulting in slow or stalled learning.

To overcome the vanishing gradient issue in deep neural networks, employ ReLU, Leaky ReLU, or ELU activation functions to maintain gradient flow. Implement batch normalization to stabilize training and reduce internal covariate shift. Use residual connections (ResNets) to facilitate easier backpropagation and address vanishing gradients. Additionally, consider gradient clipping and proper weight initialization techniques to prevent gradients from becoming too small during training. These strategies collectively help ensure stable and efficient training of deep networks.

Question: How CNN helps in reducing parameters?

Answer: Parameter sharing: Filters are applied across different regions of the input, reducing the need for unique parameters.

Pooling layers: Down-sampling feature maps to summarize information, reducing spatial dimensions and parameters. These mechanisms enable CNNs to efficiently capture spatial patterns in data while maintaining translation invariance and reducing overfitting.

Question: Why do we do negative sampline?

Answer: Negative sampling is utilized in machine learning tasks like word embeddings and recommender systems for efficiency and balance. By focusing training on a subset of negative examples, it reduces computational complexity and prevents bias towards the majority class. Additionally, negative sampling improves representation learning by encouraging the model to capture more robust and informative patterns in the data.

Conclusion

Preparing for a data analytics interview at Info Edge requires a blend of technical prowess, analytical thinking, and effective communication skills. By familiarizing yourself with these key questions and crafting insightful responses, you’ll be well-equipped to impress your potential employers and embark on a rewarding career in data analytics at Info Edge. Good luck!

Remember, the journey to mastering data analytics begins with understanding its intricacies and honing your skills to become a sought-after asset in the industry.

Python skill test

Question: Ways to find Outliers?

Question: Define entropy and cross entropy.

Question: How to draw a boxplot?

Question: What is the IQR?

Question: What is central limit theorem?

Question: Different optimizers in deep learning.

Statistics

Question: Definition of a random variable.

Question: What is Bayes Theorem?

Question: Explain p-test, t-test, z-test.

Question: What is Random variable?

Question: What is Rank of a matrix.

Machine Learning

Question: Linear Regression and its assumption.

Question: What is the linear and non-linear model?

Question: What are Pros and Cons of KNN?

Question: What is tree Pruning?

Question: How to overcome the issue of overfitting in the decision trees?

Deep Learning

Question: What is the role of the activation function?

Question: What is the vanishing gradient issue, and how to overcome it?

Question: How CNN helps in reducing parameters?

Question: Why do we do negative sampline?

Conclusion

LEAVE A REPLY Cancel reply