What is the expected net profit from playing this ticket?
A) $-2.81
B) $2.81C) $-1.82
C) $-1.82
D) $1.82
Ans: (B)
Expected value in this case
E(X) = P(grand prize)*(10405-5)+P(small)(100-5)+P(losing)*(-5)
P(grand prize)= (1/10)*(1/10)*(1/26)
P(small) = 1/26-1/2600, the reason we need to do this is we need to exclude the case where he gets the letter right and also the numbers rights. Hence, we need to remove the scenario of getting the letter right.
P(losing ) = 1-1/26-1/2600
Therefore we can fit in the values to get the expected value as $2.81
Assume you sell sandwiches. 70% people choose egg, and the rest choose chicken. What is the probability of selling 2 egg sandwiches to the next 3 customers?
A) 0.343
B) 0.063
C) 0.147
D) 0.027
Ans: (C)
The probability of selling Egg sandwich is 0.7 & that of a chicken sandwich is 0.3. Now, the probability that next 3 customers would order 2 egg sandwich is 0.7 * 0.7 *0.3 = 0.147. They can order them in any sequence, the probabilities would still be the same.
Explain star schema.?
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
How regularly must an algorithm be updated?
You will want to update an algorithm when:
You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity
What is Data Science? Also, list the differences between supervised and unsupervised learning.?
Data Science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.
Supervised Learning vs Unsupervised Learning
Supervised Learning
Unsupervised Learning
1. Input data is labeled. 1. Input data is unlabeled.
2. Uses training dataset. 2. Uses the input data set.
3. Used for prediction. 3. Used for analysis.
4. Enables classification and regression. 4. Enables Classification, Density Estimation, & Dimension Reduction
What are feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
Explain the steps in making a decision tree.?
Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps 1 to 2 to the divided data.
Stop when you meet some stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
What is logistic regression?
Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.
What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when performing data analysis using Python.
Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.
Mastery of N-dimensional NumPy arrays.
Mastery of pandas dataframes.
Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with scikit-learn.
Ability to write efficient list comprehensions instead of traditional for loops.
Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
What is Selection Bias?
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias includes:
Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second roll?
A) 1/36
B) 1/18
C) 5/36
D) 1/6
E) 1/3
Ans: (C)
The two events mentioned are independent. The first roll of the die is independent of the second roll. Therefore the probabilities can be directly multiplied.
P(getting first 2) = 1/6
P(no second 4) = 5/6
Therefore P(getting first 2 and no second 4) = 1/6* 5/6 = 5/36
Consider a tetrahedral die and roll it twice. What is the probability that the number on the first roll is strictly higher than the number on the second roll?
Note: A tetrahedral die has only four sides (1, 2, 3 and 4).
A) 1/2
B) 3/8
C) 7/16
D) 9/16
Ans: (B)
(1,1) (2,1) (3,1) (4,1)
(1,2) (2,2) (3,2) (4,2)
(1,3) (2,3) (3,3) (4,3)
(1,4) (2,4) (3,4) (4,4)
There are 6 out of 16 possibilities where the first roll is strictly higher than the second roll.
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.
Calculation of seasonality is pretty straight forward.
Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )
where true positives are positive events which are correctly classified as positives.
Can you cite some examples where a false negative important than a false positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?
Example 2: What if Jury or judge decide to make a criminal go free?
Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?
Can you cite some examples where both false positive and false negatives are equally important?
In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
Can you explain the difference between a Validation Set and a Test Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
Explain cross-validation.
Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
What is Machine Learning?
Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.
What is the Supervised Learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.
Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks
E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas.
What is Unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models
E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
A) $-2.81
B) $2.81C) $-1.82
C) $-1.82
D) $1.82
Ans: (B)
Expected value in this case
E(X) = P(grand prize)*(10405-5)+P(small)(100-5)+P(losing)*(-5)
P(grand prize)= (1/10)*(1/10)*(1/26)
P(small) = 1/26-1/2600, the reason we need to do this is we need to exclude the case where he gets the letter right and also the numbers rights. Hence, we need to remove the scenario of getting the letter right.
P(losing ) = 1-1/26-1/2600
Therefore we can fit in the values to get the expected value as $2.81
Adobe Systems Data Science Recently Asked Interview Questions Answers |
Assume you sell sandwiches. 70% people choose egg, and the rest choose chicken. What is the probability of selling 2 egg sandwiches to the next 3 customers?
A) 0.343
B) 0.063
C) 0.147
D) 0.027
Ans: (C)
The probability of selling Egg sandwich is 0.7 & that of a chicken sandwich is 0.3. Now, the probability that next 3 customers would order 2 egg sandwich is 0.7 * 0.7 *0.3 = 0.147. They can order them in any sequence, the probabilities would still be the same.
Explain star schema.?
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
How regularly must an algorithm be updated?
You will want to update an algorithm when:
You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity
What is Data Science? Also, list the differences between supervised and unsupervised learning.?
Data Science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.
Supervised Learning vs Unsupervised Learning
Supervised Learning
Unsupervised Learning
1. Input data is labeled. 1. Input data is unlabeled.
2. Uses training dataset. 2. Uses the input data set.
3. Used for prediction. 3. Used for analysis.
4. Enables classification and regression. 4. Enables Classification, Density Estimation, & Dimension Reduction
What are feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
Explain the steps in making a decision tree.?
Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps 1 to 2 to the divided data.
Stop when you meet some stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
What is logistic regression?
Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.
What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when performing data analysis using Python.
Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.
Mastery of N-dimensional NumPy arrays.
Mastery of pandas dataframes.
Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with scikit-learn.
Ability to write efficient list comprehensions instead of traditional for loops.
Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
What is Selection Bias?
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias includes:
Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second roll?
A) 1/36
B) 1/18
C) 5/36
D) 1/6
E) 1/3
Ans: (C)
The two events mentioned are independent. The first roll of the die is independent of the second roll. Therefore the probabilities can be directly multiplied.
P(getting first 2) = 1/6
P(no second 4) = 5/6
Therefore P(getting first 2 and no second 4) = 1/6* 5/6 = 5/36
Consider a tetrahedral die and roll it twice. What is the probability that the number on the first roll is strictly higher than the number on the second roll?
Note: A tetrahedral die has only four sides (1, 2, 3 and 4).
A) 1/2
B) 3/8
C) 7/16
D) 9/16
Ans: (B)
(1,1) (2,1) (3,1) (4,1)
(1,2) (2,2) (3,2) (4,2)
(1,3) (2,3) (3,3) (4,3)
(1,4) (2,4) (3,4) (4,4)
There are 6 out of 16 possibilities where the first roll is strictly higher than the second roll.
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.
Calculation of seasonality is pretty straight forward.
Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )
where true positives are positive events which are correctly classified as positives.
Can you cite some examples where a false negative important than a false positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?
Example 2: What if Jury or judge decide to make a criminal go free?
Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?
Can you cite some examples where both false positive and false negatives are equally important?
In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
Can you explain the difference between a Validation Set and a Test Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
Explain cross-validation.
Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
What is Machine Learning?
Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.
What is the Supervised Learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.
Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks
E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas.
What is Unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models
E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
Post a Comment