Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.
Explain cross-validation.?
It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and gain insight on how the model will generalize to an independent data set.
What is Collaborative Filtering?
The process of filtering used by most recommender systems to find patterns and information by collaborating perspectives, numerous data sources, and several agents.
Amazon Data Science Recently Asked Interview Questions Answers |
What is the expected net profit from playing this ticket?
A) $-2.81
B) $2.81C) $-1.82
C) $-1.82
D) $1.82
Ans: (B)
Expected value in this case
E(X) = P(grand prize)*(10405-5)+P(small)(100-5)+P(losing)*(-5)
P(grand prize)= (1/10)*(1/10)*(1/26)
P(small) = 1/26-1/2600, the reason we need to do this is we need to exclude the case where he gets the letter right and also the numbers rights. Hence, we need to remove the scenario of getting the letter right.
P(losing ) = 1-1/26-1/2600
Therefore we can fit in the values to get the expected value as $2.81
Assume you sell sandwiches. 70% people choose egg, and the rest choose chicken. What is the probability of selling 2 egg sandwiches to the next 3 customers?
A) 0.343
B) 0.063
C) 0.147
D) 0.027
Ans: (C)
The probability of selling Egg sandwich is 0.7 & that of a chicken sandwich is 0.3. Now, the probability that next 3 customers would order 2 egg sandwich is 0.7 * 0.7 *0.3 = 0.147. They can order them in any sequence, the probabilities would still be the same.
Explain star schema.?
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
How regularly must an algorithm be updated?
You will want to update an algorithm when:
You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity
What is Data Science? Also, list the differences between supervised and unsupervised learning.?
Data Science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.
Supervised Learning vs Unsupervised Learning
Supervised Learning
Unsupervised Learning
1. Input data is labeled. 1. Input data is unlabeled.
2. Uses training dataset. 2. Uses the input data set.
3. Used for prediction. 3. Used for analysis.
4. Enables classification and regression. 4. Enables Classification, Density Estimation, & Dimension Reduction
What are feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
Explain the steps in making a decision tree.?
Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps 1 to 2 to the divided data.
Stop when you meet some stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
What is logistic regression?
Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.
What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when performing data analysis using Python.
Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.
Mastery of N-dimensional NumPy arrays.
Mastery of pandas dataframes.
Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with scikit-learn.
Ability to write efficient list comprehensions instead of traditional for loops.
Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
What is Selection Bias?
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias includes:
Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
Post a Comment