A) -0.00001
B) 0.5
C) 1.001
A) Only A
B) Only B
C) Only C
D) A and B
E) B and C
F) A and C
Ans: (F)
Probability always lie within 0 to 1.
Dell EMC Data Science Recently Asked Interview Questions Answers |
Anita randomly picks 4 cards from a deck of 52-cards and places them back into the deck ( Any set of 4 cards is equally likely ). Then, Babita randomly chooses 8 cards out of the same deck ( Any set of 8 cards is equally likely). Assume that the choice of 4 cards by Anita and the choice of 8 cards by Babita are independent. What is the probability that all 4 cards chosen by Anita are in the set of 8 cards chosen by Babita?
A)48C4 x 52C4
B)48C4 x 52C8
C)48C8 x 52C8
D) None of the above
Ans: (A)
The total number of possible combination would be 52C4 (For selecting 4 cards by Anita) * 52C8 (For selecting 8 cards by Babita).
Since, the 4 cards that Anita chooses is among the 8 cards which Babita has chosen, thus the number of combinations possible is 52C4 (For selecting the 4 cards selected by Anita) * 48C4 (For selecting any other 4 cards by Babita, since the 4 cards selected by Anita are common).
Question Context:
A player is randomly dealt a sequence of 13 cards from a deck of 52-cards. All sequences of 13 cards are equally likely. In an equivalent model, the cards are chosen and dealt one at a time. When choosing a card, the dealer is equally likely to pick any of the cards that remain in the deck.
Recommender Systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
Examples include movie recommenders in IMDB, Netflix & BookMyShow, product recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video recommendations and game recommendations in Xbox.
What is Linear Regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values.
All extreme values are not outlier values. The most common ways to treat outlier values
To change the value and bring in within a range
To just remove the value.
What are various steps involved in an analytics project?
The following are the various steps involved in an analytics project:
Understand the business problem
Explore the data and become familiar with it.
Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.
After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.
Validate the model using a new data set.
Start implementing the model and track the result to analyse the performance of the model over the period of time.
What are the differences between overfitting and underfitting?
In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.
Python or R – Which one would you prefer for text analytics?
We will prefer Python because of the following reasons:
Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
R is more suitable for machine learning than just text analysis.
Python performs faster for all types of text analytics.
How does data cleaning plays a vital role in analysis?
Data cleaning can help in analysis because:
Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
What is logistic regression?
Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.
What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when performing data analysis using Python.
Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.
Mastery of N-dimensional NumPy arrays.
Mastery of pandas dataframes.
Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with scikit-learn.
Ability to write efficient list comprehensions instead of traditional for loops.
Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
What is Selection Bias?
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias includes:
Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.
Calculation of seasonality is pretty straight forward.
Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )
where true positives are positive events which are correctly classified as positives.
Post a Comment