October 16, 2018

Srikaanth

MuleSoft Most Frequently Asked Data Science Interview Questions Answers

What is Machine Learning?

The simplest way to answer this question is – we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

What is the difference between skewed and uniform distribution?

When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in an uniform distribution. Distributions that have more observations on one side of the graph than the other  are referred to as skewed distribution.Distributions with fewer observations on the left ( towards lower values) are said to be skewed left and distributions with fewer observation on the right ( towards higher values) are said to be skewed right.
MuleSoft Most Frequently Asked Data Science Interview Questions Answers
MuleSoft Most Frequently Asked Data Science Interview Questions Answers

You created a predictive model of a quantitative outcome variable using multiple regressions. What are the steps you would follow to validate the model?

Since the question asked, is about post model building exercise, we will assume that you have already tested for null hypothesis, multi collinearity and Standard error of coefficients.

Once you have built the model, you should check for following –

·         Global F-test to see the significance of group of independent variables on dependent variable

·         R^2

·         Adjusted R^2

·         RMSE, MAPE

In addition to above mentioned quantitative metrics you should also check for-

·         Residual plot

·         Assumptions of linear regression

What do you understand by Recall and Precision?

Recall  measures "Of all the actual true samples how many did we classify as true?"

Precision measures "Of all the samples we classified as true how many are actually true?"

We will explain this with a simple example for better understanding -

Imagine that your wife gave you surprises every year on your anniversary in last 12 years. One day all of a sudden your wife asks -"Darling, do you remember all anniversary surprises from me?".

This simple question puts your life into danger.To save your life, you need to Recall all 12 anniversary surprises from your memory. Thus, Recall(R) is the ratio of number of events you can correctly recall to the number of all correct events. If you can recall all the 12 surprises correctly then the recall ratio is 1 (100%) but if you can recall only 10 suprises correctly of the 12 then the recall ratio is 0.83 (83.3%).

However , you might be wrong in some cases. For instance, you answer 15 times, 10 times the surprises you guess are correct and 5 wrong. This implies that your recall ratio is 100% but the precision is 66.67%.

Precision is the ratio of number of events you can correctly recall to a number of all events you recall (combination of wrong and correct recalls).

How can you deal with different types of seasonality in time series modelling?

Seasonality in time series occurs when time series shows a repeated pattern over time. E.g., stationary sales decreases during holiday season, air conditioner sales increases during the summers etc. are few examples of seasonality in a time series.

Seasonality makes your time series non-stationary because average value of the variables at different time periods. Differentiating a time series is generally known as the best method of removing seasonality from a time series. Seasonal differencing can be defined as a numerical difference between a particular value and a value with a periodic lag (i.e. 12, if monthly seasonality is present)

Explain K-means algorithm.?

K-Means is a basic an unsupervised learning algorithm and uses data clusters, known as K-clusters to classify the data. The data similarity is identified by grouping the data. The K centers are defined in each K cluster. Using K clusters the K groups are formed and K is performed. The objects are assigned to their nearest cluster center. All objects of the same cluster are related to each other and different from the objects of other clusters. This algorithm is the best for large sets of data.

What is Linear Regression?

Linear regression is basically used for predictive analysis. This method describes the relationship between dependent and independent variables. In linear regression, a single line is fitted within a scatter plot. It consists of the following three methods:

Analyzing and determining the direction and correlation of the data
Deployment of estimation model
To ensure the validity and usefulness of the model. It also helps to determine the outcomes of various events

In real world scenario, how the machine learning is deployed?

The real world applications of machine learning include:

Finance: To evaluate risks, investment opportunities and in the detection of fraud

Robotics: To handle the non ordinary situations

Search Engine: To rank the pages as per the user’s personal preferences

Information Extraction: To frame the possible questions to extract the answers from database

E-commerce: To deploy targeted advertising, re-marketing and customer churn

What is the importance of data cleansing in data analysis?

As the data come from various multiple sources, so it becomes important to extract useful and relevant data and therefore data cleansing become very important. Data cleansing is basically the process of correcting and detecting accurate and relevant data components and deletion of the irrelevant one. For data cleansing, the data is processed concurrently or in batches.

How is statistics used by Data Scientists?

With the help of statistics, the Data Scientists can convert the huge amount of data to provide its insights. The data insights can provide a better idea of what the customers are expecting? With the help of statistics, the Data scientists can know the customer’s behavior, his engagements, interests and final conversion. They can make powerful predictions and certain inferences. It can also be converted into powerful propositions of business and the customers can also be offered suitable deals.

What are the benefits of R language?

R programming uses a number of software suites for statistical computing, graphical representation, data calculation and manipulation. Following are a few characteristics of R programming:

It has an extensive tool collection
Tools have the operators to perform Matrix operations and calculations using arrays
Analysing techniques using graphical representation
It is a language with many effective features but is simple as well
It supports machine learning applications
It acts as a connecting link between a number of data sets, tools and software
It can be used to solve data oriented problem

Explain Recommender System.?

The recommended system works on the basis of past behavior of the person and is widely deployed in a number of fields like music preferences, movie recommendations, research articles, social tags and search queries. With this system, the future model can also be prepared, which can predict the person’s future behavior and can be used to know the product the person would prefer buying or which movie he will view or which book he will read. It uses the discrete characteristics of the items to recommend any additional item.

Which language R or Python is most suitable for text analytics?
As Python consists of a rich library of Pandas, due to which the analysts can use high-level data analysis tools and data structures, this feature is absent in R, so Python is more suitable for text analytics.

What is the difference between Data Analytics, Big Data, and Data Science?

Big Data: Big Data deals with huge data volume in structured and semi structured form and require just basic knowledge of mathematics and statistics.
Data Analytics: Data Analytics provide the operational insights of complex scenarios of business
Data Science: Data Science deals with slicing and dicing of data and require deep knowledge of mathematics and statistics

Which language is more suitable for text analytics? R or Python?

Since Python consists of a rich library called Pandas which allows the analysts to use high-level data analysis tools as well as data structures, while R lacks this feature. Hence Python will more suitable for text analytics.

Subscribe to get more Posts :