October 16, 2018

Srikaanth

OpenText Data Science Interview Questions Answers

Explain K-means algorithm.?

K-Means is a basic an unsupervised learning algorithm and uses data clusters, known as K-clusters to classify the data. The data similarity is identified by grouping the data. The K centers are defined in each K cluster. Using K clusters the K groups are formed and K is performed. The objects are assigned to their nearest cluster center. All objects of the same cluster are related to each other and different from the objects of other clusters. This algorithm is the best for large sets of data.

What is Linear Regression?

Linear regression is basically used for predictive analysis. This method describes the relationship between dependent and independent variables. In linear regression, a single line is fitted within a scatter plot. It consists of the following three methods:

Analyzing and determining the direction and correlation of the data
Deployment of estimation model
To ensure the validity and usefulness of the model. It also helps to determine the outcomes of various events

In real world scenario, how the machine learning is deployed?

The real world applications of machine learning include:

Finance: To evaluate risks, investment opportunities and in the detection of fraud

Robotics: To handle the non ordinary situations

Search Engine: To rank the pages as per the user’s personal preferences

Information Extraction: To frame the possible questions to extract the answers from database

E-commerce: To deploy targeted advertising, re-marketing and customer churn
OpenText Most Frequently Asked Data Science Interview Questions Answers
OpenText Most Frequently Asked Data Science Interview Questions Answers

What is the importance of data cleansing in data analysis?

As the data come from various multiple sources, so it becomes important to extract useful and relevant data and therefore data cleansing become very important. Data cleansing is basically the process of correcting and detecting accurate and relevant data components and deletion of the irrelevant one. For data cleansing, the data is processed concurrently or in batches.

How is statistics used by Data Scientists?

With the help of statistics, the Data Scientists can convert the huge amount of data to provide its insights. The data insights can provide a better idea of what the customers are expecting? With the help of statistics, the Data scientists can know the customer’s behavior, his engagements, interests and final conversion. They can make powerful predictions and certain inferences. It can also be converted into powerful propositions of business and the customers can also be offered suitable deals.

What are the benefits of R language?

R programming uses a number of software suites for statistical computing, graphical representation, data calculation and manipulation. Following are a few characteristics of R programming:

It has an extensive tool collection
Tools have the operators to perform Matrix operations and calculations using arrays
Analysing techniques using graphical representation
It is a language with many effective features but is simple as well
It supports machine learning applications
It acts as a connecting link between a number of data sets, tools and software
It can be used to solve data oriented problem

Explain Recommender System.?

The recommended system works on the basis of past behavior of the person and is widely deployed in a number of fields like music preferences, movie recommendations, research articles, social tags and search queries. With this system, the future model can also be prepared, which can predict the person’s future behavior and can be used to know the product the person would prefer buying or which movie he will view or which book he will read. It uses the discrete characteristics of the items to recommend any additional item.

Which language R or Python is most suitable for text analytics?
As Python consists of a rich library of Pandas, due to which the analysts can use high-level data analysis tools and data structures, this feature is absent in R, so Python is more suitable for text analytics.

What is the difference between Data Analytics, Big Data, and Data Science?

Big Data: Big Data deals with huge data volume in structured and semi structured form and require just basic knowledge of mathematics and statistics.
Data Analytics: Data Analytics provide the operational insights of complex scenarios of business
Data Science: Data Science deals with slicing and dicing of data and require deep knowledge of mathematics and statistics

Which language is more suitable for text analytics? R or Python?

Since Python consists of a rich library called Pandas which allows the analysts to use high-level data analysis tools as well as data structures, while R lacks this feature. Hence Python will more suitable for text analytics.

What is a Recommender System?

A recommender system is today widely deployed in multiple fields like movie recommendations, music preferences, social tags, research articles, search queries and so on. The recommender systems work as per collaborative and content-based filtering or by deploying a personality-based approach. This type of system works based on a person’s past behavior in order to build a model for the future. This will predict the future product buying, movie viewing or book reading by people. It also creates a filtering approach using the discrete characteristics of items while recommending additional items.

Compare SAS, R and Python programming?

SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence it cannot be readily adopted by smaller enterprises
R: The best part about R is that it is an Open Source tool and hence used generously by academia and the research community. It is a robust tool for statistical computation, graphical representation and reporting. Due to its open source nature it is always being updated with the latest features and then readily available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with most other tools and technologies. The best part about Python is that it has innumerable libraries and community created modules making it very robust. It has functions for statistical operation, model building and more.

R and Python are two of the most important programming languages for Machine Learning Algorithms.

Explain the various benefits of R language?

The R programming language includes a set of software suite that is used for graphical representation, statistical computing, data manipulation and calculation.

Some of the highlights of R programming environment include the following:

An extensive collection of tools for data analysis
Operators for performing calculations on matrix and array
 Data analysis technique for graphical representation
 A highly developed yet simple and effective programming language
 It extensively supports machine learning applications
 It acts as a connecting link between various software, tools and datasets
 Create high quality reproducible analysis that is flexible and powerful
 Provides a robust package ecosystem for diverse needs
 It is useful when you have to solve a data-oriented problem

What are the two main components of the Hadoop Framework?

HDFS and YARN are basically the two major components of Hadoop framework.

HDFS- Stands for Hadoop Distributed File System. It is the distributed database working on top of Hadoop. It is capable of storing and retrieving bulk of datasets in no time.
YARN- Stands for Yet Another Resource Negotiator. It allocates resources dynamically and handles the workloads.

How do Data Scientists use Statistics?

Statistics helps Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about the consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.

What is logistic regression?

It is a statistical technique or a model in order to analyze a dataset and predict the binary outcome. The outcome has to be a binary outcome that is either zero or one or a yes or no. Random Forest is an important technique which is used to do classification, regression and other tasks on data.

Why data cleansing is important in data analysis?

With data coming in from multiple sources it is important to ensure that data is good enough for analysis. This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process of detecting and correcting of data records, ensuring that data is complete and accurate and the components of data that are irrelevant are deleted or modified as per the needs. This process can be deployed in concurrence with data wrangling or batch processing.
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an essential part of the data science because the data can be prone to error due to human negligence, corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at which it comes.

Describe univariate, bivariate and multivariate analysis.

As the name suggests these are analysis methodologies having a single, double or multiple variables.
So a univariate analysis will have one variable and due to this there are no relationships, causes. The major aspect of the univariate analysis is to summarize the data and find the patterns within it to make actionable decisions.
A Bivariate analysis deals with the relationship between two sets of data. These sets of paired data come from related sources, or samples. There are various tools to analyze such data including the chi-squared tests and t-tests when the data are having a correlation. If the data can be quantified then it can analyzed using a graph plot or a scatterplot. The strength of the correlation between the two data sets will be tested in a Bivariate analysis.

How machine learning is deployed in real world scenarios?

Here are some of the scenarios in which machine learning finds applications in real world:

Ecommerce: Understanding the customer churn, deploying targeted advertising, remarketing
Search engine: Ranking pages depending on the personal preferences of the searcher
Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
Medicare: Designing drugs depending on the patient’s history and needs
Robotics: Machine learning for handling situations that are out of the ordinary
Social media: Understanding relationships and recommending connections
Extraction of information: framing questions for getting answers from databases over the web

What are the various aspects of a Machine Learning process?

In this post I will discuss the components involved in solving a problem using machine learning.

Domain knowledge

This is the first step wherein we need to understand how to extract the various features from the data and learn more about the data that we are dealing with. It has got more to do with the type of domain that we are dealing with and familiarizing the system to learn more about it.

Feature Selection

This step has got more to do with the feature that we are selecting from the set of features that we have. Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding the type of feature that we want to select to go ahead with our machine learning endeavor.

Algorithm

This is a vital step since the algorithms that we choose will have a very major impact on the entire process of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms used are Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, etc.

Training

This is the most important part of the machine learning technique and this is where it differs from the traditional programming. The training is done based on the data that we have and providing more real world experiences. With each consequent training step the machine gets better and smarter and able to take improved decisions.

Evaluation

In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to the mark or not. There are various metrics that are involved in this process and we have to closed deploy each of these to decide on the efficacy of the whole machine learning endeavor.

Optimization

This process involves improving the performance of the machine learning process using various optimization techniques. Optimization of machine learning is one of the most vital components wherein the performance of the algorithm is vastly improved. The best part of optimization techniques is that machine learning is not just a consumer of optimization techniques but it also provides new ideas for optimization too.

Testing

Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into test and training set. There are various testing techniques like cross-validation in order to deal with multiple situations.

https://mytecbooks.blogspot.com/2018/10/opentext-data-science-interview.html
Subscribe to get more Posts :