**Data Science Interview Questions and Answers**

**Q.What do you mean by word Data Science?**

**Ans: **Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining

**Q.Explain the term botnet?**

**Ans:** A botnet is a a type of bot running on an IRC network that has been created with a Trojan.

**Q.What is Data Visualization?**

**Ans:** Data visualization is a common term that describes any effort to help people understand the significance of data by placing it in a visual context.

**Q.How you can define Data cleaning as a critical part of process?**

**Ans:** Cleaning up data to the point where you can work with it is a huge amount of work. If we’re trying to reconcile a lot of sources of data that we don’t control like in this flight, it can take 80% of our time.

**Q.Point out 7 Ways how Data Scientists use Statistics?**

**Ans:**

- Design and interpret experiments to inform product decisions.

2. Build models that predict signal, not noise.

3. Turn big data a into the big picture

4. Understand user retention, engagement, conversion, and leads.

5. Give your users what they want.

6. Estimate intelligently.

7. Tell the story with the data.

**Q.Differentiate between Data modeling and Database design?**

**Ans:** Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.

Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system.

**Q.Describe in brief the data Science Process flowchart?**

**Ans: **

1.Data is collected from sensors in the environment.

2. Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing.

3. Exploratory data analysis and statistical modeling may be performed.

4. A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment.

**Q.What do you understand by term hash table collisions?**

**Ans:** Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.

**Q.Compare and contrast R and SAS?**

**Ans:** SAS is commercial software whereas R is free source and can be downloaded by anyone.

SAS is easy to learn and provide easy option for people who already know SQL whereas R is a low level programming language and hence simple procedures takes longer codes.

**Q.What do you understand by letter ‘R’?**

**Ans:** R is a low level language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at BELL.

**Q.What all things R environment includes?**

**Ans:**

- A suite of operators for calculations on arrays, in particular matrices,

2. An effective data handling and storage facility,

3. A large, coherent, integrated collection of intermediate tools for data analysis, an effective data handling and storage facility,

4. Graphical facilities for data analysis and display either on-screen or on hardcopy, and

5. A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

**What are the applied Machine Learning Process Steps?**

**Ans:**

- Problem Definition: Understand and clearly describe the problem that is being solved.

2. Analyze Data: Understand the information available that will be used to develop a model.

3. Prepare Data: Define and expose the structure in the dataset.

4. Evaluate Algorithms: Develop robust test harness and baseline accuracy from which to improve and spot check algorithms.

5. Improve Results: Improve results to develop more accurate models.

6. Present Results: Details the problem and solution so that it can be understood by third parties.

**Q.Compare Multivariate, Univariate and Bivariate analysis?**

**Ans:** MULTIVARIATE: Multivariate analysis focuses on the results of observations of many different variables for a number of objects.

UNIVARIATE: Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved.

BIVARIATE: Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.

**Q.What is Hypothesis in Machine Learning?**

**Ans:** The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be returned by it. It is typically dened by a hypothesis language, possibly in conjunction with a language bias.

**Q.Differentiate between Uniform and Skewed Distribution?**

**Ans:** UNIFORM DISTRIBUTION: A uniform distribution, sometimes also known as a rectangular distribution, is a distribution that has constant probability. The latter of which simplifies to the expected for . The continuous distribution is implemented as Uniform Distribution

SKEWED DISTRIBUTION: In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. The qualitative interpretation of the skew is complicated.

**Q.What do you understand by term Transformation in Data Acquisition?**

**Ans:** The transformation process allows you to consolidate, cleanse, and integrate data. We can semantically arrange the data from heterogeneous sources.

**Q.What do you understand by term Normal Distribution?**

**Ans:** It is a function which shows the distribution of many random variables as a symmetrical bell-shaped graph.

**Q.What is Data Acquisition?**

**Ans:** It is the process of measuring an electrical or physical phenomenon such as voltage, current, temperature, pressure, or sound with a computer. A DAQ system comprises of sensors, DAQ measurement hardware, and a computer with programmable software.

**Q.What is Data Collection?**

**Ans: **Data collection is the process of collecting and measuring information on variables of interest, in a proper systematic fashion that enables one to answer stated research questions hypotheses, and revise outcomes.

**Q.What do you understand by term Use case?**

**Ans:** A use case is a methodology used in system analysis to identify, clarify, and organize system requirements. The use case consists of a set of possible sequences of interactions between systems and users in a particular environment and related to a defined particular goal.

**Q.What is Sampling and Sampling Distribution?**

**Ans:** SAMPLING: Sampling is the process of choosing units (ex- people, organizations) from a population of interest so that by studying the sample we can fairly generalize our results back to the population from which they were chosen.

SAMPLING DISTRIBUTION: The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.

**Q.What is Linear Regression?**

**Ans:** In statistics, linear regression is an way for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted by X. The case of one explanatory variable is known as simple linear regression.

**Q.Differentiate between Extrapolation and Interpolation?**

**Ans:** Extrapolation is an approximate of a value based on extending a known sequence of values or facts beyond the area that is certainly known. Interpolation is an estimation of a value within two known values in a list of values.

**Q.How expected value is different from Mean value?**

**Ans:** There is no difference. These are two names for the same thing. They are mostly used in different contexts, though if we talk about the expected value of a random variable and the mean of a sample, population or probability distribution.

**Q.Differentiate between Systematic and Cluster Sampling?**

**Ans:** SYSTEMATIC SAMPLING: Systematic sampling is a statistical methology involving the selection of elements from an ordered sampling frame. The most common form of systematic sampling is an equal-probability method.

CLUSTER SAMPLING: A cluster sample is a probability sample by which each sampling unit is a collection, or cluster, of elements.

**Q.What are the advantages of Systematic Sampling?**

**Ans: **

1.Easier to perform in the field, especially if a proper frame is not available.

2. Regularly provides more information per unit cost than simple random sampling, in the sense of smaller variances.

**Q.What do you understand by term Threshold limit value?**

**Ans:** The threshold limit value (TLV) of a chemical substance is a level in which it is believed that a worker can be exposed day after day for a working lifetime without affecting his/her health.

**Q.Differentiate between Validation Set and Test set?**

**Ans:** Validation set: It is a set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

**Q.How can R and Hadoop be used together?**

**Ans:** The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use Map Reduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. Data analysts can then perform complex modeling exercises on a subset of prepared data in R.

**Q.What do you understand by term RIMPALA?**

**Ans:** RImpala-package contains the R functions required to connect, execute queries and retrieve back results from Impala. It uses the rJava package to create a JDBC connection to any of the impala servers running on a Hadoop Cluster.

**Q.What is Collaborative Filtering?**

**Ans:** Collaborative filtering (CF) is a method used by some recommender systems. It consists of two senses, a narrow one and a more general one. In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources.

**Q.What are the challenges of Collaborative Filtering?**

**Ans:**

- Scalability

2. Data sparsity

3. Synonyms

4. Grey sheep Data sparsity

5. Shilling attacks

6. Diversity and the Long Tail

**Q.What do you understand by Big data?**

**Ans:** Big data is a buzzword, or catch-phrase, which describe a massive volume of both structured and unstructured data that is so large which is difficult to process using traditional database and software techniques.

**Q.What do you understand by Matrix factorization?**

**Ans:** Matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios by which one would find out something hidden under the data.

**Q.What do you understand by term Singular Value Decomposition?**

**Ans:** In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It has many useful applications in signal processing and statistics.

**Q.What do you mean by Recommender systems?**

**Ans:** Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item.

**Q.What are the applications of Recommender Systems?**

**Ans:** Recommender systems have become extremely common in recent years, and are applied in a variety of applications. The most popular ones are probably movies, music, news, books, research articles, search queries, social tags, and products in general.

**Q.What are the two ways of Recommender System?**

**Ans:** Recommender systems typically produce a list of recommendations in one of two ways: Through collaborative or content-based filtering. Collaborative filtering approaches building a model from a user’s past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.

**Q.What are the factors to find the most accurate recommendation algorithms?**

**Ans:**

- Diversity

2. Recommender Persistence

3. Privacy

4. User Demographics

5. Robustness

6. Serendipity

7. Trust

8. Labeling

**Q.What is K-Nearest Neighbor?**

**Ans:** k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

**Q.What is Horizontal Slicing?**

**Ans:** In horizontal slicing, projects are broken up roughly along architectural lines. That is there would be one team for UI, one team for business logic and services (SOA), and another team for data.

**Q.What are the advantages of vertical slicing?**

**Ans:** The advantage of slicing vertically is you are more efficient. You don’t have the overhead, and effort that comes from trying to coordinate activities across multiple teams. No need to negotiate for resources. You’re all on the same team.

**Q.What is null hypothesis?**

**Ans:** In inferential statistics the null hypothesis usually refers to a general statement or default position that there is no relationship between two measured phenomena, or no difference among groups.

**Q.What is Statistical hypothesis?**

**Ans:** In statistical hypothesis testing, the alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test.

**Q.What is performance measure?**

**Ans:** Performance measurement is the method of collecting, analyzing and/or reporting information regarding the performance of an individual, group, organization, system or component.

**Q.What is the use of tree command?**

**Ans:** This command is used to list contents of directories in a tree-like format.

**Q.What is the use of uniq command?**

**Ans: **This command is used to report or omit repeated lines.

**Q.Which command is used translate or delete characters?**

**Ans:** tr command is used translate or delete characters.

**Q.What is the use of tapkee command?**

**Ans:** This command is used to reduce dimensionality of a data set using various algorithms.

**Q.Which command is used to sort the lines of text files?**

**Ans:** sort command is used to sort the lines of text files.

**Data Science Interview Questions and Answers in R Programming**

**Q.How can you merge two data frames in R language?**

**Ans:** Data frames in R language can be merged manually using cbind () functions or by using the merge () function on common rows or columns.

**Q.Explain about data import in R language**

**Ans:** R Commander is used to import data in R language. To start the R commander GUI, the user must type in the command Rcmdr into the console. There are 3 different ways in which data can be imported in R language-

- Users can select the data set in the dialog box or enter the name of the data set (if they know).
- Data can also be entered directly using the editor of R Commander via Data->New Data Set. However, this works well when the data set is not too large.
- Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical package or from the clipboard.

**Q.Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be output of vector Z that is defined as Z <- X*Y.**

**Ans:** In R language when the vectors have different lengths, the multiplication begins with the smaller vector and continues till all the elements in the larger vector have been multiplied.

The output of the above code will be –

Z <- (3, 4, 4)
**Q.How missing values and impossible values are represented in R language?**

**Ans:** NaN (Not a Number) is used to represent impossible values whereas NA (Not Available) is used to represent missing values. The best way to answer this question would be to mention that deleting missing values is not a good idea because the probable cause for missing value could be some problem with data collection or programming or the query. It is good to find the root cause of the missing values and then take necessary steps handle them.

**Q.R language has several packages for solving a particular problem. How do you make a decision on which one is the best to use?**

**Ans:** CRAN package ecosystem has more than 6000 packages. The best way for beginners to answer this question is to mention that they would look for a package that follows good software development principles. The next thing would be to look for user reviews and find out if other data scientists or analysts have been able to solve a similar problem.

Data Science Interview questions Data Science Interview Questions and Answers

**Q.Which function in R language is used to find out whether the means of 2 groups are equal to each other or not?**

**Ans:** t.tests ()

**Q.What is the best way to communicate the results of data analysis using R language?**

**Ans:** The best possible way to do this is combine the data, code and analysis results in a single document using knitr for reproducible research. This helps others to verify the findings, add to them and engage in discussions. Reproducible research makes it easy to redo the experiments by inserting new data and applying it to a different problem.

**Q.How many data structures does R language have?**

**Ans:** R language has Homogeneous and Heterogeneous data structures. Homogeneous data structures have same type of objects – Vector, Matrix ad Array. Heterogeneous data structures have different type of objects – Data frames and lists.

**Q.What is the value of f (2) for the following R code?**

**Ans: **

b <- 4
f <- function (a)
{
b <- 3
b^3 + g (a)
}
g <- function (a)
{
a*b
}
The answer to the above code snippet is 35. The value of “a” passed to the function is 2 and the value for “b” defined in the function f (a) is 3. So the output would be 3^3 + g (2). The function g is defined in the global environment and it takes the value of b as 4(due to lexical scoping in R) not 3 returning a value 2*4= 8 to the function f. The result will be 3^3+8= 35.
**Q.What is the process to create a table in R language without using external files?**

**Ans: **

MyTable= data.frame ()

edit (MyTable)

The above code will open an Excel Spreadsheet for entering data into MyTable.

**Q.Explain about the significance of transpose in R language**

**Ans:** Transpose t () is the easiest method for reshaping the data before analysis.

**Q.What are with () and BY () functions used for?**

**Ans:** With () function is used to apply an expression for a given dataset and BY () function is used for applying a function each level of factors.

**Q.dplyr package is used to speed up data frame management code. Which package can be integrated with dplyr for large fast tables?**

**Ans:** data.table

**Q.In base graphics system, which function is used to add elements to a plot?**

**Ans:** boxplot () or text ()

**Q.What are the different type of sorting algorithms available in R language?**

**Ans:**

Bucket Sort

Selection Sort

Quick Sort

Bubble Sort

Merge Sort

**Q.What is the command used to store R objects in a file?**

**Ans:** save (x, file=”x.Rdata”)

**Q.What is the best way to use Hadoop and R together for analysis?**

**Ans:** HDFS can be used for storing the data for long-term. MapReduce jobs submitted from either Oozie, Pig or Hive can be used to encode, improve and sample the data sets from HDFS into R. This helps to leverage complex analysis tasks on the subset of data prepared in R.

**Q.What will be the output of log (-5.8) when executed on R console?**

**Ans:** Executing the above on R console will display a warning sign that NaN (Not a Number) will be produced because it is not possible to take the log of negative number.

**Q.How is a Data object represented internally in R language?**

**Ans:** unclass (as.Date (“2016-10-05″))

**Q.What will be the output of the below code –**

printmessage <- function (a) {
if (is.na (a))
print ("a is a missing value!")
else if (a < 0)
print ("a is less than zero")
else
print ("a is greater than or equal to zero")
invisible (a)
}
printmessage (NA)
The output for the above R programming code will be “a is a missing value.” The function is.na () is used to check if the input passed is a missing value.
**Q.Which package in R supports the exploratory analysis of genomic data?**

**Ans:** adegenet

**Q.What is the difference between data frame and a matrix in R?**

**Ans:** Data frame can contain heterogeneous inputs while a matrix cannot. In matrix only similar data types can be stored whereas in a data frame there can be different data types like characters, integers or other data frames.

**Q.How can you add datasets in R?**

**Ans:** rbind () function can be used add datasets in R language provided the columns in the datasets should be same.

**Q.What are factor variable in R language?**

**Ans:** Factor variables are categorical variables that hold either string or numeric values. Factor variables are used in various types of graphics and particularly for statistical modelling where the correct number of degrees of freedom is assigned to them.

**Q.What is the memory limit in R?**

**Ans:** 8TB is the memory limit for 64-bit system memory and 3GB is the limit for 32-bit system memory.

**Q.What are the data types in R on which binary operators can be applied?**

**Ans:** Scalars, Matrices ad Vectors.

**Q.How do you create log linear models in R language?**

**Ans:** Using the loglm () function

**Q.What will be the class of the resulting vector if you concatenate a number and NA?**

**Ans:** number

**Q.What is meant by K-nearest neighbour?**

**Ans:** K-Nearest Neighbour is one of the simplest machine learning classification algorithms that is a subset of supervised learning based on lazy learning. In this algorithm the function is approximated locally and any computations are deferred until classification.

**Q.What will be the class of the resulting vector if you concatenate a number and a character?**

**Ans:** character

**Q.If you want to know all the values in c (1, 3, 5, 7, 10) that are not in c (1, 5, 10, 12, 14). Which in-built function in R can be used to do this? Also, how this can be achieved without using the in-built function.**

**Ans:** Using in-built function – setdiff(c (1, 3, 5, 7, 10), c (1, 5, 10, 11, 13))

Without using in-built function – c (1, 3, 5, 7, 10) [! c (1, 3, 5, 7, 10) %in% c (1, 5, 10, 11, 13).

**Q.How can you debug and test R programming code?**

**Ans:** R code can be tested using Hadley’s testthat package.

**Q.Write a function in R language to replace the missing value in a vector with the mean of that vector.**

**Ans:** mean impute <- function(x) {x [is.na(x)] <- mean(x, na.rm = TRUE); x}
**Q.What happens if the application object is not able to handle an event?**

**Ans:** The event is dispatched to the delegate for processing.

**Q.Differentiate between lapply and sapply.**

**Ans:** If the programmers want the output to be a data frame or a vector, then sapply function is used whereas if a programmer wants the output to be a list then lapply is used. There one more function known as vapply which is preferred over sapply as vapply allows the programmer to specific the output type. The disadvantage of using vapply is that it is difficult to be implemented and more verbose.

**Q.Differentiate between seq (6) and seq_along (6)**

**Ans:** Seq_along(6) will produce a vector with length 6 whereas seq(6) will produce a sequential vector from 1 to 6 c( (1,2,3,4,5,6)).

**Q.How will you read a .csv file in R language?**

**Ans:** read.csv () function is used to read a .csv file in R language. Below is a simple example –

filcontent <-read.csv (sample.csv)
print (filecontent)
**Q.How do you write R commands?**

**Ans:** The line of code in R language should begin with a hash symbol (#).

**Q.How can you verify if a given object “X” is a matric data object?**

**Ans:** If the function call is.matrix(X ) returns TRUE then X can be termed as a matrix data object.

**Q.What do you understand by element recycling in R?**

**Ans:** If two vectors with different lengths perform an operation –the elements of the shorter vector will be re-used to complete the operation. This is referred to as element recycling.

Example – Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated when computing the result.
**Q.How can you verify if a given object “X” is a matrix data object?**

**Ans:** If the function call is.matrix(X) returns true then X can be considered as a matrix data object otheriwse not.

**Q.How will you measure the probability of a binary response variable in R language?**

**Ans:** Logistic regression can be used for this and the function glm () in R language provides this functionality.

**Q.What is the use of sample and subset functions in R programming language?**

**Ans:** Sample () function can be used to select a random sample of size ‘n’ from a huge dataset.

Subset () function is used to select variables and observations from a given dataset.

**Q.There is a function fn(a, b, c, d, e) a + b * c – d / e. Write the code to call fn on the vector c(1,2,3,4,5) such that the output is same as fn(1,2,3,4,5).**

**Ans:** do.call (fn, as.list(c (1, 2, 3, 4, 5)))

**Q.How can you resample statistical tests in R language?**

**Ans:** Coin package in R provides various options for re-randomization and permutations based on statistical tests. When test assumptions cannot be met then this package serves as the best alternative to classical methods as it does not assume random sampling from well-defined populations.

**Q.What is the purpose of using Next statement in R language?**

**Ans:** If a developer wants to skip the current iteration of a loop in the code without terminating it then they can use the next statement. Whenever the R parser comes across the next statement in the code, it skips evaluation of the loop further and jumps to the next iteration of the loop.

**Q.How will you create scatterplot matrices in R language?**

**Ans:** A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like formula, data, subset, labels, etc.

The two key parameters required to build a scatterplot matrix are –

- formula- A formula basically like ~a+b+c . Each term gives a separate variable in the pairs plots where the terms should be numerical vectors. It basically represents the series of variables used in pairs.
- data- It basically represents the dataset from which the variables have to be taken for building a scatterplot.

**Q.How will you check if an element 25 is present in a vector?**

**Ans:** There are various ways to do this-

- It can be done using the match () function- match () function returns the first appearance of a particular element.
- The other is to use %in% which returns a Boolean value either true or false.

- element () function also returns a Boolean value either true or false based on whether it is present in a vector or not.

**Q.What is the difference between library() and require() functions in R language?**

**Ans:** There is no real difference between the two if the packages are not being loaded inside the function. require () function is usually used inside function and throws a warning whenever a particular package is not found. On the flip side, library () function gives an error message if the desired package cannot be loaded.

**Q.What are the rules to define a variable name in R programming language?**

**Ans:** A variable name in R programming language can contain numeric and alphabets along with special characters like dot (.) and underline (-). Variable names in R language can begin with an alphabet or the dot symbol. However, if the variable name begins with a dot symbol it should not be a followed by a numeric digit.

**Q.What do you understand by a workspace in R programming language?**

**Ans:** The current R working environment of a user that has user defined objects like lists, vectors, etc. is referred to as Workspace in R language.

**Q.Which function helps you perform sorting in R language?**

**Ans:** Order ()

**Q.How will you list all the data sets available in all R packages?**

**Ans:** Using the below line of code-

data(package = .packages(all.available = TRUE))

**Q. Which function is used to create a histogram visualisation in R programming language?**

**Ans:** Hist()

**Q. Write the syntax to set the path for current working directory in R environment.**

**Ans:** Setwd(“dir_path”)

**Q.How will you drop variables using indices in a data frame?**

**Ans:** Let’s take a dataframe df<-data.frame(v1=c(1:5),v2=c(2:6),v3=c(3:7),v4=c(4:8))
df
## v1 v2 v3 v4
## 1 1 2 3 4
## 2 2 3 4 5
## 3 3 4 5 6
## 4 4 5 6 7
## 5 5 6 7 8
Suppose we want to drop variables v2 & v3 , the variables v2 and v3 can be dropped using negative indicies as follows-
df1<-df[-c(2,3)]
df1
## v1 v4
## 1 1 4
## 2 2 5
## 3 3 6
## 4 4 7
## 5 5 8
**Q.What will be the output of runif (7)?**

**Ans:** It will generate 7 randowm numbers between 0 and 1.

**Q.What is the difference between rnorm and runif functions ?**

**Ans:** rnorm function generates “n” normal random numbers based on the mean and standard deviation arguments passed to the function.

**Syntax of rnorm function –**

rnorm(n, mean = , sd = )

runif function generates “n” unform random numbers in the interval of minimum and maximum values passed to the function.

**Syntax of runif function –**

runif(n, min = , max = )

**Q.What will be the output on executing the following R programming code –**

mat<-matrix(rep(c(TRUE,FALSE),8),nrow=4)
sum(mat)
8
**Q.How will you combine multiple different string like “Data”, “Science”, “in” ,“R”, “Programming” as a single string “Data_Science_in_R_Programmming” ?**

**Ans:** paste(“Data”, “Science”, “in” ,“R”, “Programming”,sep=”_”)

**Q.Write a function to extract the first name from the string “Mr. Tom White”.**

**Ans:** substr (“Mr. Tom White”,start=5, stop=7)

**Q.Can you tell if the equation given below is linear or not ?**

**Emp_sal= 2000+2.5(emp_age) ^{2}**

**Ans:** Yes it is a linear equation as the coefficients are linear.

**Q.What will be the output of the following R programming code ?**

**var2<- c("I","Love,"DeZyre")**

**var2**

**Ans:** It will give an error.

**Q.What will be the output of the following R programming code?**

**x<-5**

**if(x%%2==0)**

** print(“X is an even number”)**

**else**

** print(“X is an odd number”)**

**Ans:** Executing the above code will result in an error as shown below –

## Error: :4:1: unexpected ‘else’

## 3: print(“X is an even number”)

## 4: else

## ^

R programming language does not know if the else related to the first ‘if’ or not as the first if() is a complete command on its own.

**Q.What is R Base package?**

**Ans**: R Base package is the package that is loaded by default whenever R programming environent is loaded .R base package provides basic fucntionalites in R environment like arithmetic calcualtions, input/output.

**Q. How will you merge two dataframes in R programming language?**

**Ans:** Merge () function is used to combine two dataframes and it identifies common rows or columns between the 2 dataframes. Merge () function basically finds the intersection between two different sets of data.

Merge () function in R language takes a long list of arguments as follows –

Syntax for using Merge function in R language –

merge (x, y, by.x, by.y, all.x or all.y or all )

- X represents the first dataframe.
- Y represents the second dataframe.
- X- Variable name in dataframe X that is common in Y.
- Y- Variable name in dataframe Y that is common in X.
- x – It is a logical value that specifies the type of merge. all.X should be set to true, if we want all the observations from dataframe X . This results in Left Join.
- y – It is a logical value that specifies the type of merge. all.y should be set to true , if we want all the observations from dataframe Y . This results in Right Join.
- all – The default value for this is set to FALSE which means that only matching rows are returned resulting in Inner join. This should be set to true if you want all the observations from dataframe X and Y resulting in Outer join.

**Q.Write the R programming code for an array of words so that the output is displayed in decreasing frequency order.**

**Ans:** R Programming Code to display output in decreasing frequency order –

tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]
Output -
1) a a1 b
2) 3 3 2
**Q.How to check the frequency distribution of a categorical variable?**

**Ans:** The frequency distribution of a categorical variable can be checked using the table function in R language. Table () function calculates the count of each categories of a categorical variable.

gender=factor(c(“M”,”F”,”M”,”F”,”F”,”F”))

table(sex)

**Output of the above R Code –**

Gender

F M

4 2

Programmers can also calculate the % of values for each categorical group by storing the output in a dataframe and applying the column percent function as shown below –

t = data.frame(table(gender))

t$percent= round(t$Freq / sum(t$Freq)*100,2)

Gender |
Frequency |
Percent |

F | 4 | 66.67 |

M | 2 | 33.33 |

**Q.What is the procedure to check the cumulative frequency distribution of any categorical variable?**

**Ans:** The cumulative frequency distribution of a categorical variable can be checked using the cumsum () function in R language.

**Example –**

gender = factor(c(“f”,”m”,”m”,”f”,”m”,”f”))

y = table(gender)

cumsum(y)

**Output of the above R code-**

Cumsum(y)

f m

3 3

**Q.What will be the result of multiplying two vectors in R having different lengths?**

**Ans:** The multiplication of the two vectors will be performed and the output will be displayed with a warning message like – “Longer object length is not a multiple of shorter object length.” Suppose there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential manner but since the length is not same, the first element of the smaller vector b will be multiplied with the last element of the larger vector a.
contact for more on Data Science Online Training

## Data Science interview questions Tags

Data Science interview questions and answers,Data Science online training, Data Science interview questions, Data Science training online, Data Science training, Data Science training institute, latest Data Science interview questions, best Data Science interview questions 2019, top 100 Data Science interview questions,sample Data Science interview questions,Data Science interview questions technical, best Data Science interview tips, best Data Science interview basics, Data Science Interview techniques,Data Science Interview Tips.