Adv. Data Science Using R | Assignment – 3

1 year ago

6 minutes

Quiz – 3

Q1. Which of the following language is used in Data science?
R
C
C++
Ruby

Q2. What is the primary file type of R?
Vector
Text file
RScripts
Statistical file

Q3. Which one of the following R packages is used for data products?
haven
igraph
slidify
forecast

Q4. Which of the following is valid for checking categorical variable?
Level
Table
Unique
All of the above

Q5. Suppose ABC is the matrix of 3 rows and 4 columns. Choose correct option(s) to rename columns:
row_names(ABC)= c(“row1”,”row2”,”row3”)
rownames(ABC)=c(“row1”,”row2”)
row(ABC)=c(“row1”,”row2”)
rownames(ABC)=c(“row”,”row2”,”row3”)

Q6. Arrange in proper order of data type:
Logical, integer, numeric, character
Integer, numeric, character, logical
Character, logical, integer, numeric
Numeric, integer, character, logical

Q7. What is the output of below code:
A=10
B=20
print(A,B)
10 20
Error
(10, 20)
None of the above

Q8. Return statement is compulsory while writing function in R
True
False

Q9. Last variable in function is by default return variable in R
True
False

Q10. What package is need to be install for reading?
Read_excel
Readxl
Readcsv
read_csv

Q11. what is the output of below mentioned code?
logic1=c(T,F,F,T,F,T)
print(which (logic))
1 4 6
2 3 6
6 4 1
1 2 3

Q12. If A = c (1, 13, 42, 13, 4) then what is A = A [ -4 ]?
1, 13, 42, 4
1, 13, 42, 13
13
1, 42, 13, 4

Q13. what function can be used to split the string?
Output will be : “Navin” “Mr. Naresh J”
strsplit(name,”[.]”)
charsplit(name,”[,]”)
stringsplit(name)
strsplit(name,”[,]”)

Q14. i=100 , how to find out data type of i
Option 1
type(i)
class(i)
none of the above

Q15. Dt = “01-12-2020” is in the form of character. What is the option to convert date into “MM-DD-YYYY”
To_date (dt, ”MM – DD – YYYY”)
date( x = dt, format = “%m / %d / %Y”)
Date ( x = dt, format = “%m / %d / %Y”)
none of the above

Assignment – 3

1. What Is KNN Algorithm? Features Of KNN Algorithm. How Does KNN Algorithm Work? Write KNN algorithm pseudocode and Practical Implementation Of KNN Algorithm In R.

The K-Nearest Neighbors (KNN) algorithm is a non-parametric, instance-based method for classification and regression. It is a supervised learning algorithm that stores all available cases and classifies new cases based on a similarity measure, such as Euclidean distance.

Features of KNN Algorithm:

Simple to understand and implement
No assumptions about the distribution of the data
Can be used for both classification and regression problems

The algorithm works by taking a new data point and finding the k number of closest points in the training set. The new data point is then classified by the majority class of the k nearest neighbors.

Pseudocode for the KNN algorithm:

Initialize the number of nearest neighbors (k)
For each point in the dataset: a. Calculate the distance between the point and the new data point b. Add the distance and the point to a list
Sort the list by distance
Take the first k elements from the sorted list
Determine the majority class among the k elements
Classify the new data point as the majority class

Practical Implementation of KNN Algorithm in R:

# Load the library
library(class)

# Create a sample data set
x <- cbind(rnorm(50), rnorm(50))
y <- gl(2, 25, labels = c("A", "B"))

# Fit a KNN model with k = 3
fit <- knn(x, x, y, k = 3)

# Predict the class of new data points
newdata <- rbind(c(1, 2), c(3, 4))
predicted_class <- predict(fit, newdata)

This code creates a sample dataset of 50 points, with two features and two classes (A and B). Then it fits a KNN model with k = 3, and predicts the class of two new data points.

2. Develop a Machine Learning Model using SVM in R to solve A Business Problem. Add Screenshots of the graphs and code to validate your answer.

Applying SVM for solving a Business use Case

The data source is https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

Read the data and check the structure of both train and test

library(lubridate)
library(caret)
library(dplyr)
library(DMwR)
library(ROSE)
library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(data.table)
library(e1071)
library(gridExtra)

train <-fread('../input/train_sample.csv', stringsAsFactors = FALSE,
data.table = FALSE)
test <-fread('../input/test.csv', stringsAsFactors = FALSE, data.table
= FALSE)
str(train)

str(test)

There is no difference between a train and test data except we need to
predict target (is_attributed) in test and attributed_time (Time taken
to download Application) is not given in test data)
Missing value checking and estimation

colSums(is.na(train))

There is no missing value at all, data is very clean and clear

colSums(train=='')

Attributes_time (Time taken to download) having blank entries, this
is logically correct
Lets check the target variable how many are not downloaded in train data

table(train$is_attributed)

Our assumption is correct since blank entries in Attributes_time is
matching with Application not downloaded in train data.
As it’s logically correct, we don’t need to do any further action on this
And also notice that, this variable is not present in test data, so no point
of keeping it in the train data too

train$attributed_time=NULL

Applying the SVM on the data
Linear Support Vector Machine

Before going into model, lets tune the cost Parameter

set.seed(1234)
liner.tune=tune.svm(is_attributed~.,data=smote_train,kernel="linear",cost=c
(0.1,0.5,1,5,10,50))
liner.tune

We will get the best parameters for the SVM linear kernel, it uses multi-fold
cross validation method

Lets see how our Linear model works

Lets get a best.liner model

best.linear=liner.tune$best.model

#Predict data

best.test=predict(best.linear,newdata=test_val,type="class")
confusionMatrix(best.test,test_val$is_attributed)

The Kernel Trick – Radial Support vector Machine

set.seed(1234)
rd.poly=tune.svm(is_attributed~.,data=smote_train,kernel="radial",gamma=
seq(0.1,5))
summary(rd.poly)

Lets predict the test data

best.rd=rd.poly$best.model
pre.rd=predict(best.rd,newdata = test_val)
confusionMatrix(pre.rd,test_val$is_attributed)

3. Write down the step by step classification of naïve bayes classification in R.

The step by step classification of naïve bayes classification in R are as Follows:

Load the necessary libraries, such as the “e1071” library for the Naive Bayes classifier.
Prepare the data for the model. This includes splitting the data into a training and test set, and converting any categorical variables into factors.
Train the model by fitting the training data to the Naive Bayes classifier.
Use the trained model to predict the class of the test data.
Evaluate the performance of the model by comparing the predicted class to the actual class of the test data. This can be done using metrics such as accuracy, precision, and recall.
Repeat steps 3-5 for different input data and/or different model parameters to find the best model for the given data.

* The material and content uploaded on this website are for general information and reference purposes only and don’t copy the answers of this website to any other domain without any permission or else copyright abuse will be in action.
Please do it by your own first!

#Adv. Data Science Using R

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_R6B18SKKHL	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_234720847_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
activechatyWidgets	1 day	No description
chatyWidget_0	7 days	No description