Saturday, March 30, 2019
Medical Data Analytics Using R
Medical entropy Analytics victimisation R1.) R for recentness = months since last bribe, 2.) F for relative frequency = derive human activity of donation, 3.) M for M wiztary = make sense amount of blood donated in c.c., 4.) T for Time = months since premiere donation and 5.) Binary variable = 1 - donated blood, 0- didnt donate blood.The main idea behind this dataset is the concept of relationship counseling CRM. Based on three rhythmic pattern recentness, oftenness and Monetary (RFM) which be 3 out of the 5 evaluates of the dataset, we would be able to predict whether a guest is presumable to donate blood again based to a selling campaign. For example, customers who decl atomic number 18 donated or visited to a greater extent(prenominal) currently (recentness), more often quantifys ( frequence) or made superiorer monetary hold dears (Monetary) ar more likely to respond to a marketing effort. Customers with little(prenominal) RFM brand are less likely to a ct. It is too known in customer behavior, that the time of the first positive interaction (donation, purchase) is non significant. However, the recency of the last donation is very important.In the traditional RFM implementation for each one customer is bedded based on his RFM value parameters against every(prenominal) the other customers and that develops a produce for every customer. Customers with bigger get ahead are more likely to react in a positive way for example (visit again or donate). The model constructs the formula which could predict the following problem.Keep in deposit only customers that are more likely to continue donating in the succeeding(a) and remove those who are less likely to donate, given a reliable layover of time. The previous statement also determines the problem which depart be rail ined and interrogationed in this project.Firstly, I created a .csv consign and generated 748 ridiculous random numbers in Excel in the domain 1,748 in the f irst tower, which corresponds to the customers or users ID. thence I transferred the all told data from the .txt saddle away (transfusion.data) to the .csv charge up in excel by using the delimited (,) option. then I randomly split it in a produce file cabinet and a screen file. The make file contains the 530 instances and the run file has the 218 instances. Afterwards, I read both the ingesting dataset and the examine dataset.From the previous results, we can bring out that we have no missing or invalid values. Data ranges and units be reasonable.Figure 1 above depicts boxplots of all the attributes and for both demand and establish datasets. By examining the figure, we notice that both datasets have similar dispersals and there are some outliers (Monetary 2,500) that are visible. The volume of blood variable has a high correlation with oftenness. Because the volume of blood that is donated each time is fixed, the Monetary value is proportional to the absolute f requency (number of donations) each person gave. For example, if the amount of blood move in each person was 250 ml/bag (Taiwan Blood run Foundation 2007) action then Monetary = 250* frequence. This is also why in the predictive model we depart not consider the Monetary attribute in the implementation. So, it is reasonable to expect that customers with higher frequency give have a lot higher Monetary value. This can be affirm also visually by examining the Monetary outliers for the train set. We retrieve post 83 instances.In come in, to understand better the statistical dispersion of the whole dataset (748 instances) we will look at the standard deviation (SD) between the Recency and the variable whether customer has donated blood (Binary variable) and the SD between the frequence and the Binary variable.The distribution of scores around the correspond is small, which means the data is concentrated. This can also be notice from the plots.From this correlation matrix, we c an verify what was stated above, that the frequency and the monetary values are proportional inputs, which can be noticed from their high correlation.Another observation is that the various Recency numbers are not factors of 3. This goes to opposition with what the description said about the data beingness collected every 3 months. Additionally, there is always a uttermost number of times you can donate blood per certain(p) period (e.g. 1 time per month), but the data shows that.36 customers donated blood more than once and 6 customers had donated 3 or more times in the same month.The features that will be used to calculate the prediction of whether a customer is likely to donate again are 2, the Recency and the Frequency (RF). The Monetary feature will be dropped. The number of categories for R and F attributes will be 3. The highest RF score will be 33 equivalent to 6 when added together and the lowest will be 11 equivalent to 2 when added together. The threshold for the added s core to determine whether a customer is more likely to donate blood again or not, will be set to 4 which is the median value. The users will be appoint to categories by sorting on RF attributes as well as their scores. The file with the donators will be pick out on Recency first (in ascending order) because we want to see which customers have donated blood more recently. Then it will be sorted on frequency (in descending order this time because we want to see which customers have donated more times) in each Recency fellowship. Apart from sorting, we will need to nurse some logical argument rules that have occurred after multiple sievesFor Recency (Business rule 1) If the Recency in months is less than 15 months, then these customers will be assigned to socio-economic class 3.If the Recency in months is equal or greater than 15 months and less than 26 months, then these customers will be assigned to category 2.Otherwise, if the Recency in months is equal or greater than 26 mo nths, then these customers will be assigned to category 1And for Frequency (Business rule 2)If the Frequency is equal or greater than 25 times, then these customers will be assigned to category 3.If the Frequency is less than 25 times or greater than 15 months, then these customers will be assigned to category 2.If the Frequency is equal or less than 15 times, then these customers will be assigned to category 1RESULTSThe output of the program are two smaller files that have resulted from the train file and the other one from the test file, that have excluded several customers that should not be considered future day targets and unploughed those that are likely to respond. Some statistics about the precision, recall and the balanced F-score of the train and test file have been calculated and printed. Furthermore, we compute the absolute inequality between the results retrieved from the train and test file to get the offset demerit between these statistics. By doing this and veri fying that the error numbers are negligible, we formalise the consistency of the model implemented. Moreover, we depict two confusion matrices one for the test and one for the training by calculating the truthful positives, delusive negatives, phony positives and true(a) negatives. In our case, true positives correspond to the customers (who donated on March 2007) and were separate as future possible donators. False negatives correspond to the customers (who donated on March 2007) but were not classified as future possible targets for marketing campaigns. False positives correlate to customers (who did not donate on March 2007) and were ill-consideredly classified as possible future targets. Lastly, true negatives which are customers (who did not donate on March 2007) and were correctly classified as not plausible future donators and therefore removed from the data file. By sorting we mean the application of the threshold (4) to separate those customers who are more likely an d less likely to donate again in a certain future period.Lastly, we calculate 2 more single value metrics for both train and test files the Kappa Statistic (general statistic used for classification systems) and Matthews correlation coefficient Coefficient or cost/reward measure. Both are normalized statistics for classification systems, its values neer exceed 1, so the same statistic can be used notwithstanding as the number of observations grows. The error for both measures are MCC error 0.002577 and Kappa error 0.002808, which is very small (negligible), similarly with all the previous measures.REFERENCESUCI Machine larn Repository (2008) UCI machine learning repository Blood transfusion service of process center data set. Available at http//archive.ics.uci.edu/ml/datasets/Blood+ transfusion+Service+Center (Accessed 30 January 2017).Fundation, T.B.S. (2015) Operation department. Available at http//www.blood.org.tw/ mesh/english/docDetail.aspx?uid=7741pid=7681docid=37144 (Acce ssed 31 January 2017).The Appendix with the canon starts below. However the whole code has been uploaded on my Git Hub profile and this is the link where it can be accessed.https//github.com/it21208/RassignmentDataAnalysis/ blot/master/RassignmentDataAnalysis.Rlibrary(ggplot2)library(car) read training and interrogation datasetstraindata read.csv(C/Users/Alexandros/Dropbox/MSc/second Semester/Data analysis/ subsidization/transfusion.csv)testdata read.csv(C/Users/Alexandros/Dropbox/MSc/2nd Semester/Data analysis/Assignment/test.csv) assigning the datasets to dataframesdftrain data.frame(traindata)dftest data.frame(testdata)sapply(dftrain, typeof) give better names to columnsnames(dftrain)1 IDnames(dftrain)2 recencynames(dftrain)3frequencynames(dftrain)4ccnames(dftrain)5timenames(dftrain)6donatednames(dftest)1IDnames(dftest)2recencynames(dftest)3frequencynames(dftest)4ccnames(dftest)5timenames(dftest)6donated drop time column from both filesdftrain$time NULLdftest$time NULL sort (train) dataframe on Recency in ascending ordersorted_dftrain dftrain order( dftrain,2 ), add column in (train) dataframe - hold score ( lay) of Recency for each customersorted_dftrain , R send 0 permute train file from dataframe format to matrixmatrix_train as.matrix(sapply(sorted_dftrain, as.numeric)) sort (test) dataframe on Recency in ascending ordersorted_dftest dftest order( dftest,2 ), add column in (test) dataframe -hold score (rank) of Recency for each customersorted_dftest , Rrank 0 transform train file from dataframe format to matrixmatrix_test as.matrix(sapply(sorted_dftest, as.numeric)) categorise matrix_train and add scores for Recency apply blood line rulefor(i in 1nrow(matrix_train)) if (matrix_train i,2 matrix_train i,6 3 else if ((matrix_train i,2 = 15)) matrix_train i,6 2 else matrix_train i,6 1 categorize matrix_test and add scores for Recency apply business rulefor(i in 1nrow(matrix_test)) if (matrix_test i,2 matrix_test i,6 3 else i f ((matrix_test i,2 = 15)) matrix_test i,6 2 else matrix_test i,6 1 convert matrix_train hazard to dataframesorted_dftrain data.frame(matrix_train) sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.)sorted_dftrain_2 sorted_dftrainorder(-sorted_dftrain,6, -sorted_dftrain,3 ), add column in train dataframe- hold Frequency score (rank) for each customersorted_dftrain_2 , Frank 0 convert dataframe to matrixmatrix_train as.matrix(sapply(sorted_dftrain_2, as.numeric)) convert matrix_test back to dataframesorted_dftest data.frame(matrix_test) sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.)sorted_dftest2 sorted_dftest order( -sorted_dftest,6, -sorted_dftest,3 ), add column in test dataframe- hold Frequency score (rank) for each customersorted_dftest2 , Frank 0 convert dataframe to matrixmatrix_test as.matrix(sapply(sorted_dftest2, as.numeric))categorize matrix_train, add scores for Frequencyfor(i in 1nrow(matrix_train)) if (matrix_trai ni,3 = 25) matrix_traini,7 3 else if ((matrix_traini,3 15) (matrix_traini,3 matrix_traini,7 2 else matrix_traini,7 1 categorize matrix_test, add scores for Frequencyfor(i in 1nrow(matrix_test)) if (matrix_testi,3 = 25) matrix_testi,7 3 else if ((matrix_testi,3 15) (matrix_testi,3 matrix_testi,7 2 else matrix_testi,7 1 convert matrix test back to dataframesorted_dftrain data.frame(matrix_train) sort (train) dataframe 1rst on Recency rank (desc.) 2nd Frequency rank (desc.)sorted_dftrain_2 sorted_dftrain order( -sorted_dftrain,6, -sorted_dftrain,7 ), add another column for the Sum of Recency rank and Frequency ranksorted_dftrain_2 , SumRankRAndF 0 convert dataframe to matrixmatrix_train as.matrix(sapply(sorted_dftrain_2, as.numeric)) convert matrix test back to dataframesorted_dftest data.frame(matrix_test) sort (train) dataframe 1rst on Recency rank (desc.) 2nd Frequency rank (desc.)sorted_dftest2 sorted_dftest order( -sorted_dftest,6, -sorted_dftest,7 ) , add another column for the Sum of Recency rank and Frequency ranksorted_dftest2 , SumRankRAndF 0 convert dataframe to matrixmatrix_test as.matrix(sapply(sorted_dftest2, as.numeric)) sum Recency rank and Frequency rank for train filefor(i in 1nrow(matrix_train)) matrix_traini,8 matrix_traini,6 + matrix_traini,7 sum Recency rank and Frequency rank for test filefor(i in 1nrow(matrix_test)) matrix_testi,8 matrix_testi,6 + matrix_testi,7 convert matrix_train back to dataframesorted_dftrain data.frame(matrix_train) sort train dataframe fit in to total rank in descending ordersorted_dftrain_2 sorted_dftrain order( -sorted_dftrain,8 ), convert sorted train dataframematrix_train as.matrix(sapply(sorted_dftrain_2, as.numeric)) convert matrix_test back to dataframesorted_dftest data.frame(matrix_test) sort test dataframe according to total rank in descending ordersorted_dftest2 sorted_dftest order( -sorted_dftest,8 ), convert sorted test dataframe to matrixmatrix_test as.matr ix(sapply(sorted_dftest2, as.numeric)) apply business rule check count customers whose score = 4 and that spend a penny Donated, train file check count for all customers that have donated in the train datasetcount_train_predicted_donations 0counter_train 0number_donation_instances_whole_train 0 off_positives_train_counter 0for(i in 1nrow(matrix_train)) if ((matrix_traini,8 = 4) (matrix_traini,5 == 1)) count_train_predicted_donations = count_train_predicted_donations + 1 if ((matrix_traini,8 = 4) (matrix_traini,5 == 0)) false_positives_train_counter = false_positives_train_counter + 1 if (matrix_traini,8 = 4) counter_train counter_train + 1 if (matrix_traini,5 == 1) number_donation_instances_whole_train number_donation_instances_whole_train + 1 apply business rule check count customers whose score = 4 and that Have Donated, test file check count for all customers that have donated in the test datasetcount_test_predicted_donations 0counter_test 0number_donation_i nstances_whole_test 0false_positives_test_counter 0for(i in 1nrow(matrix_test)) if ((matrix_testi,8 = 4) (matrix_testi,5 == 1)) count_test_predicted_donations = count_test_predicted_donations + 1 if ((matrix_testi,8 = 4) (matrix_testi,5 == 0)) false_positives_test_counter = false_positives_test_counter + 1 if (matrix_testi,8 = 4) counter_test counter_test + 1 if (matrix_testi,5 == 1) number_donation_instances_whole_test number_donation_instances_whole_test + 1 convert matrix_train to dataframedftrain data.frame(matrix_train) remove the meeting of customers who are less likely to donate again in the future from train filedftrain_final dftrainc(1counter_train),18 convert matrix_train to dataframedftest data.frame(matrix_test) remove the group of customers who are less likely to donate again in the future from test filedftest_final dftestc(1counter_test),18 save final train dataframe as a CSV in the contract directory reduced target future customerswrite.csv(dftrain _final, file = CUsersAlexandrosDropboxMSc2nd SemesterData analysisAssignmenttrain_output.csv, row.names = FALSE)save final test dataframe as a CSV in the specified directory reduced target future customerswrite.csv(dftest_final, file = CUsersAlexandrosDropboxMSc2nd SemesterData analysisAssignmenttest_output.csv, row.names = FALSE)train precision=number of relevant instances retrieved / number of retrieved instances collect.530precision_train count_train_predicted_donations / counter_train train recall = number of relevant instances retrieved / number of relevant instances in collect.530recall_train count_train_predicted_donations / number_donation_instances_whole_train measure combines PrecisionRecall is harmonic mean of PrecisionRecall balanced F-score for train filef_balanced_score_train 2*(precision_train*recall_train)/(precision_train+recall_train) test precisionprecision_test count_test_predicted_donations / counter_test test recallrecall_test count_test_predicted_donati ons / number_donation_instances_whole_test the balanced F-score for test filef_balanced_score_test 2*(precision_test*recall_test)/(precision_test+recall_test) error in precisionerror_precision abs(precision_train-precision_test) error in recallerror_recall abs(recall_train-recall_test) error in f-balanced scoreserror_f_balanced_scores abs(f_balanced_score_train-f_balanced_score_test) Print Statistics for verification and validationcat(Precision with training dataset , precision_train)cat(Recall with training dataset , recall_train)cat(Precision with testing dataset , precision_test)cat(Recall with testing dataset , recall_test)cat(The F-balanced scores with training dataset , f_balanced_score_train)cat(The F-balanced scores with testing dataset , f_balanced_score_test)cat(Error in precision , error_precision)cat(Error in recall , error_recall)cat(Error in F-balanced scores , error_f_balanced_scores) confusion matrix (true positives, false positives, false negatives, true negativ es) calculate true positives for train which is the variable count_train_predicted_donations calculate false positives for train which is the variable false_positives_train_counter calculate false negatives for trainfalse_negatives_for_train number_donation_instances_whole_train count_train_predicted_donations calculate true negatives for traintrue_negatives_for_train (nrow(matrix_train) number_donation_instances_whole_train) false_positives_train_countercollect_trainc(false_positives_train_counter, true_negatives_for_train, count_train_predicted_donations, false_negatives_for_train) calculate true positives for test which is the variable count_test_predicted_donations calculate false positives for test which is the variable false_positives_test_counter calculate false negatives for testfalse_negatives_for_test number_donation_instances_whole_test count_test_predicted_donations calculate true negatives for testtrue_negatives_for_test(nrow(matrix_test)-number_donation_instance s_whole_test)- false_positives_test_countercollect_test c(false_positives_test_counter, true_negatives_for_test, count_test_predicted_donations, false_negatives_for_test)TrueCondition factor(c(0, 0, 1, 1))PredictedCondition factor(c(1, 0, 1, 0)) print confusion matrix for traindf_conf_mat_train data.frame(TrueCondition,PredictedCondition,collect_train)ggplot(data = df_conf_mat_train, purpose = aes(x = PredictedCondition, y = TrueCondition)) + geom_tile(aes(fill = collect_train), colour = white) + geom_text(aes(label = sprintf(%1.0f, collect_train)), vjust = 1) + scale_fill_gradient(low = blue, high = red) + theme_bw() + theme(legend.position = none) print confusion matrix for testdf_conf_mat_test data.frame(TrueCondition,PredictedCondition,collect_test)ggplot(data = df_conf_mat_test, mapping = aes(x = PredictedCondition, y = TrueCondition)) + geom_tile(aes(fill = collect_test), colour = white) + geom_text(aes(label = sprintf(%1.0f, collect_test)), vjust = 1) + scale_fill_gradi ent(low = blue, high = red) + theme_bw() + theme(legend.position = none) MCC = (TP * TN FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for train valuesmcc_train ((count_train_predicted_donations * true_negatives_for_train) (false_positives_train_counter * false_negatives_for_train))/sqrt((count_train_predicted_donations+false_positives_train_counter)*(count_train_predicted_donations+false_negatives_for_train)*(false_positives_train_counter+true_negatives_for_train)*(true_negatives_for_train+false_negatives_for_train)) print MCC for traincat(Matthews coefficient of correlation Coefficient for train ,mcc_train) MCC = (TP * TN FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for test valuesmcc_test ((count_test_predicted_donations * true_negatives_for_test) (false_positives_test_counter * false_negatives_for_test))/sqrt((count_test_predicted_donations+false_positives_test_counter)*(count_test_predicted_donations+false_negatives_for_test)*(false_positives_test_counter+true_negative s_for_test)*(true_negatives_for_test+false_negatives_for_test)) print MCC for testcat(Matthews Correlation Coefficient for test ,mcc_test) print MCC err between train and errcat(Matthews Correlation Coefficient error ,abs(mcc_train-mcc_test)) Total = TP + TN + FP + FN for traintotal_train count_train_predicted_donations + true_negatives_for_train + false_positives_train_counter + false_negatives_for_train Total = TP + TN + FP + FN for testtotal_test count_test_predicted_donations + true_negatives_for_test + false_positives_test_counter + false_negatives_for_test totalAccuracy = (TP + TN) / Total for train valuestotalAccuracyTrain (count_train_predicted_donations + true_negatives_for_train)/ total_train totalAccuracy = (TP + TN) / Total for test valuestotalAccuracyTest (count_test_predicted_donations + true_negatives_for_test)/ total_test randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total) for train valuesrandomAccuracyTrain((true_negatives_for_train+false_positi ves_train_counter)*(true_negatives_for_train+false_negatives_for_train)+(false_negatives_for_train+count_train_predicted_donations)*(false_positives_train_counter+count_train_predicted_donations))/(total_train*total_train) randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total) for test valuesrandomAccuracyTest((true_negatives_for_test+false_positives_test_counter)*(true_negatives_for_test+false_negatives_for_test)+(false_negatives_for_test+count_test_predicted_donations)*(false_positives_test_counter+count_test_predicted_donations))/(total_test*total_test) kappa = (totalAccuracy randomAccuracy) / (1 randomAccuracy) for trainkappa_train (totalAccuracyTrain-randomAccuracyTrain)/(1-randomAccuracyTrain) kappa = (totalAccuracy randomAccuracy) / (1 randomAccuracy) for testkappa_test (totalAccuracyTest-randomAccuracyTest)/(1-randomAccuracyTest) print kappa errorcat(Kappa error ,abs(kappa_train-kappa_test))
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment