Linux DM Lab Manual

Contents
S.No 1. 2. List of Linux Programs List of Data Mining Programs Week1 1. Write a shell script that accepts a file name, starting and ending line numbers as arguments and displays all the lines between the given line numbers. 2. Write a shell script that deletes all lines containing a specified word in one or more files supplied as arguments to it. 3. Write a shell script that displays a list of all the files in the current directory to which the user has read, write and execute permissions. 4. Write a shell script that receives any number of file names as arguments checks if every argument supplied is a file or a directory and reports accordingly. Whenever the argument is a file, the number of lines on it is also reported. Week 2 5. Write a shell script that accepts a list of file names as its arguments, counts and reports the occurrence of each word that is present in the first argument file on other argument files 6. Write a shell script to list all of the directory files in a directory. 7. Write a shell script to find factorial of a given integer. Week 3 8. Write an awk script to count the number of lines in a file that do not contain vowels. 9. Write an awk script to find the number of characters, words and lines in a file. 10. Write a c program that makes a copy of a file using standard I/O and system calls Week 4 11. Implement in C the following UNIX commands using System calls A. cat B. ls C. mv 12 8 Topic Page no 4 6
3.
4.
5.
14
15
6.
12. Write a program that takes one or more file/directory names as command line input and reports the following information on the file. A. File type. B. Number of links. C. Time of last access. D. Read, Write and Execute permissions. Week 5 13. Write a C program to emulate the UNIX ls l command. 7. 14. Write a C program to list for every file in a directory, its inode number and file name. 15. Write a C program that demonstrates redirection of standard output to a file. Ex: ls > f1. Week 6 16. Write a C program to create a child process and allow the parent to display parent and the child to display child on the screen. 17. Write a C program to create a Zombie process. 18. Write a C program that illustrates how an orphan is created. Week 7 19. Write a C program that illustrates how to execute two commands concurrently with a command pipe. Ex: - ls l | sort 20. Write C programs that illustrate communication between two unrelated processes using named pipe 21. Write a C program to create a message queue with read and write permissions to write 3 messages to it with different priority numbers. 22 18
8.
20
9.
22. Write a C program that receives the messages (from the above message queue as specified in (21)) and displays them. Week 8 23. Write a C program to allow cooperating processes to lock a resource for exclusive use, using a) Semaphores b) flock or lockf system calls. 24. Write a C program that illustrates suspending and resuming processes using signals 11. Week 9 25. Write a C program that implements a producer-consumer system with two processes. 31
10.
30
(Using Semaphores). 26. Write client and server programs (using c) for interaction between server and client processes using Unix Domain sockets. 12. Week 10 27. Write client and server programs (using c) for interaction between server and client processes using Internet Domain sockets. 28. Write a C program that illustrates two processes communicating using shared memory
33
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
Listing of categorical attributes and the real-valued attributes separately. Rules for identifying attributes. Training a decision tree. Test on classification of decision tree. Testing on the training set . Using cross validation for training. Significance of attributes in decision tree. Trying generation of decision tree with various number of decision tree. Find out differences in results using decision tree and crossvalidation on a data set. Decision trees. Reduced error pruning for training Decision Trees using crossvalidation Convert a Decision Trees into "if-then-else rules".
39 40 43 47 51 52 55 58 60 62 62 65
List of Linux Programs

1. Write a shell script that accepts a file name, starting and ending line numbers as arguments and displays all the lines between the given line numbers.
3
2. 3. 4.
5. 6. 7. 8. 9. 10. 11.
Write a shell script that deletes all lines containing a specified word in one or more files supplied as arguments to it. Write a shell script that displays a list of all the files in the current directory to which the user has read, write and execute permissions. Write a shell script that receives any number of file names as arguments checks if every argument supplied is a file or a directory and reports accordingly. Whenever the argument is a file, the number of lines on it is also reported Write a shell script that accepts a list of file names as its arguments, counts and reports the occurrence of each word that is present in the first argument file on other argument files Write a shell script to list all of the directory files in a directory. Write a shell script to find factorial of a given integer. Write an awk script to count the number of lines in a file that do not contain vowels. Write an awk script to find the number of characters, words and lines in a file. Write a c program that makes a copy of a file using standard I/O and system calls Implement in C the following UNIX commands using System calls A. cat B. ls C. mv Write a program that takes one or more file/directory names as command line input and reports the following information on the file. A. File type. B. Number of links. C. Time of last access. D. Read, Write and Execute permissions. Write a C program to emulate the UNIX ls l command. Write a C program to list for every file in a directory, its inode number and file name. Write a C program that demonstrates redirection of standard output to a file.Ex: ls > f1. Write a C program to create a child process and allow the parent to display parent and the child to display child on the screen. Write a C program to create a Zombie process. Write a C program that illustrates how an orphan is created. Write a C program that illustrates how to execute two commands concurrently with a command pipe. Ex: - ls l | sort Write C programs that illustrate communication between two unrelated processes using named pipe. Write a C program to create a message queue with read and write permissions to write 3 messages to it with different priority numbers. Write a C program that receives the messages (from the above message queue as specified in (21)) and displays them. Write a C program to allow cooperating processes to lock a resource for
4
12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
24. 25. 26. 27. 28.
exclusive use, using a) Semaphores b) flock or lockf system calls. Write a C program that illustrates suspending and resuming processes using signals. Write a C program that implements a producer-consumer system with two processes. (using Semaphores). Write client and server programs (using c) for interaction between server and client processes using Unix Domain sockets. Write client and server programs (using c) for interaction between server and client processes using Internet Domain sockets. Write a C program that illustrates two processes communicating using shared memory
List of Data Mining Programs
S.No. 1. 2.
Task Description List all the categorical (or nominal) attributes and the real-valued attributes separately. What attributes do you think might be crucial in making the credit assessment ? Come up with some simple rules in plain English using your selected attributes. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly ? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy ? Is testing on the training set as you did above a good idea ? Why or Why not ? One approach for solving the problem encountered in the previous question is using cross-validation ? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease ? Why ? (10 marks) Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-status" (attribute 9). One way to do this (perhaps rather simple minded) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the preprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant effect? Discuss. Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost mAECx in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using crossvalidation (you can do this in Weka) and report the Decision Tree you obtain ? Also, report your accuracy using the pruned model. Does your accuracy increase ? (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.PART, train this model and report the set of rules obtained. 6 Sometimes just one attribute can be good enough in making the decision, yes, just one ! Can you predict what attribute that might be in this dataset ? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier.
3. 4.
5. 6.
7.
8.
9.
10. 11.
12.
Week1 1. Write a shell script that accepts a file name, starting and ending line numbers as arguments and displays all the lines between the given line numbers. Aim: ToWrite a shell script that accepts a file name, starting and ending line numbers as arguments and displays all the lines between the given line numbers. Script: if [ $# -ne 3 ] then echo "Error : Invalid number of arguments." exit fi
7
if [ $2 -gt $3 ] then echo "Error : Invalid range value." exit fi l=`expr $3 - $2 + 1` cat $1 | tail +$2 | head -$l Output: $sh 11b.sh test 5 7 abc 1234 def 5678 ghi 91011 Description : head command : This command is used to display at the beginning of one ormore files. By default it displays first 10 lines of a file head [ count option ] filename tail command : This command is used to display last few lines at the end of a file. By default it displays last 10 lines of a file tail [ +/- start ] filename start is starting line number tail -5 filename : It displays last 5 lines of the file tail +5 filename : It displays all the lines ,beginning from line number 5 to end of the file. 2. Write a shell script that deletes all lines containing a specified word in one or more files supplied as arguments to it. Aim: To write a shell script that deletes all lines containing a specified word in one or more files supplied as arguments to it. Script: clear if [ $# -eq 0 ] then echo no arguments passed exit fi echo the contents before deleting for i in $* do echo $i cat $i done echo enter the word to be deleted
8
read word for i in $* do grep -vi "$word" $i > temp mv temp $i echo after deleting cat $i done Output: $ sh 8b.sh test1 the contents before deleting test1 hello hello bangalore mysore city enter the word to be deleted city after deleting hello hello Bangalore $ sh 8b.sh no argument passed
3. Write a shell script that displays a list of all the files in the current directory to which the user has read, write and execute permissions. Aim: To write a shell script that displays a list of all the files in the current directory to which the user has read, write and execute permissions. Script: echo "enter the directory name" read dir if [ -d $dir ] then cd $dir ls > f exec < f while read line
9
do if [ -f $line ] then if [ -r $line -a -w $line -a -x $line ] then echo "$line has all permissions" else echo "files not having all permissions" fi fi done fi 4. Write a shell script that receives any number of file names as arguments checks if every argument supplied is a file or a directory and reports accordingly. Whenever the argument is a file, the number of lines on it is also reported Aim: To write a shell script that receives any number of file names as arguments checks if every argument supplied is a file or a directory Script: for x in $* do if [ -f $x ] then echo " $x is a file " echo " no of lines in the file are " wc -l $x elif [ -d $x ] then echo " $x is a directory " else echo " enter valid filename or directory name " fi done
10
Week 2 5. Write a shell script that accepts a list of file names as its arguments, counts and reports the occurrence of each word that is present in the first argument file on other argument files. Aim : To write a shell script that accepts a list of file names as its arguments, counts and reports the occurrence of each word that is present in the first argument file on other argument files. Script: if [ $# -ne 2 ] then echo "Error : Invalid number of arguments." exit fi str=`cat $1 | tr '\n' ' '` for a in $str do echo "Word = $a, Count = `grep -c "$a" $2`" done Output : $ cat test hello AEC $ cat test1 hello AEC hello AEC hello $ sh 1.sh test test1 Word = hello, Count = 3 Word = AEC, Count = 2
6. Write a shell script to list all of the directory files in a directory.

11
Script: # !/bin/bash echo"enter directory name" read dir if[ -d $dir] then echo"list of files in the directory" ls $dir else echo"enter proper directory name" fi Output: Enter directory name AEC List of all files in the directoty CSE.txt ECE.txt 7. Write a shell script to find factorial of a given integer. Script: # !/bin/bash echo "enter a number" read num fact=1 while [ $num -ge 1 ] do fact=`echo $fact\* $num|bc` let num-done echo "factorial of $n is $fact" Output: Enter a number 5 Factorial of 5 is 120
12
Week 3 8. Write an awk script to count the number of lines in a file that do not contain vowels. 9. Write an awk script to find the number of characters, words and lines in a file. 10. Write a c program that makes a copy of a file using standard I/O and system calls Aim : To write an awk script to find the number of characters, words and lines in a file. Script: BEGIN{print "record.\t characters \t words"} #BODY section { len=length($0) total_len+=len print(NR,":\t",len,":\t",NF,$0) words+=NF } END{ print("\n total") print("characters :\t" total len) print("lines :\t" NR) }
13
Week 4 11. Implement in C the following UNIX commands using System calls A. cat B. ls C. mv 12. Write a program that takes one or more file/directory names as command line input and reports the following information on the file. A. File type. B. Number of links. C. Time of last access. D. Read, Write and Execute permissions. AIM: Implement in C the cat Unix command using system calls #include<fcntl.h> #include<sys/stat.h> #define BUFSIZE 1 int main(int argc, char **argv) { int fd1; int n; char buf; fd1=open(argv[1],O_RDONLY); printf("Welcome to AEC\n"); while((n=read(fd1,&buf,1))>0) { printf("%c",buf); /* or write(1,&buf,1); */ } return (0); } AIM: Implement in C the following ls Unix command using system calls Algorithm: 1. Start. 2. open directory using opendir( ) system call. 3. read the directory using readdir( ) system call. 4. print dp.name and dp.inode . 5. repeat above step until end of directory. 6. End #include <sys/types.h> #include <sys/dir.h> #include <sys/param.h> #include <stdio.h>
14
#define FALSE 0 #define TRUE 1 extern int alphasort(); char pathname[MAXPATHLEN]; main() { int count,i; struct dirent **files; int file_select(); if (getwd(pathname) == NULL ) { printf("Error getting pathn"); exit(0); } printf("Current Working Directory = %sn",pathname); count = scandir(pathname, &files, file_select, alphasort); if (count <= 0) { printf("No files in this directoryn"); exit(0); } printf("Number of files = %dn",count); for (i=1;i<count+1;++i) printf("%s \n",files[i-1]->d_name); } int file_select(struct direct *entry) { if ((strcmp(entry->d_name, ".") == 0) ||(strcmp(entry->d_name, "..") == 0)) return (FALSE); else return (TRUE); } AIM: Implement in C the Unix command mv using system calls Algorithm: 1. Start 2. open an existed file and one new open file using open() system call 3. read the contents from existed file using read( ) system call
15
4. write these contents into new file using write system call using write( ) system call 5. repeat above 2 steps until eof 6. close 2 file using fclose( ) system call 7. delete existed file using using unlink( ) system 8. End. Program: #include<fcntl.h> #include<stdio.h> #include<unistd.h> #include<sys/stat.h> int main(int argc, char **argv) { int fd1,fd2; int n,count=0; fd1=open(argv[1],O_RDONLY); fd2=creat(argv[2],S_IWUSR); rename(fd1,fd2); unlink(argv[1]); printf( file is copied ); return (0); }
16
Week 5 13. Write a C program to emulate the UNIX ls l command. ALGORITHM : Step 1: Include necessary header files for manipulating directory. Step 2: Declare and initialize required objects. Step 3: Read the directory name form the user. Step 4: Open the directory using opendir() system call and report error if the directory is not available. Step 5: Read the entry available in the directory. Step 6: Display the directory entry ie., name of the file or sub directory. Step 7: Repeat the step 6 and 7 until all the entries were read. /* 1. Simulation of ls command */ #include<fcntl.h> #include<stdio.h> #include<unistd.h> #include<sys/stat.h>main() { char dirname[10]; DIR *p; struct dirent *d; printf("Enter directory name "); scanf("%s",dirname); p=opendir(dirname); if(p==NULL) { perror("Cannot find dir."); exit(-1); } while(d=readdir(p)) printf("%s\n",d->d_name); } SAMPLE OUTPUT: enter directory name iii ... f2
17
14. Write a C program to list for every file in a directory, its inode number and file name. 15. Write a C program that demonstrates redirection of standard output to a file. Ex: ls > f1. Description:
An Inode number points to an Inode. An Inode is a data structure that stores the following information about a file :

Size of file Device ID User ID of the file Group ID of the file The file mode information and access privileges for owner, group and others File protection flags The timestamps for file creation, modification etc link counter to determine the number of hard links Pointers to the blocks storing files contents
18
19
Week 6 16. Write a C program to create a child process and allow the parent to display parent and the child to display child on the screen. #include<stdio.h> #include<string.h> main() { int childpid; if (( childpid=fork())<0) { printf("cannot fork"); } else if(childpid >0) { } else printf(Child process); } 17. Write a C program to create a Zombie process. If child terminates before the parent process then parent process with out child is called zombie process #include<stdio.h> #include<string.h> main() { int childpid; if (( childpid=fork())<0) { printf("cannot fork"); } else if(childpid >0) { Printf(child process); exit(0); } else { wait(100);
20
printf(parent process); } } 18. Write a C program that illustrates how an orphan is created. #include<stdio.h> main() { int id; printf("Before fork()\n"); id=fork(); if(id==0) { printf("Child has started: %d\n ",getpid()); printf("Parent of this child : %d\n",getppid()); printf("child prints 1 item :\n "); sleep(25); printf("child prints 2 item :\n"); } else { printf("Parent has started: %d\n",getpid()); printf("Parent of the parent proc : %d\n",getppid()); } printf("After fork()"); }
21
Week 7 19. Write a C program that illustrates how to execute two commands concurrently with a command pipe. Ex: - ls l | sort AIM: Implementing Pipes DESCRIPTION: A pipe is created by calling a pipe() function. int pipe(int filedesc[2]); It returns a pair of file descriptors filedesc[0] is open for reading and filedesc[1] is open for writing. This function returns a 0 if ok & -1 on error. ALGORITHM: The following is the simple algorithm for creating, writing to and reading from a pipe. 1) Create a pipe through a pipe() function call. 2) Use write() function to write the data into the pipe. The syntax is as follows write(int [],ip_string,size); int [] filedescriptor variable, in this case if int filedesc[2] is the variable, then use the filedesc[1] as the first parameter. ip_string The string to be written in the pipe. Size buffer size for storing the input 3) Use read() function to read the data that has been written to the pipe. The syntax is as follows read(int [], char,size); PROGRAM: #include<stdio.h> #include<string.h> main() { int pipe1[2],pipe2[2],childpid; if(pipe(pipe1)<0 || pipe(pipe2) < 0) printf("pipe creation error"); if (( childpid=fork())<0) {
22
printf("cannot fork"); } else if(childpid >0) { close(pipe1[0]); close(pipe2[1]); client(pipe2[0],pipe1[1]); while (wait((int *) 0 ) !=childpid); close(pipe1[1]); close(pipe2[0]); exit(0); } else { close(pipe1[1]); close(pipe2[0]); server(pipe1[0],pipe2[1]); close(pipe1[0]); close(pipe2[1]); exit(0); } } client(int readfd,int writefd) { int n; char buff[1024]; if(fgets(buff,1024,stdin)==NULL) printf("file name read error"); n=strlen(buff); if(buff[n-1]=='\n') n--; if(write(writefd,buff,n)!=n) printf("file name write error"); while((n=read(readfd,buff,1024))>0) if(write(1,buff,n)!=n) printf("data write error"); if(n<0) printf("data error"); } server(int readfd,int writefd) { char buff[1024],errmsg[50]; int n,fd;
23
n=read(readfd,buff,1024); buff[n]='\0'; if((fd=open(buff,0))<0) { sprintf(buff,"file does nit exist"); write(writefd,buff,1024); } else { while((n=read(fd,buff,1024))>0) write(writefd,buff,n); } } 20. Write C programs that illustrate communication between two unrelated processes using named pipe. AIM: Implementing IPC using a FIFO (or) named pipe. DESCRIPTION: Another kind of IPC is FIFO(First in First Out) is sometimes also called as named pipe.It is like a pipe, except that it has a name.Here the name is that of a file that multiple processes can open(), read and write to. A FIFO is created using the mknod() system call. The syntax is as follows int mknod(char *pathname, int mode, int dev); The pathname is a normal Unix pathname, and this is the name of the FIFO. The mode argument specifies the file mode access mode.The dev value is ignored for a FIFO. Once a FIFO is created, it must be opened for reading (or) writing using either the open system call, or one of the standard I/O open functions-fopen, or freopen. ALGORITHM: The following is the simple algorithm for creating, writing to and reading from a FIFO. 1) Create a fifo through mknod() function call. 2) Use write() function to write the data into the fifo. The syntax is as follows write(int [],ip_string,size);
24
int [] filedescriptor variable, in this case if int filedesc[2] is the variable, then use the filedesc[1] as the first parameter.
ip_string The string to be written in the fifo.
Size buffer size for storing the input 3) Use read() function to read the data that has been written to the fifo. The syntax is as follows read(int [], char,size); PROGRAM: #define FIFO1 "Fifo1" #define FIFO2 "Fifo2" #include<stdio.h> #include<string.h> #include<sys/types.h> #include<fcntl.h> #include<sys/stat.h> main() { int childpid,wfd,rfd; mknod(FIFO1,0666|S_IFIFO,0); mknod(FIFO2,0666|S_IFIFO,0); if (( childpid=fork())==-1) { printf("cannot fork"); } else if(childpid >0) { wfd=open(FIFO1,1); rfd=open(FIFO2,0); client(rfd,wfd); while (wait((int *) 0 ) !=childpid); close(rfd); close(wfd); unlink(FIFO1); unlink(FIFO2); } else
25
{ rfd=open(FIFO1,0); wfd=open(FIFO2,1); server(rfd,wfd); close(rfd); close(wfd); } } client(int readfd,int writefd) { int n; char buff[1024]; printf ("enter s file name"); if(fgets(buff,1024,stdin)==NULL) printf("file name read error"); n=strlen(buff); if(buff[n-1]=='\n') n--; if(write(writefd,buff,n)!=n) printf("file name write error"); while((n=read(readfd,buff,1024))>0) if(write(1,buff,n)!=n) printf("data write error"); if(n<0) printf("data error"); } server(int readfd,int writefd) { char buff[1024],errmsg[50]; int n,fd; n=read(readfd,buff,1024); buff[n]='\0'; if((fd=open(buff,0))<0) { sprintf(buff,"file does nit exist"); write(writefd,buff,1024); } else { while((n=read(fd,buff,1024))>0) write(writefd,buff,n); } }
26
21. Write a C program to create a message queue with read and write permissions to write 3 messages to it with different priority numbers. 22. Write a C program that receives the messages (from the above message queue as specified in (21)) and displays them. Aim: To create a message queue DESCRIPTION: Message passing between processes are part of operating system, which are done through a message queue. Where messages are stored in kernel and are associated with message queue identifier (msqid). Processes read and write messages to an arbitrary queue in a way such that a process writes a message to a queue, exits and other process reads it at later time. ALGORITHM: Before defining a structure ipc_perm structure should be defined which is done by including following file. #include <sys/types.h> #include <sys/ipc.h> A structure of information is maintained by kernel, it should contain following. struct msqid_ds{ struct ipc_perm msg_perm; /*operation permission*/ struct msg *msg_first; /*ptr to first msg on queue*/ struct msg *msg_last; /*ptr to last msg on queue*/ ushort msg_cbytes; /*current bytes on queue*/ ushort msg_qnum; /*current no of msgs on queue*/ ushort msg_qbytes; /*max no of bytes on queue*/ ushort msg_lspid; /*pid o flast msg send*/ ushort msg_lrpid; /*pid of last msgrecvd*/ time_t msg_stime; /*time of last msg snd*/ time_t msg_rtime; /*time of last msg rcv*/ time_t msg_ctime; /*time of last msg ctl*/ }; To create new message queue or access existing message queue msgget() function is used Syntax: int msgget(key_t key ,int msgflag); Msg flag values Num val Symb value desc 0400 MSG_R Read by owner 0200 MSG_w Write by owner 0040 MSG_R >>3 Read by group 0020 MSG_W>>3 Write by group
27
Msgget returns msqid, or -1 if error 1. To put message on queue msgsnd() function is used. Syntax: int msgsnd(int msqid , struct msgbuf *ptr,int length, int flag); msqid is message queue id, a unique id msgbuf is actual content to send, a pointer to structure which contain following struct msgbuf { Long mtype; /*message type >0 */ Char mtext[1]; /*data*/ }; length is the size of message in bytes flag is - IPC_NOWAIT which allows sys call to return immediately when no room on queue, when this is specified msgsnd will return -1 if no room on queue. Else flag can be specified as 0 2. To receive Message msgrcv() function is used Syntax: Int msgrcv(int msqid , struct msgbuf *ptr, int length, long msgtype, int flag); *ptr is pointer to structure where message received is to be stored Length is size to be received and stored in pointer area Flag has MSG_NOERROR , it returns an error if length is not large enough to receive msg, if data portion is greater than msg length it truncates and returns. 3. Variety of control operations on msg can be done through msgctl() function Int msgctl(int msqid, int cmd, struct msqid_ds *buff); IPC_RMID in cmd is given to remove a message queue from the system. Let us create a header file msgq.h with following in it #include <sys/type.h> #include <sys/ipc.h> #include <sys/msg.h> #include <sys/errno.h> extern int errno; #define MKEY1 1234L #define MKEY2 2345L #define PERMS 0666
28
Server operation algorithm: #include msgq.h main() { Int readid, writeid; If((readid = msgget(MSGKEY1, PERMS |IPC_CREAT))<0) err_sys(Server: cant get message queue 1); If((writeid= msgget(MKEY@, PERMS | IPC_CREAT))<0) err_sys(Server : cant get message queue 2); server(readid,writeid); exit(0); }
Client process: #include msgq.h main() { int readid, writeid; /* open queues which server has already created it */ If ( (wirteid =msgget(MKEY1,0))<0) err_sys(client : cant access msgget message queue 1); if((readid=msgget(MKEY2,0))<0) err_sys(client : cant msgget messages queue 2): client(readid,writeid); /*delete msg queuu */ If (msgctl(readid, IPC_RMID,( struct msqid_ds *)0)<0) err_sys(Client: cant RMID message queue1); if(msgctl(writeid, IPC_RMID, (struct msqid_ds *) 0) <0) err_sys(Client: cant RMID message queue 2); exit(0); } Week 8 23. Write a C program to allow cooperating processes to lock a resource for exclusive use, using a) Semaphores b) flock or lockf system calls.
29
24.
Write a C program that illustrates suspending and resuming processes using signals. 23. a) AIM: C program that illustrate file locking using semaphores
PROGRAM: #include<stdio.h> #include<stdlib.h> #include<error.h> #include<sys/types.h> #include<sys/ipc.h> #include<sys/sem.h> int main(void) { key_t key; int semid; union semun arg; if((key==ftok("sem demo.c","j"))== -1) { perror("ftok"); exit(1); } if(semid=semget(key,1,0666|IPC_CREAT))== -1) { perror("semget"): exit(1); } arg.val=1; if(semctl(semid,0,SETVAL,arg)== -1) { perror("smctl"); exit(1); } return 0; } OUTPUT: semget smctl
Week 9
30
25.
Write a C program that implements a producer-consumer system with two processes. (using Semaphores). 26. Write client and server programs (using c) for interaction between server and client processes using Unix Domain sockets. Algorithm: 1. Start 2. create semaphore using semget( ) system call 3. if successful it returns positive value 4. create two new processes 5. first process will produce 6. until first process produces second process cannot consume 7. End. Source code: #include<stdio.h> #include<stdlib.h> #include<sys/types.h> #include<sys/ipc.h> #include<sys/sem.h> #include<unistd.h> #define num_loops 2 int main(int argc,char* argv[]) { int sem_set_id; int child_pid,i,sem_val; struct sembuf sem_op; int rc; struct timespec delay; clrscr(); sem_set_id=semget(ipc_private,2,0600); if(sem_set_id==-1) { perror(main:semget); exit(1); } printf(semaphore set created,semaphore setid%d\n , sem_set_id); child_pid=fork(); switch(child_pid) { case -1:
31
perror(fork); exit(1); case 0: for(i=0;i<num_loops;i++) { sem_op.sem_num=0; sem_op.sem_op=-1; sem_op.sem_flg=0; semop(sem_set_id,&sem_op,1); printf(producer:%d\n,i); fflush(stdout); } break; default: for(i=0;i<num_loops;i++) { printf(consumer:%d\n,i); fflush(stdout); sem_op.sem_num=0; sem_op.sem_op=1; sem_op.sem_flg=0; semop(sem_set_id,&sem_op,1); if(rand()>3*(rano_max14)); { delay.tv_sec=0; delay.tv_nsec=10; nanosleep(&delay,null); } } break; } return 0; }
Output: semaphore set created semaphore set id 327690 producer: 0 consumer:0 producer:1 consumer:1 Week 10
32
27. Write client and server programs (using c) for interaction between server and client processes using Internet Domain sockets. 28. Write a C program that illustrates two processes communicating using shared memory. DESCRIPTION: Shared Memory is an efficeint means of passing data between programs. One program will create a memory portion which other processes (if permitted) can access. The problem with the pipes, FIFOs and message queues is that for two processes to exchange information, the information has to go through the kernel. Shared memory provides a way around this by letting two or more processes share a memory segment. In shared memory concept if one process is reading into some shared memory, for example, other processes must wait for the read to finish before processing the data. A process creates a shared memory segment using shmget()|. The original owner of a shared memory segment can assign ownership to another user with shmctl(). It can also revoke this assignment. Other processes with proper permission can perform various control functions on the shared memory segment using shmctl(). Once created, a shared segment can be attached to a process address space using shmat(). It can be detached using shmdt() (see shmop()). The attaching process must have the appropriate permissions for shmat(). Once attached, the process can read or write to the segment, as allowed by the permission requested in the attach operation. A shared segment can be attached multiple times by the same process. A shared memory segment is described by a control structure with a unique ID that points to an area of physical memory. The identifier of the segment is called the shmid. The structure definition for the shared memory segment control structures and prototypews can be found in <sys/shm.h>. shmget() is used to obtain access to a shared memory segment. It is prottyped by: int shmget(key_t key, size_t size, int shmflg); The key argument is a access value associated with the semaphore ID. The size argument is the size in bytes of the requested shared memory. The shmflg argument specifies the initial access permissions and creation control flags. When the call succeeds, it returns the shared memory segment ID. This call is also used to get the ID of an existing shared segment (from a process requesting sharing of some existing memory portion). The following code illustrates shmget(): #include <sys/types.h>
33
#include <sys/ipc.h> #include <sys/shm.h> ... key_t key; /* key to be passed to shmget() */ int shmflg; /* shmflg to be passed to shmget() */ int shmid; /* return value from shmget() */ int size; /* size to be passed to shmget() */ ... key = ... size = ... shmflg) = ... if ((shmid = shmget (key, size, shmflg)) == -1) { perror("shmget: shmget failed"); exit(1); } else { (void) fprintf(stderr, "shmget: shmget returned %d\n", shmid); exit(0); } ... Controlling a Shared Memory Segment shmctl() is used to alter the permissions and other characteristics of a shared memory segment. It is prototyped as follows: int shmctl(int shmid, int cmd, struct shmid_ds *buf); The process must have an effective shmid of owner, creator or superuser to perform this command. The cmd argument is one of following control commands: SHM_LOCK -- Lock the specified shared memory segment in memory. The process must have the effective ID of superuser to perform this command. SHM_UNLOCK -- Unlock the shared memory segment. The process must have the effective ID of superuser to perform this command. IPC_STAT -- Return the status information contained in the control structure and place it in the buffer pointed to by buf. The process must have read permission on the segment to perform this command. IPC_SET -- Set the effective user and group identification and access permissions. The process must have an effective ID of owner, creator or superuser to perform this command. IPC_RMID -- Remove the shared memory segment. The buf is a sructure of type struct shmid_ds which is defined in <sys/shm.h>
34
The following code illustrates shmctl(): #include <sys/types.h> #include <sys/ipc.h> #include <sys/shm.h> ... int cmd; /* command code for shmctl() */ int shmid; /* segment ID */ struct shmid_ds shmid_ds; /* shared memory data structure to hold results */ ... shmid = ... cmd = ... if ((rtrn = shmctl(shmid, cmd, shmid_ds)) == -1) { perror("shmctl: shmctl failed"); exit(1); } .. Attaching and Detaching a Shared Memory Segment shmat() and shmdt() are used to attach and detach shared memory segments. They are prototypes as follows: void *shmat(int shmid, const void *shmaddr, int shmflg); int shmdt(const void *shmaddr); shmat() returns a pointer, shmaddr, to the head of the shared segment associated with a valid shmid. shmdt() detaches the shared memory segment located at the address indicated by shmaddr . The following code illustrates calls to shmat() and shmdt(): #include <sys/types.h> #include <sys/ipc.h> #include <sys/shm.h> static struct state { /* Internal record of attached segments. */ int shmid; /* shmid of attached segment */ char *shmaddr; /* attach point */ int shmflg; /* flags used on attach */ } ap[MAXnap]; /* State of current attached segments. */ int nap; /* Number of currently attached segments. */ ... char *addr; /* address work variable */ register int i; /* work area */ register struct state *p; /* ptr to current state entry */ ... p = &ap[nap++]; p->shmid = ... p->shmaddr = ... p->shmflg = ...
35
p->shmaddr = shmat(p->shmid, p->shmaddr, p->shmflg); if(p->shmaddr == (char *)-1) { perror("shmop: shmat failed"); nap--; } else (void) fprintf(stderr, "shmop: shmat returned %#8.8x\n", p->shmaddr); ... i = shmdt(addr); if(i == -1) { perror("shmop: shmdt failed"); } else { (void) fprintf(stderr, "shmop: shmdt returned %d\n", i); for (p = ap, i = nap; i--; p++) if (p->shmaddr == addr) *p = ap[--nap]; } ... Algorithm: 1. Start 2. create shared memory using shmget( ) system call 3. if success full it returns positive value 4. attach the created shared memory using shmat( ) system call 5. write to shared memory using shmsnd( ) system call 6. read the contents from shared memory using shmrcv( ) system call 7. End . Source Code: #include<stdio.h> #include<stdlib.h> #include<sys/ipc.h> #include<sys/types.h> #include<string.h> #include<sys/shm.h> #define shm_size 1024 int main(int argc,char * argv[]) { key_t key; int shmid; char *data; int mode; if(argc>2) {
36
fprintf(stderr,usage:stdemo[data_to_writte]\n); exit(1); } if((shmid=shmget(key,shm_size,0644/ipc_creat))==-1) { perror(shmget); exit(1); } data=shmat(shmid,(void *)0,0); if(data==(char *)(-1)) { perror(shmat); exit(1); } if(argc==2) printf(writing to segment:\%s\\n,data); if(shmdt(data)==-1) { perror(shmdt); exit(1); } return 0; } Input: #./a.out swarupa Output: writing to segment swarupa
Data Mining Lab
Credit Risk Assessment Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A banks business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the bans profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The banks loan policy must involve a compromise not too strict, and not too lenient.
37
To do the assignment, you first and foremost need some knowledge about the world of credit . You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when not to, approve a loan application. The German Credit Data : Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset ( original) Excel spreadsheet version of the German credit data (download from web). In spite of the fact that the data is German, you should probably make use of it for this assignment, (Unless you really can consult a real loan officer !) A few notes on the German dataset : DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter). Owns_telephone. German phone rates are much higher than in Canada so fewer people own telephones. Foreign_worker. There are millions of these in Germany (many from Turkey). It is very hard to get German citizenship if you were not born of German parents. There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad.
Subtasks : (Turn in your answers to the following tasks)
Laboratory Manual For Data Mining
EXPERIMENT-1 Aim: To list all the categorical(or nominal) attributes and the real valued attributes using Weka mining tool. Tools/ Apparatus: Weka mining tool.. 38
Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system bank.csv. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. SampleOutput:
EXPERIMENT-2
Aim: To identify the rules with some of the important attributes by a) manually and b) Using Weka . Tools/ Apparatus: Weka mining tool.. Theory: Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where X,Y C I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent (left hand side or LHS) and consequent (righthandside or RHS) of the rule respectively. To illustrate the concepts, we use a small example from the supermarket domain. 39
The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be meaning that if milk and bread is bought, customers also buy butter. Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The bestknown constraints are minimum thresholds on support and confidence. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset {milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). The confidence of a rule is defined . For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS . ALGORITHM: Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two subproblems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, , Ik}, association rules with this itemsets are generated in the following way: the first rule is {I1, I2, , Ik1} and {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In Database D. Find frequent set Lk 1. Join Step. o Ck is generated by joining Lk 1with itself Prune Step. o Any (k 1) itemset that is not frequent cannot be a subset of a frequent k itemset, hence should be removed. Where (Ck: Candidate itemset of size k) (Lk: frequent itemset of size k)
40
Apriori Pseudocode Apriori (T,) L<{ Large 1itemsets that appear in more than transactions } K<2 while L(k1) C(k)<Generate( Lk 1) for transactions t T C(t)Subset(Ck,t) for candidates c C(t) count[c]<count[ c]+1 L(k)<{ c C(k)| count[c] K<K+ 1 return L(k) k Procedure: 1) Given the Bank database for mining. 2) Select EXPLORER in WEKA GUI Chooser. 3) Load Bank.csv in Weka by Open file in Preprocess tab. 4) Select only Nominal values. 5) Go to Associate Tab. 6) Select Apriori algorithm from Choose button present in Associator weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 7) Select Start button 8) now we can see the sample rules. Sample output:
41
EXPERIMENT-3
Aim: To create a Decision tree by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory: Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For 42
example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classifications are discrete and do not imply order. Continuous, floatingpoint values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm. The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model. Scoring a classification model results in class assignments and probabilities for each case. For example, a model that classifies customers as low, medium, or high value would also predict the probability of each classification for each customer. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling. Different Classification Algorithms Oracle Data Mining provides the following algorithms for classification: Decision Tree Decision trees automatically generate rules, which are conditional statements that reveal the logic used to build the tree. Naive Bayes Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Procedure: 1) Open Weka GUI Chooser. 43
2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system bank.csv. 5) Go to Classify tab. 6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by clicking the button choose 7) and select tree j48 9) Select Test options Use training set 10) if need select attribute. 11) Click Start . 12)now we can see the output details in the Classifier output. 13) right click on the result list and select visualize tree option . Sample output:
44
The decision tree constructed by using the implemented C4.5 algorithm
45
EXPERIMENT-4
Aim: To find the percentage of examples that are classified correctly by using the above created decision tree model? ie.. Testing on the training set. Tools/ Apparatus: Weka mining tool.. Theory: Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entirecovariance mAECx The naive Bayes probabilistic model : The probability model for a classifier is a conditional model P(C|F1 .................Fn) over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write P(C|F1...............Fn)=[{p(C)p(F1..................Fn|C)}/p(F1,........Fn)] 46
In plain English the above equation can be written as Posterior= [(prior *likehood)/evidence] In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p(C,F1........Fn) which can be rewritten as follows, using repeated applications of the definition of conditional probability: p(C,F1........Fn) =p(C) p(F1............Fn|C) =p(C)p(F1|C) p(F2.........Fn|C,F1,F2) =p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2) = p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)......p(Fn|C,F1,F2,F3.........Fn1)
Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for ji . This means that p(Fi|C,Fj)=p(Fi|C) and so the joint model can be expressed as p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)........... =p(C) p(Fi|C) This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this: p(C|F1..........Fn)= p(C) p(Fi|C) Z where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the values of the feature variables are known. Models of this form are much more manageable, since they factor into a so called class prior p(C) and independent probability distributions p(Fi|C). If there are k classes and if a model for eachp(Fi|C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has ( k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is 2 n + 1, where n is the number of binary features used for prediction P(h/D)= P(D/h) P(h) P(D) P(h) : Prior probability of hypothesis h P(D) : Prior probability of training data D P(h/D) : Probability of h given D P(D/h) : Probability of D given h 47
Nave Bayes Classifier : Derivation D : Set of tuples Each Tuple is an n dimensional attribute vector X : (x1,x2,x3,. xn) Let there me m Classes : C1,C2,C3Cm NB classifier predicts X belongs to Class Ci iff P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i Maximum Posteriori Hypothesis P(Ci/X) = P(X/Ci) P(Ci) / P(X) Maximize P(X/Ci) P(Ci) as P(X) is constant Nave Bayes Classifier : Derivation With many attributes, it is computationally expensive to evaluate P(X/Ci) Nave Assumption of class conditional independence P(X/Ci) = n P( xk/ Ci) k=1 P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci)
Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select NBTree i.e., Navie Baysiean tree. 9) Select Test options Use training set 10) if need select attribute. 48
11) now Start weka. 12)now we can see the output details in the Classifier output.
Sample output:
=== Evaluation on training set === === Summary ===
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances
554 46
92.3333 % 7.6667 %
0.845 0.1389 0.2636 27.9979 % 52.9137 % 600
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.894 0.948 Weighted Avg. 0.052 0.106 0.923 0.935 0.914 0.081 0.894 0.948 0.924 0.914 0.931 0.923 0.936 YES 0.936 NO 0.923 0.936
=== Confusion MAECx ===
a b <-- classified as 245 29 | a = YES 49
17 309 | b = NO
EXPERIMENT-5
Aim: To Is testing a good idea. Tools/ Apparatus: Weka Mining tool Procedure: 1) In Test options, select the Supplied test set radio button 2) click Set 3) Choose the file which contains records that were not in the training set we used to create the model. 4) click Start(WEKA will run this test data set through the model we already created. ) 5) Compare the output results with that of the 4th experiment Sample output: This can be experienced by the different problem solutions while doing practice. The important numbers to focus on here are the numbers next to the "Correctly Classified Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6 percent). Other important numbers are in the "ROC Area" column, in the first row (the 0.936); Finally, in the "Confusion MAECx," it shows the number of false positives and false negatives. The false positives are 29, and the false negatives are 17 in this mAECx. Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a good model. One final step to validating our classification tree, which is to run our test set through the model and ensure that accuracy of the model
50
Comparing the "Correctly Classified Instances" from this test set with the "Correctly Classified Instances" from the training set, we see the accuracy of the model , which indicates that the model will not break down with unknown data, or when future data is applied to it.
EXPERIMENT-6
Aim: To create a Decision tree by cross validation training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory: Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining. Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. In data mining, trees can be described also as the combination of mathematical and computational techniques to aid the description, categorisation and generalization of a given set of data. Data comes in records of the form: (x, y) = (x1, x2, x3..., xk, y) 51
The dependent variable, Y, is the target variable that we are trying to understand, classify or generalise. The vector x is comprised of the input variables, x1, x2, x3 etc., that are used for that task.
Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select J48 9) Select Test options Cross-validation. 10) Set Folds Ex:10 11) if need select attribute. 12) now Start weka. 13)now we can see the output details in the Classifier output. 14)Compare the output results with that of the 4th experiment 15) check whether the accuracy increased or decreased? Sample output:
52
=== Stratified cross-validation === === Summary ===
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances
539 61
89.8333 % 10.1667 %
0.7942 0.167 0.305 33.6511 % 61.2344 % 600
=== Detailed Accuracy By Class ===
53
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.861 0.929 Weighted Avg. 0.071 0.139 0.898 0.911 0.889 0.108 0.861 0.929 0.899 0.886 0.909 0.898 0.883 YES 0.883 NO 0.898 0.883
=== Confusion MAECx ===
a b <-- classified as 236 38 | a = YES 23 303 | b = NO
EXPERIMENT-7
Aim: Delete one attribute from GUI Explorer and see the effect using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list available filters. 54
7) Select weka.filters.unsupervised.attribute.Remove 8) Next, click on text box immediately to the right of the "Choose" button 9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the "invertSelection" option is set to false ) 10) Then click "OK" . Now, in the filter box you will see "Remove -R 1" 11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and create a new working relation 12) To save the new working relation as an ARFF file, click on save button in the top panel. 13) Go to OPEN file and browse the file that is newly saved (attribute deleted file) 14) Go to Classify tab. 15) Choose Classifier Tree 16) Select j48 tree 17) Select Test options Use training set 18) if need select attribute. 19) now Start weka. 20)now we can see the output details in the Classifier output. 21) right click on the result list and select visualize tree option . 22) Compare the output results with that of the 4th experiment 23) check whether the accuracy increased or decreased? 24)check whether removing these attributes have any significant effect.
Sample output:
55
56
57
EXPERIMENT-8
Aim: Select some attributes from GUI Explorer and perform classification and see the effect using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list which are to be removed. With this step only the attributes necessary for classification are left in the attributes panel. 7) The go to Classify tab. 8) Choose Classifier Tree 9) Select j48 10) Select Test options Use training set 11) if need select attribute. 12) now Start weka. 13)now we can see the output details in the Classifier output. 14) right click on the result list and select visualize tree option . 15)Compare the output results with that of the 4th experiment 16) check whether the accuracy increased or decreased? 17)check whether removing these attributes have any significant effect. Sample output:
58
EXPERIMENT-9 59
Aim: To create a Decision tree by cross validation training data set by changing the cost mAECx in Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select j48 9) Select Test options Training set. 10)Click on more options. 11)Select cost sensitive evaluation and click on set button 12)Set the mAECx values and click on resize. Then close the window. 13)Click Ok 14)Click start. 15) we can see the output details in the Classifier output 16) Select Test options Cross-validation. 17) Set Folds Ex:10 18) if need select attribute. 19) now Start weka. 20)now we can see the output details in the Classifier output. 21)Compare results of 15th and 20th steps. 22)Compare the results with that of experiment 6.
Sample output: 60
EXPERIMENT-10
Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool. 61
Tools/ Apparatus: Weka mining tool.. Procedure: This will be based on the attribute set, and the requirement of relationship among attribute we want to study. This can be viewed based on the database and user requirement.
EXPERIMENT-11
Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show accuracy for cross validation trained data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory : Reduced-error pruning Each node of the (over-fit) tree is examined for pruning A node is pruned (removed) only if the resulting pruned tree performs no worse than the original over the validation set Pruning a node consists of Removing the sub-tree rooted at the pruned node Making the pruned node a leaf node Assigning the pruned node the most common classification of the training instances attached to that node Pruning nodes iteratively Always select a node whose removal most increases the DT accuracy over the validation set Stop when further pruning decreases the DT accuracy over the validation set IF (Children=yes) (income=>30000) THEN (car=Yes)
Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 62
4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Tree 9) Select NBTree i.e., Navie Baysiean tree. 10) Select Test options Use training set 11) right click on the text box besides choose button ,select show properties 12) now change unprone mode false to true. 13) change the reduced error pruning % as needed. 14) if need select attribute. 15) now Start weka. 16)now we can see the output details in the Classifier output. 17) right click on the result list and select visualize tree option .
Sample output:
63
64
EXPERIMENT-12
Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART classifiers, by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier TreesRules 9) Select J48 . 10) Select Test options Use training set 11) if need select attribute. 12) now Start weka. 13)now we can see the output details in the Classifier output. 14) right click on the result list and select visualize tree option . (or) java weka.classifiers.trees.J48 -t c:\temp\bank.arff
Procedure for OneR: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 65
5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Rules 9) Select OneR . 10) Select Test options Use training set 11) if need select attribute. 12) now Start weka. 13)now we can see the output details in the Classifier output.
Procedure for PART: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Rules 9) Select PART . 10) Select Test options Use training set 11) if need select attribute. 12) now Start weka. 13)now we can see the output details in the Classifier output.
Attribute relevance with respect to the class relevant attribute ( science) IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance) 66
IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)
Sample output: J48 java weka.classifiers.trees.J48 -t c:/temp/bank.arff
67
One R
PART
68

Linux DM Lab Manual

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linux DM Lab Manual

Încărcat de

Drepturi de autor:

Formate disponibile

Contents

List of Linux Programs

24. 25. 26. 27. 28.

List of Data Mining Programs

6. Write a shell script to list all of the directory files in a directory.

ip_string The string to be written in the fifo.

Data Mining Lab

Subtasks : (Turn in your answers to the following tasks)

Laboratory Manual For Data Mining

The decision tree constructed by using the implemented C4.5 algorithm

=== Evaluation on training set === === Summary ===

0.845 0.1389 0.2636 27.9979 % 52.9137 % 600

=== Detailed Accuracy By Class ===

=== Confusion MAECx ===

a b <-- classified as 245 29 | a = YES 49

=== Stratified cross-validation === === Summary ===

0.7942 0.167 0.305 33.6511 % 61.2344 % 600

=== Detailed Accuracy By Class ===

=== Confusion MAECx ===

a b <-- classified as 236 38 | a = YES 23 303 | b = NO

IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)

Sample output: J48 java weka.classifiers.trees.J48 -t c:/temp/bank.arff

S-ar putea să vă placă și