Sunteți pe pagina 1din 3

: KNN Pseudocode

Problem Statement
l is an array of objects of length n k is the number of nearesest neighbors d is a distance function where the traingle inequality does not hold ,

Input :

Output: index of object in the array where the sum of distances from the object to its k-nearest neighbors is minimal. k knn-dist(obj) = i=1 nndistobj,i where nndist is the distance from obj to its ith closest neighbor

2
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Naive Brute-froce Algorithm


function Outlier Search(l, k, d()) dist f loat[n][n] dist is a matrix of n by n dist(i, j ) d(i, j )ij O(n2 ) operations knn f loat[n][k ] knn is a matrix of n by k for i = 1 to n do knn[i] min k distances in dist[i] in ascending order end for nk operations for each element, so O(n2 k ) elements in total totaldist f loat[n] totaldist is a vector of n by 1 for i = 1 to n do totaldist[i] sum(knn[i]) end for nk operations for entire loop return outlier = arg maxi (totaldist[i]) The above algorithm has a O(n2 k ) runtime. end function

I tried coming up with a runtime that had a lower big-O complexity runtime but was unsuccessful, due to the fact that the triangle inequality doesnt hold here, which means we cant derive an upper bound for Dik even if we have Dij + Djk , which means we need to compute all n 2 distances. If we had some metric coordinate system, we would be able to use things like k-d trees or bins if we knew the density of the points. However we can optimize the algorithm in terms of average number of operations we need to perform. The rst algorithm uses an upperbound to optimize the algorithm.

: KNN Pseudocode

3
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Optimized Algorithm
function Outlier Search(l, k, d()) dist f loat[n][n] dist is a matrix of n by n dist(i, j ) d(i, j )ij O(n2 ) operations [a1...ak ] min k distances in dist[1] in ascending order maxv sum([a1...ak ]) maxi 1 for i = 2 to n do vali max(dist(i, 1)...dist(i, n)) takes O(n) operations to nd max if vali k < maxv then vali is largest element, so k vali is upperbound [a1...ak ] k max distances in dist[i] in descending order maxi i end if In this case, still nk operations to nd max k distances for each element end for But in average case, rst knn-dist has half probability of being smaller/greater than n knn-dist of all the other objects. Thus the runtime should be O( n 2 n) + O ( 2 nk ) return maxi end function

Optimized Algorithm

There are a couple other optimizations you can also perform. One of them is take advantage of the relationship the knn-dist you calculate from any k distances you observe is the upperbound on knn-dist(i). It can only decrease. Thus, if at any point, the current minimal knn-dist is lower, you can safely break because it will only get smaller. In the worst case, the rst k elements in dist(i,k) are exactly the k farthest neighbors so you have to traverse through the entire dist(i) before you nd the minimal dist(i,k) is minimal that is smaller than your current max.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

function Outlier Search(l, k, d()) dist f loat[n][n] dist is a matrix of n by n dist(i, j ) d(i, j )ij O(n2 ) operations [a1...ak ] min k distances in dist[1] in ascending order minv sum([a1...ak ]) maxi 1 for i = 2 to n do [b1...bk ] sort([dist(i, 1)...dist(i, k )]) in ascending order currv sum([b1...bk ]) for j = k to n do if dist(i, j ) < bk then [b1...bk ] sort([b1...bk 1.dist(i, j )]) in ascending order currv sum([b1...bk ]) if currv < maxv then break currv represents the upperbound on knn-dist. So if it is < maxv now, it will only get smaller end if end if end for if currv > maxv then maxi = i 2

: KNN Pseudocode
21:

end if 22: end for 23: return maxi 24: end function

If you use this optimized algorithm, it would almost never go through the entire inner loop For j = k to n. That would require A) that i in dist(i) to be arranged in an order such that i would be ascending while knn-dist(i) is in descending order. Further more, it would require that the j in dist(i, j ) that j would be in ascending order while dist(i, j ) would be in descending order. Since there is only one such permutation out of n!n! permutations, I think one can conclude the chances of this is relatively low. Im not exactly sure about the average case analysis of this algorithm but I can ensure the average case is always better than O(kn2 ), where n2 is the number of elements and k is the number of nearest neighbors since we need to keep track of a list of the k nearest neighbors

S-ar putea să vă placă și