Applications of K-Means Clustering in Security Domain

Among lot of different Unsupervised Machine Learnings, one of the very popularly known is K-Means Clustering. But question arise — How K-Means has been implemented in real world scenarios ? Stick with me in this blog to know more about K-Means Clustering & It's applications in Security Domain.

7 min readSep 5, 2021

What is Unsupervised Machine Learning ?

Unsupervised Learning is one kind of machine learning technique that deals with unlabeled, unclassified data. This kind of Machine Learning algorithm help us to group unsorted information consistent with similarities, patterns, and differences with no prior training of knowledge.

In industry, we have lot of situations where we don't have any historical data but still we somehow need to apply machine learning on those real time data. For example server logs are real time & if we want to find any kind of unusual activities from those logs, then we need to work on it without any past experiences. Here we need to apply different kind of Unsupervised Machine Learning Algorithms.

Some of the Algorithms that come under Unsupervised Machine Learning are :

K-means Clustering
KNN (k-nearest neighbors)
Hierarchal Clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition

What is Clustering ?

Clustering is one kind of exploratory data analysis technique that helps to understand unlabeled data. The basic fundamental of Clustering is to identify subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.

Types of Clustering :

Hierarchical clustering
Partitioning clustering

Hierarchical clustering is further subdivided into :

Agglomerative clustering : It is a bottom-up approach where we begin with each element as a separate cluster and merge them into successively more massive clusters.
Divisive clustering : It is a top-down approach where we begin with the whole set and proceed to divide it into successively smaller clusters.

Partitioning clustering is further subdivided into :

K-Means clustering : In this method, the objects are divided into several clusters mentioned by the number “K”. We can mention any value for “K” & Machine Learning will find that many clusters in our data.
Fuzzy C-Means clustering : It is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster.

What is K-Means Clustering ?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that needs to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

The process K-Means Clustering follows is mentioned below :

1st, select the value of “K” that means fixing the number of clusters.
2nd, choosing some random k points or centroid to form the cluster. In this step we can choose any data points & it's not compulsory to choose data points which belongs to the dataset.
3rd, now we need to allocate all the data points to their nearest cluster. For that we can use either Euclidean-based distance or correlation-based distance method.
4th, till the time new data keep on coming we keep on repeating the 3rd step & allocate them to their nearest cluster.
Finally, Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster & our model will be created.

One Example of K-Means Clustering :

Let's say we have an online grocery delivery service which also has the facility of Supermarket Outlets. Now based on the customers interest & the location from where most of the customers are ordering, they will try to setup stores in such a way so that they can cover most of the area. For that they will use “K-Means Clustering”.

They first need to decide how many stores they can open on that location.
Next they will do the market analysis & try to understand which areas are in high demand or from which exact locations customers are ordering. For them each of the customer’s location will be data points on that high demanded area.
Lastly they will use “K-Means algorithm” & try to find the centroid to create clusters based on the number of stores they want to setup.

Applications of K-Means Clustering in Security Domain :

Nowadays for a company the most critical thing is “Data”. Because for them data is business. That's why lot of hackers always try to steal those data. And for company prospective, they always need to secure these data & need to store them safely.

Here comes the power of lot of different algorithms that's been implemented to prevent the hackers. Let's discuss some of the used cases where companies are using “K-Means” clustering to detect possible cyber attacks.

Cyber Profiling :

Profiling means trying to classify, what's known & unknown to us for a particular individual or for a group. Cyber Profiling idea comes up from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
In this era of billions of users in the internet having different different interests, It's very hard to find the difference between a cyber criminal & a general user. But some of the brilliant minds tried to analyze the human behavior & find out all the patterns. Next they try to continuously analyze the internet activities of suspected users & using lot of different algorithms they classify those activities. One of the method in this prospect is “K-Means clustering”.
Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods. For investigation, the cyber-profiling process gives a good, contributing to the field of forensic computer science.

Below diagram shows — How K-Means Clustering works in Cyber Profiling :

Let's say after applying K-Means algorithm on the internet traffic of the suspected users we find out this below graph :

Cluster 1 means these users has a normal web activities. But cluster 2 has little bit suspicious web activities. Whereas those users who belong to cluster 3 has very suspicious web activities.

From this kind of analysis, we can come to conclusions or can also predict which users are most probably cyber criminals.

Crime Pattern Analysis :

Another Used case of K-Means clustering can be security devices or security programs. Each day there are millions of customer traffic in the servers of popular companies. From those traffic it's very much harder to track if there is any kind of attack happening or not. Most of the cyber attack either has the goal to stress the servers or to theft data from the databases.

Using K-Means clustering method, these companies are keep on analyzing these millions of web traffic to their servers. Now their goal is to classify the web traffic in multiple clusters based on their behaviours & those traffic which belong to the cluster that might be possible cyber attack, can be blocked. There are lot of security devices & software available which in back-end using Unsupervised Learning techniques to detect the possible cyber attacks.

Another example of Crime Pattern analysis is shown in below picture :

Final Words :

These are widely known two used cases of K-Means Clustering in Security Domain. But besides security domain, there are lot of examples where we are using K-Means and similar kind of algorithms. Day by day technology is keep on developing but the core concepts or logics behind them is same & these kinds of algorithms are the keys for the success of lot of companies.
I tried to discuss little bit about Clustering, Unsupervised ML, K-Means & some real world implementations. If you find this blog informative — hit that clap button to let me know you liked it.
I keep on writing Blogs on Machine Learning, DevOps Automation, Cloud Computing, Big Data etc. So, if you want to read future blogs of mine, follow me on Medium. You can also ping me on LinkedIn, checkout my LinkedIn profile below…

Raktim Midya - Microsoft Learn Student Ambassadors (Beta) - Microsoft | LinkedIn★ I'm a tech enthusiast working to better understand the core concepts behind different popular Technological Fields…
www.linkedin.com

Thanks Everyone for reading. That’s all… Signing Off… 😊