[fb42ae]: / k_means / k-means_clustering.Rmd

Download this file

73 lines (53 with data), 2.5 kB

---
title: "k-means_clustering.rmd"
output: html_document
date: "2024-04-17"
---

```{r setup, include=FALSE}
# Load necessary libraries
library(tidyverse)
library(ggfortify)
```

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}

# Step 1: Load the data
diabetes <- read.csv("./data/Diabetes_prediction.csv")

# Step 2: Preprocess the data (example: normalize the data)
# It's often a good idea to scale the data because K-means is sensitive to the scale of the data
diabetes_scaled <- scale(diabetes[, sapply(diabetes, is.numeric)])  # Scaling numeric columns

# Step 3: Perform K-means clustering
set.seed(123)  # for reproducibility
k <- 3  # number of clusters
diabetes_kmeans <- kmeans(diabetes_scaled, centers = k, nstart = 25)

# Step 4: Analyze the results
# Print cluster centers
print(diabetes_kmeans$centers)

# Adding cluster results back to the original data
diabetes$cluster <- diabetes_kmeans$cluster

# Step 5: Visualize the results
# Using ggplot2 to plot the clusters (example using the first two principal components)

autoplot(prcomp(diabetes_scaled), data = diabetes, colour = 'cluster', frame = TRUE)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
diabetes$cluster <- as.factor(diabetes_kmeans$cluster)  # Convert cluster to a factor for coloring

# Perform PCA on the scaled data
pca_results <- prcomp(diabetes_scaled)

# Extract the first two principal components
pc_data <- as.data.frame(pca_results$x[, 1:2])
colnames(pc_data) <- c("PC1", "PC2")
pc_data$cluster <- diabetes$cluster  # Add cluster information for coloring

# Create a scatter plot using ggplot2
ggplot(pc_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.5, size = 3) +
  scale_color_manual(values = c("red", "blue", "green")) +  # Custom colors for each cluster
  labs(title = "PCA of Diabetes Dataset with K-means Clustering",
       x = "Principal Component 1",
       y = "Principal Component 2",
       color = "Cluster") +
  theme_minimal()
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.