Introduction

Look at the sky, go to any place with a low light pollution level and you will notice that it is full of stars. The stars show a multitude of colors and are not all the same because they do not all have identical temperatures. To define color precisely, astronomers have devised quantitative methods for characterizing the color of a star and then using those colors to determine stellar temperatures. The reason of different colors is explained by Wien’s law and we know that it does not depend from the distance of stars, but it is related to radiation and wavelenghts and it gives us insight about stars’ surface temperature.

Furthermore, every color that we know can be expressed as a combination of Red, Green and Blue (RGB). My goal is to retrieve the RGB value from a star with some clustering techniques and then compare the color to a dataset containing the most common stars color (spectrum) and their relative temperature range. In this way i want to retrieve the estimated star’s superficial temperature.

In this article i will use R-CRAN and Clustering techniques for image processing in order to classify stars based on their spectrum and superficial temperature. All the images are free to use and can be taken from various astronomy websites, particularly on NASA Photojournal.

Clustering Analysis and CLARA Algorithm

Cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.[1]

In this research, after retrieving the RGB combination from some sample/test stars images i will perform a clustering, color isolation and comparisons (color-distance) algorithms to extract some information. Furthermore, i will use CLARA algorithm (Clustering Large Applications) for the clustering of the images’ RGB, as this algorithm performs best with large datasets and is the most suitable for the research purpose.

For the first test case I will test VEGA from the above table and i display the image dimension (height, length)

library(jpeg)
library(cluster)
vega <- readJPEG("images\\vega.jpg")
dimension <- dim(vega);dimension[1:2]

## [1] 800 800

Now that i read the image, i will retrieve the RGB values that i need and plot the image from this values.

RGBValues <- data.frame(
  x=rep(1:dimension[2], each=dimension[1]),
  y=rep(dimension[1]:1, dimension[2]),
  r.value=as.vector(vega[,,1]),
  g.value=as.vector(vega[,,2]),
  b.value=as.vector(vega[,,3]))

plot(y ~ x, data=RGBValues, main="VEGA",
     col = rgb(RGBValues[c("r.value", "g.value", "b.value")]),
     asp = 1, pch = ".")

Once i have all the data, i need to determine the optimal number of cluster (k) for the image processing:

I use CLARA and test clusters(k) till k=10, and then for each k i will compare the average SILHOUETTE value, that is essentially a measure that express the k-cluster consistency to the dataset (how well each object has been classified) and ranges from -1 (inconsistent) to +1 (consistent).

var <- c()
for (k in 1:10) {
  clara_ <- clara(RGBValues[, c("r.value", "g.value", "b.value")], k)
  var[k] <- clara_$silinfo$avg.width
}

plot(var, type = 'l',
     main = "Optimal number of clusters for VEGA",
     xlab = "Number of clusters",
     ylab = "Average silhouette",
     col = "blue")

From the above chart I can derive that the optimal number of clusters is k=2 (highest silhouette value). This can be partially explained because the target image has two dominant colors: the star’s color and the dark space color (background color).

Once i have the optimal k, I can run CLARA:

star <- RGBValues[, c("r.value", "g.value", "b.value")]
clara <- clara(star, 2)
plot(silhouette(clara))

The silhouette value shows us that the data is consistent and we can proceed with the clustering.

Now i can plot the clustered image:

colors <- rgb(clara$medoids[clara$clustering, ])
plot(y ~ x,
     data=RGBValues,
     main="VEGA",
     col = colors,
     asp = 1,
     pch = ".")

Then i create a dataframe containing information that i need for the analysis:

colors: HEX values of the two dominant colors
Freq: Frequency of each color (in absolute terms)
distribution: Frequency of each color (in percentage terms)

colors_data <- as.data.frame(table(colors))
colors_data$distribution <- round((((colors_data$Freq)/sum(colors_data$Freq)) * 100), 2)
colors_data

##    colors   Freq distribution
## 1 #2F414D 436735        68.24
## 2 #A3BED9 203265        31.76

And plot a simple pie chart to show the two colors with their %distributions:

colors_data$colors <- as.character(colors_data$colors)
pie(colors_data$Freq,
    labels = colors_data$distribution,
    col = colors_data$colors,
    xlab = "Colors",
    ylab = "Frequency")

Colors Isolation

From the above pie chart I can see the first problem to solve: isolate the star’s color from the surrounding/background.

As images can seize different proportions of a star’s surrounding, I can not simply create an algorithm that automatically exclude the color with an higher percentage in distribution (like in this case) because it’s likely that an error will occur and some false result will be provided as the algorithm output.

Here an example:

In the above example, both pictures are displaying the same object but from different perspectives. In case I would implement an algorithm that would automatically exclude the color with an higher %distribution, in the first picture will be excluded the object (cluster) that I want to analyze while in the second example I will exclude the surrounding.

To solve this, i created a function that retrieves the brightness of colors for a given RGB and from that we can exclude the darkest color (because the space itself has no brightness, or has some residual brightness from the clustering depending on the quality of images)

In my dataframe i have colors expressed in HEX values and i will convert those into RGB values with a simple built-in function in R:

hexToRgb <- function(x) {
    paste(as.vector(col2rgb(x)), collapse = " ")
}

Then i can create a function that return each color’s brightness for a given RGB value using LUMA methodology:

getBrightness <- function(rgb) {
    vector <- unlist(strsplit(rgb, " "))
    r <- strtoi(vector[1])
    g <- strtoi(vector[2])
    b <- strtoi(vector[3])
    0.2126 * r + 0.7152 * g + 0.0722 * b
}
colors_data$brightness <- unlist(lapply(lapply(colors_data$colors, hexToRgb), getBrightness))

Now I can display the colors’ brightness and exclude the darkest one:

color_to_analyze <- colors_data[colors_data$brightness == max(colors_data$brightness), ]$colors
colors_data

colors	Freq	distribution	brightness
#2F414D	436735	68.24	62.0396
#A3BED9	203265	31.76	186.2092

Colors Similarity

Now that I have the star’s color, I need an algorithm that for a given color returns me the star’s surface temperature (in Kelvin). Another thing to consider is that not all stars have the same shade of color, some can be dark blue and some light blue, and so on.

My idea is to create a dataframe with the most 5 common colors and shades and the relative surface temperature (as shown in the first image in the introduction section), then for each star analyzed we evaluate its RGB values against the database’s RGB values and we return the most similar using the Euclidean Distance:

1. Create Euclidean Distance function for RGB:

checkColorsDifference <- function(col1, col2) {
    vector1 <- unlist(strsplit(col1, " "))
    vector2 <- unlist(strsplit(col2, " "))
    r <- (strtoi(vector1[1]) - strtoi(vector2[1]))**2
    g <- (strtoi(vector1[2]) - strtoi(vector2[2]))**2
    b <- (strtoi(vector1[3]) - strtoi(vector2[3]))**2
    sqrt(r+g+b)
}

2. Create Database of stars’ colors and temperatures (in Kelvin). I used this website to retrieve RGB values from each star in the picture:

rgbColors <- c("30 125 241", "232 251 255", "179 176 7", "209 108 36", "194 6 10")
temperatures <- c("28000-18000", "11000-7500", "6000-5000", "5000-3600", "3600-2000")
db_data <- data.frame(rgbColors, temperatures)
db_data

rgbColors	temperatures
30 125 241	28000-18000
232 251 255	11000-7500
179 176 7	6000-5000
209 108 36	5000-3600
194 6 10	3600-2000

3. I check the similarity of my color with each color in the database:

The similarity value follows the euclidean distance rule mentioned above so the highest the distance value, the more “unequal” the match. In my case I need to take the most similar match (lowest distance value)

db_data$matchColor <- unlist(lapply(db_data$rgbColors, checkColorsDifference, col1=hexToRgb(color_to_analyze)))
temperature <- db_data[db_data$matchColor == min(db_data$matchColor), ]$temperatures
db_data

rgbColors	temperatures	matchColor
30 125 241	28000-18000	149.96666
232 251 255	11000-7500	99.62931
179 176 7	6000-5000	211.07345
209 108 36	5000-3600	203.96323
194 6 10	3600-2000	278.68620

print(paste0("The star has a temperature between ", temperature, " K"))

## [1] "The star has a temperature between 11000-7500 K"

Let’s check if the algorithm returns the correct superficial temperature range:

We see that the clustering algorithm returns the right temperature’s range.

Now let’s try with our most familiar star: THE SUN!

I will group all the lines in a single function to quickly test my algorithm and then i run it:

checkTemperature <- function(star_file) {
    vega <- readJPEG(star_file)
    dimension <- dim(vega);dimension[1:2]
    RGBValues <- data.frame(
      x=rep(1:dimension[2], each=dimension[1]),
      y=rep(dimension[1]:1, dimension[2]),
      r.value=as.vector(vega[,,1]),
      g.value=as.vector(vega[,,2]),
      b.value=as.vector(vega[,,3]))
    plot(y ~ x, data=RGBValues, main="SUN",
     col = rgb(RGBValues[c("r.value", "g.value", "b.value")]),
     asp = 1, pch = ".")
    var <- c()
    for (k in 1:10) {
      clara_ <- clara(RGBValues[, c("r.value", "g.value", "b.value")], k)
      var[k] <- clara_$silinfo$avg.width
    }
    star <- RGBValues[, c("r.value", "g.value", "b.value")]
    clara <- clara(star, 2)
    colors <- rgb(clara$medoids[clara$clustering, ])
    colors_data <- as.data.frame(table(colors))
    colors_data$distribution <- round((((colors_data$Freq)/sum(colors_data$Freq)) * 100), 2)
    colors_data$colors <- as.character(colors_data$colors)
    colors_data$brightness <- unlist(lapply(lapply(colors_data$colors, hexToRgb), getBrightness))
    color_to_analyze <- colors_data[colors_data$brightness == max(colors_data$brightness), ]$colors
    rgbColors <- c("30 125 241", "232 251 255", "179 176 7", "209 108 36", "194 6 10")
    temperatures <- c("28000-18000", "11000-7500", "6000-5000", "5000-3600", "3600-2000")
    db_data <- data.frame(rgbColors, temperatures)
    db_data$matchColor <- unlist(lapply(db_data$rgbColors, checkColorsDifference, col1=hexToRgb(color_to_analyze)))
    temperature <- db_data[db_data$matchColor == min(db_data$matchColor), ]$temperatures

    print(paste0("The star has a temperature between ", temperature, " K"))
}

checkTemperature("Images\\sun.jpg")

## [1] "The star has a temperature between 6000-5000 K"

Considering that the Sun superficial temperature is around 5778K [2], also in this case the algorithm returned the correct range.

Dimension Reduction

Once I have tested the algorithm and asserted that it provides the correct results for the test cases, the next step is to improve its computational time.

Taking under consideration that the time complexity is directly related to the pictures size and quality, one of the most direct way to improve the computation is to reduce/compress the image resolution without altering the stars spectrum.

Principal Component Analysis (PCA)

Principal component analysis is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.[3]

library(tidyverse)
library(jpeg)
library(factoextra)
library(knitr)

sun <- readJPEG("Images\\sun.jpg")

r <- sun[,,1]
g <- sun[,,2]
b <- sun[,,3]

pcar <- prcomp(r, center=FALSE)
pcag <- prcomp(g, center=FALSE)
pcab <- prcomp(b, center=FALSE)

pcaimage <- list(pcar, pcag, pcab)
df <- data.frame(scheme=rep(c("R", "G", "B"), each=nrow(sun)),
                 index=rep(1:nrow(sun), 3),
                 var=c(pcar$sdev^2,
                       pcag$sdev^2,
                       pcab$sdev^2))

# Reorder of factors
levels(df$scheme) <- c("R", "G", "B")
df$scheme <- factor(df$scheme, levels(df$scheme))

I plot the variance of each (RGB) color and i can notice that the skewness of Red and Green it’s related to the low presence in our picture, while the blue is less skewed and has a lower variance because of it’s predominance.

df %>%
  group_by(scheme) %>%
  mutate(propvar=100*var/sum(var)) %>%
  ungroup() %>%
  ggplot(aes(x=index, y=propvar, fill=scheme)) +
  geom_bar(stat="identity") +
  geom_line() +
  labs(title="Scree plot", x="Principal Component",
       y="% of Variance") +
  scale_x_continuous(limits=c(0, 20)) +
  facet_wrap(~scheme) +
  theme_bw() +
  theme(legend.title=element_blank(),
        legend.position="bottom")

I can expand this analysis by plotting a comulative proportion of the variance:

df %>%
  group_by(scheme) %>%
  mutate(propvar=100*var/sum(var)) %>%
  mutate(cumsum=cumsum(propvar)) %>%
  ungroup() %>%
  ggplot(aes(x=index, y=cumsum, fill=scheme)) +
  geom_bar(stat="identity") +
  geom_line() +
  labs(title="Cumulative proportion of variance explained",
       x="Principal Component", y="Cumulative % of Variance") +
  scale_x_continuous(limits=c(0, 20)) +
  facet_wrap(~scheme) +
  theme_bw() +
  theme(legend.title=element_blank(),
        legend.position="bottom")

Now i can use the PCA fundaments to compress my image, particularly i will compress the image using [2, 30, 200, 300] principal components and then i will run my algorithm to check if the results are altered and to test the computational time.

# PCs values
pcnum <- c(2, 30, 200, 300)

# Reconstruct the image four times
for(i in pcnum){
    pca.img <- sapply(pcaimage, function(j){
      compressed.img <- j$x[, 1:i] %*% t(j$rotation[, 1:i])
    }, simplify='array')
    writeJPEG(pca.img, paste("PCA IMAGE/Image reconstruction with",
            round(i, 0), "principal components.jpg"))
    startTime <- Sys.time()
    checkTemperature(paste("PCA IMAGE/Image reconstruction with",
            round(i, 0), "principal components.jpg"))
    print(paste(i, " principal components"))
    print(Sys.time() - startTime)
}

## [1] "The star has a temperature between 6000-5000 K"
## [1] "2  principal components"
## Time difference of 14.42367 secs

## [1] "The star has a temperature between 6000-5000 K"
## [1] "30  principal components"
## Time difference of 14.75798 secs

## [1] "The star has a temperature between 6000-5000 K"
## [1] "200  principal components"
## Time difference of 17.30336 secs

## [1] "The star has a temperature between 6000-5000 K"
## [1] "300  principal components"
## Time difference of 17.63228 secs

Below you can see the image compressed with [2, 30, 200, 300] principal components:

I tested that while the results remained the same, the computational time-complexity using PCA improved leading to a more efficient algorithm.

Considering the computing improvement for a single star analysis, imagine how the computational complexity will benefit from PCA by running the same algorithm for a n+1 number of stars.

Conclusion

We verified that the star spectrum has a direct correlation with the superficial temperature and that with the proper clustering algorithm we can clean and classify stars following the Wien’s Law. This algorithm can be freely used and compared to the provided dataset of RGB/Temperature comparison criteria.

Furthermore, the PCA revealed to be a fundamental tool to improve the algorithm and, if used on a large scale, it can lead to a huge performance improvement in computing time.

If you find this topic interesting, you can expand my analysis with a different test dataset or you can expand the algorithm to compute multiple stars in one picture.

References

[1] https://en.wikipedia.org/wiki/Cluster_analysis
[2] https://en.wikipedia.org/wiki/Sun
[3] https://en.wikipedia.org/wiki/Principal_component_analysis

Image Clustering To Identify Stars’ Surface Temperature

Raffaele di Costanzo

18/02/2021