Machine Learning Data Set 1

Our First Data Set

Training data is the data we use to determine what our neural network will look like and it is the backbone of all machine learning. It will shape what our neural network looks like.

The neural network is a sort of graph that we use to model and better understand our machine learning tool. We’ll start to build that as soon as we see our data.

Here, our data will be a long series of hex codes that refer to colors. We will display them with the color attached and hopefully you will immediately see the two most likely places for biases to crop up in machine learning..

I have even arranged this data somewhat suggestively. First, I’ll note that the hex code is oriented RGB with the first 2 digits giving the hex saturation of red, the next 2 giving that of green, and the final giving the saturation of blue. This suggests we choose three perceptrons- one that corresponds to each of those colors. This is somewhat analogous to how the cone cells in our eyes work!

What is the first factor that might induce bias here?

It appears the some colors may be oversampled while others are underrepresented.

The next big choice about network is how we will label this data. How many categories do we want? Do we want separate categories for light green, hunter green, mint green, dark green, slime green, ect., or do we want just a singular category for green? This step is crucial. I’ll save you the trouble: the class I initially did this with made some democrat choices and, on the right, is the list we arrived at. Here we find a second potential candidate for biasing our algorithm: what categories did we choose?

Imagine a fruit and vegetable classifying algorithm that had only the options: berry or leafy vegetable. It’s easy to find scenarios where that could prove very faulty. How might it classify brussels sprouts, peanuts, peas, or potatoes? We end up with thirteen categories. Right now, our network looks something like the one below. We take in a vector with 3 numbers: red, green, and blue saturation and output a vector with 13 digits where, hopefully, one of them is a 1 indicating the color we are interested in and all the rest are zeroes!