Preprocessing Data

A very large portion of the work of machine learning is preprocessing or “labelling” the data. Here, we have to break apart data into chunks our perceptrons can intake and describe our data in consistent and concrete framework.

I’m going to show only a sample of the preprocessed data here because of volume of space it would otherwise take up. What we are doing here is called supervised learning because we (or some other person) is responsible for labelling the data in some way. Particularly, we would call this classified supervised learning because the data set is being classified into qualitative categories. Let’s discuss what’s going on below. First, the hex code is broken into it’s RGB constituents. Those are then transformed into decimal numbers. Finally, our hex numbers range between 0 and 255 so we transform these into number between 0 and 1 via dividing each number by 255. This gives us our three dimensional input vector of color components.

Next, we label this with zeros and a one for the color we believe it to be. 0ffafa becomes 0f, fa, fa which is 15, 250, 250 in decimal notation which normalizes into (.06, .98, .98). Our output vector becomes (0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0). We repeat this for each of original samples in the training data set.

Now, you see yet a third place that bias can crop up: is caffee really aqua? Could it not be blue or perhaps even mint? Data classification by humans will inherently introduce some sort of bias into a machine learning system.