As a student I seemed to miss out on statistics. I did the mean, median, and mode, and even managed standard deviation, but that was about that. I got through my professional life with that level of understanding on the subject, but lately I have had a bit of an Epiphany (well Easter is approaching!). That is Statistics – I now get it!
We now run lots of instruments in the lab that can produce mountains of data and I spend a lot of my time trying to make sense of it. To be honest we are not really very good at it. How do you know when the data is good? Are there patterns in this data or is it just noise? How do you know what parameters are key to affecting the properties that you are trying to optimise? Sometimes it all gets too much and you find yourself overwhelmed by the complexity of the whole thing. Some people refer to the task as “data mining” and that is what it feels like really hard work and sometimes not much to show for it at the end. Well the answer is statistics. I will now try explain how it works.
A classic problem that we have is processing FTIR spectral data. In each spectra there are 700 data points and we can have data sets of easily 100 spectra to process. What we used to do is measure peak heights of the bands of the materials of interest, maybe a reference peak also for normalization and use that to predict the level of the component that we are interested in. Now this can quite simply be automated which is good, but unfortunately with spectra you often get interference and our peaks of interest get masked by other peaks which leads to the whole process not working. What we want is something that will go through a set of data, that we know the levels of material x is and look for patterns in the data to find the differences in the spectra. It then builds a model of this so that when we give it 100 samples with unknown levels of x in it will give us and accurate prediction based on what it has learned from the training set of spectra that we gave it.
Well amazing as it may sounds this is just the job that the latest statistical treatments can do. There are lots of acronyms out there, for all of the approaches all with their owns strengths ( LS, CLS, ILS, PCA, PCR, PLS1, PLS2), but with the power of a PC they can do the number crunching that brings sense to what can look like a rather random set of data.
We used PLS (partial least squares). From what I have gathered this is one of the most powerful algorithms for this sort of work. Our project was quite tricky and I thought that we might as well use the best. Checking out on the background I found this in wikipedia on the PLS
“Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space.”
That is quite self explanatory- I especially like the bit about projecting the variables into a new space. Honestly, trying to work out how it does it is impossible, unless you have done degree level maths with a lot of matrix algebra, it is heavy duty stuff, but luckily, to do it ( make the models and analysis) you don’t have to understand it. The software is surprisingly easy ( even for me, known to be not good on software issues) and it does it all for you.
This how it works.
- Take the spectra of your training set. We use 10 standards and take 5 spectra from each set. So that gives us 50 spectra for the training set. (This is the bit that takes all the work the rest is easy.)
- Load the spectra into GRAMS IQ. click on analyse. It does its work ( 3 seconds) and gives a correlation graph with a key value R Squared. We want this to be as close to 1 as possible, 0.85 is OK, 0.9, good, 0.99 very good, 0.995 great.
- Tune the model– this involves, changing the parts of the spectra to use eg try just 1700-700 usually helps get the Rsq. down, or preprocessing by using autobasline, or even get it to work on the 1st derivate. All this sounds complex but it is not and in 5 minutes you can get the Rsq from 0.8 to 0.98, by fiddling with the parameters and keeping an eye on whether the R sq. goes up or down. When you get as close to 1 as you can you tell software to make a calibration file for stage 4.
- Load calibration file into spectral software. We run Agilent FTIRs and it is now simple to put the calibration file into the Agilent Microlab PC software. Write a simple test routine and get the analysis running, which now all happens seamlessly in the background of the standard Microlab software.
This has been a lot of fun but also a revelation as to what can be done nowadays by software and PC’s.I will be using this PLS for other jobs now I have seen what it can do. Not just on spectral data, but other tasks where we are struggling to get ontop of loads of multivariate data. I always remember the old Microsoft slogan ” work smarter not harder” I think that is what it is all about.