Author Topic: multi variable regression- or maybe just a line chart  (Read 3839 times)

charlie

  • Jackass In Charge
  • Posts: 7896
  • Karma: +84/-53
multi variable regression- or maybe just a line chart
« on: May 15, 2014, 03:17:34 PM »
So...

For work, I want to estimate something. Let's call it the size of a customer's data set. Higher-ups want to figure this value out exactly, but I think that's a waste of time (also it's actually probably easy and I'm just ignorant). Instead, I want to do some quick queries for information that probably correlates to the total size of the data set.

I have 10-20 samples that I can gather data from. I can get the actual size and the values for the quick queries. I'd like to identify which query best fits correlates with the actual size to use in the product.

I see people on websites I read do this all the time. I have value X, plus values A, B, C and D, and I want to know which of A-D best correlates with X.

Can somebody tell me how? :)

I'm using excel right now, although I do have R/RStudio on another machine that I don't know how to use if that's relevant.

If I didn't know these statistical tools existed, I would normally just plot values A-D and value X on the same graph and see which line looked closest. But I can't even do that (have two different Y Axis values).

Help!!

Please. :)

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: multi variable regression- or maybe just a line chart
« Reply #1 on: May 15, 2014, 04:24:47 PM »
So to make sure I have it right - you have data set A, you want to see if data set X, Y, or Y best correlates to data set A?

There are a 2 ways to do this...The first, and simplest, is to do a simple correlation - which is basically your line chart.  In excel the function is =CORREL().  The output the r value which tells you how strong the corellation is.  The values are 0 to 1.  0 means there's no correlation.  1 means a perfect correlation.  If you square the r you get a more sensible estimation of the strength of the correlation.  You could fill in this sentence: Variable X accounts for <R^2>% of the variation in variable A.

The other thing you could do is a bit more powerful, but more complex.  It's called multiple regression.  The point of multiple regression is drop all of the variables into one (generally) linear formula to predict the dependent variables...Basically, you drop X, Y, and Z into the regression and you get the following formula that predicts A:

A = B(sub 0) * Constant + B(sub1) * X + B(sub 2) * Y + B(sub 3) * Z

This should look familiar - it's the basis for your old middle school y = mX + B. 

The output will tell you the r value and p value (significance) of the model.  The Beta values are the unique correllation coeficients for the data sets.  They each have their unique r's and p values that tells you if they're actually significant unique predictors of variation in the model or just dead weight. 

This method is more complex, but it is also more powerful.  It really shouldn't be TOO hard in R.  I've never used R before though, so I'm not totally sure how to walk you through it.  I'd be happy to explain more if you'd like - but school just let out and it's birthday burgers and whiskey time!
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

charlie

  • Jackass In Charge
  • Posts: 7896
  • Karma: +84/-53
Re: multi variable regression- or maybe just a line chart
« Reply #2 on: May 15, 2014, 04:40:42 PM »
Happy Birthday kermi! Enjoy it.

I'll play with CORREL a bit for now. Thanks!

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: multi variable regression- or maybe just a line chart
« Reply #3 on: May 16, 2014, 10:35:27 AM »
Thanks!  It was great.  Good luck playing!  Let me know if I can help - my guess is that the multiple would be really easy to do and I'm happy to help you understand the printout if you want.
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

charlie

  • Jackass In Charge
  • Posts: 7896
  • Karma: +84/-53
Re: multi variable regression- or maybe just a line chart
« Reply #4 on: May 16, 2014, 01:53:33 PM »
CORREL gave me back good information. Let's see if I take the time to try the multiple regression just for fun.

I got R^2 values of .95, .91 and .90 for the three variables I was looking at. I'll probably gather more information to see if anything is even better and make sure it applies to more samples.

Thanks!

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: multi variable regression- or maybe just a line chart
« Reply #5 on: May 17, 2014, 12:40:56 AM »
Shit man. You got an R SQUARED of .95????  You're done.  Don't bother with a thing else.  That variable accounts for NINETY-FIVE percent of the variation in your target variable? Fuck it.  That's psycho high. 

I'm usually ecstatic with .35.
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

charlie

  • Jackass In Charge
  • Posts: 7896
  • Karma: +84/-53
Re: multi variable regression- or maybe just a line chart
« Reply #6 on: May 17, 2014, 10:17:41 AM »
Ha ha... This is a slightly different field than yours. :)

But yeah I was just going to gather more samples to be sure. I also want to determine if there are any major outliers. If I did the stuff in R it would be for the fun of it.

charlie

  • Jackass In Charge
  • Posts: 7896
  • Karma: +84/-53
Re: multi variable regression- or maybe just a line chart
« Reply #7 on: May 21, 2014, 07:55:47 PM »
Well... I put in more data and my numbers tanked (relatively speaking).

Funny thing was that I have 3 groups of data I've added. First, with just group B, I got that .95 R^2 for the COV data set. I added group C and it went pretty far down. But then I tried subtracting CO from COV and got .98 R^2. That made me happy-- more data and an even better number.

Then I added group A and everything went to shit. Now COV is best again, but it's at .59.

Maybe that's pretty good, but still, I got spoiled early. Oh, wait, if I delete C23 from the dataset it gets better (0.85). Hmmm... I hate that one piece of data changes that number so drastically. Is there a different value for how volatile the data is?

By the way, I'm not really doing this for work as they've decided on doing something else. This is just for fun now. My data is here for the curious:
https://docs.google.com/spreadsheets/d/1IcdUnMv2SpdcVernGZAHvg2OU9CIhfNXfshWJ46yW3g/edit?usp=sharing


Edit: I looked at C23 and noticed it had a very high COO count, so I added that variable to the data. COO + COV + OL gives an R^2 of 0.94.
« Last Edit: May 21, 2014, 08:28:12 PM by charlie »