Author Topic: Stats people, significance testing  (Read 2797 times)

Perspective

  • badfish
  • Jackass In Charge
  • Posts: 4635
  • Karma: +64/-22
    • http://jeff.bagu.org
Stats people, significance testing
« on: October 11, 2012, 12:17:38 PM »
I'm trying to figure out what type of test I should use for statistical significance here.

I have groups of data, each data point can have a bunch of types. My hypothesis is that data points of the same type tend to co-occur in the same groups. So if I see an X in a group, I'm more likely to see another X than random chance. How do I test significance here?

I can compute the distribution of types over the whole population, and for a given type I can compute the probability of seeing a second item of the same type over all groups (then I can iterate over all types). But I'm not sure which test is suitable here. The data is not normally distributed so t-test is out. Is there some way to format this as a chi-squared test?

Computer scientists don't generally do real science, so I'm a little out of my league here. Kermi, Jawib, I'm looking at you here!

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: Stats people, significance testing
« Reply #1 on: October 11, 2012, 07:10:20 PM »
So basically you have a categorical dependent variable...

How many types (levels) are there?  What's the distribution look like?  Any other variables?

Assuming if there's not other variables involved, I think you're talking a chi^2 test for goodness of fit.

http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Goodness_of_fit
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

Perspective

  • badfish
  • Jackass In Charge
  • Posts: 4635
  • Karma: +64/-22
    • http://jeff.bagu.org
Re: Stats people, significance testing
« Reply #2 on: October 11, 2012, 07:18:55 PM »
The issue is that my hypothesis is not over the entire population, it's within an arbitrary group. To use the example from the Wikipedia article, I'm not hypothesizing that the number of men and women are equal. My hypothesis is that if you see a man in a group, you are more likely to see another man.

Assuming men and women are equal in the population (for this example), I'd expect the probability of seeing a man, given you've already seen one man in the group to be (significantly) greater than 0.5. In my scenario though, there are many types, not just men and women.

>How many types (levels) are there?  What's the distribution look like?  Any other variables?

There are hundreds of thousands of types, and each data point can have multiple types (probably around 10 on average, but it can vary a lot). The distribution of types over the entire population is Zipfian (follows a power law). No other variables (or I'm assuming independence for tractability).

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: Stats people, significance testing
« Reply #3 on: October 11, 2012, 09:13:57 PM »
Yea - I'm not sure the best thing to use here....not my traditional field of stats....Basically you're talking about

If I was forced to do this...I think what I might do is pick one of the types (as a starter) and randomly sample one of them (eventually I'd do more than one) and then take say, the closest 500 items and run a chi2 for goodness of fit....Then repeat...

Basically, it's almost a regression with a chi 2....but I'm sure that's not the best method....

This is outside the realm of my traditional hypothesis testing wherein we test to reject a null hypoth....I think what you actually need is some sort of loglinear analysis or logistic regression - but I really don't know about those - JaWib, you have an idea?
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

Perspective

  • badfish
  • Jackass In Charge
  • Posts: 4635
  • Karma: +64/-22
    • http://jeff.bagu.org
Re: Stats people, significance testing
« Reply #4 on: October 12, 2012, 10:34:24 AM »
Yeah, I ran it by my wife (also clinical psych) last night and she said the same thing. It's not the type of analysis you guys traditionally use.

If I was forced to do this...I think what I might do is pick one of the types (as a starter) and randomly sample one of them (eventually I'd do more than one) and then take say, the closest 500 items and run a chi2 for goodness of fit....Then repeat...

I was originally thinking of something along these lines. Compute the probability of type A over the whole population, then compute the probability of type A given type A already being seen over all groups and compare the two. Then do this for all types and maybe compute an average increase in the probability(?). Then I could say something like seeing a type that was already seen in a group is on average 3x more likely than random chance given by the baseline distribution.

Thanks for the suggestions :thumbsup:

kermi3

  • ?
  • Ass Wipe
  • Posts: 5513
  • Karma: +56/-22
Re: Stats people, significance testing
« Reply #5 on: October 12, 2012, 11:14:56 AM »
Cool - yea, give that a shot...Look into the logistic regression or loglinear analysis if you want something more specific.  I think that's the direction you have to go for....
govtcheez03:  i kind of look for it - i seek out stupidity and annoy it until it either gets better, gets banned, or goes away on its own

JaWiB

  • definitelys definately no MacGyver
  • Jackass V
  • Posts: 1443
  • Karma: +57/-4
Re: Stats people, significance testing
« Reply #6 on: October 12, 2012, 06:31:42 PM »
I don't really do stats much, and I've never been good at it...

Can you calculate the expected value of (# of type x in group y)? Shouldn't that value be normally distributed about the expected value?  :dunno: