Author Topic: Stats people, significance testing (Read 2817 times)

Perspective · « **on:** October 11, 2012, 12:17:38 PM »

I'm trying to figure out what type of test I should use for statistical significance here.

I have groups of data, each data point can have a bunch of types. My hypothesis is that data points of the same type tend to co-occur in the same groups. So if I see an X in a group, I'm more likely to see another X than random chance. How do I test significance here?

I can compute the distribution of types over the whole population, and for a given type I can compute the probability of seeing a second item of the same type over all groups (then I can iterate over all types). But I'm not sure which test is suitable here. The data is not normally distributed so t-test is out. Is there some way to format this as a chi-squared test?

Computer scientists don't generally do real science, so I'm a little out of my league here. Kermi, Jawib, I'm looking at you here!

kermi3 · « **Reply #1 on:** October 11, 2012, 07:10:20 PM »

So basically you have a categorical dependent variable...

How many types (levels) are there? What's the distribution look like? Any other variables?

Assuming if there's not other variables involved, I think you're talking a chi^2 test for goodness of fit.

http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Goodness_of_fit

Perspective · « **Reply #2 on:** October 11, 2012, 07:18:55 PM »

The issue is that my hypothesis is not over the entire population, it's within an arbitrary group. To use the example from the Wikipedia article, I'm not hypothesizing that the number of men and women are equal. My hypothesis is that if you see a man in a group, you are more likely to see another man.

Assuming men and women are equal in the population (for this example), I'd expect the probability of seeing a man, given you've already seen one man in the group to be (significantly) greater than 0.5. In my scenario though, there are many types, not just men and women.

>How many types (levels) are there? What's the distribution look like? Any other variables?

There are hundreds of thousands of types, and each data point can have multiple types (probably around 10 on average, but it can vary a lot). The distribution of types over the entire population is Zipfian (follows a power law). No other variables (or I'm assuming independence for tractability).

kermi3 · « **Reply #3 on:** October 11, 2012, 09:13:57 PM »

Yea - I'm not sure the best thing to use here....not my traditional field of stats....Basically you're talking about

If I was forced to do this...I think what I might do is pick one of the types (as a starter) and randomly sample one of them (eventually I'd do more than one) and then take say, the closest 500 items and run a chi2 for goodness of fit....Then repeat...

Basically, it's almost a regression with a chi 2....but I'm sure that's not the best method....

This is outside the realm of my traditional hypothesis testing wherein we test to reject a null hypoth....I think what you actually need is some sort of loglinear analysis or logistic regression - but I really don't know about those - JaWib, you have an idea?

Perspective · « **Reply #4 on:** October 12, 2012, 10:34:24 AM »

Yeah, I ran it by my wife (also clinical psych) last night and she said the same thing. It's not the type of analysis you guys traditionally use.

Quote from: kermi3 on October 11, 2012, 09:13:57 PM

If I was forced to do this...I think what I might do is pick one of the types (as a starter) and randomly sample one of them (eventually I'd do more than one) and then take say, the closest 500 items and run a chi2 for goodness of fit....Then repeat...

I was originally thinking of something along these lines. Compute the probability of type A over the whole population, then compute the probability of type A given type A already being seen over all groups and compare the two. Then do this for all types and maybe compute an average increase in the probability(?). Then I could say something like seeing a type that was already seen in a group is on average 3x more likely than random chance given by the baseline distribution.

Thanks for the suggestions

kermi3 · « **Reply #5 on:** October 12, 2012, 11:14:56 AM »

Cool - yea, give that a shot...Look into the logistic regression or loglinear analysis if you want something more specific. I think that's the direction you have to go for....

JaWiB · « **Reply #6 on:** October 12, 2012, 06:31:42 PM »

I don't really do stats much, and I've never been good at it...

Can you calculate the expected value of (# of type x in group y)? Shouldn't that value be normally distributed about the expected value?

EntropySink

News:

Author Topic: Stats people, significance testing (Read 2817 times)

Perspective

Stats people, significance testing

kermi3

Re: Stats people, significance testing

Perspective

Re: Stats people, significance testing

kermi3

Re: Stats people, significance testing

Perspective

Re: Stats people, significance testing

kermi3

Re: Stats people, significance testing

JaWiB

Re: Stats people, significance testing