The issue is that my hypothesis is not over the entire population, it's within an arbitrary group. To use the example from the Wikipedia article, I'm not hypothesizing that the number of men and women are equal. My hypothesis is that if you see a man in a group, you are more likely to see another man.
Assuming men and women are equal in the population (for this example), I'd expect the probability of seeing a man, given you've already seen one man in the group to be (significantly) greater than 0.5. In my scenario though, there are many types, not just men and women.
>How many types (levels) are there? What's the distribution look like? Any other variables?
There are hundreds of thousands of types, and each data point can have multiple types (probably around 10 on average, but it can vary a lot). The distribution of types over the entire population is Zipfian (follows a power law). No other variables (or I'm assuming independence for tractability).