Friday, March 25, 2011

Causation vs Association

One of the biggest mistakes in Statistics is confusing Causation with Association.

We can say that cigarette smoking causes lung cancer but we can not say stop smoking causes weight gain. Yes, it is true most subjects will gain weight when they stop smoking. There is an association, but no causation.

One of the best examples is when there is more ice cream sold at the beach, there are more drownings. Does this mean that eating ice cream causes the drownings ? NO !! There is an association between ice cream selling and drowning, but NO causation.

The reason there is more drowning is because there are just more people at the beach. The warmer the weather the more people go swimming the ice cream truck shows up to make some money. This is sometimes called the confounding variable or lurking variable.

There was a very important report recently that there were more Shark attacks in Florida, and therefore more dangerous to go swimming. Well this is not necessarily true.

People wanted to know why the sharks were all a sudden attacking people. Was this JAWS revenge ?

NO, it just turns out that there are MORE people swimming in the ocean, where the sharks just happen to live. It is usually the people running INTO the poor shark, not the other way around.

The key in statistics is to find that confounding variable.

Another interesting stat is that people who have more books at home, their children are more likely to succeed in school. Does having those books at home make kids smarter ? Can a family go out and buy a library and all of a sudden their children will be brighter ? NO !! (although I have met families that have tried doing just this...maybe more for show then to help their children's gpa...lol)

I have a few family friends that have the Steven Hawkins book "A Short History of Time". The author claims anyone with an education should be able to read it. I have tried several times with no success and i have read Singh's Big Bang which is three times as big. So i have found this book at my friend's home library. It probable sat on the coffee table for a year or two (when the book was popular and Steven was making the book tour in his wheel chair....he is considered to be one of the brightest man alive). So i have taped a $20 bill to one of the back pages of the book, to see if anyone ever actually tries to READ the book. Every year i check this book in the home shelves and the $20's are still there....lol

Wednesday, March 23, 2011

Sampling

Many professors of statistics will state that 50% of all research papers are bad because of sampling. When doing research the two things one must consider are cost and time. To do the sampling correctly takes time and sometimes too much money. So researchers especially students in college take short cut like using their friends for their sampling.

Sampling is also why liberal cable shows like CNN and conservative cable shows like FOX get different results in their polls.

If you were to poll customers at the grocery store at 10 am, you will get a completely different results from taking the poll at 6 pm or even 8 pm.

The instrument used most often in gathering data is the questionnaire. And the biggest problem with the questioners is question bias. You get different results depending on how a question is phrased.

The size of the sample is important too. Believe it or not research has shown that a sample size of only 30 is sufficient.

For presidential polls a size of 2,000 is usually used. Even with such a large size one can run into problems. If the pollster is using phone numbers, it is possible that most of the phones will be from households in the big cities. That is why we now do something called "stratified" sampling. This is were we decide to sample 500 from the northeast, 500 from the South, 500 from the midwest and 500 from the West. This way we get a good representation of the entire country. But anytime you start to use jugdement in your method of sampling, you might have brought some bias into the sampling.

The best type of sampling is called SRS (Simple Random Sampling). The best method to use here is random numbers to determine who to ask the questions from.

But even with a SRS, we still do not know who will pick up the phone (the lady of the house or the man) and who will take the time to talk to a strange and who will not. Now a days we have the problem of the land line versus the cell phone. My house for example does not have a land line.

In general, we should all be weary of any poll. Especially without seeing the methodology. But it is interesting that I have read that polls have been successful in predicting 99% of the elections around the country. It is only the two or three upsets that make the headlines.

One of the most recent examples of this occurred during the Kerry vs Bush election of 2004. Exit polls had Kerry winning some key states. Unfortunately it was later discovered that the company conducting the polling had used young people to talk to voters. Guess who young people interviewed ? You guess it, other young people. And guess who in general would not agree to be interviewed after voting ? You guess it again, older conservative voters.