Statstitcs

We will be discussing about the following topics in this post:

  1. Hypothesis Testing
  2. Central Limit Theorem
  3. Confidence Intervals
  4. Student’s T Distribution
  5. Chi-Square Distribution and Chi-Square Test
  6. Poisson Distribution and its questions

Hypothesis Testing

The idea of hypothesis testing starts when we have a sample at hand and based on that sample we make assumptions about the parametrs of the population and the assumption we make are generally called as hypothesis (Quantitate Statment about the populatoin) . We have two types of hypothesis:

  1. Null Hypothesis: The null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. This is generally denoted by $H_0$ and is considered to be true unless we have sufficient evidence to reject it in favor of an alternative hypothesis.

  2. Alternative (Research) Hypothesis: The alternative hypothesis is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed). This is generally denoted by $H_1$.

Lets take a example of a soda company which claims that the average volume of soda in their bottles is 300 ml. Now we want to test this claim. So we take a sample of 100 bottles and measure the volume of soda in each bottle. Now we want to test whether the claim is true or not. So we will set up our hypothesis as follows:

$H_0$: The average volume of soda in the bottles is 300 ml.

$H_1$: The average volume of soda in the bottles is not 300 ml. (Two tailed test)

$H_1$: The average volume of soda in the bottles is less than 300 ml. (One tailed test)

$H_1$: The average volume of soda in the bottles is greater than 300 ml. (One tailed test)

We will only have one of the above alternative hypothesis. One we have selected the pair of hypothesis we will select a significance level. The significance level is the probability of rejecting the null hypothesis when it is true. It is denoted by $\alpha$. The most common significance levels are 0.1, 0.05 and 0.01. The significance level is also called as the type I error rate because when the null hypothesis is true and we reject it then it is a type I error and this is directly proportional to the significance level. So if we increase the significance level then the type I error rate will also increase.

Now we select a test statistic. A test statistic is a quantity calculated from our sample of data. It is used in a hypothesis test to determine whether to reject the null hypothesis. In the case above the test statistic will be the mean.

Now it is time to select a testing procedure. The testing procedure is the method we use to determine whether to reject the null hypothesis. The testing procedure depends upton the information.

  1. If the population standard deviation is known then we use the Z-test and sample size should be greater than 30.
  2. If the population standard deviation is unknown then we use the T-test and sample size should be less than 30.
  3. If the population is normally distributed and the population standard deviation is unknown then we use the Z-test and sample size should be greater than 30. In this case we will use the sample standard deviation as the population standard deviation.

for the Z-test we calculate the Z-score as follows:

\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

where $\bar{X}$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the population standard deviation and $n$ is the sample size. The Z-score is then compared with the Z-score table to find the area under the curve. If the area under the curve is less than the significance level then we reject the null hypothesis. The area under the cure is also called as the p-value.

for the T-test we calculate the T-score as follows:

\[T = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{n}}}\]

where $\bar{X}$ is the sample mean, $\mu$ is the population mean, $s$ is the sample standard deviation and $n$ is the sample size. The T-score is then compared with the T-score table to find the area under the curve. If the area under the curve is less than the significance level then we reject the null hypothesis. The area under the cure is also called as the p-value.

The alternative way of finding the this is by using degrees of freedom and half of the significance level.

We calculate the degrees of freedom and we use half of the significance level then we find the T-score from the T-score table. The degrees of freedom is calculated as follows:

\[df = n - 1\]

where $n$ is the sample size. The T-score is then compared with the T-score table to find the area under the curve. If the area under the curve is less than the significance level then we reject the null hypothesis. The area under the cure is also called as the p-value.

The outcome of the test can be one of the following:

  1. Reject the nulll hypothesis when it is true. (Type I error)
  2. Reject the null hypothesis when it is false. (Correct Decision)
  3. Accept the null hypothesis when it is true. (Correct Decision)
  4. Accept the null hypothesis when it is false. (Type II error)

Type I and Type II Errors

Type I error is the rejection of a true null hypothesis (also known as a “false positive” finding or conclusion). This is denoted by $\alpha$ and is called as the significance level.

While a type II error is the non-rejection/acceptance of a false null hypothesis (also known as a “false negative” finding or conclusion). This is denoted by $\beta$.

Power of test is the probability of rejecting the null hypothesis when it is false. This is denoted by $ 1- \beta$.

In other words, a type I error is to falsely infer the existence of something that is not there, while a type II error is to falsely infer the absence of something that is.

Lets take a example where we want to test whether a person should take medication or not. The null hypothesis that the person is healthy and the alternative hypothesis is that the person is sick. Now if we reject the null hypothesis when it is true then it is a type I error and if we accept the null hypothesis when it is false then it is a type II error. The type I error is more dangerous than the type II error. So we should always try to reduce the type I error.

Central Limit Theorem

Let $X_1, X_2, X_3, …, X_n$ be a random sample of size $n$ from a population with mean $\mu$ and variance $\sigma^2$. Then the distribution of the sample mean $\bar{X}$ is approximately normal with mean $\mu$ and variance $\frac{\sigma^2}{n}$ for large $n$.

Standard Normal Distribution and Z-Score

A standard normal distribution is a normal distribution with mean 0 and variance 1. The standard normal distribution is denoted by $N(0,1)$. The area under the curve within one s.d of mean is 0.68, within two s.d of mean is 0.95 and within three s.d of mean is 0.997.

The Z is the number of standard deviations from the mean. It is calculated as follows: \(Z = \frac{X - \mu}{\sigma}\) where $X$ is the random variable, $\mu$ is the mean and $\sigma$ is the standard deviation.

The Z-score is used to find the area under the curve. The area under the curve from the left of right up till the value of Z is reached is given by the Z-score table.

Standard Error of the Mean

The standard error of the mean (SEM) is the standard deviation of the sample mean estimate of a population mean. The SEM is the standard deviation of the sample-mean’s estimate of a population mean. From the central limit theorem, we know that the standard error of the mean is given by:

\[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\]

where $\sigma$ is the standard deviation of the population and $n$ is the sample size. But in practice we don’t know the standard deviation of the population. So we use the standard deviation of the sample as an estimate of the standard deviation of the population. So the standard error of the mean is given by:

\[\sigma_{\bar{X}} = \frac{s}{\sqrt{n}}\]

where $s$ is the standard deviation of the sample and $n$ is the sample size. This should be used when the sample size is greater than 30. If the sample size is less than 30 then we use the t-distribution.

Confidence Intervals

A confidence interval is a range of values that is likely to contain the true value of an unknown population parameter. The confidence interval is denoted by $[L,U]$ where $L$ is the lower bound and $U$ is the upper bound. It better to understand the concept of confidence interval with the help of an example.

Lets say we want to estimate the average height of all the students in a class. We take a sample of 100 students and we know that the population standard deviation is 5 cm. The sample mean is 170 cm. Now we want to estimate 95% confidence interval for the population mean.

We have a mean value of sample which is 170 cm of size 100. We have poplutation standard deviation which is 5 cm.

First thing that we calculate is the standard error of the mean / (sd of distribution of mean) which is given by:

\[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{5}{\sqrt{100}} = 0.5\]

Now we calculate the Z-score for 95% confidence interval which is 1.96. Now we calculate the confidence interval as follows:

\[CI = \bar{X} \pm Z \times \sigma_{\bar{X}} = 170 \pm 1.96 \times 0.5 = [169.02, 170.98]\]

It might look like we have treated the sample mean as population mean. As we are calculating the confidence interval for the population mean. We are saying that 95% of the time the sample mean will lie between 169.02 and 170.98. When talking about the population mean we are saying that there is 95% change that population mean will lie between 169.02 and 170.98.

Hypothesis Testing - Difference of Two Means - Student’s -Distribution & Normal Distribution

Student’s t-distribution is just a heavy tailed version of the normal distribution and as we reach a degree of freedom of around 30, the t-distribution becomes almost identical to the normal distribution. The t-distribution is used when the sample size is less than 30. The t-distribution is denoted by $t_{df}$ where $df$ is the degrees of freedom. The degrees of freedom is calculated as follows:

We use the t-distribution when the sample size is less than 30. The t-distribution is similar to the normal distribution but it has heavier tails. The t-distribution is denoted by $t_{df}$ where $df$ is the degrees of freedom. The degrees of freedom is calculated as follows:

\[df = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1 - 1) + (s_2^2/n_2)^2/(n_2 - 1)}\]

where $s_1$ and $s_2$ are the standard deviations of the two samples and $n_1$ and $n_2$ are the sample sizes of the two samples.

The t-distribution is used when the sample size is less than 30. The t-distribution is similar to the normal distribution but it has heavier tails. The t-distribution is denoted by $t_{df}$ where $df$ is the degrees of freedom. The degrees of freedom is calculated as follows:

If we have to use normal distribution instead of t-distribution then we have to use the pooled standard deviation. The pooled standard deviation is given by:

\[Z = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

where $s_1$ and $s_2$ are the standard deviations of the two samples and $n_1$ and $n_2$ are the sample sizes of the two samples.

Based on this we accpet or reject the null hypothesis. If the Z-score is greater than the critical value then we reject the null hypothesis and if the Z-score is less than the critical value then we accept the null hypothesis.

Chi-Square Test

ref

Primarily chi-squared test is used to compare a theoretical distribution with an observed distribution for categorical data.

  1. It is used to test the independence of two events.
  2. It is used to test the goodness of fit.

Goodness of Fit

Q: In a class of 75 student 11 are left Does this fit the prevaling theory that 12% of the students are left handed?

H0: The observed distribution is same as the theoretical distribution. H1: The observed distribution is not same as the theoretical distribution.

Here we have the observed distribtion represented by O and the theoretical/Expected distribution represented by E.

  O E
Left Handed 11 9
Right Handed 64 66
Total 75 75

We will take the significance level as 5%. The degrees of freedom is 1 because we have two categories and if one is know then the other is total - known. We also have one more condition that the expected value should be greater than 5 for each of the categories. If the expected value is less than 5 then one solution is to combine the categories.

The chi-square value is given by:

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

The chi-square value is 0.505 now base on the chi-square table we can find the critical value which is 3.84. Now we compare the chi-square value with the critical value. As the chi-square value is less then the critical value we accept the null hypothesis.

Test of Independence

Q: 120 people were suryved for their preferred social media platform. Is there enough evidence to suggest social media preference is independent of sex?

  Male Female Total
Facebook 15 20 35
Instagram 30 35 65
Tictok 5 15 20
Total 50 70 120

The firs step is to calculate the expected value. The expected value is given by:

\[E = \frac{row total \times column total}{grand total}\]

The above formula come from the defination of indepdence which $P(A \cap B) = P(A) \times P(B)$. Here A is sex and B is social media platform. For example let say we want to calculate $P(Male \cap Facebook ) = P(Male) \times P(Facebook)$ which is given by $\frac{50}{120} \times \frac{35}{120}$. Now we can calculate the expected value by multiplying it with all the people that are surveyed and we get 14.6.

  Male Female Total
Facebook 14.6 20.4 35
Instagram 27.1 37.9 65
Tictok 8.3 11.7 20
Total 50 70 120

One we have the expected value we can calculate the chi-square value. The chi-square value is given by:

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

We sum over all the cells which includes male and females and all the social media platforms. The next this is determine the degrees of freedom. Here we are provided with the row and col wise sum. Hence if we know only two values then we can fill the whole table and hence the degrees of freedom is 2.

Now we follow the same procedure as before.

results matching ""

    No results matching ""