So far, you have conducted analyses for data on 75 people, 3,000 people, and so on. But what if you need to analyse a very large amount of data, e.g., data on 3,00,000 people? Or, what if you need to do this for, say, the entire Indian population?

Suppose for a business application, you want to find out the average number of times people in urban India visited malls last year. That’s 400 million (40 crore) people! You cannot possibly go and ask every single person how many times they visited the mall. That’s a costly and time-consuming process. How can you reduce the time and money spent on finding this number?

To reiterate, these are the notations and formulas related to populations and their samples:

The reason for dividing by n-1 and not n is beyond the scope of this course. However, if you want to learn more about it, please refer to this link.

For an upcoming government project, you want to find the average height of the students in Class VIII of a given school. Instead of asking each student, suppose you took a few students as your sample and wrote the data down:

Roll Number | Height |

8012 | 121.92 cm |

8045 | 133.21 cm |

8053 | 141.34 cm |

8099 | 126.23 cm |

8125 | 175.74 cm |

Let’s go through another example of sampling.

In order to counter fake news, let’s say that Facebook is planning to include a new feature in its timeline. Below each post, a fact-checking warning will be provided, like this.

In case you want to read more about this feature, please refer to this link.

Before changing the timelines of all Facebook users to include this feature, Facebook first wants to evaluate how its users would react to this new feature.

So, it lets a small sample (~10,000 users) try out the new timeline. Then, it asks the 10,000 users whether they prefer the new timeline (Feature B) or the old timeline (Feature A).

Let’s say that one such survey shows that 50.5% of the people prefer feature B to feature A. Based on this, Facebook can say that feature B is preferred by more people than feature A, and hence, B should replace A.

By conducting the exercise on a sample and not the entire population, it **saved time and** **money** and **avoided risks** that could arise if it rolled out an untested feature.

But hold on! How can you be sure that the insights inferred for the sample hold true for the population as well? In other words, just because 50.5% of the people in the sample preferred feature B, is it fair to infer that 50.5% of the people in the population (1.86 billion Facebook users) will also prefer feature B to A?

You cannot answer this question with the information that you have right now. However, after the next few lectures, which will cover sampling distributions, central limit theorem and confidence intervals, you will be equipped with the knowledge required to answer this question.