Problem-Solving Using The Statistical Investigation Process
Theory
The Statistical Investigation Process has five stages: Question β Collect β Display β Analyse β Conclude. A good question anticipates variability. Sampling should be representative (random or stratified, not convenience). Watch for bias, correlation vs causation, and overgeneralisation.
The Statistical Investigation Process is a structured way to answer real-world questions using data. It has five main stages.
Stage 1 β Pose a statistical question. A statistical question anticipates variability. "How tall are Year 11 students?" is statistical (heights vary). "How tall is the principal?" is not. Vague questions like "study habits" need to be refined into something measurable like "the average weekly study hours of Year 11 students".
Stage 2 β Collect data (sampling). Identify the population (everyone you want to know about) and the sample (the smaller group you actually measure). Common sampling methods:
- Random sampling β every member of the population has an equal chance of being chosen.
- Stratified sampling β divide the population into groups and randomly sample from each, in proportion to group size.
- Convenience sampling β pick whoever's easiest. Quick, but biased.
Stage 3 β Display the data using whichever chart suits the data type.
Stage 4 β Analyse. Calculate appropriate summary statistics: mean/median/mode for centre; range/IQR/SD for spread. Symmetric data β mean + SD. Skewed or outlier data β median + IQR.
Stage 5 β Conclude. State findings in plain English, in the context of the original question. Include a measure of centre, spread, and any limitations.
The first diagram shows the five stages of the Statistical Investigation Process. The second compares the three main sampling methods and their levels of bias.
This subtopic is about process and judgment β no calculation formulas. Use the references below.
The five stages
| Stage | Goal |
|---|---|
| 1. Question | Pose a question that anticipates variability |
| 2. Collect | Sample the population fairly |
| 3. Display | Choose the right chart for the data type |
| 4. Analyse | Compute appropriate summary statistics |
| 5. Conclude | Answer in plain English, in context, with limitations |
Sampling methods
| Method | How it works | Bias level |
|---|---|---|
| Random | Every member equally likely to be chosen | Low |
| Stratified | Random sample from each subgroup, proportional to size | Low |
| Convenience | Whoever is easiest to access | High |
Display choices by data type
| Data type | Display |
|---|---|
| Categorical | frequency table, bar chart |
| Discrete numerical (small) | dot plot |
| Mid-size numerical | stem-and-leaf plot |
| Large/continuous numerical | histogram, boxplot |
| Two groups | parallel boxplots, back-to-back stem-and-leaf |
Working through a statistical investigation
- Pose a precise statistical question. Replace vague phrasing with something measurable that anticipates variability.
- Choose a sampling method. Random or stratified for unbiased data. Define the population and sample clearly.
- Display the data using a chart suited to the type β bar chart, dot plot, stem-and-leaf, histogram, or parallel boxplots.
- Analyse with appropriate summary statistics. Symmetric β mean + SD. Skewed or outliers β median + IQR.
- Conclude in plain English, in the original units, and in context. State any limitations (small sample, possible bias, etc.).
Identifying bias
- Ask: which parts of the population are over- or under-represented?
- Common sources: convenience samples, voluntary response, sampling from only one location or time, leading survey questions.
- Name the type of bias if you can (e.g. selection bias, response bias).
Spotting correlation vs causation
- If a claim says one variable causes another, ask: could there be a third variable influencing both?
- Observational data shows association, not causation. Only a controlled experiment can establish causation.
The phrase "study habits" is too vague to measure. A statistical question needs to anticipate variability and pin down what is being measured.
Answer: "What is the average weekly study hours of Year 11 students at this school?" This is measurable (hours per week), anticipates variability (different students will report different hours), and specifies the population (Year 11 students at this school).
Identify whom the researcher is trying to learn about (population) vs whom they actually measured (sample).
| Population | all Brisbane workers | |
| Sample | the |
Concern: the sample is biased β it includes only train commuters at one station during peak hour. Car drivers, bus users, and off-peak commuters are excluded.
Gym members are not representative of all Brisbane adults β they're already a group that exercises more than average.
Answer: selection bias. The sample over-represents people who exercise. The gym's estimate of "Brisbane adults' exercise level" will be too high.
The teacher confuses correlation (an observed association) with causation (one variable causing another). Tablet owners may also tend to be from families with more resources, more study support, or more general access to learning materials.
Answer: the flaw is treating correlation as causation. Owning a tablet is associated with higher scores but doesn't necessarily cause them. Only a controlled experiment (e.g. randomly giving half the students tablets) could test whether the tablet itself raises scores.
Common pitfalls
Frequently asked questions
What is a statistical question?
A statistical question is one that anticipates variability β there will be different answers from different people or measurements. 'How tall are Year 11 students?' is statistical (heights vary). 'How tall is the principal?' is not β there's only one answer.
What is the difference between population and sample?
The population is everyone (or everything) you want to know about. The sample is the smaller group you actually measure. The sample needs to be representative of the population for conclusions to be reliable.
What is random sampling?
Random sampling means every member of the population has an equal chance of being chosen. It's the gold standard for unbiased data collection, but can be hard to do in practice.
What is stratified sampling?
Stratified sampling means dividing the population into groups (called strata, like year levels), then randomly sampling from each group in proportion to its size. It ensures every subgroup is represented.
What is selection bias?
Selection bias is when the sampling method makes the sample unrepresentative. For example, surveying gym members about exercise habits β they exercise more than average, so the estimate will be too high.
What does 'correlation is not causation' mean?
It means that even if two variables are associated, one does not necessarily cause the other. 'Tablet owners scored higher' might be true, but tablets did not necessarily cause the higher scores β there could be confounding factors like family income or study support.
Video Lessons
Practice Questions
20 questions available.
Practice Questions