Concepts and Methodologies

The Department of Statistics Singapore (DOS) collects data through surveys (e.g. Household Expenditure Surveys, Price Surveys, Census, etc.) and from administrative sources (e.g. births and deaths data from the Immigration and Checkpoints Authority, education data from the Ministry of Education, Singapore, etc.).

Such data are commonly used in various forms of analyses, such as to describe or visualise patterns and trends, or to uncover relationship between variables. Learn about the common types of data, chart types typically used for data visualisation, and real-world applications of mathematical concepts such as seasonal adjustment.

Units of Analysis: Who’s in the Spotlight?

The unit of analysis refers to the entity that is being studied in a dataset. For example, in clinical trials, the unit of analysis could be an individual (i.e. the patient). Macroeconomic studies may consider a whole country as a unit of analysis, e.g. comparing the productivity of workers between countries.

Units of analysis commonly found in the data published by DOS includes:

Individuals, on Population Structure (by Residency, Age, etc.), Marital Status, etc.
Households, on Household Expenditure, Household Income from Work, etc.
Firms, on Productivity, Value-added, Business Receipts Index, Retail Sales Index etc.

Types of Data: Which One Tells Your Story?

Quantitative Data versus Qualitative Data

Qualitative data are generally non-numerical which relate to words, pictures or even videos. It helps to answer ‘what’ or ‘why’ questions. An example is sentiments data which describe how individuals are feeling. Qualitative data can be analysed by grouping the data into themes or categories.

Quantitative data are numerical in nature, and help to answer questions pertaining to ‘how much’ or ‘how many’ etc. Quantitative data can be analysed using statistical analysis. Examples of quantitative data are as follows:

Cross-Sectional

Data are collected over a single time period (e.g. records are from a specific year).
Allows the comparison of characteristics between units or groups at a fixed point in time.

From the table below, it can be seen that in 2022, there were 6 males and 4 females. Of the 4 females, only 1 lives in HDB housing while the rest live in private estates.

Table 1: Cross sectional data of individuals in 2022 and their characteristics

Name	Year	Age	Sex	House Type	Highest Qualification Attained
Christopher	2022	22	M	4-Room HDB	Diploma
David	2022	30	M	3-Room HDB	University
Grace	2022	40	F	Condominium	Diploma
Jason	2022	23	M	4-Room HDB	Post-Secondary
John	2022	55	M	1-2 Room HDB	Post-Secondary
Michelle	2022	13	F	4-Room HDB	Below-Secondary
Peter	2022	45	M	5-Room HDB	Bachelor's Degree
Sally	2022	23	F	Landed	Post-Secondary
Samuel	2022	19	M	4-Room HDB	Secondary
Yvonne	2022	7	F	Landed	No formal qualification

Time Series

Data are collected at several points in time.
Data can be collected/observed at different frequencies such as hourly, daily, weekly, monthly, quarterly, or annually.
Ordering of records by time matters since there is a time dimension.
Time series data allow us to identify and analyse trends or behaviours over time. Table 2 shows an upward trend in the number of individuals with higher education levels (e.g., Diploma, University) from 2017 to 2022, and a downward trend in the number of those with Secondary education or below.
Other examples include interest rates (monthly), exchange rates (daily), GDP (annually), inflation (monthly, annual).

Table 2: Number of Singapore Residents aged 25 years and above by highest qualification attained, from 2017 to 2022

Year	Below secondary	Secondary	Post-secondary	Diploma	University
2017	813,800	488,800	253,500	415,900	874,000
2018	761,500	511,200	260,700	434,900	908,700
2019	745,700	503,700	265,900	461,200	946,200
2020	757,800	484,300	296,900	456,800	982,000
2021	643,900	492,600	281,300	486,200	1,074,300
2022	646,300	492,200	308,000	521,900	1,116,900

Pooled

Combines time series and cross-sectional data, but observations in each cross section do not necessarily refer to the same unit.
For example, individuals across the years can be randomly sampled to obtain multiple years of cross-sectional data, and these datasets can be combined together.
The main advantage of pooled data over cross-sectional data is that the former is typically a larger sample, since multiple cross-sectional data are combined from different time periods.

Table 3: Pooled cross-sectional data of individuals from 2015 to 2022, and their characteristics

Name	Year	Age	Sex	House Type	Highest Qualification Attained
Anthony	2017	22	M	3-Room HDB	Diploma
Anthony	2018	23	M	4-Room HDB	Diploma
Cobalt	2018	24	M	4-Room HDB	University
Daniel	2019	30	M	3-Room HDB	University
Ernest	2022	32	M	3-Room HDB	University
Felicia	2010	40	F	Condominium	Diploma
Germaine	2019	41	F	Condominium	Diploma
Hailey	2015	45	F	Condominium	Diploma
Isabelle	2022	13	F	4-Room HDB	Below-Secondary
James	2021	45	M	5-Room HDB	University

Note: Records are “pooled” into the same table. The records may be from different periods of time.

Longitudinal

Combines cross-sectional and time series data, similar to pooled data. The difference is that the same cross-sectional units (e.g. individuals) are observed at several points in time (days, months, years, before and after event etc.). It is also known as panel data.
Allows the observation of the same group over a period of time, and track the changes over the period. This is useful for detecting development trends or changes in the characteristics of interest. Hence, panel data are often used in studies that investigate the effects of certain events, policies or treatments. For example, the effects of a particular government policy on the wages of individuals can be studied by looking at the change in wages within individuals before and after policy implementation.

Table 4: Longitudinal data of individuals and their characteristics

Name	Year	Age	Sex	House Type	Highest Qualification Attained
Christopher	2020	22	M	4-Room HDB	Diploma
Christopher	2021	23	M	4-Room HDB	Diploma
Christopher	2022	24	M	4-Room HDB	University
David	2020	30	M	3-Room HDB	University
David	2021	31	M	3-Room HDB	Post-Graduate Degree
David	2022	32	M	3-Room HDB	Post-Graduate Degree
Grace	2019	40	F	4-Room HDB	Diploma
Grace	2020	41	F	4-Room HDB	Diploma
Grace	2021	42	F	Condominium	Diploma
Jason	2021	23	M	4-Room HDB	Post-Secondary
Jason	2022	24	M	4-Room HDB	Post-Secondary
Jason	2023	25	M	4-Room HDB	University
Michelle	2021	13	F	4-Room HDB	Below-Secondary
Michelle	2022	14	F	4-Room HDB	Below-Secondary
Michelle	2023	15	F	4-Room HDB	Below-Secondary

Note: Each unit (person in this example) is observed for multiple periods of time.

Primary Data vs. Secondary Data

Primary data refer to data directly collected from the data source. This can be through surveys, interviews, or experiments. Primary data are generally considered reliable and objective. However, due to limitations such as cost and complexity of data, collecting primary data may not always be possible.

Secondary data refer to data collected by another party. One drawback is that secondary data are often not tailored to accommodate the specific needs of the researcher. It may also be costly to purchase if it is not freely available to the public.

How to Analyse the Data: A Chart is Worth a Thousand Words!

Graphs and charts help to present complex data in a visually appealing and simple-to-understand manner. Learn about the common types of graphs and charts below.

Download a copy of the graphs and charts pdf (1 MB).

Using Statistics Correctly: Can Numbers Lie?

Correlation vs. Causation

Correlation is a statistical measure that indicates the extent to which the value of two or more variables move in relation to each other. Positively correlated variables tend to move in the same direction, while negatively correlated variables tend to move in opposite directions with one another. However, it may not necessarily be the case that the change in one variable causes the change in the other. On the other hand, causation means that the change in one variable causes the other variable to change.

The figure below illustrates the difference between correlation and causation. Hot sunny weather would cause an ice-cream to melt and cause sunburn (with prolonged sun exposure). Melting ice-cream and getting a sunburn are correlated, where they tend to occur together in the hot sunny weather. If the presence of the hot sunny weather was ignored, it would be wrongly concluded that melting ice-cream causes sunburn!

Difference Between Correlation and Causation

Misleading Visualisations

Authors can unknowingly produce bad visualisations of data, or worse, be out to misinform their readers. It is important to be armed with knowledge to identify bad visualisations.

Using the chart on Average Monthly Real Earnings Per Employee against Mean Years of Schooling from the scatter plot above, data points for certain years could be omitted to give the impression that the mean years of schooling are not correlated with average real monthly income, as seen in the scatter plot below. This is known as truncated data.

Scatter plot of Average Monthly Real Earnings Per Employee against Mean Years of Schooling, with omitted data points

Misleading

Scatter plot of Average Monthly Real Earnings Per Employee against Mean Years of Schooling, without omitted data points

Misleading

Simpson’s Paradox

Statistics from Office for National Statistics of United Kingdom (ONS) showed that death rates for the vaccinated were greater than the unvaccinated. These numbers could be wrongly used to argue that vaccination increases the risk of death.

Simpson’s Paradox refers to the observation of a trend that is present when data are aggregated, but disappears or reverses when the groupings are made clearer (e.g. grouping by age or vaccination status). The paradox arises because death rates increase significantly with age, such that mortality rates are higher for older folks as compared to younger folks, other things being equal. Older people are more likely to be vaccinated as compared to younger people in the same age range, and are also more likely to die from Covid-19 infection or other health reasons. Therefore, age is a confounding variable since it is positively related to both vaccination rates and death rates. It is age, rather than vaccinations that is driving up the death rates of vaccinated people.

The mortality rates shown in the line chart below is a crude measure of mortality, and do not take into account the age structure of the population. Accounting for the age structure of the population is important since it can influence the number of deaths. For example, the younger population will likely have fewer deaths than the older population, all else being equal.

Mortality Rates of Individuals by Vaccination Status in UK (Jan to Sep 2021)

Mortality Rates of Individuals

Instead, age-standardised mortality rates^ can be used to meaningfully compare between populations with different age structures. Age-standardised mortality rates can be computed by first calculating the mortality rates within each specified age band, followed by taking the weighted average based on a standardised age distribution.

Suppose there are 2 countries – City A and City B. City A has a younger population compared to City B but has higher mortality rate for all age bands. For simplicity, assume that there are 3 age bands – young, middle, old.

Mortality Rates of City A and City B

City		Young	Middle	Old	Overall
A	Mortality rate	8%	8%	20%	9.2%
A	Proportion of age group	0.3	0.6	0.1	1
B	Mortality rate	5%	5%	15%	10.0%
B	Proportion of age group	0.1	0.4	0.5	1

The unadjusted overall mortality rate in City A is 9.2% (based on (0.3 x 8%) + (0.6 x 8%) + (0.1 x 20%) = 9.2%), lower than the 10.0% of City B, even though City A’s mortality rate at each age group is higher.

Age-Standardised Mortality Rates of City A and City B

City		Young	Middle	Old	Overall (Age-Standardised)
A	Mortality rate	8%	8%	20%	11.6%
B	Mortality rate	5%	5%	15%	8.0%
	Standardised Proportion of age group	0.2	0.5	0.3	1

To do age adjustments, take the weighted average of each countries’ mortality rate using a standardised set of age group proportions in table above. In this case, the mortality rate of City A becomes 11.6% (=(0.2 x 8%) + (0.5 x 8%) + (0.3 x 20%)), and this is higher than City B’s 8.0%.

Going back to the Covid-19 vaccination example from ONS, looking at the age-standardised death rates, the death rates of vaccinated are lower than those of the unvaccinated.

Age-Standardised* Mortality Rates of Individuals by Vaccination Status in UK (Jan to Sep 2021)

Age-standardised Mortality Rates of Individuals

^For more information on age-standardised mortality rates for Singapore, please refer to the Statistics Singapore Newsletter article pdf (778 KB).

*Age-standardised mortality rates per 100,000 people, standardised to the 2013 European Standard Population using 5-year age groups from age 10 and over.

The above information are cited from Office for National Statistics of United Kingdom (ONS) and the usage of the information is subjected to ONS's terms and conditions.

Beware of results from small sample sizes, or polls

When testing out a hypothesis, it may not always be possible to collect data for the entire population due to logistical or financial reasons (e.g. research budget). Hence, an option for researchers would be to use a smaller group, which is known as a sample (Figure 10).

Population vs Sample

However, small sample sizes could affect the reliability of the results. One reason is because small sample sizes decrease the statistical power of a study, which means that there is a lower likelihood of detecting a true effect that exists in the entire group, via the study. Another reason could be that the sample is not representative of the population, like online polls, where only people who feel strongly about a subject would respond to the polls. This means the results are skewed towards this group of people, when the majority could be neutral about the subject. As such, robust statistical reporting or research typically requires a large enough sample size. To circumvent non-representativeness, one way is to conduct simple random sampling, where samples are chosen strictly by chance, so that all members of the population have the same chance of being selected for the study.

Concluding Remarks

As statistics is a broad field, the content above above serves as a brief and simple introduction to the different types of data, ways to analyse data, and the common pitfalls of using statistics. With this new-found knowledge, enjoy exploring and working with data to gain useful insights.

What's New

Find Data

Our Services, Tools & Surveys

Standards

Who We Are

Careers

Concepts and Methodologies

Concepts and Methodologies

Units of Analysis: Who’s in the Spotlight?

Types of Data: Which One Tells Your Story?

Quantitative Data versus Qualitative Data

Primary Data vs. Secondary Data

How to Analyse the Data: A Chart is Worth a Thousand Words!

Using Statistics Correctly: Can Numbers Lie?

Beware of results from small sample sizes, or polls

Concluding Remarks