Web-scraping Principles
Background
- Web-scraping is defined as the collection of data from the internet using a programming code (i.e., crawler) and a web tool (i.e., scraper), to search and extract the data required, respectively. It has numerous applications across various industries, such as news / price monitoring, market research and sentiment analysis that can be tedious to perform if done manually.
- The Singapore Department of Statistics (DOS) uses alternative data sources and methods such as web-scraping to supplement the information gleaned from traditional surveys. Web-scraping allows for more firms to be covered and more timely collection of data while reducing respondent burden, facilitating the delivery of more insightful statistics by DOS. An example is the use of online price information in the compilation of the Consumer Price Index (CPI).
DOS’s Web-scraping Principles
With increasingly more data residing on websites, DOS conducts web-scraping activities as part of our data collection while minimising the burden of respondents providing the information. The data may be used by DOS or shared with other public agencies to fulfil public duties, including policy analyses and service delivery.
We adopt the following principles to assure that web-scraping is carried out consistently, ethically and transparently.
Principles
- Abiding by applicable national legislation;
- Minimising burden on the website owners (e.g., by adding idle time between requests; web-scraping at a time of day during which the web server is not expected to be under heavy load); and
- Identifying ourselves to the website owners when carrying out web-scraping (e.g., in user agent strings).