The Post-Apartheid Labour Market Series

The Post-Apartheid Labour Market Series (palms) is a compilation of microdata from 69 household surveys conducted in South Africa. The dataset and the code used to create the data are publicly available from DataFirst, a data repository at the University of Cape Town (www.doi.org/10.25828/gtr1-8r20). To harmonise the data required understanding the differences across the surveys, which has generated new knowledge about the South African labour market.


Introduction
South Africa has conducted multiple nationally representative labour marketrelated household surveys since 1993. The Post-Apartheid Labour Market Series (palms) is a harmonised compilation of microdata from 69 of these surveys and was created by the authors as well as David Lam at the University of Michigan (Kerr, Lam & Wittenberg, 2019). The surveys included in palms are the October Household Surveys (1994)(1995)(1996)(1997)(1998)(1999), the biannual Labour Force Surveys (lfs) (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) and the Quarterly Labour Force Surveys (qlfss) (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019), all conducted by Statistics South Africa, the National Statistics Office (nso). palms also includes the 1993 Project for Statistics on Living Standards and Development (pslsd) conducted by the Southern African Labour and Development Research Unit (saldru) at the University of Cape Town (uct). palms is publicly available through DataFirst at uct (see Figure 1). We have also released a guide to palms to help users understand the data as well as issues they might encounter when using it to undertake labour market analysis (Kerr & Wittenberg, 2019a). We discuss the most recent version (3.3) of the data, but we have released several prior versions, and we plan to continue to update it once a year, as new surveys are released. The Post-Apartheid Labour MARKET | 10.1163/24523666-bja10011 research data journal for the humanities and social sciences (2020) 1-11

Background
The surveys included in palms enable, in theory, research about inequality, unemployment, changes in employment structure and many other pressing labour market issues in South Africa. The microdata for all these surveys is publicly available. So any researcher could download each survey and assemble them to better understand the evolution of the South African labour market. Given the rapid change in the country since 1993, it was inevitable that the surveys differed from each other in a multitude of ways, as we discuss in more detail below. palms is an important dataset for several reasons. Firstly, publicly available and harmonised microdata from these labour market-related surveys makes it much easier for researchers to better understand the South African labour market. Secondly, the availability of palms means that researchers do not have to duplicate work done by many others in creating the data. Thirdly, if results on important issues differ across researchers and these researchers are all using palms, then at least one can rule out as an explanation for these differences that the researchers created the data in different ways. Finally, many researchers have simply used two surveys to describe trends over time. But often the conclusions drawn depend on which two surveys were used. In South Africa, the 1995 ohs was used by many researchers to describe changes between 1995 and a later point but it turned out that 1995 was not an ideal anchor-point for such comparisons, for reasons that are not clear (Branson & Wittenberg, 2007). ohs 1995 found many more employed African men, many more orphans and a much smaller gender wage gap than the surrounding surveys (Wittenberg, 2014b). Having all the surveys together allows researchers to examine trends and to make sure their results are not an artefact of the two surveys they chose to compare.

Problem
There are many difficulties in constructing a consistent picture of earnings, employment and unemployment from South African household surveys. These include changes in questions on key-outcomes, sampling methods, weighting, fieldwork, data imputation and data processing by the data producer. These difficulties mean that depending on which surveys researchers use and what decisions they make about many aspects of the data processing and creation, they can reach very different conclusions. The point of palms is that a common dataset is created that researchers can build on, but also that 4 research data journal for the humanities and social sciences (2020) 1-11 all the processing undertaken is made transparent by providing the code used to create the data, which allows others to criticise and/or replicate the data. Although we give examples of difficulties and issues from South African labour market-related surveys, these issues apply in many other contexts also, and we hope that this discussion is helpful for other researchers conducting similar exercises in different contexts.

Methods
We have documented many of the changes and inconsistencies across the surveys that palms included in prior research. We briefly discuss these here to demonstrate how the harmonisation of the various surveys in creating palms required a substantial research investment, generated a new research program in understanding the evolution of the South African labour market, as well as giving new insights into results and trends that were not well understood. The issues we discuss include changes in questions about earnings and processing of earnings data, earnings imputation methods, fieldwork changes and issues with weighting. palms includes several labour market-related variables about education, employment status, employment and employer characteristics. It also includes a harmonised earnings variable. Several changes across the surveys make constructing even simple descriptions of changes in earnings over time very difficult. Wittenberg (2014a) provides a detailed discussion of the inconsistencies in earnings across the different surveys. These include very different questions to the self-employed about their income in different ohss, changes in the way income bracket questions were asked, the extent of outliers in earnings across the surveys, as well as those reporting zero earnings. Probably the most valuable aspect of palms is the harmonising of all these earnings data into a single earnings variable. But palms also includes a separate earnings data file with the original variables from all the surveys, in case researchers need to investigate these further.
Between 10-15% of the employed in the surveys in palms are missing earnings data. There are also bracket responses and responses that are clearly outliers. These responses types are unlikely to occur randomly across individuals. Imputation is the usual solution to this type of problem. But single imputation will understate the true statistical uncertainty of any estimate. palms thus also includes a separate file with multiple imputed earnings data: earnings for refusals, bracket responses and outliers are all imputed for each individual to 5 The Post-Apartheid Labour MARKET | 10.1163/24523666-bja10011 research data journal for the humanities and social sciences (2020) 1-11 allow analysts to understand the impact of missing earnings data and to generate unbiased standard errors. Schafer (1999) recommends imputing 5-10 times, and palms contains 10 imputations.
Data producers also undertake imputation of missing or non-sensical earnings responses. Unfortunately, and unlike the palms imputations, the methods used are often not carefully documented or explained. The 1994 ohs earnings data was heavily imputed without any documentation by the nso. palms provides ohs 1994 earnings created from a process of reverse engineering based on Wittenberg (2008). The more recent qlfss also have substantial imputations by the nso, and again the methodology is not documented, and no imputation flags are provided in the data. We have used palms to show that there are two imputation regimes for two different periods of time and that the imputation is likely to be driving impossible changes in the Gini coefficient in earnings (Kerr & Wittenberg, 2019b), but, unfortunately, we cannot do any more without publicly available unimputed data from the nso, which we have requested multiple times, unfortunately without success. Kerr and Wittenberg (2019b) discuss several fieldwork-related issues in the surveys in palms. These include the vast overestimate of both subsistence agricultural workers in the two lfss from 2000 and of informally self-employed in February 2001, which together resulted in an impossibly large increase in the employment rate and labour force participation rate at the start of the new millennium. Since the qlfss began in 2008, Statistics South Africa has excluded subsistence agriculture from the definition of employment. We cannot undo these kinds of issues post-fieldwork, but some partial solutions are possible. For example, palms includes an employment dummy variable that excludes those employed in subsistence agriculture, given the inconsistencies in their treatment over the surveys. The palms guide also explains some of these issues, so that researchers are made aware of them (Kerr & Wittenberg, 2019a).
Many household surveys are calibrated to a demographic model of the population they cover. This means that the sample design weights adjusted for non-response are calibrated, so the weighted totals match the best estimates from the demographic model on a few key characteristics, usually sex, age groups, province and (in South Africa) self-declared race. The difficulty is that the demographic models may be updated and improved, and thus earlier population estimates may be incorrect. This leads to large jumps in totals whenever major adjustments are made to these demographic models (Branson & Wittenberg, 2014). To ameliorate this issue, palms includes weights constructed using a consistent demographic model for the entire period, using crossentropy weighting to calibrate the weights (Wittenberg, 2010). 6 research data journal for the humanities and social sciences (2020) 1-11

Data
-Post-Apartheid Labour Market Series (palms) deposited at DataFirstdoi:www.doi.org/10.25828/gtr1-8r20 -Temporal coverage: 1993-2019 and onward There are 69 surveys included in palms version 3.3. The earliest survey is the 1993 pslsd. Before this, the Apartheid state did not collect nationally representative data, as it did not want to publicise the dire state of the living standards of many South Africans. The data that it did collect was also not publicly available. The pslsd is also the only survey in palms not conducted by the nso. It was included because it is probably the most well-used household survey since it was undertaken in 1993 and contains valuable information about the state of the labour market just before the advent of democracy. The ohss were run annually between 1994 and 1999, the 1994 survey being run 6 months after the first democratic elections. The ohss (and the pslsd) included a much broader set of questions than just those about the labour market, but we have not included much of this data in palms since it is a labour-focused dataset. The lfss were run biannually and focused on labour market-related issues. The qlfss have run every quarter since 2008 and are similar to the lfss. Table 1 summarises the data.
Several types of variables are included in palms. These include the household and person identifiers, the survey design variables, basic demographic and location information on each individual, educational attainment, labour force status and numerous variables pertaining to the individual's employment and employer. Table 2 shows a summary of the variable types in palms, with some examples. Since the surveys differed, not all surveys have the same set of variables, but there is a core of common variables. palms includes the original household and person identifiers so that researchers can merge in Table 1 Summary of surveys included in palms other data from the surveys that are not in palms. All the surveys in palms collected data on all resident members. We have included children and the elderly in the data even though they have no labour market data because many labour related research questions involve these groups. All the surveys included in palms were two-stage cluster samples with stratification, although the number of households sampled per cluster and the strata have varied substantially over the surveys. One of the features of palms is the inclusion of the correct strata and cluster variables, which were often incorrectly released in the original versions released by Statistics South Africa. The inclusion of these variables allows the user to specify the correct sample design in the statistical software used to conduct any analysis (along with the cross-entropy weights discussed earlier). The pslsd sample size was 9000 households. The sample sizes of the ohs varied between 16 000 and 30 000 households, partly as a result of how much funding was available to conduct the surveys. The sample size for the lfss and qlfss until 2014 was 30 000 households and 33 000 households from 2015 onwards. The realised samples are lower than this for all the surveys as a result of refusals, non-contact, vacant dwellings etc.

Palms Use Case
Having explained the palms data, we now show two brief examples of how palms can be used to shed light on aspects of the South African labour market. Figure 2 shows the employment rate, non-participation rate and broad and strict unemployment rates. It shows the well-known fact that unemployment has grown dramatically in the post-Apartheid period and is very high, whether one uses the strict or broad rates (broad unemployment includes those who have not looked for work but who want work). But the figure also shows the much less well-known fact that the employment rate has been roughly constant over the post-Apartheid period. These two facts are possible because of the big decline in the non-participation rate. The two red lines show the change between the ohss and lfss (in 2000) and the lfss and the qlfs (in 2008). Clearly, there have been changes in definitions between these surveys that impact the measurement of labour force participation. Figure 3 shows various percentiles of the earnings distribution, extending the work of Wittenberg (2017aWittenberg ( , 2017b. Median earnings has declined since 1993, whilst the 75th and 90th percentile have increased, the 90th percentile very substantially. The 25th and 10th percentiles have actually increased. Table 3 shows the 95% confidence intervals for these percentiles in 1993 and 2017, accounting for the complex survey design. These changes mean that inequality within the bottom half of the distribution has decreased, whilst inequality in the top half has increased substantially. It seems that earnings  across the distribution has been flat or declining since about 2012. Earnings information was collected in several different ways across the surveys in palms, and the harmonisation process has made the creation of the sorts of trends displayed in Figure 2 and Figure 3 much simpler.

Concluding Remarks
The Post-Apartheid Labour Market Series (palms) is a publicly available source of harmonised microdata focusing mainly on the South African labour market and has been created from 69 household surveys conducted between 1993 and 2019. All the code used to create the data is also publicly available. palms is South Africa specific but provides an example that could be followed by researchers in other contexts. Since such a data source is a public good, similar projects may have to be undertaken by nsos or funded by international donors. The creation of palms has not just been an exercise in data production. It has led to a substantive research program that has improved the quality of the data and shed new light on several aspects of the South African labour market.