I’m currently working on a paper examining the relationship between occupational characteristics and childbearing using the NEPS Starting Cohort 6 (SC6). I have some questions regarding missing values in the income and gender role variables and would greatly appreciate your guidance.
For income, I’m using variables from the spEmp dataset, such as ts23511_g1, ts23411_g1, and other related variables. However, I encounter substantial missing values: fewer than 20,000 non-missing values out of approximately 85,000 employment episodes up to Wave 14. According to the documentation, many of the missing values are either „missing by design“ or „not determinable.“
Could you please clarify what “not determinable” means in this context? Also, do you have any information on whether these missing values can be considered missing at random (MAR)? I’m considering multiple imputation using the mice package in R, and it would help to know whether this approach is appropriate. Since my study covers a long time span (from the 1980s onward), it’s important for me to maximise coverage of income information across employment episodes.
Additionally, I have a quick question regarding the gender role variables from the pTarget dataset: how should I interpret „system missing values“ in those items?
Thank you very much for your support and for providing access to this excellent dataset.
In this case „not deteminable“ means that there is no information from the source variables available of which a variable is derived from. Missing income data is very common… for instance, ts23511_g1 is filled with information of ts23510. If ts23510 has missings, ts23512 will be used. If ts23512 is missing, ts23513 will be used and if ts23513 is also missing, ts23514 will be used. If ts23514 is missing, medians of ts23510 will be used if ts23511 is available.
In short: ts23510>> ts23512>> ts23513>> ts23514>>median(ts23510).
System missings in gender role variables… I can’t blame you for getting stuck here. All 5208 cases of system missings should actually be coded as missing by design (-54). Only target persons who already were interviewed before wave 4 (Panelbefragte) were asked on gender roles. Those who started the survey in wave 4 (Erstbefragte) did not get this item set.
This is something you can’t easily find out looking the field instrument (https://www.neps-data.de/Portals/0/NEPS/Datenzentrum/Forschungsdaten/SC6/Feldversionen/NEPS_SC6_SurveyInstruments_Field_w4-5_de.pdf)
On Page 639 variable t43637 is availabe in the questionaire for Erstbefragte but it is filtered in the qestionaire (variable zqs1a_1: if (h_etappe = 8) goto 20115Z. All target persons in SC6 belong to etappe 8. Therefore nobody in that cohort saw these questions. This is almost impossible to find out. I am very sorry but the designers of that question module obviously did not keep the user’s point of view in mind.
I will forward the missing at random (MAR)-problem to colleagues of the statistics department. The should give you more information on that soon.
Thank you very much for your clear and helpful explanation. If I understand correctly, “not determinable” suggests that a variable is derived from others, while “missing by design” means that a respondent was not surveyed on a particular question.
However, I’m still a bit unsure about how to interpret “system missings.” I’m sorry — I find it quite challenging to fully grasp the distinctions between the different types of missing value codes.
When variables are derived from open ended questions or derived using covariates, „not determinable“ means that information is either incomplete or open ended questions could not be coded into a scheme (if somebody said his/her desired occupation is „batman“, this open ended information does not fit into scheme).
Missing by design means that a question was not asked in a wave for either all target persons or it was not asked for a certain sub-population. These „Erstbefragte“ is a subpopulation and should be coded with -54 but as this infomation is hidden within logics of questionaire programming (if (h_etappe = 8) goto 20115Z), nobody found out that it should be coded with -54.
If a target person skips items depending on previous questions, it would be great if the item developers could define a missing code (e.g. -99 ‘filtered’) for this in the programming but in the older days this did not always happen. Therefore system missings are often due to undefined/unlabeled filtering. I could only find out about it using raw data which is not available for you.
I fully understand that it is quite hard to distinguish between different missing codes