Identifying Novice Teachers in NEPS/LAP Dataset

sdemirtas · 5. März 2026 um 06:16

Hello!

I recently received access to the NEPS/LAP data for my doctoral dissertation at the University of Cologne.

My dissertation focuses on the self-efficacy growth trajectories of novice teachers, specifically participants who were in the early stages of their teaching career during the NEPS/LAP study period between 2015 and 2022.

While exploring the dataset, I noticed that the sample also includes highly experienced teachers, including some participants who appear to be well beyond the novice stage. Since these cases are not suitable for my study, I have been trying to identify and exclude experienced teachers from the sample. However, the variables “length of service at the school” and “year of second state examination” have not helped me to do this (as there are many N/A responses there).

I would therefore like to ask whether you could suggest a way to identify novice teachers in the NEPS/LAP data and exclude more experienced participants. For example, is there any variable or indicator that would help determine participants’ career stage at the time they first entered the study, or information on when they began and completed their Referendariat?

Thank you very much in advance for your help.

Saniye

dietmar.angerer · 18. März 2026 um 15:20

Hallo Saniye!

Sorry it took me long!

I could not find the variable " “length of service at the school” - please tell me the dataset and the variable name…

There are loads of variables with information on referendariat but in the metadata of those variables, it is often noted that you should not use those variables for analysis. You can use infoquery (stata tools on our website) to check that or data view in SPSS.

But even variables that are for analysis seem to be contradictory. Try to use data from spEmp to pick teachers and the starting dates of being teachers. Data from spVocExtExam might help as well to gather information an exam dates.
This is some extensive data preparation - but it could be worse! You can find the code after the example on episode splitting…
you need to match dates of interview with episode data. This is called episode splitting and was described in the forum before. The wave indicator in episode datasets is not recommended because it just marks the time of the last interview but not the date where the event happened!

you create an interview date in CohortProfile…

*CohortProfile: wave and intdate

ID_t    wave        intdate 
--------------------------------
1          1        201603
1          2        201612
1          3        201705
1          4        201804

start and end of events in spEmp :

*spEmp: event

ID_t    start    end
--------------------------------
1      201701    201801

multiply data rows by the duration of an event, creating variable intdate that starts at start and ends are end of the event

*spEmp:  episode-split event

ID_t    start    end      intdate
---------------------------------
1      201703    201801    201703
1      201703    201801    201704
1      201703    201801    201705
1      201703    201801    201706
1      201703    201801    201707
1      201703    201801    201708
1      201703    201801    201709
1      201703    201801    201710
1      201703    201801    201711
1      201703    201801    201712
1      201703    201801    201801

match data using intdate to match events with CohortProfile to get a proper wave-indicator

use "CohortProfile.dta", clear
merge 1:m ID_t intdate using "spEmp_split.dta", keep(matched) 

ID_t    wave        intdate     start       end 
------------------------------------------------
1          3        201705     201703    201801

That is the syntax to controll for teaching experience.


global sufpath "PATH_TO_YOUR_SUF_DATA"
global workdir "PATH_TO_YOUR_WORKING_DIR"

global version D_19-0-0
//global version R_19-0-0


label define noyes 0"No" 1"yes"

**********************************************************************************
********** Prepare interview dates for matching with episode data ****************
**********************************************************************************

// generate interview-date-variable to match with episode data later (episode splitting)
use "${sufpath}\SC5_CohortProfile_${version}.dta" , clear
label language en
local idvars ID_t wave
keep ID_t wave tx8600y tx8600m tx8610y tx8610m tx80121 tx80220 tx80521 tx80522

// generate interview date
generate intdate = ym(tx8600y,tx8600m)
 replace intdate = ym(tx8610y,tx8610m) if missing(intdate) // if missing date of interview replaced by date of testing
 
 // replace mean of intdate per wave if person is a temporary drop-out
generate mean_as_intdate 		= (missing(intdate) & tx80220 == 2)
bysort wave: egen intdate_mean 	= mean(intdate)
replace intdate = round(intdate_mean) if missing(intdate) & tx80220 != 3
keep if !missing(intdate) 
format %tm intdate

// data of interview is equal in two wave for a person: add one more month for the later wave to achieve uniqueness within ID_t + intdate
bysort ID_t intdate (wave): replace intdate = intdate + 1 if _n==_N & _N > 1
isid ID_t intdate
 
label variable intdate "date of interview, missings filled"
label variable mean_as_intdate "date of interview replaced by mean of interview date" 
 
drop tx8600y tx8600m tx8610y tx8610m intdate_mean

// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
	label variable `var' "`: variable label  `var'' (source: CohortProfile)"
}
save "${workdir}/SC5_Teachers_CP.dta", replace  // >> interview dates prepared for merging



// create a one-line-per-person-file where the mins of all interview dates per wave a stored in separate variables
* this is handy to merge data from spVocExtExam for higher matching numbers as episode-splitting is not possible
sum wave
forvalues wave=1/`r(max)' {
	egen wave_min_w`wave'_temp = min(intdate) if wave == `wave'
	egen wave_min_w`wave' = max(wave_min_w`wave')
	replace wave_min_w`wave' = round(wave_min_w`wave')
	format %tm wave_min_w`wave'
	drop wave_min_w`wave'_temp
}
keep ID_t wave_min_w*
duplicates drop
save "${workdir}/SC5_Teachers_CP_wavemin.dta", replace



**********************************************************************************
********** Extract possibly useful variables from spVocExtExam *******************
**********************************************************************************
use "${sufpath}/SC5_spVocExtExam_${version}.dta", clear
label language en

// Teachers degree picked by professional qualification (KldB2010)
generate teacher_kldb2010 = 1 if  inlist(ts15301_g2,84114,84124,84134,84184,84213,84214) 
 replace teacher_kldb2010 = 0 if !inlist(ts15301_g2,84114,84124,84134,84184,84213,84214) & !missing(ts15301_g2) & ts15301_g2 >= 0
 replace teacher_kldb2010 = ts15301_g2 if ts15301_g2 < 0 | missing(ts15301_g2)

 // date of exam - might by usefull for some of your analyses
replace ts1530m = ts1530m -20 if inrange(ts1530m,21,32)
generate exam_date = ym(ts1530y,ts1530m)
format %tm exam_date
label variable exam_date "date of external exam (ts1530y,ts1530m)"

// achieved 2nd state exam
generate degree_stateEx2 = (ts15304 == 30)
label variable degree_stateEx2 "degree achieved: 2nd state examination (ts15304)"
label values degree_stateEx2 yesno

label variable teacher_kldb2010 "degree teacher (kldb2010), ts15301_g2"
label values teacher_kldb2010 yesno

//drop if missing(exam_date)
clonevar intdate = exam_date
local idvars ID_t wave


// drop almost perfect duplicates
bysort ID_t degree_stateEx2 teacher_kldb2010 exam_date ts15201 tg24150_g2 ts15219_g1 ts15304 t724401 t724402 tg24310 ts15301_g2 ts15301_g3 ts15301_g4 ts15301_g5 ts15301_g6 ts15301_g7 ts15301_g9 ts15301_g14 ts15301_g16 ts15302 ts15302_g1 th28370 ts15303_g2 (wave): keep if _n ==_N
keep ID_t wave ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 intdate

label variable intdate "Date of interview - if matching with CohortProfile"

// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
	label variable `var' "`: variable label  `var'' (source: spVocExtExam)"
}

// adding minimal interview data of all waves to spVocExtExam to create wave-variable
merge m:1 ID_t using "${workdir}/SC5_Teachers_CP_wavemin.dta", keep(matched) nogenerate


// creating "proper" wave-variable (event happened)

sum wave // using wave (event reported) to gain maximum value for wave
rename wave wave_ext
generate wave = .
// filling up with wave wenn exam_date is older or equal to minimum interview date of wave
forvalues wave = 1/`r(max)' {
	local wave_prior = `wave'-1
	if `wave' == 1 replace wave = `wave' if exam_date <= wave_min_w`wave'
	else {
		replace wave = `wave' if exam_date <= wave_min_w`wave' & exam_date >= wave_min_w`wave_prior' 
	}
	if `wave'==`r(max)' replace wave = `wave' if exam_date > wave_min_w`wave' 
}
	
//drop duplicates in desired order (example)
bysort ID_t wave (degree_stateEx2 teacher_kldb2010 exam_date wave_ext exam): keep if _n ==_N
	
keep ID_t wave ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 
label variable wave "wave (source: CohortProfile)"
save "${workdir}/SC5_Teachers_extExam.dta", replace



**********************************************************************************
*************** Extract possibly useful variables from spEmp *********************
**********************************************************************************

use "${sufpath}\SC5_spEmp_${version}.dta" , clear
label language en
local idvars ID_t splink
keep if subspell == 0 & disagint != 2 // keeping harmonized/completed and non-cancelled episodes 

generate ts23201_g2_LA 	= inlist(ts23201_g2,84114,84124,84134,84184,84213,84214)  // picking people who work(ed) as teachers accoring to KldB2010

// generating dummy completed referendariat for sorting/dropping duplicates later 
generate tg64002_tmp = tg64002
 replace tg64002_tmp = 0 if tg64002 == 2
 
 // adding smoothed starting and end dates of events
merge 1:1 ID_t splink using "${sufpath}/SC5_Biography_${version}.dta", nogenerate keep(matched) keepusing(start? end? splast)
generate end 	= ym(endy,endm)
generate start 	= ym(starty,startm)
drop if missing(start) | missing(end) // drop episodes with missing dates as the can't be matched
format %tm start end
keep ID_t splink start end splast ts23201_g2 ts23201_g2_LA tg64002 tg64002_tmp

// splitting episodes into months and create "intdate" which is a date that is newer or the same as the start of the event and smaller as or equal to the end of the event 
generate duration = end -start +1
expand duration
bysort ID_t splink : generate line = _n - 1

generate intdate = start + line
format %tm intdate
assert intdate >= start & intdate <= end 

distinct ID_t intdate, joint // checking for duplicates within ID_t+intdate
//drop duplicates in desired order (example)
bysort ID_t intdate (ts23201_g2_LA tg64002_tmp splast duration splink): keep if _n==_N


// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
	label variable `var' "`: variable label  `var'' (spEmp)"
}
//save "${workdir}/SC5_spEmp_split.dta", replace


// merge with intdate of CohortProfile to get the proper wave-variable
merge 1:1 ID_t intdate using  "${workdir}/SC5_Teachers_CP.dta", keep(matched using) generate(merge_spEmp_CP) keepusing(wave)
rename (start end splast duration splink) (start_emp end_emp splast_emp duration_emp splink_emp)
label variable ts23201_g2_LA "works as teacher; ts23201_g2 (kldb2010)"
label values ts23201_g2_LA yesno
label variable end_emp   "end of employment episode (Biography)"
label variable start_emp "start of employment episode (Biography)"
label variable splast_emp "employment episode is lasting (Biography)"
label variable duration_emp "duration of employment episode in months (Biography)"
label variable wave "wave (CohortProfile)"
save "${workdir}/SC5_Teachers_spEmp.dta", replace

****************************************************************************
***************** merge data from spEmp and spVocExtExam *******************
****************************************************************************
use "${workdir}/SC5_Teachers_CP.dta", clear
merge 1:1 ID_t wave using "${workdir}/SC5_Teachers_spEmp.dta", nogenerate keep(master matched)
merge 1:1 ID_t wave using "${workdir}/SC5_Teachers_extExam.dta", nogenerate keep(master matched)

*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*
*** you could merge data from pTarget*-datasets here ***
*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*

// generate an uncorreted dummy that tells you whether there is any hint that an observation is a teacher (wave-wise)
egen any_teacher = anymatch(teacher_kldb2010 degree_stateEx2 ts23201_g2_LA tg64002_tmp), values(1)
label variable any_teacher "target is teacher any information (wave-wise)"
label values any_teacher yesno


 

distinct ID_t if any_teacher == 1 // How many teachers are there?
/*

. distinct ID_t if any_teacher == 1

       |        Observations
       |      total   distinct
-------+----------------------
  ID_t |      18773       3076



*/

				
keep ID_t wave tx80121 tx80220 tx80521 intdate ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 ts23201_g2_LA tg64002_tmp tg64002 splast_emp end_emp start_emp any_teacher

// generate an uncorreted dummy that tells you whether there is any hint that target person is a teacher
bysort ID_t (wave): egen any_teacher_total = max(any_teacher)
label variable any_teacher_total "target is teacher any information (ID-wise)"

/*
. tab wave tx80220 if any_teacher_total, mis

                      |  Participation/drop-out status
                      |     (source: CohortProfile)
                 Wave | missing b  Participa  Temporary |     Total
----------------------+---------------------------------+----------
2010/2011 (CATI+compe |         0      3,076          0 |     3,076 
          2011 (CAWI) |         0      2,418        656 |     3,074 
          2012 (CATI) |         0      2,755        319 |     3,074 
          2012 (CAWI) |         0      2,439        635 |     3,074 
2013 (CATI+competenci |         0      2,966        106 |     3,072 
          2013 (CAWI) |         0      2,296        776 |     3,072 
2014 (CATI+competence |     1,265      1,650        151 |     3,066 
          2014 (CAWI) |         0      2,157        901 |     3,058 
          2015 (CATI) |         0      2,723        330 |     3,053 
          2016 (CATI) |         0      2,600        431 |     3,031 
          2016 (CAWI) |         0      1,855      1,152 |     3,007 
          2017 (CATI) |         0      2,600        354 |     2,954 
          2018 (CATI) |         0      2,187        690 |     2,877 
          2018 (CAWI) |         0      1,473      1,347 |     2,820 
          2019 (CATI) |         0      1,956        800 |     2,756 
          2020 (CATI) |         0      1,883        681 |     2,564 
          2020 (CAWI) |         0      1,544        821 |     2,365 
          2021 (CATI) |         0      1,778        557 |     2,335 
   2022 (CATI & CAWI) |         0      1,644        515 |     2,159 
----------------------+---------------------------------+----------
                Total |     1,265     42,000     11,222 |    54,487 
*/

// if you just want to keep targets qualified or working as teachers...
keep if any_teacher_total == 1


format %12.0g teacher_kldb2010 ts15301_g2 degree_stateEx2 ts15304 ts23201_g2_LA ts23201_g2 tg64002


// correcting teacher-dummies - contradicting information will be coded to -20
egen no_teacher = anymatch(teacher_kldb2010 degree_stateEx2 ts23201_g2_LA tg64002_tmp), values(0) // coded to 1 if one variable indicates that target is no teacher

clonevar any_teacher_g1 = any_teacher
 replace any_teacher_g1 = -20 if any_teacher == 1 & no_teacher == 1
 replace any_teacher_g1 = 0 if any_teacher == 0 & no_teacher == 1
 
bysort ID_t (wave): egen any_teacher_total_g1 = max(any_teacher_g1)
label variable any_teacher_total_g1 "target is teacher any information (ID-wise)"

// if you just want to keep targets qualified or working as teachers, dropping contradicting
distinct ID_t if any_teacher_total == 1
/*

       |        Observations
       |      total   distinct
-------+----------------------
  ID_t |      54487       3076


*/

distinct ID_t if any_teacher_total_g1 == 1

/*

. distinct ID_t if any_teacher_total_g1 == 1

       |        Observations
       |      total   distinct
-------+----------------------
  ID_t |      53200       3000
*/
keep if any_teacher_total_g1 == 1

tab wave tx80220 if any_teacher_total_g1, mis

/*

. tab wave tx80220 if any_teacher_total_g1, mis

                      |  Participation/drop-out status
                      |     (source: CohortProfile)
                 Wave | missing b  Participa  Temporary |     Total
----------------------+---------------------------------+----------
2010/2011 (CATI+compe |         0      3,000          0 |     3,000 
          2011 (CAWI) |         0      2,357        641 |     2,998 
          2012 (CATI) |         0      2,681        317 |     2,998 
          2012 (CAWI) |         0      2,377        621 |     2,998 
2013 (CATI+competenci |         0      2,891        105 |     2,996 
          2013 (CAWI) |         0      2,235        761 |     2,996 
2014 (CATI+competence |     1,239      1,603        148 |     2,990 
          2014 (CAWI) |         0      2,105        877 |     2,982 
          2015 (CATI) |         0      2,661        317 |     2,978 
          2016 (CATI) |         0      2,545        411 |     2,956 
          2016 (CAWI) |         0      1,814      1,121 |     2,935 
          2017 (CATI) |         0      2,545        342 |     2,887 
          2018 (CATI) |         0      2,142        672 |     2,814 
          2018 (CAWI) |         0      1,435      1,323 |     2,758 
          2019 (CATI) |         0      1,912        784 |     2,696 
          2020 (CATI) |         0      1,843        666 |     2,509 
          2020 (CAWI) |         0      1,507        807 |     2,314 
          2021 (CATI) |         0      1,739        545 |     2,284 
   2022 (CATI & CAWI) |         0      1,607        504 |     2,111 
----------------------+---------------------------------+----------
                Total |     1,239     40,999     10,962 |    53,200 

*/

local analyze_vars ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 tg64002 ts23201_g2_LA tg64002_tmp splast_emp end_emp start_emp
local num_ana_vars = `: word count `analyze_vars''

egen num_miss_vars = rowmiss(ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 tg64002 ts23201_g2_LA tg64002_tmp splast_emp end_emp start_emp)
generate no_data = (num_miss_vars == `num_ana_vars')

// drop last waves when no data available
sum wave
global maxwave `r(max)'
forvalues num =1/$maxwave {
	bysort ID_t (wave): drop if _n==_N & no_data == 1 & tx80220 == 2
}

br ID_t wave tx80521 intdate splast_emp end_emp start_emp no_data

// identify CATI-waves because episode-information is gathered only in CATI waves 
generate cati_wave = inlist(wave,1,3,5,7,9,10,12,13,15,16,18,19)
bysort ID_t cati_wave ( wave): generate last_cati = ((_n == _N) & cati_wave==1)

// identifying targets with teaching episodes that hold on until they drop out of the survey
bysort ID_t cati_wave (wave): generate employed_temp = ((_n == _N) & splast == 1 & ts23201_g2_LA == 1 & last_cati == 1)
bysort ID_t (wave): egen employed_spEmp = max (employed_temp)

// identify teachers who started later than 2014
generate teach_start_2015_2022_temp = start_emp > ym(2014,12) & !missing(start_emp) & ts23201_g2_LA == 1
bysort ID_t (wave): egen teach_start_2015_2022 = max(teach_start_2015_2022_temp)

// identify teachers who started 2014 or earlier
generate teach_start_upto_2014_temp = start_emp <= ym(2014,12) & !missing(start_emp) & ts23201_g2_LA == 1
bysort ID_t (wave): egen teach_start_upto_2014 = max(teach_start_upto_2014_temp)

// identify teachers who started later than 2014 by those who had been teachers earlier
replace teach_start_2015_2022 = 0 if teach_start_upto_2014 == 1

drop tg64002_tmp any_teacher any_teacher_total no_teacher any_teacher_g1 any_teacher_total_g1 num_miss_vars employed_temp teach_start_2015_2022_temp teach_start_upto_2014_temp

// how many people are still teaching and started later than 2014
distinct ID_t if teach_start_2015_2022 == 1 & employed_spEmp == 1

// how many people are still teaching and started later than 2014
distinct ID_t if teach_start_upto_2014 == 1 & employed_spEmp == 1

I hope this helps you in some ways.

Bye,
Dietmar

sdemirtas · 23. April 2026 um 15:26

Hello Dietmar,

Thank you very very much for helping. I wish that I could thank you earlier; I just tried to do it via email and apparently it did not work.

dietmar.angerer · 24. April 2026 um 13:24

Hey Saniye!

Thank you, I’m glad I could help

Best,
Dietmar