Hallo Saniye!
Sorry it took me long!
I could not find the variable " “length of service at the school” - please tell me the dataset and the variable name…
There are loads of variables with information on referendariat but in the metadata of those variables, it is often noted that you should not use those variables for analysis. You can use infoquery (stata tools on our website) to check that or data view in SPSS.
But even variables that are for analysis seem to be contradictory. Try to use data from spEmp to pick teachers and the starting dates of being teachers. Data from spVocExtExam might help as well to gather information an exam dates.
This is some extensive data preparation - but it could be worse! You can find the code after the example on episode splitting…
you need to match dates of interview with episode data. This is called episode splitting and was described in the forum before. The wave indicator in episode datasets is not recommended because it just marks the time of the last interview but not the date where the event happened!
you create an interview date in CohortProfile…
*CohortProfile: wave and intdate
ID_t wave intdate
--------------------------------
1 1 201603
1 2 201612
1 3 201705
1 4 201804
start and end of events in spEmp :
*spEmp: event
ID_t start end
--------------------------------
1 201701 201801
multiply data rows by the duration of an event, creating variable intdate that starts at start and ends are end of the event
*spEmp: episode-split event
ID_t start end intdate
---------------------------------
1 201703 201801 201703
1 201703 201801 201704
1 201703 201801 201705
1 201703 201801 201706
1 201703 201801 201707
1 201703 201801 201708
1 201703 201801 201709
1 201703 201801 201710
1 201703 201801 201711
1 201703 201801 201712
1 201703 201801 201801
match data using intdate to match events with CohortProfile to get a proper wave-indicator
use "CohortProfile.dta", clear
merge 1:m ID_t intdate using "spEmp_split.dta", keep(matched)
ID_t wave intdate start end
------------------------------------------------
1 3 201705 201703 201801
That is the syntax to controll for teaching experience.
global sufpath "PATH_TO_YOUR_SUF_DATA"
global workdir "PATH_TO_YOUR_WORKING_DIR"
global version D_19-0-0
//global version R_19-0-0
label define noyes 0"No" 1"yes"
**********************************************************************************
********** Prepare interview dates for matching with episode data ****************
**********************************************************************************
// generate interview-date-variable to match with episode data later (episode splitting)
use "${sufpath}\SC5_CohortProfile_${version}.dta" , clear
label language en
local idvars ID_t wave
keep ID_t wave tx8600y tx8600m tx8610y tx8610m tx80121 tx80220 tx80521 tx80522
// generate interview date
generate intdate = ym(tx8600y,tx8600m)
replace intdate = ym(tx8610y,tx8610m) if missing(intdate) // if missing date of interview replaced by date of testing
// replace mean of intdate per wave if person is a temporary drop-out
generate mean_as_intdate = (missing(intdate) & tx80220 == 2)
bysort wave: egen intdate_mean = mean(intdate)
replace intdate = round(intdate_mean) if missing(intdate) & tx80220 != 3
keep if !missing(intdate)
format %tm intdate
// data of interview is equal in two wave for a person: add one more month for the later wave to achieve uniqueness within ID_t + intdate
bysort ID_t intdate (wave): replace intdate = intdate + 1 if _n==_N & _N > 1
isid ID_t intdate
label variable intdate "date of interview, missings filled"
label variable mean_as_intdate "date of interview replaced by mean of interview date"
drop tx8600y tx8600m tx8610y tx8610m intdate_mean
// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
label variable `var' "`: variable label `var'' (source: CohortProfile)"
}
save "${workdir}/SC5_Teachers_CP.dta", replace // >> interview dates prepared for merging
// create a one-line-per-person-file where the mins of all interview dates per wave a stored in separate variables
* this is handy to merge data from spVocExtExam for higher matching numbers as episode-splitting is not possible
sum wave
forvalues wave=1/`r(max)' {
egen wave_min_w`wave'_temp = min(intdate) if wave == `wave'
egen wave_min_w`wave' = max(wave_min_w`wave')
replace wave_min_w`wave' = round(wave_min_w`wave')
format %tm wave_min_w`wave'
drop wave_min_w`wave'_temp
}
keep ID_t wave_min_w*
duplicates drop
save "${workdir}/SC5_Teachers_CP_wavemin.dta", replace
**********************************************************************************
********** Extract possibly useful variables from spVocExtExam *******************
**********************************************************************************
use "${sufpath}/SC5_spVocExtExam_${version}.dta", clear
label language en
// Teachers degree picked by professional qualification (KldB2010)
generate teacher_kldb2010 = 1 if inlist(ts15301_g2,84114,84124,84134,84184,84213,84214)
replace teacher_kldb2010 = 0 if !inlist(ts15301_g2,84114,84124,84134,84184,84213,84214) & !missing(ts15301_g2) & ts15301_g2 >= 0
replace teacher_kldb2010 = ts15301_g2 if ts15301_g2 < 0 | missing(ts15301_g2)
// date of exam - might by usefull for some of your analyses
replace ts1530m = ts1530m -20 if inrange(ts1530m,21,32)
generate exam_date = ym(ts1530y,ts1530m)
format %tm exam_date
label variable exam_date "date of external exam (ts1530y,ts1530m)"
// achieved 2nd state exam
generate degree_stateEx2 = (ts15304 == 30)
label variable degree_stateEx2 "degree achieved: 2nd state examination (ts15304)"
label values degree_stateEx2 yesno
label variable teacher_kldb2010 "degree teacher (kldb2010), ts15301_g2"
label values teacher_kldb2010 yesno
//drop if missing(exam_date)
clonevar intdate = exam_date
local idvars ID_t wave
// drop almost perfect duplicates
bysort ID_t degree_stateEx2 teacher_kldb2010 exam_date ts15201 tg24150_g2 ts15219_g1 ts15304 t724401 t724402 tg24310 ts15301_g2 ts15301_g3 ts15301_g4 ts15301_g5 ts15301_g6 ts15301_g7 ts15301_g9 ts15301_g14 ts15301_g16 ts15302 ts15302_g1 th28370 ts15303_g2 (wave): keep if _n ==_N
keep ID_t wave ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 intdate
label variable intdate "Date of interview - if matching with CohortProfile"
// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
label variable `var' "`: variable label `var'' (source: spVocExtExam)"
}
// adding minimal interview data of all waves to spVocExtExam to create wave-variable
merge m:1 ID_t using "${workdir}/SC5_Teachers_CP_wavemin.dta", keep(matched) nogenerate
// creating "proper" wave-variable (event happened)
sum wave // using wave (event reported) to gain maximum value for wave
rename wave wave_ext
generate wave = .
// filling up with wave wenn exam_date is older or equal to minimum interview date of wave
forvalues wave = 1/`r(max)' {
local wave_prior = `wave'-1
if `wave' == 1 replace wave = `wave' if exam_date <= wave_min_w`wave'
else {
replace wave = `wave' if exam_date <= wave_min_w`wave' & exam_date >= wave_min_w`wave_prior'
}
if `wave'==`r(max)' replace wave = `wave' if exam_date > wave_min_w`wave'
}
//drop duplicates in desired order (example)
bysort ID_t wave (degree_stateEx2 teacher_kldb2010 exam_date wave_ext exam): keep if _n ==_N
keep ID_t wave ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2
label variable wave "wave (source: CohortProfile)"
save "${workdir}/SC5_Teachers_extExam.dta", replace
**********************************************************************************
*************** Extract possibly useful variables from spEmp *********************
**********************************************************************************
use "${sufpath}\SC5_spEmp_${version}.dta" , clear
label language en
local idvars ID_t splink
keep if subspell == 0 & disagint != 2 // keeping harmonized/completed and non-cancelled episodes
generate ts23201_g2_LA = inlist(ts23201_g2,84114,84124,84134,84184,84213,84214) // picking people who work(ed) as teachers accoring to KldB2010
// generating dummy completed referendariat for sorting/dropping duplicates later
generate tg64002_tmp = tg64002
replace tg64002_tmp = 0 if tg64002 == 2
// adding smoothed starting and end dates of events
merge 1:1 ID_t splink using "${sufpath}/SC5_Biography_${version}.dta", nogenerate keep(matched) keepusing(start? end? splast)
generate end = ym(endy,endm)
generate start = ym(starty,startm)
drop if missing(start) | missing(end) // drop episodes with missing dates as the can't be matched
format %tm start end
keep ID_t splink start end splast ts23201_g2 ts23201_g2_LA tg64002 tg64002_tmp
// splitting episodes into months and create "intdate" which is a date that is newer or the same as the start of the event and smaller as or equal to the end of the event
generate duration = end -start +1
expand duration
bysort ID_t splink : generate line = _n - 1
generate intdate = start + line
format %tm intdate
assert intdate >= start & intdate <= end
distinct ID_t intdate, joint // checking for duplicates within ID_t+intdate
//drop duplicates in desired order (example)
bysort ID_t intdate (ts23201_g2_LA tg64002_tmp splast duration splink): keep if _n==_N
// adding name of dataset to variable label
unab allvars: _all
local relabelvars : list allvars - idvars
foreach var of local relabelvars {
label variable `var' "`: variable label `var'' (spEmp)"
}
//save "${workdir}/SC5_spEmp_split.dta", replace
// merge with intdate of CohortProfile to get the proper wave-variable
merge 1:1 ID_t intdate using "${workdir}/SC5_Teachers_CP.dta", keep(matched using) generate(merge_spEmp_CP) keepusing(wave)
rename (start end splast duration splink) (start_emp end_emp splast_emp duration_emp splink_emp)
label variable ts23201_g2_LA "works as teacher; ts23201_g2 (kldb2010)"
label values ts23201_g2_LA yesno
label variable end_emp "end of employment episode (Biography)"
label variable start_emp "start of employment episode (Biography)"
label variable splast_emp "employment episode is lasting (Biography)"
label variable duration_emp "duration of employment episode in months (Biography)"
label variable wave "wave (CohortProfile)"
save "${workdir}/SC5_Teachers_spEmp.dta", replace
****************************************************************************
***************** merge data from spEmp and spVocExtExam *******************
****************************************************************************
use "${workdir}/SC5_Teachers_CP.dta", clear
merge 1:1 ID_t wave using "${workdir}/SC5_Teachers_spEmp.dta", nogenerate keep(master matched)
merge 1:1 ID_t wave using "${workdir}/SC5_Teachers_extExam.dta", nogenerate keep(master matched)
*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*
*** you could merge data from pTarget*-datasets here ***
*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*
// generate an uncorreted dummy that tells you whether there is any hint that an observation is a teacher (wave-wise)
egen any_teacher = anymatch(teacher_kldb2010 degree_stateEx2 ts23201_g2_LA tg64002_tmp), values(1)
label variable any_teacher "target is teacher any information (wave-wise)"
label values any_teacher yesno
distinct ID_t if any_teacher == 1 // How many teachers are there?
/*
. distinct ID_t if any_teacher == 1
| Observations
| total distinct
-------+----------------------
ID_t | 18773 3076
*/
keep ID_t wave tx80121 tx80220 tx80521 intdate ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 ts23201_g2_LA tg64002_tmp tg64002 splast_emp end_emp start_emp any_teacher
// generate an uncorreted dummy that tells you whether there is any hint that target person is a teacher
bysort ID_t (wave): egen any_teacher_total = max(any_teacher)
label variable any_teacher_total "target is teacher any information (ID-wise)"
/*
. tab wave tx80220 if any_teacher_total, mis
| Participation/drop-out status
| (source: CohortProfile)
Wave | missing b Participa Temporary | Total
----------------------+---------------------------------+----------
2010/2011 (CATI+compe | 0 3,076 0 | 3,076
2011 (CAWI) | 0 2,418 656 | 3,074
2012 (CATI) | 0 2,755 319 | 3,074
2012 (CAWI) | 0 2,439 635 | 3,074
2013 (CATI+competenci | 0 2,966 106 | 3,072
2013 (CAWI) | 0 2,296 776 | 3,072
2014 (CATI+competence | 1,265 1,650 151 | 3,066
2014 (CAWI) | 0 2,157 901 | 3,058
2015 (CATI) | 0 2,723 330 | 3,053
2016 (CATI) | 0 2,600 431 | 3,031
2016 (CAWI) | 0 1,855 1,152 | 3,007
2017 (CATI) | 0 2,600 354 | 2,954
2018 (CATI) | 0 2,187 690 | 2,877
2018 (CAWI) | 0 1,473 1,347 | 2,820
2019 (CATI) | 0 1,956 800 | 2,756
2020 (CATI) | 0 1,883 681 | 2,564
2020 (CAWI) | 0 1,544 821 | 2,365
2021 (CATI) | 0 1,778 557 | 2,335
2022 (CATI & CAWI) | 0 1,644 515 | 2,159
----------------------+---------------------------------+----------
Total | 1,265 42,000 11,222 | 54,487
*/
// if you just want to keep targets qualified or working as teachers...
keep if any_teacher_total == 1
format %12.0g teacher_kldb2010 ts15301_g2 degree_stateEx2 ts15304 ts23201_g2_LA ts23201_g2 tg64002
// correcting teacher-dummies - contradicting information will be coded to -20
egen no_teacher = anymatch(teacher_kldb2010 degree_stateEx2 ts23201_g2_LA tg64002_tmp), values(0) // coded to 1 if one variable indicates that target is no teacher
clonevar any_teacher_g1 = any_teacher
replace any_teacher_g1 = -20 if any_teacher == 1 & no_teacher == 1
replace any_teacher_g1 = 0 if any_teacher == 0 & no_teacher == 1
bysort ID_t (wave): egen any_teacher_total_g1 = max(any_teacher_g1)
label variable any_teacher_total_g1 "target is teacher any information (ID-wise)"
// if you just want to keep targets qualified or working as teachers, dropping contradicting
distinct ID_t if any_teacher_total == 1
/*
| Observations
| total distinct
-------+----------------------
ID_t | 54487 3076
*/
distinct ID_t if any_teacher_total_g1 == 1
/*
. distinct ID_t if any_teacher_total_g1 == 1
| Observations
| total distinct
-------+----------------------
ID_t | 53200 3000
*/
keep if any_teacher_total_g1 == 1
tab wave tx80220 if any_teacher_total_g1, mis
/*
. tab wave tx80220 if any_teacher_total_g1, mis
| Participation/drop-out status
| (source: CohortProfile)
Wave | missing b Participa Temporary | Total
----------------------+---------------------------------+----------
2010/2011 (CATI+compe | 0 3,000 0 | 3,000
2011 (CAWI) | 0 2,357 641 | 2,998
2012 (CATI) | 0 2,681 317 | 2,998
2012 (CAWI) | 0 2,377 621 | 2,998
2013 (CATI+competenci | 0 2,891 105 | 2,996
2013 (CAWI) | 0 2,235 761 | 2,996
2014 (CATI+competence | 1,239 1,603 148 | 2,990
2014 (CAWI) | 0 2,105 877 | 2,982
2015 (CATI) | 0 2,661 317 | 2,978
2016 (CATI) | 0 2,545 411 | 2,956
2016 (CAWI) | 0 1,814 1,121 | 2,935
2017 (CATI) | 0 2,545 342 | 2,887
2018 (CATI) | 0 2,142 672 | 2,814
2018 (CAWI) | 0 1,435 1,323 | 2,758
2019 (CATI) | 0 1,912 784 | 2,696
2020 (CATI) | 0 1,843 666 | 2,509
2020 (CAWI) | 0 1,507 807 | 2,314
2021 (CATI) | 0 1,739 545 | 2,284
2022 (CATI & CAWI) | 0 1,607 504 | 2,111
----------------------+---------------------------------+----------
Total | 1,239 40,999 10,962 | 53,200
*/
local analyze_vars ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 tg64002 ts23201_g2_LA tg64002_tmp splast_emp end_emp start_emp
local num_ana_vars = `: word count `analyze_vars''
egen num_miss_vars = rowmiss(ts15304 ts15301_g2 teacher_kldb2010 exam_date degree_stateEx2 ts23201_g2 tg64002 ts23201_g2_LA tg64002_tmp splast_emp end_emp start_emp)
generate no_data = (num_miss_vars == `num_ana_vars')
// drop last waves when no data available
sum wave
global maxwave `r(max)'
forvalues num =1/$maxwave {
bysort ID_t (wave): drop if _n==_N & no_data == 1 & tx80220 == 2
}
br ID_t wave tx80521 intdate splast_emp end_emp start_emp no_data
// identify CATI-waves because episode-information is gathered only in CATI waves
generate cati_wave = inlist(wave,1,3,5,7,9,10,12,13,15,16,18,19)
bysort ID_t cati_wave ( wave): generate last_cati = ((_n == _N) & cati_wave==1)
// identifying targets with teaching episodes that hold on until they drop out of the survey
bysort ID_t cati_wave (wave): generate employed_temp = ((_n == _N) & splast == 1 & ts23201_g2_LA == 1 & last_cati == 1)
bysort ID_t (wave): egen employed_spEmp = max (employed_temp)
// identify teachers who started later than 2014
generate teach_start_2015_2022_temp = start_emp > ym(2014,12) & !missing(start_emp) & ts23201_g2_LA == 1
bysort ID_t (wave): egen teach_start_2015_2022 = max(teach_start_2015_2022_temp)
// identify teachers who started 2014 or earlier
generate teach_start_upto_2014_temp = start_emp <= ym(2014,12) & !missing(start_emp) & ts23201_g2_LA == 1
bysort ID_t (wave): egen teach_start_upto_2014 = max(teach_start_upto_2014_temp)
// identify teachers who started later than 2014 by those who had been teachers earlier
replace teach_start_2015_2022 = 0 if teach_start_upto_2014 == 1
drop tg64002_tmp any_teacher any_teacher_total no_teacher any_teacher_g1 any_teacher_total_g1 num_miss_vars employed_temp teach_start_2015_2022_temp teach_start_upto_2014_temp
// how many people are still teaching and started later than 2014
distinct ID_t if teach_start_2015_2022 == 1 & employed_spEmp == 1
// how many people are still teaching and started later than 2014
distinct ID_t if teach_start_upto_2014 == 1 & employed_spEmp == 1
I hope this helps you in some ways.
Bye,
Dietmar