In this case we assume that all variables that we observe $(y_i^, x_i^, z_i)$ and $(x_i, z_i)$ in non-probability and probability sample (or population) and $^*$ informs that a given variable is mis-classified.
Motivating example is as follows:
-
target variable: we require English language -- this may be provided in a given ad but for some this may be missing but we could derive this from the text (say the ad is in English or it is stated in the text that English is "our language")
-
auxiliary variables ($X$): information about the occupation is missing and we derive this using our classifier
-
auxiliary variables ($Z$): information about a given company (measured without an error, say the size, NACE, public/private)
Research questions:
- how we can deal with such cases? what literature say about that?
- what is the bias when we estimate regression model on $E(y_i^* | x_i^*, z_i)$ instead of $E(y_i | x_i, z_i)$?
In this case we assume that all variables that we observe $(y_i^, x_i^, z_i)$ and$(x_i, z_i)$ in non-probability and probability sample (or population) and $^*$ informs that a given variable is mis-classified.
Motivating example is as follows:
Research questions: