Multiple stages of validation help ensure our accuracy
Note: This page is actively being worked on - parts may be incomplete.
We have embedded multiple stages of validation through our development process:
- Qualitative Development of the Textual Structures: Ensuring that the textual structures reflect the underlying material
- Validation of the Classification: Ensuring that the material populates the developed textual structures correctly
- Manual Oversight and Face-validity: On-going checks to ensure that the overall population process is occurring as expected
Qualitative Development of the Textual Structures
The first stage in classifying the texts involved developing a representation of the material. While each sentence in a particular communication medium (e.g., managerial backgrounds) is unique, there is substantial underlying similarity in the material discussed. For example, managerial backgrounds, typically discuss positions that a manager has worked in, experiences that they have gained, and qualifications and professional licenses that they have received.
Each of the textual structures were developed with substantial qualitative consideration of the underlying material, drawing from research on themes (e.g., Glaser and Strauss, 1967; Ryan and Bernard, 2003) to help ensure that the textual structures reflect the underlying text. After first identifying the primary dimensions of the text, each of these sentences were then dissected again into the components.
Each of the textual structures were developed with substantial qualitative consideration of the underlying material, drawing from research on themes (e.g., Glaser and Strauss, 1967; Ryan and Bernard, 2003) to help ensure that the textual structures reflect the underlying text. After first identifying the primary dimensions of the text, each of these sentences were then dissected again into the components.
Validation of the Classifications
The second stage in classifying the text is to ensure that the material is correctly populated into the textual structures. While this involved a combination of manual classification and machine-learned classification. Three primary approaches are used to ensure that the material is classified appropriately:
Validation by context
Checks that the context in which a term occurs in a sentence is appropriate; for example, while the concept sequencing PERSON_NAME IS MANAGEMENT_TITLE AT COMPANY_NAME is common, and appropriate, a concept sequencing such as PERSON_NAME RECEIVED COMPANY_NAME FROM UNIVERSITY_NAME is not common, and not likely to be correct (i.e., likely indicating that a degree acronym has incorrectly been classified as a company name). This validation includes three components:
Validation by dissection
By dissecting concepts to underlying properties, and manually verifying the much reduced number of terms in the sub-concepts, and the sequencing of the sub-concepts, it is possible to validate a much larger number of terms. For example, the validity of concepts comprised of separate parts (e.g., MANAGEMENT_TITLE) can be assessed, despite there being tens of thousands of unique titles at the overall level.
Validation of terms through external-data-checks
By connecting terms to external databases, it is possible to verify concepts underpinned by a large number of labels, such as location information, that are unfeasible to manually verify, and lack the repetition in underlying words to allow dissection.
- Validation by context: Ensuring that classifications are
Validation by context
Checks that the context in which a term occurs in a sentence is appropriate; for example, while the concept sequencing PERSON_NAME IS MANAGEMENT_TITLE AT COMPANY_NAME is common, and appropriate, a concept sequencing such as PERSON_NAME RECEIVED COMPANY_NAME FROM UNIVERSITY_NAME is not common, and not likely to be correct (i.e., likely indicating that a degree acronym has incorrectly been classified as a company name). This validation includes three components:
- Manual checks to identify unlikely concept sequencing.
- Machine-learned identification, where classifications through machine-learning are inconsistent with the classified concept.
- Identification as concept sequencing that does not conform to that expected in the textual structures
Validation by dissection
By dissecting concepts to underlying properties, and manually verifying the much reduced number of terms in the sub-concepts, and the sequencing of the sub-concepts, it is possible to validate a much larger number of terms. For example, the validity of concepts comprised of separate parts (e.g., MANAGEMENT_TITLE) can be assessed, despite there being tens of thousands of unique titles at the overall level.
Validation of terms through external-data-checks
By connecting terms to external databases, it is possible to verify concepts underpinned by a large number of labels, such as location information, that are unfeasible to manually verify, and lack the repetition in underlying words to allow dissection.
Manual Oversight and Face-validity
Checks throughout the process to ensure that the textual structures are being populated in-line with those developed.
Beyond documenting the textual structures, the examples, and summary statistics included in Appendix D illustrate that the textual structures, properties, and classifications have a high correspondence to what would be expected.
Beyond documenting the textual structures, the examples, and summary statistics included in Appendix D illustrate that the textual structures, properties, and classifications have a high correspondence to what would be expected.