Menai Insight combines qualitative approaches to textual analysis to ensure that nuanced relationships with sentences are captured, and machine learning to ensure that text is processed to a high degree of accuracy across millions of sentences. This article explains key stages in our approach to ensure high-accuracy large scale classifications of the text.
A key challenge with textual-constructs is capturing the desired information from the text. Nuanced, multi-dimensional constructs, common in organizational research, typically do not reduce to individual words, with traditional approaches based on word counts or term clusterings, often relatively coarse proxy of the desired construct. At Menai Insight, we take care at every stage to both ensure that the our textual structures capture the nuance of the sentences, and to ensure that the information is consistently extracted across texts.
Our ability to capture the information consistently is underpinned by multiple layers of qualitative and quantitative validations throughout the development and population process, including:
Qualitative Development of the Textual Structures
The first stage in classifying the texts involved developing a representation of the material. While each sentence in a particular communication medium, such as managerial backgrounds is unique, there is substantial underlying similarity in the material discussed. Thus, although specific details such as names, dates, and experiences inherently differ between managers, there are underlying similarities; backgrounds typically discuss the discussion positions that a manager has worked in, experiences that they have gained, and qualifications and professional licenses that they have received.
To identify the similarities, that provide the basis for our textual structures, we drew on qualitative on themes (e.g., Glaser and Strauss, 1967; Ryan and Bernard, 2003), reading substantial number of each texts in each medium to identify the underlying basis on which the texts were comparable, and the specifics on which they differed.
Populating the Textual Structures and Validating the Classifications
After the textual structures are developed, we then developed analytical approaches, drawing on information extraction research, to systematically populate the information from the text to the textual structures. Again, substantial qualitative and quantitative validation was implemented to ensure that the ontologies are populated as expected. For example, We give substantial, continued, attention to ensure that machine-learned classifications conform to plausible sequencing, and identify and correct classifications that that are not populating the textual structures as expected.
Since machine learning inherently relies on well classified data, our care to identify and correct mistakes leads to cumulative benefits: correcting mistakes increases the accuracy of our classifications, which in turn make it easier to identify and correct other errors. Indeed, while working with millions of sentences introduces its own challenges, the diversity of sentences previously classified helps ensure that the accuracy of classifying new material.
Manual Oversight and Face-validity
Finally, our entire process has substantial manual oversight, with the transparent connections between the underlying text and the textual structures illustrating the validity of extracted constructs. By overseeing the entire population process, we help ensure that classifications are continually improving, and not drifting from those expected. Moreover, the direct correspondence to the underlying text further illustrate the validity of the textual structures, while allowing more nuanced theorizing than feasible with distant proxies.
Glaser, B., and A. Strauss
1967 The Discovery of Grounded Theory: Strategies for Qualitative Research. New Brunswick: Aldine Transaction.
Ryan, G.W., and H.R. Bernard
2003 Techniques to identify themes. Field Methods. 15. 85-109.