Menai Insight combines qualitative approaches to textual analysis to ensure that nuanced relationships with sentences are captured, and machine learning to ensure that text is processed to a high degree of accuracy across millions of sentences. This article explains key stages in our approach to ensure high-accuracy large scale classifications of the text.
A key challenge with textual-constructs is capturing the desired information from the text. Nuanced, multi-dimensional constructs, common in organizational research, typically do not reduce to individual words, with traditional approaches based on word counts or term clusterings, often relatively coarse proxy of the desired construct. At Menai Insight, we take care at every stage to both ensure that the our textual structures capture the nuance of the sentences, and to ensure that the information is consistently extracted across texts.
Our ability to capture the information consistently is underpinned by multiple layers of qualitative and quantitative validations throughout the development and population process are summarized below.
Qualitative Analysis: Development of the Textual Structures
The first stage in classifying the texts involved developing a representation of the material. While each sentence in a particular communication medium, such as managerial backgrounds is unique, there is substantial underlying similarity in the material discussed. Thus, although specific details such as names, dates, and experiences inherently differ between managers, there are underlying similarities; backgrounds typically discuss the discussion positions that a manager has worked in, experiences that they have gained, and qualifications and professional licenses that they have received.
To identify the similarities, that provide the basis for our textual structures, we drew on qualitative on themes (e.g., Glaser and Strauss, 1967; Ryan and Bernard, 2003), reading substantial number of each texts in each medium to identify the underlying basis on which the texts were comparable, and the specifics on which they differed.
Machine Learning: Populating the Textual Structures and Validating the Classifications
After the textual structures are developed, we then developed analytical approaches, drawing on information extraction research, to systematically populate the information from the text to the textual structures. Again, substantial qualitative and quantitative validation was implemented to ensure that the ontologies are populated as expected. For example, We give substantial, continued, attention to ensure that machine-learned classifications conform to plausible sequencing, and identify and correct classifications that that are not populating the textual structures as expected.
Since machine learning inherently relies on well classified data, our care to identify and correct mistakes leads to cumulative benefits: correcting mistakes increases the accuracy of our classifications, which in turn make it easier to identify and correct other errors. Indeed, while working with millions of sentences introduces its own challenges, the diversity of sentences previously classified helps ensure that the accuracy of classifying new material.
Manual Oversight: Ensuring populated as expected
To ensure that mistakes are identified, our entire process has substantial manual oversight. This process, which spans our entire population process, helps ensure that classifications are continually improving, and not drifting from those expected. Moreover, the direct correspondence to the underlying text further illustrate the validity of the textual structures, while allowing more nuanced theorizing than feasible with distant proxies.
Transparency: Avoiding black-box analysis
Our overall process, and usage our the textual structures is designed to be transparent - with clear connections between the underlying text and the textual structures helping to illustrate the validity of extracted constructs. Being able to identify and verify at the specific sentence-level that desired terms are being extracted, gives confidence in the validity of aggregated constructs. Moreover, we extend this level of transparency, with clear documentation specifying the textual-structure and detailing the classification process.
Glaser, B., and A. Strauss
1967 The Discovery of Grounded Theory: Strategies for Qualitative Research. New Brunswick: Aldine Transaction.
Ryan, G.W., and H.R. Bernard
2003 Techniques to identify themes. Field Methods. 15. 85-109.