Text mining of key pharmacology textbooks
The introductory sections of each of five key textbooks, selected by the
research team as the most commonly used texts in their local context,
were mined to extract the pharmacology terms that were most commonly
used. The relevant texts and sections were:
- Rang & Dale’s Pharmacology, 8e Section I: General Principles ISBN
978-0702053627
- Goodman and Gilman’s Manual of Pharmacology and Therapeutics, 13e
Section I: General Principles ISBN 978-0071624428
- Katzung’s Basic and Clinical Pharmacology, 14e Section I: Basic
Principles ISBN 978-1259641152
- Golan’s Principles of Pharmacology, 4e Section I: Fundamental
Principles ISBN 978-1451191004
- Bryant and Knights’ Pharmacology for Health Professionals, 4e Unit II:
Principles of Pharmacology ISBN 978-0729541701
Contents of the introductory chapters were converted to raw text files
using PyPDF2 as the source texts were in PDF format. Pre-processing was
conducted using Natural Language Tool Kit (NLTK). To begin, stop words
(e.g. ”the”, ”an”), punctuations, numbers, tags and special characters
were removed, and uppercase letters were converted to lowercases. Words
shorter than two characters were also removed. Next, additional
user-defined keywords such as ”chapter” and ”section” were excluded from
the corpii. Artefacts arising from encoding (e.g.
”\u02da”, ”\n”) and ligatures (e.g. ”ff”
appearing as ”©”) were also addressed. Subsequently, since the majority
of the keywords were nouns, words with noun tags were extracted using
parts-of-speech tagging. Finally, all of the extracted words were
normalised using stemming, which removes suffixes and prefixes from word
roots, such that keywords ”drugs” and ”drug” would all be considered
equal under the word root ”drug”. Key terms were extracted using
scikit-learn’s Term Frequency-Inverse Document Frequency (TF-IDF)
Vectorizer with N-gram range between 2 and 3. The top 100 terms sorted
by TF-IDF score were then selected for further analysis.