Text mining of key pharmacology textbooks
The introductory sections of each of five key textbooks, selected by the research team as the most commonly used texts in their local context, were mined to extract the pharmacology terms that were most commonly used. The relevant texts and sections were:
  1. Rang & Dale’s Pharmacology, 8e Section I: General Principles ISBN 978-0702053627
  2. Goodman and Gilman’s Manual of Pharmacology and Therapeutics, 13e Section I: General Principles ISBN 978-0071624428
  3. Katzung’s Basic and Clinical Pharmacology, 14e Section I: Basic Principles ISBN 978-1259641152
  4. Golan’s Principles of Pharmacology, 4e Section I: Fundamental Principles ISBN 978-1451191004
  5. Bryant and Knights’ Pharmacology for Health Professionals, 4e Unit II: Principles of Pharmacology ISBN 978-0729541701
Contents of the introductory chapters were converted to raw text files using PyPDF2 as the source texts were in PDF format. Pre-processing was conducted using Natural Language Tool Kit (NLTK). To begin, stop words (e.g. ”the”, ”an”), punctuations, numbers, tags and special characters were removed, and uppercase letters were converted to lowercases. Words shorter than two characters were also removed. Next, additional user-defined keywords such as ”chapter” and ”section” were excluded from the corpii. Artefacts arising from encoding (e.g. ”\u02da”, ”\n”) and ligatures (e.g. ”ff” appearing as ”©”) were also addressed. Subsequently, since the majority of the keywords were nouns, words with noun tags were extracted using parts-of-speech tagging. Finally, all of the extracted words were normalised using stemming, which removes suffixes and prefixes from word roots, such that keywords ”drugs” and ”drug” would all be considered equal under the word root ”drug”. Key terms were extracted using scikit-learn’s Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer with N-gram range between 2 and 3. The top 100 terms sorted by TF-IDF score were then selected for further analysis.