Fiction has always captivated me, and I find audiobooks to be the perfect medium to experience it. While most audiobooks are in English, which isn't my native language, I've developed strong proficiency in it over the years. However, fiction often introduces rare words that remain unfamiliar to non-native speakers, especially those of us who don't live in English-speaking countries. My goal is simple: learn these words to enhance my audiobook experience. After all, the fewer unknown words I encounter, the more vivid and immersive the story becomes in my mind.
In this article, I'll walk you through a practical solution for expanding your vocabulary while enjoying audiobooks. I've developed a TypeScript script that helps identify and extract unfamiliar words, making the learning process more systematic. The complete implementation is available on GitHub if you'd like to try it yourself or build upon it.
There are three files specific to you in this project:
input.epub
: This is the book from which you want to extract words.output.txt
: This file will contain all the new words from the book you provided.ignore_words.txt
: This file will include all the words you already know or wish to ignore.To get started, install the dependencies with yarn
, then navigate to the product/words
directory. Run yarn start
to process your book, and the script will generate an output.txt
file containing all the new words found in your text.
The implementation processes the book in several steps. First, it extracts all words from the EPUB file and loads your list of known words. It then filters out words that are too short (less than 3 characters), already known, or incorrectly spelled according to the English dictionary. For each remaining word, it determines the base form (for example, "running" becomes "run") to avoid learning different forms of the same word separately. Finally, it saves the unique set of new base words to the output file.
import fs from "fs"
import path from "path"
import nspell from "nspell"
import { extractWordsFromEpub } from "./core/extractWordsFromEpub"
import { getWordBaseForm } from "./core/getWordBaseForm"
const inputSrc = path.join(__dirname, "input.epub")
const outputSrc = path.join(__dirname, "output.txt")
const ignoreWordsSrc = path.join(__dirname, "ignore_words.txt")
const main = async () => {
const en = await import("dictionary-en")
const spell = nspell(en.default as any)
const allWords = await extractWordsFromEpub(inputSrc)
const ignoreWords = await fs.promises.readFile(ignoreWordsSrc, "utf-8")
const ignoreWordsSet = new Set(ignoreWords.split("\n"))
const result = new Set<string>()
allWords.forEach((word) => {
if (word.length < 3 || ignoreWordsSet.has(word) || !spell.correct(word)) {
return
}
const baseForm = getWordBaseForm(word)
if (ignoreWordsSet.has(baseForm)) {
return
}
result.add(baseForm)
})
fs.promises.writeFile(outputSrc, Array.from(result).join("\n"), "utf-8")
}
main()
The core functionality for extracting words from EPUB files relies on the epub
package to parse the book's content. The implementation processes each chapter sequentially, stripping HTML tags and breaking the text into sentences. For each sentence, we extract individual words while applying several filters: removing punctuation, handling whitespace, and excluding numbers. To maintain consistency with English text conventions, we convert first words of sentences to lowercase only if they're common words, and for other positions, we exclude words that start with capital letters to avoid including proper nouns. This approach helps focus on learning common vocabulary rather than names or places.
import EPub from "epub"
export const extractWordsFromEpub = async (
filePath: string,
): Promise<Set<string>> => {
return new Promise((resolve, reject) => {
const epub = new EPub(filePath)
const wordSet = new Set<string>()
epub.on("end", () => {
const chapters = [...epub.flow]
const processChapter = (chapterIndex: number) => {
if (chapterIndex >= chapters.length) {
resolve(wordSet)
return
}
const chapter = chapters[chapterIndex]
epub.getChapter(chapter.id, (error: Error | null, text: string) => {
if (error) {
reject(error)
return
}
const strippedText = text.replace(/<[^>]*>/g, " ")
const sentences = strippedText.split(/[.!?]+/)
sentences.forEach((sentence) => {
const words = sentence
.replace(/[^\w\s-]/g, " ")
.replace(/\s+/g, " ")
.trim()
.split(" ")
.filter((word) => word.length > 0)
.filter((word) => !/\d/.test(word))
if (words.length > 0) {
const firstWord = words[0].toLowerCase()
if (firstWord.length > 0) {
wordSet.add(firstWord)
}
words.slice(1).forEach((word) => {
if (word.length > 0 && !/^[A-Z]/.test(word)) {
wordSet.add(word.toLowerCase())
}
})
}
})
processChapter(chapterIndex + 1)
})
}
processChapter(0)
})
epub.on("error", (error: Error) => {
reject(error)
})
epub.parse()
})
}
To handle different word forms effectively, we need to extract the base form of each word. For example, we want to treat "running," "runs," and "ran" as variations of "run." The getWordBaseForm
function uses the compromise
natural language processing library to handle this conversion. It processes words through several steps: first checking if it's a noun that needs to be converted to singular form, then attempting to find the infinitive form of verbs, and finally handling adjectives and adverbs. For adverbs specifically, we maintain a mapping of common suffixes to transform them back to their adjectival form.
After running the script, I review the words in the output.txt
file and look up any unfamiliar ones in a ChatGPT chat to learn their meanings. As I process each section of familiar words, I add them to the ignore_words.txt
file, ensuring they won't appear in subsequent runs of the tool.
I am a native Russian speaker with a strong command of English, but I occasionally encounter new words. I will provide you with an English word, and I would like you to translate it and offer a clear explanation to help me remember its meaning effectively.
import { order } from "@lib/utils/array/order"
import nlp from "compromise"
import Three from "compromise/types/view/three"
const getAdjective = (doc: Three) => {
const adjective = doc.adjectives()
const [adjectiveResult] = adjective.conjugate()
if (adjectiveResult && "Adjective" in adjectiveResult) {
return adjectiveResult.Adjective as string
}
}
const adverbAdjectiveSuffixes: Record<string, string> = {
ibly: "ible",
ably: "able",
ally: "al",
ily: "y",
ly: "",
}
const adverbSuffixes = order(
Object.keys(adverbAdjectiveSuffixes),
(suffix) => suffix.length,
"desc",
)
export const getWordBaseForm = (word: string): string => {
const doc = nlp(`${word} to`)
const singularForm = doc.nouns().toSingular().text()
if (singularForm) {
return singularForm
}
const baseForm = doc.verbs().toInfinitive().text()
if (baseForm) {
return baseForm.replace(" to", "")
}
const adjective = getAdjective(doc)
if (adjective) {
return adjective
}
const adverb = doc.adverbs().text()
if (adverb) {
const suffix = adverbSuffixes.find((suffix) => adverb.endsWith(suffix))
if (suffix) {
const replacement = adverbAdjectiveSuffixes[suffix]
const adjective = getAdjective(
nlp(adverb.slice(0, -suffix.length) + replacement),
)
if (adjective) {
return adjective
}
}
}
return word
}
This TypeScript implementation combines EPUB parsing, natural language processing, and word form normalization to create a practical tool for vocabulary expansion. By systematically extracting and processing unfamiliar words from audiobooks, it provides a structured approach to language learning that enhances the reading experience for non-native English speakers.