--- title: "BTM" output: rmarkdown::html_vignette author: - Chung-hong Chan ^[GESIS] vignette: > %\VignetteIndexEntry{BTM} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(46709394) ``` The package BTM by Jan Wijffels et al. finds "topics in collections of short text". Compared to other topic model packages, BTM requires a special data format for training. Oolong has no problem generating word intrusion tests with BTM. However, that special data format can make creation of topic intrusion tests very tricky. This guide provides our recommendations on how to use BTM, so that the model can be used for generating topic intrusion tests. # Requirement #1: Keep your quanteda corpus It is because every document has a unique document id. ```{r} require(BTM) require(quanteda) require(oolong) trump_corpus <- corpus(trump2k) ``` And then you can do regular text cleaning, stemming procedure with `quanteda`. Instead of making the product a `DFM` object, make it a `token` object. You may read [this issue](https://github.com/quanteda/quanteda/issues/1404) by Benoit et al. ```{r} tokens(trump_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, split_hyphens = TRUE, remove_url = TRUE) %>% tokens_tolower() %>% tokens_remove(stopwords("en")) %>% tokens_remove("@*") -> trump_toks ``` # Requirement #2: Keep your data frame Use this function to convert the `token` object to a data frame. ```{r} as.data.frame.tokens <- function(x) { data.frame( doc_id = rep(names(x), lengths(x)), tokens = unlist(x, use.names = FALSE) ) } trump_dat <- as.data.frame.tokens(trump_toks) ``` Train a BTM model ```{r, message = FALSE, results = 'hide', warning = FALSE} trump_btm <- BTM(trump_dat, k = 8, iter = 500, trace = 10) ``` ## Pecularities of BTM This is how you should generate $\theta_{t}$ . However, there are many NaN and there are only 1994 rows (`trump2k` has 2000 tweets) due to empty documents. ```{r} theta <- predict(trump_btm, newdata = trump_dat) dim(theta) ``` ```{r} setdiff(docid(trump_corpus), row.names(theta)) ``` ```{r} trump_corpus[604] ``` Also, the row order is messed up. ```{r} head(row.names(theta), 100) ``` # Oolong's support for BTM Oolong has no problem generating word intrusion test for BTM like you do with other topic models. ```{r} oolong <- create_oolong(trump_btm) oolong ``` For generating topic intrusion tests, however, you must provide the data frame you used for training (in this case `trump_dat`). Your `input_corpus` must be a quanteda corpus too. ```{r} oolong <- create_oolong(trump_btm, trump_corpus, btm_dataframe = trump_dat) oolong ``` `btm_dataframe` must not be NULL. ```{r btm_error1, error = TRUE} oolong <- create_oolong(trump_btm, trump_corpus) ``` `input_corpus` must be a quanteda corpus. ```{r btm_error2, error = TRUE} oolong <- create_oolong(trump_btm, trump2k, btm_dataframe = trump_dat) ```