Title: | Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda' |
---|---|
Description: | Duct tape the 'quanteda' ecosystem (Benoit et al., 2018) <doi:10.21105/joss.00774> to modern Transformer-based text classification models (Wolf et al., 2020) <doi:10.18653/v1/2020.emnlp-demos.6>, in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of 'quanteda.textmodels' and provides a function to setup the 'Python' environment to use the pretrained models from 'Hugging Face' <https://huggingface.co/>. More information: <doi:10.5117/CCR2023.1.003.CHAN>. |
Authors: | Chung-hong Chan [aut, cre] |
Maintainer: | Chung-hong Chan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.11 |
Built: | 2025-01-16 05:55:04 UTC |
Source: | https://github.com/gesistsa/grafzahl |
These functions detects miniconda and cuda.
detect_conda() detect_cuda()
detect_conda() detect_cuda()
detect_conda
conducts a test to check whether 1) a miniconda installation and 2) the grafzahl miniconda environment exist.
detect_cuda
checks whether cuda is available. If setup_grafzahl
was executed with cuda
being FALSE
, this function will return FALSE
. Even if setup_grafzahl
was executed with cuda
being TRUE
but with any factor that can't enable cuda (e.g. no Nvidia GPU, the environment was incorrectly created), this function will also return FALSE
.
boolean, whether the system is available.
This is a dataset from the paper "The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms." The data frame contains four columns: id (identifier), headline (the actual text data), value (sentiment: 0 Neutral, +1 Positive, -1 Negative), gold (whether or not this row is "gold standard", i.e. test set). The data is available from Wouter van Atteveldt's Github. https://github.com/vanatteveldt/ecosent
ecosent
ecosent
An object of class data.frame
with 6322 rows and 4 columns.
Van Atteveldt, W., Van der Velden, M. A., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121-140.
This function downloads the training and test sets of the Amharic News Text Classification Dataset from Hugging Face.
get_amharic_data()
get_amharic_data()
A named list of two corpora: training and test
Azime, Israel Abebe, and Nebil Mohammed (2021). "An Amharic News Text classification Dataset." arXiv preprint arXiv:2103.05639
Fine tune (or train) a pretrained Transformer model for your given training labelled data x
and y
. The prediction task can be classification (if regression
is FALSE
, default) or regression (if regression
is TRUE
).
grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) ## Default S3 method: grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) ## S3 method for class 'corpus' grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) textmodel_transformer(...) ## S3 method for class 'character' grafzahl( x, y = NULL, model_name = "xlmroberta", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE )
grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) ## Default S3 method: grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) ## S3 method for class 'corpus' grafzahl( x, y = NULL, model_name = "xlm-roberta-base", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE ) textmodel_transformer(...) ## S3 method for class 'character' grafzahl( x, y = NULL, model_name = "xlmroberta", regression = FALSE, output_dir, cuda = detect_cuda(), num_train_epochs = 4, train_size = 0.8, args = NULL, cleanup = TRUE, model_type = NULL, manual_seed = floor(runif(1, min = 1, max = 721831)), verbose = TRUE )
x |
the corpus or character vector of texts on which the model will be trained. Depending on |
y |
training labels. It can either be a single string indicating which docvars of the corpus is the training labels; a vector of training labels in either character or factor; or |
model_name |
string indicates either 1) the model name on Hugging Face website; 2) the local path of the model |
regression |
logical, if |
output_dir |
string, location of the output model. If missing, the model will be stored in a temporary directory. Important: Please note that if this directory exists, it will be overwritten. |
cuda |
logical, whether to use CUDA, default to |
num_train_epochs |
numeric, if |
train_size |
numeric, proportion of data in |
args |
list, additionally parameters to be used in the underlying simple transformers |
cleanup |
logical, if |
model_type |
a string indicating model_type of the input model. If |
manual_seed |
numeric, random seed |
verbose |
logical, if |
... |
paramters pass to |
a grafzahl
S3 object with the following items
call |
original function call |
input_data |
input_data for the underlying python function |
output_dir |
location of the output model |
model_type |
model type |
model_name |
model name |
regression |
whether or not it is a regression model |
levels |
factor levels of y |
manual_seed |
random seed |
meta |
metadata about the current session |
if (detect_conda() && interactive()) { library(quanteda) set.seed(20190721) ## Using the default cross validation method model1 <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base") predict(model1) ## Using LIME input <- corpus(ecosent, text_field = "headline") training_corpus <- corpus_subset(input, !gold) model2 <- grafzahl(x = training_corpus, y = "value", model_name = "GroNLP/bert-base-dutch-cased") test_corpus <- corpus_subset(input, gold) predicted_sentiment <- predict(model2, test_corpus) require(lime) sentences <- c("Dijsselbloem pessimistisch over snelle stappen Grieken", "Aandelenbeurzen zetten koersopmars voort") explainer <- lime(training_corpus, model2) explanations <- explain(sentences, explainer, n_labels = 1, n_features = 2) plot_text_explanations(explanations) }
if (detect_conda() && interactive()) { library(quanteda) set.seed(20190721) ## Using the default cross validation method model1 <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base") predict(model1) ## Using LIME input <- corpus(ecosent, text_field = "headline") training_corpus <- corpus_subset(input, !gold) model2 <- grafzahl(x = training_corpus, y = "value", model_name = "GroNLP/bert-base-dutch-cased") test_corpus <- corpus_subset(input, gold) predicted_sentiment <- predict(model2, test_corpus) require(lime) sentences <- c("Dijsselbloem pessimistisch over snelle stappen Grieken", "Aandelenbeurzen zetten koersopmars voort") explainer <- lime(training_corpus, model2) explanations <- explain(sentences, explainer, n_labels = 1, n_features = 2) plot_text_explanations(explanations) }
Create a grafzahl S3 object from the output_dir
hydrate(output_dir, model_type = NULL, regression = FALSE)
hydrate(output_dir, model_type = NULL, regression = FALSE)
output_dir |
string, location of the output model. If missing, the model will be stored in a temporary directory. Important: Please note that if this directory exists, it will be overwritten. |
model_type |
a string indicating model_type of the input model. If |
regression |
logical, if |
a grafzahl
S3 object with the following items
call |
original function call |
input_data |
input_data for the underlying python function |
output_dir |
location of the output model |
model_type |
model type |
model_name |
model name |
regression |
whether or not it is a regression model |
levels |
factor levels of y |
manual_seed |
random seed |
meta |
metadata about the current session |
Make prediction from a fine-tuned grafzahl object.
## S3 method for class 'grafzahl' predict(object, newdata, cuda = detect_cuda(), return_raw = FALSE, ...)
## S3 method for class 'grafzahl' predict(object, newdata, cuda = detect_cuda(), return_raw = FALSE, ...)
object |
an S3 object trained with |
newdata |
a corpus or a character vector of texts on which prediction should be made. |
cuda |
logical, whether to use CUDA, default to |
return_raw |
logical, if |
... |
not used |
a vector of class prediction or a matrix of logits
Install a self-contained miniconda environment with all Python components (PyTorch, Transformers, Simpletransformers, etc) which grafzahl required. The default location is "~/.local/share/r-miniconda/envs/grafzahl_condaenv" (suffix "_cuda" is added if cuda
is TRUE
).
On Linux or Mac and if miniconda is not found, this function will also install miniconda. The path can be changed by the environment variable GRAFZAHL_MINICONDA_PATH
setup_grafzahl(cuda = FALSE, force = FALSE, cuda_version = "11.3")
setup_grafzahl(cuda = FALSE, force = FALSE, cuda_version = "11.3")
cuda |
logical, if |
force |
logical, if |
cuda_version |
character, indicate CUDA version, ignore if |
TRUE (invisibly) if installation is successful.
# setup an environment with cuda enabled. if (detect_conda() && interactive()) { setup_grafzahl(cuda = TRUE) }
# setup an environment with cuda enabled. if (detect_conda() && interactive()) { setup_grafzahl(cuda = TRUE) }
A vector of all supported model types.
supported_model_types
supported_model_types
An object of class character
of length 23.
This is a dataset from the paper "The Dynamics of Political Incivility on Twitter". The tweets were by Members of Congress elected to the 115th Congress (2017–2018). It is important to note that not all the incivility labels were coded by human. Majority of the labels were coded by the Google Perspective API. All mentions were removed. The dataset is available from Pablo Barbera's Github. https://github.com/pablobarbera/incivility-sage-open
unciviltweets
unciviltweets
An object of class corpus
(inherits from character
) of length 19982.
Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
Set up grafzahl to be used on Google Colab or similar environments. This function is also useful if you do not want to use conda on a local machine, e.g. you have configurateed the required Python package.
use_nonconda(install = TRUE, check = TRUE, verbose = TRUE)
use_nonconda(install = TRUE, check = TRUE, verbose = TRUE)
install |
logical, whether to install the required Python packages |
check |
logical, whether to perform a check after the setup. The check displays 1) whether CUDA can be detected, 2) whether
the non-conda mode has been activated, i.e. whether the option 'grafzahl.nonconda' is |
verbose |
logical, whether to display messages |
TRUE (invisibly) if installation is successful.
# A typical use case for Google Colab if (interactive()) { use_nonconda() }
# A typical use case for Google Colab if (interactive()) { use_nonconda() }