Package 'grafzahl' reference manual

Title:	Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda'
Description:	Duct tape the 'quanteda' ecosystem (Benoit et al., 2018) <doi:10.21105/joss.00774> to modern Transformer-based text classification models (Wolf et al., 2020) <doi:10.18653/v1/2020.emnlp-demos.6>, in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of 'quanteda.textmodels' and provides a function to setup the 'Python' environment to use the pretrained models from 'Hugging Face' <https://huggingface.co/>. More information: <doi:10.5117/CCR2023.1.003.CHAN>.
Authors:	Chung-hong Chan [aut, cre]
Maintainer:	Chung-hong Chan <[email protected]>
License:	GPL (>= 3)
Version:	0.0.11
Built:	2025-03-21 06:20:55 UTC
Source:	https://github.com/gesistsa/grafzahl

Detecting Miniconda And Cuda

Description

These functions detects miniconda and cuda.

Usage

detect_conda()

detect_cuda()
detect_conda()

detect_cuda()

Details

detect_conda conducts a test to check whether 1) a miniconda installation and 2) the grafzahl miniconda environment exist.

detect_cuda checks whether cuda is available. If setup_grafzahl was executed with cuda being FALSE, this function will return FALSE. Even if setup_grafzahl was executed with cuda being TRUE but with any factor that can't enable cuda (e.g. no Nvidia GPU, the environment was incorrectly created), this function will also return FALSE.

Value

boolean, whether the system is available.

A Corpus Of Dutch News Headlines

Description

This is a dataset from the paper "The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms." The data frame contains four columns: id (identifier), headline (the actual text data), value (sentiment: 0 Neutral, +1 Positive, -1 Negative), gold (whether or not this row is "gold standard", i.e. test set). The data is available from Wouter van Atteveldt's Github. https://github.com/vanatteveldt/ecosent

Usage

ecosent
ecosent

Format

An object of class data.frame with 6322 rows and 4 columns.

References

Van Atteveldt, W., Van der Velden, M. A., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121-140.

Download The Amharic News Text Classification Dataset

Description

This function downloads the training and test sets of the Amharic News Text Classification Dataset from Hugging Face.

Usage

get_amharic_data()
get_amharic_data()

Value

A named list of two corpora: training and test

References

Azime, Israel Abebe, and Nebil Mohammed (2021). "An Amharic News Text classification Dataset." arXiv preprint arXiv:2103.05639

Fine tune a pretrained Transformer model for texts

Description

Fine tune (or train) a pretrained Transformer model for your given training labelled data x and y. The prediction task can be classification (if regression is FALSE, default) or regression (if regression is TRUE).

Usage

grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

## Default S3 method:
grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

## S3 method for class 'corpus'
grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

textmodel_transformer(...)

## S3 method for class 'character'
grafzahl(
  x,
  y = NULL,
  model_name = "xlmroberta",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)
grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

## Default S3 method:
grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

## S3 method for class 'corpus'
grafzahl(
  x,
  y = NULL,
  model_name = "xlm-roberta-base",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

textmodel_transformer(...)

## S3 method for class 'character'
grafzahl(
  x,
  y = NULL,
  model_name = "xlmroberta",
  regression = FALSE,
  output_dir,
  cuda = detect_cuda(),
  num_train_epochs = 4,
  train_size = 0.8,
  args = NULL,
  cleanup = TRUE,
  model_type = NULL,
  manual_seed = floor(runif(1, min = 1, max = 721831)),
  verbose = TRUE
)

Arguments

`x`	the corpus or character vector of texts on which the model will be trained. Depending on `train_size`, some texts will be used for cross-validation.
`y`	training labels. It can either be a single string indicating which docvars of the corpus is the training labels; a vector of training labels in either character or factor; or `NULL` if the corpus contains exactly one column in docvars and that column is the training labels. If `x` is a character vector, `y` must be a vector of the same length.
`model_name`	string indicates either 1) the model name on Hugging Face website; 2) the local path of the model
`regression`	logical, if `TRUE`, the task is regression, classification otherwise.
`output_dir`	string, location of the output model. If missing, the model will be stored in a temporary directory. Important: Please note that if this directory exists, it will be overwritten.
`cuda`	logical, whether to use CUDA, default to `detect_cuda()`.
`num_train_epochs`	numeric, if `train_size` is not exactly 1.0, the maximum number of epochs to try in the "early stop" regime will be this number times 5 (i.e. 4 * 5 = 20 by default). If `train_size` is exactly 1.0, the number of epochs is exactly that.
`train_size`	numeric, proportion of data in `x` and `y` to be used actually for training. The rest will be used for cross validation.
`args`	list, additionally parameters to be used in the underlying simple transformers
`cleanup`	logical, if `TRUE`, the `runs` directory generated will be removed when the training is done
`model_type`	a string indicating model_type of the input model. If `NULL`, it will be inferred from `model_name`. Supported model types are available in supported_model_types.
`manual_seed`	numeric, random seed
`verbose`	logical, if `TRUE`, debug messages will be displayed
`...`	paramters pass to `grafzahl()`

Value

a grafzahl S3 object with the following items

`call`	original function call
`input_data`	input_data for the underlying python function
`output_dir`	location of the output model
`model_type`	model type
`model_name`	model name
`regression`	whether or not it is a regression model
`levels`	factor levels of y
`manual_seed`	random seed
`meta`	metadata about the current session

Examples

if (detect_conda() && interactive()) {
library(quanteda)
set.seed(20190721)
## Using the default cross validation method
model1 <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
predict(model1)

## Using LIME
input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold)
model2 <- grafzahl(x = training_corpus,
                 y = "value",
                 model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predicted_sentiment <- predict(model2, test_corpus)
require(lime)
sentences <- c("Dijsselbloem pessimistisch over snelle stappen Grieken",
               "Aandelenbeurzen zetten koersopmars voort")
explainer <- lime(training_corpus, model2)
explanations <- explain(sentences, explainer, n_labels = 1,
                        n_features = 2)
plot_text_explanations(explanations)
}
if (detect_conda() && interactive()) {
library(quanteda)
set.seed(20190721)
## Using the default cross validation method
model1 <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
predict(model1)

## Using LIME
input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold)
model2 <- grafzahl(x = training_corpus,
                 y = "value",
                 model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predicted_sentiment <- predict(model2, test_corpus)
require(lime)
sentences <- c("Dijsselbloem pessimistisch over snelle stappen Grieken",
               "Aandelenbeurzen zetten koersopmars voort")
explainer <- lime(training_corpus, model2)
explanations <- explain(sentences, explainer, n_labels = 1,
                        n_features = 2)
plot_text_explanations(explanations)
}

Create a grafzahl S3 object from the output_dir

Description

Create a grafzahl S3 object from the output_dir

Usage

hydrate(output_dir, model_type = NULL, regression = FALSE)
hydrate(output_dir, model_type = NULL, regression = FALSE)

Arguments

`output_dir`	string, location of the output model. If missing, the model will be stored in a temporary directory. Important: Please note that if this directory exists, it will be overwritten.
`model_type`	a string indicating model_type of the input model. If `NULL`, it will be inferred from `model_name`. Supported model types are available in supported_model_types.
`regression`	logical, if `TRUE`, the task is regression, classification otherwise.

Value

a grafzahl S3 object with the following items

`call`	original function call
`input_data`	input_data for the underlying python function
`output_dir`	location of the output model
`model_type`	model type
`model_name`	model name
`regression`	whether or not it is a regression model
`levels`	factor levels of y
`manual_seed`	random seed
`meta`	metadata about the current session

Prediction from a fine-tuned grafzahl object

Description

Make prediction from a fine-tuned grafzahl object.

Usage

## S3 method for class 'grafzahl'
predict(object, newdata, cuda = detect_cuda(), return_raw = FALSE, ...)
## S3 method for class 'grafzahl'
predict(object, newdata, cuda = detect_cuda(), return_raw = FALSE, ...)

Arguments

`object`	an S3 object trained with `grafzahl()`
`newdata`	a corpus or a character vector of texts on which prediction should be made.
`cuda`	logical, whether to use CUDA, default to `detect_cuda()`.
`return_raw`	logical, if `TRUE`, return a matrix of logits; a vector of class prediction otherwise
`...`	not used

Value

a vector of class prediction or a matrix of logits

Setup grafzahl

Description

Install a self-contained miniconda environment with all Python components (PyTorch, Transformers, Simpletransformers, etc) which grafzahl required. The default location is "~/.local/share/r-miniconda/envs/grafzahl_condaenv" (suffix "_cuda" is added if cuda is TRUE). On Linux or Mac and if miniconda is not found, this function will also install miniconda. The path can be changed by the environment variable GRAFZAHL_MINICONDA_PATH

Usage

setup_grafzahl(cuda = FALSE, force = FALSE, cuda_version = "11.3")
setup_grafzahl(cuda = FALSE, force = FALSE, cuda_version = "11.3")

Arguments

`cuda`	logical, if `TRUE`, indicate whether a CUDA-enabled environment is wanted.
`force`	logical, if `TRUE`, delete previous environment (if exists) and create a new environment
`cuda_version`	character, indicate CUDA version, ignore if `cuda` is `FALSE`

Value

TRUE (invisibly) if installation is successful.

Examples

# setup an environment with cuda enabled.
if (detect_conda() && interactive()) {
    setup_grafzahl(cuda = TRUE)
}
# setup an environment with cuda enabled.
if (detect_conda() && interactive()) {
    setup_grafzahl(cuda = TRUE)
}

Supported model types

Description

A vector of all supported model types.

Usage

supported_model_types
supported_model_types

Format

An object of class character of length 23.

A Corpus Of Tweets With Incivility Labels

Description

This is a dataset from the paper "The Dynamics of Political Incivility on Twitter". The tweets were by Members of Congress elected to the 115th Congress (2017–2018). It is important to note that not all the incivility labels were coded by human. Majority of the labels were coded by the Google Perspective API. All mentions were removed. The dataset is available from Pablo Barbera's Github. https://github.com/pablobarbera/incivility-sage-open

Usage

unciviltweets
unciviltweets

Format

An object of class corpus (inherits from character) of length 19982.

References

Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.

Set up grafzahl to be used on Google Colab or similar environments

Description

Set up grafzahl to be used on Google Colab or similar environments. This function is also useful if you do not want to use conda on a local machine, e.g. you have configurateed the required Python package.

Usage

use_nonconda(install = TRUE, check = TRUE, verbose = TRUE)
use_nonconda(install = TRUE, check = TRUE, verbose = TRUE)

Arguments

`install`	logical, whether to install the required Python packages
`check`	logical, whether to perform a check after the setup. The check displays 1) whether CUDA can be detected, 2) whether the non-conda mode has been activated, i.e. whether the option 'grafzahl.nonconda' is `TRUE`.
`verbose`	logical, whether to display messages

Value

TRUE (invisibly) if installation is successful.

Examples

# A typical use case for Google Colab
if (interactive()) {
    use_nonconda()
}
# A typical use case for Google Colab
if (interactive()) {
    use_nonconda()
}

Package 'grafzahl'

Help Index

Detecting Miniconda And Cuda

Description

Usage

Details

Value

A Corpus Of Dutch News Headlines

Description

Usage

Format

References

Download The Amharic News Text Classification Dataset

Description

Usage

Value

References

Fine tune a pretrained Transformer model for texts

Description

Usage

Arguments

Value

See Also

Examples

Create a grafzahl S3 object from the output_dir

Description

Usage

Arguments

Value

Prediction from a fine-tuned grafzahl object

Description

Usage

Arguments

Value

Setup grafzahl

Description

Usage

Arguments

Value

Examples

Supported model types

Description

Usage

Format

A Corpus Of Tweets With Incivility Labels

Description

Usage

Format

References

Set up grafzahl to be used on Google Colab or similar environments

Description

Usage

Arguments

Value

Examples