Kullback-Leibler Divergence Shift

Shift object for calculating the Kullback-Leibler divergence (KLD) between two systems.

Usage

kldivergence_shift(
  type2freq_1,
  type2freq_2,
  base = 2L,
  reference_value = 0,
  normalization = "variation"
)

Arguments

type2freq_1: A data.frame containing words and their frequencies.
type2freq_2: A data.frame containing words and their frequencies.
base: The base for the logarithm when computing entropy scores.
reference_value: Optional. String or numeric. The reference score to use to partition scores into two different regimes. If 'average', uses the average score according to type2freq_1 and type2score_1. If a lexicon is used for type2score, you need to use the middle point of that lexicon's scale. If no value is supplied, zero will be used as the reference point. See details for more information.
normalization: Optional. Default value: "variation". If 'variation', normalizes shift scores so that the sum of their absolute values sums to 1. If 'trajectory', normalizes them so that the sum of shift scores is 1 or -1. The trajectory normalization cannot be applied if the total shift score is 0, so scores are left unnormalized if the total is 0 and 'trajectory' is specified.

Value

Returns a list object of class shift.

Details

The Kullback-Leibler divergence (KLD) is a useful asymmetric measure of how two texts differ. One text is the reference text and the other is the comparison text. If we let type2freq_1 be the reference text and type2freq_2 be the comparison text, then we can calculate the KLD as

\(D^{(KL)}(P^{(2)} || P^{(1)}) = \sum_i p_i^{(2)} \log \frac{p_i^{(2)}}{p_i^{(1)}}\).

A word's contribution can be written as the difference in surprisals between the reference and comparison text, similar to the Shannon entropy except weighting each surprisal by the frequency of the word in the comparison.

\(\delta KLD_i = p_i^{(2)} \log \frac{1}{p_i^{(1)}} - p_i^{(2)} \log \frac{1}{p_i^{2}}\).

The contribution is positive if \(p_i^{(2)} > p_i^{(1)}\). Similarly, it is negative if \(p_i^{(2)} < p_i^{(1)}\).

The total Kullback-Leibler divergence be accessed through the difference column in the created shift object.

WARNING

The KLD is only well-defined if every single word in the comparison text is also in the reference text. If this is not the case KLD diverges to infinity.

Examples

if (FALSE) {
library(shifterator)
library(quanteda)
library(quanteda.textstats)
library(dplyr)

reagan <- corpus_subset(data_corpus_inaugural, President == "Reagan") %>% 
  tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% # to move from classes frequency, textstat, and data.frame to data.frame
select(feature, frequency) 

bush <- corpus_subset(data_corpus_inaugural, President == "Bush" & FirstName == "George W.") %>% 
tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% 
select(feature, frequency)

# This will return the message that the KL divergence is not well-defined.
kld <- kldivergence_shift(reagan, bush)
}