Shift object for calculating the Kullback-Leibler divergence (KLD) between two systems.
Usage
kldivergence_shift(
type2freq_1,
type2freq_2,
base = 2L,
reference_value = 0,
normalization = "variation"
)
Arguments
- type2freq_1
A data.frame containing words and their frequencies.
- type2freq_2
A data.frame containing words and their frequencies.
- base
The base for the logarithm when computing entropy scores.
- reference_value
Optional. String or numeric. The reference score to use to partition scores into two different regimes. If 'average', uses the average score according to type2freq_1 and type2score_1. If a lexicon is used for type2score, you need to use the middle point of that lexicon's scale. If no value is supplied, zero will be used as the reference point. See details for more information.
- normalization
Optional. Default value: "variation". If 'variation', normalizes shift scores so that the sum of their absolute values sums to 1. If 'trajectory', normalizes them so that the sum of shift scores is 1 or -1. The trajectory normalization cannot be applied if the total shift score is 0, so scores are left unnormalized if the total is 0 and 'trajectory' is specified.
Details
The Kullback-Leibler divergence (KLD) is a useful asymmetric measure of how two texts differ. One text is the reference text and the other is the comparison text. If we let type2freq_1 be the reference text and type2freq_2 be the comparison text, then we can calculate the KLD as
\(D^{(KL)}(P^{(2)} || P^{(1)}) = \sum_i p_i^{(2)} \log \frac{p_i^{(2)}}{p_i^{(1)}}\).
A word's contribution can be written as the difference in surprisals between the reference and comparison text, similar to the Shannon entropy except weighting each surprisal by the frequency of the word in the comparison.
\(\delta KLD_i = p_i^{(2)} \log \frac{1}{p_i^{(1)}} - p_i^{(2)} \log \frac{1}{p_i^{2}}\).
The contribution is positive if \(p_i^{(2)} > p_i^{(1)}\). Similarly, it is negative if \(p_i^{(2)} < p_i^{(1)}\).
The total Kullback-Leibler divergence be accessed through the difference
column in the created shift object.
See also
Other shifts:
entropy_shift()
,
jsdivergence_shift()
,
proportion_shift()
,
weighted_avg_shift()
Examples
if (FALSE) {
library(shifterator)
library(quanteda)
library(quanteda.textstats)
library(dplyr)
reagan <- corpus_subset(data_corpus_inaugural, President == "Reagan") %>%
tokens(remove_punct = TRUE) %>%
dfm() %>%
textstat_frequency() %>%
as.data.frame() %>% # to move from classes frequency, textstat, and data.frame to data.frame
select(feature, frequency)
bush <- corpus_subset(data_corpus_inaugural, President == "Bush" & FirstName == "George W.") %>%
tokens(remove_punct = TRUE) %>%
dfm() %>%
textstat_frequency() %>%
as.data.frame() %>%
select(feature, frequency)
# This will return the message that the KL divergence is not well-defined.
kld <- kldivergence_shift(reagan, bush)
}