Entropy Shift

Shift object for calculating the shift in entropy between two systems.

Usage

entropy_shift(
  type2freq_1,
  type2freq_2,
  base = 2L,
  alpha = 1,
  reference_value = 0,
  normalization = "variation"
)

Arguments

type2freq_1: A data.frame containing words and their frequencies.
type2freq_2: A data.frame containing words and their frequencies.
base: The base for the logarithm when computing entropy scores.
alpha: The parameter for the generalized Tsallis entropy. Setting 'alpha = 1' recovers the Shannon entropy.
reference_value: Optional. String or numeric. The reference score to use to partition scores into two different regimes. If 'average', uses the average score according to type2freq_1 and type2score_1. If a lexicon is used for type2score, you need to use the middle point of that lexicon's scale. If no value is supplied, zero will be used as the reference point. See details for more information.
normalization: Optional. Default value: "variation". If 'variation', normalizes shift scores so that the sum of their absolute values sums to 1. If 'trajectory', normalizes them so that the sum of shift scores is 1 or -1. The trajectory normalization cannot be applied if the total shift score is 0, so scores are left unnormalized if the total is 0 and 'trajectory' is specified.

Value

Returns a list object of class shift.

Shannon Entropy Shifts

We can use the Shannon entropy to identify more "surprising" words and how they vary between two texts. The Shannon entropy H is calculated as:

\(H(P) = \sum_i p_i \log \frac{1}{p_i}\) Where the factor \(-\log p_i\) is the surprisal value of a word. The less often a word appears in a text, the more surprising it is. The Shannon entropy can be interpreted as the average surprisal value of a text. We can compare two texts by taking the difference between their entropies, \(H(P^{(2)}) - H(P^{(1)})\). When we do this, we can get the contribution \(\delta H_i\) of each word to that difference:

\(\delta H_i = p_i^{(2)} \log \frac{1}{p_i^{(2)}} - p_i^{(1)} \log \frac{1}{p_i^{(1)}}\)

We can rank these contributions and plot them as a Shannon entropy word shift. If the contribution \(\delta H_i\) is positive, then word i the has a higher score in the second text. If the contribution is negative, then its score is higher in the first text.

The contributions \(\delta H_i\) are available in the type2shift_score column in the shift_scores data.frame in the shift object. The surprisals are available in the type2score_1 and type2score_2 columns.

Tsallis Entropy Shifts

The Tsallis entropy is a generalization of the Shannon entropy which allows us to emphasize common or less common words by altering an order parameter \(\alpha\) \> 0. When \(\alpha\) \< 1, uncommon words are weighted more heavily, and when \(\alpha\) \> 1, common words are weighted more heavily. In the case where \(\alpha\) = 1, the Tsallis entropy is equivalent to the Shannon entropy, which equally weights common and uncommon words.

The contribution \(\delta H_i^{\alpha}\) of a word to the difference in Tsallis entropy of two texts is given by

\(\delta H_i^{\alpha} = \frac{-\bigl(p_i^{(2)}\bigr)^\alpha + \bigl(p_i^{(1)}\bigr)^\alpha}{\alpha - 1}\).

The Tsallis entropy can be calculated using entropy_shift by passing it the parameter alpha.

Examples

library(shifterator)
library(quanteda)
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
#> Warning: Your current locale is not in the list of available locales. Some functions may not work properly. Refer to stri_locale_list() for more details on known locale specifiers.
#> Warning: Your current locale is not in the list of available locales. Some functions may not work properly. Refer to stri_locale_list() for more details on known locale specifiers.
#> Package version: 3.2.3
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 2 of 2 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

reagan <- corpus_subset(data_corpus_inaugural, President == "Reagan") %>% 
  tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% # to move from classes frequency, textstat, and data.frame to data.frame
select(feature, frequency) 

bush <- corpus_subset(data_corpus_inaugural, President == "Bush" & FirstName == "George W.") %>% 
tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% 
select(feature, frequency)

shannon_entropy_shift <- entropy_shift(reagan, bush)

tsallis_entropy_shift <- entropy_shift(reagan, bush, alpha = 0.8)