Shift object for calculating differences in proportions of types across two systems.
Arguments
- type2freq_1
A data.frame containing words and their frequencies.
- type2freq_2
A data.frame containing words and their frequencies.
Details
The easiest word shift graph that we can construct is a proportion shift. If \(p_i^{(1)}\) is the relative frequency of word i in the first text, and \(p_i^{(2)}\) is its relative frequency in the second text, then the proportion shift calculates their difference:
\(\delta p_i = p_i^{(2)} - p_i^{(1)}\)
If the difference is positive (\(\delta p_i > 0\)), then the word is relatively more common in the second text. If it is negative (\(\delta p_i < 0\)), then it is relatively more common in the first text. We can rank words by this difference and plot them as a word shift graph.
See also
Other shifts:
entropy_shift()
,
jsdivergence_shift()
,
kldivergence_shift()
,
weighted_avg_shift()
Examples
#' library(shifterator)
library(quanteda)
library(quanteda.textstats)
library(dplyr)
reagan <- corpus_subset(data_corpus_inaugural, President == "Reagan") %>%
tokens(remove_punct = TRUE) %>%
dfm() %>%
textstat_frequency() %>%
as.data.frame() %>% # to move from classes frequency, textstat, and data.frame to data.frame
select(feature, frequency)
bush <- corpus_subset(data_corpus_inaugural, President == "Bush" & FirstName == "George W.") %>%
tokens(remove_punct = TRUE) %>%
dfm() %>%
textstat_frequency() %>%
as.data.frame() %>%
select(feature, frequency)
prop <- proportion_shift(reagan, bush)