nlpstats.correlations#
The nlpstats.correlations module provides tools for meta-evaluating metrics.
The quality of a metric is quantified by calculating the correlation between its scores and human scores for the outputs of a set of systems on a set of inputs. The correlation can be calculated in several different ways, each of which is a function of two score matrices.
Let \(X \in \mathbb{R}^{m \times n_1}\) and \(Z \in \mathbb{R}^{m \times n_2}\) be the metric and human score matrices in which \(x_i^{j_1}\) and \(z_i^{j_2}\) are the respective scores on the output from the \(i\) th system on the \(j_1\) th and \(j_2\) th inputs. Note that the rows of \(X\) and \(Z\) correspond to each other, but this is not necessarily true for their columns. \(X\) and \(Z\) can be used to calculate three different correlations as follows.
The system-level correlation quantifies the extent to which the metric scores systems similarly to humans. It is defined as:
where \(r(\cdot)\) is some function which calculates the correlation between the paired observations and \(\bar{x}_i\) and \(\bar{z}_i\) are the metric and human scores for system \(i\):
The input-level correlation (also called the summary-level correlation in the summarization literature) quantifies how similarly the metric and humans score different outputs for the same input. The input-level correlation requires the columns of \(X\) and \(Z\) to be paired (i.e., the \(j\) th column of \(X\) and \(Z\) both correspond to the same input and \(n_1 = n_2\)). It is defined as:
where \(n = n_1 = n_2\). This function calculates the average correlation between the columns of \(X\) and \(Z\)
Finally the global-level correlation directly calculates the correlation between all of the \(x_i^{j}\) and \(z_i^{j}\) pairs. It also requires the columns to be paired:
This module provides methods for:
Calculating Correlations#
- nlpstats.correlations.correlations.correlate(X, Z, level, coefficient)[source]#
Calculates a correlation between score matrices
XandZ.The rows of
XandZshould always correspond to each other. That is,X[i]andZ[i]contain the scores for the outputs from systemi. For input- and global-level correlations, the columns should also correspond to each other and thusXandZmust be the same shape; there is no such requirement for system-level correlations.If a score is missing for a specific output, that value should be equal to
np.nan. For input- and global-level correlations,XandZmust havenp.nanvalues in the same locations.The different level correlations also have their own functions to compute them directly (see
system_level(),input_level(), andglobal_level()).- Parameters
X (npt.ArrayLike) – A two-dimensional score matrix in which
X[i][j]contains theXscore for theith system on thejth input.Z (npt.ArrayLike) – A two-dimensional score matrix in which
Z[i][j]contains theZscore for theith system on thejth input.level (str) – The correlation to calculate, either
"system","input", or"global".coefficient (Union[Callable, str]) – The correlation coefficient to use, either
"pearson","spearman","kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.
- Returns
The correlation or
np.nanif it does not exist- Return type
float
Examples
Suppose we have two score matrices, \(X\) and \(Z\), of size \(m \times n\). Here, we randomly generate them.
>>> import numpy as np >>> np.random.seed(4) >>> >>> m, n = 10, 25 >>> X = np.random.rand(m, n) >>> Z = np.random.rand(m, n)
\(X\) and \(Z\) can be used to calculate several different correlations. The system-level Pearson:
>>> correlate(X, Z, "system", "pearson") -0.5011117333825296
The input-level Spearman:
>>> correlate(X, Z, "input", "spearman") -0.07103030303030303
or the global-level Kendall:
>>> correlate(X, Z, "global", "kendall") -0.05413654618473896
- nlpstats.correlations.correlations.system_level(X, Z, coefficient)[source]#
Calculates the system-level correlation between \(X\) and \(Z\).
See
correlate()for details.
- nlpstats.correlations.correlations.input_level(X, Z, coefficient)[source]#
Calculates the input-level correlation between \(X\) and \(Z\).
See
correlate()for details.
- nlpstats.correlations.correlations.global_level(X, Z, coefficient)[source]#
Calculates the global-level correlation between \(X\) and \(Z\).
See
correlate()for details.
Confidence Intervals#
This module contains two methods for calculating confidence intervals: bootstrapping and the Fisher transformation. The documentation for both functions is described next.
- nlpstats.correlations.bootstrap.bootstrap(X, Z, level, coefficient, resampling_method, paired_inputs=True, confidence_level=0.95, n_resamples=9999)[source]#
Calculates a confidence interval for a correlation via bootstrapping.
The rows of
XandZshould always correspond to each other. That is,X[i]andZ[i]contain the scores for the outputs from systemi. If the columns ofXandZcorrespond to each other,paired_inputsshould be set toTrue, and this is required for input- and global-level correlations.If a score is missing for a specific output, that value should be equal to
np.nan. For input- and global-level correlations,XandZmust havenp.nanvalues in the same locations.The resampling method indicates whether the set of systems (rows) and/or inputs (columns) are resampled during bootstrapping. They correspond to the Boot-Inputs, Boot-Systems, and Boot-Both resampling methods from Deutsch et al. (2021). Each method results in different interpretations for the resulting confidence intervals.
- Parameters
X (npt.ArrayLike) – A two-dimensional score matrix in which
X[i][j]contains theXscore for theith system on thejth input.Z (npt.ArrayLike) – A two-dimensional score matrix in which
Z[i][j]contains theZscore for theith system on thejth input.level (str) – The level of correlation, either
"system","input", or"global".coefficient (Union[Callable, str]) – The correlation coefficient to use, either
"pearson","spearman","kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.resampling_method (str) – The resampling method to use, either
"systems","inputs", or"both"to indicate whether the systems and/or inputs should be resampled during bootstrapping. Ifpaired_inputs=Trueand inputs are resampled, they will be sampled in parallel forXandZ, otherwise they will not.paired_inputs (bool) – Indicates whether the columns of
XandZare pairedconfidence_level (float) – The confidence level of the correlation interval, between 0 and 1.
n_resamples (int) – The number of resamples to take
- Return type
Examples
Given score matrices
XandZ, the 95% confidence interval for the system-level Pearson correlation using the Boot-Both resampling method can be calculated as:>>> import numpy as np >>> m, n = 10, 25 >>> X = np.random.rand(m, n) >>> Z = np.random.rand(m, n) >>> bootstrap(X, Z, "system", "pearson", "both")
- class nlpstats.correlations.bootstrap.BootstrapResult(lower, upper, samples)[source]#
- property lower#
The lower-bound
- property upper#
The upper-bound
- property samples#
The bootstrapped samples
- nlpstats.correlations.fisher.fisher(X, Z, level, coefficient, confidence_level=0.95)[source]#
Calculates a confidence interval for a correlation via the Fisher transformation.
The Fisher function is a parametric method for calculating the confidence interval for a correlation (see Bonett & Wright (2000)).
The rows of
XandZshould always correspond to each other. That is,X[i]andZ[i]contain the scores for the outputs from systemi. For input- and global-level correlations, the columns should also correspond to each other and thusXandZmust be the same shape; there is no such requirement for system-level correlations.If a score is missing for a specific output, that value should be equal to
np.nan. For input- and global-level correlations,XandZmust havenp.nanvalues in the same locations.- Parameters
X (npt.ArrayLike) – A two-dimensional score matrix
Z (npt.ArrayLike) – A two-dimensional score matrix
level (str) – The level of correlation, either
"system","input", or"global".coefficient (Union[Callable, str]) – The correlation coefficient to use, either
"pearson","spearman","kendall".confidence_level (float) – The confidence level of the correlation interval, between 0 and 1.
- Return type
Hypothesis Testing#
This module contains two methods for hypothesis testing the difference between two correlations: permutation tests and Williams’ test. The documentation for both functions is described next.
- nlpstats.correlations.permutation.permutation_test(X, Y, Z, level, coefficient, permutation_method, alternative='two-sided', n_resamples=9999)[source]#
Runs a hypothesis test on the difference between correlations.
This function will compare the correlations of
XtoZandYtoZ. Typically,XandYcorrespond to metric score matrices andZis the human score matrix.The rows of
X,Y, andZmust correspond to each other, and the columns ofXandYmust too. For input- and global-level correlations, the columns ofXandYmust also be paired with those ofZ.If a value from the matrices is missing, it should be replaced with
np.nan. Thenp.nanlocations must always be identical forXandY. The same is true forZfor input- and global-level correlations.The permutation method indicates whether the set of systems (rows) and/or inputs (columns) are permuted during the test. They correspond to the Perm-Inputs, Perm-Systems, and Perm-Both permutation methods from Deutsch et al. (2021). Each method results in different interpretations for the resulting hypothesis tests.
- Parameters
X (npt.ArrayLike) – A two-dimensional score matrix in which
X[i][j]contains theXscore for theith system on thejth input.Y (npt.ArrayLike) – A two-dimensional score matrix in which
Y[i][j]contains theYscore for theith system on thejth input.Z (npt.ArrayLike) – A two-dimensional score matrix in which
Z[i][j]contains theZscore for theith system on thejth input.level (str) – The level of correlation, either
"system","input", or"global".coefficient (Union[Callable, str]) – The correlation coefficient to use, either
"pearson","spearman","kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.permutation_method (str) – The permutation method to use, either
"systems","inputs", or"both"to indicate whether the systems and/or inputs should be permuted during the test.alternative (str) – The alternative hypothesis.
"two-sided"corresponds to an alternative hypothesis that \(r(X, Z) \neq r(Y, Z)\),"greater"correponds to \(r(X, Z) > r(Y, Z)\) and"less"corresponds to \(r(X, Z) < r(Y, Z)\).n_resamples (int) – The number of permutation samples to take
- Return type
Examples
An example of hypothesis testing the input-level Pearson correlation of
XtoZandYtoZusing Perm-Inputs.>>> import numpy as np >>> m, n = 10, 25 >>> X = np.random.rand(m, n) >>> Y = np.random.rand(m, n) >>> Z = np.random.rand(m, n) >>> permutation_test(X, Y, Z, "input", "pearson", "inputs")
The system-level correlations can be compared even if the columns of
XandYto do not matchZ:>>> X = np.random.rand(m, 2 * n) >>> Y = np.random.rand(m, 2 * n) >>> permutation_test(X, Y, Z, "system", "pearson", "inputs")
- class nlpstats.correlations.permutation.PermutationResult(pvalue, samples)[source]#
- property pvalue#
The p-value of the test
- property samples#
The values of the test statistic sampled during the test
- nlpstats.correlations.williams.williams_test(X, Y, Z, level, coefficient, alternative='two-sided')[source]#
Calculates a hypothesis test between two correlations using Williams’ test.
See Graham & Baldwin (2014) for details on Williams’ test.
The rows of
X,Y, andZmust correspond to each other, and the columns ofXandYmust too. For input- and global-level correlations, the columns ofXandYmust also be paired with those ofZ.If a value from the matrices is missing, it should be replaced with
np.nan. Thenp.nanlocations must always be identical forXandY. The same is true forZfor input- and global-level correlations.- Parameters
X (npt.ArrayLike) – A two-dimensional score matrix in which
X[i][j]contains theXscore for theith system on thejth input.Y (npt.ArrayLike) – A two-dimensional score matrix in which
Y[i][j]contains theYscore for theith system on thejth input.Z (npt.ArrayLike) – A two-dimensional score matrix in which
Z[i][j]contains theZscore for theith system on thejth input.level (str) – The level of correlation, either
"system","input", or"global".coefficient (Union[Callable, str]) – The correlation coefficient to use, either
"pearson","spearman","kendall".alternative (str) – The alternative hypothesis.
"two-sided"corresponds to an alternative hypothesis that \(r(X, Z) \neq r(Y, Z)\),"greater"correponds to \(r(X, Z) > r(Y, Z)\) and"less"corresponds to \(r(X, Z) < r(Y, Z)\).
- Return type