nlpstats.correlations#

The nlpstats.correlations module provides tools for meta-evaluating metrics.

The quality of a metric is quantified by calculating the correlation between its scores and human scores for the outputs of a set of systems on a set of inputs. The correlation can be calculated in several different ways, each of which is a function of two score matrices.

Let \(X \in \mathbb{R}^{m \times n_1}\) and \(Z \in \mathbb{R}^{m \times n_2}\) be the metric and human score matrices in which \(x_i^{j_1}\) and \(z_i^{j_2}\) are the respective scores on the output from the \(i\) th system on the \(j_1\) th and \(j_2\) th inputs. Note that the rows of \(X\) and \(Z\) correspond to each other, but this is not necessarily true for their columns. \(X\) and \(Z\) can be used to calculate three different correlations as follows.

The system-level correlation quantifies the extent to which the metric scores systems similarly to humans. It is defined as:

\[r_{\textrm{sys}} = r(\left\{\left(\bar{x}_1, \bar{z}_1\right), \dots, \left(\bar{x}_m, \bar{z}_m\right)\right\})\]

where \(r(\cdot)\) is some function which calculates the correlation between the paired observations and \(\bar{x}_i\) and \(\bar{z}_i\) are the metric and human scores for system \(i\):

\[ \begin{align}\begin{aligned}\bar{x}_i = \frac{1}{n_1} \sum_j^{n_1} x_i^j\\\bar{z}_i = \frac{1}{n_2} \sum_j^{n_2} z_i^j\end{aligned}\end{align} \]

The input-level correlation (also called the summary-level correlation in the summarization literature) quantifies how similarly the metric and humans score different outputs for the same input. The input-level correlation requires the columns of \(X\) and \(Z\) to be paired (i.e., the \(j\) th column of \(X\) and \(Z\) both correspond to the same input and \(n_1 = n_2\)). It is defined as:

\[r_{\textrm{inp}} = \frac{1}{n} \sum_j^n r\left(\left\{(x_1^j, z_2^j), \dots, (x_m^j, z_m^j)\right\}\right)\]

where \(n = n_1 = n_2\). This function calculates the average correlation between the columns of \(X\) and \(Z\)

Finally the global-level correlation directly calculates the correlation between all of the \(x_i^{j}\) and \(z_i^{j}\) pairs. It also requires the columns to be paired:

\[r_{\textrm{glo}} = r\left(\left\{(x_1^1, z_1^1), \dots, (x_m^1, z_m^1), \dots (x_m^n, z_m^n)\right\}\right)\]

This module provides methods for:

calculating these correlations
estimating confidence intervals for correlations
statistical testing the difference between correlations

Calculating Correlations#

nlpstats.correlations.correlations.correlate(X, Z, level, coefficient)[source]#

Calculates a correlation between score matrices X and Z.

The rows of X and Z should always correspond to each other. That is, X[i] and Z[i] contain the scores for the outputs from system i. For input- and global-level correlations, the columns should also correspond to each other and thus X and Z must be the same shape; there is no such requirement for system-level correlations.

If a score is missing for a specific output, that value should be equal to np.nan. For input- and global-level correlations, X and Z must have np.nan values in the same locations.

The different level correlations also have their own functions to compute them directly (see system_level(), input_level(), and global_level()).

Parameters

X (npt.ArrayLike) – A two-dimensional score matrix in which X[i][j] contains the X score for the i th system on the j th input.
Z (npt.ArrayLike) – A two-dimensional score matrix in which Z[i][j] contains the Z score for the i th system on the j th input.
level (str) – The correlation to calculate, either "system", "input", or "global".
coefficient (Union[Callable, str]) – The correlation coefficient to use, either "pearson", "spearman", "kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.

Returns

The correlation or np.nan if it does not exist

Return type

float

Examples

Suppose we have two score matrices, \(X\) and \(Z\), of size \(m \times n\). Here, we randomly generate them.

>>> import numpy as np
>>> np.random.seed(4)
>>>
>>> m, n = 10, 25
>>> X = np.random.rand(m, n)
>>> Z = np.random.rand(m, n)

\(X\) and \(Z\) can be used to calculate several different correlations. The system-level Pearson:

>>> correlate(X, Z, "system", "pearson")
-0.5011117333825296

The input-level Spearman:

>>> correlate(X, Z, "input", "spearman")
-0.07103030303030303

or the global-level Kendall:

>>> correlate(X, Z, "global", "kendall")
-0.05413654618473896

nlpstats.correlations.correlations.system_level(X, Z, coefficient)[source]#

Calculates the system-level correlation between \(X\) and \(Z\).

See correlate() for details.

nlpstats.correlations.correlations.input_level(X, Z, coefficient)[source]#

Calculates the input-level correlation between \(X\) and \(Z\).

See correlate() for details.

nlpstats.correlations.correlations.global_level(X, Z, coefficient)[source]#

Calculates the global-level correlation between \(X\) and \(Z\).

See correlate() for details.

Confidence Intervals#

This module contains two methods for calculating confidence intervals: bootstrapping and the Fisher transformation. The documentation for both functions is described next.

nlpstats.correlations.bootstrap.bootstrap(X, Z, level, coefficient, resampling_method, paired_inputs=True, confidence_level=0.95, n_resamples=9999)[source]#

Calculates a confidence interval for a correlation via bootstrapping.

The rows of X and Z should always correspond to each other. That is, X[i] and Z[i] contain the scores for the outputs from system i. If the columns of X and Z correspond to each other, paired_inputs should be set to True, and this is required for input- and global-level correlations.

If a score is missing for a specific output, that value should be equal to np.nan. For input- and global-level correlations, X and Z must have np.nan values in the same locations.

The resampling method indicates whether the set of systems (rows) and/or inputs (columns) are resampled during bootstrapping. They correspond to the Boot-Inputs, Boot-Systems, and Boot-Both resampling methods from Deutsch et al. (2021). Each method results in different interpretations for the resulting confidence intervals.

Parameters

X (npt.ArrayLike) – A two-dimensional score matrix in which X[i][j] contains the X score for the i th system on the j th input.
Z (npt.ArrayLike) – A two-dimensional score matrix in which Z[i][j] contains the Z score for the i th system on the j th input.
level (str) – The level of correlation, either "system", "input", or "global".
coefficient (Union[Callable, str]) – The correlation coefficient to use, either "pearson", "spearman", "kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.
resampling_method (str) – The resampling method to use, either "systems", "inputs", or "both" to indicate whether the systems and/or inputs should be resampled during bootstrapping. If paired_inputs=True and inputs are resampled, they will be sampled in parallel for X and Z, otherwise they will not.
paired_inputs (bool) – Indicates whether the columns of X and Z are paired
confidence_level (float) – The confidence level of the correlation interval, between 0 and 1.
n_resamples (int) – The number of resamples to take

Return type

BootstrapResult

Examples

Given score matrices X and Z, the 95% confidence interval for the system-level Pearson correlation using the Boot-Both resampling method can be calculated as:

>>> import numpy as np
>>> m, n = 10, 25
>>> X = np.random.rand(m, n)
>>> Z = np.random.rand(m, n)
>>> bootstrap(X, Z, "system", "pearson", "both")

class nlpstats.correlations.bootstrap.BootstrapResult(lower, upper, samples)[source]#

property lower#: The lower-bound

property upper#: The upper-bound

property samples#: The bootstrapped samples

nlpstats.correlations.fisher.fisher(X, Z, level, coefficient, confidence_level=0.95)[source]#

Calculates a confidence interval for a correlation via the Fisher transformation.

The Fisher function is a parametric method for calculating the confidence interval for a correlation (see Bonett & Wright (2000)).

If a score is missing for a specific output, that value should be equal to np.nan. For input- and global-level correlations, X and Z must have np.nan values in the same locations.

Parameters

X (npt.ArrayLike) – A two-dimensional score matrix
Z (npt.ArrayLike) – A two-dimensional score matrix
level (str) – The level of correlation, either "system", "input", or "global".
coefficient (Union[Callable, str]) – The correlation coefficient to use, either "pearson", "spearman", "kendall".
confidence_level (float) – The confidence level of the correlation interval, between 0 and 1.

Return type

FisherResult

class nlpstats.correlations.fisher.FisherResult(lower, upper)[source]#

property lower#: The lower-bound

property upper#: The upper-bound

Hypothesis Testing#

This module contains two methods for hypothesis testing the difference between two correlations: permutation tests and Williams’ test. The documentation for both functions is described next.

nlpstats.correlations.permutation.permutation_test(X, Y, Z, level, coefficient, permutation_method, alternative='two-sided', n_resamples=9999)[source]#

Runs a hypothesis test on the difference between correlations.

This function will compare the correlations of X to Z and Y to Z. Typically, X and Y correspond to metric score matrices and Z is the human score matrix.

The rows of X, Y, and Z must correspond to each other, and the columns of X and Y must too. For input- and global-level correlations, the columns of X and Y must also be paired with those of Z.

If a value from the matrices is missing, it should be replaced with np.nan. The np.nan locations must always be identical for X and Y. The same is true for Z for input- and global-level correlations.

The permutation method indicates whether the set of systems (rows) and/or inputs (columns) are permuted during the test. They correspond to the Perm-Inputs, Perm-Systems, and Perm-Both permutation methods from Deutsch et al. (2021). Each method results in different interpretations for the resulting hypothesis tests.

Parameters

X (npt.ArrayLike) – A two-dimensional score matrix in which X[i][j] contains the X score for the i th system on the j th input.
Y (npt.ArrayLike) – A two-dimensional score matrix in which Y[i][j] contains the Y score for the i th system on the j th input.
Z (npt.ArrayLike) – A two-dimensional score matrix in which Z[i][j] contains the Z score for the i th system on the j th input.
level (str) – The level of correlation, either "system", "input", or "global".
coefficient (Union[Callable, str]) – The correlation coefficient to use, either "pearson", "spearman", "kendall", or a custom correlation function. The custom function must accept two vectors as input and return the correlation between them.
permutation_method (str) – The permutation method to use, either "systems", "inputs", or "both" to indicate whether the systems and/or inputs should be permuted during the test.
alternative (str) – The alternative hypothesis. "two-sided" corresponds to an alternative hypothesis that \(r(X, Z) \neq r(Y, Z)\), "greater" correponds to \(r(X, Z) > r(Y, Z)\) and "less" corresponds to \(r(X, Z) < r(Y, Z)\).
n_resamples (int) – The number of permutation samples to take

Return type

PermutationResult

Examples

An example of hypothesis testing the input-level Pearson correlation of X to Z and Y to Z using Perm-Inputs.

>>> import numpy as np
>>> m, n = 10, 25
>>> X = np.random.rand(m, n)
>>> Y = np.random.rand(m, n)
>>> Z = np.random.rand(m, n)
>>> permutation_test(X, Y, Z, "input", "pearson", "inputs")

The system-level correlations can be compared even if the columns of X and Y to do not match Z:

>>> X = np.random.rand(m, 2 * n)
>>> Y = np.random.rand(m, 2 * n)
>>> permutation_test(X, Y, Z, "system", "pearson", "inputs")

class nlpstats.correlations.permutation.PermutationResult(pvalue, samples)[source]#

property pvalue#: The p-value of the test

property samples#: The values of the test statistic sampled during the test

nlpstats.correlations.williams.williams_test(X, Y, Z, level, coefficient, alternative='two-sided')[source]#

Calculates a hypothesis test between two correlations using Williams’ test.

See Graham & Baldwin (2014) for details on Williams’ test.

Parameters

X (npt.ArrayLike) – A two-dimensional score matrix in which X[i][j] contains the X score for the i th system on the j th input.
Y (npt.ArrayLike) – A two-dimensional score matrix in which Y[i][j] contains the Y score for the i th system on the j th input.
Z (npt.ArrayLike) – A two-dimensional score matrix in which Z[i][j] contains the Z score for the i th system on the j th input.
level (str) – The level of correlation, either "system", "input", or "global".
coefficient (Union[Callable, str]) – The correlation coefficient to use, either "pearson", "spearman", "kendall".
alternative (str) – The alternative hypothesis. "two-sided" corresponds to an alternative hypothesis that \(r(X, Z) \neq r(Y, Z)\), "greater" correponds to \(r(X, Z) > r(Y, Z)\) and "less" corresponds to \(r(X, Z) < r(Y, Z)\).

Return type

WilliamsResult

class nlpstats.correlations.williams.WilliamsResult(pvalue)[source]#

property pvalue#: The p-value of the test