Information bottleneck method

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and described as providing 'a surprisingly rich framework for discussing a variety of problems in signal processing and learning'. The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and described as providing 'a surprisingly rich framework for discussing a variety of problems in signal processing and learning'. Applications include distributional clustering and dimension reduction, and more recently it has been suggested as a theoretical foundation for deep learning. It generalized the classical notion of minimal sufficient statistics from parametric statistics to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with the relevant variable Y. The information bottleneck can also be viewed as a rate distortion problem, with a distortion function that measures how well Y is predicted from a compressed representation T compared to its direct prediction from X. This interpretation provides a general iterative algorithm for solving the information bottleneck tradeoff and calculating the information curve from the distribution p(X,Y). The compressed variable is T {displaystyle T,} and the algorithm minimizes: where I ( X ; T ) {displaystyle I(X;T)} and I ( T ; Y ) {displaystyle I(T;Y)} are the mutual information between X ; T {displaystyle X;T,} and T ; Y {displaystyle T;Y,} , respectively, and β {displaystyle eta } is a Lagrange multiplier. Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).Consider X {displaystyle X} and Y {displaystyle Y} respectively as the input and output layers of a DNN, and let T {displaystyle T} be any hidden layer of the network. Shwartz-Ziv and Tishby proposed the information bottleneck that expresses the tradeoff between the mutual information measures I ( X , T ) {displaystyle I(X,T)} and I ( T , Y ) {displaystyle I(T,Y)} . In this case, I ( X , T ) {displaystyle I(X,T)} and I ( T , Y ) {displaystyle I(T,Y)} respectively quantify the amount of information that the hidden layer contains about the input and the output.They conjectured that the training process of a DNN consists of two separate phases; 1) an initial fitting phase in which I ( T , Y ) {displaystyle I(T,Y)} increases, and 2) a subsequent compression phase in which I ( X , T ) {displaystyle I(X,T)} decreases. Saxe et al. in countered the claim of Shwartz-Ziv and Tishby, stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function. In particular, they claimed that the compression does not happen with ReLu activation functions. Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al had not observed compression due to weak estimates of the mutual information. Recently, Noshad et al used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations. The Gaussian bottleneck, namely, applying the information bottleneck approach to Gaussian variables, leads to solutions related to canonical correlation analysis. Assume X , Y {displaystyle X,Y,} are jointly multivariate zero mean normal vectors with covariances Σ X X , Σ Y Y {displaystyle Sigma _{XX},,,Sigma _{YY}} and T {displaystyle T,} is a compressed version of X {displaystyle X,} that must maintain a given value of mutual information with Y {displaystyle Y,} . It can be shown that the optimum T {displaystyle T,} is a normal vector consisting of linear combinations of the elements of X , T = A X {displaystyle X,,,T=AX,} where matrix A {displaystyle A,} has orthogonal rows. The projection matrix A {displaystyle A,} in fact contains M {displaystyle M,} rows selected from the weighted left eigenvectors of the singular value decomposition of the matrix (generally asymmetric)

Parent Topic

Child Topic

No Parent Topic