Misplaced Pages

Set balancing

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Set balancing" – news · newspapers · books · scholar · JSTOR (December 2015) (Learn how and when to remove this message)

The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in design of experiments.

There is a group of subjects. Each subject has several features, which are considered binary. For example: each subject can be either young or old; either black or white; either tall or short; etc. The goal is to divide the subjects to two sub-groups: treatment group (T) and control group (C), such that for each feature, the number of subjects that have this feature in T is roughly equal to the number of subjects that have this feature in C. E.g., both groups should have roughly the same number of young people, the same number of black people, the same number of tall people, etc.

Matrix representation

Formally, the set balancing problem can be described as follows.

m {\displaystyle m} is the number of subjects in the general population.

n {\displaystyle n} is the number of potential features.

The subjects are described by A {\displaystyle A} , an n × m {\displaystyle n\times m} matrix with entries in 0 , 1 {\displaystyle {0,1}} . Each column represents a subject and each row represents a feature. a i , j = 1 {\displaystyle a_{i,j}=1} if subject j {\displaystyle j} has feature i {\displaystyle i} , and a i , j = 0 {\displaystyle a_{i,j}=0} if subject j {\displaystyle j} does not have feature i {\displaystyle i} .

The partition to groups is described by b {\displaystyle b} , an m × 1 {\displaystyle m\times 1} vector with entries in 1 , 1 {\displaystyle {-1,1}} . b j = 1 {\displaystyle b_{j}=1} if subject j {\displaystyle j} is in the treatment group T and b j = 1 {\displaystyle b_{j}=-1} is subject j {\displaystyle j} is in the control group C.

The balance of features is described by c = A b {\displaystyle c=A\cdot b} . This is an n × 1 {\displaystyle n\times 1} vector. The numeric value of c i {\displaystyle c_{i}} is the imbalance in feature i {\displaystyle i} : if c i > 0 {\displaystyle c_{i}>0} then there are more subjects with i {\displaystyle i} in T and if c i < 0 {\displaystyle c_{i}<0} then there are more subjects with i {\displaystyle i} in C.

The imbalance of a given partition is defined as:

I ( b ) = | | A b | | = max i 1 , n | c i | {\displaystyle I(b)=||A\cdot b||_{\infty }=\max _{i\in 1\dots ,n}|c_{i}|}

The set balancing problem is to find a vector b {\displaystyle b} which minimizes the imbalance I ( b ) {\displaystyle I(b)} .

Randomized algorithm

An approximate solution can be found with the following very simple randomized algorithm:

Send each subject to the treatment group with probability 1/2.

In matrix formulation:

Choose the elements of b {\displaystyle b} randomly with probability 1/2 to each value in {1,-1}.

Surprisingly, although this algorithm completely ignores the matrix A {\displaystyle A} , it achieves a small imbalance with high probability when there are many features. Formally, for a random vector b {\displaystyle b} :

P r o b [ I ( b ) 4 m ln n ] 2 n {\displaystyle Prob\left\leq {\frac {2}{n}}}

PROOF:

Let k i {\displaystyle k_{i}} be the total number of subjects that have feature i {\displaystyle i} (equivalently, the number of ones in the i {\displaystyle i} -th of the matrix A {\displaystyle A} ). Consider the following two cases:

Easy case: k i 4 m ln n {\displaystyle k_{i}\leq {\sqrt {4m\ln n}}} . Then, with probability 1, the imbalance in feature i {\displaystyle i} (that we marked by c i {\displaystyle c_{i}} ) is at most 4 m ln n {\displaystyle {\sqrt {4m\ln n}}} .

Hard case: k i > 4 m ln n {\displaystyle k_{i}>{\sqrt {4m\ln n}}} . For every j {\displaystyle j} , let X j = a i , j b j {\displaystyle X_{j}=a_{i,j}b_{j}} . Each such X j {\displaystyle X_{j}} is a random variable that can be either 1 or -1 with probability 1/2. The imbalance in feature i {\displaystyle i} is: c i = j = 1 m X j {\displaystyle c_{i}=\sum _{j=1}^{m}{X_{j}}} . Since the X j {\displaystyle X_{j}} are independent random variables, by the Chernoff bound, for every a > 0 {\displaystyle a>0} :

P r o b [ | c i | a ] 2 exp ( a 2 / 2 m ) {\displaystyle Prob\left\leq 2\exp(-a^{2}/2m)}

Select: a = 4 m ln n {\displaystyle a={\sqrt {4m\ln n}}} and get:

P r o b [ | c i | 4 m ln n ] 2 n 2 {\displaystyle Prob\left\leq {\frac {2}{n^{2}}}}

By the union bound,

P r o b [ i : | c i | 4 m ln n ] 2 n {\displaystyle Prob\left\leq {\frac {2}{n}}} .

References

  1. Mitzenmacher, Michael & Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 0-521-83540-2.
Categories: