On Extension of Rationals

This note is based on my talk An Expedition to the World of \(p\)-adic Numbers at Carnegie Mellon University on January 15, 2014.

Construction from Cauchy Sequences

A standard approach to construct the real numbers from the rational numbers is a procedure called completion, which forces all Cauchy sequences in a metric space by adding new points to the metric space.

Definition A norm, denoted by \(|\cdot|\), on the field \(F\) is a function from \(F\) to the set of nonnegative numbers in an ordered field \(R\) such that (1) \(|x|=0\) if and only if \(x=0\); (2) \(|xy|=|x||y|\); (3) \(|x+y|\leq|x|+|y|\).

Remark The ordered field is usually chosen to be \(\mathbb{R}\). However, to construct of \(\mathbb{R}\), the ordered field is \(\mathbb{Q}\).

A norm on the field \(F\) naturally gives rise to a metric \(d(x,y)=|x-y|\) on \(F\). For example, the standard metric on the rationals is defined by the absolute value, namely, \(d(x, y) = |x-y|\), where \(x,y\in\mathbb{Q}\), and \(|\cdot|\) is the standard norm, i.e., the absolute value, on \(\mathbb{Q}\).

Given a metric \(d\) on the field \(F\), the completion procedure considers the set of all Cauchy sequences on \(F\) and an equivalence relation

$$(a_n)\sim (b_n)\text{ if and only if }d(a_n, b_n)\to 0\text{ as }n\to\infty.$$

Definition Two norms on \(F\), \(|\cdot|_1, |\cdot|_2\), are equivalent if and only if for every sequence \((a_n)\), it is a Cauchy sequence with respect to \(d_1\) if and only if it is so with respect to \(d_2\), where \(d_1, d_2\) are the metrics determined by \(|\cdot|_1, |\cdot|_2\) respectively.

Remark It is reasonable to worry about the situation in which two norms \(|\cdot|_1\) and \(|\cdot|_2\) are equivalent, but they introduce different equivalent relationships on the set of Cauchy sequences. However, given two equivalent norms, \(|a_n|_1\) converges to 0 if and only if it does so with respect to \(d_2\). (Hint: prove by contradiction and consider the sequence \((1/a_n)\).)

Definition The trivial norm on \(F\) is a norm \(|\cdot|\) such that \(|0|=0\) and \(|x|=1\) for \(x\neq 0\).

Since we are interested in norms that generate different completions of the field, it would be great if we can classify all nontrivial norms modulo the norm equivalence.

Alternative Norms on Rationals

Definition Let \(p\) be any prime number. For any non-zero integer \(a\), let \(\mathrm{ord}_pa\) be the highest power of \(p\) which divides \(a\). For any rational \(x=a/b\), define \(\mathrm{ord}_px=\mathrm{ord}_pa-\mathrm{ord}_pb\). Further define a map \(|\cdot|_p\) on \(\mathbb{Q}\) as follows: $$|x|_p = \begin{cases}p^{-\mathrm{ord}_px} & \text{if }x\neq 0 \\ 0 & \text{if }x=0\end{cases}.$$

Proposition \(|\cdot|_p\) is a norm on \(\mathbb{Q}\). We call it the \(p\)-adic norm on \(\mathbb{Q}\).

Proof (sketch) We only check the triangle inequality. Notice that $$\begin{eqnarray*}\mathrm{ord}_p\left(\frac{a}{b}-\frac{c}{d}\right) &=& \mathrm{ord}_p\left(\frac{ad-bc}{bd}\right) \\ &=& \mathrm{ord}_p(ad-bc)-\mathrm{ord}_p(bd) \\ &\geq& \min(\mathrm{ord}_p(ad), \mathrm{ord}_p(bc)) – \mathrm{ord}_p(bd) \\ &=& \min(\mathrm{ord}_p(ad/bd), \mathrm{ord}_p(bc/bd)).\end{eqnarray*}$$ Therefore, we obtain \(|a/b-c/d|_p \leq \max(|a/b|_p, |c/d|_p) \leq |a/b|_p+|c/d|_p\). QED

Remark Some counterintuitive fact about the \(p\)-adic norm are

The following theorem due to Ostrowski classifies all possible norms on the rationals up to norm equivalence. We denote the standard absolute value by \(|\cdot|_\infty\).

Theorem [OSTROWSKI 1916] Every nontrivial norm \(|\cdot|\) on \(\mathbb{Q}\) is equivalent to \(|\cdot|_p\) for some prime \(p\) or for \(p=\infty\).

Proof We consider two cases (i) There is \(n\in\{1,2,\ldots\}\) such that \(|n|>1\); (ii) For all \(n\in\{1,2,\ldots\}\), \(|n|\leq 1\). As we shall see, in the 1st case, the norm is equivalent to \(|\cdot|_\infty\), whereas, in the 2nd case, the norm is equivalent to \(|\cdot|_p\) for some prime \(p\).

Case (i) Let \(n_0\) be the least such \(n\in\{1,2,\ldots\}\) such that \(|n|>1\). Let \(\alpha > 0\) be such that \(|n_0|=n_0^\alpha\).

For every positive integer \(n\), if \(n_0^s \leq n < n_0^{s+1}\), then we can write it in \(n_0\)-base: $$n = a_0 + a_1n_0 + \ldots + a_sn_0^s.$$

By the choice of \(n_0\), we know that \(|a_i|\leq 1\) for all \(i\). Therefore, we obtain $$\begin{eqnarray*}|n| & \leq & |a_0| + |a_1||n_0| + \ldots + |a_s||n_0|^s \\ & \leq & 1 + |n_0| + \ldots |n_0|^s \\ & \leq & n_0^{s\alpha}\left(1+n_0^{-\alpha}+n_0^{-2\alpha}+\ldots\right) \\ & \leq & Cn^\alpha,\end{eqnarray*}$$ where \(C\) does not depend on \(n\). Replace \(n\) by \(n^N\) and get \(|n|^N = |n^N| \leq Cn^{N\alpha}\), and so \(|n|\leq \sqrt[N]{C}n^\alpha\). As we can choose \(N\) to be arbitrarily large, we obtain \(|n| \leq n^\alpha\).

On the other hand, we have $$\begin{eqnarray*}|n| & \geq & |n_0^{s+1}| – |n_0^{s+1}-n| \\ & \geq & n_0^{(s+1)\alpha} – (n_0^{s+1}-n_0^s)^\alpha\\ & = & n_0^{(s+1)\alpha}\left[1-(1-1/n_0)^\alpha\right] \\ & > & C’n^\alpha.\end{eqnarray*}$$ Using the same trick, we can actually take \(C’=1\).

Therefore \(|n| = n^\alpha\). It is easy to see it is equivalent to \(|\cdot|_\infty\).

Case (ii) Since the norm is nontrivial, let \(n_0\) be the least \(n\) such that \(|n|<1\).

Claim 1 \(n_0=p\) is a prime.

Claim 2 \(|q|=1\) if \(q\) is a prime other than \(p\).

Suppose \(|q| < 1\). Find \(M\) large enough so that both \(|p^M|\) and \(|q^M|\) are less than \(1/2\). By Bézout’s lemma, there exists \(a,b\in\mathbb{Z}\) such that \(ap^M + bq^M = 1\). However, $$1 = |1| \leq |a||p^M| + |b||q^M| < 1/2 + 1/2 = 1,$$ a contradiction.

Therefore, we know \(|n|=|p|^{ord_p n}\). It is easy to see it is equivalent to \(|\cdot|_p\). QED

 Non-Archimedean Norm

As one might have noticed, the \(p\)-adic norm satisfies an inequality stronger than the triangle inequality, namely \(|a\pm b|_p\leq \max(|x|_p, |y|_p)\).

Definition A norm is non-Archimedean provided \(|x\pm y|\leq \max(|x|, |y|)\).

The world of non-Archimedean norm is good and weird. Here are two testimonies.

Proposition (No more scalene triangles!) If \(|x|\neq |y|\), then \(|x\pm y| = \max(|x|, |y|)\).

Proof Suppose \(|x| < |y|\). On one hand, we have \(|x\pm y| \leq |y|\). On the other hand, \(|y| \leq \max(|x\pm y|, |x|)\). Since \(|x| < |y|\), we must have \(|y| \leq |x\pm y|\). QED.

Proposition (All points are centers!) \(D(a, r^-) = D(b, r^-)\) for all \(b\in D(a, r^-)\) and \(D(a, r) = D(b,r)\) for all \(b\in D(a,r)\), where \(D(c, r^-) = \{x : |x-c|<r\}\) and \(D(c,r)=\{x:|x-c|\leq r\}\).

Construction of \(p\)-adic Numbers

The \(p\)-adic numbers are the completion of \(\mathbb{Q}\) via the \(p\)-adic norm.

Definition The set of \(p\)-adic numbers is defined as $$\mathbb{Q}_p = \{\text{Cauchy sequences with respect to }|\cdot|_p\} / \sim_p,$$ where \((a_n)\sim_p(b_n)\) iff \(|a_n-b_n|_p\to 0\) as \(n\to\infty\).

We would like to extend \(|\cdot|_p\) from \(\mathbb{Q}\) to \(\mathbb{Q}_p\). When extending \(|\cdot|_\infty\) from \(\mathbb{Q}\) to \(\mathbb{R}\), we set \(|[(a_n)]|_\infty\) to be \([(|a_n|)]\), an element in \(\mathbb{R}\). However, the values that \(|\cdot|_p\) can take, after the extension, are still in \(\mathbb{Q}\).

Definition The extension of \(|\cdot|_p\) on \(\mathbb{Q}_p\) is defined by \(|[(a_n)]|_p = \lim_{n\to\infty}|a_n|_p\).

Remark Suppose \((a_n)\sim_p (a_n’)\). Then \(\lim_{n\to\infty}|a_n-a_n’|_p=0\), and so \(\lim_{n\to\infty}|a_n|_p=\lim_{n\to\infty}|a_n’|_p\). Moreover, one can prove that \(\lim_{n\to\infty}|a_n|_p\) always exists provided that \((a_n)\) is a Cauchy sequence. (Hint: Suppose \(\lim_{n\to\infty}|a_n|_p > 0\). There exists \(\epsilon > 0\) such that \(|a_n|_p > \epsilon\) infinitely often. Choose \(N\) enough so that \(|a_m – a_n|_p < \epsilon\) for all \(m,n>N\). Use ‘no more scalene triangles!’ property to deduce a contradiction.)

Representation of \(p\)-adic Numbers

Even though each real number is an equivalence class of Cauchy sequences, each equivalence class has a canonical representative. For instance, the canonical representative of \(\pi\) is \((3, 3.1, 3.14, 3.141, 3.1415, \ldots)\). The analog for \(\mathbb{Q}_p\) is the following.

Theorem Every equivalence class \(a\) in \(\mathbb{Q}_p\) for which \(|a|_p\leq 1\) has exactly one representative Cauchy sequence of the form \((a_i)\) for which (1) \(0\leq a_i < p^i\) for \(i=1,2,3,\ldots\); (2) \(a_i = a_{i+1} (\pmod p^i)\) for \(i=1,2,3,\ldots\).

Proof of Uniqueness Prove by definition chasing.

Proof of Existence We shall repeatedly apply the following lemma.

Lemma For every \(b\in\mathbb{Q}\) for which \(|b|_p \leq 1\) and \(i\in\mathbb{N}\), there exists \(a\in\{0, \ldots, p^i-1\}\) such that \(|a-b|_p \leq p^{-i}\).

Proof of Lemma Suppose \(b=m/n\) is in the lowest form. As \(|b|_p\leq 1\), we know that \((n, p^i)=1\). By Bézout’s lemma, \(an+a’p^i=m\) for some integers \(a,a’\). We may assume \(a\in\{0,\ldots,p^i-1\}\). Note that \(a-b=a’p^i/n\), and so \(|a-b|_p \leq p^{-i}\). qed

Suppose \((c_i)\) is a representative of \(a\). As \((c_i)\) is a Cauchy sequence, we can extract a subsequence \((b_i)\) such that \(|b_i – b_j| \leq p^{-i}\) for all \(i < j\) which is still a representative of \(a\). Using the lemma above, for each \(b_i\), we can find \(0 \leq a_i < q^i\) such that \(|a_i-b_i|_p \leq q^{-i}\). Therefore \((a_i)\) is a representative of \(a\) as well. For all \(i<j\), we have \(|a_i-a_j|_p \leq \max(|a_i – b_i|_p, |b_i-b_j|_p, |b_j-a_j|_p) \leq q^{-i}\), Therefore \(q^i\) divides \(a_i – a_j\).


For \(|a|_p\leq 1\), we write \(a=b_0 + b_1p + b_2p^2 + \ldots\), where \((b_{n-1}b_{n-2}\ldots b_0)_p = a_n\).

What if \(|a|_p > 1\)? As \(|ap^m|=|a|/p^m\), \(|ap^m|\leq 1\) for some natural number \(m\). By the representation theorem, we can write \(ap^m = b_0 + b_1p + b_2p^2 + \ldots\), and \(a = b_0p^{-m} + b_1p^{-m+1} + b_2p^{-m+2} + \ldots\).

Using the representation of \(p\)-adic numbers, one can perform arithmetic operations such as addition, subtraction, multiplication and division.

Like \(\mathbb{R}\), \(\mathbb{Q}_p\) is not algebraically closed. Though \(\mathbb{C}\), the algebraic closure of \(\mathbb{R}\), has degree \(2\) over \(\mathbb{R}\), and it is complete with respect to the absolute value, it is not so for \(\overline{\mathbb{Q}_p}\), the algebraic closure of \(\mathbb{Q}_p\). In fact, \(\overline{\mathbb{Q}_p}\) has infinite degree over \(\mathbb{Q}_p\) and is, unfortunately, not complete with respect to proper extension of \(|\cdot|_p\). The good news is that the completion of \(\overline{\mathbb{Q}_p}\), denoted by \(\Omega\) is algebraically closed.

3 (this post is made with love)

Poisson Summation Formula and Basel Problem

This note is based on Professor Noam Elkies’ talk at Carnegie Mellon University on December 2, 2014.

A classical mathematical analysis problem, also known as the Basel problem, first posed by Pietro Mengoli in 1644 and solved by Leonhard Euler in 1735, asks the precise sum in closed form of the infinite series $$\sum_{n=1}^\infty \frac{1}{n^2} = \lim_{n\to\infty}\left(\frac{1}{1^2}+\frac{1}{2^2}+\ldots+\frac{1}{n^2}\right).$$

Here we present an attack to the Basel problem using the Poisson summation formula.

Poisson Summation Formula

The Poisson summation formula states that $$\sum_{n\in\mathbb{Z}}f(n) = \sum_{k\in\mathbb{Z}}\hat{f}(k),$$ where \(f: \mathbb{R}\to\mathbb{R}\) is a “nice” function and \(\hat{f}(\omega):=\int_\mathbb{R}f(x)e^{-2\pi i\omega x}dx\) is the Fourier transformation of \(f\).

For the sake of completeness, we are going to prove the Poisson summation formula and specify how “nice” the function should be for the formula to work.

Proof of Poisson summation formula Let \(R:\mathbb{R}\to\mathbb{R}\) be defined by \(R(t) = \sum_{n\in\mathbb{Z}}f(n+t)\). Because \(R(t+m) = R(t)\) for all \(m\in\mathbb{Z}\), and so \(R\) is a periodic function, we can expand \(R(t)\) in the Fourier series $$R(t) = \sum_{k\in\mathbb{Z}}a_ke^{2\pi i kt},$$ where $$\begin{eqnarray}a_k & = & \int_0^1R(t)e^{-2\pi ikt}dt \\ & = & \int_0^1\sum_{n\in\mathbb{Z}}f(n+t)e^{-2\pi ikt}dt \\ & = & \sum_{n\in\mathbb{Z}}\int_0^1f(n+t)e^{-2\pi ikt}dt \\ & = & \sum_{n\in\mathbb{Z}}\int_n^{n+1}f(t)e^{-2\pi ikt}dt \\ & = & \int_\mathbb{R}f(t)e^{-2\pi ikt}dt.\end{eqnarray}$$ which equals the value at \(-k\) of the Fourier transform \(\hat{f}\) of \(f\). We conclude that $$\sum_{n\in\mathbb{Z}}f(n) = R(0)=\sum_{k\in\mathbb{Z}}a_k = \sum_{k\in\mathbb{Z}}\hat{f}(k).$$ Q.E.D.

Remark In order to use Fourier series for \(R\), we need \(R\in C^1\) to ensure that its Fourier series expansion \(\sum_{n\in\mathbb{Z}}a_ke^{2\pi ikt}\) converges to \(R(t)\) pointwise (and uniformly in fact). This allows us to evaluate \(R\) at \(0\) and equate it with \(\sum_{k\in\mathbb{Z}} a_k\). On the other hand, we switched the integration and the infinite summation $$\int_0^1\sum_{n\in\mathbb{Z}}f(n+t)e^{-2\pi ikt}dt = \sum_{n\in\mathbb{Z}}\int_0^1f(n+t)e^{-2\pi ikt}dt,$$ in the computation. The Fubini–Tonelli theorem says that this is legit if $$\int_0^1\sum_{n\in\mathbb{Z}}\left|f(n+t)e^{-2\pi ikt}\right|dt = \int_0^1\sum_{n\in\mathbb{Z}}\left|f(n+t)\right|dt < \infty$$ or $$\sum_{n\in\mathbb{Z}}\int_0^1\left|f(n+t)e^{-2\pi ikt}\right|dt = \sum_{n\in\mathbb{Z}}\int_n^{n+1}|f(x)|dx = \int_\mathbb{R}|f(x)|dx < \infty.$$

Application to Basel Problem

Take \(f(x) = \frac{1}{x^2+c^2}\), where \(c > 0\) is a constant. For such \(f\), \(R(t) = \sum_{n\in\mathbb{Z}}f(n+t)\) defined above converges uniformly. As each \(f(n+t)\) is continuous, so is \(R\). As \(\sum_{n\in\mathbb{Z}}f'(n+t) = \sum_{n\in\mathbb{Z}}\frac{-2(t+n)}{((t+n)+c^2)^2}\) converges uniformly, we have that \(R'(t) = \sum_{n\in\mathbb{Z}}f'(n+t)\), and moreover, as \(f'(n+t)\) is continuous, so is \(R’\). Now \(R\) is continuously differentiable, and so we can use Fourier series for \(R\). On the other hand, since \(f\) is absolutely integrable, we can switch integral with summation. Therefore we can apply the Poisson summation formula.

Its Fourier transform is, by Cauchy’s integral formula, $$\hat{f}(\omega) = \int_\mathbb{R}\frac{1}{x^2+c^2}e^{-2\pi i\omega x}dx = \frac{\pi}{c}e^{-2\pi c|\omega|}.$$ By the Poisson summation formula, we obtain $$\sum_{n\in\mathbb{Z}}\frac{1}{n^2+c^2} = \sum_{k\in\mathbb{Z}}\frac{\pi}{c}e^{-2\pi c|k|} = \frac{\pi}{c}\frac{e^{2\pi c}+1}{e^{2\pi c}-1},$$ and so $$\sum_{n=1}^\infty \frac{1}{n^2+c^2} = \frac{1}{2}\left(\frac{\pi}{c}\frac{e^{2\pi c}+1}{e^{2\pi c}-1}-\frac{1}{c^2}\right).$$

Sending \(c\) to \(0^+\), the left hand side converges to \(\sum_{n=1}^\infty\frac{1}{n^2}\) and the right hand side is equal to $$\begin{eqnarray}& & \frac{1}{2}\left(\frac{\pi}{c}\frac{2+2\pi c+2\pi^2 c^2 + \ldots}{2\pi c + 2\pi^2 c^2 + 4\pi^3c^3/3 + \ldots}-\frac{1}{c^2}\right) \\ & = & \frac{1}{2}\left(\frac{(2 + 2\pi c+2\pi^2c^2 + \ldots) – (2 + 2\pi c + 4\pi^2c^2/3 + \ldots)}{2c^2 + 2\pi c^3 + 4\pi^2c^4/3 + \ldots}\right) \\ & = & \frac{1}{2}\frac{2\pi^2c^2 – 4\pi^2c^2/3 + \ldots}{2c^2 + \ldots }\end{eqnarray}$$ and converges to \(\frac{\pi^2}{6}\) as \(c\to 0^+\).

12 (this post is made with love)

A Note on Thue’s Method

This note is based on my talk Introduction to Diophantine Approximation at Carnegie Mellon University on November 4, 2014. Due to limit of time, details of Thue’s method were not fully presented in the talk. The note is supposed to serve as a complement to that talk.

Diophantine Approximation

Diophantine approximation deals with the approximation of real numbers by rational numbers. In other words, given a real number \(\alpha\), we are interested in finding a rational approximation \(p/q\) such that the error \(|\alpha – p/q|\) is “small”. The smaller the error, the better the approximation. Since the set of rationals is dense, there is no best rational approximation of an irrational number (apparently, the best rational approximation of a rational number is itself). However, if we compare the error with the complexity of the rational, i.e., the denominator of the rational, it is natural to expect that the more complicated the rational approximation, the smaller the error. In particular, we can make sense of this in the following way.

Theorem (DIRICHLET) Given an irrational \(\alpha\), we have \(|\alpha – p/q| < 1/q^2\) for infinitely many rationals \(p/q\).

Proof Think of \(C = \mathbb{R}/\mathbb{Z}\) as a circle with circumference 1. Fix a natural number \(N\) and mark \(nq\) on \(C\) for \(n=1,\ldots,N\). By the pigeonhole principle, there are \(1\leq n_1 \leq n_2\leq N\) such that \(n_1\alpha\) and \(n_2\alpha\) is \(\leq 1/N\) apart from each other on \(C\). Let \(q = n_2 – n_1 < N\). Then there is \(p\) such that \(|q\alpha – p| \leq 1/N\), and so \(|\alpha – p/q| \leq 1/(Nq) < 1/q^2\). Taking \(N\) large enough yields better approximation \(p/q\) with the desired property.

By means of continued fraction, one can push the result a bit further by a constant factor.

Theorem (HURWITZ 1891) Given an irrational \(\alpha\), we have \(|\alpha – p/q| < 1/\sqrt{5}q^2\) for infinitely many rationals \(p/q\). Moreover, the constant \(\sqrt{5}\) is the best possible for \(\alpha = (1+\sqrt{5})/2\).

These show us how “well” we can approximate reals by rationals. On the other hand, we would also like to say that we cannot approximate “too well”, i.e., given a real number \(\alpha\), we have \(|\alpha – p/q| > 1/q^n\) for all but finite rationals \(p/q\), where \(n\) may depend on \(\alpha\). Unfortunately, this is not so because of the existence of Liouville numbers.

Definition A Liouville number is an irrational number \(\alpha\) with the property that, for every natural number \(n\), there exists a rational \(p/q\) and such that \(|\alpha – p/q| < q^{-n}\).

For example, \(\alpha = \sum_{k=1}^\infty 2^{-k!}\) is a Liouville number.

However, we indeed can say something along those lines for algebraic numbers.

Diophantine Approximation for Algebraic Numbers

Definition An algebraic number is a number that is a root of a non-zero polynomial in one variable with rational coefficients (or equivalently integer coefficients). Given an algebraic number, there is a monic polynomial (with rational coefficients) of least degree that has the number as a root. This polynomial is called its minimal polynomial. If its minimal polynomial has degree \(d\), then the algebraic number is said to be of degree \(d\).

Theorem (LIOUVILLE 1844) Given an algebraic irrational number \(\alpha\) of degree \(d\), there exists \(A>0\) such that for all rationals \(p/q\), \(|\alpha – p/q|>Aq^{-d}\).

A immediate consequence of the theorem is that Liouville numbers are not algebraic.

Proof Let \(f(x)\) be a polynomial with integer coefficients of degree \(d\) that has \(\alpha\) as a root. Note that \(f'(\alpha)\neq 0\) otherwise \(f'(x)\) is a polynomial of degree \(<d\) that has \(\alpha\) as a root. As \(f'(x)\) is continuous, there is \(r>0\) such that \(|f'(x)| < 1 / A\) for all \(|x-\alpha|<r\). Without loss of generality, we may assume \(p/q\) is pretty close, i.e., within distance \(r\), to \(\alpha\) by taking \(A < r\). Note that also \(f(p/q)\neq 0\) otherwise \(f(x) / (x-p/q)\) is a polynomial of degree \(<d\) that has \(\alpha\) as a root. Since \(q^df(p/q)\) is an integer, \(|f(p/q)| \geq q^{-d}\). On the other hand, by mean value theorem, there is \(c\) between \(p/q\) and \(\alpha\) such that \(|f(\alpha)-f(p/q)| = |f'(c)||\alpha – p/q|\). Therefore \(|\alpha – p/q| = |f(p/q)|/f'(c) > Aq^{-n}\) since \(|c-\alpha| < r\).

Liouville’s theorem (on Diophantine approximation) can be phrased in an alternative way.

Corollary Given an irrational algebraic number \(\alpha\) of degree \(d\). If the equation \(|\alpha – p/q| < 1/q^n\) has infinitely many rational solutions \(p/q\), then \(n\leq d\).

The continuing working was established by Axel Thue (1909), Carl Ludwig Siegle (1921), Freeman Dyson (1947), and culminated at Klaus Roth’s result (1955):

Theorem (ROTH 1955) Given an irrational algebraic number \(\alpha\). If the equation \(|\alpha-p/q|<1/q^n\) has infinitely many rational solutions \(p/q\), then \(n\leq 2\).

The constant \(2\) cannot be improved due to Dirichlet’s theorem. Another way to interpret Roth’s theorem is as follows.

Corollary Any irrational algebraic number \(\alpha\) has approximation exponent equal to 2, i.e., for any \(\epsilon>0\), the inequality \(|\alpha-p/q|<1/q^{2+\epsilon}\) can only have finitely many solutions \(p/q\).

This amounts to say that we cannot approximate irrational algebraic numbers “too well” in the sense that if we slightly restrict the error from \(q^{-2}\) to \(q^{-(2+\epsilon)}\), the number of “good” approximation changes from infinite to finite.

Thue’s Method

Although Roth’s proof has merit in itself, part of his idea dates back to Thue’s work. Indeed, Thue used a method, now known as the polynomial method, in a special way to prove the following theorem.

Theorem (THUE 1909) Given an irrational algebraic number \(\alpha\). If the equation \(|\alpha-p/q|<1/q^n\) has infinitely many rational solutions \(p/q\), then \(n\leq d/2+1\), where \(d\) is the degree of \(\alpha\).

Proof Suppose that the irrational algebraic number \(\alpha\) has two “good” rational approximations \(r_1 = p_1/q_1, r_2 = p_2/q_2\), i.e., \(|\alpha – p_i/q_i| < q_i^{-n}\) for \(i = 1,2\). Since there are infinitely many “good” rational approximations, we get to choose later how large we want \(q_i\) to be.

The rest of the proof can be broken into three parts. Let \(m\in\mathbb{N}\) and \(0 < \epsilon \ll 1\) be two constants to be determined. The constant \(\epsilon\) is fixed before we send it to \(0\).

We are going to use \(C_0, \ldots, C_6\) to indicate positive constants that may depend on \(\epsilon, d\) and \(c\), a constant determined by \(\alpha\) (defined later). We may always assume \(\alpha, d, c, C_0,\ldots, C_6 \ll m\).

Part I Find a non-zero polynomial \(f(x,y) = P(x) + Q(x)y\) with integer coefficients that satisfies the following properties:

  1. it vanishes at \((\alpha, \alpha)\) to order \((m, 1)\), i.e., \(\partial_x^k f(\alpha, \alpha) = 0\) for all \(k<m\);
  2. its degree in \(x\) is \(\leq D := \frac{1}{2}(1+\epsilon)dm\).
  3. its size, i.e., the maximum of all its coefficients’ absolute values, denoted by \(|f|\), is \(\leq C_1^m\) for some constant \(C_1\).

Proof of Part I The proof is a classical “parameter counting” argument. Consider a polynomial \(f(x,y)=P(x)+Q(x)y = \sum_{i<D}a_ix^i + \sum_{i<D}b_ix^iy\) with degree \(\leq D\) in \(x\) and think of its \(2D\) coefficients as undetermined. We can interpret the first property as a homogeneous system of \(m\) linear equations with coefficients in \(\mathbb{Z}[\alpha]\) because $$\frac{1}{k!}\partial_x^k f(\alpha, \alpha) = 0 \text{ if and only if } \sum_{i < D}{i\choose k}\alpha^{i-k}a_i + \sum_{i < D}{i\choose k}\alpha^{i-k+1}b_i = 0.$$

Fix a (non-zero) polynomial \(c_dx^d + \ldots + c_0\) with integer coefficients that has \(\alpha\) as a root and \(c\) the size of this polynomial. Notice that \(c_d\alpha^d = -(c_{d-1}\alpha^{d-1} + \ldots + c_0)\). For each of the linear equation above of the form $$L_0 + L_1\alpha + \ldots + L_t\alpha^t = 0,$$ for some \(t \geq d\) where \(L_k\) are linear combination of \(a_i, b_i\) with integer coefficients, we can replace it by $$c_d(L_0 + L_1\alpha + \ldots + L_{t-1}\alpha^{t-1}) – L_t(c_0\alpha^{t-d} + \ldots + c_{d-1}\alpha^{t-1}) = 0,$$ i.e., $$L’_0 + L_1’\alpha + \ldots + L_{t-1}\alpha^{t-1} = 0$$ with $$L’_k = \begin{cases} c_dL_k & 0\leq k < t-d \\ c_dL_k – c_{k-t+d}L_t & t-d \leq k < t\end{cases}$$ and we can repeat this until its degree in \(\alpha\) becomes smaller than \(d\). Let \(|L_k|_1\) be the sum of the absolute values of all coefficients in \(L_k\). Note that each replacement increases the \(\max |L_k|_1\) to \(\max |L_k’|_1 \leq 2c \max|L_k|\). Note that the initially \(\max |L_k| < 2^D\). After at most \(D\) replacements, the equation satisfies \(\max |L_k| < 2^D(2c)^D = C_0^m\) for some constant \(C_0\).

Finally, for each of the linear equation, after the replacements, of the form \(L_0 + L_1\alpha + \ldots +  L_{d-1}\alpha^{d-1} = 0\), it is equivalent to have \( L_0 = L_1 = \ldots = L_{d-1}\) because \(1, \alpha, \ldots, \alpha^{d-1}\) are linearly independent over \(\mathbb{Q}\). Therefore each linear equation with coefficient in \(\mathbb{Z}[\alpha]\) corresponding to \(\partial_x^k f(\alpha, \alpha) = 0\) is equivalent to \(d\) linear equations whose \(|\cdot|_1\) are all bounded by \(C_0^m\). In total, we have \(dm\) linear equations with \(2D\) undetermined.

One can view this homogeneous system of linear equations as a linear transformation \(L = (L_1, \ldots, L_{dm}): \mathbb{Z}^{2D} \to \mathbb{Z}^{dm}\). Consider all integral points in \([-M, M]^{2D}\) and their image under \(L\). Because \(|L_k|_1 \leq C_0^m\), their image is contained in \([-C_0^mM, C_0^mM]^{dm}\). By the pigeonhole principle, as long as $$(2M+1)^{2D} > (2C_0^mM+1)^{dm},$$ there are two integral points \(v’, v”\) in \([-M,M]^{2D}\) such that \(Lv’=Lv”\), hence there is an integral point \(v=v’-v”\in [-2M, 2M]^{2D}\) such that \(Lv=0\) and \(v\neq 0\), hence a non-zero solution to the linear system. To guarantee the above inequality involving \(M\), it is enough to have \(2M > C_0^{m/\epsilon}\), and so \(2M = C_1^m\), for some constant \(C_1\), works.

Part II Find \(l < L := \epsilon m\) such that \(g(x,y) = \frac{1}{l!}\partial_x^l f(x,y)\) satisfies the following properties:

  1. it vanishes at \((\alpha, \alpha)\) to order \((m-L, 1)\);
  2. its degree in \(x\) is \(\leq D\) (\(=\frac{1}{2}(1+\epsilon)dm\));
  3. its size is \(|g| \leq {D\choose l}C_1^m < C_2^m\) for some constant \(C_2\);
  4. and \(g(r_1, r_2)\neq 0\).

Proof of Part II The first three properties follow immediately from the properties of \(f\). Henceforth, we only have to show there exists \(l < L\) such that the last property holds. For the sake of contradiction, assume that \(\partial_x^l f(r_1, r_2) = 0\) for all \(l < L\). Define $$D(x) := \begin{vmatrix}P(x) & Q(x) \\ P'(x) & Q'(x)\end{vmatrix} = P(x)Q'(x) – P'(x)Q(x).$$ Because \(f(x,y)\) vanishes at \((r_1, r_2)\) to order \((L,1)\), by Leibniz rule, \(D(x)\) vanishes at \(r_1\) to order \(L-1\). Note that \(D(x)\) is in \(\mathbb{Z}[x]\). By Gauss’ lemma, \((q_1x-p_1)^{L-1}\) divides \(D(x)\). Therefore either (1) \(|D(x)| \geq q_1^{L-1}\); or (2) \(D(x) = 0\) as a polynomial.

If it is the first case, because \(|D(x)| = |P(x)Q'(x) – P'(x)Q(x)| < 2D|f|^2\), we have \(2D|f|^2 > q_1^{L-1}\). Because \(|f| < C_1^m\), we get a contradiction as long as $$q_1 > C_3,$$ for some constant \(C_3\).

If it is the second case, \(P(x)\) and \(Q(x)\) are linearly dependent over \(\mathbb{Q}\). Therefore either (1) \(P(x) = -r_2’Q(x)\) and \(Q(x)\) vanishes at \(r_1\) to order \(< L\) or (2) \(Q(x)\) vanishes at \(r_1\) to order \(L\). If it is the first case, because \(f(x,y) = (y-r_2′)Q(x)\) vanishes at \((r_1, r_2)\) to order \((L, 1)\) and \(Q(x)\) vanish at \(r_1\) to order \(<L\), we know that \(r_2′ = r_2\), hence by Gauss’ lemma, \(f(x,y)\) is a multiple of \(q_2x – p_2\), and so \(|f| \geq q_2\). If it is the second case, because \(Q\) vanishes at \(r_1\) to order \(L\), by Gauss’ lemma, \(Q(x)\) is a multiple of \((q_1x – p_1)^L\), and so \(|f| \geq |Q| \geq q_1^L\). To get a contradiction in both cases, we need $$q_1 > C_4\text{ and }q_2 > C_1^m.$$

Therefore, it is enough to have $$q_1 > C_5\text{ and }q_2 > q_1^m$$ for some constant \(C_5\).

Part III We claim that \(|g(r_1, r_2)| < C_6^m\left(q_1^{-(1-\epsilon)m}+q_2^{-1}\right)\).

Proof of Part III We know that $$|g(r_1, r_2)-g(r_1, \alpha)| = |Q(r_1)||\alpha – r_2| \leq D|g|(|\alpha|+1)^D|\alpha-r_2|.$$ Because \(g\) vanishes at \((\alpha_1, \alpha_1)\) to order \((m-L, 1)\), using Taylor expansion, we have $$|g(r_1, \alpha) – g(\alpha, \alpha)| \leq \sup_{|x| < |\alpha|+1}\frac{1}{(m-L)!}\partial_x^{m-L}g(x,\alpha) |\alpha_1-r_1|^{m-L} < D2^D|g|(|\alpha|+1)^D||\alpha_1-r_1|^{m-L}.$$ As \(r_1, r_2\) are two “good” approximations of \(\alpha\), we obtain $$\begin{eqnarray}|g(r_1, r_2)| & \leq & D|g|(|\alpha|+1)^D|\alpha-r_2| + D2^D|g|(|\alpha|+1)^D||\alpha_1-r_1|^{m-L}\\ & < & C_6^m\left(q_2^{-n} + q_1^{(-m+L)n}\right),\end{eqnarray}$$ for some constant \(C_6\)

From Part II, we know that \(g(r_1, r_2)\neq 0\), and so \(|g(r_1, r_2)| \geq q_1^{-D}q_2^{-1}\). From Part III, we know that \(|g(r_1, r_2)| < C_6^m(q_1^{(-m+L)n}+q_2^{-n})\). Therefore, we obtain $$q_1^{-D}q_2^{-1} < C_6^m\left(q_1^{(-m+L)n}+q_2^{-n}\right).$$ Set \(m\) to be the biggest integer less than \(\log{q_2} / \log{q_1}\), and so \(q_1^m < q_2 \leq q_1^{m+1}\). To ensure Part II to work, we only need to take \(q_1 > C_5\). In addition, we require \(q_1^\epsilon > C_6\) and so \(C_6^m < q_1^{\epsilon m} < q_2^\epsilon\). The choice of \(m\) implies that $$q_2^{-\frac{1}{2}(1+\epsilon)d – 1} = q_2^{-D/m}q_2^{-1} < q_2^\epsilon\left(q_2^{-\frac{m-L}{m+1}n}+q_2^{-n}\right) < 2q_2^{-(1-2\epsilon)n+\epsilon}.$$ In order to have this inequality hold for arbitrarily large \(q_2\), we need \(\frac{1}{2}(1+\epsilon)d + 1 > (1-2\epsilon)n-\epsilon\). Let \(\epsilon\) go to \(0\) and obtain \(n\leq \frac{1}{2}d + 1\). Q.E.D.


The proof largely relies on Larry Guth‘s notes on polynomial method.

0 (be the first to like this)



除了在 MOP 第三周举行的为时两天的 TSTST 以外,在前两周还有四次 MOP 测试和两次 IMO 模拟考试。这六次考试都是不关联来年的集训队选拔的。但这六次考试很有趣的是,阅卷员(大多数是过去参加过 IMO 的学生)会对解答主观地作出风格分的评判。比如,如果你的解答是一通暴算,那很抱歉,你的风格分可能会很低。为了鼓励第一次的参加的学员,特别是在红组和绿组的学员,他们的四次 MOP 测试会在原有的三个题目上附加两个简单的题目。

为了增进学生互相之间的合作,在前两周还有三个学生一组组队赛。红组和绿组用一套稍简单的题,蓝组和黑组用稍难的一套题。评分的方式是在阅卷员面前解释自己的解答。换句话说,在这个过程中,学生不但要解决问题,还要教会组里不会的伙伴。这正好也呼应了 Jane Street 讲座里所说的工作中的团队合作。

除了之前说的几个考试测验外,整个项目里还有一个广受同学欢迎的叫做 ELMO 的竞赛。ELMO 具体是什么意思不得而知,但每年这个名字都会有新的诠释。比如今年 ELMO 代表 Ego Loss (surely) Must Occur,意思就是损伤自尊心必然发生。这个竞赛由老生(Veteran,也就是至少是第二次参加 MOP 的学员)运作,模拟 IMO 的模式:老生提供试题,构成备选试题列表(ELMO Shortlist),新生(Rookie,第一次参加 MOP 的学员)组队,老生被分配到每队成为领队,新生参加比赛后由老生阅卷并互相之间协调分数。

虽然这个活动仅仅是一个由学生运行的比赛,但是每年他们都会像模像样地把试题和备选试题发到上篇提到的 Art of Problem Solving 的网站,不知道内情的人,乍一看肯定会觉得这是一个由某机构运作的官方的比赛。学生们拍照的时候不会说Cheese(茄子),而会喊 ELMO Meeting!(ELMO 会议),可见这个比赛的受欢迎程度。

除了这个 ELMO 竞赛外,另外有一件事情彻底颠覆了我之前对美国高中数学教育的刻板印象。有一天,教练和阅卷员开完会,回到寝室楼大厅,看到了这番情景。

Plank Countdown
Plank Countdown

为了说清楚这个比赛,简单讲一下美国的一个为初中生设计的竞赛 MathCounts。这个比赛类似于中国的华罗庚数学金杯赛,最后一轮是抢答赛(Countdown Round),也是该项比赛唯一的口试部分,用于决定每年的冠军。而照片里所看到的是高中生版本的 MathCounts,大家管它叫平板支撑抢答赛(Plank Countdown),因为比赛选手需要在抢答的过程中一直平板支撑。出乎我意料的是,孩子们的心算速度非常快。决赛轮,两个选手大多数时候都能在题目刚念完未念完的时候给出答案。

孩子们在不断创新的同时,整个教练组也尽力会去优化整个项目的体验。比如 Po-Shen 在第二周周五安排了一个叫哲学(Philosophy)的专题,学生分成四组和不同的教练和阅卷员答疑解惑“人生哲学问题”。在最后一周给 TSTST 阅卷的时候,教练组和阅卷员还讨论了是否要将试卷还给学生的事情。为了能让学生成长,最后决定最大限度地透明化,先将分数告诉学生,如果有异议可以和阅卷员讨论,待分数稳定后将试卷归还。这与近几年的做法很不相同。除此之外,还有一个类似座谈会性质的“超越 MOP”(Beyond MOP),所有人集中在 Neihardt 蓝色大厅一起你问我答。

在昨晚刚结束的学生才艺秀晚会中,负责人 Steve Dunbar 提到这次活动除了 D. E. Shaw & Co 和 Jane Street 赞助外,还有 Two Sigma,Dropbox,Citadel,Art of Problem Solving  以及最主要的赞助 Akamai Foundation。这些赞助公司有一个共同的特点,就是他们的创始人中都有数学竞赛的经历,并且很多曾经参加过 MOP,他们在社会上取得了成功之后不忘回馈。才艺秀结束后,两个阅卷员坐在走道里,给学生讲他们当年在 MOP 的时候的故事。其中 Brian 说了一个蛮有意思的故事是,当年,也是我高三那年,冯祖鸣带着美国队和中国队一起训练,结果美国队有个学生结识朋友的方式是把人抱起来,然后背着地摔地上,有人调侃说就因为这个中国队那年没拿团体第一。我相信这些有趣的轶事会一代一代传下去。

由于整个美国数学竞赛协会有人事变动,整个办公室将从内布拉斯加搬到华盛顿。今年是 MOP 在内布拉斯加哎的最后一年。今年恰逢 Po-Shen 被任命为了美国队领队,明年的 MOP 将在匹兹堡卡内基梅隆大学举办。



6 (this post is made with love)


上个学期,不时会在午后太阳将下山前,在卡内基梅隆大学附近的山坡上小跑。有一次小跑归来,进入 Doherty 楼的时候,遇到了 Po-Shen Loh。我之前就听说 Po-Shen 是美国国际数学奥林匹克(下简称 IMO)队的副领队。Po-Shen 当时推着自行车,突然问我,你想不想今年暑假辅导美国 IMO 集训队。我回答说我非常乐意,但因为没有碰竞赛好多年,水平肯定不行了,但!是!平面几何肯定是可以保证质量的。于是,我就这么来到了内布拉斯加的首府林肯。

说到内布拉斯加,大多数美国人的反应是不毛之地(in the middle of no where),喜欢看生活大爆炸的同学肯定会说是 Penny 的老家。但林肯毕竟是首府,集训队住宿和上课的地方在内布拉斯加大学林肯分校。美国大学在校园建筑方面向来奢华,集训队所住的地方更是学校里历史最悠久的宿舍楼,条件可以说是非常舒适。

这个暑期项目的官方名字是 Maths Olympiad Summer Program,缩写就是 MOSP,但是大家喜欢管它叫 MOP,直译过来就是拖把,参加这个活动的学生也就自然被称之为了 Mopper,直译过来就是拖地的人。根据 Mopper 们在之前一年中各个竞赛的综合表现被分为黑组(十人)、蓝组(十五人)、绿组(十五人)和红组(十一人)。黑组的十个成员中包括了即将参加今年在非洲开普敦举行的 IMO 正式队员。

MOP 正式开始是从六月九日,在这之前还有给黑组成员的小灶 Pre-MOP。为了备课,我的母亲从上海帮我翻拍了很多以前的笔记和数学日志,精选了一百来个题目,花了不少时间把每一个题目都做了一遍,并翻译成英文写了解答。

即便如此备战,第一天试水还是被黑组的学生震撼到了。具体过程是这样的,由于种种原因,我没办法复印讲义,于是只能把题目抄到黑板上。我抄了五六个题到黑板上,心想,少年们,受苦吧!但离下课还有一会的时候,学生当中有一个叫 James Tao 问我有没有更多的问题,于是我又得往黑板上抄了两三个题目才稳住了局面。


Pre-MOP 很快就过去了。六月九日大批的蓝绿红组的学生到来,看面孔大多数都是华裔,不知道这算是人种优势,还是华人家长比较在数学教育。于是我问 James Tao 他是如何走上奥林匹克数学道路的。他说,一方面,是他父亲的辅导,小时候拿着一本中文的几何书翻译着将给他听。另一方面,他发现了 Art of Problem Solving 这个网站和一本叫 Problems from the Book 的书。我个人的看法是,华人家长会更有意识培养和挖掘孩子数学能力。

六月十日的晚上有一个简短的会,面向所有的 Mopper。先是活动的负责人,美国数学竞赛协会的 Steve Dunbar 先生讲了一些纪律方面的事情,接着 Po-Shen 讲了一番话,我觉得很值得回味。他说,因为鸽笼原理,不可能每一个人都能在学术圈找到教职。于是参加奥林匹克数学的孩子,将来会分散到各个行业,并且大多数成为各个行业的佼佼者,所以这个活动是一个很好的结识朋友的机会。

这个活动实际上不只是结识朋友的好机会,还是一个帮助学生扩展眼界的好机会。这次活动得到了 D. E. Shaw & Co. 的支持,所以整个活动的时间也可以更长。MOP 第一周除了 D. E. Shaw & Co. 有来演讲以外,周末还有 Jane Street 的人过来介绍他们的公司。我记得 Jane Street 的人开篇就说,希望这些搞奥林匹克数学的孩子能够离开自己舒适的圈子,去业界闯一闯。

除了不学术的这一面外,MOP 还有特别学术的一面。每两天晚上八点半都会有讨论班。讨论班由像我一样的教练给演讲,Po-Shen 要求说最好是和当前研究相关的,但更要容易理解。大多数教练,包括我,不是什么名师,而是过去在不同国家有竞赛经验的,有的工作了,有的在学术界。

更颠覆我预期的事,MOP 这个活动根本不是用来选拔最后去 IMO 比赛的!整个活动,只有一个叫 TSTST 的考试是算成绩的。TSTST 全称是 Team Selection Test Selection Test,也就是选拔参加国家队选拔考的考试。换句话说在这个考试里表现较好的学生(去年是十三人)可以参加来年的国家队选拔考试。

于是乎,MOP 和竞赛真正有关的只有:一、培训今年参加 IMO 竞赛的美国队队员;二、通过 TSTST 选拔合适的人选参加来年的 TST,并结合他们在来年的 USAMO 的成绩,决定美国国家队人选。

在我看来,这个策略是非常适合美国国情的。想象一下,在一个没有奥数产业的国家,本来美国人就对奥林匹克数学不感冒,再通过连续竞赛刷掉学生的方式选拔,很可能会进一步打击参与度。这些在 TSTST 中表现不错的学生,不但获取了自信,并且可以有很长时间准备来年的竞赛,继续进步成长,整个过程压力也会小一些。


31 (this post is made with love)

How I Wrote My First Independent Paper

This is what happened behind the scene.

In the first draft, I was very pedantic. The resulted draft is notation-heavy and has no introductory section. It was then totally discarded. However, the structure of the proof was reused in the later drafts. My advisor suggested a useful dictionary by Trzeciak for mathematicians like me who write in English as a second language.

The second draft was finished after 3 months. This time, I started with the history of the problem. The entire process of tracking down the history is like this.

How citation works.
How citation works.

The picture is exaggerated. The reality is that for the first time I was handling 10+ references at the same time and that I had to figure out how and in what order I should introduce them in my article. The introductory took 1/3 of the time. Another 1/3 of the time, I was struggling how I could turn the notation-heavy proof in my first draft into English words and sentences which are more friendly to readers. The rest 1/3 of the time, I was drawing pictures and figures. I had never used Tikz (a standard LaTeX graphics package) before. After I learnt how to draw points, lines, how to fill in shape with color and how to put labels, I started my ambitious project to draw a 3D triangulation of a sphere. The result was a super slow Ruby script that outputs Tikz code, which was later complied to generate pictures.

Figures in the 3rd draft with enhanced shading.
Figures in the 3rd draft with enhanced shading.

The hope was to have the top notch graphics in my paper to compensate the weak mathematics result I got.

Advisor was pretty happy with the second draft and provided several suggestions. Takeaways:

  • In LaTex, : is a binary operator. The correct unary operator is \colon. Use \smallskip, \midskip or \bigskip to manually adjust spacing.
  • Citation information on MathSciNet is generally better than Google Scholar in terms of accuracy of information.
  • Starting with the introduction section might be a better way to start a paper. By looking at others’ writings, one can familiarize himself/herself with the language and writing style of the field.

Advisor also drew me the following picture on my second draft.

Heuristic readership of different parts of an academic paper.
Heuristic readership of different parts of an academic paper.

The third draft was a revision of the second draft with two new figures. One of the major pictures is enhance by the Phong shading algorithm. A more refined fourth draft followed with more than 70 places of modifications.

The outcome of this creative process can be found on arXiv:1405.2503.

4 (this post is made with love)

Fun with Hex (Cont.)

In the previous post Fun with Hex, I asserted without proof that the Hex Game will never result a draw. In Jiri Matousek’s An Invitation to Discrete Mathematics, I found an elegant proof for the assertion, which requires only a little bit of elementary graph theory. The idea of the proof can also prove Sperner’s Lemma. Moreover, I will present later how to extrapolate this idea to discover something new. The only bit of elementary graph theory we are going to use is the Handshaking lemma.

Handshaking Lemma: Every finite undirected graph has an even number of vertices with odd degree.

To prove the lemma, simply observe that the sum of the degrees of all vertices is equal to twice the number of the edges, which should be even.

Now let us review the Hex Game. Recall two versions of the Hex Game.

Voronoi Duality
Voronoi Duality

The original version of the game is to color the hexagons of the game board (illustrated in black) and each player tries to link the opposite sides of the board with a connected series of hexagons with the same color.

Alternatively, one can play with the Voronoi dual of the hexagon game board (illustrated in red). This time, two players color the vertices and compete to link the opposite sides with a path of vertices of their own color. For instance, player one would try to form a red path from the northwest side to the southeast side, while player two would strive to build a blue one from the northeast side to the southwest side.

We can tilt and rescale the dual graph without changing the game as follows.

Dual graph which can be easily drawn
Dual graph which can be easily drawn

In this new graph, the first player wants to build a red path from the west to the east and the second player wants to build a blue path from the north to the south. Furthermore, for purpose of the proof, we add four auxiliary vertices on each side and color them according to which player it belongs to.

Dual graph with four auxiliary vertices on each side
Dual graph with four auxiliary vertices on each side

As are shown in the above graph, the vertices on the west and the east are red and the ones on the north and the south are blue. Our goal is to prove that no matter how you color the remaining vertices with red and blue, there is always a monochromatic path connecting two of the four auxiliary points. For the coloring below, there is a red path from W to E.

A red path from W to E
A red path from W to E

Proof: Assume, for the sake of contradiction, that some coloring does not form any red path from W to E or blue path from N to S. First, for every red vertex unreachable by any red path starting from W, we change it to green. Similarly, for every blue vertex unreachable by any blue path starting from N, we change it to green, too.

Under the assumption that there is no red path from W to E and no blue path from N to S, the vertices E ans S are both changed to green. Now we have a triangulated planar graph whose four vertices on the external face, W, N, E, S are colored by red, blue, green, green respectively.

Lemma: Given a 3-coloring on a triangulated planar graph whose four vertices on the external face, say W, N, E, S, are colored by red, blue, green, green respectively. There is an internal triangular face whose vertices are of different color.

By this lemma, we can conclude that there exists an internal triangular face, say ABC, such that A is red, B blue and C green. This implies that there are a red path from W to A and a blue path from N to B. If the vertex C is originally red, then it should still be red because there is an edge between A and C. By the same reason, if it is originally blue, then it should still be blue. This gives a contradiction! To finish the proof for the Hex Game, we only need to prove the lemma.

Proof of Lemma: For each face of the triangulated planar graph, we put a vertex in it. Two vertices are adjacent if and only if their corresponding faces share a common edge whose endpoints are colored red and blue.

3-coloring of a triangulated planar graph and its correspondent face graph.
3-coloring of a triangulated planar graph and its correspondent ‘face graph’.

Note that the degree of the vertex in the external face is 1. By the handshaking lemma, there is a vertex, say \(v\) in the internal face, say \(F\), whose degree is odd. It is easy to see the only possible coloring of the vertices of \(F\) to get an odd degree of \(v\) is to color them by 3 different colors. Q.E.D.

Now I am going to show something new. The idea to consider another game board which looks like:

Symmetric Hex Game of size 3
Symmetric Hex Game of size 3

The idea is to get a symmetric board. To build such a game board, one can start with a hexagon in the center and then add several layers of hexagons around it. In the picture above, the hexagon labeled by 1 is the one we start with. We call it the 1st layer. Then we can add 6 hexagons labeled by 2 around it to form the 2nd layer and another 12 hexagons labeled by 3 around the 2nd layer. In general, to build a symmetric Hex of size \(n\), one only needs to repeat the process \(n\) times.

For this symmetric Hex game board, we have three pairs of opposite sides. One reasonable goal, which is analogous to the goal of the Hex Game, is to build a monochromatic path that links two opposite sides. However two players can cooperate to tie the game in this case.

Two players can cooperate to not link any pair of opposite sides
Two players can cooperate not to connect any pair of opposite sides

Therefore, for the symmetric Hex, the goal is to build a monochromatic connected component that touches two opposite sides or three nonadjacent sides. For simplicity, we can such a monochromatic connected component a winning component. In other words, if we label six sides by 1 through 6, both players want to use their own colors to connect side 1 and 4, or side 2 and 5, or side 3 and 6, or side 1, 3 and 5, or side 2, 4 and 6. Again, we will only consider its dual variant in the sense that each player colors the vertex of the following graph.

Dual variant of symmetric Hex Game
Dual variant of symmetric Hex Game

Proposition: The Symmetric Hex Game can never end in a tie.

Proof: Assume, for the sake of contradiction, that there is a 2-coloring of the symmetric Hex that contains no winning component. Consider the graph with 4 extra vertices W, E, N, S which connect to the vertices of the original graph as follows.

Graph with 4 extra vertices
Graph with 4 extra vertices

To be specific, W connects to all the vertices on side 5, E the vertices on side 2, N the vertices on side 1 and 6, S the vertices on side 3 and 4.

By the same argument we used in the proof of the Hex Game, there is either a red path from W to E or a blue path from N to S. In other words, we should have one of the following schematic pictures.

Five possibilities
Five possibilities

By our assumption, it must look like the pictures on the second row. Without loss of generality, assume that we have the first picture on the second row and the blue path connects vertices \(a\) and \(b\).

Now we change the way the extra 4 vertices connect to the vertices on the boundary as follows.

4 extra vertices connect the vertices on the sides in a different way
4 extra vertices connect the vertices on the sides in a different way

Again, there is either a blue path from W to E or a red path from N to S. If there is a blue path from W to E, then this path combined with the blue path from \(a\) to \(b\) gives us a connected component that touches two opposite sides or 3 nonadjacent sides. Therefore it has to be the case that a red path connects N to S.

A red path on the blue path's right side.
A red path on the blue path’s right side.

To summarize, whenever we have a blue path \(P_1\) that connects side 4 and 6, there is be a red path \(P_2\) connecting the same sides on its right side. By symmetry, there is a blue path \(P_3\) connecting side 4 and 6 on \(P_2\)‘s right side. We can keep going and get an infinite sequence of paths connecting side 4 and 6 with alternating colors, which is impossible! Q.E.D.

Corollary For any 2-coloring of the symmetric Hex board of size \(n\), there is a monochromatic connected component of size at least \(2n+1\).

Proof By the symmetric Hex theorem, there is a monochromatic connected component that touches 2 opposite sides or 3 nonadjacent sides. In either case, the size of the component is at least \(2n+1\). Q.E.D.

2 (this post is made with love)

Communicating Information through Randomness

This week Staren sent me a puzzle on WeChat. After we discovered a solution for the puzzle, I tried to backtrack the source of it. I found it appeared in the blog post Yet another prisoner puzzle by Oliver Nash. The author seemed to work at a quantitative trading company SIG at that time. Given that Staren is working at JP Morgan, I guess this is one of many brain-teasers that was circulating among the traders around year 2010. Anyway, here is a version of the puzzle that was used in that blog post.

A room contains a normal 8 by 8 chessboard together with 64 identical coins, each with one “heads” side and one “tails” side. Two prisoners are at the mercy of a typically eccentric jailer who has decided to play a game with them for their freedom. The rules are the game are as follows.

The jailer will take one of the prisoners (let us call him the “first” prisoner) with him into the aforementioned room, leaving the second prisoner outside. Inside the room the jailer will place exactly one coin on each square of the chess board, choosing to show heads or tails as he sees fit (e.g. randomly). Having done this he will then choose one square of the chess board and declare to the first prisoner that this is the “magic” square. The first prisoner must then turn over exactly one of the coins and exit the room. After the first prisoner has left the room, the second prisoner is admitted. The jailer will ask him to identify the magic square. If he is able to do this, both prisoners will be granted their freedom.

These rules are explained to both prisoners before the game begins and they are allowed some time together to discuss their strategy. Of course the question is: what strategy should the prisoners adopt?

There is an extremely simple and clever solution that uses the XOR operation. However we will detour into the realm of linear algebra and information theory and come back to the elegant one-line solution at the end of the post.

In this puzzle, it doesn’t matter that we have a 8 by 8 chessboard to place the coins. All it matters is that we have \(n=64\) places to put the coins in. Also it doesn’t matter that the jailer declare a magic square. All it matters is that he pick a magic number from \(0, 1, \ldots, n-1\), which corresponds to the position of the magic square, and tell the first prisoner the number.

Henceforth, we can consider the following puzzle.

The jailer chooses an \(n\)-bit binary and pick a magic number from \(1, 2, \ldots, n\). The first prisoner flips exactly at most one bit of the binary (not necessarily the magic bit) and shows the second prisoner the altered binary. What strategy should the prisoners adopt so that the second prisoner can tell the magic number?

I have changed the puzzle a little bit. In the original puzzle, the first prisoner must flip a bit. Now we allow him to do nothing if he wants to. We shall solve this weaker version of the puzzle first and come back to the original one.

If we think of the set of all the \(n\)-bit binaries as a metric space under the Hamming distance. I claim such strategy exists if and only if one can partition all the \(n\)-bit binaries into \(n\) groups denoted as \(A_1, \ldots, A_n\) such that for all \(k\), the 1-neighborhood of \(A_k\) covers all the \(n\)-bit binaries, i.e. any \(n\)-bit binary is at most 1-bit away from some binary in \(A_k\).

The reason is as follows. Given such partition \(A_1, \ldots, A_n\). If the jailer chooses the initial binary \(x\) and the magic number \(k\), because there is some binary, say \(y\), in \(A_k\) such that \(x\) and \(y\) are at most 1 bit off, the first prisoner can turn the initial binary into \(y\) by flipping at most 1 bit in \(x\). The second prisoner then can take this binary \(y\), look up the partition \(A_1, \ldots, A_n\) and find the index of the group which contains \(y\). Conversely, one can argue it is necessary to have such a partition of all the binaries if one has the strategy.

For some reason we will see later, we call a subset \(A\) of \(n\)-bit binaries a covering code provided its 1-neighborhood covers all \(n\)-bit binaries. In terms of covering codes, our goal is to find \(n\) disjoint covering codes. By the way, in information theory, people call a subset of binaries a code.

Notice that if we can find \(n+1\) disjoint covering codes, we can find a strategy that works even if we allow the jailer to pick the magic number from \(1, 2, \ldots, n+1\). However we can not hope for more than \(n+1\) disjoint covering codes. Here is why. Suppose we have \(N\) disjoint covering codes. By pigeonhole principle, there is a covering code, say \(A\), whose size is no more than \(2^n/N\). Note that the 1-neighborhood of a binary contains exactly \(n+1\) binaries. Hence the 1-neighborhood of \(A\) contains at most \(2^n\times\frac{n+1}{N}\). In order to have the 1-neighborhood of \(A\) to cover all the binaries, we must have \(N\leq n+1\). Moreover, if \(N=n+1\), all codes should have the same size \(2^n/(n+1)\).

Let us start with small \(n\).

  • For \(n=1\), we can find 2 disjoint covering codes \(A_1=\{0\}, A_2=\{1\}\).
  • For \(n=2\), we can find 2 disjoint covering codes \(A_1=\{00, 11\}, A_2=\{01, 10\}\) or \(A_1=\{00, 01\}, A_2=\{10, 11\}\).
  • For \(n=3\), we can find 4 disjoint covering codes \(A_1=\{000,111\}, A_2=\{001,110\}, A_3=\{010,101\}, A_4=\{100, 011\}\).
  • For \(n=4\), we can take 4 disjoint covering codes by appending 0,1 to the covering codes for \(n=3\). These covering codes are \(A_1=\{*000,*111\}, A_2=\{*001,*110\}, A_3=\{*010,*101\}, A_4=\{*100, *011\}\) where \(*\) is a wild card that can be either 0 or 1.
  • For \(n=5\), it is impossible to find 5 disjoint covering codes because by pigeonhole principle one of the 5 disjoint covering codes would contain no more than \(2^5/5\sim 6\) binaries and the 1-neighborhood of any 6-element subset will always have 2 binaries not covered. This can be checked by a computer program.

It is interesting to see that for some \(n\) we can find \(n\) or \(n+1\) disjoint covering codes and for some \(n\), for instance 5, there are no \(n\) disjoint covering codes. The situation is quite complicated here.

Therefore we had better focus on the extremal cases in which we can find \(n+1\) disjoint covering codes. If it is the case that \(n+1\) disjoint covering codes exist, all the covering codes should be of the same size \(2^n/(n+1)\) and for each covering code \(A\), the 1-neighborhoods of two distinct point in \(A\) do not overlap. This implies that \(n=2^m-1\) for some \(m\). Since we are interested in the extremal case, henceforth we always assume \(n=2^m-1\).

Even finding one covering code is hard besides that our goal is to find \(n+1\) many of them. However, Richard Hamming in 1950 invented the first error correction code, the Hamming code \(H\), which has the properties:

  1. it is a linear code, i.e., \(H\) is a \((n-m)\)-dimensional linear subspace of \(\mathbb{Z}_2^n\), the liner space of all the \(n\)-bit binaries over the finite field \(\mathbb{Z}_2\);
  2. the minimum Hamming distance between two distinct elements in \(H\) is exactly three.

For an introduction to Hamming code, I recommend Thirty-three Miniatures by Jiri Matousek. The 5th miniature, Error-Correcting Codes, includes detailed construction of the Hamming code.

Because of the 1st property, \(H\) has \(2^{n-m}\) elements. The 2nd property says that the 1-neighborhoods of any two distinct elements in \(H\) are disjoint since \(1+1<3\). Hence the 1-neighborhood of \(H\) contains \(2^{n-m}(n+1)=2^{n-m}2^m=2^n\) many elements. In other words, the Hamming code is a linear covering code!

Good! We have found a linear covering code \(H\). By translating the Hamming code \(H\) by \(\mathbf{e_1}=100\ldots 00, \mathbf{e_2}=010\ldots 00, \ldots, \mathbf{e_n}=000\ldots 01\) respectively, we have \(n+1\) covering codes in total (including \(H\) itself) since a translate of a covering code is still a covering code. Moreover, they are disjoint from each other because of the 2nd property of Hamming code.

Great! We have found \(n+1=2^m\) disjoint covering codes for \(n\)-bit binaries. We can take this construction and use the wild card trick to get \(n+1\) disjoint covering codes for \(n+1\)-bit binaries. In fact, for any \(n+1\) disjoint covering codes for \(n\)-bit binaries, say \(A_0, \ldots, A_n\), define \(A_k^* = \{0,1\}\times A_k \). For each \(k\), \(A_k^*\) is a covering code for \(n\)-bit binaries, and what’s more, they are disjoint.

Hence the weaker version of the puzzle in which the first prisoner can opt to not flip the coin is totally settled since \(8\times 8=2^6\). The same strategy works for the stronger version of the puzzle, because if unfortunately the initial binary is an element of \(A_k^* = \{0,1\}\times A_k \) and the magic number is also \(k\), the first prisoner can flip the leftmost bit so that the outcome stays in \(A_k^*\).

Practically, in order to achieve such strategy, the first prisoner needs to memorize the Hamming code. However there is an easy way to get the Hamming code as it can be characterized as follows.

The Hamming code is the solution set of the linear system \(R\mathbf{x}=\mathbf{0}\), where the matrix \(R=[\mathbf{c_1}, \mathbf{c_2}, \ldots, \mathbf{c_n}]\) is an \(m \times n\) matrix whose \(k\)th column \(\mathbf{c_k}\) is the binary expansion of \(k\) for \(k=1,2,\ldots, n\).

For instance, when \(m=3, n=2^m-1=7\), the matrix $$R=\begin{bmatrix}0 & 0 & 0 & 1 & 1 & 1 & 1\\ 0  & 1 & 1 & 0 & 0 & 1 & 1\\ 1 & 0 & 1 & 0 & 1 & 0 & 1\end{bmatrix}.$$

Similarly, we can characterize \(\{0,1\}\times H\) as the solution set of $$\begin{bmatrix}\mathbf{0} & R\end{bmatrix}\mathbf{x}=\mathbf{0}.$$ We denote \(R^*=[\mathbf{c_0}, \mathbf{c_1}, \ldots, \mathbf{c_n}]\) as the matrix \(R\) prepended by a zero column.

Suppose \(x_0x_1\ldots x_n\) is in \(\{0,1\}\times H\). This means $$\sum_{x_k\neq 0}\mathbf{c_k}=\sum_{k=0}^nx_k\mathbf{c_k}=\mathbf{0},$$ where all summations take place in \(\mathbb{Z}_2^n\). The left hand side of the equation above can be thought as the following procedure: we take all indices of the non-zero bit in the binary, \(\{k: x_k\neq 0\}\), write out the binary expansions of them, \(\{\mathbf{c_k}: x_k\neq 0\}\) and then add them up bitwise. Recall that adding binaries bitwise is equivalent to the XOR operation of numbers. The number corresponding to the left hand side is \(\mathrm{XOR}_{x_k\neq 0}k\). We call this number the XOR-signature of the binary \(x_0x_1\ldots x_n\). Therefore the XOR-signature $$\mathrm{XOR}_{x_k\neq 0}k=0$$ is another characterization of the set \(\{0,1\}\times H\). Similarly, a binary \(x_0x_1\ldots x_n\) is in \(\{0,1\}\times (H+\mathbf{e_k})\) if and only if its XOR-signature is equal to \(k\).

All right. Now we have a convenient way to tell which covering code a binary belongs to according to its XOR-signature. How will this help the first prisoner to figure out which coin to flip? Given the initial binary \(x_0x_1\ldots x_n\), the first prisoner can compute its XOR-signature denoted by \(s\). He wants to flip the \(k\)th bit of the binary so as to change its XOR-signature from \(s\) to \(s’\) which is the magic number that the jailer has picked from \(0,1,\ldots, n(=2^m-1)\). Note that when he flips the \(k\)th bit of the binary, he will change the XOR-signature from \(s\) to \(s \mathrm{XOR} k\). So \(k\) must satisfy \(s \mathrm{XOR} k=s’\). One can solve this equation by XOR both sides by \(s\) and get \(k=s \mathrm{XOR} s \mathrm{XOR} k = s \mathrm{XOR} s’\).

In other words, the first prisoner only needs to compute the XOR-signature of the initial binary, say \(s\) and flip the \((s\mathrm{XOR}s’)\)th bit of the binary, which will change the XOR-signature of the binary from \(s\) to \(s’\). Then the second prisoner comes in and compute the XOR-signature of the new binary which gives the magic number.

Pretty neat, isn’t it?

8 (this post is made with love)

Cousin’s Wedding

I had two cousins in Shanghai before my younger cousin and I came to North America. Thanks to the one-child policy, I don’t really have a sister or a brother. So I just call my two cousins younger brother and elder sister. I recall that every weekend we met at grandpa’s home and enjoyed great time when I was little. Quite like Modern Family, isn’t it? Well, time flies. It turns out, my elder sister just had her wedding ceremony last weekend.

So last summer, I took a round trip to Montreal to visit my younger brother whom I mentioned above and my younger sister whom was born and raised in Montreal. And we prepared a homemade video for our elder sister.

This video was played during the wedding ceremony.

0 (be the first to like this)

Crazy Telescoping

In mathematics, a telescoping series is a series whose partial sums eventually only have a fixed number of terms after cancellation. I learnt this technique when I was doing maths olympiad. However until last year I learnt this buzz word ‘telescoping’ since I received my education in China and we call it ‘the method of differences’.

Anyway, yesterday, I was helping Po-Shen Loh to proctor this Annual Virginia Regional Mathematics Contest at CMU. Unfortunately, I did not take my breakfast. All I could digest are the seven problems. And even worse, I got stuck by the last problem which is the following.

Find $$\sum_{n=1}^\infty\frac{n}{(2^n+2^{-n})^2}+\frac{(-1)^nn}{(2^n-2^{-n})^2}.$$

It turns out that the punch line is, as you might have expected, telescoping. But it is just a bit crazy.

There is Pleasure sure, In being Mad, which none but Madmen know!

John Dryden’s “The Spanish Friar”

First consider the first term, $$\frac{1}{(2^1+2^{-1})^2}+\frac{-1}{(2^1-2^{-1})^2}=\frac{-2\times 2}{(2^2-2^{-2})^2}.$$ Then take look at the second term, $$\frac{2}{(2^2+2^{-2})^2}+\frac{2}{(2^2-2^{-2})^2}.$$ Aha, the sum of the first two terms becomes $$\frac{2}{(2^2+2^{-2})^2}+\frac{-2}{(2^2-2^{-2})^2}=\frac{-4\times 2}{(2^4-2^{-4})^2}$$ which again will interact nicely with the 4th term (not the 3rd term). After all, the sum of the 1st, 2nd, 4th, 8th, …, \(n=(2^m)\)th  terms is $$\frac{-4n}{(2^{2n}-2^{-2n})^2}$$. Note that this goes to zero.

But this only handles the terms indexed by 2’s power. Naturally, the next thing to look at is the sum of the 3rd, 6th, 12th, 24th, …, \(n=(3\times 2^m)\)th terms, which amazingly turns out have the telescoping phenomena again and is equal to $$\frac{-4n}{(2^{2n}-2^{-2n})^2}.$$ Again, the this goes to zero.

We claim that starting with an odd index, say \((2l+1)\), the partial sum of the \((2l+1), 2(2l+1), \ldots, 2^m(2l+1), \ldots\)‘th terms goes to zero.

For people who is willing to suffer a bit more for rigorousness, the argument for the previous claim can be easily formalized by the method of differences. The key fact we have been exploiting above is the following identity.

$$\frac{2n}{(2^{2n}+2^{-2n})^2}+\frac{2n}{(2^{2n}-2^{-2n})^2} = \frac{4n}{(2^{2n}-2^{-2n})^2}-\frac{8n}{(2^{4n}-2^{-4n})^2}$$

The last bit of the ingredient is the absolute summability of the original sequence which implies that changing the order of summation does not affect the result. Hence the sum in total is \(0\).

5 (this post is made with love)