The magic of Artificial Neural Network — Part 4

Simon Kwan
9 min readJan 5, 2025

--

Photo by Jeremy Perkins on Unsplash

In part 2, I have sketched the proof of the Universal Approximation Theorem:

Let S be the set of all artificial neural network functions.

S = {f ∈ C(Im)|f(x)=∑υ[i]σ(w[i]Tx+b[i]) where v[i],b[i]∈ℝ,w[i]∈ℝm,1≤i≤M }

Suppose Cl(S) is not dense in C(Im)
⇒ Cl(S) ≠ C(Im)
⇒ Cl(S) ⊂ C(Im) ⇒ Cl(S) is a closed linear subspace of C(Im)
⇒ For h∈C(Im)\Cl(S), there exists Continuous Linear functional L on C(Im) such that
L(w) = 0 ∀w∈Cl(S)
L(h) 0 [ By Hahn-Banach Theorem ]
⇒ L(h) = ∫ h(x)dμ(x) [ By RMK Representation Theorem ]
⇒ L(σ(wTx+b))=∫ σ(wTx+b)dμ(x)=0 ∀w∈Rn and b∈R [ Since σ(wTx+b) ∈ S ]
⇒ μ=0 [ Since σ is discriminatory ]
⇒ L(h) = ∫ h(x)dμ(x) = 0
contradict L(h) 0

In this article, we are going to prove that:

L(h) = ∫ h(x)dμ(x)
L(σ(wTx+b))=∫ σ(wTx+b)dμ(x)=0 ∀w∈Rn and b∈R

Lebesgue Integral and Riesz Markov Kakutani representation theorem

By using the Riesz Markov Kakutani (RMK) representation theorem, this Continuous Linear Functional L on C(Im) can be represented as the Lebesgue Integral:

L(f) = ∫_Im f(x)dμ(x) ∀f ∈ C(Im)

Note that I use ∫_Im to denote the Lebesgue Integration over Im which is the domain of f(x). So, what is Lebesgue Integral? The Lebesgue Integral is a more generalized version of the Riemann Integral:

R(f) = ∫ f(x)dx where x∈[0, 1]

For the Riemann integral, the domain is partitioned into intervals. By constructing the bars with heights that meet the graph, the integral can be approximated by the sum of the area of these bars.

For the Lebesgue integral, the range is partitioned into intervals by choosing a finite number of target values. By constructing the bars with heights that equal to these values but below the function, the integral can be approximated by the sum of the area of these bars. Note that there may be several bars with heights that equal to these values but with different range of the function.

Riemann Integral (Upper one), Lebesgue Integral (Lower one) From wikipedia

The Lebesgue Integral is more general than the Riemann Integral since the domain of the function can be more general space, not only Euclidean space. The Lebesgue Integral is also more powerful and let’s consider the function q: [0,1] → {0, 1} below which has no Riemann Integral:

q(x) = 1 if x is a rational number
q(x) = 0 otherwise

upper Riemann sum = inf(Σ sup(q(xi*)) ∆x) = 1*(1–0) = 1
lower Riemann sum = sup(Σ inf(q(xi*)) ∆x) = 0*(1–0) = 0
⇒ upper Riemann sum lower Riemann sum ⇒ R(q) is not Riemann Integrable

Note that since rational number is dense in R, there is always a rational number between any real numbers, so sup(q(xi*)) = 1. Irrational number is also dense in R, so inf(q(xi*)) = 0.

In the Riemann Integral, “dx” is an infinitesimal change in x whereas in the Lebesgue Integral, μ is a “measure” of a “measurable set” and f(x) is a “measurable function”. Then, what is the meaning of “measure”, “measurable set” and “measurable function”?

Measure Theory 101

So, what is the meaning of the Measure ? Actually, it is the basic concept in the Measure Theory. In layman terms, it is a function that measures the the size of a measurable set and always returns a finite non-negative real number. For measure that can return negative real number, then it is called signed measure.

Let X be a set.
A σ-algebra Σ on a set X is a nonempty collection of subsets of X which satisfies the following properties:
SA1. X ∈ Σ
SA2. ∀A ∈ Σ, X\A ∈ Σ
SA3. for all countable collection A[i] ∈ Σ, A[1]∪…∪A[∞] ∈ Σ

A Measurable Space is the ordered pair (X, Σ).

A Measurable Set is an element of the σ-algebra.

A Measure μ on a measurable space (X, Σ) is a function μ:Σ→ℝ∪{∞}, which satisfies the following properties:
MS1. ∀A ∈ Σ, μ(A) ≥ 0
MS2. μ(∅) = 0
MS3. for all countable collection A[i], if A[i]∩A[j]=∅ where i j
μ(A[1]∪…∪A[∞]) = μ(A[1])+…+μ(A[∞])

Note that ℝ∪{∞} is the Extended Real Numbers. It is constructed from the real number system ℝ by adding the element ∞ which is greater than every real number. The product of 0 * ∞ is defined to be 0.

The measure μ has the following properties:
MS4. μ(A) ≤ μ(B) if A ⊆ B [ This does not applies to signed measure ]
Proof
A ⊆ B ⇒ B=(B\A)∪A
⇒ μ(B) = μ((B\A)∪A) = μ(B\A) + μ(A) [ Since B\A ∩ A = ∅ ]
⇒ μ(B) ≥ μ(A) [ By MS1 ]

Let (X, Σx) and (Y, Σy) be the measurable spaces.
A function f : X → Y is said to be Measurable if
∀ B ∈ Σy, f−1(B) ∈ Σx

A measurable function f : X → ℝ is said to be Simple Function if it takes only finitely many values: α[1] < α[2] < .. α[n]
∀ x ∈ X, f(x) = α[1] IA[1](x) + … + α[n] IA[n](x)
with A[i] = f-1({α[i]}) ∈ Σx and A[1]∪…∪A[n] = X

I denotes the Indicator Function:
Given a subset A ⊆ X, IA: X → {0, 1}
IA(x) = 1 if x ∈ A
IA(x) = 0 if x ∉ A

Lebesgue Integral of the simple function f : X → ℝ+ with the canonical representation f = Σ α[i] IA[i] is defined as follows:

∫_X f(x)dμ(x) = Σ α[i] μ(A[i])

If B is a measurable subset of X, the Lebesgue Integral becomes:
∫_B f(x)dμ(x) = ∫_X IB f(x) dμ(x) = Σ α[i] μ(A[i] ∩ B)

Lebesgue Integral of the non-negative function f : X → ℝ+ is defined as the supremum of set of Lebesgue Integral of simple functions:

∫_X f(x)dμ(x) = sup { ∫_X s(x)dμ(x) : 0 ≤ s ≤ f, s is simple }

Note that the measure μ is defined on extended real numbers. Thus, ∫_Xs(x)dμ(x) = Σ α[i] μ(A[i]) < ∞ and the set {∫_X s(x)dμ(x)} has the upper bound ∞. Also, s(x) = 0 is a simple function, so the set {∫_X s(x)dμ(x)} contains the Lebesgue Integral of this function and thus is non-empty. By the least upper bound axiom, the supremum of the set exists. Therefore, the definition of the Lebesgue Integral of the non-negative function is justified.

For measurable function f that is not non-negative, it can be written as:
f = f+ — f-
f+ = max(f, 0)
f- = min(f, 0)
Note that both f+ and f− are non-negative measurable functions. Also note that:
|f| = f+ + f-

Lebesgue Integral of the general function f : X → ℝ is defined as follows if min(∫_X f+(x)dμ(x) , ∫_X f-(x)dμ(x)) < ∞:

∫_X f(x)dμ(x) = ∫_X f+(x)dμ(x) — ∫_X f-(x)dμ(x)

This condition is required since ∞ — ∞ is not defined if both ∫_X f+(x)dμ(x) = ∞ and ∫_X f-(x)dμ(x) = ∞.
A measurable function f: X → ℝ is said to be Lebesgue Integrable if
∫_X |f(x)| dμ(x) < ∞

Let’s go back to the function q(x) and consider its Lebesgue Integral.
L(q) = ∫_[0,1]q(x)dμ(x)
= 0*μ([0,1]\ℚ) + 1*μ([0,1]⋂ℚ)
= 1*0 [ measure of countable set is 0 w.r.t. Lebesgue Measure ]
= 0

Properties of Lebesgue Integral

Now, I state all the related properties and theorem of the Lebesgue Integral without proof. You can refer to the book on Measure Theory for details.

LB1: ∫_A⋃B f(x)dμ(x) = ∫_A f(x)dμ(x) + ∫_B f(x)dμ(x) where A⋂B = ø

LB2: ∫_A c·(f+g)(x)dμ(x) = c ∫_A f(x)dμ(x) + c ∫_A g(x)dμ(x) where c is a constant and c ∈ℝ

LB3: | ∫_A f(x)dμ(x) | ≤ ∫_A |f(x)| dμ(x)

LB4: ∫_A f(x)dμ(x) = 0 iff f(x) = 0 a.e.

Almost everywhere ??

Note that in LB4, a.e. is an abbreviation of “almost everywhere”. A property P holds “almost everywhere” if it holds for all elements in a set except a subset of measure zero. In LB4, “f(x) = 0 a.e.” means μ({x∈A | f(x) 0}) = 0. If property P1 is a.e. and P2 is a.e., then P1^P2 will be a.e. too.

P1 a.e. and P2 a.e => P1^P2 a.e
Proof
P1 a.e. and P2 a.e
⇒ P1: ∀x ∈ X\A where A = {x∈X| ¬P1(x)} and μ(A) = 0 and
P2: ∀x ∈ X\B where B = {x∈X| ¬P2(x)} and μ(B) = 0
⇒ P1^P2 ∀x ∈ X\A∩X\B where μ(A)=μ(B)=0
⇒ P1^P2 ∀x ∈ X\A⋃B where 0≤μ(A⋃B)≤μ(A)+μ(B)=0
⇒ P1^P2 ∀x ∈ X\C where C=A⋃B={x∈X| ¬P1^P2(x)} and μ(C) = 0
⇒ P1^P2 a.e

Theorem of Lebesgue Integral

Lebesgue Bounded Convergence Theorem:
Let (fn(x)) be a sequence of measurable function defined on measurable set A and μ(A) < ∞. If the following conditions 1) and 2) are satisfied:
1) There exists M ∈ ℝ s.t. |fn(x)| ≤ M ∀n∈ℕ ∀x∈A and
2) lim_n→∞ fn(x) = f(x) ∀x∈A
Then lim_n→∞ ∫_A fn(x) dμ(x) = ∫_A f(x) dμ(x)

RMK representation theorem (Signed version)
Let X be a locally compact Hausdorff space
Let C(X) be the space of continuous function on X
Let Cc(X) be the space of continuous function on X with compact support
Let L be a continuous linear functional on Cc(X)
Then there exists a unique signed regular Borel measure μ on X such that
L(f) = ∫_X f(x) dμ(x) ∀f ∈ Cc(X)

It can be proved that
1) Im is a locally compact Hausdorff space and
2) Cc(Im) = C(Im)
So, we can now apply the RMK representation theorem and get the following result:
L(f) = ∫_Im f(x)dμ(x) ∀f ∈ C(Im)

In part 3, I have shown that when Cl(S) is a closed linear subspace in C(Im) and for every continuous function h∈C(Im)\Cl(S), there exists a corresponding Continuous Linear Functional L on C(Im) with L(h) 0 and L(w) = 0 ∀w∈Cl(S). With the help of RMK representation theorem, the Lebesgue Integral of any neural network function in S is zero.

∀w∈Rn and b∈R,
σ(wTx+b) ∈ S
⇒ σ(wTx+b) ∈ Cl(S) [ By CL4 ]
⇒ L(σ(wTx+b)) = 0
L(σ(wTx+b))=∫ σ(wTx+b)dμ(x)=0 ∀w∈Rn and b∈R

Sigmoidal function σ is discriminatory

Our next step is to show that any continuous sigmoidal function σ is discriminatory.

Let M(Im) denote the space of finite, signed regular Borel measure on Im and μ is a member of M(Im). The regular Borel measure is one kind of the measure but the detail is not important here, so I will not describe here.

σ is sigmoidal if:
σ(t) → 1 as t → +∞
σ(t) → 0 as t → -∞

Sigmoidal function example: f(t)=1/(1+e^-t)

σ is discriminatory if for a measure μ ∈ M(Im), ∀y∈ℝ^n and θ∈ℝ,
∫_Im σ(y^T x + θ)dμ(x) = 0 ⇒ μ = 0

Any bounded, measurable sigmoidal function σ is discriminatory.
Proof
Define σλ(x) = σ(λ(y^T x+θ)+φ) where y∈ℝ^n and φ,θ∈ℝ, x∈Im:
σλ(x) → 1 for y^T x + θ > 0 as λ → +∞
σλ(x) → 0 for y^T x + θ < 0 as λ → +∞
σλ(x) = σ(φ) for y^T x + θ = 0 ∀ λ

Define γ(x):
γ(x) = 1 for y^T x + θ > 0
γ(x) = 0 for y^T x + θ < 0
γ(x) = σ(φ) for y^T x + θ = 0

Note that (σλ(x)) is a sequence of measurable function and γ(x) is the limit of this sequence: lim_λ→∞ σλ(x)=γ(x) ∀x∈Im
Also, |σλ(x)| = |σ(x)| ≤ 1 [ Since σ is bounded ]
⇒ lim_λ→∞ ∫_Im σλ(x)dμ(x) = ∫_Im γ(x)dμ(x) [ By Lebesgue Bounded Convergence Theorem ]

Let Hy,θ={x|y^T x + θ > 0}, Πy,θ={x|y^T x + θ = 0} and Gy,θ={x|y^T x + θ < 0}. That is, Hy,θ is the half space above the plane Πy,θ and Gy,θ is the half space below the plane Πy,θ. Thus, Im = Hy,θ ∪ Πy,θ ∪ Gy,θ and Hy,θ∩Πy,θ=∅, Gy,θ∩Πy,θ=∅ and Hy,θ∩Gy,θ=∅. Thus,

γ(x) = 1 ∀x ∈ Hy,θ
γ(x) = 0 ∀x ∈ Gy,θ
γ(x) = σ(φ) ∀x ∈ Πy,θ

If for a measure μ ∈ M(Im), ∫_Im σ(y’^T x + θ’)dμ(x) = 0 ∀y’∈ℝ^n and θ’∈ℝ
∫_Im σλ(x)dμ(x)
= ∫_Im σ(λ(y^T x+θ)+φ) dμ(x)
= ∫_Im σ((λy)^T x+(λθ+φ)) dμ(x) [ Put y’=λy, θ’=λθ+φ ]
= 0
⇒ lim_λ→∞ ∫_Im σλ(x)dμ(x) = 0
⇒ ∫_Im γ(x)dμ(x) = 0 [ Since lim_λ→∞ ∫_Im σλ(x)dμ(x) = ∫_Im γ(x)dμ(x) ]
⇒ ∫_Im γ(x)dμ(x) = μ(Hy,θ) + σ(φ)μ(Πy,θ) = 0 [ By LB1, LB2 and definition of γ ]
⇒ μ(Hy,θ) + σ(φ)μ(Πy,θ) = 0 ∀y∈ℝ^n and φ,θ∈ℝ
⇒ μ(Hy,θ) = 0 [ Consider σ(φ) → 0 as φ → -∞ ]
⇒ μ(Πy,θ) = 0 [ Consider σ(φ) → 1 as φ → +∞ ]

Conclusion

At this point, we have proved that μ(Hy,θ) = μ(Πy,θ) = 0. That is, μ = 0 for any plane and its upper half space. However, “μ = 0 for any set in Im” is not proven yet and we continue our proof in the next article.

Your feedback is highly appreciated and will help me to continue my articles.

Please give this post a clap if you like this post. Thanks!!

--

--

Simon Kwan
Simon Kwan

Written by Simon Kwan

"The fear of the LORD is the beginning of wisdom" ✝️Seasoned Java Programmer. MSc Comp Sc, BSc Math (1st Hon), BSc Mech Eng

No responses yet