disinformation and proof length

Wed Jul 6 22:00:14 EDT 2022

Vaughan Pratt wrote:

> Vladik Kreinovich proposes that disinformation about a proposition P
> increases Shannon entropy, thereby decreasing audience A's information
> about the state of the system.  He illustrates this with (essentially) the
> example in which ~P is the only other proposition and A's probability of P
> is decreased from 1 to 0.5, raising the system entropy from zero to 1 bit
> and implying that A has lost 1 bit of information.

I think that binary logic is a too coarse model for describing
disinformation, since human thinking tends to be fuzzier than just the
binary of belief and disbelief. Let's apply Bayesian epistemology to the
audience A (I will call it Alice to give it the flavor of a discussion
about cryptography, where this name is used).

Reference about Bayesian epistemology:
https://plato.stanford.edu/entries/epistemology-bayesian/

For simplicity, let's define Alice's worldview as a probability
distribution concerning P and not P. This worldview can be characterized by
an extended real number (i.e., topological space consisting of the real
numbers, positive infinity, and negative infinity) theta, which is positive
infinity if Alice thinks that P is true, and negative infinity if Alice
thinks that P is false, and it gives Alice's subjective degree of certainty
about P.

Let's start by considering that Alice doesn't have any bias concerning P
and not P. Alice's prior probability density could be just the standard
normal distribution (mean = 0, variance = 1):

Pr( theta = x ) = 1/sqrt(2*pi)*exp(-x^2/2)

where Pr denotes the probability (this notation is used in cryptography to
avoid confusion with the letter P).

Every time Alice gets data d, she updates her probablity density about
theta using Bayes' theorem,

Pr( theta = x | data = d ) = Pr( data = d | theta = x ) * Pr( theta = x ) /
Pr( data = d )

The only non-trivial point in Bayes theorem, when applied to disinformation
theory, is the interpretation of Pr( data = d | theta = x ). I would
suggest that this quantity is the degree of expectation that the data would
be d according to a person believing that P is true with intensity theta =
x. That is, different intensities of beliefs on a given proposition produce
different expectations for the data.

The prior probability distribution Pr( theta = x ) can't be affected by
disinformation, since it is Alice's previous beliefs. So, there are only
two ways in which disinformation can change Alice's worldview:

- Method I (disinformation by unconditional probability): Alice is told
that Pr( data = d ) is higher or lower than it really is.

- Method II (disinformation by conditional probability): Alice is told that
Pr( data = d | theta = x ) is higher or lower than it really is.

Let's Eve be the agent who wants to apply disinformation to Alice. Eve can
try two strategies:

- Strategy A (confusion): Eve will try to keep the skewness of Alice's
probability distribution as close to zero as possible. This will increase
the symmetry in the probability distribution. Consequently, it will be
harder for Alice to decide between P and not P.

- Strategy B (recruitment): Eve will try to keep Alice's probability
density skewness with the opposite sign of the probability density that
should be obtained using genuine information. This will cause Alicia to
believe the opposite of what is true.

Note that in this dynamic approach to misinformation, the entropy of the
probability distribution plays no role (it is independent of the skewness).
What is important here is the sign of the point x for which Pr( theta = x |
data = d ) reaches a maximum (there may be more than one point like that).
Of course, in other frameworks, Shannon entropy may be more relevant.

>From the point of view of deductive systems, the framework that I am
presenting is about the interaction between two versions of an incomplete
theory T, one version (T + P) in which the independent proposition P is
true and another version (T + not P) in which P is false. The data are
propositions in T, which are true in T + P, T + not P, or in both. Given a
proposition (data) there is a probability that it came from T + P, another
probability that it came from T + not P, and another probability that it
came from T. This probability can be defined using the notion of proof
complexity. For example, the probability that a data point came from theory
R (R could be T, T + P, or T + not P) can be defined as one divided by the
length of the shortest proof of this data point in R, normalized by this
quantify applied to all data points in the dataset (this is the first idea
that comes to my mind, maybe there are other functions that are more
suitable than 1/x for this problem). This definition will give a higher
probability to those propositions having shorter proofs, which are more
likely to arise in reality (heuristic assumption), because they require
fewer steps to be produced. A better definition may involve several proofs
to the same proposition, not just the shortest length.

Kind regards,
Jose M.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/fom/attachments/20220706/97291910/attachment.html>