Tag Archives: Coding

Information Reduction 1: Coding

Two types of coding

In a previous post, I described two fundamentally different types of coding. In the first, the intention is to carry all the information contained in the source over into the encoded version. In the second, on the other hand, we deliberately refrain from doing this. It is the second – the information-losing – type that is of particular interest to us.

When I highlighted this difference in my presentations twenty years ago and the phrase ‘information reduction’ appeared prominently in my slides, my project partners pointed out that this might not go down too well with the audience. After all, everyone wants to win; nobody wants to lose. How can I promote a product for which loss is a quality feature?

Well, sometimes we have to face the fact that the thing we have been trying to avoid at all costs is actually of great value. And that’s certainly the case for information-losing coding.

Medical coding

Our company specialised in the encoding of free-text medical diagnoses. Our program read the diagnoses that doctors write in free text in their patients’ medical records and automatically assigned them a code based upon a standard coding system (ICD-10) with about 15,000 codes (Switzerland, 2019). This sounds like a lot, but the number is small considering the billions of distinguishable diagnoses and diagnostic formulations that occur in the field of medicine (see article). Of course, the individual code cannot contain more information than the standard code is able to discern for the case in question. The full-text diagnoses usually contained more information than this and our task was to automatically extract the relevant parts from the free texts in order to assign the correct code. We were fairly successful in this attempt.

Coding is part of a longer chain

But coding is only one step in a bigger process. Firstly, the information-processing chain extends from codes to flat rates per case (Diagnosis Related Groups = DRGs). Secondly, the free texts to be coded in the medical record are themselves the result of a multi-stage chain of information processing and reduction that has already been performed. Overall, a hospital case involves a chain made up of the following stages from patient examination to flat rate per case:

  • Patient: amount of information contained in the patient.
  • Doctor: amount of information about the patient that the doctor recognises.
  • Medical record: amount of information documented by the doctor.
  • Diagnoses: amount of information contained in the texts regarding the diagnoses.
  • Codes: amount of information contained in the diagnosis codes.
  • Flat rate per case: amount of information contained in the flat rate per case.

The information is reduced at every step, usually quite dramatically. The question is, how does this process work? Can the reduction be automated. And is it a determinate process, or one in which multiple options exist?


This is a page about information reduction — see also overview.

Translation: Tony Häfliger and Vivien Blandford

Two Types of Coding 2

The two types of coding in set diagrams

I would like to return to the subject of my article Two types of coding 1 and clarify the difference between the two types of coding using set diagrams. I believe this distinction is so important for the field of semanticsand for information theory in general, that it should be generally understood.

Information-preserving coding

The information-preserving type of coding can be represented using the following diagram

Mengendiagramm 1:1-Kodierung

Fig 1: Information-preserving coding (1:1, all codes reachable)

The original form is shown on the left and the encoded form on the right. The red dot on the left could, for example, represent the letter A and the dot on the right the Morse code sequence dot dash. Since this is a 1:1 representation, you can always find your way back from each element on the right to the initial element on the left, i.e. from the dot dash of Morse code to the letter A.

Mengendiagramm 1:1-Kodierung, nicht alle Kodes erreicht

Fig. 2: Information-preserving coding (1:1, not all codes reachable)

Of course, a 1:1 coding system preserves information even if not all codes are used. Since the unused ones can never arise during coding, they play no role at all. For each element of the set depicted on the right that is used for a code, there is exactly one element in the initial form. The code is therefore reversible without loss of information, i.e. decodable, and the original form can be restored without loss for each resulting code.

Mengendarstellung: Informationserhaltende Kodierung (1:n)

Fig. 3: Information-preserving coding (1:n)

With a 1:n system of coding, too, the original form can be reconstructed without loss. An original element can be coded in different ways, but each code has only one original element. There is thus no danger of not getting back to the initial value. Again, it does not matter whether or not all possible codes (elements on the right) are used, since unused codes never need to be reached and therefore do not have to be retranslated.

For all the coding ratios shown so far (1:1 and 1:n), the original information can be fully reconstructed. It doesn’t matter whether we choose a ratio of 1:1 or 1:n, or whether all possible codes are used or some remain free. The only important thing is that each code can only be reached from a single original element. In the language of mathematics, information-preserving codes are injective relations.

Information-reducing coding
Mengendiagramm: Informationsreduzierende Kodierung

Fig. 4: Information-reducing coding (n:1)

In this type of coding, several elements from the initial set point to the same code, i.e. to the same element in the set of resulting codes. This means that the original form can no longer be reconstructed at a later time. The red dot in the figure on the right, for example, represents a code for which there are three different initial forms. The information about the difference between the three dots on the left is lost in the dot on the right and can never be reconstructed. Mathematicians call this a non-injective relation. Coding systems of this type lose information.

Although this type of coding is less ‘clean’, it is nevertheless the one that interests us most, as it typifies many processes in reality.

Two Types of Coding 1

A simple broken bone

In the world of healthcare, medical diagnoses are encoded to improve transparency. This is necessary because they can be formulated in such a wide variety of different ways. For example, a patient may suffer from the following:

– a broken arm
– a distal radius fracture
– a fractura radii loco classico
– a closed extension fracture of the distal radius
– a Raikar’s fracture, left
– a bone fracture of the left distal forearm
– an Fx of the dist. radius l.
– a Colles fracture

Even though they are constructed from different words and abbreviations, all the above expressions can be used to describe the same factual situation, some with more precision than others. And this list is by no means exhaustive. I have been studying such expressions for decades and can assure you without any exaggeration whatsoever that there are billions of different formulations for medical diagnoses, all of them absolutely correct.

Of course, this  huge array of free texts in all variations cannot be processed statistically. The diagnoses are therefore encoded, often using the ICD (International Classification of Diseases) system, which comprises between 15,000 and 80,000 different codes depending on variant. That’s a lot of codes, but much clearer than the billions of possible text formulations it replaces.

Incidentally, the methods used to automate the interpretation of texts so that it can be performed by a computer program are a fascinating subject.

Morse code 

Morse code is used for communication in situations where it’s only possible to send very simple signals. The sender encodes the letters of the alphabet in the form of dots and dashes, which are then transmitted to the recipient, who decodes them by converting them back into letters. An E, for example, becomes a dot and an A becomes a dot followed by a dash. The process of encoding/decoding is perfectly reversible, and the representation unambiguous.

Cryptography

In the field of cryptography, too, we need to be able to translate the code back into its original form. This approach differs from Morse code only in that the translation rule is usually a little more complicated and is known only to a select few. As with Morse code, however, the encrypted form needs to carry the same information as the original.

Information reduction

Morse code and cryptographic codes are both designed so that the receiver can ultimately recreate the original message. The information itself needs to remain unchanged, with only its outer form being altered.

The situation is quite different for ICD coding. Here, we are not dealing with words that are interchangeable on a one-for-one basis such as tibia and shinbone – the ICD is not, and was never intended to be, a reversible coding system. Instead, ICD codes are like drawers in which different diagnoses can be placed, and the process of classification involves deliberately discarding information which is then lost for ever. This is because there is simply too much detail in the diagnoses themselves. For example, a fracture can have the following independent characteristics:

– Name of the bone in question
– Site on the bone
– State of the skin barrier (open, closed)
– Joint involvement (intra-articular, extra-articular)
– Direction of the deformity (flexion, extension, etc.)
– Type of break line (spiral, etc.)
– Number and type of fracture fragments (monoblock, comminuted)
– Cause (trauma, tumour metastasis, fatigue)
– etc.

All these characteristics can be combined, which multiplies the number of possibilities. A statistical breakdown naturally cannot take all combination variants into account, so the diagnostic code covers only a few. In Germany and Switzerland, the ICD can cope with fewer than 20,000 categories for the entire field of medicine. The question of what information the ‘drawers’ can and cannot take into account, is an important topic both for players within the healthcare system and those of us who are interested in information theory and its practical application. Let’s turn now to the coding process.

Two types of coding

I believe that the distinction described above is an important one. On the one hand, we have coding systems that aim to preserve the information itself and change only its form, such as Morse code and cryptographic systems. On the other hand, we have systems such as those for encoding medical diagnosis. These aim to reduce the total amount of information because this is simply too large and needs to be cut down – usually dramatically – for the sake of clarity. Coding to reduce information behaves very differently from coding to preserve information.

This distinction is critical. Mathematical models and scientific theories that apply to information-preserving systems are not suitable for information-reducing ones. In terms of information theory, we are faced with a completely different situation.