C 0000005 mmmv number array bytestream t1

From commentsarchive
Jump to: navigation, search

C 1..100

This specification is UNUSABLE due to being INCOMPLETE. It is still being written.

The mmmv_number_array_bytestream_t1 is meant to be a storage and data transmission format, not an in-memory processing format.

Some of the ideas used in this Specification

This list of ideas is in the role of a comment of the specification. The aim of this list of ideas is to at least partly explain the reasoning behind the choices made in this specification.

Light Compression

As a storage and data transmission format the mmmv_number_array_bytestream_t1 tries to implement some inherent lightweight compression. The core idea of a lossless compression algorithm is that a compression algorithm tries to minimize data duplication. Different compression algorithms have different ways how to detect data duplication and how to re-encode the data, but one idea is that there is a table

abbreviation original bitstream
abbreviation_bitstream_x1 original_bitstream_x1
abbreviation_bitstream_x2 original_bitstream_x2
... ...
abbreviation_bitstream_xK original_bitstream_xK
... ...
abbreviation_bitstream_xN original_bitstream_xN

and the compression algorithm does plain search and replace and after that repeated sequences like

    # are replaced with 
    5 <some_bitstream_Z>

Different compression algorithms may assemble that table differently, but the general idea used in the mmmv_number_array_bytestream_t1 specification is that the table is hardcoded into the codec. The table rows are chosen quite arbitrarily, partly by random guess and certainly based on the 2020_05 snapshot of the life experience of the author of this specification, me, Martin.Vahi@softf1.com .

Digital Signal Accompanied by a Clock Signal

The idea originates from digital electronics and the idea is that there are 2 signals: data bitstream signal and a clock signal. The data bitstream signal is assumed to be self-synchronizing and the clock signal is used for indicating the start of an arbitrary length bitstream segment at the data bitstream. Like

self-synchronizing scheme: 1=="01"  0=="10"

 data bitstream: 10010101011010011010100110011001010101100110010110010110101010011010101010101010  is     self-synchronizing 
clock bitstream: 10000000000010000010000000000000001000000000000000000010000000000000000000000000  is not self-synchronizing 
clock bitstream: 1___________1_____1_______________1___________________1_________________________

Arbitrarily Sized Positive Whole Numbers are a Series of Digits that are self Smaller Positive Whole Numbers than 2-digit Versions of the Arbitrarily Sized Positive Whole Numbers

In a simplistic, tautological way, the format is

|<number N1 length in number of digits>|<the number N1 as a series of digits>|

Given that the length of the N1 is also a number and given that a limited set of bits, for example, 8b = 1B, sets a limit to the maximum positive whole number that can be represented by the limited set of bits, the implementation idea in EBNF is:


Boolean operations on Bytes can be Defined for the text form of the Bytes

Sometimes raw bytestreams are not easy to process or not processable at all. For example, an AJAX web application web browser side JavaScript can process text and the classical, CPU-supported, 4B integers, but working with raw bytestreams is not that simple, if the processing is expected to work flawlessly on different CPU-types that have different byte endianness and bit endianness. Any bytestream can be converted to text and one way out of many ways(Base64, Uuencoding, Parchive, yEnc, Xxencoding) to convert a bytestream to text is to convert it to a commaseparated list of whole numbers at the range of 0..255. In that case it is possible to define the classical commutative boolean bitwise operations in a huge table that has (256*256-256)/2+256=32896 compartments. The "-256" part comes from removing the main diagonal of the 256x256 table and the "+256" comes from adding the main diagonal of the 256x256 table back. The "/2" comes from the idea that the triangular sets of compartments at either side of the main diagonal are symmetrical and have exactly the same number of compartments. The optimisation idea is:


For example

    XOR(01000011,01001100) == 00001111

    convert_to_text(01000011) ==  "67"
    convert_to_text(01001100) ==  "76"
    convert_to_text(00001111) ==  "15"

    textform_XOR("67","76") = textform_XOR_implementation(
                                  ) == "15"

Irrational Numbers have Infinitely many Digits

It takes infinite amount of RAM to store an infinitely long series of digits. Some of the irrational numbers that tend to be often used like the $\pi$ and the $e$ and $\sqrt{-1}$ and $\sqrt{2}$ can be stored very efficiently as a symbol and without rounding. The same applies for less frequently used irrational numbers like $\sqrt{e}$. To avoid forcing lack of precision where the lack of precision can be relatively easily avoided, a number array bytestream format should allow to store symbolically succinctly expressible irrational numbers in a symbolic form.

The Specification

All bytes and bits are in the Little-Endian format, id est least significant bit/byte is at the lowest possible index. The format in EBNF:


ENCODING_ALGORITHM_ID :== BYTE_00000001 BYTE_00000000{3} // 4B C++ int in Little-Endian byte order and bit order
BYTE_00000000 :== <A byte, where all bits are 0>
BYTE_00000001 :== <A byte, where all bits, except the least significant bit, are 0>

DIGITBYTE :== <A digit in base 256, id est its value is in range [0;255]. It is in Little-endian format.>

                                     CLOCKBYTE_TYPE_BIT :== <0 for START_CLOCKBYTE, 1 for CONTINUATION_CLOCKBYTE>
                        4_BIT_DIMENSION_ID_BITS_6_5_4_3 :== <4b positive whole number in Little-Endian format>

3_BIT_EXPONENT_OF_2_EQUALS_DISTANCE_TILL_NEXT_CLOCKBYTE :== <3b positive whole number in Little-Endian format. The distance is 2^(the 3b number value), 
                                                             id est the range of the distance is [1,2,4,8,16,32,64,128] and that sort of range
                                                             has been chosen because usually the CPU-s of year 2020 computers use unsigned integer types
                                                             that have those kind of sizes in number of bytes.
                                                             The distance between 2 clockbytes is the number of digitbytes between the 2 clockbytes.>

CONTINUATION_CLOCKBYTE :== <matches the nearest preceding START_CLOCKBYTE, except that the CLOCKBYTE_TYPE_BIT==1 and the bits 2,1,0 are updated.>

BYTESTREAM_END_CLOCKBYTE :== <the same as the START_CLOCKBYTE, except that 4_BIT_DIMENSION_ID_BITS_6_5_4_3==1111===15 and all of the rest of the bits do not matter.>


4_BIT_DIMENSION_ID_BITS_6_5_4_3 in base 2 4_BIT_DIMENSION_ID_BITS_6_5_4_3 in base 10 meaning
0000 0 $n \in \left\{\mathbb{N} \cup \{ 0 \}\right\}$
0001 1 $\frac{1}{n},\;\; n \in \left\{\mathbb{N} \setminus \{ 0 \}\right\}$
0010 2 $\sqrt[]{-1} \cdot n,\;\; n \in \left\{\mathbb{N} \cup \{ 0 \}\right\}$
0011 3 $\sqrt[]{-1} \cdot \frac{1}{n},\;\; n \in \left\{\mathbb{N} \setminus \{ 0 \}\right\}$
0100 4 POOLELI. Z
0101 5 POOLELI. Z
0110 6 POOLELI. Z
0111 7 POOLELI. Z
1000 8 POOLELI. Z
1001 9 POOLELI. Z
1010 10 POOLELI. Z
1011 11 POOLELI. Z
1100 12 POOLELI. Z
1101 13 POOLELI. Z
1110 14 POOLELI. Z
1111 15 POOLELI. Z

POOLELI. sümbolkonstantide tabel koostada ja füüsikute muud numbriformaadid lisaks imaginaar-arvudele üles otsida, mingi 5 tükki oli