[swift-dev] Reducing the size of Swift binaries by shortening symbols

Dmitri Gribenko gribozavr at gmail.com
Sun Dec 20 23:53:03 CST 2015


On Sun, Dec 20, 2015 at 9:43 PM, Nadav Rotem <nrotem at apple.com> wrote:
>
> On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
>
> + Stephen Canon, because he probably has good ideas in this domain.
>
> On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev
> <swift-dev at swift.org> wrote:
>>
>>
>> What’s next?
>>
>> The small experiment I described above showed that compressing the names
>> in the string table has a huge potential for reducing the size of swift
>> binaries. I’d like for us (swift-developers) to talk about the implications
>> of this change and start working on the two tasks of tightening our existing
>> mangling format and on implementing a new compression layer on top.
>
>
>
> Hi Dmitri,
>
> Hi Nadav,
>
> This is a great start that shows that there is a potential for improvement
> in our mangled names!
>
> To make this effort more visible, I would suggest creating a bug on
> https://bugs.swift.org/ .
>
> I think we survey existing solutions that industry has developed for
> compressing short messages.  What comes to mind:
>
> - header compression in HTTP2:
> https://http2.github.io/http2-spec/compression.html
>
> - PPM algorithms are one of the best-performing compression algorithms for
> text.
>
>
> It is important to keep in mind that we are not inventing a new
> text-compression algorithms here. We are solving a specific problem with
> specific characteristics and constraints. Nothing is groundbreaking here.

This is exactly the case for using a specialized algorithm: a
well-defined problem with specific constraints, not a general case of
arbitrary input.

> Lempel-Ziv based algorithms rely on repetitions within the text, and can use
> some prior statistical knowledge on the text (pre-computed tables). Our
> mangled names are relatively short and don’t have significant repetitions,
> so having a reference to some pre-computed table, and not having an internal
> reference is obvious.

Names are formed according to well-defined mangling rules with their
own redundancy and patterns, and identifiers usually come from
English.  This is why using a high-order statistical model makes
sense.

Pre-computed tables on the other hand have their own issues: since we
have a runtime demangler that our metadata parsing relies upon, the
whole precomputed table of known substitutions has to be a part of the
runtime, and not just the demangler.  This creates interesting
constraints for bare-metal environments for example.  I'm not saying
that it is wrong to use them, I'm just saying that the trade-off is
not so clear-cut.

> On top of the basic encoding layer we need to apply some kind of variable
> length encoding (Cameron suggested Huffman). The restricted character set is
> not a problem here. To encode a string of bits using the charset
> [a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the
> whole bit stream.
>
> - Arithmetic coding is also a natural starting point for experimentation.
>
> Since the input mangled name also comes in a restricted character set, we
> could also remove useless bits first, and try an existing compression
> algorithm on the resulting binary string.
>
>
> The upper bits never come into play. You can conceptually imagine that bytes
> have ~6 bits (63 combinations). You only need to worry about these kind of
> things when you use variable length encoding (see my previous comment about
> Cameron’s suggestion to use Huffman encoding as a second pass).
>
>
> We should also build a scheme that uses shortest one between the compressed
> and non-compressed names.
>
> For running experiments it would be useful to publish a sample corpus of
> mangled names that we will be using for comparing the algorithms and
> approaches.
>
>
> nm -a libswiftCore.dylib > strings.txt
>
> I also have a concern about making mangled names completely unreadable.
> Today, I can frequently at least get a gist of what the referenced entity is
> without a demangler.  What we could do is make the name consist of a
> human-readable prefix that encodes just the base name and a compressed
> suffix that encodes the rest of the information.
>
> _T<length><class name><length><method name><compressed suffix>
>
>
> When are you looking at symbols directly?   SIL already shows the detangled
> names as comments, and for everything else there is a demangler.

System tools to inspect object files on Linux don't know about Swift
mangling.  Yes, I can pass their output through a demangler, sometimes
(when they don't have a GUI for example).

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/


More information about the swift-dev mailing list