[swift-dev] Reducing the size of Swift binaries by shortening symbols

Sun Dec 20 23:43:44 CST 2015

> On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
> 
> + Stephen Canon, because he probably has good ideas in this domain.
> 
> On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <swift-dev at swift.org <mailto:swift-dev at swift.org>> wrote:
> 
> What’s next?
> 
> The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top. 
> 

Hi Dmitri, 

> Hi Nadav,
> 
> This is a great start that shows that there is a potential for improvement in our mangled names!
> 
> To make this effort more visible, I would suggest creating a bug on https://bugs.swift.org/ <https://bugs.swift.org/> .
> 
> I think we survey existing solutions that industry has developed for compressing short messages.  What comes to mind:
> 
> - header compression in HTTP2:
> https://http2.github.io/http2-spec/compression.html <https://http2.github.io/http2-spec/compression.html>
> 
> - PPM algorithms are one of the best-performing compression algorithms for text.

It is important to keep in mind that we are not inventing a new text-compression algorithms here. We are solving a specific problem with specific characteristics and constraints. Nothing is groundbreaking here. Lempel-Ziv based algorithms rely on repetitions within the text, and can use some prior statistical knowledge on the text (pre-computed tables). Our mangled names are relatively short and don’t have significant repetitions, so having a reference to some pre-computed table, and not having an internal reference is obvious. 

On top of the basic encoding layer we need to apply some kind of variable length encoding (Cameron suggested Huffman). The restricted character set is not a problem here. To encode a string of bits using the charset [a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the whole bit stream. 

> - Arithmetic coding is also a natural starting point for experimentation.
> 
> Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.

The upper bits never come into play. You can conceptually imagine that bytes have ~6 bits (63 combinations). You only need to worry about these kind of things when you use variable length encoding (see my previous comment about Cameron’s suggestion to use Huffman encoding as a second pass).

> 
> We should also build a scheme that uses shortest one between the compressed and non-compressed names.
> 
> For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.
> 

nm -a libswiftCore.dylib > strings.txt

> I also have a concern about making mangled names completely unreadable.  Today, I can frequently at least get a gist of what the referenced entity is without a demangler.  What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.
> 
> _T<length><class name><length><method name><compressed suffix>

When are you looking at symbols directly?   SIL already shows the detangled names as comments, and for everything else there is a demangler. 

> 
> We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.
> 
> This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.
> 
> We should also investigate improving existing mangling scheme to produce shorter results.  For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity.  This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.

Yes. 

Thanks,
Nadav

> 
> Dmitri
> 
> -- 
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com <mailto:gribozavr at gmail.com>>*/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151220/25939dd1/attachment.html>