<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko &lt;<a href="mailto:gribozavr@gmail.com" class="">gribozavr@gmail.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote">+ Stephen Canon, because he probably has good ideas in this domain.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <span dir="ltr" class="">&lt;<a href="mailto:swift-dev@swift.org" target="_blank" class="">swift-dev@swift.org</a>&gt;</span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><b class=""><u class=""><br class=""></u></b></div><div class=""><b class=""><u class="">What’s next?</u></b></div><div class=""><br class=""></div><div class="">The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.&nbsp;</div></div></blockquote><div class=""><br class=""></div></div></div></div></div></blockquote><div><br class=""></div><div>Hi Dmitri,&nbsp;</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class="">Hi Nadav,</div><div class=""><br class=""></div><div class="">This is a great start that shows that there is a potential for improvement in our mangled names!</div><div class=""><br class=""></div><div class="">To make this effort more visible, I would suggest creating a bug on&nbsp;<a href="https://bugs.swift.org/" class="">https://bugs.swift.org/</a> .</div><div class=""><br class=""></div><div class="">I think we survey existing solutions that industry has developed for compressing short messages.&nbsp; What comes to mind:</div><div class=""><br class=""></div><div class="">- header compression in HTTP2:</div><div class=""><a href="https://http2.github.io/http2-spec/compression.html" target="_blank" class="">https://http2.github.io/http2-spec/compression.html</a><br class=""></div></div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">- PPM algorithms are one of the best-performing compression algorithms for text.</div></div></div></div></blockquote><div><br class=""></div>It is important to keep in mind that we are not inventing a new text-compression algorithms here. We are solving a specific problem with specific characteristics and constraints. Nothing is groundbreaking here. Lempel-Ziv based algorithms rely on repetitions within the text, and can use some prior statistical knowledge on the text (pre-computed tables). Our mangled names are relatively short and don’t have significant repetitions, so having a reference to some pre-computed table, and not having an internal reference is obvious.&nbsp;</div><div><br class=""></div><div>On top of the basic encoding layer we need to apply some kind of variable length encoding (Cameron suggested Huffman). The restricted character set is not a problem here. To encode a string of bits using the charset [a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the whole bit stream.&nbsp;</div><div><br class=""></div><div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_extra">- Arithmetic coding is also a natural starting point for experimentation.</div><div class="gmail_extra"><br class=""></div>Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.</div></div></div></blockquote><div><br class=""></div><div>The upper bits never come into play. You can conceptually imagine that bytes have ~6 bits (63 combinations). You only need to worry about these kind of things when you use variable length encoding (see my previous comment about Cameron’s suggestion to use Huffman encoding as a second pass).</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><br class=""></div><div class="gmail_extra">We should also build a scheme that uses shortest one between the compressed and non-compressed names.</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.</div><div class="gmail_extra"><br class=""></div></div></div></blockquote><div><br class=""></div><div>nm -a libswiftCore.dylib &gt; strings.txt</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra">I also have a concern about making mangled names completely unreadable.&nbsp; Today, I can frequently at least get a gist of what the referenced entity is without a demangler.&nbsp; What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">_T&lt;length&gt;&lt;class name&gt;&lt;length&gt;&lt;method name&gt;&lt;compressed suffix&gt;</div></div></div></blockquote><div><br class=""></div><div>When are you looking at symbols directly? &nbsp; SIL already shows the detangled names as comments, and for everything else there is a demangler.&nbsp;</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><br class=""></div><div class="gmail_extra">We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">We should also investigate improving existing mangling scheme to produce shorter results.&nbsp; For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity.&nbsp; This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.<br clear="all" class=""></div></div></div></blockquote><div><br class=""></div><div>Yes.&nbsp;</div><div><br class=""></div><div>Thanks,</div><div>Nadav</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class=""><br class=""></div><div class="">Dmitri</div><div class=""><br class=""></div>-- <br class=""><div class="">main(i,j){for(i=2;;i++){for(j=2;j&lt;i;j++){if(!(i%j)){j=0;break;}}if<br class="">(j){printf("%d\n",i);}}} /*Dmitri Gribenko &lt;<a href="mailto:gribozavr@gmail.com" target="_blank" class="">gribozavr@gmail.com</a>&gt;*/</div>

</div></div>

</div></blockquote></div><br class=""></body></html>