Jekyll2022-12-18T17:07:50+00:00/primitives/feed.xmlPrimitivesEngineering BlogApplied Politics For Crypto2022-12-17T00:00:00+00:002022-12-17T00:00:00+00:00/primitives/2022/12/17/Applied-Politics-for-Crypto<h1 id="yellow-manual---applied-politics-for-the-crypto-industry">yellow manual - applied politics for the crypto industry</h1>
<blockquote>
</blockquote>
<h2 id="preface">Preface</h2>
<p>Moderation is not an ideology<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. It is not an opinion.</p>
<p>It is not a thought.
Likewise, it is an absence of thought.</p>
<p>In other words, the problem with moderation is that the “center” is not fixed. It moves. And since it moves, and people being people, people will try to move it. This creates an incentive for violence, but we should not look at this as a moral problem, rather as an engineering problem. Any solution that solves the problem is acceptable.</p>
<p>Any solution that does not solve the problem is not acceptable.</p>
<h2 id="motivation">Motivation</h2>
<p>This manual exists because it was required. Under normal circumstances, such a manual
would’ve never been written because the author does not believe such information can be
conveyed without on-deck coordination. With that said, it exists, and it serves a purpose.<sup id="fnref:1:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p>This manual cannot solve systemic problems; it cannot deliver miracles, and it cannot
turn the world upside down.</p>
<p>This manual can and does provide the primary guidance to building the tools necessary
in order for a competent enough force to try to alleviate some problems currently
faced by the crypto industry and some of the problems that will soon befall on that said
industry.</p>
<h3 id="objectives">Objectives</h3>
<p>This document’s purpose is to analyze, provide strategic knowledge and easy-to-follow
instructions meant to fulfill the goal(s) agreed upon.</p>
<p>We start with a SWOT analysis because this industry is itself brand new. It literally did
not exist less than 15 years ago. With that said, it does not mean that hundreds-years-old
knowledge cannot be applied to it. It sure can, but there are challenges uniquely
associated to this cause that do not exist with almost all others.</p>
<p>With that out of the way, then we can proceed to the kind of people needed for the
operation, institutional architecture, how to use it, country-specific issues and, finally,
some conclusions.</p>
<h3 id="defining-patterns">Defining Patterns</h3>
<p>We use the term “pattern” meaning relevant to our protocol, and “anti-pattern” to represent a more subjective interpretation or one that is hard to automate to determine should it be included or not.</p>
<h4 id="anti-patterns-for-governance-proposals">Anti Patterns for Governance Proposals</h4>
<ul>
<li>Proposals that address a problem that has not been defined</li>
<li>Proposals that addresses a problem that no longer exists</li>
<li>The Proposals addresses more than one problem</li>
<li>Proposals that has no stated purpose</li>
<li>The language of the Proposals is vague or complex</li>
<li>Proposals is unable to achieve its stated goal</li>
</ul>
<h3 id="axioms-and-principles-for-governance-actions">Axioms and Principles for Governance Actions</h3>
<ul>
<li>
<p>Requirements (the need for a new law) is realized by Principle (the to-be law)</p>
</li>
<li>
<p>If you are in the business of producing laws, then the law is a Business Object</p>
</li>
</ul>
<p><img src="https://i.imgur.com/GyzuuBU.png" alt="" /></p>
<ul>
<li>
<p>Governance / Legal elements are not <em>passive</em>.</p>
</li>
<li>
<p>This document does not seek to define an <em>imperative</em> set but rather <em>relational</em> sets.</p>
</li>
<li>
<p>Restrictions of freedom perpetrated through explicit threats should not be any more prima facie problematic than restrictions perpetrated via other means.<sup id="fnref:1:2" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</li>
</ul>
<h2 id="patterns-for-finding-important-governance-patterns">Patterns for Finding Important Governance Patterns</h2>
<p>We use the term “pattern” meaning relevant to our protocol, and “anti pattern” to represent a more subjective interpretation or one that is hard to automate to determine should it be included or not.</p>
<h3 id="patterns">Patterns</h3>
<ul>
<li>Law that addresses a problem that has not been defined</li>
<li>Law that addresses a problem that no longer exists</li>
<li>The law addresses more than one problem</li>
<li>Law that has no stated purpose</li>
<li>The language of the law is vague or complex</li>
<li>Law is unable to achieve its stated goal</li>
</ul>
<h3 id="anti-patterns">Anti-Patterns</h3>
<ul>
<li>Laws that address problems that have not been defined</li>
<li>Laws that address problems that no longer exist</li>
<li>Laws that address more than one problem in different domains</li>
<li>Laws that lack a stated, measurable problem-solving the goal, or purpose</li>
<li>Laws that fail to achieve their goal or lack stated goals</li>
<li>Laws that lack a citation of references</li>
<li>Laws whose burdens are greater than their problem-solving benefit</li>
<li>Laws whose problem-solving benefit and burdens are equal</li>
<li>Laws whose results cannot be measured</li>
<li>Laws that interfere with other laws</li>
<li>Laws that duplicate other laws</li>
<li>Requires Review</li>
<li>Laws that are not enforced*</li>
<li>Laws that are overly vague or complex*</li>
<li>Laws that have not undergone QA analysis within a specified time frame</li>
</ul>
<h2 id="governance-primitives-for-potential-smart-contract-events--emits">Governance Primitives for (Potential) Smart Contract Events / Emits</h2>
<p>Now that we have established governance patterns and a classification system, we can begin to map out how these relationships present themselves, either by acting upon, being acted upon, events, etc.</p>
<h2 id="industry-swot">Industry SWOT</h2>
<p>To execute the task(s) at hand, there are aspects that need to be kept in mind always
throughout the execution of the task(s) until or if reality changes. The crypto industry
itself is volatile so, for all intends and purposes, reality itself may very well change as far
as the industry is concerned. Which is why this SWOT analysis is not be believed as
exhaustive for more than 24 months from the time this material has been produced.</p>
<h3 id="strengths">Strengths</h3>
<ul>
<li>
<blockquote>
<p>The industry is new. Which means it can attract public (think: agitation) support from
people otherwise opposed to the ideals the author of this manual holds dear (think:
progressives, hipsters, techno-optimists, etc.)</p>
</blockquote>
</li>
<li>
<blockquote>
<p>The industry naturally attracts people who distrust government. Which means the cost
of training the people into the operation is lower since they no longer have to be
convinced that the State is not their friend.</p>
</blockquote>
</li>
</ul>
<h3 id="weaknesses">Weaknesses</h3>
<ul>
<li>
<blockquote>
<p>Cryptocurrency is difficult to use, difficult to explain to the public and, so far,
of limited use for normies.</p>
</blockquote>
</li>
</ul>
<p>Now, sure, the counter-argument would be that exchanges exist and all sorts of
websites/firms/services now exist with a friendly interface that can mitigate the difficulty
of use by removing the necessity of holding one’s wallet. Of course, that also means those
exchanges are subjected to regulation (by removing anonymity, for starters) thus
removing the point of crypto and also torpedoing the oft-used talking point that “crypto is
like cash, but better”. It most obviously is not. It can become, maybe, but it is not as of
the current moment.</p>
<p>This weakness may change some day. But if the industry goes like the way of FOSS
(which also appeared as an “alternative” to “corporate industry” with their “free as in
freedom software”) - this is highly unlikely to happen.</p>
<p>Bottom line: Anyone who doesn’t understand that cryptocurrency is here to stay as a
niche instrument because it’s difficult to use and of limited use for normies, cannot be
of help in the execution of the task(s) at hand.</p>
<p>In crypto circles, this is countered by all sorts of ‘copium’ and dodgy numbers.</p>
<p>For instance, the most “reliable” figure on crypto use says 4.2% of the global population,
or 320 million people are using crypto^1. If that were true, that would make crypto about a
tenth of the video game market (which now stands at around 3.24 billion people).</p>
<p>(^1) https://triple-a.io/crypto-ownership-data/ (archived link: https://archive.ph/wip/EJUcD )</p>
<p>The comparison with videogame industry is not random. Videogames are in and of
themselves a niche, culturally and legislatively speaking. #Gamergate was partial success
even though it was backed by almost 50% of the global population.</p>
<p>Crypto, being a tenth of that, stands no chance of defending itself in public in the culture.</p>
<p>But the reality is that crypto is not even that. Heck, the source for that number comes
from a dodgy website that ends in .io (a red flag for any normie) that claims to represent a
company registered with the Central Bank of Singapore - which means it’s a no-go for
privacy-minded people since their exchange, in effect, is nothing more than glorified
CBDC.</p>
<ul>
<li>
<blockquote>
<p>Cryptocurrency is very easy to demonize in public propaganda, especially by the
Regime</p>
</blockquote>
</li>
</ul>
<p>This cannot change for as long as cryptocurrency remains very difficult to use.</p>
<h3 id="opportunities">Opportunities</h3>
<ul>
<li>
<blockquote>
<p>Crypto currencies can be hard to trace. Which means any operation to defend the
cryptoindustry can attract the support of people one wouldn’t normally think of when it
comes to a lobbying operation (think: criminals, the underworld, etc.)</p>
</blockquote>
</li>
<li>
<blockquote>
<p>Increasing adoption can only happen if crypto becomes easier to use while at the
same time maintaining all of its selling points. Any financing of that should be
considered (incl. from dodgy sources) - since it would help the industry as a whole</p>
</blockquote>
</li>
<li>
<blockquote>
<p>Being marginal also means any operation benefits from the element of surprise.
Nobody expects a crypto lobbyist. Just like nobody expected a gamer lobbyist.</p>
</blockquote>
</li>
</ul>
<h3 id="threats">Threats</h3>
<ul>
<li>
<blockquote>
<p>The crypto industry is based on almost blind trust in technology. In order for the
task(s) on hand to be effectively executed, the exact opposite mentality will be needed</p>
</blockquote>
</li>
<li>
<blockquote>
<p>The Regime. This threat, of course, will never go away. Apologies for stating the
obvious</p>
</blockquote>
</li>
<li>
<blockquote>
<p>Internal mess. The fact that the industry is new means there are no rules. Oftentimes
that’s a great thing, but it’s also a threat in multiple ways:</p>
</blockquote>
</li>
<li>in propaganda: The Regime can always credibly say it’s an industry of crooks
because routinely crooks do thrive in this industry</li>
<li>in agitation: Almost all crypto-bros are cringe and routinely shady (this can be
mitigated by growing your own crypto-bros that are not cringe or shady)</li>
</ul>
<h1 id="necessary-institutional-architectureorganizational-structure">Necessary institutional architecture/organizational structure</h1>
<p>To execute the task(s) ahead of you, the organizational structure has to mimic the one of
a PIA (Private Intelligence Agency).</p>
<p>The author is not aware of the budget and dedication that the reader has committed for
this, so the author will assume a mid-tier of both.
It should be noted that the structure proposed is modular. That is to say that, depending
on the budget and assets available, the organizational structure can lack some of the
elements and still kinda work (with the obvious trade-off of decreased effectiveness).</p>
<p>The following graph represents an adaptation (and a slight downsizing) of the structure of
AggregateIQ.</p>
<p>Please notice the arrows. Where the arrow is drawn, it means the individuals know each
other. Where there aren’t, then they shouldn’t be knowing each other.</p>
<p>The task(s) required by the reader can best be executed if as fewer people as possible
know the whole agenda. All intel should be transmitted on a “need to know” basis. No
more than three people should be aware of the entirety of the agenda. And even three is
too risky - but fewer than three jeopardizes the execution so it’s a trade-off that can’t be
avoided.</p>
<p>All that is in the pink circle represents the leadership. These 7 people should be meeting
all in person as rare as possible (no more than twice a year, preferably once every other
year). At the same time, these 7 people should be communicating with each other
electronically as rare as possible (preferably never, but that’s very difficult to achieve in
this day and age). It goes without saying that when they do communicate electronically,
all traces should be shredded as soon as possible.</p>
<p>In terms of legal entities needed: an advocacy grou/ NGO and a think-tank (which is
usually an NGO as well) are needed.</p>
<p>These need to be “serious” institutions. They will be glorified paid shills alright, but they
need to be able to provide very serious policy papers and talking points and must not be
officially connectable to each other.</p>
<p>The purpose is to create the illusion of a societal debate in which a think tank (yours)
happens to agree with some of the goals of a random advocacy group (also yours, made
to be harsher than needed) and then their research to be picked up by a serious academic
(your paid shills in the institutions). Once that process is done, the lobbyist can start his
work.</p>
<p>Depending on the budget this operation has, underneath the blue circles, new circles can
emerge. That is to say, each of the four blue circles can be free to create their own
networks. In fact, if possible, it is preferable this happens as long as they don’t know of
each other.</p>
<p>Those in the advocacy group must not know that their work is the primary research for a
serious think tank. Those in the think tank must not know that their work is expected and
thoroughly read by a paid shill in the institutions that’s on the same payroll as they are.</p>
<p>Also, nobody from the blue circles (and most definitely from the circles beneath them)
must know more than two people from the pink circle. Ideally, nobody from beneath the
blue circles should know anyone from the pink circle.</p>
<p>The more rigid the chain of command is kept, the faster the results can be delivered.</p>
<p>Enforcing those limits can prove to be quite a challenge (working with people is the
hardest job), but it’s doable with the right incentives.</p>
<h1 id="who-to-recruit">Who to recruit</h1>
<p>There are two different criteria depending on the circle. Going for each of them.</p>
<p>The pink circle</p>
<p>In the leadership structure, the loyalty should always be personal rather than institutional.
Loyalty to the individuals make the world go around. Institutions come and go, personal
loyalty stays.</p>
<p>Generally, these are the conditions:</p>
<ul>
<li>male (always male!)</li>
<li>speaker of at least two languages besides English (Arabic, Afrikaans, Russian, Spanish
and Mandarin are the most common languages besides English in this space)</li>
<li>ability to develop personal loyalty (pre-existent loyalty is preferable)</li>
<li>willingness to accept irregular work schedule at weird/uncommon hours</li>
<li>pre-existent experience in intelligence is preferable</li>
<li>little or no family obligations (or willingness and ability to manage them competently if
such obligations exist)</li>
</ul>
<p>The policy wonk should be from outside the industry.
The current or former deep state should be very knowledgeable in the industry. If you
catch a true believer that also fits the criteria, you’ve hit jackpot (nobody is more hard
working than a recent convert).
The jurist/lawyer is preferable to have experience in/with the criminal underworld.</p>
<h2 id="the-ngo-circle">The NGO circle</h2>
<p>The person running your NGO should be someone who has run non-profits in the
designated country before. Ideally, someone at least partially acquainted with the notion
of strategic messaging.</p>
<div class="important">IMPORTANT: You need an NGO in every country you want to influence!</div>
<p>The head of the NGO should have a personal loyalty to you or someone in the pink circle.</p>
<p>The head of the NGO is the most trusted person outside the pink circle. He needs to
deliver the strategic messaging, recruit volunteers, manage fundraisers, build up a
positive image in the community (e.g. children’s camps, feeding the homeless, etc.). The
NGO is the propaganda arm that the public sees.</p>
<p>For instance, for great public impact, a campaign to feed the homeless should be funded
by legally dodgy fundraising through crypto. Thus draw attention to the public how a
force for good is hampered by paranoid policies of the State.</p>
<p>The head of the NGO should be a person that has no qualms in calling “national security”
names and be ruthless in opinion pieces / editorials about it.</p>
<p>The head of the NGO should be someone not known as a conservative/libertarian.
Should be someone whose paid articles would be taken up by Washington Post, New
York Times, and others. He should always be known as “Executive director of [your
NGO]” and be as hard as possible for the Regime media to tie him with any non-Leftist
causes.</p>
<h3 id="lobbyist">Lobbyist</h3>
<p>At first, the best course of action is to hire one of the existent ones, or a lobbying firm.
And then, in time, try to either “steal the profession” (i.e. have someone of your own
learn how to do it from a pro and in time take his place) or “steal the employee” (the
lobbyist the firm gives to your ends up on your payroll) if he is that good.</p>
<p>Of course, you need at least one lobbyist in every country of interest</p>
<h3 id="think-tank">Think Tank</h3>
<p>Creating a think-tank is not difficult from a legal standpoint - but running it can be
complicated.</p>
<p>The advantage is that you can have only one think tank for all the operations. You don’t
really need one in every country.</p>
<p>Unfortunately, things that appear Russia-related are not fashionable these days. Which
means that connecting Vitalik Buterin to your think-tank officially comes with some PR
challenges.
With that said, the think tank should be connected to a personality. That personality does
not necessarily have to be involved in your think tank in any capacity (though it doesn’t
hurt if he or she is). Think: Elon Musk Policy Initiative or Vitalik Buterin Public
Policy Committee or… Anatoly Yakovenko Policy Institute.</p>
<p>If such a personality is unavailable, then at the very least the think tank should strive to
have a “neutral sounding” name. The author’s proposals:</p>
<ol>
<li>Committee for Economic Empowerment (CEE)</li>
<li>Economic Empowerment Council (EEC)</li>
<li>Council on Economic and Social Science Research (CESSR)</li>
<li>The Twenty-First Century Fund</li>
<li>The Twenty-First Century Economic Initiative</li>
<li>The Progress and Prosperity Initiative (PPI)</li>
</ol>
<p>Pick one of those. Under no circumstances should a think tank related to this operation
should have and honest/upfront name. The institute for cultural Marxism or the Frankfurt school is called neither.</p>
<blockquote>
<p>It is called <em>Institut für Sozialforschung</em> (The Institute for Social Research).</p>
</blockquote>
<p>The most successful free market think tank is called Mackinac Center for Public Policy
(notice the absolute lack of words like Liberty, Free Markets, Flat Taxes or other honest
keywords in its name).
The author cannot stress enough just how important it is for the think tank to not have a
name that betrays its agenda.</p>
<h3 id="paid-shill-in-institutions">Paid Shill in institutions</h3>
<p>Naturally, the first targets for this task are economics professors and career bureaucrats in
the EB (Bureau of Economic and Business Affairs) and the US Department of Commerce</p>
<ul>
<li>Their equivalents in other targeted countries (usually known as Ministry of
Economy or Chancellor in the case of the UK). Also, the DoJ (and Ministry of Justice,
respectively, in other countries) is a primary target because they make the political
decision on enforcing laws and on how they enforce them.</li>
</ul>
<p>But immediately after that, any university professor and any “journalist” is a legitimate
target to becoming a shill (paid or not) in institutions.</p>
<div class="note">NOTE: Ultimately, crypto is a radical agenda because it seeks to upend the traditional financial
way: in order to succeed, the agenda needs shills in key positions wherever it can get
them.
</div>
<h2 id="how-to-recruit">How to recruit</h2>
<p>When it comes to operations like these, effective recruitment can be very expensive.</p>
<p>For starters: <strong>Nobody you haven’t met in person should be recruited for any top role</strong>
(especially the pink circle!). All recruitment has to be done in person - in order to make
it easier to create personal loyalty.</p>
<p>Secondly: For the “pink circle” you should do it yourself (alongside one or two
confidantes - preferably people with as similar beliefs as yours).
For the blue circles, you can rely on a recruitment firm to filter candidates and use a pre-
defined process - but you and your pink circle should have the final say after you
meet the selected candidates in person.</p>
<p>Thirdly: The process should not be publicly transparent. As fewer people as possible
should be able to know how you selected a person.</p>
<p>When it comes to lobbying, preference should be given to potential lobbyists who
already have personal relationships with lawmakers or relevant bureaucrats.</p>
<p>Get a place that is simultaneously easily accessible and at least in part off the grid. Or at
least a room like that. With white noise (to prevent any recording) and electromagnetic
shielding (to prevent broadcast). Ideally, all sensitive information - including ironing out
the details of recruitment with any individual and doling out of tasks - should be spoken
only there.</p>
<p>All of these may sound like overkill - but it doesn’t take long to get on the radar of the
Regime. So at least make it hard for them to find out details.</p>
<p>When approaching a new potential recruit, always have them visit your place. Ideally,
record (or photograph) the moment they enter your establishment from the outside. Those
pictures may end up being helpful down the line if that individual ends up later on being
a politician or influential figure and turns against you. Yes, this is nasty, but… welcome
to politics.</p>
<p>Always read up on your potential recruits from their haters. If they don’t have haters,
then they’d better have a very good case. But, generally, in public-facing roles and
lobbying - those that don’t have haters are probably weak.</p>
<h2 id="initiation-on-effective-lobby">Initiation on effective lobby</h2>
<p>All lobbyists, propagandists, political operatives, researchers, etc. etc. like to portray their
work as immensely complicated. And, sometimes that’s true. But oftentimes it’s not. It’s
more like a “crime of opportunity” + basic formula + knowledge.</p>
<p>This manual can’t help with all knowledge (because a lot of it is situational and there’s no
way all or even most situations can be conceived/thought of before time and also because
this manual is not supposed to be “a nuvela”). But this manual can explain the basic
formula.</p>
<h3 id="procedural-methodology">Procedural methodology</h3>
<p>All countries, effectively everywhere, including dictatorships, do not function so much on
Law or the Dictator’s will - but on procedure. All lobbyists, judicial activists and other
people in this line of work that are worth their salt know that procedure is God.</p>
<p>So here’s how this works:</p>
<p>Congress/Parliament passes a law saying that trading securities without a license and
without paying the taxes owed is a punishable offense of 2 to 48 months in prison and/or
a fine no larger than $200,000. [This is a school-like example!]</p>
<p>Okay… now what? Everyone trading Bitcoin without getting a license goes to jail, right?
Because you failed to convince the Congress/Parliament not to pass the law in the first
place. Well… not so fast.</p>
<p>What is a security? If the law doesn’t say, then the burden to define it falls on either the
Cabinet (through something called methodological norms - or detailed rules or detailed
arrangements) or on a specific agency (the SEC in the US, the FCA in the UK, the CSA
in Canada, etc.).</p>
<p>This presents the second lobbying opportunity. The people in those agencies have the
power to define “trading securities” as “trading securities worth more than
$1,000,000/year”. If you get them to do that, effectively retail use of cryptocurrency
remains untouched.</p>
<p>But let’s say you fail on that too. And the agencies define it as “any trading of any
security not licensed by this agency”. Now everyone trades Bitcoin goes to jail, right?
Well… not so fast.</p>
<p>If the law doesn’t say who can prosecute this (i.e. who has prosecutorial jurisdiction)
then the presumption (in the USA and Canada) is that the local authority does. In the
USA, the powers not delegated to the United States by the Constitution or by Congress or
SCOTUS, in practice], nor prohibited by it to the States, are reserved to the States
respectively, or to the people.</p>
<p>This means there is a lobbying opportunity at the state level. If states, or even cities, can
declare themselves “sanctuary state/cities” and refuse to enforce federal law on
immigration - then there is nothing that stops states to declare themselves crypto-havens.
Such a decision is usually made at the gubernatorial level but, if you’re ballsy, you go for
the jugular and try to have limitations of the application of the law passed through the
State legislature. From that point on, it becomes a complicated legal matter for the State
AG to litigate against the Feds. That can take years - years which can then be used to
bullhorn just how good of a policy it is (through your think tank(s), your NGO(s), your
shills, etc.) - and by the time it reaches SCOTUS, the public opinion may be radically
different.</p>
<p>But let’s assume no governor anywhere is willing to give a damn and no state legislature
anywhere is willing to go to legal war with the Feds on this matter.</p>
<p>In such a case, enforcement still remains on the state/local police first and foremost. Do
those people even know what cryptocurrency is? The answer is obviously NO.</p>
<p>So it’s a lobbying opportunity at the police level. A nice donation to the NYPD secures a
lot of political favors (just ask Cuomo or Trump). This is also possible with the London
Metro Police. And it’s definitely possible in South Africa. Not so possible in Singapore
though.</p>
<p>But let’s say the police is uninterested and will follow with their duty to enforce the law.</p>
<p>Can they do it competently? Most of the times the answer will be no. So they will defer
to the feds (in the case of US and Canada).</p>
<h3 id="common-law">Common Law</h3>
<p>In some countries (the UK, Singapore, South Africa and most of the EU) a law can be
nullified by a judge if it’s not clear enough. Making the case for that requires a
competent lawyer (which may not be cheap) - but, if done right, it works most of the
time. Because most laws do not actually meet the criteria for clear enough (and the
criteria for that are detailed in thick books on “judicial procedure” - here’s that word
again).</p>
<p>The example given above (with all its imperfections and attempts to fit it through the
legal framework of more than one country) does, however, show just how a law can (and
almost always does) become something else in practice than what the public
pressured/elected legislators to do.</p>
<p>This is true with almost any law. The public definitely didn’t want the EPA to harass
small farmers for having a pond on their property^2 by spending millions of dollars on
pointless lawsuits and threatening citizens with fines of $37,500 per day^3.</p>
<p>https://pacificlegal.org/case/johnson-v-environmental-protection-agency/</p>
<p>The lawmakers also certainly didn’t think about this being possible when they entrusted
the EPA with the power to regulate environmental concerns. Yet it did happen. And
there’s two reasons for that: Bureaucracies’ inherent tendency to expand their power (the
libertarians’ favorite reason to cite) and lobbying by radical environmentalists (the reason
nobody cites in public because then the public would become too aware of the reality).</p>
<p>Radical environmentalists are .5% of any country (maybe even less). But they are very
noisy, very well funded and have worked 40-to-60 years to get their policies across. Their
policies, not their message. Most of their message AND their policies remain wildly
unpopular. But so what?</p>
<p>The radical environmentalists were never picky about their funds. They accepted funds
from the USSR to fight against nuclear energy, then from the Russian Federation to fight
in favor of gas in Europe but against gas in the USA, then from China to fight in favor of
electric cars (90%+ of batteries are made in China) and so on.</p>
<p>The reason this example is given is to also underscore the point made in the SWOT
analysis. Crypto advocacy should seriously consider accepting financial backing even
from unsavory corners such as the criminal underworld or the state of Kazakhstan (which
produces cheap energy and is thus a hub for crypto mining - and it’s thus in their interest
to have crypto lightly regulated in the West). There is a moral line here so everyone will
have to make their decisions. If the operation is to succeed, it will need to be a bit more
flexible on these “moral” questions. And “moral” is in scare quotes because some of these
business questions have been turned into “moral” ones by CFR (another think tank!
Which is why you need your own think tank!).</p>
<p>Here’s a graph:</p>
<p>The basics of lobbying is to be able to interfere and intervene in all 5 of these circles.</p>
<p>The public is only aware of the top circle (where lobbyists try to influence legislators).
But actually most of the lobbying is done in the other 4 circles.</p>
<h3 id="how-to-write-policy">How to write policy</h3>
<p>The nitty-gritty is done by the Think Tank in cooperation with your Jurist/Lawyer in
particular and the Pink Circle in general.</p>
<p>It is important though to separate dissemination. The Think Tank must appear to be
working independently from the advocacy NGO.</p>
<p>All else can be connected (or not) depending on the current political project, the
propagandistic environment or other constraints (or lack thereof). Here’s a graph:</p>
<p>The author insists on these graphs because this way of doing things has proven itself to be
the most likely to succeed.</p>
<p>Lobbying, activist policy making and all of this is not a 100% exact science. It’s also an
art. And it’s made more difficult because you work with people. However, abiding by the
institutional architecture laid out here removes many of the obstacles and follies that
other activists and activist groups end up encountering because of their ignorance on how
the world works and too much focus on how they believe the world should work.</p>
<p>The most important thing is to take full advantage of the institutional architecture.</p>
<p>Before tasking the Think Tank to write the policy, you (and the Pink Circle) must answer
to the following questions after coming up with a new idea:</p>
<ul>
<li>Does this policy forward our strategic goals (short, medium and long term)?</li>
<li>Do we personally know at least one relevant legislator with a strong chance of taking
up the issue?</li>
<li>Who do we have in the media that would push this with a straight face?</li>
<li>Can this realistically be formulated into public policy? (here the jurist/lawyer has the
most important word)</li>
<li>
<p>Can this be done without resorting to government?</p>
</li>
<li>Who would oppose? (here the former/current deep state has the most important word)</li>
<li>Can this be sold to the public without fermenting a backlash? (here the propagandist has
the most important word; and if the answer is NO, this should be reconsidered)</li>
<li>Is there precedent or close enough analogy? (policy wonk should know this)</li>
</ul>
<p>If the answers to these questions are satisfactory, then the idea should be passed to the
think tank and the propagandist should be meeting the NGO.</p>
<p>If your architecture doesn’t (yet) include a think tank, then policy wonk and jurist from
the Pink Circle should start writing.</p>
<p>You may want (at least at first, when the funds are tight) to have the same people (policy
wonk and jurist) to also de facto lead the Think Tank.</p>
<p>Whatever you do, though - never come out in public with a new idea without such an
internal process.</p>
<p>An idea that sounds great to a crypto enthusiast may end up being frightening to the
public. And since in the public crypto is still new/fringe - the last thing you want is to
give the Regime another reason to start vilifying you.</p>
<h3 id="agitprop">Agitprop</h3>
<p>The NGO should deal with most of the practical aspects of agitprop.</p>
<p>The same idea rarely can be pitched the same way in two different countries.</p>
<p>So, for instance, a policy that may allow people to make more money by reducing their
fiscal burden through crypto may be relatively easy to pitch to the American public -
where tens of millions of people actively despise the IRS and 100+ million people would
pay attention to something that could save them money by reducing their tax bill.</p>
<p>The exact same policy may not be possible to pitch at all to the British public - which
tends to be more reverential to Big Government. Put it simply: The British, Canadian,
Australian and NZ public is content with the Maximal State. To them, higher taxes is not
the non-starter that it is with Americans. So, in such countries, the NGO would have to
take an entirely different approach (such as, for instance, by avoiding public debate and
instead focusing on key marginal seats in the upcoming election to swing the vote for a
candidate willing to take up the issue).</p>
<p>In South Africa, such a policy would be meaningless (tax evasion is the norm anyway).
But in South Africa the same policy can be sold as empowering individuals (which is a
message that definitely resonates with middle class blacks).</p>
<p>A lesser-known fact is that Africans in general are more accustomed to electronic
money and electronic payments for reasons of necessity - many central banks in many</p>
<p>African countries (Zimbabwe, Uganda, Kenya, Nigeria) simply cannot print enough cash
as needed because they can’t afford it (this is the case in Zim and Uganda) or they can
print it but can’t distribute it fast enough (the case of Kenya and Nigeria).</p>
<p>Some of that is true in South Africa as well. So those people can be converted to try out
crypto much faster than Brits or rural Americans. There are more smartphones in Africa
than in the United States and Europe combined. And South Africa has the highest
penetration of mobile internet (read: potential crypto customers who can then become
advocates) from Africa and one of the highest penetration rates in the world (only China,
where it’s mandatory, Sweden - which is small and South Korea have higher penetration
rates).</p>
<p>Other agitprop tips:</p>
<ul>
<li>Have the NGO run at least two websites and not be transparent about owning them both</li>
<li>Have the NGO run a network of profiles on social media not transparently connected</li>
<li>The NGO should always try to subvert other protests and make them about your Cause</li>
</ul>
<h1 id="country-specific-indications">Country-specific indications</h1>
<p>The nitty-gritty of each country’s tax and securities’ law will be revealed by the relevant
structure in that said country. The purpose of this chapter in the manual is to indicate the
aspect(s) of this manual that do or don’t apply to each country, the equivalents (if any)
and the aspects of lobbying and propaganda that are unique to that place.</p>
<p>The United States of America</p>
<p>The premise of this manual is that the operation is headquartered in the United States of
America.</p>
<p>If the reader wishes otherwise, the author suggests that the operation be headquartered in
the US because out of the 320 million crypto users worldwide, 46 million of them are in
the United States. And 51 million in North America. In other words, almost a fifth of the
whole market is in North America. So not headquartering this in the USA would be a
mistake because not only it is the biggest market but the USA is also the source of light in
terms of policy for most of the English-speaking world.</p>
<p>What to keep in mind:</p>
<p>=> Make sure you are not subjected to Foreign Agents Registration Act (FARA)
requirements</p>
<p>You can have foreign employees/collaborators but if they live in the US on a visa other
than B1/B2 (tourist/business) then they are subjected to FARA requirements if they are
involved (even distantly) in lobbying activities.
FARA requirements apply to US citizens as well if those individuals receive payments
from abroad or have a recent history of working for a foreign government or a large
foreign entity (especially Chinese or Russian).</p>
<p>Having one of your top members be on the public list of foreign agents is very bad
propaganda.</p>
<p>So, if you get a notification from the DoJ to register someone as a foreign agent, in
public that person’s relationship with your organization should be terminated. In private,
that person can be assigned another, less public, role.</p>
<p>=> Create personal loyalty relationships with pundits and legislators from both parties</p>
<p>Foreign policy hawks can be helpful sometimes.
For instance, since Russia’s invasion of Ukraine, Kyiv has not only legalized the use of
crypto, but instituted one of the most relaxed regimes in the world^4.</p>
<p>(^4) Ukraine legalizes crypto sector as digital currency donations continue to pour in -
https://www.cnbc.com/2022/03/17/ukraine-legalizes-cryptocurrency-sector-as-donations-pour-in.html
(published on March 17, 2022)</p>
<p>This means opportunity for both good (financing the Ukrainian army via black-
ops/unaccounted for budgets) but also for “bad” (money laundering, etc.). The people
who get to decide these things in the USA are 65% Democrat.</p>
<p>What the author is saying is that one should approach this issue without prejudging
things. There are potential friends of crypto in both the Democrat and the Republican
party. And there are haters of crypto in all parties, including so-called Libertarian.</p>
<p>Speaking of which: The organization should not be wasting too much time cultivating
relationships with the Libertarian Party. If they occur naturally, fine. But not much
intentional energy should be spent on the LP because the LP will not be in power, or even
gain as much as a Senate seat anytime soon. Lobbying is about influencing current
Power. And current Power resides with the Democrats and Republicans.</p>
<p>=> Social media: Public-facing roles should nuke all of their social media accounts</p>
<p>The USA is a post-puritan society - therefore it has reflexes of a Puritan one. What you
said in 2012 can and will be used against you successfully in 2023. The easiest way is to
keep no written records.</p>
<p>=> Lobbying Disclosure Act of 1995</p>
<p>In all countries, the rule is the same: What is no explicitly forbidden is implicitly
permitted. Make sure don’t run afoul of the Lobbying Disclosure Act of 1995.
The best way to do that is to never meet the requirements for disclosure.</p>
<p>United Kingdom</p>
<p>=> No official requirement for lobbyists to register</p>
<p>However, all departments will have to publish online quarterly reports detailing
ministerial meetings with interest groups and hospitality received by ministers and their
advisers. Details of meetings between officials and outside groups will not have to be
published. This means that the best course of action is to avoid “official” meetings as
much as possible and cloak them into generic “civil society” [i.e. random civilian goes to
meet his/her local MP] - as those are not subjected to reporting rules.</p>
<p>It is important to avoid getting yourself into the public eye for as much as possible. The
UK lobbying space is dominated by very well-funded left-wing institutions: The
Common Purpose, Friends of Environment and all sorts of less-known but very well
funded genderist “charities” (such as Mermaids which is currently fighting a lawsuit
worth tens of millions of pounds against the very existence of a genderist NGO that is not
friendly enough with “transgenders”). You don’t want to get yourself caught up into such
a mess before you become a moneyed established interest in the country.</p>
<p>It is in your interest to be in the UK because there are millions of people who own crypto
over there and most of them are what would be classified as dissidents.</p>
<p>=> Focus primarily on the Tory party</p>
<p>Realistically, there won’t be a Labour government anytime soon. And the smaller parties
are small enough to not be important for this talk anytime soon.
The Tory party in the UK is the equivalent of FIDESZ in Hungary or BJP in India. It is
the party. It’s where most of the Power resides.</p>
<p>=> Propaganda not oriented towards the masses</p>
<p>As mentioned earlier in this paper, the British public is more at ease with Big
Government. Generally speaking, ways to shrink their tax bill are not the hook that it is in
the US.</p>
<p>As such, the propaganda in Britain should be focused on businesses (including smaller
ones).</p>
<p>In February 2022, the first physical card that lets one spend crypto was launched in the
UK^5 so there is plenty of room for crypto lobbying and propaganda to businesses.</p>
<p>(^5) CoinJar Card is now available in the UK - https://blog.coinjar.com/coinjar-card-uk/ (published on
February 17, 2022)</p>
<p>Australia</p>
<p>Australia’s legal and societal structure is a combination of the worst elements of British,
American and continental European structures.</p>
<p>It has federal subjects (states) like the US, but with mixed powers (like devolved powers
in Britain) and with onerous legislation (like continental Europe) that assumes that if
something is not regulated it is thus suspect.</p>
<p>Following the events of the Wuhan Flu, the author is not a big fan of Australia period. On
anything. Ideally, this place should be boycotted by normal people in general and
freedom-loving people in particular. With that said…</p>
<p>=> Lobbying is highly regulated</p>
<p>Lobbying in Australia is regulated both at the federal level and at the level of some states
(though, in practice, it’s all states since most AU states are sparsely populated).</p>
<p>The Attorney General of the Australian Government maintains a Lobbyist Register^6 and
is in charge of enforcing the Lobbying Code of Conduct^7 which is a mammoth document
with plenty of external citations. You will need a full time person just to untangle that in
Australia.</p>
<p>There are no meaningful exceptions. Even a teenagers’ three-people NGO lobbying on
behalf of the teenagers’ right to breath normally (i.e. sans the ridiculous Covid muzzle)
has to register as a federal lobbyist in order to be able to even speak.</p>
<p>Government officials are prohibited from meeting anyone who is not a registered
lobbyist.</p>
<p>The only de facto exceptions are if you can catch them at an event or campaign rally (and
those are very rare anyway).</p>
<p>The workaround most use to be spared registration is the so-called “in house lobbyist” -
in which one is employed officially by Organization Inc. with the official title “lobbyist”.
Sounds silly, but that’s the law in Australia.</p>
<p>Also, exempted from the onerous Code are:
charities and religious organisations
non-profit organisations and associations
individuals making representations on behalf of relatives or friends
members of foreign trade delegations</p>
<p>(^6) https://lobbyists.ag.gov.au/register
(^7) https://www.ag.gov.au/integrity/publications/lobbying-code-conduct</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre> people already registered under a Commonwealth scheme regulating certain
professions (such as tax agents and customs brokers) who make representations to
the government on behalf of clients
service providers (such as lawyers, doctors, accountants and other service
providers) who make occasional representations to the government on behalf of
clients in a way that is incidental to the provision of their professional services.
</pre></td></tr></tbody></table></code></pre></div></div>
<p>None of these are helpful for the purposes of this operation, though.</p>
<p>=> Indebted and fiscally irresponsible population</p>
<p>All English-speaking nations have populations that spend more than they make and get
into onerous debt. But Australia is at a whole new level. The Buy-Now-Pay-Later
(BNPL) schemes have known an explosive growth^8 and this is all before the central bank
will ramp up the interest rate.</p>
<p>Australia will have a severe downturn in 2023 - which means less disposable income.</p>
<p>It is highly unlikely that Australia will be a target nation for this operation. Or, if it
would, one should be prepared to waste a lot of resources for quite some time.</p>
<p>=> Hostile government and hostile population</p>
<p>The population in Australia seriously believes (as a statistical norm, of course) that
paying more money to the government changes the weather (otherwise known as
“climate change policy”).</p>
<p>The current Labour administration will do just that. And the crypto sector is a large
energy consumer. So it will be hit by this.</p>
<p>Admittedly, this also presents a lobbying opportunity. The operation will have to decide
whether it’s worth it for the 2% of the Australian population that owns any crypto.</p>
<p>Perspective: There are more crypto owners in two small neighborhoods in London than
in the entire nation of Australia.</p>
<p>New Zealand</p>
<p>There are less than 80,000 (eighty thousands) crypto holders in New Zealand, the vast
majority of them holding less than $100 worth of crypto.</p>
<p>With that said…</p>
<p>(^8) One in seven buy now, pay later customers had more than 20 loans last year, Choice survey shows -
https://www.theguardian.com/australia-news/2022/sep/16/one-in-seven-buy-now-pay-later-customers-had-
more-than- 20 - loans-last-year-choice-survey-shows (published on September 15, 2022)</p>
<p>=> Lobbying is unregulated in New Zealand</p>
<p>There are no rules concerning lobbying in NZ. None whatsoever.</p>
<p>Left wing groups complain that ministers start their own lobbying firms immediately
after leaving office^9 thus increasing the level of corruption in the country. They’re not
wrong, but this also means your operation can easily get into the arena.</p>
<p>=> Incoming draconian financial surveillance regime</p>
<p>The Financial Markets (Conduct of Institutions) Amendment Act 2022 comes into force
in 2025^10 and, as it stands now, it has the potential to effectively ban retail use of crypto
without complying with onerous bureaucratic and registration processes.</p>
<p>The consultation period for the relevant parts for the crypto industry have been expired in
July 2022. Until November there is consultation open for “registration fees” (read:
nothing important).</p>
<p>The lobbying opportunity here is for change of this act after 2026 and, in the meantime,
work through propaganda to get Labor out of power.</p>
<p>Singapore</p>
<p>Singapore’s political regime is, effectively, fascism. Or, more descriptively tax haven
fascism. The government isn’t particularly concerned with economic regulation, but it is
very concerned with controlling society by any means necessary in order to preserve its
power and advance its ideological agenda.</p>
<p>=> There are no rules and regulations concerning local lobbyists</p>
<p>Locals can engage in all forms of lobbying without any restrictions.</p>
<p>=> Foreign lobbying is effectively forbidden</p>
<p>This is done through a complex patchwork of laws that work as follows:</p>
<p>Public Order Act (Chapter 257A, 2012 Rev Ed) (POA), the Newspaper and Printing
Presses Act (Chapter 206 2002 Rev Ed) (NPPA), the Broadcasting Act (Chapter 28 2012
Rev Ed), the Political Donations Act (Chapter 236 2001 Rev Ed) (PDA) and the
Parliamentary Elections Act (Chapter 218 2011 Rev Ed) (PEA) all have very strict rules</p>
<p>(^9) From minister to lobbyist in three months: New Zealand needs to do better on transparency -
https://www.theguardian.com/world/2022/oct/07/from-minister-to-lobbyist-in-three-months-new-zealand-
needs 10 - to-do-better-on-transparency (Published on October 7, 2022)
https://www.mbie.govt.nz/business-and-employment/business/financial-markets-regulation/conduct-of-
financial-institutions-regime/</p>
<p>and restrictions that make effective lobbying by a foreign entity very difficult - even when
using locals as an interface.</p>
<p>Not to mention that finding locals willing and capable of doing this is a monumental task
in itself.</p>
<p>For instance, under the Public Order Act, the Commissioner of Police may refuse to grant
a permit for a public assembly or procession if he or she has reasonable grounds to
believe that it is directed towards a political end and involves foreign entities and
individuals (section 7(2)).</p>
<p>The Newspaper and Printing Presses Act and the Broadcasting Act empower the
government to restrict and control the ownership of newspapers and broadcast media so
as to prevent foreigners from manipulating Singapore media platforms to influence local
politics. This has started to be enforced on the Internet as well.</p>
<p>Section 19(1) of the NPPA sets out that ministerial approval is required for a newspaper
to receive funds directly or indirectly from a foreign source. Section 19(8) of the NPPA
makes it an offence for any journalist to have received funds and failed to declare the
receipt within seven days to the managing director of his or her newspaper. So buying a
paid shill in the press is also complicated.</p>
<p>Section 24(2) of the NPPA prohibits the sale, distribution or import or possession for sale
or distribution of any declared foreign newspaper, unless ministerial approval is
obtained. Section 25(1) also prohibits the reproduction for sale or distribution in
Singapore of any copy of a declared foreign newspaper, unless ministerial approval is
obtained.</p>
<p>Section 31(1) of the Broadcasting Act prohibits the rebroadcast of any declared foreign
broadcasting service, unless ministerial approval is obtained. Section 43(1) prohibits the
receipt of funds from any foreign source in order to finance a broadcasting service, unless
consent from the Info-communications Media Development Authority (IMDA) is
obtained. Section 44 also contains restrictions on foreign ownership of broadcasting
companies.</p>
<p>=> Personal relationships</p>
<p>Considering what’s explained above, the best course of action in this country is to travel
and establish personal relationships and loyalties over there.</p>
<p>Singaporean politicians may be open to further relaxation of the measures on the crypto
sector if it is explained to them in terms of national well-being. After all, criminals
laundering their crypto through Singaporean banks will pay fees and maybe even some
taxes to the State of Singapore.</p>
<p>It’s not ideal, but there are few legal ways to do this otherwise.</p>
<p>South Africa</p>
<p>As mentioned earlier in this paper, ZA is a place with a huge growth potential in the
crypto sector.</p>
<p>On October 19, 2022, the FSCA (Financial Sector Conduct Authority) announced that it
would classify cryptocurrency assets as financial products^11 opening the floor to
regulating the industry.</p>
<p>=> Large lobbying industry</p>
<p>There are no state-enforced rules for lobbying in South Africa. However, the industry is
very large and self-regulating.</p>
<p>One is not obligated to join the lobbying bodies, but it’s nearly impossible to get anything
done if one is not part of those said bodies - such as LOBBYINGSA or the South African
Lobbying Association^12 or the African Lobbying Union (which is a pan-African lobbying
organization).</p>
<p>These organizations’ purpose is twofold - to regulate the industry and to offer lobbying
opportunities through networking.</p>
<p>Some would say it’s a grift. The reality is that they’re both.
The lobbying sector is so well-developed in South Africa that they’re also the only
country in Africa where Finsbury International Policy & Regulatory Advisers (FIPRA)
has a presence in.</p>
<p>FIPRA Network</p>
<p>(^11) South Africa moves to regulate crypto assets - https://www.reuters.com/technology/south-africa-moves-
regulate-crypto-assets- 2022 - 10 - 19/ (Published on October 20, 2022)
(^12) https://ethicore.co.za/membership/south-african-lobbying-association-lobbyingsa/</p>
<p>FIPRA is the largest lobbying firm in the world and if you find yourself fighting against
them, be ready - because they don’t mess around. They’re very ruthless in general and
particularly ruthless in South Africa.</p>
<p>=> free speech</p>
<p>As long as you’re not into racial politics, there is both de jure and de facto freedom of
speech in ZA. So it’s open season for agitprop and implementing the tactics laid out in
this manual.</p>
<p>Canada</p>
<p>Canada resembles the USA for the purposes of this manual so the author will list what’s
different in Canada compared to the United States of America.</p>
<p>=> Provincial lobbying Acts</p>
<p>The US has some state-level legislation for lobbyists (but none of it quite relevant for the
purposes of this manual). Canada, however, has far more reaching lobbying legislation at
the provincial level. To make matters more complicated, the type of regulation differs
quite a bit from province to province, with the exception of Northwest Territories(or
Nunavut) which does not have lobbying legislation.</p>
<p>=> At least 7 million radicalized people</p>
<p>Traditionally, Canada as a whole would be further-Left than the Democrats. Covid19
changed that. Maybe not forever, but definitely for the remainder of this decade.</p>
<p>At least 7 million people (20% of the population) can sociologically be described as fully
radicalized. They don’t trust the government, authority and really any of the established
institutions. And particularly don’t trust banks after what they saw during the Ottawa
protests (when the Regime froze the protesters’ bank accounts).</p>
<p>This population does not have a real voice (the PPC doesn’t really count, yet).</p>
<p>This presents an opportunity and a challenge. The challenge is to get these people to vote
for the same party.
The opportunity is for your operation to simply take over a party (the PPC, why not?) and
essentially get into the Parliament with your operation’s agenda baked into the new
government.</p>
<p>A right-wing (or better said non-Leftist) coalition will likely take power in Canada after
Trudeau. Covid19 hearings and Ottawa-Protest/Emergencies Act hearings are tearing
Trudeau’s primeministership apart. So… if your operation moves fast, your operation
could field its own candidates in 2025 and stand a good chance of being part of the next
government.</p>
<p>The body politic of Canada is ripe for a shakeup. 7 million radicalized people can do a
lot, if they’re propagandized enough.</p>
<p>Main enemy: Voter suppression propaganda. Or… blackpilling.
It is no secret that the Canadian army used military campaign to influence public
opinion^13 and nothing happened. Nobody was punished for it even after those who did it
disobeyed direct orders to stop it.</p>
<p>So your operation will be fighting the military in the information war. That doesn’t mean
it’s not a winnable issue. It most certainly is. But it will require some boots on the ground
and would not be cheap.</p>
<p>=> Crypto platforms are regulated in Canada, but not crypto use</p>
<p>This might change but as of November 2022, crypto use and crypto lending are not yet
subject to a specific regulatory regime in Canada.</p>
<p>Registration with the Canadian Securities Administrators of platforms appears to be
voluntary, for now. But with the panic in the crypto markets in the summer of 2022,
many have started the process to register with the CSA.</p>
<p>Protecting at least this arrangement (which is surprisingly economically liberal for an
otherwise socialist country) should be the primary objective of your operation. Which is
why what’s suggested just above (taking over a party and fielding candidates) is a worthy
investment - especially considering Canada’s propensity towards left-wing lunacy. If you
control this, then the chances of the sector being assaulted by SJWs diminishes as well.</p>
<p>(^13) Military campaign to influence public opinion continued after defense chief shut it down -
https://www.cbc.ca/news/politics/psychological-warfare-influence-campaign-canadian-armed-forces-
1.6079084 (Canadian Broadcasting Corporation, published on June 24, 2021)</p>
<h1 id="some-conclusions">Some conclusions</h1>
<p>This manual may or may not be exhaustive, depending on how much the reader is willing
to try. The larger the operation attempted in reality, the less useful this manual becomes.</p>
<p>The reason is simply: To start a global lobbying operation, a manual/roadmap like this
one is enough. To run it for more than 12 months, however, more than a manual like this
is needed.</p>
<p>Ideally, the reader would have a trusted inner circle (not too different of the pink circle
exemplified in this paper) of two or three people who are not just reasonably well paid,
but also committed for the long haul.</p>
<p>Lobbying is not an activity that yields results in three months. Or even in twelve.</p>
<p>The author of this manual just managed - with a network not very dissimilar to what’s
being described in this paper - to defeat the banking lobby and reverse a long-standing
anti-cash legislative trend. It took 4 years - a period in which the Regime threw
everything - including the inconspicuous anti-cash shilling of the Wuhan Flu pandemic
(“you get viruses from cash - use card!”).</p>
<p>And blocking anti-cash measures is something relatively simple when compared to what
it is required in some countries for the benefit of the crypto industry. This fair warning is
placed here just to give the reader an idea about the timeframes involved.</p>
<h2 id="the-end">THE END.</h2>
<h2 id="footnotes">Footnotes</h2>
<p>of the knowledge and experience of the author, with all the biases and limitations that
come with that. The version you are reading may not be the final version as the author is willing to add,
change or clarify anything upon request.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>This is not an adaptation of an existent one - but an adaptation to the issue at hand <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:1:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
</ol>
</div>yellow manual - applied politics for the crypto industryOptimal autonomous organizations2022-10-12T00:00:00+00:002022-10-12T00:00:00+00:00/primitives/2022/10/12/Optimal-autonomous-organizations<blockquote>
<p>Originally posted on
https://graymirror.substack.com/p/optimal-autonomous-organizations</p>
</blockquote>
<p>In 2022, the distributed autonomous organization is <em>sort of</em> a thing. DAOs
exist. They are not a joke. But we can hardly say they are <em>mature</em>.</p>
<p>The 2022 DAO standard design has not been ruthlessly optimized by experience and
competition—we should not expect the current practice of the field to be in any
way optimal. We should expect it to reflect the hopes of the players, not their
experience.</p>
<p><em>Autonomous</em> and <em>sovereign</em> are of course synonyms—or near synonyms. When we
draw up a contract for a DAO, we are drawing up a constitution for a government.
In both cases, there is no power above the organization that can directly
regulate it; and the organization has some power (mathematical or military) that
can effectively defend it from all competing powers.</p>
<p>Contrast the DAO with the corporation, which is anything but autonomous—while it
does act independently like a sovereign, the corporation’s process is regulated
by the laws of a higher government. The sovereign can do anything it wants to
the company; but the company trusts it to enforce contracts, not through math
but through lawyers. Have you hired a lawyer lately? Lawyers <em>are</em> lovely
people.</p>
<p>Yet the architecture of the corporation <em>has</em> been ruthlessly optimized. Big,
ambitious companies approach and exceed the scale and ambition of many
historical sovereigns. If there was a better way for Apple to organize itself to
build phones, or SpaceX to build rockets, probably someone would have found this
way.</p>
<p>Instead the basic design of the Anglo-American limited-liability joint-stock
company has remained roughly unchanged since the start of the Industrial
Revolution—which, a contrarian historian might argue, might actually have been a
Corporate Revolution. If the joint-stock design is not perfectly optimal, we can
expect it to be nearly optimal.</p>
<p>While there is a categorical difference between these two types of
organizations—we could call them first-order (<em>sovereign</em>) and second-order
(<em>contractual</em>) organizations—it seems that society in the current year has very
effective second-order organizations, but not very effective first-order
organizations.</p>
<p>Therefore, we probably know more about second-order organizations. So, when
designing a DAO, we should start from corporate governance, not political
science.</p>
<p>There is a categorical difference between sovereign and contractual power. We
have to adapt the corporate design for that difference—while optimizing away any
ancient spandrels from the paper age.</p>
<p>But at least we are looking for something optimal. Let us try to rederive it
from scratch, in the context of a DAO: the optimal autonomous organization, or
<em><strong>OAO</strong></em>.</p>
<p>Suppose you are an asshole and you have a governance token. This makes you an
asshole with a governance token. Why do you, not some other asshole, have this
token? What determines whether you do the right thing with it?</p>
<p>What determines whether you do the right thing with your governance token is the
product of two factors: <em>purpose</em> and <em>competence</em>. Competence cannot be
culturally instilled—except by filtering who gets the token. There is no such
thing as uniform competence, except as enforced by some competence filter.</p>
<p>But we would like the culture of the governance tokenholders—the <em>governors</em>—to
be at least uniform in <em>purpose</em>. They may be more or less clueful; but they all
should be at least trying to solve the same abstract problem.</p>
<p>An optimal autonomous organization assumes the goal of the governance holders is
<em>unconflicted</em>. Their governance actions have only one motivation: the success
of the organization. Maybe this should go without saying, but it most certainly
does not.</p>
<p>Success itself is difficult to define, of course—but ignore this problem for a
moment, as each organization solves it differently. What other motivation can a
governor have? You could have a <em>conflict of purpose</em>. As a governor, you can be
conflicted in two ways:</p>
<p>First, you have a <em>straightforward</em> conflict of purpose if directing the
organization to zig rather than zag would benefit you personally.</p>
<p>For example, insiders in an organization commonly have straightforward conflicts
of purpose, which is why companies are not democratically governed by their
employees. No one wants to kill their own project, etc. A more sordid conflict
of purpose is the common or garden conflict of interest, in which you steer a
contract to your cousin.</p>
<p>Second, you have a <em>vicarious</em> conflict of purpose if your governance token is
actually a source of <em>entertainment</em>—if you find enjoyment and meaning in
directing, or even just trying to direct, the organization. Vicarious conflicts
of purpose often lead governors to push an organization
<em><a href="https://en.wikipedia.org/wiki/Ultra_vires">ultra vires</a></em>—outside its
constitutional definition of success.</p>
<p>For example, if you have a strong sense of good and evil, you may find meaning
in directing the organization to do good rather than evil, according to your
personal interpretation of some specific historical situation.</p>
<p>If the purpose of the organization is to do good rather than evil, this is not a
conflict of purpose. If the purpose of the organization is to make widgets, or
to make money by selling widgets, it is. Imagine you are managing the HOA of a
condo complex, but send the landscaping funds to help starving refugees in
Yemen. Are you guilty of embezzling? You are.</p>
<p>Vicarious conflicts of purpose do not benefit the governor directly—only
emotionally. But most people have emotions, and these emotions are socially
mediated. Moreover, the most common socially mediated purpose of governance is
not even related to good or evil—it is just the governor’s socially-mediated
desire to experience status through importance. “Impact” is power and
intrinsically pleasant to feeble human beings.</p>
<h2 id="direct-democracy-and-vicarious-purpose">Direct democracy and vicarious purpose</h2>
<p>When we look at DAOs in 2022, we see a lot of direct democracy. It is normal for
the holders of governance tokens, or the equivalent, to vote directly on policy
decisions. Imagine if the holders of Chipotle shares had to elect the next new
burrito flavor.</p>
<p>Direct democracy is clearly a tradeoff in favor of a vicarious conflict of
purpose. One way to define vicarious purpose is a thought-experiment: if you, a
governance holder, could proxy all governance decisions irrevocably to a
governance expert whom you absolutely <em>knew</em> would make better decisions for the
organization—would you?</p>
<p>If so, you have no attachment to the vicarious purpose of impact. But few
governors of today’s DAOs seem to be looking to surrender their power in any
such way.</p>
<p>Constitutions designed to pander to vicarious purpose may have a much greater
chance of being created. But, because management by shareholder democracy looks
like an absurdly ineffective practice from the corporate perspective,
constitutions which make no concession to vicarious purpose seem more likely to
be efficient.</p>
<p>So, on a level playing field, we would expect corporations to kick the ass of
DAOs. This is the most fundamental motivation for the design of the OAO, which
is nothing more than a streamlined version of corporate governance translated
into blockchain. But this design has absolutely no role for vicarious purpose.</p>
<p>We will assume that governance-token holders have no vicarious purpose—they are
uniformly unconflicted in their goal of an efficient organization. Of course
this is an ideal approximation, but the closer to it the better. If it does not
approximate reality, reality should be educated to accommodate it.</p>
<p>But if vicarious purpose persists and cannot be educated away, the OAO is not
the right design. Folly has always had its tax to pay.</p>
<h2 id="efficiency-and-accountability">Efficiency and accountability</h2>
<p>The OAO is designed to maximize the combination of <em>efficiency</em> and
<em>accountability</em>.</p>
<p>Draw the OAO as an hourglass. At the top of the OAO are the <em>governors</em>. Below
them is an inverted pyramid which ends in a smaller pyramid. The base of this
pyramidion is the <em>board of trustees</em>. Its tip, the tip of both pyramids, is the
<em>chief executive</em>. Below this nexus is a right-side-up pyramid: the hierarchy of
<em>employees</em>, as a classic orgchart.</p>
<p>The governors picks the board. The board picks the CEO. The CEO bosses the
employees. This simple monarchical design makes everything that works in the
world. Look around you—everything you see was made by a monarchy. You can
probably see the work of twenty monarchies without turning your head.</p>
<p>The top <em>backup</em> pyramid provides <em>accountability</em> and <em>responsibility</em>. The
bottom <em>operating</em> pyramid provides <em>authority</em> and <em>efficiency</em>. The
combination is an <em>accountable monarchy</em> or <em>responsible monarchy</em>—essentially,
a “benevolent dictatorship.”</p>
<p>Thousands of years of human history have proven at every scale that the
pyramid-shaped hierarchy is the best way for people working together to get
things done. The fundamental problem of the classic pyramidal monarchy, at a
sovereign level, is the single point of failure at the top. The joint-stock
design is the basis of a solution, but this solution has never been adapted well
to the sovereign level.</p>
<p>Sovereign monarchy goes wrong less often than most think, but often enough that
an unaccountable autocracy is a real problem. If accountability can be solved
with little or no compromise to the efficiency of the regime, the result is
probably near optimal.</p>
<p>The purpose of the hourglass shape is to avoid interference between efficiency
and accountability. Accountability ensures a level of competence which
indirectly also assures sanity—resolving the fear of a deranged, yet autonomous,
CEO-dictator-king. Accountability certainly has a nonzero impact on efficiency;
it cannot help but cramp the style of the CEO, especially in a crisis. But this
cost can be quite tolerable.</p>
<p>Corporate governance is not corporate management. The CEO is always the manager.
A company micromanaged by its board of directors is a disaster. Here,
accountability is cutting into efficiency in a major way—but not as badly as if
the shareholders were in charge directly!</p>
<p>The goal of a well-chosen board is simply to ensure that the CEO is excellent in
every sensible way—since purely numerical targets, while incredibly useful, can
always be hacked in counterintuitive ways. The board is human because humans
have common sense, and the organization must not act in ways that violate common
sense.</p>
<p>Because the board consists of professionals, and because they are not involved
in the actual use of power, and because in a successful company they need not be
consulted at all, the vicarious purpose of the board can be kept extremely low.</p>
<p>The board is <em>not</em> part of the operating loop of the company. A board is a
backup device. Power corrupts, but power does not corrupt the board—it is as far
from power, or at least any power that would conflict its interests, as
possible. The whole point of a board is that it makes as few decisions as
possible—making it literally independent.</p>
<p>A board can be repurposed for other accountability-like purposes. For example,
it can approve structural decisions or even set budgets, set the CEO’s pay, etc.
The board in a DAO might collectively control the cold wallet of the treasury,
set the burn rate from this treasury to the organization, etc.</p>
<p>But when a board makes policy decisions, it becomes a player within the
management of the corporation. This has more impact on efficiency—it degrades
not only the performance of the company, but also of the board (which is no
longer independent).</p>
<p>Almost all major corporate boards recognize this, and refrain from micromanaging
(which, as many a startup has learned to its sorrow, any board can do)—of their
own cultural volition. The purpose of a professional board can be culturally
standardized in a way that is impossible for a motley crew of random
governors—which is why the shareholders do not directly elect the CEO.</p>
<h2 id="securing-autonomy">Securing autonomy</h2>
<p>For a DAO to be truly autonomous, its governance must be secure against
sovereign power. For a regime to be truly sovereign, its governance must be
secure against all external and internal powers. These goals are slightly
different but clearly related.</p>
<p>If the board can be pressured or coaxed by any other force to overmanage the
CEO, or to malfunction in any other way, it is not an independent board. If the
board is not truly independent, the OAO design does not work and should not be
tried.</p>
<p>The advantage of the OAO is that an OAO is secure so long as (a) its process and
(b) its trustees are secure. The process is secured by running the OAO on a
blockchain. The trustees are secured by making them anonymous—even to each
other.</p>
<p>So long as the trustees are anonymous, the employees and even the CEO can be
completely public. They still cannot be coerced, except secretly or by surprise.
Ultimately, all the assets of the OAO are under the trustees’ control. The
trustees can even update the constitutional contract of the OAO.</p>
<p>If they sense the influence of an outside power on their employees, the trustees
can get new employees, or even pick a new CEO. Without a lucky investigation, a
board of anonymous trustees is almost impossible to find and kill. Autonomy is
ultimately a security problem—and an anonymous board is the best solution to
this problem.</p>
<h2 id="selecting-the-board">Selecting the board</h2>
<p>To select <em>N</em> trustees from a random pool of governors, ask every governor to
either be a candidate, or delegate their support to another, wiser governor.</p>
<p>Every candidate’s point total is the cumulative weight of their tree of
supporters. Candidates who are not in the top <em>N</em> either waste their points, or
point their votes toward a more successful candidate.</p>
<p>If you see no one wiser than you, you can only be a candidate. If you are not in
the top N, you might as well throw your votes toward one of them you prefer to
the rest.</p>
<p>Once everyone stops changing their vote, the top <em>N</em> represent the community’s
collective wisdom on who in the community is the wisest. The weighting of points
is preserved, and counts in votes within the board—not every seat has equal
weight.</p>
<p>In some cases, an OAO begins with one or more founders, not a community. In this
case, the founders should select their own trustees—whose mission is to continue
the founders’ vision, even if the founders die or go insane. But initial trustee
selection has as many variations as there are organizations.</p>
<h2 id="cutting-the-cord">Cutting the cord</h2>
<p>Of course, the board election process can run continuously—occasionally changing
the board membership as the body of opinion amongst the governors shifts, and as
the board changes possibly changing the CEO.</p>
<p><em>This is a terrible idea</em>. It destabilizes the OAO by making its accountability
structures vulnerable to any power that can manipulate the public opinion of the
governors. The OAO starts from the assumption that public opinion is highly
volatile and imperfect. The governance structure is bootstrapped from it because
there is no alternative. But the governance structure should be more stable and
sane than the governors, <em>forever</em>.</p>
<p>If the governors have no vicarious purpose, they should be willing to delegate
their power <em>permanently</em> to any institution designed to be more responsible
than them. If after this initial election public opinion shifts, and the
opinions of the community diverge from the opinions of the trustees, <em>the
trustees are more likely to be right</em>—since the whole point of the election was
to concentrate responsibility into the trustees.</p>
<p>Why would any governor, if focused only on implementing responsible governance,
<em>not</em> permanently abandon his vote to an inherently even more responsible
regime? One reason might be: he likes voting. In other words, he has a vicarious
purpose.</p>
<h2 id="the-reset-button">The reset button</h2>
<p>But another reason might be: he is not quite 100% sure of the permanent
perfection of the design. Is the OAO 100% inherently forever more responsible
than the community of governors? Is the engineering perfect? It should be… but
you never know.</p>
<p>One way to retain the backup accountability of the community of governors,
without pandering unnecessarily to their vicarious purpose, is to give them a
<em>reset button</em> that reboots the government completely from the start.</p>
<p>Board selection ends; but if, at any time in the future, the governors become
absurdly dissatisfied with the board, they can reset the regime—resulting in a
new election, etc. An election always starts out by producing a board aligned
with the governors.</p>
<p>So, while the reset button exists, the governors retain their ultimate
sovereignty. But they have no way at all to manage the board, except by hitting
the button repeatedly—which, democracy being democracy, they will get tired of.</p>
<p>A reset button should have a high threshold, so that in any reasonably
well-governed system it is never a serious possibility—just as in the lifecycle
of most corporations there is never a single serious shareholder vote. In the
long run, it is probably a bad design. In some ways it is the most conservative
design, and it deserves consideration.</p>
<h2 id="anonymizing-the-trustees">Anonymizing the trustees</h2>
<p>It is difficult to anonymize the candidates in an election. We would expect
election winners to be identified. Indeed, considering the threat model of the
organization, there may be nothing wrong with an identified board.</p>
<p>But there are so many ways to pressure a fragile human being. If any entity can
put pressure on the board, that entity joins the board in the circle of power—an
unholy and unacceptable intrusion. And anyone whose name is known can be
pressured.</p>
<p>Once the cord of the selection process is cut, however, board seats are the
permanent property of their holders. They are tokens. They can be transferred,
like any token.</p>
<p>And the natural and inevitable ethos of the responsible trustee, who feels no
vicarious purpose at all, toward any form of pressure at all is simple: transfer
the token. The receiving party should be anonymous.</p>
<p>Once this chain, which can grow at any time, is sufficiently long, no authority
will have the practical power to trace it. A single initial transfer for each
seat anonymizes the board forever, unless there is some leak—which only takes
one transfer to fix.</p>
<h2 id="the-purpose-of-the-trustees">The purpose of the trustees</h2>
<p>What is the definition of success? Any organization is dysfunctional if it
deviates from its own definition of success.</p>
<p>This definition must match the ethos of the trustees. Ultimately, the stability
of any government or other autonomous system rests upon the ethos of some set of
human beings. If this set of humans is corrupted, the government is also
corrupted.</p>
<p>The OAO addresses this by resting its ethical stability on the trustees. Its
reasoning is that, since the trustees are exposed to far smaller doses of
power—ideally, replacing a retiring CEO every decade or two—they are far less
likely to be corrupted by power.</p>
<p>Therefore, it can be assumed that the trustees have some shared ethos which they
pass on to their successors. The ethos is enforced by no one—no one can fire a
trustee. But since it sets a standard which no incentive is pressing against,
this ethos of governance can maintain itself indefinitely.</p>
<h2 id="a-generic-ethos">A generic ethos</h2>
<p>The first part of the ethos of an OAO trustee is mission-specific. The OAO (like
early corporations, which were chartered with a
<a href="https://en.wikipedia.org/wiki/Objects_clause">purpose</a>) must define its
mission. Any intentional action outside this mission is
<em><a href="https://en.wikipedia.org/wiki/Ultra_vires">ultra vires</a></em>. It must displease
the trustee. If the OAO’s only mission is to make money by any means necessary,
this is fine; but even this mission must be written down.</p>
<p>The second part of the ethos is mission-independent. It includes these
directives, which only the trustee’s ethical conscience can enforce:</p>
<ol>
<li>The trustee votes to maximize the performance of the CEO in achieving the
mission, without using any other criterion.</li>
<li>The trustee has a succession plan which will be activated in case of death
or resignation, and has selected a successor of equal or greater
responsibility.</li>
<li>The trustee has the responsibility to remain anonymous, and the
responsibility to resign if anonymity is compromised, even to other
trustees.</li>
<li>The trustee never acknowledges ownership of the seat, except to other
trustees.</li>
<li>The trustee does not communicate with other trustees or with the CEO, except
within an anonymous virtual meeting of all the trustees.</li>
<li>The trustee does not work in, for or with the organization.</li>
<li>The trustee does not hold more than one seat.</li>
<li>The trustee does not buy or sell a seat.</li>
</ol>
<p>The temptation to violate these commandments is fairly weak. Also, the damage
done if some small percentage of the trustees violates them is fairly small.
Again, since the structure and ethos of the OAO insulate the trustees from the
direct execution of power, power is unlikely to corrupt them.</p>
<p>And if some seats are corrupted—for instance, if a seat is offered publicly for
sale—a majority of the trustees can defend themselves by rewriting the
constitution.</p>
<p>The executive should be one, two, or (rarely) three people. Three only works
because a triumvirate can take a vote; two cannot do anything without a
consensus.</p>
<p>In general, a multiple executive tends to only work with founding executives. In
that case, though, it can work quite well.</p>
<h2 id="the-executive">The executive</h2>
<p>The executive should <em>not</em> have control of the cold assets of the OAO, only of
operating cash. If the OAO is burning cash, the trustees set the burn rate. If
cash dividends need to be distributed to the governors, the trustees send it
directly from the cold wallet. The trustees must always be the direct stewards
of the OAO’s capital and treasury.</p>
<h2 id="three-types-of-oao">Three types of OAO</h2>
<p>The three types of OAO are <em>sovereign</em>, <em>legal</em>, and <em>illegal</em>.</p>
<p>A sovereign OAO is a government with a monopoly of force over some territory and
population. The problem in a sovereign OAO is how to give the CEO control over
the military, whose guns do not care about the blockchain.</p>
<p>The solution is a system of military weapons which do care about the
blockchain—which use
<a href="https://en.wikipedia.org/wiki/Permissive_Action_Link">permissive action links</a>
all the way down to the handgun. As a result, the CEO can decide which units can
or cannot fire their weapons—a decisive factor, to say the least, in any attempt
at civil strife or rebellion.</p>
<p>A legal OAO is protected by law—law may even help it enforce its contracts. A
legal OAO may not even need its trustees to be anonymous. But it may want this
anyway.</p>
<p>A criminal OAO is a sovereign OAO without a monopoly of force. As a result, all
its operations must be protected from every kind of sovereign force. Its assets
must be privacy coins; its CEO and employees must be anonymous, as well as its
trustees and its governors. Yet the effectiveness of this design for, say, a
drug cartel, is undeniable. This is a scientific assessment, not an endorsement.</p>
<h2 id="summary">Summary</h2>
<p>The DAO is a fun and workable concept which many people are trying, but which
has not yet learned to punch at its full weight when in the ring with “real”
corporations.</p>
<p>This is because DAOs have all been governed (a) by committee, (b) by democracy,
or (c) by informal power structures. These forms of governance have no chance in
any equal competition with a joint-stock corporation, which has the advantage of
autocratic efficiency—the company’s trains run on time. So they are confined to
protected niches.</p>
<p>An OAO is a DAO running a modern version of the joint-stock design. Eliminating
the official rituals that slow down every official corporation, while
maintaining the management structure that makes those corporations scalable and
efficient, might produce a more equal competition between new and old management
forms.</p>
<p>Copyright 2022 Curtis Yarvin -
https://graymirror.substack.com/p/optimal-autonomous-organizations</p>Curtis YarvinOriginally posted on https://graymirror.substack.com/p/optimal-autonomous-organizationsHigh Resolution High Throughput2022-10-11T00:00:00+00:002022-10-11T00:00:00+00:00/primitives/2022/10/11/High-Resolution-High-Throughput<h2 id="high-resolution-high-throughput-pt-2">High resolution, high throughput (pt 2)</h2>
<blockquote>
<p>source:
https://raw.githubusercontent.com/frankmcsherry/blog/master/posts/2017-03-01.md</p>
</blockquote>
<p>This post is about the second of two issues I outlined a while back in a
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md">differential dataflow roadmap</a>.
I’ve recently written a bit about the first issue,
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2017-02-11.md">performance degradation over time</a>,
and steps to ameliorate the issue. That seems to be mostly working now, and I’ll
write a bit more about that as it settles.</p>
<p>Instead, we’ll talk in this post about the second concern: with fine-grained
updates, perhaps just a few updates per timestamp, additional workers do not
increase the throughput of update processing (and they mostly slow it down).</p>
<p>Stealing
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#resolution-and-scaling">a figure from the roadmap post</a>,
let’s look at doing 10,000 updates to a reachability computation with two
million edges, but batching the 10,000 updates three different ways: one, ten,
and one hundred updates at a time:</p>
<p><img src="https://github.com/frankmcsherry/blog/blob/master/assets/roadmap/batching.png" alt="batching" /></p>
<p>The solid lines are the distributions of single-worker latencies, and the dotted
lines are the distributions of two-worker latencies. Visually, the second worker
helps when we have larger input batches and hurts when we have smaller input
batches. In fact, the second worker helps enough on the tail (up at the top of
the plot) that it always gives a throughput increase, but this seems like good
luck more than anything. We would like to see curves that look more like the
rightmost pair.</p>
<p>We would love to get the throughput scaling of larger batch sizes, so why not
always work with large batch sizes? The single-element updates provide something
valuable: very precise information about which input changes lead to which
output changes. By lumping all updates together in a larger batch, we lose
resolution on the effects of the changes. We have to dumb down the computation
to get the performance benefits, and that sucks.</p>
<p>In this post, I’ll explain the plan to fix this.</p>
<h3 id="a-tale-of-three-loops">A tale of three loops</h3>
<p>Imagine you were asked to hand-write a program that gets provided with a
timestamped sequence of edge changes (additions, deletions) and you need to
provide the corresponding timestamped changes to the number of nodes at each
distance from node zero.</p>
<p>That is, the input looks a bit like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre> edge change time
(0,3) +1 0
(0,2) +1 5
(2,3) +1 10
(0,3) -1 11
</pre></td></tr></tbody></table></code></pre></div></div>
<p>and your output should look something like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>dist change time
1 +1 0
1 +1 5
1 -1 11
2 +1 11
</pre></td></tr></tbody></table></code></pre></div></div>
<p>where these counts are (I hope) the correct changes in counts for the distances
in the graph. Let me know if they are not.</p>
<hr />
<p>As an exercise, actually imagine this. How <em>would</em> you structure your
hand-written program?</p>
<hr />
<p>If I had to guess (and I do), I would guess that most people would write a
program that foremost (i) iterates forward over timestamps, for each time (ii)
iterates over distances from the root, and for each depth (iii) iterates over
reachable nodes and their edges to determine the reachable set of the next
depth.</p>
<p>That is, a program that looks roughly like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>foreach time
foreach depth (until converged)
foreach node at depth
set depth of neighbors to at most depth+1
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This program seems totally fine, and I suspect a normal computer scientist will
understand it better than the sort of loop we are going to end up with. To be
totally clear, we aren’t going to change the written program at all, we are just
going to execute our program differently. But, if we had to write a program to
explain how the execution works, it would look like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>foreach depth (until converged)
foreach node at depth
foreach time
set depth of neighbors at time to at most depth+1
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Oh geez. Why can’t we just write normal programs for once, huh?</p>
<p>Let’s walk through the loop ordering above, using our example just above. Recall</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre> edge change time
(0,3) +1 0
(0,2) +1 5
(2,3) +1 10
(0,3) -1 11
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now, we do “time” last, and we do iteration over depth first. So, that means
that we start with the depth 0 nodes. As it turns out there is just one, the
root (node <code class="language-plaintext highlighter-rouge">0</code>). We iterate over its edges, and determine which neighbors are
reachable at which times, and offer them “depth 1”. I think they are:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>(3,1) +1 0
(2,1) +1 5
(3,1) -1 11
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This is all we do for the first depth. We are now ready to head in to the next
depth, which is depth 1. These nodes (and their history) is highlighted just
above. When we line this up with edges, we get proposals for depth 2:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>(3,2) +1 10
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now, this proposal is mostly uninteresting to node <code class="language-plaintext highlighter-rouge">3</code>, except come time <code class="language-plaintext highlighter-rouge">11</code>.
At that time, node <code class="language-plaintext highlighter-rouge">3</code> does actually end up with depth 2, and so we want to do
another round of iteration. But, node <code class="language-plaintext highlighter-rouge">3</code> has no outgoing edges so there isn’t
anything to do.</p>
<p>Nothing in this execution required us to perform work in time order, except
possibly within a <code class="language-plaintext highlighter-rouge">(depth, key)</code> pair. We could literally take the whole input
history, if we had access to it, and compute the entire output history, doing
the computation depth-by-depth.</p>
<p>This is possible only because we have chosen to map functional computations
across input streams. This restriction on our computational model turns in to a
flexibility in how we execute the computation. Isn’t it just delightful when
that happens?</p>
<h3 id="why-would-we-do-this">Why would we do this?</h3>
<p>We can apparently pivot around iterative algorithms so that rather than
time-by-time, we do rounds of iterations. Why would we do that?</p>
<p>There are a few reasons I can think of, and they kinda boil down to the same
reason: the only sense in which data-parallel computation needs to wait on input
times is that work should be done in-order for each key.</p>
<ol>
<li>
<p><strong>Each distinct timestamp is some serious overhead in timely dataflow.</strong></p>
<p>This is really annoying. Each distinct timestamp results in all of the
timely dataflow workers having a little chat. These chats can be boxcar’d
together, but we are sending bytes of coordination traffic around for each
distinct time. If there is one record for each time, we would be sending
much more coordination traffic than data traffic. If we only need to send
progress traffic for each iteration, rather than each (round, iteration), we
cut out untenable overhead.</p>
</li>
<li>
<p><strong>Workers can proceed independently on decoupled times, scaling better.</strong></p>
<p>When we worry about times <em>last</em>, workers can get more done without having
to coordinate. This means workers end up with larger hunks of work to
perform before they need to wait on others, and generally higher
utilization, and possibly higher throughput (we’ll have to see).</p>
</li>
<li>
<p><strong>Workers can re-order work across times to increase locality.</strong></p>
<p>Even with a rich and complicated history of updates, workers can sort the
entire collection by key and do only one scan of key-indexed state. For each
key there may be many times to consider (in order!), but the worker only
needs to visit each key once, and in whichever order is most convenient.</p>
</li>
</ol>
<p>There might be a few other cool reasons. Each one is an opportunity for me to
screw things up.</p>
<h3 id="making-this-happen">Making this happen</h3>
<p>What would it take to let us do this sort of transformation on iterative
computations? Run batches of input changes concurrently, before we have finished
all of the iterations of earlier batches? What black magic would we need to
summon this pow</p>
<p>Actually, timely dataflow already does this.</p>
<p>Ok, ok. Let’s remind ourselves about our reachability computation, which
iteratively joins current distances with edges to propose new distances to each
neighbor, followed by minimizing over the proposed distances for each node:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre> <span class="c">// initialize roots as reaching themselves at distance 0</span>
<span class="k">let</span> <span class="n">nodes</span> <span class="o">=</span> <span class="n">roots</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
<span class="c">// repeatedly update minimal distances each node can be reached from each root</span>
<span class="n">nodes</span><span class="nf">.iterate</span><span class="p">(|</span><span class="n">inner</span><span class="p">|</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">edges</span> <span class="o">=</span> <span class="n">edges</span><span class="nf">.enter</span><span class="p">(</span><span class="o">&</span><span class="n">inner</span><span class="nf">.scope</span><span class="p">());</span>
<span class="k">let</span> <span class="n">nodes</span> <span class="o">=</span> <span class="n">nodes</span><span class="nf">.enter</span><span class="p">(</span><span class="o">&</span><span class="n">inner</span><span class="nf">.scope</span><span class="p">());</span>
<span class="c">// propose dist+1 to neighbors, take mins.</span>
<span class="n">inner</span><span class="nf">.join_map</span><span class="p">(</span><span class="o">&</span><span class="n">edges</span><span class="p">,</span> <span class="p">|</span><span class="mi">_</span><span class="n">k</span><span class="p">,</span><span class="n">l</span><span class="p">,</span><span class="n">d</span><span class="p">|</span> <span class="p">(</span><span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="nf">.concat</span><span class="p">(</span><span class="o">&</span><span class="n">nodes</span><span class="p">)</span>
<span class="nf">.group</span><span class="p">(|</span><span class="mi">_</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">t</span><span class="p">|</span> <span class="n">t</span><span class="nf">.push</span><span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="na">.0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))</span>
<span class="p">})</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Before we do anything, let’s add one line after the <code class="language-plaintext highlighter-rouge">group</code>:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre> <span class="nf">.inspect_batch</span><span class="p">(|</span><span class="n">t</span><span class="p">,</span><span class="mi">_</span><span class="p">|</span> <span class="nd">println!</span><span class="p">(</span><span class="s">"time: {:?}"</span><span class="p">,</span> <span class="n">t</span><span class="p">))</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This is going to tell us each time we see a batch of data produced by the
<code class="language-plaintext highlighter-rouge">group</code> operator (the “min” on depths), and at what logical time we see it. It
should clue us in to how the computation is actually being executed.</p>
<p>The code above is just the definition of the computation; we can run it a few
different ways.</p>
<h4 id="way-1-one-update-at-a-time">Way 1: One update at a time</h4>
<p>Let’s start with the traditional way we run these computations: we introduce a
change to an input edge, adding a new edge and removing an old edge, and we then
run the computation until the output reflects that change. In our timely code we
might write something like:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="k">for</span> <span class="n">round</span> <span class="n">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">rounds</span> <span class="p">{</span>
<span class="c">// sliding window, let's pretend ...</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">edge_count</span> <span class="o">+</span> <span class="n">round</span><span class="p">],</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">round</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">));</span>
<span class="c">// advance input and run.</span>
<span class="n">graph</span><span class="nf">.advance_to</span><span class="p">(</span><span class="n">round</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computation</span><span class="nf">.step_while</span><span class="p">(||</span> <span class="n">probe</span><span class="nf">.lt</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="nf">.time</span><span class="p">()));</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Here we push some changes into the computation, we advance the <code class="language-plaintext highlighter-rouge">graph</code> input
(important!), and then we let the computation run until our <code class="language-plaintext highlighter-rouge">probe</code> (definition
not shown) tells us that our output has caught up to the new input round.</p>
<p>Advancing the input is very important. This is what reveals to timely dataflow
that there will be no more input data with whatever timestamps have been left
behind, which is what allows it to pass this information along to differential
dataflow operators. Then they get to go and do some work.</p>
<p>Advancing is also what tells our <code class="language-plaintext highlighter-rouge">probe</code> that there can’t be any more output.
For homework, convince yourself that this version of the code doesn’t work:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="k">for</span> <span class="n">round</span> <span class="n">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">rounds</span> <span class="p">{</span>
<span class="n">graph</span><span class="nf">.advance_to</span><span class="p">(</span><span class="n">round</span><span class="p">);</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">edge_count</span> <span class="o">+</span> <span class="n">round</span><span class="p">],</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">round</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">));</span>
<span class="n">computation</span><span class="nf">.step_while</span><span class="p">(||</span> <span class="n">probe</span><span class="nf">.le</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="nf">.time</span><span class="p">()));</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Back to breadth-first search and depth computation. I’m going to run the
computation one update at a time for ten rounds, on a graph with 100 nodes and
100 edges, like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>cargo run --example bfs -- 100 100 1 10
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This produces a bunch of output times, each of the form
<code class="language-plaintext highlighter-rouge">((Root, round), iteration)</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>time: ((Root, 0), 0)
time: ((Root, 0), 1)
time: ((Root, 0), 2)
time: ((Root, 0), 3)
time: ((Root, 0), 4)
time: ((Root, 2), 4)
time: ((Root, 2), 5)
time: ((Root, 5), 2)
time: ((Root, 5), 3)
time: ((Root, 7), 3)
time: ((Root, 7), 4)
time: ((Root, 9), 1)
time: ((Root, 10), 3)
time: ((Root, 10), 4)
time: ((Root, 10), 5)
time: ((Root, 10), 6)
time: ((Root, 10), 7)
time: ((Root, 10), 8)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>As intended, we do all the work for one round before we proceed to the next
round. Within each round, we perform work by iteration, as we kinda need to do
one iteration before the next.</p>
<p>Actually, the <em>real</em> reason we do iterations in order is that timely dataflow
sees that there is a back-edge in our dataflow graph, and that updates at
<code class="language-plaintext highlighter-rouge">(round, iter)</code> can result in updates at <code class="language-plaintext highlighter-rouge">(round, iter+1)</code>. Timely dataflow does
not give the go-ahead to differential dataflow operators until all of the work
of the previous iteration has finished. That is why things actually happen in
iteration order.</p>
<p>Notice that there is not a back edge from “previous rounds” to “subsequent
rounds”. Timely dataflow can see that updates at <code class="language-plaintext highlighter-rouge">(round, iter)</code> cannot result
in updates at <code class="language-plaintext highlighter-rouge">(round+1, iter)</code>. What could the implications be …</p>
<h4 id="way-2-update-all-the-things">Way 2: Update all the things!</h4>
<p>Let’s let timely and differential off the leash. Instead of holding back on
advancing the inputs, lets just put all the data in right away (but still at the
correct rounds):</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="k">for</span> <span class="n">round</span> <span class="n">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">rounds</span> <span class="p">{</span>
<span class="c">// sliding window, let's pretend ...</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">edge_count</span> <span class="o">+</span> <span class="n">round</span><span class="p">],</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">graph</span><span class="nf">.send</span><span class="p">((</span><span class="n">edges</span><span class="p">[</span><span class="n">round</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">));</span>
<span class="n">graph</span><span class="nf">.advance_to</span><span class="p">(</span><span class="n">round</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="c">// run like crazy!</span>
<span class="n">computation</span><span class="nf">.step_while</span><span class="p">(||</span> <span class="n">probe</span><span class="nf">.lt</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="nf">.time</span><span class="p">()));</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This version of the code just dumps all the data in, and only once it is done
does it go and start running the computation. At this point, timely knows that
the input can’t producing anything before <code class="language-plaintext highlighter-rouge">rounds</code>; what happens when
differential sees this information?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>time: ((Root, 0), 0)
time: ((Root, 0), 1)
time: ((Root, 0), 2)
time: ((Root, 5), 2)
time: ((Root, 9), 1) <-- wtf?
time: ((Root, 0), 3)
time: ((Root, 5), 3)
time: ((Root, 7), 3)
time: ((Root, 10), 3)
time: ((Root, 0), 4)
time: ((Root, 2), 4)
time: ((Root, 7), 4)
time: ((Root, 10), 4)
time: ((Root, 2), 5)
time: ((Root, 10), 5)
time: ((Root, 10), 6)
time: ((Root, 10), 7)
time: ((Root, 10), 8)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Chew on that for a bit.</p>
<p>Actually, I think this all makes a lot of sense if you ignore the <code class="language-plaintext highlighter-rouge">(9,1)</code> for
the moment. If you ignore that time, all of the other updates are done in
iteration order. Timely and differential agree that we can do the work for each
of the iteration <code class="language-plaintext highlighter-rouge">2</code>, <code class="language-plaintext highlighter-rouge">3</code>, <code class="language-plaintext highlighter-rouge">4</code>, and <code class="language-plaintext highlighter-rouge">5</code> at the same time, even before all work
at prior rounds have completed.</p>
<p>The <code class="language-plaintext highlighter-rouge">(9,1)</code> update is a bit of a mystery, but nothing about differential
dataflow’s operator implementation guarantees that all work that can be
performed will be performed <em>immediately</em>. In particular, there are several
points where the operator learns it will need to do some more work, and enqueues
the work rather than testing whether the work can be done right away. The
apparent <code class="language-plaintext highlighter-rouge">(9,1)</code> disorder may just be a result of this. It’s not an incorrect
disorder, just work we could have done before <code class="language-plaintext highlighter-rouge">(0,2)</code> and <code class="language-plaintext highlighter-rouge">(5,2)</code> if we wanted
to.</p>
<h4 id="way-3-a-little-bit-of-both">Way 3: A little bit of both</h4>
<p>We could also do a bit of both: ingest some data, do some computation, ingest
some data, do some computation. This is a lot more like what we actually expect
in a streaming system. Taking all the timestamped input at once is more like a
temporal database (as I understand them), and taking the timestamped input only
one update at a time is like .. a bad streaming system I guess.</p>
<p>So let’s do that, doing a few rounds (three) of computation after each update,
but not necessarily running until all updates for the round are complete. What
do we see:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>time: ((Root, 0), 0)
time: ((Root, 0), 1)
time: ((Root, 0), 2)
time: ((Root, 0), 3)
time: ((Root, 5), 2)
time: ((Root, 0), 4)
time: ((Root, 2), 4)
time: ((Root, 5), 3)
time: ((Root, 7), 3)
time: ((Root, 2), 5)
time: ((Root, 7), 4)
time: ((Root, 9), 1)
time: ((Root, 10), 3)
time: ((Root, 10), 4)
time: ((Root, 10), 5)
time: ((Root, 10), 6)
time: ((Root, 10), 7)
time: ((Root, 10), 8)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This begins and ends pretty predictably, for obvious reasons (nothing to work on
at beginning / end other than the first / last update). But in the middle we see
some pretty neat stuff. I’m thinking specifically of</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>...
time: ((Root, 2), 5)
time: ((Root, 7), 4)
time: ((Root, 9), 1)
...
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Here we’ve got a neat little wave-front cutting through our <code class="language-plaintext highlighter-rouge">(round, iter)</code>
partial order. Each of these times are mutually incomparable (none can lead to
another), and they can all be processed concurrently.</p>
<h3 id="what-needs-to-be-different">What needs to be different</h3>
<p>If timely dataflow already lets us re-order the computation, and allows us to
process multi-element wavefronts concurrently, what is the problem?</p>
<p>Although timely gives operators enough information, there are several
implementation issues that emerge if we just let timely dataflow run free on
fine-grained timestamps.</p>
<ol>
<li>
<p><strong>Each timestamp has lots of overhead</strong></p>
<p>We already mentioned that timely does coordination for each timestamp, and
that is still true a few sections later. If we want to avoid bogging down
the computation with control traffic, we’ll need to think of a better way of
talking about all the different timestamps.</p>
</li>
<li>
<p><strong>Differential operators run first by time, then by key</strong></p>
<p>Even though timely informs the operators that they can re-order compuation
by iteration rather than by round, within an operator the implementations
still operate in blocks of logical time, rather than processing all
available times for each key. We’ll want to fix this (for sanity), but it
also opens the door to improved locality (one pass over keys per
invocation).</p>
</li>
</ol>
<p>These two problems have relatively tractable solutions, which I’ll just spill
out there. Neither is properly implemented, but the first is in use in timely’s
logging infrastructure, and the second has been typed into comment out code.
Pretty serious business.</p>
<p>Honestly, the first step seems totally simple and workable, and I expect no
issues. The second step will likely eventually work, but it risks discovering
some horrible algorithmic nightmares along the way. That being said, here we go:</p>
<h4 id="step-1-high-resolution-streams">Step 1: High-resolution streams</h4>
<p>Right now update streams in differential take the form of timely dataflow
messages, where the data have the form</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre> <span class="p">(</span><span class="n">record</span><span class="p">,</span> <span class="n">diff</span><span class="p">):</span> <span class="p">(</span><span class="n">D</span><span class="p">,</span> <span class="nb">isize</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>There is some record, and a count indicating by how much it has changed. Like
all timely dataflow messages, there is a time attached to the message, and we
treat that as the time for all updates in the message. A timely dataflow message
therefore looks something like (but is actually nothing like):</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre> <span class="p">(</span><span class="n">Time</span><span class="p">,</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">D</span><span class="p">,</span> <span class="nb">isize</span><span class="p">)</span><span class="o">></span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>That is great if there are lots of updates with the same time, as they can get
bundled together. This doesn’t work especially well if, in the limit, there is
just one update for each time. In addition to the control traffic, each update
gets sent out as a singleton message with lots of associated overhead.</p>
<p>So, a different way of doing things, a more painful way if you don’t actually
need the flexibility, is to pack the times in as data as well, sending messages
like</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre> <span class="p">(</span><span class="n">Time</span><span class="p">,</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">D</span><span class="p">,</span> <span class="n">Time</span><span class="p">,</span> <span class="nb">isize</span><span class="p">)</span><span class="o">></span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We have <code class="language-plaintext highlighter-rouge">Time</code> in there twice now, but the two uses serve different roles. The
first <code class="language-plaintext highlighter-rouge">Time</code> is timely dataflow’s “capability”. It tells timely dataflow, and
us, at which logical times downstream operators are allowed to send data by
virtue of holding on to this message. The second times tell us when changes
<em>actually</em> occur, but these times don’t need to be equal to that of the
capability; timely dataflow doesn’t know about them.</p>
<p>It turns out that for things to make sense, all of the second times should be
greater or equal to that of the capability. If a change occurs, it may
precipitate changes at that or future times, and we really want a capability
that allows us to send messages reflecting those times. Correspondingly, we want
timely dataflow’s promise that “no messages with a given capability will arrive”
to have meanining; the completion of a capability timestamp will imply the
completion of the corresponding data timestamps.</p>
<p>So that’s the plan. Bundle up batches of <code class="language-plaintext highlighter-rouge">(D, Time, isize)</code> updates and send
them along with a capability that is less or equal to each of the times. Of
course we can’t just mint a capability out of nowhere, so it will really be the
reverse: grab a capability and use it to send all the pending updates at times
greater than or equal to its time. Once we’ve sent everything we need to, throw
away the capability and let workers proceed to whatever bundles of times are
next.</p>
<p>If we ever end up needing to send an update in the future of no capability we
hold, we done screwed up.</p>
<h4 id="step-2-high-resolution-operators">Step 2: High-resolution operators</h4>
<p>Operators currently receive timely dataflow messages of the first (time-free)
form above, and receive progress information about the capabilities on those
messages. We will need to rethink both of these, as well as the general
structure of the operator’s logic.</p>
<p>Informally, a differential dataflow operator accepts input updates into a pile,
differentiated by timestamp. When it learns from its input that a timestamp in
its pile is finished, it picks up all the updates with that timestamp and
commits them. It then flips through all the keys in these committed updates, and
checks whether the operator logic applied to the input collection at this time
still produces the accumulated output at this time, and issues updates if not.</p>
<p>Actually it is a bit more complicated, but let’s not worry about that here.</p>
<p>The rough structure up above is time-by-time, but there is nothing much that
prevents it from operating in terms of time <em>intervals</em> rather than individual
times. You probably know what an interval is, right? Something like <code class="language-plaintext highlighter-rouge">[a, b)</code>
that says “<code class="language-plaintext highlighter-rouge">a</code> and stuff up to but not including <code class="language-plaintext highlighter-rouge">b</code>”.</p>
<p>We are going to do this, but where <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are
<a href="https://en.wikipedia.org/wiki/Antichain">antichains</a>.</p>
<p>An antichain is a collection of mutually incomparable elements from a partial
order, and in timely dataflow it acts a bit like a line cut across the partial
order (not actually; that would be a
<a href="https://en.wikipedia.org/wiki/Antichain#Height_and_width">maximal antichain</a>).
We will speak of the interval <code class="language-plaintext highlighter-rouge">[a, b)</code> as those elements of the partial order
greater or equal to some element of <code class="language-plaintext highlighter-rouge">a</code>, but not greater or equal to any element
of <code class="language-plaintext highlighter-rouge">b</code>.</p>
<p>This may make more sense to think about an interval as those times that, from
the point of view of a differential dataflow operator, were not previously
complete (greater-or-equal to the prior input frontier) but are now complete
(not greater-or-equal to the current input frontier). As an operator executes,
the sequence of input frontiers it observes evolves, and each step defines an
interval of this form.</p>
<p>With that thought in mind, our plan is to have each operator first identify the
interval of newly completed times, say <code class="language-plaintext highlighter-rouge">[a,b)</code>, and then pull all updates with
times in this interval. I don’t know a great datastructure for this, so the
working plan is that all <code class="language-plaintext highlighter-rouge">(D, Time, isize)</code> updates are just going to be in a
big list that we scan each time the frontier changes. Once we pull out updates
at newly completed times, we order them by key and process each key
independently.</p>
<p>There are more details for sure, but once we are willing to just re-scan piles
of updates in the name of performance, many doors are open to us.</p>
<h4 id="organization">Organization</h4>
<p>I’m not sure I want to try and write operators that hybridize high-resolution
and low-resolution implementations. At the moment I’m more inclined to
specialize the <code class="language-plaintext highlighter-rouge">Collection</code> type, which wraps a stream of updates, into two
types:</p>
<ol>
<li>
<p><code class="language-plaintext highlighter-rouge">LoResCollection</code>, which has relatively few distinct times, and bundles data
without additional logical times as data.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">HiResCollection</code>, which has relatively many distinct times, and bundles
logical times in with the data.</p>
</li>
</ol>
<p>These two types can now have separate implementations of <code class="language-plaintext highlighter-rouge">group</code> and <code class="language-plaintext highlighter-rouge">join</code> and
such. This does raise the question of what happens with <code class="language-plaintext highlighter-rouge">join</code> where the inputs
are different granularities, and I don’t know other than it is pretty easy to
promote a <code class="language-plaintext highlighter-rouge">LoResCollection</code> to a <code class="language-plaintext highlighter-rouge">HiResCollection</code> just by sticking the same
time in the payload. We could go the other way, but at an unboundedly horrible
cost, so let’s not.</p>
<p>Actually, the current <code class="language-plaintext highlighter-rouge">Trace</code> interface masks details about high-resolution vs
low-resolution, and operators like <code class="language-plaintext highlighter-rouge">join</code> just take pre-arranged traces rather
than weirdly typed <code class="language-plaintext highlighter-rouge">Collection</code> structs. It might be surprisingly non-horrible
to meld the two representations together, for example supporting a frequently
changing graph and an infrequently changing queries against it. I’m not sure how
we would choose which output type to produce, though (the higher-resolution, of
course, but how to determine this without specialization).</p>
<p>Related, we will evenutally want to meld high- and low-resolution trace
representations. Quickly changing edge sets call for a high-resolution
representation, but once the times have passed and we want to coalesce the
updates, the resulting updates change only with iterations and not rounds, and
admit a low-resolution representation. The low-resolution implementations can be
much more efficient than the high-resolution ones, because they avoid some
massive redundancy in re-transcribing the times with every update.</p>
<p>All in all, I think there are some great things to try out here, many likely
pitfalls, but some fairly cool potential. I’m optimistic that we will soon get
to a system that processes updates with high-resolution <em>and</em> high-throughput,
for as long as you run the system.</p>
<p>It will probably be slower on some batch graph compute, but are people really
still working on that stuff?</p>
<h3 id="addendum-a-prototype-march-5-2017">Addendum: A Prototype (March 5, 2017)</h3>
<p>I have a prototype up and running. It seems to produce the correct output, in
the sense that it produces exactly the same outputs whether you run it with one
update at a time, or one million updates at a time. Also, the output isn’t
empty; I thought to check that.</p>
<p>First up, let’s look at some measurements from the previous pile of code. This
previous pile takes batches of records which all have the same time. This means
that if you want each update to have its own timestamp, you get lots of small
batches. If you put multiple updates together in a batch, they all have the same
timestamp and their effects can’t be distinguished.</p>
<p>Using this implementation, let’s get some baseline measurements. We are going to
look at the breadth-first search computation (how many nodes are at each
distance from the root) doing one million updates to two random graphs, one with
1,000 nodes and 2,000 edges, and one with 1,000,000 nodes and 10,000,000 edges.
We will do the one million updates a few different ways, batching the updates in
batches of sizes from one up to one million (e.g. <code class="language-plaintext highlighter-rouge">10 x 100000</code> means batches of
size 10, done 100,000 times). All updates in a batch have the same timestamp.</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">142s</td>
<td style="text-align: right">100s</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">73s</td>
<td style="text-align: right">64s</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">27s</td>
<td style="text-align: right">51s</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">5s</td>
<td style="text-align: right">48s</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">-</td>
<td style="text-align: right">34s</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">-</td>
<td style="text-align: right">21s</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">-</td>
<td style="text-align: right">12s</td>
</tr>
</tbody>
</table>
<p>We don’t have measurements for 10,000 and larger batch sizes for the small
graph, because with only 2,000 edges and the same timestamp for all the updates
in a batch, most of the changes would just cancel. I should say, although it is
trivial for the <code class="language-plaintext highlighter-rouge">1k / 2k</code> graph, the <code class="language-plaintext highlighter-rouge">1m / 10m</code> graph takes about eight seconds
to load its ten million edges, and these numbers include that.</p>
<p>Notice the massive discrepancy between single-element batches (142s and 100s)
and the larger batches (5s and 12s). This is a pretty substantial throughput
difference. We would love to get that throughput, or something close to it,
while keeping the resolution of single-element updates.</p>
<h4 id="the-prototype">The prototype</h4>
<p>There is some prototype code! Yay! It is <em>pretty weird</em> code, not like much I’ve
written before. I’m quite certain there are inefficiencies in it, so the
absolute numbers are just an indication that we are moving in the right
direction. These are the same experiments as above, except here <em>every update
has a distinct timestamp</em>. We are producing the output that corresponds to the
<code class="language-plaintext highlighter-rouge">1 x 1000000</code> row from above, but without shattering all the updates into
different batches.</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">237s</td>
<td style="text-align: right">94s</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">173s</td>
<td style="text-align: right">75s</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">148s</td>
<td style="text-align: right">57s</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">623s</td>
<td style="text-align: right">43s</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">-</td>
<td style="text-align: right">31s</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">-</td>
<td style="text-align: right">25s</td>
</tr>
</tbody>
</table>
<p>There are several things different about this chart.</p>
<ol>
<li>
<p>First up, you may notice we didn’t do a batch size 1 row. There are some
things we do assuming there will be lots of work, and when there isn’t lots
of work we do it anyhow. The whole point of this research is to move to
larger batches. That being said, this will probably be fixed. These same
issues end up hurting the small graph more than the large graph; the small
graph is sparser, and updates cause longer cascades of small updates.</p>
</li>
<li>
<p>We have a <code class="language-plaintext highlighter-rouge">10000 x 100</code> entry for the smaller graph! It makes sense to run
the experiment now, because each update with a different time doesn’t result
in cancelation. Unfortunately, it is terrible. The reason here seems to be
the same reason we had to do that
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2017-02-11.md">compaction</a>
stuff: with so many updates, each of the 1,000 keys gets a sizeable history,
and within a batch we are trying to process all of it without compacting it.
The makes us go quadratic in the number of updates per key per batch. The
good news is that we should be to do compaction on our own. The bad news is
that I have to code that up.</p>
</li>
<li>
<p>The <code class="language-plaintext highlighter-rouge">1m / 10m</code> column doesn’t look so bad, does it? The times are worse than
before, for sure, but not by all that much. They are roughly “one batch size
worse”, I think. And the results tell us the exact consequences of each
individual update, corresponding to the <code class="language-plaintext highlighter-rouge">1 x 1000000</code> row up above. I think
these could also get a bit better, because there are some fairly feeble
moments in the code.</p>
</li>
</ol>
<p>Let’s take the <code class="language-plaintext highlighter-rouge">1m / 10m</code> experiment and crank up the number of workers. Note:
we are still producing the high-resolution outputs.</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1 worker</th>
<th style="text-align: right">2 workers</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">94s</td>
<td style="text-align: right">82s</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">75s</td>
<td style="text-align: right">58s</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">57s</td>
<td style="text-align: right">39s</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">43s</td>
<td style="text-align: right">28s</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">31s</td>
<td style="text-align: right">20s</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">25s</td>
<td style="text-align: right">15s</td>
</tr>
</tbody>
</table>
<p>Here, one worker takes 8s before it starts processing updates, and two workers
take 5s before they start processing updates. These numbers include those two
(and look a bit better if you mentally subtract that out).</p>
<p>This is pretty good news, I think. For small batches the second worker doesn’t
help much, which is what we should expect; the high-resolution changes don’t
improve the performance of small batches, they make larger batches produce the
same output. The larger batches do get a decent benefit from additional workers;
the scaling isn’t 2x, and it probably shouldn’t be (we have to do data exchange,
and flail around with some buffers).</p>
<p>This looks pretty promising to me. We can get the output that used to take us
92s (100s - 8s) now in just 10s to 15s. Or, maybe 23s if we want sub-second
response time. See, we need to take the total time and divide by the number of
batches to get the average response time, and we only get 1m updates / 10s
throughput if we want to wait for 10s. In fact, if that is our strategy there
are going to be some updates that take 20s before we see their implications.
We’d really like to draw down the numbers for the medium batch sizes.</p>
<p>There are for sure things to improve in the code, and I hope and expect these
numbers to come down. I’m also worried about (and planning on fixing) the
numbers for the smaller graph, which I’d very much like to work hitch-free. In
particular, I’d love to have an “idiot-proof” implementation that just works for
any reasonable problem, without careful caveats about settings of batch sizes
and the like. Watch this space!</p>
<h3 id="addendum-small-message-optimization-march-7-2017">Addendum: Small message optimization (March 7, 2017)</h3>
<p>One of the “things we do assuming there will be lots of work”, alluded to above
as a reason we might have poor performance on small batch sizes, is radix sort.
As I’ve written it, there are 256 numbers to go and check each time you radix
shuffle on a byte, because there are that many different bytes each record might
have produced. You do this for each byte position of (in this case) an
eight-byte key.</p>
<p>If you just have 10 elements to sort, just call <code class="language-plaintext highlighter-rouge">.sort()</code>.</p>
<p>I’ve done that now. The times have improved, generally. Old times are in
parentheses:</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">157s (237s)</td>
<td style="text-align: right">73s (94s)</td>
<td style="text-align: right">64s (82s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">79s (173s)</td>
<td style="text-align: right">58s (75s)</td>
<td style="text-align: right">46s (58s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">(148s)</td>
<td style="text-align: right">53s (57s)</td>
<td style="text-align: right">36s (39s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">(623s)</td>
<td style="text-align: right">41s (43s)</td>
<td style="text-align: right">28s (28s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">-</td>
<td style="text-align: right">(31s)</td>
<td style="text-align: right">(20s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">-</td>
<td style="text-align: right">(25s)</td>
<td style="text-align: right">(15s)</td>
</tr>
</tbody>
</table>
<p>Some measurements weren’t re-taken, under the premise that they shouldn’t be
improved (and I’m getting dodgy numbers the more my laptop runs and heats up).</p>
<p>The small instance still suffers from the second issue above: that the
implementation’s behavior is quadratic in the number of times per key in each
batch. For the <code class="language-plaintext highlighter-rouge">10000 x 100</code> experiment, several keys have more than 100 times,
resulting in 100x overhead that could be substantially reduced. I have a partial
solution for that, but it is vexxingly hard to do some things with general
partial orders that are so very, very simple for integers that just increase.</p>
<p>Even in the larger graph, we can see large numbers of times for each key. I had
<code class="language-plaintext highlighter-rouge">group</code> capture a histogram of the number of distinct times each key processes
in each batch, and for the <code class="language-plaintext highlighter-rouge">1000000 x 1</code> experiment (the largest batch size,
admittedly, but also one we thought was getting decent performance), we get
distributions of distinct times that look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
</pre></td><td class="rouge-code"><pre>counts[1]: 56707
counts[2]: 106391
counts[3]: 144178
counts[4]: 158547
counts[5]: 149205
counts[6]: 123057
counts[7]: 91704
counts[8]: 62347
counts[9]: 39843
counts[10]: 23667
counts[11]: 13367
counts[12]: 7006
counts[13]: 3644
counts[14]: 1823
counts[15]: 857
counts[16]: 347
counts[17]: 173
counts[18]: 67
counts[19]: 33
counts[20]: 19
counts[21]: 6
counts[22]: 3
counts[23]: 2
counts[24]: 1
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Most of the keys are doing some amount of redundant work here. Each time
currently rescans the input updates and re-accumulates collections, whereas most
of this work can be done just once and then updated as we move through times.
That’s not the whole story though, which will have to wait for the next
addendum.</p>
<h3 id="addendum-many-distinct-times-optimizations-march-24-2017">Addendum: Many distinct times optimizations (March 24, 2017)</h3>
<p>I have a candidate for <code class="language-plaintext highlighter-rouge">group</code> that works relatively well even with large
numbers of distinct times for each key. The details will need to wait for a
longer blog post, but they roughly amount to looking for totally ordered runs in
the times we work with, and (future work) re-arranging the times to have longer
runs. The result is an implementation that is linear (plus sorting) in the
number of updates, multiplied by the number of times that are not <code class="language-plaintext highlighter-rouge">gt</code> their
immediate predecessor.</p>
<p>This works great for total orders, and is a start for partial orders. I still
have some more to do with respect to re-ordering times to cut down on this
number, but already there are some improvements in running times. Here are
updated numbers with old execution times in parentheses (note: other
optimizations have happened along the way, so this isn’t just about a new
algorithm).</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">117s (157s)</td>
<td style="text-align: right">82s (73s)</td>
<td style="text-align: right">68s (64s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">75s (79s)</td>
<td style="text-align: right">65s (58s)</td>
<td style="text-align: right">46s (46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">87s (148s)</td>
<td style="text-align: right">58s (53s)</td>
<td style="text-align: right">40s (36s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">70s (623s)</td>
<td style="text-align: right">47s (41s)</td>
<td style="text-align: right">33s (28s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">131s</td>
<td style="text-align: right">34s (31s)</td>
<td style="text-align: right">21s (20s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">385s</td>
<td style="text-align: right">26s (25s)</td>
<td style="text-align: right">19s (15s)</td>
</tr>
</tbody>
</table>
<p>As you can see, several numbers for the smaller graph got much better, and at
the same time the numbers for the larger graph got a bit worse. This makes
sense, as the code is certainly more sophisticated than before, and if the
problem didn’t exist (e.g. the larger graph) we are just paying a cost. That
being said, I bet we can recover these losses and more when we actually try and
optimize the implementations; if nothing else, we can just drop in to the
simpler implementation for small numbers of times and save the complex one for
large number of times.</p>
<p>Also in the measurements, the times for the small graph are not strictly
improving as we increase the batch size. This is probably a result of not really
nailing the smallest number of totally ordered chains yet, though I can’t really
confirm that yet. There are some other reasons that arbitrarily large batches
aren’t perfect for iterative algorithms (in each iteration we must at least pick
up previous updates, making each iteration take time linear in the sum of batch
sizes of prior iterations, rather than just their own size).</p>
<h3 id="addendum-fixing-some-deranged-allocation-march-26-2017">Addendum: Fixing some deranged allocation (March 26, 2017)</h3>
<p>You might notice in the numbers above a disappointing spike up to <code class="language-plaintext highlighter-rouge">87s</code> for the
small graph in the <code class="language-plaintext highlighter-rouge">1000 x 1000</code> configuration. It turns out this is because one
thousand updates, which turns into two thousand changes (one edge in, one edge
out), is just over the threshold we used for “should we radix sort or not?”.
This means that for these settings, we end up allocating some 256 buffers and
working with them a fair bit. And then we drop them on the floor so that we can
re-allocate them the next time around. Not very bright.</p>
<p>In fact, it was much sillier than that. The sorting happens as part of
separating an undifferentiated pile of updates into “sealed” updates, those
whose times have passed and we are not ready to finalize, and updates that stay
in the pile. We were doing the “should we radix sort” based on the
undifferentiated number, rather than the number we would eventually have to sort
(those of finished times). Because of how partially ordered times work, and that
timely dataflow can only carry one capability at a time, we end up slicing these
batches even more finely when the frontier advances, so that we have several
small sealed sets. Each of them have the bad radix sorting allocate-and-drop
behavior.</p>
<p>So that should be fixed. I even pushed a new version of <code class="language-plaintext highlighter-rouge">timely_sort</code> (some
radix sorting code) that does less allocation. I’m not quite using it yet, but
even with the local fixes, numbers look better:</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">63s (87s)</td>
<td style="text-align: right">54s (58s)</td>
<td style="text-align: right">36s (40s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">61s (70s)</td>
<td style="text-align: right">43s (47s)</td>
<td style="text-align: right">28s (33s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">127s (131s)</td>
<td style="text-align: right">31s (34s)</td>
<td style="text-align: right">20s (21s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">(385s)</td>
<td style="text-align: right">23s (26s)</td>
<td style="text-align: right">16s (19s)</td>
</tr>
</tbody>
</table>
<p>The configurations with batches smaller than one thousand really shouldn’t see
much change, and the much larger batches shouldn’t have <em>much</em> change (some
small batches emerge in the computation for larger batches). There is some
serious improvement for the small graph, and decent improvement for the large
graph. We are mostly regaining ground on the larger graph, having taken a hit
from the complexity of the new and complicated “linear-ish” algorithm.</p>
<p>What these numbers should tell you, though, is that all this code is new enough
that we are getting 10% improvements just by looking at it and removing the
stupid. I’m planning on doing a bit more of that next. For example, each time we
radix sort one thousand elements, we compute the hash of each eight times. Why
would we do that?</p>
<p>One appealing aspect of Rust (over managed languages) is that there is no reason
we shouldn’t be able to write the code that does what we think it should do, and
in this case we kinda think we should be able to sort some updates by key, zip
them along stashed state, and compute and ship output differences. Any thing
that takes time is either because (i) we aren’t actually doing that yet, or (ii)
we are doing it badly. Each of these should be fixable.</p>
<p>One problem is that I don’t actually know how fast we <em>should</em> be able to
compute one million related bfs computations. Should we hope to get the <code class="language-plaintext highlighter-rouge">1k/2k</code>
number down to one second? Why not? That seems like a good goal to aim for. Or
at least, we should understand what are the large number of fundamental
computations that prevent us from that goal.</p>
<h3 id="addendum-quadratic-behavior-in-join-march-31-2017">Addendum: Quadratic behavior in <code class="language-plaintext highlighter-rouge">join</code> (March 31, 2017)</h3>
<p>I found the source of the bad behavior for the small graph!</p>
<p>When shifting from “each batch contains one time” to “batches may contain
multiple times” I was pleased to find that the <code class="language-plaintext highlighter-rouge">join</code> logic still passed its
tests and didn’t seem to need any fixing. Wow was that wrong.</p>
<p>First, it turns out that <code class="language-plaintext highlighter-rouge">join</code> as written wasn’t even correct. It passed all
the tests because <code class="language-plaintext highlighter-rouge">bfs</code> (which I used as the “integration” test) never has both
of its join inputs vary at the same time. Sure the <code class="language-plaintext highlighter-rouge">edges</code> change and the
<code class="language-plaintext highlighter-rouge">dists</code> change, but it is always either one (new round of edge data) or the
other (new iteration). So I fixed that (the bug, not the test).</p>
<p>More interestingly, I think, it became patently clear that the implementation
would quite happily go quadratic even on joins where there was <em>no</em> output. Let
me explain how join used to work:</p>
<p>When joining two sources of data, we get sequences of batches of updates, each
of which looks kind of like a <code class="language-plaintext highlighter-rouge">Vec<(K, V, T, R)></code> (where you should think of <code class="language-plaintext highlighter-rouge">R</code>
as <code class="language-plaintext highlighter-rouge">isize</code>). For each batch we receive, we want to join it with all batches
received so far on the <em>other</em> input. This is a fairly traditional streaming
equijoin. So we join one <code class="language-plaintext highlighter-rouge">batch</code> with the <code class="language-plaintext highlighter-rouge">trace</code> of all history for the other
input (probably compacted a bit; that isn’t the issue) on the other side:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre><span class="k">for</span> <span class="n">key</span> <span class="n">in</span> <span class="n">batch</span><span class="nf">.keys</span><span class="p">()</span> <span class="p">{</span>
<span class="k">for</span> <span class="o">&</span><span class="p">(</span><span class="n">val1</span><span class="p">,</span> <span class="n">time1</span><span class="p">,</span> <span class="n">diff1</span><span class="p">)</span> <span class="n">in</span> <span class="o">&</span><span class="n">batch</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="p">{</span>
<span class="k">for</span> <span class="o">&</span><span class="p">(</span><span class="n">val2</span><span class="p">,</span> <span class="n">time2</span><span class="p">,</span> <span class="n">diff2</span><span class="p">)</span> <span class="n">in</span> <span class="o">&</span><span class="n">trace</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="p">{</span>
<span class="n">output</span><span class="nf">.ship</span><span class="p">(((</span><span class="n">val1</span><span class="p">,</span> <span class="n">val2</span><span class="p">),</span> <span class="n">time1</span><span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">time2</span><span class="p">),</span> <span class="n">diff1</span> <span class="o">*</span> <span class="n">diff2</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Amazingly, due to bi-lineary of <code class="language-plaintext highlighter-rouge">join</code> and the way differential dataflow
difference work, this is actually correct. Even more clearly, this will do an
amount of work proportional to the product of the sizes of <code class="language-plaintext highlighter-rouge">batch</code> and <code class="language-plaintext highlighter-rouge">trace</code>.
That makes some sense, because we probably expect to see an output for each pair
of values in <code class="language-plaintext highlighter-rouge">batch</code> and <code class="language-plaintext highlighter-rouge">trace</code>, right?</p>
<p>WRONG!</p>
<p>This is where I started to think that maybe I should read about temporal
databases or something, rather than “discovering” all this stuff myself. Over
the course of the history of <code class="language-plaintext highlighter-rouge">batch</code> and <code class="language-plaintext highlighter-rouge">trace</code>, the collections may never grow
to be all that big. In fact, they could totally alternate empty / non-empty out
of sync, in which case there would be no matches. All we would need to do to see
this would be to play each history in order, which takes time linear in the
inputs.</p>
<p>So I wrote a better inner loop for join when the histories are big and scary
(and fall back to the default implementation when they are not). The idea is
roughly to walk through the histories in order, maintaining each collection’s
updates accumulated with respect to the remaining frontier of times for the
other collection (for total orders, read this as: “updated in place”).</p>
<p>Returning to our trusty <code class="language-plaintext highlighter-rouge">bfs</code> experiment, we get new numbers that look like (old
numbers in parentheses):</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">62s (63s)</td>
<td style="text-align: right">56s (54s)</td>
<td style="text-align: right">38s (36s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">59s (61s)</td>
<td style="text-align: right">47s (43s)</td>
<td style="text-align: right">32s (28s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">69s (127s)</td>
<td style="text-align: right">36s (31s)</td>
<td style="text-align: right">23s (20s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">73s (385s)</td>
<td style="text-align: right">32s (23s)</td>
<td style="text-align: right">21s (16s)</td>
</tr>
</tbody>
</table>
<p>This is, much like a post or two back, a serious improvement for the small
graph, and a non-trivial regression for the larger graph.</p>
<p>I’m not entirely sure what is wrong with the larger graph, in that the join
implementation is largely the same for uncomplicated histories, except that it
must first extract the history to check if it is complicated; the old
implementation didn’t have to copy data out from <code class="language-plaintext highlighter-rouge">batch</code> and <code class="language-plaintext highlighter-rouge">trace</code> to look at
it, which is perhaps the issue? I feel like we can eventually work around that,
especially given that batch exfiltration of data should be faster than the
careful navigation we were (and still are, unfortunately) doing to read the
data.</p>
<p>Looking at a profile, the large graph <code class="language-plaintext highlighter-rouge">1000000 x 1</code> experiment spends only 6% of
its time in <code class="language-plaintext highlighter-rouge">join</code> at all, so the serious regression seems unlikely to live
there. I don’t think I’ve changed <code class="language-plaintext highlighter-rouge">group</code> in the meantime, so I’m not exactly
sure what is screwed up; perhaps I tweaked the measurement program
inappropriately, or perhaps I caught a dodgy measurement the previous time
around (when there was, in fairness, a buggy join implementation).</p>
<p>For the small graph, the bulk of the time is now spent in <code class="language-plaintext highlighter-rouge">group</code>, in some
operations that may still have some defective performance (sorting mostly, it
seems; technically super-linear). It would be great to get performance to
improve with increasing batch size before starting to optimize the
implementations.</p>
<h3 id="addendum-simplifying-interesting-times-march-31-2017">Addendum: Simplifying “interesting times” (March 31, 2017)</h3>
<p>One complicated bit of logic in <code class="language-plaintext highlighter-rouge">group</code> determines the logical times at which we
may need to re-evaluate user logic. It is not so much complicated, as much as I
made it complicated. The logic is meant to take two sets of times, <code class="language-plaintext highlighter-rouge">old</code> and
<code class="language-plaintext highlighter-rouge">new</code> let’s say, and determine the times that are the lattice join of a subset
of <code class="language-plaintext highlighter-rouge">old</code> and a non-empty subset of <code class="language-plaintext highlighter-rouge">new</code>.</p>
<p>For example, here is the reference implementation that I wrote (here <code class="language-plaintext highlighter-rouge">edits</code> is
the old set and <code class="language-plaintext highlighter-rouge">times</code> is the new set):</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="rouge-code"><pre><span class="c">// REFERENCE IMPLEMENTATION (LESS CLEVER)</span>
<span class="k">let</span> <span class="n">times_len</span> <span class="o">=</span> <span class="n">times</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">for</span> <span class="n">position</span> <span class="n">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">times_len</span> <span class="p">{</span>
<span class="k">for</span> <span class="o">&</span><span class="p">(</span><span class="mi">_</span><span class="p">,</span> <span class="k">ref</span> <span class="n">time</span><span class="p">,</span> <span class="mi">_</span><span class="p">)</span> <span class="n">in</span> <span class="n">edits</span> <span class="p">{</span>
<span class="k">if</span> <span class="o">!</span><span class="n">time</span><span class="nf">.le</span><span class="p">(</span><span class="o">&</span><span class="n">times</span><span class="p">[</span><span class="n">position</span><span class="p">])</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">join</span> <span class="o">=</span> <span class="n">time</span><span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">times</span><span class="p">[</span><span class="n">position</span><span class="p">]);</span>
<span class="n">times</span><span class="nf">.push</span><span class="p">(</span><span class="n">join</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">position</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="n">position</span> <span class="o"><</span> <span class="n">times</span><span class="nf">.len</span><span class="p">()</span> <span class="p">{</span>
<span class="k">for</span> <span class="n">index</span> <span class="n">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">position</span> <span class="p">{</span>
<span class="k">if</span> <span class="o">!</span><span class="n">times</span><span class="p">[</span><span class="n">index</span><span class="p">]</span><span class="nf">.le</span><span class="p">(</span><span class="o">&</span><span class="n">times</span><span class="p">[</span><span class="n">position</span><span class="p">])</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">join</span> <span class="o">=</span> <span class="n">times</span><span class="p">[</span><span class="n">index</span><span class="p">]</span><span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">times</span><span class="p">[</span><span class="n">position</span><span class="p">]);</span>
<span class="n">times</span><span class="nf">.push</span><span class="p">(</span><span class="n">join</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">position</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">times</span><span class="p">[</span><span class="n">position</span><span class="o">..</span><span class="p">]</span><span class="nf">.sort</span><span class="p">();</span>
<span class="n">times</span><span class="nf">.dedup</span><span class="p">();</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It does a bunch of work, much more than the possibly linear-time implementation
I worked hard on. Of course, it is so much simpler (by about 80 lines, and many
loops), and we should probably just use it when we don’t have lots of edits.
Because often we don’t have lots of edits.</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">58s (62s)</td>
<td style="text-align: right">55s (56s)</td>
<td style="text-align: right">38s (38s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">56s (59s)</td>
<td style="text-align: right">46s (47s)</td>
<td style="text-align: right">30s (32s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">67s (69s)</td>
<td style="text-align: right">35s (36s)</td>
<td style="text-align: right">23s (23s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">74s (73s)</td>
<td style="text-align: right">31s (32s)</td>
<td style="text-align: right">23s (21s)</td>
</tr>
</tbody>
</table>
<p>These are pretty minor effects, with some light improvement in the smaller batch
sizes where we expect less complicated histories. I was initially really excited
about this because I conflated the improvements with the next optimization, but
once I broke them apart this was not the better part. Sorry!</p>
<h3 id="addendum-avoiding-expensive-hashing-march-31-2017">Addendum: Avoiding expensive hashing (March 31, 2017)</h3>
<p>What actually makes a difference is ripping out a fair amount of redundant
hashing. Our default storage uses hash tables to index data by key, which is
really helpful when we have relatively few keys in each batch. At the same time,
our default implementation just calls each types associated hash function when
it needs a hash, which can be quite a lot.</p>
<p>In particular, when we first arrange data into batches we sort it (by key, then
value, then time), and while this is primarily a radix sort using the hash, we
need to finish it with a standard Rust sort to deal with possible hash
collisions and to get values and times ordered too. If we have lots of data, and
especially if we have lots of equivalent keys, this ends up calling the hash
function on the key quite a lot.</p>
<p>There is a small change we can make to cache the hash value; doing that doesn’t
seem to help all that much; it probably makes sorting faster but then costs
later on when we need to move around keys and hash values together. This is
worth looking into more, because if you show up with long <code class="language-plaintext highlighter-rouge">String</code> keys you
aren’t going to want lots of hash re-evaluation.</p>
<p>A simple change, for the purposes of graphs, is to use random node identifiers
and have each identifier be its own hash value. This works out great, and we get
generally improved performance:</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">51s (58s)</td>
<td style="text-align: right">51s (55s)</td>
<td style="text-align: right">31s (38s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">51s (56s)</td>
<td style="text-align: right">38s (46s)</td>
<td style="text-align: right">25s (30s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">60s (67s)</td>
<td style="text-align: right">28s (35s)</td>
<td style="text-align: right">19s (23s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">66s (74s)</td>
<td style="text-align: right">24s (31s)</td>
<td style="text-align: right">17s (23s)</td>
</tr>
</tbody>
</table>
<p>This recovers a fair chunk of time we lost previously, and this difference could
actually be the source of the apparent regression (perhaps I was using this
version before; I certainly have in the past).</p>
<p>One question this raises is whether we really need hash tables as the index
structure. They are helpful for sparse access, but if our plan is to push hard
on throughput, perhaps simple ordered lists are good enough. They are much
simpler to construct, and very cheap to merge. They would likely kill the
numbers for small batch sizes, effectively raising the “minimum latency” you
would experience for small loads. This will also be fun to check out, though.</p>
<p>Plus we are actually going to put real indices in place at some point, which
should make the distinction less important.</p>
<p>We still have an up-tick for increasing batch sizes in the small graph, and I
still want to sort that out. Removing all this hashing is one way of getting rid
of the noise that is leaving the source of the problem a mystery.</p>
<h3 id="addendum-re-engineering-group-april-2-2017">Addendum: Re-engineering <code class="language-plaintext highlighter-rouge">group</code> (April 2, 2017)</h3>
<p>I did a bit of a re-write of the core <code class="language-plaintext highlighter-rouge">group</code> logic. Not much has changed
algorithmically, but certain parts were tidied up enough that we spend less time
futzing around with messy piles of data.</p>
<p>For example, previously the operator accepted batches of keyed input data, and
for each key flipped through all times to create a list of <code class="language-plaintext highlighter-rouge">(key, time)</code> pairs
we should look into. That’s great, but we didn’t really need to do that; we can
just wait until we start to work on the key, and put together the list of times
for that key. This required a bit of sanity checking about “exactly what times
are we planning on working on” that was enabled by the simplified code
structure.</p>
<p>We also bite off a larger chunk of the graph to work on, doing only one sweep
through the keys where we may previously have done several, feeding the output
into different batches as appropriate (when we have multiple incomparable
capabilities).</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">45s (51s)</td>
<td style="text-align: right">57s (51s)</td>
<td style="text-align: right">44s (31s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">39s (51s)</td>
<td style="text-align: right">43s (38s)</td>
<td style="text-align: right">30s (25s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">48s (60s)</td>
<td style="text-align: right">29s (28s)</td>
<td style="text-align: right">19s (19s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">55s (66s)</td>
<td style="text-align: right">20s (24s)</td>
<td style="text-align: right">12s (17s)</td>
</tr>
</tbody>
</table>
<p>Some interesting things happen here. The small graph performance improves a fair
bit, the very large batch performance improves quite a bit (more on this), and
the small batch large graph performance takes a bit of a hit. I’m not exactly
sure what the deal is here, except that we are doing more work in larger batches
now, and this provides both opportunities to do things well and to do things
badly.</p>
<p>I want to call out the large graph, multiple worker numbers. There is a pretty
serious improvement there, which is even more impressive when you learn that the
first 4.5 second are just prepping the computation (loading the graph and doing
the initial bfs computation). So what we are actually seeing appears to by 12s
of compute down to 8s of compute. I just need to do that a few more times. Also,
make sure to run integration test to see that we are producing the correct
output (ed: apparently).</p>
<h3 id="addendum-less-interesting-times-april-7-2017">Addendum: Less interesting times (April 7, 2017)</h3>
<p>I made what I think of as a pretty substantial change to the way <code class="language-plaintext highlighter-rouge">group</code> works.
Let me recap, both because it gets us on the same page, and because I need the
practice.</p>
<p>The <code class="language-plaintext highlighter-rouge">group</code> operator works on a bunch of keys in parallel, and for our purposes
we are just going to talk about what it does for each key individually (it maps
this behavior across all keys).</p>
<p>The <code class="language-plaintext highlighter-rouge">group</code> operator repeatedly gets presented with batches of updates each of
which corresponds to an <em>interval</em> of partially ordered time: <code class="language-plaintext highlighter-rouge">[lower, upper)</code>,
where <code class="language-plaintext highlighter-rouge">lower</code> and <code class="language-plaintext highlighter-rouge">upper</code> are both antichains (sets whose elements are mutually
incomparable) and the interval includes those times greater or equal to an
element of <code class="language-plaintext highlighter-rouge">lower</code> but not greater or equal to any element of <code class="language-plaintext highlighter-rouge">upper</code>.</p>
<p>When presented with an interval of updates, the <code class="language-plaintext highlighter-rouge">group</code> operator is now in a
position to determine the corresponding interval of updates to its output. All
of the inputs updates at times not after <code class="language-plaintext highlighter-rouge">upper</code> have been locked in, and this
means that mathematically that all of the output updates at times not after
<code class="language-plaintext highlighter-rouge">upper</code> are also locked in, we just haven’t computed them yet. So the <code class="language-plaintext highlighter-rouge">group</code>
operator needs to determine the output updates at times in the interval
<code class="language-plaintext highlighter-rouge">[lower, upper)</code>.</p>
<p>The previous implementation did this by tracking all of the times at which the
output might change, and each time around seeing which of these times are in the
interval <code class="language-plaintext highlighter-rouge">[lower, upper)</code> and working on those times. This was intended to be
very precise, but it has some serious overhead and, counter-intuitively, can end
up less precise that simpler methods.</p>
<p>The current implementation (this is all a work in progress) just takes as input
<code class="language-plaintext highlighter-rouge">lower</code> and <code class="language-plaintext highlighter-rouge">upper</code>, and starts looking for times that land in this interval. A
time is plausibly interesting, in that it could possibly have a non-zero output
update, if it is the join of sets of times found in input or output updates. As
we are (currently) planning on walking through all updates anyhow (to “simulate”
the history of the values for the key), we have the opportunity to start forming
these sets of joined things and seeing which land in our target interval.</p>
<p>Although we might consider lots of times, each time will either be (i) in the
<code class="language-plaintext highlighter-rouge">[lower, upper)</code> interval, in which case we want to reconsider it, or (ii) at
<code class="language-plaintext highlighter-rouge">upper</code> or beyond, in which case we should defer it for future processing. We
can also skip any times in the future of defered times, because we’ll just
re-discover them when we get to them in the future, right?</p>
<p>Or will we?</p>
<p>This is meant to be the “good news” of this approach: if in the future it turns
out that the updates that prompted some possibly interesting time vanish,
perhaps because they cancel when seen from this point in the future, then great!
Although we thought it might be worth looking in to what the input and output
look like at that time, if by the time we get to the interval containing the
time the updates just aren’t there any more, no work for us to do!</p>
<p>Let’s look at an example: Imagine we are supplying one thousand rounds of input
to an iterative computation, so timestamps look like <code class="language-plaintext highlighter-rouge">(round, iteration)</code>. We
might start with updates that look like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>("hello", (17, 0), +1)
("world", (23, 0), +1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Meaning that <code class="language-plaintext highlighter-rouge">"hello"</code> shows up in the 17th round of input and <code class="language-plaintext highlighter-rouge">"world"</code> shows
up in the 23rd round of input. Perhaps over the course of the iterative
computation, the <code class="language-plaintext highlighter-rouge">"world"</code> record evolves a bit and eventually goes away</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>("world", (23, 3), -1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Of course, <code class="language-plaintext highlighter-rouge">"hello"</code> can evolve too, and perhaps in a later iteration it prompts
something exciting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>("wombat", (17, 5), +1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This is very exciting, because wombats are magical animals. Now, based on our
tradition reasoning, in addition to our general excitement about wombats we may
also come to the conclusion that the time <code class="language-plaintext highlighter-rouge">(23, 5)</code> is pretty interesting. Some
stuff happens at <code class="language-plaintext highlighter-rouge">(_, 5)</code>, and some stuff happens at <code class="language-plaintext highlighter-rouge">(23, _)</code>, so stuff
probably happens at <code class="language-plaintext highlighter-rouge">(23, 5)</code> that we should check out.</p>
<p>As it turns out, nothing happens at <code class="language-plaintext highlighter-rouge">(23, 5)</code>, because by the time we’ve gotten
to iteration five, the <code class="language-plaintext highlighter-rouge">"world"</code> updates have canceled with each other. The
input collection is identical to the collection at <code class="language-plaintext highlighter-rouge">(23, 4)</code> and at <code class="language-plaintext highlighter-rouge">(22, 5)</code>
and even at <code class="language-plaintext highlighter-rouge">(22, 4)</code>, which pretty much means that it doesn’t experience change
and so its output doesn’t change either.</p>
<p>Our prior implementations, each of which tracked all possibly interesting times
explicitly, would miss this opportunity because they flagged the times <code class="language-plaintext highlighter-rouge">(17,3)</code>
and <code class="language-plaintext highlighter-rouge">(19,3)</code> as interesting, and lost track of the fact that their reasons for
being interesting cancel each other out. When we arrive at <code class="language-plaintext highlighter-rouge">(23, 0)</code> we would be
warned about the excitement associated with <code class="language-plaintext highlighter-rouge">(19, 3)</code> and</p>
<p>So that’s all a very nice hypothetical optimization, but what does it do for our
<code class="language-plaintext highlighter-rouge">bfs</code> computation?</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">40s (45s)</td>
<td style="text-align: right">59s (57s)</td>
<td style="text-align: right">45s (44s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">39s (39s)</td>
<td style="text-align: right">45s (43s)</td>
<td style="text-align: right">31s (30s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">49s (48s)</td>
<td style="text-align: right">30s (29s)</td>
<td style="text-align: right">20s (19s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">55s (56s)</td>
<td style="text-align: right">18s (20s)</td>
<td style="text-align: right">11s (12s)</td>
</tr>
</tbody>
</table>
<p>Not a great deal. There is a little bit of movement, but I think most of it is
attributable to noise.</p>
<p>This is sort of good news, because we haven’t actually put the optimization of
ignoring canceling times into place yet, we are just seeing how well we do when
we have to rediscover times in each <code class="language-plaintext highlighter-rouge">[lower, upper)</code> interval rather than having
them listed for us. We removed a fair amount of “time management” code, at the
possible cost of re-evaluating the user logic at more times than strictly
necessary. Though, practically, I’m not sure we actually do any more evaluation
this way, as we were fairly conservative about which times we would consider
previously (in that we considered quite a lot).</p>
<table>
<thead>
<tr>
<th style="text-align: right">experiment</th>
<th style="text-align: right">1k / 2k</th>
<th style="text-align: right">1m / 10m</th>
<th style="text-align: right">1m / 10m -w2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1 x 1000000</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
<td style="text-align: right">-</td>
</tr>
<tr>
<td style="text-align: right">10 x 100000</td>
<td style="text-align: right">(117s)</td>
<td style="text-align: right">(82s)</td>
<td style="text-align: right">(68s)</td>
</tr>
<tr>
<td style="text-align: right">100 x 10000</td>
<td style="text-align: right">(75s)</td>
<td style="text-align: right">(65s)</td>
<td style="text-align: right">(46s)</td>
</tr>
<tr>
<td style="text-align: right">1000 x 1000</td>
<td style="text-align: right">40s (40s)</td>
<td style="text-align: right">59s (57s)</td>
<td style="text-align: right">42s (45s)</td>
</tr>
<tr>
<td style="text-align: right">10000 x 100</td>
<td style="text-align: right">36s (39s)</td>
<td style="text-align: right">44s (45s)</td>
<td style="text-align: right">30s (30s)</td>
</tr>
<tr>
<td style="text-align: right">100000 x 10</td>
<td style="text-align: right">46s (49s)</td>
<td style="text-align: right">30s (30s)</td>
<td style="text-align: right">20s (20s)</td>
</tr>
<tr>
<td style="text-align: right">1000000 x 1</td>
<td style="text-align: right">56s (55s)</td>
<td style="text-align: right">18s (18s)</td>
<td style="text-align: right">10s (11s)</td>
</tr>
</tbody>
</table>High resolution, high throughput (pt 2)Differential Dataflow Pt12022-10-10T00:00:00+00:002022-10-10T00:00:00+00:00/primitives/2022/10/10/Differential-Dataflow-pt1<h1 id="differential-dataflow-roadmap">Differential dataflow roadmap</h1>
<blockquote>
<p>source https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md</p>
</blockquote>
<p>I’m going to take this post to try and outline what I think is an important
direction for differential dataflow, and to explain how to start moving in this
direction. I think I have a handle on most of the path, but talking things out
and explaining them, with examples and data and such, makes me a lot more
comfortable before just writing a lot of code.</p>
<p>The main goal is to support “high resolution” updates to input streams. Right
now, updates to differential dataflow come in batches, and get relatively decent
scaling as long as the batches are not small. While you can cut the size of
batches to improve resolution, increasing the number of workers no longer
improve performance.</p>
<p>It would be <em>great</em>, and this write-up is meant to be a first step, to be able
to have input updates timestamped with the nanosecond of their arrival and the
corresponding output updates with the same resolution, while still maintaining
the throughput you would expect for large batch updates.</p>
<h2 id="the-problem">The problem</h2>
<p>Let’s start with a simple-ish, motivating problem to explain what is missing. We
can also use it to evaluate our progress (none yet!), and possibly to tell us
when we are done.</p>
<p>Imagine you are performing reachability queries, an iterative Datalog-style
computation, over dynamic graph data from user-specified starting locations. The
computation is relatively simply written:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre><span class="c">// two inputs, one for roots, one for edges.</span>
<span class="k">let</span> <span class="p">(</span><span class="n">root_input</span><span class="p">,</span> <span class="n">roots</span><span class="p">)</span> <span class="o">=</span> <span class="n">scope</span><span class="nf">.new_input</span><span class="p">();</span>
<span class="k">let</span> <span class="p">(</span><span class="n">edge_input</span><span class="p">,</span> <span class="n">edges</span><span class="p">)</span> <span class="o">=</span> <span class="n">scope</span><span class="nf">.new_input</span><span class="p">();</span>
<span class="c">// iteratively expand set of (root, node) reachable pairs.</span>
<span class="n">roots</span><span class="nf">.map</span><span class="p">(|</span><span class="n">root</span><span class="p">|</span> <span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">root</span><span class="p">))</span>
<span class="nf">.iterate</span><span class="p">(|</span><span class="n">reach</span><span class="p">|</span> <span class="p">{</span>
<span class="c">// bring un-changing collections into loop.</span>
<span class="k">let</span> <span class="n">roots</span> <span class="o">=</span> <span class="n">edges</span><span class="nf">.enter</span><span class="p">(</span><span class="o">&</span><span class="n">reach</span><span class="nf">.scope</span><span class="p">());</span>
<span class="k">let</span> <span class="n">edges</span> <span class="o">=</span> <span class="n">edges</span><span class="nf">.enter</span><span class="p">(</span><span class="o">&</span><span class="n">reach</span><span class="nf">.scope</span><span class="p">());</span>
<span class="c">// join `reach` and `edges` on `node` field.</span>
<span class="n">reach</span><span class="nf">.map</span><span class="p">(|(</span><span class="n">root</span><span class="p">,</span> <span class="n">node</span><span class="p">)|</span> <span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">root</span><span class="p">))</span>
<span class="nf">.join_map</span><span class="p">(</span><span class="o">&</span><span class="n">edges</span><span class="p">,</span> <span class="p">|</span><span class="mi">_</span><span class="n">node</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">dest</span><span class="p">|</span> <span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">dest</span><span class="p">))</span>
<span class="nf">.concat</span><span class="p">(</span><span class="o">&</span><span class="n">roots</span><span class="p">)</span>
<span class="nf">.distinct</span><span class="p">()</span>
<span class="p">});</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The result of this computation is a collection of pairs <code class="language-plaintext highlighter-rouge">(root, node)</code>
corresponding to those elements <code class="language-plaintext highlighter-rouge">root</code> of <code class="language-plaintext highlighter-rouge">roots</code>, and those elements <code class="language-plaintext highlighter-rouge">node</code>
they can reach transitively along elements in <code class="language-plaintext highlighter-rouge">edges</code>.</p>
<p>Of course, the heart of differential dataflow lies in incrementally updating its
computations. We are interested in what happens to this computation as the
inputs <code class="language-plaintext highlighter-rouge">roots</code> and <code class="language-plaintext highlighter-rouge">edges</code> change. More specifically,</p>
<ol>
<li>
<p>The <code class="language-plaintext highlighter-rouge">roots</code> collection may be updated by adding and removing root elements,
which issue and cancel standing queries for reachable nodes, respectively.</p>
</li>
<li>
<p>The <code class="language-plaintext highlighter-rouge">edges</code> collection may be updated by adding and removing edge elements,
which affect the reachable set of nodes from any of the elements of <code class="language-plaintext highlighter-rouge">roots</code>.</p>
</li>
</ol>
<p>Consider a version of this computation that runs “forever”, where the timestamp
type is a <code class="language-plaintext highlighter-rouge">u64</code> indicating “nanosecond since something”. Each change that
occurs, to <code class="language-plaintext highlighter-rouge">edges</code> or <code class="language-plaintext highlighter-rouge">roots</code> happens at a likely distinct nanosecond, and so we
imagine many single-element updates to our computation. We don’t expect to
actually process them within nanoseconds (would be great, but), but the
nanoseconds units means that corresponding output updates also indicate the
logical nanosecond at which the change happens.</p>
<p>This isn’t difficult in differential dataflow: timely dataflow, on which it is
built, does no work for epochs in which no data are exchanged, no matter how
fine grained the measurement. We could use
<a href="https://en.wikipedia.org/wiki/Planck_time">Planck time</a> if we wanted; our
computation wouldn’t run any differently (it might overflow the 64 bit numbers
sooner).</p>
<p>But, this doesn’t mean we don’t have problems.</p>
<h3 id="degradation-with-time">Degradation with time</h3>
<p>For now, let’s put ten roots into <code class="language-plaintext highlighter-rouge">roots</code> and load up two million random edges
between one million nodes. We are then going to repeatedly remove the oldest
existing edge and introduce a new random edge in its place. This is a sliding
window over an unbounded stream of random edges, two million elements wide.</p>
<p>Our computation determines the reachable sets for our ten roots, and maintains
them as we change the graph. How quickly does this happen? Here are some
empirical cumulative density functions, computed by capturing the last 100
latencies after each of 100, 1000, 10000, and 100000 updates have been
processed.</p>
<p><img src="https://github.com/frankmcsherry/blog/blob/master/assets/roadmap/gnp1m.png" alt="gnp1m" /></p>
<p>This is all a bit of a tangle, but we see a fairly consistent shape for the
first 100,000 updates. However, there is clearly some degradation that starts to
happen. On the plus side, most of the latencies are still milliseconds at most,
which is pretty speedy. Should we be happy?</p>
<p>Let’s look at a slight variation on this experiment, where instead of millions
of edges and nodes we use <em>thousands</em>. Yeah, smaller, by a lot. Same deal as
above, latencies at 100, 1000, 10000, and … urg.</p>
<p><img src="https://github.com/frankmcsherry/blog/blob/master/assets/roadmap/gnp1k.png" alt="gnp1k" /></p>
<p>These curves are very different from the curves above. I couldn’t compute the
100,000 update measurement because it took so long.</p>
<h4 id="whats-going-on">What’s going on?</h4>
<p>Differential dataflow’s internal data structures are append-only, and over the
course of 10,000 updates we are dumping a large number of updates <em>relative to
the number of nodes</em>. Back when we had one million nodes, doing 100,000 updates
wasn’t such a big deal because on average each node got just a few (multiply by
ten, because of the roots!). With only 1,000 nodes, all of those updates are
being forced onto far fewer nodes, which mean that each node has a much more
complicated history. Unfortunately, to determine what state a node is currently
in, at any point in the computation, we need to examine all of its history.</p>
<p>As the number of updates for each key increases, the amount of work we have to
do for each key increases.</p>
<h3 id="resolution-and-scaling">Resolution and scaling</h3>
<p>How about we try to speed things up by adding more workers? Perhaps
unsurprisingly, with single-element updates, multiple workers do not really help
out. At least, the way the code is written at the moment, all the workers chill
out waiting for that single update to get sorted out before moving on to the
next update. As there is only a small amount of work to do, most workers sit on
their hands instead of do productive work.</p>
<p>Let’s evaluate this, plus alternatives we might have hoped for. We are going to
do single element updates 10,000 times to the two million edge graph, but we
will also do 10 element updates 1,000 times, and 100 element updates 100 times.
We are doing the same set of updates, just in coarser granularities, leading to
lower resolution outputs.</p>
<p><img src="https://github.com/frankmcsherry/blog/blob/master/assets/roadmap/batching.png" alt="batching" /></p>
<p>The plot above shows solid lines for single-threaded execution and dashed lines
for two-threaded execution. When we have the single-element updates, the solid
line is better than the dashed line (one worker is better than two). When we
have hundred-element updates, the dashed line is better than the solid line (two
workers are better than one). As the amount of work in each batch increases, the
second worker can more productively contribute.</p>
<p>While we can eyeball the latencies and see some trends, what are the actual
throughputs for each of these configurations?</p>
<table>
<thead>
<tr>
<th style="text-align: right">batch size</th>
<th style="text-align: right">one worker</th>
<th style="text-align: right">two workers</th>
<th style="text-align: right">increase</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1</td>
<td style="text-align: right">1244.96/s</td>
<td style="text-align: right">1297.07/s</td>
<td style="text-align: right">1.042x</td>
</tr>
<tr>
<td style="text-align: right">10</td>
<td style="text-align: right">1988.71/s</td>
<td style="text-align: right">2530.23/s</td>
<td style="text-align: right">1.272x</td>
</tr>
<tr>
<td style="text-align: right">100</td>
<td style="text-align: right">1563.32/s</td>
<td style="text-align: right">2743.31/s</td>
<td style="text-align: right">1.755x</td>
</tr>
</tbody>
</table>
<p>Something good seemed to happen for one worker batch size 10 that doesn’t happen
to batch size 100; I’m not sure what that is about. But, we see that the second
worker helps more and more with increasing batch sizes. We don’t get 2x
improvement, which is partly due to the introduction of data exchange going from
one to two workers (no data shuffling happens for one worker).</p>
<h4 id="whats-going-on-1">What’s going on?</h4>
<p>This isn’t too mysterious: processing single elements at a time and asking all
workers to remain idle until each is finished leaves a lot of cycles on the
table. At the same time, lumping lots of updates together improves the
utilization and allows more workers to reduce the total time to process, but
comes at the cost of resolution: we can’t see which of the 100 updates had which
effect.</p>
<p>We would love to get the resolution of single-element updates with the
throughput scaling of the batched updates, if at all possible. We’d also like
the <em>latency</em> of the single-element updates, but note that this is not the same
thing as either resolution or throughput.</p>
<ul>
<li>
<p><strong>Resolution</strong> is important for correctness; we can’t alter the resolution of
inputs and outputs without changing the definition of the computation itself.</p>
</li>
<li>
<p><strong>Throughput</strong> is the rate of changes we can accommodate without falling over.
We want this to be as large as possible, ideally scaling with the number of
workers, so that we can handle more updates per unit time.</p>
</li>
<li>
<p><strong>Latency</strong> is the time to respond to an input update with its corresponding
output update. The lower the latency the better, but this fights a little
against throughput.</p>
</li>
</ul>
<p>At the moment, single-element updates focus on latency; workers do nothing
except attend to the most recent single update. Getting great latency would be
excellent, but if it comes at the cost of throughput we might want a different
trade-off.</p>
<h3 id="goals">Goals</h3>
<p>The intent of this write-up is to investigate these problems in more detail,
propose some solutions, and (importantly, for me) come up with a framework for
evaluating the results. There is a saying that “you can’t manage what you don’t
measure”, one corrolary of which is that I’m not personally too motivated to
work hard on code until I have a benchmark for it. With that in mind, here are
two benchmarks that (i) are important, (ii) currently suck, and (iii) could be a
lot better:</p>
<ol>
<li>
<p><strong>Sustained latency:</strong> For windowed computations (or those with bounded
inputs, generally), the latency distribution should stabilize with time. The
latency distribution for 1,000 node 2,000 edge reachability computations
after one million updates should be pretty much the same as the distribution
after one thousand updates. Minimize the difference, and report only the
former.</p>
</li>
<li>
<p><strong>Single-update throughput scaling:</strong> The throughput of single-element
updates should improve with multiple workers (up to a point). The
single-update throughput for 1,000 node 2,000 edge reachability computations
should scale nearly linearly with (a few) workers. Maximize the throughput,
reporting single-element updates per second per worker.</p>
</li>
</ol>
<p>These aren’t really grand challenges or anything, especially as I think I know
how to do them already, but goal setting is an important part of getting things
done.</p>
<h2 id="the-problems">The problems</h2>
<p>There are two main problems that we are going to want to re-work bits of
differential dataflow to fix. There are also some secondary “constraints”, which
are currently non-issues but which we could break if we try and be too clever.</p>
<p>To give you a heads up, and to let you skip around, the problems (with links!)
are:</p>
<ul>
<li>
<p><strong><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#problem-0-data-structures-for-high-resolution-times">Problem 0: Data structures for high-resolution times</a></strong>
The data structure differential dataflow currently uses to store collection
data isn’t great for high resolution times, even ignoring the more subtle
performance issues. It works, but it doesn’t expect large numbers of times and
should be reconsidered.</p>
</li>
<li>
<p><strong><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#problem-1-unbounded-increase-in-latency">Problem 1: Unbounded increase in latency</a></strong>
As the computation proceeds, the latencies increase without bound. This is
because we keep appending in state, and the amount that must be considered to
evaluate the current configuration grows without bound.</p>
</li>
<li>
<p><strong><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#problem-2-poor-scaling-with-small-updates">Problem 2: Poor scaling with small updates</a></strong>
As we increase the number of workers, we do not get increased throughput
without also increasing the sizes of batches of input we process. Increases in
performance come at the cost of more granular updates.</p>
</li>
</ul>
<p>There are some constraints that are currently in place, and we will go through
them to remember what is hard and annoying about just typing these things in.</p>
<ul>
<li>
<p><strong><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#constraint-1-compact-representation-in-memory">Constraint 1: Compact representation in memory</a></strong>
The representation of a trace should not be so large that I can’t fit normal
graphs in memory on my laptop. Ideally the memory footprint should be not much
larger than required to write the data on disk.</p>
</li>
<li>
<p><strong><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#constraint-2-shared-index-structures-between-operators">Constraint 2: Shared index structures between operators</a></strong>
Index structures are currently shared between operators, so that a collection
only needs to be indexed and maintained once per computation, for each key on
which it is indexed.</p>
</li>
</ul>
<h3 id="problem-0-data-structures-for-high-resolution-times">Problem 0: Data structures for high-resolution times</h3>
<p>Each differential dataflow collection is described by a bunch of tuples, each of
which reflect a change described by three things:</p>
<ul>
<li>
<p><strong>Data:</strong> Each change that occurs relates to some data. Typically these are
<code class="language-plaintext highlighter-rouge">(key, val)</code> pairs, but they could also just be <code class="language-plaintext highlighter-rouge">key</code> records, or they could
be even more complicated.</p>
</li>
<li>
<p><strong>Time:</strong> Each change occurs at some logical time. In the simplest case each
is just an integer indicating which round the change happens in, but it can be
more complex and is generally only known to be an element from a partially
ordered set.</p>
</li>
<li>
<p><strong>Delta:</strong> Each change has a signed integer change to the frequency of the
element, indicating whether the change adds an element or removes an element.</p>
</li>
</ul>
<p>This collection of <code class="language-plaintext highlighter-rouge">(data, time, delta)</code> tuples needs to be maintained in a form
that allows relatively efficient enumeration of the history of individual data
records: those <code class="language-plaintext highlighter-rouge">(data, time, delta)</code> tuples matching <code class="language-plaintext highlighter-rouge">data</code>.</p>
<p>Differential dataflow currently maintains its tuples ordered first by <code class="language-plaintext highlighter-rouge">key</code>,
then by <code class="language-plaintext highlighter-rouge">time</code>, and then by <code class="language-plaintext highlighter-rouge">val</code>. This makes some sense if you imagine that
many changes to <code class="language-plaintext highlighter-rouge">key</code> occur at the same time, as you can perform per-<code class="language-plaintext highlighter-rouge">time</code>
logic once per distinct time. In batch-iterative computation, where there is
just one input and relatively few iterations, this is a reasonable assumption.
It is less reasonable for high-resolution times.</p>
<p>Ideally, we would define an interface for the storage layer, so that operators
can be backed by data structures appropriate for high-resolution times, or for
batch data as appropriate. Let’s describe what interface the storage should
provide, somewhat abstractly:</p>
<ol>
<li>
<p>Accept batches of updates, <code class="language-plaintext highlighter-rouge">(data, time, delta)</code>.</p>
<p>This is perhaps obvious, but without this we don’t really have a problem.
Importantly, we should be able to submit <em>batches</em> of updates corresponding
to multiple <code class="language-plaintext highlighter-rouge">data</code> and multiple <code class="language-plaintext highlighter-rouge">time</code> entries. The batch interface
communicates that the data structure doesn’t need to be in an indexable state
for each element, only once it accepts the batch.</p>
</li>
<li>
<p>Enumerate those <code class="language-plaintext highlighter-rouge">data</code> associated with a <code class="language-plaintext highlighter-rouge">key</code>.</p>
<p>Many operators (e.g. <code class="language-plaintext highlighter-rouge">join</code> and <code class="language-plaintext highlighter-rouge">group</code>) drive computation by <code class="language-plaintext highlighter-rouge">key</code> keys and
their associated <code class="language-plaintext highlighter-rouge">val</code> values. One should be able to enumerate values
associated with a key, preferably supporting some sort of navigation (e.g.
searching for values).</p>
</li>
<li>
<p>Report the history <code class="language-plaintext highlighter-rouge">(time, delta)</code> for each <code class="language-plaintext highlighter-rouge">data</code>.</p>
<p>The history of <code class="language-plaintext highlighter-rouge">data</code> is used by many operators to determine (i) the
cumulative weight at any other <code class="language-plaintext highlighter-rouge">time</code>, (ii) which times are associated with a
<code class="language-plaintext highlighter-rouge">key</code>, which drives when user-defined logic needs to be re-run.</p>
</li>
</ol>
<p><a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#constraint-1-compact-representation-in-memory">Constraint #1</a>
makes life a little difficult for random access, navigation, and mutation, as
these usually fight with compactness. Perhaps less obviously,
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#constraint-2-shared-index-structures-between-operators">Constraint #2</a>
complicates in-place updating, because multiple readers may share read access to
the same hunk of memory, and something needs to stay true about it.</p>
<h4 id="a-proposal">A proposal</h4>
<p>My best plan for the moment is something like a log-structure merge trie, which
probably isn’t an existing term, but let me explain:</p>
<ol>
<li>
<p>We maintain several immutable collections of <code class="language-plaintext highlighter-rouge">((key, val), time, delta)</code>
tuples, of geometrically decreasing size. When we add new collections,
corresponding to an inserted batch, we merge any collections whose sizes are
within a factor of two, amortizing the merge effort over subsequent
insertions.</p>
</li>
<li>
<p>Each of the <code class="language-plaintext highlighter-rouge">((key, val), time, delta)</code> collections is represented as a trie,
with three vectors corresponding to “keys and the offsets of their values”,
“values and the offsets of their history”, and “histories”:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="k">struct</span> <span class="n">Trie</span><span class="o"><</span><span class="n">K</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">T</span><span class="o">></span> <span class="p">{</span>
<span class="n">keys</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// key and offset into self.values</span>
<span class="n">values</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// val and offset into self.histories</span>
<span class="n">histories</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="nb">isize</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// bunch of times and deltas</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
</ol>
<p>Adding new batches of data is standard; this type of data structure is meant to
be efficient at writing, with the main (only?) cost being the merging. As the
collections are immutable, this can happen in the background, but needs to
happen at a sufficient rate to avoid falling behind. However, merging feels like
a relatively high-throughput operation compared to a large amount of random
access (computation) that will come with each inserted element. Said
differently, we only merge in data involved in computation, so we shouldn’t be
doing more writes than reads.</p>
<p>Reading data out requires indexing into each of the tries to find a target key.
One could look into each of the tries for the key, using something like binary
search, or a galloping cursor (as most operators process keys in some known
order). Another option is to maintain an index for keys, indicating for each the
lowest level (smallest) trie in which the key exists and the key’s offset in
that trie’s <code class="language-plaintext highlighter-rouge">keys</code> field. With each <code class="language-plaintext highlighter-rouge">(K, usize)</code> pair, we could store again an
index of the next-lowest level trie in which the key exists and the key’s offset
there.</p>
<p>This allows us to find the keys and their tries with one index look-up and as
many pointer jumps as trie levels in which the key exists. Adding and merging
tries only requires updating the index for involved keys, and does not require
rewriting anything in existing trie layers.</p>
<p>Here is a sketch of the involved structures:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="k">struct</span> <span class="n">Storage</span><span class="o"><</span><span class="n">K</span><span class="p">,</span><span class="n">V</span><span class="p">,</span><span class="n">T</span><span class="o">></span> <span class="p">{</span>
<span class="n">index</span><span class="p">:</span> <span class="n">HashMap</span><span class="o"><</span><span class="n">K</span><span class="p">,</span> <span class="n">KeyLoc</span><span class="o">></span><span class="p">,</span> <span class="c">// something better, ideally</span>
<span class="n">tries</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="n">Trie</span><span class="o"><</span><span class="n">V</span><span class="p">,</span><span class="n">T</span><span class="o">>></span><span class="p">,</span> <span class="c">// tries of decreasing size</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">KeyLoc</span> <span class="p">{</span>
<span class="n">level</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="c">// trie level</span>
<span class="n">index</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="c">// index into trie.keys</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">Trie</span><span class="o"><</span><span class="n">V</span><span class="p">,</span> <span class="n">T</span><span class="o">></span> <span class="p">{</span>
<span class="n">keys</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">usizes</span><span class="p">,</span> <span class="n">KeyLoc</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// key, offset into self.values, next key location</span>
<span class="n">values</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// val and offset into self.histories</span>
<span class="n">histories</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="nb">isize</span><span class="p">)</span><span class="o">></span><span class="p">,</span> <span class="c">// bunch of times and deltas</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I think this design makes a good deal of sense in principle, but it remains to
see how it will work out in practice. On the plus side, it doesn’t seem all that
complicated at this point, so trying it out shouldn’t be terrifying. Also, I’m
much happier with something that works in principle, maybe loses a factor of two
over a better implementation, but doesn’t require a full-time employee to
maintain.</p>
<p>The design has a few other appealing features: each of the bits of state are
contiguous ordered hunks of memory,</p>
<ol>
<li>
<p>They are relatively easy to serialize to disk and <code class="language-plaintext highlighter-rouge">mmap</code> back in.</p>
</li>
<li>
<p>Processing batches of keys in order results in one sequential scan over each
array, good for performance and if we spill to disk.</p>
</li>
<li>
<p>The large unit of data means that sharing between operators is relatively low
cost (we can wrap each layer in a <code class="language-plaintext highlighter-rouge">Rc</code> reference count).</p>
</li>
</ol>
<p>You might notice that this doesn’t yet meet
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-26.md#constraint-2-shared-index-structures-between-operators">Constraint #2</a>,
the requirement that the memory size look something like what it would take to
write the data down compactly. For example, if all times and deltas are
identical (say <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">+1</code>, respectively), the <code class="language-plaintext highlighter-rouge">histories</code> field will hold a
very large amount of identical <code class="language-plaintext highlighter-rouge">(0,1)</code> pairs. There are some remedies that I can
think of, and will discuss them below, but for the moment too bad.</p>
<h3 id="problem-1-unbounded-increase-in-latency">Problem 1: Unbounded increase in latency</h3>
<p>Latency increases without bound as the computation proceeds. If we were to look
at memory utilization, we would also see that it increases without bound as the
computation proceeds. Neither of these are good news if you are expecting to run
indefinitely.</p>
<p>This is not unexpected for an implementation whose internal datastructures are
append-only. As a differential dataflow computation proceeds, each operator
absorbs changes to its inputs and appends them to its internal representation of
the input. This representation grows and grows, which means (i) it takes more
memory, and (ii) the operator must flip through more memory to determine the
state at any given logical time.</p>
<p>Let’s look at an example to see the issue, and get a hint at how to solve it.</p>
<p>In the reachability example above, we update the query set <code class="language-plaintext highlighter-rouge">roots</code> by adding and
removing elements. These changes look like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>(root, time_1, +1)
(root, time_2, -1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We add an element <code class="language-plaintext highlighter-rouge">root</code> at some first time, and then subtract it out at some
later time.</p>
<p>Although it was important to have both of these differences, at some point in
the computation, once we have processed everything up through <code class="language-plaintext highlighter-rouge">time_2</code>, we are
going to be scanning these differences over and over, and they will always
cancel. Not only that, but all of their consequent reachability updates</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>((root, node), (time_1, iter), +1)
((root, node), (time_2, iter), -1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>are going to live on as well, despite cancelling completely after <code class="language-plaintext highlighter-rouge">time_2</code>.
Future changes to <code class="language-plaintext highlighter-rouge">edges</code> will flip through each of these updates to determine
if they should provoke an output update related to <code class="language-plaintext highlighter-rouge">root</code>, and while they will
eventually determine that no they shouldn’t, they do a fair bit of work to see
this.</p>
<h4 id="compaction">Compaction</h4>
<p>We know that once we have “passed” <code class="language-plaintext highlighter-rouge">time_2</code> we really don’t care about <code class="language-plaintext highlighter-rouge">root</code>,
do we? At that point, and from that point forward, its updates will just cancel
out.</p>
<p>This is true, and while it is good enough for a system where times are totally
ordered, we need to be a bit smarter with partially ordered times. Martin Abadi
and I did the math out for “being a bit smarter” a few years ago, and I’m going
to have to reconstruct it (sadly, our mutual former employer deleted the work).</p>
<p>In a world with partially ordered times, we talk about progress with
“frontiers”: sets of partially ordered times none of which comes before any
others in the set. At any point in a timely dataflow computation, there is a
frontier of logical times defining those logical times we may see in the future:
times greater or equal to a time in the frontier.</p>
<p>Frontiers are what we will use to compact our differences, rather than the idea
of “passing” times.</p>
<p>Any frontier of partially ordered elements partitions the set of all times
(past, present, future) into an equivalence class based on “distinguishability”
in the future: two times are indistinguishable if they compare identically to
every future time:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>t1 == t2 : for all f in Future, t1 <= f iff t2 <= f.
</pre></td></tr></tbody></table></code></pre></div></div>
<p>As the only thing we know about times is that they are partially ordered, their
behavior under the <code class="language-plaintext highlighter-rouge"><=</code> comparison is sufficient to describe each full.
Differences at indistinguishable times can be coalesced into (at most) one
difference.</p>
<p>Let’s look at an example. Imagine we have the following updates:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>((a, b), +1) @ (0, 0)
((b, c), +1) @ (0, 1)
((a, c), +1) @ (1, 0)
((b, c), -1) @ (1, 1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>So, we initially have an <code class="language-plaintext highlighter-rouge">(a,b)</code> and we generate a <code class="language-plaintext highlighter-rouge">(b,c)</code> in the first
iteration of some iterative computation, say. Someone then changes our input to
have <code class="language-plaintext highlighter-rouge">(a,c)</code> in the input, and now we remove <code class="language-plaintext highlighter-rouge">(b,c)</code> in the second iteration.</p>
<p>Imagine now that our frontier, the lower envelope of times we might yet see in
the computation, is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>{ (0, 3), (1, 2), (2, 0) } .
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Meaning, we may still see any time that is greater-or-equal to one of these
times. While this does rule out times like <code class="language-plaintext highlighter-rouge">(0,1)</code> and <code class="language-plaintext highlighter-rouge">(0,2)</code>, it does <em>not</em>
mean that we can just coalesce them. There is a difference between these two
times, in that the possible future time <code class="language-plaintext highlighter-rouge">(2,1)</code> can tell them apart:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>(0,1) <= (2,1) : true
(0,2) <= (2,1) : false
</pre></td></tr></tbody></table></code></pre></div></div>
<p>So how then do we determine which times are equivalent to which others? Ideally,
we would consult our notes, but this option is not available to us. We can do
the next best thing, which is to look at
<a href="https://github.com/MicrosoftResearch/Naiad/blob/release_0.5/Frameworks/DifferentialDataflow/LatticeInternTable.cs#L59-L74">what we did in Naiad’s implementation</a>:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="c1">/// <summary></span>
<span class="c1">/// Joins the given time against all elements of reachable times, and returns the meet of these joined times.</span>
<span class="c1">/// </summary></span>
<span class="c1">/// <param name="s"></param></span>
<span class="c1">/// <returns></returns></span>
<span class="k">private</span> <span class="n">T</span> <span class="nf">Advance</span><span class="p">(</span><span class="n">T</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="n">reachableTimes</span> <span class="p">!=</span> <span class="k">null</span><span class="p">);</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="n">reachableTimes</span><span class="p">.</span><span class="n">Count</span> <span class="p">></span> <span class="m">0</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">meet</span> <span class="p">=</span> <span class="k">this</span><span class="p">.</span><span class="n">reachableTimes</span><span class="p">.</span><span class="n">Array</span><span class="p">[</span><span class="m">0</span><span class="p">].</span><span class="nf">Join</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="k">this</span><span class="p">.</span><span class="n">reachableTimes</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">meet</span> <span class="p">=</span> <span class="n">meet</span><span class="p">.</span><span class="nf">Meet</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="n">reachableTimes</span><span class="p">.</span><span class="n">Array</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nf">Join</span><span class="p">(</span><span class="n">s</span><span class="p">));</span>
<span class="k">return</span> <span class="n">meet</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Ok, this is <em>NOT</em> Rust. C-sharp is object-orientated, and has a <code class="language-plaintext highlighter-rouge">this</code> keyword
that wraps some state local to whatever “this” is. It turns out “this” is a
table of timestamps, whose values we update as <code class="language-plaintext highlighter-rouge">this.reachableTimes</code> advances.
This <code class="language-plaintext highlighter-rouge">reachableTimes</code> thing is how Naiad refers to frontiers: timestamps that
the operator can still receive.</p>
<p>What the code tells us is that to determine what a time <code class="language-plaintext highlighter-rouge">s</code> should look like
given a frontier, we should join <code class="language-plaintext highlighter-rouge">s</code> with each element in the frontier, and take
its meet. If you aren’t familiar with “join” and “meet”, let’s review those:</p>
<ul>
<li>
<p>The <strong>join</strong> method determines the least upper bound of two arguments. That is,</p>
<p>a <= join(a,b), and
b <= join(a,b), and
for all c: if (a <= c and b <= c) then join(a,b) <= c.</p>
</li>
</ul>
<p>This may not always exist in a general partial order, so we need to be in at
least a join semi-lattice (a partial order where join is always defined).</p>
<ul>
<li>
<p>The <strong>meet</strong> method determines the greatest lower bound of two arguments. That
is,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>meet(a,b) <= a, and
meet(a,b) <= b, and
for all c: if (c <= a and c <= b) then c <= meet(a,b).
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>This may not always exist in a general partial order, so we need to be in at
least a meet semi-lattice (a partial order where meet is always defined).</p>
</li>
</ul>
<p>If both join and meet are defined for all pairs of elements in our partial
order, we have what is called a
“<a href="https://en.wikipedia.org/wiki/Lattice_(order)">lattice</a>”. Differential
dataflow should <em>probably</em> require all of its timestamps to be lattices, but at
the moment it just uses least upper bounds. This discussion may prompt the
change to lattices.</p>
<p>For very simple examples of join and meet, consider pairs of integers in which
you compare pairs coordinate wise, and</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>(a1,b1) <= (a2, b2) iff a1 <= a2 && b1 <= b2 .
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The join (least upper bound) of two elements is the pair with the
coordinate-wise maximums, and the meet (greatest lower bound) of two elements is
the pair with the coordinate-wise minimums.</p>
<h4 id="an-example-redux">An example (redux)</h4>
<p>Let’s look at our example again. We have updates:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>((a, b), +1) @ (0, 0)
((b, c), +1) @ (0, 1)
((a, c), +1) @ (1, 0)
((b, c), -1) @ (1, 1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>and perhaps the frontier is currently</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>{ (0, 3), (1, 2), (2, 0) } .
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We can update each of our times using the “meet of joins” rule above, here</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>time -> meet(join(time, (0,3)), join(time, (1,2)), join(time, (2,0)))
</pre></td></tr></tbody></table></code></pre></div></div>
<p>For each of our times, we get the following updates</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>(0,0) -> meet((0,3), (1,2), (2,0)) = (0,0)
(0,1) -> meet((0,3), (1,2), (2,1)) = (0,1)
(1,0) -> meet((1,3), (1,2), (2,0)) = (1,0)
(1,1) -> meet((1,3), (1,2), (2,1)) = (1,1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It doesn’t seem like this changed anything, did it? Well, all four times can
still be distinguished in the future. The future time <code class="language-plaintext highlighter-rouge">(0,3)</code> can tell the
difference between times that differ in the first coordinate, and the future
time <code class="language-plaintext highlighter-rouge">(2,0)</code> can distinguish between the times that differ in the second
coordinate.</p>
<p>Imagine our frontier advances, finishing input epoch zero, and becomes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>{ (1, 2), (2, 0) } .
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now we get different results when we advance times, as the first term drops out
of each meet.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>(0,0) -> meet((1,2), (2,0)) = (1,0)
(0,1) -> meet((1,2), (2,1)) = (1,1)
(1,0) -> meet((1,2), (2,0)) = (1,0)
(1,1) -> meet((1,2), (2,1)) = (1,1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Ooooo! Now some things are starting to look the same! The two <code class="language-plaintext highlighter-rouge">(b,c)</code> updates in
times <code class="language-plaintext highlighter-rouge">(0,1)</code> and <code class="language-plaintext highlighter-rouge">(1,1)</code> can now cancel.</p>
<p>Imagine instead we closed our input, removing the possibility of new input
epochs, setting the frontier to</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>{ (0,3), (1,1) }
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now we get even more contraction, where we can contract across iterations as
well as rounds of input:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>(0,0) -> meet((0,3), (1,1)) = (0,1)
(0,1) -> meet((0,3), (1,1)) = (0,1)
(1,0) -> meet((1,3), (1,1)) = (1,1)
(1,1) -> meet((1,3), (1,1)) = (1,1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now we are able to aggregate updates across iterations, rather than epochs. In
our example it doesn’t actually change anything, but in an iterative computation
with closed inputs it means that we can update “in place” rather than retaining
the history of all iterations.</p>
<p>If both happen, and the frontier becomes just</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>{ (1,1) }
</pre></td></tr></tbody></table></code></pre></div></div>
<p>all of the updates we have can be aggregated. The meet of joins logic works
seamlessly for all modes.</p>
<h4 id="proving-things">Proving things</h4>
<p>Imagine we have a frontier F, is it true that the technique above (take the
meets of joins) is correct? What would that even mean? Here is a correctness
claim we might try to prove:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>Claim (correctness):
For any frontier F and time s, let
t = meet_{f in F} join(f,s).
then for all g >= F, we have s <= g iff t <= g.
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Let’s prove the <code class="language-plaintext highlighter-rouge">iff</code> in two parts,</p>
<ol>
<li>**If <code class="language-plaintext highlighter-rouge">t <= g</code>, then <code class="language-plaintext highlighter-rouge">s <= g</code>: **</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre>For any `f` we have that `s <= join(s,f)`, but in particular for those
`f in F`. Because `s` is less than all terms in the meet, and by the main
property of meets, we have that `s <= t` as `t` is that meet. We combine
this with the assumption `t <= g` and reach our conclusion using
transitivity of `<=`.
</pre></td></tr></tbody></table></code></pre></div></div>
<ol>
<li>
<p>**If <code class="language-plaintext highlighter-rouge">s <= g</code>, then <code class="language-plaintext highlighter-rouge">t <= g</code>: **</p>
<p>By assumption, <code class="language-plaintext highlighter-rouge">g</code> is greater than or equal to some element <code class="language-plaintext highlighter-rouge">f in F</code>. As
such, <code class="language-plaintext highlighter-rouge">join(s,f) <= g</code>, by the main property of joins (as both <code class="language-plaintext highlighter-rouge">s <= g</code> and
<code class="language-plaintext highlighter-rouge">f <= g</code>). The meet operation always produces an element less or equal to
its arguments, and because the definition of <code class="language-plaintext highlighter-rouge">t</code> has at least the
<code class="language-plaintext highlighter-rouge">join(s,f)</code> term in its meets, we conclude that <code class="language-plaintext highlighter-rouge">t <= g</code>.</p>
</li>
</ol>
<p>Wow proofs are fun! Let’s do another one!</p>
<p>How about proving that this contraction is optimal? What would that even mean?
Here is an optimality claim we might try and prove:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>Claim (optimality):
For two times s1 and s2, if for all g >= F we have that
s1 <= g iff s2 <= g ,
then meet_{f in F} join(f,s1) == meet_{f in F} join(f,s2).
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What we are saying here is that if two times are in fact indistinguishable for
all future times, then they will result in the same surrogate times <code class="language-plaintext highlighter-rouge">t1</code> and
<code class="language-plaintext highlighter-rouge">t2</code>. As we cannot correctly equate two times that are not indistinguishable,
this would be optimality.</p>
<p>Let’s try and prove this.</p>
<p><strong>Proof deferred.</strong> I couldn’t remember how to prove optimality, or even if we
did prove it. Sigh. However, I asked Martin Abadi what he thought, and he came
back with the following alternate optimality statement, which I’m going to call
“maximality” to keep it clear from the previous claim.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>Claim (maximality):
For two times s and t', if for all g >= F we have that
s <= g iff t' <= g
then
t' <= meet_{f in F} join(f,s).
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What this claim says is that if you were thinking of contracting <code class="language-plaintext highlighter-rouge">s</code> to any time
<code class="language-plaintext highlighter-rouge">t'</code> other than <code class="language-plaintext highlighter-rouge">meet_{f in F} join(f,s)</code>, your <code class="language-plaintext highlighter-rouge">t'</code> will have to be less or
equal to ours. Our choice is “maximal”, in that sense. This proves that we’ve
done as well as we can, but it doesn’t prove that if <code class="language-plaintext highlighter-rouge">s1</code> and <code class="language-plaintext highlighter-rouge">s2</code> are
indistinguishable they result in the same contraction. Yet!</p>
<p>Here is Martin’s proof (mutatis mutandis):</p>
<p>For all <code class="language-plaintext highlighter-rouge">f</code> (but in particular <code class="language-plaintext highlighter-rouge">f in F</code>) we have that <code class="language-plaintext highlighter-rouge">s <= join(f,s)</code> by the
properties of join, and because <code class="language-plaintext highlighter-rouge">join(f,s) >= F</code> we have by assumption that
<code class="language-plaintext highlighter-rouge">t' <= join(f,s)</code>. As this holds for all <code class="language-plaintext highlighter-rouge">f in F</code>, <code class="language-plaintext highlighter-rouge">t'</code> must also be less or
equal to the meet of all these terms, by the main property of meet. Done!</p>
<p>Now we can prove optimality, using maximality as help.</p>
<p>First, let’s define</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>t1 = meet_{f in F} join(f,s1), and
t2 = meet_{f in F} join(f,s2).
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Now, we have assumed that <code class="language-plaintext highlighter-rouge">s1</code> and <code class="language-plaintext highlighter-rouge">s2</code> are indistinguishable in the future of
<code class="language-plaintext highlighter-rouge">F</code>, and we know by correctness that <code class="language-plaintext highlighter-rouge">s1</code> and <code class="language-plaintext highlighter-rouge">t1</code> are similarly
indistinguishable, as are <code class="language-plaintext highlighter-rouge">s2</code> and <code class="language-plaintext highlighter-rouge">t2</code>. This means that <code class="language-plaintext highlighter-rouge">s1</code> and <code class="language-plaintext highlighter-rouge">t2</code> are
indistinguishable, as are <code class="language-plaintext highlighter-rouge">s2</code> and <code class="language-plaintext highlighter-rouge">t1</code>. Applying each of these observations
with maximality, we conclude that</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>t1 <= meet_{f in F} join(f,s2), and
t2 <= meet_{f in F} join(f,s1).
</pre></td></tr></tbody></table></code></pre></div></div>
<p>However, the right hand sides are exactly <code class="language-plaintext highlighter-rouge">t2</code> and <code class="language-plaintext highlighter-rouge">t1</code>, respectively, and if
each of <code class="language-plaintext highlighter-rouge">t1</code> and <code class="language-plaintext highlighter-rouge">t2</code> are less or equal to each other, they must be the same
(the “antisymmetry” property of a partial order). Done!</p>
<p>Proofs are still fun! Let’s hope it’s actually true.</p>
<h4 id="implementation">Implementation</h4>
<p>We now have an awesome rule for compacting differences, by advancing timestamps
using the rule from up above:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>advance(s,F) = meet_{f in F} join(s,y) .
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We can apply this rule whenever we get a chance to rewrite bits of internal
state. Our optimality result tells us that as long as we apply this rule
regularly enough, we should be able to cancel any indistinguishable updates.</p>
<p>For various reasons, including compaction, we will make sure we take this
opportunity regularly. In the log-structured merge thing up above, each time we
do a merge we can write new times out after subjecting them to this change.</p>
<p>In principle, we could also use this rule to rewrite times within layers of the
merge trie, though I’m a bit hesitant to do that without thinking harder about
the implications of departing from the immutable realm.</p>
<h3 id="problem-2-poor-scaling-with-small-updates">Problem 2: Poor scaling with small updates</h3>
<p>As we increase the number of workers, we hope to see a corresponding improvement
in performance. This improvement can take a few different forms:</p>
<ul>
<li><strong>Weak scaling:</strong></li>
</ul>
<p>As the number of workers increases, the amount of work that can be performed
in a fixed time increases.</p>
<p>As best as I understand, differential dataflow does a fine job with weak
scaling: more workers can do more work in a fixed amount of time. Increasing
the amount of work does not need to increase the amount of coordination, as
long as the number of batches do not increase. The downside here is that</p>
<ul>
<li><strong>Strong scaling:</strong></li>
</ul>
<p>As the number of workers increases, the amount of time taken to perform a
fixed amount of work decreases.</p>
<p>Adding more workers does not necessarily decrease the amount of time to
perform a fixed amount of work. In the limit, when each batch has just a
single record, the existence of additional workers simply does not offer
anything of use; the single record goes to one worker who is then the only
worker able to perform productive computation.</p>
<p>Lots of systems do weak scaling pretty well, and strong scaling up to a point.
While we want as much strong scaling as possible, there is only so fast we can
hope to go (with me writing all the code).</p>
<h4 id="high-resolution-timestamps">High-resolution timestamps</h4>
<p>Rather than try and get excellent strong scaling, our somewhat more modest goal
is to develop weak scaling without altering the resolution of timestamps in
differential dataflow. That is, we will accept inputs at the same frequency as a
strongly scaled system (high resolution) and produce outputs with the same
frequency, but we only need to sustain a high throughput rather than low
latency.</p>
<p>For any example of what I’m talking about, think about a sequence of ten
updates:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>(datum_0, 0, +1)
(datum_1, 1, +1)
..
(datum_9, 9, +1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>In our current implementation, these each have distinct times, and go into
distinct input batches. Each worker worries about the completion of each batch
independently, and doesn’t get started on batch 7 until all batches up to and
including batch 6 have been confirmed processed.</p>
<p>It doesn’t have to work this way (and doesn’t, in some other systems).</p>
<p>Timely dataflow certainly allows for multiple times in flight, and if we put all
ten messages into the system and announce “done with rounds <code class="language-plaintext highlighter-rouge">0-9</code>”, each
differential dataflow operator will pick up various messages, let’s say a worker
picks up <code class="language-plaintext highlighter-rouge">datum_7</code>, and receives word from timely dataflow that all inputs up
through round <code class="language-plaintext highlighter-rouge">9</code> are accounted for. The work isn’t all done yet, but the
operator now knows enough to get processing.</p>
<p>Conceptually, we are going to take this approach, with some implementation
details fixed.</p>
<p>Timely dataflow’s progress tracking machinery gets stressed out proportional to
the number of distinct times that you use. Each distinct time needs to be
announced to all other participants, so even if there is just one data record
there would be <code class="language-plaintext highlighter-rouge">#workers</code> control messages sent out. This means that we
shouldn’t really send records at individual times. In addition, all sorts of
internal buffering and such are broken on timestamp boundaries; all channel
buffers get flushed, that sort of thing. We’d really like to avoid that.</p>
<p>Fortunately, there is something simple and smart to do, lifted from timely
dataflow’s logging infrastructure. Rather than have each time with a distinct
timestamp, we use just the smallest timestamp and send several records whose
actual times are presented as data. For example,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>((datum_0, 0), 0, +1)
((datum_1, 1), 0, +1)
..
((datum_9, 9), 0, +1)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Here we’ve sent the same data all with timestamp zero, but we have provided
enough information to determine the actual time for each record.</p>
<p>Let’s call the actual timely dataflow timestamp the “message timestamp”; this is
the one that is all zeros. Let’s call the embedded timestamp the “data
timestamp”; this ranges from zero up to nine in this example. The choice to have
each data timestamp in the future of the message timestamp results in two
important properties:</p>
<ol>
<li>
<p>Operators receive messages with a message timestamp that allows them to send
messages with received data timestamp. Operators can safely “advance” any
capability they hold, and in particular they can advance the message
timestamp capability to be a capability for any data timestamp.</p>
</li>
<li>
<p>When timely guarantees that no messages will arrive with message timestamp
<code class="language-plaintext highlighter-rouge">time</code>, the same must also true for data timestamp <code class="language-plaintext highlighter-rouge">time</code>. This ensures that
any logic based on timely dataflow progress statements can still be effected.</p>
</li>
</ol>
<p>What we’ve done here is embed a higher-resolution timestamp in a
lower-resolution timestamp, using the former for application logic and the
latter for progress logic. We haven’t committed to any particular difference
between the two, and we seem to be at liberty to lower the resolution for
progress tracking as we see fit.</p>
<p>The downside to lower-resolution progress tracking is that other workers don’t
learn as quickly that they can make forward progress. You might be sitting on a
message with message timestamp <code class="language-plaintext highlighter-rouge">0</code> and a record with data timestamp
<code class="language-plaintext highlighter-rouge">10_000_000</code>, which is totally safe and correct, but really annoying to all the
other workers who are waiting to see if you produce a message with message and
data timestamp <code class="language-plaintext highlighter-rouge">0</code>. One can imagine lots of policies to address this, so let’s
name a few.</p>
<h5 id="millisecond-resolution">Millisecond resolution</h5>
<p>One very simple scheme fixes the lower-resolution timestamp to be something like
“milliseconds” and has the data timestamp indicate the remaining fractional
millisecond, giving us nanosecond accuracy at the data timestamp level.</p>
<p>This approach has one very appealing property, which is that because all workers
use the same scaling, when timely dataflow indicates that time <code class="language-plaintext highlighter-rouge">i</code> has completed
you know that all times up to <code class="language-plaintext highlighter-rouge">i+1</code> are complete. Not just <code class="language-plaintext highlighter-rouge">i</code> milliseconds, but
anything strictly less than <code class="language-plaintext highlighter-rouge">i+1</code> milliseconds.</p>
<p>The downside here is lack of flexibility. Perhaps in a millisecond we can
accumulate thousands of records; we will have to wait for the millisecond to
expire before we start processing them.</p>
<h5 id="variable-resolution">Variable resolution</h5>
<p>A more optimistic approach might pay attention to how much data is being sent,
and refresh the message timestamp every 1024 records it sends, or something
similarly chosen to amortize the amount of progress traffic that will result
against the data being sent. This ensures that there is at least a certain
amount of work in each batch for each other worker.</p>
<p>One must use a bit of care here to ensure that the timestamps are a coarsening
of some common time. It would be too bad if one operator had relatively few
records to ingest, and advanced times at a slower rate than other operators.
Rather, each should probably have some common notion of time, and when it is
time to advance the low-resolution timestamp each worker consults the common
time and leaps to its now current value.</p>
<p>The downside here is less information about what progress information from
timely dataflow means. Whereas up above, an indication that time <code class="language-plaintext highlighter-rouge">i</code> was
complete meant up to <code class="language-plaintext highlighter-rouge">i+1</code>, here it means no such thing.</p>
<h4 id="operator-implementations">Operator implementations</h4>
<p>Differential dataflow’s operator implementations currently act “time-at-a-time”,
maintaining a list of timestamps they should process and acting on each in turn.
What the operator does depends on the operator, but it typically involves
looking at the history for certain keys, up to and including the timestamp. The
“time-at-a-time” discipline works well enough if there are few times, but when
there are as many timestamps as there are data records, it needs a bit more
thought.</p>
<p>The “time-at-a-time” discipline does maintain an important property, that each
key processes its timestamps according to their partial order. We can still
maintain this property if we want to retire a large batch of data timestamps at
once, roughly as:</p>
<ol>
<li>
<p>Identify the subset of unprocessed <code class="language-plaintext highlighter-rouge">((data, dtime), mtime, delta)</code> tuples for
which <code class="language-plaintext highlighter-rouge">dtime</code> is not greater or equal to any element in the operator’s input
frontier (the condition normally used for <code class="language-plaintext highlighter-rouge">mtime</code>).</p>
</li>
<li>
<p>Group this subset by <code class="language-plaintext highlighter-rouge">key</code>, and order within each key respecting the partial
order on <code class="language-plaintext highlighter-rouge">dtime</code>.</p>
</li>
<li>
<p>For each <code class="language-plaintext highlighter-rouge">(key, dtime)</code> pair, do the thing the operator used to do for each
<code class="language-plaintext highlighter-rouge">(mtime, key)</code> pair.</p>
</li>
</ol>
<p>One advantage this new approach has is that despite a large number of times to
process, we still make just one sequential scan through the keys, resulting in
at most one scan through the collection store.</p>
<p>There are likely to be an abundance of other subtle issues about operator
implementations, which I can’t yet foresee. This is one of the advantages of
writing code though, rather than just speculating. You get to find out!</p>
<h4 id="timely-dataflow">Timely dataflow</h4>
<p>It would be great for timely dataflow to support lower-resolution timestamps for
progress tracking natively. It isn’t obvious that there is one correct way to do
it, so for now we are going to try it out “user mode” style. Perhaps we will
learn something about it (e.g. “not worth it”) that will inform a timely
adoption.</p>
<h3 id="constraint-1-compact-representation-in-memory">Constraint 1: Compact representation in memory</h3>
<p>A collection represents a set of tuples of type <code class="language-plaintext highlighter-rouge">((Key, Val), Time, isize)</code>. If
we were to write them down, the space requirements would be</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>size_of::<((Key, Val), Time, isize)>() * #(distinct (key,val,time)s)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>because any tuples with the same <code class="language-plaintext highlighter-rouge">(key,val,time)</code> entries can be coalesced.</p>
<p>But simply writing down the tuples is not the most efficient way to represent
them. We have seen above the “trie” representation, which sorts tuples and
compresses out common prefixes. For example, the trie representation would
require</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre> size_of::<(Key, usize)>() * #(distinct keys)
+ size_of::<(Val, usize)>() * #(distinct (key,val)s)
+ size_of::<(Time, isize)>() * #(distinct (key,val,time)s)
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This can be much smaller than the raw tuple representation. It has other
advantages, like clearly indicating where key and value ranges start and stop,
which means our code doesn’t constantly have to check.</p>
<p>In principle, the data can be much smaller still in some not-uncommon cases.
When the data are static, for example, we have no need of the <code class="language-plaintext highlighter-rouge">(time, isize)</code>
entries because nothing changes. Even when the data are not static, but have a
large number of entries have timestamps that can be contracted to the same
timestamp, most of the data do not require <code class="language-plaintext highlighter-rouge">(time, isize)</code> entries.</p>
<p>Economies like this can be accommodated using alternate trie representations.
Relatively few distinct timestamps are well accommodated by a trie for data
structured as <code class="language-plaintext highlighter-rouge">(time, (key, val), delta)</code>, organized first by time. This type of
arrangement has the annoyance that <code class="language-plaintext highlighter-rouge">key</code> data are multiple locations, and must
be merged in order to determine cumulative counts at any time. This is not such
a pain for few times, as we were going to need to merge the geometrically sized
trie layers anyhow, but obviously more difficult and less efficient when the
number of times is large.</p>
<p>At the moment, I don’t have particularly great thoughts on choosing between
these representations other than to try and have a solid trait hiding the
specifics, behind which we can put several implementations. With some luck, we
could even have composite implementations that wrap a few implementations and
drop tuples into the one best suited to represent them. But decisions that
prevent something like this seem like poor ideas.</p>
<h3 id="constraint-2-shared-index-structures-between-operators">Constraint 2: Shared index structures between operators</h3>
<p>Several computations re-use the same collection indexed the same way. For
example, the “people you may know” query from the recent
<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-06-21.md">differential dataflog post</a>,
which looks like so:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="c">// symmetrize the graph, because they do that too.</span>
<span class="k">let</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">graph</span><span class="nf">.map</span><span class="p">(|(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)|</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">))</span><span class="nf">.concat</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="p">);</span>
<span class="n">graph</span><span class="nf">.semijoin</span><span class="p">(</span><span class="o">&</span><span class="n">query</span><span class="p">)</span>
<span class="nf">.map</span><span class="p">(|(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)|</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">))</span>
<span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="p">)</span>
<span class="nf">.map</span><span class="p">(|(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)|</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">))</span>
<span class="nf">.filter</span><span class="p">(|</span><span class="o">&</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)|</span> <span class="n">x</span> <span class="o">!=</span> <span class="n">z</span><span class="p">)</span>
<span class="c">// <-- put antijoin here if you had one</span>
<span class="nf">.topk</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The collection <code class="language-plaintext highlighter-rouge">graph</code> is used twice; both times its edge records
<code class="language-plaintext highlighter-rouge">(source, target)</code> are keyed by <code class="language-plaintext highlighter-rouge">source</code>. The code as written above with have
both <code class="language-plaintext highlighter-rouge">semijoin</code> and <code class="language-plaintext highlighter-rouge">join</code> create and maintain their own indexed copies of the
data.</p>
<p>We can be less wasteful by explicitly managing the arrangement of data into
indexed collections, and the sharing of those collections between operators.
Each of <code class="language-plaintext highlighter-rouge">semijoin</code> and <code class="language-plaintext highlighter-rouge">join</code> internally use differential’s <code class="language-plaintext highlighter-rouge">arrange</code> operator,
which takes a keyed collection of data and returns an <code class="language-plaintext highlighter-rouge">Arranged</code>, which contains
a reference counted pointer to the collection trace the arrange operator
maintains. Because the collection is logically append-only, the sharing can be
made relatively safe (there are rules on how you are allowed to interpret the
contents).</p>
<p>Explicitly arranging and then re-using the arrangements, the code above looks
like (note: arrangement not currently optimized for visual appeal):</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="c">// symmetrize the graph</span>
<span class="k">let</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">graph</span><span class="nf">.map</span><span class="p">(|(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)|</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">))</span><span class="nf">.concat</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="p">);</span>
<span class="c">// "arrange" graph, because we'll want to use it twice the same way.</span>
<span class="k">let</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">graph</span><span class="nf">.arrange_by_key</span><span class="p">(|</span><span class="n">k</span><span class="p">|</span> <span class="n">k</span><span class="nf">.clone</span><span class="p">(),</span> <span class="p">|</span><span class="n">x</span><span class="p">|</span> <span class="p">(</span><span class="nn">VecMap</span><span class="p">::</span><span class="nf">new</span><span class="p">(),</span> <span class="n">x</span><span class="p">));</span>
<span class="k">let</span> <span class="n">query</span> <span class="o">=</span> <span class="n">query</span><span class="nf">.arrange_by_self</span><span class="p">(|</span><span class="n">k</span><span class="p">:</span> <span class="o">&</span><span class="nb">u32</span><span class="p">|</span> <span class="n">k</span><span class="nf">.as_u64</span><span class="p">(),</span> <span class="p">|</span><span class="n">x</span><span class="p">|</span> <span class="p">(</span><span class="nn">VecMap</span><span class="p">::</span><span class="nf">new</span><span class="p">(),</span> <span class="n">x</span><span class="p">));</span>
<span class="c">// restrict attention to edges from query nodes</span>
<span class="n">graph</span><span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">query</span><span class="p">,</span> <span class="p">|</span><span class="n">k</span><span class="p">,</span><span class="n">v</span><span class="p">,</span><span class="mi">_</span><span class="p">|</span> <span class="p">(</span><span class="n">v</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">k</span><span class="nf">.clone</span><span class="p">()))</span>
<span class="nf">.arrange_by_key</span><span class="p">(|</span><span class="n">k</span><span class="p">|</span> <span class="n">k</span><span class="nf">.clone</span><span class="p">(),</span> <span class="p">|</span><span class="n">x</span><span class="p">|</span> <span class="p">(</span><span class="nn">VecMap</span><span class="p">::</span><span class="nf">new</span><span class="p">(),</span> <span class="n">x</span><span class="p">))</span>
<span class="nf">.join</span><span class="p">(</span><span class="o">&</span><span class="n">graph</span><span class="p">,</span> <span class="p">|</span><span class="mi">_</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">|</span> <span class="p">(</span><span class="n">x</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">y</span><span class="nf">.clone</span><span class="p">()))</span>
<span class="nf">.map</span><span class="p">(|(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)|</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">))</span>
<span class="nf">.filter</span><span class="p">(|</span><span class="o">&</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)|</span> <span class="n">x</span> <span class="o">!=</span> <span class="n">z</span><span class="p">)</span>
<span class="c">// <-- put antijoin here if you had one</span>
<span class="nf">.topk</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>There is some excessive arrangement going on (e.g. <code class="language-plaintext highlighter-rouge">query</code> and the results of
the first <code class="language-plaintext highlighter-rouge">join</code>) because the arranged operators only work on pairs of
arrangements. This could be cleaned up if important, but it is assumed you know
a bit about what you are doing at this point.</p>
<p>If all of the code above makes little sense, it boils down to: whatever we do
with our collection data structure, we need to worry that multiple operators may
be looking at the same data.</p>
<p>For example, in the context of one operator we can easily speak about “the
frontier” and do compaction based on this information. When multiple operators
are sharing the same data, there is no one frontier; there is a set of
frontiers, or something like that. It can all be made to work (mostly you just
union together the frontiers with <code class="language-plaintext highlighter-rouge">MutableAntichain</code>), but some attention to
detail is important.</p>
<h2 id="conclusions">Conclusions</h2>
<p>This is a pretty beefy write-up, and possibly more for my benefit than for yours
(maybe I should have said that at the beginning; I’ve most realized it here at
the end, though). I’d really like to lay out the criteria for a successful data
structure and maintenance strategy more clearly, but there are lots of
constraints that come together. For now, I think it is time to start trying it
out and seeing what goes horribly wrong. Then I can tell you about that.</p>Differential dataflow roadmapFile locking in Linux2022-10-05T00:00:00+00:002022-10-05T00:00:00+00:00/primitives/2022/10/05/Lockfiles-in-Linux<h1 id="file-locking-in-linux">File locking in Linux</h1>
<blockquote>
<p>Introduction Advisory locking Common features Differing features File
descriptors and i-nodes BSD locks (flock) POSIX record locks (fcntl) lockf
function Open file description locks (fcntl) Emulating Open file description
locks Test program Command-line tools Mandatory locking Example usage
Introduction File locking is a mutual-exclusion mechanism for files. Linux
supports two major kinds of file locks: advisory locks mandatory locks Below
we discuss all lock types available in POSIX and Linux and provide usage
examples.</p>
</blockquote>
<p><strong>Table of contents</strong></p>
<ul>
<li><a href="https://gavv.net/articles/file-locks/#introduction">Introduction</a></li>
<li><a href="https://gavv.net/articles/file-locks/#advisory-locking">Advisory locking</a></li>
<li><a href="https://gavv.net/articles/file-locks/#common-features">Common features</a></li>
<li><a href="https://gavv.net/articles/file-locks/#differing-features">Differing features</a></li>
<li><a href="https://gavv.net/articles/file-locks/#file-descriptors-and-i-nodes">File descriptors and i-nodes</a></li>
<li><a href="https://gavv.net/articles/file-locks/#bsd-locks-flock">BSD locks (flock)</a></li>
<li><a href="https://gavv.net/articles/file-locks/#posix-record-locks-fcntl">POSIX record locks (fcntl)</a></li>
<li><a href="https://gavv.net/articles/file-locks/#lockf-function">lockf function</a></li>
<li><a href="https://gavv.net/articles/file-locks/#open-file-description-locks-fcntl">Open file description locks (fcntl)</a></li>
<li><a href="https://gavv.net/articles/file-locks/#emulating-open-file-description-locks">Emulating Open file description locks</a></li>
<li><a href="https://gavv.net/articles/file-locks/#test-program">Test program</a></li>
<li><a href="https://gavv.net/articles/file-locks/#command-line-tools">Command-line tools</a></li>
<li><a href="https://gavv.net/articles/file-locks/#mandatory-locking">Mandatory locking</a></li>
<li><a href="https://gavv.net/articles/file-locks/#example-usage">Example usage</a></li>
</ul>
<hr />
<h2 id="introduction"><a href="https://gavv.net/articles/file-locks/#introduction"></a>Introduction</h2>
<p><a href="https://en.wikipedia.org/wiki/File_locking">File locking</a> is a mutual-exclusion
mechanism for files. Linux supports two major kinds of file locks:</p>
<ul>
<li>advisory locks</li>
<li>mandatory locks</li>
</ul>
<p>Below we discuss all lock types available in POSIX and Linux and provide usage
examples.</p>
<hr />
<h2 id="advisory-locking"><a href="https://gavv.net/articles/file-locks/#advisory-locking"></a>Advisory locking</h2>
<p>Traditionally, locks are
<a href="https://unix.stackexchange.com/questions/147392/what-is-advisory-locking-on-files-that-unix-systems-typically-employs">advisory</a>
in Unix. They work only when a process explicitly acquires and releases locks,
and are ignored if a process is not aware of locks.</p>
<p>There are several types of advisory locks available in Linux:</p>
<ul>
<li>BSD locks (flock)</li>
<li>POSIX record locks (fcntl, lockf)</li>
<li>Open file description locks (fcntl)</li>
</ul>
<p>All locks except the <code class="language-plaintext highlighter-rouge">lockf</code> function are
<a href="https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock">reader-writer locks</a>,
i.e. support exclusive and shared modes.</p>
<p>Note that <a href="http://man7.org/linux/man-pages/man3/flockfile.3.html"><code class="language-plaintext highlighter-rouge">flockfile</code></a>
and friends have nothing to do with the file locks. They manage internal mutex
of the <code class="language-plaintext highlighter-rouge">FILE</code> object from stdio.</p>
<p>Reference:</p>
<ul>
<li><a href="https://www.gnu.org/software/libc/manual/html_node/File-Locks.html">File Locks</a>,
GNU libc manual</li>
<li><a href="https://www.gnu.org/software/libc/manual/html_node/Open-File-Description-Locks.html">Open File Description Locks</a>,
GNU libc manual</li>
<li><a href="https://lwn.net/Articles/586904/">File-private POSIX locks</a>, an LWN article
about the predecessor of open file description locks</li>
</ul>
<h3 id="common-features"><a href="https://gavv.net/articles/file-locks/#common-features"></a>Common features</h3>
<p>The following features are common for locks of all types:</p>
<ul>
<li>All locks support blocking and non-blocking operations.</li>
<li>Locks are allowed only on files, but not directories.</li>
<li>Locks are automatically removed when the process exits or terminates. It’s
guaranteed that if a lock is acquired, the process acquiring the lock is still
alive.</li>
</ul>
<h3 id="differing-features"><a href="https://gavv.net/articles/file-locks/#differing-features"></a>Differing features</h3>
<p>This table summarizes the difference between the lock types. A more detailed
description and usage examples are provided below.</p>
<table>
<thead>
<tr>
<th> </th>
<th>BSD locks</th>
<th>lockf function</th>
<th>POSIX record locks</th>
<th>Open file description locks</th>
</tr>
</thead>
<tbody>
<tr>
<td>Portability</td>
<td>widely available</td>
<td>POSIX (XSI)</td>
<td>POSIX (base standard)</td>
<td>Linux 3.15+</td>
</tr>
<tr>
<td>Associated with</td>
<td>File object</td>
<td>[i-node, pid] pair</td>
<td>[i-node, pid] pair</td>
<td>File object</td>
</tr>
<tr>
<td>Applying to byte range</td>
<td>no</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Support exclusive and shared modes</td>
<td>yes</td>
<td>no</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Atomic mode switch</td>
<td>no</td>
<td>-</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Works on NFS (Linux)</td>
<td>Linux 2.6.12+</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr>
</tbody>
</table>
<h3 id="file-descriptors-and-i-nodes"><a href="https://gavv.net/articles/file-locks/#file-descriptors-and-i-nodes"></a>File descriptors and i-nodes</h3>
<p>A <a href="https://en.wikipedia.org/wiki/File_descriptor"><em>file descriptor</em></a> is an index
in the per-process file descriptor table (in the left of the picture). Each file
descriptor table entry contains a reference to a <em>file object</em>, stored in the
file table (in the middle of the picture). Each file object contains a reference
to an <a href="https://en.wikipedia.org/wiki/Inode">i-node</a>, stored in the i-node table
(in the right of the picture).</p>
<p><img src="https://gavv.net/articles/file-locks/tables.png" alt="" /></p>
<p>A file descriptor is just a number that is used to refer a file object from the
user space. A file object represents an opened file. It contains things likes
current read/write offset, non-blocking flag and another non-persistent state.
An i-node represents a filesystem object. It contains things like file
meta-information (e.g. owner and permissions) and references to data blocks.</p>
<p>File descriptors created by several <code class="language-plaintext highlighter-rouge">open()</code> calls for the same file path point
to different file objects, but these file objects point to the same i-node.
Duplicated file descriptors created by <code class="language-plaintext highlighter-rouge">dup2()</code> or <code class="language-plaintext highlighter-rouge">fork()</code> point to the same
file object.</p>
<p>A BSD lock and an Open file description lock is associated with a file object,
while a POSIX record lock is associated with an <code class="language-plaintext highlighter-rouge">[i-node, pid]</code> pair. We’ll
discuss it below.</p>
<h3 id="bsd-locks-flock"><a href="https://gavv.net/articles/file-locks/#bsd-locks-flock"></a>BSD locks (flock)</h3>
<p>The simplest and most common file locks are provided by
<a href="http://man7.org/linux/man-pages/man2/flock.2.html"><code class="language-plaintext highlighter-rouge">flock(2)</code></a>.</p>
<p>Features:</p>
<ul>
<li>not specified in POSIX, but widely available on various Unix systems</li>
<li>always lock the entire file</li>
<li>associated with a file object</li>
<li>do not guarantee atomic switch between the locking modes (exclusive and
shared)</li>
<li>up to Linux 2.6.11, didn’t work on NFS; since Linux 2.6.12, flock() locks on
NFS are emulated using fcntl() POSIX record byte-range locks on the entire
file (unless the emulation is disabled in the NFS mount options)</li>
</ul>
<p>The lock acquisition is associated with a file object, i.e.:</p>
<ul>
<li>duplicated file descriptors, e.g. created using <code class="language-plaintext highlighter-rouge">dup2</code> or <code class="language-plaintext highlighter-rouge">fork</code>, share the
lock acquisition;</li>
<li>independent file descriptors, e.g. created using two <code class="language-plaintext highlighter-rouge">open</code> calls (even for
the same file), don’t share the lock acquisition;</li>
</ul>
<p>This means that with BSD locks, threads or processes can’t be synchronized on
the same or duplicated file descriptor, but nevertheless, both can be
synchronized on independent file descriptors.</p>
<p><code class="language-plaintext highlighter-rouge">flock()</code> doesn’t guarantee atomic mode switch. From the man page:</p>
<blockquote>
<p>Converting a lock (shared to exclusive, or vice versa) is not guaranteed to be
atomic: the existing lock is first removed, and then a new lock is
established. Between these two steps, a pending lock request by another
process may be granted, with the result that the conversion either blocks, or
fails if LOCK_NB was specified. (This is the original BSD behaviour, and
occurs on many other implementations.)</p>
</blockquote>
<p>This problem is solved by POSIX record locks and Open file description locks.</p>
<p>Usage example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre>#include <sys/file.h>
// acquire shared lock
if (flock(fd, LOCK_SH) == -1) {
exit(1);
}
// non-atomically upgrade to exclusive lock
// do it in non-blocking mode, i.e. fail if can't upgrade immediately
if (flock(fd, LOCK_EX | LOCK_NB) == -1) {
exit(1);
}
// release lock
// lock is also released automatically when close() is called or process exits
if (flock(fd, LOCK_UN) == -1) {
exit(1);
}
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="posix-record-locks-fcntl"><a href="https://gavv.net/articles/file-locks/#posix-record-locks-fcntl"></a>POSIX record locks (fcntl)</h3>
<p>POSIX record locks, also known as process-associated locks, are provided by
<a href="http://man7.org/linux/man-pages/man2/fcntl.2.html"><code class="language-plaintext highlighter-rouge">fcntl(2)</code></a>, see “Advisory
record locking” section in the man page.</p>
<p>Features:</p>
<ul>
<li><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html">specified</a>
in POSIX (base standard)</li>
<li>can be applied to a byte range</li>
<li>associated with an <code class="language-plaintext highlighter-rouge">[i-node, pid]</code> pair instead of a file object</li>
<li>guarantee atomic switch between the locking modes (exclusive and shared)</li>
<li>work on NFS (on Linux)</li>
</ul>
<p>The lock acquisition is associated with an <code class="language-plaintext highlighter-rouge">[i-node, pid]</code> pair, i.e.:</p>
<ul>
<li>file descriptors opened by the same process for the same file share the lock
acquisition (even independent file descriptors, e.g. created using two <code class="language-plaintext highlighter-rouge">open</code>
calls);</li>
<li>file descriptors opened by different processes don’t share the lock
acquisition;</li>
</ul>
<p>This means that with POSIX record locks, it is possible to synchronize
processes, but not threads. All threads belonging to the same process always
share the lock acquisition of a file, which means that:</p>
<ul>
<li>the lock acquired through some file descriptor by some thread may be released
through another file descriptor by another thread;</li>
<li>when any thread calls <code class="language-plaintext highlighter-rouge">close</code> on any descriptor referring to given file, the
lock is released for the whole process, even if there are other opened
descriptors referring to this file.</li>
</ul>
<p>This problem is solved by Open file description locks.</p>
<p>Usage example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
</pre></td><td class="rouge-code"><pre>#include <fcntl.h>
struct flock fl;
memset(&fl, 0, sizeof(fl));
// lock in shared mode
fl.l_type = F_RDLCK;
// lock entire file
fl.l_whence = SEEK_SET; // offset base is start of the file
fl.l_start = 0; // starting offset is zero
fl.l_len = 0; // len is zero, which is a special value representing end
// of file (no matter how large the file grows in future)
fl.l_pid = 0; // F_SETLK(W) ignores it; F_OFD_SETLK(W) requires it to be zero
// F_SETLKW specifies blocking mode
if (fcntl(fd, F_SETLKW, &fl) == -1) {
exit(1);
}
// atomically upgrade shared lock to exclusive lock, but only
// for bytes in range [10; 15)
//
// after this call, the process will hold three lock regions:
// [0; 10) - shared lock
// [10; 15) - exclusive lock
// [15; SEEK_END) - shared lock
fl.l_type = F_WRLCK;
fl.l_start = 10;
fl.l_len = 5;
// F_SETLKW specifies non-blocking mode
if (fcntl(fd, F_SETLK, &fl) == -1) {
exit(1);
}
// release lock for bytes in range [10; 15)
fl.l_type = F_UNLCK;
if (fcntl(fd, F_SETLK, &fl) == -1) {
exit(1);
}
// close file and release locks for all regions
// remember that locks are released when process calls close()
// on any descriptor for a lock file
close(fd);
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="lockf-function"><a href="https://gavv.net/articles/file-locks/#lockf-function"></a>lockf function</h3>
<p><a href="http://man7.org/linux/man-pages/man3/lockf.3.html"><code class="language-plaintext highlighter-rouge">lockf(3)</code></a> function is a
simplified version of POSIX record locks.</p>
<p>Features:</p>
<ul>
<li><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/lockf.html">specified</a>
in POSIX (XSI)</li>
<li>can be applied to a byte range (optionally automatically expanding when data
is appended in future)</li>
<li>associated with an <code class="language-plaintext highlighter-rouge">[i-node, pid]</code> pair instead of a file object</li>
<li>supports only exclusive locks</li>
<li>works on NFS (on Linux)</li>
</ul>
<p>Since <code class="language-plaintext highlighter-rouge">lockf</code> locks are associated with an <code class="language-plaintext highlighter-rouge">[i-node, pid]</code> pair, they have the
same problems as POSIX record locks described above.</p>
<p>The interaction between <code class="language-plaintext highlighter-rouge">lockf</code> and other types of locks is not specified by
POSIX. On Linux, <code class="language-plaintext highlighter-rouge">lockf</code> is
<a href="https://github.com/lattera/glibc/blob/master/io/lockf.c">just a wrapper</a> for
POSIX record locks.</p>
<p>Usage example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre>#include <unistd.h>
// set current position to byte 10
if (lseek(fd, 10, SEEK_SET) == -1) {
exit(1);
}
// acquire exclusive lock for bytes in range [10; 15)
// F_LOCK specifies blocking mode
if (lockf(fd, F_LOCK, 5) == -1) {
exit(1);
}
// release lock for bytes in range [10; 15)
if (lockf(fd, F_ULOCK, 5) == -1) {
exit(1);
}
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="open-file-description-locks-fcntl"><a href="https://gavv.net/articles/file-locks/#open-file-description-locks-fcntl"></a>Open file description locks (fcntl)</h3>
<p>Open file description locks are Linux-specific and combine advantages of the BSD
locks and POSIX record locks. They are provided by
<a href="http://man7.org/linux/man-pages/man2/fcntl.2.html"><code class="language-plaintext highlighter-rouge">fcntl(2)</code></a>, see “Open file
description locks (non-POSIX)” section in the man page.</p>
<p>Features:</p>
<ul>
<li>Linux-specific, not specified in POSIX</li>
<li>can be applied to a byte range</li>
<li>associated with a file object</li>
<li>guarantee atomic switch between the locking modes (exclusive and shared)</li>
<li>work on NFS (on Linux)</li>
</ul>
<p>Thus, Open file description locks combine advantages of BSD locks and POSIX
record locks: they provide both atomic switch between the locking modes, and the
ability to synchronize both threads and processes.</p>
<p>These locks are available since the 3.15 kernel.</p>
<p>The API is the same as for POSIX record locks (see above). It uses
<code class="language-plaintext highlighter-rouge">struct flock</code> too. The only difference is in <code class="language-plaintext highlighter-rouge">fcntl</code> command names:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">F_OFD_SETLK</code> instead of <code class="language-plaintext highlighter-rouge">F_SETLK</code></li>
<li><code class="language-plaintext highlighter-rouge">F_OFD_SETLKW</code> instead of <code class="language-plaintext highlighter-rouge">F_SETLKW</code></li>
<li><code class="language-plaintext highlighter-rouge">F_OFD_GETLK</code> instead of <code class="language-plaintext highlighter-rouge">F_GETLK</code></li>
</ul>
<h3 id="emulating-open-file-description-locks"><a href="https://gavv.net/articles/file-locks/#emulating-open-file-description-locks"></a>Emulating Open file description locks</h3>
<p>What do we have for multithreading and atomicity so far?</p>
<ul>
<li>BSD locks allow thread synchronization but don’t allow atomic mode switch.</li>
<li>POSIX record locks don’t allow thread synchronization but allow atomic mode
switch.</li>
<li>Open file description locks allow both but are available only on recent Linux
kernels.</li>
</ul>
<p>If you need both features but can’t use Open file description locks (e.g. you’re
using some embedded system with an outdated Linux kernel), you can <em>emulate</em>
them on top of the POSIX record locks.</p>
<p>Here is one possible approach:</p>
<ul>
<li>Implement your own API for file locks. Ensure that all threads always use this
API instead of using <code class="language-plaintext highlighter-rouge">fcntl()</code> directly. Ensure that threads never open and
close lock-files directly.</li>
<li>In the API, implement a process-wide singleton (shared by all threads) holding
all currently acquired locks.</li>
<li>Associate two additional objects with every acquired lock:</li>
<li>a counter</li>
<li>an RW-mutex, e.g.
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_rwlock_destroy.html"><code class="language-plaintext highlighter-rouge">pthread_rwlock</code></a></li>
</ul>
<p>Now, you can implement lock operations as follows:</p>
<ul>
<li>Acquiring lock</li>
<li>First, acquire the RW-mutex. If the user requested the shared mode, acquire a
read lock. If the user requested the exclusive mode, acquire a write lock.</li>
<li>Check the counter. If it’s zero, also acquire the file lock using <code class="language-plaintext highlighter-rouge">fcntl()</code>.</li>
<li>Increment the counter.</li>
<li>Releasing lock</li>
<li>Decrement the counter.</li>
<li>If the counter becomes zero, release the file lock using <code class="language-plaintext highlighter-rouge">fcntl()</code>.</li>
<li>Release the RW-mutex.</li>
</ul>
<p>This approach makes possible both thread and process synchronization.</p>
<h3 id="test-program"><a href="https://gavv.net/articles/file-locks/#test-program"></a>Test program</h3>
<p>I’ve prepared a
<a href="https://github.com/gavv/snippets/blob/master/fs/locks.c">small program</a> that
helps to learn the behavior of different lock types.</p>
<p>The program starts two threads or processes, both of which wait to acquire the
lock, then sleep for one second, and then release the lock. It has three
parameters:</p>
<ul>
<li>lock mode: <code class="language-plaintext highlighter-rouge">flock</code> (BSD locks), <code class="language-plaintext highlighter-rouge">lockf</code>, <code class="language-plaintext highlighter-rouge">fcntl_posix</code> (POSIX record locks),
<code class="language-plaintext highlighter-rouge">fcntl_linux</code> (Open file description locks)</li>
<li>access mode: <code class="language-plaintext highlighter-rouge">same_fd</code> (access lock via the same descriptor), <code class="language-plaintext highlighter-rouge">dup_fd</code> (access
lock via duplicated descriptors), <code class="language-plaintext highlighter-rouge">two_fds</code> (access lock via two descriptors
opened independently for the same path)</li>
<li>concurrency mode: <code class="language-plaintext highlighter-rouge">threads</code> (access lock from two threads), <code class="language-plaintext highlighter-rouge">processes</code>
(access lock from two processes)</li>
</ul>
<p>Below you can find some examples.</p>
<p>Threads are not serialized if they use BSD locks on duplicated descriptors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>$ ./a.out flock dup_fd threads
13:00:58 pid=5790 tid=5790 lock
13:00:58 pid=5790 tid=5791 lock
13:00:58 pid=5790 tid=5790 sleep
13:00:58 pid=5790 tid=5791 sleep
13:00:59 pid=5790 tid=5791 unlock
13:00:59 pid=5790 tid=5790 unlock
</pre></td></tr></tbody></table></code></pre></div></div>
<p>But they are serialized if they are used on two independent descriptors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>$ ./a.out flock two_fds threads
13:01:03 pid=5792 tid=5792 lock
13:01:03 pid=5792 tid=5794 lock
13:01:03 pid=5792 tid=5792 sleep
13:01:04 pid=5792 tid=5792 unlock
13:01:04 pid=5792 tid=5794 sleep
13:01:05 pid=5792 tid=5794 unlock
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Threads are not serialized if they use POSIX record locks on two independent
descriptors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>$ ./a.out fcntl_posix two_fds threads
13:01:08 pid=5795 tid=5795 lock
13:01:08 pid=5795 tid=5796 lock
13:01:08 pid=5795 tid=5795 sleep
13:01:08 pid=5795 tid=5796 sleep
13:01:09 pid=5795 tid=5795 unlock
13:01:09 pid=5795 tid=5796 unlock
</pre></td></tr></tbody></table></code></pre></div></div>
<p>But processes are serialized:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>$ ./a.out fcntl_posix two_fds processes
13:01:13 pid=5797 tid=5797 lock
13:01:13 pid=5798 tid=5798 lock
13:01:13 pid=5797 tid=5797 sleep
13:01:14 pid=5797 tid=5797 unlock
13:01:14 pid=5798 tid=5798 sleep
13:01:15 pid=5798 tid=5798 unlock
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="command-line-tools"><a href="https://gavv.net/articles/file-locks/#command-line-tools"></a>Command-line tools</h3>
<p>The following tools may be used to acquire and release file locks from the
command line:</p>
<ul>
<li>
<p><a href="http://man7.org/linux/man-pages/man1/flock.1.html"><strong><code class="language-plaintext highlighter-rouge">flock</code></strong></a></p>
<p>Provided by <code class="language-plaintext highlighter-rouge">util-linux</code> package. Uses <code class="language-plaintext highlighter-rouge">flock()</code> function.</p>
<p>There are two ways to use this tool:</p>
<ul>
<li>
<p>run a command while holding a lock:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>flock my.lock sleep 10
</pre></td></tr></tbody></table></code></pre></div> </div>
<p><code class="language-plaintext highlighter-rouge">flock</code> will acquire the lock, run the command, and release the lock.</p>
</li>
<li>
<p>open a file descriptor in bash and use <code class="language-plaintext highlighter-rouge">flock</code> to acquire and release the
lock manually:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre>set -e # die on errors
exec 100>my.lock # open file 'my.lock' and link file descriptor 100 to it
flock -n 100 # acquire a lock
echo hello
sleep 10
echo goodbye
flock -u -n 100 # release the lock
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
</ul>
<p>You can try to run these two snippets in parallel in different terminals and
see that while one is sleeping while holding the lock, another is blocked in
flock.</p>
</li>
<li>
<p><a href="https://linux.die.net/man/1/lockfile"><strong><code class="language-plaintext highlighter-rouge">lockfile</code></strong></a></p>
<p>Provided by <code class="language-plaintext highlighter-rouge">procmail</code> package.</p>
<p>Runs the given command while holding a lock. Can use either <code class="language-plaintext highlighter-rouge">flock()</code>,
<code class="language-plaintext highlighter-rouge">lockf()</code>, or <code class="language-plaintext highlighter-rouge">fcntl()</code> function, depending on what’s available on the system.</p>
</li>
</ul>
<p>There are also two ways to inspect the currently acquired locks:</p>
<ul>
<li>
<p><a href="http://man7.org/linux/man-pages/man8/lslocks.8.html"><strong><code class="language-plaintext highlighter-rouge">lslocks</code></strong></a></p>
<p>Provided by <code class="language-plaintext highlighter-rouge">util-linux</code> package.</p>
<p>Lists all the currently held file locks in the entire system. Allows to
perform filtering by PID and to configure the output format.</p>
<p>Example output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>COMMAND PID TYPE SIZE MODE M START END PATH
containerd 4498 FLOCK 256K WRITE 0 0 0 /var/lib/docker/containerd/...
dockerd 4289 FLOCK 256K WRITE 0 0 0 /var/lib/docker/volumes/...
(undefined) -1 OFDLCK READ 0 0 0 /dev...
dockerd 4289 FLOCK 16K WRITE 0 0 0 /var/lib/docker/builder/...
dockerd 4289 FLOCK 16K WRITE 0 0 0 /var/lib/docker/buildkit/...
dockerd 4289 FLOCK 16K WRITE 0 0 0 /var/lib/docker/buildkit/...
dockerd 4289 FLOCK 32K WRITE 0 0 0 /var/lib/docker/buildkit/...
(unknown) 4417 FLOCK WRITE 0 0 0 /run...
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
<li>
<p><a href="http://man7.org/linux/man-pages/man5/proc.5.html"><strong><code class="language-plaintext highlighter-rouge">/proc/locks</code></strong></a></p>
<p>A file in <code class="language-plaintext highlighter-rouge">procfs</code> virtual file system that shows current file locks of all
types. The <code class="language-plaintext highlighter-rouge">lslocks</code> tools relies on this file.</p>
<p>Example content:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre>16: FLOCK ADVISORY WRITE 4417 00:17:23319 0 EOF
27: FLOCK ADVISORY WRITE 4289 08:03:9441686 0 EOF
28: FLOCK ADVISORY WRITE 4289 08:03:9441684 0 EOF
29: FLOCK ADVISORY WRITE 4289 08:03:9441681 0 EOF
30: FLOCK ADVISORY WRITE 4289 08:03:8528339 0 EOF
31: OFDLCK ADVISORY READ -1 00:06:9218 0 EOF
43: FLOCK ADVISORY WRITE 4289 08:03:8536567 0 EOF
52: FLOCK ADVISORY WRITE 4498 08:03:8520185 0 EOF
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
</ul>
<hr />
<h2 id="mandatory-locking"><a href="https://gavv.net/articles/file-locks/#mandatory-locking"></a>Mandatory locking</h2>
<p>Linux has limited support for
<a href="https://www.kernel.org/doc/Documentation/filesystems/mandatory-locking.txt">mandatory file locking</a>.
See the “Mandatory locking” section in the
<a href="http://man7.org/linux/man-pages/man2/fcntl.2.html"><code class="language-plaintext highlighter-rouge">fcntl(2)</code></a> man page.</p>
<p>A mandatory lock is activated for a file when all of these conditions are met:</p>
<ul>
<li>The partition was mounted with the <code class="language-plaintext highlighter-rouge">mand</code> option.</li>
<li>The set-group-ID bit is on and group-execute bit is off for the file.</li>
<li>A POSIX record lock is acquired.</li>
</ul>
<p>Note that the <a href="https://en.wikipedia.org/wiki/Setuid">set-group-ID</a> bit has its
regular meaning of elevating privileges when the group-execute bit is on and a
special meaning of enabling mandatory locking when the group-execute bit is off.</p>
<p>When a mandatory lock is activated, it affects regular system calls on the file:</p>
<ul>
<li>When an exclusive or shared lock is acquired, all system calls that <em>modify</em>
the file (e.g. <code class="language-plaintext highlighter-rouge">open()</code> and <code class="language-plaintext highlighter-rouge">truncate()</code>) are blocked until the lock is
released.</li>
<li>When an exclusive lock is acquired, all system calls that <em>read</em> from the file
(e.g. <code class="language-plaintext highlighter-rouge">read()</code>) are blocked until the lock is released.</li>
</ul>
<p>However, the documentation mentions that current implementation is not reliable,
in particular:</p>
<ul>
<li>races are possible when locks are acquired concurrently with <code class="language-plaintext highlighter-rouge">read()</code> or
<code class="language-plaintext highlighter-rouge">write()</code></li>
<li>races are possible when using <code class="language-plaintext highlighter-rouge">mmap()</code></li>
</ul>
<p>Since mandatory locks are not allowed for directories and are ignored by
<code class="language-plaintext highlighter-rouge">unlink()</code> and <code class="language-plaintext highlighter-rouge">rename()</code> calls, you can’t prevent file deletion or renaming
using these locks.</p>
<h3 id="example-usage"><a href="https://gavv.net/articles/file-locks/#example-usage"></a>Example usage</h3>
<p>Below you can find a usage example of mandatory locking.</p>
<p><code class="language-plaintext highlighter-rouge">fcntl_lock.c</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
</pre></td><td class="rouge-code"><pre>#include <sys/fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "usage: %s file\n", argv[0]);
exit(1);
}
int fd = open(argv[1], O_RDWR);
if (fd == -1) {
perror("open");
exit(1);
}
struct flock fl = {};
fl.l_type = F_WRLCK;
fl.l_whence = SEEK_SET;
fl.l_start = 0;
fl.l_len = 0;
if (fcntl(fd, F_SETLKW, &fl) == -1) {
perror("fcntl");
exit(1);
}
pause();
exit(0);
}
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Build <code class="language-plaintext highlighter-rouge">fcntl_lock</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>$ gcc -o fcntl_lock fcntl_lock.c
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Mount the partition and create a file with the mandatory locking enabled:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre>$ mkdir dir
$ mount -t tmpfs -o mand,size=1m tmpfs ./dir
$ echo hello > dir/lockfile
$ chmod g+s,g-x dir/lockfile
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Acquire a lock in the first terminal:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>$ ./fcntl_lock dir/lockfile
(wait for a while)
^C
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Try to read the file in the second terminal:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>$ cat dir/lockfile
(hangs until ^C is pressed in the first terminal)
hello
</pre></td></tr></tbody></table></code></pre></div></div>Victor GaydovFile locking in LinuxQuestions For New Tech2022-10-04T00:00:00+00:002022-10-04T00:00:00+00:00/primitives/2022/10/04/Questions-for-new-Tech<h1 id="questions-for-a-new-technology">Questions for a new technology</h1>
<blockquote>
<p>Given that coordination and communication swamp all other costs in modern
software development it is a pressing area to invest in, especially as your
team scales.</p>
</blockquote>
<p>Given that coordination and communication swamp all other costs in modern
software development it is a pressing area to invest in,
<a href="https://kellanem.com/notes/on-team-size">especially as your team scales</a>.</p>
<p>I use a framework of a <em>Small Number of Well Known Tools</em> to build shared
understanding in our complex systems over time. When we want to do something
other than use the <em>Small Number of Well Known Tools</em> (in the small number of
well known patterns), that’s a <em>Departure</em>.</p>
<p>I have a long note I want to post on technical decision making and departures.</p>
<p>In the mean time I want to share a short list of questions I’ve been using in
various forms for over a decade to engage with <em>The Dreaded Question</em>.</p>
<p>The Dreaded Question goes something like:</p>
<blockquote>
<p>“We should use this new technology X, it’s faster, it’s better, it’s more
elegant, it’s more actively developed, aren’t you committed to people learning
and growing here at company Y, look I whipped up a prototype over the weekend
and it’s in production, isn’t this technology amazing, huh, well fuck this
fascist totalitarian state, I’m out of here.”</p>
</blockquote>
<p>(maybe that wasn’t a question)</p>
<p>If you’ve led technology teams for any period of time, you’ve had this
conversation. (or, in rare cases, missed important opportunities to have this
conversation)</p>
<h3 id="the-questions">The Questions</h3>
<p>They aren’t particularly subtle in their bias. They aren’t meant to be. They
also aren’t meant to be a series of boxes to be checked or hoops to be jumped
through.</p>
<ol>
<li>What problem are we trying to solve? (Tech should never be introduced as an
end to itself)</li>
<li>How could we solve the problem with our current tech stack? (If the answer
is we can’t, then we probably haven’t thought about the problem deeply
enough)</li>
<li>Are we clear on what new costs we are taking on with the new technology?
(monitoring, training, cognitive load, etc)</li>
<li>What about our current stack makes solving this problem in a cost-effective
manner (in terms of money, people or time) difficult?</li>
<li>If this new tech is a replacement for something we currently do, are we
committed to moving everything to this new technology in the future? Or are
we proliferating multiple solutions to the same problem? (aka “Will this
solution kill and eat the solution that it replaces?”)</li>
<li>Who do we know and trust who uses this tech? Have we talked to them about
it? What did they say about it? What don’t they like about it? (if they
don’t hate it, they haven’t used it in depth yet)</li>
<li>What’s a low risk way to get started?</li>
<li>Have you gotten a mixed discipline group of senior folks together and
thrashed out each of the above points? Where is that documented?</li>
</ol>Questions for a new technologyPricing Transaction Costs2022-09-28T00:00:00+00:002022-09-28T00:00:00+00:00/primitives/2022/09/28/Pricing-Transaction-Costs<h3 id="gas-pricing-notes-and-suggestions">Gas Pricing Notes and Suggestions</h3>
<p><img align="right" src="https://gist.githubusercontent.com/sambacha/9ec6a1a70466bcabe04eca3821e1c9d4/raw/1364229703b5f903e2852895cfae79845e5ddab9/app.svg" height="710" alt="" /></p>
<p>Carrying over from the issues we have:</p>
<ol>
<li>DO NOT TRACK GAS USED VIA WRAPPER: gas used through a wrapper contract is not
accurate with Multicall due to EIP-2929</li>
</ol>
<blockquote>
<p>This is probably the source of alot of issues TBH wrt gas price estimation.</p>
</blockquote>
<blockquote>
<p>V3 = Trident V2 = SushiV1</p>
</blockquote>
<p>For each gas estimate, normalize decimals to that of the chosen <code class="language-plaintext highlighter-rouge">usd token</code>.</p>
<p>Use the BFS approach. It allows us to keep a reference to nodes that we want to
come back to, even though we haven’t checked/visited them yet. This is crucial
in both pathfinding and gas pricing, which is elaborated below.</p>
<ol>
<li>First we seed BFS (breadth first search) queue with the best quotes for each
percentage. i.e. best quote when sending 10% of amount, best quote when
sending 20% of amount, …]</li>
<li>Then will explore the various combinations from each node.</li>
</ol>
<ul>
<li>Size of the queue at this point is the number of potential routes we are
investigating for the given number of splits.</li>
<li>If we didn’t improve our quote by adding another split, very unlikely to
improve it by splitting more after that.</li>
</ul>
<ol>
<li>For all other percentages, add a new potential route.</li>
</ol>
<ul>
<li>E.g. if our current aggregated route if missing 50%, we will create new nodes
and add to the queue for:</li>
<li>50% + new 10% route, 50% + new 20% route, etc.</li>
</ul>
<ol>
<li>[Calculate] if on L1, the estimated gas used based on hops and ticks across
all the routes if on L2, the gas used on the L2 based on hops and ticks
across all the routes</li>
<li>If swapping on an L2 that includes a L1 security fee, calculate the fee and
include it in the gas adjusted quotes</li>
<li>[check] ensure any addresses are aliased if needed for L2>L1</li>
<li>
<p>[assert] Ensure the <code class="language-plaintext highlighter-rouge">gasModel</code> exists and that the swap route is a v3 only
route</p>
</li>
<li>Include a <code class="language-plaintext highlighter-rouge">networkCongestion</code> property when requesting EIP-1559-compatible
gas fee estimates. This value, which is a number from 0 to 1, where 0
represents “not congested” and 1 represents “extremely congested”, can be
used to communicate the status of the overall network to the DApp and end
user.</li>
</ol>
<table>
<thead>
<tr>
<th><strong>Field</strong></th>
<th><strong>Value</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>networkCongestion</td>
<td>A normalized number that can be used to gauge the congestion evel of the network with 0 meaning not congested and 1 meaning extremely congested</td>
</tr>
<tr>
<td>minWaitTimeEstimate</td>
<td>The fastest the transaction will take <in milliseconds=""></in></td>
</tr>
<tr>
<td>maxWaitTimeEstimate</td>
<td>The slowest the transaction will take <in milliseconds=""></in></td>
</tr>
<tr>
<td>suggestedMaxPriorityFeePerGas</td>
<td>“A suggested tip” <GWEI hex="" numbe="">r"</GWEI></td>
</tr>
<tr>
<td>suggestedMaxFeePerGas</td>
<td>A suggested max fee the most a user will pay <GWEI hex="" number=""></GWEI></td>
</tr>
</tbody>
</table>
<h2 id="gwei-service">Gwei Service</h2>
<p>The Gwei Service is an important part of the overall system. Since Gwei pricing
is the most important portion of the overall system efficacy it is decoupled
from the application itself and run in a separate stack entirely. We inject the
Gwei pricing service by loading at runtime via <code class="language-plaintext highlighter-rouge">startGasWorker()</code>. <em>note</em> we use
the term GasWorker to draw a distinction between <code class="language-plaintext highlighter-rouge">gwei</code> and <code class="language-plaintext highlighter-rouge">gas</code>. Whereas
<code class="language-plaintext highlighter-rouge">gwei</code> is understood as a specific SI unit, gas is more abstract.</p>
<h2 id="gas-pricing-service">Gas Pricing Service</h2>
<p>For accurate pricing, we trim off the lowest price with the fastest time and
highest price with the slowest times until 80% of the data is represented; these
are outliers</p>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="rouge-code"><pre><span class="cm">/** @dev filter transactions from blocks */</span>
<span class="nx">blocks</span><span class="p">.</span><span class="nx">forEach</span><span class="p">((</span><span class="nx">block</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">block</span><span class="p">.</span><span class="nx">transactions</span><span class="p">.</span><span class="nx">forEach</span><span class="p">((</span><span class="nx">tx</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">price</span> <span class="o">=</span> <span class="nb">parseFloat</span><span class="p">(</span><span class="nx">ethers</span><span class="p">.</span><span class="nx">utils</span><span class="p">.</span><span class="nx">formatUnits</span><span class="p">(</span><span class="nx">tx</span><span class="p">.</span><span class="nx">gasPrice</span><span class="p">,</span> <span class="dl">"</span><span class="s2">gwei</span><span class="dl">"</span><span class="p">));</span>
<span class="kd">const</span> <span class="nx">duration</span> <span class="o">=</span> <span class="nx">tx</span><span class="p">.</span><span class="nx">waitDuration</span><span class="p">;</span>
<span class="cm">/**
*
* Purge anything that takes over an hour
*/</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">duration</span> <span class="o">></span> <span class="p">(</span><span class="mi">60</span> <span class="o">*</span> <span class="mi">60</span><span class="p">))</span> <span class="p">{</span> <span class="k">return</span><span class="p">;</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">duration</span> <span class="o"><</span> <span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">60</span><span class="p">))</span> <span class="p">{</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">fast</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">price</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="nx">duration</span> <span class="o"><</span> <span class="p">(</span><span class="mi">5</span> <span class="o">*</span> <span class="mi">60</span><span class="p">))</span> <span class="p">{</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">medium</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">price</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">slow</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">price</span><span class="p">);</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="transaction-details">Transaction Details</h3>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="cm">/**
* Add the transaction details
* @const delta
* @param waitDuration
* @param dataLength
* @param gasLimit
* @param value
*/</span>
<span class="kd">const</span> <span class="nx">delta</span> <span class="o">=</span> <span class="nx">timestamp</span> <span class="o">-</span> <span class="nx">seenTime</span><span class="p">;</span>
<span class="nx">txs</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span>
<span class="na">w</span><span class="p">:</span> <span class="nx">delta</span><span class="p">,</span> <span class="c1">// waitDuration</span>
<span class="na">d</span><span class="p">:</span> <span class="nx">ethers</span><span class="p">.</span><span class="nx">utils</span><span class="p">.</span><span class="nx">hexDataLength</span><span class="p">(</span><span class="nx">tx</span><span class="p">.</span><span class="nx">data</span><span class="p">),</span> <span class="c1">// dataLength</span>
<span class="na">l</span><span class="p">:</span> <span class="nx">tx</span><span class="p">.</span><span class="nx">gasLimit</span><span class="p">.</span><span class="nx">toString</span><span class="p">(),</span> <span class="c1">// gasLimit</span>
<span class="na">p</span><span class="p">:</span> <span class="nx">ethers</span><span class="p">.</span><span class="nx">utils</span><span class="p">.</span><span class="nx">formatUnits</span><span class="p">(</span><span class="nx">tx</span><span class="p">.</span><span class="nx">gasPrice</span><span class="p">,</span> <span class="dl">'</span><span class="s1">gwei</span><span class="dl">'</span><span class="p">),</span> <span class="c1">// gasPrice</span>
<span class="na">v</span><span class="p">:</span> <span class="nx">ethers</span><span class="p">.</span><span class="nx">utils</span><span class="p">.</span><span class="nx">formatUnits</span><span class="p">(</span><span class="nx">tx</span><span class="p">.</span><span class="nx">value</span><span class="p">),</span> <span class="c1">// value</span>
<span class="p">});</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="canary-scanning">Canary Scanning</h3>
<blockquote>
<p>Failsafe guard</p>
</blockquote>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre><span class="c1">// Canary scanning (check every second)</span>
<span class="c1">// If we go too long without a ne block or a new transaction, it indicates the</span>
<span class="c1">// underlying connection to a backend has probalby disconnected. By exiting,</span>
<span class="c1">// we give our process manager a change to run us again to reconnect</span>
<span class="nx">setInterval</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">delta</span> <span class="o">=</span> <span class="nx">getTime</span><span class="p">()</span> <span class="o">-</span> <span class="nx">canaryTimer</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">delta</span> <span class="o">></span> <span class="nx">MAX_DISCONNECT</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">`Canary: forcing restart...`</span><span class="p">);</span>
<span class="nx">process</span><span class="p">.</span><span class="nx">exit</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">},</span> <span class="mi">1000</span><span class="p">).</span><span class="nx">unref</span><span class="p">();</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>How to subscribe to gas price changes</p>
<div class="language-ts highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="k">import</span> <span class="p">{</span> <span class="nx">Container</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">typedi</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="nx">EventConstants</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">@constants/events</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="nx">EventEmitter</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">events</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="p">{</span> <span class="nx">GAS_CHANGE</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">EventConstants</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">events</span><span class="p">:</span> <span class="nx">EventEmitter</span> <span class="o">=</span> <span class="nx">Container</span><span class="p">.</span><span class="kd">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">eventEmitter</span><span class="dl">'</span><span class="p">);</span>
<span class="nx">events</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="nx">GAS_CHANGE</span><span class="p">,</span> <span class="p">(</span><span class="nx">newGasPrice</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="c1">// do something with the newGasPrice</span>
<span class="p">});</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="reference-interface-from-metamask">Reference Interface from MetaMask</h3>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="c1">// source: https://github.com/MetaMask/controllers/commit/77b1410a0611bbea785e5528b44143aebe5d407f</span>
<span class="cm">/**
* @type Eip1559GasFee
*
* Data necessary to provide an estimate of a gas fee with a specific tip
* @property minWaitTimeEstimate - The fastest the transaction will take, in milliseconds
* @property maxWaitTimeEstimate - The slowest the transaction will take, in milliseconds
* @property suggestedMaxPriorityFeePerGas - A suggested "tip", a GWEI hex number
* @property suggestedMaxFeePerGas - A suggested max fee, the most a user will pay. a GWEI hex number
*/</span>
<span class="k">export</span> <span class="kd">type</span> <span class="nx">Eip1559GasFee</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">minWaitTimeEstimate</span><span class="p">:</span> <span class="kr">number</span><span class="p">;</span> <span class="c1">// a time duration in milliseconds</span>
<span class="nl">maxWaitTimeEstimate</span><span class="p">:</span> <span class="kr">number</span><span class="p">;</span> <span class="c1">// a time duration in milliseconds</span>
<span class="nl">suggestedMaxPriorityFeePerGas</span><span class="p">:</span> <span class="kr">string</span><span class="p">;</span> <span class="c1">// a GWEI decimal number</span>
<span class="nl">suggestedMaxFeePerGas</span><span class="p">:</span> <span class="kr">string</span><span class="p">;</span> <span class="c1">// a GWEI decimal number</span>
<span class="p">};</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="rouge-code"><pre><span class="cm">/**
* @type GasFeeEstimates
*
* Data necessary to provide multiple GasFee estimates, and supporting information, to the user
* @property low - A GasFee for a minimum necessary combination of tip and maxFee
* @property medium - A GasFee for a recommended combination of tip and maxFee
* @property high - A GasFee for a high combination of tip and maxFee
* @property estimatedBaseFee - An estimate of what the base fee will be for the pending/next block. A GWEI dec number
* @property networkCongestion - A normalized number that can be used to gauge the congestion
* level of the network, with 0 meaning not congested and 1 meaning extremely congested
*/</span>
<span class="k">export</span> <span class="kd">type</span> <span class="nx">GasFeeEstimates</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">low</span><span class="p">:</span> <span class="nx">Eip1559GasFee</span><span class="p">;</span>
<span class="nl">medium</span><span class="p">:</span> <span class="nx">Eip1559GasFee</span><span class="p">;</span>
<span class="nl">high</span><span class="p">:</span> <span class="nx">Eip1559GasFee</span><span class="p">;</span>
<span class="nl">estimatedBaseFee</span><span class="p">:</span> <span class="kr">string</span><span class="p">;</span>
<span class="nl">networkCongestion</span><span class="p">:</span> <span class="kr">number</span><span class="p">;</span>
<span class="p">};</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="rouge-code"><pre><span class="cm">/**
* Calculates the approximate normalized ranking of the latest base fee in the given blocks among
* the entirety of the blocks. That is, sorts all of the base fees, then finds the rank of the first
* base fee that meets or exceeds the latest base fee among the base fees. The result is the rank
* normalized as a number between 0 and 1, where 0 means that the latest base fee is the least of
* all and 1 means that the latest base fee is the greatest of all. This can ultimately be used to
* render a visualization of the status of the network for users.
*
* @param blocks - A set of blocks as obtained from {@link fetchBlockFeeHistory}.
* @returns A promise of a number between 0 and 1.
*/</span>
<span class="k">async</span> <span class="kd">function</span> <span class="nx">calculateNetworkCongestionLevelFrom</span><span class="p">(</span>
<span class="nx">blocks</span><span class="p">:</span> <span class="nx">Block</span><span class="o"><</span><span class="nx">Percentile</span><span class="o">></span><span class="p">[],</span>
<span class="p">):</span> <span class="nb">Promise</span><span class="o"><</span><span class="kr">number</span><span class="o">></span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">latestBaseFeePerGas</span> <span class="o">=</span> <span class="nx">blocks</span><span class="p">[</span><span class="nx">blocks</span><span class="p">.</span><span class="nx">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">].</span><span class="nx">baseFeePerGas</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">sortedBaseFeesPerGas</span> <span class="o">=</span> <span class="nx">blocks</span>
<span class="p">.</span><span class="nx">map</span><span class="p">((</span><span class="nx">block</span><span class="p">)</span> <span class="o">=></span> <span class="nx">block</span><span class="p">.</span><span class="nx">baseFeePerGas</span><span class="p">)</span>
<span class="p">.</span><span class="nx">sort</span><span class="p">((</span><span class="nx">a</span><span class="p">,</span> <span class="nx">b</span><span class="p">)</span> <span class="o">=></span> <span class="nx">a</span><span class="p">.</span><span class="nx">cmp</span><span class="p">(</span><span class="nx">b</span><span class="p">));</span>
<span class="kd">const</span> <span class="nx">indexOfBaseFeeNearestToLatest</span> <span class="o">=</span> <span class="nx">sortedBaseFeesPerGas</span><span class="p">.</span><span class="nx">findIndex</span><span class="p">(</span>
<span class="p">(</span><span class="nx">baseFeePerGas</span><span class="p">)</span> <span class="o">=></span> <span class="nx">baseFeePerGas</span><span class="p">.</span><span class="nx">gte</span><span class="p">(</span><span class="nx">latestBaseFeePerGas</span><span class="p">),</span>
<span class="p">);</span>
<span class="k">return</span> <span class="nx">indexOfBaseFeeNearestToLatest</span> <span class="o">!==</span> <span class="o">-</span><span class="mi">1</span>
<span class="p">?</span> <span class="nx">indexOfBaseFeeNearestToLatest</span> <span class="o">/</span> <span class="p">(</span><span class="nx">blocks</span><span class="p">.</span><span class="nx">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
</pre></td><td class="rouge-code"><pre><span class="p">{</span><span class="w">
</span><span class="nl">"low"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"minWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">180000</span><span class="p">,</span><span class="w">
</span><span class="nl">"maxWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">360000</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxPriorityFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1"</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"40"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"medium"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"minWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">15000</span><span class="p">,</span><span class="w">
</span><span class="nl">"maxWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">60000</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxPriorityFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2"</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"45"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"high"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"minWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
</span><span class="nl">"maxWaitTimeEstimate"</span><span class="p">:</span><span class="w"> </span><span class="mi">15000</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxPriorityFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"3"</span><span class="p">,</span><span class="w">
</span><span class="nl">"suggestedMaxFeePerGas"</span><span class="p">:</span><span class="w"> </span><span class="s2">"65"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"estimatedBaseFee"</span><span class="p">:</span><span class="w"> </span><span class="s2">"32"</span><span class="p">,</span><span class="w">
</span><span class="nl">"networkCongestion"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.2</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></div></div>Gas Pricing Notes and SuggestionsRationality Is Self Defeating2022-09-10T00:00:00+00:002022-09-10T00:00:00+00:00/primitives/2022/09/10/Rationality-is-Self-Defeating<h1 id="rationality-is-self-defeating-in-permissionless-systems--bryan-fords-home-page">Rationality is Self-Defeating in Permissionless Systems – Bryan Ford’s Home Page</h1>
<blockquote>
<h2 id="excerpt">Excerpt</h2>
<p>Many blockchain and cryptocurrency fans seem to prefer building and analyzing
decentralized systems in a rational or “greedy behavior” failure model, rather
than a Byzantine or “arbitrary behavior” failure model. Many of the same
blockchain and cryptocurrency fans also like open, permissionless systems like
Bitcoin and Ethereum, which anyone can join and participate in using weak
identities such as anonymous cryptography key pairs.</p>
</blockquote>
<hr />
<h2 id="september-23-2019"><em>September 23, 2019</em></h2>
<blockquote>
<p><em>by <a href="https://bford.info/">Bryan Ford</a> and
<a href="https://informationsecurity.uibk.ac.at/people/rainer-boehme/">Rainer Böhme</a> —
<a href="https://arxiv.org/pdf/1910.08820.pdf">PDF preprint</a> version available</em></p>
</blockquote>
<p>Many blockchain and cryptocurrency fans seem to prefer building and analyzing
decentralized systems in a rational or “greedy behavior” failure model, rather
than a Byzantine or “arbitrary behavior” failure model. Many of the same
blockchain and cryptocurrency fans also like open, permissionless systems like
Bitcoin and Ethereum, which anyone can join and participate in using weak
identities such as anonymous cryptography key pairs.</p>
<p>What most of these heavily-overlapping sets of fans do not seem to realize,
however, is that rationality assumptions are self-defeating in open
permissionless systems with weak identities. A fairly simple metacircular
argument – a kind of
“<a href="https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems">Gödel’s incompleteness theorem</a>
for rationality” – shows that for any system <em>S</em> that makes <em>any</em> behavioral
assumption, including but not limited to a rationality assumption, a rational
attacker both exists and <em>has an incentive</em> to defeat that behavioral
assumption, thereby violating that assumption and exhibiting Byzantine behavior
from the perspective of the system.</p>
<p>As a quick summary of the argument we will expand below, suppose a
permissionless system like Bitcoin is secure against rational attacks, but has
some weakness against irrational Byzantine attacks in which the attacker would
lose money. Because the system is open, permissionless, and exists within a
larger ecosystem, a rational attacker can find ways to “bet against” Bitcoin’s
security in <em>other</em> financially-connected systems (e.g., Ethereum), making a
profit <em>outside of</em> Bitcoin on this attack against Bitcoin. An attack that
appears irrational in the context of Bitcoin may be perfectly rational in the
context of the larger ecosystem.</p>
<p>For this reason, an open permissionless system designed to be secure only
against rational adversaries is actually just <em>insecure</em>, unless it remains
secure even when the “rational” participants become fully Byzantine. Given this,
one might as well have designed the permissionless system in a Byzantine model
in the first place. The rationality assumption offers no actual benefit, but
merely can make an insecure system appear secure under flawed analysis.</p>
<p>This blog post is based partly on ideas in
<a href="https://web.archive.org/web/20191124192837/https://bdlt.school/files/slides/talk-rainer-b%C3%B6hme-a-primer-on-economics-for-cryptocurrencies.pdf">Rainer Böhme’s talk</a>
at the recent
<a href="https://web.archive.org/web/20210416231544/https://bdlt.school/">BDLT Summer School in Vienna</a>.
While formalizing the argument would require some effort, we thought it would be
worth at least sketching the argument intuitively for the public record.</p>
<h2 id="threat-modeling-honest-byzantine-and-rational-participants">Threat Modeling: Honest, Byzantine, and Rational Participants</h2>
<p>In designing or analyzing the security of any decentralized system, we must
define the system’s <em>threat model</em>, and in particular our assumptions about the
behaviors of the participants in the system. An <em>honest</em>, <em>correct</em>, or
<em>altruistic</em> participant is one that we assume to follow the system’s protocol
rules as specified, hence representing a “well-behaved” participant exhibiting
no adversarial behavior.</p>
<p>A <em>Byzantine</em> participant, named after the
<a href="http://theory.stanford.edu/~trevisan/cs174/byzantine.pdf">Byzantine Generals Problem</a>,
is one we make <em>no</em> assumptions about. A Byzantine participant can behave in
<em>arbitrary</em> fashion, without restriction, and hence by definition represents the
strongest possible adversary.</p>
<p>We would like to build systems that could withstand <em>all</em> participants being
Byzantine, but this appears fundamentally impossible. We therefore in practice
have to make threshold security assumptions, such as that over two-thirds of the
participants in classic Byzantine consensus protocols are honest, or that the
participants controlling over half the hashpower in Bitcoin are well-behaved.</p>
<p>Even with threshold assumptions, however, building systems that resist Byzantine
behavior is extremely difficult, and the resulting systems are often much more
complex and inefficient than systems tolerating weaker adversaries. We may
therefore be tempted to improve a design’s simplicity or efficiency by making
stronger assumptions about the behavior of adversarial participants, effectively
weakening the assumed adversary.</p>
<p><img src="https://bford.info/2019/09/23/rational/adversaries.svg" alt="Types of adversaries" /></p>
<p>One such popular assumption, especially in economic circles, is <em>rationality</em>.
In essence, we assume that rational participants may deviate from the rules in
arbitrary ways but <em>only when doing so is in their economic self-interest</em>,
improving their expected rewards – usually but not always financial – in
comparison with following the rules honestly.</p>
<p>By assuming that adversarial participants are rational rather than Byzantine, we
need not secure the system against <em>all</em> possible participant behaviors, such as
against participants who pay money with no reward merely to sow chaos and
destruction. Instead, we merely need to prove that the system is <em>incentive
compatible</em>, for example by showing that its rules represent a Nash equilibrium,
in which deviations from the equilibrium will not give participants a greater
financial reward.</p>
<p>Besides simplicity and efficiency, another appeal of rationality assumptions is
the promise of <em>strengthening</em> the system’s security by lowering the threshold
of participants we assume to be fully honest. To circumvent the classical
Byzantine consensus requirement that fewer than one third of participants may be
faulty, for example, we might hope to tolerate closer to 50%, or even 100%, of
participants being “adversarial” if we assume they are rational and not
Byzantine. Work on
<a href="http://www.cs.utexas.edu/~lorenzo/papers/sosp05.pdf">the BAR model (Byzantine-Altruistic-Rational)</a>
and
<a href="http://www.cs.utexas.edu/~lorenzo/papers/Abraham11Distributed.pdf"><em>(k,t)</em>-robustness</a>
exemplifies this goal, which sometimes appears achievable in closed systems with
strong identities. But a direct implication of our metacircular argument is that
an <em>open</em> system cannot generally be secure if all participants are either
Byzantine or rational.</p>
<h2 id="assumptions-underlying-the-argument">Assumptions Underlying the Argument</h2>
<p>The metacircular argument makes three main assumptions.</p>
<p>First, the system <em>S</em> under consideration is open and permissionless, allowing
anyone to join and participate in the system using only weak, anonymous
identities such as bare cryptographic key pairs. Identities in <em>S</em> need not even
be costless provided their price is modest: the argument still works even if <em>S</em>
imposes membership fees or requires new wallet keys to be “mined”, for example.
Proof-of-Work cryptocurrencies such as Bitcoin and Ethereum, Proof-of-Stake
systems such as Algorand and Ouroboros, and most other permissionless systems
seem to satisfy this openness property. Because participation is open to anyone
globally and can be anonymous, we cannot reasonably expect police or governments
to protect <em>S</em> from attack: even if they wanted to and considered it their job,
they would not be able to find or discipline a smart rational attacker who might
be attacking from anywhere around the globe, especially from a country with weak
international agreements and extradition rules. Thus, <em>S</em> must “stand on its
own”, by successfully either withstanding or disincentivizing attacks coming
from anywhere. (And it will turn out that merely disincentivizing such attacks
is impossible.)</p>
<p>Second, the system <em>S</em> does not control a majority of total economic power or
value in the world: i.e., it is not totally economically dominant from a global
perspective. Instead, there may be (and probably are) actors outside of <em>S</em> who,
if rationally incentivized to do so, can at least temporarily muster an amount
of economic power outside of <em>S</em> comparable to or greater than the economic
value within or controlled by <em>S</em>. In other words, we assume that <em>S</em> is not the
“biggest fish in the ocean.” Given that there can be at most one globally
dominant economic system at a time, it seems neither useful nor advisable to
design systems that are secure only when they are the biggest fish in the ocean,
because almost always they are not.</p>
<p>Third, the system <em>S</em> actually <em>leverages</em> in some fashion the behavioral
assumption(s) it makes on participants, such as a rationality assumption. That
is, we assume there exist one or more (arbitrary) behavioral strategies that <em>S</em>
assumes some participants <em>will not</em> follow, such as economically-losing
behaviors in the case of rationality. Further, we assume there exists such an
assumption-violating strategy that will cause <em>S</em> to malfunction or otherwise
deviate observably from its correct operation. In fact, we need not assume that
this deviant behavior will <em>always</em> succeed in breaking <em>S</em>, but only that it
will non-negligibly <em>raise the probability</em> of <em>S</em> failing. If this were not the
case, and <em>S</em> in fact operates correctly, securely, and indistinguishably from
its ideal even if participants do violate their behavioral assumptions, then <em>S</em>
is actually Byzantine secure after all. In that case, <em>S</em> is not actually
benefiting from its assumptions about participant behavior, which are redundant
and thus may be simply discarded.</p>
<h2 id="the-metacircular-argument-rational-attacks-on-rationality">The Metacircular Argument: Rational Attacks on Rationality</h2>
<p>Suppose permissionless system <em>S</em> is launched, and operates smoothly for some
time, with all participants conforming to <em>S</em>’s assumptions about them. Because
<em>S</em> is permissionless (assumption 1) and exists in a larger open world
(assumption 2), new rational participants may arrive at any time, attracted by
<em>S</em>’s success and presumably growing economic value provided there is an
opportunity to profit from doing so.</p>
<p>Consider a particular newly-arriving participant <em>P</em>. <em>P</em> could of course play
by the rules <em>S</em> assumes of <em>P</em>, in which case the greatest immediate economic
benefit <em>P</em> could derive from participating in <em>S</em> is some fraction of the total
economic value currently embodied in <em>S</em> (e.g., its market cap). For most
realistic permissionless systems embodying strong founders’ or early-adopters’
rewards, if <em>P</em> is not one of the original founders of <em>S</em> but arrives
substantially after launch, then <em>P</em>’s near-term payoff prospectives from
joining <em>S</em> is likely bounded to a fairly <em>small</em> fraction of <em>S</em>’s total value.
But what if there were another strategy <em>P</em> could take, for perfectly <em>rational</em>
and economically-motivated reasons, by which <em>P</em> could in relatively short order
acquire a <em>large</em> fraction of <em>S</em>’s total value?</p>
<p><img src="https://bford.info/2019/09/23/rational/open-world.svg" alt="Open world with S and S'" /></p>
<p>Because <em>S</em> is permissionless and operating in a larger open world, <em>P</em> is not
confined to operating exclusively within the boundaries of <em>S</em>. <em>P</em> can also
make use of facilities external to <em>S</em>. By assumption 2, <em>P</em> may in particular
have access to, or be able to borrow temporarily, financial resources comparable
to or larger than the total value of <em>S</em>.</p>
<p>Suppose the facilities external to <em>S</em> include another Ethereum-like
cryptocurrency <em>S’</em>, which includes a smart contract facility with which
decentralized exchanges, futures markets, and the like may be implemented. (This
is not really a separate assumption because even if <em>S’</em> did not already exist,
<em>P</em> could create and launch it, given sufficient economic resources under
assumption 2.) Further, suppose that someone (perhaps <em>P</em>) has created on
external system <em>S’</em> a decentralized exchange, futures market, or any other
mechanism by which tokens representing shares of the value of <em>S</em> may be traded
or speculated upon in the context of <em>S’</em>: e.g., a series of tradeable Ethereum
tokens pegged to <em>S</em>’s cryptocurrency or stake units.</p>
<p>Now suppose participant <em>P</em> finds some behavioral strategy that system <em>S</em>
depends on participants <em>not</em> exhibiting, and that will observably break <em>S</em> –
or even that just <em>might</em> break <em>S</em> with significant non-negligible probability.
Assumption 3 above guarantees the existence of such a behavioral strategy,
unless <em>S</em>’s rationality assumptions were in fact redundant and worthless. <em>P</em>
must merely be clever enough to find and implement such a strategy. It is
possible this strategy might first require <em>P</em> to pretend to be one or more
well-behaved participants of <em>S</em> for a while, to build up the necessary
reputation or otherwise get correctly positioned in <em>S</em>’s state space; a bit of
patience and persistence on <em>P</em>’s part will satisfy this requirement. <em>P</em> may
also have to “buy into” <em>S</em> enough to surmount any entry costs or stake
thresholds <em>S</em> might impose; the external funds <em>P</em> can invoke or borrow by
assumption 2 can satisfy this requirement, and are bounded by the total value of
<em>S</em>. In general, <em>S</em>’s openness by assumption 1 and the existence of a
correctness-violating strategy by assumption 3 ensures that there exists some
course of action and supply of external resources by which <em>P</em> can position
itself to violate <em>S</em>’s behavioral assumption.</p>
<p>In addition to infiltrating and positioning itself within <em>S</em>, <em>P</em> also invokes
or borrows enough external funds and uses them to short-sell (bet against)
shares of <em>S</em>’s value massively in the context of the external system <em>S’</em>,
which (unlike <em>S</em>) <em>P</em> trusts will remain operational and hold its value
independently of <em>S</em>. Provided <em>P</em> reaches this short-selling position gradually
and carefully enough to avoid revealing its strategy early, the funds <em>P</em> must
invoke or borrow for this purpose must be bounded by some fraction of the total
economic value of <em>S</em>. And provided there are at least some participants and/or
observers of <em>S</em> who believe that <em>S</em> is secure and will remain operating
correctly, and are willing to bet to that effect on <em>S’</em>, <em>P</em> will eventually be
able to build its short position.</p>
<p>Finally, once <em>P</em> is positioned correctly within both <em>S</em> and <em>S’</em>, <em>P</em> then
launches its assumption-violating behavior in <em>S</em> that will observably cause <em>S</em>
to fail as per assumption 2. This might manifest as a denial-of-service attack,
a correctness attack, or in any other fashion. The only requirement is that
<em>P</em>’s behavior creates an <em>observable</em> failure, which a nontrivial number of the
existing participants in <em>S</em> believed would not happen because they believed in
<em>S</em> and its threat model. The fact that <em>S</em> is now observed to be broken, and
its basic design assumptions manifestly violated, causes the shares of <em>S</em>’s
value to drop precipitously on external market <em>S’</em>, on which <em>P</em> takes a
handsome profit. Perhaps <em>S</em> recovers and continues, or perhaps it fails
entirely – but either way, <em>P</em> has essentially transferred a significant
fraction of system <em>S</em>’s economic value from system <em>S</em> itself to <em>P</em>’s own
short-sold position on external market <em>S’</em>. And to do so, <em>P</em> needed only to
find a way – any way – to <em>surprise</em> all those who believed <em>S</em> was secure and
that its threat model accurately modeled <em>S</em>’s real-world participants.</p>
<p>Even if <em>P</em>’s assumption-violating behavioral strategy does not break <em>S</em> with
perfect reliability, but only with some probability, <em>P</em> can still create an
<em>expectation</em> of positive profit from its attack by hedging its bets
appropriately on <em>S’</em>. <em>P</em> does not need a perfect attack, but merely needs to
possess the <em>correct</em> knowledge that <em>S</em>’s failure probability is much higher
than the other participants in <em>S</em> believe it to be – because only <em>P</em> knows
that (and precisely when) it will violate <em>S</em>’s design assumptions to create
that higher failure probability. Furthermore, even if <em>P</em>’s attack fails, and
the vulnerability it exploits is quickly detected and patched, <em>P</em> may still
profit marginally from the market’s adjustment to a realization that <em>S</em>’s
failure probability was (even temporarily) higher than most of <em>S</em>’s
participants thought it was.</p>
<p>Within the context of system <em>S</em>, <em>P</em>’s behavior manifests as Byzantine
behavior, specifically violating the assumptions <em>S</em>’s designers thought
participants would not exhibit and thus excluded from <em>S</em>’s threat model.
Considered in the larger context of the external world in which <em>S</em> is embedded,
however, including the external trading system <em>S’</em>, <em>P</em>’s behavior is perfectly
rational and economically-motivated. Thus, the very rationality of <em>P</em> in the
larger open world is precisely what motivates <em>P</em> to break, and profit from,
<em>S</em>’s ill-considered assumption that its participants would behave rationally.</p>
<h2 id="implications-for-practical-systems">Implications for Practical Systems</h2>
<p>This type of financial attack is by no means entirely theoretical or limited to
fully-digital systems such as cryptocurrencies. In our scenario, <em>P</em> is
essentially playing a game closely-analogous to the investors in
<a href="https://en.wikipedia.org/wiki/Credit_default_swap">credit default swaps</a> who
both contributed to, and profited handsomely from, the
<a href="https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%932008">2007-2008 financial crisis</a>,
as covered more recently in the film
<a href="https://en.wikipedia.org/wiki/The_Big_Short_(film)">The Big Short</a>.</p>
<p>In the cryptocurrency space, some real-world attacks we are seeing – such as
increasingly-common
<a href="https://cryptoslate.com/prolific-51-attacks-crypto-verge-ethereum-classic-bitcoin-gold-feathercoin-vertcoin/">51% attacks</a>
– might be viewed as special cases of this metacircular attack on rationality.
It is often claimed that large proof-of-work miners (or proof-of-stake holders)
will not attempt 51% attacks because doing so would undermine the value of the
cryptocurrency in which they by definition hold a large stake, and hence would
be “irrational”. But this argument falls apart if the attack allows the large
stakeholder to reap rewards outside the attacked system, e.g., by defrauding
exchanges or selling <em>S</em> short in other systems.</p>
<p>Externally-motivated attacks on cryptocurrencies have been predicted before in
the form of
<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2041492">virtual protest or “Occupy Bitcoin” attacks</a>,
<a href="https://www.econinfosec.org/archive/weis2013/papers/KrollDaveyFeltenWEIS2013.pdf">Goldfinger attacks</a>,
<a href="https://www.comp.nus.edu.sg/~prateeks/papers/38Attack.pdf">puzzle transaction attacks</a>,
<a href="https://www.sba-research.org/wp-content/uploads/publications/201709%20-%20AJudmayer%20-%20CBT_Merged_Mining_camera_ready_final.pdf">merged mining attacks</a>,
<a href="https://fc18.ifca.ai/bitcoin/papers/bitcoin18-final17.pdf">hostile blockchain takeovers</a>,
and out-of-band variants of
<a href="https://eprint.iacr.org/2019/775.pdf">pay-to-win attacks</a>. All these attacks
are specific instances of our argument. They have been presented in the
literature as open yet solvable challenges. We are not aware, however, of any
prior attempt to summarize the lessons learned and formulate a general
impossibility statement.</p>
<p>For most practical systems, we do not even know if they are incentive compatible
in the absence of an external system <em>S’</em> – i.e., where assumption 2 is violated
– and probably they are not. Almost all game-theoretic treatments of (parts of)
the Bitcoin protocol deliver negative results. Many attacks against specific
cryptocurrency system designs are known to be profitable in expectation, such as
<a href="https://www.avivz.net/pubs/12/Bitcoin_EC0212.pdf">ransaction withholding</a>,
<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2407834">empty block mining</a>,
<a href="https://www.cs.cornell.edu/~ie53/publications/btcProcFC.pdf">selfish mining</a>,
<a href="http://webee.technion.ac.il/people/ittay/publications/btcPoolsSP15.pdf">block withholding</a>,
<a href="https://www.cs.umd.edu/~kartik/papers/5_stubborn_eclipse.pdf">stubborn mining</a>,
<a href="https://syssec.kaist.ac.kr/pub/2017/kwon_ccs_2017.pdf">fork after withholding</a>,
and <a href="http://www.cs.umd.edu/~jkatz/papers/whale-txs.pdf">whale attacks</a>. It is
likely thanks only to frictions such as risk aversion and other costs that we
rarely observe such attacks in large deployed systems. Many specific attacks do
not even depend on assumption 1, underlining the fact that rationality is not a
silver bullet even where this metacircular argument does not apply. Where it
does apply, it is more general and effectively <em>guarantees</em> the existence of
attacks against <em>all</em> open systems that assume participants are rational.</p>
<p>Another related observation is that financial markets on derivatives of a system
<em>S</em> mature in the external world (e.g., <em>S’</em>) as <em>S</em> grows and becomes more
relevant. So in some sense, systems built on the rationality assumption are
temporarily more secure only until they become fat enough targets to be eaten by
their own success. We can see this effect, for example, in the growing and
increasingly liquid market for hash power, which effectively thwarts
<a href="https://bitcoin.org/bitcoin.pdf">Nakamoto’s</a>
(<a href="https://link.springer.com/chapter/10.1007/3-540-48071-4_10">or Dwork’s</a>) rule
of thumb that the ratio of processors to individuals varies in a small band.
Such dynamics happen in the real world, too. But there they have traditionally
taken centuries or decades while in cryptocurrency space everything happens in
time-lapse.</p>
<h2 id="limitations-of-the-argument">Limitations of the Argument</h2>
<p>This argument is of course currently only a rough and informal sketch. An
enterprising student might wish to try formalizing it, or maybe someone has
already done so but we are unaware of it.</p>
<p>The metacircular argument certainly does not apply to all cryptocurrencies or
decentralized systems. In a permissioned system, for example, in which a closed
group of participants are strongly-identified and subject to legal and
contractual agreements with each other, one can hope that the threat of lawsuits
for arbitrarily-large damages will keep rational participants incentivized to
behave correctly. Similarly, in a national cryptocurrency, which might be
relatively open but only to citizens of a given country, and which require
verified identities with which the police can expect to track down and jail
misbehaving participants, this metacircular argument does not necessarily apply.</p>
<p>Apart from police enforcement, rationality assumptions may be weakened in other
ways to circumvent the metacircular argument. For example, an open system might
be designed according to a “weak rationality” assumption that users need
incentives to join the system in the first place (e.g., mining rewards in
Bitcoin), but that after having become stakeholders, most will then behave
honestly. In this case, rational incentives serve only as a tool for system
growth, but become irrelevant and equivalent to a strong honesty assumption in
terms of the internal security of the system itself.</p>
<h2 id="conclusion-irrationality-can-be-rational">Conclusion: Irrationality Can Be Rational</h2>
<p><img src="https://bford.info/2019/09/23/rational/adversaries-open.svg" alt="Types of adversaries" /></p>
<p>What many in the cryptocurrency community seem to want is a system that is both
permissionless and tolerant of strongly-rational behavior – either beyond the
thresholds a similar a Byzantine system would tolerate (such as a rational
majority), or by deriving some simplicity or efficiency benefit from assuming
rationality. But in an open world in which the permissionless system is not the
only game in town, a potential <em>perfectly rational</em> attacker can always exist,
or appear at any time, whose entirely rational behavior is precisely to profit
from bringing the system down by violating its assumptions on participant
behavior.</p>
<p>So if you think you have designed a permissionless decentralized system that is
cleverly secured based on rationality assumptions, you haven’t. You have merely
obfuscated the rational attacker’s motive and opportunity to profit outside your
system from breaking your rationality assumptions. The only practical way to
eliminate this threat appears to be either to close the system and require
strong identities and police protection, or else secure the system against
arbitrary Byzantine behavior, thereby rendering rationality assumptions
redundant and useless for security.</p>
<blockquote>
<p><em>We wish to thank Jeff Allen, Ittay Eyal, Damir Filipovic, Patrik Keller,
Alexander Lipton, Andrew Miller, and Haoqian Zhang for helpful feedback on
early drafts of this post.</em></p>
<p><em>Updated 27-Oct-2019 with link to
<a href="https://arxiv.org/pdf/1910.08820.pdf">PDF preprint</a> version.</em></p>
</blockquote>Bryan FordRationality is Self-Defeating in Permissionless Systems – Bryan Ford’s Home PageThe Allowchain2022-05-25T00:00:00+00:002022-05-25T00:00:00+00:00/primitives/2022/05/25/the-allowchain<h1 id="the-blockchain-and-the-whitechain---by-curtis-yarvin">The blockchain and the whitechain - by Curtis Yarvin</h1>
<blockquote>
<p>“There is one centralized whitelist of registered addresses.”</p>
</blockquote>
<p>The libertarian dream of crypto isn’t dead yet—but we can see its death from
here. Crypto is still a revolution. But it is a financial revolution, not a
political revolution. Any political revolution will have to be a consequence of
the financial revolution—and there is no certainty in any such revolution.</p>
<h3 id="the-path-to-the-whitechain">The path to the whitechain</h3>
<p>The future belongs to <em>whitelists</em>—or “allowlists” for our brilliant new
century. A <em>whitelist</em> is a list of registered, or <em>white</em>, addresses. A
<em>whitechain</em> is a blockchain in which sending tokens to an unregistered address
either destroys or refunds them. There is one centralized whitelist of
registered addresses.</p>
<p>Naturally, a legitimate address is matched to a legitimate account at a
legitimate bank. Money laundering on the whitechain is as hard as money
laundering with a bank account. If a token has ever left the whitechain and
passed through a nonwhite address, it cannot be traded in any way by any
legitimate exchange—it is just <em>dead</em>.</p>
<p>The larval state is the <em>graychain</em>, in which a centralized “denylist” lists
“bad” wallets. For example, DoJ or Treasury could be posting a continuously
updated list of blocked addresses, believed to belong to Russian oligarchs or
whoever. Any traffic with these addresses would be traced by all legitimate
exchanges and bar any sale or redemption.</p>
<p>It is surprising that the graychain does not already exist, forcing exchanges to
check a live USG-certified blacklist before trading any tokens. But once there
is a blacklist, the leap to a whitelist is just a matter of data—banks need to
submit the addresses for all of their crypto-savvy customers. It can be uploaded
on reels of tape, or something. If you have an outside wallet, send a form to
the government.</p>
<p>When a sufficiently powerful financial hegemon regulates the blockchain this
way, first excommunicating a positive set of bad actors with a blacklist, then
all those who refuse to take communion (get their crypto into a registered
account) with a whitelist. Of course, those who fear the spotlight of the
confessional are likely to be bad actors…</p>
<p>Soon, the wilds are tamed. Black crypto still exists—but it is effectively a
different currency. And a much cheaper currency, since its paths to fiat are
winding at best. They may be nonexistent, in which black crypto is possibly
worthless. Privacy coins still exist—they can be treated like black crypto, ie,
trivially murdered by regulation. No legitimate exchange can trade either fiat
or white crypto for them.</p>
<p>At this point, it seems as if crypto has been neutered. It has not been neutered
at all. Rather, it has snuck inside the walls—discarding its irritant qualities.</p>
<p>Who cares about money laundering? Who cares about yield farming? Not Jesus. Did
Jesus drive the hodlers out of the temple? Or the flash-loan peddlers, the
algorithmic stablecoins, the degen rug-pullers? To ask the question is to answer
it.</p>
<p>It is a pity that the Au and Ag of crypto could not turn themselves into privacy
coins, creating a winning monetary contender that also mathematically defied the
state. Why do the powers that be keep getting lucky? To test our hearts, I
suppose.</p>
<p>But once the state gets to know the classic blockchain, the state likes it quite
a bit. The blockchain is a kind of technical perfection of the official record
that lies at the heart of every civilized state. The earliest histories are mere
king-lists—ie, records of transactions in sovereignty.</p>
<p>This attraction is a fatal one—because crypto is a more attractive currency than
state equity. Namely, it is harder. Regulating it legitimizes it and makes it
more dangerous.</p>
<h3 id="the-next-stage">The next stage</h3>
<p>Crypto—ideally one standard crypto, for there can be only one—then concentrates
on its new mission: increasing the pool of savings stored in the new monetary
standard, and unifying competing pools of savings. (It is certainly not
technically unimaginable to envision a <em>financial</em> merger, a pooling of
interests, between Bitcoin and Ethereum—though it would require both to exhibit
unprecedented strategic governance capacity.)</p>
<p>In an environment of financial deflation, such as the Federal Reserve in their
great wisdom has in the spring of 2022 created by raising the price of money—the
interest rate—above the historical pittance that yet already poisons our
pneumonic economy, the prices of all assets valued by their direct or imputed
yield itself plummets. This gives everyone an incentive to sell them for cash.</p>
<p>But is crypto like cash? Or is it the ultimate momentum-driven risk asset? It is
both—it is either.</p>
<p>In the boom, crypto did not act like cash. It acted like junk. It exhibited
remarkable correlation with risk assets—on the way up, and on the way down (so
far). Will this correlation continue?</p>
<p>Reddit traders refer to the strength of hands—meaning the level of decline a
hodler can accept before caving and selling. Junk money has paper hands. Cash
money—coins in the hands of true hodlers—has diamond hands. Crypto winters are
necessary to flush out the weak hands and shift coins into strong hands.
Hardness is a measure of this strength.</p>
<p>Rising interest rates inherently cause flight from capital assets which are
valued by yield production. This is just math. The quantity of capital of a
certain term that is demanded at a certain interest rate is not correct when the
interest rate changes. This creates a game of musical chairs in which anyone can
sit down with a mouse click.</p>
<p>The place to park money is (a) cash or (b) bearish bets. It is not assets priced
by yield. It is not assets inflated by boom money. It is assets valued as
monetary standards.</p>
<h3 id="the-both-way-bet">The both-way bet</h3>
<p>The ideal situation for crypto is that some of this flight from “risk” assets
priced by yield falls into crypto rather than fiat—neither of which is priced by
yield.</p>
<p>For this to happen, the crypto sellers whose thesis that crypto and everything
that isn’t dollars always goes up has just been disproved have to exit crypto.
All the old junk money has to leave the building. This is the classic crypto
winter.</p>
<p>Once the junk outflows are finished, the pattern of crypto trading is set by the
old hodlers and new smart money—those whose money falls out of bull-market
assets and into crypto, at a time when it is flat. The cause of this trade is
that the trader has both (a) capitulated on the asset market and (b) been
enlightened in the monetary market.</p>
<p>Junk money is “smart beta”—it makes money by chasing market patterns and hoping
they continue. This hope is also known as “risk.”</p>
<p>Capital flight is widespread and rational in a bear market. Once the pattern
that crypto (or gold, or anything else) is a capital-flight target in a bear
market asserts itself in the data, this pattern will be imitated and amplified
by waves of junk money. This amplified pattern—if it can happen—will drive the
next stage of monetization.</p>
<p>If crypto can rise in a brutal bear market for yield-returning assets as fleeing
investors bounce into coins instead of dollars, crypto will convince the robots
that it is an asset that goes up in both kinds of markets.</p>
<p>The robots will then begin to chase it even more, causing it to go up even
more—almost as if they see the
<a href="https://www.unqualified-reservations.org/2009/08/urs-crash-course-in-sound-economics/">Nash equilibrium</a>.
This will go on until a new overhang of weak hands builds up… unless it is the
last cycle in
<a href="https://www.unqualified-reservations.org/2011/04/on-monetary-restandardization/">the process</a>,
of course!</p>Curtis YarvinThe blockchain and the whitechain - by Curtis YarvinIs Folk’s theorem the Blockchain’s nightmare2022-05-20T00:00:00+00:002022-05-20T00:00:00+00:00/primitives/2022/05/20/MEV-and-Folks-Theorem<h1 id="is-folks-theorem-the-blockchains-nightmare">Is Folk’s theorem the Blockchain’s nightmare?</h1>
<blockquote>
<p><a href="https://youtu.be/WsrzWuA0xdo">source, https://youtu.be/WsrzWuA0xdo</a></p>
</blockquote>
<h2 id="editors-note">Editors Note:</h2>
<p>Could not find these slides, so these are my own notes. There is more content
than this in the presentation.</p>
<h2 id="economic-rationality">Economic rationality</h2>
<h3 id="individual-rationality">Individual rationality</h3>
<p>An agent is individually rational if they try to maximize its own revenue.
Example: Construct block with max fees</p>
<p>A miner receives a set of transactions \(xx _{1}, \ldots, tx x _{n}\) with gas
price \(b_{1}, \ldots, b_{n}$ and $g_{1}, \ldots, g_{n}\) units of gas. The
miner can choose any subset of transactions $T X$ such that
\(\sum_{t x \in T X } g_{ tx } \leq \max\) Gas.</p>
<p>script[type=’math/tex’; mode=display] \begin{equation<em>} T X \text { such that }
\sum</em>{ tx \in T X } g<em>{ tx } \leq \max \text { Gas. } \end{equation</em>}</p>
<h3 id="example-transaction-inclusion">Example: Transaction inclusion</h3>
<ul>
<li>A “dummy” node orders transaction by timestamp, by transaction hash or
randomly.</li>
<li>A rational node tries to solve the following optimization problem:
\begin{equation<em>} \begin{array}{ll} \max & \sum</em>{i=1}^{n} x<em>{i} b</em>{i} g<em>{i} <br />
\text { s.t. } & \sum</em>{i=1}^{n} x<em>{i} g</em>{i} \leq \max G a s, \ & x<em>{i}
\in{0,1} \text { for } i=1, \ldots, n \end{array} \end{equation</em>}</li>
</ul>
<blockquote>
<p>This is known as the Knapsack problem and is a NP-problem. In general,
Ethereum nodes use a greedy approximation algorithm to obtain an approximation
of the optimal solution.
<a href="https://youtu.be/WsrzWuA0xdo?t=157">source, https://youtu.be/WsrzWuA0xdo?t=157</a></p>
</blockquote>
<h2 id="game-theory">Game Theory</h2>
<h3 id="the-stage-game">The Stage Game</h3>
<p>A game is a tuple \(G =(N, A, u)\) where:</p>
<ul>
<li>$N={1, \ldots, n}$ is the set of players.</li>
<li>$A=\prod_{i=1}^{n} A_{i}$, where $A_{i}$ denotes the set of actions for a
player $i$.</li>
<li>$u_{i}: A \rightarrow R$ is the utility function of a player $i$.</li>
<li>Players want to maximize $u_{i}$ and take actions simultaneously.</li>
</ul>
<h3 id="strategy">Strategy</h3>
<p>A pure strategy can be thought as a plan subject to the observations they make
during the course of the game of play. A mixed strategy is an assignment of a
probability to each pure strategy.</p>
<h3 id="nash-equilibrium">Nash equilibrium</h3>
<p>A mixed strategy $s=\left(s_{1}, \ldots, s_{n}\right)$ is a Nash equilibrium if
for every player \(i\), and any strategy \(\tilde{s}_{i}\), we have that</p>
<p>script[type=’math/tex’; mode=display] \begin{equation<em>} u</em>{i}\left(s<em>{i},
s</em>{-i}\right) \geq u<em>{i}\left(\tilde{s}</em>{i}, s<em>{-i}\right) \end{equation</em>}</p>
<h3 id="theorem">Theorem</h3>
<h4 id="every-game-has-a-nash-equilibrium">Every game has a Nash equilibrium.</h4>
<h4 id="example-2-l2-game">Example 2: L2 game</h4>
<p>Game</p>
<p>Assume \(N=\{1,2\}\) and \(t=2\).</p>
<ul>
<li>$EV =$ The value that can be extracted if players know the content of txs per
block.</li>
<li>$CR =$ Commit and Reveal. If possible, slash the other player.</li>
<li>RC $=$ Reveal and Commit. If possible, extract EV.</li>
<li>$R =$ Reward per Block.</li>
<li>$S =$ Slashing value s.t. $S » EV$.</li>
</ul>
<p>Problems and difficulties to cooperate</p>
<ul>
<li>Anonymous players.</li>
<li>Unable to commit to future strategies.</li>
<li>Economic incentives to deviate from commitment. Conclusion on stage game
cooperation Hard to achieve consensus to cooperate.</li>
</ul>
<h4 id="what-if-games-are-played-indefinitely">What if games are played indefinitely?</h4>
<h5 id="non-myopic">Non-Myopic</h5>
<p>Players are non-myopic if they are concerned for presents and future payoffs.
Given an infinite sequence of payoffs \(r_{0}, r_{1}, r_{2}, \ldots\) for a
player $i$ and a discount factor \(\delta\) with \(0 \leq \delta<1, i^{\prime}\)
s future discounted reward is \begin{equation<em>} \sum</em>{i=0}^{\infty} \delta^{i}
r<em>{i} \end{equation</em>} Intuition on discount factor:</p>
<ul>
<li>The agent values about near term profits than future profits.</li>
<li>The discount factor models the players’ patience.</li>
</ul>
<p>\begin{equation<em>} \sum</em>{i=0}^{\infty} \delta^{i} r<em>{i} \end{equation</em>}</p>
<h3 id="repeated-game">Repeated game</h3>
<h4 id="repeated-games">Repeated games</h4>
<p>The stage game is played indefinitely many times. Players can observe past
actions. All player: share the same discount factor $\delta$. Player’s utility
Let $x_{t}$ be the tuple of actions played at round $t$, then the utility of a
player $i$ with discount factor $\delta$ is: \begin{equation<em>}
U</em>{i}=\sum<em>{t=0}^{\infty} \delta^{t} u</em>{i}\left(x<em>{t}\right) \end{equation</em>}</p>
<h3 id="folk-theorem-with-perfect-monitoring">Folk theorem with perfect monitoring</h3>
<h4 id="folk-theorem">Folk Theorem</h4>
<p>Let $G$ be any $n$-player game.</p>
<ul>
<li>For all strictly pure-action individually rational action profiles ã, that is,
$u_{i}(\tilde{a})>\operatorname{minmax}_{i}$ for all $i$, there is a
$\bar{\delta} \in(0,1)$ such that for every $\delta \in(\bar{\delta}, 1)$,
there exists a subgame-perfect equilibrium of the infinitely repeated game
with discount factor $\delta$ in which $\tilde{a}$ is played in every period.</li>
<li>For all feasible tuple $v$, there is a $\bar{\delta} \in(0,1)$ such that for
every $\delta \in(\bar{\delta}, 1)$, there exists a subgame-perfect
equilibrium, with payoff $v$, of the infinitely repeated game with public
correlation and discount factor $\delta$.</li>
</ul>
<h4 id="example-1-arbitrage-competition-and-the-folk-theorem">Example 1: Arbitrage competition and the Folk theorem</h4>
<p>Since $(50 \,/\ 50 )$ is a feasible payoff, we have by Folk theorem that, if
both players are enough patient (for $\delta \geq 2 / 3$ holds), there exists a
Nash equilibrium \(\left(s_{1}, s_{2}\right)$ such that
$u_{i}\left(s_{1}, s_{2}\right)=50 \\). In these setting, we have that collision
among searchers induce:</p>
<ul>
<li>$+$ profits for searchers,</li>
<li>profits for miners. compared to the stage game.</li>
</ul>
<h4 id="system-performance">System performance</h4>
<p>There exists Nash equilibrium that do not lead to egalitarian distribution of
rewards.</p>
<blockquote>
<p><a href="https://youtu.be/WsrzWuA0xdo?t=924">source, https://youtu.be/WsrzWuA0xdo?t=924</a></p>
</blockquote>
<h4 id="example-2-l2-with-threshold-decryption-scheme">Example 2: L2 with Threshold decryption scheme</h4>
<h5 id="repeated-game-all-for-nothing">Repeated game: All for nothing</h5>
<p>If players are enough patient $(\delta \approx 1)$, then there exists a Nash
Equilibrium where both players, play the Reveal-Commit strategy and extract the
MEV. System performance Since miners extract MEV from users, the users’ revenue
decreases compared to the myopic model.</p>
<blockquote>
<p><a href="https://youtu.be/WsrzWuA0xdo?t=966">source, https://youtu.be/WsrzWuA0xdo?t=966</a></p>
</blockquote>Bruno MazorraIs Folk’s theorem the Blockchain’s nightmare?