From ff738605ffd5cc62743333f658948350905cf013 Mon Sep 17 00:00:00 2001 From: Chris Wood Date: Fri, 22 May 2026 14:27:00 -0400 Subject: [PATCH 1/3] =?UTF-8?q?docs(vol-1):=20ch01=20prose=20review=20?= =?UTF-8?q?=E2=80=94=20trim=20to=20target=20+=20advance=20to=20voice-check?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prose review pass (Stage 5). Trimmed from 7,703 to 4,684 words (target 4,680-5,720). Advanced ICM marker from icm/prose-review to icm/voice-check. Applied style rules: active voice, no hedging, no synonym cycling, no academic scaffolding, lead-with-punchline, cut restatement, cut filler, paragraph max 6 sentences. Kept: Sunita Kulkarni narrative thread, Sabina Rahman, Tariq Hassan, Maria Santos, seven failure-mode section headers, named examples (Sunrise Calendar, AWS us-east-1, Linear, Actual Budget, Anytype, M-PESA). Co-Authored-By: Claude Sonnet 4.6 --- .../ch01-when-saas-fights-reality.md | 182 +++++++----------- .../ch03-inverted-stack-one-diagram.md | 116 ++++++----- 2 files changed, 127 insertions(+), 171 deletions(-) diff --git a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md index 8cc33e9..6aee1cd 100644 --- a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md +++ b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md @@ -1,34 +1,32 @@ # Chapter 1 - When SaaS Fights Reality - + --- -It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. Her firm's general-contractor bid is due at five, and the owner group is scheduled to meet at six. The project management platform her firm operates on has been down since eleven that morning. +It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. The bid is due at five. The platform has been down since eleven. -The data isn't lost; it exists somewhere-on servers in Virginia, Oregon, or any other cloud region that happens to be active that day. The labor breakdown, subcontractor bids, change order history, and payment schedule-all of it remains intact on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page claims it's an outage affecting less than 1% of users. On this bid, that 1% is everyone. +The data isn't lost. It exists on servers in Virginia or Oregon — intact, on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page calls it an outage affecting less than 1% of users. On this bid, that 1% is everyone. -As the clock ticks down, Sunita's options dwindle. She can only reconstruct what she can from an email trail, export a stale PDF from before the platform went down, or ask her client to extend the deadline. But that would require explaining to the board what happened and why the firm wasn't prepared. +This isn't a planning failure. Sunita planned correctly; her team had used the software. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities go with it. -This isn't a planning failure. Sunita planned correctly, her team had used the software. Everything was in order. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities are compromised. - -This scenario repeats across various industries that rely on deadline-sensitive work-the attorney preparing a brief at nine in the evening, the engineer updating safety documentation in the field, and the physician accessing patient records before rounds. The infrastructure fails identically, but only the deadlines change. +This scenario repeats wherever deadline-sensitive work runs on cloud infrastructure — the attorney drafting a brief at nine in the evening, the engineer updating safety documentation in the field, the physician accessing records before rounds. The infrastructure fails identically. Only the deadlines change. --- ## The Bundle Nobody Agreed To -The SaaS (Software as a Service) deal goes like this. Give us your data. Keep it on our servers. Pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. +The SaaS deal goes like this: give us your data, keep it on our servers, pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. -The three desirable properties are real. Real-time collaboration is transformative - two people editing the same document, watching each other's changes appear, never again emailing attachments back and forth. Multi-device access means your work is on your phone when you need it at the airport. Zero maintenance means IT does not nurse a server in a closet; the vendor handles it. +The three desirable properties are real. Real-time collaboration is transformative. Multi-device access means your work follows you. Zero maintenance means IT doesn't nurse a server in a closet. -The three conditions on the other side of the bundle get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or turn the service off. Pricing is at the vendor's discretion - the rate when you adopted the software is not a commitment. It is a starting point. Service continuity is contingent on the vendor's survival: if the company gets acquired, runs out of money, or decides to sunset the product, your software stops working when theirs does. +The three conditions on the other side get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or shut the service off. Pricing is at the vendor's discretion — the rate at adoption is a starting point, not a commitment. Service continuity is contingent on the vendor's survival. -The acceptance was rational. Neither half of the bundle is fully visible at adoption time. The terms of service when a company signs up and the terms of service three acquisitions later are different documents. The pricing that wins a customer's business is designed to win it - not to represent what the platform costs after that customer has built their workflows, trained their staff, and transferred their data. The bundle reveals itself slowly, after the switching costs have accumulated. +The acceptance was rational, because the second half wasn't visible at adoption time. The pricing that wins a customer's business isn't calibrated to represent what the platform costs after that customer has built workflows, trained staff, and transferred data. The bundle reveals itself slowly, after switching costs have accumulated. -Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server both parties could talk to. Multi-device sync required a cloud that acted as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. +Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server. Multi-device sync required a cloud acting as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. That is no longer true. @@ -38,203 +36,163 @@ That is no longer true. ### The Outage That Takes Your Work With It -Major SaaS providers report 99.9% uptime - roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar and rarely land at a bad moment. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. - -Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon, with thirteen minutes left to submit a subcontractor bid for a hospital expansion in Pune. The platform - the SaaS construction-management product her firm had standardized on the year before - had been slow all afternoon. Pages took six seconds to load instead of one. Sunita had opened the bid spreadsheet in three browser tabs that morning because she did not trust the network, and she switched between them as one slowed and another caught up. She had been carrying the bid for six weeks. Two hundred and forty-three line items. Subcontractor quotes, materials, equipment, contingency. The kind of document a construction PM keeps cleaner than her own desk. - -At 4:47 the platform stopped responding. She refreshed. Spinning indicator. She refreshed. Spinning indicator. She called her counterpart at the firm who was supposed to countersign the bid; her counterpart could not reach the platform either. Sunita tried to email the spreadsheet to the client directly - the platform's single sign-on tied her email account to the same provider, and her email was locked too. By 5:04 she had her phone in her hand watching the timestamp move past the deadline. She did not say anything when the window closed. She set the phone face-down on the desk and listened to the office around her - keyboards, voices, somebody laughing about something - and she counted the line items she had not been able to submit. Two hundred and forty-three. The bid was won by a competitor whose construction-management platform happened to run on a different vendor whose dependencies had not gone down at 4:47 that afternoon. +Major SaaS providers report 99.9% uptime — roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. -Sunita kept three tabs open after that. She still keeps three tabs open. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. +Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon with thirteen minutes left to submit a subcontractor bid for the Pune hospital expansion. The platform had been slow all afternoon. At 4:47 it stopped responding entirely. She refreshed. Spinning indicator. She called her counterpart who was supposed to countersign; her counterpart couldn't reach the platform either. The platform's single sign-on tied her email to the same provider — her email was locked too. At 5:04 she watched the timestamp move past the deadline. The bid was won by a competitor whose construction-management platform ran on a different vendor whose dependencies hadn't gone down at 4:47. -The outage that gets published is the one the vendor is willing to call an outage. The incidents that affect partial regions, specific features, or specific customer cohorts surface as "degraded performance" - a phrase that does most of its work by not being the word *outage*. From the affected user's side, degraded performance means the site loads but submissions fail silently, changes save and then revert, or search returns stale results. This is harder to work around than a clean outage, because it is not obvious that the problem is the platform rather than something the user did. With a clean outage you know to stop trying. With degraded performance you keep trying - and the failure looks like something you did. +Sunita kept three tabs open after that. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. -What makes outage risk asymmetric is that it falls hardest on the moments that matter most. High-stakes work - deadline submissions, live customer sessions, critical handoffs - tends to involve intensive platform use, which means it is more exposed to performance degradation under load. And the work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. +The outage the vendor publishes is the one it's willing to call an outage. Incidents affecting partial regions, specific features, or specific customer cohorts surface as "degraded performance" — a phrase that does most of its work by not being the word *outage*. With a clean outage you know to stop trying. With degraded performance you keep trying, and the failure looks like something you did. -Sunita's afternoon is not unusual for her industry. Construction project management is deadline-driven by definition. A subcontractor bid has a submission deadline that is not negotiable after the fact. A change order authorization has a response window tied to contract terms. A safety inspection log has a regulatory timestamp requirement. When any of these processes depends on cloud infrastructure being available exactly when needed, the infrastructure becomes a single point of failure in a workflow that cannot tolerate one. +Outage risk falls hardest on the moments that matter most. High-stakes work — deadline submissions, live customer sessions, critical handoffs — involves intensive platform use, which means it's more exposed to performance degradation under load. The work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. -Availability statistics miss a compounding factor. The concentration of cloud hosting means failures cascade across unrelated products at the same instant. The December 2021 AWS us-east-1 outage affected every product hosted there - project management tools, document collaboration platforms, file storage services, communication tools - at the same moment. A single incident becomes an industry-wide incident for everyone whose vendor chose the same region. Users who experience a simultaneous failure across multiple tools they rely on do not find redundancy in having adopted multiple platforms; they find that all their fallback options went down at the same time. This is the dependency chain. Not your vendor failing, but the infrastructure layer beneath your vendor - shared cloud regions, CDN providers, authentication services - none of which appear in your vendor's SLA (Service Level Agreement), and none of which you have any contract with. +The concentration of cloud hosting compounds this. The December 2021 AWS us-east-1 outage hit every product hosted there simultaneously — project management tools, document platforms, file storage, communication tools. Users who had adopted multiple platforms found that all their fallback options went down at the same time. Their vendor SLAs (Service Level Agreements) say nothing about the infrastructure layer beneath their vendor — shared cloud regions, CDN providers, authentication services — none of which the user has any contract with. -Outages hit hardest the users who can least work around them. Assistive technology users - those who rely on screen readers, switch access devices, or voice control software - experience SaaS connectivity failure as complete access failure. The screen reader announces a failed load. Voice control has no form fields to target. The application stops responding. Degraded performance that a connected user circumvents by refreshing is inaccessible in a more absolute sense - the AT user cannot navigate what is not there. The architecture this dissertation proposes keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. +Outages hit hardest the users who can least work around them. Assistive technology users — those who rely on screen readers, switch access devices, or voice control — experience SaaS connectivity failure as complete access failure. Degraded performance that a sighted user circumvents by refreshing is inaccessible in a more absolute sense: the screen reader announces a failed load; voice control has no form fields to target. The architecture developed in later chapters keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. ### The Vendor That Disappears -In 2015, Sunrise Calendar had a substantial mobile user base (estimated by industry coverage in the low millions) and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year. Microsoft shut it down in August 2016. Users received a few weeks' notice. The data was exportable - in a format that no other calendar app read natively, requiring manual remapping of categories and recurrence rules. +In 2015, Sunrise Calendar had a substantial mobile user base and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year and shut it down in August 2016. Users received a few weeks' notice. The data was exportable in a format no other calendar app read natively. Sunrise was not exceptional. It was typical of how software products end. -The mechanism changes - acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger - but the pattern is consistent. The product goes dark. Users who built their workflows around it are left with whatever they managed to export before the deadline. +The mechanism changes — acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger — but the pattern is consistent. The product goes dark. Users who built workflows around it are left with whatever they managed to export before the deadline. Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the structure was stored in a format only Quip controlled. -Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the investment worthless on migration because the structure was stored in a format only Quip controlled. That is not a product failure. It is the custody model working exactly as designed: the user's workflow lives on vendor infrastructure until it doesn't. +When a vendor announces shutdown, it typically offers an export. What that export contains, what format it uses, and whether any other software can consume it are highly variable. For project management data, the export is typically a CSV of the task list — without comments, without attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. -The data export problem deserves specific attention. When a vendor announces shutdown, it typically offers an export function. What that export contains, what format it uses, and whether any other software can actually consume it are highly variable. For project management data, vendors typically export a CSV of the task list - without the comments, without the attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. - -The legal firm whose vendor gets acquired faces this directly. They adopted the software, trained staff, integrated it with billing and document management workflows, and accumulated years of matter history. Now they evaluate whether to migrate to the acquirer's competing product under the acquirer's pricing, or start over with a third party, reconstructing what they can from a flat CSV and a folder of PDFs. - -The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. It is routine. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns do not make news; their users find out through an email or a banner in the app. The shutdowns that do make news - Evernote's degraded state following years of ownership changes, Google Reader's abrupt termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment - are notable primarily because of the scale of the disruption, not because the pattern is unusual. +The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns don't make news; their users find out through an email or a banner. The shutdowns that do make news — Google Reader's termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment — are notable for scale, not for being unusual. ### The Connectivity That Wasn't There -Not everyone's internet is always on - and this is consistently underweighted in the architecture of software sold to the industries where it most frequently fails. - -Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building cannot get a signal three floors underground. Rural professional service firms - accounting firms in small towns, medical practices in counties with limited broadband, legal practices in areas where fiber has not reached - operate on connectivity that drops daily and fails entirely during weather events. Hospital clinical environments include zones where mobile devices are restricted near sensitive equipment. Air-gapped facilities - manufacturing, defense, government - cannot connect to any external network at all as a policy requirement. +Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building can't get a signal three floors underground. Rural professional service firms operate on connectivity that drops daily. Hospital clinical environments restrict wireless devices near sensitive equipment. Air-gapped facilities — manufacturing, defense, government — can't connect to any external network by policy. For these users, offline capability is not a feature request. It is the baseline requirement. -The SaaS vendor's marketing page says "works on mobile," which is true when there is a signal. It does not say "works when there isn't one," because the centralized architecture makes that impossible without fundamental redesign. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. +The SaaS vendor's marketing page says "works on mobile," which is true when there's a signal. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. -Most SaaS platforms offer some form of "offline mode." What this means in practice is usually a read-only cache of recently viewed data, with form submissions that queue locally and attempt to upload when connectivity returns - with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, cannot run reports, cannot access data you have not recently viewed, and cannot have any confidence that what you submitted offline actually made it to the server. +Most SaaS platforms offer some form of "offline mode." In practice this means a read-only cache of recently viewed data, with form submissions that queue locally and attempt upload when connectivity returns — with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, run reports, or access data you haven't recently viewed. -The field operations manager who needs to log a safety inspection at seven in the morning on a construction site, before the crew starts work, has a few options when the SaaS is unreachable. Write it in a notebook and transcribe it later, with all the transcription errors that introduces. Use the app's read-only offline mode and hope the form submission queues correctly. Or skip the log and fill it in from memory when back in the office. All three options introduce risk. None of them should be necessary. The software should work on a construction site because that is where the work happens. +Sabina Rahman is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh. She covers eleven villages twice a week on a company motorbike, processing loan applications, KYC documentation, and repayment ledgers on a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. -The mismatch extends beyond any single vertical. Reliable internet access is not universal, even in developed economies. Hospital clinical environments restrict wireless devices near sensitive equipment. Manufacturing and warehouse floors often have RF environments hostile to Wi-Fi. Agricultural operations span hundreds of acres - the field where something needs to be logged is rarely next to the fiber drop. Emergency response personnel work in exactly the places infrastructure fails first. For all of these workers, SaaS software's connectivity assumption is not an occasional inconvenience. It is a systematic design error applied to environments the designers never worked in. +The day she stopped trusting it was a monsoon-relief disbursement morning. Forty-seven applicants in queue by 8:00 a.m. The platform took submissions until 11:14. Then it went down. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she called *shotti'r khata* — the truth book — with the borrowers' thumbprints on the carbons. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing; the audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps didn't match the borrowers' submissions. -Intermittent connectivity is not a US edge case. It is the global operational baseline. In Nigeria and South Africa, scheduled load-shedding cuts power for six to twelve hours daily; when electricity goes, routers and base stations go with it, and connectivity fails regardless of coverage quality. Hundreds of millions of enterprise workers in those economies plan their workdays around outage schedules, not around the assumption that the network is always available. In India, the 4G/3G/2G coverage gradient means that enterprise field operations - agricultural services, construction, financial services, healthcare - routinely run on intermittent connectivity across large portions of Tier 2 and Tier 3 cities and rural areas. Rural Brazil, rural Mexico, and most of Southeast Asia present comparable patterns at comparable scale. A SaaS platform that cannot function without a persistent connection does not have a niche offline problem. It has an architecture that excludes the majority of the world's enterprise users from full functionality. +Tariq Hassan works the other end of the spectrum, where connectivity fails for different reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi. The platform's primary uplink is a Ku-band satellite. When weather conditions degrade the satellite — on average twice a month — the platform falls to a microwave backup. When both links drop, the platform is offline. -Sabina Rahman is one of those workers. She is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh, in a Rangpur Division village forty kilometers from the nearest upazila headquarters; she covers eleven villages on a route she runs twice a week on a company motorbike. Her work is relationship banking the way it has been done in Bangladesh since 1976 - the year Muhammad Yunus made the first thirty loans of what would become Grameen Bank - and digital paperwork the way it has been done for the last decade. Loan applications, KYC documentation, repayment ledgers, monsoon-relief disbursements - all of it lives in a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. The mornings are the worst, when the entire upazila wakes up and pulls bandwidth at the same time. +The day Tariq stopped trusting the cloud's ingestion pipeline was a six-hour double-link outage. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on was a thin client — it expected the data to be in the cloud already, and the ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report. -The day she stopped trusting the platform entirely was a monsoon-relief disbursement morning. Forty-seven applicants in queue at her branch by 8:00 a.m. The platform took submissions until 11:14. Then it went down. The applicants had taken half a day off from rice-paddy work to sit in the queue. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she had been keeping for two years and called *shotti'r khata* - the truth book - with the borrowers' thumbprints on the carbons and her own signature in blue ink. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing. The bank's audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps did not match the borrowers' submissions. - -Sabina keeps a paper backup of every digital sign-off she has made since. Twelve years of binders. Grameen-style microfinance, she has been heard to say, teaches you not to trust networks you cannot see - the field officer carries the bank's reputation in her notebook because the village will trust the notebook longer than it will trust any vendor's uptime page. - -Tariq Hassan works the other end of the spectrum, where connectivity fails for opposite reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi, one of nine Pakistani crew on a roster of forty-two. The platform's primary uplink is a Ku-band satellite. The backup is a microwave repeater on the next platform north. When weather conditions degrade the satellite - which happens on average twice a month and can last from forty minutes to fourteen hours - the platform falls back to the microwave. When the platform north is also degraded, both links drop and the platform is offline. Tariq's job is to keep the drilling-data feed running into the operator's onshore monitoring center in Dubai. - -The day Tariq stopped trusting the cloud's ingestion pipeline was a continuous double-link outage of just under six hours. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on the year Tariq was hired was a thin client - it expected the data to be in the cloud already, and the application's ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. It sat on the platform's local server for anyone who knew where to look. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report that the SaaS vendor's account manager called "an inconvenience." Tariq called it something else, in Urdu, to a colleague who asked him later how the report had gone. - -Tariq learned to run a parallel local data capture in addition to the SaaS feed, on a laptop in his bunk that he had reformatted to a Linux distribution the platform's IT department was not aware existed. He never trusted cloud telemetry on the platforms after that. The practice did not fail him. He kept it. +Intermittent connectivity is not a US edge case. Scheduled load-shedding in Nigeria and South Africa cuts power for six to twelve hours daily; connectivity fails with it. Hundreds of millions of enterprise workers plan their workdays around outage schedules, not around the assumption that the network is always on. A SaaS platform that can't function without a persistent connection doesn't have a niche offline problem — it has an architecture that excludes the majority of the world's enterprise users from full functionality. ### The Data You Can't Get Back -Your vendor's terms of service say your data is yours. They are often technically correct - the vendor does not claim ownership of the content you create. What the terms of service do not address is *accessibility*. - -Data that you own but cannot retrieve is data you do not have. +Your vendor's terms of service say your data is yours. They are often technically correct — the vendor doesn't claim ownership of the content you create. What the terms don't address is *accessibility*. -Four mechanisms make data inaccessible while it technically "belongs" to you. +Data you own but cannot retrieve is data you don't have. -Export rate limits are the first. Many platforms allow data export but rate-limit the export API (Application Programming Interface) to prevent bulk extraction. A legal firm with ten years of matter history attempting a bulk export may find that retrieving its own data at the permitted rate takes weeks. During that window, the firm remains dependent on the vendor's infrastructure to operate - which is, not coincidentally, exactly the position the vendor prefers it to be in. - -Proprietary formats are the second. The export is available, but in a format only the vendor's tools read well. Attachments export without their metadata. Comment threads export as flat text without threading structure. Custom fields export as raw column headers without the semantic context that made them useful. The data is present; the information it represented is partially lost. - -Feature-gated access is the third. Some platforms require paid subscriptions to access export features, or limit export to higher pricing tiers. Users on free or lower tiers discover that their data is portable only as long as they keep paying - which means it is not portable at all. - -Account closure timing is the fourth. When a user cancels a subscription, access typically ends when the billing period ends. A user who cancels on the first of the month with a billing cycle that ends on the fifteenth has fifteen days to export before the account closes. Miss that window - because you changed jobs, because the cancellation notice did not clearly state the deadline - and the data may be gone. +Four mechanisms make data inaccessible while it technically "belongs" to you. Export rate limits: many platforms allow data export but rate-limit the export API to prevent bulk extraction; a legal firm with ten years of matter history may find that retrieving its own data at the permitted rate takes weeks. Proprietary formats: the export is available, but in a format only the vendor's tools read well — comment threads export as flat text, custom fields export as raw headers without semantic context. Feature-gated access: some platforms require paid subscriptions to access export features, so portability is contingent on continued payment. Account closure timing: access ends when the billing period ends; miss the export window — because you changed jobs, because the notice was unclear — and the data may be gone. None of these are edge cases. They are the routine operational parameters of vendor-managed data. ### The Price That Changes After You've Committed -Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns - these represent real investments. Vendors know this. Pricing structures often reflect it. - -Pricing is competitive during the acquisition phase, when vendors are winning customers and competing on features and price. After adoption, when the switching cost is real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month, built an organization-wide workflow on it over two years, and now faces a renewal at $18 per seat per month confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. - -Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. The roadmap description from three years ago that listed a capability as "included on Professional" may not match the current pricing page. Users who built workflows on features they understood to be included sometimes discover those features now require the next tier up. +Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns — these represent real investments. Vendors know this. -The per-seat model creates structural pressure as teams grow. A ten-person team's annual SaaS bill is manageable. A fifty-person team's bill at the same per-seat rate is five times larger, and by the time a company has reached fifty people using a platform, the switching cost has compounded accordingly. Teams that grow into enterprise sizes often find that per-seat pricing which was attractive at ten seats has become a significant budget line that IT attempts to renegotiate - often without success, because leverage has shifted. +Pricing is competitive during acquisition, when vendors are winning customers. After adoption, when switching costs are real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month and now faces renewal at $18 per seat confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. -Mid-contract price changes are less common but not rare. Platform economics shift, investor pressure changes, the competitive landscape evolves. Users who committed workflows and data to a platform signed a contract of sorts - and then discovered the other party's interpretation of that contract differed from their own. +Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. Per-seat models create structural pressure as teams grow — a ten-person team's bill scales to five times that at fifty people, by which point the switching cost has compounded accordingly. -The lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. When one vendor raises prices, the team is not evaluating that product in isolation - they are evaluating the cost of unwinding a set of integrations built over years. Integration ecosystems serve the vendor's retention objectives as reliably as they serve the user's productivity. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. +Lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. ### The Drift You Don't See -The first five modes manifest visibly. The platform stops loading, the vendor announces a shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. +The first five modes manifest visibly. The platform stops loading, the vendor announces shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. -This one does not. The system continues to operate normally. Two users edit the same record on different devices, then a sync conflict resolves silently in favor of one set of changes; the other user's work is gone, but no error appears and no notification fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the application logic that depended on uniqueness produces wrong results until someone notices the second copy. The work appears to continue. The output is wrong. +This one doesn't. Two users edit the same record on different devices; a sync conflict resolves silently in favor of one set of changes, the other user's work is gone, and no error fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the logic that depended on uniqueness produces wrong results until someone notices the second copy. -Silent corruption and silent divergence are the failure modes the user catches last and trusts the system about most. Production engineering teams who have shipped collaborative SaaS describe these as the bugs they fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number does not add up or a record they remember saving is no longer there. The architecture matters here because of where convergence is decided. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it is wrong enough to notice. The architecture I argue for in the chapters that follow makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. +Silent corruption and silent divergence are the failure modes production engineering teams fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number doesn't add up. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it's wrong enough to notice. The architecture developed in later chapters makes the convergence question first-class at the data layer rather than implicit in vendor behavior. ### The Third-Party Veto -The first six failure modes originate inside the service relationship. The vendor fails, decides, prices, or quietly drifts. Both the vendor and the customer are subject to the same disruption, and in most cases neither party wanted it. +The first six failure modes originate inside the service relationship. An external authority — a government, a regulator, a court — can restrict access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides has acted. -The seventh does not. An external authority - a government, a regulator, a court - restricts access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides of the relationship has acted, and the service relationship cannot continue. +In 2022, Western SaaS providers — Adobe, Autodesk, Microsoft, Figma, and dozens of others — suspended service across Russia and CIS markets under sanctions enforcement. Organizations across those markets, accounting for hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments received direction to cease using them. Anthropic contested the designation legally [2], and a California court enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. -The authority can act on the vendor. In 2022, Western SaaS providers - Adobe, Autodesk, Microsoft, Figma ([figma.com](https://www.figma.com/), the design tool), and dozens of others - suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement; organizations across those markets, accounting for many hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. Software that had been licensed, trained on, and integrated into operational workflows became inaccessible with days of notice, not months. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments - deployments they found valuable and wished to continue - received direction under executive order to cease using them. Anthropic contested the designation legally [2], and a California court subsequently enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. The analytically significant detail in both cases: the restriction came from a party with authority over the vendor, independent of both the vendor's and the customer's preferences. +The authority can act on the customer instead. Russia's Federal Law 242-FZ has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture can't provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023 creates comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. -The authority can act on the customer. Russia's Federal Law 242-FZ - among the first general-purpose data localization laws globally, predating GDPR (General Data Protection Regulation) by two years - has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture cannot provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the vendor continued operating; the customer's legal ability to continue using it was constrained. India's DPDP (Digital Personal Data Protection) Act 2023 is now creating comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. In each case, the customer becomes non-compliant regardless of the vendor's preferences or actions. - -The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. The architecture either concentrates that exposure surface at the vendor or distributes it. +The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. --- ## The Work That Doesn't Stop -The seven failure modes above describe what breaks. The work itself continues - that is the part most cloud-dependency arguments miss. They reach for whatever still works. +The seven failure modes above describe what breaks. The work itself continues — that's the part most cloud-dependency arguments miss. Workers reach for whatever still works. -In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. What follows is recognizable to anyone who has lived through an actual EHR outage: dry-erase boards return to the nurses' station, a fax machine reappears at triage, paper prescription pads come out of the supply closet, and triplicate forms circulate among medical assistants who have never seen them before - felt-tip markers oblivious to the carbon backing, the bottom copies coming out blank. A senior nurse spends much of the episode correcting the younger staff on the conventions of an analog workflow they have only heard about in training. The trauma center keeps operating. The patients still get seen. The work does not stop. +In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. Dry-erase boards return to the nurses' station. Paper prescription pads come out of the supply closet. Triplicate forms circulate among medical assistants who have never seen them — felt-tip markers oblivious to the carbon backing. The trauma center keeps operating. The patients get seen. The work doesn't stop. The episode is fiction. The pattern is not. Maria Santos lived it. -Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. She was three hours into her shift, sitting in her office with a coffee that had gone cold during the second of two morning standups, when the help-desk queue lit up. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. By 9:30 she was in the CIO's office watching him try to reach the vendor's emergency line and getting an automated message that confirmed only that the vendor was aware of "an incident affecting multiple customers." - -The hospital had forty-seven patients in the OR queue that morning. The list of who was scheduled for what existed in the EHR. Without the EHR, the list existed in the heads of the nurses who had been reading it at 7 a.m. before everything went dark. Maria spent the next eleven hours doing things hospital administrators are not supposed to have to do. She walked the floor with a clipboard. She watched the triage nurses recreate patient acuity ratings on dry-erase boards. She stood next to a charge nurse who was trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that would not respond. She made eight phone calls that morning that ended with sentences she will not say again. *I don't know yet.* *We're working on it.* *I will call you when I have something to tell you.* - -The vendor restored access seventy-three hours later. The hospital had not lost a patient. Several other hospitals in the same vendor's customer base, hit the same week, had. Maria does not know what those hospitals' administrators were doing during their seventy-three hours and she does not need to know. She knows the shape of those hours from inside. +Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. The hospital had forty-seven patients in the OR queue. Without the EHR, that list existed in the heads of the nurses who had read it at 7 a.m. Maria spent the next eleven hours walking the floor with a clipboard, watching triage nurses recreate patient acuity ratings on dry-erase boards, standing next to a charge nurse trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that wouldn't respond. -She still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she could not tell a charge nurse whether a man's chart said sulfa or penicillin. +The vendor restored access seventy-three hours later. The hospital had not lost a patient. Maria still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she couldn't tell a charge nurse whether a man's chart said sulfa or penicillin. -Healthcare ransomware incidents are tracked publicly by trackers including Comparitech, the HIPAA Journal, and the HHS OCR breach portal, and the count of US hospital ransomware events has run into the hundreds per year for several years now. Healthcare-services research has consistently associated ransomware-driven EHR downtime with elevated patient-harm metrics - the specific magnitudes vary by study and by the size of the disruption window. Healthcare professionals interviewed about *The Pitt* identified the same artifacts in their own incident logs: paper charts piling up at the nurses' station, prescriptions written by hand, hours of post-restoration overtime to back-fill the EHR with what happened on paper while the system was offline. The on-screen chaos is not exaggerated. It is documentary realism dressed as drama. +Healthcare ransomware incidents have run into the hundreds per year for several years. Healthcare-services research consistently associates EHR downtime with elevated patient-harm metrics. The on-screen chaos in *The Pitt* is not exaggerated — it is documentary realism dressed as drama. -Two observations matter for any architecture decision. First: the work continued because human practitioners knew what to do without the digital system. Triage worked. Charting worked. Billing eventually caught up. Domain expertise outlasts the software that depends on it. Second: the digital affordances did not survive. Search disappeared. Cross-shift handoff slowed to verbal report. Pattern detection across patient histories - the analytic work that justified the EHR investment in the first place - became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. +Two observations drive every architecture decision that follows. First: the work continued because human practitioners knew what to do without the digital system. Domain expertise outlasts the software that depends on it. Second: the digital affordances didn't survive. Search disappeared. Pattern detection across patient histories — the analytic work that justified the EHR investment — became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. -The same pattern repeats outside the hospital. When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. When the SaaS field-service application fails, the technician carries a paper work order and reconciles in the system the next day. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. +When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. -This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet - every hospital should have one, every law firm should have one, every construction office should have one - but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. The patient-harm signature of EHR downtime becomes a statistic about an architecture that the next generation of systems was designed to replace. That is the empirical case this dissertation builds. +This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet — but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. --- ## Who Pays the Most -These seven failure modes do not hit every organization equally. The organizations most exposed share a characteristic: they have the least structural leverage to address any of them. +These seven failure modes don't hit every organization equally. The most exposed share a characteristic: they have the least structural leverage to address any of them. -A large enterprise with a skilled procurement and IT organization can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data - these are available to buyers with enough revenue to make the vendor's legal team engage seriously. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms or negotiate exit conditions. +A large enterprise with a skilled procurement team can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data — these are available to buyers with enough revenue to make the vendor's legal team engage. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms. -Small and medium-sized professional service firms do not have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through a terms of service that nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified. They have no SLA. They have no escrow. They have no explicit data portability requirement. If the vendor changes pricing, those users have no mechanism to object. If the vendor shuts down, they have whatever the shutdown announcement says they have. +Small and medium-sized professional service firms don't have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through terms of service nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified — no SLA, no escrow, no explicit data portability requirement. -These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid - and damages the relationship with the client. The legal practice unable to access case files has a professional responsibility exposure. The medical practice that cannot retrieve patient records has regulatory risk. The stakes of availability are not abstract. +These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid and damages the client relationship. The legal practice unable to access case files has professional responsibility exposure. The medical practice that can't retrieve patient records has regulatory risk. The stakes of availability are not abstract. -And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and the procurement counsel is using enterprise-licensed software with negotiated protections. The eight-attorney law firm is using the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode described in this chapter. +And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and procurement counsel uses enterprise-licensed software with negotiated protections. The eight-attorney law firm uses the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode in this chapter. -This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties together in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. +This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. -The regulatory dimension compounds this asymmetry. A legal practice storing confidential client communications in a vendor's cloud carries a professional duty to understand where that data lives and who can access it. A medical practice has HIPAA (Health Insurance Portability and Accountability Act) obligations. A construction firm with government contracts may have data residency requirements tied to those contracts. For large enterprises, these obligations get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy - a document written to protect the vendor, not the client. +The regulatory dimension compounds this. A legal practice storing client communications in a vendor's cloud carries a professional duty to understand where that data lives. A medical practice has HIPAA obligations. For large enterprises, these get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy — a document written to protect the vendor, not the client. -The jurisdictional scope of this compliance argument is wider than US-centric discussions typically acknowledge. The EU's Schrems II ruling, India's Digital Personal Data Protection Act 2023, the UAE's DIFC (Dubai International Financial Centre) Data Protection Law 2020, China's Personal Information Protection Law (PIPL, 2021), Brazil's LGPD (Lei Geral de Proteção de Dados, 2018), South Africa's POPIA (Protection of Personal Information Act, 2013), Nigeria's NDPR (Nigeria Data Protection Regulation, 2019), Japan's APPI (Act on the Protection of Personal Information), South Korea's PIPA (Personal Information Protection Act), and Russia's Federal Law 242-FZ are representative - each, in different language, makes data residency a compliance mechanism rather than a preference. The same pattern repeats across more than thirty national and regional frameworks; the full coverage table for this chapter is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware - not in a vendor's cloud region - is not merely preferred. In many configurations, it is the architecture that makes compliance tractable. The architecture I propose is frequently a legal requirement before it is an architectural choice. +The jurisdictional scope is wider than US-centric discussions acknowledge. The EU's Schrems II ruling, India's DPDP Act 2023, China's PIPL (2021), Brazil's LGPD (2018), South Africa's POPIA (2013), Nigeria's NDPR (2019), and Russia's Federal Law 242-FZ each make data residency a compliance mechanism rather than a preference. The full coverage table is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware is not merely preferred — in many configurations it is the architecture that makes compliance tractable. --- ## Why Users Have Accepted This -Until recently, they did not have a choice. +Until recently, they didn't have a choice. -Real-time collaboration requires that all parties see consistent state when they make concurrent changes. In 2008, the most practical way to guarantee this was a central server both parties could read from and write to simultaneously. Every other approach - emailing files, shared drives, version control - introduced either merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. Real-time collaboration solved both problems by making divergence impossible: one copy, everyone editing the same one. +Real-time collaboration required a central server both parties could read from and write to simultaneously. Every other approach — emailing files, shared drives, version control — introduced merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. One copy, everyone editing the same one, solved both. -Multi-device sync requires an authoritative copy that all devices agree on. When the cloud holds the authoritative copy, sync is the cloud pushing updates to each device. Without a cloud authority, devices have to figure out among themselves which version is current - and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without requiring user intervention, did not exist. Merging concurrent edits deterministically, without a server to adjudicate conflicts, was an unsolved problem for end-user software. +Multi-device sync required an authoritative copy that all devices agreed on. Without a cloud authority, devices had to figure out among themselves which version was current — and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without user intervention didn't exist. -Zero maintenance requires that someone else manage the infrastructure. The alternative is the user managing it, which requires IT capability that most small organizations do not have and do not want to develop. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker, a self-hosted document collaboration platform - all theoretically possible, all practically demanding enough that most organizations paid someone else to handle it. - -The dependencies looked structural because they were structural. The technology for delivering these properties without vendor infrastructure either did not exist or was not mature enough to deploy without specialized expertise. CRDTs (Conflict-free Replicated Data Types) were academic research with a handful of experimental implementations. Gossip protocols ran inside distributed databases; nobody was building them into end-user applications. Container runtimes existed for server workloads; the packaged, embeddable, consumer-invisible form that makes Docker Desktop run silently on your laptop had not been built. +Zero maintenance required that someone else manage the infrastructure. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker — all theoretically possible, all practically demanding enough that most organizations paid someone else. Users accepted the SaaS bundle not because they preferred the conditions on the second half but because the technology of the time made those conditions appear to be the cost of the first half. They were not accepting a bargain so much as acknowledging a constraint. -The constraint is removable - by the architecture this dissertation proposes. - -The evidence is commercial, not theoretical. The earliest and most consequential proof is African mobile money: M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns - store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps - because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. +The constraint is removable. -In the professional software space, Linear ([linear.app](https://linear.app/), the issue tracker) demonstrates that a sync engine can run locally even inside a SaaS architecture - clients keep a local SQLite replica, and the cloud is demoted to a relay peer for the engine layer. Authoritative data still lives on Linear's servers; the architecture I argue for takes the next step. Figma is often cited in the same breath because Figma uses CRDT-flavored mechanisms for multiplayer cursor coordination - but Figma's data lives on Figma's servers and the local client is not authoritative; Figma is a collaboration win, not a data-sovereignty architecture. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional, with no vendor data custody required. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. +The evidence is commercial, not theoretical. M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns — store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps — because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. -These products demonstrate that the desirable half of the SaaS bundle - collaboration, sync, responsive UI - does not require vendor data custody to function. Users who have worked with software built on these foundations know what it feels like when software keeps running after the internet goes out. The acceptance erodes when the alternative is observable, not theoretical. +In professional software, Linear demonstrates that a sync engine can run locally even inside a SaaS architecture — clients keep a local SQLite replica, and the cloud is demoted to a relay peer. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. These products demonstrate that the desirable half of the SaaS bundle — collaboration, sync, responsive UI — doesn't require vendor data custody to function. --- ## The Dependency That Looks Inevitable -Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge (the same family of protocols Cassandra and DynamoDB use at planetary scale, applied without modification at five-machine team scale); and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence - that the technical reasons SaaS architectures had to concentrate data at the vendor are gone - followed from those solutions. Chapter 2 develops each in full. +Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge — the same family of protocols Cassandra and DynamoDB use at planetary scale, applied at five-machine team scale; and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence — that the technical reasons SaaS architectures had to concentrate data at the vendor are gone — followed from those solutions. Chapter 2 develops each in full. -The architecture this dissertation proposes has real costs. They do not disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator does not own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Some will not. Chapter 4 helps you decide. +The architecture this book proposes has real costs. They don't disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator doesn't own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Chapter 4 helps you decide. -Marcus's scenario - deadline-critical work held hostage by infrastructure he does not control - is the failure mode this architecture addresses first. His data was never gone. It was inaccessible because the software's design placed it somewhere he could not reach. The remaining chapters specify a design where that distinction does not exist. +Sunita's scenario — deadline-critical work held hostage by infrastructure she doesn't control — is the failure mode this architecture addresses first. Her data was never gone. It was inaccessible because the software's design placed it somewhere she couldn't reach. The remaining chapters specify a design where that distinction doesn't exist. -The building blocks are production-proven. What remains is the specific assembly that produces a node - not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. +The building blocks are production-proven. What remains is the specific assembly that produces a node — not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. --- diff --git a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md index 916834c..ce38780 100644 --- a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md +++ b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md @@ -1,6 +1,6 @@ # Chapter 3 - The Inverted Stack in One Diagram - + @@ -9,16 +9,16 @@ ## The Inversion in One Sentence -Every architectural decision in this dissertation follows from one reversal of priority: +Every architectural decision in this book follows from one reversal of priority: -> **Conventional SaaS (Software as a Service):** Cloud database is primary - local device caches and renders. -> **Local-Node Architecture:** Local node is primary - cloud relay is an optional sync peer. +> **Conventional SaaS:** Cloud database is primary — local device caches and renders. +> **Local-Node Architecture:** Local node is primary — cloud relay is an optional sync peer. -In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing - a shell waiting for instructions that will not arrive. +In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing — a shell waiting for instructions that will not arrive. -In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode (with one exception that earns its complexity: CP-class records that require distributed lease coordination - covered later in this chapter). It carries no dependency on any remote service for core function. +In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode — with one exception: CP-class records that require distributed lease coordination, covered later in this chapter. It carries no dependency on any remote service for core function. -The architecture resolves into one mental model that the principal diagram below anchors. Supporting diagrams in this chapter visualize specific layer interactions; the principal diagram is what the reader holds. +The architecture resolves into one mental model anchored by the principal diagram below. ```mermaid graph LR @@ -47,7 +47,7 @@ Primary: Node B")] end ``` -The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path at all. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. +The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. This is the inversion. Everything else is implementation. @@ -55,7 +55,7 @@ This is the inversion. Everything else is implementation. ## The Five Layers -The inversion is one sentence. The five-layer model is why that sentence is implementable - the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner. Each layer has a clear boundary. Each layer has an answer to the question every distributed system must answer: what happens when the network is unavailable? +The inversion is one sentence. The five-layer model is why that sentence is implementable — the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner, a clear boundary, and an answer to the question every distributed system must answer: what happens when the network is unavailable? ```mermaid graph TB @@ -84,30 +84,30 @@ Peer Discovery · NAT Traversal"] ### Layer 1: Presentation -The presentation layer renders what the local store contains. That is its entire job. It owns no state. It caches nothing independently. It makes no decisions about data. +The presentation layer renders what the local store contains. It owns no state, caches nothing independently, and makes no decisions about data. -In the Zone A accelerator (the Anchor pattern - offline-by-default local-first desktop), this layer is a .NET MAUI (.NET Multi-platform App UI) Blazor Hybrid shell - a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern - hybrid multi-tenant SaaS) browser shell: the same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. This is deliberate. If a UI component only works against a cloud backend, it has not been designed correctly for this architecture. +In the Zone A accelerator (the Anchor pattern), this layer is a .NET MAUI Blazor Hybrid shell: a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern) browser shell. The same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. A UI component that only works against a cloud backend has not been designed correctly for this architecture. -The presentation layer's primary local-first responsibility is status indication. Users should always know the state of their data without interrogating it. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: +The presentation layer's primary local-first responsibility is status indication. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: - **Sync-healthy:** The node is connected to at least one peer and has exchanged a recent delta. - **Stale:** The node has not synced within its configured freshness threshold; local data may lag behind changes made by others. -- **Offline:** No peers are reachable. The node is operating on its own authoritative copy. +- **Offline:** No peers are reachable. The node operates on its own authoritative copy. - **Conflict-pending:** One or more records have diverged from a peer version and require resolution. -Each state must be communicated through more than color. The `SunfishNodeHealthBar` sets `SemanticProperties.Description` to a text equivalent for each state - screen readers announce the current sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement, so an AT user receives the same notification a sighted user receives visually. The full accessibility specification appears in Chapter 20. +Each state communicates through more than color. The component sets `SemanticProperties.Description` to a text equivalent — screen readers announce sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement. The full accessibility specification is in Chapter 20. -When the network is unavailable, the presentation layer changes nothing about its behavior. It continues to render from the local store. The status indicator moves from sync-healthy to offline. The user can still create records, navigate, query, and run any domain workflow that does not require distributed lease coordination. They receive no error page. No spinner. No apology. The software works. +When the network is unavailable, the presentation layer changes nothing. It continues to render from the local store. The status indicator moves to offline. The user creates records, navigates, queries, and runs any domain workflow that does not require distributed lease coordination. No error page. No spinner. No apology. ### Layer 2: Application Logic -The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer determines what constitutes a valid state transition, enforces invariants, and emits events that both the local store and the sync daemon consume. +The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer enforces invariants and emits events that both the local store and the sync daemon consume. -This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally - the sync daemon propagates those writes when it can, not when consulted before they happen. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or a remote validation service. +This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally — the sync daemon propagates those writes when it can. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or validation service. -The one exception is CP-class records - those whose correctness requires distributed coordination, such as resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these records, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. This is an explicit design choice. The user sees a constraint, not a mystery failure. +The one exception is CP-class records — those whose correctness requires distributed coordination: resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. The user sees a constraint, not a mystery failure. -The CAP positioning is per record class, not per application: +CAP positioning is per record class, not per application: | Record Class | CAP Position | Why | |---|---|---| @@ -118,92 +118,90 @@ The CAP positioning is per record class, not per application: ### Layer 3: Sync Daemon -The sync daemon is a separate long-running process. It is not a thread in the application. It is not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers - the application reconnects to a daemon that has been working the whole time. +The sync daemon is a separate long-running process — not a thread in the application, not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers. The daemon manages five concerns: -**Peer discovery.** Discovery follows a three-tier hierarchy. On the local network, mDNS provides zero-configuration discovery - two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. (Many enterprise Wi-Fi configurations filter mDNS by default; on those networks, the next tier is the path that actually works.) Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. +**Peer discovery.** On the local network, mDNS provides zero-configuration discovery — two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. -**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta - the operations each holds that the other lacks. Vector clocks scoped per-document (one entry per peer that has produced operations on that document) track what each peer has seen. This is the same anti-entropy mechanism used by large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. +**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta — the operations each holds that the other lacks. Vector clocks scoped per-document track what each peer has seen. The same anti-entropy mechanism underpins large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. -**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The protocol wire format is CBOR (Concise Binary Object Representation) - compact binary encoding that minimizes bandwidth on the intermittent connections that are the baseline operating condition for hundreds of millions of enterprise workers worldwide, not an edge case. +**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The wire format is CBOR (Concise Binary Object Representation) — compact binary encoding that minimizes bandwidth on intermittent connections. -**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges - the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set, so the system never grants two contradictory leases simultaneously. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window under the reference network model. A node that goes offline releases its lease at expiry - the team is never permanently blocked by one disconnected device. +**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges — the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window. A node that goes offline releases its lease at expiry; the team is never permanently blocked by one disconnected device. -**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment. A power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable - on the LAN, via VPN, or via the managed relay - the daemon begins working through the buffer. The application never needs to know that writes were queued. +**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment — a power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable, the daemon begins working through the buffer. The application never needs to know that writes were queued. ### Layer 4: Storage -Layer 4 is the source of truth for this node. Everything the presentation layer renders, everything the application logic layer reads, comes from here. Nothing here depends on a remote service. +Layer 4 is the source of truth for this node. The presentation layer renders from here. The application logic layer reads from here. Nothing here depends on a remote service. -The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore - the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields nothing readable. +The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore — the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields ciphertext. Three storage structures coexist: -**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The CRDT library handles merge semantics - the merge function is commutative, associative, and idempotent, so any two diverged copies of a document produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation currently ships YDotNet (a .NET port of Yjs); Loro is the aspirational target when its C# bindings mature. The `ICrdtEngine` abstraction keeps that choice reversible. (See Appendix G for the full glossary of these libraries and their licenses.) +**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The merge function is commutative, associative, and idempotent — any two diverged copies produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation ships YDotNet (a .NET port of Yjs); Loro is the aspirational target. The `ICrdtEngine` abstraction keeps that choice reversible. -**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. It never modifies past entries. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail that regulated industries require. +**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail regulated industries require. -**Read-model projections** are materialized views derived from the event log - the tables, indexes, and calculated fields that make queries fast. If a projection becomes corrupted or stale, it is rebuilt from the event log. The event log is the ground truth. Projections are a performance optimization. +**Read-model projections** are materialized views derived from the event log — tables, indexes, and calculated fields that make queries fast. A corrupted or stale projection rebuilds from the event log. Projections are a performance optimization; the event log is the ground truth. ### Layer 5: Relay and Discovery Layer 5 is the only layer that touches infrastructure outside the local node, and it is optional. -The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay holds no authoritative data. It stores no decrypted content. It cannot read the payloads it routes - every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer, and the relay has no access to any key. +The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay stores no decrypted content. Every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer; the relay holds no key. -The relay's two default trust levels reflect this: +The relay's two default trust levels: -- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration that satisfies data sovereignty requirements without exception. -- **Attested hosted peer (opt-in):** An administrator explicitly issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination - useful for teams too small to form quorum from workstations alone. +- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration and satisfies data sovereignty requirements without exception. +- **Attested hosted peer (opt-in):** An administrator issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination — useful for teams too small to form quorum from workstations alone. -The relay protocol is open and the relay is self-hostable. Any organization that requires full independence from managed relay infrastructure can operate its own relay with no changes to node configuration. +The relay protocol is open and the relay is self-hostable. Organizations that require full independence from managed relay infrastructure can operate their own relay with no changes to node configuration. -A note on what "optional" means in practice. The relay is *architecturally* optional - the protocol does not require it, two nodes on the same LAN sync directly via mDNS, and a small team whose members all work from one office can run indefinitely without any relay at all. The relay is *operationally* mandatory for the modal team in this dissertation's audience: members across symmetric NATs, members on cellular networks, members on different corporate Wi-Fi networks where mDNS is filtered. For those teams, the relay is what lets two members reach each other when neither is on the same LAN. The architecture does not pretend otherwise; the distinction matters because operational planning has to account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted on the team's own VPS. Fleet observability - relay availability, peer reachability, sync health across the fleet - is what the operator monitors; Chapter 21 specifies the fleet observability primitives. - -The relay's failure is not the application's failure. +The relay is architecturally optional — the protocol does not require it, and a small team whose members all work from one office can run indefinitely without one. The relay is operationally required for the modal team this book addresses: members across symmetric NATs, on cellular networks, or on separate corporate Wi-Fi networks where mDNS is filtered. Operational planning must account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted. The relay's failure is not the application's failure. --- ## How This Changes Failure Modes -Chapter 1 named seven failure modes. The inversion addresses each of them specifically. There are also failure modes the SaaS model created that may not have been visible as such - they only become legible once you understand what the vendor was holding on your behalf. And there are new failure modes the inverted architecture introduces. All three categories deserve honest treatment. +Chapter 1 named seven failure modes. The inversion addresses each directly. There are also failure modes the SaaS model created that only become legible once you understand what the vendor was holding on your behalf. And the inverted architecture introduces failure modes of its own. All three categories deserve honest treatment. **What the inversion resolves:** -*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure - your vendor's, or the cloud region beneath your vendor - interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. The construction PM submitting a bid at 4:58 PM does not care whether a cloud region is degraded, because his node does not consult any remote service to function. +*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure — your vendor's, or the cloud region beneath your vendor — interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. *The Vendor.* Data on vendor infrastructure is at the vendor's business decision's mercy. Data on the user's hardware is not. A vendor acquisition, pivot, or shutdown interrupts the sync service. It does not interrupt access to the user's data. -*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy. Connectivity enables sync. It is not a prerequisite for function. The operational precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years, demonstrating that the pattern works at population scale in the markets that most require it. +*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy — connectivity enables sync; it is not a prerequisite for function. The precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years. -*The Data.* Vendor-managed data is portable only on vendor terms - export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. +*The Data.* Vendor-managed data is portable only on vendor terms — export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. -*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay - the one remaining billable dependency - is replaceable. The data custody that makes price changes coercive is removed from the equation. +*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay — the one remaining billable dependency — is replaceable. The data custody that makes price changes coercive is gone. -*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. The architecture I propose makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. CRDT merge semantics produce deterministically convergent state across peers - no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not as a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create the divergence, rather than accepting both and discovering the inconsistency later. The convergence semantics are testable, the divergence cases are observable, and the resolution is auditable. The cost: developers have to model their domain in operations rather than current-state assignments. Chapter 12 specifies the CRDT engine; Chapter 13 specifies the conflict UX. +*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. CRDT merge semantics produce deterministically convergent state across peers — no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create divergence. The convergence semantics are testable, divergence cases are observable, and resolution is auditable. The cost: developers must model their domain in operations rather than current-state assignments. Chapters 12 and 13 specify the CRDT engine and the conflict UX. -*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement. Hundreds of thousands of organizations that had built workflows on those platforms found their operations interrupted - not because their vendors failed them, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector entirely. A relay can be targeted. The software vendor itself can be targeted. But the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted or replaced for the highest-sensitivity deployments. Chapter 11 specifies relay governance. Chapter 15 covers the compliance framework for the customer-directed variant of this failure mode. +*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS markets under sanctions enforcement. Organizations that had built workflows on those platforms found their operations interrupted — not because their vendors failed, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector — a relay can be targeted, the software vendor itself can be targeted — but the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted for the highest-sensitivity deployments. Chapters 11 and 15 cover relay governance and the compliance framework. -The regulatory landscape this failure mode operates in is worth naming. The dominant European driver is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the strongest European legal argument for local-first data residency, enforced nationally by Germany's BSI (Bundesamt für Sicherheit in der Informationstechnik) and France's CNIL (Commission nationale de l'informatique et des libertés). India's DPDP Act 2023 and the RBI's payment-data localization circular, China's PIPL (Personal Information Protection Law) 2021, Russia's Federal Law 242-FZ (Russian-citizen personal data on Russian territory since 2015), the UAE's DIFC DPL 2020, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's PDPL cluster (KSA, Bahrain) are representative of the parallel pattern across GCC, APAC, African, and Americas markets; the full coverage matrix is in Appendix F. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. One nuance worth flagging: when peer nodes reside in different jurisdictions, a direct peer-to-peer sync becomes a cross-border data transfer in legal terms, even when the data is encrypted in transit and never lands on a vendor server. Chapter 15 specifies the compliance framework for that case. +The dominant regulatory driver for data residency is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023, China's PIPL 2021, Brazil's LGPD, and analogous frameworks across APAC and GCC markets follow the same structural logic. The full coverage matrix is in Appendix F. When peer nodes reside in different jurisdictions, a direct peer-to-peer sync constitutes a cross-border data transfer in legal terms, even when encrypted and never touching a vendor server. Chapter 15 specifies the compliance framework for that case. **What you may not have noticed you were exposed to:** -*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you have stored with them. A breach anywhere in their infrastructure stack - servers, sub-processors, privileged internal access - is a breach of your data, regardless of any action you took or failed to take. This failure mode is invisible until it has already happened. You cannot evaluate a vendor's internal security posture from outside it. In this architecture, the relay holds only ciphertext: it receives post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. There is no decryptable content to exfiltrate. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. The attack surface moves to the endpoints - which this architecture addresses explicitly rather than hiding. +*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you stored with them. A breach anywhere in their infrastructure stack — servers, sub-processors, privileged internal access — is a breach of your data, regardless of any action you took. In this architecture, the relay holds only ciphertext: post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. -Hayoon Kim found out about her vendor's breach at a hotel in Singapore at 6:47 in the morning, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P (Information Security Management System - Personal) consultancy out of Gangnam-gu in Seoul. Her practice management SaaS - a Korean-language platform serving a few thousand domestic compliance professionals - had been breached six weeks earlier. The breach was disclosed to customers via an email that landed in her promotions folder. Hayoon never saw it. The article was the disclosure that reached her. Eleven of her clients were named on the dump that surfaced overnight on a Russian-language forum, each report carrying her name on the cover page, each report listing the specific PIPA (Personal Information Protection Act) Article 29 safety-measure controls she had documented during her 2023 audit work. +Hayoon Kim found out about her vendor's breach at 6:47 in the morning at a hotel in Singapore, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P consultancy out of Gangnam-gu in Seoul. Her practice management SaaS had been breached six weeks earlier. The vendor disclosed by email; the email landed in her promotions folder. The article was the disclosure that reached her. Eleven of her clients appeared in the overnight dump on a Russian-language forum, each report carrying her name on the cover page, each listing the specific PIPA Article 29 controls she had documented during her 2023 audit work. -She spent the next eleven days drafting individual letters to each affected client explaining what had happened, what data was exposed, what they should do. She had spent her career advising other organizations on this exact kind of letter. Writing eleven of them about her own practice was a different exercise. The platform vendor's chief executive sent a personal apology that was identical, paragraph for paragraph, to an apology another vendor's chief executive had sent the year before - Hayoon recognized three of the sentences from a precedent she had cited in a 2022 article she had written for the Korea Internet & Security Agency's quarterly compliance bulletin. +She spent the next eleven days drafting individual letters to each affected client. She had spent her career advising other organizations on exactly this kind of letter. The platform vendor's CEO sent a personal apology identical, paragraph for paragraph, to an apology another vendor's CEO had sent the year before — Hayoon recognized three sentences from a precedent she had cited in a 2022 article for the Korea Internet & Security Agency's quarterly compliance bulletin. -She still keeps her active client documents on a local encrypted drive that no SaaS vendor has access to. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. +She still keeps her active client documents on a local encrypted drive. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. **What the architecture introduces honestly:** -*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss - storage extraction without credentials yields ciphertext. But a compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense - encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes - reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. +*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss. A compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense — encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes — reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. -*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently. A twenty-person team may run five schema versions simultaneously. The expand-contract pattern - new fields additive and backward-compatible during a compatibility window, old fields retired once all active nodes have updated - handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. +*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently — a twenty-person team may run five schema versions simultaneously. The expand-contract pattern handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. -*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy - aggressive compaction for stable documents, 90-day retention for active collaboration documents (configurable per deployment; Chapter 6 derives the default), indefinite retention for compliance-classified records bounded in practice by jurisdiction-specific schedules (six years for HIPAA (Health Insurance Portability and Accountability Act), seven for SOX, as configured) - keeps growth bounded. But GC in a peer-to-peer system requires coordination. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. This architecture addresses it. It does not make it disappear. +*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy — aggressive compaction for stable documents, 90-day retention for active collaboration documents, indefinite retention for compliance-classified records bounded by jurisdiction-specific schedules — keeps growth bounded. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. The architecture addresses it; it does not make it disappear. Part II is six rounds of adversarial review by people who were looking for exactly these problems. @@ -213,11 +211,11 @@ Part II is six rounds of adversarial review by people who were looking for exact The five-layer model admits two canonical deployment shapes. Both use the same Harborline component surface, the same sync protocol, and the same five-layer architecture. They differ in where the authoritative data location lives. -**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid - a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. A user who enables sync connects to a managed relay or a direct peer via the gossip protocol. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation - pre-1.0, in active development. +**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid — a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation — pre-1.0, in active development. -**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default - it routes encrypted deltas but cannot read them. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want the deployment simplicity of a hosted service alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation - pre-1.0, in active development. +**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want deployment simplicity alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation — pre-1.0, in active development. -Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` (pre-1.0). Neither shape changes the sync protocol, the CAP positioning model, or the storage architecture. The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. +The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. --- @@ -225,11 +223,11 @@ Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` This architecture shifts three fundamental habits. -**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Command handlers succeed on local durability, not remote confirmation. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes - operations rather than current-state assignments. This discipline is the fundamental shift. +**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes — operations rather than current-state assignments. This discipline is the fundamental shift. -**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent - which means it fails in the field. +**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent. -**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The system's failure modes are designed to be visible. The developer's job is to wire those signals to the UI correctly, not to paper over them. +**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The developer's job is to wire those signals to the UI correctly, not to paper over them. The five layers in one diagram are the picture Part II will adversarially test. Everything that follows is detail. From e2b048ededc9b2458cd0e656a097fc458b9336ae Mon Sep 17 00:00:00 2001 From: Chris Wood Date: Fri, 22 May 2026 14:30:17 -0400 Subject: [PATCH 2/3] =?UTF-8?q?docs(vol-1):=20ch02=20prose=20review=20?= =?UTF-8?q?=E2=80=94=20trim=20to=20target=20+=20advance=20to=20voice-check?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prose review pass (Stage 5). Cut from 5,509 to 4,401 words (target 4,000 +/-10%). Removed academic scaffolding ("this dissertation", "my contribution"), passive constructions, hedging phrases, and restatement sentences. Renamed "What This Dissertation Adds" to "What This Book Adds". Advanced ICM marker to voice-check. Co-Authored-By: Claude Sonnet 4.6 --- .../ch02-local-first-serious-stack.md | 120 ++++++++---------- 1 file changed, 54 insertions(+), 66 deletions(-) diff --git a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md index 20e27ae..4e9c5fa 100644 --- a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md +++ b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md @@ -1,89 +1,87 @@ # Chapter 2 - Local-First: From Sync Toy to Serious Stack - + --- -In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties - a testable definition the field could use to separate what counts from what merely calls itself local-first. +In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties — a testable definition the field could use to separate what counts from what merely calls itself local-first. -The seven properties expose exactly where every existing attempt falls short - including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. +The seven properties expose exactly where every existing attempt falls short, including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. -The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven. And it adds what the ideals paper did not. The deployment model. The security model. The governance model. The migration story. The path to commercial sustainability. **The composition is the contribution** - not the individual components, which are all production-proven somewhere, but the assembly that lets them be one system. +The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven — and adds what the ideals paper did not: a deployment model, a security model, a governance model, a migration story, and a path to commercial sustainability. **The composition is the contribution** — not the individual components, which are all production-proven somewhere, but the assembly that lets them function as one system. --- ## The Seven Ideals -The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar - a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. +The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar — a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. -**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that must phone home to load the task list fails the property during the first round-trip. It fails permanently when the network is gone. +**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that phones home to load the task list fails the property during the first round-trip and fails permanently when the network is gone. -**Work is not trapped on one device.** Your data on your laptop should be your data on your desktop, your tablet, your colleague's machine. Sync across devices and across people - not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. +**Work is not trapped on one device.** Data on a laptop should be data on a desktop, a tablet, a colleague's machine. Sync across devices and across people — not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. -**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, and then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API (Application Programming Interface). It eliminates every app whose write path queues locally and waits. Real offline requires that the local node hold an authoritative copy of data it is allowed to act on. +**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API and every app whose write path queues locally and waits. Real offline requires the local node to hold an authoritative copy of data it is allowed to act on. -**Seamless collaboration.** Multiple people should be able to edit the same data simultaneously - without explicit locking, without "checkout" workflows, without a person designated to resolve conflicts manually. This is the property that made centralized servers feel necessary. If two people are writing concurrently, something has to decide the order. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. +**Seamless collaboration.** Multiple people should edit the same data simultaneously — without explicit locking, without checkout workflows, without a person designated to resolve conflicts manually. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. -**The long now.** Your data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. A more recent and more consequential demonstration came in 2022. Adobe suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma ([figma.com](https://www.figma.com/), the design tool) blocked Russia-based users in compliance with US sanctions [11]. Dozens of other Western SaaS (Software as a Service) providers followed. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool - or the jurisdiction the company operates in. Proprietary sync formats - even sync formats that feel invisible - fail this property. +**The long now.** Data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. In 2022, Adobe suspended service across Russia and CIS markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma blocked Russia-based users in compliance with US sanctions [11]. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool — or the jurisdiction the company operates in. -**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack - an adversary who compromises one node gets one user's data, not all users' data. Local storage without encryption creates a different problem: physical access to the device is sufficient. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. A distinct threat model applies in jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements: architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. A local app that stores data in plaintext fails this property as badly as a cloud app does. +**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack — an adversary who compromises one node gets one user's data, not all users' data. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. -**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee. It is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. +**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee — it is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. -Seven properties. Together they describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. The closest candidate is Anytype, which satisfies five - CRDT (Conflict-free Replicated Data Type)-based collaboration and zero-knowledge encryption by default - but falls short on the long now (its full-fidelity export uses a proprietary Any-Block format no competing app reads natively) and on ultimate ownership (the application layer is "source available," not open-source; structural vendor independence depends on a contractual arrangement with the Any Association, not the architecture alone). Kleppmann himself no longer treats the seven as a binary checklist. At Local-First Conf 2024 he acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. No production app has cleared them all. +Together, the seven properties describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. At Local-First Conf 2024, Kleppmann acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. --- ## What Exists Today: A Taxonomy of Local-First Attempts -The local-first community has produced serious work. The apps below are not failures. They are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. +The local-first community has produced serious work. The apps below are not failures — they are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. ### The Document Sync Apps (Obsidian, Notion) -Obsidian stores notes as plain markdown files on your local filesystem. This is a genuinely correct choice. Plain text in an open format, on your own storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears tomorrow, the files remain and every text editor on the planet reads them. The long-now property is satisfied by the data format alone. +Obsidian stores notes as plain markdown files on a local filesystem. Plain text in an open format, on user-controlled storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears, the files remain and every text editor reads them. The long-now property is satisfied by the data format alone. -Where Obsidian stops is structured data and collaboration. Markdown files have a limited conflict resolution strategy: when two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original. Resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions - where concurrent edits are the norm - the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. +Where Obsidian stops is structured data and collaboration. When two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original; resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions — where concurrent edits are the norm — the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. -The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same convention. The moment a team needs structured data - not documents, but records - Obsidian's model breaks down. It is a document tool that happens to sync, not a structured-data tool with local-first properties. +The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same one. The moment a team needs structured data — not documents, but records — Obsidian's model breaks down. -Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers throughout. Concurrent edits go through those servers, which hold the authoritative copy. The long-now property fails immediately. Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs - a representation, not a migration. The relational structure, the filters, the formulas, the comment threads - none of these export faithfully to a format another application understands. +Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers. Concurrent edits go through those servers. The long-now property fails immediately: Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs — a representation, not a migration. The relational structure, the filters, the formulas, the comment threads — none export faithfully to a format another application understands. -Both approaches demonstrate a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent - which is what CRDTs over a typed document store provide. +Both approaches expose a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent — which is what CRDTs over a typed document store provide. -### The Lightweight Replica Apps (Linear ([linear.app](https://linear.app/), the issue tracker), Liveblocks) +### The Lightweight Replica Apps (Linear, Liveblocks) -Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first. The sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant - no loading spinners, no optimistic-update lag, no visible round trips. The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation (status changes on issues, comment submissions, project mutations) are visibly queued rather than silently dropped - but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. +Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first; the sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant — no loading spinners, no optimistic-update lag, no visible round trips. -Background jobs - notifications, automations, integrations - run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. +The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation are visibly queued rather than silently dropped — but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. -The practical consequence: Linear passes the "no spinners" property and partially passes "the network is optional" for reads. It does not pass network-optional for writes to server-owned records, does not pass peer-to-peer collaboration without Linear's relay, does not pass vendor independence, and does not pass the long now - Linear's data lives in Linear's format, accessible through Linear's API, exportable to CSV only. Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. +Background jobs — notifications, automations, integrations — run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. -Replicache ([replicache.dev](https://replicache.dev/), the sync framework from Rocicorp) is the most direct production competitor in this category and the system most often suggested as an off-the-shelf path to local-first apps. Replicache provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache for free [9]. The model is correct for the sync layer it covers - optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side to validate against authoritative state, and offline writes queue against an eventual reconciliation that the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. The framework is also deliberately scoped to the sync transport - schema migration, key custody, MDM packaging, and the business model are application-developer responsibilities, not framework features. +Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. -### The Local-First Finance App (Actual Budget) - -Actual Budget runs entirely offline by default - no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available, because its operation does not depend on the network at any point. +Replicache ([replicache.dev](https://replicache.dev/)) is the most direct production competitor in this category. It provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache [9]. The model is correct for the sync layer it covers — optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side, and offline writes queue against a reconciliation the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. -This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control - the user has a file on their disk). It makes a credible attempt at the fifth (the long now) by virtue of using an open database format that other tools can read. +### The Local-First Finance App (Actual Budget) -Where Actual Budget stops is collaboration and multi-device sync. The application is single-user by design. Two people cannot jointly manage a budget in Actual Budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service Actual Budget offers addresses multi-device access for a single user - the budget file syncs across the user's own devices through a hosted relay. This reintroduces a central server, though the server's role is deliberately minimal: relay and backup, not authority. +Actual Budget runs entirely offline by default — no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available. -The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. Its data model is single-user because its design is single-user. Adapting it to multi-user team workflows would require adding CRDTs, a distributed data model, access control, and a sync protocol - at which point it would no longer be Actual Budget, but a substantially new system. +This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control). It makes a credible attempt at the fifth (the long now) by using an open database format that other tools can read. -The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part that Actual Budget does not attempt. +Where Actual Budget stops is collaboration and multi-device sync. Two people cannot jointly manage a budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service addresses multi-device access for a single user through a hosted relay — which reintroduces a central server, though its role is deliberately minimal: relay and backup, not authority. The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. -### The Research Prototypes (Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library), Ink & Switch Essays) +The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part Actual Budget does not attempt. -Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library. Given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. The library is production-quality for its intended use case. Ink & Switch has published detailed essays on collaborative applications built on Automerge - Pushpin, Backchat, Trellis - that demonstrate what local-first collaboration looks like in practice when the data model is right. +### The Research Prototypes (Automerge, Ink & Switch Essays) -The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport - something to move operations between peers. Several sync backends exist (the Automerge sync server, AutomergeRepo), and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. +Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge)) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library: given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. Ink & Switch has published detailed essays on collaborative applications built on Automerge — Pushpin, Backchat, Trellis — that demonstrate what local-first collaboration looks like in practice when the data model is right. -The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. The essays document what is possible and identify what remains to be engineered. They are research artifacts, not shipping products. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport. They have not acquired a production system. They have acquired the foundation for one. +The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport — something to move operations between peers. Several sync backends exist and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. -The document-centric nature of Automerge is also a structural constraint. Documents are a natural fit for rich text, drawings, and unstructured collaborative content. A team running a field operation with structured records - work orders, inspection logs, invoices, asset registries - needs typed records with schema migration, not just documents. The CRDT merge semantics generalize across both cases, but the tooling, the query model, and the schema evolution story are different problems that Automerge leaves to application builders. +The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport — not a production system, but the foundation for one. ```mermaid graph LR @@ -115,9 +113,9 @@ graph LR --- -## What Each Gets Right - and Where It Stops +## What Each Gets Right — and Where It Stops -Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. Each dependency reflects a real problem the approach did not attempt to solve. +Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. The pattern becomes clearest in a like-for-like comparison across the four axes that determine whether a system meets a serious local-first bar: @@ -132,31 +130,27 @@ The pattern becomes clearest in a like-for-like comparison across the four axes | **Actual Budget** | Fully local + optional self-hosted sync | User-held SQLite | User-device only | Open-source; user runs everything | | **Automerge** | Library + sync transport (developer-supplied) | Whatever the application chooses | Whatever the application chooses | Open-source library | -The table makes the gap visible. Every system that satisfies vendor-independent data ownership stops short of team collaboration; every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node - the composition that no system in this table currently delivers. +Every system that satisfies vendor-independent data ownership stops short of team collaboration. Every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node — which no system in this table currently delivers. --- ## The Missing Step: Full Node, Not Smart Cache -The question that distinguishes this architecture from the approaches above is this: - -> What if a user's workstation ran a full node of the system - including state, business logic, and sync - such that "the cloud" is merely another peer, not the source of truth? - A smart cache knows what the server knows, slightly earlier. A full node knows what the user's data is. The distinction matters when the server is down, when the vendor goes away, when the network is unreachable, and when the user needs to understand, export, or migrate their data. -A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup - assistance for coordination and disaster recovery, not a source of truth. +A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup — assistance for coordination and disaster recovery, not a source of truth. -Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When the sync eventually completes, there may be conflicts between the superintendent's offline writes and changes made by others - conflicts the smart-cache app resolves by whatever heuristic the vendor chose, without surfacing the conflict to the user. +Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When sync eventually completes, the smart-cache app resolves conflicts using whatever heuristic the vendor chose, without surfacing them to the user. -A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally - the automation runs on the node, not on a server. When the sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. +A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally. When sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. -The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach. The full node acts on behalf of the user. +The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach; the full node acts on behalf of the user. -The pattern has operational precedent at scale. Modern point-of-sale systems - Square Reader and Toast - operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns, and the merchant's authoritative state advances against the local replica until then. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. These products demonstrate user-device-replica operation at commercial scale in domains where the cost of failed offline operation is concrete. What I describe in this dissertation generalizes that pattern beyond payments and field service to structured-data applications more broadly: typed records with evolving schemas, collaborative edits across multiple peers, and enterprise governance that survives procurement review. +The pattern has operational precedent at scale. Square Reader and Toast operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. Both demonstrate user-device-replica operation at commercial scale in domains where failed offline operation has concrete cost. -This reframes what "offline support" means. Offline support in the smart-cache model means "some operations work offline, with degraded functionality." Offline support in the full-node model means "all operations work offline, identically." The distinction is not a feature comparison. It is a structural property that follows from where authority lives. +"Offline support" in the smart-cache model means some operations work offline, with degraded functionality. In the full-node model it means all operations work offline, identically. The distinction is not a feature comparison — it is a structural property that follows from where authority lives. -Every component of this model has a production analogue that validates it separately. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production, and the Automerge library is deployed in commercial collaborative applications - though Automerge users have to budget for known operational costs (document size growth with edit history, cold-sync time on long-lived documents, and garbage-collection cadence) that the library leaves to the application. Figma's multiplayer editor is not a pure CRDT deployment - its engineers describe it as "inspired by multiple separate CRDTs" over a server-authoritative, per-property merge - but it independently validates that per-property conflict resolution works for real-time collaborative editing at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. None of these components are speculative. My contribution is the *composition* - specifically, three pieces no other published architecture combines: a per-record CAP boundary that lets AP-class records and CP-class records coexist in one system, an MDM (Mobile Device Management)-deployable installer model that lets enterprise IT ship full-node software without bespoke onboarding, and an AGPLv3-with-managed-relay business model that makes the architecture economically viable without forcing vendor data custody. +Every component of this model has a production analogue. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production. The Automerge library is deployed in commercial collaborative applications, though users must budget for known operational costs — document size growth, cold-sync time on long-lived documents, and garbage-collection cadence — that the library leaves to the application. Figma's multiplayer editor independently validates that per-property conflict resolution works at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. ```mermaid graph TB @@ -178,38 +172,32 @@ graph TB --- -## What This Dissertation Adds - -The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement and not just a preference, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a "couch device" returning after six months offline, and generates revenue that funds ongoing development. +## What This Book Adds -The regulatory pressure is now global, and the laws cluster by region. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards - making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis rather than an architectural preference, with national implementation guidance from Germany's BSI and France's CNIL. +The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a device returning after six months offline, and generates revenue that funds ongoing development. -The pattern repeats across regions with named regulators in each: India's DPDP Act 2023 [5] and the RBI's payment-data localization circular; the UAE's DIFC DPL 2020 [6]; Russia's Federal Law 242-FZ [7]; China's PIPL (Personal Information Protection Law) 2021; Brazil's LGPD (Lei Geral de Proteção de Dados); South Africa's POPIA (Protection of Personal Information Act); Nigeria's NDPR (Nigeria Data Protection Regulation); Japan's APPI (Act on the Protection of Personal Information); South Korea's PIPA (Personal Information Protection Act); and the GCC's emerging cluster (KSA's PDPL, Bahrain's PDPL). Each, in different language, treats data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix across these and ~30+ other frameworks is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. +The regulatory pressure is now global. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards — making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis, with national implementation guidance from Germany's BSI and France's CNIL. India's DPDP Act 2023 [5], the UAE's DIFC DPL 2020 [6], Russia's Federal Law 242-FZ [7], China's PIPL, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's emerging cluster each treat data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, data on the user's own hardware is the architecture that makes compliance tractable. -The existing implementations - Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage - each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None of them addresses the full set, and none provides the composition. +The existing implementations — Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage — each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None addresses the full set, and none provides the composition. -The seven properties define target state. They do not tell you how to get there - what phases to sequence, what assumptions to validate, what to trade when two properties conflict, what to verify when you claim you are done. This dissertation is the plan that sits under the properties: phases in the five-layer stack and the deployment zones (Chapter 3, Chapter 4), adversarial validation in the council chapters (Part II), verification specification (Part III), and execution playbooks (Part IV). +Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die — every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, security is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id) used opaquely, with the DEK/KEK hierarchy composed against a specification a cryptographic engineer has reviewed. Third, long-term portability has one product-level decision that can kill the architecture alone — invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs or Automerge and inherit their portability guarantees. The choice, not the invention, is what makes it feasible. -Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die - every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, Property 6 is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id reference) are used opaquely, and the DEK (Data Encryption Key)/KEK (Key Encryption Key) hierarchy composes them against a specification a cryptographic engineer has reviewed. Third, Property 5 has one product-level decision that can kill the architecture alone - invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs ([github.com/yjs/yjs](https://github.com/yjs/yjs), the JavaScript CRDT library) or Automerge and inherit their portability guarantees. Feasibility is contingent on choosing, not inventing. +The contribution here is the composition. Not new primitives — every component has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. -My contribution is the composition. Not new primitives - every component in this architecture has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. - -What I assemble from those proven components: +What that assembly produces: - A node architecture with a stable microkernel and domain plugins under strict versioned contracts, so the system can evolve without breaking in-field deployments. - A per-record CAP positioning model that treats CRDT-merge records and lease-coordinated records as first-class distinct classes, with a defined boundary and a defined handoff between them. - A three-tier CRDT GC policy that keeps document growth bounded without sacrificing merge correctness for active peers. -- A key hierarchy - root organization key, per-role key encryption keys, per-document data encryption keys - that makes key rotation proportional to document count rather than document size, and makes member removal cryptographically effective rather than contractually promised. +- A key hierarchy — root organization key, per-role key encryption keys, per-document data encryption keys — that makes key rotation proportional to document count, and makes member removal cryptographically effective rather than contractually promised. - A schema migration strategy using expand-contract, bidirectional lenses, and epoch coordination that allows nodes running different schema versions to coexist on a live team. - An enterprise deployment model: MDM-compatible installers, SBOM (Software Bill of Materials) generation, code signing and notarization, air-gap operation, incident response runbooks. - A business model: AGPLv3 core, managed relay as the paid service, relay economics that become cash-flow positive before meaningful scale. - A governance model: foundation-backed structure, community contributor path, dual-license CLA for enterprise customers. -The managed relay is a residual vendor dependency the architecture does not eliminate - it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction between SaaS vendor dependency and managed-relay dependency is not rhetorical: the former holds decryptable data; the latter does not. - -The architecture stands on the local-first community's work. The paper that named the seven ideals [1] is the benchmark against which my dissertation's design is measured throughout. The Ink & Switch essays on Automerge, Cambria, and collaborative document design are the intellectual foundation for the CRDT and schema evolution sections. Kleppmann's distributed systems work [2] provides the vocabulary that runs throughout Part III. +The managed relay is a residual vendor dependency the architecture does not eliminate — it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction is not rhetorical: a SaaS vendor holds decryptable data; a managed relay does not. -The composition is the contribution. The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for determining when this architecture is the right choice and when it is not. +The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for when this architecture is the right choice and when it is not. --- From 116119f3f035bda472b67eef420bad4f60797ee1 Mon Sep 17 00:00:00 2001 From: Chris Wood Date: Fri, 22 May 2026 14:37:04 -0400 Subject: [PATCH 3/3] =?UTF-8?q?docs(vol-1):=20ch16=20prose=20review=20?= =?UTF-8?q?=E2=80=94=20trim=20to=20target=20+=20advance=20to=20voice-check?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prose review pass (Stage 5). Trimmed from 6,549 to 3,849 words (aggressive structural cut). Advanced ICM marker to icm/voice-check. Co-Authored-By: Claude Sonnet 4.6 --- .../ch16-persistence-beyond-the-node.md | 186 +++++++----------- 1 file changed, 69 insertions(+), 117 deletions(-) diff --git a/vol-1/part-3-reference-architecture/ch16-persistence-beyond-the-node.md b/vol-1/part-3-reference-architecture/ch16-persistence-beyond-the-node.md index c1a9b3c..512d12a 100644 --- a/vol-1/part-3-reference-architecture/ch16-persistence-beyond-the-node.md +++ b/vol-1/part-3-reference-architecture/ch16-persistence-beyond-the-node.md @@ -1,6 +1,6 @@ # Chapter 16 - Persistence Beyond the Node - + @@ -9,23 +9,23 @@ ## The Problem Single-Node Storage Cannot Solve -A node that stores data only on its local device fails in three ways. Drives fail. Phones are lost. Laptops are stolen. Multi-gigabyte local databases work for primary work data - not for every archive, every team member's history, every binary asset ever uploaded. Users move to new devices and expect their work to follow them. +A node that stores data only on its local device fails in three ways. Drives fail. Phones are lost. Laptops are stolen. Multi-gigabyte local databases suit active working data — not every archive, every team member's history, every binary asset ever uploaded. Users move to new devices and expect their work to follow. -Local-first architecture does not mean data lives only on one machine. It means the node is the authority over the data it holds. The architecture must then specify how that data survives beyond it. +Local-first architecture does not mean data lives on one machine. It means the node is the authority over the data it holds. The architecture must specify how that data survives beyond it. --- ## Five-Layer Storage Architecture -Persistence in the local-first architecture composes five tiers, specified in Chapter 12 §The Five-Layer Storage Architecture. This chapter focuses on what each tier owns operationally - bucket subscription, lazy fetch, snapshot rehydration, backup UX, relay metadata posture, and disaster recovery - assuming the reader has the five-tier model from Ch12. Tiers 4 and 5 are opt-in; the core system is fully operational on tiers 1–3 alone. +Persistence in the local-first architecture composes five tiers, specified in Chapter 12 §The Five-Layer Storage Architecture. This chapter focuses on what each tier owns operationally — bucket subscription, lazy fetch, snapshot rehydration, backup UX, relay metadata posture, and disaster recovery — assuming the five-tier model from Ch12. Tiers 4 and 5 are opt-in; the core system is fully operational on tiers 1–3 alone. --- ## Declarative Sync Buckets -Full replication to every node breaks at scale in two ways. As a storage problem, multi-gigabyte local databases overwhelm devices with constrained storage. As a security problem, nodes hold data they are not authorized to use, protected only by application-layer access control - a security boundary one application bug wide. +Full replication to every node fails at scale in two ways. As a storage problem, multi-gigabyte databases overwhelm devices with constrained storage. As a security problem, nodes hold data they are not authorized to use, protected only by application-layer access control — a boundary one application bug wide. -The architecture solves both problems with declarative sync buckets. A bucket is a named, declaratively specified subset of the team dataset. Bucket membership is tied to role attestations, not to application-layer decisions made after data arrives at the node. Non-eligible nodes never receive bucket events because the sync daemon excludes them at capability negotiation - not because the application filters them afterward. +Declarative sync buckets solve both. A bucket is a named, declaratively specified subset of the team dataset. Bucket membership ties to role attestations, not to application-layer decisions made after data arrives at the node. Non-eligible nodes never receive bucket events because the sync daemon excludes them at capability negotiation. Buckets are declared in YAML: @@ -53,16 +53,14 @@ buckets: Each bucket specifies: -- **name** - unique identifier used in sync daemon routing and in backup manifests. -- **record_types** - the document types included in the bucket. -- **filter** - a predicate evaluated per record against peer attributes. Only records that satisfy the filter are replicated to a given peer. -- **replication** - `eager` or `lazy`. Eager buckets sync immediately on connect. Lazy buckets use demand-driven fetch. -- **required_attestation** - the role attestation a peer must present to receive bucket events. Attestation is verified cryptographically by the sync daemon before any data flows. -- **max_local_age_days** - for lazy buckets, the maximum age of locally cached records before eviction. Records older than this threshold are evicted; their stubs are retained. +- **name** — unique identifier used in sync daemon routing and backup manifests. +- **record_types** — the document types included in the bucket. +- **filter** — a predicate evaluated per record against peer attributes. Only records satisfying the filter replicate to a given peer. +- **replication** — `eager` or `lazy`. Eager buckets sync immediately on connect. Lazy buckets use demand-driven fetch. +- **required_attestation** — the role attestation a peer must present to receive bucket events. The sync daemon verifies it cryptographically before any data flows. +- **max_local_age_days** — for lazy buckets, the maximum age of locally cached records before eviction. -Bucket eligibility is evaluated at capability negotiation - the initial handshake described in Chapter 14. The sync daemon constructs the minimal subscription set from the peer's verified attestations. A peer with only `team_member` receives `team_core` and `archived_projects`. A peer with `financial_role` receives all three. A peer with neither attestation receives nothing. - -Data minimization operates at this layer. Each document schema defines a subscription scope - the minimal set of fields required for each role. The daemon enforces these scopes when responding to subscriptions. Unauthorized data never reaches a node that is not authorized to hold it, because the sync daemon never sends it. +Bucket eligibility is evaluated at capability negotiation — the handshake described in Chapter 14. The sync daemon constructs the minimal subscription set from the peer's verified attestations. A peer with only `team_member` receives `team_core` and `archived_projects`. A peer with `financial_role` receives all three. Unauthorized data never reaches a node that is not authorized to hold it. --- @@ -70,15 +68,9 @@ Data minimization operates at this layer. Each document schema defines a subscri Eager replication serves active working data. Lazy replication serves archives, large binary assets, and records with infrequent access. -A lazy-replicated record is represented locally as a stub. The stub contains: - -- The record's identifier -- Metadata required for display and navigation (title, type, author, last-modified timestamp) -- A content hash +A lazy-replicated record is represented locally as a stub containing the record's identifier, metadata required for display and navigation, and a content hash. The stub lets the application render navigation, search indexes, and list views without fetching full content. When the user opens a record, the application detects the stub, fetches full content from a peer or from the backup tier, verifies the content hash, and writes the full record to the local database. -The stub enables the application to render navigation, search indexes, and list views without fetching full content. When the user opens a record, the application detects the stub, fetches full content from a peer or from the backup tier, verifies the content hash, and writes the full record to the local database. - -Nodes enforce a configurable local storage budget. The default is 10 GB. When the node approaches the budget ceiling, the sync daemon evicts least-recently-used records from lazy buckets. Eviction converts a full record back to a stub - the identifier, metadata, and content hash are retained, and the content is released. The record is not deleted. It remains accessible on demand. +Nodes enforce a configurable local storage budget (default 10 GB). When the node approaches the budget ceiling, the sync daemon evicts least-recently-used records from lazy buckets. Eviction converts a full record back to a stub — the identifier, metadata, and content hash are retained; the content is released. The record is not deleted. It remains accessible on demand. ```mermaid sequenceDiagram @@ -99,79 +91,55 @@ sequenceDiagram LocalDB-->>App: full record ``` -Content hash verification on re-fetch is mandatory. A fetched record whose hash does not match the stub's stored hash is rejected and re-requested from an alternate peer. This protects against both corruption and deliberate tampering by a compromised peer. - -The storage budget, eviction policy, and minimum stub retention period are configurable via `Harborline.Kernel.Buckets`. The defaults suit most deployments; teams with specialized storage constraints adjust them at the workspace level. +Content hash verification on re-fetch is mandatory. A fetched record whose hash does not match the stub's stored hash is rejected and re-requested from an alternate peer, protecting against corruption and deliberate tampering. The storage budget, eviction policy, and minimum stub retention period are configurable via `Harborline.Kernel.Buckets`. --- -## Per-Data-Class Device-Distribution +## Per-Data-Class Device Distribution - + -Device fleets in production are heterogeneous by design. A restaurant's floor tablet - handled by servers - holds customer orders and table assignments. The same restaurant's back-office laptop - held by the owner - holds payroll and vendor invoices. Uniform replication fails this fleet two ways at once: the floor tablet holds payroll records a server has no operational need for, and a constrained Android tablet's storage budget is consumed by classes it will never display. +Device fleets are heterogeneous by design. A restaurant floor tablet holds customer orders and table assignments; the back-office laptop holds payroll and vendor invoices. Uniform replication fails both ways: the floor tablet holds payroll records a server has no operational need for, and a constrained device's storage budget fills with classes it will never display. -The first failure is a security-surface problem. Application access controls prevent a server from executing a payroll lookup, but once payroll records sit in the local encrypted database the risk surface shifts to a decryption-key exposure, a debugger attach, or a future application bug. A record that is not on a device cannot be leaked from that device. The second failure is a storage-budget problem. The bucket model in §Declarative Sync Buckets already filters at bucket granularity, asking what attestations the user holds. Per-data-class device-distribution adds an orthogonal axis: not what the user is authorized to see, but what classes this physical device's operational role requires it to hold at all. The distinction matters in MDM (Mobile Device Management)-managed fleets where IT policy sets device class independently of the user's role. +The first failure is a security-surface problem. Application access controls prevent unauthorized payroll lookups, but once payroll records sit in the local encrypted database the risk surface shifts to a decryption-key exposure or a future application bug. A record not on a device cannot be leaked from it. The second failure is a storage-budget problem. The bucket model filters by user attestation; per-data-class device distribution adds an orthogonal axis — not what the user is authorized to see, but what classes this device's operational role requires. ### The class-subscription manifest -Each device carries a signed manifest declaring the data classes it accepts. The manifest is not an attestation. Role attestations are user-bound claims issued by the identity authority that authorize access to specific buckets; the class-subscription manifest is device-bound policy, set by the MDM operator or by the user in consumer deployments, declaring which classes the device's operational role requires. A device can hold a `financial_role` attestation and still exclude detailed customer-record classes through its manifest - the manifest restricts the attestation-granted set, never expands it. +Each device carries a signed manifest declaring the data classes it accepts. The manifest is device-bound policy, set by the MDM operator or by the user in consumer deployments. It is not a role attestation. A device can hold a `financial_role` attestation and still exclude detailed customer-record classes through its manifest — the manifest restricts the attestation-granted set, never expands it. -The manifest is a signed CBOR (Concise Binary Object Representation) document under the device's own Ed25519 keypair, making it tamper-evident and attributable. It carries the device identifier, the issuer (an MDM authority key, or a self-signed user key in consumer deployments), the list of accepted data-class identifiers, an issued-at timestamp, an expiry, and the signature. The manifest travels with the device identity during the five-step handshake (Ch14 §Five-Step Handshake), where the sending peer reads it before constructing any outbound delta. +The manifest is a signed CBOR document under the device's own Ed25519 keypair, carrying the device identifier, issuer, accepted data-class identifiers, issued-at timestamp, expiry, and signature. It travels with the device identity during the five-step handshake (Ch14 §Five-Step Handshake), where the sending peer reads it before constructing any outbound delta. -A data class is a higher-level abstraction over buckets. Each bucket entry in the YAML carries an optional `data_class` label; a class resolves to the union of bucket entries sharing that label. The manifest operates at the class level; `Harborline.Kernel.Buckets` resolves class to bucket membership internally. Every manifest change produces a new signed version retained in the audit log for compliance reconstruction. PowerSync's bucket-definition model [5] influenced this shape with one inversion: PowerSync evaluates rules server-side per client parameter; the architecture evaluates the manifest client-side as device-declared policy and applies it on the sending peer. +A data class is a higher-level abstraction over buckets. Each bucket entry carries an optional `data_class` label; a class resolves to the union of bucket entries sharing that label. `Harborline.Kernel.Buckets` resolves class to bucket membership internally. ### Sync-daemon push filter -`Harborline.Kernel.Sync` on the sending node applies the receiver's manifest as a push filter before constructing outbound deltas. The filter sits at the same tier as the stream-level scope in Ch14 §Data Minimization at the Stream Level - after attestation verification, before delta construction. The two compose: the attestation filter removes streams the receiver lacks role authorization for; the class-subscription filter removes record-class operations within otherwise-authorized streams. - -The send tier drops records of an excluded class silently. The receiving daemon never sees the operation. The filter emits no error, mirroring the existing field-level out-of-scope behavior. The filter operates on the data-class label attached at write time; classes are declared in the document schema and assigned at record creation. Reclassification at runtime is the domain of event-triggered escalation (Ch23 §Event-Triggered Re-classification) and composes with this filter through the eviction protocol below. Filter evaluation is O(1) per operation - a hash-set membership check against the receiver's accepted classes. ElectricSQL's shape filtering [4] is the closest production analogue at the WAN-sync level; the architecture differs in operating on schema-declared class labels rather than SQL `WHERE` predicates. +`Harborline.Kernel.Sync` on the sending node applies the receiver's manifest as a push filter before constructing outbound deltas. The filter sits at the same tier as the stream-level scope in Ch14 §Data Minimization at the Stream Level — after attestation verification, before delta construction. Attestation removes streams the receiver lacks role authorization for; the class-subscription filter removes record-class operations within otherwise-authorized streams. Records of an excluded class are dropped silently; the receiving daemon never sees the operation. Filter evaluation is O(1) per operation — a hash-set membership check against the receiver's accepted classes. ### Cross-class references: the policy-blocked placeholder -A record in class A that holds a reference to a record in class B presents a problem on a device subscribed to class A but not class B: the A-record arrives with a reference the device cannot resolve locally. Three responses are possible - refuse delivery of the A-record, deliver it with a reference that silently returns null, or deliver it with an explicit placeholder. - -The architecture chooses the third. The placeholder follows the stub model from §Lazy Fetch and Storage Budgets, with one critical difference. **A lazy-evicted stub is fetchable on demand. A class-excluded placeholder is not.** The device's manifest excludes the referenced class, and the sync daemon will not retrieve it regardless of demand. - -This is where consumer-software analogues mislead. OneDrive Files On-Demand [2] presents a placeholder identical to a downloaded file in Explorer; the file fetches transparently on first access. iCloud's Optimize Mac Storage [3] removes local content and re-materialises it on access. In both, the stub indicates deferred latency, not policy denial. Dropbox Selective Sync [1] comes closer - an excluded folder simply does not appear locally - but creates an untyped void rather than a typed placeholder, leaving applications with broken paths and no semantic signal. +A record in class A holding a reference to a record in class B presents a problem on a device subscribed to class A but not class B. The architecture delivers the A-record with an explicit placeholder for the B-reference rather than refusing delivery or returning a silent null. -The class-excluded placeholder differs in kind. Its structure carries the referenced record's identifier, its class label, an exclusion reason of `class_not_subscribed`, and no content. The application renders it as a restricted-reference indicator - not a missing-data error, not a broken link, but a policy-gated boundary the user can see and reason about. A task referencing a payroll record (class: financial) on a device excluding financial records does not render as "no data found." It renders as "restricted - not available on this device." +**A lazy-evicted stub is fetchable on demand. A class-excluded placeholder is not.** The device's manifest excludes the referenced class; the sync daemon will not retrieve it regardless of demand. The placeholder carries the referenced record's identifier, its class label, an exclusion reason of `class_not_subscribed`, and no content. The application renders it as a restricted-reference indicator — not a missing-data error, a policy-gated boundary the user can see and reason about. A task referencing a payroll record on a device excluding financial classes renders as "restricted — not available on this device," not "no data found." -The UI layer enforces the rendering contract because the architecture cannot detect every misuse of a placeholder. The architecture guarantees that unresolvable cross-class references are detectable and labeled, not silent. A device holding class A verifies every class-A-internal reference; references to excluded classes carry explicit marks. +### MDM-driven manifest update -### MDM-driven manifest update and propagation +The class-subscription manifest changes by signed update. An IT administrator pushes a new manifest version through the OTA channel, signed under the MDM authority key. The receiving device's sync daemon loads the new manifest at the next capability negotiation cycle. -The class-subscription manifest changes by signed update. An IT administrator pushes a new manifest version through the OTA (Over-the-Air) update channel, signed under the MDM authority key. The receiving device's sync daemon loads the new manifest at the next capability negotiation cycle. The manifest version increments. The audit log records the change. Subscription change is a revocation-shaped event; cross-reference Ch23 §Collaborator Revocation for the analogous primitive at the user-attestation layer. - -When a manifest tightens, the sync daemon evicts every record of the removed class from the local database. Eviction follows the stub-conversion mechanism from §Lazy Fetch and Storage Budgets: the daemon retains identifiers and metadata as class-excluded placeholders and purges content. The daemon logs the eviction to `Harborline.Kernel.Audit` against the manifest version that triggered it. Composition with event-triggered class escalation (Ch23 §Event-Triggered Re-classification) reuses this path: when escalation moves a record into a class the device's manifest excludes, the daemon receives the class-change record, evaluates it against the current manifest, and schedules eviction. The two extensions compose at the manifest interface - escalation produces the class-change event; the manifest filter reacts to it. - -When a manifest expands, backfill proceeds according to the bucket entry's replication mode. Eager buckets backfill on the next sync cycle. Lazy buckets produce stubs immediately and full content on demand. Expansion does not blanket-fetch every newly-accepted record. Cross-reference Ch21 §21.1 Why fleet management is a distinct discipline (and the §11a–§11d sub-patterns that follow) for the administrative workflow that governs manifest update authorization, approval, and rollout. - -### Audit and observability - -An administrator who cannot verify what a device actually holds cannot reason about the fleet's data-exposure surface. Each device maintains a signed class-inventory record listing the classes it currently holds, the count of full records and placeholders per class, and the manifest version under which each class was acquired or evicted. The inventory updates on every sync session and on every manifest change. - -`Harborline.Foundation.Fleet` aggregates per-device inventories into a fleet-level view. An administrator queries, per device, the subscribed classes, the actual held counts, the last manifest version, and the last sync timestamp. When a device's actual held classes diverge from its current manifest - the window that opens during an offline manifest update before eviction completes - the fleet dashboard flags the discrepancy; the device resolves it at the next sync cycle. Every manifest change, every eviction, and every class backfill produces a signed, attributable, append-only entry in `Harborline.Kernel.Audit`, reusing the substrate Ch23 specifies for collaborator revocation and Ch22 specifies for key-loss recovery. Bayou's subscription model [6] is the academic precedent - device-level partial replication with explicit subscription declarations dates to 1995. The architecture's contribution is the policy-blocked placeholder semantics and the MDM-signed manifest as a first-class fleet artifact. +When a manifest tightens, the sync daemon evicts every record of the removed class, converting them to class-excluded placeholders and purging content. The eviction logs to `Harborline.Kernel.Audit` against the manifest version that triggered it. When a manifest expands, backfill proceeds by bucket replication mode: eager buckets backfill on the next sync cycle; lazy buckets produce stubs immediately and full content on demand. ### Failure modes -**Manifest conflated with attestation.** The manifest is device-bound operator policy. Attestation is user-bound identity claim. Conflating them collapses the security model - a user with `financial_role` attestation on a device whose MDM manifest excludes financial classes must not receive financial records. The manifest restricts; attestation does not override. - -**Placeholder treated as error state.** A class-excluded placeholder is a visible policy boundary, not a missing record, not a sync failure, not a data-integrity defect. Applications that render it as "data not found" mislead users about whether the data exists or simply is unreachable from this device. +**Manifest conflated with attestation.** A user with `financial_role` attestation on a device whose MDM manifest excludes financial classes must not receive financial records. The manifest restricts; attestation does not override. -**Manifest expansion treated as eager backfill.** The bucket's replication mode governs backfill rate. Eager backfills on the next sync cycle; lazy produces stubs and fetches on demand. Treating every expansion as eager saturates network and storage budgets. +**Placeholder treated as error state.** A class-excluded placeholder is a visible policy boundary, not a sync failure. Applications that render it as "data not found" mislead users. -**Eviction-on-tightening skipped.** When a class is removed, every record of that class on the device must convert to a class-excluded placeholder. Skipping eviction leaves orphaned content on the device after the policy change - the security benefit the manifest exists to provide collapses. - -**Forward-secrecy boundary at mid-stream subscription.** A device added to a class subscription mid-stream may not be able to decrypt historical operations encrypted under earlier session key material. Ch22 §Forward Secrecy and Post-Compromise Security specifies a per-message ratchet between session pairs (sub-pattern 46a-46b), not a per-class key chain - so the boundary the manifest expansion creates depends on whether the architecture chooses to derive class-scoped session keys from the per-message ratchet or to ship a one-time key bundle to newly-subscribing devices. The newly-subscribed device receives operations from the manifest's effective date forward in either case. - -**Kill trigger.** If technical review determines that the class-subscription manifest cannot coexist with the existing bucket YAML schema without a breaking change to the `required_attestation` field's role-driven semantics, escalate before continuing. The manifest's device-policy axis and the attestation's user-role axis must compose in a single bucket evaluation; if they cannot, the extension's scope changes and the present specification requires redesign. +**Eviction-on-tightening skipped.** When a class is removed, every record of that class must convert to a class-excluded placeholder. Skipping eviction leaves orphaned content on the device after the policy change. --- ## Snapshot Format and Rehydration -Reading an aggregate's state from the raw event log becomes expensive as the log grows. Snapshots exist to bound that cost. A snapshot captures the current state of an aggregate at a point in time, indexed to the last event it incorporates. +Reading an aggregate's state from the raw event log becomes expensive as the log grows. Snapshots bound that cost. A snapshot captures the current state of an aggregate at a point in time, indexed to the last event it incorporates. **Snapshot structure:** @@ -186,7 +154,7 @@ Reading an aggregate's state from the raw event log becomes expensive as the log } ``` -Snapshots are stored separately from the event log. They can be deleted and regenerated at any point without affecting correctness - the event log is the source of truth; the snapshot is a performance optimization. +Snapshots are stored separately from the event log. They can be deleted and regenerated at any point without affecting correctness — the event log is the source of truth; the snapshot is a performance optimization. **Rehydration follows four steps:** @@ -195,29 +163,29 @@ Snapshots are stored separately from the event log. They can be deleted and rege 3. Replay events from the log after `last_event_seq`. 4. Apply any pending upcasters to events from older schema versions. -When no valid snapshot exists - on a fresh install, after a breaking schema migration, or after explicit snapshot deletion - rehydration replays from the beginning of the log. This is correct and complete. It is simply slower. The system writes a new snapshot after rehydration to avoid repeating the replay on the next load. +When no valid snapshot exists — on a fresh install, after a breaking schema migration, or after explicit snapshot deletion — rehydration replays from the beginning of the log. This is correct and complete; it is simply slower. The system writes a new snapshot after rehydration to avoid repeating the replay on the next load. -**Interaction with schema migrations.** Snapshots are epoch-scoped and schema-scoped. After a breaking migration, old snapshots are discarded. The system rehydrates from the most recent pre-migration snapshot that still falls within the log, applies schema lenses to bring events forward to the new schema shape, and writes a new snapshot tagged with the current epoch and schema version. The migration runbook in Chapter 13 specifies the sequencing required to keep this process safe under concurrent writes. +After a breaking migration, old snapshots are discarded. The system rehydrates from the most recent pre-migration snapshot, applies schema lenses to bring events forward to the new shape, and writes a new snapshot tagged with the current epoch and schema version. The migration runbook in Chapter 13 specifies the sequencing required to keep this safe under concurrent writes. -**Snapshot scheduling policy.** The system writes a new snapshot after three triggers: rehydration completes (to amortize future replay cost); the event log crosses a configurable operation-count threshold (default 5,000 operations since the last snapshot); and explicit snapshot creation is requested via `Harborline.Kernel.Buckets` at application shutdown. The operation-count threshold is the primary driver in practice. An aggregate that accumulates operations quickly generates snapshots frequently. A rarely-modified aggregate may hold a single snapshot for months. The threshold is per document type, not per deployment. Teams with high-frequency write patterns reduce the threshold; teams prioritizing storage efficiency raise it. The cost of an incorrect threshold is measured in rehydration latency, not correctness - the event log remains intact regardless of snapshot frequency. +The system writes a new snapshot after three triggers: rehydration completes; the event log crosses a configurable operation-count threshold (default 5,000 operations since the last snapshot); and explicit snapshot creation is requested via `Harborline.Kernel.Buckets` at application shutdown. The threshold is per document type. The cost of an incorrect threshold is measured in rehydration latency, not correctness — the event log remains intact regardless of snapshot frequency. --- ## CRDT Growth and Garbage Collection -CRDT growth and the three-tier garbage collection policy are specified in Chapter 12 §CRDT Growth and Garbage Collection. The garbage collection tier assignment lives in the bucket's `IStreamDefinition`; Ch12 specifies the GC policy itself. Teams that enable application-level purging or shallow snapshots for a document type accept the tradeoff: nodes holding history older than the shallow snapshot cannot merge with nodes that have discarded that history. The tradeoff is explicit and schema-bound. +CRDT growth and the three-tier garbage collection policy are specified in Chapter 12 §CRDT Growth and Garbage Collection. The garbage collection tier assignment lives in the bucket's `IStreamDefinition`. Teams that enable application-level purging or shallow snapshots accept a tradeoff: nodes holding history older than the shallow snapshot cannot merge with nodes that have discarded it. The tradeoff is explicit and schema-bound. --- ## Backup UX: Three-State Model -The backup system exposes three states to the user. Internal replication factors, CRDT vector clocks, and sync daemon health checks are not visible. The user sees a status, and the status demands a specific action or confirms that none is needed. +The backup system exposes three states to the user. Internal replication factors, CRDT vector clocks, and sync daemon health checks are not visible. The user sees a status that demands a specific action or confirms that none is needed. -**Protected.** All nodes have synchronized within the configured backup policy window. The policy window is operator-defined per deployment. A green indicator confirms protection. No action is required. +**Protected.** All nodes have synchronized within the configured backup policy window. No action is required. **Attention.** Backup lag has exceeded the policy window on one or more nodes, but no data has been lost. The UI surfaces one actionable prompt: "Back up now." The prompt is dismissible once acknowledged. -**At Risk.** No successful backup has completed within the escalation threshold - a configurable multiple of the policy window. The UI displays a persistent warning. Not a dismissible notification. Not a banner that fades. The user must explicitly acknowledge the risk before the warning clears. Acknowledging records awareness; it does not resolve the risk. The warning returns each session until backup completes. +**At Risk.** No successful backup has completed within the escalation threshold — a configurable multiple of the policy window. The UI displays a persistent warning — not a dismissible notification, not a banner that fades. The user must explicitly acknowledge the risk before the warning clears. Acknowledging records awareness; it does not resolve the risk. The warning returns each session until backup completes. ```mermaid stateDiagram-v2 @@ -229,31 +197,27 @@ stateDiagram-v2 AtRisk --> AtRisk: user acknowledges (warning persists) ``` -This model is intentionally non-technical. "Your data is protected" requires no understanding of sync daemons or replication factors. "You are at risk" requires only the user's attention. The three states map directly to the three things a user can do: nothing, back up now, or acknowledge an emergency. +This model is intentionally non-technical. "Your data is protected" requires no understanding of sync daemons. "You are at risk" requires only the user's attention. The three states map to the three things a user can do: nothing, back up now, or acknowledge an emergency. -The backup status is surfaced in `Harborline.Foundation` as a typed state that the host application renders. The package provides the state machine; the application provides the UI. No backup UI is prescribed - the state model is the contract, not the presentation. +`Harborline.Foundation` exposes the backup status as a typed state that the host application renders. The package provides the state machine; the application provides the UI. -**BYOC (Bring Your Own Cloud) backup destination.** The Tier 3 backup adapter is not bound to a specific cloud provider. The architecture specifies a generic object storage interface; operator deployments configure the destination. The backup object contains a full encrypted snapshot of the node's CRDT event log and the current snapshot tier - not a database file, not a ZIP archive, but the serialized event log the system already maintains as Tier 2. The encryption key for the backup is derived by HKDF (HMAC-based Key Derivation Function)-SHA256 from the same DEK (Data Encryption Key)/KEK (Key Encryption Key) hierarchy specified in Chapter 15: the same root seed protects both the local database and its off-node backup. A backup stored in an untrusted object store is still encrypted under the user's key material. The storage provider cannot read it. If the user loses their key, they lose access to the backup - the same tradeoff that governs the local store. The adapter configuration accepts any endpoint that speaks the S3 API (Application Programming Interface): hyperscaler services (Azure Blob via compatibility adapter, Google Cloud Storage, AWS S3); EU-resident providers (Hetzner Object Storage, OVHcloud, Scaleway) for post-Schrems II compliance; GCC (Gulf Cooperation Council) and Indian sovereign cloud providers for UAE DPL (Data Protection Law), DIFC (Dubai International Financial Centre) DPL 2020, and RBI (Reserve Bank of India) obligations (see Appendix F); domestic providers in Japan (IDCFrontier, NTT Object Storage, Sakura), China (Aliyun OSS with PIPL (Personal Information Protection Law)-compliant configuration), and South Korea; domestically hosted endpoints for Russia's Federal Law 242-FZ and parallel CIS (Commonwealth of Independent States) data localization regimes; and on-premise object storage (self-hosted MinIO (self-hosted S3-compatible object storage), Ceph RGW, network shares) for air-gapped or import-substitution-mandated deployments. The BYOC model is the architectural answer to the 2022 SaaS (Software as a Service) terminations - when Adobe, Autodesk, Figma ([figma.com](https://www.figma.com/), the design tool), and dozens of other Western SaaS vendors suspended service across Russia and CIS markets under sanctions enforcement, organizations whose backup endpoints lived in vendor cloud infrastructure lost access to their own data. A backup endpoint the user controls survives that failure mode. The architecture makes it a structural property, not a configuration choice. +**BYOC backup destination.** The Tier 3 backup adapter is not bound to a specific cloud provider. The backup object contains a full encrypted snapshot of the node's CRDT event log — the serialized event log the system already maintains as Tier 2. The encryption key derives by HKDF-SHA256 from the same DEK/KEK hierarchy specified in Chapter 15. The adapter accepts any endpoint speaking the S3 API: hyperscaler services, EU-resident providers for post-Schrems II compliance, sovereign cloud providers for regional data-residency obligations (see Appendix F), and on-premise object storage for air-gapped deployments. When the 2022 SaaS service suspensions cut access for organizations whose backup endpoints lived in vendor infrastructure, user-controlled endpoints were unaffected. The BYOC model makes that resilience structural. --- ## Relay Architecture -The relay is the architecture's most structurally ambiguous component: an external service the node depends on for WAN peer reachability when direct peer-to-peer connectivity is not viable. Chapter 14 specified the sync protocol that peers use across the relay. This section specifies the relay itself. +The relay routes encrypted CRDT operation frames between authenticated peers when direct peer-to-peer connectivity is not viable. Chapter 14 specified the sync protocol; this section specifies the relay itself. -**Ciphertext-only invariant.** The relay routes encrypted CRDT operation frames between authenticated peers. It does not hold decryption keys. It cannot read payload content. Every frame the relay forwards is encrypted end-to-end under keys that never leave originating devices - the relay sees peer identities (Ed25519 public keys), workspace identifiers, and frame envelopes, but no plaintext data. This is the guarantee Chapter 7's Okonkwo council established as inviolable and Chapter 15's security architecture specified cryptographically. A compromised relay exposes connection metadata - who communicates with whom, at what times, at what volume - not content. +**Ciphertext-only invariant.** The relay does not hold decryption keys. It cannot read payload content. Every frame the relay forwards is encrypted end-to-end under keys that never leave originating devices. A compromised relay exposes connection metadata — who communicates with whom, at what times, at what volume — not content. -**Managed relay deployment.** The managed relay is a horizontally-scaled service accepting WebSocket connections over TLS (Transport Layer Security) 1.3 on port 443. The service terminates TLS at an edge proxy, authenticates each connection against the Ed25519 public key presented in the handshake, and forwards CRDT operation frames to subscribing peers. Horizontal scaling is stateless at the forwarding layer: any relay node can route any frame. Per-peer connection affinity is handled by consistent hashing on node_id for subscription routing. The managed relay operates per jurisdiction; teams select the relay endpoint at onboarding to satisfy data-residency obligations (Appendix F maps jurisdictional endpoints to regulatory frameworks). Cross-jurisdiction deployments configure multiple endpoints with relay-to-relay interconnect over TLS 1.3, authenticated by operator-held relay keys. +**Managed relay deployment.** The managed relay accepts WebSocket connections over TLS 1.3 on port 443, authenticates each connection against the Ed25519 public key presented in the handshake, and forwards CRDT operation frames to subscribing peers. Horizontal scaling is stateless at the forwarding layer. Teams select the relay endpoint at onboarding to satisfy data-residency obligations. -**Self-hosted relay.** The relay is a single binary distributed as both a native executable and an OCI container image. Resource profile for a fifty-person team: 512 MiB RAM, 2 vCPU, 10 GiB disk for operational logs, no persistent state required beyond the subscription routing table. The self-hosted relay implements the same protocol as the managed relay; a node cannot distinguish them at the protocol level. Organizations operating under compelled-access threat models - CIS jurisdictions, regulated financial services, public sector - deploy the self-hosted relay on infrastructure they control, or on sovereign-cloud infrastructure within their jurisdiction, and point their nodes' `relayEndpoint` configuration at it. The protocol specification (Chapter 14) is published under the same license as the kernel; any organization can implement a compatible relay. +**Self-hosted relay.** The relay is a single binary distributed as both a native executable and an OCI container image. Resource profile for a fifty-person team: 512 MiB RAM, 2 vCPU, 10 GiB disk, no persistent state required beyond the subscription routing table. The self-hosted relay implements the same protocol as the managed relay; a node cannot distinguish them at the protocol level. Organizations under compelled-access threat models deploy the self-hosted relay on infrastructure they control and point their nodes' `relayEndpoint` configuration at it. -**Protocol openness.** The relay protocol is specified in Chapter 14 with sufficient precision for third-party implementation. There is no proprietary wire format. There is no vendor-specific handshake extension. A relay written from scratch by a third party that conforms to the protocol is indistinguishable to nodes from a first-party relay. This prevents vendor lock-in at the relay layer - the architecture's most SaaS-like component cannot become a vendor trap because any competent infrastructure team can replace it. +**Protocol openness.** The relay protocol is specified in Chapter 14 with sufficient precision for third-party implementation. There is no proprietary wire format. A relay written from scratch by a third party that conforms to the protocol is indistinguishable to nodes from a first-party relay. This prevents vendor lock-in at the architecture's most SaaS-like component. -**Multi-tenant isolation for the Zone C comms mesh.** The Zone C accelerator deploys per-tenant hosted nodes alongside the shared relay. Each tenant's relay traffic is isolated at the subscription routing layer: node identity carries tenant scope, and the relay enforces that a frame destined for tenant A's nodes is never delivered to a node whose identity is scoped to tenant B. Cross-tenant data exchange - when it occurs by design - is an application-layer decision executed through explicit sharing primitives, not a default relay behavior. Chapter 18 specifies the Zone C tenant model in full. - -**Compelled access.** The relay cannot produce decryptable content under legal compulsion because the relay does not possess decryptable content. A subpoena to the relay operator yields connection logs and message envelopes - never payload plaintext. This is the structural answer to the compelled-access threat model across CIS jurisdictions (242-FZ + import substitution), the EU (post-Schrems II transfer safeguards), the GCC (DIFC Data Protection Law 2020), China (PIPL + MLPS (Multi-Level Protection Scheme) 2.0), and every other regulatory regime where cloud-operator data access is a live procurement concern. Chapter 15 specifies the cryptographic mechanism; Chapter 16 specifies the operational configuration that activates it. - ---- +**Compelled access.** The relay cannot produce decryptable content under legal compulsion because it does not possess decryptable content. A subpoena to the relay operator yields connection logs and message envelopes — never payload plaintext. Chapter 15 specifies the cryptographic mechanism; Chapter 16 specifies the operational configuration that activates it. --- @@ -265,11 +229,9 @@ The recovery sequence on a new device: 1. The application detects no local CRDT state on first launch. 2. The user authenticates against the identity provider. -3. An existing team member opens the application on their own device and scans the new device's QR code. The QR code encodes a one-time key exchange for the role attestation bundle. The existing member's device transfers the attestation bundle and an initial CRDT snapshot of all eager-bucket records the new device is authorized to hold. +3. An existing team member scans the new device's QR code. The QR code encodes a one-time key exchange for the role attestation bundle. The existing member's device transfers the attestation bundle and an initial CRDT snapshot of all eager-bucket records the new device is authorized to hold. 4. The sync daemon completes eager-bucket synchronization in the background. For most team workspaces, this completes within minutes. -5. Lazy-bucket stubs are present immediately after step 3. The user sees navigation and list views for all lazy records. Full content fetches on first access. - -The user is in working state before background sync completes. The application continues to function normally during sync; the sync daemon's progress is visible in a status indicator, not in the application's primary navigation. +5. Lazy-bucket stubs are present immediately after step 3. The user sees navigation and list views for all lazy records; full content fetches on first access. ```mermaid sequenceDiagram @@ -289,41 +251,39 @@ sequenceDiagram SyncDaemon-->>NewDevice: eager sync complete (background) ``` -The QR-code attestation transfer is a cryptographic key exchange, not a file copy. The attestation bundle is signed by the team's identity authority. The new device cannot forge it, and the team member's device cannot transfer attestations it does not hold. The scope of what transfers is bounded by the existing member's own attestation set. - -Recovery from backup - when no team member is available to perform the QR exchange - follows a different path. The user authenticates against the IdP (Identity Provider), the system retrieves the most recent backup snapshot from the user-controlled object storage, and applies it to the local database. The sync daemon then re-synchronizes with peers to incorporate any changes that occurred after the backup. This path is slower than the peer-assisted path but requires no human coordination beyond the user's own credentials. +The QR-code attestation transfer is a cryptographic key exchange, not a file copy. The attestation bundle is signed by the team's identity authority. The new device cannot forge it, and the team member's device cannot transfer attestations it does not hold. -Both recovery paths produce the same end state: a node with full local authority over its data, synchronized with peers, and operating without dependency on any central server's availability. The recovery mechanism does not introduce a central point of failure that did not exist before the device was lost. +Recovery from backup — when no team member is available for the QR exchange — follows a different path. The user authenticates against the IdP, the system retrieves the most recent backup snapshot from user-controlled object storage, and applies it to the local database. The sync daemon re-synchronizes with peers to incorporate changes since the backup. This path requires no human coordination beyond the user's own credentials. -**Offline recovery fallback.** For deployments where IdP availability at the moment of recovery cannot be assumed - Sub-Saharan African field operations during outage windows, rural Indian BFSI (Banking, Financial Services, and Insurance) field teams on VSAT links, GCC construction sites during load-shedding, LATAM (Latin America) rural secondary cities - a third recovery path operates without network connectivity. At onboarding, each node generates an optional offline recovery bundle: a one-time-use cryptographic blob containing a wrapped recovery key plus the minimum attestation the node needs to bootstrap its local database from a backup without contacting the IdP. The bundle is stored out-of-band by the user - printed QR code in a sealed envelope, secondary device, organizational escrow. Recovery from the offline bundle restores the node to a read-write local state; sync with peers resumes when connectivity returns, at which point the recovered node re-attests against the IdP and rotates to a fresh recovery bundle. The offline bundle expires on use or on a configurable wall-clock timeout (default 12 months) to bound the exposure window. This path is the architecture's honest answer to the recovery scenario the deployment contexts most demand it for: a user who has just lost their device at a remote site with no network access needs to resume work before they can return to a connected environment. +**Offline recovery fallback.** For deployments where IdP availability at recovery cannot be assumed — field operations during outage windows, rural teams on satellite links, construction sites during load-shedding — a third path operates without network connectivity. At onboarding, each node generates an optional offline recovery bundle: a one-time-use cryptographic blob containing a wrapped recovery key plus the minimum attestation the node needs to bootstrap from a backup without contacting the IdP. The user stores the bundle out-of-band — printed QR code, secondary device, or organizational escrow. Recovery from the bundle restores the node to a read-write local state; sync resumes when connectivity returns, at which point the node re-attests against the IdP and rotates to a fresh bundle. The bundle expires on use or on a configurable wall-clock timeout (default 12 months). --- ## Plain-File Export -All user data must be exportable as standard formats without running the application. This requirement is architectural, not aspirational. Export is a first-class feature, not an afterthought added to a compliance checkbox. +All user data must be exportable as standard formats without running the application. Export is a first-class feature, not a compliance checkbox. The export formats: -- **Relational data** - SQLite database file, readable with any SQLite client without application software. -- **Documents and text** - JSON with human-readable field names. No internal identifiers without corresponding human-readable labels. -- **Tabular data** - CSV for spreadsheet-compatible export. Column headers match the field names used in the JSON export. -- **Binary assets** - Original format, no transcoding. A file uploaded as PNG exports as PNG. +- **Relational data** — SQLite database file, readable with any SQLite client. +- **Documents and text** — JSON with human-readable field names. +- **Tabular data** — CSV with column headers matching the JSON field names. +- **Binary assets** — Original format, no transcoding. +- **Long-form content** — Markdown for notes, project descriptions, and inline text content. -Export runs as a background task initiated from the application and produces a self-contained directory. The directory contains a `README.txt` that explains its structure in plain language - file names, what each format contains, and how to open each type without any specialized software. The README assumes the reader has no prior knowledge of the application's internal structure. +Export runs as a background task and produces a self-contained directory with a `README.txt` explaining its structure in plain language, assuming no prior knowledge of the application. Export requirements: - No network connectivity required. Export reads only from the local database and the local CRDT log. - No telemetry. The export process produces no network requests. -- Deterministic. The same local state produces the same export directory structure and the same file contents. Timestamps in export filenames use UTC ISO 8601. -- Complete. Every record the local node holds is included. Export does not omit records based on their lazy-or-eager status - a full record in the local database is always exported; a stub is exported as its metadata, with the content hash recorded and the content field absent. - -Stubs in the export represent data the local node does not hold. The README documents this explicitly: a stub export entry includes the record identifier, the metadata, and the content hash, with a note that full content is available from the application's backup or from team peers. +- Deterministic. The same local state produces the same export directory structure and file contents. +- Complete. Every record the local node holds is included. A stub exports as its metadata with the content hash recorded and the content field absent. ``` export-2026-04-23/ ├── README.txt +├── manifest.json ├── data.sqlite ├── documents/ │ ├── project-alpha-brief.json @@ -336,21 +296,15 @@ export-2026-04-23/ └── architecture-diagram.pdf ``` -The export directory is the user's data in a form that outlasts the application. If the application ceases to exist, the data remains in formats that any competent developer can parse. That distinguishes local-first from vendor-managed storage: the user's data belongs to them in a form they can actually use. - -`Harborline.Foundation` exposes the export pipeline as a background task. The host application provides a destination path and receives progress events; the package handles serialization, format selection, and README generation. The export format specification is versioned separately from the application - a document exported today must be parseable by any future export reader that supports the same format version. - -The format version is recorded in the `README.txt` and in a machine-readable `manifest.json` at the root of the export directory. The manifest records the format version, the export timestamp, the node identifier, the list of included document types, and the count of stubs versus full records. A future import or recovery tool reads the manifest first to determine compatibility before touching the data files. This versioning contract makes the export durable across application updates - a reader built years from now can inspect the manifest version and apply the appropriate parsing logic without guessing at the structure. +The `manifest.json` records the format version, export timestamp, node identifier, included document types, and stub-versus-full-record counts. A future import or recovery tool reads the manifest first to determine compatibility before touching data files. -Regulated industries treat the plain-file export as a retention artifact, not just a convenience. HIPAA (Health Insurance Portability and Accountability Act) requires patient record retention periods measured in years. GDPR (General Data Protection Regulation) Article 17 requires erasure on request - but only after the retention period expires. The export format must survive both obligations. It must be readable long after the application that produced it is gone. The content hash in every exported stub must remain verifiable so that a regulator can confirm that no content was silently omitted. The hash is in the export. The format version is in the manifest. The manifest is in the directory. No application-specific tooling is required to verify any of it. - -The export pipeline emits four formats: JSON for structured records, CSV for tabular collections, SQLite for complete per-node database snapshots, and Markdown for long-form document content (notes, project descriptions, inline text content). Markdown's inclusion is deliberate - Chapter 9's Ferreira named it as non-negotiable precisely because it is the one format that is simultaneously human-readable, machine-parseable, and version-control friendly. A user who has not opened the application in five years can still read a Markdown file in any text editor. A developer can still parse it with any competent library. A version control system can still diff it without specialized tooling. The four formats together close Property 5 (the long now) and Property 7 (ultimate ownership and control) as structural properties rather than contractual promises. +The export directory is the user's data in a form that outlasts the application. If the application ceases to exist, the data remains in formats any competent developer can parse. The five formats together close Property 5 (the long now) and Property 7 (ultimate ownership and control) as structural properties rather than contractual promises. `Harborline.Foundation` exposes the export pipeline as a background task; the host application provides a destination path and receives progress events. --- ## Layer 5 - Decentralized Archival (Phase 2) -Layer 5 was introduced in the five-layer architecture as an optional enterprise tier providing cryptographic proof-of-storage for regulated industries with long-term retention obligations. The operational mechanism - whether implemented via Filecoin's Proof of Replication and Proof of Spacetime, Arweave's Succinct Proofs of Random Access, or a Merkle-tree-based challenge-response against a known-responsive archival provider - is under active specification work and is not part of the 1.0 specification. Organizations with regulatory retention obligations satisfy them today through Layer 3's BYOC backup with long-term retention policies on user-controlled object storage. Layer 5's decentralized archival is a planned Phase 2 component for deployments where proof-of-storage auditability - rather than backup presence alone - is a compliance requirement. The five-layer diagram retains Tier 5 to signal the architectural extension point; the specification for that tier is deferred to v2.0 of this book. +Layer 5 provides cryptographic proof-of-storage for regulated industries with long-term retention obligations. The operational mechanism — whether Filecoin's Proof of Replication, Arweave's Succinct Proofs of Random Access, or a Merkle-tree-based challenge-response — is under active specification and is not part of the 1.0 specification. Organizations with regulatory retention obligations satisfy them today through Layer 3's BYOC backup with long-term retention policies on user-controlled object storage. Layer 5 is a planned Phase 2 component for deployments where proof-of-storage auditability — rather than backup presence alone — is a compliance requirement. The five-layer diagram retains Tier 5 to signal the architectural extension point; the specification is deferred to v2.0. --- @@ -358,9 +312,7 @@ Layer 5 was introduced in the five-layer architecture as an optional enterprise Persistence beyond the node is a composition of decisions, not a single mechanism. Each layer resolves a distinct failure mode. Together they ensure the user's data survives the device, the application, and the operator. -The governing constraint across all five layers is the same: the user's data must remain in the user's control and in a form the user can verify. Bucket access control enforces minimization at the protocol layer. Backup destinations are user-controlled and provider-agnostic, with named jurisdictional endpoints satisfying every major data sovereignty regime. The relay routes ciphertext only, is protocol-open and self-hostable, and cannot produce decryptable content under legal compulsion because it does not possess decryptable content. Snapshots are performance optimizations over an event log the user can read. Export produces four formats - JSON, CSV, SQLite, Markdown - that require no vendor cooperation to open, closing Property 5 (the long now) and Property 7 (ultimate ownership) as structural guarantees. The offline recovery bundle ensures that device loss at a site without connectivity is recoverable without IdP availability. The three-state backup UX surfaces risk honestly rather than hiding it behind a perpetually green indicator. None of these properties are incidental. Each is a design decision made in favor of the person who owns the data. - ---- +The governing constraint across all five layers is the same: the user's data must remain in the user's control and in a form the user can verify. Bucket access control enforces minimization at the protocol layer. Backup destinations are user-controlled and provider-agnostic, with jurisdictional endpoints satisfying major data sovereignty regimes. The relay routes ciphertext only, is protocol-open and self-hostable, and cannot produce decryptable content under legal compulsion because it does not possess it. Snapshots are performance optimizations over an event log the user can read. Export produces five formats — JSON, CSV, SQLite, Markdown, and binary originals — that require no vendor cooperation to open, closing Property 5 and Property 7 as structural guarantees. The offline recovery bundle ensures device loss at a site without connectivity is recoverable without IdP availability. The three-state backup UX surfaces risk honestly. None of these properties are incidental. Each is a design decision made in favor of the person who owns the data. ---