diff --git a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md index 8cc33e9..6aee1cd 100644 --- a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md +++ b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md @@ -1,34 +1,32 @@ # Chapter 1 - When SaaS Fights Reality - + --- -It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. Her firm's general-contractor bid is due at five, and the owner group is scheduled to meet at six. The project management platform her firm operates on has been down since eleven that morning. +It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. The bid is due at five. The platform has been down since eleven. -The data isn't lost; it exists somewhere-on servers in Virginia, Oregon, or any other cloud region that happens to be active that day. The labor breakdown, subcontractor bids, change order history, and payment schedule-all of it remains intact on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page claims it's an outage affecting less than 1% of users. On this bid, that 1% is everyone. +The data isn't lost. It exists on servers in Virginia or Oregon — intact, on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page calls it an outage affecting less than 1% of users. On this bid, that 1% is everyone. -As the clock ticks down, Sunita's options dwindle. She can only reconstruct what she can from an email trail, export a stale PDF from before the platform went down, or ask her client to extend the deadline. But that would require explaining to the board what happened and why the firm wasn't prepared. +This isn't a planning failure. Sunita planned correctly; her team had used the software. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities go with it. -This isn't a planning failure. Sunita planned correctly, her team had used the software. Everything was in order. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities are compromised. - -This scenario repeats across various industries that rely on deadline-sensitive work-the attorney preparing a brief at nine in the evening, the engineer updating safety documentation in the field, and the physician accessing patient records before rounds. The infrastructure fails identically, but only the deadlines change. +This scenario repeats wherever deadline-sensitive work runs on cloud infrastructure — the attorney drafting a brief at nine in the evening, the engineer updating safety documentation in the field, the physician accessing records before rounds. The infrastructure fails identically. Only the deadlines change. --- ## The Bundle Nobody Agreed To -The SaaS (Software as a Service) deal goes like this. Give us your data. Keep it on our servers. Pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. +The SaaS deal goes like this: give us your data, keep it on our servers, pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. -The three desirable properties are real. Real-time collaboration is transformative - two people editing the same document, watching each other's changes appear, never again emailing attachments back and forth. Multi-device access means your work is on your phone when you need it at the airport. Zero maintenance means IT does not nurse a server in a closet; the vendor handles it. +The three desirable properties are real. Real-time collaboration is transformative. Multi-device access means your work follows you. Zero maintenance means IT doesn't nurse a server in a closet. -The three conditions on the other side of the bundle get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or turn the service off. Pricing is at the vendor's discretion - the rate when you adopted the software is not a commitment. It is a starting point. Service continuity is contingent on the vendor's survival: if the company gets acquired, runs out of money, or decides to sunset the product, your software stops working when theirs does. +The three conditions on the other side get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or shut the service off. Pricing is at the vendor's discretion — the rate at adoption is a starting point, not a commitment. Service continuity is contingent on the vendor's survival. -The acceptance was rational. Neither half of the bundle is fully visible at adoption time. The terms of service when a company signs up and the terms of service three acquisitions later are different documents. The pricing that wins a customer's business is designed to win it - not to represent what the platform costs after that customer has built their workflows, trained their staff, and transferred their data. The bundle reveals itself slowly, after the switching costs have accumulated. +The acceptance was rational, because the second half wasn't visible at adoption time. The pricing that wins a customer's business isn't calibrated to represent what the platform costs after that customer has built workflows, trained staff, and transferred data. The bundle reveals itself slowly, after switching costs have accumulated. -Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server both parties could talk to. Multi-device sync required a cloud that acted as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. +Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server. Multi-device sync required a cloud acting as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. That is no longer true. @@ -38,203 +36,163 @@ That is no longer true. ### The Outage That Takes Your Work With It -Major SaaS providers report 99.9% uptime - roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar and rarely land at a bad moment. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. - -Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon, with thirteen minutes left to submit a subcontractor bid for a hospital expansion in Pune. The platform - the SaaS construction-management product her firm had standardized on the year before - had been slow all afternoon. Pages took six seconds to load instead of one. Sunita had opened the bid spreadsheet in three browser tabs that morning because she did not trust the network, and she switched between them as one slowed and another caught up. She had been carrying the bid for six weeks. Two hundred and forty-three line items. Subcontractor quotes, materials, equipment, contingency. The kind of document a construction PM keeps cleaner than her own desk. - -At 4:47 the platform stopped responding. She refreshed. Spinning indicator. She refreshed. Spinning indicator. She called her counterpart at the firm who was supposed to countersign the bid; her counterpart could not reach the platform either. Sunita tried to email the spreadsheet to the client directly - the platform's single sign-on tied her email account to the same provider, and her email was locked too. By 5:04 she had her phone in her hand watching the timestamp move past the deadline. She did not say anything when the window closed. She set the phone face-down on the desk and listened to the office around her - keyboards, voices, somebody laughing about something - and she counted the line items she had not been able to submit. Two hundred and forty-three. The bid was won by a competitor whose construction-management platform happened to run on a different vendor whose dependencies had not gone down at 4:47 that afternoon. +Major SaaS providers report 99.9% uptime — roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. -Sunita kept three tabs open after that. She still keeps three tabs open. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. +Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon with thirteen minutes left to submit a subcontractor bid for the Pune hospital expansion. The platform had been slow all afternoon. At 4:47 it stopped responding entirely. She refreshed. Spinning indicator. She called her counterpart who was supposed to countersign; her counterpart couldn't reach the platform either. The platform's single sign-on tied her email to the same provider — her email was locked too. At 5:04 she watched the timestamp move past the deadline. The bid was won by a competitor whose construction-management platform ran on a different vendor whose dependencies hadn't gone down at 4:47. -The outage that gets published is the one the vendor is willing to call an outage. The incidents that affect partial regions, specific features, or specific customer cohorts surface as "degraded performance" - a phrase that does most of its work by not being the word *outage*. From the affected user's side, degraded performance means the site loads but submissions fail silently, changes save and then revert, or search returns stale results. This is harder to work around than a clean outage, because it is not obvious that the problem is the platform rather than something the user did. With a clean outage you know to stop trying. With degraded performance you keep trying - and the failure looks like something you did. +Sunita kept three tabs open after that. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. -What makes outage risk asymmetric is that it falls hardest on the moments that matter most. High-stakes work - deadline submissions, live customer sessions, critical handoffs - tends to involve intensive platform use, which means it is more exposed to performance degradation under load. And the work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. +The outage the vendor publishes is the one it's willing to call an outage. Incidents affecting partial regions, specific features, or specific customer cohorts surface as "degraded performance" — a phrase that does most of its work by not being the word *outage*. With a clean outage you know to stop trying. With degraded performance you keep trying, and the failure looks like something you did. -Sunita's afternoon is not unusual for her industry. Construction project management is deadline-driven by definition. A subcontractor bid has a submission deadline that is not negotiable after the fact. A change order authorization has a response window tied to contract terms. A safety inspection log has a regulatory timestamp requirement. When any of these processes depends on cloud infrastructure being available exactly when needed, the infrastructure becomes a single point of failure in a workflow that cannot tolerate one. +Outage risk falls hardest on the moments that matter most. High-stakes work — deadline submissions, live customer sessions, critical handoffs — involves intensive platform use, which means it's more exposed to performance degradation under load. The work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. -Availability statistics miss a compounding factor. The concentration of cloud hosting means failures cascade across unrelated products at the same instant. The December 2021 AWS us-east-1 outage affected every product hosted there - project management tools, document collaboration platforms, file storage services, communication tools - at the same moment. A single incident becomes an industry-wide incident for everyone whose vendor chose the same region. Users who experience a simultaneous failure across multiple tools they rely on do not find redundancy in having adopted multiple platforms; they find that all their fallback options went down at the same time. This is the dependency chain. Not your vendor failing, but the infrastructure layer beneath your vendor - shared cloud regions, CDN providers, authentication services - none of which appear in your vendor's SLA (Service Level Agreement), and none of which you have any contract with. +The concentration of cloud hosting compounds this. The December 2021 AWS us-east-1 outage hit every product hosted there simultaneously — project management tools, document platforms, file storage, communication tools. Users who had adopted multiple platforms found that all their fallback options went down at the same time. Their vendor SLAs (Service Level Agreements) say nothing about the infrastructure layer beneath their vendor — shared cloud regions, CDN providers, authentication services — none of which the user has any contract with. -Outages hit hardest the users who can least work around them. Assistive technology users - those who rely on screen readers, switch access devices, or voice control software - experience SaaS connectivity failure as complete access failure. The screen reader announces a failed load. Voice control has no form fields to target. The application stops responding. Degraded performance that a connected user circumvents by refreshing is inaccessible in a more absolute sense - the AT user cannot navigate what is not there. The architecture this dissertation proposes keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. +Outages hit hardest the users who can least work around them. Assistive technology users — those who rely on screen readers, switch access devices, or voice control — experience SaaS connectivity failure as complete access failure. Degraded performance that a sighted user circumvents by refreshing is inaccessible in a more absolute sense: the screen reader announces a failed load; voice control has no form fields to target. The architecture developed in later chapters keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. ### The Vendor That Disappears -In 2015, Sunrise Calendar had a substantial mobile user base (estimated by industry coverage in the low millions) and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year. Microsoft shut it down in August 2016. Users received a few weeks' notice. The data was exportable - in a format that no other calendar app read natively, requiring manual remapping of categories and recurrence rules. +In 2015, Sunrise Calendar had a substantial mobile user base and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year and shut it down in August 2016. Users received a few weeks' notice. The data was exportable in a format no other calendar app read natively. Sunrise was not exceptional. It was typical of how software products end. -The mechanism changes - acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger - but the pattern is consistent. The product goes dark. Users who built their workflows around it are left with whatever they managed to export before the deadline. +The mechanism changes — acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger — but the pattern is consistent. The product goes dark. Users who built workflows around it are left with whatever they managed to export before the deadline. Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the structure was stored in a format only Quip controlled. -Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the investment worthless on migration because the structure was stored in a format only Quip controlled. That is not a product failure. It is the custody model working exactly as designed: the user's workflow lives on vendor infrastructure until it doesn't. +When a vendor announces shutdown, it typically offers an export. What that export contains, what format it uses, and whether any other software can consume it are highly variable. For project management data, the export is typically a CSV of the task list — without comments, without attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. -The data export problem deserves specific attention. When a vendor announces shutdown, it typically offers an export function. What that export contains, what format it uses, and whether any other software can actually consume it are highly variable. For project management data, vendors typically export a CSV of the task list - without the comments, without the attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. - -The legal firm whose vendor gets acquired faces this directly. They adopted the software, trained staff, integrated it with billing and document management workflows, and accumulated years of matter history. Now they evaluate whether to migrate to the acquirer's competing product under the acquirer's pricing, or start over with a third party, reconstructing what they can from a flat CSV and a folder of PDFs. - -The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. It is routine. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns do not make news; their users find out through an email or a banner in the app. The shutdowns that do make news - Evernote's degraded state following years of ownership changes, Google Reader's abrupt termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment - are notable primarily because of the scale of the disruption, not because the pattern is unusual. +The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns don't make news; their users find out through an email or a banner. The shutdowns that do make news — Google Reader's termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment — are notable for scale, not for being unusual. ### The Connectivity That Wasn't There -Not everyone's internet is always on - and this is consistently underweighted in the architecture of software sold to the industries where it most frequently fails. - -Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building cannot get a signal three floors underground. Rural professional service firms - accounting firms in small towns, medical practices in counties with limited broadband, legal practices in areas where fiber has not reached - operate on connectivity that drops daily and fails entirely during weather events. Hospital clinical environments include zones where mobile devices are restricted near sensitive equipment. Air-gapped facilities - manufacturing, defense, government - cannot connect to any external network at all as a policy requirement. +Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building can't get a signal three floors underground. Rural professional service firms operate on connectivity that drops daily. Hospital clinical environments restrict wireless devices near sensitive equipment. Air-gapped facilities — manufacturing, defense, government — can't connect to any external network by policy. For these users, offline capability is not a feature request. It is the baseline requirement. -The SaaS vendor's marketing page says "works on mobile," which is true when there is a signal. It does not say "works when there isn't one," because the centralized architecture makes that impossible without fundamental redesign. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. +The SaaS vendor's marketing page says "works on mobile," which is true when there's a signal. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. -Most SaaS platforms offer some form of "offline mode." What this means in practice is usually a read-only cache of recently viewed data, with form submissions that queue locally and attempt to upload when connectivity returns - with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, cannot run reports, cannot access data you have not recently viewed, and cannot have any confidence that what you submitted offline actually made it to the server. +Most SaaS platforms offer some form of "offline mode." In practice this means a read-only cache of recently viewed data, with form submissions that queue locally and attempt upload when connectivity returns — with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, run reports, or access data you haven't recently viewed. -The field operations manager who needs to log a safety inspection at seven in the morning on a construction site, before the crew starts work, has a few options when the SaaS is unreachable. Write it in a notebook and transcribe it later, with all the transcription errors that introduces. Use the app's read-only offline mode and hope the form submission queues correctly. Or skip the log and fill it in from memory when back in the office. All three options introduce risk. None of them should be necessary. The software should work on a construction site because that is where the work happens. +Sabina Rahman is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh. She covers eleven villages twice a week on a company motorbike, processing loan applications, KYC documentation, and repayment ledgers on a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. -The mismatch extends beyond any single vertical. Reliable internet access is not universal, even in developed economies. Hospital clinical environments restrict wireless devices near sensitive equipment. Manufacturing and warehouse floors often have RF environments hostile to Wi-Fi. Agricultural operations span hundreds of acres - the field where something needs to be logged is rarely next to the fiber drop. Emergency response personnel work in exactly the places infrastructure fails first. For all of these workers, SaaS software's connectivity assumption is not an occasional inconvenience. It is a systematic design error applied to environments the designers never worked in. +The day she stopped trusting it was a monsoon-relief disbursement morning. Forty-seven applicants in queue by 8:00 a.m. The platform took submissions until 11:14. Then it went down. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she called *shotti'r khata* — the truth book — with the borrowers' thumbprints on the carbons. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing; the audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps didn't match the borrowers' submissions. -Intermittent connectivity is not a US edge case. It is the global operational baseline. In Nigeria and South Africa, scheduled load-shedding cuts power for six to twelve hours daily; when electricity goes, routers and base stations go with it, and connectivity fails regardless of coverage quality. Hundreds of millions of enterprise workers in those economies plan their workdays around outage schedules, not around the assumption that the network is always available. In India, the 4G/3G/2G coverage gradient means that enterprise field operations - agricultural services, construction, financial services, healthcare - routinely run on intermittent connectivity across large portions of Tier 2 and Tier 3 cities and rural areas. Rural Brazil, rural Mexico, and most of Southeast Asia present comparable patterns at comparable scale. A SaaS platform that cannot function without a persistent connection does not have a niche offline problem. It has an architecture that excludes the majority of the world's enterprise users from full functionality. +Tariq Hassan works the other end of the spectrum, where connectivity fails for different reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi. The platform's primary uplink is a Ku-band satellite. When weather conditions degrade the satellite — on average twice a month — the platform falls to a microwave backup. When both links drop, the platform is offline. -Sabina Rahman is one of those workers. She is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh, in a Rangpur Division village forty kilometers from the nearest upazila headquarters; she covers eleven villages on a route she runs twice a week on a company motorbike. Her work is relationship banking the way it has been done in Bangladesh since 1976 - the year Muhammad Yunus made the first thirty loans of what would become Grameen Bank - and digital paperwork the way it has been done for the last decade. Loan applications, KYC documentation, repayment ledgers, monsoon-relief disbursements - all of it lives in a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. The mornings are the worst, when the entire upazila wakes up and pulls bandwidth at the same time. +The day Tariq stopped trusting the cloud's ingestion pipeline was a six-hour double-link outage. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on was a thin client — it expected the data to be in the cloud already, and the ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report. -The day she stopped trusting the platform entirely was a monsoon-relief disbursement morning. Forty-seven applicants in queue at her branch by 8:00 a.m. The platform took submissions until 11:14. Then it went down. The applicants had taken half a day off from rice-paddy work to sit in the queue. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she had been keeping for two years and called *shotti'r khata* - the truth book - with the borrowers' thumbprints on the carbons and her own signature in blue ink. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing. The bank's audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps did not match the borrowers' submissions. - -Sabina keeps a paper backup of every digital sign-off she has made since. Twelve years of binders. Grameen-style microfinance, she has been heard to say, teaches you not to trust networks you cannot see - the field officer carries the bank's reputation in her notebook because the village will trust the notebook longer than it will trust any vendor's uptime page. - -Tariq Hassan works the other end of the spectrum, where connectivity fails for opposite reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi, one of nine Pakistani crew on a roster of forty-two. The platform's primary uplink is a Ku-band satellite. The backup is a microwave repeater on the next platform north. When weather conditions degrade the satellite - which happens on average twice a month and can last from forty minutes to fourteen hours - the platform falls back to the microwave. When the platform north is also degraded, both links drop and the platform is offline. Tariq's job is to keep the drilling-data feed running into the operator's onshore monitoring center in Dubai. - -The day Tariq stopped trusting the cloud's ingestion pipeline was a continuous double-link outage of just under six hours. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on the year Tariq was hired was a thin client - it expected the data to be in the cloud already, and the application's ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. It sat on the platform's local server for anyone who knew where to look. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report that the SaaS vendor's account manager called "an inconvenience." Tariq called it something else, in Urdu, to a colleague who asked him later how the report had gone. - -Tariq learned to run a parallel local data capture in addition to the SaaS feed, on a laptop in his bunk that he had reformatted to a Linux distribution the platform's IT department was not aware existed. He never trusted cloud telemetry on the platforms after that. The practice did not fail him. He kept it. +Intermittent connectivity is not a US edge case. Scheduled load-shedding in Nigeria and South Africa cuts power for six to twelve hours daily; connectivity fails with it. Hundreds of millions of enterprise workers plan their workdays around outage schedules, not around the assumption that the network is always on. A SaaS platform that can't function without a persistent connection doesn't have a niche offline problem — it has an architecture that excludes the majority of the world's enterprise users from full functionality. ### The Data You Can't Get Back -Your vendor's terms of service say your data is yours. They are often technically correct - the vendor does not claim ownership of the content you create. What the terms of service do not address is *accessibility*. - -Data that you own but cannot retrieve is data you do not have. +Your vendor's terms of service say your data is yours. They are often technically correct — the vendor doesn't claim ownership of the content you create. What the terms don't address is *accessibility*. -Four mechanisms make data inaccessible while it technically "belongs" to you. +Data you own but cannot retrieve is data you don't have. -Export rate limits are the first. Many platforms allow data export but rate-limit the export API (Application Programming Interface) to prevent bulk extraction. A legal firm with ten years of matter history attempting a bulk export may find that retrieving its own data at the permitted rate takes weeks. During that window, the firm remains dependent on the vendor's infrastructure to operate - which is, not coincidentally, exactly the position the vendor prefers it to be in. - -Proprietary formats are the second. The export is available, but in a format only the vendor's tools read well. Attachments export without their metadata. Comment threads export as flat text without threading structure. Custom fields export as raw column headers without the semantic context that made them useful. The data is present; the information it represented is partially lost. - -Feature-gated access is the third. Some platforms require paid subscriptions to access export features, or limit export to higher pricing tiers. Users on free or lower tiers discover that their data is portable only as long as they keep paying - which means it is not portable at all. - -Account closure timing is the fourth. When a user cancels a subscription, access typically ends when the billing period ends. A user who cancels on the first of the month with a billing cycle that ends on the fifteenth has fifteen days to export before the account closes. Miss that window - because you changed jobs, because the cancellation notice did not clearly state the deadline - and the data may be gone. +Four mechanisms make data inaccessible while it technically "belongs" to you. Export rate limits: many platforms allow data export but rate-limit the export API to prevent bulk extraction; a legal firm with ten years of matter history may find that retrieving its own data at the permitted rate takes weeks. Proprietary formats: the export is available, but in a format only the vendor's tools read well — comment threads export as flat text, custom fields export as raw headers without semantic context. Feature-gated access: some platforms require paid subscriptions to access export features, so portability is contingent on continued payment. Account closure timing: access ends when the billing period ends; miss the export window — because you changed jobs, because the notice was unclear — and the data may be gone. None of these are edge cases. They are the routine operational parameters of vendor-managed data. ### The Price That Changes After You've Committed -Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns - these represent real investments. Vendors know this. Pricing structures often reflect it. - -Pricing is competitive during the acquisition phase, when vendors are winning customers and competing on features and price. After adoption, when the switching cost is real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month, built an organization-wide workflow on it over two years, and now faces a renewal at $18 per seat per month confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. - -Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. The roadmap description from three years ago that listed a capability as "included on Professional" may not match the current pricing page. Users who built workflows on features they understood to be included sometimes discover those features now require the next tier up. +Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns — these represent real investments. Vendors know this. -The per-seat model creates structural pressure as teams grow. A ten-person team's annual SaaS bill is manageable. A fifty-person team's bill at the same per-seat rate is five times larger, and by the time a company has reached fifty people using a platform, the switching cost has compounded accordingly. Teams that grow into enterprise sizes often find that per-seat pricing which was attractive at ten seats has become a significant budget line that IT attempts to renegotiate - often without success, because leverage has shifted. +Pricing is competitive during acquisition, when vendors are winning customers. After adoption, when switching costs are real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month and now faces renewal at $18 per seat confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. -Mid-contract price changes are less common but not rare. Platform economics shift, investor pressure changes, the competitive landscape evolves. Users who committed workflows and data to a platform signed a contract of sorts - and then discovered the other party's interpretation of that contract differed from their own. +Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. Per-seat models create structural pressure as teams grow — a ten-person team's bill scales to five times that at fifty people, by which point the switching cost has compounded accordingly. -The lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. When one vendor raises prices, the team is not evaluating that product in isolation - they are evaluating the cost of unwinding a set of integrations built over years. Integration ecosystems serve the vendor's retention objectives as reliably as they serve the user's productivity. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. +Lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. ### The Drift You Don't See -The first five modes manifest visibly. The platform stops loading, the vendor announces a shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. +The first five modes manifest visibly. The platform stops loading, the vendor announces shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. -This one does not. The system continues to operate normally. Two users edit the same record on different devices, then a sync conflict resolves silently in favor of one set of changes; the other user's work is gone, but no error appears and no notification fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the application logic that depended on uniqueness produces wrong results until someone notices the second copy. The work appears to continue. The output is wrong. +This one doesn't. Two users edit the same record on different devices; a sync conflict resolves silently in favor of one set of changes, the other user's work is gone, and no error fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the logic that depended on uniqueness produces wrong results until someone notices the second copy. -Silent corruption and silent divergence are the failure modes the user catches last and trusts the system about most. Production engineering teams who have shipped collaborative SaaS describe these as the bugs they fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number does not add up or a record they remember saving is no longer there. The architecture matters here because of where convergence is decided. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it is wrong enough to notice. The architecture I argue for in the chapters that follow makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. +Silent corruption and silent divergence are the failure modes production engineering teams fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number doesn't add up. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it's wrong enough to notice. The architecture developed in later chapters makes the convergence question first-class at the data layer rather than implicit in vendor behavior. ### The Third-Party Veto -The first six failure modes originate inside the service relationship. The vendor fails, decides, prices, or quietly drifts. Both the vendor and the customer are subject to the same disruption, and in most cases neither party wanted it. +The first six failure modes originate inside the service relationship. An external authority — a government, a regulator, a court — can restrict access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides has acted. -The seventh does not. An external authority - a government, a regulator, a court - restricts access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides of the relationship has acted, and the service relationship cannot continue. +In 2022, Western SaaS providers — Adobe, Autodesk, Microsoft, Figma, and dozens of others — suspended service across Russia and CIS markets under sanctions enforcement. Organizations across those markets, accounting for hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments received direction to cease using them. Anthropic contested the designation legally [2], and a California court enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. -The authority can act on the vendor. In 2022, Western SaaS providers - Adobe, Autodesk, Microsoft, Figma ([figma.com](https://www.figma.com/), the design tool), and dozens of others - suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement; organizations across those markets, accounting for many hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. Software that had been licensed, trained on, and integrated into operational workflows became inaccessible with days of notice, not months. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments - deployments they found valuable and wished to continue - received direction under executive order to cease using them. Anthropic contested the designation legally [2], and a California court subsequently enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. The analytically significant detail in both cases: the restriction came from a party with authority over the vendor, independent of both the vendor's and the customer's preferences. +The authority can act on the customer instead. Russia's Federal Law 242-FZ has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture can't provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023 creates comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. -The authority can act on the customer. Russia's Federal Law 242-FZ - among the first general-purpose data localization laws globally, predating GDPR (General Data Protection Regulation) by two years - has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture cannot provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the vendor continued operating; the customer's legal ability to continue using it was constrained. India's DPDP (Digital Personal Data Protection) Act 2023 is now creating comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. In each case, the customer becomes non-compliant regardless of the vendor's preferences or actions. - -The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. The architecture either concentrates that exposure surface at the vendor or distributes it. +The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. --- ## The Work That Doesn't Stop -The seven failure modes above describe what breaks. The work itself continues - that is the part most cloud-dependency arguments miss. They reach for whatever still works. +The seven failure modes above describe what breaks. The work itself continues — that's the part most cloud-dependency arguments miss. Workers reach for whatever still works. -In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. What follows is recognizable to anyone who has lived through an actual EHR outage: dry-erase boards return to the nurses' station, a fax machine reappears at triage, paper prescription pads come out of the supply closet, and triplicate forms circulate among medical assistants who have never seen them before - felt-tip markers oblivious to the carbon backing, the bottom copies coming out blank. A senior nurse spends much of the episode correcting the younger staff on the conventions of an analog workflow they have only heard about in training. The trauma center keeps operating. The patients still get seen. The work does not stop. +In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. Dry-erase boards return to the nurses' station. Paper prescription pads come out of the supply closet. Triplicate forms circulate among medical assistants who have never seen them — felt-tip markers oblivious to the carbon backing. The trauma center keeps operating. The patients get seen. The work doesn't stop. The episode is fiction. The pattern is not. Maria Santos lived it. -Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. She was three hours into her shift, sitting in her office with a coffee that had gone cold during the second of two morning standups, when the help-desk queue lit up. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. By 9:30 she was in the CIO's office watching him try to reach the vendor's emergency line and getting an automated message that confirmed only that the vendor was aware of "an incident affecting multiple customers." - -The hospital had forty-seven patients in the OR queue that morning. The list of who was scheduled for what existed in the EHR. Without the EHR, the list existed in the heads of the nurses who had been reading it at 7 a.m. before everything went dark. Maria spent the next eleven hours doing things hospital administrators are not supposed to have to do. She walked the floor with a clipboard. She watched the triage nurses recreate patient acuity ratings on dry-erase boards. She stood next to a charge nurse who was trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that would not respond. She made eight phone calls that morning that ended with sentences she will not say again. *I don't know yet.* *We're working on it.* *I will call you when I have something to tell you.* - -The vendor restored access seventy-three hours later. The hospital had not lost a patient. Several other hospitals in the same vendor's customer base, hit the same week, had. Maria does not know what those hospitals' administrators were doing during their seventy-three hours and she does not need to know. She knows the shape of those hours from inside. +Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. The hospital had forty-seven patients in the OR queue. Without the EHR, that list existed in the heads of the nurses who had read it at 7 a.m. Maria spent the next eleven hours walking the floor with a clipboard, watching triage nurses recreate patient acuity ratings on dry-erase boards, standing next to a charge nurse trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that wouldn't respond. -She still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she could not tell a charge nurse whether a man's chart said sulfa or penicillin. +The vendor restored access seventy-three hours later. The hospital had not lost a patient. Maria still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she couldn't tell a charge nurse whether a man's chart said sulfa or penicillin. -Healthcare ransomware incidents are tracked publicly by trackers including Comparitech, the HIPAA Journal, and the HHS OCR breach portal, and the count of US hospital ransomware events has run into the hundreds per year for several years now. Healthcare-services research has consistently associated ransomware-driven EHR downtime with elevated patient-harm metrics - the specific magnitudes vary by study and by the size of the disruption window. Healthcare professionals interviewed about *The Pitt* identified the same artifacts in their own incident logs: paper charts piling up at the nurses' station, prescriptions written by hand, hours of post-restoration overtime to back-fill the EHR with what happened on paper while the system was offline. The on-screen chaos is not exaggerated. It is documentary realism dressed as drama. +Healthcare ransomware incidents have run into the hundreds per year for several years. Healthcare-services research consistently associates EHR downtime with elevated patient-harm metrics. The on-screen chaos in *The Pitt* is not exaggerated — it is documentary realism dressed as drama. -Two observations matter for any architecture decision. First: the work continued because human practitioners knew what to do without the digital system. Triage worked. Charting worked. Billing eventually caught up. Domain expertise outlasts the software that depends on it. Second: the digital affordances did not survive. Search disappeared. Cross-shift handoff slowed to verbal report. Pattern detection across patient histories - the analytic work that justified the EHR investment in the first place - became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. +Two observations drive every architecture decision that follows. First: the work continued because human practitioners knew what to do without the digital system. Domain expertise outlasts the software that depends on it. Second: the digital affordances didn't survive. Search disappeared. Pattern detection across patient histories — the analytic work that justified the EHR investment — became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. -The same pattern repeats outside the hospital. When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. When the SaaS field-service application fails, the technician carries a paper work order and reconciles in the system the next day. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. +When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. -This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet - every hospital should have one, every law firm should have one, every construction office should have one - but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. The patient-harm signature of EHR downtime becomes a statistic about an architecture that the next generation of systems was designed to replace. That is the empirical case this dissertation builds. +This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet — but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. --- ## Who Pays the Most -These seven failure modes do not hit every organization equally. The organizations most exposed share a characteristic: they have the least structural leverage to address any of them. +These seven failure modes don't hit every organization equally. The most exposed share a characteristic: they have the least structural leverage to address any of them. -A large enterprise with a skilled procurement and IT organization can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data - these are available to buyers with enough revenue to make the vendor's legal team engage seriously. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms or negotiate exit conditions. +A large enterprise with a skilled procurement team can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data — these are available to buyers with enough revenue to make the vendor's legal team engage. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms. -Small and medium-sized professional service firms do not have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through a terms of service that nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified. They have no SLA. They have no escrow. They have no explicit data portability requirement. If the vendor changes pricing, those users have no mechanism to object. If the vendor shuts down, they have whatever the shutdown announcement says they have. +Small and medium-sized professional service firms don't have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through terms of service nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified — no SLA, no escrow, no explicit data portability requirement. -These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid - and damages the relationship with the client. The legal practice unable to access case files has a professional responsibility exposure. The medical practice that cannot retrieve patient records has regulatory risk. The stakes of availability are not abstract. +These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid and damages the client relationship. The legal practice unable to access case files has professional responsibility exposure. The medical practice that can't retrieve patient records has regulatory risk. The stakes of availability are not abstract. -And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and the procurement counsel is using enterprise-licensed software with negotiated protections. The eight-attorney law firm is using the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode described in this chapter. +And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and procurement counsel uses enterprise-licensed software with negotiated protections. The eight-attorney law firm uses the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode in this chapter. -This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties together in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. +This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. -The regulatory dimension compounds this asymmetry. A legal practice storing confidential client communications in a vendor's cloud carries a professional duty to understand where that data lives and who can access it. A medical practice has HIPAA (Health Insurance Portability and Accountability Act) obligations. A construction firm with government contracts may have data residency requirements tied to those contracts. For large enterprises, these obligations get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy - a document written to protect the vendor, not the client. +The regulatory dimension compounds this. A legal practice storing client communications in a vendor's cloud carries a professional duty to understand where that data lives. A medical practice has HIPAA obligations. For large enterprises, these get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy — a document written to protect the vendor, not the client. -The jurisdictional scope of this compliance argument is wider than US-centric discussions typically acknowledge. The EU's Schrems II ruling, India's Digital Personal Data Protection Act 2023, the UAE's DIFC (Dubai International Financial Centre) Data Protection Law 2020, China's Personal Information Protection Law (PIPL, 2021), Brazil's LGPD (Lei Geral de Proteção de Dados, 2018), South Africa's POPIA (Protection of Personal Information Act, 2013), Nigeria's NDPR (Nigeria Data Protection Regulation, 2019), Japan's APPI (Act on the Protection of Personal Information), South Korea's PIPA (Personal Information Protection Act), and Russia's Federal Law 242-FZ are representative - each, in different language, makes data residency a compliance mechanism rather than a preference. The same pattern repeats across more than thirty national and regional frameworks; the full coverage table for this chapter is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware - not in a vendor's cloud region - is not merely preferred. In many configurations, it is the architecture that makes compliance tractable. The architecture I propose is frequently a legal requirement before it is an architectural choice. +The jurisdictional scope is wider than US-centric discussions acknowledge. The EU's Schrems II ruling, India's DPDP Act 2023, China's PIPL (2021), Brazil's LGPD (2018), South Africa's POPIA (2013), Nigeria's NDPR (2019), and Russia's Federal Law 242-FZ each make data residency a compliance mechanism rather than a preference. The full coverage table is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware is not merely preferred — in many configurations it is the architecture that makes compliance tractable. --- ## Why Users Have Accepted This -Until recently, they did not have a choice. +Until recently, they didn't have a choice. -Real-time collaboration requires that all parties see consistent state when they make concurrent changes. In 2008, the most practical way to guarantee this was a central server both parties could read from and write to simultaneously. Every other approach - emailing files, shared drives, version control - introduced either merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. Real-time collaboration solved both problems by making divergence impossible: one copy, everyone editing the same one. +Real-time collaboration required a central server both parties could read from and write to simultaneously. Every other approach — emailing files, shared drives, version control — introduced merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. One copy, everyone editing the same one, solved both. -Multi-device sync requires an authoritative copy that all devices agree on. When the cloud holds the authoritative copy, sync is the cloud pushing updates to each device. Without a cloud authority, devices have to figure out among themselves which version is current - and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without requiring user intervention, did not exist. Merging concurrent edits deterministically, without a server to adjudicate conflicts, was an unsolved problem for end-user software. +Multi-device sync required an authoritative copy that all devices agreed on. Without a cloud authority, devices had to figure out among themselves which version was current — and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without user intervention didn't exist. -Zero maintenance requires that someone else manage the infrastructure. The alternative is the user managing it, which requires IT capability that most small organizations do not have and do not want to develop. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker, a self-hosted document collaboration platform - all theoretically possible, all practically demanding enough that most organizations paid someone else to handle it. - -The dependencies looked structural because they were structural. The technology for delivering these properties without vendor infrastructure either did not exist or was not mature enough to deploy without specialized expertise. CRDTs (Conflict-free Replicated Data Types) were academic research with a handful of experimental implementations. Gossip protocols ran inside distributed databases; nobody was building them into end-user applications. Container runtimes existed for server workloads; the packaged, embeddable, consumer-invisible form that makes Docker Desktop run silently on your laptop had not been built. +Zero maintenance required that someone else manage the infrastructure. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker — all theoretically possible, all practically demanding enough that most organizations paid someone else. Users accepted the SaaS bundle not because they preferred the conditions on the second half but because the technology of the time made those conditions appear to be the cost of the first half. They were not accepting a bargain so much as acknowledging a constraint. -The constraint is removable - by the architecture this dissertation proposes. - -The evidence is commercial, not theoretical. The earliest and most consequential proof is African mobile money: M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns - store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps - because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. +The constraint is removable. -In the professional software space, Linear ([linear.app](https://linear.app/), the issue tracker) demonstrates that a sync engine can run locally even inside a SaaS architecture - clients keep a local SQLite replica, and the cloud is demoted to a relay peer for the engine layer. Authoritative data still lives on Linear's servers; the architecture I argue for takes the next step. Figma is often cited in the same breath because Figma uses CRDT-flavored mechanisms for multiplayer cursor coordination - but Figma's data lives on Figma's servers and the local client is not authoritative; Figma is a collaboration win, not a data-sovereignty architecture. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional, with no vendor data custody required. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. +The evidence is commercial, not theoretical. M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns — store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps — because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. -These products demonstrate that the desirable half of the SaaS bundle - collaboration, sync, responsive UI - does not require vendor data custody to function. Users who have worked with software built on these foundations know what it feels like when software keeps running after the internet goes out. The acceptance erodes when the alternative is observable, not theoretical. +In professional software, Linear demonstrates that a sync engine can run locally even inside a SaaS architecture — clients keep a local SQLite replica, and the cloud is demoted to a relay peer. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. These products demonstrate that the desirable half of the SaaS bundle — collaboration, sync, responsive UI — doesn't require vendor data custody to function. --- ## The Dependency That Looks Inevitable -Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge (the same family of protocols Cassandra and DynamoDB use at planetary scale, applied without modification at five-machine team scale); and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence - that the technical reasons SaaS architectures had to concentrate data at the vendor are gone - followed from those solutions. Chapter 2 develops each in full. +Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge — the same family of protocols Cassandra and DynamoDB use at planetary scale, applied at five-machine team scale; and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence — that the technical reasons SaaS architectures had to concentrate data at the vendor are gone — followed from those solutions. Chapter 2 develops each in full. -The architecture this dissertation proposes has real costs. They do not disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator does not own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Some will not. Chapter 4 helps you decide. +The architecture this book proposes has real costs. They don't disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator doesn't own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Chapter 4 helps you decide. -Marcus's scenario - deadline-critical work held hostage by infrastructure he does not control - is the failure mode this architecture addresses first. His data was never gone. It was inaccessible because the software's design placed it somewhere he could not reach. The remaining chapters specify a design where that distinction does not exist. +Sunita's scenario — deadline-critical work held hostage by infrastructure she doesn't control — is the failure mode this architecture addresses first. Her data was never gone. It was inaccessible because the software's design placed it somewhere she couldn't reach. The remaining chapters specify a design where that distinction doesn't exist. -The building blocks are production-proven. What remains is the specific assembly that produces a node - not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. +The building blocks are production-proven. What remains is the specific assembly that produces a node — not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. --- diff --git a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md index 20e27ae..4e9c5fa 100644 --- a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md +++ b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md @@ -1,89 +1,87 @@ # Chapter 2 - Local-First: From Sync Toy to Serious Stack - + --- -In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties - a testable definition the field could use to separate what counts from what merely calls itself local-first. +In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties — a testable definition the field could use to separate what counts from what merely calls itself local-first. -The seven properties expose exactly where every existing attempt falls short - including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. +The seven properties expose exactly where every existing attempt falls short, including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. -The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven. And it adds what the ideals paper did not. The deployment model. The security model. The governance model. The migration story. The path to commercial sustainability. **The composition is the contribution** - not the individual components, which are all production-proven somewhere, but the assembly that lets them be one system. +The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven — and adds what the ideals paper did not: a deployment model, a security model, a governance model, a migration story, and a path to commercial sustainability. **The composition is the contribution** — not the individual components, which are all production-proven somewhere, but the assembly that lets them function as one system. --- ## The Seven Ideals -The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar - a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. +The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar — a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. -**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that must phone home to load the task list fails the property during the first round-trip. It fails permanently when the network is gone. +**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that phones home to load the task list fails the property during the first round-trip and fails permanently when the network is gone. -**Work is not trapped on one device.** Your data on your laptop should be your data on your desktop, your tablet, your colleague's machine. Sync across devices and across people - not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. +**Work is not trapped on one device.** Data on a laptop should be data on a desktop, a tablet, a colleague's machine. Sync across devices and across people — not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. -**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, and then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API (Application Programming Interface). It eliminates every app whose write path queues locally and waits. Real offline requires that the local node hold an authoritative copy of data it is allowed to act on. +**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API and every app whose write path queues locally and waits. Real offline requires the local node to hold an authoritative copy of data it is allowed to act on. -**Seamless collaboration.** Multiple people should be able to edit the same data simultaneously - without explicit locking, without "checkout" workflows, without a person designated to resolve conflicts manually. This is the property that made centralized servers feel necessary. If two people are writing concurrently, something has to decide the order. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. +**Seamless collaboration.** Multiple people should edit the same data simultaneously — without explicit locking, without checkout workflows, without a person designated to resolve conflicts manually. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. -**The long now.** Your data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. A more recent and more consequential demonstration came in 2022. Adobe suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma ([figma.com](https://www.figma.com/), the design tool) blocked Russia-based users in compliance with US sanctions [11]. Dozens of other Western SaaS (Software as a Service) providers followed. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool - or the jurisdiction the company operates in. Proprietary sync formats - even sync formats that feel invisible - fail this property. +**The long now.** Data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. In 2022, Adobe suspended service across Russia and CIS markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma blocked Russia-based users in compliance with US sanctions [11]. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool — or the jurisdiction the company operates in. -**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack - an adversary who compromises one node gets one user's data, not all users' data. Local storage without encryption creates a different problem: physical access to the device is sufficient. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. A distinct threat model applies in jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements: architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. A local app that stores data in plaintext fails this property as badly as a cloud app does. +**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack — an adversary who compromises one node gets one user's data, not all users' data. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. -**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee. It is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. +**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee — it is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. -Seven properties. Together they describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. The closest candidate is Anytype, which satisfies five - CRDT (Conflict-free Replicated Data Type)-based collaboration and zero-knowledge encryption by default - but falls short on the long now (its full-fidelity export uses a proprietary Any-Block format no competing app reads natively) and on ultimate ownership (the application layer is "source available," not open-source; structural vendor independence depends on a contractual arrangement with the Any Association, not the architecture alone). Kleppmann himself no longer treats the seven as a binary checklist. At Local-First Conf 2024 he acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. No production app has cleared them all. +Together, the seven properties describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. At Local-First Conf 2024, Kleppmann acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. --- ## What Exists Today: A Taxonomy of Local-First Attempts -The local-first community has produced serious work. The apps below are not failures. They are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. +The local-first community has produced serious work. The apps below are not failures — they are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. ### The Document Sync Apps (Obsidian, Notion) -Obsidian stores notes as plain markdown files on your local filesystem. This is a genuinely correct choice. Plain text in an open format, on your own storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears tomorrow, the files remain and every text editor on the planet reads them. The long-now property is satisfied by the data format alone. +Obsidian stores notes as plain markdown files on a local filesystem. Plain text in an open format, on user-controlled storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears, the files remain and every text editor reads them. The long-now property is satisfied by the data format alone. -Where Obsidian stops is structured data and collaboration. Markdown files have a limited conflict resolution strategy: when two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original. Resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions - where concurrent edits are the norm - the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. +Where Obsidian stops is structured data and collaboration. When two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original; resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions — where concurrent edits are the norm — the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. -The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same convention. The moment a team needs structured data - not documents, but records - Obsidian's model breaks down. It is a document tool that happens to sync, not a structured-data tool with local-first properties. +The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same one. The moment a team needs structured data — not documents, but records — Obsidian's model breaks down. -Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers throughout. Concurrent edits go through those servers, which hold the authoritative copy. The long-now property fails immediately. Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs - a representation, not a migration. The relational structure, the filters, the formulas, the comment threads - none of these export faithfully to a format another application understands. +Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers. Concurrent edits go through those servers. The long-now property fails immediately: Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs — a representation, not a migration. The relational structure, the filters, the formulas, the comment threads — none export faithfully to a format another application understands. -Both approaches demonstrate a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent - which is what CRDTs over a typed document store provide. +Both approaches expose a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent — which is what CRDTs over a typed document store provide. -### The Lightweight Replica Apps (Linear ([linear.app](https://linear.app/), the issue tracker), Liveblocks) +### The Lightweight Replica Apps (Linear, Liveblocks) -Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first. The sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant - no loading spinners, no optimistic-update lag, no visible round trips. The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation (status changes on issues, comment submissions, project mutations) are visibly queued rather than silently dropped - but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. +Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first; the sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant — no loading spinners, no optimistic-update lag, no visible round trips. -Background jobs - notifications, automations, integrations - run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. +The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation are visibly queued rather than silently dropped — but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. -The practical consequence: Linear passes the "no spinners" property and partially passes "the network is optional" for reads. It does not pass network-optional for writes to server-owned records, does not pass peer-to-peer collaboration without Linear's relay, does not pass vendor independence, and does not pass the long now - Linear's data lives in Linear's format, accessible through Linear's API, exportable to CSV only. Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. +Background jobs — notifications, automations, integrations — run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. -Replicache ([replicache.dev](https://replicache.dev/), the sync framework from Rocicorp) is the most direct production competitor in this category and the system most often suggested as an off-the-shelf path to local-first apps. Replicache provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache for free [9]. The model is correct for the sync layer it covers - optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side to validate against authoritative state, and offline writes queue against an eventual reconciliation that the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. The framework is also deliberately scoped to the sync transport - schema migration, key custody, MDM packaging, and the business model are application-developer responsibilities, not framework features. +Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. -### The Local-First Finance App (Actual Budget) - -Actual Budget runs entirely offline by default - no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available, because its operation does not depend on the network at any point. +Replicache ([replicache.dev](https://replicache.dev/)) is the most direct production competitor in this category. It provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache [9]. The model is correct for the sync layer it covers — optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side, and offline writes queue against a reconciliation the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. -This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control - the user has a file on their disk). It makes a credible attempt at the fifth (the long now) by virtue of using an open database format that other tools can read. +### The Local-First Finance App (Actual Budget) -Where Actual Budget stops is collaboration and multi-device sync. The application is single-user by design. Two people cannot jointly manage a budget in Actual Budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service Actual Budget offers addresses multi-device access for a single user - the budget file syncs across the user's own devices through a hosted relay. This reintroduces a central server, though the server's role is deliberately minimal: relay and backup, not authority. +Actual Budget runs entirely offline by default — no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available. -The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. Its data model is single-user because its design is single-user. Adapting it to multi-user team workflows would require adding CRDTs, a distributed data model, access control, and a sync protocol - at which point it would no longer be Actual Budget, but a substantially new system. +This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control). It makes a credible attempt at the fifth (the long now) by using an open database format that other tools can read. -The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part that Actual Budget does not attempt. +Where Actual Budget stops is collaboration and multi-device sync. Two people cannot jointly manage a budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service addresses multi-device access for a single user through a hosted relay — which reintroduces a central server, though its role is deliberately minimal: relay and backup, not authority. The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. -### The Research Prototypes (Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library), Ink & Switch Essays) +The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part Actual Budget does not attempt. -Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library. Given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. The library is production-quality for its intended use case. Ink & Switch has published detailed essays on collaborative applications built on Automerge - Pushpin, Backchat, Trellis - that demonstrate what local-first collaboration looks like in practice when the data model is right. +### The Research Prototypes (Automerge, Ink & Switch Essays) -The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport - something to move operations between peers. Several sync backends exist (the Automerge sync server, AutomergeRepo), and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. +Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge)) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library: given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. Ink & Switch has published detailed essays on collaborative applications built on Automerge — Pushpin, Backchat, Trellis — that demonstrate what local-first collaboration looks like in practice when the data model is right. -The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. The essays document what is possible and identify what remains to be engineered. They are research artifacts, not shipping products. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport. They have not acquired a production system. They have acquired the foundation for one. +The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport — something to move operations between peers. Several sync backends exist and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. -The document-centric nature of Automerge is also a structural constraint. Documents are a natural fit for rich text, drawings, and unstructured collaborative content. A team running a field operation with structured records - work orders, inspection logs, invoices, asset registries - needs typed records with schema migration, not just documents. The CRDT merge semantics generalize across both cases, but the tooling, the query model, and the schema evolution story are different problems that Automerge leaves to application builders. +The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport — not a production system, but the foundation for one. ```mermaid graph LR @@ -115,9 +113,9 @@ graph LR --- -## What Each Gets Right - and Where It Stops +## What Each Gets Right — and Where It Stops -Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. Each dependency reflects a real problem the approach did not attempt to solve. +Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. The pattern becomes clearest in a like-for-like comparison across the four axes that determine whether a system meets a serious local-first bar: @@ -132,31 +130,27 @@ The pattern becomes clearest in a like-for-like comparison across the four axes | **Actual Budget** | Fully local + optional self-hosted sync | User-held SQLite | User-device only | Open-source; user runs everything | | **Automerge** | Library + sync transport (developer-supplied) | Whatever the application chooses | Whatever the application chooses | Open-source library | -The table makes the gap visible. Every system that satisfies vendor-independent data ownership stops short of team collaboration; every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node - the composition that no system in this table currently delivers. +Every system that satisfies vendor-independent data ownership stops short of team collaboration. Every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node — which no system in this table currently delivers. --- ## The Missing Step: Full Node, Not Smart Cache -The question that distinguishes this architecture from the approaches above is this: - -> What if a user's workstation ran a full node of the system - including state, business logic, and sync - such that "the cloud" is merely another peer, not the source of truth? - A smart cache knows what the server knows, slightly earlier. A full node knows what the user's data is. The distinction matters when the server is down, when the vendor goes away, when the network is unreachable, and when the user needs to understand, export, or migrate their data. -A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup - assistance for coordination and disaster recovery, not a source of truth. +A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup — assistance for coordination and disaster recovery, not a source of truth. -Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When the sync eventually completes, there may be conflicts between the superintendent's offline writes and changes made by others - conflicts the smart-cache app resolves by whatever heuristic the vendor chose, without surfacing the conflict to the user. +Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When sync eventually completes, the smart-cache app resolves conflicts using whatever heuristic the vendor chose, without surfacing them to the user. -A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally - the automation runs on the node, not on a server. When the sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. +A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally. When sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. -The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach. The full node acts on behalf of the user. +The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach; the full node acts on behalf of the user. -The pattern has operational precedent at scale. Modern point-of-sale systems - Square Reader and Toast - operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns, and the merchant's authoritative state advances against the local replica until then. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. These products demonstrate user-device-replica operation at commercial scale in domains where the cost of failed offline operation is concrete. What I describe in this dissertation generalizes that pattern beyond payments and field service to structured-data applications more broadly: typed records with evolving schemas, collaborative edits across multiple peers, and enterprise governance that survives procurement review. +The pattern has operational precedent at scale. Square Reader and Toast operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. Both demonstrate user-device-replica operation at commercial scale in domains where failed offline operation has concrete cost. -This reframes what "offline support" means. Offline support in the smart-cache model means "some operations work offline, with degraded functionality." Offline support in the full-node model means "all operations work offline, identically." The distinction is not a feature comparison. It is a structural property that follows from where authority lives. +"Offline support" in the smart-cache model means some operations work offline, with degraded functionality. In the full-node model it means all operations work offline, identically. The distinction is not a feature comparison — it is a structural property that follows from where authority lives. -Every component of this model has a production analogue that validates it separately. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production, and the Automerge library is deployed in commercial collaborative applications - though Automerge users have to budget for known operational costs (document size growth with edit history, cold-sync time on long-lived documents, and garbage-collection cadence) that the library leaves to the application. Figma's multiplayer editor is not a pure CRDT deployment - its engineers describe it as "inspired by multiple separate CRDTs" over a server-authoritative, per-property merge - but it independently validates that per-property conflict resolution works for real-time collaborative editing at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. None of these components are speculative. My contribution is the *composition* - specifically, three pieces no other published architecture combines: a per-record CAP boundary that lets AP-class records and CP-class records coexist in one system, an MDM (Mobile Device Management)-deployable installer model that lets enterprise IT ship full-node software without bespoke onboarding, and an AGPLv3-with-managed-relay business model that makes the architecture economically viable without forcing vendor data custody. +Every component of this model has a production analogue. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production. The Automerge library is deployed in commercial collaborative applications, though users must budget for known operational costs — document size growth, cold-sync time on long-lived documents, and garbage-collection cadence — that the library leaves to the application. Figma's multiplayer editor independently validates that per-property conflict resolution works at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. ```mermaid graph TB @@ -178,38 +172,32 @@ graph TB --- -## What This Dissertation Adds - -The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement and not just a preference, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a "couch device" returning after six months offline, and generates revenue that funds ongoing development. +## What This Book Adds -The regulatory pressure is now global, and the laws cluster by region. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards - making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis rather than an architectural preference, with national implementation guidance from Germany's BSI and France's CNIL. +The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a device returning after six months offline, and generates revenue that funds ongoing development. -The pattern repeats across regions with named regulators in each: India's DPDP Act 2023 [5] and the RBI's payment-data localization circular; the UAE's DIFC DPL 2020 [6]; Russia's Federal Law 242-FZ [7]; China's PIPL (Personal Information Protection Law) 2021; Brazil's LGPD (Lei Geral de Proteção de Dados); South Africa's POPIA (Protection of Personal Information Act); Nigeria's NDPR (Nigeria Data Protection Regulation); Japan's APPI (Act on the Protection of Personal Information); South Korea's PIPA (Personal Information Protection Act); and the GCC's emerging cluster (KSA's PDPL, Bahrain's PDPL). Each, in different language, treats data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix across these and ~30+ other frameworks is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. +The regulatory pressure is now global. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards — making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis, with national implementation guidance from Germany's BSI and France's CNIL. India's DPDP Act 2023 [5], the UAE's DIFC DPL 2020 [6], Russia's Federal Law 242-FZ [7], China's PIPL, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's emerging cluster each treat data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, data on the user's own hardware is the architecture that makes compliance tractable. -The existing implementations - Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage - each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None of them addresses the full set, and none provides the composition. +The existing implementations — Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage — each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None addresses the full set, and none provides the composition. -The seven properties define target state. They do not tell you how to get there - what phases to sequence, what assumptions to validate, what to trade when two properties conflict, what to verify when you claim you are done. This dissertation is the plan that sits under the properties: phases in the five-layer stack and the deployment zones (Chapter 3, Chapter 4), adversarial validation in the council chapters (Part II), verification specification (Part III), and execution playbooks (Part IV). +Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die — every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, security is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id) used opaquely, with the DEK/KEK hierarchy composed against a specification a cryptographic engineer has reviewed. Third, long-term portability has one product-level decision that can kill the architecture alone — invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs or Automerge and inherit their portability guarantees. The choice, not the invention, is what makes it feasible. -Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die - every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, Property 6 is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id reference) are used opaquely, and the DEK (Data Encryption Key)/KEK (Key Encryption Key) hierarchy composes them against a specification a cryptographic engineer has reviewed. Third, Property 5 has one product-level decision that can kill the architecture alone - invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs ([github.com/yjs/yjs](https://github.com/yjs/yjs), the JavaScript CRDT library) or Automerge and inherit their portability guarantees. Feasibility is contingent on choosing, not inventing. +The contribution here is the composition. Not new primitives — every component has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. -My contribution is the composition. Not new primitives - every component in this architecture has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. - -What I assemble from those proven components: +What that assembly produces: - A node architecture with a stable microkernel and domain plugins under strict versioned contracts, so the system can evolve without breaking in-field deployments. - A per-record CAP positioning model that treats CRDT-merge records and lease-coordinated records as first-class distinct classes, with a defined boundary and a defined handoff between them. - A three-tier CRDT GC policy that keeps document growth bounded without sacrificing merge correctness for active peers. -- A key hierarchy - root organization key, per-role key encryption keys, per-document data encryption keys - that makes key rotation proportional to document count rather than document size, and makes member removal cryptographically effective rather than contractually promised. +- A key hierarchy — root organization key, per-role key encryption keys, per-document data encryption keys — that makes key rotation proportional to document count, and makes member removal cryptographically effective rather than contractually promised. - A schema migration strategy using expand-contract, bidirectional lenses, and epoch coordination that allows nodes running different schema versions to coexist on a live team. - An enterprise deployment model: MDM-compatible installers, SBOM (Software Bill of Materials) generation, code signing and notarization, air-gap operation, incident response runbooks. - A business model: AGPLv3 core, managed relay as the paid service, relay economics that become cash-flow positive before meaningful scale. - A governance model: foundation-backed structure, community contributor path, dual-license CLA for enterprise customers. -The managed relay is a residual vendor dependency the architecture does not eliminate - it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction between SaaS vendor dependency and managed-relay dependency is not rhetorical: the former holds decryptable data; the latter does not. - -The architecture stands on the local-first community's work. The paper that named the seven ideals [1] is the benchmark against which my dissertation's design is measured throughout. The Ink & Switch essays on Automerge, Cambria, and collaborative document design are the intellectual foundation for the CRDT and schema evolution sections. Kleppmann's distributed systems work [2] provides the vocabulary that runs throughout Part III. +The managed relay is a residual vendor dependency the architecture does not eliminate — it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction is not rhetorical: a SaaS vendor holds decryptable data; a managed relay does not. -The composition is the contribution. The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for determining when this architecture is the right choice and when it is not. +The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for when this architecture is the right choice and when it is not. --- diff --git a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md index 916834c..ce38780 100644 --- a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md +++ b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md @@ -1,6 +1,6 @@ # Chapter 3 - The Inverted Stack in One Diagram - + @@ -9,16 +9,16 @@ ## The Inversion in One Sentence -Every architectural decision in this dissertation follows from one reversal of priority: +Every architectural decision in this book follows from one reversal of priority: -> **Conventional SaaS (Software as a Service):** Cloud database is primary - local device caches and renders. -> **Local-Node Architecture:** Local node is primary - cloud relay is an optional sync peer. +> **Conventional SaaS:** Cloud database is primary — local device caches and renders. +> **Local-Node Architecture:** Local node is primary — cloud relay is an optional sync peer. -In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing - a shell waiting for instructions that will not arrive. +In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing — a shell waiting for instructions that will not arrive. -In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode (with one exception that earns its complexity: CP-class records that require distributed lease coordination - covered later in this chapter). It carries no dependency on any remote service for core function. +In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode — with one exception: CP-class records that require distributed lease coordination, covered later in this chapter. It carries no dependency on any remote service for core function. -The architecture resolves into one mental model that the principal diagram below anchors. Supporting diagrams in this chapter visualize specific layer interactions; the principal diagram is what the reader holds. +The architecture resolves into one mental model anchored by the principal diagram below. ```mermaid graph LR @@ -47,7 +47,7 @@ Primary: Node B")] end ``` -The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path at all. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. +The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. This is the inversion. Everything else is implementation. @@ -55,7 +55,7 @@ This is the inversion. Everything else is implementation. ## The Five Layers -The inversion is one sentence. The five-layer model is why that sentence is implementable - the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner. Each layer has a clear boundary. Each layer has an answer to the question every distributed system must answer: what happens when the network is unavailable? +The inversion is one sentence. The five-layer model is why that sentence is implementable — the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner, a clear boundary, and an answer to the question every distributed system must answer: what happens when the network is unavailable? ```mermaid graph TB @@ -84,30 +84,30 @@ Peer Discovery · NAT Traversal"] ### Layer 1: Presentation -The presentation layer renders what the local store contains. That is its entire job. It owns no state. It caches nothing independently. It makes no decisions about data. +The presentation layer renders what the local store contains. It owns no state, caches nothing independently, and makes no decisions about data. -In the Zone A accelerator (the Anchor pattern - offline-by-default local-first desktop), this layer is a .NET MAUI (.NET Multi-platform App UI) Blazor Hybrid shell - a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern - hybrid multi-tenant SaaS) browser shell: the same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. This is deliberate. If a UI component only works against a cloud backend, it has not been designed correctly for this architecture. +In the Zone A accelerator (the Anchor pattern), this layer is a .NET MAUI Blazor Hybrid shell: a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern) browser shell. The same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. A UI component that only works against a cloud backend has not been designed correctly for this architecture. -The presentation layer's primary local-first responsibility is status indication. Users should always know the state of their data without interrogating it. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: +The presentation layer's primary local-first responsibility is status indication. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: - **Sync-healthy:** The node is connected to at least one peer and has exchanged a recent delta. - **Stale:** The node has not synced within its configured freshness threshold; local data may lag behind changes made by others. -- **Offline:** No peers are reachable. The node is operating on its own authoritative copy. +- **Offline:** No peers are reachable. The node operates on its own authoritative copy. - **Conflict-pending:** One or more records have diverged from a peer version and require resolution. -Each state must be communicated through more than color. The `SunfishNodeHealthBar` sets `SemanticProperties.Description` to a text equivalent for each state - screen readers announce the current sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement, so an AT user receives the same notification a sighted user receives visually. The full accessibility specification appears in Chapter 20. +Each state communicates through more than color. The component sets `SemanticProperties.Description` to a text equivalent — screen readers announce sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement. The full accessibility specification is in Chapter 20. -When the network is unavailable, the presentation layer changes nothing about its behavior. It continues to render from the local store. The status indicator moves from sync-healthy to offline. The user can still create records, navigate, query, and run any domain workflow that does not require distributed lease coordination. They receive no error page. No spinner. No apology. The software works. +When the network is unavailable, the presentation layer changes nothing. It continues to render from the local store. The status indicator moves to offline. The user creates records, navigates, queries, and runs any domain workflow that does not require distributed lease coordination. No error page. No spinner. No apology. ### Layer 2: Application Logic -The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer determines what constitutes a valid state transition, enforces invariants, and emits events that both the local store and the sync daemon consume. +The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer enforces invariants and emits events that both the local store and the sync daemon consume. -This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally - the sync daemon propagates those writes when it can, not when consulted before they happen. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or a remote validation service. +This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally — the sync daemon propagates those writes when it can. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or validation service. -The one exception is CP-class records - those whose correctness requires distributed coordination, such as resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these records, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. This is an explicit design choice. The user sees a constraint, not a mystery failure. +The one exception is CP-class records — those whose correctness requires distributed coordination: resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. The user sees a constraint, not a mystery failure. -The CAP positioning is per record class, not per application: +CAP positioning is per record class, not per application: | Record Class | CAP Position | Why | |---|---|---| @@ -118,92 +118,90 @@ The CAP positioning is per record class, not per application: ### Layer 3: Sync Daemon -The sync daemon is a separate long-running process. It is not a thread in the application. It is not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers - the application reconnects to a daemon that has been working the whole time. +The sync daemon is a separate long-running process — not a thread in the application, not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers. The daemon manages five concerns: -**Peer discovery.** Discovery follows a three-tier hierarchy. On the local network, mDNS provides zero-configuration discovery - two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. (Many enterprise Wi-Fi configurations filter mDNS by default; on those networks, the next tier is the path that actually works.) Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. +**Peer discovery.** On the local network, mDNS provides zero-configuration discovery — two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. -**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta - the operations each holds that the other lacks. Vector clocks scoped per-document (one entry per peer that has produced operations on that document) track what each peer has seen. This is the same anti-entropy mechanism used by large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. +**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta — the operations each holds that the other lacks. Vector clocks scoped per-document track what each peer has seen. The same anti-entropy mechanism underpins large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. -**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The protocol wire format is CBOR (Concise Binary Object Representation) - compact binary encoding that minimizes bandwidth on the intermittent connections that are the baseline operating condition for hundreds of millions of enterprise workers worldwide, not an edge case. +**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The wire format is CBOR (Concise Binary Object Representation) — compact binary encoding that minimizes bandwidth on intermittent connections. -**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges - the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set, so the system never grants two contradictory leases simultaneously. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window under the reference network model. A node that goes offline releases its lease at expiry - the team is never permanently blocked by one disconnected device. +**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges — the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window. A node that goes offline releases its lease at expiry; the team is never permanently blocked by one disconnected device. -**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment. A power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable - on the LAN, via VPN, or via the managed relay - the daemon begins working through the buffer. The application never needs to know that writes were queued. +**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment — a power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable, the daemon begins working through the buffer. The application never needs to know that writes were queued. ### Layer 4: Storage -Layer 4 is the source of truth for this node. Everything the presentation layer renders, everything the application logic layer reads, comes from here. Nothing here depends on a remote service. +Layer 4 is the source of truth for this node. The presentation layer renders from here. The application logic layer reads from here. Nothing here depends on a remote service. -The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore - the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields nothing readable. +The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore — the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields ciphertext. Three storage structures coexist: -**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The CRDT library handles merge semantics - the merge function is commutative, associative, and idempotent, so any two diverged copies of a document produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation currently ships YDotNet (a .NET port of Yjs); Loro is the aspirational target when its C# bindings mature. The `ICrdtEngine` abstraction keeps that choice reversible. (See Appendix G for the full glossary of these libraries and their licenses.) +**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The merge function is commutative, associative, and idempotent — any two diverged copies produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation ships YDotNet (a .NET port of Yjs); Loro is the aspirational target. The `ICrdtEngine` abstraction keeps that choice reversible. -**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. It never modifies past entries. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail that regulated industries require. +**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail regulated industries require. -**Read-model projections** are materialized views derived from the event log - the tables, indexes, and calculated fields that make queries fast. If a projection becomes corrupted or stale, it is rebuilt from the event log. The event log is the ground truth. Projections are a performance optimization. +**Read-model projections** are materialized views derived from the event log — tables, indexes, and calculated fields that make queries fast. A corrupted or stale projection rebuilds from the event log. Projections are a performance optimization; the event log is the ground truth. ### Layer 5: Relay and Discovery Layer 5 is the only layer that touches infrastructure outside the local node, and it is optional. -The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay holds no authoritative data. It stores no decrypted content. It cannot read the payloads it routes - every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer, and the relay has no access to any key. +The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay stores no decrypted content. Every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer; the relay holds no key. -The relay's two default trust levels reflect this: +The relay's two default trust levels: -- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration that satisfies data sovereignty requirements without exception. -- **Attested hosted peer (opt-in):** An administrator explicitly issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination - useful for teams too small to form quorum from workstations alone. +- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration and satisfies data sovereignty requirements without exception. +- **Attested hosted peer (opt-in):** An administrator issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination — useful for teams too small to form quorum from workstations alone. -The relay protocol is open and the relay is self-hostable. Any organization that requires full independence from managed relay infrastructure can operate its own relay with no changes to node configuration. +The relay protocol is open and the relay is self-hostable. Organizations that require full independence from managed relay infrastructure can operate their own relay with no changes to node configuration. -A note on what "optional" means in practice. The relay is *architecturally* optional - the protocol does not require it, two nodes on the same LAN sync directly via mDNS, and a small team whose members all work from one office can run indefinitely without any relay at all. The relay is *operationally* mandatory for the modal team in this dissertation's audience: members across symmetric NATs, members on cellular networks, members on different corporate Wi-Fi networks where mDNS is filtered. For those teams, the relay is what lets two members reach each other when neither is on the same LAN. The architecture does not pretend otherwise; the distinction matters because operational planning has to account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted on the team's own VPS. Fleet observability - relay availability, peer reachability, sync health across the fleet - is what the operator monitors; Chapter 21 specifies the fleet observability primitives. - -The relay's failure is not the application's failure. +The relay is architecturally optional — the protocol does not require it, and a small team whose members all work from one office can run indefinitely without one. The relay is operationally required for the modal team this book addresses: members across symmetric NATs, on cellular networks, or on separate corporate Wi-Fi networks where mDNS is filtered. Operational planning must account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted. The relay's failure is not the application's failure. --- ## How This Changes Failure Modes -Chapter 1 named seven failure modes. The inversion addresses each of them specifically. There are also failure modes the SaaS model created that may not have been visible as such - they only become legible once you understand what the vendor was holding on your behalf. And there are new failure modes the inverted architecture introduces. All three categories deserve honest treatment. +Chapter 1 named seven failure modes. The inversion addresses each directly. There are also failure modes the SaaS model created that only become legible once you understand what the vendor was holding on your behalf. And the inverted architecture introduces failure modes of its own. All three categories deserve honest treatment. **What the inversion resolves:** -*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure - your vendor's, or the cloud region beneath your vendor - interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. The construction PM submitting a bid at 4:58 PM does not care whether a cloud region is degraded, because his node does not consult any remote service to function. +*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure — your vendor's, or the cloud region beneath your vendor — interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. *The Vendor.* Data on vendor infrastructure is at the vendor's business decision's mercy. Data on the user's hardware is not. A vendor acquisition, pivot, or shutdown interrupts the sync service. It does not interrupt access to the user's data. -*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy. Connectivity enables sync. It is not a prerequisite for function. The operational precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years, demonstrating that the pattern works at population scale in the markets that most require it. +*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy — connectivity enables sync; it is not a prerequisite for function. The precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years. -*The Data.* Vendor-managed data is portable only on vendor terms - export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. +*The Data.* Vendor-managed data is portable only on vendor terms — export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. -*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay - the one remaining billable dependency - is replaceable. The data custody that makes price changes coercive is removed from the equation. +*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay — the one remaining billable dependency — is replaceable. The data custody that makes price changes coercive is gone. -*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. The architecture I propose makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. CRDT merge semantics produce deterministically convergent state across peers - no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not as a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create the divergence, rather than accepting both and discovering the inconsistency later. The convergence semantics are testable, the divergence cases are observable, and the resolution is auditable. The cost: developers have to model their domain in operations rather than current-state assignments. Chapter 12 specifies the CRDT engine; Chapter 13 specifies the conflict UX. +*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. CRDT merge semantics produce deterministically convergent state across peers — no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create divergence. The convergence semantics are testable, divergence cases are observable, and resolution is auditable. The cost: developers must model their domain in operations rather than current-state assignments. Chapters 12 and 13 specify the CRDT engine and the conflict UX. -*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement. Hundreds of thousands of organizations that had built workflows on those platforms found their operations interrupted - not because their vendors failed them, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector entirely. A relay can be targeted. The software vendor itself can be targeted. But the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted or replaced for the highest-sensitivity deployments. Chapter 11 specifies relay governance. Chapter 15 covers the compliance framework for the customer-directed variant of this failure mode. +*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS markets under sanctions enforcement. Organizations that had built workflows on those platforms found their operations interrupted — not because their vendors failed, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector — a relay can be targeted, the software vendor itself can be targeted — but the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted for the highest-sensitivity deployments. Chapters 11 and 15 cover relay governance and the compliance framework. -The regulatory landscape this failure mode operates in is worth naming. The dominant European driver is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the strongest European legal argument for local-first data residency, enforced nationally by Germany's BSI (Bundesamt für Sicherheit in der Informationstechnik) and France's CNIL (Commission nationale de l'informatique et des libertés). India's DPDP Act 2023 and the RBI's payment-data localization circular, China's PIPL (Personal Information Protection Law) 2021, Russia's Federal Law 242-FZ (Russian-citizen personal data on Russian territory since 2015), the UAE's DIFC DPL 2020, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's PDPL cluster (KSA, Bahrain) are representative of the parallel pattern across GCC, APAC, African, and Americas markets; the full coverage matrix is in Appendix F. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. One nuance worth flagging: when peer nodes reside in different jurisdictions, a direct peer-to-peer sync becomes a cross-border data transfer in legal terms, even when the data is encrypted in transit and never lands on a vendor server. Chapter 15 specifies the compliance framework for that case. +The dominant regulatory driver for data residency is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023, China's PIPL 2021, Brazil's LGPD, and analogous frameworks across APAC and GCC markets follow the same structural logic. The full coverage matrix is in Appendix F. When peer nodes reside in different jurisdictions, a direct peer-to-peer sync constitutes a cross-border data transfer in legal terms, even when encrypted and never touching a vendor server. Chapter 15 specifies the compliance framework for that case. **What you may not have noticed you were exposed to:** -*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you have stored with them. A breach anywhere in their infrastructure stack - servers, sub-processors, privileged internal access - is a breach of your data, regardless of any action you took or failed to take. This failure mode is invisible until it has already happened. You cannot evaluate a vendor's internal security posture from outside it. In this architecture, the relay holds only ciphertext: it receives post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. There is no decryptable content to exfiltrate. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. The attack surface moves to the endpoints - which this architecture addresses explicitly rather than hiding. +*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you stored with them. A breach anywhere in their infrastructure stack — servers, sub-processors, privileged internal access — is a breach of your data, regardless of any action you took. In this architecture, the relay holds only ciphertext: post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. -Hayoon Kim found out about her vendor's breach at a hotel in Singapore at 6:47 in the morning, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P (Information Security Management System - Personal) consultancy out of Gangnam-gu in Seoul. Her practice management SaaS - a Korean-language platform serving a few thousand domestic compliance professionals - had been breached six weeks earlier. The breach was disclosed to customers via an email that landed in her promotions folder. Hayoon never saw it. The article was the disclosure that reached her. Eleven of her clients were named on the dump that surfaced overnight on a Russian-language forum, each report carrying her name on the cover page, each report listing the specific PIPA (Personal Information Protection Act) Article 29 safety-measure controls she had documented during her 2023 audit work. +Hayoon Kim found out about her vendor's breach at 6:47 in the morning at a hotel in Singapore, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P consultancy out of Gangnam-gu in Seoul. Her practice management SaaS had been breached six weeks earlier. The vendor disclosed by email; the email landed in her promotions folder. The article was the disclosure that reached her. Eleven of her clients appeared in the overnight dump on a Russian-language forum, each report carrying her name on the cover page, each listing the specific PIPA Article 29 controls she had documented during her 2023 audit work. -She spent the next eleven days drafting individual letters to each affected client explaining what had happened, what data was exposed, what they should do. She had spent her career advising other organizations on this exact kind of letter. Writing eleven of them about her own practice was a different exercise. The platform vendor's chief executive sent a personal apology that was identical, paragraph for paragraph, to an apology another vendor's chief executive had sent the year before - Hayoon recognized three of the sentences from a precedent she had cited in a 2022 article she had written for the Korea Internet & Security Agency's quarterly compliance bulletin. +She spent the next eleven days drafting individual letters to each affected client. She had spent her career advising other organizations on exactly this kind of letter. The platform vendor's CEO sent a personal apology identical, paragraph for paragraph, to an apology another vendor's CEO had sent the year before — Hayoon recognized three sentences from a precedent she had cited in a 2022 article for the Korea Internet & Security Agency's quarterly compliance bulletin. -She still keeps her active client documents on a local encrypted drive that no SaaS vendor has access to. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. +She still keeps her active client documents on a local encrypted drive. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. **What the architecture introduces honestly:** -*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss - storage extraction without credentials yields ciphertext. But a compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense - encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes - reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. +*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss. A compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense — encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes — reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. -*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently. A twenty-person team may run five schema versions simultaneously. The expand-contract pattern - new fields additive and backward-compatible during a compatibility window, old fields retired once all active nodes have updated - handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. +*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently — a twenty-person team may run five schema versions simultaneously. The expand-contract pattern handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. -*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy - aggressive compaction for stable documents, 90-day retention for active collaboration documents (configurable per deployment; Chapter 6 derives the default), indefinite retention for compliance-classified records bounded in practice by jurisdiction-specific schedules (six years for HIPAA (Health Insurance Portability and Accountability Act), seven for SOX, as configured) - keeps growth bounded. But GC in a peer-to-peer system requires coordination. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. This architecture addresses it. It does not make it disappear. +*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy — aggressive compaction for stable documents, 90-day retention for active collaboration documents, indefinite retention for compliance-classified records bounded by jurisdiction-specific schedules — keeps growth bounded. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. The architecture addresses it; it does not make it disappear. Part II is six rounds of adversarial review by people who were looking for exactly these problems. @@ -213,11 +211,11 @@ Part II is six rounds of adversarial review by people who were looking for exact The five-layer model admits two canonical deployment shapes. Both use the same Harborline component surface, the same sync protocol, and the same five-layer architecture. They differ in where the authoritative data location lives. -**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid - a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. A user who enables sync connects to a managed relay or a direct peer via the gossip protocol. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation - pre-1.0, in active development. +**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid — a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation — pre-1.0, in active development. -**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default - it routes encrypted deltas but cannot read them. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want the deployment simplicity of a hosted service alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation - pre-1.0, in active development. +**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want deployment simplicity alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation — pre-1.0, in active development. -Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` (pre-1.0). Neither shape changes the sync protocol, the CAP positioning model, or the storage architecture. The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. +The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. --- @@ -225,11 +223,11 @@ Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` This architecture shifts three fundamental habits. -**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Command handlers succeed on local durability, not remote confirmation. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes - operations rather than current-state assignments. This discipline is the fundamental shift. +**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes — operations rather than current-state assignments. This discipline is the fundamental shift. -**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent - which means it fails in the field. +**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent. -**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The system's failure modes are designed to be visible. The developer's job is to wire those signals to the UI correctly, not to paper over them. +**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The developer's job is to wire those signals to the UI correctly, not to paper over them. The five layers in one diagram are the picture Part II will adversarially test. Everything that follows is detail. diff --git a/vol-1/part-2-council-reads-the-paper/ch07-security-lens.md b/vol-1/part-2-council-reads-the-paper/ch07-security-lens.md index 10b44ac..84cdcde 100644 --- a/vol-1/part-2-council-reads-the-paper/ch07-security-lens.md +++ b/vol-1/part-2-council-reads-the-paper/ch07-security-lens.md @@ -1,69 +1,59 @@ # Chapter 7 - The Security Lens - + --- -Nia Okonkwo held the security seat on Joel's dissertation committee. Her charter was not to evaluate whether the architecture was elegant - it was to find the gap between the key hierarchy on paper and the incident response on the morning a key actually gets stolen. +Nia Okonkwo holds the security seat on Joel's dissertation committee. Her charter is not to evaluate elegance — it is to find the gap between the key hierarchy on paper and the incident response on the morning a key actually gets stolen. -Nia Okonkwo has broken three "local-first" demos in under twenty minutes. The pattern was the same all three times. She ignored the application layer. She ignored the data-at-rest story. She went straight for the sync channel. Two demos had no auth on the sync socket at all. The third had auth - a sixteen-character string hardcoded in the config. She found it by running `strings` on the binary. +She has broken three "local-first" demos in under twenty minutes. The pattern was the same each time: ignore the application layer, ignore the data-at-rest story, go straight for the sync channel. Two demos had no auth on the sync socket at all. The third had auth — a sixteen-character string hardcoded in the config, found by running `strings` on the binary. -She is not a hostile reviewer because she dislikes the inverted stack. She is a hostile reviewer because she has learned that distributed architectures fail at exactly the places their designers felt most confident. The encryption is usually fine. The key hierarchy is often documented. What breaks is the gap between the hierarchy on paper and the incident response on the morning a key actually gets stolen. +Okonkwo is hostile because distributed architectures fail at exactly the places their designers felt most confident. The encryption is usually fine. The key hierarchy is often documented. What breaks is the incident response on the morning a key actually gets stolen. -Okonkwo read the first version of Joel's dissertation with that question in front of her - not *is the cryptography correct*, but *what happens the day after the breach.* +She read the first version of Joel's dissertation with that question in front of her — not *is the cryptography correct*, but *what happens the day after the breach.* --- -## Act 1: Round 1 - The Key Compromise Gap +## Act 1: Round 1 — The Key Compromise Gap ### What Earned a 9/10 -The first version of Joel's dissertation gets one dimension nearly right: data minimization at the protocol layer. Subscription filtering is enforced at the sync daemon's send tier - not at the application layer, not at the UI - and that placement is specified clearly. A node that lacks the required role attestation never receives the operations. There is no receive-and-hide. There is no "we filter it before displaying." There is no trust placed in the application to discard what it should not have. The daemon does not send it. +The first version gets data minimization at the protocol layer nearly right. Subscription filtering is enforced at the sync daemon's send tier — not at the application layer, not at the UI — and that placement is specified clearly. A node that lacks the required role attestation never receives the operations. There is no receive-and-hide. There is no application layer trusted to discard what it should not have. The daemon does not send it. -Okonkwo scored this dimension a 9 out of 10. In her experience, this is the dimension most commonly implemented backwards. Teams build an application that receives all data and enforces visibility rules in UI components. Which means the data already crossed the network. Already landed in local storage. Already accessible to anyone who knows where to look. Send-tier filtering is the architectural achievement that makes the rest of the security story coherent. If filtering had been left to the application layer, no amount of key management would have compensated. +Okonkwo scored this a 9 out of 10. In her experience, teams build an application that receives all data and enforces visibility rules in UI components — which means the data already crossed the network, already landed in local storage, already accessible to anyone who knows where to look. Send-tier filtering is the achievement that makes the rest of the security story coherent. -The threat model section earns her respect for a related reason. The paper acknowledges that distributing data to endpoints does not eliminate the honeypot problem - it distributes it to the weakest endpoint. A cloud database is one high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous posture. The paper does not pretend otherwise. - -What it does not do is follow that acknowledgment to its conclusion. +The threat model section earns her respect for a related reason. The paper acknowledges that distributing data to endpoints distributes the honeypot problem to the weakest endpoint. A cloud database is one high-value target. A fleet of workstations is a larger attack surface with heterogeneous posture. The paper does not pretend otherwise — but it does not follow that acknowledgment to its conclusion. ### The Blocking Issue: No Key Compromise Response -The key hierarchy in the first version uses envelope encryption. Each document gets a random Data Encryption Key (DEK). Each role gets a Key Encryption Key (KEK). The DEK is encrypted with the role KEK and stored alongside the ciphertext. When role membership changes, the administrator generates a new KEK, re-wraps all DEKs with it, and discards the old KEK. Nodes that cannot obtain the new KEK cannot decrypt future records. - -This is the correct model. The trouble is what happens when the KEK itself is compromised - not rotated on schedule, but actually stolen. +The key hierarchy uses envelope encryption. Each document gets a random Data Encryption Key (DEK). Each role gets a Key Encryption Key (KEK). The DEK is encrypted with the role KEK and stored alongside the ciphertext. When role membership changes, the administrator generates a new KEK, re-wraps all DEKs, and discards the old KEK. -The first version of the dissertation scores a 5 out of 10 on incident response for key compromise. It provides no detection mechanism. It specifies no re-keying procedure for the compromise case as opposed to the scheduled rotation case. It analyzes no historical data exposure window for an attacker who holds the KEK. It defines no user notification path. +This is the correct model. The trouble is what happens when the KEK is compromised — not rotated on schedule, but stolen. -Consider the failure scenario concretely. A senior administrator's workstation is physically stolen on the train home. The attacker recovers the device, breaks the full-disk encryption - a realistic attack if the device is powered on - and extracts the OS keychain. The keychain holds the current role KEK for every role this administrator manages. With the KEK, the attacker can decrypt every wrapped DEK in the sync log. Every document those roles ever had access to is now readable. +The first version scores a 5 out of 10 on incident response for key compromise. No detection mechanism. No re-keying procedure for the compromise case. No analysis of the historical data exposure window. No user notification path. -The paper describes the key hierarchy and stops there. +Consider the failure scenario. A senior administrator's workstation is stolen on the train home. The attacker breaks full-disk encryption — realistic if the device is powered on — and extracts the OS keychain. The keychain holds the current role KEK for every role this administrator manages. With the KEK, the attacker decrypts every wrapped DEK in the sync log. Every document those roles ever accessed is now readable. The paper describes the key hierarchy and stops there. -For Okonkwo, this is not a documentation gap. An architecture that specifies a key hierarchy without specifying what to do when the hierarchy is violated has not specified a security model. It has specified a pleasant normal-path story. Security architectures are evaluated on their failure modes. The normal path is never the problem. +For Okonkwo, this is not a documentation gap. An architecture that specifies a key hierarchy without specifying what to do when the hierarchy is violated has not specified a security model. It has specified a pleasant normal-path story. -Four questions sit unanswered. *How does the system detect a key compromise?* The detection pathway determines the data-at-risk scope, because the time between compromise and detection is the time the attacker operates with the key. *What is the re-keying procedure for the compromise case?* Scheduled rotation uses the existing KEK to re-wrap DEKs; a compromised KEK cannot be used to re-wrap, because doing so produces DEKs wrapped with the same compromised key. The procedure must generate an entirely new KEK chain. +Four questions sit unanswered. *How does the system detect a key compromise?* Detection time defines the data-at-risk scope — the attacker operates with the key until detection. *What is the re-keying procedure for the compromise case?* Scheduled rotation uses the existing KEK to re-wrap DEKs; a compromised KEK cannot be used to re-wrap, because doing so produces DEKs wrapped with the same compromised key. The procedure must generate an entirely new KEK chain. -The third question is the one that wakes Okonkwo at 3 a.m. *What historical data is at risk?* A compromised KEK exposes every document that KEK ever protected, all the way back to the moment the key was created. The data-at-risk window is not defined by when the compromise occurred. It is defined by the KEK's age. +The third question is the one that wakes Okonkwo at 3 a.m. *What historical data is at risk?* A compromised KEK exposes every document it ever protected, back to the moment the key was created. The data-at-risk window is defined by the KEK's age, not by when the compromise occurred. -The fourth question is the one most architectures forget. *What does the user see?* An incident response that produces correct cryptographic behavior without user-visible notification is not an incident response. Someone must be told their data was potentially exposed. - -The three conditions Okonkwo raised alongside the block - diagram the key hierarchy, specify the offline node revocation reconnection flow, address in-memory key material - are completeness items. They are real. But the block stands on the compromise response alone. +The fourth question is the one most architectures forget. *What does the user see?* An incident response that produces correct cryptographic behavior without user-visible notification is not an incident response. ### Round 1 Verdict: PROCEED WITH CONDITIONS -Okonkwo issues PROCEED WITH CONDITIONS. The domain average of 7.3 out of 10 supports that verdict. But one condition is not a condition in the normal sense. It is a prerequisite. A security review cannot clear a key-based system without a specified compromise response. A score of 5 out of 10 on the weakest dimension - one that governs every other security property in the architecture - means the architecture cannot advance past a security review until that dimension is resolved. - -The architecture is unusually honest for its class. The threat model is real. The send-tier filtering is correct. The attacker-mindset framing - that distributing data to endpoints distributes the attack surface - is rare in local-first literature. The incident response gap is exactly the kind of gap that fails real-world security reviews. The condition holds until resolved. +Okonkwo issues PROCEED WITH CONDITIONS. The domain average of 7.3 supports that verdict — but one condition is a prerequisite, not a condition in the ordinary sense. A security review cannot clear a key-based system without a specified compromise response. The architecture is unusually honest: the threat model is real, the send-tier filtering is correct, the attacker-mindset framing is rare in local-first literature. The incident response gap is exactly the kind of gap that fails real-world security reviews. The condition holds until resolved. --- ## What Changed Between Rounds -The revision resolved the blocking issue. It specified the compromise response procedure in full and diagrammed the key hierarchy from the root organization key down through role KEKs, per-node wrapped copies, per-record DEKs, and ciphertext. The relationships between layers are explicit: role KEKs are wrapped with each authorized node's public key, DEKs are wrapped with the role KEK, and ciphertext is produced by the DEK using a symmetric cipher. No level of the hierarchy is implicit. - -This is the hierarchy Okonkwo asked for in Round 1: +The revision resolved the blocking issue. It specified the compromise response in full and diagrammed the key hierarchy from the root organization key through role KEKs, per-node wrapped copies, per-record DEKs, and ciphertext. No level is implicit. ```mermaid graph TD @@ -82,125 +72,103 @@ graph TD D3 --> E3["Ciphertext"] ``` -The revision specifies the key compromise response procedure. Detection triggers include physical loss reports, anomalous access patterns identified in the audit log, and explicit administrator reports. Detection triggers this sequence: generate an entirely new KEK for the affected role, not derived from the compromised key. Re-wrap every DEK owned by that role using the new KEK. Discard the old KEK and all node-level copies of it. Broadcast revocation through the relay. Notify affected users with the data-at-risk window - from the compromised key's creation date to the moment of revocation. +The revision specifies the key compromise response procedure. Detection triggers include physical loss reports, anomalous access patterns in the audit log, and explicit administrator reports. On detection: generate an entirely new KEK for the affected role, not derived from the compromised key; re-wrap every DEK owned by that role using the new KEK; discard the old KEK and all node-level copies; broadcast revocation through the relay; notify affected users with the data-at-risk window — from the compromised key's creation date to the moment of revocation. -The offline node revocation reconnection flow is now specified at the step level. When an offline node reconnects, the sync daemon presents its current attestation bundle to the relay. The relay checks the revocation log. If any key in the node's bundle has been revoked, the relay rejects the sync handshake. The node receives a specific error code indicating revocation, not a generic connection failure. Before sync can resume, the node must obtain a fresh key bundle - which requires the user to re-authenticate against the IdP (Identity Provider), establish new role attestations, and receive new wrapped KEK copies from the administrator. The user sees a message: "Your access credentials have been updated. Sign in again to continue syncing." +The offline node revocation reconnection flow is now specified at the step level. When an offline node reconnects, the sync daemon presents its attestation bundle to the relay. The relay checks the revocation log. If any key in the bundle has been revoked, the relay rejects the sync handshake with a specific error code indicating revocation — not a generic connection failure. Before sync can resume, the node must obtain a fresh key bundle, which requires re-authentication against the IdP, new role attestations, and new wrapped KEK copies from the administrator. The user sees: "Your access credentials have been updated. Sign in again to continue syncing." -In-memory key material is addressed at the implementation level. Locked memory pages prevent the OS from swapping key material to disk. The application zeros key material on process exit. These are implementation constraints on `Harborline.Kernel.Security`, not suggestions. +In-memory key material is addressed as implementation constraints on `Harborline.Kernel.Security`: locked memory pages prevent the OS from swapping key material to disk; the application zeros key material on process exit. --- -## Act 2: Round 2 - Four Remaining Conditions - -Round 2 opens with a commendation. The data minimization at the sync layer is architecturally correct and, in Okonkwo's assessment, represents a meaningful improvement over most commercial CRDT (Conflict-free Replicated Data Type) implementations. The question is what remains. +## Act 2: Round 2 — Four Remaining Conditions -Four conditions emerge. None is a block. All are real. +Round 2 opens with a commendation. The data minimization at the sync layer is architecturally correct and, in Okonkwo's assessment, represents a meaningful improvement over most commercial CRDT implementations. Four conditions remain. None is a block. ### Supply Chain: Who Signs the Release -The architecture uses content-addressed identifiers for update distribution. When a new release is published, the CID of the package is computed and distributed. Clients verify the CID before installation. A compromised CDN cannot serve a corrupt package because the CID mismatch fails immediately. +The architecture uses content-addressed identifiers for update distribution. Clients verify the content identifier (CID) before installation. A compromised CDN cannot serve a corrupt package because the CID mismatch fails immediately. -This is correct. The gap is one step earlier. +The gap is one step earlier. The CID guarantees the integrity of the package relative to the CID. It does not guarantee that the CID itself came from the legitimate build process. An attacker who compromises the build system produces a valid package, computes its correct CID, and signs that CID with a compromised release signing key. Clients verify the CID, confirm it matches, and install the attacker's payload — exactly as the protocol specifies. -The CID guarantees the integrity of the package relative to the CID. It does not guarantee that the CID itself came from the legitimate build process. An attacker who compromises the build system can produce a valid package, compute its correct CID, and sign that CID with a compromised release signing key. Clients verify the CID, confirm it matches, and install the attacker's payload - as the protocol specifies. +Three gaps remain. First, the release signing key needs a custody specification: who holds it, how it is stored, what happens if it is compromised. A release signing key on a developer's laptop is a single point of failure with a coffee shop's WiFi attached. Second, reproducible builds: independent parties must verify that the published binary matches the published source. Third, integration with a supply chain transparency framework such as Sigstore [1], which provides a publicly auditable log of signing events. A signing event that does not appear in the transparency log can be detected and rejected by clients. -Three gaps remain. First, the release signing key needs a custody specification: who holds it, how it is stored, what happens if it is compromised. A release signing key stored on a developer's laptop is not a supply chain security posture - it is a single point of failure with a coffee shop's WiFi attached to it. Second, reproducible builds: independent parties must be able to verify that the published binary matches the published source. Without reproducibility, the build process is an unauditable black box. Third, integration with a supply chain transparency framework such as Sigstore ([sigstore.dev](https://www.sigstore.dev/), the supply-chain signing toolkit) [1], which provides a publicly auditable log of signing events. A signing event that does not appear in the transparency log can be detected and rejected by clients. - -Okonkwo scores this dimension 7 out of 10. The content-addressing model is the right foundation. The signing key custody and the transparency layer are what complete it. +Okonkwo scores this dimension 7 out of 10. ### The Compromised Relay -The revised paper addresses relay compromise correctly. The relay is untrusted transport. All data is end-to-end encrypted. The relay handles ciphertext. A relay operator who reads everything on the wire gets operation identifiers and timestamps, not payloads. - -This is the right architecture. The condition is about what the relay can see even when it cannot read payloads. +The relay is untrusted transport. All data is end-to-end encrypted. The relay handles ciphertext. A relay operator who reads everything on the wire gets operation identifiers and timestamps, not payloads. -Traffic analysis is sensitive. A relay operator who cannot read messages can still observe which nodes communicate with which, at what times, and at what volume. For a legal firm, the communication pattern between two nodes during a specific time window can reveal which matters are active and which team members are collaborating - without any payload access at all. For healthcare deployments, communication frequency between specific nodes can reveal patient activity patterns. +This is the right architecture. The condition is about what the relay can see even when it cannot read payloads. A relay operator who cannot read messages can still observe which nodes communicate with which, at what times, and at what volume. For a legal firm, the communication pattern between two nodes during a specific time window can reveal which matters are active and which team members are collaborating — without any payload access. For healthcare deployments, communication frequency between specific nodes can reveal patient activity patterns. -The architecture is not broken. The limitation is real, and the dissertation must disclose it. Organizations for whom metadata privacy is a hard requirement should run a self-hosted relay on infrastructure they control, removing the third-party relay operator as a metadata observer. +The limitation is real and must be disclosed. Organizations for whom metadata privacy is a hard requirement should run a self-hosted relay on infrastructure they control. -**Compelled Access as a Distinct Threat Model.** The compromised-relay threat model has a cousin that deserves its own name: compelled access. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, the architecture's end-to-end encryption with keys that never leave the originating device addresses a threat model that cloud storage cannot satisfy architecturally. The relay operator cannot produce decryptable content under a compulsion order because the relay operator does not possess decryptable content. This is not a cryptographic subtlety. It is the structural reason the architecture answers compelled-access regimes that BYOK and customer-managed-key (CMK) approaches cannot reach. +**Compelled Access as a Distinct Threat Model.** The compromised-relay threat model has a cousin that deserves its own name: compelled access. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device answers a threat model that cloud storage cannot satisfy architecturally. The relay operator cannot produce decryptable content under a compulsion order because the relay operator does not hold decryptable content. This is not a cryptographic subtlety — it is the structural reason the architecture answers compelled-access regimes that customer-managed-key approaches cannot reach. -The 2022 sanctions enforcement event is the canonical empirical anchor. Adobe, Autodesk, Microsoft, Figma, and dozens of other Western SaaS vendors suspended service across Russia and CIS markets on days of notice. Hundreds of thousands of organizations that had cleared SOC 2, ISO 27001, and vendor risk assessments lost access. The failure mode was not technical. It was jurisdictional. CIS-region import substitution (импортозамещение) requirements followed directly, and the architecture's local-key, ciphertext-relay posture is the structural answer those requirements describe. - -The closest SaaS analog - customer-managed keys with Microsoft 365 / Salesforce Shield / Box KeySafe - is a partial answer. The customer's key sits outside the cloud provider's direct custody, but the data still traverses third-party infrastructure under that provider's jurisdictional control, which makes the provider legally compellable through other means: court orders to the provider's parent jurisdiction, gag orders, or service termination. CMK narrows the legal surface; it does not move the trust boundary off vendor-controlled infrastructure. The inverted stack moves the trust boundary onto the customer's endpoint. For deployments where compelled access is a named threat, the difference is the architecture. - -The regulatory alignment runs across every major regime. **Western:** Schrems II (*Data Protection Commissioner v. Facebook Ireland Limited*, CJEU C-311/18, 2020) for EU personal-data transfers; the EU's NIS2 Directive (Article 21 risk-management measures, in force October 2024) for essential and important entities; Germany's BSI C5 cloud-security catalogue; the EU Cyber Resilience Act for connected products; CNIL guidance on cloud sovereignty for French deployments. **CIS / Russia:** Federal Law 242-FZ for Russia-resident personal data, with parallel localization regimes in Kazakhstan and Belarus. **Asia / Middle East / Africa:** UAE DIFC (Dubai International Financial Centre) DPL 2020 and ADGM (Abu Dhabi Global Market) Data Protection Regulations 2021 for GCC financial-zone licensing; India DPDP (Digital Personal Data Protection) Act + RBI 2018 BFSI circular requiring financial data to reside on India-resident servers; Japan APPI (Act on the Protection of Personal Information, 2022 revision); South Korea PIPA, distinct from Japan APPI, plus ISMS-P for Korean financial-services on-premise mandates; China PIPL (Personal Information Protection Law) and MLPS 2.0 (Multi-Level Protection Scheme); Nigeria NDPR (re-enacted 2023); South Africa POPIA; Kenya Data Protection Act 2019; Brazil LGPD; Mexico LFPDPPP; Colombia Ley 1581. The full matrix sits in Appendix F. - -The architectural property - relay routes ciphertext, keys stay with the user - answers all of these structurally rather than contractually. African fintech ran the same play before the architecture formalized it: M-PESA, MTN MoMo, and FarmerLine survived Western-cloud disruptions because their architecture never depended on Western cloud in the first place. For deployments where compelled access is a named threat, the self-hosted relay is the additional guarantee - the metadata itself stays on user-controlled infrastructure. +The 2022 sanctions enforcement event is the canonical empirical anchor. Adobe, Autodesk, Microsoft, Figma, and dozens of other Western SaaS vendors suspended service across Russia and CIS markets on days of notice. Hundreds of thousands of organizations lost access. The failure mode was not technical — it was jurisdictional. CIS-region import substitution requirements followed directly, and the local-key, ciphertext-relay posture is the structural answer those requirements describe. The closest SaaS analog — customer-managed keys with Microsoft 365 or Salesforce Shield — narrows the legal surface but does not move the trust boundary off vendor-controlled infrastructure. The inverted stack moves the trust boundary onto the customer's endpoint. ### Physical Access and the Memory Window The at-rest encryption story is correct. SQLCipher protects local databases. Keys are derived from user credentials using Argon2id and stored in OS-native keystores. Physical storage extraction without credentials produces no plaintext. -The gap is the memory window while the application is running. - -An attacker with thirty minutes of physical access to a live system can use cold boot attack techniques or memory forensics tools. Cold boot exploits the remanence of DRAM: memory contents persist briefly after power loss and can be read if the attacker acts within seconds to minutes of shutdown, depending on hardware. Memory forensics tools that run from a bootable USB can dump process memory directly. The decryption key that is in memory while the application is running is readable by both techniques. +The gap is the memory window while the application is running. An attacker with thirty minutes of physical access to a live system can use cold boot techniques — DRAM contents persist briefly after power loss and can be read if the attacker acts within seconds to minutes of shutdown — or memory forensics tools that run from a bootable USB to dump process memory directly. The decryption key in memory is readable by both techniques. -The mitigation is a re-authentication interval. The application requests re-authentication from the OS keychain at configurable intervals - every four hours is the recommended default for high-security deployments. An attacker who gains physical access to an authenticated session can operate within that window. An attacker who encounters a session requiring re-authentication cannot proceed without the user's credentials. - -This is a hardening recommendation, not an architecture flaw. The base model is correct; the recommendation narrows the exposure window for deployments where physical access is a realistic threat vector. Okonkwo scores physical access an 8 out of 10. +The mitigation is a re-authentication interval: the application requests re-authentication from the OS keychain at configurable intervals. Four hours is the recommended default for high-security deployments. This is a hardening recommendation, not an architecture flaw. Okonkwo scores physical access an 8 out of 10. ### Credential Recovery and Account Continuity -Okonkwo's fifth prompt asked what happens the day after the user loses their passphrase. The paper now specifies three recovery paths and names what is not supported. - -For passphrase loss: an optional recovery-key file generated at account setup unseals the local keystore without the passphrase; organizations that require centralized recovery can enable administrator-held wrapped KEK copies under a defined escrow procedure. For OS keystore corruption: re-enrollment via the organization's MDM (Mobile Device Management) re-delivers the role attestation to a fresh keystore, and relay-assisted re-sync restores the node's data from peers. For legal hold on a departed employee's local device: the encrypted SQLCipher database is extractable by IT with the administrator's escrow KEK; decryption proceeds under applicable legal authority. +The paper specifies three recovery paths. For passphrase loss: an optional recovery-key file generated at account setup unseals the local keystore without the passphrase; organizations requiring centralized recovery can enable administrator-held wrapped KEK copies under a defined escrow procedure. For OS keystore corruption: re-enrollment via the organization's MDM re-delivers the role attestation to a fresh keystore, and relay-assisted re-sync restores the node's data from peers. For legal hold on a departed employee's device: the encrypted SQLCipher database is extractable by IT with the administrator's escrow KEK. -What the architecture does not support: instant recovery without a recovery artifact. A user who lost their passphrase, declined recovery-key backup, declined organizational escrow, and whose device is the only copy of local-only records has lost them. End-to-end custody without recovery mechanisms makes permanent loss possible. Organizations must choose at least one recovery path and test it before production. +What the architecture does not support: instant recovery without a recovery artifact. A user who lost their passphrase, declined recovery-key backup, declined organizational escrow, and whose device is the only copy of local-only records has lost them. Organizations must choose at least one recovery path and test it before production. Okonkwo scores credential recovery 7 out of 10. ### GDPR Article 17 in a CRDT System -This is the condition Okonkwo scores lowest in Round 2: compliance framework mapping, 5 out of 10. It surfaces a genuine conflict. Article 17 of the General Data Protection Regulation [2] requires deletion of personal data on request; the no-GC compliance tier's operation log is immutable by design, and the immutability is the feature - append-only signed entries are what regulated industries require for tamper-evident audit. The architecture cannot simultaneously provide an immutable audit log and comply with Article 17 through conventional deletion. +This is the condition Okonkwo scores lowest in Round 2: compliance framework mapping, 5 out of 10. Article 17 of the GDPR [2] requires deletion of personal data on request; the no-GC compliance tier's operation log is immutable by design, and the immutability is the feature — append-only signed entries are what regulated industries require for tamper-evident audit. The architecture cannot simultaneously provide an immutable audit log and comply with Article 17 through conventional deletion. -The resolution is *crypto-shredding* - destruction of the DEK that protects the operation's content rather than removal of the operation itself. The operation entry remains in the log, preserving DAG (Directed Acyclic Graph) integrity; its ciphertext becomes an unrecoverable stub. Chapter 15 (Security Architecture) specifies the mechanism, including the procedural exemption under GDPR Article 17(3)(b) for processing necessary for legal obligations and public interest. +The resolution is crypto-shredding: destroy the DEK that protects the operation's content rather than remove the operation itself. The operation entry remains in the log, preserving DAG integrity; its ciphertext becomes an unrecoverable stub. Chapter 15 specifies the mechanism, including the procedural exemption under GDPR Article 17(3)(b) for processing necessary for legal obligations. -The pattern's known limitation is metadata residue. Operation identifiers, timestamps, and structural position in the DAG remain after DEK destruction. Whether that metadata constitutes personal data under Article 17 is jurisdictional - a legal question, not an architectural one. Disclose it. - -The same DEK-destruction-with-metadata-residue pattern is the architectural answer across every major right-to-erasure regime. GDPR Article 17, India's DPDP Act erasure right, and Brazil's LGPD Article 18 are representative; parallel provisions exist under POPIA, NDPR, Kenya DPA 2019, Japan APPI (2022), South Korea PIPA, China PIPL, LFPDPPP ARCO rights, Colombia Ley 1581, Argentina Ley 25.326, and the regimes named in Appendix F. The cryptographic pattern is architectural; the compliance procedure is jurisdictional. +The pattern's known limitation is metadata residue. Operation identifiers, timestamps, and structural position in the DAG remain after DEK destruction. Whether that metadata constitutes personal data under Article 17 is jurisdictional — a legal question, not an architectural one. Disclose it. The same crypto-shredding pattern applies to the parallel right-to-erasure regimes in GDPR, India's DPDP Act, Brazil's LGPD, and the regimes named in Appendix F. ### Round 2 Verdict: PROCEED WITH CONDITIONS -Okonkwo issues PROCEED WITH CONDITIONS. Domain average 7.0 out of 10. The blocking issue from Round 1 is fully resolved. The full condition list: +Okonkwo issues PROCEED WITH CONDITIONS. Domain average 7.0. The blocking issue from Round 1 is fully resolved. The full condition list: **C1 (High):** Specify release signing key custody, reproducible build requirement, and Sigstore integration for update supply chain transparency. -**C2 (High):** Address GDPR Article 17 for the no-GC compliance CRDT tier - document the crypto-shredding pattern and explicitly scope the limitation on operation metadata. +**C2 (High):** Address GDPR Article 17 for the no-GC compliance CRDT tier — document the crypto-shredding pattern and explicitly scope the limitation on operation metadata. **C3 (Medium):** Acknowledge relay metadata and traffic analysis limitation for high-sensitivity deployments. State the self-hosted relay as the mitigation. -**C4 (Medium):** Specify a recommended default re-attestation interval - twenty-four hours balances a bounded revocation window against operational friction. +**C4 (Medium):** Specify a recommended default re-attestation interval — twenty-four hours balances a bounded revocation window against operational friction. **C5 (Low):** Add cold boot and in-memory key hardening recommendation for high-security deployments, including the four-hour re-authentication interval guidance. -**C6 (Medium):** Document the three supported credential recovery paths (recovery-key file, administrator-held wrapped KEK escrow, MDM re-enrollment plus relay-assisted re-sync) and explicitly name the unsupported no-artifact case. +**C6 (Medium):** Document the three supported credential recovery paths and explicitly name the unsupported no-artifact case. -C1 and C2 must be addressed before first external release. C3 through C6 are addressable in the companion document without blocking alpha implementation. The architecture cleared a security review that began with three demos broken in twenty minutes. The conditions govern the operational hardening - the supply chain custody, the metadata disclosures, the recovery paths, the re-attestation cadence - that turns a sound key hierarchy into a deployable security posture. +C1 and C2 must be addressed before first external release. C3 through C6 are addressable in the companion document without blocking alpha implementation. The architecture cleared a security review that began with three demos broken in twenty minutes. The conditions govern the operational hardening — supply chain custody, metadata disclosures, recovery paths, re-attestation cadence — that turns a sound key hierarchy into a deployable security posture. --- ## The Principle: Defense-in-Depth Is Not Optional -The council's security review surfaces the central tension in distributed endpoint architectures. The inverted stack solves the central honeypot problem: a fleet of workstations is a harder target than a single cloud database, because there is no single high-value target and no single breach that exposes all data for all users. A compromised node exposes only what that node is authorized to access. - -This is a genuine improvement - and a displacement of the problem rather than an elimination of it. The architect's vulnerability-first move is to say so out loud, in the same paper that announces the improvement. +The security review surfaces the central tension in distributed endpoint architectures. The inverted stack solves the central honeypot problem: a compromised node exposes only what that node is authorized to access. A fleet of workstations is a harder target than a single cloud database. -A fleet of workstations is a distributed attack surface. Each node is a potential target. The security posture of the weakest endpoint is the security posture of the data that endpoint holds. In an enterprise deployment with fifty nodes, an attacker does not target the strongest endpoint - they target the one belonging to the administrator with the broadest role access and the worst patch cadence. +This is a genuine improvement — and a displacement of the problem, not an elimination of it. A fleet is a distributed attack surface. The security posture of the weakest endpoint is the security posture of the data that endpoint holds. In an enterprise deployment with fifty nodes, an attacker targets the administrator with the broadest role access and the worst patch cadence. -The architecture requires defense-in-depth across four layers. None is optional. Each of these layers is built from cryptographic primitives that have been independently audited - libsodium, age, Argon2id reference, SQLCipher - composed against a specification a cryptographic engineer has reviewed. The crypto discipline established in Chapter 2 holds because the primitives are opaque and the composition is specified, not because the system is novel. +The architecture requires defense-in-depth across four layers. Each is built from independently audited cryptographic primitives — libsodium, age, Argon2id reference, SQLCipher — composed against a written specification a cryptographic engineer has reviewed. -Layer one is encryption at rest. SQLCipher on local databases. Argon2id key derivation. OS-native keystores. Physical storage extraction without credentials yields no plaintext. This layer is table stakes. +Layer one: encryption at rest. SQLCipher, Argon2id key derivation, OS-native keystores. Physical storage extraction without credentials yields no plaintext. Table stakes. -Layer two is field-level encryption. Per-record DEKs. Per-role KEKs. DEK/KEK envelope encryption. An attacker who compromises a node's local storage gets encrypted blobs. Without the KEK, the DEKs are useless. Without the DEKs, the ciphertext is useless. +Layer two: field-level encryption. Per-record DEKs, per-role KEKs, DEK/KEK envelope encryption. An attacker who compromises a node's local storage gets encrypted blobs. Without the KEK, the DEKs are useless. Without the DEKs, the ciphertext is useless. -Layer three is stream-level data minimization. Subscription filtering at the sync daemon's send tier. A compromised node is limited to the operations it was authorized to receive. The blast radius of a single node compromise is bounded by role scope, enforced at the protocol layer where it cannot be bypassed by application changes. +Layer three: stream-level data minimization. Subscription filtering at the sync daemon's send tier. A compromised node is limited to the operations it was authorized to receive. The blast radius of a single node compromise is bounded by role scope, enforced at the protocol layer where application changes cannot bypass it. -Layer four is circuit breaker and quarantine. Offline writes queue for validation against current team state before promotion. A node that reconnects after a long offline period does not automatically push its queued writes to peers - those writes enter a quarantine queue and are validated against current policy before merging. This prevents a compromised offline node from pushing malicious writes on reconnection. +Layer four: circuit breaker and quarantine. When a node reconnects after a long offline period, its queued writes enter a quarantine queue and are validated against current policy before merging. A compromised offline node cannot push malicious writes on reconnection. -The data minimization invariant - send-tier filtering, enforced at the protocol layer - is what makes the security story credible. Without it, layers one and two protect data at rest but cannot contain a breach once data is in transit. An application-layer filter that receives all operations and hides some in the UI is not a security control. It is a UI control. An attacker with access to the sync socket or the local database bypasses it entirely. +The send-tier filtering invariant is what makes the security story credible. Without it, layers one and two protect data at rest but cannot contain a breach once data is in transit. An application-layer filter that receives all operations and hides some in the UI is a UI control, not a security control. An attacker with access to the sync socket or the local database bypasses it entirely. -Every practitioner building on this architecture should treat the send-tier filtering invariant as inviolable. The filter belongs in the sync daemon. It does not belong in the view layer. It does not belong in the API (Application Programming Interface) handler. It does not belong in a permission check on a UI component. The moment it moves, the blast radius of any node compromise expands from role-scoped to total. +Every practitioner building on this architecture must treat the send-tier filtering invariant as inviolable. The filter belongs in the sync daemon. It does not belong in the view layer, the API handler, or a permission check on a UI component. The moment it moves, the blast radius of any node compromise expands from role-scoped to total. Distribute the data to endpoints for resilience. Treat each endpoint as a potential breach. Four layers. No shortcuts. @@ -212,12 +180,12 @@ What a practitioner carries forward from Okonkwo's review: - **DEK/KEK envelope encryption is enforced at the architecture level, not the application level.** The key hierarchy is audited, not invented; primitives are libsodium, age, Argon2id reference, SQLCipher; compositions require a cryptographic engineer's sign-off against a written specification. - **Send-tier filtering is an inviolable protocol invariant.** Subscription filtering lives in the sync daemon, not in the UI, not in an API handler, not in a permission check on a view component. If it moves, blast radius expands from role-scoped to total. -- **Key compromise response is specified and tested before first production deployment.** Revocation procedure, administrator re-attestation flow with a twenty-four-hour recommended re-attestation interval, offline-node reconnection handling, and capability-rotation propagation are documented with timing commitments, not described as design intent. -- **Root organization key custody uses HSM (Hardware Security Module) or multi-party ceremony.** A compromised root key is a higher-order failure than a compromised role KEK. Deployments under import substitution constraints or in jurisdictions where Western HSM hardware is not approved must use a domestic HSM equivalent or a documented multi-party key ceremony - the custody requirement is structural, not procedural. -- **Supply-chain transparency is signed, reproducible, and attestable.** Release signing key custody is documented; reproducible builds are required for release artifacts; Sigstore or equivalent attestations ship with every release; SBOM (Software Bill of Materials) accompanies the binary. +- **Key compromise response is specified and tested before first production deployment.** Revocation procedure, administrator re-attestation flow with a twenty-four-hour recommended interval, offline-node reconnection handling, and capability-rotation propagation are documented with timing commitments — not described as design intent. +- **Root organization key custody uses HSM or multi-party ceremony.** A compromised root key is a higher-order failure than a compromised role KEK. Deployments under import substitution constraints or in jurisdictions where Western HSM hardware is not approved must use a domestic HSM equivalent or a documented multi-party key ceremony. +- **Supply-chain transparency is signed, reproducible, and attestable.** Release signing key custody is documented; reproducible builds are required for release artifacts; Sigstore or equivalent attestations ship with every release; SBOM accompanies the binary. - **Relay is ciphertext-only with a self-hosted path for metadata-sensitive deployments.** Compelled-access and traffic-analysis threat models are named explicitly; self-hosted relay operation is a supported configuration, not a fork. -- **Credential recovery offers at least one artifact-based path.** Recovery-key file, administrator-held wrapped KEK escrow, or MDM re-enrollment plus relay-assisted re-sync - organizations must choose and test one before production. The no-artifact case (permanent loss possible) is disclosed to users at onboarding. -- **Right-to-erasure is implemented via crypto-shredding with documented metadata residue.** DEK destruction makes operation content unrecoverable; operation metadata remains and must be disclosed to the data protection officer; the pattern applies uniformly to GDPR, India's DPDP, and Brazil's LGPD obligations - and to the parallel right-to-erasure regimes named in Appendix F. +- **Credential recovery offers at least one artifact-based path.** Recovery-key file, administrator-held wrapped KEK escrow, or MDM re-enrollment plus relay-assisted re-sync — organizations must choose and test one before production. The no-artifact case is disclosed to users at onboarding. +- **Right-to-erasure is implemented via crypto-shredding with documented metadata residue.** DEK destruction makes operation content unrecoverable; operation metadata remains and must be disclosed to the data protection officer; the pattern applies uniformly to GDPR, India's DPDP, and Brazil's LGPD obligations. ---